REFEREE REPORT(S):
Referee: 1

COMMENTS TO THE AUTHOR(S)
This paper studies a novel variant of the familiar Bell inequalities:
proving limitations on local hidden variable theories via Kolmogorov
complexity. This concept, from the field of algorithmic information
theory, measures the intrinsic information content of a binary string as
the length of the shortest program required to output that string. The
idea behind this paper is to show that quantum mechanics violates local
hidden variable theories corresponding to computable hidden variables,
because (roughly speaking) the string of measurement outcomes produced
from a sequence of experiments is incompressible, and hence cannot be
produced by a computable function. The authors first do some theoretical
work to set this idea up in a form where it can be experimentally
tested, and then do a photonic experiment to test it. The experimental
test is based on obtaining sequences of measurement outcomes and then
compressing them via standard compression software.

I found the main idea behind this paper novel and interesting. It is
related to a theoretical paper of Fuchs, but the authors here take a
different, less-statistical perspective. Although Bell inequality
violations have proven beyond any reasonable doubt that local hidden
variable theories do not describe the universe, the computational point
of view taken here offers a different perspective on the issue.

However, I do have some significant concerns about this paper, which I
believe needs major rewriting and is not acceptable in its current form.
These include the following:

1. As far as I can tell, the experiment actually performed is
essentially a standard CHSH-type Bell test (the authors seem to imply
this themselves just before sec 4.1), so all that is different is the
analysis of the measurement results. The authors should justify why
their analysis is telling us anything new.

2. It is not clear to me that there is anything essential about
nonlocality, or indeed quantumness, in the arguments behind this work.
Isn't it the case that any nondeterministic (i.e. randomised) theory of
physics would also violate all computable (even nonlocal) hidden
variable theories? For example, consider an experiment where we measure
a stream of |0> qubits in the X basis. Quantum mechanics says that the
answers will be uniformly random bits. This stream will be
incompressible and hence any computable function cannot reproduce it --
and this will even hold if our measurements or state preparation are
inaccurate. The authors should discuss why their arguments go beyond
this simple idea.

3. The authors do not seem to include any evidence (or indeed a proof)
that the NCD measure they use is a good approximation of the "actual"
(and noncomputable!) NID measure required to show a violation.
Intuitively, if compression software manages to significantly compress
some data, this implies that its Kolmogorov complexity is low -- but if
the software does not compress the data much, this does not imply that
the Kolmogorov complexity is high. Cilibrasi and Vitanyi state that "we
cannot in general know how well we are doing using the NCD". The authors
should justify why they think that their compression software is giving
a good enough estimate of the NID to imply a violation.

4. Further, it is not obvious to me how inequality (4) can *ever* be
violated, in a quantum experiment or otherwise. However they are
produced, x_{a_0}, x_{a_1}, y_{b_0} and y_{b_1} are just bit-strings.
Therefore, once we have these strings, it seems that they must obey
inequality (3), and hence inequality (4). The authors should add an
explanation of how this inequality could be violated or otherwise
resolve this potential confusion (e.g. by giving an example of some
strings which would violate the inequality).

5. I found the presentation very confusing, especially at the start of
the paper, and especially for those readers not familiar with Kolmogorov
complexity. For example:
- the introduction should clearly describe and motivate the main idea
at the start, namely that bounds on compressibility of measurement
outcomes can be used to falsify certain hidden variable theories.
- start of sec 1.2: why should U_A and U_B generate output strings that
are independent? If U_A and U_B have the same program, for example, and
are run on the same input, they will output the same thing. Does this
assume that they have independent inputs? Even so, U_A and U_B might
always output the same answer (e.g. if they are both the program "always
output 0").
- Fig 1: I found the caption here confusing, especially the phrase "A
Universal Turing Machine... can reproduce the string of length N". I
guess you are saying here that, in a world where computable hidden
variables described reality, this would be the case? About the figure
itself: why should Lambda be the same program describing both halves of
the bipartite system? And even if it is the same program for both
halves, why is the sequence of outputs the same, given that the inputs
are different?
- p3: "The output x of each analyzer can be described as the output of
a UTM"... again, I presume this is assuming that local, computable
hidden variable theories describe the universe? The text needs to be
clearer about where this is being assumed and where it is not.
- p7, after eq (9): why does a Q value of 0.37 correspond to "failing"
the test, while a Q value below 0.1 corresponds to "passing" the test?
This seems arbitrary.
- it might be a good idea to refer to other recent work addressing
computability and Kolmogorov complexity in quantum physics (
http://arxiv.org/abs/1407.0604 )

The paper could eventually be suitable for publication in NJP following
a resubmission with major changes to the presentation, if the above
concerns are satisfactorily addressed.

Referee: 2

COMMENTS TO THE AUTHOR(S)
This paper tests an information-theoretic Bell inequality bearing some
similarity to equation (8) of reference [14]. The particular inequality
of the current paper is formulated in terms of Kolmogorov complexity.
The quantities in the inequality are estimated by using compression
software, which is an interesting tactic, and applied to data taken from
a Bell-type experiment with photons.

I think that the authors measure long runs of data in a fixed
measurement setting configuration and then directly apply the
information-theoretic functions to the resulting long bit strings. This
would be an interesting analysis well deserving of attention, but it is
not totally clear to me that this is what is being done. In general, the
theory underlying the paper is not well elucidated and it is vague to
the point that I don't really know what is supposed to be falsified by
their results. (Obviously, some version of Local Realism - but what
exactly?)

So my overall impression is that there may be something interesting
here, but I need substantial clarification.

I have the following critiques:

1) "Local universal Turing machine" (abstract, again on p2 line 22) is
not a standard concept. The authors should precisely define what they
mean by "locality" with respect to a Turing machine. It also would help
to define a UTM for physicists reading the paper who may be unfamiliar
with the precise concept. Later on page 2 line 51, the authors "extend
the picture to two spatially separated UTMs... the programs fed into the
machines can be correlated," is the program the “local hidden variable,”
so to speak? The paper would benefit from more explanation of how their
assumptions relate to the “usual” assumptions of a Bell-style argument.

2) Figure 1 suggests that the setting is being toggled randomly just
prior to each detection event. But later in the theoretical description
(section 1.3), it seems that the setting is fixed for a long time and
multiple detection events generate the bit string x_{i,a_j} for a fixed
setting a_j. I think the actual experiment was conducted the latter way.
Either way is fine, but which way is it? The quick sentence at the
beginning of 1.3 is not sufficient to change the assumptions so
drastically; please make this explicitly clear, for instance "before, we
were assuming the settings were being toggled every trial, now we are
going to assume that the settings are fixed for long blocks of time, and
our method is to analyze these blocks as-is." (If that is really what
you are doing, anyway.) Or just assume that you have a string of
successive outcomes in a fixed setting from the start (and change Figure
1). On a related note, I recommend that the authors mention that having
multiple consecutive trials in a fixed setting configuration increases
the separation necessary if an experimenter were to want to run their
experiment under space-like separation conditions.

3) On page 3, line 54, we have "The output x of each analyzer can be
described as the output of a UTM, fed with the settings a_j or b_k, and
a program \Lambda, which contains the information about generating the
correct output for every detection event and for every setting. For a
string have of finite length l(x)=N, \Lambda has to describe the 4^N
possible events." I have had great difficulty parsing this, and I'm
still not sure I get it. It sounds like the "output" is a computable
function of the settings and the detection events, but this doesn't make
sense. I now think that "detection event" means a single trial when the
output can be either 0 or 1, but in the absence of a formal definition,
"detection event" sounds like the specific occurrence of a 1. And then
we have that "\Lambda describes the 4^N possible events," what does
"describe" mean? 2^N options come from the settings and these are
inputs (assuming there is a new setting for each trial, which I'm not
sure is true) and 2^N options come from the choice of detector
clicks/nonclicks which are outputs (I think), so I don't even have a
best guess for what is meant here. Please define things clearly with a
re-write of this paragraph. Overall, I think more things need to be
defined clearly in the paper.

4) The derivation of (4) follows from two iterations of (2), which is a
“triangle-inequality” method for deriving Bell inequalities. This was
introduced by [B. Schumacher PRA 44, 7047 (1991)] and so I think this
paper is relevant to the current work. Triangle-inequality-based Bell
inequalities have been used recently in [Knill et al. PRA 91, 032105
2015] and a follow up piece [B. Christensen et al., 92, 032130 (2015)].
The two 2015 papers are a little dense, but they explore how to analyze
data that comes in big blocks of fixed measurement settings – much like
the current paper (as I presently understand it). It is not necessary to
cite the 2015 papers, but doing so could provide a richer context for
the current paper.

5) The paper makes repeated claims that it offers a method to avoid
i.i.d. assumptions (i.i.d = independent, identically distributed). But
the authors make assumptions that are effectively equivalent to i.i.d.
assumptions - which is fine, but it undermines the repeated claim that
the paper offers a method to avoid such assumptions. The last paragraph
on page 4 is essentially an i.i.d. assumption by another name. Here each
"i.i.d. trial" would be what the authors are referring to as a single
experimental run in a fixed setting. See also the last paragraph of page
10, where a form of an i.i.d. assumption is implicit in quantifying the
error statistically.

Is the point the authors are trying to make with the i.i.d. discussion
as follows? “Some experimentalists might take data in fixed measurement
configurations, and then when analyzing it, pretend that it actually
came from an experiment where the settings were changed every trial. We
don't do this, and rather use our method to analyze our big blocks of
data as-is.” If this is what the authors are getting at, they should
state it more explicitly.

6) I think claim of the last paragraph of page 4, that Kolmogorov
complexity remains the same in successive runs, needs a little more
discussion as to why it is a natural assumption. For instance, there is
a small nonzero chance that the quantum state will produce a string of
all 1's, which has anomalously low Kolmogorav complexity, so the claim
must really be a claim about the average complexity, but even this
weakened claim could use more justification as to why it is natural.

7) Section 2.1, "The statistical approach uses an ensemble of all
possible N-bit strings and looks for their average Kolmogorov
complexity." There are 2^N possible N-bit strings, do you mean the
average over all of these, or something else? Then “The ensemble average
is the Shannon entropy H(X)...” the Shannon entropy of what? Please
re-write section 2.1 in greater detail so I can understand it.

8) The paper uses Normalized Information Distance (NID), which is only
approximately a metric and so the Bell inequality (5) only holds
approximately. Furthermore, the paper uses Normalized Compression
Distance (NCD) to approximate NID. While the authors offer an assessment
of the error on the first approximation (p. 4 line 24), I only see the
term “well approximated” for the passage from NID to NCD in Section 2.2.
The authors should discuss the error involved in the second
approximation with more precision, and I think the results section
should review discussion of all these accumulated error terms with an
estimate of their size.

9) Regarding section 4 and figure 5, what is the contrast of the
polarizers? Did the authors use a PBS or something else?

10) I believe the method of section 4.1 requires a version of the fair
sampling assumption, and I think the authors should state this. On a
related note, anyone familiar with these types of experiments will know
that Alice and Bob were not space-like separated, but I still think it
is a good idea to explicitly state this somewhere.

11) I’m a little confused about page 10 around line 19; the four options
are 00, 01, 10, and 11, do you mean Alice can get these four things and
Bob can also get these four things, or does 00 mean Alice gets D_V, Bob
gets D_V, 01 means Alice D_V, Bob D_H, etc. If it’s the second way, I
think this assumes that the probability of getting a simultaneous click
in D_H and D_V at one location (say, Alice) is miniscule, did these
events occur rarely and were discarded, or did they never occur?
Discarding is fine but if it occurred it should be discussed.

12) To compensate for variation in the efficiencies of the detectors,
runs were repeated with the various detectors swapped for each other,
and then “the resulting outcomes are then interleaved.” (Page 10, line
33). Can the authors be a more explicit about the exact interleaving
process on the bit strings (I think this means 111 and 000 becomes
101010, but I’m not sure), and then the subsequent method of
simultaneously feeding Alice and Bob’s separate bit strings into the
compressor (was this a concatenation as suggested by p. 6, line 17, or
an interleaving as suggested by page 7 line 57?)

13) I would like to see more reporting on the data. Was there only one
run in each setting configuration, or were there many in each
configuration, so that each term in (8) could have been estimated with
an average over many runs? In particular I'm curious where the figure S
on page 10 comes from. How many total trials were there/how long were
the bitstrings?

14) I don’t think the approach allows one to “omit the notion of
probability,” (p. 11 line 30), as the results implicitly depend on
assumptions that certain possibilities (such as a string of all 1s,
which has low Kolmogorov complexity) can be ignored by arguing that
their probability is very low. I think the estimation of Kolmogorov
complexity with NCD probably also depends on bounding low-probability
events.

Minor technical points:
M1) Preceding (2), NID is only asymptotically a metric, or approximately
a metric, so it's a little funny to say "NID is a metric thus (2) holds"
only to immediately qualify that by saying that (2) only approximately
holds (which is precisely why NID is only an approximate metric).
M2) There is a typo in the last subscript of (8), it should be y_{b_1}