REFEREE REPORT(S): Referee: 1 COMMENTS TO THE AUTHOR(S) This paper studies a novel variant of the familiar Bell inequalities: proving limitations on local hidden variable theories via Kolmogorov complexity. This concept, from the field of algorithmic information theory, measures the intrinsic information content of a binary string as the length of the shortest program required to output that string. The idea behind this paper is to show that quantum mechanics violates local hidden variable theories corresponding to computable hidden variables, because (roughly speaking) the string of measurement outcomes produced from a sequence of experiments is incompressible, and hence cannot be produced by a computable function. The authors first do some theoretical work to set this idea up in a form where it can be experimentally tested, and then do a photonic experiment to test it. The experimental test is based on obtaining sequences of measurement outcomes and then compressing them via standard compression software. I found the main idea behind this paper novel and interesting. It is related to a theoretical paper of Fuchs, but the authors here take a different, less-statistical perspective. Although Bell inequality violations have proven beyond any reasonable doubt that local hidden variable theories do not describe the universe, the computational point of view taken here offers a different perspective on the issue. However, I do have some significant concerns about this paper, which I believe needs major rewriting and is not acceptable in its current form. These include the following: 1. As far as I can tell, the experiment actually performed is essentially a standard CHSH-type Bell test (the authors seem to imply this themselves just before sec 4.1), so all that is different is the analysis of the measurement results. The authors should justify why their analysis is telling us anything new. 2. It is not clear to me that there is anything essential about nonlocality, or indeed quantumness, in the arguments behind this work. Isn't it the case that any nondeterministic (i.e. randomised) theory of physics would also violate all computable (even nonlocal) hidden variable theories? For example, consider an experiment where we measure a stream of |0> qubits in the X basis. Quantum mechanics says that the answers will be uniformly random bits. This stream will be incompressible and hence any computable function cannot reproduce it -- and this will even hold if our measurements or state preparation are inaccurate. The authors should discuss why their arguments go beyond this simple idea. 3. The authors do not seem to include any evidence (or indeed a proof) that the NCD measure they use is a good approximation of the "actual" (and noncomputable!) NID measure required to show a violation. Intuitively, if compression software manages to significantly compress some data, this implies that its Kolmogorov complexity is low -- but if the software does not compress the data much, this does not imply that the Kolmogorov complexity is high. Cilibrasi and Vitanyi state that "we cannot in general know how well we are doing using the NCD". The authors should justify why they think that their compression software is giving a good enough estimate of the NID to imply a violation. 4. Further, it is not obvious to me how inequality (4) can *ever* be violated, in a quantum experiment or otherwise. However they are produced, x_{a_0}, x_{a_1}, y_{b_0} and y_{b_1} are just bit-strings. Therefore, once we have these strings, it seems that they must obey inequality (3), and hence inequality (4). The authors should add an explanation of how this inequality could be violated or otherwise resolve this potential confusion (e.g. by giving an example of some strings which would violate the inequality). 5. I found the presentation very confusing, especially at the start of the paper, and especially for those readers not familiar with Kolmogorov complexity. For example: - the introduction should clearly describe and motivate the main idea at the start, namely that bounds on compressibility of measurement outcomes can be used to falsify certain hidden variable theories. - start of sec 1.2: why should U_A and U_B generate output strings that are independent? If U_A and U_B have the same program, for example, and are run on the same input, they will output the same thing. Does this assume that they have independent inputs? Even so, U_A and U_B might always output the same answer (e.g. if they are both the program "always output 0"). - Fig 1: I found the caption here confusing, especially the phrase "A Universal Turing Machine... can reproduce the string of length N". I guess you are saying here that, in a world where computable hidden variables described reality, this would be the case? About the figure itself: why should Lambda be the same program describing both halves of the bipartite system? And even if it is the same program for both halves, why is the sequence of outputs the same, given that the inputs are different? - p3: "The output x of each analyzer can be described as the output of a UTM"... again, I presume this is assuming that local, computable hidden variable theories describe the universe? The text needs to be clearer about where this is being assumed and where it is not. - p7, after eq (9): why does a Q value of 0.37 correspond to "failing" the test, while a Q value below 0.1 corresponds to "passing" the test? This seems arbitrary. - it might be a good idea to refer to other recent work addressing computability and Kolmogorov complexity in quantum physics ( http://arxiv.org/abs/1407.0604 ) The paper could eventually be suitable for publication in NJP following a resubmission with major changes to the presentation, if the above concerns are satisfactorily addressed. Referee: 2 COMMENTS TO THE AUTHOR(S) This paper tests an information-theoretic Bell inequality bearing some similarity to equation (8) of reference [14]. The particular inequality of the current paper is formulated in terms of Kolmogorov complexity. The quantities in the inequality are estimated by using compression software, which is an interesting tactic, and applied to data taken from a Bell-type experiment with photons. I think that the authors measure long runs of data in a fixed measurement setting configuration and then directly apply the information-theoretic functions to the resulting long bit strings. This would be an interesting analysis well deserving of attention, but it is not totally clear to me that this is what is being done. In general, the theory underlying the paper is not well elucidated and it is vague to the point that I don't really know what is supposed to be falsified by their results. (Obviously, some version of Local Realism - but what exactly?) So my overall impression is that there may be something interesting here, but I need substantial clarification. I have the following critiques: 1) "Local universal Turing machine" (abstract, again on p2 line 22) is not a standard concept. The authors should precisely define what they mean by "locality" with respect to a Turing machine. It also would help to define a UTM for physicists reading the paper who may be unfamiliar with the precise concept. Later on page 2 line 51, the authors "extend the picture to two spatially separated UTMs... the programs fed into the machines can be correlated," is the program the “local hidden variable,” so to speak? The paper would benefit from more explanation of how their assumptions relate to the “usual” assumptions of a Bell-style argument. 2) Figure 1 suggests that the setting is being toggled randomly just prior to each detection event. But later in the theoretical description (section 1.3), it seems that the setting is fixed for a long time and multiple detection events generate the bit string x_{i,a_j} for a fixed setting a_j. I think the actual experiment was conducted the latter way. Either way is fine, but which way is it? The quick sentence at the beginning of 1.3 is not sufficient to change the assumptions so drastically; please make this explicitly clear, for instance "before, we were assuming the settings were being toggled every trial, now we are going to assume that the settings are fixed for long blocks of time, and our method is to analyze these blocks as-is." (If that is really what you are doing, anyway.) Or just assume that you have a string of successive outcomes in a fixed setting from the start (and change Figure 1). On a related note, I recommend that the authors mention that having multiple consecutive trials in a fixed setting configuration increases the separation necessary if an experimenter were to want to run their experiment under space-like separation conditions. 3) On page 3, line 54, we have "The output x of each analyzer can be described as the output of a UTM, fed with the settings a_j or b_k, and a program \Lambda, which contains the information about generating the correct output for every detection event and for every setting. For a string have of finite length l(x)=N, \Lambda has to describe the 4^N possible events." I have had great difficulty parsing this, and I'm still not sure I get it. It sounds like the "output" is a computable function of the settings and the detection events, but this doesn't make sense. I now think that "detection event" means a single trial when the output can be either 0 or 1, but in the absence of a formal definition, "detection event" sounds like the specific occurrence of a 1. And then we have that "\Lambda describes the 4^N possible events," what does "describe" mean? 2^N options come from the settings and these are inputs (assuming there is a new setting for each trial, which I'm not sure is true) and 2^N options come from the choice of detector clicks/nonclicks which are outputs (I think), so I don't even have a best guess for what is meant here. Please define things clearly with a re-write of this paragraph. Overall, I think more things need to be defined clearly in the paper. 4) The derivation of (4) follows from two iterations of (2), which is a “triangle-inequality” method for deriving Bell inequalities. This was introduced by [B. Schumacher PRA 44, 7047 (1991)] and so I think this paper is relevant to the current work. Triangle-inequality-based Bell inequalities have been used recently in [Knill et al. PRA 91, 032105 2015] and a follow up piece [B. Christensen et al., 92, 032130 (2015)]. The two 2015 papers are a little dense, but they explore how to analyze data that comes in big blocks of fixed measurement settings – much like the current paper (as I presently understand it). It is not necessary to cite the 2015 papers, but doing so could provide a richer context for the current paper. 5) The paper makes repeated claims that it offers a method to avoid i.i.d. assumptions (i.i.d = independent, identically distributed). But the authors make assumptions that are effectively equivalent to i.i.d. assumptions - which is fine, but it undermines the repeated claim that the paper offers a method to avoid such assumptions. The last paragraph on page 4 is essentially an i.i.d. assumption by another name. Here each "i.i.d. trial" would be what the authors are referring to as a single experimental run in a fixed setting. See also the last paragraph of page 10, where a form of an i.i.d. assumption is implicit in quantifying the error statistically. Is the point the authors are trying to make with the i.i.d. discussion as follows? “Some experimentalists might take data in fixed measurement configurations, and then when analyzing it, pretend that it actually came from an experiment where the settings were changed every trial. We don't do this, and rather use our method to analyze our big blocks of data as-is.” If this is what the authors are getting at, they should state it more explicitly. 6) I think claim of the last paragraph of page 4, that Kolmogorov complexity remains the same in successive runs, needs a little more discussion as to why it is a natural assumption. For instance, there is a small nonzero chance that the quantum state will produce a string of all 1's, which has anomalously low Kolmogorav complexity, so the claim must really be a claim about the average complexity, but even this weakened claim could use more justification as to why it is natural. 7) Section 2.1, "The statistical approach uses an ensemble of all possible N-bit strings and looks for their average Kolmogorov complexity." There are 2^N possible N-bit strings, do you mean the average over all of these, or something else? Then “The ensemble average is the Shannon entropy H(X)...” the Shannon entropy of what? Please re-write section 2.1 in greater detail so I can understand it. 8) The paper uses Normalized Information Distance (NID), which is only approximately a metric and so the Bell inequality (5) only holds approximately. Furthermore, the paper uses Normalized Compression Distance (NCD) to approximate NID. While the authors offer an assessment of the error on the first approximation (p. 4 line 24), I only see the term “well approximated” for the passage from NID to NCD in Section 2.2. The authors should discuss the error involved in the second approximation with more precision, and I think the results section should review discussion of all these accumulated error terms with an estimate of their size. 9) Regarding section 4 and figure 5, what is the contrast of the polarizers? Did the authors use a PBS or something else? 10) I believe the method of section 4.1 requires a version of the fair sampling assumption, and I think the authors should state this. On a related note, anyone familiar with these types of experiments will know that Alice and Bob were not space-like separated, but I still think it is a good idea to explicitly state this somewhere. 11) I’m a little confused about page 10 around line 19; the four options are 00, 01, 10, and 11, do you mean Alice can get these four things and Bob can also get these four things, or does 00 mean Alice gets D_V, Bob gets D_V, 01 means Alice D_V, Bob D_H, etc. If it’s the second way, I think this assumes that the probability of getting a simultaneous click in D_H and D_V at one location (say, Alice) is miniscule, did these events occur rarely and were discarded, or did they never occur? Discarding is fine but if it occurred it should be discussed. 12) To compensate for variation in the efficiencies of the detectors, runs were repeated with the various detectors swapped for each other, and then “the resulting outcomes are then interleaved.” (Page 10, line 33). Can the authors be a more explicit about the exact interleaving process on the bit strings (I think this means 111 and 000 becomes 101010, but I’m not sure), and then the subsequent method of simultaneously feeding Alice and Bob’s separate bit strings into the compressor (was this a concatenation as suggested by p. 6, line 17, or an interleaving as suggested by page 7 line 57?) 13) I would like to see more reporting on the data. Was there only one run in each setting configuration, or were there many in each configuration, so that each term in (8) could have been estimated with an average over many runs? In particular I'm curious where the figure S on page 10 comes from. How many total trials were there/how long were the bitstrings? 14) I don’t think the approach allows one to “omit the notion of probability,” (p. 11 line 30), as the results implicitly depend on assumptions that certain possibilities (such as a string of all 1s, which has low Kolmogorov complexity) can be ignored by arguing that their probability is very low. I think the estimation of Kolmogorov complexity with NCD probably also depends on bounding low-probability events. Minor technical points: M1) Preceding (2), NID is only asymptotically a metric, or approximately a metric, so it's a little funny to say "NID is a metric thus (2) holds" only to immediately qualify that by saying that (2) only approximately holds (which is precisely why NID is only an approximate metric). M2) There is a typo in the last subscript of (8), it should be y_{b_1}