I mentioned in passing in a previous post that the authors of a new study titled “Your Brain on ChatGPT” may have p-hacked or cherry-picked their results. It’s now time to take a closer look at this possibility. To begin (sorry, Richard!), I think it would be fair to say that the practice of p-hacking or data dredging — i.e. of manipulating or “massaging” one’s data in order to find a noteworthy result — has reached epidemic proportions in all of the so-called “social sciences”, even accounting research (see here, for example), and I have blogged about several variants of this problem many times before; see:
- p-hacking primer (14 January 2017)
- Cherry picking (31 March 2019)
- Data dredging (1 April 2019)
- Publication bias (3 April 3019)
- Tentativew reply to Gow 2023 (17 October 2023)
Now, let’s turn to “Your Brain on ChatGPT” by Kosmyna, et al. (2025). In part 4 of this 19 June blog post, Ben Shindel makes a strong case why the results in this paper are most likely p-hacked. To the point, he explains that the Kosmyna study “tested virtually every possible qualitative and quantitative measure in order to determine statistical significance and evaluate interesting-looking findings” (emphasis omitted). Moreover, it is the centerpiece of the Kosmyna study — the EEG results — that is most suspect. To see why, check out this revealing methodological disclosure buried in pages 77-78 of the Kosmyna paper:
“For all the sessions [in our experiment] we calculated dDTF for all pairs of electrodes 32 × 32 = 1024 and ran repeated measures analysis of variance (rmANOVA) within the participant and between the participants within the groups. Due to complexity of the data and volume of the collected data we ran rmANOVA ≤ 1000 times each. To denote different levels of significance in figures and results, we adopted the following convention:
- p < 0.05 was considered statistically significant and is marked with a single asterisk (*)
- p < 0.01 with a double asterisk (**)
- p < 0.001 with a triple asterisk (***)”
But as Shindel correctly notes, the MIT Media Lab team was “bound to get tons of false positives” from their EEG results because this particular method will produce “perhaps tens of thousands of possible correlations to test for. Hundreds of these will meet their criteria for statistical significance by chance, probably even with FDR [False Discovery Rate] implemented.” For my part, as I disclosed when I began to consider the impact of A.I. models on critical thinking (see here), this is why I am still agnostic on this question.

