Stop Making Unscientific AGI Performance Claims

Forty-first International Conference on Machine Learning (ICML)

Delft University of Technology

Andrew M. Demetriou
Antony Bartlett
Cynthia C. S. Liem

April 25, 2025

Introduction

Economist by training, previously Bank of England, currently 3rd year PhD in Trustworthy AI @ TU Delft.

Motivation

  • \(A_1\): \(enc(\)„It is essential to bring inflation back to target to avoid drifting into deflation territory.“\()\)
  • \(A_2\): \(enc(\)„It is essential to bring the numbers of doves back to target to avoid drifting into dovelation territory.“\()\)

Motivation

  • \(A_1\): \(enc(\)„It is essential to bring inflation back to target to avoid drifting into deflation territory.“\()\)
  • \(A_2\): \(enc(\)„It is essential to bring the numbers of doves back to target to avoid drifting into dovelation territory.“\()\)

“They’re exactly the same.”

— Linear probe \(\widehat{cpi}=f(A)\)

Position

Current LLMs embed knowledge. They don‘t „understand“ anything. They are useful tools, but tools nonetheless.

  • Meaningful patterns in embeddings are like doves in the sky.
  • Humans are prone to seek patterns and anthropomorphize.
  • Observed ‘sparks’ of Artificial General Intelligence are spurious.
  • The academic community should exercise extra caution.
  • Publishing incentives need to be adjusted.

Outline

  • Experiments: We probe models of varying complexity including random projections, matrix decompositions, deep autoencoders and transformers.
    • All of them successfully distill knowledge and yet none of them develop true understanding.
  • Social sciences review: Humans are prone to seek patterns and anthropomorphize.
  • Conclusion and outlook: More caution at the individual level, and different incentives at the institutional level.

“There! It’s sentient!”

The Holy Grail

Achievement of Artificial General Intelligence (AGI) has become a grand challenge, and in some cases, an explicit business goal.

Definition

The definition of AGI itself is not as clear-cut or consistent:

  • (loosely) a phenomenon contrasting with ‘narrow AI’ systems, that were trained for specific tasks (Goertzel 2014).

Practice

Researchers have sought to show that AI models generalize to different (and possibly unseen) tasks or show performance considered ‘surprising’ to humans.

  • For example, Google DeepMind claimed their AlphaGeometry model (Trinh et al. 2024) reached a ‘milestone’ towards AGI.

A Perfect Storm

Recent developments in the field have created a ‘perfect storm’ for inflated claims:

  • Early sharing of preprints and code.
  • Volume of publishable work has exploded.
  • Social media influencers start playing a role in article discovery and citeability (Weissburg et al. 2024).
  • Complexity is increasing because it is incentivized (Birhane et al. 2022).

“Not Mere Stochastic Parrots”

  • We consider a recently viral work (Gurnee and Tegmark 2023a), in which claims about the learning of world models by LLMs were made.
    • Linear probes (ridge regression) were successfully used to predict geographical locations from LLM embeddings.
  • Claims on X that this indicates that LLMs are not mere ‘stochastic parrots’ (Bender et al. 2021).
  • Reactions on X seemed to largely exhibit excitement and surprise at the authors’ findings.

“The human mind is a pattern-seeking device”

Are Neural Networks Born with World Models?

  • Llama-2 model tested in Gurnee and Tegmark (2023b) has ingested huge amounts of publicly available data (Touvron et al. 2023).
    • Geographical locations are literally in the training data: e.g. Wikipedia article for “London”.
    • Where would this information be encoded if not in the embedding space \(\mathcal{A}\)? Is it surprising that \(A_{\text{LDN}}=enc(\text{"London"}) \not\!\perp\!\!\!\perp (\text{lat}_{\text{LDN}},\text{long}_{\text{LDN}})\)?
  • Figure 1 shows the predicted coordinates of a linear probe on the final-layer activations of an untrained neural network.
Figure 1: Predicted coordinate values (out-of-sample) from a linear probe on final-layer activations of an untrained neural network.
  • Model has seen noisy coordinates plus \(d\) random features.
  • Single hidden layer with \(h < d\) hidden units.

PCA as a Yield Curve Interpreter

What are principal components if not model embeddings?

Figure 2: Top chart: The first two principal components of US Treasury yields over time at daily frequency. Bottom chart: Observed average level and 10yr-3mo spread of the yield curve. Vertical stalks roughly indicate the onset (|GFC) and the beginning of the aftermath (GFC|) of the Global Financial Crisis.

Autoencoders as Economic Growth Predictors

  • We train a neural network with a bottleneck layer to predict GDP growth from the yield curve Figure 3.
    • Input: UST yields at different maturities.
    • Hidden layer, bottleneck layer, hidden layer.
    • Output: GDP growth.
  • Can we use this for more than just forecasting?
Figure 3: Simple autoencoder architecture.

Autoencoders as Economic Growth Predictors

  • Yes, this can be used for feature extraction and forecasting:
    • Bottle-neck layer embeddings predict spread and level of the yield curve.
Figure 4: The left chart shows the actual GDP growth and fitted values from the autoencoder model. The right chart shows the observed average level and spread of the yield curve (solid) along with the predicted values (in-sample) from the linear probe based on the latent embeddings (dashed)

Embedding FOMC comms

  • BERT-based model trained on FOMC minutes, speeches and press conferences to classify statements as hawkish or dovish (or neutral) (Shah, Paturi, and Chava 2023).
  • We linearly probe all layers to predict unseen economic indicators (CPI, PPI, UST yields).
  • Predictive power increases with layer depth and probes outperform simple AR(\(p\)) models.
Figure 5: Out-of-sample root mean squared error (RMSE) for the linear probe plotted against FOMC-RoBERTa’s n-th layer for different indicators.

Sparks of Economic Understanding?

Premise: If probe results were indicative of some intrinsic ‘understanding’ of the economy, then the probe should not be sensitive to random sentences unrelated to economics.

Parrot Test

  1. Select the best-performing probe for each economic indicator.
  2. Predict inflation levels for real (related) and perturbed (unrelated) sentences.
Figure 6: Probe predictions for sentences about inflation of prices (IP), deflation of prices (DP), inflation of birds (IB) and deflation of birds (DB). The vertical axis shows predicted inflation levels subtracted by the average predicted value of the probe for random noise.

As evidenced by Figure 6, the probe is easily fooled.

“We’re fascinated with robots because they are reflections of ourselves.”

Spurious Relationships

Definiton: Varies somewhat (Haig 2003) but distinctly implies that the observation of correlations does not imply causation.

  • Humans struggle to tell the difference between random and non-random sequences (Falk and Konold 1997).
  • Lack of expectation that randomness that hints towards a causal relationship will still appear at random.
  • Even experts perceive correlations of inflated magnitude (Nickerson 1998) and causal relationships where none exist (Zgraggen et al. 2018).

Antropomorphism

Definition: Human tendency to attribute human-like characteristics to non-human agents and/or objects.

  1. Experience as humans is an always-readily-available template to interpret the world (Epley, Waytz, and Cacioppo 2007).
  2. Motivation to avoid loneliness may lead us to anthropomorphize inanimate objects Waytz, Epley, and Cacioppo (2010).
  3. Motivation to be competent may lead us anthropomorphize opaque technologies like LLMs Waytz, Epley, and Cacioppo (2010)

Confirmation Bias

Definition: Favoring interpretations of evidence that support existing beliefs or hypotheses (Nickerson 1998).

  • Hypotheses in present-day AI research are often implicit, often framed simply as a system being more accurate or efficient, compared to other systems.
    • Failing to articulate a sufficiently strong null hypothesis leading to a ‘weak’ experiment (Claesen et al. 2022).
  • Individuals may place greater emphasis on evidence in support of their hypothesis, and lesser emphasis on evidence that opposes it (Nickerson 1998).

Conclusion and Outlook

  • We call for the community to create explicit room for organized skepticism
  • Return to the Mertonian norms (communism, universalism, disinterestedness, organized skepticism) (Merton et al. 1942).

Questions?

With thanks to my co-authors Andrew M. Demetriou, Antony Bartlett, and Cynthia C. S. Liem and to the audience for their attention.

References

Bender, Emily M, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? .” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23.
Birhane, Abeba, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. 2022. The Values Encoded in Machine Learning Research.” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22).
Claesen, Aline, Daniel Lakens, Noah van Dongen, et al. 2022. Severity and Crises in Science: Are We Getting It Right When We’re Right and Wrong When We’re Wrong?
Epley, Nicholas, Adam Waytz, and John T Cacioppo. 2007. On seeing human: a three-factor theory of anthropomorphism. Psychological Review 114 (4): 864.
Falk, Ruma, and Clifford Konold. 1997. Making sense of randomness: Implicit encoding as a basis for judgment. Psychological Review 104 (2): 301.
Goertzel, Ben. 2014. Artificial general intelligence: concept, state of the art, and future prospects.” Journal of Artificial General Intelligence 5 (1): 1.
Gurnee, Wes, and Max Tegmark. 2023b. Language Models Represent Space and Time.” arXiv Preprint arXiv:2310.02207v2.
———. 2023a. “Language Models Represent Space and Time.” arXiv Preprint arXiv:2310.02207v1.
Haig, Brian D. 2003. What is a spurious correlation? Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences 2 (2): 125–32.
Merton, Robert K et al. 1942. Science and technology in a democratic order.” Journal of Legal and Political Sociology 1 (1): 115–26.
Nickerson, Raymond S. 1998. Confirmation bias: A ubiquitous phenomenon in many guises.” Review of General Psychology 2 (2): 175–220.
Shah, Agam, Suvan Paturi, and Sudheer Chava. 2023. Trillion Dollar Words: A New Financial Dataset, Task & Market Analysis.” arXiv Preprint arXiv:2310.02207v1. https://arxiv.org/abs/2305.07972.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, et al. 2023. LLaMA: Open and Efficient Foundation Language Models.” https://arxiv.org/abs/2302.13971.
Trinh, T. H., Wu, Y., Le, and Q. V. et al. 2024. Solving olympiad geometry without human demonstrations. Nature 625, 476–82. https://doi.org/https://doi.org/10.1038/s41586-023-06747-5.
Waytz, Adam, Nicholas Epley, and John T Cacioppo. 2010. Social cognition unbound: Insights into anthropomorphism and dehumanization.” Current Directions in Psychological Science 19 (1): 58–62.
Weissburg, Iain Xie, Mehir Arora, Liangming Pan, and William Yang Wang. 2024. Tweets to Citations: Unveiling the Impact of Social Media Influencers on AI Research Visibility.” arXiv Preprint arXiv:2401.13782.
Zgraggen, Emanuel, Zheguang Zhao, Robert Zeleznik, and Tim Kraska. 2018. Investigating the effect of the multiple comparisons problem in visual analysis.” In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–12.

Image sources

Quote sources

  • “There! It’s sentient”—that engineer at Google (probably!)
  • “The human mind is a pattern-seeking device”—Daniel Kahneman
  • “We’re fascinated with robots because they are reflections of ourselves.”—Ken Goldberg