May 15, 2026

How do measurements from different sources give us a good understanding of the world?

Current debates on AI seem to mirror Plato vs Aristotle’s debates from thousands of years ago

Les parfums, les couleurs et les sons se répondent. — Baudelaire, "Correspondances" (1857)

The parable of the blind men and an elephant dates back to around 500 BCE. In it, a group of blind men touch different parts of an elephant; one feels the trunk, another the tail, a third the tusk, and so on. Each of these men had a unique internal representation of what the elephant was, yet they were all feeling the same thing.

When dealing with modern nuclear fusion reactors, specifically tokamaks, the situation is not all that different. Like the elephant, a single superheated plasma within the reactor can contain whirlpools in one section, smooth currents in another, and rapid fluctuations in yet another. These individual phenomena happen across a time range spanning orders of magnitude akin to the difference between one day and one millennium. And like the blind men, we are often given individual diagnostics that can only observe one aspect of the reactor at once as limited by their laws of physics.

Traditionally, each sensor has been responsible for measuring a limited number of phenomena. But this can be limiting since plasma phenomena directly influence each other. If the sensors could somehow communicate, they might fill in for each other's missing representations and even surface information that neither could pick up individually.

The whole is other than the sum of the parts. — Kurt Koffka, Principles of Gestalt Psychology (1935)

A recent paper in Nature Communications, “Multimodal super-resolution: discovering hidden physics and its application to fusion plasmas” uses a model called Diag2Diag which enables enhanced measurements using combinations of diagnostics. It correlates hundreds of high-frequency data streams from DIII-D, the largest operating tokamak in the United States, to learn their joint structure and predict what a low-frequency sensor would have measured at much higher temporal resolution. Fast magnetic events called edge localized modes (ELMs), which are never observed in the low-frequency signal, show up directly in the upsampled data. In essence, we can argue that Diag2Diag learned a more complete model of this nuclear fusion process by predicting one view with another.

Diag2Diag can directly impute anomalous events

This concept isn't unique to fusion. Machine learning research has spent the last decade trying to combine representations across modalities such as text, images, audio, time series, and graphs; the usual recipe is to train each modality in isolation and stitch the embeddings together at the end, often by simple concatenation. While this works, it does not seem typical of how multiple modalities are usually learned in nature. For example, when we sit in front of a fireplace, we see the glow of its flames, hear the crackling of the wood, smell its burning fragrances, and feel its warmth; this gives us a holistic view of its concept. So an interesting question is what happens when modalities are learned jointly from the start.

Exciting advances in modern robotics are moving in that direction: specifically, multimodal world models are currently being developed which can fuse inputs from cameras, tactile feedback, proprioception, action, and language into a single unified representation. If the model truly captures the joint manifold of how a scene looks across modalities, you can generate plausible views you never recorded.

We can argue that multimodality, though, is really just one instance of a more general goal: finding a unified abstraction that holds across different views of the same underlying thing. Yann LeCun's Joint Embedding Predictive Architectures (JEPA) addresses this with the premise that multiple views of the same underlying state should map to nearby points in some abstract representation. JEPA is elegant in the sense that it sidesteps pixel-level generation, learns from self-supervision, and produces embeddings that retain whatever is predictive across views.

But there's an ongoing debate about whether JEPA is really a world model in the strong sense. The critique most often raised is that it can't generate well, since you can't sample plausible futures or counterfactual observations from it directly. So for the sake of creating a dichotomy, we can argue that there's a different camp with a different philosophy. Fei-Fei Li and her collaborators at Stanford and World Labs argue for spatial intelligence, or what Li calls "a new type of generative models," tasked with rendering perceptually and physically consistent worlds rather than embedding them. It seems to resemble an old debate from a few thousand years ago: Plato held that truth lives in the universal Form, Aristotle that it lives only in the particulars that compose it. JEPA leans Platonic; Li leans Aristotelian. The question now is who is right?

In fairness, the strongest case for JEPA is precisely this dichotomy's central worry. When generative models try to render the world in full detail, they run into a chaotic cloud of future possibilities and have to commit to one, sometimes the wrong one. We can call this hallucination, and LeCun's argument is that predicting in a learned latent space avoids it entirely. For most domains, this is a compelling tradeoff. But in generative AI for science, we actually want to see fleshed out examples of a full measurement. So in a sense, we can say that the goals for these models are that JEPA-like world models focus on finding the common thread between all possible events, while generative world models try to find a most probable chain of events.

For most tasks, JEPA's tradeoff is great as it sidesteps hallucination and learns from huge amounts of unlabeled data. But for certain observations, it may be oversimplifying. An ELM in a fusion plasma is a high-frequency, locally-anomalous feature. How to efficiently resolve this tension, between modeling the world as a Platonic abstraction or as a cluster of particulars, is still a work in progress. Maybe this scene from Avengers: Infinity War can provide a clue.

From Avengers: Infinity War (2018) where Dr. Strange sees 14,000,605 realities all at once

References

Baudelaire, Charles. Les Fleurs du Mal. Paris: Poulet-Malassis et de Broise, 25 June 1857. "Correspondances" appears as poem IV in the Spleen et Idéal section.
Tittha Sutta (Udāna 6.4), c. 500 BCE. The earliest known version of the blind-men-and-an-elephant parable is found in the Pali Buddhist Udāna, attributed to the Buddha. The familiar English verse adaptation is John Godfrey Saxe, "The Blind Men and the Elephant" (mid-19th century).
Koffka, Kurt. Principles of Gestalt Psychology. New York: Harcourt Brace, 1935. The epigraph uses Koffka's own phrasing — "the whole is other than the sum of the parts" — which he defended explicitly against the popular "greater than" mistranslation often misattributed to Aristotle. Aristotle's nearest passage, Metaphysics Η (8), 1045a 8–10, reads "the whole is something besides the parts."
Jalalvand, A., Kim, S., Seo, J., Hu, Q., Curie, M., Steiner, P., Nelson, A. O., Na, Y.-S., & Kolemen, E. (2025). "Multimodal super-resolution: discovering hidden physics and its application to fusion plasmas." Nature Communications. DOI: 10.1038/s41467-025-63492-1
LeCun, Y. (2022). "A Path Towards Autonomous Machine Intelligence." OpenReview, version 0.9.2, 27 June 2022. https://openreview.net/forum?id=BZ5a1r-kVsf
Li, Fei-Fei (2025). "From Words to Worlds: Spatial Intelligence is AI's Next Frontier." Essay, 10 November 2025. https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence
Chen, N., Bouchiat, K., Steiner, P., Jalalvand, A., Kim, S., & Kolemen, E. (2026). "Towards Large-Scale Heterogeneous Data Organization for Scientific Foundation Models: A Nuclear Fusion Case Study." ICLR 2026 Workshop DATA-FM. https://openreview.net/forum?id=dJX9Eu3LEI
Mei, Z., Yin, T., Shorinwa, O., Badithela, A., Zheng, Z., Bruno, J., Bland, M., Zha, L., Hancock, A., Fisac, J. F., Dames, P., & Majumdar, A. (2026). "Video Generation Models in Robotics — Applications, Research Challenges, Future Directions." arXiv:2601.07823, 12 January 2026. https://arxiv.org/abs/2601.07823

Acknowledgements

Thanks to Zhiting Mei, Yug Oswal, Zander Keith, and Max Bodenstein for suggestions