New datasets will train AI models to think like scientists

The initiative, called Polymathic AI, uses technology like that powering large language models such as OpenAI’s ChatGPT or Google’s Gemini. But instead of ingesting text, the project’s models learn using scientific datasets from across astrophysics, biology, acoustics, chemistry, fluid dynamics and more, essentially giving the models cross-disciplinary scientific knowledge.

“These datasets are by far the most diverse large-scale collections of high-quality data for machine learning training ever assembled for these fields,” said team member Michael McCabe from the Flatiron Institute in New York City. “Curating these datasets is a critical step in creating multidisciplinary AI models that will enable new discoveries about our universe.”

Today (2 December), the Polymathic AI team has released two of its open-source training dataset collections to the public — a colossal 115 terabytes, from dozens of sources — for the scientific community to use to train AI models and enable new scientific discoveries. For comparison, GPT-3 used 45 terabytes of uncompressed, unformatted text for training, which ended up being around 0.5 terabytes after filtering.

The full datasets are available to download for free on HuggingFace, a platform hosting AI models and datasets. The Polymathic AI team provides further information about the datasets in two papers accepted for presentation at the NeurIPS machine learning conference, to be held later this month in Vancouver, Canada.

“Just as LLMs such as ChatGPT learn to use common grammatical structure across languages, these new scientific foundation models might reveal deep connections across disciplines that we’ve never noticed before,” said Cambridge team lead Dr Miles Cranmer from Cambridge’s Institute of Astronomy. “We might uncover patterns that no human can see, simply because no one has ever had both this breadth of scientific knowledge and the ability to compress it into a single framework.”

AI tools such as machine learning are increasingly common in scientific research, and were recognised in two of this year’s Nobel Prizes. Still, such tools are typically purpose-built for a specific application and trained using data from that field. The Polymathic AI project instead aims to develop models that are truly polymathic, like people whose expert knowledge spans multiple areas. The project’s team reflects intellectual diversity, with physicists, astrophysicists, mathematicians, computer scientists and neuroscientists.

The first of the two new training dataset collections focuses on astrophysics. Dubbed the Multimodal Universe, the dataset contains hundreds of millions of astronomical observations and measurements, such as portraits of galaxies taken by NASA’s James Webb Space Telescope and measurements of our galaxy’s stars made by the European Space Agency’s Gaia spacecraft.

The other collection — called the Well — comprises over 15 terabytes of data from 16 diverse datasets. These datasets contain numerical simulations of biological systems, fluid dynamics, acoustic scattering, supernova explosions and other complicated processes. Cambridge researchers played a major role in developing both dataset collections, working alongside PolymathicAI and other international collaborators.

While these diverse datasets may seem disconnected at first, they all require the modelling of mathematical equations called partial differential equations. Such equations pop up in problems related to everything from quantum mechanics to embryo development and can be incredibly difficult to solve, even for supercomputers. One of the goals of the Well is to enable AI models to churn out approximate solutions to these equations quickly and accurately.

“By uniting these rich datasets, we can drive advancements in artificial intelligence not only for scientific discovery, but also for addressing similar problems in everyday life,” said Ben Boyd, PhD student in the Institute of Astronomy.

Gathering the data for those datasets posed a challenge, said team member Ruben Ohana from the Flatiron Institute. The team collaborated with scientists to gather and create data for the project. “The creators of numerical simulations are sometimes sceptical of machine learning because of all the hype, but they’re curious about it and how it can benefit their research and accelerate scientific discovery,” he said.

The Polymathic AI team is now using the datasets to train AI models. In the coming months, they will deploy these models on various tasks to see how successful these well-rounded, well-trained AIs are at tackling complex scientific problems.

“It will be exciting to see if the complexity of these datasets can push AI models to go beyond merely recognising patterns, encouraging them to reason and generalise across scientific domains,” said Dr Payel Mukhopadhyay from the Institute of Astronomy. “Such generalisation is essential if we ever want to build AI models that can truly assist in conducting meaningful science.”

“Until now, haven’t had a curated scientific-quality dataset cover such a wide variety of fields,” said Cranmer, who is also a member of Cambridge’s Department of Applied Mathematics and Theoretical Physics. “These datasets are opening the door to true generalist scientific foundation models for the first time. What new scientific principles might we discover? We’re about to find out, and that’s incredibly exciting.”

The Polymathic AI project is run by researchers from the Simons Foundation and its Flatiron Institute, New York University, the University of Cambridge, Princeton University, the French Centre National de la Recherche Scientifique and the Lawrence Berkeley National Laboratory.

Members of the Polymathic AI team from the University of Cambridge include PhD students, postdoctoral researchers and faculty across four departments: the Department of Applied Mathematics and Theoretical Physics, the Department of Pure Mathematics and Mathematical Statistics, the Institute of Astronomy and the Kavli Institute for Cosmology.

“The University of Cambridge is a public collegiate research university in Cambridge, England. Founded in 1209, the University of Cambridge is the third-oldest university in continuous operation.”

Please visit the firm link to site