3 June 2026

Stress-testing AI Scientists in parallel universes

Two women working in a lab with a large robot


At ARIA, we constantly grapple with the bottlenecks that constrain scientific and technological progress. The dawn of the "AI Scientist" – autonomous, closed-loop systems capable of generating hypotheses, designing experiments, and interpreting results – promises a massive acceleration in how we discover knowledge.

Yet, a foundational debate persists: can these models infer genuinely new rules of the universe when granted generous token budgets and computing resources? Or are they fundamentally limited by their training distributions, capable only of uncovering simple derivations of what they have already seen?

To push past speculation, we needed to move beyond traditional benchmarks that merely measure an LLM's capacity to recall existing data. We needed to observe scientific search behavior in its purest form.

The intervention: Building Albert

Ant Rowstron and Aayush Chadha at ARIA (Advanced Research and Invention Agency) created Albert, a phased multi-agent reasoning layer (or an AI Scientist) built to interact with simulated laboratory instruments. Albert splits scientific workflows among specialised agents: an Explorer for baseline observations, a Theorist for generating risky or conservative hypotheses, a Director to manage the lifecycle, a Technician to execute tool calls, and a Chronicler to compress memory.

To rigorously isolate reasoning from data recall, we deployed Albert first to try to discover rules comparing the performance of different frontier models. Albert was dropped into five virtual worlds (Chemistry, Ecology, Genetics, Physics, and Causal Inference) governed by hidden laws entirely atypical to Earth. To succeed, the AI had to independently discover these alien rules through trial and error.

While trying to ensure Albert was an optimal design, we used another virtual world, an Alloy Discovery Benchmark, where Albert had to work out the rules on how to mix 40 different base compounds to create a specific material with a maximal quality metric. Each experimental iteration incurring a cost penalty. Albert had to choose when to stop, and we compared how different design choices for Albert impacted the performance.

Key insights from the frontier

Our initial adventures with Albert yielded stark, counterintuitive truths about the current state of autonomous R&D systems.

1. Capability is surging, but domain performance is uneven

Plotting the success of frontier models over time reveals a significant spike in reasoning capability over the last six months. However, this capability is far from uniform, varying both across models and virtual world domains. An exciting result.

2. More complex scaffolding is a mixed blessing

For the design of Albert we tested four configurations, ranging from simple direct tool calls to setups allowing the model to write its own optimisation code and access long-term strategy memory. Surprisingly, adding complex software scaffolding yielded almost no automatic benefit. An unexpected result.

Charting the next experiment

These are our first experiments, and we plan to leave the main page as a living document that we will update over time. We are actively seeking ideas and collaborations to build better, more robust environments that can push autonomous discovery to its true frontier.

Get the full story


For more information on the work we're doing, explore our AI in Science initiatives and subscribe to our newsletter for the latest updates.