How to Think with AI: Lessons from Stanford's Virtual Lab

Two studies, one week, diametrically opposed conclusions on whether LLMs can help us reason better

Aug 23, 2025

Thanks for reading! If you enjoyed the essay, please consider liking it, adding a comment, or best of all, sharing it with others.

0:00

-9:10

Early August was an exciting but confusing time for research into LLMs’ ability to help us reason better. Two major studies were published, both of which point in very different directions for the future of this technology. The first, a study conducted by researchers at Arizona State University, concludes that the reasoning ability of state-of-the-art LLMs is brittle, an illusion of pattern-matching that fails outside the confines of its training data. The second, research conducted by a team at Stanford, developed a multi-agent system for scientific reasoning and project design. This team of agentic scientists didn’t just work—they made a scientific discovery.

On the one hand, we see indications that LLMs have hard limits, with progress reflecting surface fluency rather than deep understanding. On the other, we have actual results—an experiment demonstrates that, when carefully coordinated and instructed, LLMs can generate scientific insight. Together, it’s confusing. AI is fragile and yet still powerful.

So what to make of this? How can both be true? The key lies in the clever design of the Stanford experiment. Their discovery doesn’t erase the charge of brittle intelligence. It reframes it, offering lessons to all of us on how to leverage the strengths of LLMs while still shielding us from their shortfalls. Stanford’s virtual lab can teach us a lot about how to think effectively with LLMs.

Let’s start with the bad news. Researchers at Arizona State University, contributing to a broader line of work on LLM robustness, sought to test the reasoning abilities of these models. They probed whether Chain-of-Thought reasoning, that is, the breaking down of complex problems into smaller steps and solving those intermediate steps, is evidence of reasoning, as it appears to be, or only the illusion of reasoning.

The team trained smaller GPT-like models from scratch, which, although not as powerful as the big models, gave them control over the training data and therefore a better understanding of which patterns the GPT-like models have access to, and which patterns must be inferred.

They subjected their small models to three stress tests:

Task Shifts: Models were trained on specific rule-based transformations (like shifting letters) and then tested on new or recombined transformations.
Length Shifts: Models were trained on Chain-of-Thoughts of a fixed length, say, four steps, then tested on shorter or longer ones.
Format Shifts: Here, researchers phrased prompts a bit differently, changing the ordering or structure.

In each case, the models simply failed to adapt. They choked on new tasks, struggled to adjust to differences in the number of steps in a reasoning chain, and could not rise above cosmetic changes to how a request was stated.

The main takeaway of this study is that LLMs do not reason; they pattern-match. Move them outside of known patterns and their performance quickly degrades. In human terms, they do not “understand” what they are doing and so cannot extrapolate the process they have learned to related contexts. The success we see in bigger models with a wide range of tasks is a “mirage”: it may look like reasoning, but is actually pattern-matching that far exceeds our capacity to do the same.

If LLMs cannot reason, then they should not be much help for making scientific discoveries. And yet, Stanford researchers programmed a team of agentic “scientists” to design nanobodies that could neutralize the SARS-CoV-2 virus. And did they ever! The team of virtual scientists eventually produced blueprints for 92 of these nanobodies, which were then tested by real scientists in the lab for their efficacy and potential to fight COVID-19.

Not only were the virtual scientists incredibly prolific but also effective. Over 90% of their blueprints yielded nanoproteins that were both expressible, which means that a cell could actually use the code to “express” a protein, and soluble, folding into a stable shape and dissolving in liquid, both of which are key if the nanoproteins are to be used in a drug.

And it all took only a few days. The virtual scientists were prolific, effective, and fast!

The success of this experiment comes down to an important insight: the virtual lab is set up just like a real, interdisciplinary lab. It has an agent designated as the Principal Investigator (PI), whose role is to frame the problem and manage the project. The PI has a team of specialists, agents who act as biologists, computational scientists, and other roles that are common on such projects. The specialists can propose tools and methods, and to introduce other ideas that reflect their area of responsibility and domain expertise.

That’s already interesting! Reasoning is being expressly modelled as a collaborative effort, a process that is shared among many, not internal to a single person.

But here is what I think is noteworthy. The virtual lab has a resident skeptic, a peer reviewer, whose role is to challenge ideas and methods, and to flag errors or potential hallucinations. This departs from the human experience, where quality control is assumed to ultimately rest with the PI but is also shared among specialists. Here, the experimenters explicitly addressed a problem specific to LLMs: they are more error-prone and so need constant monitoring.

Like real scientists, the virtual scientists held “lab meetings.” The agents met to discuss their progress in group sessions as well as one-on-one meetings between the PI and specialist agents. This iterative workflow led to a very quick turnaround loop of proposing ideas, critiquing them, and then refining what could work.

The agents did not stop with ideation, however. They developed a series of steps that their human counterparts could execute in a physical lab. This required the identification of tools and methods that could take a protein sequence idea to a prediction about the protein’s shape and a check of how well that predicted structure would actually stick to the spike protein of the virus.

For the non-scientist, this may seem to faintly recall AlphaFold, a tool developed by DeepMind to predict the 3-D structure from an amino acid sequence. (At least, it did for me!) In fact, AlphaFold was one of the tools used in the pipeline. But the Stanford team wasn’t building a predictive model; they were devising a method for using LLMs to reason together about scientific problems.

We can learn a lot from the design choices of the Stanford researchers, even if we aren’t scientists. Foremost is their decision to model reasoning as a team sport rather than a solo activity. The thinking was distributed over multiple agents, with one, the PI, orchestrating the whole. This is an example of what cognitive scientists call distributed cognition, in which thinking is spread over multiple people and tools. The most famous example of this phenomenon is a ship’s navigation crew. No single sailor sails the boat. Cognition emerges from the shared use of charts, instruments, and the coordination of tasks.

We do not naturally default to distributed cognition, but here is where individual users could learn a lot from the virtual lab on how to use GenAI more effectively.

For most of us, agents aren’t necessary. Working alone makes sense, typically using a single chatbot. Even so, we need to recognize that our artificial helpmate isn’t one voice, it is many. We should ask it to take on different personas and perspectives on the same problem, ensuring that we have access to a broad range of expertise during our conversations.

You can do this in a structured way, deliberately imitating the setup of the virtual lab. The key is to use a prompt that:

Assigns multiple roles to the same chatbot.
Spells out a workflow for how those roles interact.
Runs in iterations, improving each round.
Ends with a consolidated output that pulls everything together.

I tested this by asking a chatbot to develop a business plan for a coding bootcamp. I gave it a few parameters: what kind of bootcamp, who it would serve, and what it would teach. The chatbot then designed a prompt for me to use with these elements:

Defined personas: in my case, a researcher, CFO, CMO, critic, and facilitator.
Workflow: each persona contributed in order, with the facilitator pulling the pieces together into a draft plan.
Clear goal: a viable business plan, responsive to the context I had given.

Iteration was key, and, as expected, the results improved between rounds as I offered a bit more guidance on real-world conditions and filled in some missing information to better align the output with my expectations.

But if such a structured prompt feels like overkill to you, you should still adopt one important habit: invite the chatbot to play the role of critic often. Stress testing ideas can limit the damage from hallucinations, unproductive approaches, or simply tighten up your thinking.

Finally, returning to the findings of the Arizona team, we need to be mindful of the kinds of problems best suited for GenAI. The Stanford experimenters did not ask the agents to solve a completely novel problem or to engage in blue-sky thinking. Their problems required searching a (huge) design space. The LLMs needed to assemble existing knowledge, filter large amounts of data for useful patterns, and translate that into a process that humans could apply.

It’s a remarkable feat, but none of it requires that LLMs infer patterns outside of their training data. Their reasoning may be brittle, yet in the right context it is very useful. To work with this technology effectively, we must adapt, learning to distribute our thinking over multiple agents or personas, while choosing carefully when such an approach is likely to yield results. The challenge ahead, as I see it, is not to expect LLMs to think like us, but to learn how to think with them.

Stef Hutka, PhD

Aug 24

Here's to designing AI to think with us, not for us! (More here: https://sendfull.substack.com/p/ep-44-designing-ai-to-think-with)

Expand full comment

1 reply by Louise Vigeant, PhD

Bette A. Ludwig, PhD 🌱

Sep 28

I’m not sure people fully realize how much AI is advancing the medical field. The focus is often on ChatGPT in schools or at work, but there’s so much more this technology can offer.

2 more comments...

Think Therefore AI

Discussion about this post