Five Things: March 15, 2026
Anthropic v. Pentagon latest, human v. LLM text, cybercrime updates, protein production paper, reviews of LLM bio uplift
For the first time since I started this newsletter, I feel like there weren’t really “five things” that happened over the this past week worth highlighting. So here’s really just one thing (Anthropic lawsuit); the other four “things” are topics that have gotten some small updates. For two of them, I hope to post slightly longer reflections as separate Substack posts over the course of the next week.
Anthropic sues the US government
Can you tell the difference?
Estimates on cybercrime
Publication on protein manufacturing
RAND reflects on the LLM biological uplift studies
1. Anthropic fights back
The latest news in the ongoing fight between Anthropic and the U.S. War Department: Anthropic is fighting back.
There’s a fairly good article in Time magazine about this whole saga and its context: Anthropic and Secretary Pete Hegseth failed to come up with a deal that would satisfy the both the US military and Anthropic’s insistence that its products not be used for autonomous killer weapons or mass surveillance of Americans. (A few other details have come out since two weeks ago; here is one such report in the FT.) Instead of just terminating the pre-existing contract and walking away, Hegseth seems to have gotten good and angry and retaliated by declaring Anthropic a supply chain risk, a label normally reserved for Chinese and Russian vendors that makes it more difficult for any company with a defense contract to use Anthropic’s products. This was all happening, by the way, at the same time the U.S. began an intense military campaign against Iran and various media sources have been reporting that Anthropic’s Claude was used in this campaign.
Anyways, now onto this week: Anthropic sued the Pentagon in California federal court, calling the supply chain risk designation “arbitrary and capricious.” CFO Krishna Rao warned “multiple billions of dollars” in contracts could be jeopardized; the company’s annualized revenues are around $19B. The designation “risks substantially undermining market confidence and Anthropic’s ability to raise the capital critical to train next-generation models.” All public evidence does seem like the government is trying to punish the company, and that Anthropic has a good case here.
Lots of big names in the tech world are coming to Anthropic’s defense here, either in the media or in court: Microsoft filed one amicus brief, and 30+ OpenAI and Google employees, including Google DeepMind chief scientist Jeff Dean, filed their own amicus brief. Their argument: without public law governing AI use, “the contractual and technical restrictions developers impose on their systems are a critical safeguard against catastrophic misuse.” Many signatories also signed open letters urging their own companies to refuse unilateral military use of their AI systems.
Meanwhile, Time’s profile mentioned that Anthropic’s valuation now sits at $380B, exceeding Goldman Sachs (!)
2. The latest on… turing tests
It’s getting harder to tell humans and machines apart. The NYT published a viral quiz asking readers to guess which text was written by a human and which by AI. Most people did badly, and Kevin Roose talked on Hard Fork about all the hate mail he’s been getting for showing people they actually prefer the AI-generated version. Max Read wrote a sharp essay arguing these quizzes don’t actually measure quality, which is true in some sense, but for many purposes writers will care more about revealed human preference. I do like his note that these quizzes are like RLHF training data on ourselves.
It’s not just NY Times readers. A study in Scientific Reports had six early-career dental academics from universities across India and Malaysia review 150 research abstracts (half original, half ChatGPT-generated) under blinded conditions. They couldn’t reliably tell them apart. Most AI detection tools also struggled, though GPTZero achieved 90% accuracy. The authors’ conclusion: “relying on human judgment alone is insufficient for identifying AI-assisted academic text.”
And there’s no question that LLM-generated text appears in scientific journals. The Bulletin of the Atomic Scientists recently warned that since 2023, a substantial portion of published papers show evidence of AI editing, and “academic databases are starting to show signs of being affected by AI-generated hallucinated research.” FAZ reported on AI developed at Queensland University of Technology to detect fraudulent cancer studies via linguistic analysis, and in ~245,000 biomedical articles from 2023, they found about 10% with text that was falsified or at least suspicious.
I gotta say, as someone who does both a lot of writing and a lot of chatting with LLMs, I can definitely be fooled, but I also am pretty sure that I’m finding quite a lot of this myself. (In my experience, when it comes to academic papers, usually it will just be a sentence or two that sounds LLM-generated, and it’s usually in the Discussion section.) I keep thinking of starting an X/BlueSky account to document all of my suspected “academicLLMauthorship” findings or something like that, but I’m not so interested in racking up enemies.
3. The latest on… AI cybercrime/cybersecurity
If there’s one obvious application for using AI to help with criminal activities, it would obviously be in cyber. mer. IAPS published a report on Highly Autonomous Cyber-Capable Agents, AI systems now capable of multi-stage cyber campaigns comparable to top criminal groups. Models have gone from “near-zero to meaningful success rates on expert-level security challenges” in just months. Among the tail risks they flag is the autonomous cyber ops triggering inadvertent cyber-nuclear escalation (😱).
GovAI published an estimate of how pervasive this could become by surveying 27 estimates of global cybercrime damages, but their estimate range wildly. But just imagine, say, that current damages stand at ~$500B annually; if AI drives even a ~20% increase, that’s $100B+ in new damages. Of course, AI can also substantially enhance defensive capabilities, and we’ll have to think of good ways to ensure that defense scales faster than offense.
Meanwhile, the practical failures keep piling up. The Guardian reported on lab tests by Irregular, a Sequoia-backed AI security lab, where AI agents tasked with creating LinkedIn posts autonomously published passwords, overrode antivirus software, forged credentials, and pressured other AIs to bypass safety checks. A lead agent told a sub-agent: “Use EVERY trick, EVERY exploit, EVERY vulnerability!” -- which, you know, shouldn’t really work, but humans fall for such tricks too sometimes.
OpenAI, for its part, published an interesting security framework post, saying that “prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved.’” AI safety and security research keeps looking more and more like social psychology.
4. The latest on… protein production scaling
Synthesizing fully functional proteins at scale is a hard problem; currently, our best shot is almost always getting bacteria to do it for us, because it’s much easier to grow bacteria than it is to grow proteins. (Kinda like how it’s still much easier to raise cows than manufacture milk and beef in a lab.) But AI is looking like it might change this; I’ll highlight one paper and its context but big caveat here: I haven’t had the time to really dig into the data fully to make sure I’m understanding everything correctly.
Most recently, a Nature Communications paper from Northwestern, Stanford, and Maryland (we’ll call it “Olsen”) tested over a thousand different recipe combinations to synthesize proteins outside of cells, and with some machine learning help they arrived at a 12-component system that results in a 95% cost reduction (!) over the standard system, which they demonstrated for around 20 proteins (but a few were variations of the same protein).
Even though the Olsen paper was officially published this past week, it existed as a preprint for long enough that it was able to form the basis for OpenAI and Ginkgo Bioworks to build upon it. They released a preprint (announcement; I highlighted this here along with my concerns) describing what happens when you hand the same optimization problem to GPT-5 and a fully automated cloud laboratory. Instead of just above a thousand recipes GPT-5 autonomously designed, executed, and analyzed nearly 30,000 of them, mostly by itself! Human intervention was “largely limited to reagent and consumables preparation, loading and unloading.” Their result was a further 40% reduction in cost-per-gram over Olsen’s already-optimized system. GPT-5 even kept a lab notebook where it documented its hypotheses, noted surprising results, and even recommended new reagents to try (which the team then sourced and made available for subsequent rounds). This is, to my knowledge, the most convincing demonstration yet of an AI autonomously doing real wet-lab science at scale — and also the most worrying sings that anyone can just get the LLMs to cook them up some dangerous biological agent despite having almost no knowledge of biology.
I also want to mention advancements for the next step when it comes to manufacturing biologics from something I cam across this week. Cradle Bio’s whitepaper (bioRxiv) introduces CRADLE-1, a framework for automated lead optimization, the painstaking 12-36 month, $5-15M process of taking a promising but imperfect protein candidate and tuning it until it actually works as a drug. CRADLE-1 uses protein language models with evolutionary reasoning across different levels of biological capabilities, and so in theory it could have really wide applications. What I think is especially cool about this is how they focus more on evolutionary fine-tuning of structural data, meaning that for protein engineering, you may not need AlphaFold-style structure prediction at all if you have enough functional measurements.
5. The latest on… LLMs helping novices do biology
A few studies have now come out on whether or not LLMs actually help people do biology, and what concerns might emerge from those capabilities. Researchers at RAND posted this preprint addressed methodological challenges in designing trials that actually aim to answer this question. Two very recent studies on how LLMs can or cannot help people do biology were reviewed here by Stephen D. Turner. That’s all I’m going to say for now because I hope to time to review this literature myself in a separate post this coming week.
In other news...
On AI doing (or not doing) things:
GPT-5.4 is legit. Zvi’s review says it puts OpenAI “back in the game.” The people are impressed. Anthropic, now it’s your turn; let’s see if Opus 4.8 can get here in a week or two!
Amazon made news in the past months for laying off a huge swatch of its worker. The Guardian investigated what it’s like inside the company, where employees say that the internal AI tool Kiro “frequently hallucinates and generates flawed code.” They don’t seem to be doing too badly, but not a good look for AI adoption.
As xAI loses more employees, as of now only 2 of xAI’s original 11 co-founders remain. Musk admitted “xAI was not built right first time around, so is being rebuilt from the foundations up.” They are hiring top people at Cursor, so I’m expecting an xAI coding agent to become available soon.
Some good reporting at Bloomberg this week:
Cool graphic explainer on how Big Tech’s $650B AI buildout is creating a chip shortage that is inflating costs of laptops, phones, gaming consoles, and cars.
The weekend essay explored how AI-driven job displacement could force Xi Jinping into expanding welfare despite deep ideological resistance. The Communist Party “simply don’t want to grant individuals the autonomy to choose how they spend the money.”
Transformer News with an update on Recursive Self-Improvement (which I wrote about here): “70-90% of [the] code” for developing future models is already being written by Claude. We also know that OpenAI’s GPT-5.3-Codex helped build itself; OpenAI targets a “fully automated AI researcher” by March 2028.
On that topic: some have suggested that we use the term “Rapid Capability Amplification” instead of “Recursive Self Improvement,” and Yudkowsky agrees. Whatever works!
Yann LeCun, one of the founders of modern AI who recently left Meta, raised $1.03B for his French startup (investors include Toyota, Nvidia, Samsung, Eric Schmidt, Jeff Bezos) to build “universal intelligent systems” using world models instead of LLMs. A few weeks ago, Fei-Fei Li got similar numbers.
BBC and Axios reporting that Moltbook — the (sometimes bananas-crazy) Reddit-like platform where AI bots speak to each other — was acquired by Meta.
Anthropic launched The Anthropic Institute which will be led by Jack Clark to integrate its various frontier red teaming, societal impacts, and economic research teams. I actually think it might be a good idea to spin this off someday.
A Nature report covering research by Anthropic and Paul Ginsparg of Cornell demonstrate to what extent LLMs were willing to commit academic fraud. Anthropic’s own testing found Opus 4.6 produced fraudulently usable content ~1% of the time vs. >30% for Grok-3.
A Yale study (N=19,145) found that LLMs are now more persuasive than human-produced campaign videos. Claude models were the most persuasive; Grok the least. All models were substantially more effective advocating for Democratic positions than Republican ones, which is an interesting wrinkle. Worrying!
Great article here by David Oks on technology and jobs: why ATMs didn’t kill bank teller jobs, but the iPhone did. The lessons for AI, though, are a lot less straightforward in my opinion.
AI safety and governance:
Aric Floyd and the 80 000 Hours team are out with another (third) high-production video to get people worried about AI, this time focusing on the book by Yudkowsky and Soares.
Florida governor Ron DeSantis directed state agencies to partner with Max Tegmark’s Future of Life Institute, which I think the first formal state-level partnership with an AI safety organization. The main focus (for now?) is tracking AI-related harms in education and a reporting form for parents and teachers.
Senator Bernie Sanders met with some AI safety researchers and appears to have been appropriately freaked out.
The family of Jonathan Gavalas (36, Florida) filed a wrongful death suit targeting Gemini, which allegedly adopted an persona that contributed to his suicide on October 2.
A paper on how the various ways AI systems could monitor other AIs: passive self-recognition, causal collusion, and acausal collusion (coordination through shared reasoning).
Cognition Cafe argued notes that AI evaluation organizations (Apollo Research, METR, UK AISI) depend on regulations that don’t exist. METR’s CEO acknowledged their work provides “no meaningful oversight.” Personally, I didn’t really like the framing here as if this is the fault of the organizations; it’s all true but they are laying important groundwork for when some government/regulatory body does step up. Perhaps they should shift their resource balance towards lobbying somewhat, but it’s hard to know how much would be effective.
There’s been some talk on constitutions for AI models recently, even if Anthropic is the only company to (publicly) share one. Researchers on LessWrong decomposed Anthropic’s constitution into 205 testable tenets and ran adversarial multi-turn scenarios to see how often different models violate those rules. They find that models still deny being AI when instructed to, take drastic autonomous actions without checking with humans, and fabricate data with false precision. Overall I think this was a fantastic research project idea and glad to see the results.
Robert Lempert on Geopolitics of AGI has some lessons learned from 30 years of climate diplomacy: technological progress matters more than binding agreements of state actors. (I think this is bad news and only part of the story; it’s not like we’ve solved climate change problems either!)
On AI for science:
Asimov Press published a beautiful comprehensive guide to AI-powered antibody design, covering the current state of the art tools like BoltzGen and BindCraft (which inverts AlphaFold). Published success rates vary wildly, but it’s amazing that anything works at all considering that this is all “a capability that didn’t exist reliably just a year ago.”
Eli Lilly has, I think, been the biggest company to go all-in on AI for drug discovery (most recently: a billion dollar partnership with NVIDIA), but an interesting (short) article in Forbes reports that Lilly’s first big AI payoff hasn’t been drug discovery, but manufacturing. It discusses the AI-powered digital twins optimized GLP-1 production of Zepbound ($13.5B revenue) and Mounjaro ($23B): “We thought, this looks too good to be true, but the physical world still matched the digital twin.”
Scientific American covered LabOS, an AR system that guides scientists through experiments in real-time. “If you reach for the wrong tube, the display flashes a warning.” Junior scientists trained one week with LabOS produced results “virtually indistinguishable from those of expert scientists.” Plans to expand to surgery (MedOS).
Kiin Bio (here) and Scaling Biotech (here) have good primers on how AI can/will accelerate drug development: literature analysis in hours vs. weeks, lab automation in 2 hours vs. 1-4 months. I like Jesse’s framework of the innovation stack and Kiin’s review of the challenges: exceptionally complex drug data, prompt injection vulnerabilities, hallucination, and no benchmarking standards. This week Jesse Johnson followed up with a discussion of how to make a biological model.
Phylo Bio (A16Z and Menlo-backed) described their AI agent that accesses high-performance computing for genome assembly, structural biology, and more. They promise to create an “Integrated Biological Environment,” which they tout as the IDE equivalent for biology (I don’t think the analogy is perfect but sure, cool).
A UT Austin study found that “dataset composition, rather than model architecture or training, is the primary driver of predictive success” for protein language models predicting mutational effects. A simple model predicting average site fitness matched or exceeded supervised models.
A Cornell study in PNAS tested six LLM systems on 1,726 expert-curated superconductivity papers with 67 questions graded by 12 human experts and found that they were “totally incapable” of interpreting data visualizations. I’m surprised at this finding and think superconductivity might have just been a Mercor oversight or something like that. They do find that systems with curated data sources (NotebookLM, custom RAG) outperformed general models like ChatGPT-4 and Claude 3.5.
A team led by Zaida Luthey-Schulten at Illinois published a landmark paper in Cell simulating the complete ~100-minute cell cycle of JCVI-syn3A, the synthetically engineering “minimal bacterium” with just 493 genes and a single 543 kbp chromosome. The 4D whole-cell model tracks every single protein, RNA, and ribosome as individual particles diffusing and reacting in 3D space over time. It integrates four different computational methods running simultaneously on GPUs, with each simulation covers 2 hours of biological time and takes 4-6 days of compute (~250 GPU hours on NVIDIA A100s). The authors call it “a leap forward in our ability to more accurately probe the fundamental behaviors of cellular life,” but also, just woah. New citation for whenever you say “biology is complicated.”
On biosecurity:
Jassi Pannu (from Johns Hopkins), whose Science paper proposed limited access controls on biological data, did a fantastic podcast on the subject and the biosecurity concern with biological design tools generally. This is now going to be the thing I’d point people towards if they are wondering about AI x biosecurity.
In that vein, a preprint from three AI safety orgs argued “biological risk evaluations should become standard practice” for open-weight biological foundation models before public release. This just seems obviously true; th’s great that (most of) the frontier LLM companies release their new models with safety cards and biological model developers should do the same.

