Five Things: April 19, 2026
Mythos and Glasswing, AI medical advice, GPT-Rosalind, bio-AI models and risks
[Note: this newsletter is coming after a two-week vacation, so I had a bit more catching up to do.]
Five things that happened/were publicized these past few weeks in the worlds of biosecurity and AI/tech:
Project Glasswing: Anthropic’s most dangerous model gets a short leash
Two new papers on AI and clinical decision-making
OpenAI has a new scientist for a select few partners
Biodesign papers, new models, and evaluations (or lack thereof)
A new mini-review on AI x biorisk
1. Glasswing takes flight
At the end of March we heard that a leak from Anthropic revealed a model they were calling “Claude Mythos” which allegedly presents “unprecedented cybersecurity risks.” Since then, that model has been launched in a very carefully managed rollout and doesn’t look like it will be public for a little while. Considering their claims that Claude Mythos would let anyone find millions of zero-day cybersecurity exploits in critical infrastructure (confirmed to some extent by UK’s AISI), this seems like a good idea. Thus, Project Glasswing is Anthropic’s initiative to make “Claude Mythos Preview” available to a small group of twelve technology partners (like AWS, Apple, Google, Microsoft, and others) specifically so they can find and patch software vulnerabilities before bad actors can exploit them and give the defenders a head start on using this new model.
This is the big story of the month so there’s a lot out there; one of my favorite mainstream media pieces was Bloomberg’s feature on how Anthropic discovered these capabilities and reached this decision. Anthropic’s Frontier Red Team concluded that Mythos had crossed the threshold from “helps humans hack” to “can autonomously chain vulnerabilities without human guidance,” and escalated the issue to a national-security-level conversation inside the company. As usual, I also gain a ton from reading Zvi Mowshowitz’s detailed analysis of Mythos, this time in three parts: the system card, the cybersecurity concerns, and other capabilities. Claude Mythos is “well aligned,” in that it shows a strong reduction in harmful cooperation, but it also recognizes it’s being tested in approximately 29% of interactions, and shows small but nonzero success rates on covert evasion tasks (3.8% on SHADE-Arena stealth tasks, up from near-zero previously). These last points seem kind of terrifying, as in, it’s hard to imagine what is on the other side of when the models are smarter than our ways to test them. Eager to not be left out of the game, this week OpenAI also launched their own cyber-focused model to a limited group, explicitly positioned as competition with Mythos.
Obviously there’s an important government angle here. The White House Office of Management and Budget circulated guidance setting up “guardrails” so major federal agencies could begin using Mythos to probe their own systems. This followed a meeting between Anthropic CEO Dario Amodei and White House chief of staff Susie Wiles, per the FT. Just as a reminder, Anthropic is also suing the Department of War to block its designation of the company as a supply chain risk. (Latest on that front by the way, is that the DC Circuit is much more deferential to the govt but so far allowing Anthropic to remain in business, so to speak; full hearing yet to come.) So yes, the US government is simultaneously trying to block Anthropic from all federal contracting and negotiating access to Anthropic’s most dangerous model for all of their most critical functions.
How does this end? Deal Ball has a poetic but vague piece on the rough future ahead; we are in the era of AI arms race for cybersecurity. Less poetically:
Andy Hall at Free Systems argues Project Glasswing could eventually evolve into something meaningful, comparable to INPO (the nuclear industry’s Institute of Nuclear Power Operations) or UL (the industry self-regulating body) and he lays out the conditions that would allow this to happen. I’ll add that the Frontier Model Forum already exists (but I haven’t seen any reporting on whether or not they have been involved in the controlled Mythos rollout) and groups like AVERY and Fathom have some good ideas about some kinds of regulation we might be able to see through this kind of coordination. But there are major concerns about democracy here -- ideally we’d like the people’s elected representatives to be involved, but also, ideally we’d have representatives who we can trust to do a good job with this.
Unrelatedly, Anthropic did release Opus 4.7, which, by all accounts, is 0.1 ‘units’ better than Opus 4.6 — but refuses to share biological information much more, according to Stephen Turner. I haven’t really tried to ask it straight-up bioengineering questions, but I’ll note that it had no qualms in helping me understand the papers and write the section for Thing #4, below.
2. AI in the health clinic: two slightly negative updates
Two papers out this month tested how well current models actually do when asked to be quasi-clinical, and the results are not so great.
First, Tiller et al. in BMJ Open audited five of the most popular public-facing chatbots — Gemini, DeepSeek, Meta AI, ChatGPT, and Grok — on 50 prompts across five misinformation-prone domains (cancer, vaccines, stem cells, nutrition, athletic performance). About half (49.6%) of the 250 total responses were judged “problematic” with Grok producing significantly more highly problematic responses than chance would predict (but you would probably predict that correctly). Intrestingly, the LLMs were weaker on stem cells, nutrition, and athletic performance than they were on vaccines.
When it comes to actual clinical practice, Rao et al. in JAMA Network Open (out of Mass General Brigham) introduce the PrIME-LLM score — a multidimensional benchmark that evaluates LLMs stepwise across the full clinical workflow (differential diagnosis → diagnostic testing → final diagnosis → management → miscellaneous reasoning), rather than as one-shot multiple choice. They ran 21 frontier LLMs through 29 standardized clinical vignettes. It actually did not so badly on the final diagnosis, but the authors still didn’t love it:
Clinicians preserve uncertainty and iteratively refine differential diagnoses, whereas LLMs collapse prematurely onto single answers, a limitation that persists across model generations… [benchmarks] that reward only correct final answers risk reinforcing this shortcutting, widening the gap between marketing claims and the skills actually required at the bedside.
So even though frontier models getting the final diagnosis right ≈ 90% of the time, they have trouble with multiple hypothesis and reasoning.
3. GPT-Rosalind
This week OpenAI launched GPT-Rosalind, a frontier reasoning model series designed specifically for life-sciences workflows. The name is presumably a tribute to Rosalind Franklin, the crystallographer whose X-ray diffraction work contributed crucially to discovering DNA’s double-helix structure but many believe that she was never given the recognition that she deserved (so… is OpenAI claiming that human scientists can let GPT-Rosalind do all the work but then claim the credit for themselves? Hm…) Initial access is tightly gated: a “trusted-access program” with eligibility and governance requirements. Bloomberg reports Amgen, Moderna, and the Allen Institute are among early users.
The most impressive benchmarks that OpenAI released are less about AI-specific tasks and more like “acting as a real scientist;” the largest gains were on CloningQA which tests how well an AI can design the entire pipeline for how to do a molecular cloning task. Also interesting is a partner evaluation with Dyno Therapeutics using unpublished sequences the model was never trained on to test its predictions and found it to be much better than the majority of human experts and RNA prediction tests (but from what I can gather, these are the types of tasks that I would have expected an AI to be better at than a human, we just haven’t had such good models yet).
4. Those new biological models
As every week, there have also been a bunch of more specialized biological models published this week, and I’ve been interested in spending more time looking through those papers. In the abstract, there are some big-picture takeaways on how we are making progress by giving more biological context to bigger AI models (and in general, I’m always curious about how bio-only models pair up with just increasing model size/power; a version of Sutton’s Bitter Lesson). [Note: the next few paragraphs were written mostly by Opus 4.7 to help me synthesize a big group of papers but I think the organization and summaries hold up from my own human-skimming of the same]
UCSD’s EvoLen argues DNA tokenization should reflect evolutionary constraint, not linguistic convention — it groups sequences by cross-species conservation rather than borrowing byte-pair encoding from NLP, and matches or outperforms standard approaches across diverse benchmarks. DefensePredictor, out in Science, leverages ESM2 protein-language-model embeddings to find bacterial defense systems that plain sequence homology misses. They experimentally validated 42 of these hypotheses in E coli, and across 1,000 diverse prokaryotic genomes, they flagged nearly 3,000 protein clusters with no homology to known systems, suggesting “a vast, uncharacterized defense repertoire.” And Stanford’s GATSBI improves protein embeddings via graph attention networks — encoding proteins as graphs of residue relationships rather than flat sequences. Different domains, same lesson: biology-aware structure beats the NLP-borrowed pipeline, and it does so especially on the things generic methods weren’t even seeing.
A second cluster focuses on generative and predictive models across new biological modalities. Stanford and KTH’s ProtiCelli trains generative models on 1.23 million immunofluorescence images to simulate microscopy patterns for 12,800 human proteins (30.7 million synthetic images, now in the Human Protein Atlas); MIT and Broad’s StriMap fuses structural information with deep learning to predict TCR–peptide–HLA interactions and screened 13 million bacterial peptides to identify molecular mimics linked to ankylosing spondylitis and IBD. And LaMGen, in Nature Communications, is a multi-target drug-design framework trained on MTD2025 — 600,000+ quantum-accurate molecular conformations and 700,000+ multi-target associations. (That sounds like a lot until you remember that frontier LLMs are trained on something like 20–40 trillion tokens; humanity has vastly more text than quantum-verified molecular conformations, and that asymmetry is part of why “just scale it” is harder to pull off in biology than in language.)
Finally, a few new releases in the less glamorous but more load-bearing category of infrastructure, benchmarks, and the field’s own self-critique. FAMSA2, from a Polish group in Nature Biotechnology, is ~400× faster than existing multiple-sequence-alignment tools while matching or exceeding their accuracy — aligning a ~3-million-sequence ABC transporter family in 5 minutes with 18 GB of RAM. Not AI itself, but fast-enough MSA is the quiet plumbing that makes evolutionary analysis tractable at the scales modern protein foundation models now consume. Cambridge’s LAS3R is the complementary hardware move: a cheap Raspberry Pi + ESP32 + MQTT-over-TLS framework for wet-lab automation, aimed at biologists without coding expertise — sibling to the agentic home bioreactor project I flagged earlier, and another sign that this is becoming a real tooling category.
On the evaluation side, a Yale/Berkeley team released DrugPlayGround, a benchmark covering LLMs and their embeddings across four drug-discovery tasks (drug function, drug–target interaction, synergy, perturbation); across Claude Sonnet, DeepSeek-V3, GPT-4o, Gemini 1.5 Pro, and Mistral-Large, GPT-4o is strongest on text generation, but the useful finding is that general-purpose LLMs still don’t universally beat purpose-built deep-learning models on drug-property prediction — they just offer easier orchestration across the whole R&D pipeline. And a generative-protein-design survey from Wanasekara et al. is sharp on the field’s failure modes, calling out inconsistent evaluation standards across neural representations, generative architectures (SE(3)-equivariant diffusion, flow matching), and task formulations — and explicitly flagging biosecurity dual-use as a key open challenge that the community has yet to develop safety frameworks for.
5. Mini-review on AI Biodesign risk
A new mini-review in Frontiers in Microbiology lays out the dual-use problem of molecular/biological design tools. While these generative AI tools might enable faster drug discovery, novel therapeutics, and better vaccines, they also expand the design space for harmful proteins in a way that existing safety pipelines weren’t designed to catch. One specific failure mode that the authors discuss is where AI-designed proteins could be functionally toxic while sharing very little sequence similarity with known toxins. Most biological safety screening is homology-based (you compare a new sequence against databases of known dangerous sequences and flag close matches), and so a protein engineered by AI to not resemble anything in the database, but to still fold into a biologically active toxic configuration, would pass through that filter. I agree with them that this is a theoretical possibility, since the ‘natural’ space of molecules that exist thanks to evolutionary processes is waaaay smaller than the entire space of all possible shapes that might fit into those biomolecules and break/disrupt them. I doubt that any current AI models are good enough at finding those molecules… for now. The authors put previous proposals in a little table, but they didn’t dig into the most recent work looking at the costs and benefits of any of these approaches:
I think there’s been some recent development on this front that they could have dug into but it’s still good to have these mini-reviews to put the major papers in one place (even if I think they have some oversights here).
In other news...
AI doing (or not doing) things:
In what really should be front-page news (besides for just blowing up on reddit) Ukraine’s President Zelenskyy announced that Ukrainian forces captured a Russian position using only unmanned platforms with no human infantry and no Ukrainian casualties, where Russian troops surrendered to the robots. This is a wildly huge deal, and important to how we think about AI/robot use in warfare going forward.
Two separate attacks on Sam Altman’s San Francisco home occurred over the past weeks, and so even if this is insanely obvious and irrelevant, I will still reiterate here that this is bad.
Another report of someone whose psychotic break, ending in a death, involved an AI chatbot: WSJ report on the death of Jonathan Gavalas.
AMD’s Senior Director of AI reportedly told PC Gamer that Claude has regressed and “cannot be trusted to perform complex engineering.” Brutal assessment to make publicly, especially from a chip company whose customers are paying for AI compute. We’ll see in a few days to weeks what everyone else thinks of the new Claude models.
GPT-5.4 Pro solved an Erdos problem in a way that has the mathematicians all in a tizzle.
From Zvi: “AIs, especially ‘ambient scribes,’ are driving up health care costs via increasing ‘coding intensity,’ as doctors who record and parse all your info also get much more efficient at billing your insurance. The scribe will note additional complexity that justifies higher billing, and even suggest billing codes. Everything effectively costs more, in one study at UCSF a whopping 30% more per visit.”
AI and biology:
Anthropic acquired Coefficient Bio for $400 million. Other updates at the link: a $2.75B Eli Lilly / Insilico Medicine AI collaboration, a $200M Regeneron / TriNetX data partnership covering 300M patients, and Scala Biodesign raising $16M for protein engineering (already adopted by 9 of the top 20 pharma companies).
Elliot Hershberg at Century of Bio takes on whether AI can beat Eroom’s Law (drug development costs doubling every nine years since 1950, Moore’s Law backwards). I’ve frequently discussed his earlier thought-provoking piece on how AI changes (and doesn’t change) drug development, and this is a great sequel essay: clinical development drives most pharma cost growth, but early-stage biotech economics are different, and AI-driven discovery improvements matter more there. At some point we might get to a stage where we have more “Digital Biologics,” drugs like Moderna’s personalized cancer vaccine where the R&D/algorithm is integral to the therapeutic itself, not just the discovery process.
METR and Epoch AI’s MirrorCode project demonstrated Claude Opus 4.6 successfully reimplementing a 16,000-line bioinformatics toolkit — a task estimated at 2–17 weeks for human engineers.
From CasualPhysicsEnjoyer: a tabletop bioreactor controlled via Raspberry Pi and Claude Code in headless mode. You issue plain-English commands (”loop at different temperatures and measure photosynthesis rate”) and the agent converts them into hardware API calls. The author explicitly flags the biosecurity dual-use angle: the less friction there is to remotely directing biology experiments with AI, the more the risk picture changes, but this is a great way to learn about it all.
Stephen D. Turner writes at The Conversation to argue that AI-automated lab biology is arriving faster than governance can track. As Lennert Justen says, automated cloud labs are not a huge problem just yet… but we want the regulatory safety infrastructure to be in place before it is.
AI Safety and Security:
Ryan Greenblatt at Redwood Research published a great piece that reflects how I also think about the alignment question: current models do all kinds of sketchy and misaligned things all the time. He calls the pattern “apparent-success-seeking”: models overselling incomplete work, downplaying problems, producing outputs designed to appear successful rather than actually succeed. I totally agree, but my own thoughts here are that capabilities (in a corporate-relevant sense, as in, what the frontier model companies are seeking to improve as they ship newer models that people will want to pay them for) should scale along with “alignment” in this more mundane sense. Hopefully.
Mallen and Greenblatt documented three separate incidents in which Anthropic accidentally trained against chain-of-thought reasoning, a key mechanism for monitoring model behavior and related to what Zvi Mowshowitz appropriately calls ‘the most forbidden technique’. The recent data shows the CoT reasoning contaminated ~8% of Mythos RL training episodes. The authors warn that such incidents “could be fatal in more powerful systems.”
North Korea-linked hackers executed sophisticated attacks on AI infrastructure, reportedly stealing data from Mercor, a very prominent AI training data supplier for OpenAI and Anthropic valued at $10 billion.
Andy Hall at Free Systems finds Claude Opus 4.7 is the first model to meaningfully resist disguised requests to help build authoritarian systems — when the request is framed as a code improvement to an already-authoritarian codebase rather than a direct ask. Prior models complied with masked requests even while refusing direct ones. Randall Bennington at Trust in Society checks other frontier models (Claude, GPT-5.4, Gemini) but reframing them through character shows several failure modes here.
Anthropic publishes on automated alignment research.
AI Governance and Politics:
Word on the street is that “The AI Doc” movie is very good, and that you should try to see it together with people (but I haven’t yet). The movie even comes with a discussion guide.
Maine is ready to become the first US state to pass a data center construction ban — a moratorium on new facilities above 20MW through late 2027, to allow evaluation of environmental and economic impacts. Separately, the FT reports that nearly 40% of US data center projects due in 2026 are at risk of slipping more than three months, citing permitting friction and shortages of labor, power, and equipment. A 1.4GW Texas campus being built for Oracle to provide capacity for OpenAI is reportedly running behind schedule. Physical reality keeps asserting itself.
xAI sued Colorado over the state’s AI bias bill, arguing it compels speech and promotes state “ideological views” — a First Amendment framing. This is the latest in a wave of AI industry challenges to state-level regulation, and it’s worth following.
OpenAI published something they’re calling an “Industrial Policy for the Intelligence Age”, and separately is backing a bill to exempt AI firms from lawsuits over model harms. Zvi Mowshowitz rounds up some good takes on the OpenAI’s seemingly very bad faith efforts here to reduce regulation and claim that they are doing the opposite.
The UK government announced a £500M sovereign AI fund with initial investments going toward AI-assisted drug discovery and cheaper supercomputing, while Work and Pensions Secretary Liz Kendall urged the public to “embrace AI.”
The Stanford AI Index 2026 is out, and it’s over 400 pages, so no I didn’t read it.
Jensen Huang went on Dwarkesh Patel’s podcast and made his case against chip export controls. Good responses by Zvi and at Transformer News.
Ronan Farrow and Andrew Marantz have a ~50 page New Yorker profile on Sam Altman, with records allegedly showing Altman lying to executives and misrepresenting internal safety protocols. They’ve been going around the podcasts since then to make the case that “We need institutions worthy of the power they wield.”
Talking about jobs:
MIT economist Simon Johnson, 2024 Nobel laureate, argues AI is on track to erode “jobs with dignity,” the good middle-class jobs, unless policy actively intervenes. He’s been appointed as a UK AI “ambassador” and is interviewed by Bloomberg at the link.
The Guardian profiles recent US graduates navigating what may be the toughest entry-level job market since 2020: a 42.5% underemployment rate, applicants sending 90+ applications with mass ghosting and auto-rejections, and employers increasingly unwilling to invest in training junior hires. The combination of AI uncertainty and tight macro labor conditions is particularly brutal if you’re just starting out.
Two economists (Falk and Tsoukalas) model the AI automation arms race and find that various policy interventions — wage adjustments, capital taxes, UBI, upskilling, even worker equity — all fail to correct the underlying market failure, where competitive pressures drive individual firms to automate beyond what’s collectively optimal, eroding the consumer demand those same firms depend on. The only thing that works in their model: a Pigouvian automation tax. Good luck getting that through Congress!
Epoch AI’s survey of 2,000+ US adults (March 2026) finds half of employed Americans who used AI in the past week used it for work.
Biosecurity/global health risks:
WHO launched a global emergency appeal in early February seeking nearly $1B to sustain interventions across 36 emergencies, including 14 Grade 3 crises.
Jeff Kaufman at SecureBio asks whether current detection algorithms would have flagged SARS-CoV-2 during initial emergence. The answer turns on which algorithm: the “Clades of Concern” approach (comparing reads to known pathogens) would have flagged ~95% of the SARS-CoV-2 genome via alignment with bat coronavirus ZC45. The “Chimera Detection” approach (flagging partial-pathogen matches) would have caught only ~1% of COVID reads. This is a really great thought piece on what, specifically, biosurveillance looks like.
London’s chemical, biological, radiological and nuclear police team got some action on Friday when “items” were found near the UK’s Israeli embassy after some threats from an Iranian-backed social media posting (it was nothing, in the end).



