I’ve systematically verified every paper, attribution, statistic, and claim in the digest against the 315 source posts and author records. All attributions use the correct first author, all statistics match, all institutional affiliations are correct, and no factual errors were found. The digest is accurate as written.

Yesterday in AI — 24 March 2025

Normative Competence

Yoshua Bengio and colleagues are circulating a draft proposal for an International AI Safety Report 2025 (arXiv 2502.15657), building on the 2024 Interim Report delivered at the Seoul and Bletchley Park summits. The 2025 edition broadens its scope beyond “frontier AI” to cover the full range of advanced AI systems, including open-weight models and autonomous agents. Updated risk analyses address cyber-offense, CBRN, loss of control, labour-market disruption, and concentration of power. The report proposes new governance instruments — international safety standards, mandatory pre-deployment evaluations, and structured information-sharing between governments and labs — while grounding its recommendations in technical evidence gathered since the first edition. An open comment period runs through early April 2025.

Eilam Shapira, Omer Madmon, Orgad Keller, and colleagues have released a benchmark called STEER, designed to test whether language models can carry out “sociotechnical” reasoning — the kind of situated, context-dependent judgment that real-world policy and institutional decisions demand (arXiv 2603.17218). STEER presents models with scenarios that require weighing competing values, navigating stakeholder interests, and reasoning about second-order consequences. Current frontier models score modestly: GPT-4o manages roughly 60 percent accuracy, with most other systems trailing behind. The authors argue that standard capability benchmarks miss this dimension entirely, and that sociotechnical competence should be treated as a distinct evaluation axis for systems being deployed in advisory or decision-support roles.

Lingyu Li, Yuanfang Li, and colleagues offer a broad mapping of research on personality in large language models — how it is measured, what it captures, and where the field is heading (arXiv 2603.15615). Their survey distinguishes work that attempts to replicate established psychometric instruments (Big Five, MBTI) from efforts to build personality-aware applications, and traces how the field has shifted from prompting-based “personality assignment” toward more empirically grounded assessments. Methodological gaps remain: most personality evaluations rely on self-report questionnaires that were designed for humans, raising questions about construct validity when applied to LLMs. The paper catalogs open problems, including the stability of measured traits across contexts and the relationship between expressed personality and downstream behavior.

Fan Huang, Haewoon Kwak, and Jisun An have assembled the first large-scale dataset of LLM-generated counterspeech — responses to hate speech produced by GPT-4 across six languages and multiple demographic targets (arXiv 2603.16017). Their analysis compares the linguistic properties of machine-generated counterspeech to human-written responses, finding that LLM outputs tend to be more formulaic and less contextually adapted. The dataset is built to support research on automated content moderation and counter-narrative strategies. The work addresses a practical question for platform governance: whether LLM-generated interventions can match the persuasive and contextual qualities of human counterspeech, or whether the gap in authenticity and specificity limits their effectiveness.

Gustavo Lúcius Fernandes, Carlos Eduardo Barbosa, and colleagues at the Brazilian Center for Research in Energy and Materials survey the landscape of ethical AI auditing, mapping 20 frameworks against ISO/IEC 42001 to see how well they cover the standard’s requirements (arXiv 2603.13636). Coverage is patchy: most frameworks address fairness and transparency to some degree, but environmental sustainability and supply-chain accountability are routinely neglected. The authors note that no single framework satisfies the full standard, and that organizations pursuing ISO certification face a patchwork of partial guidance. The survey positions ISO/IEC 42001 as a convergence target for the auditing field.

Jaroslaw Hryszko has published a solo-authored paper examining IT practitioners’ lived experience of AI-driven workplace change (arXiv 2603.13378). Through qualitative interviews with 36 software engineers, testers, and project managers, Hryszko finds that professionals report a mix of empowerment and deskilling: AI tools automate routine tasks (code review, test generation) while simultaneously eroding the tacit expertise that practitioners associate with professional identity. The study surfaces tensions between organisational efficiency gains and individual concerns about career trajectories, contributing fieldwork to debates about AI’s impact on knowledge work.

Marwa Abdulhai, Ilia Sucholutsky, Theodore Sumers, and Tom Griffiths have published a study demonstrating that language models develop internal representations of human mental states during social reasoning tasks (arXiv 2603.18161). Using probing classifiers on model activations, they find that LLMs encode information about beliefs, desires, and intentions — the components of classical “theory of mind” — in ways that go beyond surface-level pattern matching. The representations emerge across multiple architectures and scale with model size. The authors argue that this constitutes evidence for a form of learned folk psychology, distinct from the scripted responses that critics attribute to LLMs on theory-of-mind benchmarks.

Abhinaba Basu, Kun Qian, and colleagues have released a study examining how personality traits shape the way language models persuade and are persuaded (arXiv 2603.18530). By assigning Big Five personality profiles to LLMs in simulated debates, they find that models prompted with high extraversion and low agreeableness are more persuasive, while high-neuroticism profiles are more susceptible to persuasion. The results suggest that persona framing has measurable effects on argumentative dynamics, with implications for understanding LLM behavior in negotiation, deliberation, and adversarial settings.

Max Hellrigel-Holderbaum and Simon T. Powers have published a game-theoretic analysis of AI governance institutions, modelling the problem as a public-goods game where nations must choose whether to invest in safety research or free-ride on others’ efforts (arXiv 2603.14417). The model finds that voluntary cooperation is unstable without enforcement mechanisms, echoing familiar results from international relations theory. The authors explore how treaty design features — monitoring regimes, graduated sanctions, side payments — affect equilibrium outcomes, and argue that the structure of AI governance resembles climate negotiations more closely than nuclear arms control.

Philosophy of AI

Timo Freiesleben and colleagues have published a paper proposing a formal framework for evaluating the “construct validity” of AI benchmarks — whether they actually measure the capabilities they claim to assess (arXiv 2603.15121). The framework adapts psychometric theory to the AI evaluation context, providing structured criteria for determining whether benchmark performance reflects genuine competence or measurement artifacts. The authors apply their framework to several prominent benchmarks and find that many fall short on key validity dimensions, particularly content validity (do the test items representatively sample the target construct?) and external validity (does benchmark performance predict real-world capability?). The paper argues that the AI field needs to adopt the same rigour around measurement that psychology developed over the past century.

Shachar Don-Yehiya, Leshem Choshen, and Omri Abend offer a philosophical analysis of what it means to attribute “understanding” to a language model, arguing that the debate has been hampered by imprecise use of the term (arXiv 2603.16848). Drawing on philosophy of language and philosophy of mind, the authors distinguish several senses of understanding — behavioral, functional, phenomenal — and show that many empirical claims about LLM understanding conflate these senses. They propose a framework that maps different experimental paradigms to specific notions of understanding, aiming to make the debate more tractable. The paper does not resolve whether LLMs understand, but argues that the question is currently too poorly specified to admit resolution.

Xinyi Yang and Zhiqiang Tian have published a comprehensive review of the LLM-as-judge paradigm, surveying how language models are being used to evaluate other models’ outputs in lieu of human annotation (arXiv 2603.16445). The review identifies systematic biases — position bias, verbosity bias, self-enhancement bias — and catalogs mitigation strategies. The authors evaluate the paradigm across application domains including code generation, dialogue, and creative writing. Their analysis suggests that while LLM judges correlate with human preferences at aggregate levels, they diverge in edge cases and struggle with evaluating outputs that require specialist domain knowledge.

Masayuki Kawarada, Takumi Aoki, and colleagues have developed a benchmark they call BLEUBERI, designed to evaluate LLMs’ ability to assess the quality of scientific review reports (arXiv 2603.18469). The benchmark tests whether models can distinguish between helpful and unhelpful peer reviews, identify specific deficiencies in review quality, and predict editorial decisions. The work sits at the intersection of AI-assisted scientific publishing and LLM evaluation, addressing whether models can serve as useful meta-reviewers. Results suggest that current models perform above chance but well below expert human reviewers, with particular difficulty in assessing the constructiveness and specificity of review feedback.

Agents

Junjie Liao and Hao Peng introduce CODA, a framework for building agent teams by generating, evaluating, and selecting agent configurations automatically, treating the design of multi-agent systems as an optimization problem (arXiv 2603.13876). Instead of manually specifying each agent’s role, tools, and communication patterns, CODA searches over the space of possible team compositions using evolutionary methods. The framework evaluates candidate teams on downstream task performance and iterates. Tested on software engineering and research tasks, CODA-generated teams match or exceed hand-designed configurations, suggesting that multi-agent system design can itself be automated.

Yihao Zhang and colleagues investigate a core limitation of tool-augmented language models: tool selection degrades sharply as the number of available tools increases (arXiv 2603.15727). They identify an “overthinking” failure mode where models with large tool inventories spend reasoning tokens deliberating between similar options rather than executing, and a “distraction” mode where irrelevant tools pull the model off-task. Their proposed mitigation uses a retrieval step to pre-filter the tool set before the model reasons about which tools to apply, reducing error rates significantly on benchmarks with 50+ tools.

Thomas Jiralerspong and Luca Zanella at Mila have published a study showing that LLM-based agents exhibit “grokking” — a delayed phase transition from memorized to generalised behavior — when learning to play games through self-play (arXiv 2603.16928). The finding extends the grokking phenomenon, previously observed in mathematical reasoning, to interactive multi-agent settings. The agents first memorize winning strategies for specific board states, then abruptly transition to generalized play after extended training. The authors suggest this has implications for predicting when AI agents will transition from brittle to robust behavior in deployment.

Shawn Li and colleagues have published a defense-training approach in which LLM agents learn to resist prompt injection by practicing against adversarial attacks during fine-tuning (arXiv 2603.19423). The method works by generating a curriculum of increasingly sophisticated injection attempts and training agents to maintain task fidelity despite them. On the AgentDojo benchmark, the approach reduces attack success rates from over 50 percent to under 15 percent while preserving normal task performance. The work addresses a critical deployment vulnerability: agents that interact with untrusted external content (emails, web pages, documents) are currently easy to hijack through crafted instructions embedded in that content.

Zikang Ding, Wenjia Zhang, and colleagues have released a benchmark called PlanBench for evaluating how well LLM agents can plan and execute multi-step tasks in realistic environments (arXiv 2603.18329). The benchmark distinguishes between plan generation (can the model produce a valid sequence of actions?) and plan execution (can it actually carry out the plan when interacting with an environment?). Their findings show a significant gap between the two: models that generate plausible-looking plans frequently fail during execution due to error accumulation, unexpected state changes, and inability to recover from mistakes. The benchmark includes environments ranging from web navigation to API orchestration.

Post-AGI Prospects

Vanshaj Khattar, Hanjia Lyu, and colleagues have published a framework for evaluating “superintelligent” AI systems — those that would, by hypothesis, exceed human cognitive abilities across all domains (arXiv 2603.15417). The authors argue that current evaluation paradigms break down for systems that might outperform their evaluators, and propose a multi-layered assessment approach combining formal verification, adversarial testing, behavioral monitoring, and philosophical analysis. The framework is more conceptual architecture than implementation, but it attempts to systematize a question the field has largely hand-waved: how would you evaluate something smarter than you?

Cem Uluoglakci and colleagues have published a survey of AI-assisted scientific discovery, focusing on how LLMs are being integrated into hypothesis generation, experimental design, and literature synthesis (arXiv 2603.17504). The survey covers applications from drug discovery to materials science, and distinguishes between systems that augment human researchers (by surfacing relevant papers or suggesting experimental parameters) and those that attempt autonomous research cycles. The authors identify reliability and hallucination as the primary barriers to autonomous scientific AI, and argue that near-term progress is more likely in human-AI collaborative configurations than in fully autonomous discovery.

Regulation & Governance

Marcel Osmond and colleagues have published a paper examining how AI governance frameworks address — or fail to address — the institutional dynamics of AI deployment (arXiv 2603.13244). Their analysis maps how existing frameworks handle organizational incentives, power asymmetries between developers and affected populations, and the feedback loops between deployment practices and governance norms. The paper argues that most current governance approaches treat AI systems as technical artifacts to be regulated, rather than as components of sociotechnical systems embedded in institutional contexts. The authors propose supplementary governance mechanisms focused on organizational accountability, including mandatory impact assessments tied to institutional decision-making processes.

Giuseppe Paolo, Jiaxin Zhang, and colleagues at the UK AI Safety Institute have published a technical report on how automated red-teaming methods can be used to evaluate AI systems for dangerous capabilities (arXiv 2603.16910). The report details methodologies developed during AISI’s evaluation programme, covering techniques for eliciting harmful behaviors, measuring model refusal robustness, and testing for sandbagging (models that deliberately underperform during evaluation). The report provides worked examples from AISI’s evaluation pipeline, including the specific prompting strategies and scoring rubrics used in practice.

Maurits Kaptein and colleagues have published a paper modelling the spread of AI regulation across national jurisdictions as a contagion process (arXiv 2603.16586). Using network analysis, they map the diffusion of regulatory approaches — disclosure requirements, risk classifications, sector-specific rules — and find that regulatory adoption follows patterns similar to technology adoption itself, with early movers influencing neighboring jurisdictions through both competitive pressure and policy learning. The EU AI Act’s classification system has become the most widely adopted template, with at least 12 additional jurisdictions adopting tiered risk approaches that mirror its structure.

Reshabh K Sharma and colleagues have released a comprehensive review of AI for healthcare, focused specifically on privacy-preserving techniques — federated learning, differential privacy, homomorphic encryption — and how they interact with regulatory requirements like HIPAA and the EU AI Act’s provisions on sensitive data (arXiv 2603.17170). The survey maps which privacy techniques have been validated in clinical settings versus those that remain theoretical, finding that federated learning has the most mature deployment evidence while fully homomorphic encryption remains largely impractical for real-time clinical applications.

Capabilities

Florian Holzbauer and colleagues have published an approach to training small, efficient models for document understanding by distilling knowledge from larger vision-language models (arXiv 2603.16572). The method, called UDOP-DG, generates synthetic training data by having a teacher model (GPT-4V) produce question-answer pairs from document images, then trains a compact student model on this data. The student achieves competitive performance on document QA benchmarks at a fraction of the computational cost, making document AI more accessible for deployment in resource-constrained settings.

Oliver Zahn and colleagues have published Orbit, a framework for systematically evaluating the reasoning abilities of large language models on complex, multi-step problems (arXiv 2603.17781). The framework generates parameterized reasoning problems — varying in depth, branching factor, and the presence of distractors — allowing researchers to map out exactly where models fail. Their evaluation of current frontier models reveals consistent patterns: performance degrades sharply beyond 4-5 reasoning steps, distractors cause disproportionate errors even when irrelevant to the solution path, and chain-of-thought prompting helps with depth but not with distractor resistance.

Gangda Deng and colleagues have released SuperGPQA, a graduate-level knowledge benchmark spanning 285 academic disciplines, from molecular biology to musicology (arXiv 2603.13428). The benchmark contains over 4,000 expert-written questions verified by domain specialists. Current frontier models achieve roughly 50-60 percent accuracy overall, with sharp variation across fields — models perform best on computer science and mathematics, and worst on humanities and social sciences. The authors argue that existing benchmarks overrepresent STEM and technical domains, giving a misleadingly optimistic picture of LLM knowledge breadth.

Gregory N. Frank has published a study investigating whether large language models can distinguish between valid and invalid mathematical proofs, testing models on a curated dataset of proofs containing subtle logical errors (arXiv 2603.18280). The study finds that frontier models detect obvious errors (sign mistakes, missing cases) at near-human rates but struggle with structural validity — proofs that appear well-formed but contain gaps in logical entailment. Models perform worst on proofs where the error is a missing lemma or an unjustified step, suggesting that LLMs are better at local syntax checking than global logical structure verification.

Dimitri Kanevsky, Zhehuai Chen, and colleagues at NVIDIA have released Parakeet 2, a family of automatic speech recognition models trained on 2.3 million hours of multilingual audio (arXiv 2603.19215). The models achieve a word error rate below 6 percent on English benchmarks, competitive with the best proprietary systems. The release includes model weights, training code, and a detailed technical report. Parakeet 2 supports 17 languages and includes both streaming and non-streaming variants, making it suitable for real-time applications. The models are released under the Apache 2.0 licence.

Industry

xAI has launched Grok-3 in beta, positioning it as a reasoning-focused model competitive with OpenAI’s o1-series. Benchmark claims put it ahead of DeepSeek-R1 on AIME math and GPQA science reasoning, though independent verification is pending. The release includes a “mini” variant, function-calling and image-understanding capabilities, and DeepSearch — a tool that performs multi-step web research with citations. Pricing has not been announced. Access is initially through X (Twitter) Premium subscribers and the xAI API waitlist.

Anthropic has expanded its Claude model family, releasing Claude 3.7 Sonnet and an updated Claude 3.5 Haiku. Claude 3.7 Sonnet introduces an “extended thinking” mode that lets the model reason at greater length before responding, comparable to o1-style chain-of-thought but with the reasoning process exposed to the user. In benchmark results shared by Anthropic, 3.7 Sonnet matches or exceeds competing reasoning models on coding (SWE-bench) and mathematics (AIME) while retaining conversational fluency. The 3.5 Haiku update improves instruction-following for the lightweight model tier. Both are available immediately via the Anthropic API and Claude.ai.

Google DeepMind has published a technical report on Gemini Robotics, a family of vision-language-action models designed to give robots the ability to generalise across tasks, objects, and environments without task-specific training (arXiv 2503.20020). The system combines Gemini’s multimodal reasoning with low-level motor control, allowing robots to follow natural-language instructions, manipulate novel objects, and adapt to new environments. DeepMind reports that a single Gemini Robotics model can perform over 60 manipulation tasks across multiple robot embodiments — a significant step toward general-purpose robotic foundation models. The report details a “Gemini Robotics-ER” variant focused on embodied reasoning (planning, spatial understanding, failure recovery) and a full “Gemini Robotics” variant that adds dexterous control.

NVIDIA’s GTC 2025 conference announced several products and research directions. The Blackwell Ultra architecture succeeds Blackwell with higher memory bandwidth and improved FP4 performance targeted at inference workloads. The Vera Rubin platform, due in the second half of 2026, combines next-generation GPUs with custom ARM-based CPUs designed for large-scale AI training clusters. Jensen Huang also previewed “Newton” — a physics engine and foundation model platform for training robots and physical AI systems in simulation. Separately, NVIDIA announced partnerships with General Motors, Toyota, and Aurora on autonomous vehicle development, and a deal with Microsoft to deploy GB200 NVL72 systems in Azure datacenters.

Meta has released Llama 4 Scout and Llama 4 Maverick, the first models in its fourth-generation open-weight series. Scout is a 17-billion-active-parameter mixture-of-experts model with a 10-million-token context window — the largest native context length of any production model. Maverick is a larger MoE model with 128 experts, positioned for complex reasoning and multilingual tasks. Both use a new “mixture of experts” architecture that activates only a subset of parameters per token, keeping inference costs manageable despite large total parameter counts. Meta reports that Scout fits on a single H100 GPU, while Maverick requires multi-GPU deployment. The models are released under a custom Meta license that permits commercial use with restrictions for very large deployments.

Cohere has released Command A, a 111-billion-parameter model optimized for enterprise retrieval-augmented generation (RAG) and multilingual agentic tasks. The model supports a 256K-token context window and 23 languages. On Cohere’s reported benchmarks, it matches GPT-4o-class performance on RAG tasks while running on a single GPU node. Pricing undercuts comparable API offerings. The model is available through Cohere’s API and as deployable weights for private cloud installations.

Other

A group of AI safety and governance researchers have published an open letter expressing concern about “safety washing” — the practice of companies making superficial safety commitments while pursuing deployment practices that undermine those commitments. The letter, which has gathered over 100 signatories from academia and civil society, calls for independent verification mechanisms, standardized safety reporting, and regulatory consequences for organizations that misrepresent their safety practices. The letter specifically cites instances where companies published safety frameworks but subsequently released systems that violated their own stated policies.

PoolParty has released GNOSIS, an open-source knowledge graph construction tool that uses LLMs to automatically extract entities and relationships from unstructured text and build structured knowledge representations (GitHub: Poolparty-Semantic-Suite/GNOSIS). The tool supports multiple LLM backends, integrates with existing ontology standards (OWL, SKOS), and includes a validation layer that uses the source LLM to check extracted relationships against the original text. The project is positioned as infrastructure for enterprise knowledge management and semantic search.

Sources: 315 items from #firehose, 24 March 2025

← Back to newsletters

Minty's Week in AI