λ-Files

RSS

Thoughts on computer science, engineering, and technology.

Extracting a Formal Semantics with AI Agents

When we talk about formal verification of programs, one of the first questions is: does the programming language we are verifying have a formal semantics? If the language behaviour is specified only by an implementation or by prose written for humans, then formal reasoning about programs in that language is impossible without first formalising the language itself.

Many programming languages either lack a formal semantics, or have one that was developed separately from the implementation. Even when such a semantics exists, it may cover only part of the language or lag behind the current implementation.

Recently, while attending FMxAI’26, I joined a breakout session titled “How to handle all other languages?”. We discussed the idea of autoformalising semantics from standards documents, implementation source code, and differential testing against real compilers and interpreters. I prefer the term “semantics extraction”: the goal is not to invent a new semantics, but to recover the one already implicit in the implementation and its test suite.

I decided to test the idea on a small but real language: the Unix bc(1) calculator. It is compact enough for an experiment, but it is not a toy. It has arithmetic expressions, mutable variables and arrays, loops, conditionals, functions, recursion, special variables such as scale, ibase, and obase, and arbitrary-precision decimal arithmetic. It also has some interesting language quirks, such as the unusual handling of quit, and numeric literals whose interpretation depends on the current ibase.

The resulting project is bc-lean: an experiment in AI-assisted semantics extraction for the POSIX subset of bc, checked against GNU bc 1.07.1 and formalised in Lean 4.

The Experiment Setup

I set up the experiment so that AI agents would do as much of the engineering work as possible, with limited human guidance. Across the project I used several models and agent environments, including Cursor Composer 2.5, Claude Opus 4.8, and Codex/GPT-5.5.

The agents were allowed to inspect the GNU bc 1.07.1 source tree, especially the parser and execution engine, and to run a local GNU bc binary as a reference implementation. They also used test programs from GNU bc and a curated BSD-2-Clause subset of Gavin Howard’s bc tests for validation.

The target was an executable formal model in Lean 4: a parser, an abstract syntax tree, two operational semantics, tests against the reference implementation, and some metatheory about the semantics themselves.

What Was Built

The repository now contains:

  1. A standalone tree-sitter parser for POSIX bc syntax.
  2. A Lean surface AST and a tree-sitter XML bridge.
  3. A big-step executable semantics.
  4. A structural small-step executable semantics.
  5. Shared runtime support.
  6. A command-line interpreter, bc-lean, which can run either semantics.
  7. Regression tests comparing both Lean evaluators with GNU bc.
  8. A progress theorem for the small-step semantics.
  9. A big-step/small-step equivalence theorem.

The project includes a syntax-only parser, but not a typechecker or diagnostics layer. This matches the scope of the experiment: bc is an untyped calculator language, and GNU bc’s warning modes are a separate concern. The current scope is the POSIX/standard subset, with GNU bc 1.07.1 used as the behavioural reference.

In terms of code size: tree-sitter grammar contains 45 productions, and lean code for the abstract syntax, small and big step semantics, and proofs is about 15 Kloc.

Current Status

At the current checkpoint, the project builds with no sorry in the proof surface. The latest documented verification run reports:

  • AST golden tests: 40/40 passing.
  • Big-step evaluator tests: 39/39 matching GNU bc.
  • Small-step evaluator tests: 39/39 matching GNU bc.
  • lake build green across the Lean tree.

The main proved results are:

  • Bc.Progress.progress: every small-step configuration is either terminal or can take a step.
  • Bc.BigSmall.runProgram_iff: the big-step and small-step semantics agree on final results for ordinary program runs.
  • Bc.BigSmall.runProgramWithState_iff: the same equivalence for custom initial states whose stopped flag is false.

This should still be read as an experiment rather than a finished, audited formalisation. The code and proofs have not yet received enough independent human review to treat the model as authoritative.

Effort

Development effort was non-linear. The initial syntax work was surprisingly quick: the first tree-sitter parser was extracted in well under an hour from the GNU bc yacc/lex sources, then connected to a Lean AST and golden tests in another short session. However, independent review by a different agent/model found real issues: associativity mistakes, over-permissive grammar cases, string handling bugs, and some confusion about which checks belonged in syntax rather than later semantics. This was a useful pattern for the whole project: generation was fast, but review and boundary-setting mattered.

The big-step semantics was relatively straightforward. Once the parser and AST were stable, the agent could read GNU bc’s execution engine, implement an executable Lean evaluator, and validate it against GNU bc. This still required careful modelling of decimal arithmetic, functions, arrays, special variables, and output formatting, but it followed the natural recursive structure of the language. In terms of independent agent work, this was one of the smoother parts of the experiment.

The small-step semantics was where human intervention mattered most. The first version worked as an executable interpreter, but it was not really the structural small-step semantics we wanted: it shared too much structure with the big-step evaluator, and some evaluation was effectively big-step work hidden inside a single small-step transition. We had to split the common runtime layer out, make big-step and small-step independent siblings, and start over with a structural residual semantics. Several of the most important improvements came from human review asking for the right semantic shape, not from simply asking the agent to keep coding.

The proof work was mostly automatic in the sense that agents did the detailed Lean engineering: generating auxiliary lemmas, repairing proof scripts, and iterating until lake build was clean. The progress theorem had to be corrected: the first attempt proved a property of the fuel-bounded runner, not the intended fuel-free one-step semantics. The big-step/small-step equivalence also had to be stated at the final RunResult level, where the two semantics may use different fuel measures. The final equivalence proof took the largest proof engineering effort, with forward and backward simulations, residual-term evaluation, fuel monotonicity lemmas, and determinism arguments, but it was ultimately completed by the agent without sorry or custom axioms.

Summary

I think the experiment was a success. In two days of mostly agent work, with perhaps two hours of human supervision, we extracted a formal semantics for an existing language from reference implementations, unit tests, and documentation. Doing this manually would certainly have taken me much longer. However, the correctness of the result remains an open question until it receives further human review.

Why Forcing Users to Change Passwords Is Bad for Security

Periodic password expiry (e.g. “change your password every 60–90 days”) is no longer considered good practice. Modern guidance is clear:

Do not require password changes unless there is evidence of compromise.

Authoritative Guidance

🇺🇸 United States: National Institute of Standards and Technology (NIST)

“Verifiers and CSPs SHALL NOT require subscribers to change passwords periodically. However, verifiers SHALL force a change if there is evidence that the authenticator has been compromised.”

Source: NIST SP 800-63B-4, Digital Identity Guidelines: Authentication and Authenticator Management, §3.1.1.2, item 6 (final, July 2025).

🇬🇧 United Kingdom: National Cyber Security Centre (NCSC)

“Regular password changing harms rather than improves security.”

“Forcing password expiry carries no real benefits.”

Source: UK NCSC Password Administration for System Owners - Don’t enforce regular password expiry.

🇫🇷 France: Agence nationale de la sécurité des systèmes d’information (ANSSI)

“Si la politique de mots de passe exige des mots de passe robustes et que les systèmes permettent son implémentation, alors il est recommandé de ne pas imposer par défaut de délai d’expiration sur les mots de passe des comptes non sensibles comme les comptes utilisateur.”

Source: ANSSI, Recommandations relatives à l’authentification multifacteur et aux mots de passe, R24, 2021.

ANSSI separately recommends expiration for privileged accounts such as administrator accounts (R25).

Microsoft

“Periodic password expiration is an ancient and obsolete mitigation of very low value, and we don’t believe it’s worthwhile for our baseline to enforce any specific value.”

“When humans are forced to change their passwords, too often they’ll make a small and predictable alteration to their existing passwords, and/or forget their new passwords.”

Source: Aaron Margosis, Security baseline (FINAL) for Windows 10 v1903 and Windows Server v1903, Microsoft Security Baselines blog, 2019. See also: Microsoft 365 password policy recommendations.

Research Evidence

  • S. Chiasson and P. C. van Oorschot, Quantifying the Security Advantage of Password Expiration Policies, Designs, Codes and Cryptography, 2015. School of Computer Science, Carleton University. Quantifies the security gain of expiry under an analytic guessing-attack model and finds it minor at best. Available at: [pdf]

  • Y. Zhang, F. Monrose, and M. K. Reiter, The Security of Modern Password Expiration: An Algorithmic Framework and Empirical Analysis, in Proceedings of the 17th ACM Conference on Computer and Communications Security (CCS), 2010. Empirical study using password histories from 7,752 defunct accounts; new passwords are predictably derived from old ones via simple transforms. Available at: [pdf]

  • H. Habib, P. Emami-Naeini, S. Devlin, M. Oates, C. Swoopes, L. Bauer, N. Christin, and L. F. Cranor, User Behaviors and Attitudes Under Password Expiration Policies, in Proceedings of the Symposium on Usable Privacy and Security (SOUPS), 2018. Workplace survey: forced expiration did not yield stronger replacement passwords and elicited predictable coping strategies (e.g. appending digits), though it did not clearly produce some other feared harms either. Available at: [pdf]

  • D. Florêncio and C. Herley, A Large-Scale Study of Web Password Habits, in Proceedings of the International World Wide Web Conference (WWW), 2007. Empirical evidence on password reuse and user behaviour relevant to expiry policies. Available at: [link]

  • L. F. Cranor, Time to rethink mandatory password changes, FTC Tech Blog, 2 March 2016. Public-facing critique of mandatory rotation by the FTC’s then chief technologist, widely cited in the subsequent policy debate. Available at: [link]

Killing Voicemail with Twilio: A Simple Setup for People Who Hate Voicemail

I hate voicemail. My inbox was mostly full of robocalls, accidental pocket dials, and 'call me back' messages that would have been far clearer in writing. Eventually, I decided that voicemail had to go.

The solution turned out to be wonderfully simple: instead of letting my carrier send unanswered calls to a traditional voicemail box, I now forward them to a Twilio number that plays a short, polite message and immediately hangs up.

Step 1: Get a Twilio Number

After creating a Twilio account, I rented a local phone number. This costs me $1.15 per month.

Step 2: Create a TwiML Bin

Twilio uses TwiML (Twilio Markup Language) to define call behaviour. In the Twilio Console, I created a new TwiML Bin containing my greeting logic.

Here is a sample XML:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Say voice="male">
    Hello. I do not use voicemail. Please send me a text or email instead. Goodbye.
  </Say>
  <Hangup/>
</Response>

This answers the call, speaks the message, and hangs up immediately. If you prefer a recorded audio file, you can replace <Say> with <Play> and point it to a publicly accessible MP3 file.

Step 3: Connect the TwiML to the Number

In Twilio's number settings, I set the Voice configuration to use my new TwiML Bin. Calling the Twilio number now plays the message and disconnects properly.

Step 4: Forward Unanswered Calls to Twilio

Here is the key part: telling my mobile network to forward all conditional calls (busy, no-answer, unreachable) to this Twilio number instead of voicemail.

On T-Mobile US, that is done with a single GSM code:

**004*<TwilioNumber>*11#

Dial it on your phone, press call, and you will receive a confirmation. This ensures any call that would have gone to voicemail is redirected to Twilio instead.

To remove forwarding later, dial:

##004#

What It Costs

Twilio charges:

  • About $1-$1.50 per month for the virtual phone number.
  • About $0.008-$0.009 per minute for incoming calls.

Because my greeting is under 10 seconds, each forwarded call costs only a fraction of a cent. Even ten missed calls a month add up to only a few pennies.

Future Possibilities

The setup leaves plenty of room for adding smarter call handling later on.. In the future, you can add:

  • Different greetings or handling for different callers.
  • Automatic SMS replies explaining why voicemail is disabled.
  • Integration with email, messaging apps, or your calendar.
  • Logging or analytics to see who tried calling, without needing voicemail.

And That's It - Voicemail Is Gone

Now, whenever I ignore or miss a call, the caller is forwarded to my Twilio line, hears:

"I do not use voicemail. Please send me a text or email instead."

...and the call ends. There is no mechanism for leaving a message. My voicemail inbox is gone, notifications are gone, and life is significantly quieter.

Essence and Accidents of AI-Generated Proofs

There is an ongoing, sometimes heated, argument between people embracing AI "vibe coding" as a great new technology liberating us from mundane tasks and allowing us to focus on high-level problems, vs people considering it is "stupid autocomplete" which produces "AI slop" while dumbing down writers who routinely rely on it, in the process.

We will frame this discussion around AI-assisted proof synthesis (e.g., for the Lean proof assistant). While proof and code synthesis are conceptually similar (by the Curry-Howard correspondence), the former is more illustrative in an academic context, as it brings questions of originality, attribution, and credit assignment into sharper focus. Additionally, there are a couple of important differences pertinent to using AI:

  • Correctness criteria. Once the theorem is formulated, the proof assistant can check the candidate proof and determine whether the given theorem statement is indeed proven.
  • Proof irrelevance. Any two proofs of the same proposition are considered indistinguishable and equally valid.

These differences address concerns about the correctness of generated proof (which can be checked) and eliminate the need for human review, thereby invalidating the "AI slop" concern (style or verbosity does not matter, as the proof term is intended for the checker, not for humans).

🧠 Informal Problem
        │ 👤 Human — formalisation
        ▼
📄 Lean Specification
        │ 🤖 AI — script generation
        ▼
📝 Proof Script
        │ LΞ⋀N — elaboration + checking
        ▼
✔️ Verified Proof Term

This leaves us with more fundamental questions: 1) Can the AI-generated proofs be considered genuinely innovative or just a rehashing of existing prior work? 2) How much of the result could be attributed to the individual vs the machine?

The first question is a topic of ongoing debate about AI, which goes beyond proof or code synthesis applications and is outside the scope of this essay.

To answer the second one, we first informally divide proofs into two broad categories. The first type of proof is primarily mundane and mechanical, for example, as is common in proofs in program verification, which usually involve structural induction, case splits, term unfolding, and normalisation, etc. Such proof generation could be viewed as automation, similar to SAT solving, using decidable theories such as Presburger arithmetic or using tactics such as the "Sledgehammer" in Isabelle/HOL.

The second type is more intricate proofs in which the solution is non-trivial and requires making deep observations about the nature of the problem, linking it to external theories, identifying relevant invariants, and introducing and proving new conjectures.

Intuitively, the first kind involves no innovation or creativity and requires no credit for intellectual contribution, while the second calls for assigning intellectual credit.

These two categories may correspond to the "accidental" and "essential" difficulties described in [1]. In the context of mechanised proofs, essential difficulties are inherent to the nature of the problem. In contrast, accidental difficulties stem from the mechanics of formalisation, such as proof automation, combinatorial explosion, representational bottlenecks, and limitations of the proof assistant. But is there a precise boundary between these two?

Accidental Complexity Essential Complexity
structural induction finding the right abstraction
case splits identifying key invariants
term unfolding / reduction introducing new definitions or concepts
normalisation / rewriting linking to external theories
proof repair / refactoring formulating auxiliary lemmas
heuristic lemma search discovering new conjectures
managing combinatorial explosion reframing the problem conceptually

Let us reflect on the pre-modern-AI era, on computer algebra systems like Maple or Mathematica, which can symbolically solve some differential equations. Mathematicians typically do not list Maple/Mathematica as coauthors. Back then, one might say that the problems such a system could solve are merely accidental difficulties. Conversely, if it could not be solved by it, the difficulty may be either essential or accidental.

Unfortunately, with modern AI, we can no longer use such a simple distinction. It is claimed, albeit still a matter of debate, that modern AI may exhibit emergent properties that go beyond what it was trained on. This leaves the possibility of producing truly innovative results and thus solving problems of essential difficulty.

Unsatisfactorily, this leaves the line between the proofs of accidental and essential difficulty not firmly fixed. It remains a judgment call. Moreover, historically, such a line has shifted: many mathematical proofs of problems that were regarded as essential in the 19th century are now treated as accidental, as the development of new methods turned once-insightful arguments into routine applications.

As of the time of this writing (November 2025), the type of proofs which AI could reliably generate falls into the first category: mundane and mechanical (accidental complexity). So if you are using AI to grind out these proofs, it would be safe to consider it just an automation aid that could be used without reservation. As the state of the art in AI models evolves and they can tackle the essential complexity in proofs, we may need to start crediting them for their contributions.

References:

[1] Brooks, "No Silver Bullet Essence and Accidents of Software Engineering," in Computer, vol. 20, no. 4, pp. 10–19, April 1987, doi: 10.1109/MC.1987.1663532. https://www.cs.unc.edu/techreports/86-020.pdf

Evolution email setup for Tufts.edu

If anyone is trying to set up the Evolution email client to access a Tufts University (tufts.edu) email account, I hope a Google search leads you here! After spending some time on this, I want to document the working steps for the public record.

  1. When adding an account start with your official email address John.Smith@tufts.edu
  2. Untick "Look up mail server details based on entered e-mail address". (the auto-discovery does not work and takes forever.)
  3. Use Microsoft365 account type (not EWS or IMAP/SMTP - they both do not work)
  4. For login name, use something like jsmith01@tufts.edu (not John.Smith@tufts.edu)
  5. Chose OAuth as the authentication method
  6.  For OAuth, in the advanced options section provide Tufts Outlook Tenant ID: fbbf6c60-0344-4c29-9459-725685739b19
  7. Optionally configure Draft, Sent, and Archive folders to reside on the server. 

Hope this helps!