Wednesday, June 27, 2012

Formal Semantics for Top 5 Programming Languages

A recent blog post on undefined behavior in C got me thinking. Being from the ML community, I have a certain appreciation for rigorous formal semantics that can be machine checked. Though practical machine checking is largely a new development, rigorous formal semantics has been with us for decades. Standard ML is the epitome of this approach to language design. The 48-page (128-pp total when including appendices and TOC/index) The Definition of Standard ML - Revised formally specifies the entirety of the language1. Don't get me wrong. The Standard isn't perfect. Indeed, it has some bugs of its own [PDF]. Nevertheless, for the most part, the Standard has raised the level of discourse and enabled succinct yet precise descriptions of a powerful higher-order typed language possible. This approach to modeling, defining, and evolving programming languages has in recent years taken a life of its own having been applied to both new experimental languages as well as time-tested existing languages.

The claimed advantages of formal semantics versus informal prose specification are many-fold:

  1. It remove ambiguities in the specification.
  2. Formal semantics are succinct compared to prose with a similar degree of rigor.
  3. They can be machine-checked and at least lend itself to formal proof of important properties such as type and memory safety.
  4. They serve as a single reference for independent implementors of a single language in order to maintain consistency between different compilers and interpreters.
  5. They can assist in identifying and pinpointing bugs in ad hoc semantics.
  6. Some have argued that formal semantics can be a lot clearer and easier to understand than prose specifications.
There are disadvantages too. Until recently, there was generally a gap between formal semantics and implementation. Even though formal semantics may be correct, an unfaithful implementation may still be buggy. Recent work in certified compilation has mitigated this problem somewhat.

In this post, I'd like to organize representative formal semantics of popular languages. Since most of these languages were not designed with formal semantics in mind as was Standard ML, the respective formal semantics is likely to be incomplete and only modeling a fragment of the language.

Java

Featherweight Java, a simplified form of Java with a formal semantics, was first described in a 2001 paper from Igarashi, Pierce, and Wadler. This idealized, cut-down variation on Java omitted all features except classes, methods, fields, inheritance, and dynamic typecasts. From that foundation, the authors were able to mathematically prove the type safety of Featherweight Java. This was an important achievement considering that Java's type system had known problems beforehand.

C

Clight [PDF] is INRIA's formal semantics for a subset of the C language. It is the basis for the CompCert verified compilation research program. Ellison and Rosu have an alternative semantics in a POPL 2012 paper [PDF] specified using Maude Rewriting Logic.

JavaScript

JavaScript/ECMAScript has a number of formal semantics. Maffeis et al [PDF] developed an operational semantics. Guha et al [PDF] had a follow-on semantics that takes a more Featherweight Java approach to the matter and lends itself more easily to standard type safety proof approaches. Chargueraud has yet another formal semantics implemented in the Coq proof assistant. Herman and Flanagan take a very different approach by attempting to specify ECMAScript semantics through Standard ML [PDF].

Python and Ruby

The popular dynamic scripting languages Python and Ruby are a bit troublesome from the perspective of formal specification. Because both are rapidly changing languages which do not have any generally accepted specification other than the main implementation. This brings up another potential advantage of formal semantics. By formulating a formal semantics, independent implementations of a given language can remain source-compatible. Instead of having each independent implementor guess at the semantics and experimentally determine behavior through trial and error testing with the main implementation, implementors can simply refer to the formal semantics. The claim is that having the Definition of Standard ML helped keep the many implementations of Standard ML compatible and synchronized. Despite this hurdle, there have been a number of researchers working on formal semantics for these dynamic scripting languages. Gideon Smeding developed an operational semantics for Python for his Master's thesis. More recently, a group of researchers from Brown, Argentina, and China prepared a semantics for Python using Racket and PLT Redex. Avik Chaudhuri works on formal semantics for Ruby and now ActionScript.

PHP

Though they do not provide a complete semantics for even a core fragment of the language, IBM Research studied the semantics of PHP's copy-on-write assignment behavior [PDF].

Scheme

Slightly off the beaten path, Matthews and Findler developed the semantics for R5RS Scheme [PDF]. This was an interesting work. Certain parts of R5RS were intentionally left underspecified such as order of evaluation. Furthermore, pretty involved features such as dynamic-wind were also formalized. Most importantly, the resultant semantics was executable in the sense that it was encoded in PLT Redex, a powerful semantics specification and simulation tool that is very helpful for "debugging" semantics.


1With the exception of the higher-order module system whose formal semantics was the subject of my dissertation [PDF].

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.