Schwa

Schwa is a hybrid between Scheme and C which gives you access to the best features of both languages, plus extensions for natural language processing. The low-level and compute-intensive parts of your code are ANSI/POSIX C, with its efficiency, succinct notation, and access to the full range of C libraries. However, they can also use an allocator with automatic garbage collection and a range of scheme-like dynamic data structures. The Schwa interpreter provides a Scheme-like front-end for scripting, debugging, and building simple user interfaces.

Many systems have tried to mate compiled low-level code with an interpreted front-end. However, these have concentrated almost entirely on ease of programming in the interpreted language. A distinctive feature of Schwa is that it adds functional language features to the compiled language. This makes the compiled language easier to use, and the compiled and interpreted environents more similar. This, in turn, simplifies linking new compiled functions into the interpreter. (A less thorough-going version of this approach was used successfully by Elk.)

The target audience for Schwa is researchers in natural language processing (NLP) who want to build experimental tools quickly, but with reasonably good efficiency (storage space, computing time). It is particularly designed for building small packages of closely-related programs (e.g. doing several related types of analysis on several specific types of corpus data) and applications that depend on C libraries (e.g. networking, signal processing). Schwa has been deliberately kept small and general-purpose, to improve reliability and to support extensions for different types of applications.

Schwa also contains new features to simplify building natural language applications. Symbol names can be arbitrary strings (including non-ASCII characters). Native support is provided for ngrams (interned lists of symbols). Users have direct access to mappings from strings and ngrams to packed ID numbers, a normally hidden component of symbol, tuple, and language model implementations. Hash tables and some basic operations on garbage-collected strings (e.g. sprintf, substring) are built in. A new "bundle" datatype can implement a wide range of constructs, including feature bundles and dynamically-typed structs/records.

Schwa is suitable for many tasks in natural language processing. However, there are exceptions. Some tasks (e.g. grep) are already well supported by extremely mature software. Very large datasets or low-memory platforms (e.g. handhelds), for example, may require special-purpose code. Some tasks may eventually work in Schwa, but will need to wait until more memory is available or a major code rewrite is appropriate.

The Schwa distribution includes a usermanual and implementation notes, as well as a small collection of tools and demo programs to help get you started. The tools include a simple SGML/XML/HTML parser, a basic tokenizer, and readers for the CMU pronouncing dictionary and the Mississippi State Switchboard transcriptions.

Requirements, downloading a copy, etc

Schwa is distributed under the University of Illinois's open source license. The most recent version (1.0) can be downloaded from here. This is the first release of Schwa, so there will undoubtedly be a couple quick rounds of update releases to fix glitches.

Schwa runs on 32-bit linux computers. It should run on any reasonably recent linux release.

Schwa depends on Hans Boehm's garbage collector, which can be obtained from Boehm's GC home page or (if that's somehow unavailable) my local copy of version 6.5 .

Contacts

Questions, comments, and bug reports should be sent to Margaret Fleck (mfleck@cs.uiuc.edu or margaretmfleck@yahoo.com).