Interpreter 3: Tokenizer

1. Pair programming structure
2. Get started
3. Assignment details
4. Syntax details
5. Some additional hints
6. The files you start with
7. Optional extensions
- 7.1. Studying for exams
8. How to test and submit your work
- 8.1. Testing your work
- 8.2. Submitting your work

For this part of the assignment, you'll build a tokenizer for Scheme, in C. (You must implement the tokenizer from scratch – don't use Lex or any similar tool, even if you happen to know what it is and how to use it.)

1. Pair programming structure

This is a "pair" assignment, which means that if you are working on a team with someone else, you and your partner should engage in the pair programming model. You should be physically together, sharing a computer and both looking at the same screen. Each partner has an active role to play during pair programming - please review the instructions on Moodle about pair programming, and make sure you understand what you should be doing when you are the navigator and when you are the driver.

You should make sure that over the course of an assignment that you spend roughly the same amount of time each "driving." My recommendation is to take turns approximately every 15 minutes or so. Set a timer to help you remember. I will also ask you to fill out a survey about your partnership at the end of the term.

For the interpreter project, you are permitted to do a limited amount of debugging separate from your partner. If you do this, you should be sure that you're each contributing similar amounts: it's not fair, either in terms of time or in terms of learning, if one person does all the debugging. Before debuggging on your own, make sure that your partner is onboard with that. Also, make sure to share what you found with your partner afterwards.

If pair programming in real-time just doesn't work for you and your partner, then you will need to amicably split up and work individually. If you choose this option, you must communicate that to me [Anna] and send me an email with a zip file containing the code checkpoint of any work you did together (if applicable).

If things are not going well with working with your partner or you find yourselves tempted to split up the work, please talk to me. While doing a limited amount of debugging separately is fine, you should be doing the initial design and coding together - if you or your partner cannot explain how part of your interpreter works, then you have not followed the pair programming policies for this class.

2. Get started

You'll download the folder with the starter code from the link below. Like for previous homeworks, unzip the folder and move it to where you'd like: most likely the folder for the Docker container. Then get started in VS Code.

If when running tests you run into a permissions error, you can fix this by running the following code in the terminal:

chmod a+x test-e test-m

This tells the computer that all users should be permitted to execute (run) the files test-e and test-m.

Click here to download the starter files.

3. Assignment details

You should create tokenizer.c to implement the functions specified in tokenizer.h. More generally, your tokenizer should read Scheme code from a file and return a linked list of tokens accompanied by the type of each token. You must include the following token types:

boolean, integer, double, string, symbol, open, close, quote

The last three are for (, ), and ' respectively.

The output format is, per line, token:type.

On comments: You should strip out comments as you tokenize, so that anything after a ; on a line is ignored.

In detecting if you've found the end of a line, make sure to be super careful if you have created your test file with an editor on Microsoft Windows. Linux and Mac use an ascii \n at the end of a line, but Windows uses a \r\n at the end of a line. All of our tests run on on Linux so you don't need to worry about this unless you are creating your own test files with Windows. If you are, you need to either just create your test files on Mac/Linux (i.e., on the lab computers), or make sure that your tokenizer code can also handle \r\n at the ends of a line.

For example, here's the input file t04.scm that I supplied in the exemplary tests:

(+ x2 ( + ( quote x ) "foo;; still a string" 323) ;; now it's a comment!

On the above, your program should output:

(:open
+:symbol
x2:symbol
(:open
+:symbol
(:open
quote:symbol
x:symbol
):close
"foo;; still a string":string
323:integer
):close

Your tokenizer should handle bad input gracefully; your program should never crash with a segfault, bus error, or the like. If the input code is untokenizable, print an error message. You may print something generic such as "Syntax error: untokenizeable", or you can provide more detailed information in the error message depending on what the problem is. Whatever you print, it should start with the string text "Syntax error" so that it will pass the automated tests. (Upper or lower case doesn't matter.) After encountering an error, your program should exit after printing the error - don't read any more input. This is why we wrote the function texit. Also feel free to follow your error message by printing Scheme code around the error to be helpful if you can, though this optional.

4. Syntax details

Scheme is a complex language. In the interest of making things more tractable, here is a simplified syntax for numbers that I expect your tokenizer to handle (adapted from Dybvig's book):

<number>   ->  <sign> <ureal> | <ureal>
<sign>     ->  + | -
<ureal>    ->  <uinteger> | <udecimal>
<uinteger> ->  <digit>+
<udecimal> ->  . <digit>+ | <digit>+ . <digit>*
<digit>    ->  0 | 1 | ... | 9

Note that * is shorthand for zero or more repetitions of the symbol, and + is shorthand for one or more repetitions of the symbol. Tip: if you want to convert strings to numbers, you can use the functions strtol and strtod in the standard library.

Similarly, you should recognize symbols (identifiers) with the following syntax (again adapted from Dybvig's book):

<identifier> ->  <initial> <subsequent>* | + | -
<initial>    ->  <letter> | ! | $ | % | & | * | / | : | < | = | > | ? | ~ | _ | ^
<subsequent> ->  <initial> | <digit> | . | + | -
<letter>     ->  a | b | ... | z | A | B | ... | Z
<digit>      ->  0 | 1 | ... | 9

This is a little inconsistent with the actual behavior of Scheme, but it simplifies things up a bit (at the cost of some minor incompatibilities).

Symbols are case-sensitive.

You may also find the syntax section of Dybvig's book to be helpful. The dialect described may not be completely consistent with the above, but it's readable in English. The BNF that I have provided above is considered to be the correct specification for this assignment. It deviates slightly from what Guile allows, but in ways that I think end up making some things, like differentiating numbers and symbols, easier.

Scheme provides many different ways of enclosing lists (parentheses, square brackets, etc.). We will only write code that uses parentheses for the purpose of enclosing lists. None of our test code will use square brackets. You can write your tokenizer and subsequent parts of the interpreter to only work on parentheses. That said, if you want your project to also wrok on square brackets, it isn't that hard of an extension. If you think you want to go that way, you'll need to make sure that you tokenize parentheses and brackets differently.

We will identify the single quote ' as a special token all of its own. It should be stored in a SchemeItem as a QUOTE_TYPE and displayed by your tokenizer accordingly. Don't get this confused with the word quote, which is tokenized as a regular symbol.

Tokens are separated by various forms of whitespace (space, newline, tab, etc). Use the C function isspace to test if a given character is a whitespace character.

5. Some additional hints

I suppose it is theoretically possible to code this assignment up so that it produces precisely the right output, but without actually storing the data in a list. Don't do that; perhaps that might be easier to get results on this portion of the project (although I'm not convinced), but it will leave you in a useless position to continue for the next portion.

There are many different ways of reading input files. I found it most useful to use the functions fgetc and occasionally ungetc. You can look those up; documentation is also available here.

At some points in my code, I wanted to find out things like "Is the current character a letter?" Conveniently, characters can be compared using < and >, and the characters a-z, A-Z, and 0-9 all form their own ranges. So, I can write code like:

if (myChar >= 'A' && myChar <= 'Z') {
 ... //Whatever I want to do in this case goes here
}

This tests if myChar is in A-Z - i.e., it tests if myChar is an uppercase letter. I can do similar things for a-z and 0-9; note that if you're comparing to, say, the character 0, you want to compare to '0', not 0 (i.e., don't forget the single quotes!).

The heart of your code will be your tokenize function, which reads the file and returns a list of SchemeItems. Here's a rough sketch of what that function might look like:

SchemeItem *tokenize() {
    int charRead;
    SchemeItem *list = makeEmpty();
    charRead = fgetc(stdin);

    while (charRead != EOF) {

        if (charRead == ...) {
            ...
        } else if (charRead == ...) {
            ...
        } else {
            ...
        }
        charRead = fgetc(stdin);
    }


    SchemeItem *revList = reverse(list);
    return revList;
}

Note that the type of charRead is an int, not a char. This is because the version of clang that is included with Docker gives a warning if charRead is a char and you reach the end of the file (i.e., you compare char, an unsigned type, with EOF, which is represented by -1).

You can still compared charRead, an int, with strings - e.g.,

if (charRead == '(' ) {
  ...
}

We will guarantee that no token will be longer than 300 characters long. That includes strings. Your code doesn't need to handle arbitrarily long tokens.

We will guarantee that no line will be longer than 300 characters long. Your code doesn't need to handle arbitrarily long lines.

FAQ: Should your tokenizer catch errors with mismatched parentheses, i.e. )(a b ))? No, it should tokenizer strings like that without error. Catching mismatched parentheses is the parser's job.

6. The files you start with

After you download the starter code, you should be able to see that you get the following starting files:

schemeitem.h: essentially the same header we have been using for an SchemeItem, but some new types are added in
linkedlist.h: the same header file, more or less, from the last assignment
talloc.h: the same header file, more or less, from the last assignment
tokenizer.h: the header file for the functions you'll need to write for this assignment. You'll also create a number of helper functions for that, but those don't need to appear in the header since they will be "private".
main.c: Short main function that drives the whole program.
justfile: contains instructions for the command just, which will compile and test your code
test-files-m/ and test-files-e/: a directory of Scheme programs as test inputs, and expected outputs
test-m and test-e: the usual testing scripts that you run
tester.py: a helper script to handle automated testing.
lib: a subdirectory of binaries, see below

At this point, you have a choice regarding how to proceed for linkedlist.c and talloc.c. If you want to continue to build your own interpreter that is truly yours, you can copy in these files from the last project, and move forward. Alternatively, if you would prefer to use my versions instead, you can do so. In the lib directory you'll see linkedlist.o and talloc.o files, as well as .h files to go with them. These .o files are compiled binary versions of my linkedlist.c and talloc.c. If you would prefer to use one or both of these in your project instead of your own, you can replace them in the justfile. Specifically, in the justfile, change the first line as indicated. I provide these binaries so that people who have trouble with earlier parts of the project don't get slowed down, but I strongly encourage you to use your own code if it works for two reasons:

You'll feel a much stronger sense of ownership for the project if you do.
You won't be able to trace bugs back to the linked list / talloc source code if you need to. You'll sometimes have to make do with cryptic errors from my code that say "linked list insertion error" or something like that. My code works as specified, but it will produce bad and sometimes difficult-to-understand output if it gets bad input.

Note that compiled binaries are operating system specific (sometimes version specific), and thus might not work outside of the course Docker container. If you run into troubles using my binaries (either on your own computer, or for unforeseen reasons in the Docker container), please let me know and we can talk about options.

To compile your code, issue the command just build at the command prompt. (This will give an error if you haven't yet created the .c files you need, or if you haven't changed the justfile to refer to the binaries if you need them.)

To run your code, first create a file with some of your own Scheme code in it, and give it a filename (e.g., myprog.scm). Then issue this command at a prompt:

./interpreter < myprog.scm

Your tokenizer code should read data from "stdin." In that case, the above command will redirect the text from the myprog.scm file into your program.

To run your code through valgrind, you can similarly type

valgrind ./interpreter < myprog.scm

One challenge you may find is that when your output gets long, it gets hard to read before it scrolls all the way off the top of your terminal window. Piping your output through the UNIX less program helps. It will pause the output at the bottom of the screen, waiting for you press the space bar to continue, or q if you want to quit. You can do this with or without valgrind:

./interpreter < myprog.scm | less

valgrind ./interpreter < myprog.scm | less

7. Optional extensions

Work in this section is 100% optional, and not worth any points. Nonetheless, if you're looking for an extra challenge, these are fun additional exercises to try.

Here are some additional details in the Scheme language that you can integrate if you wish; they will be unnecessary for the core part of the project going forward, though they may play into later optional extensions:

Dot: .. A dot by itself is useful for handling dotted pairs. A dot token type should be identified if and only if there is a dot with whitespace on both sides of it. A dot at the beginning or end of file cannot be a dot token.
Square brackets: [ ]. Add openbracket and closebracket token types, and tokenize these as well.

7.1. Studying for exams

The exam may ask you to be able to identify what sorts of errors your tokenizer can catch, vs. what it might not, given a particular BNF. Think about changing the BNF provided in the assignment, and then identifying what sorts of input should be tokenized. Also think through what sorts of programming errors should not be caught by the tokenizer.

8. How to test and submit your work

8.1. Testing your work

Go back and look at the sections at the end of Scheme Intro 2 labeled "How to test your work". Everything there about how to run M tests and E tests applies identically here. Note that the tests will fail if your code has compiler warnings and/or valgrind errors, in addition to if its behavior is observably incorrect.

8.2. Submitting your work

You'll submit your work on Gradescope. You should already have a file CollaborationsAndSources.txt in your working directory (if not, copy it over from the previous assignment). Go to the header for this portion of the assignment and fill in the document. Indicate in what ways (if any) you collaborated with other people on this assignment (did you talk with anyone other than Anna or our prefect? Did you share strategies with anyone? Get or give advice on debugging? These are fine things to do, and you should note them in the CollaborationsAndSources.txt file). If you used any resources outside of your notes from class, that is also something to note in CollaborationsAndSources.txt. If you didn't talk with anyone or use any outside sources, please note that explicitly. Look back at the Scheme Intro 1 instructions if you want more details on the CollaborationsAndSources.txt file.

After updating CollaborationsAndSources.txt and finishing the assignment, run ./zipitup

This will zip up your C files, header files, and collaborations and sources. We will be re-copying in our version of the tests and of the included header files.

Then, upload the created zip to Gradescope.

On Gradescope, make sure to add your partner if you worked with a partner. This page has detailed instructions for how to do that.

On Gradescope, you should see autograder output for the same tests as you ran locally. There is a small but nonzero chance that you have coded something unexpected that cause the tests pass on the computer you’re working on, but not on Gradescope. We will be grading this assignment based on the results of the tests on Gradescope, so you should make sure to pay attention to how the tests run there as well when you submit.

For grading, you'll earn at least an M if the following conditions are all met:

All M tests pass
A visual inspection of your code shows that you have not hyper-tailored your code to pass the tests. Hyper-tailoring your code would be doing things like checking for the exact input from the test, meaning you haven't written your code in a way that would work for similar inputs (see Homework policies for more on this).
You have a CollaborationsAndSources.txt with the information described above.
The work is submitted by the late deadline.

You'll earn an E if:

All conditions above for an M are met.
All the E tests pass.
Your code is not significantly more complex than needed to accomplish the task.
You have a comment before each function describing what the function does (its input/output behavior, not a play-by-play of "first checks… then makes a recursive call …").
The work is submitted by the regular deadline or it is submitted by the late deadline and you elect to use a late token by submmitting a regrading request.

You may earn a grade of M or SP (Some Progress) and be offered to resubmit for an improved grade if:

All tests pass (all E tests if you can resubmit for an E, and all M tests if you can resubmit for an M)
Your code has significant stylistic or efficiency issues that we have not explicitly covered in class or on the feedback for a previous assignment.

If your code does not meet the criteria for an M or an E, then you'll earn a grade of NA (Not Assessable) or SP (Some Progress) for the assignment. Not Assessable means that your submission does not have an attempt at one or more functions in the assignment. Some Progress means that your submission has some code (in the correct language) for all functions in the assignment, but that the code does not pass all of the M tests, is hyper-tailored to the tests, and/or is missing a CollaborationsAndSources.txt file.

Good luck, ask questions, and have fun!

This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Jed Yang, and Laura Effinger-Dean. Thanks for sharing!