Interpreter 3: Tokenizer

1. Pair programming structure
2. Get started
3. Assignment details
4. Syntax details
5. Some additional hints
6. The files you start with
7. Capstone extensions
- 7.1. Studying for exams
8. How to test and submit your work
- 8.1. Testing your work
- 8.2. Submitting your work

For this part of the assignment, you'll build a tokenizer for Scheme, in C. (You must implement the tokenizer from scratch – don't use Lex or any similar tool, even if you happen to know what it is and how to use it.)

1. Pair programming structure

This is a "pair" assignment, which means that you and your partner should use the pair programming model. You should be physically together, sharing a computer and both looking at the same screen.

Each partner has an active role to play during pair programming - please review the instructions on Moodle about pair programming, and make sure you understand what you should be doing when you are the navigator and when you are the driver. You should make sure that over the course of an assignment that you each spend roughly the same amount of time "driving." My recommendation is to set a timer and switch drivers every 10-15 minute.

For the interpreter project, you are permitted to do a limited amount of debugging separate from your partner. If you do this, you should be sure that you're each contributing similar amounts: it's not fair, either in terms of time or in terms of learning, if one person does all the debugging. Before debuggging on your own, make sure that your partner is onboard with that. Also, make sure to share what you found with your partner afterwards.

If pair programming in real-time just doesn't work for you and your partner, then you will need to amicably split up and work individually. If you choose this option, you must communicate that to me [Anna] and send me an email with a zip file containing the code checkpoint of any work you did together (if applicable).

If things are not going well with working with your partner or you find yourselves tempted to split up the work, please talk to me. While doing a limited amount of debugging separately is fine, you should be doing the initial design and coding together - if you or your partner cannot explain how part of your interpreter works, then you have not followed the pair programming policies for this class.

2. Get started

Download the folder with the starter code from the link below. Like for previous homeworks, unzip the folder and move it to where you'd like: most likely the folder for the Docker container. Then get started in VS Code.

If when running tests you run into a permissions error, you can fix this by running the following code in the terminal:

chmod a+x test-e test-m

This tells the computer that all users should be permitted to execute (run) the files test-e and test-m.

Click here to download the starter files.

3. Assignment details

You should create tokenizer.c to implement the functions specified in tokenizer.h. More generally, your tokenizer should read Scheme code from a file and return a linked list of tokens accompanied by the type of each token. You must include the following token types:

boolean, integer, real, string, symbol, open, close, quote

The last three are for (, ), and ' respectively.

The output format is, per line, token:type.

On comments: You should strip out comments as you tokenize, so that anything after a ; on a line is ignored.

In detecting if you've found the end of a line, make sure to be super careful if you have created your test file with an editor on Microsoft Windows. Linux and Mac use an ascii \n at the end of a line, but Windows uses a \r\n at the end of a line. All of our tests run on on Linux so you don't need to worry about this unless you are creating your own test files with Windows. If you are, you need to either just create your test files on Mac/Linux (i.e., on the lab computers), or make sure that your tokenizer code can also handle \r\n at the ends of a line.

For example, here's the input file t04.scm that I supplied in the exemplary tests:

(+ x2 ( + ( quote x ) "foo;; still a string" 323) ;; now it's a comment!

On the above, your program should output:

(:open
+:symbol
x2:symbol
(:open
+:symbol
(:open
quote:symbol
x:symbol
):close
"foo;; still a string":string
323:integer
):close

Your tokenizer should handle bad input gracefully; your program should never crash with a segfault, bus error, or the like. If the input code is untokenizable, print an error message. You may print something generic such as "Syntax error: untokenizeable", or you can provide more detailed information in the error message depending on what the problem is. Whatever you print, it should start with the string text "Syntax error" so that it will pass the automated tests. (Upper or lower case doesn't matter.) After encountering an error, your program should exit after printing the error - don't read any more input. This is why we wrote the function texit. Also feel free to follow your error message by printing Scheme code around the error to be helpful if you can, though this optional.

4. Syntax details

Scheme is a complex language. In the interest of making things more tractable, here is a simplified syntax for numbers that I expect your tokenizer to handle (adapted from Dybvig's book):

<number>   ->  <sign> <ureal> | <ureal>
<sign>     ->  + | -
<ureal>    ->  <uinteger> | <udecimal>
<uinteger> ->  <digit>+
<udecimal> ->  . <digit>+ | <digit>+ . <digit>*
<digit>    ->  0 | 1 | ... | 9

Note that * is shorthand for zero or more repetitions of the symbol, and + is shorthand for one or more repetitions of the symbol. Tip: if you want to convert strings to numbers, you can use the functions strtol and strtod in the standard library.

Similarly, you should recognize symbols (identifiers) with the following syntax (again adapted from Dybvig's book):

<identifier> ->  <initial> <subsequent>* | + | -
<initial>    ->  <letter> | ! | $ | % | & | * | / | : | < | = | > | ? | ~ | _ | ^
<subsequent> ->  <initial> | <digit> | . | + | -
<letter>     ->  a | b | ... | z | A | B | ... | Z
<digit>      ->  0 | 1 | ... | 9

This is a little inconsistent with the actual behavior of Scheme, but it simplifies things up a bit (at the cost of some minor incompatibilities).

Symbols are case-sensitive.

You may also find the syntax section of Dybvig's book to be helpful. The dialect described may not be completely consistent with the above, but it's readable in English. The BNF that I have provided above is considered to be the correct specification for this assignment. It deviates slightly from what Guile allows, but in ways that I think end up making some things, like differentiating numbers and symbols, easier.

Scheme provides many different ways of enclosing lists (parentheses, square brackets, etc.). We will only write code that uses parentheses for the purpose of enclosing lists. None of our test code will use square brackets. You can write your tokenizer and subsequent parts of the interpreter to only work on parentheses. That said, if you want your project to also wrok on square brackets, it isn't that hard of an extension. If you think you want to go that way, you'll need to make sure that you tokenize parentheses and brackets differently.

We will identify the single quote ' as a special token all of its own. It should be stored in a SchemeItem as a QUOTE_TYPE and displayed by your tokenizer accordingly. Don't get this confused with the word quote, which is tokenized as a regular symbol.

Tokens are separated by various forms of whitespace (space, newline, tab, etc). Use the C function isspace to test if a given character is a whitespace character.

5. Some additional hints

I suppose it is theoretically possible to code this assignment up so that it produces precisely the right output, but without actually storing the data in a list. Don't do that; perhaps that might be easier to get results on this portion of the project (although I'm not convinced), but it will leave you in a useless position to continue for the next portion.

There are many different ways of reading input files. I found it most useful to use the functions fgetc and occasionally ungetc. You can look those up; documentation is also available here.

At some points in my code, I wanted to find out things like "Is the current character a letter?" Conveniently, characters can be compared using < and >, and the characters a-z, A-Z, and 0-9 all form their own ranges. So, I can write code like:

if (myChar >= 'A' && myChar <= 'Z') {
 ... //Whatever I want to do in this case goes here
}

This tests if myChar is in A-Z - i.e., it tests if myChar is an uppercase letter. I can do similar things for a-z and 0-9; note that if you're comparing to, say, the character 0, you want to compare to '0', not 0 (i.e., don't forget the single quotes!).

The heart of your code will be your tokenize function, which reads the file and returns a list of SchemeItems. Here's a rough sketch of what that function might look like:

SchemeItem *tokenize() {
    int charRead;
    SchemeItem *list = makeNull();
    charRead = fgetc(stdin);

    while (charRead != EOF) {

        if (charRead == ...) {
            ...
        } else if (charRead == ...) {
            ...
        } else {
            ...
        }
        charRead = fgetc(stdin);
    }


    SchemeItem *revList = reverse(list);
    return revList;
}

Note that the type of charRead is an int, not a char. This is because the version of clang that is included with Docker gives a warning if charRead is a char and you reach the end of the file (i.e., you compare char, an unsigned type, with EOF, which is represented by -1).

You can still compared charRead, an int, with strings - e.g.,

if (charRead == '(' ) {
  ...
}

We will guarantee that no token will be longer than 300 characters long. That includes strings. Your code doesn't need to handle arbitrarily long tokens.

We will guarantee that no line will be longer than 300 characters long. Your code doesn't need to handle arbitrarily long lines.

FAQ: Should your tokenizer catch errors with mismatched parentheses, i.e. )(a b ))? No, it should tokenizer strings like that without error. Catching mismatched parentheses is the parser's job.

6. The files you start with

After you download the starter code, you should be able to see that you get the following starting files:

schemeitem.h: essentially the same header we have been using for an SchemeItem, but some new types are added in
linkedlist.h: the same header file, more or less, from the last assignment
talloc.h: the same header file, more or less, from the last assignment
tokenizer.h: the header file for the functions you'll need to write for this assignment. You'll also create a number of helper functions for that, but those don't need to appear in the header since they will be "private".
main.c: Short main function that drives the whole program.
justfile: contains instructions for the command just, which will compile and test your code
test-files-m/ and test-files-e/: a directory of Scheme programs as test inputs, and expected outputs
test-m and test-e: the usual testing scripts that you run
tester.py: a helper script to handle automated testing.
lib: a subdirectory of binaries, see below

At this point, you have a choice regarding how to proceed for linkedlist.c and talloc.c. If you want to continue to build your own interpreter that is truly yours, you can copy in these files from the last project, and move forward. Alternatively, if you would prefer to use my versions instead, you can do so. In the lib directory you'll see linkedlist.o and talloc.o files, as well as .h files to go with them. These .o files are compiled binary versions of my linkedlist.c and talloc.c. If you would prefer to use one or both of these in your project instead of your own, you can replace them in the justfile. Specifically, in the justfile, change the first line as indicated. I provide these binaries so that people who have trouble with earlier parts of the project don't get slowed down, but I strongly encourage you to use your own code if it works for two reasons:

You'll feel a much stronger sense of ownership for the project if you do.
You won't be able to trace bugs back to the linked list / talloc source code if you need to. You'll sometimes have to make do with cryptic errors from my code that say "linked list insertion error" or something like that. My code works as specified, but it will produce bad and sometimes difficult-to-understand output if it gets bad input.

Note that compiled binaries are operating system specific (sometimes version specific), and thus might not work outside of the course Docker container. If you run into troubles using my binaries (either on your own computer, or for unforeseen reasons in the Docker container), please let me know and we can talk about options.

To compile your code, issue the command just build at the command prompt. (This will give an error if you haven't yet created the .c files you need, or if you haven't changed the justfile to refer to the binaries if you need them.)

To run your code, first create a file with some of your own Scheme code in it, and give it a filename (e.g., myprog.scm). Then issue this command at a prompt:

./interpreter < myprog.scm

Your tokenizer code should read data from "stdin." In that case, the above command will redirect the text from the myprog.scm file into your program.

To run your code through valgrind, you can similarly type

valgrind ./interpreter < myprog.scm

One challenge you may find is that when your output gets long, it gets hard to read before it scrolls all the way off the top of your terminal window. Piping your output through the UNIX less program helps. It will pause the output at the bottom of the screen, waiting for you press the space bar to continue, or q if you want to quit. You can do this with or without valgrind:

./interpreter < myprog.scm | less

valgrind ./interpreter < myprog.scm | less

7. Capstone extensions

For the final project submission, you will get to choose what additional functionality you want your interpreter to have, and you'll implement (several small, a few medium, or 1-2 large) enhancements. One option is to be able to handle dot (as in, a dotted pair) or square brackets.

Handling dot would count as a medium contribution, and square brackets will count as a small contribution. In either case, building in the tokenizer functionality is not sufficient, you'll need to also handle these data types in parsing and evaluation.

However, if you'd like to get a start on the capstone portion – or at least keep your options open for what enhancements you choose – the first step for handling dot or square brackets is to tokenize these characters.

Dot: .. A dot by itself is useful for handling dotted pairs. A dot token type should be identified if and only if there is a dot with whitespace on both sides of it. A dot at the beginning or end of file cannot be a dot token.
Square brackets: [ ]. Add openbracket and closebracket token types, and tokenize these as well.

If you implement either of these extensions, be sure to test your code!

7.1. Studying for exams

The exam may ask you to be able to identify what sorts of errors your tokenizer can catch, vs. what it might not, given a particular BNF grammar. Think about changing the BNF grammar provided in the assignment, and then identifying what sorts of input should be tokenized. Also think through what sorts of programming errors should not be caught by the tokenizer.

8. How to test and submit your work

8.1. Testing your work

As with previous assignments, there are M tests and E tests. See the section of Scheme Intro 2 labeled "How to test your work" if you need a refresher on how to run these tests. Note that the interpreter tests will fail if your code has compiler warnings and/or valgrind errors. Make sure to run the tests in the Docker environment.

8.2. Submitting your work

8.2.1. Acknowledging help

You should already have a file CollaborationsAndSources.txt in your working directory (if not, copy it over from the previous assignment). Go to the header for this portion of the assignment and fill in the document. Did you share strategies with anyone else? Talk about any annoying errors and get advice? These are fine things to do, and you should note them in the CollaborationsAndSources.txt file. Give the names of people you talked with (other than your assigned partner) and some description of your interactions. If you used any resources outside of notes taken in class or the homework/lab starter files that Anna provided, that is also something to note.

If you didn't talk with anyone or use any outside sources, please note that explicitly in CollaborationsAndSources.txt.

8.2.2. Generating the submission file and uploading to Gradescope

After completing CollaborationsAndSources.txt and finishing the assignment, run ./zipitup (if you get a permissions error, run chmod u+x ./zipitup first). This script will create a zip file XXX_submission.zip where XXX is the assignment name or number. Note that this zip file with include your C files, any new header files you created, and CollaborationsAndSources.txt. We will re-copy in our version of the tests and of the provided header files. Upload the zip file to Gradescope.

On Gradescope, make sure to add your partner if you worked with a partner. This page has detailed instructions for how to do that.

On Gradescope, you should see autograder output for the same tests as you ran locally. There is a small chance that the tests will fail on Gradescope, even if they passed on your own computer. Usually this means that you coded something in a different way than we expected when writing the autograders. We will be grading based on the results of the tests on Gradescope, so make sure to pay attention to the results of the tests there. (If you run into this issue, check that you are submitting to the correct assignment on Gradescope. Then try resubmitting and rerunning the autograder. If the error messages are legible, feel free to debug further, but it's also fine to reach out to Anna or the graders at this point.)

8.2.3. Grading

You'll earn at least an M if the following conditions are all met:

All M tests pass
A visual inspection of your code shows that you have not hyper-tailored your code to pass the tests. Hyper-tailoring your code would be doing things like checking for the exact input from the test, meaning you haven't written your code in a way that would work for similar inputs.
You have a CollaborationsAndSources.txt file with the information described above.
Your code has significant stylistic or efficiency issues that we have not explicitly covered in class or on the feedback for a previous assignment.

You'll earn an E if:

All conditions above for an M are met.
All the E tests pass.
Your code is not significantly more complex than needed to accomplish the task.
You have a comment before each function describing what the function does (its input/output behavior, not a play-by-play of "first checks… then makes a recursive call …").
Your code follows good style convention.

If your code does not meet the criteria for an M or an E, then you'll earn a grade of NA (Not Assessable) or SP (Some Progress) for the assignment. Not Assessable means that your submission does not have an attempt at one or more functions in the assignment. Some Progress means that your submission has some code (in the correct language) for all functions in the assignment, but that the code does not pass all of the M tests, is hyper-tailored to the tests, and/or is missing a CollaborationsAndSources.txt file.

You are welcome to revise your submission until the revision deadline (see Gradescope).

If you revise your submission, double-check to make sure that the autograder tests pass and that you have included CollaborationsAndSources.txt.

Good luck, ask questions, and have fun!

This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Jed Yang, and Laura Effinger-Dean. Thanks for sharing!