Interpreter 4: Parser

Table of Contents

For this part of the assignment, you'll implement a parser for Scheme, in C. (You must implement the parser from scratch - don't use Yacc, Bison, ANTLR, or any similar tool, even if you know what it is and how to use it.)

This assignment is to be done individually.

You can talk to other people in the class, me, and any of the course staff (graders, lab assistants, prefects) for ideas and to gain assistance. You can get direct help debugging your code from me and any of the course staff, and you're welcome to show us your code during debugging or other conversations. The code that you write should be your own, and you shouldn’t directly show or share your code with other students (although you may discuss general debugging strategies with others). Similarly, you should not look at other's code. See the course syllabus for more details or just ask me if I can clarify.

Both you and your interpreter partner can access your code from the previous part by downloading it from Gradescope. You should not be working with your interpreter partner on this assignment.

1. Get started

You'll download the folder with the starter code from the link below. Like for previous homeworks, unzip the folder and move it to where you'd like: most likely the folder for the Docker container. Then get started in VS Code.

If when running tests you run into a permissions error, you can fix this by running the following code in the terminal:

chmod a+x test-e test-m

This tells the computer that all users should be permitted to execute (run) the files test-e and test-m.

Click here to download the starter files.

2. Assignment details

You should create a parser.c file that implements the functions specified in parser.h. Particularly, you will need to write a function parse that uses the token list from the last assignment to construct a list of parse trees.

For example, here is a syntactically correct Scheme program:

(define x (quote a))
(define y x)
(car y)

Any Scheme program consists of a list of S-expressions. An S-expression always has parentheses around it, unless it is an atom. Note that a Scheme program itself, then, is not a single S-expression; it is a list of S-expressions.

Your parse function should therefore return a list, using the linked list structure that we have been using, of parse trees.

A parse tree, in turn, uses the linked list structure again to represent a tree. For example, the expression (define x (quote a)) would become the following parse tree, where each box is a SchemeItem:

parse.png

The above parse tree structure would be the first item in a 3-item linked list that represents the above Scheme program.

Do not include parentheses in your parse tree. Their purpose is to tell you the structure of the tree, so once you have an actual tree you don't need them anymore - they'll only get in the way. (Technically, this means that you're creating an abstract syntax tree, not a parse tree, since not all of the input is preserved.)

Parsing languages can be pretty complicated. The good news is that parsing Scheme is relatively straightforward. Since your Scheme code IS a tree, you just need to be able to read it as such and turn it into a tree in memory. You can do it with a stack of tokens:

  • Initialize an empty stack.
  • While there are more tokens:
    • Get the next token.
    • If the token is anything other than one of the close grouping symbols, push it onto the stack.
    • If the token is close grouping symbol, start popping items from your stack until you pop off the matching open symbol (and form a list of them as you go). Push that list back on the stack.

So you'll need a stack… your linked list is a fine candidate for it.

One small extension to the idea above: you'll end up with a stack at the end, and you'll need to decide what to do with it. You'll want to be sure to remember that you are reading a list of s-expressions, not just a single s-expression, and the list you return should match the order of the file. I.e., in the example above, (define x (quote a)) was the first thing in the file, and its parse tree should be the first item in a 3-item linked list if parsing the entire example above.

At the end of parsing a file, make sure you check that you don't have any open grouping symbols that are still waiting to be closed (i.e., (define a 3 should lead to a parse error!). Using a variable to store the current depth of the tree may be helpful. This variable could start at 0, and track how many open parentheses we're waiting to close at any given time.

You'll also need to write a printTree function. The output format for a parse tree of valid Scheme code is very simple: it looks exactly like Scheme code! To output an internal node in the parse tree, output ( followed by the outputs of the children of that internal node from left to right followed by ). To output a leaf node (a non-parenthetical token), just output its value as you in the tokenizer, sans type.

When printing, you'll always use ( and ), regardless of whether the original code used brackets or parentheses. Your parse tree shouldn't actually know whether the original code used brackets or parentheses!

3. Syntax errors

As with the tokenizer, your parser should never crash with a segmentation fault, bus error, or the like. Here are the different types of syntax errors you should handle:

  1. If the input code ever has too many close grouping symbols (in other words, if you ever encounter a closing symbol that doesn't match a previous open paren), print Syntax error: too many close parentheses.
  2. If the parse function returns an incomplete parse tree and the end of the input is reached and there are too few close parentheses, print Syntax error: not enough close parentheses.
  3. If the parse function encounters mismatched parentheses like ([)], print something beginning with Syntax error. You may follow that with whatever error message you like to describe the situation.

If you ever encounter a syntax error, your program should exit after printing the error - don't do any more parsing. This is why we wrote the function texit. Before exiting, you should output the text "Syntax error". Also feel free to print Scheme code around the error to be helpful if you can, though this is optional.

Most production parsers continue to try to parse after an error to provide additional error messages. Your efforts here may help you to gain some sympathy as to why all error messages after the first one are questionable!

4. Sample executions

$ cat test-in-01.scm
(if 2 3)
(+ ))

$ cat test-in-02.scm
(define map
  [lambda (f L)
    (if (null? L)
        L
        (cons (f (car L))
              (map f (cdr L))))])

$ cat test-in-03.scm
1 2 (3)

$ ./interpreter < test-in-01.scm
Syntax error: too many close parentheses

$ ./interpreter < test-in-02.scm
(define map (lambda (f L) (if (null? L) L (cons (f (car L)) (map f (cdr L))))))

$ ./interpreter < test-in-03.scm
1 2 (3)
  • Sample code

5. Sample code

Here is a rough approximation of part of my parse function. My addToParseTree function takes in a pre-existing tree, a token to add to it, and a pointer to an integer depth. depth is updated to represent the number of unclosed open parentheses in the parse tree. The parse tree is complete if and only if depth is 0 when the parse function returns.

SchemeItem *parse(SchemeItem *tokens) {

    SchemeItem *tree = makeEmpty();
    int depth = 0;

    SchemeItem *current = tokens;
    assert(current != NULL && "Error (parse): null pointer");
    while (current->type != EMPTY_TYPE) {
        SchemeItem *token = car(current);
        tree = addToParseTree(tree, &depth, token);
        current = cdr(current);
    }
    if (depth != 0) {
        syntaxError();
    }
    ...
}

You're not required to use this code skeleton, as there are plenty of other good ways to structure your code, instead. For instance, if you wanted to more explicitly represent the different between the parse stack and tree, you might replace tree with parseStack.

6. Helper functions

You'll note that you may want helper functions in parser.c (or indeed, in any number of other locations). A common question is whether you can/should modify the .h files to add these functions. Generally, you would only do this if you intend for these functions to be called outside of the file in which you're defining them. Since we won't need to do that for this project, you should not change the .h files you're given (and doing so may break the Gradescope tests).

If you're writing helper functions without putting them in .h files, you might experience an error about something being undefined. This occurs during compilation because the compiler doesn't know that the function will be defined later in the file. You have two options:

  • Declare a protoype for the function within your .c file, underneath your includes, just as you would in the .h file. The prototype is just the function return type, name, and parameters, without the body - each function part in the .h file is a prototype. Within that .c file, this is equivalent to having put all the prototypes in the .h file, but outside of that .c file, no one else knows about the function.
  • Write the function prior to calling it. This is basically strategically ordering your file so functions that are called by other functions are defined before the functions that call them.

If you want to add in additional .c and .h files to separate out particular functionality, that is possible (but not at all required). If you do this, please modify the justfile to include these .c and .h files. The grader will be using your version of the justfile, not the original copy (that also allows you to switch to our binaries if need be).

7. The single quote

Dealing with the special case of the single quote (i.e. 'a, or '(a b c)) means that we need to redefine “push onto the stack,” which appears a few times in the parsing algorithm. Here is the new definition of “push onto the stack”, which you only need to pass all of the E tests:

  • Proceed as usual so long if the stack doesn’t have a single quote ' on top.
  • Also proceed as usual if pushing on a left paren, no matter what the stack looks like.
  • In all other cases…
    • Pop the single quote off the stack.
    • Then proceed as if the token sequence had (quote ….) wrapped around the value you’re pushing. In other words, create a new subtree consisting of quote and the new token, and then push that subtree onto the stack instead of the original value you were going to push.

8. How to build your code

Look back at the tokenizer assignment, in the section titled "The files that you start with," on how to build your code as well as how to use my binaries for the previous assignments if you wish.

9. How to test and submit your work

9.1. Testing your work

Go back and look at the sections at the end of Scheme Intro 2 labeled "How to test your work". Everything there about how to run M tests and E tests applies identically here. Note that the tests will fail if your code has compiler warnings and/or valgrind errors, in addition to if its behavior is observably incorrect.

9.2. Submitting your work

You'll submit your work on Gradescope. You should already have a file CollaborationsAndSources.txt in your working directory (if not, copy it over from the previous assignment). Go to the header for this portion of the assignment and fill in the document. Indicate in what ways (if any) you collaborated with other people on this assignment (did you talk with anyone other than Anna or our prefect? Did you share strategies with anyone? Get or give advice on debugging? These are fine things to do, and you should note them in the CollaborationsAndSources.txt file). If you used any resources outside of your notes from class, that is also something to note in CollaborationsAndSources.txt. If you didn't talk with anyone or use any outside sources, please note that explicitly.

After updating CollaborationsAndSources.txt and finishing the assignment, run ./zipitup

This will zip up your C files, header files, and collaborations and sources. We will be re-copying in our version of the tests and of the included header files.

Then, upload the created zip to Gradescope.

On Gradescope, you should see autograder output for the same tests as you ran locally. There is a small but nonzero chance that you have coded something unexpected that cause the tests pass on the computer you’re working on, but not on Gradescope. We will be grading this assignment based on the results of the tests on Gradescope, so you should make sure to pay attention to how the tests run there as well when you submit.

For grading, you'll earn at least an M if the following conditions are all met:

  • All M tests pass
  • A visual inspection of your code shows that you have not hyper-tailored your code to pass the tests. Hyper-tailoring your code would be doing things like checking for the exact input from the test, meaning you haven't written your code in a way that would work for similar inputs see Homework policies for more on this).
  • You have a CollaborationsAndSources.txt with the information described above.
  • The work is submitted by the late deadline.

You'll earn an E if:

  • All conditions above for an M are met.
  • All the E tests pass.
  • Your code is not significantly more complex than needed to accomplish the task.
  • You have a comment before each function describing what the function does (its input/output behavior, not a play-by-play of "first checks… then makes a recursive call …").
  • The work is submitted by the regular deadline or it is submitted by the late deadline and you elect to use a late token by submmitting a regrading request

You may earn a grade of M or SP (Some Progress) and be offered to resubmit for an improved grade if:

  • All tests pass (all E tests if you can resubmit for an E, and all M tests if you can resubmit for an M)
  • Your code has significant stylistic or efficiency issues that we have not explicitly covered in class or on the feedback for a previous assignment.

If your code does not meet the criteria for an M or an E, then you'll earn a grade of NA (Not Assessable) or SP (Some Progress) for the assignment. Not Assessable means that your submission does not have an attempt at one or more functions in the assignment. Some Progress means that your submission has some code (in the correct language) for all functions in the assignment, but that the code does not pass all of the M tests, is hyper-tailored to the tests, and/or is missing a CollaborationsAndSources.txt file. Note that revisions may not be used on individual interpreter assignments to avoid extensions beyond the end of term or revisions in cases where you would have gained access to your partner's code. (To be fair to all students, this policy applies whether or not you were assigned a partner.)

Good luck, ask questions, and have fun!

This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Jed Yang, and Laura Effinger-Dean. Thanks for sharing!