Interpreter 4: Parser
Table of Contents
For this part of the assignment, you'll implement a parser for Scheme, in C. (You must implement the parser from scratch - don't use Yacc, Bison, ANTLR, or any similar tool, even if you know what it is and how to use it.)
This assignment is to be done individually.
You can talk to other people in the class, me, and any of the course staff (graders, lab assistants, prefects) for ideas and to gain assistance. You can get direct help debugging your code from me and any of the course staff, and you're welcome to show us your code during debugging or other conversations. The code that you write should be your own, and you shouldn’t directly show or share your code with other students (although you may discuss general debugging strategies with others). Similarly, you should not look at other's code. See the course syllabus for more details or just ask me if I can clarify.
Both you and your interpreter partner can access your code from the previous part by downloading it from Gradescope. You should not be working with your interpreter partner on this assignment.
1. Get started
You'll download the folder with the starter code from the link below. Like for previous homeworks, unzip the folder and move it to where you'd like: most likely the folder for the Docker container. Then get started in VS Code.
If when running tests you run into a permissions error, you can fix this by running the following code in the terminal:
chmod a+x test-e test-m
This tells the computer that all users should be permitted to execute (run)
the files test-e
and test-m
.
2. Assignment details
You should create a parser.c
file that implements the functions specified in
parser.h
. Particularly, you will need to write a function parse
that uses
the token list from the last assignment to construct a list of parse trees.
For example, here is a syntactically correct Scheme program:
(define x (quote a)) (define y x) (car y)
Any Scheme program consists of a list of S-expressions. An S-expression always has parentheses around it, unless it is an atom. Note that a Scheme program itself, then, is not a single S-expression; it is a list of S-expressions.
Your parse
function should therefore return a list, using the linked list
structure that we have been using, of parse trees.
A parse tree, in turn, uses
the linked list structure again to represent a tree. For example, the expression
(define x (quote a))
would become the following parse tree, where each box is
a SchemeItem
:
The above parse tree structure would be the first item in a 3-item linked list that represents the above Scheme program.
Do not include parentheses in your parse tree. Their purpose is to tell you the structure of the tree, so once you have an actual tree you don't need them anymore - they'll only get in the way. (Technically, this means that you're creating an abstract syntax tree, not a parse tree, since not all of the input is preserved.)
Parsing languages can be pretty complicated. The good news is that parsing Scheme is relatively straightforward. Since your Scheme code IS a tree, you just need to be able to read it as such and turn it into a tree in memory. You can do it with a stack of tokens:
- Initialize an empty stack.
- While there are more tokens:
- Get the next token.
- If the token is anything other than one of the close grouping symbols, push it onto the stack.
- If the token is close grouping symbol, start popping items from your stack until you pop off the matching open symbol (and form a list of them as you go). Push that list back on the stack.
So you'll need a stack… your linked list is a fine candidate for it.
One small extension to the idea above: you'll end up with a stack at the
end, and you'll need to decide what to do with it. You'll want to be
sure to remember that you are reading a list of
s-expressions, not just a single s-expression, and the list you return
should match the order of the file. I.e., in the example above,
(define x (quote a))
was the first thing in the file, and its parse
tree should be the first item in a 3-item linked list if parsing the
entire example above.
At the end of parsing a file, make sure you check that you don't have
any open grouping symbols that are still waiting to be closed (i.e.,
(define a 3
should lead to a parse error!). Using a variable to
store the current depth of the tree may be helpful. This variable
could start at 0
, and track how many open parentheses we're waiting
to close at any given time.
You'll also need to write a printTree
function. The output format for a parse
tree of valid Scheme code is very simple: it looks exactly like Scheme code! To
output an internal node in the parse tree, output (
followed by the outputs of
the children of that internal node from left to right followed by )
. To
output a leaf node (a non-parenthetical token), just output its value
as you in the tokenizer, sans type.
When printing, you'll always use (
and )
, regardless of whether
the original code used brackets or parentheses. Your parse tree
shouldn't actually know whether the original code used brackets or parentheses!
3. Syntax errors
As with the tokenizer, your parser should never crash with a segmentation fault, bus error, or the like. Here are the different types of syntax errors you should handle:
- If the input code ever has too many close grouping symbols (in other words, if you
ever encounter a closing symbol that doesn't match a previous open paren), print
Syntax error: too many close parentheses
. - If the
parse
function returns an incomplete parse tree and the end of the input is reached and there are too few close parentheses, printSyntax error: not enough close parentheses
. - If the
parse
function encounters mismatched parentheses like([)]
, print something beginning withSyntax error
. You may follow that with whatever error message you like to describe the situation.
If you ever encounter a syntax error, your program should exit after printing
the error - don't do any more parsing. This is why we wrote the function
texit
. Before exiting, you should output the text "Syntax error". Also feel
free to print Scheme code around the error to be helpful if you can,
though this is
optional.
Most production parsers continue to try to parse after an error to provide additional error messages. Your efforts here may help you to gain some sympathy as to why all error messages after the first one are questionable!
4. Sample executions
$ cat test-in-01.scm (if 2 3) (+ )) $ cat test-in-02.scm (define map [lambda (f L) (if (null? L) L (cons (f (car L)) (map f (cdr L))))]) $ cat test-in-03.scm 1 2 (3) $ ./interpreter < test-in-01.scm Syntax error: too many close parentheses $ ./interpreter < test-in-02.scm (define map (lambda (f L) (if (null? L) L (cons (f (car L)) (map f (cdr L)))))) $ ./interpreter < test-in-03.scm 1 2 (3)
- Sample code
5. Sample code
Here is a rough approximation of part of my parse
function. My addToParseTree
function takes in a pre-existing tree, a token to add to it, and a pointer to an
integer depth
. depth
is updated to represent the number of unclosed open
parentheses in the parse tree. The parse tree is complete if and only if
depth
is 0 when the parse function returns.
SchemeItem *parse(SchemeItem *tokens) { SchemeItem *tree = makeEmpty(); int depth = 0; SchemeItem *current = tokens; assert(current != NULL && "Error (parse): null pointer"); while (current->type != EMPTY_TYPE) { SchemeItem *token = car(current); tree = addToParseTree(tree, &depth, token); current = cdr(current); } if (depth != 0) { syntaxError(); } ... }
You're not required to use this code skeleton, as there are plenty of other
good ways to structure your code, instead. For instance, if you wanted to
more explicitly represent the different between the parse stack and tree,
you might replace tree
with parseStack
.
6. Helper functions
You'll note that you may want helper functions in parser.c
(or
indeed, in any number of other locations). A common question is whether you
can/should modify the .h
files to add these functions. Generally, you
would only do this if you intend for these functions to be called
outside of the file in which you're defining them. Since we won't need
to do that for this project, you should not change the .h
files
you're given (and doing so may break the Gradescope tests).
If you're writing helper functions without putting them in .h
files, you might experience an
error about something being undefined. This occurs during compilation
because the compiler doesn't know that the function will be defined
later in the file. You have two options:
- Declare a protoype for the function within your
.c
file, underneath your includes, just as you would in the.h
file. The prototype is just the function return type, name, and parameters, without the body - each function part in the.h
file is a prototype. Within that.c
file, this is equivalent to having put all the prototypes in the.h
file, but outside of that.c
file, no one else knows about the function. - Write the function prior to calling it. This is basically strategically ordering your file so functions that are called by other functions are defined before the functions that call them.
If you want to add in additional .c
and .h
files to separate out particular functionality, that is possible (but not
at all required). If you do this, please modify the justfile
to include
these .c
and .h
files. The grader will be using your version of the
justfile
, not the original copy (that also allows you to switch to our
binaries if need be).
7. The single quote
Dealing with the special case of the single quote (i.e. 'a, or '(a b c)) means that we need to redefine “push onto the stack,” which appears a few times in the parsing algorithm. Here is the new definition of “push onto the stack”, which you only need to pass all of the E tests:
- Proceed as usual so long if the stack doesn’t have a single quote ' on top.
- Also proceed as usual if pushing on a left paren, no matter what the stack looks like.
- In all other cases…
- Pop the single quote off the stack.
- Then proceed as if the token sequence had (quote ….) wrapped around the value you’re pushing. In other words, create a new subtree consisting of quote and the new token, and then push that subtree onto the stack instead of the original value you were going to push.
8. How to build your code
Look back at the tokenizer assignment, in the section titled "The files that you start with," on how to build your code as well as how to use my binaries for the previous assignments if you wish.
9. How to test and submit your work
9.1. Testing your work
Go back and look at the sections at the end of Scheme Intro 2 labeled "How to test your work". Everything there about how to run M tests and E tests applies identically here. Note that the tests will fail if your code has compiler warnings and/or valgrind errors, in addition to if its behavior is observably incorrect.
9.2. Submitting your work
You'll submit your work on Gradescope.
You should already have a file CollaborationsAndSources.txt
in your working directory (if not, copy it over from the previous assignment).
Go to the header
for this portion of the assignment and fill in the document. Indicate
in what ways (if any) you collaborated with other people on this
assignment (did you talk with anyone other than Anna or our prefect?
Did you share strategies with anyone?
Get or give advice on debugging? These are
fine things to do, and you should note them in
the CollaborationsAndSources.txt
file). If you used any resources outside of your notes from class, that
is also something to note in CollaborationsAndSources.txt
.
If you didn't talk with anyone or use any outside sources, please note
that explicitly.
After updating CollaborationsAndSources.txt
and finishing the
assignment, run ./zipitup
This will zip up your C files, header files, and collaborations and sources. We will be re-copying in our version of the tests and of the included header files.
Then, upload the created zip to Gradescope.
On Gradescope, you should see autograder output for the same tests as you ran locally. There is a small but nonzero chance that you have coded something unexpected that cause the tests pass on the computer you’re working on, but not on Gradescope. We will be grading this assignment based on the results of the tests on Gradescope, so you should make sure to pay attention to how the tests run there as well when you submit.
For grading, you'll earn at least an M if the following conditions are all met:
- All M tests pass
- A visual inspection of your code shows that you have not hyper-tailored your code to pass the tests. Hyper-tailoring your code would be doing things like checking for the exact input from the test, meaning you haven't written your code in a way that would work for similar inputs see Homework policies for more on this).
- You have a
CollaborationsAndSources.txt
with the information described above. - The work is submitted by the late deadline.
You'll earn an E if:
- All conditions above for an M are met.
- All the E tests pass.
- Your code is not significantly more complex than needed to accomplish the task.
- You have a comment before each function describing what the function does (its input/output behavior, not a play-by-play of "first checks… then makes a recursive call …").
- The work is submitted by the regular deadline or it is submitted by the late deadline and you elect to use a late token by submmitting a regrading request
You may earn a grade of M or SP (Some Progress) and be offered to resubmit for an improved grade if:
- All tests pass (all E tests if you can resubmit for an E, and all M tests if you can resubmit for an M)
- Your code has significant stylistic or efficiency issues that we have not explicitly covered in class or on the feedback for a previous assignment.
If your code does not meet the criteria for an M or an E, then you'll earn a
grade of NA (Not Assessable) or SP (Some Progress) for the assignment. Not Assessable means that your submission does not
have an attempt at one or more functions in the assignment. Some Progress means
that your submission has some code (in the correct language) for all functions in the
assignment, but that the code does not pass all of the M tests, is
hyper-tailored to the tests, and/or is missing a CollaborationsAndSources.txt
file. Note that revisions may not be used on individual interpreter
assignments to avoid extensions beyond the end of term or revisions in
cases where you would have gained access to your partner's code. (To
be fair to all students, this policy applies whether or not you were
assigned a partner.)
Good luck, ask questions, and have fun!
This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Jed Yang, and Laura Effinger-Dean. Thanks for sharing!