HW 3 (Books data analysis)

Assignment overview

In this assignment, you'll analyze a dataset about books.

Goals

Get practice reading and understanding code, understanding and implementing calculations, using return values with functions, and using loops with the accumulator pattern.

Logistics

This is an individual assignment. You are welcome to discuss any part of the assignment with classmates, course staff or Anna. Make sure to cite any help you receive in the "acknowlegdements" portion of the assignment.

This assignment is due at 10PM on Monday, April 13. This assignment does have autograder tests on Gradescope, so I recommend submitting early to make sure you have time to fix and debug any errors.

Setup

Mount the COURSES drive and create a folder called hw3 in your STUWORK folder. Open the new folder in VSCode. If you need a refresher on how to complete these steps, refer back to the in-class lab from the first day of class.

Next, download this starter zip file. Unzip it and move all three files (dataAnalysis.py, books.csv, and books_small.csv) to your new hw3 folder.

For this assignment, you will modify the code in dataAnalysis.py to analyze the books dataset that you also downloaded.

The dataset was originally collected from goodreads, a book review platform. I accessed it from this site though, and made a few modifications for simplicity.

Understand the data

When working with a dataset, you first need to know what's in the data and how it's organized. These data files have the extension ".csv", short for "comma separated values". That is, each line in the data has fields that are separated by commas. Each field is a particular type of data. The fields in these data files are, in order:

  1. Book title (title)
  2. Author name (author)
  3. Publication year (publicationYear)
  4. Genre or type of book (category)
  5. Average user rating out of 5 stars (rating)
  6. Total number of user ratings (numRatings)
  7. Book length, in pages (numPages)
  8. Number of books written by this author according to the goodreads database (authorBookCount)

You downloaded two data files. books.csv has data about over 5000 books, while books_small.csv contains information about just 6 books. It's often helpful to use a smaller data file to test a program because you can check the program's output against your own calculations. I strongly suggest you get your program working with the smaller file first and then move on to the larger file.

It's often helpful to look at a few lines of the data to understand it better. You can open books.csv in an editor like VSCode, or you can use the less command on the command line to see a few lines at a time.

% less books.csv

Pressing space will show additional lines; typing 'q' will take you back to the command line.

Note that if you open a CSV file in Excel or Google Sheets, you won't see the commas that separate each value. Instead, the data will automatically be sorted into columns.

Understand dataAnalysis.py

Later in the assignment, you'll modify dataAnalysis.py to do some additional analyses on the data. First, complete all steps in this section because they will be helpful for making your implementation process smoother.

  • Open up dataAnalysis.py and read through the code.
  • Run python3 dataAnalysis.py from the command line and see what happens. (Note: this has to do with the function main() at the end of the program.)
  • Add a print statement so that the for loop prints each line of the file. Check that this works, then remove the print statement. The goal is to understand what's happening in the code. The print statement should help you understand what is happening in the loop. We don't need or want to print anything here longer term, though!
  • Add a print statement so that only the author is printed for each line in the file. Can you do the same for the rating of the book? This may take some trial and error. Don't be afraid to try different things out and see what works! Again, once you've figured this out you can delete the print statement. Add in any comments that will help you remember what the different variables mean.
  • Now that you know how to get the author of a book, look at my code for setting the variables and the numbers I use like 0, 1, 4, and 6. How do these match up with the fields in a data file? (Look back at the previous section if you're not sure what I mean by field.)

Now, you should have a better understanding of dataAnalysis.py. If there are other parts that still don't make sense, try adding additional print statements to clarify. Feel free to ask questions on Ed or during office hours if you get stuck. The goal of giving you a more complicated piece of code like this is not to teach you all the syntax of everything I use in the file, but to empower you to use the tools we've learned to investigate and make sense of code you don't understand.

Analyzing the data

Now, you'll modify dataAnalysis.py.

  1. Calculate the average rating of all books. Print this average along with a description of what you're printing (similar to my print statement about the total number of lines in the data). You should leave your answer as a decimal (don't round it).
  2. Create a function called myRound(). This function should take a number and returns the result of rounding that number to the nearest integer. E.g., 3.5 rounds to 4, 3.2 rounds to 3, and so on. Do not use the built-in function round to do the rounding for you. If you do so, you will not earn credit for this part. Instead, come up with a way to use other operators we've talked about (integer division and arithmetic operators like +, -, *, /) to do the rounding. Walk through your ideas on paper first. Note that to pass the autograder tests, your function needs to be called myRound, needs to accept a single parameter, and needs to return something of type int, not of type float.
  3. Calculate the average number of pages across all books. Use myRound() to print the rounded result. As before, include a description of what you're printing.
  4. Calculate the total number of books that have genre Poetry. Print the result, along with a description of what you're printing. For this part, you'll want to use conditionals (see the readings for Friday's class).
  5. Calculate the number of books in our dataset whose authors have written fewer than 5 books (according to goodread's database). Note that this is equivalent to counting the number of rows that have a value less than 5 in the "authorBookCount" field. Again, you'll want to use conditionals.
  6. Finally, after you are convinced that your code is working smoothly, modify one line of code in dataAnalysis.py so that the larger data file (books.csv) is used instead of the smaller data file.
  7. Optional (not worth any points): what else can we learn from this data? Be creative! Can you find the book with the highest (or lowest) average rating? What about the book with the highest total number of user ratings?

Code notes and tips

  • The starter code uses the functions int and float. This is because the data from the file is read as a string (we actually end up with a "list of strings"). We'll talk more about strings next week, and more about lists the week after that.
  • Some of the questions will require if statements. See the readings for Friday, especially Runestone 5.6, for how to use if statements (also called conditionals). Runestone focuses on if statements with numbers, but one of the questions above asks you to work with strings (the genre) rather than numbers. We can use strings (in quotes) with if statements, as follows:

    userName = input("Type your name: ")
    if userName == "Anna":
        print("Your name is the same as mine!")
    else:
        print("Your name is not the same as mine.")
    
  • For each of the questions above, you should check that your program prints the correct answer according to the small data file. The reason I've included a small data file is because it's easy to check your work in this way when there are only 6 lines of data, as opposed to over 5000.

Wrap up

When you're finished, make sure to complete the usual documentation steps. This includes adding comments, writing function docstrings, and filling out the acknowledgements and reflection portion of the header.

You should also think about coding style. Have you written everything in a consistent way that is easy to read? Does your code have any unnecessary print statements? (Remove them.) Review the style document on Moodle for the expectations for this assignment.

Assignment submission and misc. notes

Handing in the assignment

You need to hand in dataAnalysis.py on Gradescope.

There are autograder tests on Gradescope. You are responsible for making sure you pass these tests and you may resubmit the assignment before the deadline as many times as necessary. If you're ever not sure what a test result means, ask! You can post on Ed or stop by for in-person help at Office Hours or in Olin 310 during the lab assistant hours.

Grading

This assignment is worth 40 points, broken up as follows:

  • Basic functionality (5 points): does your code run without errors when we type python3 dataAnalysis.py at the commmand line?
  • Average rating (5 points): does your code correctly compute and print the average rating? There will be one autograder test for this part.
  • Round function (7 points): does myRound() work as expected? To get full credit for this question, you need to pass the autograder tests and myRound() cannot use the built-in function round.
  • Average number of pages (5 points): does your code correctly compute the average number of pages and print the rounded result? There will also be one autograder test for this part.
  • Count of poetry books (6 points): does your code correctly compute and print the number of books whose genre is poetry? There will be one autograder test for this part.
  • Number of books whose authors have fewer than 5 books (6 points): does your code correctly compute and print the number of books whose authors have written fewer than 5 books? There will be one autograder test for this part.
  • Style (6 points): header, comments, following the style guidelines from Moodle. The graders will also check that you are not printing lots of irrelevant information (e.g., printing each line of data in the for loop or printing intermediate calculations).

Anna's acknowledgements

This assignment was adapted from assignments used by David Liben-Nowell, Anna Rafferty, and Layla Oesper. Thanks for sharing!