1. Detailed Course Schedule

This include all topics covered and all assignments given and when they are due.

1.1. Day 1 - 5/18/21 - Introduction and mathematical foundations

1.1.1. Content

1.1.2. Due before next class


1.2. Day 2 - 5/20/21 - Jupyter and a review of Python and pandas

1.2.1. Content

  • Chapter 3: Computational notebooks (Jupyter) - reading and slides

  • Chapter 4: Review of Python and pandas - reading, but no slides

1.2.2. Due before next class


1.3. Day 3 - 5/25/21 - Before and after, single-table verbs

1.3.1. Content

  • Chapter 5: Before and after, in mathematics and communication - reading and slides

  • Chapter 6: Pandas single-table verbs - reading and slides

1.3.2. Due before next class


1.4. Day 4 - 5/27/21 - Abstraction and version control

1.4.1. Content

  • Chapter 7: Abstraction in mathematics and computing - reading and slides

  • Chapter 8: Version control and GitHub - reading and slides

1.4.2. Due before next class


1.5. Day 5 - 6/1/21 - Math and stats in Python, plus Visualization

1.5.1. Content

1.5.2. Due before next class

  • Data preparation exercise

    • (Some steps of this you have probably already completed. What’s new for everyone is making a project that can easily load the file into Jupyter, so we’re ready to experiment with it next week in class.)

    • Look at the 2016 election data on this page of NPR’s website.

    • Extract the table from that page into a CSV file (for example, by copying and pasting into Excel, then touching it up as needed).

    • Write a Jupyter notebook that imports the CSV file.

    • Ensure that you remove all rows that are not for entire states (which you can do in Excel or Jupyter, whichever you prefer).

    • Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.

    • Submit that notebook to your instructor through Blackboard.

  • DataCamp

    • Joining Data with pandas, all chapters

      • NOTE: We will not cover this content in class next time. We will cover it the subsequent class meeting instead. But I’m assigning you to do it now because then you won’t have any homework next time, when the project is due, and you’ll be able to focus on that instead.

    • See here for a cheat sheet of all the content of the above DataCamp lessons. (The cheat sheet needs updating to reflect a recent change in the DataCamp assignments, due to DataCamp’s having archived some old courses. The cheat sheet is very similar to the new content, however.)

  • Reading

  • Other

    • Optional: There are several LOYO opportunities from today’s course notes (chapters 9 and 10). Consider forming a team and siezing one of the opportunities.


1.6. Day 6 - 6/3/21 - Processing the Rows of a DataFrame

1.6.1. Content

  • Chapter 11: Processing the rows of a DataFrame - reading and slides

1.6.2. Due before next class

  • No DataCamp today, so that you can focus on the project.

  • Reading

  • Other

    • Optional: There are a few LOYO opportunities from today’s course notes (chapter 11). Consider forming a team and siezing one of the opportunities.


1.7. Day 7 - 6/8/21 - Concatenation and Merging

1.7.1. Content

1.7.2. Due before next class

It’s a light week, because you just did Project 1 and deserve a little time to rest.


1.8. Day 8 - 6/10/21 - Miscellaneous Munging Methods (ETL)

1.8.1. Content

  • Chapter 13: Miscellaneous Munging Methods (ETL) - reading and slides

1.8.2. Due before next class

  • DataCamp (last one for the whole semester!)

  • Reading

  • Other

    • Ensure that you have the git command installed on your own computer (again, not on Deepnote or Colab).

      • If you’re on Windows and have already installed the GitHub Desktop app, then you just need to tell Windows where to find the git.exe command that’s built into that app. The folder containing it will be something like C:\Users\YOUR-USERNAME\AppData\Local\GitHubDesktop\app-2.8.1\resources\app\git\cmd. Investigate using Windows Explorer to find the correct path for your system. You may need to reveal hidden files and folders to find the AppData folder. (Yes, this is a pain.)

      • If you’re on Mac, you might not have the git command unless you’ve installed Xcode at some time in the past. You can run xcode-select --install to install just the minimal Xcode tools to get git. If that doesn’t work, download them directly from Apple, which may require creating a free developer account.

      • To prove that you successfully got the git command installed, run git --version.

      • Take a screenshot of your terminal window showing the results of the successful git --version so I can see that you got this to work.

    • Install Streamlit on your own computer.

      • The command you need is pip install streamlit.

      • Do not run this command on Deepnote or Colab, but on your own computer.

      • Once you’ve installed it, you should be able to run the command streamlit --version to see what version is installed. Do so to ensure that your installation succeeded.

    • Create a Heroku account.

    • Take a screenshot to prove that all this worked.

      • After you’ve run heroku login and can still see its successful output, just re-run the git --version and streamlit --version commands so that all three outputs are on one screen, making it obvious that you’ve got all three tools installed correctly.

      • Take a screenshot of that terminal window showing those three commands’ successful output.

      • Submit that screenshot via Blackboard as your homework.

      • It should look something like this:

An image of the screenshot described in the text above

  • Project Planning

    • Optional: If you want to get ahead on the final project in a way that’s rather easy and fun, start hunting for datasets that cover a topic you’re interested in and might want to analyze. Try to find a dataset that’s pretty comprehensive, so that there are plenty of options for ways to analyze, visualize, and manipulate it.


1.9. Day 9 - 6/15/21 - Dashboards

1.9.1. Content

1.9.2. Due before next class

  • Network data exercise

    • The purpose of this exercise is to familiarize you with some network data, since next week we will be studying just that. It also gives you another chance to practice pd.merge().

    • Download this Excel workbook of shipping data among U.S. states in 1997.

    • Look over all the sheets in the workbook to familiarize yourself with their meaning.

    • Create a Jupyter notebook that reads all the sheets from the workbook.

      • Note: Reading Excel files requires installing the openpyxl module, which is not present by default in some cloud computing environments. You may need to run pip install openpyxl in the terminal, or at the top of the notebook, or place it in a requirements.txt file.

    • Add code that creates a DataFrame just like the shipping sheet, but with each state abbreviation replaced by its full name.

    • The “adjacent” column in the distances DataFrame should be boolean type; convert it.

    • Add two columns to the shipping table, one containing the distance between the two states, and the other containing the boolean of whether the two states are adjacent, both taken from the distance table.

    • Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.

    • Submit that notebook to your instructor through Blackboard.

  • Reading


1.10. Day 10 - 6/17/21 - Relations, graphs, and networks

1.10.1. Content

  • Chapter 15: Relations as graphs and network analysis - reading and slides

1.10.2. Due before next class

  • Data prep exercise for a music recommendation system

    • In class next time we will build a recommender system for songs (that is, given your preferences and a big database of other people’s preferences, it will try to match you with new songs you might like).

    • Visit this page and read about the data archive, then download it from there in ZIP format. It is almost 1GB in size, so leave some time for this download!

    • Unzip the download and find within it three files; we care only about jams.tsv. Place this file in a folder where you can access it with Python and pandas. It contains every user’s “jams” from 2011-2015.

    • Write some code to load into a pandas Series the full set of unique user IDs in that file. That is, do not include any user more than once in the series. (This code may be slow to run, because the file is large.) This step is asking for just the user IDs, not any jam or song data.

    • Use the sample() method in pandas Series objects to select a random subset of the users to work with, so that we don’t have to deal with the entire jams file, which would take a long time to do computations with. Include at least 1000 in your sample, to get a sufficient representation of the full dataset. I chose 2000 in my own work, but later computations will get much slower if you go beyond about 2000.

    • Write some code to load from the jams.tsv DataFrame every jam by all the users in your sample. There are roughly 15 jams per user on average, so you should end up with 15 times as many results as the number of users you chose (about 15,000 to 30,000).

    • We need only three columns of the result: user ID, artist, and song title. Discard all other columns.

    • To give a song a unique name string, let’s combine the artist and song title into a single column. That is, rather than a column with “Don’t Stop Believin’” for song title and “Journey” as artist, create a new column called “song” that contains text like “Don’t Stop Believin’, by Journey”.

    • Drop the original title and artist columns so that your final jams DataFrame contains just two columns, user and song.

    • Export that DataFrame to a new CSV file that we will analyze in class. Call it jam-sample.csv.

    • Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.

    • Submit that notebook to your instructor through Blackboard.

  • Reading


1.11. Day 11 - 6/22/21 - Relations as matrices

1.11.1. Content

1.11.2. Due before next class

  • Data preparation exercise

    • In class next time we will do an introductory machine learning exercise about predicting mortgage approval/denial.

    • Download the training dataset here. It is a sample from the same mortgage dataset we’ve used many times. Recall that its data dictionary is available online here.

    • Load it into pandas and check the data types of the columns.

    • To make all the data numeric, we will be replacing categorical columns with boolean columns in which false is represented by 0 and true is represented by 1. This will make it possible to use that data in a numerical model.

    • Replace the conforming_loan_limit column with two boolean columns, one that means “conforming loan limit is C (conforming)” and one that means “conforming loan limit is NC (not conforming).” Don’t forget to use 0/1 instead of False/True. (There are other values that column may take on, but we will analyze just those two.)

    • Replace the derived_sex column with two boolean columns, one that means “derived sex is Male” and one that means “derived sex is Female.” Don’t forget to use 0/1 instead of False/True. (There are other values that column may take on, but we will analyze just those two.)

    • The action_taken column contains only 1s and 3s. This is because this dataset was filtered to include only accepted or rejected mortgages (no withdrawals, pre-approvals, etc.). Replace this column with another boolean column, still using 0/1 for False/True, meaning “application accepted.”

    • The debt-to-income ratio column is categorical instead of numeric. Make it numeric by replacing each category with a central value in that category. For instance, the category “20%-<30%” can be replaced with the number 25, the category “43” can be just the number 43, etc. Let’s use 70 for “>60%.”

    • Your newly cleaned data should have all numeric columns. Export it as a CSV file and bring it with you to class for an in-class activity in Week 12.

    • Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.

    • Submit that notebook to your instructor through Blackboard.

  • Reading

(Perhaps more assignments are coming; this section is still incomplete.)


1.12. Day 12 - 6/24/21 - Introduction to machine learning

1.12.1. Content

  • Chapter 17: Introduction to machine learning - reading and slides

No more homework this semester! Use the remaining time to do a great final project!


1.13. Day 13 - 6/29/21 - Final Project Workshop

We will begin class today answering questions to help you review for the final exam, which is on Thursday, 7/1/21. Then we will end class at the regular Zoom link and transition to the office hours Zoom link…

The second half of class today, at the office hours Zoom link, will consist entirely of help with individuals and teams working on their final project, which is due at 11:59pm on this day.


1.14. Day 14 - 7/1/21 - Final Exam

We will take the final exam in class, and students may leave when they complete it.

Recall that the topics for the final exam appear online here.