1. Detailed Course Schedule¶

This include all topics covered and all assignments given and when they are due.

1.1. Day 1 - 5/18/21 - Introduction and mathematical foundations¶

1.1.1. Content¶

Chapter 1: Introduction to data science - reading and slides
Chapter 2: Mathematical foundations - reading and slides

1.1.2. Due before next class¶

DataCamp
- Optional, basic review:
  - Introduction to Python
  - Python Data Science Toolbox, Part 1
- Required (though it may still be review):
  - Intermediate Python, chapters 1-4
  - pandas Foundations, just chapter 1 (This course is archived on DataCamp; in the future I will replace it with an available course instead.)
  - Manipulating DataFrames with pandas, just chapter 1 (This course is archived on DataCamp; in the future I will replace it with an available course instead.)
- See here for a cheat sheet of all the content of the above DataCamp lessons.
Reading
- Each week, you are expected to read the appropriate chapters from the course notes before class. Since this is the first day for the course, I did not expect you to have read Chapters 1-2 in advance. But that means that you must now read them together with Chapters 3-4 before next week.
- Chapter 1: Introduction to data science (adds details to today’s class content)
- Chapter 2: Mathematical foundations (adds details to today’s class content)
- Chapter 3: Computational notebooks (Jupyter) (prepares for next week)
- Chapter 4: Python review focusing on pandas and mathematical foundations (prepares for next week)
Other
- If you don’t already have a Python environment installed on your computer, see these instructions for installing one. As part of that process, ensure that you can open both Jupyter Lab and VS Code.
- Optional: There are many LOYO opportunities from today’s course notes (chapters 1 and 2). See the syllabus for a definition of LOYO (Learning on Your Own) and consider forming a team and siezing one of the opportunities.

1.2. Day 2 - 5/20/21 - Jupyter and a review of Python and pandas¶

1.2.1. Content¶

Chapter 3: Computational notebooks (Jupyter) - reading and slides
Chapter 4: Review of Python and pandas - reading, but no slides

1.2.2. Due before next class¶

DataCamp
- Manipulating DataFrames with pandas, chapters 2-4 (This course is archived on DataCamp; in the future I will replace it with an available course instead.)
- See here for a cheat sheet of all the content of the above DataCamp lessons.
Reading
- Chapter 5: Before and after, in mathematics and communication
- Chapter 6: Pandas single-table verbs

1.3. Day 3 - 5/25/21 - Before and after, single-table verbs¶

1.3.1. Content¶

Chapter 5: Before and after, in mathematics and communication - reading and slides
Chapter 6: Pandas single-table verbs - reading and slides

1.3.2. Due before next class¶

Communication exercise
- Create a new Deepnote project and upload into it this Jupyter notebook and this CSV file. (Please be sure to do this in a new Deepnote project, rather than just a new folder in an existing project. Grading becomes error-prone if I have to hunt through your folders for what I’m supposed to grade.)
- The first half of the notebook has plenty of comments and explanations, but the second half does not. Use the principles discussed in class today (and covered in Chapter 5 of the course notes) to comment/document/explain the second half of that file.
- Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.
- Submit that notebook to your instructor through Blackboard.
DataCamp
- pandas Foundations, just chapter 2 (This course is archived on DataCamp; in the future I will replace it with an available course instead.)
- See here for a cheat sheet of all the content of the above DataCamp lessons.
Reading
- Chapter 7: Abstraction in mathematics and computing
- Chapter 8: Version control and GitHub

1.4. Day 4 - 5/27/21 - Abstraction and version control¶

1.4.1. Content¶

Chapter 7: Abstraction in mathematics and computing - reading and slides
Chapter 8: Version control and GitHub - reading and slides

1.4.2. Due before next class¶

Version control exercise
- This assignment is described in the final slide for Chapter 8, linked to above.
DataCamp
- Intermediate Python, chapter 5
- Statistical Thinking in Python, Part 1, all chapters
- Introduction to Data Visualization with Matplotlib, just chapter 1
- Introduction to Data Visualization with Seaborn, chapters 1 and 3
- See here for a cheat sheet of all the content of the above DataCamp lessons. (The cheat sheet needs updating to reflect a recent change in the DataCamp assignments, due to DataCamp’s having archived some old courses. The cheat sheet is very similar to the new content, however.)
Reading
- Chapter 9: Math and stats in Python
- Chapter 10: New visualization tools

1.5. Day 5 - 6/1/21 - Math and stats in Python, plus Visualization¶

1.5.1. Content¶

Chapter 9: Math and stats in Python - reading and slides
Chapter 10: New visualization tools - reading and slides

1.5.2. Due before next class¶

Data preparation exercise
- (Some steps of this you have probably already completed. What’s new for everyone is making a project that can easily load the file into Jupyter, so we’re ready to experiment with it next week in class.)
- Look at the 2016 election data on this page of NPR’s website.
- Extract the table from that page into a CSV file (for example, by copying and pasting into Excel, then touching it up as needed).
- Write a Jupyter notebook that imports the CSV file.
- Ensure that you remove all rows that are not for entire states (which you can do in Excel or Jupyter, whichever you prefer).
- Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.
- Submit that notebook to your instructor through Blackboard.
DataCamp
- Joining Data with pandas, all chapters
  - NOTE: We will not cover this content in class next time. We will cover it the subsequent class meeting instead. But I’m assigning you to do it now because then you won’t have any homework next time, when the project is due, and you’ll be able to focus on that instead.
- See here for a cheat sheet of all the content of the above DataCamp lessons. (The cheat sheet needs updating to reflect a recent change in the DataCamp assignments, due to DataCamp’s having archived some old courses. The cheat sheet is very similar to the new content, however.)
Reading
- Chapter 11: Processing the rows of a DataFrame
Other
- Optional: There are several LOYO opportunities from today’s course notes (chapters 9 and 10). Consider forming a team and siezing one of the opportunities.

1.6. Day 6 - 6/3/21 - Processing the Rows of a `DataFrame`¶

1.6.1. Content¶

Chapter 11: Processing the rows of a DataFrame - reading and slides

1.6.2. Due before next class¶

No DataCamp today, so that you can focus on the project.
Reading
- Chapter 12: Concatenation and Merging
Other
- Optional: There are a few LOYO opportunities from today’s course notes (chapter 11). Consider forming a team and siezing one of the opportunities.

1.7. Day 7 - 6/8/21 - Concatenation and Merging¶

1.7.1. Content¶

Chapter 12: Concatenation and Merging - reading and slides

1.7.2. Due before next class¶

It’s a light week, because you just did Project 1 and deserve a little time to rest.

DataCamp
- Streamlined Data Ingestion with pandas
- See here for a cheat sheet of all the content of the above DataCamp lessons.
Reading
- Chapter 13: Miscellaneous Munging Methods (ETL)

1.8. Day 8 - 6/10/21 - Miscellaneous Munging Methods (ETL)¶

1.8.1. Content¶

Chapter 13: Miscellaneous Munging Methods (ETL) - reading and slides

1.8.2. Due before next class¶

DataCamp (last one for the whole semester!)
- Introduction to SQL for Data Science
  - NOTE: Bentley’s CS350 course goes into this content in far greater detail. You can see this lesson as a small preview or taste of that course.
- See here for a cheat sheet of all the content of the above DataCamp lessons.
Reading
- Chapter 14: Dashboards
Other
- Ensure that you have the git command installed on your own computer (again, not on Deepnote or Colab).
  - If you’re on Windows and have already installed the GitHub Desktop app, then you just need to tell Windows where to find the git.exe command that’s built into that app. The folder containing it will be something like C:\Users\YOUR-USERNAME\AppData\Local\GitHubDesktop\app-2.8.1\resources\app\git\cmd. Investigate using Windows Explorer to find the correct path for your system. You may need to reveal hidden files and folders to find the AppData folder. (Yes, this is a pain.)
  - If you’re on Mac, you might not have the git command unless you’ve installed Xcode at some time in the past. You can run xcode-select --install to install just the minimal Xcode tools to get git. If that doesn’t work, download them directly from Apple, which may require creating a free developer account.
  - To prove that you successfully got the git command installed, run git --version.
  - Take a screenshot of your terminal window showing the results of the successful git --version so I can see that you got this to work.
- Install Streamlit on your own computer.
  - The command you need is pip install streamlit.
  - Do not run this command on Deepnote or Colab, but on your own computer.
  - Once you’ve installed it, you should be able to run the command streamlit --version to see what version is installed. Do so to ensure that your installation succeeded.
- Create a Heroku account.
  - Here is the signup page.
  - Then install the Heroku command-line tools on your own computer (again, not on Deepnote or Colab).
  - Ensure that after doing so, you can get to a terminal and run heroku login successfully.
  - If you’re on Windows and it can’t find the heroku command even though you just installed it, you may need to add C:\Program Files\Heroku\bin to your system path variable. Follow this tutorial to do so.
- Take a screenshot to prove that all this worked.
  - After you’ve run heroku login and can still see its successful output, just re-run the git --version and streamlit --version commands so that all three outputs are on one screen, making it obvious that you’ve got all three tools installed correctly.
  - Take a screenshot of that terminal window showing those three commands’ successful output.
  - Submit that screenshot via Blackboard as your homework.
  - It should look something like this:

An image of the screenshot described in the text above

Project Planning
- Optional: If you want to get ahead on the final project in a way that’s rather easy and fun, start hunting for datasets that cover a topic you’re interested in and might want to analyze. Try to find a dataset that’s pretty comprehensive, so that there are plenty of options for ways to analyze, visualize, and manipulate it.

1.9. Day 9 - 6/15/21 - Dashboards¶

1.9.1. Content¶

Chapter 14: Dashboards - reading and slides

1.9.2. Due before next class¶

Network data exercise
- The purpose of this exercise is to familiarize you with some network data, since next week we will be studying just that. It also gives you another chance to practice pd.merge().
- Download this Excel workbook of shipping data among U.S. states in 1997.
- Look over all the sheets in the workbook to familiarize yourself with their meaning.
- Create a Jupyter notebook that reads all the sheets from the workbook.
  - Note: Reading Excel files requires installing the openpyxl module, which is not present by default in some cloud computing environments. You may need to run pip install openpyxl in the terminal, or at the top of the notebook, or place it in a requirements.txt file.
- Add code that creates a DataFrame just like the shipping sheet, but with each state abbreviation replaced by its full name.
- The “adjacent” column in the distances DataFrame should be boolean type; convert it.
- Add two columns to the shipping table, one containing the distance between the two states, and the other containing the boolean of whether the two states are adjacent, both taken from the distance table.
- Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.
- Submit that notebook to your instructor through Blackboard.
Reading
- Chapter 15: Relations as graphs and network analysis

1.10. Day 10 - 6/17/21 - Relations, graphs, and networks¶

1.10.1. Content¶

Chapter 15: Relations as graphs and network analysis - reading and slides

1.10.2. Due before next class¶

Data prep exercise for a music recommendation system
- In class next time we will build a recommender system for songs (that is, given your preferences and a big database of other people’s preferences, it will try to match you with new songs you might like).
- Visit this page and read about the data archive, then download it from there in ZIP format. It is almost 1GB in size, so leave some time for this download!
- Unzip the download and find within it three files; we care only about jams.tsv. Place this file in a folder where you can access it with Python and pandas. It contains every user’s “jams” from 2011-2015.
- Write some code to load into a pandas Series the full set of unique user IDs in that file. That is, do not include any user more than once in the series. (This code may be slow to run, because the file is large.) This step is asking for just the user IDs, not any jam or song data.
- Use the sample() method in pandas Series objects to select a random subset of the users to work with, so that we don’t have to deal with the entire jams file, which would take a long time to do computations with. Include at least 1000 in your sample, to get a sufficient representation of the full dataset. I chose 2000 in my own work, but later computations will get much slower if you go beyond about 2000.
- Write some code to load from the jams.tsv DataFrame every jam by all the users in your sample. There are roughly 15 jams per user on average, so you should end up with 15 times as many results as the number of users you chose (about 15,000 to 30,000).
- We need only three columns of the result: user ID, artist, and song title. Discard all other columns.
- To give a song a unique name string, let’s combine the artist and song title into a single column. That is, rather than a column with “Don’t Stop Believin’” for song title and “Journey” as artist, create a new column called “song” that contains text like “Don’t Stop Believin’, by Journey”.
- Drop the original title and artist columns so that your final jams DataFrame contains just two columns, user and song.
- Export that DataFrame to a new CSV file that we will analyze in class. Call it jam-sample.csv.
- Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.
- Submit that notebook to your instructor through Blackboard.
Reading
- Chapter 16: Relations as matrices

1.11. Day 11 - 6/22/21 - Relations as matrices¶

1.11.1. Content¶

Chapter 15: Relations as matrices - reading and slides

1.11.2. Due before next class¶

Data preparation exercise
- In class next time we will do an introductory machine learning exercise about predicting mortgage approval/denial.
- Download the training dataset here. It is a sample from the same mortgage dataset we’ve used many times. Recall that its data dictionary is available online here.
- Load it into pandas and check the data types of the columns.
- To make all the data numeric, we will be replacing categorical columns with boolean columns in which false is represented by 0 and true is represented by 1. This will make it possible to use that data in a numerical model.
- Replace the conforming_loan_limit column with two boolean columns, one that means “conforming loan limit is C (conforming)” and one that means “conforming loan limit is NC (not conforming).” Don’t forget to use 0/1 instead of False/True. (There are other values that column may take on, but we will analyze just those two.)
- Replace the derived_sex column with two boolean columns, one that means “derived sex is Male” and one that means “derived sex is Female.” Don’t forget to use 0/1 instead of False/True. (There are other values that column may take on, but we will analyze just those two.)
- The action_taken column contains only 1s and 3s. This is because this dataset was filtered to include only accepted or rejected mortgages (no withdrawals, pre-approvals, etc.). Replace this column with another boolean column, still using 0/1 for False/True, meaning “application accepted.”
- The debt-to-income ratio column is categorical instead of numeric. Make it numeric by replacing each category with a central value in that category. For instance, the category “20%-<30%” can be replaced with the number 25, the category “43” can be just the number 43, etc. Let’s use 70 for “>60%.”
- Your newly cleaned data should have all numeric columns. Export it as a CSV file and bring it with you to class for an in-class activity in Week 12.
- Follow Deepnote’s instructions for how to export the resulting notebook as a PDF.
- Submit that notebook to your instructor through Blackboard.
Reading
- Chapter 17: Introduction to machine learning

(Perhaps more assignments are coming; this section is still incomplete.)

1.12. Day 12 - 6/24/21 - Introduction to machine learning¶

1.12.1. Content¶

Chapter 17: Introduction to machine learning - reading and slides

No more homework this semester! Use the remaining time to do a great final project!

1.13. Day 13 - 6/29/21 - Final Project Workshop¶

We will begin class today answering questions to help you review for the final exam, which is on Thursday, 7/1/21. Then we will end class at the regular Zoom link and transition to the office hours Zoom link…

The second half of class today, at the office hours Zoom link, will consist entirely of help with individuals and teams working on their final project, which is due at 11:59pm on this day.

1.14. Day 14 - 7/1/21 - Final Exam¶

We will take the final exam in class, and students may leave when they complete it.

Recall that the topics for the final exam appear online here.

MA346 Course Notes