Let's start with some review questions

To make sure everyone remembers the essentials of the chapter:

  1. What does ETL stand for?
  2. Is it the same thing as munging?
  3. Why are ETL/munging skills important for data work?
  4. Information is data plus what?
  5. Who creates a data dictionary and why?
  6. What does provenance mean?

Missing values

Consider whether each of the following is reasonable or not, and be prepared to say why.

  • In a spreadsheet of daily volunteer hours logged by National Honor Society students, some values are missing. The staff member in charge of the spreadsheet filled those values with the mean across all days for the student in question.
  • In a database of historical immigration records, some records mark a person's age with an X. An analyst, before performing statistics on the data, wants to replace all X values with NaN.
  • A data scientist is trying to fit a multi-variable linear model to predict the rate of reported intolerant behaviors from variables measuring the employer's culture and commitments to diversity messaging and training. Some rows in the dataset of companies contain one or more missing values. The data scientist plans to drop these rows before creating the model.

File formats

  • When you get a dataset, which format is the easiest to read in Python? This suggests that you should create files in this format in what situations?
  • What is the value of a Python pickle file?

Other activities for today

  • An in-class activity on good and bad data cleaning practices
  • Announcement and discussion of the MA346 Final Project