Data in "tall" form

First Last Day Sales
Amy Smith Monday 39
Amy Smith Tuesday 68
Amy Smith Wednesday 10
Bob Jones Monday 93
Bob Jones Tuesday 85
Bob Jones Wednesday 0

From the reading, what is the value of this form of data?

And what is the other name for this form of data?

Data in "wide" form

First Last Monday Tuesday Wednesday
Amy Smith 39 68 10
Bob Jones 93 85 0

From the reading, what is the value of this form of data?

And what are the verbs used to convert from tall to wide form, or wide to tall form?

Pivot details

Names

  • Gray cells are called index
  • Blue cells are called columns
  • Green cells are called values

Requirements

  • The relationship from index and columns to values must be a function.

Guarantees

  • That one function will become many functions, each in its own column, one for each value in the original columns column.

Melt details

Names

  • Gray columns are called id_vars
  • Red column headers are called value_vars

Requirements

  • The id_vars uniquely identify each row.
  • The relationship from id_vars to each value_vars column is a function.

Guarantees

  • The value_vars column headers will be merged into one single column entitled variable.
  • The value_vars column entries will be merged into one single column entitled value.
  • The id_vars entries will be replicated so that the result is still a function from id_vars and variable to value.

Pivot table details

Same as pivot, except:

  • The relationship between index and columns need not be a function.
  • Therefore some values will need to be combined.
  • And so we must provide a function that will combine them, called aggfunc.

Pivot tables are extremely common for summarizing data, especially since there are so many different aggregation functions. Here is a list of all the built-in ones, and you can also code your own.

Exercises

After the break in class today, you'll be diving into working on some datasets for practice.

To prepare for that, let's do a few exercises for discussion, to refresh your memory on other pandas tools, functions, and syntax.

Exercise 1

Which of the following sentences correctly describes the uses of the pandas functions loc and iloc?

  1. For a DataFrame df, you can use df.loc[...] to look up rows, columns, or cells by their names, and df.iloc[...] to look up rows, columns, or cells by their zero-based numerical index.
  2. For a DataFrame df, you can use df.loc[...] to access rows and df.iloc[...] to access columns.
  3. For a DataFrame df, you can use df.loc[...] to look up one or more rows by integer index and df.iloc[...] to do the same, but counting from the end of the DataFrame (iloc = "inverted loc").
  4. For a DataFrame df, you can use df[...], df.loc[...], and df.iloc[...] interchangeably to get access to individual entries in the DataFrame.

Exercise 2

If you have a DataFrame in the variable df, which of the following are situations in which you would want to execute the code df["sales"] = 0?

  1. You want to remove from your DataFrame a column entitled "sales"
  2. You want to replace the data in an existing "sales" row with zero entries
  3. You want to add a new row entitled "sales," initially populated with zero entries
  4. You want to remove from your DataFrame a row entitled "sales"
  5. You want to replace the data in an existing "sales" column with zero entries
  6. You want to add a new column entitled "sales," initially populated with zero entries
  7. You want to replace one existing entry, named "sales," with a zero
  8. You want to add one new entry, named "sales," initially containing a zero
  9. You want to remove from your DataFrame one entry, the one named "sales"

Exercise 3

Assume we have a DataFrame df with several columns, including "Salary" and "Job Title". How would we find the salaries of anyone whose job title is "Engineer"? (Fill in the blanks.)

indices = df[___________] == __________
salaries = df.loc[____________________]

Exercise 4

What happens when we run the code df["column"].apply( f )?

  1. Python replaces each entry x in df["column"] with the result of f(x)
  2. Python computes and returns the result of f(df["column"])
  3. Python computes and returns a new Series containing f(x) for each entry x in df["columns"]
  4. Python replaces the column df["column"] with the result of f(df["column"])

Exercise 5

Assume that we have read a DataFrame df from a CSV file, and provided no default index, so that its index is the integers from 0 to 9.

Assume further that the rows in df each represent data collected in one particular year. The data were collected beginning with the year 1970, and repeating the data collection every five years, so that the first row is from 1970, the second row is from 1975, and so on.

We want the index of df to represent the year of data collection, which is not currently stored in any of the columns of the DataFrame. Which of the following pieces of code would accomplish that goal?

# Option 1:
df.index = df.index*5 + 1965
df.index.name = 'Year'
# Option 2:
df.index = df.index*5 + 1970
df.index.name = 'Year'  
# Option 3:
df.index = range(0,50,5) + 1965
df.index.name = 'Year'
# Option 4:
df.index = range(0,50,5) + 1970
df.index.name = 'Year'