First | Last | Day | Sales |
---|---|---|---|
Amy | Smith | Monday | 39 |
Amy | Smith | Tuesday | 68 |
Amy | Smith | Wednesday | 10 |
Bob | Jones | Monday | 93 |
Bob | Jones | Tuesday | 85 |
Bob | Jones | Wednesday | 0 |
From the reading, what is the value of this form of data?
And what is the other name for this form of data?
First | Last | Monday | Tuesday | Wednesday |
---|---|---|---|---|
Amy | Smith | 39 | 68 | 10 |
Bob | Jones | 93 | 85 | 0 |
From the reading, what is the value of this form of data?
And what are the verbs used to convert from tall to wide form, or wide to tall form?
Names
index
columns
values
Requirements
index
and columns
to values
must be a function.Guarantees
columns
column.Names
id_vars
value_vars
Requirements
id_vars
uniquely identify each row.id_vars
to each value_vars
column is a function.Guarantees
value_vars
column headers will be merged into one single column entitled variable
.value_vars
column entries will be merged into one single column entitled value
.id_vars
entries will be replicated so that the result is still a function from id_vars
and variable
to value
.Same as pivot, except:
index
and columns
need not be a function.aggfunc
.Pivot tables are extremely common for summarizing data, especially since there are so many different aggregation functions. Here is a list of all the built-in ones, and you can also code your own.
After the break in class today, you'll be diving into working on some datasets for practice.
To prepare for that, let's do a few exercises for discussion, to refresh your memory on other pandas tools, functions, and syntax.
Which of the following sentences correctly describes the uses of the pandas functions loc and iloc?
df
, you can use df.loc[...]
to look up rows, columns, or cells by their names, and df.iloc[...]
to look up rows, columns, or cells by their zero-based numerical index.df
, you can use df.loc[...]
to access rows and df.iloc[...]
to access columns.df
, you can use df.loc[...]
to look up one or more rows by integer index and df.iloc[...]
to do the same, but counting from the end of the DataFrame (iloc
= "inverted loc").df
, you can use df[...]
, df.loc[...]
, and df.iloc[...]
interchangeably to get access to individual entries in the DataFrame.If you have a DataFrame in the variable df
, which of the following are situations in which you would want to execute the code df["sales"] = 0
?
Assume we have a DataFrame df
with several columns, including "Salary" and "Job Title". How would we find the salaries of anyone whose job title is "Engineer"? (Fill in the blanks.)
indices = df[___________] == __________
salaries = df.loc[____________________]
What happens when we run the code df["column"].apply( f )
?
x
in df["column"]
with the result of f(x)
f(df["column"])
f(x)
for each entry x
in df["columns"]
df["column"]
with the result of f(df["column"])
Assume that we have read a DataFrame df
from a CSV file, and provided no default index, so that its index is the integers from 0 to 9.
Assume further that the rows in df each represent data collected in one particular year. The data were collected beginning with the year 1970, and repeating the data collection every five years, so that the first row is from 1970, the second row is from 1975, and so on.
We want the index of df
to represent the year of data collection, which is not currently stored in any of the columns of the DataFrame. Which of the following pieces of code would accomplish that goal?
# Option 1:
df.index = df.index*5 + 1965
df.index.name = 'Year'
# Option 2:
df.index = df.index*5 + 1970
df.index.name = 'Year'
# Option 3:
df.index = range(0,50,5) + 1965
df.index.name = 'Year'
# Option 4:
df.index = range(0,50,5) + 1970
df.index.name = 'Year'