Chapter 10 - Visualization¶

See also the corresponding course notes, here.

View a printable version of these slides here.

Summary of plotting tools¶

Recall the summary of visualization techniques given in the course notes for today.

Let's review it here.

With one numeric column of data:¶

If you want to see this	Then use this
Just the distribution's quartiles and outliers	Box plot
Simple approximation of the distribution	Histogram
Very good approximation of the distribution, maybe very wide	Swarm plot
Good approximation of the distribution, not too wide	Strip plot
Good approximation of a large distribution, smoothed	Violin plot
Whether the distribution is approximately normal	Overlapping ECDFs

With two numeric columns of data:¶

If you want to see this	Then use this
A graph of the data when the data is a function	Line plot
The shape of the data when the data is a relation	Scatter plot
The shape of the data when the data is a relation, plus each variable's distribution	Joint plot
The line of best fit through the data	`sns.lmplot`

With many numeric columns of data:¶

If you want to see this	Then use this
The quartiles and outliers of each	Side-by-side box plots
Simple approximation of the distributions	Histograms with side-by-side bars
Very good approximation of each distribution (can't fit too many)	Side-by-side swarm plots
Good approximation of each distribution (can fit more)	Side-by-side strip plots
Good approximation if the distributions are large (will be smoothed)	Side-by-side violin plots
The shape of all possible two-column relationships	Pair plot
A measurement of all possible correlations	Heat map of correlation coefficients

Let's practice applying those guidelines to real examples...¶

Question 1¶

Data: A series of 300 temperature readings from a single, stationary sensor at regular time intervals

Goal: To see the change in temperature over time

Which visualization type should I choose? Recall our options:

Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
Pair plot, heat map of correlation coefficients

Question 2¶

Data: A series of 100,000 temperature readings from a single, stationary sensor at regular time intervals

Goal: The distribution of temperature values over that time interval

Which visualization type should I choose? Recall our options:

Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
Pair plot, heat map of correlation coefficients

Question 3¶

Data: A large dataset about students who visited the wellness center with stress-related concerns, including columns about their demographic information, health history, academic record, and extracurricular activities

Goal: Ideas for how to predict risk for students who may be under too much pressure

Which visualization type should I choose? Recall our options:

Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
Pair plot, heat map of correlation coefficients

Question 4¶

Data: The baseball salaries we investigated in Week 3 of this course

Goal: See changes in the distribution of batter salaries throughout the 2000s

Which visualization type should I choose? Recall our options:

Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
Pair plot, heat map of correlation coefficients

Question 5¶

Data: The baseball salaries we investigated in Week 3 of this course

Goal: We want to do hypothesis testing on salaries of different groups, and need to ensure approximate normality of distributions first

Which visualization type should I choose? Recall our options:

Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
Pair plot, heat map of correlation coefficients

Visualization Exercise¶

The homework from Week 3 included documenting some code that compared two groups within a home mortgage dataset, using two histograms on one graph. Today we'll extend that work, so your instructor will provide a copy of the solutions for you to use as a starting point.

Place the notebook your instructor provides in a folder with the corresponding data file and ensure the code runs.
Remove the hypothesis testing work from the end of the file; we will not need it today.
Change the plotting code so that it uses side-by-side plots, either box plots, swarm plots, strip plots, or violin plots
Which of those visualizations did you find most useful for this data, and why?

Optional second exercise, time permitting¶

Write a function that behaves as follows:

It takes as input two boolean Series, which can be used to split the housing data into two subsamples (just like we split it into low and high minority percentage areas).
It also takes as input the names of these two Series, as two strings. In the example we just did, we might use the names "Low Minority %" and "High Minority %".
The function creates a side-by-side box/swarm/strip/violin plot of property values, with the appropriate axis labels.

It might look like this:

In [3]:

def compare_two_groups ( group1, group2, name1, name2 ):
    pass # replace this with actual code