Summary of plotting tools

Recall the summary of visualization techniques given in the course notes for today.

Let's review it here.

With one numeric column of data:

If you want to see this Then use this
Just the distribution's quartiles and outliers Box plot
Simple approximation of the distribution Histogram
Very good approximation of the distribution, maybe very wide Swarm plot
Good approximation of the distribution, not too wide Strip plot
Good approximation of a large distribution, smoothed Violin plot
Whether the distribution is approximately normal Overlapping ECDFs

With two numeric columns of data:

If you want to see this Then use this
A graph of the data when the data is a function Line plot
The shape of the data when the data is a relation Scatter plot
The shape of the data when the data is a relation, plus each variable's distribution Joint plot
The line of best fit through the data sns.lmplot

With many numeric columns of data:

If you want to see this Then use this
The quartiles and outliers of each Side-by-side box plots
Simple approximation of the distributions Histograms with side-by-side bars
Very good approximation of each distribution (can't fit too many) Side-by-side swarm plots
Good approximation of each distribution (can fit more) Side-by-side strip plots
Good approximation if the distributions are large (will be smoothed) Side-by-side violin plots
The shape of all possible two-column relationships Pair plot
A measurement of all possible correlations Heat map of correlation coefficients

Let's practice applying those guidelines to real examples...

Question 1

Data: A series of 300 temperature readings from a single, stationary sensor at regular time intervals

Goal: To see the change in temperature over time

Which visualization type should I choose? Recall our options:

  • Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
  • Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
  • Pair plot, heat map of correlation coefficients

Question 2

Data: A series of 100,000 temperature readings from a single, stationary sensor at regular time intervals

Goal: The distribution of temperature values over that time interval

Which visualization type should I choose? Recall our options:

  • Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
  • Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
  • Pair plot, heat map of correlation coefficients

Question 3

Data: A large dataset about students who visited the wellness center with stress-related concerns, including columns about their demographic information, health history, academic record, and extracurricular activities

Goal: Ideas for how to predict risk for students who may be under too much pressure

Which visualization type should I choose? Recall our options:

  • Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
  • Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
  • Pair plot, heat map of correlation coefficients

Question 4

Data: The baseball salaries we investigated in Week 3 of this course

Goal: See changes in the distribution of batter salaries throughout the 2000s

Which visualization type should I choose? Recall our options:

  • Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
  • Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
  • Pair plot, heat map of correlation coefficients

Question 5

Data: The baseball salaries we investigated in Week 3 of this course

Goal: We want to do hypothesis testing on salaries of different groups, and need to ensure approximate normality of distributions first

Which visualization type should I choose? Recall our options:

  • Box plot(s), histogram(s), swarm plot(s), strip plot(s), violin plot(s), ECDFs
  • Line plot, scatter plot, joint plot, line of best fit (sns.lmplot)
  • Pair plot, heat map of correlation coefficients

Visualization Exercise

The homework from Week 3 included documenting some code that compared two groups within a home mortgage dataset, using two histograms on one graph. Today we'll extend that work, so your instructor will provide a copy of the solutions for you to use as a starting point.

  1. Place the notebook your instructor provides in a folder with the corresponding data file and ensure the code runs.
  2. Remove the hypothesis testing work from the end of the file; we will not need it today.
  3. Change the plotting code so that it uses side-by-side plots, either box plots, swarm plots, strip plots, or violin plots
  4. Which of those visualizations did you find most useful for this data, and why?

Optional second exercise, time permitting

Write a function that behaves as follows:

  • It takes as input two boolean Series, which can be used to split the housing data into two subsamples (just like we split it into low and high minority percentage areas).
  • It also takes as input the names of these two Series, as two strings. In the example we just did, we might use the names "Low Minority %" and "High Minority %".
  • The function creates a side-by-side box/swarm/strip/violin plot of property values, with the appropriate axis labels.

It might look like this:

In [3]:
def compare_two_groups ( group1, group2, name1, name2 ):
    pass # replace this with actual code