{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Concatenating and Merging DataFrames\n",
    "\n",
    "<a href=\"../../_slides/chapter-12-slides.html\">See also the slides that summarize a portion of this content.</a>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Why join two datasets?\n",
    "\n",
    "This chapter is about two ways to combine DataFrames together.  The concepts we'll be discussing (concatenation and merging) are not unique to pandas DataFrames; they show up wherever tabular data is used, including in SQL.\n",
    "\n",
    "Combining more than one dataset together is a crucial aspect of data work.  Let's see two examples.\n",
    "\n",
    "**Example 1.**  One of my friends runs [a nonprofit organization](https://secondnature.org/) that helps colleges and universities set climate action goals and track their progress toward keeping them.  He asked my graduate data science course in Fall 2019 to look at their database and come up with any insights.  Naturally, their database had records of all the climate goals and progress for schools they were working with, but it didn't have much other information about those schools.  What if we wanted to analyze a variable they weren't tracking, like endowment?  Or what if we wanted to look at schools that hadn't yet partnered with the nonprofit?  That information would need to be brought in from another dataset.  Until we do so, we can't give interesting answers to the question the client posed.\n",
    "\n",
    "**Example 2.**  One of my colleagues in the math department told me about a clever strategy one investment group used to predict the earnings of companies they were considering investing in.  They already had lots of data about each company, including the addresses of the company's various offices and factories.  They could also purchase access to a large database of satellite images.  They used the addresses and some image-detection software to compute the number of cars in the parking lots of the company's properties.  This turned out to be a very useful predictor of growth that they could access before their competing investors had the information.  It involved bringing together two datasets in a clever way.\n",
    "\n",
    "In this chapter, we'll discuss how to combine just two DataFrames, but the ideas apply if you have more than two.  For instance, to concatenate five DataFrames `df1` through `df5`, we can proceed in pairs, combining `df1` and `df2`, then combining that result with `df3`, and so on until we have included `df5`.\n",
    "\n",
    "Let's start by discussing concatenation, which is definitely the easier of the two concepts, before we tackle merging.  The English verb \"concatenate\" means to attach two things together, one after the end of the other."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Concatenation is vertical\n",
    "\n",
    "DataFrames are tables of data, so when combining, we'll either be stacking them vertically or horizontally.  Concatenation is vertical stacking.\n",
    "\n",
    "It is an extremely common operation.  Very often what happens after you get some data is that (not surprisingly) you later get more of the same type of data.\n",
    " * For instance, if you're taking scientific measurements in a lab, one week you get a set of measurements, and the next week you get more data in the same format.\n",
    " * Or if you're following a stock or other financial instrument, its prices one week form a dataset, then the next week, you see more data with the same format.\n",
    "\n",
    "Because the standard way to organize tabular data is to put observations in rows, then getting more observations means we just need to add more rows onto the bottom of our previous table of data.  This is what concatenation is for.  Here's an illustration using the stock prices example, with data that comes from Renewable Energy Group, Inc., whose 2020 data we've seen in an earlier chapter.\n",
    "\n",
    "![Illustration of stacking two tables of stock data vertically into one larger table](_images/concat-of-stock-data.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two important things to notice in the picture.\n",
    " 1. All that's happening is that we're stacking data vertically.  It's very straightforward!\n",
    " 2. In order for us to stack two DataFrames, they must have the same columns.  The column headers are highlighted in blue to emphasize that they're the same in every table.\n",
    " \n",
    "(There are ways to deal with the case where new data comes in with different column headers; we're covering the most common case here.)\n",
    "\n",
    "The code to do this is extremely easy; it is a single call to the `pd.concat()` function.  You provide a Python list of all the DataFrames to concatenate; in this case, we have just two.  We tell it to ignore the old indexes and create a new one, so that we don't have duplicate index entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "       Date   Open   High    Low  Close\n",
       "0  2-Jan-20  27.21  27.95  26.62  27.89\n",
       "1  3-Jan-20  28.16  28.95  27.73  28.82\n",
       "2  6-Jan-20  28.53  28.81  28.00  28.39\n",
       "3  7-Jan-20  28.17  28.28  26.08  26.44\n",
       "4  8-Jan-20  26.37  26.40  24.86  25.19"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Date</th>\n      <th>Open</th>\n      <th>High</th>\n      <th>Low</th>\n      <th>Close</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2-Jan-20</td>\n      <td>27.21</td>\n      <td>27.95</td>\n      <td>26.62</td>\n      <td>27.89</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>3-Jan-20</td>\n      <td>28.16</td>\n      <td>28.95</td>\n      <td>27.73</td>\n      <td>28.82</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>6-Jan-20</td>\n      <td>28.53</td>\n      <td>28.81</td>\n      <td>28.00</td>\n      <td>28.39</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>7-Jan-20</td>\n      <td>28.17</td>\n      <td>28.28</td>\n      <td>26.08</td>\n      <td>26.44</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>8-Jan-20</td>\n      <td>26.37</td>\n      <td>26.40</td>\n      <td>24.86</td>\n      <td>25.19</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 1
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df_jan = pd.read_csv( '_static/regi-prices-jan-2020.csv' )\n",
    "df_feb = pd.read_csv( '_static/regi-prices-feb-2020.csv' )\n",
    "\n",
    "df_2mo = pd.concat( [ df_jan, df_feb ], ignore_index=True )\n",
    "df_2mo.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "         Date   Open   High    Low  Close\n",
       "35  24-Feb-20  29.16  29.47  28.08  29.07\n",
       "36  25-Feb-20  29.40  29.40  26.83  27.60\n",
       "37  26-Feb-20  27.59  28.93  27.30  27.84\n",
       "38  27-Feb-20  27.13  27.56  25.85  25.89\n",
       "39  28-Feb-20  24.90  26.66  24.51  26.45"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Date</th>\n      <th>Open</th>\n      <th>High</th>\n      <th>Low</th>\n      <th>Close</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>35</th>\n      <td>24-Feb-20</td>\n      <td>29.16</td>\n      <td>29.47</td>\n      <td>28.08</td>\n      <td>29.07</td>\n    </tr>\n    <tr>\n      <th>36</th>\n      <td>25-Feb-20</td>\n      <td>29.40</td>\n      <td>29.40</td>\n      <td>26.83</td>\n      <td>27.60</td>\n    </tr>\n    <tr>\n      <th>37</th>\n      <td>26-Feb-20</td>\n      <td>27.59</td>\n      <td>28.93</td>\n      <td>27.30</td>\n      <td>27.84</td>\n    </tr>\n    <tr>\n      <th>38</th>\n      <td>27-Feb-20</td>\n      <td>27.13</td>\n      <td>27.56</td>\n      <td>25.85</td>\n      <td>25.89</td>\n    </tr>\n    <tr>\n      <th>39</th>\n      <td>28-Feb-20</td>\n      <td>24.90</td>\n      <td>26.66</td>\n      <td>24.51</td>\n      <td>26.45</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 2
    }
   ],
   "source": [
    "df_2mo.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `pd.concat()` function is actually much more powerful than just this one little use to which we've put it here.  But we will discuss that more after we've discussed the more complex of the operations in this chapter, merging."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merging is horizontal\n",
    "\n",
    "Concatenation was appropriate when we had new rows (that is, new observations) to add to our dataset.  But what if we had new columns instead?  Keep in mind that, under the standard way we organize tabular data, columns represent the *variables* in our dataset.  So getting new columns means learning more information about the rows we already had.\n",
    "\n",
    "We saw a simple example of this in a recent in-class activity; it was simple enough that we didn't need to learn the full power of merging to handle it.  Recall that we had a dataset of home mortgage applications, and we wanted to add into it a variable that measured political affiliation of the state in which the mortgage took place.  We thus got a table that provided a measure of political alignment for each state, and we used that to *add a new column* to our old home mortgage dataset.  Each row in the mortgage dataset got a new variable measuring political alignment.  The table grew *horizontally* with new information from another table.\n",
    "\n",
    "In fact, when we have only one column to add, the technique from last week's class is easier than the full complexity of merging.  Recall how we did it:\n",
    "\n",
    "```python\n",
    "# make a dictionary that maps state abbreviations to voting measurements\n",
    "repub_votes_in_state = dict( zip( df_election['State'], df_election['Trump'] ) )\n",
    "\n",
    "# apply that dictionary to our home mortgage data to make a new column\n",
    "df_mortgages['Trump2016%'] = df_mortgages['State'].apply( repub_votes_in_state )\n",
    "```\n",
    "\n",
    "But what if the situation is more complicated?  This can happen in several ways.  In each way, `pd.merge()` is there to solve the problem.  Let's look at each way that tables might grow horizontally."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding many columns at once\n",
    "\n",
    "The technique shown above, which we used last week in class, is easy if bringing in only one new column.  If we wanted to bring in many new columns, we'd need to apply that technique repeatedly, in a loop over those columns.  But `pd.merge()` can do it all in one function call, and for the reasons we learned last week, that will probably be faster than a Python loop.\n",
    "\n",
    "Let's consider a concrete example to understand the idea of importing several new columns at once.  Consider a dataset that's been very important over the past year, tracking the number of confirmed COVID-19 cases over time in various countries.  Let's say we wanted to see if the growth patterns in such a dataset were in any way related to health care information about the country, such as how much they spend on health care, how many doctors per capita, and so on.  We'll need to bring in another dataset with all that information about each country, and import it in as new columns.  See the illustration below.\n",
    "\n",
    "(All tables illustrated from here on will have \"...\" in the final rows and columns, to indicate that the table is really much bigger, and we're showing only a portion in the illustration.)\n",
    "\n",
    "![Merging COVID-19 time series data for each country with health care information for each country](_images/merge-of-health-data.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The resulting DataFrame, on the bottom of the illustration, has all the data we want about each country, the COVID case data followed by the health data.\n",
    "\n",
    "If the rows were not in exactly the same order in each DataFrame, the ones on the right will be reordered so that they match correctly with the rows on the left.  To do this, we need a unique ID for each row that is consistent across both datasets.  In this case, we would use the country name.\n",
    "\n",
    "We're making two important assumptions here.\n",
    " 1. The list of countries is exactly the same in both datasets, so we don't have any leftover rows in either one.  This is rarely how actual data works; there's usually some discrepancy, so we'll discuss [next](#when-there-is-no-match-for-some-rows) how to handle that.\n",
    " 2. The country names are spelled and formatted exactly the same in both datasets.  This is also not always true, so [at the end of this chapter](#ensuring-a-unique-id-appears-in-both-datasets), we'll talk about how to fix that problem if and when it arises in your own work.\n",
    "\n",
    "This operation is called a *merge* in pandas or a *join* in SQL.  We could do it with code like the following.  We say we \"merge *on*\" the column we're using as the unique ID.  So the illustration above is a merge on country name (or a join on country name).  In the left dataset, the column is called \"Country/Region\" and in the right dataset, it's called \"Country.\"  So the code for this merge looks like the following.\n",
    "\n",
    "```python\n",
    "df_merged = pd.merge( df_cases, df_health,\n",
    "    left_on='Country/Region', right_on='Country' )\n",
    "```\n",
    "\n",
    "If the column name had been the same in both DataFrames, we could have done it more succinctly.\n",
    "\n",
    "```python\n",
    "df_merged = pd.merge( df_cases, df_health, on='Country' )\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When there is no match for some rows\n",
    "\n",
    "The first assumption mentioned above was that each row in the COVID dataset matched up with exactly one row in the health dataset.  The two datasets were the same size and had the same countries.  But what if this had not been the case?  Let's consider two merging examples where the rows of the one dataset don't match up perfectly with those of the other.  First, what if some rows in one dataset don't match up with any rows from the other dataset?\n",
    "\n",
    "Recall the example from the start of this chapter about my friend's nonprofit.  I gave my students a comprehensive database from the U.S. government detailing lots of information about every institution of higher education in the U.S., over 7000 of them.  We wanted to merge that with the list of schools who had partnered with the climate nonprofit, of which there were fewer than 500.  Of course, the nonprofit hadn't partnered with *every* school in the U.S.; that would be impressive!  So clearly some of the rows in the big dataset were not going to match with any of the rows in the climate dataset.  What do we do in that case?\n",
    "\n",
    "Keeping in mind the goal of that project, we want to ensure that we keep in our dataset all the schools in the comprehensive dataset, because we will want to do analytics on those schools who *haven't* signed up with the nonprofit.  There may be interesting patterns that help us see which schools tend not to sign up.  But the rows for those schools will not have any climate data to add, so there will be a lot of missing values in the merged dataset, as shown in the following illustration.\n",
    "\n",
    "![Merging a comprehensive higher education dataset with climate data about a subset of its schools](_images/merge-of-school-data.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the comprehensive dataset has over 7000 rows and we add climate data for less than 500 schools, the vast majority of the rows (about 6500/7000, or 93%) of them have no climate data, only missing values.  Those missing values are shown as blank cells in the illustration, but pandas would show them as NaNs.\n",
    "\n",
    "But this is exactly how we wanted it, because then we can consider two subpopulations, the schools with climate data and the schools without.  We could investigate differences in their attributes and perhaps verify some such differences with hypothesis tests or other tools.\n",
    "\n",
    "Because we used the *left* DataFrame as the definitive one, which we did not want to alter, and we brought the *right* DataFrame into it, we call this a *left join.*  The code for doing this operation is exactly like the previous `pd.merge()` example, with one exception: we tell it that the left DataFrame is the definitive one, using the `how` keyword.\n",
    "\n",
    "```python\n",
    "df_merged = pd.merge( df_big, df_climate,\n",
    "    left_on='NAME', right_on='fullname', how='left' )\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we had chosen to do `how='right'` instead, the right DataFrame would be considered the definitive one.  Any school from the left DataFrame that didn't appear in the right DataFrame would be discarded, and we would end up with under 500 rows, precisely one row for each school in the climate nonprofit's dataset.\n",
    "\n",
    "Note that we're still making the unrealistic assumption that the school names in the government dataset will match perfectly with those in the nonprofit's dataset, and we'll address that [at the end of the chapter](#ensuring-a-unique-id-appears-in-both-datasets).\n",
    "\n",
    "This example showed what it was like if some of the rows in the left dataset match up with *zero* rows in the right dataset.  But what if they match up with *many* rows in the right dataset?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When there are many matches for some rows\n",
    "\n",
    "Let's consider another example, this one from sports.  We'll use NFL football, but if you're not familiar with the sport, the example will still make sense.  All you need to know is that each team has many players, and that each *play* is a small part of a football game that uses just some of the team's players.  Some plays have a *receiver,* which is the player who catches the ball thrown to him (if any---sometimes the play does not involve throwing the ball).\n",
    "\n",
    "As always in this chapter, imagine two datasets.  The first is the set of all NFL players in a certain year and their stats for that year.  (You can get these datasets online for free; here I'll use a small sample of the players from the 2009 season.)  The second is the set of all plays that happened in that same season, in any game.  (The NFL lets you fetch this data from their website for free; again, I'll use a small sample of plays from the 2009 season.)\n",
    "\n",
    "Perhaps we have a theory we want to test about a team's receivers.  We want to compare certain statistics about the receiver to how the receiver performs in certain plays.  (The details are unimportant.)  So we will need to combine the two datasets, one with player stats and one with the plays from the games.  We will want to match them up so that a row in the merged dataset contains the stats for the player who caught the ball, that is, the receiver for that play.\n",
    "\n",
    "Now let's consider how we will handle the many possibilities for how rows might match across the datasets.  First let's consider rows that match many other rows; this might happen in two ways.\n",
    " * **What if a player is the receiver in more than one play?**  (This happens all the time, of course.  Once a player is hired by a team, they often play in lots of games, and are involved in many plays.)  We will want the player's stats to appear in *every* play for which the player was the receiver.  Good news!  This is how merges always work; if a row in one DataFrame matches many rows in the other, the row is always *copied.*\n",
    " * **What if a play has more than one receiver?**  This actually cannot happen, according to the rules of the NFL.  Once a player has caught the ball, they are not eligible to pass it to another player.  (If you're familiar with football, don't start talking about laterals; that's not a pass!)  So we don't have to consider this possibility.\n",
    " \n",
    "So those two considerations don't seem to change our merging code at all.  It seems like a standard merge will do what we want.\n",
    "\n",
    "But what about a row in one dataset matching zero rows in the other dataset?  This, too, might happen in two ways.\n",
    " * **What if a player is the receiver in no play?**  (This happens often also.  A player may be hired by a team, but is not as good as other players on the team, and thus does not yet get to play in real games.)  We will not want this player to appear at all in our merged dataset, because we care about receivers who showed up in actual plays.\n",
    " * **What if a play has no receiver?**  (This happens often also.  There are many types of plays and not all involve throwing.)  We will not want this play to appear in our merged dataset, because the analysis we want to do is about plays that have a receiver.\n",
    "\n",
    "Putting these two considerations together, it does not seem like we want either a left join or a right join.  Recall that a left join keeps all the rows of the left table and a right join keeps all the rows of the right table.  In this case, however, we want to keep only rows that appear in *both* tables.  This is called an *inner join,* and you can see it working in the illustration below.\n",
    "\n",
    "![Merging NFL player statistics with records of NFL plays](_images/merge-of-nfl-data.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code looks the same as before, but only the `how` parameter has changed, now using the value `\"inner\"` rather than `\"left\"` or `\"right\"`.  Actually, `\"inner\"` is the default value for `pd.merge()`, so you can omit it in this case, but I include it for emphasis.\n",
    "\n",
    "```python\n",
    "df_merged = pd.merge( df_players, df_plays,\n",
    "    left_on='Player', right_on='Receiver', how='inner' )\n",
    "```\n",
    "\n",
    "Notice that we specifically say that we want the stats for the player who was the receiver in the play, by asking the merge to happen using the Player column from the left dataset and the Receiver column from the right dataset.\n",
    "\n",
    "This kind of merge will not introduce any new missing values, because if a row didn't exist in the left or right dataset, it was not included in the result.  That's the definition of an inner join, and that's why we chose to use that method in this case."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When I want to keep all the rows\n",
    "\n",
    "An inner join is not appropriate for all merging situations.  Consider a different example.\n",
    "\n",
    "Let's imagine that two Bentley professors found out they had done research on some of the same firms, and wanted to share data.  Let's say Professor Adams had investigated the executives at a set of firms, and had information about those roles, while Professor Cordova had information about the marketing investments of a similar set of firms.\n",
    "\n",
    "When putting their data together, they don't yet know what questions they're going to ask; they'll probably start with some exploratory data analysis.  So they don't want to throw away any of their data yet.\n",
    "\n",
    "If they used an inner join, then they'd keep only the firms that appear in both datasets; that's not what they want.  A left or right join would also discard some firms.  But they want to keep them all.  This is called an *outer join,* and it's shown in the illustration below.\n",
    "\n",
    "(The split of the data into three categories, each of size 50, is just for this example.  A real example is unlikely to be separated so symmetrically.)\n",
    "\n",
    "![Merging data about corporate executives with data about marketing spending for a fictitious set of firms](_images/merge-of-firm-data.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"Firm\" column in the merged dataset will contain each name only once, and the row will be of one of three types.\n",
    " 1. If it was in both datasets, then the row contains data in every column (as long as the original datasets did).\n",
    " 2. If it was in the left dataset, then the row contains data about executives, with missing values for marketing.\n",
    " 3. If it was in the right dataset, then the row contains data about marketing, with missing values for executives.\n",
    "\n",
    "(Obviously, if the firm was in neither dataset, it doesn't show up in the merge.)\n",
    "\n",
    "The code is the same as all the code we've seen up to this point, but with `how='outer'`.\n",
    "\n",
    "```python\n",
    "df_merged = pd.merge( df_execs, df_marketing, on='Firm', how='outer' )\n",
    "```"
   ]
  },
  {
   "source": [
    "## Is joining the same as merging?\n",
    "\n",
    "In most data science or database contexts, these two terms refer to the same idea.  However, in pandas, they are two different functions that behave almost exactly the same.  Just like pandas has both `map` and `apply` that behave similarly but not exactly the same (which is frustrating), it also has `merge` and `join` that behave similarly but not exactly the same (which is also frustrating).\n",
    "\n",
    "Because they are so similar in function, if you have learned `pd.merge()`, you probably do not need to bother learning `pd.join()`.  The one exception is that if you want to merge two DataFrames using the *index* from one or both as if it were a column on which to merge, then `pd.join()` makes that easier than `pd.merge()` does.  In fact, merging on DataFrame indexes is the default behavior for `pd.join()`.  So if that's what you need, `pd.join()` is probably easier to use.  In every other case, you can just stick with `pd.merge()`."
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "Before we tackle the challenging question of what happens if there is no unique ID to use for merging, let's review where we've been and add some key details.\n",
    "\n",
    "```{admonition} Big Picture - Concat adds rows and merge adds columns (usually!)\n",
    "---\n",
    "class: alert alert-primary\n",
    "---\n",
    "As I've introduced it here, `pd.concat()` combines the rows of two DataFrames together and `pd.merge()` combines the columns.  While `pd.concat()` always adds rows, `pd.merge()` may or may not, depending on whether you use left, right, inner, or outer joins.\n",
    "\n",
    "Although `pd.concat()` and `pd.merge()` have tons of options that let you do merges and concatenations in the opposite direction from what I taught here (e.g., concat horizontally or merge vertically), this is almost never what is called for in a data project, due to the way we typically arrange tabular data.\n",
    "```\n",
    "\n",
    "The `pd.concat()` function is the easy one, and simply unites two datasets vertically.  The `pd.merge()` function is the more complicated of the two.  Let's imagine that we've called `pd.merge(A,B)` for two DataFrames `A` and `B`.\n",
    "\n",
    " * With `how='inner'`, the default, it creates new rows for every pair of rows from `A` and `B` that match on the specified columns, and it discards everything else.\n",
    " * With `how='left'`, it creates new rows for every pair of rows from `A` and `B` that match on the specified columns, plus it also keeps every row from `A` that didn't match anything from `B`, and fills in their `B` columns with missing values.  This sees `A` as the important dataset, into which we're bringing some information from `B` where possible.\n",
    " * With `how='right'`, the reverse happens.  But you don't need this option if you prefer thinking of the left dataset as the important one, into which we're bringing new columns on the right.  Instead of `pd.merge(A,B,how='right')`, you can always just use `pd.merge(B,A,how='left')` instead.\n",
    " * With `how='outer'`, it creates new rows for every pair of rows from `A` and `B` that match on the specified columns.\n",
    "    * It also keeps every row from `A` that didn't match anything from `B`, and fills in their `B` columns with missing values.\n",
    "    * It also keeps every row from `B` that didn't match anything from `A`, and fills in their `A` columns with missing values.\n",
    "    * This throws no data away.\n",
    "\n",
    "And as a final reminder, we're covering merging because it's extremely common and useful to find that you have two related datasets or databases that you want to bring together, so that subsequent analyses can benefit from relating the data in the two sources.\n",
    "\n",
    "And yet it's not common for those two datasets to have been planned carefully enough in advance that they share a unique ID system for their rows.  More than likely, the two datasets were created by different teams, organizations, or software systems, and have quite different contents and formats.  So we come to the final section of this chapter, figuring out how to do a merge even when there isn't an obvious unique ID column to use for merging."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ensuring a unique ID appears in both datasets\n",
    "\n",
    "Ensuring that the datasets you want to merge each have a column that will match perfectly with the other dataset is an essential step before merging.  Sometimes that step is extremely easy and sometimes it is very challenging.  In the examples above, we assumed that the datasets already had columns that would match up perfectly.\n",
    "\n",
    "And that's not always an unrealistic assumption.  For instance, when we merged the NPR voting records from 2016 into the home mortgage dataset in class, we merged on the two-letter abbreviation for each state.  This standard set of abbreviations was established many years ago and is used consistently everywhere U.S. states are mentioned, so it was reliable and required no work on our part.\n",
    "\n",
    "But let's consider some more complex cases, so you're ready for them when you encounter them.\n",
    "\n",
    "### Merging on multiple columns\n",
    "\n",
    "If you don't have a single column that works as a unique ID, but you have a set of columns that togther form a unique ID in the same way in each dataset, pandas supports merging on multiple columns.  For instance, if your datasets each have columns for first and last names of the people in an organization, and you're confident that no names repeat (e.g., only one John Smith, only one Erin Jones, etc.), then you can tell pandas to use more than one column to identify rows when merging.  Just supply the list of column names when merging.\n",
    "\n",
    "```python\n",
    "df_merged = pd.merge( df_members, df_activities,\n",
    "    left_on=['First Name','Last Name'],\n",
    "    right_on=['Given name','Surname'] )\n",
    "```\n",
    "\n",
    "### Changing the format of a column\n",
    "\n",
    "When you plan to merge two datasets, but no column is appropriate for the match, sometimes a quick computation of a new column will do the trick.\n",
    "\n",
    "**Example:**  If you were merging a dataset of customers using their phone numbers, perhaps dataset A contains just the numeric values (e.g., 17818913171) and dataset B contains the phone numbers formatted for human readability (e.g., +1 (781) 891-3171).  You can create a new column in dataset B that removes all the spaces, plusses, minuses, and parentheses from the phone numbers, so that they're ready to match with dataset A.\n",
    "\n",
    "### Joining multiple columns into one\n",
    "\n",
    "It may also be possible to compute an appropriate column for merging by combining more than one column together.  \n",
    "\n",
    "**Example:**  Let's say you were merging two datasets about albums released by recording artists.  The artists have a unique ID in your datasets, but the albums don't.  If you know that no artist released more than one album in the same month, you could combine together the artist's unique ID with the month and year of the album's release to form a unique ID for the album.  E.g., if The Beatles had ID 2789045 and you're considering the Sgt. Pepper album (May 1967), then you would use the code 2789045-May-1967 for that album.  You could compute such a code for each row in each DataFrame.\n",
    "\n",
    "### Sequences with different frequencies\n",
    "\n",
    "Another common problem is merging two types of time-based data that were reported on different time scales.  For instance, let's say you are trying to study police activity and criminal activity in a city.  You have crime data in the form of daily records and police reports in terms of officers' hourly shifts.  If you wanted to combine these two datasets based on time, the difference in reporting frequency means it's not obvious how to do it.\n",
    "\n",
    "So pandas provides two functions for helping with such situations.  These notes do not cover them in detail, but suggest you check out the [documentation for `pd.merge_ordered()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_ordered.html) and the [documentation for `pd.merge_asof()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html) for more sophisticated handling of time-based merge data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What about unstandardized text?\n",
    "\n",
    "This is more or less the hardest scenario.  For instance, in Fall 2019, when my students wanted to merge the government's comprehensive database of universities with the climate commitments of the schools who were working with our nonprofit client, our best option was to merge on the institution's name.  This is problematic due to variations in naming and spelling.  For instance, what if one dataset writes Bentley University and the other writes Bentley Univ.?  Or what if one dataset writes University of North Carolina at Chapel Hill and the other writes UNC Chapel Hill?  How is a computer to know how to match these up?  (That project actually involved merging several datasets about universities, and this same problem arose more than once!)\n",
    "\n",
    "The short answer is that the computer will not figure this out, because `pd.merge()` only matches on exact equality of IDs, and so you as the data scientist are in charge of somehow creating columns of unique IDs in both datasets that will match up perfectly.  This may require learning something about that domain.  In Fall 2019, my students and I spent time googling various schools whose names didn't seem to appear in the government's dataset to figure out why!\n",
    "\n",
    "When you're stuck trying to get two similar-but-not-the-same columns of text to try to match perfectly, I suggest the following method.  Whether this method is quick and easy or long and difficult varies significantly from one problem to the next.  But the outline is the same.\n",
    "\n",
    " 1. Figure out the column in each dataset that is *closest* to being useful as a unique ID.  (In the university example, this was the university name in each dataset, which was written the same in both datasets for many schools, but definitely not all.)\n",
    " 2. Figure out which dataset is to be the definitive one; this is typically the larger dataset.  (In the university example, this was the comprehensive government dataset.)  We will use the merge column from this definitive dataset as the \"official\" ID for each row, and we must adjust the other dataset so that it uses these \"official\" IDs rather than its own versions/spellings.\n",
    " 3. Add a new column to the smaller dataset that contains the official unique ID *from the other, larger dataset* that it should match.  (In the university example, this means labeling each row in the nonprofit's dataset with that school's name as it appears in the government's dataset.)  This is not always easy.\n",
    " 4. Run `pd.merge()` and have it match the unique ID column in the larger dataset with this newly created column in the smaller dataset, which is now a perfect match.\n",
    "\n",
    "Notice that steps 1, 2, and 4 are quick and easy, but step 3 is where problems may or may not arise.  Depending on how well the chosen columns match in the two datasets, step 3 might take a short time or a long time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extended Example\n",
    "\n",
    "Let's actually try to merge two datasets of university data.  I will load here the comprehensive university dataset I mentioned, originally downloaded from [here](https://data.world/kitedwards08/us-university-survey-2014), as well as a US News university rankings dataset, originally downloaded from [here](https://data.world/education/university-rankings-2017)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "            X          Y   FID  IPEDSID  \\\n",
       "0  -92.260490  34.759308  7001   107840   \n",
       "1 -121.289431  38.713353  7002   112181   \n",
       "2 -118.287070  34.101481  7003   116660   \n",
       "3 -121.652662  36.700631  7004   125310   \n",
       "4  -71.070737  42.369930  7005   164368   \n",
       "\n",
       "                                       NAME                 ADDRESS  \\\n",
       "0                           Shorter College           604 Locust St   \n",
       "1             Citrus Heights Beauty College          7518 Baird Way   \n",
       "2  Joe Blasco Makeup Artist Training Center   1670 Hillhurst Avenue   \n",
       "3                  Waynes College of Beauty  1271 North Main Street   \n",
       "4        Hult International Business School      1 Education Street   \n",
       "\n",
       "        ADDRESS2            CITY STATE    ZIP  ...          ALIAS SIZE_SET  \\\n",
       "0  NOT AVAILABLE   N Little Rock    AR  72114  ...  NOT AVAILABLE       -3   \n",
       "1  NOT AVAILABLE  Citris Heights    CA  95610  ...  NOT AVAILABLE       -3   \n",
       "2  NOT AVAILABLE     Los Angeles    CA  90027  ...  NOT AVAILABLE       -3   \n",
       "3  NOT AVAILABLE         Salinas    CA  93906  ...  NOT AVAILABLE       -3   \n",
       "4  NOT AVAILABLE       Cambridge    MA  02141  ...  NOT AVAILABLE       -3   \n",
       "\n",
       "   INST_SIZE PT_ENROLL  FT_ENROLL TOT_ENROLL  HOUSING DORM_CAP  TOT_EMPLOY  \\\n",
       "0          1        24         28         52        2        0          18   \n",
       "1          1         6         24         30        2        0           9   \n",
       "2          1         0         24         24        2        0          11   \n",
       "3          1        18         16         34        2        0           9   \n",
       "4          2         0       2243       2243        2        0         143   \n",
       "\n",
       "      SHELTER_ID  \n",
       "0  NOT AVAILABLE  \n",
       "1  NOT AVAILABLE  \n",
       "2  NOT AVAILABLE  \n",
       "3  NOT AVAILABLE  \n",
       "4  NOT AVAILABLE  \n",
       "\n",
       "[5 rows x 46 columns]"
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>X</th>\n      <th>Y</th>\n      <th>FID</th>\n      <th>IPEDSID</th>\n      <th>NAME</th>\n      <th>ADDRESS</th>\n      <th>ADDRESS2</th>\n      <th>CITY</th>\n      <th>STATE</th>\n      <th>ZIP</th>\n      <th>...</th>\n      <th>ALIAS</th>\n      <th>SIZE_SET</th>\n      <th>INST_SIZE</th>\n      <th>PT_ENROLL</th>\n      <th>FT_ENROLL</th>\n      <th>TOT_ENROLL</th>\n      <th>HOUSING</th>\n      <th>DORM_CAP</th>\n      <th>TOT_EMPLOY</th>\n      <th>SHELTER_ID</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>-92.260490</td>\n      <td>34.759308</td>\n      <td>7001</td>\n      <td>107840</td>\n      <td>Shorter College</td>\n      <td>604 Locust St</td>\n      <td>NOT AVAILABLE</td>\n      <td>N Little Rock</td>\n      <td>AR</td>\n      <td>72114</td>\n      <td>...</td>\n      <td>NOT AVAILABLE</td>\n      <td>-3</td>\n      <td>1</td>\n      <td>24</td>\n      <td>28</td>\n      <td>52</td>\n      <td>2</td>\n      <td>0</td>\n      <td>18</td>\n      <td>NOT AVAILABLE</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>-121.289431</td>\n      <td>38.713353</td>\n      <td>7002</td>\n      <td>112181</td>\n      <td>Citrus Heights Beauty College</td>\n      <td>7518 Baird Way</td>\n      <td>NOT AVAILABLE</td>\n      <td>Citris Heights</td>\n      <td>CA</td>\n      <td>95610</td>\n      <td>...</td>\n      <td>NOT AVAILABLE</td>\n      <td>-3</td>\n      <td>1</td>\n      <td>6</td>\n      <td>24</td>\n      <td>30</td>\n      <td>2</td>\n      <td>0</td>\n      <td>9</td>\n      <td>NOT AVAILABLE</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>-118.287070</td>\n      <td>34.101481</td>\n      <td>7003</td>\n      <td>116660</td>\n      <td>Joe Blasco Makeup Artist Training Center</td>\n      <td>1670 Hillhurst Avenue</td>\n      <td>NOT AVAILABLE</td>\n      <td>Los Angeles</td>\n      <td>CA</td>\n      <td>90027</td>\n      <td>...</td>\n      <td>NOT AVAILABLE</td>\n      <td>-3</td>\n      <td>1</td>\n      <td>0</td>\n      <td>24</td>\n      <td>24</td>\n      <td>2</td>\n      <td>0</td>\n      <td>11</td>\n      <td>NOT AVAILABLE</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>-121.652662</td>\n      <td>36.700631</td>\n      <td>7004</td>\n      <td>125310</td>\n      <td>Waynes College of Beauty</td>\n      <td>1271 North Main Street</td>\n      <td>NOT AVAILABLE</td>\n      <td>Salinas</td>\n      <td>CA</td>\n      <td>93906</td>\n      <td>...</td>\n      <td>NOT AVAILABLE</td>\n      <td>-3</td>\n      <td>1</td>\n      <td>18</td>\n      <td>16</td>\n      <td>34</td>\n      <td>2</td>\n      <td>0</td>\n      <td>9</td>\n      <td>NOT AVAILABLE</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>-71.070737</td>\n      <td>42.369930</td>\n      <td>7005</td>\n      <td>164368</td>\n      <td>Hult International Business School</td>\n      <td>1 Education Street</td>\n      <td>NOT AVAILABLE</td>\n      <td>Cambridge</td>\n      <td>MA</td>\n      <td>02141</td>\n      <td>...</td>\n      <td>NOT AVAILABLE</td>\n      <td>-3</td>\n      <td>2</td>\n      <td>0</td>\n      <td>2243</td>\n      <td>2243</td>\n      <td>2</td>\n      <td>0</td>\n      <td>143</td>\n      <td>NOT AVAILABLE</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 46 columns</p>\n</div>"
     },
     "metadata": {},
     "execution_count": 3
    }
   ],
   "source": [
    "df_big = pd.read_csv( '_static/Colleges_and_Universities.csv' )\n",
    "df_big.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "                    Name       Location  Rank  \\\n",
       "0   Princeton University  Princeton, NJ     1   \n",
       "1     Harvard University  Cambridge, MA     2   \n",
       "2  University of Chicago    Chicago, IL     3   \n",
       "3        Yale University  New Haven, CT     3   \n",
       "4    Columbia University   New York, NY     5   \n",
       "\n",
       "                                         Description Tuition and fees  \\\n",
       "0  Princeton, the fourth-oldest college in the Un...         $45,320    \n",
       "1  Harvard is located in Cambridge, Massachusetts...         $47,074    \n",
       "2  The University of Chicago, situated in Chicago...         $52,491    \n",
       "3  Yale University, located in New Haven, Connect...         $49,480    \n",
       "4  Columbia University, located in Manhattan's Mo...         $55,056    \n",
       "\n",
       "  In-state Undergrad Enrollment  \n",
       "0      NaN                5,402  \n",
       "1      NaN                6,699  \n",
       "2      NaN                5,844  \n",
       "3      NaN                5,532  \n",
       "4      NaN                6,102  "
      ],
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>Name</th>\n      <th>Location</th>\n      <th>Rank</th>\n      <th>Description</th>\n      <th>Tuition and fees</th>\n      <th>In-state</th>\n      <th>Undergrad Enrollment</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>Princeton University</td>\n      <td>Princeton, NJ</td>\n      <td>1</td>\n      <td>Princeton, the fourth-oldest college in the Un...</td>\n      <td>$45,320</td>\n      <td>NaN</td>\n      <td>5,402</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Harvard University</td>\n      <td>Cambridge, MA</td>\n      <td>2</td>\n      <td>Harvard is located in Cambridge, Massachusetts...</td>\n      <td>$47,074</td>\n      <td>NaN</td>\n      <td>6,699</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>University of Chicago</td>\n      <td>Chicago, IL</td>\n      <td>3</td>\n      <td>The University of Chicago, situated in Chicago...</td>\n      <td>$52,491</td>\n      <td>NaN</td>\n      <td>5,844</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Yale University</td>\n      <td>New Haven, CT</td>\n      <td>3</td>\n      <td>Yale University, located in New Haven, Connect...</td>\n      <td>$49,480</td>\n      <td>NaN</td>\n      <td>5,532</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>Columbia University</td>\n      <td>New York, NY</td>\n      <td>5</td>\n      <td>Columbia University, located in Manhattan's Mo...</td>\n      <td>$55,056</td>\n      <td>NaN</td>\n      <td>6,102</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "metadata": {},
     "execution_count": 4
    }
   ],
   "source": [
    "df_rank = pd.read_csv( '_static/National Universities Rankings.csv', encoding='latin' )\n",
    "df_rank.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "(7735, 231)"
      ]
     },
     "metadata": {},
     "execution_count": 5
    }
   ],
   "source": [
    "len( df_big ), len( df_rank )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 1.** Figure out the closest columns we have to making a match.  The only columns we could have a hope of using to uniquely identify these schools are their names.  No other column in the ranking dataset could possibly be a unique ID that would also be in the big dataset.\n",
    "\n",
    "**Step 2.** Figure out which dataset is to be the definitive one.  Clearly, the comprehensive dataset should be the definitive one, and the rankings merged into it.  So the university names in the big dataset are what we'll use as the schools' official names.\n",
    "\n",
    "**Step 3.** Add a new column to the ranking dataset and, in it, store the correct official school name for each row.  (Remember that official names come from the big dataset.)  This is the tricky part.\n",
    "\n",
    "Let's just get a sense of how many of the 231 rows in the ranking dataset have an exact match in the big dataset, and thus their official names are already in the ranking dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "141"
      ]
     },
     "metadata": {},
     "execution_count": 8
    }
   ],
   "source": [
    "official_names = list( df_big['NAME'] )        # from big dataset\n",
    "sum( df_rank['Name'].isin( official_names ) )  # from rank dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus 90 schools do *not* have an exact match.  Those are the 90 we need to solve.  It would be tedious to match them up by hand, because there are 90.  So we will use a built-in Python text module to try to do some *approximate* string matching for us.  The Python module `difflib` has a function called `get_close_matches()` that will take a piece of text and a list of options, and give you the closest matches.  Here's an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "['Python is cruel', 'Nathan is cool']"
      ]
     },
     "metadata": {},
     "execution_count": 10
    }
   ],
   "source": [
    "from difflib import get_close_matches\n",
    "get_close_matches( 'Python is cool',\n",
    "    [ 'this is not close', 'also not close',\n",
    "      'Python is cruel', 'Nathan is cool' ] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that it doesn't always find a good guess, if there isn't one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "[]"
      ]
     },
     "metadata": {},
     "execution_count": 11
    }
   ],
   "source": [
    "get_close_matches( 'pork', [ 'salad', 'lollipops', 'soda' ] )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `get_close_matches()` to create a function that will match up university names across the two datasets if they're just off by a small amount.  This could automate some of the matching we'd otherwise have to do by hand for those 90 schools that didn't match exactly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": [
       "'Bentley University'"
      ]
     },
     "metadata": {},
     "execution_count": 12
    }
   ],
   "source": [
    "def get_closest_official_name ( name_from_df_rank ):\n",
    "\n",
    "    # If there's an exact match, we're already done.\n",
    "    if name_from_df_rank in official_names:\n",
    "        return name_from_df_rank\n",
    "    \n",
    "    # Get the closest matches, if any.\n",
    "    close_matches = get_close_matches( name_from_df_rank, official_names )\n",
    "    \n",
    "    # If there weren't any, return None\n",
    "    if len( close_matches ) == 0:\n",
    "        return None\n",
    "    \n",
    "    # Otherwise, return the first one\n",
    "    return close_matches[0]\n",
    "\n",
    "# Test it\n",
    "get_closest_official_name( 'Bentley Universal' )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's apply that function to every row in the small dataset.  Note that `get_close_matches()` can be a bit slow, so the following code actually takes about 15 seconds to complete executing.  (It would be even slower if we didn't have the first `if` statement in `get_closest_official_name()`, which skips `get_close_matches()` when it's not needed.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>Rank</th>\n",
       "      <th>Description</th>\n",
       "      <th>Tuition and fees</th>\n",
       "      <th>In-state</th>\n",
       "      <th>Undergrad Enrollment</th>\n",
       "      <th>Official Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Princeton University</td>\n",
       "      <td>Princeton, NJ</td>\n",
       "      <td>1</td>\n",
       "      <td>Princeton, the fourth-oldest college in the Un...</td>\n",
       "      <td>$45,320</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5,402</td>\n",
       "      <td>Princeton University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Harvard University</td>\n",
       "      <td>Cambridge, MA</td>\n",
       "      <td>2</td>\n",
       "      <td>Harvard is located in Cambridge, Massachusetts...</td>\n",
       "      <td>$47,074</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6,699</td>\n",
       "      <td>Harvard University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>University of Chicago</td>\n",
       "      <td>Chicago, IL</td>\n",
       "      <td>3</td>\n",
       "      <td>The University of Chicago, situated in Chicago...</td>\n",
       "      <td>$52,491</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5,844</td>\n",
       "      <td>University of Chicago</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Yale University</td>\n",
       "      <td>New Haven, CT</td>\n",
       "      <td>3</td>\n",
       "      <td>Yale University, located in New Haven, Connect...</td>\n",
       "      <td>$49,480</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5,532</td>\n",
       "      <td>Yale University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Columbia University</td>\n",
       "      <td>New York, NY</td>\n",
       "      <td>5</td>\n",
       "      <td>Columbia University, located in Manhattan's Mo...</td>\n",
       "      <td>$55,056</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6,102</td>\n",
       "      <td>Coleman University</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Name       Location  Rank  \\\n",
       "0   Princeton University  Princeton, NJ     1   \n",
       "1     Harvard University  Cambridge, MA     2   \n",
       "2  University of Chicago    Chicago, IL     3   \n",
       "3        Yale University  New Haven, CT     3   \n",
       "4    Columbia University   New York, NY     5   \n",
       "\n",
       "                                         Description Tuition and fees  \\\n",
       "0  Princeton, the fourth-oldest college in the Un...         $45,320    \n",
       "1  Harvard is located in Cambridge, Massachusetts...         $47,074    \n",
       "2  The University of Chicago, situated in Chicago...         $52,491    \n",
       "3  Yale University, located in New Haven, Connect...         $49,480    \n",
       "4  Columbia University, located in Manhattan's Mo...         $55,056    \n",
       "\n",
       "  In-state Undergrad Enrollment          Official Name  \n",
       "0      NaN                5,402   Princeton University  \n",
       "1      NaN                6,699     Harvard University  \n",
       "2      NaN                5,844  University of Chicago  \n",
       "3      NaN                5,532        Yale University  \n",
       "4      NaN                6,102     Coleman University  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_rank['Official Name'] = df_rank['Name'].apply( get_closest_official_name )\n",
    "df_rank.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are correct for the first four schools, which were exact matches, but not so good for Columbia.  The only way to check to see if this worked out well is to do a manual check, because only a human is going to be able to assess whether Columbia University and Coleman University are the same; Python did its best.\n",
    "\n",
    "We can check by taking a glance over the following output, and noting which rows are wrong.  I don't include the full output here of all 90 discrepancies, just to save space, but you can use `pd.set_option( 'display.max_rows', None )` to see them all."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Official Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Columbia University</td>\n",
       "      <td>Coleman University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Washington University in St. Louis</td>\n",
       "      <td>Washington University in St Louis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>University of California--Berkeley</td>\n",
       "      <td>University of California-Berkeley</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>University of California--Los Angeles</td>\n",
       "      <td>University of California-Los Angeles</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>University of Virginia</td>\n",
       "      <td>University of Georgia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>222</th>\n",
       "      <td>New Mexico State University</td>\n",
       "      <td>New Mexico State University-Grants</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>225</th>\n",
       "      <td>University of Massachusetts--Boston</td>\n",
       "      <td>University of Massachusetts-Boston</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>226</th>\n",
       "      <td>University of Massachusetts--Dartmouth</td>\n",
       "      <td>University of Massachusetts-Dartmouth</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>227</th>\n",
       "      <td>University of Missouri--St. Louis</td>\n",
       "      <td>University of Missouri-St Louis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>228</th>\n",
       "      <td>University of North Carolina--Greensboro</td>\n",
       "      <td>University of North Carolina at Greensboro</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>90 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                         Name  \\\n",
       "4                         Columbia University   \n",
       "18         Washington University in St. Louis   \n",
       "21         University of California--Berkeley   \n",
       "24      University of California--Los Angeles   \n",
       "25                     University of Virginia   \n",
       "..                                        ...   \n",
       "222               New Mexico State University   \n",
       "225       University of Massachusetts--Boston   \n",
       "226    University of Massachusetts--Dartmouth   \n",
       "227         University of Missouri--St. Louis   \n",
       "228  University of North Carolina--Greensboro   \n",
       "\n",
       "                                  Official Name  \n",
       "4                            Coleman University  \n",
       "18            Washington University in St Louis  \n",
       "21            University of California-Berkeley  \n",
       "24         University of California-Los Angeles  \n",
       "25                        University of Georgia  \n",
       "..                                          ...  \n",
       "222          New Mexico State University-Grants  \n",
       "225          University of Massachusetts-Boston  \n",
       "226       University of Massachusetts-Dartmouth  \n",
       "227             University of Missouri-St Louis  \n",
       "228  University of North Carolina at Greensboro  \n",
       "\n",
       "[90 rows x 2 columns]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows_with_guesses = df_rank[ df_rank['Name'] != df_rank['Official Name'] ]\n",
    "rows_with_guesses[['Name','Official Name']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in many cases, it did a good job, such as in rows 18, 21, 24, and 225 through 228.  We know that rows 4 and 25 are wrong, but is row 222 wrong?  That all depends on whether Grants is the location of the main campus for New Mexico State University.  Now you see why my students and I ended up on Google!\n",
    "\n",
    "After inspecting the full list of 90 discrepancies, I found 30 that I still needed to fix by hand.  So the computer had done two-thirds of its guessing job right, saving me some time.  But how do I manually correct the 30 mistakes I found?  For instance, how do I correct row 4, which clearly isn't right?  I need to know the exact name of Columbia University in `df_big`.\n",
    "\n",
    "Let's do a search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "60                      Paul Mitchell The School-Columbia\n",
       "439                    American Career Institute√Columbia\n",
       "619                                      Columbia College\n",
       "668                             Virginia College-Columbia\n",
       "750                          Columbia Southern University\n",
       "872                              Centura College-Columbia\n",
       "1366                University of Phoenix-Columbia Campus\n",
       "1438                     ITT Technical Institute-Columbia\n",
       "1610                      Southeastern Institute-Columbia\n",
       "1907        Kenneth Shuler School of Cosmetology-Columbia\n",
       "1975                        Columbia Theological Seminary\n",
       "2059                    Regency Beauty Institute-Columbia\n",
       "2099                    Remington College-Columbia Campus\n",
       "2295                                     Columbia College\n",
       "2583                                     Columbia College\n",
       "2704                          Columbia College of Nursing\n",
       "2958                    Columbia-Greene Community College\n",
       "3346                               Lower Columbia College\n",
       "3348                               Columbia Basin College\n",
       "3404                                     Columbia College\n",
       "3622                     Columbia Gorge Community College\n",
       "3936                     Columbia State Community College\n",
       "4042              Teachers College at Columbia University\n",
       "4356        Columbiana County Career and Technical Center\n",
       "4385                 Columbia Centro Universitario-Caguas\n",
       "4509                            South University-Columbia\n",
       "4666    University of the District of Columbia David A...\n",
       "4719                    Columbia International University\n",
       "4723    Kenneth Shuler School of Cosmetology and Nails...\n",
       "4728                University of South Carolina-Columbia\n",
       "4974               Lincoln College of Technology-Columbia\n",
       "5027                                     Columbia College\n",
       "5099                          Columbia Area Career Center\n",
       "5371                             Columbia College-Chicago\n",
       "5581               University of the District of Columbia\n",
       "5664                           Columbia College Hollywood\n",
       "5775              Strayer University-District of Columbia\n",
       "6369                      University of Missouri-Columbia\n",
       "6661          Columbia University in the City of New York\n",
       "7589                  Columbia Centro Universitario-Yauco\n",
       "Name: NAME, dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Show me all names containing Columbia...\n",
    "df_big[df_big['NAME'].str.contains( 'Columbia' )]['NAME']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Holy cow!  Let's try to narrow our search a bit..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "750                          Columbia Southern University\n",
       "1366                University of Phoenix-Columbia Campus\n",
       "4042              Teachers College at Columbia University\n",
       "4509                            South University-Columbia\n",
       "4666    University of the District of Columbia David A...\n",
       "4719                    Columbia International University\n",
       "4728                University of South Carolina-Columbia\n",
       "5581               University of the District of Columbia\n",
       "5775              Strayer University-District of Columbia\n",
       "6369                      University of Missouri-Columbia\n",
       "6661          Columbia University in the City of New York\n",
       "Name: NAME, dtype: object"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Just the rows with Columbia and University...\n",
    "df_big[df_big['NAME'].str.contains( 'Columbia' )\n",
    "     & df_big['NAME'].str.contains( 'University' )]['NAME']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Aha, Columbia University in the City of New York was so long of a phrase that `get_close_matches()` did not think it was \"close\" to Columbia University.  So now I've found that the entry for row 4 in `df_rank['Official Name']` should be Columbia University in the City of New York.  I can simply tell Python to change it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rank.loc[4,'Official Name'] = 'Columbia University in the City of New York'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When I'm done manually investigating the 30 schools that had to be fixed by hand, I will have 30 lines of code that look just like the one above, but for different schools.  Here's a sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_rank.loc[4,'Official Name'] = 'Columbia University in the City of New York'\n",
    "df_rank.loc[34,'Official Name'] = 'Georgia Institute of Technology-Main Campus'\n",
    "df_rank.loc[41,'Official Name'] = 'Tulane University of Louisiana'\n",
    "df_rank.loc[52,'Official Name'] = 'Pennsylvania State University-Main Campus'\n",
    "# and so on, for a total of 30 changes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But if we're trying to follow DRY principles, we notice that there's definitely a lot of repeated code here.  We're copying and pasting the `df_rank.loc[...,'Official Name'] = '...'` part each time.  We could simplify this by creating a Python dictionary with just our corrections.  Here I include all 30 corrections as they would be if we had carefully investigated each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>Rank</th>\n",
       "      <th>Description</th>\n",
       "      <th>Tuition and fees</th>\n",
       "      <th>In-state</th>\n",
       "      <th>Undergrad Enrollment</th>\n",
       "      <th>Official Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Princeton University</td>\n",
       "      <td>Princeton, NJ</td>\n",
       "      <td>1</td>\n",
       "      <td>Princeton, the fourth-oldest college in the Un...</td>\n",
       "      <td>$45,320</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5,402</td>\n",
       "      <td>Princeton University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Harvard University</td>\n",
       "      <td>Cambridge, MA</td>\n",
       "      <td>2</td>\n",
       "      <td>Harvard is located in Cambridge, Massachusetts...</td>\n",
       "      <td>$47,074</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6,699</td>\n",
       "      <td>Harvard University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>University of Chicago</td>\n",
       "      <td>Chicago, IL</td>\n",
       "      <td>3</td>\n",
       "      <td>The University of Chicago, situated in Chicago...</td>\n",
       "      <td>$52,491</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5,844</td>\n",
       "      <td>University of Chicago</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Yale University</td>\n",
       "      <td>New Haven, CT</td>\n",
       "      <td>3</td>\n",
       "      <td>Yale University, located in New Haven, Connect...</td>\n",
       "      <td>$49,480</td>\n",
       "      <td>NaN</td>\n",
       "      <td>5,532</td>\n",
       "      <td>Yale University</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Columbia University</td>\n",
       "      <td>New York, NY</td>\n",
       "      <td>5</td>\n",
       "      <td>Columbia University, located in Manhattan's Mo...</td>\n",
       "      <td>$55,056</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6,102</td>\n",
       "      <td>Columbia University in the City of New York</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Name       Location  Rank  \\\n",
       "0   Princeton University  Princeton, NJ     1   \n",
       "1     Harvard University  Cambridge, MA     2   \n",
       "2  University of Chicago    Chicago, IL     3   \n",
       "3        Yale University  New Haven, CT     3   \n",
       "4    Columbia University   New York, NY     5   \n",
       "\n",
       "                                         Description Tuition and fees  \\\n",
       "0  Princeton, the fourth-oldest college in the Un...         $45,320    \n",
       "1  Harvard is located in Cambridge, Massachusetts...         $47,074    \n",
       "2  The University of Chicago, situated in Chicago...         $52,491    \n",
       "3  Yale University, located in New Haven, Connect...         $49,480    \n",
       "4  Columbia University, located in Manhattan's Mo...         $55,056    \n",
       "\n",
       "  In-state Undergrad Enrollment                                Official Name  \n",
       "0      NaN                5,402                         Princeton University  \n",
       "1      NaN                6,699                           Harvard University  \n",
       "2      NaN                5,844                        University of Chicago  \n",
       "3      NaN                5,532                              Yale University  \n",
       "4      NaN                6,102  Columbia University in the City of New York  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Store corrections in a dictionary:\n",
    "corrections = {\n",
    "    4   : 'Columbia University in the City of New York',\n",
    "    34  : 'Georgia Institute of Technology-Main Campus',\n",
    "    41  : 'Tulane University of Louisiana',\n",
    "    52  : 'Pennsylvania State University-Main Campus',\n",
    "    54  : 'University of Washington-Seattle Campus',\n",
    "    60  : 'Purdue University-Main Campus',\n",
    "    68  : 'University of Pittsburgh-Pittsburgh Campus',\n",
    "    77  : 'Virginia Polytechnic Institute and State University',\n",
    "    85  : 'SUNY at Binghamton',\n",
    "    109 : 'University of South Carolina-Columbia',\n",
    "    112 : 'University of Missouri-System Office',\n",
    "    114 : 'University of Oklahoma Norman Campus',\n",
    "    130 : 'Colorado State University-Fort Collins',\n",
    "    135 : 'Louisiana State University-System Office',\n",
    "    146 : 'Ohio University-Main Campus',\n",
    "    149 : 'SUNY at Albany',\n",
    "    153 : 'Oklahoma State University-Oklahoma City',\n",
    "    162 : 'University of South Florida-Main Campus',\n",
    "    181 : 'University of New Mexico-Main Campus',\n",
    "    186 : 'Widener University-Main Campus',\n",
    "    187 : 'Kent State University at Kent',\n",
    "    189 : 'Pace University-New York',\n",
    "    193 : 'Bowling Green State University-Main Campus',\n",
    "    222 : 'New Mexico State University-Main Campus'\n",
    "}\n",
    "\n",
    "# Apply all the corrections at once:\n",
    "for row_index, fixed_name in corrections.items():\n",
    "    df_rank.loc[row_index,'Official Name'] = fixed_name\n",
    "\n",
    "# See if at least the top 5 look right:\n",
    "df_rank.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Step 4.** And now that all corrections have been made, we can do the merge with confidence.  We take care to merge the main dataset's `\"NAME\"` column with the smaller dataset's `\"Official Name\"` column.  This merge will be a left join, because we do not want to discard a school just because it wasn't in US News's rankings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X</th>\n",
       "      <th>Y</th>\n",
       "      <th>FID</th>\n",
       "      <th>IPEDSID</th>\n",
       "      <th>NAME</th>\n",
       "      <th>ADDRESS</th>\n",
       "      <th>ADDRESS2</th>\n",
       "      <th>CITY</th>\n",
       "      <th>STATE</th>\n",
       "      <th>ZIP</th>\n",
       "      <th>...</th>\n",
       "      <th>TOT_EMPLOY</th>\n",
       "      <th>SHELTER_ID</th>\n",
       "      <th>Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>Rank</th>\n",
       "      <th>Description</th>\n",
       "      <th>Tuition and fees</th>\n",
       "      <th>In-state</th>\n",
       "      <th>Undergrad Enrollment</th>\n",
       "      <th>Official Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-92.260490</td>\n",
       "      <td>34.759308</td>\n",
       "      <td>7001</td>\n",
       "      <td>107840</td>\n",
       "      <td>Shorter College</td>\n",
       "      <td>604 Locust St</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>N Little Rock</td>\n",
       "      <td>AR</td>\n",
       "      <td>72114</td>\n",
       "      <td>...</td>\n",
       "      <td>18</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-121.289431</td>\n",
       "      <td>38.713353</td>\n",
       "      <td>7002</td>\n",
       "      <td>112181</td>\n",
       "      <td>Citrus Heights Beauty College</td>\n",
       "      <td>7518 Baird Way</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>Citris Heights</td>\n",
       "      <td>CA</td>\n",
       "      <td>95610</td>\n",
       "      <td>...</td>\n",
       "      <td>9</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-118.287070</td>\n",
       "      <td>34.101481</td>\n",
       "      <td>7003</td>\n",
       "      <td>116660</td>\n",
       "      <td>Joe Blasco Makeup Artist Training Center</td>\n",
       "      <td>1670 Hillhurst Avenue</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>CA</td>\n",
       "      <td>90027</td>\n",
       "      <td>...</td>\n",
       "      <td>11</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-121.652662</td>\n",
       "      <td>36.700631</td>\n",
       "      <td>7004</td>\n",
       "      <td>125310</td>\n",
       "      <td>Waynes College of Beauty</td>\n",
       "      <td>1271 North Main Street</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>Salinas</td>\n",
       "      <td>CA</td>\n",
       "      <td>93906</td>\n",
       "      <td>...</td>\n",
       "      <td>9</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-71.070737</td>\n",
       "      <td>42.369930</td>\n",
       "      <td>7005</td>\n",
       "      <td>164368</td>\n",
       "      <td>Hult International Business School</td>\n",
       "      <td>1 Education Street</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>Cambridge</td>\n",
       "      <td>MA</td>\n",
       "      <td>02141</td>\n",
       "      <td>...</td>\n",
       "      <td>143</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 54 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            X          Y   FID  IPEDSID  \\\n",
       "0  -92.260490  34.759308  7001   107840   \n",
       "1 -121.289431  38.713353  7002   112181   \n",
       "2 -118.287070  34.101481  7003   116660   \n",
       "3 -121.652662  36.700631  7004   125310   \n",
       "4  -71.070737  42.369930  7005   164368   \n",
       "\n",
       "                                       NAME                 ADDRESS  \\\n",
       "0                           Shorter College           604 Locust St   \n",
       "1             Citrus Heights Beauty College          7518 Baird Way   \n",
       "2  Joe Blasco Makeup Artist Training Center   1670 Hillhurst Avenue   \n",
       "3                  Waynes College of Beauty  1271 North Main Street   \n",
       "4        Hult International Business School      1 Education Street   \n",
       "\n",
       "        ADDRESS2            CITY STATE    ZIP  ... TOT_EMPLOY     SHELTER_ID  \\\n",
       "0  NOT AVAILABLE   N Little Rock    AR  72114  ...         18  NOT AVAILABLE   \n",
       "1  NOT AVAILABLE  Citris Heights    CA  95610  ...          9  NOT AVAILABLE   \n",
       "2  NOT AVAILABLE     Los Angeles    CA  90027  ...         11  NOT AVAILABLE   \n",
       "3  NOT AVAILABLE         Salinas    CA  93906  ...          9  NOT AVAILABLE   \n",
       "4  NOT AVAILABLE       Cambridge    MA  02141  ...        143  NOT AVAILABLE   \n",
       "\n",
       "   Name Location  Rank Description  Tuition and fees In-state  \\\n",
       "0   NaN      NaN   NaN         NaN               NaN      NaN   \n",
       "1   NaN      NaN   NaN         NaN               NaN      NaN   \n",
       "2   NaN      NaN   NaN         NaN               NaN      NaN   \n",
       "3   NaN      NaN   NaN         NaN               NaN      NaN   \n",
       "4   NaN      NaN   NaN         NaN               NaN      NaN   \n",
       "\n",
       "   Undergrad Enrollment  Official Name  \n",
       "0                   NaN            NaN  \n",
       "1                   NaN            NaN  \n",
       "2                   NaN            NaN  \n",
       "3                   NaN            NaN  \n",
       "4                   NaN            NaN  \n",
       "\n",
       "[5 rows x 54 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged = pd.merge( df_big, df_rank, left_on='NAME', right_on='Official Name', how='left' )\n",
    "df_merged.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have one large dataset containing both the generic data and the ranking data.  Although we see all missing values for ranking columns above, this is just because the first five schools in the dataset didn't happen to be ranked by US News.  This is not surprising; there were over 7700 schools in the dataset and only 231 were ranked by US News.  But we can see that the merge did go correctly if we inspect a row that had ranking data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>X</th>\n",
       "      <th>Y</th>\n",
       "      <th>FID</th>\n",
       "      <th>IPEDSID</th>\n",
       "      <th>NAME</th>\n",
       "      <th>ADDRESS</th>\n",
       "      <th>ADDRESS2</th>\n",
       "      <th>CITY</th>\n",
       "      <th>STATE</th>\n",
       "      <th>ZIP</th>\n",
       "      <th>...</th>\n",
       "      <th>TOT_EMPLOY</th>\n",
       "      <th>SHELTER_ID</th>\n",
       "      <th>Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>Rank</th>\n",
       "      <th>Description</th>\n",
       "      <th>Tuition and fees</th>\n",
       "      <th>In-state</th>\n",
       "      <th>Undergrad Enrollment</th>\n",
       "      <th>Official Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5822</th>\n",
       "      <td>-71.118234</td>\n",
       "      <td>42.374172</td>\n",
       "      <td>87</td>\n",
       "      <td>166027</td>\n",
       "      <td>Harvard University</td>\n",
       "      <td>Massachusetts Hall</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>Cambridge</td>\n",
       "      <td>MA</td>\n",
       "      <td>02138</td>\n",
       "      <td>...</td>\n",
       "      <td>17141</td>\n",
       "      <td>NOT AVAILABLE</td>\n",
       "      <td>Harvard University</td>\n",
       "      <td>Cambridge, MA</td>\n",
       "      <td>2.0</td>\n",
       "      <td>Harvard is located in Cambridge, Massachusetts...</td>\n",
       "      <td>$47,074</td>\n",
       "      <td>NaN</td>\n",
       "      <td>6,699</td>\n",
       "      <td>Harvard University</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1 rows × 54 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "              X          Y  FID  IPEDSID                NAME  \\\n",
       "5822 -71.118234  42.374172   87   166027  Harvard University   \n",
       "\n",
       "                 ADDRESS       ADDRESS2       CITY STATE    ZIP  ...  \\\n",
       "5822  Massachusetts Hall  NOT AVAILABLE  Cambridge    MA  02138  ...   \n",
       "\n",
       "     TOT_EMPLOY     SHELTER_ID                Name       Location  Rank  \\\n",
       "5822      17141  NOT AVAILABLE  Harvard University  Cambridge, MA   2.0   \n",
       "\n",
       "                                            Description  Tuition and fees  \\\n",
       "5822  Harvard is located in Cambridge, Massachusetts...          $47,074    \n",
       "\n",
       "     In-state  Undergrad Enrollment       Official Name  \n",
       "5822      NaN                 6,699  Harvard University  \n",
       "\n",
       "[1 rows x 54 columns]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged[df_merged['NAME'] == 'Harvard University']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is one of the most challenging merges you might have to do, but it's good to be prepared for the worst case scenario!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "name": "python3",
   "language": "Python",
   "display_name": "Python 3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}