Reviewing the chapter

In the course notes, we discussed a recommender system, which takes as input a user's preferences about past products (movies, in the chapter) and recommends new products based on those preferences.

Question 1. We stored previous users' preferences in a matrix for reference in helping make recommendations. What were the row and column headings in our preferences matrix? (Not the specific entries, of course, but what real-world groups did they represent?)

? ? ?
? 1 0 0
? 0 1 1
? 1 0 1

Reviewing the chapter

When a new user arrived, we needed their preferences as an input to our recommendation algorithm.

Question 2. In what format did we expect/store those preferences?

Reviewing the chapter

We used matrix multiplication (written in Python with the @ symbol) to "combine" the preferences matrix with the new user's preferences, as in prefs_matrix @ new_user_prefs.

Question 3. What type of result did this create? (The specific values are not important, but what kind of object is it, what shape, etc.?)

$$ \text{prefs_matrix @ new_user_prefs}=\left[\begin{array}{ccc}1&0&0\\0&1&1\\1&0&1\end{array}\right]\left[\begin{array}{c}1\\0\\0\end{array}\right]=\text{?} $$

Reviewing the chapter

Question 4. Why did we normalize the rows of the preferences matrix?

Question 5. What does it mean to normalize the row of a matrix?

Question 6. When we had a perfectly good user preferences matrix, with precise data from each user, why did we choose to replace it with an approximation?

About the preparatory homework

Recall that you did your work in two phases:

  1. Sample a subset of all users of the This Is My Jam archive.
  2. Get all "jams" associated with any user in your subset.

Why do you think the instructions were explicitly split into those two steps? Why couldn't we just randomly sample jams from the overall data set in just one step instead?

Today's exercise

In class today we'll be repeating the example from the Chapter 16 notes, but with a real dataset (the songs data you prepared for class today).

Our schedule will be like this, working in groups:

  1. Create and approximate a normalized preferences matrix from the data you prepared
  2. Take the middle-of-class break
  3. Apply our approximation to our own individual musical tastes
  4. Sample the resulting song recommendations and assess whether they're any good!

Before we begin

You will probably want to do today's work in a new Jupyter notebook or Python script, beginning by importing the CSV file you prepared for class today, jam-sample.csv.

Step 1: Create an adjacency matrix

Convert the edge list you created for today into an adjacency matrix, as described in this section of the chapter notes.

As a quick check, you can compute your_matrix.shape and verify that

  1. the number of rows is between 1000 and 2000, however many users you chose to sample, and
  2. the number of columns is between 15000 and 30000, however many songs those users happened to like.

Step 2: Normalize the rows

Normalize the rows of your adjacency matrix, as described in this section of the chapter notes.

You can verify that this worked by computing the norms of the rows again after you've done the work, and all should be almost exactly 1.0. (If our computers were perfect, they'd all be exactly 1.0, but there is some slight imprecision in using any finite computing device.)

Step 3: Compute the SVD

The way we've chosen to create an approximation of our adjacency matrix is using the Singular Value Decomposition (SVD) introduced in the course notes.

Create the SVD for your matrix, using the technique from this section of the chapter notes.

Once you've done so, you can check to see if your results make sense by computing the shapes of your $U$, $\Sigma$, and $V$ matrices.

  1. U.shape should be $n\times n$ for some value of $n$ in the range 1000-2000, the number of rows in your original preferences matrix.
  2. V.shape should be $m\times m$ for some value of $m$ in the range 15000-30000, the number of columns in your original preferences matrix.
  3. Σ.shape should be a single number, the same $n$ value from U.shape.

Step 4: Prepare the $\rho$ function

In this section of the chapter notes, we learned how to write a function $\rho$ that takes as input the number of singular values we plan to remove, and it lets us know the error level of the resulting approximation.

Bring that function into your work and verify that you can run it and it produces sensible results. Examples:

  • ρ(0) should give 0 (no error if we remove nothing)
  • ρ(1) should give a very small number
  • ρ(1000) (or whatever your $n$ value is) should give a value close to 1.0

Step 5: Evaluate $\rho$ on many possible inputs

Create a table of $\rho$ values, showing $\rho(i)$ for various $i$ between 1 and $n$. Python's range() function can be useful here, especially since its third parameter can create ranges with large jumps.

In [2]:
list( range( 1, 1001, 100 ) )
Out[2]:
[1, 101, 201, 301, 401, 501, 601, 701, 801, 901]

Choose a value for $i$ that you will use going forward; choose one that keeps the error level below 0.5.

Step 6: Approximate the preferences matrix

In this section of the chapter notes, we saw code for creating an approximate version of the preferences matrix, by dropping the $i$ lowest singular values.

Apply that technique to your preferences matrix now.

As a quick check, if you compute the shape of the resulting matrix, it should be the same as the shape of the original preferences matrix.

Break

We'll take our 10-minute break at this point.

Step 7: Creating a preferences vector

We'll want to be able to create realistic preferences vectors about the songs in our dataset. Recall that a preferences vector is mostly zeros, but has a few ones to indicate which songs the user liked.

  1. Create a function that takes some text as input and searches all song names in our matrix, returning the songs that contain the given text, and their corresponding column indices.
  2. Try your function out: Search for some song or artist names that you like until you have found 5 of them in our data set. Write down the column index for each.
  3. Create a function that takes a list of column indices and creates a song preferences vector from it. Recall that such a vector is a pandas Series whose index is all the song names and that contains mostly zeros, but ones in the indices you specify.
  4. Create the song preferences vector from your chosen five songs.

Step 8: Finding users like you

Recall that we can compute how similar your preferences are to those of the users in our dataset with a single matrix multiplication.

  1. If your approximated preferences matrix is called A and your preferences vector is called pv, then compute A @ pv.
    • If you encounter an error, it is probably because the size or index of your preferences vector is incorrect; check the previous slide to be sure you have constructed it correctly.
    • You may also want to check to verify that the result of this computation is a pandas Series associating each user from your jam-sample.csv file with a number that represents their similarity to your preferences.
  2. Compute just the top 1% or 0.5% of those users based on the similarity scores.

Step 9: Finding songs you might like

Now that we know users that have your preferences, let's see what songs they like. These will be your recommendations.

  1. Sum the preference vectors of all the users you selected in the previous step. This will give a single pandas Series associating each song with a relevance score.
  2. Sort the result and take the top 25 values. These are your song recommendations!

Step 10: The fun part

Create a playlist (e.g., with Spotify, Apple Music, etc.) from your 25 song recommendations.

Shuffle play.

How did the recommender system do?

Be ready to tell us at the end of class!

Important features of our system

Our recommender system is not specific to songs. It used absolutely no data about the content of any song. It didn't even know the year, artist, speed, genre, or even that the items in the matrix were songs.

The way it was able to make reasonable recommendations was just based on the associations of users with songs.

  • We can see the clear business value of large datasets like the This Is My Jam data, and why companies like Google are content to give away free products in exchange for gathering your data.
  • The song recommendation system might not even be as good if it had known information about the songs' styles and attributes. If it did, it would have focused on those attributes, which any human can already do for themself. (E.g., I like R&B, so I'll go look up some R&B hits.)
  • Instead, we found patterns in the data that may or may not line up with existing genres, artists, decades, etc., and exploited those.

Real recommender systems

Although the system we just built is a small example, and real recommender systems used by large online retailers are much more complex, they are built on foundations like the one we just used.

Discussion: What are some ways to make the simple recommender system we just built more powerful?