{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mathematics and Statistics in Python\n", "\n", "See also the slides that summarize a portion of this content.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Math in Python\n", "\n", "Having had CS230, you are surely familiar with Python's built-in math operators `+`, `-`, `*`, `/`, and `**`. You're probably also familiar with the fact that Python has a `math` module that you can use for things like trigonometry." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1.0" ] }, "metadata": {}, "execution_count": 1 } ], "source": [ "import math\n", "math.cos( 0 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I list here just a few highlights from that module that are relevant for statistical computations.\n", "\n", "`math.exp(x)` is $e^x$, so the following computes $e$." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2.718281828459045" ] }, "metadata": {}, "execution_count": 2 } ], "source": [ "math.exp( 1 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Natural logarithms are written $\\ln x$ in mathematics, but just `log` in Python." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "2.302585092994046" ] }, "metadata": {}, "execution_count": 3 } ], "source": [ "math.log( 10 ) # natural log of 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A few other functions in the `math` module are also useful for data work, but show up much less often. The distance between any two points in the plane (or any number of dimensions) can be computed with `math.dist()`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "6.324555320336759" ] }, "metadata": {}, "execution_count": 4 } ], "source": [ "math.dist( (1,0), (-5,2) )" ] }, { "source": [ "Combinations and permutations can be computed with `math.comb()` and `math.perm()` (since Python 3.8)." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Naming mathematical variables\n", "\n", "In programming, we almost never name variables with unhelpful names like `k` and `x`, because later readers of the code (or even ourselves reading it in two months) won't know what `k` and `x` actually mean. The one exception to this is in mathematics, where it is normal to use single-letter variables, and indeed sometimes the letters matter.\n", "\n", "**Example 1:** The quadratic formula is almost always written using the letters $a$, $b$, and $c$. Yes, names like `x_squared_coefficient`, `x_coefficient`, and `constant` are more descriptive, but they would lead to much uglier code that's not what anyone expects. Compare:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "((2.0, 1.0), (2.0, 1.0))" ] }, "metadata": {}, "execution_count": 5 } ], "source": [ "# not super easy to read, but not bad:\n", "def quadratic_formula_1 ( a, b, c ):\n", " solution1 = ( -b + ( b**2 - 4*a*c )**0.5 ) / ( 2*a )\n", " solution2 = ( -b - ( b**2 - 4*a*c )**0.5 ) / ( 2*a )\n", " return ( solution1, solution2 )\n", "\n", "# oh my make it stop:\n", "def quadratic_formula_2 ( x_squared_coefficient, x_coefficient, constant ):\n", " solution1 = ( -x_coefficient + \\\n", " ( x_coefficient**2 - 4*x_squared_coefficient*constant )**0.5 ) \\\n", " / ( 2*x_squared_coefficient )\n", " solution2 = ( -x_coefficient - \\\n", " ( x_coefficient**2 - 4*x_squared_coefficient*constant )**0.5 ) \\\n", " / ( 2*x_squared_coefficient )\n", " return ( solution1, solution2 )\n", "\n", "# of course both work fine:\n", "quadratic_formula_1(3,-9,6), quadratic_formula_2(3,-9,6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But the first one is so much easier to read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example 2:** Statistics always uses $\\mu$ for the mean of a population and $\\sigma$ for its standard deviation. If we wrote code where we used `mean` and `standard_deviation` for those, that wouldn't be hard to read, but it wouldn't be as clear, either.\n", "\n", "Interestingly, you can actually type Greek letters into Python code and use them as variable names! In Jupyter, just type a backslash (`\\`) followed by the name of the letter (such as `mu`) and then press the Tab key. It will replace the code `\\mu` with the actual letter $\\mu$. I've done so in the example code below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0.00876415024678427" ] }, "metadata": {}, "execution_count": 6 } ], "source": [ "def normal_pdf ( μ, σ, x ):\n", " \"\"\"The value of the probability density function for\n", " the normal distribution N(μ,σ^2), with mean μ and\n", " variance σ^2.\"\"\"\n", " shifted = ( x - μ ) / σ\n", " return math.exp( -shifted**2 / 2.0 ) \\\n", " / math.sqrt( 2*math.pi ) / σ\n", "\n", "normal_pdf( 10, 2, 15 )" ] }, { "source": [ "The same feature is not (yet?) available in VS Code, but you can copy and paste Greek letters from anywhere into your code in any editor, and they still count as valid Python variable names." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "markdown", "metadata": {}, "source": [ "## But what about NumPy?\n", "\n", "Pandas is built on NumPy, and many data science projects also use NumPy directly. Since NumPy implements tons of mathematical tools, why bother using the ones in Python's built-in `math` module? Well, on the one hand, NumPy doesn't have *everything*; for instance, the `math.comb()` and `math.perm()` functions mentioned above don't exist in NumPy. But when you *can* use NumPy, you *should,* for the following important reason.\n", "\n", "```{admonition} Big Picture - Vectorization and its benefits\n", "---\n", "class: alert alert-primary\n", "---\n", "All the functions in NumPy are *vectorized,* meaning that they will automatically apply themselves to every element of a NumPy array. For instance, you can just as easily compute `square(5)` (and get 25) as you can compute `square(x)` if `x` is a list of 1000 entries. NumPy notices that you provided a list of things to square, and it squares them all. What are the benefits to vectorization?\n", "\n", " 1. Using vectorization saves you *the work of writing loops.* You don't have to loop through all 1000 entries in `x` to square each one; NumPy knew what you meant.\n", " 2. Using vectorization saves the readers of your code *the work of reading and understanding loops.*\n", " 2. If you had to write a loop to apply a Python function (like `lambda x: x**2`) to a list of 1000 entries, then the loop would (obviously) run in Python. Although Python is a very convenient language to code in, it does not produce very fast-running code. Tools like NumPy are written in languages like C++, which are less convenient to code in, but produce faster-running results. So if you can have NumPy automatically loop over your data, rather than writing a loop in Python, *the code will execute faster.*\n", "```\n", "\n", "We will return to vectorization and loops in Chapter 11 of these notes. For now, let's just run a few NumPy functions. In each case, notice that we give it an array as input, and it automatically knows that it should take action on each entry in the array." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([0.2955275 , 0.60021856, 0.58211714, 0.85181403, 0.63111282,\n", " 0.88591914, 0.47504303, 0.97477104, 0.97928865, 0.85258941,\n", " 0.08108811, 0.33239352, 0.59003071, 0.92885247, 0.11037871,\n", " 0.71557866, 0.23640223, 0.07995767, 0.53925697, 0.01322752,\n", " 0.39700061, 0.11116519, 0.40463647, 0.75140667, 0.27429837,\n", " 0.35094469, 0.02594108, 0.55618244, 0.07358208, 0.33187975])" ] }, "metadata": {}, "execution_count": 7 } ], "source": [ "# Create an array of 30 random numbers to work with.\n", "import numpy as np\n", "values = np.random.rand( 30 )\n", "values" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([0.3 , 0.6 , 0.58, 0.85, 0.63, 0.89, 0.48, 0.97, 0.98, 0.85, 0.08,\n", " 0.33, 0.59, 0.93, 0.11, 0.72, 0.24, 0.08, 0.54, 0.01, 0.4 , 0.11,\n", " 0.4 , 0.75, 0.27, 0.35, 0.03, 0.56, 0.07, 0.33])" ] }, "metadata": {}, "execution_count": 8 } ], "source": [ "np.around( values, 2 ) # round to 2 decimal digits" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([1.34383504, 1.82251708, 1.78982374, 2.3438949 , 1.87970119,\n", " 2.42521247, 1.60808339, 2.65056028, 2.66256156, 2.34571301,\n", " 1.08446645, 1.39430143, 1.80404381, 2.53160241, 1.11670089,\n", " 2.04536992, 1.26668371, 1.08324121, 1.71473229, 1.01331539,\n", " 1.48735684, 1.11757951, 1.49875756, 2.11998004, 1.31560729,\n", " 1.42040876, 1.02628048, 1.74400195, 1.07635688, 1.39358525])" ] }, "metadata": {}, "execution_count": 9 } ], "source": [ "np.exp( values ) # compute e^x for each x in the array" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([8.73365030e-02, 3.60262317e-01, 3.38860369e-01, 7.25587149e-01,\n", " 3.98303393e-01, 7.84852714e-01, 2.25665877e-01, 9.50178588e-01,\n", " 9.59006265e-01, 7.26908705e-01, 6.57528229e-03, 1.10485453e-01,\n", " 3.48136235e-01, 8.62766902e-01, 1.21834593e-02, 5.12052820e-01,\n", " 5.58860149e-02, 6.39322821e-03, 2.90798079e-01, 1.74967341e-04,\n", " 1.57609485e-01, 1.23577002e-02, 1.63730672e-01, 5.64611991e-01,\n", " 7.52395973e-02, 1.23162175e-01, 6.72939802e-04, 3.09338909e-01,\n", " 5.41432265e-03, 1.10144165e-01])" ] }, "metadata": {}, "execution_count": 10 } ], "source": [ "np.square( values ) # square each value" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that this makes it very easy to compute certain mathematical formulas. For example, when we want to measure the quality of a model, we might compute the RSSE, or Root Sum of Squared Errors, that is, the square root of the sum of the squared differences between each actual data value $y_i$ and its predicted value $\\hat y_i$. In math, we write it like this:\n", "\n", "$$ \\text{RSSE} = \\sqrt{\\sum_{i=1}^n (y_i-\\hat y_i)^2} $$\n", "\n", "The summation symbol lets you know that a loop will take place. But in NumPy, we can do it without writing any loops." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "3.605551275463989" ] }, "metadata": {}, "execution_count": 11 } ], "source": [ "ys = np.array( [ 1, 2, 3, 4, 5 ] ) # made up data\n", "yhats = np.array( [ 2, 1, 0, 3, 4 ] ) # also made up\n", "RSSE = np.sqrt( np.sum( np.square( ys - yhats ) ) )\n", "RSSE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the NumPy code also reads just like the English: It's the square root of the sume of the squared differences; the code literally says that in the formula itself! If we had had to write it in pure Python, we would have used either a loop or a list comprehension, like in the example below." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "3.605551275463989" ] }, "metadata": {}, "execution_count": 12 } ], "source": [ "RSSE = math.sqrt( sum( [ ( ys[i] - yhats[i] )**2 for i in range(len(ys)) ] ) ) # not as readable\n", "RSSE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A comprehensive list of NumPy's math routines appear [in the NumPy documentation](https://numpy.org/doc/stable/reference/routines.math.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binding function arguments\n", "\n", "Many functions in statistics have two types of parameters. Some of the parameters you change very rarely, and others you change all the time.\n", "\n", "**Example 1:** Consider the `normal_pdf` function whose code appears [in an earlier section](#naming-mathematical-variables) of this chapter. It has three parameters, $\\mu$, $\\sigma$, and $x$. You'll probably have a particular normal distribution you want to work with, so you'll choose $\\mu$ and $\\sigma$, and then you'll want to use the function on many different values of $x$. So the first two parameters we choose just once, and the third parameter changes all the time.\n", "\n", "**Example 2:** Consider fitting a linear model $\\beta_0+\\beta_1x$ to some data $x_1,x_2,\\ldots,x_n$. That linear model is technically a function of three variables; we might write it as $f(\\beta_0,\\beta_1,x)$. But when we fit the model to the data, then $\\beta_0$ and $\\beta_1$ get chosen, and we don't change them after that. But we might plug in hundreds or even thousands of different $x$ values to $f$, using the same $\\beta_0$ and $\\beta_1$ values each time.\n", "\n", "Programmers have a word for this; they call it *binding* the arguments of a function. Binding allows us to tell Python that we've chosen values for some parameters and won't be changing them; Python can thus give us a function with fewer parameters, to make things simpler. Python does this with a tool called `partial` in its `functools` module. Here's how we would apply it to the `normal_pdf` function." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(0.3989422804014327,\n", " 0.24197072451914337,\n", " 0.05399096651318806,\n", " 0.0044318484119380075,\n", " 0.00013383022576488537)" ] }, "metadata": {}, "execution_count": 13 } ], "source": [ "from functools import partial\n", "\n", "# Let's say I want the standard normal distribution, that is,\n", "# I want to fill in the values μ=0 and σ=1 once for all.\n", "my_pdf = partial( normal_pdf, 0, 1 )\n", "\n", "# now I can use that on as many x inputs as I like, such as:\n", "my_pdf( 0 ), my_pdf( 1 ), my_pdf( 2 ), my_pdf( 3 ), my_pdf( 4 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, SciPy's built-in random number generating procedures let you use them either by binding arguments or not, at your preference. For instance, to generate 10 random floating point values between 0 and 100, we can do the following. (The `rvs` function stands for \"random values.\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([90.91522687, 60.2291996 , 16.37553438, 7.4363216 , 17.21924831,\n", " 15.01402697, 10.51570455, 72.77763709, 72.93772356, 15.25036107])" ] }, "metadata": {}, "execution_count": 14 } ], "source": [ "import scipy.stats as stats\n", "stats.uniform.rvs( 0, 100, size=10 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we can use built-in SciPy functionality to bind the first two arguments and create a specific random variable, then call `rvs` on that." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([83.00410977, 44.73415131, 37.87712319, 77.57111227, 32.56369545,\n", " 1.37120851, 52.34039717, 36.2044748 , 17.5404996 , 68.5900541 ])" ] }, "metadata": {}, "execution_count": 15 } ], "source": [ "X = stats.uniform( 0, 100 ) # make a random variable\n", "X.rvs( size=10 ) # generate 10 values from it" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same random variable can, of course, be used to create more values later.\n", "\n", "The `partial` tool built into Python only works if you want to bind the *first* arguments of the function. If you need to bind later ones, then you can do it yourself using a `lambda`, as in the following example." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "4" ] }, "metadata": {}, "execution_count": 16 } ], "source": [ "def subtract ( a, b ): # silly little example function\n", " return a - b\n", "\n", "subtract_1 = lambda a: subtract( a, 1 ) # bind second argument to 1\n", "\n", "subtract_1( 5 )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also use the concept of binding function parameters when we come to curve fitting at the end of this chapter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GB213 in Python\n", "\n", "All MA346 students have taken GB213 as a prerequisite, and we will not spend time in our course reviewing its content. However, you may very well want to know how to do computations from GB213 using Python, and these notes provide [an appendix that covers exactly that](GB213-review-in-Python). Refer to it whenever you need to use some GB213 content in this course.\n", "\n", "Topics covered there:\n", "\n", " * Discrete and continuous random variables\n", " * creating\n", " * plotting\n", " * generating random values\n", " * computing probabilities\n", " * computing statistics\n", " * Hypothesis testing for a population mean\n", " * one-sided\n", " * two-sided\n", " * Simple linear regression (one predictor variable)\n", " * creating the model from data\n", " * computing $R$ and $R^2$\n", " * visualizing the model\n", "\n", "That appendix does not cover the following topics.\n", "\n", " * Basic probability (covered in every GB213 section)\n", " * ANOVA (covered in some GB213 sections)\n", " * $\\chi^2$ tests (covered in some GB213 sections)\n", "\n", "```{admonition} Learning on Your Own - Pingouin\n", "---\n", "class: alert alert-danger\n", "---\n", "The GB213 review appendix that I linked to above uses the very popular Python statistics tools `statsmodels` and `scipy.stats`. But there is a relatively new toolkit called Pingouin; it's not as popular (yet?) but it has some advantages over the other two. See [this blog post](https://towardsdatascience.com/the-new-kid-on-the-statistics-in-python-block-pingouin-6b353a1db57c) for an introduction and consider a tutorial, video, presentation, or notebook for the class that answers the following questions.\n", "\n", " * For what tasks is Pingouin better than `statsmodels` or `scipy.stats`? Show example code for doing those tasks in Pingouin.\n", " * For what tasks is Pingouin less useful or not yet capable, compared to the others?\n", " * If I want to use Pingouin, how do I get started?\n", "```\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Curve fitting in general\n", "\n", "The final topic covered in the GB213 review mentioned above is simple linear regression, which fits a line to a set of (two-dimensional) data points. But Python's scientific tools permit you to handle much more complex models. We cannot cover mathematical modeling in detail in MA346, because it can take several courses on its own, but you can learn more about regression modeling in particular in [MA252 at Bentley](https://catalog.bentley.edu/search/?P=MA%20252). But we will cover how to fit an arbitrary curve to data in Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's say we have some data...\n", "\n", "We will assume you have data stored in a pandas DataFrame, and we will lift out just two columns of the DataFrame, one that will be used as our $x$ values (independent variable), and the other as our $y$ values (dependent variable). I'll make up some data here just for use in this example." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " salt used (x) ice remaining (y)\n", "0 2.1 7.9\n", "1 2.9 6.5\n", "2 3.1 6.5\n", "3 3.5 6.0\n", "4 3.7 6.2\n", "5 4.6 6.0" ], "text/html": "
\n | salt used (x) | \nice remaining (y) | \n
---|---|---|
0 | \n2.1 | \n7.9 | \n
1 | \n2.9 | \n6.5 | \n
2 | \n3.1 | \n6.5 | \n
3 | \n3.5 | \n6.0 | \n
4 | \n3.7 | \n6.2 | \n
5 | \n4.6 | \n6.0 | \n