Logistic Regression

In statistics and machine learning, logistic regression is a type of probabilistic classification model. An example of a classification problem is to label an email as either “spam” or “not spam”. Another example is to check if a student based on their score on multiple entrances exams are either “admitted” or “not admitted” to a college. As such, logistic regression is used to predict a binary label $y \in (0,1)$ to each instance of a data set, where each point is labeled either as $y = 0$ which represents a negative label such as not admitting a student, or $y = 1$ which represents a positive label such as admitting a student. Ultimately what logistic regression tries to do is to solve for a threshold that states if the data is of one label or the other. A classifier has an output function, this is written as $h_{\theta}(x)$ which takes the input values of $x$ and gives a numerical output that is normalized between 0 and 1. $h_{\theta}(x)$ estimates the probability that $y = 1$ on an input $x$ . We then compare this to the threshold $\tau = 0.5$ (the threshold can be set to any number, 0.5 is a good pick). This gives us our logistic regression classifier:

$\text{If } h_{\theta}(x) \geq 0.5 \text{ predict } y = 1$

$\text{If } h_{\theta}(x) < 0.5 \text{ predict } y = 0$

In logistic regression we use the logistic function to calculate the output because it can take an input with any value $\latex x$ from negative to positive infinity, whereas the output always takes values between zero and one. It is defined by the sigmoid function $h_{\theta}(x) = \frac{1}{1+e^{-x}}$ . Below we see the logistic function plotted: Going back to our school example for a students grades on two exams, if our classifier gave the following output $h_{\theta}(x) = 0.7$ , what does it mean? It means that our student has 70% chance of being admitted, and according to our threshold 0.5 has been admitted. More formally we can say $h_{\theta}(x) = p(y=1|x;\theta)$ which states that the output is equal to the probability that $y = 1$ , given $x$ , parametrized by $\theta$ . Note also that $p(y=0|x;\theta)+p(y=1|x;\theta)=0$ . We see here the variable $\theta$ , what is that? Well $\theta$ is a parameter that weighs the input x to define the decision boundary. Now you may say, wait a minute what’s a decision boundary, isn’t that the threshold? Let’s explain what they are. First off we need to change a bit what $x$ is. We know it is the input variable that we use as a training example for classifying. But as a training example x may have many features such as being the two scores from our school example. As such we say that that one training example is composed of the following features $(x_{1}, x_{2}, x_{3}, ..., x_{i})$ where $i$ is the number of features of the training example. For the school case we have $i =2$ or $(x_{1}, x_{2})$ . Okay now we have our features defined, how do we pass them through the logistic function to get a classification by our decision rule? Do we pass them in one by one? No we multiple them all by a weighting parameter $\theta$ and then sum them all together as such $\theta_{0}+\theta_{1}x_{1} +\theta_{2}x_{2}+...+\theta_{i}x_{i}$ . We then have $h_{\theta}(x) = f(\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2})$ where $f$ is the logistic function. In linear algebra this can be written as $h_{\theta}(x) = f(\theta^{T}x)$ where $T$ is the transpose of the matrix of theta’s which multiplies a matrix of x’s. The output of $h_{\theta}(x)$ is a scalar value that is checked against the threshold $\tau$ in our decision rule above. Okay good so far but what is a decision boundary? Well it is a way to imagine our decision rule in a visual way. We define the decision boundary as a line by $\theta_{0}+\theta_{1}x_{1} +\theta_{2}x_{2}+...+\theta_{i}x_{i}$ . To better understand this, let’s go to the school example. Below is a plot. Every point in the plot represents a student. The coordinates of the points $(x, y)$ represent our two features which are the two scores $(x_{1}, x_{2})$ the student got on two entrance exams where $(x_{1}, x_{2}) = (x, y)$ . The blue line is our decision boundary defined by $\theta_{0}+\theta_{1}x_{1},\theta_{2}x_{2}$ . If a student got an output $h_{\theta}(\theta_{0}+\theta_{1}x_{1},\theta_{2}x_{2}) \geq 0.5$ it means their scores where above the line and thus classified as having been admitted into college and vice versa if their output was below the decision boundary. Something look familiar? It should! The decision boundary as stated before is the equation to a line where $\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2} \geq 0$ . The equation to a line is defined as $y = mx+b$ where $m$ is the slope and $b$ is the intercept. I’ll leave it to you as an exercise to see how $\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2} \geq 0$ is similar to $y = mx+b$ . Now it is possible to define even more complex decision boundaries for more complicated data examples such as:Here we have $h_{\theta}(x) = f(\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2} +\theta_{3}x_{1}^{2}+\theta_{4}x_{2}^{2} )$ which if you remember your geometry is also the equation to a circle. Did I forget something? Holy cow I did we forgot to show how you actually choose the values of $\theta$ ! This requires a bit more explanation of what a cost function is and how you minimize it by gradient descent to get the values of $\theta$ . Keep a look out for a post on Cost Functions.