Machine Learning Method Logistic Regression

A type of regression model (like Linear Regression) but where the dependent variable (the thing being predicted) is categorical, rather than continuously varying over a range. For example, Logistic Regression might be used to predict if a house split level or ranch based on the number of rooms, where linear regression would be used to predict something like the value or selling price. It discriminates between groups of data, but does not generalize either the best seperation between the groups, nor does it find the probable center or distribution of the groups. It works well and inexpensively when the data is spread out but has a clear and linear boundry. When the boundry is non-linear, or some (training) data may be spread away from the actual boundry, an SVM may work better. When the data is clustered or grouped, a Bayesian Classifier may work better.

Although the prediction is binary, the model uses a Percentage Chance Prediction for each possible outcome. E.g. what is the percentage chance that a house is split level given that it has 5 rooms? So 0 < h_o(x) <= 1. The probability that y=1, given x parameterized by o, can be written as: h_o(x) = P(y=1:x;o) Assuming that y can only be 0 or 1, if P(y=1:x;o) = 0.7, then we know there is a 70% chance that y is 1. There is then, obviously, a 30% chance that y is zero, or P(y=0:x;o) = 0.3. I.e. the total probabilities must add to 1.

As expected, our job is to find o such that x is transformed into a valid prediction of the probability that y is 1. We can then apply a Decision Boundary which divides the probability into a 1 or 0. For example, we might say that y is 1 when the probability of y being 1 is higher than 50%. So our hypothesis function needs to translate values of O^Tx such that g(O^Tx) will be >= 0.5 when O^Tx >= 0 (positive) and less than 0.5 when O^Tx < 0 (i.e. negative) but the result will never be more than 1 or less than -1.

Rather than use a hypothesis funciton of O^Tx as in Linear Regression, we need to use the Logistic Function ^ g(O^Tx) where g(z) is:

1+e^-z

which is also called the Sigmoid Function^. In Octave:

function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
%   J = SIGMOID(z) computes the sigmoid of z.
% (z can be a matrix, vector, or scalar).
  g = 1 ./ ( 1 .+ exp(-z) ) ;
  end

This sigmoid function simply limits our result into the maximum range we desire, with a rapid transition at O^Tx = 0.

Note: The decision boundary may not be a straight line if the hypotheses includes terms which define other terms. E.g. x²

Cost Function: The standard Mean Squared Error (MSE)^ function doesn't work well as a cost function for sigmoid because it results in graph that is not convex, but instead is "pitted" with many local minima. Instead we use:
-log(h_o(x)) if y = 1
-log(h_o(x)) if y = 0

This produces a nice curve that, when y = 1, slowly approaches 0 as h_o(x) approaches 1 yet quickly becomes infinite near zero 0 and does exactly the opposite when y = 1, i.e. it slowly approaches 0 as h_o(x) approaches 0, but quickly become infinite as it approaches 1. This gives us near infinate cost as our error is large, and very little change in cost as our error decreases toward zero, helping us to not overshoot our goal. Also, the relationship between the loss function and the parameters w still gives us a concave error function. Thus, we can rest assured that there is only a unique minimum on the error surface.

We can combine the two cases (y=0 or y=1) into one equation by multiplying one side by y and the other side by 1-y; which ever term is unwanted, will be multiplied by zero and drop out.

h_O(x⁽ⁱ⁾) = g(O^Tx⁽ⁱ⁾)

Hypothesis Function: If we transpose each x⁽ⁱ⁾ and build a matrix X = [(x⁽¹⁾)^T;(x⁽²⁾)^T;...(x^(m))^T] given O = [O₀;O₁;...O_n] then the matrix product XO is [O^Tx⁽¹⁾;O^Tx⁽²⁾;...O^Tx^(m)]. In other words, if we build a matrix X (note the capital X vs x) where each row is a set of input values (x⁽ⁱ⁾)^T), then O^Tx⁽ⁱ⁾ becomes XO which is a vector with each element being the sum of the values of O times each input. We can then take the sigmoid function of each row to form our vector of hypothesis values. In Octave:

hyp = sigmoid(X*theta);

Calculating our cost then becomes simple matrix math by taking the transpose of the y values or the (1-y) values times the hypothesis or 1 - hypothesis vector. In Octave:

errs = -y' * log(hyp) - (1-y)' * log(1-hyp);

The cost for a given set of parameters over the training set is then simply the sum of the costs divided by the number of errors. In Octave:

J = sum(errs)/m;

Slope Function: As with the MSE in Linear Regression, the Sigmoid function has a very simple derivative h_o(x⁽ⁱ⁾ - y⁽ⁱ⁾) * x_j⁽ⁱ⁾ (it's quite difficult to take the derivative of it, but once you do, the result is this very simple equation) which makes it easy to calculate the slope of the error. This makes the gradient descent quick, which is why we picked it in the first place. Just like in Linear Regression we do:

O_j := O_j - a * sum i=1 to m ( (h_o(x⁽ⁱ⁾) - y⁽ⁱ⁾) * x_j⁽ⁱ⁾ ) / m

We have already calculated our h_o(x⁽ⁱ⁾) for all i as XO which is a vector. The error is then that value less our known values of y. ß_i= h_o(x⁽ⁱ⁾) - y⁽ⁱ⁾ or in Matrix form ß = XO - y which is still just a vector. In Octave:

err = (hyp .- y)

To multiply by x_j⁽ⁱ⁾ and sum, understand that we are calculating sum(ß_i x⁽ⁱ⁾) where x⁽ⁱ⁾ is a vector (one set of inputs) and ß_i is a scaler (one hypothesis less the actual y value for that input). Multiplying each x⁽ⁱ⁾ vector times each ß_i and summing the output, is the same as Matrix multiplying the transpose of X by the vector of scalers ß.

sum(ß_i x⁽ⁱ⁾) = [x⁽¹⁾ x⁽²⁾ ... x^(m) ] * [ß₁ ; ß₂ ; ... ß_m ] = X^Tß

Gradient Decent: Keeping in mind that we must simultaneously update all O and that h_O is now a totally different equation based on the sigmoid function.

% Basic Logistic Regression via gradient decent with multiple parameters in Octave
alpha = 0.01; % try larger values 0.01, 0.03, 0.1, 0.3, etc... 
m = length(y); % number of training examples
p = size(X,2); % number of parameters (second diminsion of X)

for iter = 1:num_iters %for some number of iterations
  hyp = sigmoid(X*theta); %calculate our hypothesis using current parameters
  err = (hyp .- y); %find the error between that and the real data
  s = sum( err .* X )./m; %find the slope of the error. 
  %Note: This is the derivative of our cost function
  theta = theta - alpha .* s; 
  %adjust our parameters by a small distance along that slope.
  end

Under and Over Fitting

Not all data fits well to a straight line. This is called "underfitting" or we may say that the algorithm as a "high bias". We can try fitting a quadratic or even higher order equation. E.g. instead of O₀ + O₁x, we might use O₀ + O₁x + O₂x². But, if we choose an equation of too high an order, then we might "overfit" or have an algorithm with "high variance", which would fit any function and isn't representing the function behind this data. Overfitting can therefore result in predictions for new examples which are not accurate even though it exactly predicts the data in the trianing set. The training data may well have some noise, or outliers, which are not actually representative of the true function.

If the data is in 2 or 3 features, it can be plotted and a human can decide if it is being over or under fit. But when there are many parameters, it can be impossible to plot. And using a human is sort of against the purpose of Machine Learning. It may help to reduce the number of features if we can find features that don't really apply. Another means of reducing overfitting is regularization.

Regularization

To keep the system from over fitting, and instead provide a more generalized fit, we can add the sum values of the theta parameters to the cost and slope of the error. Here is that new term added to the right of our cost function:

Question: Shouldn't we use lower weight parameters (more regularization) for higher order terms?

Don't regularize O₀.

Lambda is used as a parameter for the amount of regularization. e.g. the amount that the parameter values are multiplied by before adding them to the cost function. To large a lambda can result in underfitting.

reg = lambda * sum(theta2.^2) / (2*m);
J = J + reg;
...
reg = lambda .* theta2 ./ m ;
S = S + reg;

Where theta2 is either:

theta2 = theta;
theta2(1) = 0;

theta2 = [0; theta(2:end)]

(the [0; and ] aren't needed for the cost calculation, only for the gradient / slope.

Find Minimum Function: fminunc( @[cost, slope] = cost(theta, X, y), theta, options )

But there are better (and more complex) means of adjusting the theta (parameter) values to minumize cost. fminuc is a common and powerful method. The fminunc function expects a reference to a function which will return cost and (optionally) slope / gradient values for a given set of parameters, training data, and training answers. It starts from an initial set of parameters and there are some options which can be set, such as the maximum number of iterations.

Here is a complete learning function in Octave using fminunc:

function [J, S] = costSlope(theta, X, y, lambda)
%return cost (J) and slope (S) given training data (X, y), current theta, and lambda
  m = length(y);
  hyp = sigmoid(X*theta); 
  %make a guess based on the sigmoid of our training data times our current paramaters. 
  costs = -y' * log(hyp) - (1-y)' * log(1-hyp); %costs with sigmoid function
  %costs = -y .* log(hyp) - (1-y) .* log(1-hyp); %more general version?
  J = sum(costs(:))/m; %mean cost. (:) required for higher dimensions
  reg = (lambda * sum(theta(2:end).^2) / (2*m)); %mean cost + regularization
  J = J + reg; %add in the regularization
  err = (hyp .- y); %actual error. 
  %Note this happens to be the derivative of our cost function.
  S = (X' * err)./m; %slope of the error 
  reg = lambda .* [0;theta(2:end)] ./ m; %Also regularize the slope
  S = S + reg; %add in the regularization
  end

options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costSlope(t, X, y)), initial_theta, options);

For more about the @(t) () syntax see Octave: Anonymous Functions

Also: fmincg. This works similarly to fminunc, but is more efficient when we are dealing with large number of parameters.

It's very easy to turn this into a "one hot" logistic classifier

Also: