Logistic_Regression

1. 前言

在Coursera上吴恩达的Machine Learning中,Logistic Regression作为初学者们接触到的第二个模型,其本身是非常重要的。但是在课程当中其损失函数的梯度公式则被一笔带过。而很多人都对其为什么与线性回归的损失函数梯度是一样的表示好奇,因此,我特地从头推导了这一式子。

2. 逻辑斯蒂回归损失函数梯度推导

2.1. Introduction

The "Machine Learning" course on Coursera is a one of the most popular course over all of MOOC. This course is famous for its simplified teaching step and other bright spots. Logistics regression, as the most important model for beginner, lacks appropriate detailed derivation of formula. In this blog, I am going to expound this section step by step.

2.2. Logistic Regression

I think most of you have already got familiar with logistic regression.

2.3. Cost Function

As we all know, the cost function of the logistic regression is modified from the one from linear regression model. \[ Cost\left( h_{\theta}\left( x \right) ,y \right) =\left\{ \begin{aligned} -\log \left( h_{\theta}\left( x \right) \right) \,\, if\,\,y&=1\\ -\log \left( 1-h_{\theta}\left( x \right) \right) \,\, if\,\,y&=0\\ \end{aligned} \right. \]

To simplify this function, we can write it as

\[ J\left( \theta \right) =\frac{1}{2m}\sum_{i=1}^m{Cost\left( h_{\theta}\left( x^{\left( i \right)} \right) ,y^{\left( i \right)} \right)} \]

which

\[ Cost\left( h_{\theta}\left( x^{\left( i \right)} \right) ,y^{\left( i \right)} \right) =-y^{\left( i \right)}\log \left( h_{\theta}\left( x \right) \right) -\left( 1-y^{\left( i \right)} \right) \log \left( 1-h_{\theta}\left( x^{\left( i \right)} \right) \right) \]

2.4. Gradient of Cost Function

And next, is the part I want to expound. Andrew, in this MOOC, just briefly introduce that the gradient of the new cost function is the same as the one for linear regression. There is no explicit derivation here.

For cost function \(J\), we can write it as

\[\begin{equation} \begin{aligned} J\left( \theta \right) &=-\frac{1}{m}\left[ \sum_{i=1}^m{-y^{\left( i \right)}\log \left( h_{\theta}\left( x \right) \right) -\left( 1-y^{\left( i \right)} \right) \log \left( 1-h_{\theta}\left( x^{\left( i \right)} \right) \right)} \right] \\ \frac{\partial}{\partial \theta _j}J\left( \theta \right) &=\frac{\partial}{\partial \theta _j}\left[ -\frac{1}{m}\left[ \sum_{i=1}^m{-y^{\left( i \right)}\log \left( h_{\theta}\left( x \right) \right) -\left( 1-y^{\left( i \right)} \right) \log \left( 1-h_{\theta}\left( x^{\left( i \right)} \right) \right)} \right] \right] \\ &=-\frac{1}{m}\left[ \sum_{i=1}^m{\left( y^{\left( i \right)}\frac{1}{h_{\theta}\left( x^{\left( i \right)} \right)}\cdot \frac{\partial}{\partial \theta _j}h_{\theta}\left( x^{\left( i \right)} \right) +\left( 1-y^{\left( i \right)} \right) \cdot \frac{1}{1-h_{\theta}\left( x^{\left( i \right)} \right)}\cdot \frac{\partial}{\partial \theta _j}\left( -h_{\theta}\left( x^{\left( i \right)} \right) \right) \right)} \right] \\ &=-\frac{1}{m}\left[ \sum_{i=1}^m{\left( y^{\left( i \right)}\frac{1}{h_{\theta}\left( x^{\left( i \right)} \right)}-\left( 1-y^{\left( i \right)} \right) \cdot \frac{1}{1-h_{\theta}\left( x^{\left( i \right)} \right)} \right) \cdot \frac{\partial}{\partial \theta _j}\left( h_{\theta}\left( x^{\left( i \right)} \right) \right)} \right] \\ &=-\frac{1}{m}\left[ \sum_{i=1}^m{\left( y^{\left( i \right)}\frac{1}{g\left( \theta ^Tx \right)}-\left( 1-y^{\left( i \right)} \right) \cdot \frac{1}{1-g\left( \theta ^Tx \right)} \right) \cdot \frac{\partial}{\partial \theta _j}\left( g\left( \theta ^Tx \right) \right)} \right] \end{aligned} \end{equation}\]

In logistic regression, we use the logistic function as our decision function. Therefore, (Mentioned: \(T\) refers to transpose)

\[\begin{equation} \begin{aligned} \frac{\partial}{\partial \theta _j}g\left( \theta ^Tx \right) &=\frac{\partial}{\partial \theta _j}\cdot \frac{1}{1+e^{-\theta ^Tx}} \\ &=\frac{\partial}{\partial \theta _j}\left( 1+e^{-\theta ^Tx} \right) ^{-1} \\ &=-\left( 1+e^{-\theta ^Tx} \right) ^{-2}\cdot e^{e^{-\theta ^Tx}}\cdot x \\ &=-\frac{e^{e^{-\theta ^Tx}}\cdot -x}{\left( 1+e^{e^{-\theta ^Tx}} \right) ^2} \end{aligned} \end{equation}\]

We set \(k=e^{-\theta ^Tx}\), plug it in,

\[\begin{equation} \begin{aligned} &=\frac{k}{\left( 1+k \right) ^2}\cdot x \\ &=\left( \frac{1}{1+k}\cdot \frac{1+k-1}{1+k} \right) \cdot x \\ &=\left[ \frac{1}{1+k}\cdot \left( 1-\frac{1}{1+k} \right) \right] \cdot x \end{aligned} \end{equation}\]

Bring it back, we can see we construct the logistic function itself,

\[\begin{equation} \begin{aligned} &=\left[ \frac{1}{1+e^{-\theta ^Tx}}\cdot \left( 1-\frac{1}{1+e^{-\theta ^Tx}} \right) \right] \cdot x \end{aligned} \end{equation}\]

Thus, we can use the logistic function to simplify the eqution,

\[\begin{equation} \begin{aligned} &=\left[ \frac{1}{1+e^{-\theta ^Tx}}\cdot \left( 1-\frac{1}{1+e^{-\theta ^Tx}} \right) \right] \cdot x \\ &=g\left( \theta ^Tx \right) \cdot \left( 1-g\left( \theta ^Tx \right) \right) \cdot x \end{aligned} \end{equation}\]

We plug back this portion to the gradient of cost function (1),

\[\begin{equation} \begin{aligned} \frac{\partial}{\partial \theta _j}J\left( \theta \right) &=-\frac{1}{m}\left[ \sum_{i=1}^m{\left( y^{\left( i \right)}\cdot \frac{1}{g\left( \theta ^Tx \right)}-\left( 1-y^{\left( i \right)} \right) \cdot \frac{1}{1-g\left( \theta ^Tx \right)} \right)}\cdot g\left( \theta ^Tx \right) \cdot \left( 1-g\left( \theta ^Tx \right) \right) \cdot x^{\left( i \right)} \right] \\ &=-\frac{1}{m}\left[ \sum_{i=1}^m{\left[ y^{\left( i \right)}\cdot \left( 1-g\left( \theta ^Tx \right) \right) \cdot x^{\left( i \right)}-\left( 1-y^{\left( i \right)} \right) \cdot g\left( \theta ^Tx \right) \cdot x^{\left( i \right)} \right]} \right] \\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ y^{\left( i \right)}\cdot x^{\left( i \right)}-y^{\left( i \right)}\cdot g\left( \theta ^Tx \right) \cdot x^{\left( i \right)}-g\left( \theta ^Tx \right) \cdot x^{\left( i \right)}+y^{\left( i \right)}\cdot g\left( \theta ^Tx \right) \cdot x^{\left( i \right)} \right]} \\ &\mathrm{because}-y^{\left( i \right)}\cdot g\left( \theta ^Tx \right) \cdot x^{\left( i \right)}\,\,\mathrm{offsets} +y^{\left( i \right)}\cdot g\left( \theta ^Tx \right) \cdot x^{\left( i \right)} \\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ y^{\left( i \right)}\cdot x^{\left( i \right)}-g\left( \theta ^Tx \right) \cdot x^{\left( i \right)} \right]} \\ &=-\frac{1}{m}\sum_{i=1}^m{\left[ \left[ y^{\left( i \right)}-g\left( \theta ^Tx \right) \right] \cdot x^{\left( i \right)} \right]} \\ &=\frac{1}{m}\sum_{i=1}^m{\left[ \left[ h_{\theta}\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right] \cdot x^{\left( i \right)} \right]} \end{aligned} \end{equation}\]

which is the same as linear regression's

Q.E.D.