SlideShare a Scribd company logo
Logistic
Regression
Background: Generative and
Discriminative Classifiers
Logistic Regression
Important analytic tool in natural and
social sciences
Baseline supervised machine learning
tool for classification
Is also the foundation of neural
networks
Generative and Discriminative Classifiers
Naive Bayes is a generative classifier
by contrast:
Logistic regression is a discriminative
classifier
Generative and Discriminative Classifiers
Suppose we're distinguishing cat from dog images
imagenet imagenet
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Also build a model for dog images
Now given a new image:
Run both models and see which one fits better
Discriminative Classifier
Just try to distinguish dogs from cats
Oh look, dogs have collars!
Let's ignore everything else
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
Naive Bayes
Logistic Regression
7
ISTIC REGRESSION
ormally, recall that the naive Bayes assigns a class c to a document d no
computing P(c|d) but by computing a likelihood and a prior
ĉ = argmax
c2C
likelihood
z }| {
P(d|c)
prior
z}|{
P(c) (5.1
ive model like naive Bayes makes use of this likelihood term, which
how to generate the features of a document if we knew it was of class c.
trast a discriminative model in this text categorization scenario attempts
compute P(c|d). Perhaps it will learn to assign high weight to documen
at directly improve its ability to discriminate between possible classes
ISTIC REGRESSION
rmally, recall that the naive Bayes assigns a class c to a document d
computing P(c|d) but by computing a likelihood and a prior
ĉ = argmax
c2C
likelihood
z }| {
P(d|c)
prior
z}|{
P(c) (
P(c|d)
posterior
Components of a probabilistic machine learning
classifier
1. A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj
(i), or sometimes fj(x).
2. A classification function that computes !
", the estimated
class, via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
Given m input/output pairs (x(i),y(i)):
The two phases of logistic regression
Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x)
using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
Logistic
Regression
Background: Generative and
Discriminative Classifiers
Logistic
Regression
Classification in Logistic Regression
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class !
" Î C
Binary Classification in Logistic Regression
Given a series of input/output pairs:
◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class !
"(i) Î {0,1}
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
Logistic Regression for one observation x
Input observation: vector x = [x1, x2,…, xn]
Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class !
" Î {0,1}
(multinomial logistic regression: !
" Î {0, 1, 2, 3, 4})
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias
If this sum is high, we say y=1; if low, then y=0
her real number that’s added to the weighted inputs.
ecision on a test instance— after we’ve learned the
ssifier first multiplies each xi by its weight wi, sums up th
ds the bias term b. The resulting single number z exp
the evidence for the class.
z =
n
X
i=1
wixi
!
+b
ook we’ll represent such sums using the dot product no
z =
n
X
i=1
wixi
!
+b
k we’ll represent such sums using the dot product notatio
ot product of two vectors a and b, written as a·b is the s
orresponding elements of each vector. Thus the followin
to Eq. 5.2:
z = w·x+b
But we want a probabilistic classifier
We need to formalize “sum is high”.
We’d like a principled classifier that gives us a
probability, just like Naive Bayes did
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
The problem: z isn't a probability, it's just a
number!
Solution: use a function of z that goes from 0 to 1
to Eq. 5.2:
z = w·x+b
g in Eq. 5.3 forces z to be a legal probabil
fact, since weights are real-valued, the outp
om • to •.
ear around 0 but outlier values get squashed toward 0 or 1.
e a probability, we’ll pass z through the sigmoid func
ction (named because it looks like an s) is also called t
es logistic regression its name. The sigmoid has the fol
ically in Fig. 5.1:
y = s(z) =
1
1+e z
=
1
1+exp( z)
rest of the book, we’ll use the notation exp(x) to mean e
r of advantages; it takes a real-valued number and maps
The very useful sigmoid or logistic function
20
ranges from • to •.
bility, we’ll pass z through the sigmoid function, s(z). T
med because it looks like an s) is also called the logistic fu
regression its name. The sigmoid has the following equati
Fig. 5.1:
y = s(z) =
1
1+e z
(5
Idea of logistic regression
We’ll compute w·x+b
And then we’ll pass it through the
sigmoid function:
σ(w·x+b)
And we'll just treat it as a probability
Making probabilities with sigmoids
wo cases, p(y = 1) and p(y = 0), sum to 1. We can do this as foll
P(y = 1) = s(w·x+b)
=
1
1+exp( (w·x+b))
P(y = 0) = 1 s(w·x+b)
= 1
1
1+exp( (w·x+b))
=
exp( (w·x+b))
1+exp( (w·x+b))
wo cases, p(y = 1) and p(y = 0), sum to 1. We can do this as fol
P(y = 1) = s(w·x+b)
=
1
1+exp( (w·x+b))
P(y = 0) = 1 s(w·x+b)
= 1
1
1+exp( (w·x+b))
=
exp( (w·x+b))
1+exp( (w·x+b))
By the way:
The sigmoid function has the property
1 s(x) = s( x)
so we could also have expressed P(y = 0) as s( (w·x+b)).
Now we have an algorithm that given an instance x computes the
P(y = 1|x). How do we make a decision? For a test instance x, we sa
probability P(y = 1|x) is more than .5, and no otherwise. We call .5
boundary:
ŷ =
⇢
1 if P(y = 1|x) > 0.5
0 otherwise
5.1.1 Example: sentiment classification
P(y = 1) = s(w·x+b)
=
1
1+exp( (w·x+b))
P(y = 0) = 1 s(w·x+b)
= 1
1
1+exp( (w·x+b))
=
exp( (w·x+b))
1+exp( (w·x+b))
(5.5)
tion has the property
1 s(x) = s( x) (5.6)
=
Because
P(y = 0) = 1 s(w·x+b)
= 1
1
1+exp( (w·x
=
exp( (w·x+b))
1+exp( (w·x+b
The sigmoid function has the property
1 s(x) = s( x)
so we could also have expressed P(y = 0) as s( (w·x
Now we have an algorithm that given an instance
P(y = 1|x). How do we make a decision? For a test i
probability P(y = 1|x) is more than .5, and no otherwi
boundary:
decision
boundary
Turning a probability into a classifier
How do we make a decision? For a test instance x, we sa
(y = 1|x) is more than .5, and no otherwise. We call .5 t
ŷ =
⇢
1 if P(y = 1|x) > 0.5
0 otherwise
ample: sentiment classification
n example. Suppose we are doing binary sentiment class
text, and we would like to know whether to assign the sen
0.5 here is called the decision boundary
The probabilistic classifier
ranges from • to •.
wx + b
P(y = 1) = s(w·x+b)
=
1
1+e (w·x+b)
P(y = 0) = 1 s(w·x+
= 1
1
1+e (w·
=
e (w·x+b)
1+e (w·x+b)
Now we have an algorithm that given an instan
P(y=1)
Turning a probability into a classifier
algorithm that given an instance x computes the probabil
we make a decision? For a test instance x, we say yes if t
is more than .5, and no otherwise. We call .5 the decisi
ŷ =
⇢
1 if P(y = 1|x) > 0.5
0 otherwise
sentiment classification
e. Suppose we are doing binary sentiment classification
if w·x+b > 0
if w·x+b ≤ 0
Logistic
Regression
Classification in Logistic Regression
Logistic
Regression
Logistic Regression: a text example
on sentiment classification
Sentiment example: does y=1 or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
29
30
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x1=3 x6=4.19
x3=1
x4=3
x5=0
x2=2
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s(w·x+b)
= s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1)
= s(.833)
= 0.70 (5.6)
ŷ =
1 if P(y = 1|x) > 0.5
0 otherwise
5.1.1 Example: sentiment classification
Let’s have an example. Suppose we are doing binary sentiment classification on
movie review text, and we would like to know whether to assign the sentiment class
+ or to a review document doc. We’ll represent each input observation by the 6
features x1...x6 of the input shown in the following table; Fig. 5.2 shows the features
in a sample mini test document.
Var Definition Value in Fig. 5.2
x1 count(positive lexicon) 2 doc) 3
x2 count(negative lexicon) 2 doc) 2
x3
⇢
1 if “no” 2 doc
0 otherwise
1
x4 count(1st and 2nd pronouns 2 doc) 3
x5
⇢
1 if “!” 2 doc
0 otherwise
0
x6 log(word count of doc) ln(66) = 4.19
Classifying sentiment for input x
31
tures x1...x6 of the input shown in the following table; Fig. 5.2 shows the features
a sample mini test document.
Var Definition Value in Fig. 5.2
x1 count(positive lexicon) 2 doc) 3
x2 count(negative lexicon) 2 doc) 2
x3
⇢
1 if “no” 2 doc
0 otherwise
1
x4 count(1st and 2nd pronouns 2 doc) 3
x5
⇢
1 if “!” 2 doc
0 otherwise
0
x6 log(word count of doc) ln(66) = 4.19
et’s assume for the moment that we’ve already learned a real-valued weight for
ch of these features, and that the 6 weights corresponding to the 6 features are
x2 count(negative lexicon) 2 doc)
x3
⇢
1 if “no” 2 doc
0 otherwise
x4 count(1st and 2nd pronouns 2 doc)
x5
⇢
1 if “!” 2 doc
0 otherwise
x6 log(word count of doc)
Let’s assume for the moment that we’ve alrea
for each of these features, and that the 6 weights
are [2.5, 5.0, 1.2,0.5,2.0,0.7], while b = 0.1. (
how the weights are learned.) The weight w1, for e
Suppose w =
b = 0.1
Classifying sentiment for input x
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s(w·x+b)
= s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1)
= s(.833)
= 0.70 (5.6)
p( |x) = P(Y = 0|x) = 1 s(w·x+b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
32
1 6
5
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s(w·x+b)
= s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1)
= s(.833)
= 0.70 (5.6)
p( |x) = P(Y = 0|x) = 1 s(w·x+b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of period disambiguation: deciding
We can build features for logistic regression for
any classification task: period disambiguation
erty of the input can be a feature. Consider the task of period disambiguation:
deciding if a period is the end of a sentence or part of a word, by classifying each
period into one of two classes EOS (end-of-sentence) and not-EOS. We might use
features like x1 below expressing that the current word is lower case and the class
is EOS (perhaps with a positive weight), or that the current word is in our abbrevia-
tions dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). A
feature can also express a quite complex combination of properties. For example a
period following a upper cased word is a likely to be an EOS, but if the word itself is
St. and the previous word is capitalized, then the period is likely part of a shortening
of the word street.
x1 =
⇢
1 if “Case(wi) = Lower”
0 otherwise
x2 =
⇢
1 if “wi 2 AcronymDict”
0 otherwise
x3 =
⇢
1 if “wi = St. & Case(wi 1) = Cap”
0 otherwise 33
This ends in a period.
The house at 465 Main St. is new.
End of sentence
Not end
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ wi for each feature fi
, which is just what we want for a probability. Because it is
but has a sharp slope toward the ends, it tends to squash outlier
And it’s differentiable, which as we’ll see in Section 5.8 will
e. If we apply the sigmoid to the sum of the weighted features,
ween 0 and 1. To make it a probability, we just need to make
s, p(y = 1) and p(y = 0), sum to 1. We can do this as follows:
P(y = 1) = s(w·x+b)
=
1
1+e (w·x+b)
Logistic
Regression
Learning: Cross-Entropy Loss
Wait, where did the W’s come from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate, !
"
We want to set w and b to minimize the distance between our
estimate !
"(i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss.
36
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
The distance between !
" and y
We want to know how far is the classifier output:
!
" = σ(w·x+b)
from the true output:
y [= either 0 or 1]
We'll call this difference:
L(!
" ,y) = how much !
" differs from the true y
Intuition of negative log likelihood loss
= cross-entropy loss
A case of conditional maximum likelihood
estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the thing
we want to maximize) as
noting:
if y=1, this simplifies to !
"
if y=0, this simplifies to 1- !
"
is the negative log likelihood loss, generally called the cross-entrop
derive this loss function, applied to a single observation x. We’d
ghts that maximize the probability of the correct label p(y|x). Sinc
two discrete outcomes (1 or 0), this is a Bernoulli distribution, and
he probability p(y|x) that our classifier produces for one observa
wing (keeping in mind that if y=1, Eq. 5.9 simplifies to ŷ; if y=0,
s to 1 ŷ):
p(y|x) = ŷy
(1 ŷ)1 y
take the log of both sides. This will turn out to be handy mathema
n’t hurt us; whatever values maximize a probability will also maxim
e probability:
Deriving cross-entropy loss for a single observation x
Now take the log of both sides (mathematically handy)
Whatever values maximize log p(y|x) will also maximize
p(y|x)
he probability p(y|x) that our classifier produces for one observa
wing (keeping in mind that if y=1, Eq. 5.9 simplifies to ŷ; if y=0,
s to 1 ŷ):
p(y|x) = ŷy
(1 ŷ)1 y
take the log of both sides. This will turn out to be handy mathema
n’t hurt us; whatever values maximize a probability will also maxim
e probability:
log p(y|x) = log
⇥
ŷy
(1 ŷ)1 y
⇤
= ylogŷ+(1 y)log(1 ŷ)
p(y|x) = ŷy
(1 ŷ)1 y
Now we take the log of both sides. This will turn out to be handy m
nd doesn’t hurt us; whatever values maximize a probability will also
og of the probability:
log p(y|x) = log
⇥
ŷy
(1 ŷ)1 y
⇤
= ylogŷ+(1 y)log(1 ŷ)
Eq. 5.10 describes a log likelihood that should be maximized. In or
nto loss function (something that we need to minimize), we’ll just
Eq. 5.10. The result is the cross-entropy loss LCE:
Goal: maximize probability of the correct label p(y|x)
Maximize:
Maximize:
Deriving cross-entropy loss for a single observation x
Now flip sign to turn this into a loss: something to minimize
Cross-entropy loss (because is formula for cross-entropy(y, !
" ))
Or, plugging in definition of !
":
Now we take the log of both sides. This will turn out to be handy ma
and doesn’t hurt us; whatever values maximize a probability will also m
log of the probability:
log p(y|x) = log
⇥
ŷy
(1 ŷ)1 y
⇤
= ylogŷ+(1 y)log(1 ŷ)
Eq. 5.10 describes a log likelihood that should be maximized. In orde
into loss function (something that we need to minimize), we’ll just fli
Eq. 5.10. The result is the cross-entropy loss LCE:
LCE(ŷ,y) = log p(y|x) = [ylogŷ+(1 y)log(1 ŷ)
Goal: maximize probability of the correct label p(y|x)
Maximize:
Minimize:
Now we take the log of both sides. This will turn out to be handy mathe
and doesn’t hurt us; whatever values maximize a probability will also max
log of the probability:
log p(y|x) = log
⇥
ŷy
(1 ŷ)1 y
⇤
= ylogŷ+(1 y)log(1 ŷ)
Eq. 5.10 describes a log likelihood that should be maximized. In order t
into loss function (something that we need to minimize), we’ll just flip th
Eq. 5.10. The result is the cross-entropy loss LCE:
LCE(ŷ,y) = log p(y|x) = [ylogŷ+(1 y)log(1 ŷ)]
Finally, we can plug in the definition of ŷ = s(w·x+b):
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))
and doesn’t hurt us; whatever values maximize a probability will also maximiz
log of the probability:
log p(y|x) = log
⇥
ŷy
(1 ŷ)1 y
⇤
= ylogŷ+(1 y)log(1 ŷ) (
Eq. 5.10 describes a log likelihood that should be maximized. In order to turn
into loss function (something that we need to minimize), we’ll just flip the sig
Eq. 5.10. The result is the cross-entropy loss LCE:
LCE(ŷ,y) = log p(y|x) = [ylogŷ+(1 y)log(1 ŷ)] (
Finally, we can plug in the definition of ŷ = s(w·x+b):
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] (
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another nice
touch is the music . I was overcome with the urge to get off the couch and
start dancing . It sucked me in , and it'll do the same to you .
Let's see if this works for our sentiment example
True value is y=1. How well is our model doing?
Pretty well! What's the loss?
x1=3 x6=4.19
x4=3
x5=0
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+|x) and P( |x) can be com-
puted using Eq. 5.5:
p(+|x) = P(Y = 1|x) = s(w·x+b)
= s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1)
= s(.833)
= 0.70 (5.6)
p( |x) = P(Y = 0|x) = 1 s(w·x+b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features
like x1 below expressing that the current word is lower case and the class is EOS
R 5 • LOGISTIC REGRESSION
side of the equation drops out, leading to the following loss (we’ll use log to mean
natural log when the base is not specified):
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
= [logs(w·x+b)]
= log(.70)
= .36
Let's see if this works for our sentiment example
Suppose true value instead was y=0.
What's the loss?
CE
= [logs(w·x+b)]
= log(.70)
= .36
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
the loss to be higher. Now if we plug y = 0 and 1 s(w·x+b) = .31 from Eq. 5.7
into Eq. 5.12, the left side of the equation drops out:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
= [log(1 s(w·x+b))]
= log(.30)
= 1.2
Sure enough, the loss for the first classifier (.37) is less than the loss for the second
p(+|x) = P(Y = 1|x) = s(w·x+b)
= s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1
= s(.833)
= 0.70
p( |x) = P(Y = 0|x) = 1 s(w·x+b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks,
of the input can be a feature. Consider the task of period disambig
if a period is the end of a sentence or part of a word, by classif
into one of two classes EOS (end-of-sentence) and not-EOS. We m
like x1 below expressing that the current word is lower case and
(perhaps with a positive weight), or that the current word is in o
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
Sure enough, loss was bigger when model was wrong!
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
= [logs(w·x+b)]
= log(.70)
= .36
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
the loss to be higher. Now if we plug y = 0 and 1 s(w·x+b) = .31 from Eq. 5.7
into Eq. 5.12, the left side of the equation drops out:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
= [log(1 s(w·x+b))]
= log(.30)
= 1.2
Sure enough, the loss for the first classifier (.37) is less than the loss for the second
classifier (1.17).
8 CHAPTER 5 • LOGISTIC REGRESSION
side of the equation drops out, leading to the following loss (we’ll use log to mean
natural log when the base is not specified):
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
= [logs(w·x+b)]
= log(.70)
= .36
By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative,
i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is
terrible! I beg you not to see it!”). In this case our model is confused and we’d want
the loss to be higher. Now if we plug y = 0 and 1 s(w·x+b) = .31 from Eq. 5.7
into Eq. 5.12, the left side of the equation drops out:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
= [log(1 s(w·x+b))]
= log(.30)
Logistic
Regression
Cross-Entropy Loss
Logistic
Regression
Stochastic Gradient Descent
Our goal: minimize the loss
Let's make explicit that the loss function is parameterized
by weights !=(w,b)
• And we’ll represent "
# as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
th gradient descent is to find the optimal weights: minim
ve defined for the model. In Eq. 5.13 below, we’ll explici
the loss function L is parameterized by the weights, whic
e learning in general as q (in the case of logistic regressio
s to find the set of weights which minimizes the loss functi
mples:
q̂ = argmin
q
1
m
m
X
i=1
LCE(f(x(i)
;q),y(i)
)
Intuition of gradient descent
How do I get to the bottom of this river canyon?
x
Look around me 360∘
Find the direction of
steepest slope down
Go that way
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
Let's first visualize for a single scalar w
w
Loss
0
w1
wmin
(goal)
Should we move
right or left from here?
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
Let's first visualize for a single scalar w
w
Loss
0
w1 wmin
slope of loss at w1
is negative
(goal)
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive
Let's first visualize for a single scalar w
w
Loss
0
w1 wmin
slope of loss at w1
is negative
(goal)
one step
of gradient
descent
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
Gradient Descent: Find the gradient of the loss
function at the current point and move in the
opposite direction.
How much do we move in that direction ?
• The value of the gradient (slope in our example)
!
!"
#(% &; ( , *) weighted by a learning rate η
• Higher learning rate means move w faster
GISTIC REGRESSION
wt+1
= wt
h
d
dw
L( f(x;w),y)
’s extend the intuition from a function of one scalar variable
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
Imagine 2 dimensions, w and b
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
of a 2-dimensional gradient vector taken at the red point.
Cost(w,b)
w
b
Real gradients
Are much longer; lots and lots of weights
For each dimension wi the gradient component i
tells us the slope with respect to that variable.
◦ “How much would a small change in wi influence the
total loss function L?”
◦ We express the slope as a partial derivative ∂ of the loss
∂wi
The gradient is then defined as a vector of these
partials.
The gradient
ach dimension wi, we express the slope as a partial derivative ∂wi
of th
n. The gradient is then defined as a vector of these partials. We’ll repre
q) to make the dependence on q more obvious:
—q L(f(x;q),y)) =
2
6
6
6
6
4
∂
∂w1
L(f(x;q),y)
∂
∂w2
L(f(x;q),y)
.
.
.
∂
∂wn
L(f(x;q),y)
3
7
7
7
7
5
final equation for updating q based on the gradient is thus
We’ll represent !
" as f (x; θ ) to make the dependence on θ more
obvious:
The final equation for updating θ based on the gradient is thus
—q L( f (x;q),y)) =
2
6
6
6
6
4
∂
∂w1
L( f (x;q),y)
∂
∂w2
L( f (x;q),y)
.
.
.
∂
∂wn
L( f (x;q),y)
3
7
7
7
7
5
nal equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y)
What are these partial derivatives for logistic regression?
The loss function
4.1 The Gradient for Logistic Regression
order to update q, we need a definition for the gradient —L(f(x;q),y). Recal
logistic regression, the cross-entropy loss function is:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
It turns out that the derivative of this function for one observation vecto
. 5.18 (the interested reader can see Section 5.8 for the derivation of this equa
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
Note in Eq. 5.18 that the gradient with respect to a single weight wj represe
y intuitive value: the difference between the true y and our estimated ŷ =
Gradient for Logistic Regression
te q, we need a definition for the gradient —L( f(x;q),y). Re
ession, the cross-entropy loss function is:
y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
that the derivative of this function for one observation vec
erested reader can see Section 5.8 for the derivation of this eq
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
5.18 that the gradient with respect to a single weight wj repr
The elegant derivative of this function (see textbook 5.8 for derivation)
function STOCHASTIC GRADIENT DESCENT(L(), f(), x, y) returns q
# where: L is the loss function
# f is a function parameterized by q
# x is the set of training inputs x(1), x(2),..., x(m)
# y is the set of training outputs (labels) y(1), y(2),..., y(m)
q 0
repeat til done # see caption
For each training tuple (x(i), y(i)) (in random order)
1. Optional (for reporting): # How are we doing on this tuple?
Compute ŷ(i) = f(x(i);q) # What is our estimated output ŷ?
Compute the loss L(ŷ(i),y(i)) # How far off is ŷ(i)) from the true output y(i)?
2. g —q L( f(x(i);q),y(i)) # How should we move q to maximize loss?
3. q q h g # Go the other way instead
return q
Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
Logistic
Regression
Stochastic Gradient Descent
Logistic
Regression
Stochastic Gradient Descent:
An example and more details
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
al equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y)
Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning
rate h is 0.1:
w1 = w2 = b = 0
h = 0.1
The single update step requires that we compute the gradient, multiplied by the
learning rate
qt+1
= qt
h—q L( f(x(i)
;q),y(i)
)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
w1 = w2 = b = 0;
x1 = 3; x2 = 2
5.4.1 The Gradient for Logistic Regression
In order to update q, we need a definition for the gradient —L( f(x;q),y). Re
for logistic regression, the cross-entropy loss function is:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
It turns out that the derivative of this function for one observation ve
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
Note in Eq. 5.18 that the gradient with respect to a single weight wj rep
very intuitive value: the difference between the true y and our estimated ŷ
x+b) for that observation, multiplied by the corresponding input value xj.
5.4.2 The Stochastic Gradient Descent Algorithm
Stochastic gradient descent is an online algorithm that minimizes the loss
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
al equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y)
Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning
rate h is 0.1:
w1 = w2 = b = 0
h = 0.1
The single update step requires that we compute the gradient, multiplied by the
learning rate
qt+1
= qt
h—q L( f(x(i)
;q),y(i)
)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
w1 = w2 = b = 0;
x1 = 3; x2 = 2
5.4.1 The Gradient for Logistic Regression
In order to update q, we need a definition for the gradient —L( f(x;q),y). Re
for logistic regression, the cross-entropy loss function is:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
It turns out that the derivative of this function for one observation ve
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
Note in Eq. 5.18 that the gradient with respect to a single weight wj rep
very intuitive value: the difference between the true y and our estimated ŷ
x+b) for that observation, multiplied by the corresponding input value xj.
5.4.2 The Stochastic Gradient Descent Algorithm
Stochastic gradient descent is an online algorithm that minimizes the loss
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
al equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y)
Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning
rate h is 0.1:
w1 = w2 = b = 0
h = 0.1
The single update step requires that we compute the gradient, multiplied by the
learning rate
qt+1
= qt
h—q L( f(x(i)
;q),y(i)
)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
w1 = w2 = b = 0;
x1 = 3; x2 = 2
5.4.1 The Gradient for Logistic Regression
In order to update q, we need a definition for the gradient —L( f(x;q),y). Re
for logistic regression, the cross-entropy loss function is:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
It turns out that the derivative of this function for one observation ve
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
Note in Eq. 5.18 that the gradient with respect to a single weight wj rep
very intuitive value: the difference between the true y and our estimated ŷ
x+b) for that observation, multiplied by the corresponding input value xj.
5.4.2 The Stochastic Gradient Descent Algorithm
Stochastic gradient descent is an online algorithm that minimizes the loss
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
al equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y)
Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning
rate h is 0.1:
w1 = w2 = b = 0
h = 0.1
The single update step requires that we compute the gradient, multiplied by the
learning rate
qt+1
= qt
h—q L( f(x(i)
;q),y(i)
)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
w1 = w2 = b = 0;
x1 = 3; x2 = 2
5.4.1 The Gradient for Logistic Regression
In order to update q, we need a definition for the gradient —L( f(x;q),y). Re
for logistic regression, the cross-entropy loss function is:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
It turns out that the derivative of this function for one observation ve
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
Note in Eq. 5.18 that the gradient with respect to a single weight wj rep
very intuitive value: the difference between the true y and our estimated ŷ
x+b) for that observation, multiplied by the corresponding input value xj.
5.4.2 The Stochastic Gradient Descent Algorithm
Stochastic gradient descent is an online algorithm that minimizes the loss
Example of gradient descent
Update step for update θ is:
where
Gradient vector has 3 dimensions:
al equation for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y)
Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning
rate h is 0.1:
w1 = w2 = b = 0
h = 0.1
The single update step requires that we compute the gradient, multiplied by the
learning rate
qt+1
= qt
h—q L( f(x(i)
;q),y(i)
)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
w1 = w2 = b = 0;
x1 = 3; x2 = 2
5.4.1 The Gradient for Logistic Regression
In order to update q, we need a definition for the gradient —L( f(x;q),y). Re
for logistic regression, the cross-entropy loss function is:
LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))]
It turns out that the derivative of this function for one observation ve
Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq
∂LCE(ŷ,y)
∂wj
= [s(w·x+b) y]xj
Note in Eq. 5.18 that the gradient with respect to a single weight wj rep
very intuitive value: the difference between the true y and our estimated ŷ
x+b) for that observation, multiplied by the corresponding input value xj.
5.4.2 The Stochastic Gradient Descent Algorithm
Stochastic gradient descent is an online algorithm that minimizes the loss
Example of gradient descent
—q L( f (x;q),y)) =
6
6
6
4
∂w2
L( f (x;q),y)
.
.
.
∂
∂wn
L( f (x;q),y)
7
7
7
5
(5.15)
on for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y) (5.16)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
q0 in the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
So after one step of gradient descent, the weights have shifted to be: w1 = .15,
w2 = .1, and b = .05.
Note that this observation x happened to be a positive example. We would expect
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
η = 0.1;
r mini example there are three parameters, so the gradient vector has 3 dimen-
for w1, w2, and b. We can compute the first gradient as follows:
=
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
that we have a gradient, we compute the new parameter vector q1 by moving
the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
ter one step of gradient descent, the weights have shifted to be: w = .15,
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
Example of gradient descent
—q L( f (x;q),y)) =
6
6
6
4
∂w2
L( f (x;q),y)
.
.
.
∂
∂wn
L( f (x;q),y)
7
7
7
5
(5.15)
on for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y) (5.16)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
q0 in the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
So after one step of gradient descent, the weights have shifted to be: w1 = .15,
w2 = .1, and b = .05.
Note that this observation x happened to be a positive example. We would expect
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
η = 0.1;
r mini example there are three parameters, so the gradient vector has 3 dimen-
for w1, w2, and b. We can compute the first gradient as follows:
=
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
that we have a gradient, we compute the new parameter vector q1 by moving
the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
ter one step of gradient descent, the weights have shifted to be: w = .15,
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
Example of gradient descent
—q L( f (x;q),y)) =
6
6
6
4
∂w2
L( f (x;q),y)
.
.
.
∂
∂wn
L( f (x;q),y)
7
7
7
5
(5.15)
on for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y) (5.16)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
q0 in the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
So after one step of gradient descent, the weights have shifted to be: w1 = .15,
w2 = .1, and b = .05.
Note that this observation x happened to be a positive example. We would expect
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
η = 0.1;
r mini example there are three parameters, so the gradient vector has 3 dimen-
for w1, w2, and b. We can compute the first gradient as follows:
=
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
that we have a gradient, we compute the new parameter vector q1 by moving
the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
ter one step of gradient descent, the weights have shifted to be: w = .15,
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
Example of gradient descent
—q L( f (x;q),y)) =
6
6
6
4
∂w2
L( f (x;q),y)
.
.
.
∂
∂wn
L( f (x;q),y)
7
7
7
5
(5.15)
on for updating q based on the gradient is thus
qt+1 = qt h—L( f (x;q),y) (5.16)
In our mini example there are three parameters, so the gradient vector has 3 dimen-
sions, for w1, w2, and b. We can compute the first gradient as follows:
—w,b =
2
6
4
∂LCE(ŷ,y)
∂w1
∂LCE(ŷ,y)
∂w2
∂LCE(ŷ,y)
∂b
3
7
5 =
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
Now that we have a gradient, we compute the new parameter vector q1 by moving
q0 in the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
So after one step of gradient descent, the weights have shifted to be: w1 = .15,
w2 = .1, and b = .05.
Note that this observation x happened to be a positive example. We would expect
that after seeing more negative examples with high counts of negative words, that
the weight w2 would shift to have a negative value.
η = 0.1;
r mini example there are three parameters, so the gradient vector has 3 dimen-
for w1, w2, and b. We can compute the first gradient as follows:
=
2
4
(s(w·x+b) y)x1
(s(w·x+b) y)x2
s(w·x+b) y
3
5 =
2
4
(s(0) 1)x1
(s(0) 1)x2
s(0) 1
3
5 =
2
4
0.5x1
0.5x2
0.5
3
5 =
2
4
1.5
1.0
0.5
3
5
that we have a gradient, we compute the new parameter vector q1 by moving
the opposite direction from the gradient:
q1
=
2
4
w1
w2
b
3
5 h
2
4
1.5
1.0
0.5
3
5 =
2
4
.15
.1
.05
3
5
ter one step of gradient descent, the weights have shifted to be: w = .15,
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
Note that enough negative examples would eventually make w2 negative
Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples (512, or 1024)
Logistic
Regression
Regularization
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
Overfitting
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
79
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function
(for now written as maximizing logprob rather than minimizing loss)
Idea: choose an R(θ) that penalizes large weights
◦ fitting the data well with lots of big weights not as good
as fitting the data a little less well, with small weights
o the unseen test set, but a model that overfits will have poor generalization
o avoid overfitting, a new regularization term R(q) is added to the object
on in Eq. 5.13, resulting in the following objective for a batch of m exa
slightly rewritten from Eq. 5.13 to be maximizing log probability rather th
mizing loss, and removing the 1
m term which doesn’t affect the argmax):
q̂ = argmax
q
m
X
i=1
logP(y(i)
|x(i)
) aR(q) (5.
ew regularization term R(q) is used to penalize large weights. Thus a sett
weights that matches the training data perfectly— but uses many weights w
values to do so—will be penalized more than a setting that matches the dat
ess well, but does so using smaller weights. There are two common ways
L2 Regularization (= ridge regression)
The sum of the squares of the weights
The name is because this is the (square of the)
L2 norm ||θ||2, = Euclidean distance of θ to the origin.
L2 regularized objective function:
to do so—will be penalized more than a setting that matches the data a
ell, but does so using smaller weights. There are two common ways to
s regularization term R(q). L2 regularization is a quadratic function o
values, named because it uses the (square of the) L2 norm of the weigh
L2 norm, ||q||2, is the same as the Euclidean distance of the vector q
gin. If q consists of n weights, then:
R(q) = ||q||2
2 =
n
X
j=1
q2
j (5.23
larized objective function becomes:
" m
X
# n
X
high values to do so—will be penalized more than a setting that matches the da
little less well, but does so using smaller weights. There are two common way
compute this regularization term R(q). L2 regularization is a quadratic function
the weight values, named because it uses the (square of the) L2 norm of the wei
values. The L2 norm, ||q||2, is the same as the Euclidean distance of the vecto
from the origin. If q consists of n weights, then:
R(q) = ||q||2
2 =
n
X
j=1
q2
j (5
The L2 regularized objective function becomes:
q̂ = argmax
q
" m
X
i=1
logP(y(i)
|x(i)
)
#
a
n
X
j=1
q2
j (5
L1 Regularization (= lasso regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance
L1 regularized objective function:
q i=1 j=1
ularization is a linear function of the weight values, named after the L1 n
the sum of the absolute values of the weights, or Manhattan distance
tan distance is the distance you’d have to walk between two points in a
treet grid like New York):
R(q) = ||q||1 =
n
X
i=1
|qi| (
regularized objective function becomes:
q̂ = argmax
" m
X
logP(y(i)
|x(i)
)
#
a
n
X
|qj| (
i=1 j=1
regularization is a linear function of the weight values, named after the L1 n
||1, the sum of the absolute values of the weights, or Manhattan distance
nhattan distance is the distance you’d have to walk between two points in a
h a street grid like New York):
R(q) = ||q||1 =
n
X
i=1
|qi| (
L1 regularized objective function becomes:
q̂ = argmax
q
" m
X
1=i
logP(y(i)
|x(i)
)
#
a
n
X
j=1
|qj| (
Logistic
Regression
Multinomial Logistic
Regression
Multinomial Logistic Regression
Often we need more than 2 classes
◦ Positive/negative/neutral
◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.)
◦ Classify emergency SMSs into different actionable classes
If >2 classes we use multinomial logistic regression
= Softmax regression
= Multinomial logit
= (defunct names : Maximum entropy modeling or MaxEnt
So "logistic regression" will just mean binary (2 output classes)
84
Recall that that the cross-entropy loss for binary logistic regression
LCE(!
",y)= -logp(y|x)=-[ylog !
" +(1-y)log(1- !
")]
=- =-log #
"$ [where c is the true class)
The loss function for multinomial logistic regression
LCE(!
",y)=- (where there are K classes)
=-log #
"$ [where c is the true class)
=-logP(y=c|x)
%
&'(
)
"*+,- #
"*
%
&'(
.
"*+,- #
"*
Multinomial Logistic Regression
The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1
Need a generalization of the sigmoid called the softmax
◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
◦ Outputs a probability distribution
◦ each value in the range [0,1]
◦ all the values summing to 1
86
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities
87
moid, it is an exponential function.
ctor z of dimensionality k, the softmax is defined as:
softmax(zi) =
exp(zi)
Pk
j=1 exp(zj)
1  i  k
max of an input vector z = [z1,z2,...,zk] is thus a vector itself:
max(z) =
"
exp(z1)
Pk
i=1 exp(zi)
,
exp(z2)
Pk
i=1 exp(zi)
,...,
exp(zk)
Pk
i=1 exp(zi)
#
akes a vector z = [z1,z2,...,zk] of k arbitrary values and maps them to a probability
distribution, with each value in the range (0,1), and all the values summing to 1.
Like the sigmoid, it is an exponential function.
For a vector z of dimensionality k, the softmax is defined as:
softmax(zi) =
exp(zi)
Pk
j=1 exp(zj)
1  i  k (5.30)
The softmax of an input vector z = [z1,z2,...,zk] is thus a vector itself:
softmax(z) =
"
exp(z1)
Pk
i=1 exp(zi)
,
exp(z2)
Pk
i=1 exp(zi)
,...,
exp(zk)
Pk
i=1 exp(zi)
#
(5.31)
softmax(zi) =
e
Pk
j=1 ezj
1  i  k (5.32)
The softmax of an input vector z = [z1,z2,...,zk] is thus a vector itself:
softmax(z) =
"
ez1
Pk
i=1 ezi
,
ez2
Pk
i=1 ezi
,...,
ezk
Pk
i=1 ezi
#
(5.33)
The denominator
Pk
i=1 ezi is used to normalize all the values into probabilities.
Thus for example given a vector:
z = [0.6,1.1, 1.5,1.2,3.2, 1.1]
the result softmax(z) is
[0.055,0.090,0.0067,0.10,0.74,0.010]
The softmax function
◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities
88
i=1 ezi
i=1 ezi
i=1 ezi
nominator
Pk
i=1 ezi is used to normalize all the values into proba
xample given a vector:
z = [0.6,1.1, 1.5,1.2,3.2, 1.1]
oftmax(z) is
[0.055,0.090,0.0067,0.10,0.74,0.010]
ike the sigmoid, the input to the softmax will be the dot product b
softmax(z) =
ez1
Pk
i=1 ezi
,
ez2
Pk
i=1 ezi
,...,
ezk
Pk
i=1 ezi
enominator
Pk
i=1 ezi is used to normalize all the values into prob
example given a vector:
z = [0.6,1.1, 1.5,1.2,3.2, 1.1]
softmax(z) is
[0.055,0.090,0.0067,0.10,0.74,0.010]
For a vector z of dimensionality k, the softmax is defined as:
softmax(zi) =
exp(zi)
Pk
j=1 exp(zj)
1  i  k (5.30)
The softmax of an input vector z = [z1,z2,...,zk] is thus a vector itself:
softmax(z) =
"
exp(z1)
Pk
i=1 exp(zi)
,
exp(z2)
Pk
i=1 exp(zi)
,...,
exp(zk)
Pk
i=1 exp(zi)
#
(5.31)
Softmax in multinomial logistic regression
vector w and an input vector x (plus a bias). But now we’ll need sepa
ectors (and bias) for each of the K classes.
p(y = c|x) =
exp(wc ·x+bc)
k
X
j=1
exp(wj ·x+bj)
(5
he sigmoid, the softmax has the property of squashing values toward 0
ne of the inputs is larger than the others, it will tend to push its probab
and suppress the probabilities of the smaller inputs.
89
Input is still the dot product between weight vector w
and input vector x
But now we’ll need separate weight vectors for each
of the K classes.
for gradient descent we don’t need the loss, we need its
gradient. The gradient for a single example turns out to be
very similar to the gradient for binary logistic regression,
(!
" −y)x,
Logistic
Regression
Multinomial Logistic
Regression
Ad

More Related Content

Similar to Logistic-Regression - Machine learning model (20)

Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
NYversity
 
DM2020 boolean algebra
DM2020 boolean algebraDM2020 boolean algebra
DM2020 boolean algebra
Robert Geofroy
 
Chapter 1: Linear Regression
Chapter 1: Linear RegressionChapter 1: Linear Regression
Chapter 1: Linear Regression
AkmelSyed
 
Logistics regression
Logistics regressionLogistics regression
Logistics regression
SALWAidrissiakhannou
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptx
sghorai
 
Math Assignment Help
Math Assignment HelpMath Assignment Help
Math Assignment Help
Math Homework Solver
 
Machine learning (3)
Machine learning (3)Machine learning (3)
Machine learning (3)
NYversity
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
freshdatabos
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
Axel de Romblay
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
ssuserd329601
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
sarahfarhin
 
Probability Review for beginner to be used for
Probability Review for beginner to be used forProbability Review for beginner to be used for
Probability Review for beginner to be used for
Ridwan Syarif Siregar
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
Sameer607695
 
Probability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.pptProbability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.ppt
ShamshadAli58
 
Probability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledgProbability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledg
nsnayak03
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
Yonas992841
 
Lecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptxLecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptx
HedraAtif
 
A Brief Introduction to Linear Regression
A Brief Introduction to Linear RegressionA Brief Introduction to Linear Regression
A Brief Introduction to Linear Regression
Nidhal Selmi
 
Ch01
Ch01Ch01
Ch01
Maqsood Hayat
 
_lecture_06 F_minima_maxima_classification.pdf
_lecture_06 F_minima_maxima_classification.pdf_lecture_06 F_minima_maxima_classification.pdf
_lecture_06 F_minima_maxima_classification.pdf
LeoIrsi
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
NYversity
 
DM2020 boolean algebra
DM2020 boolean algebraDM2020 boolean algebra
DM2020 boolean algebra
Robert Geofroy
 
Chapter 1: Linear Regression
Chapter 1: Linear RegressionChapter 1: Linear Regression
Chapter 1: Linear Regression
AkmelSyed
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptx
sghorai
 
Machine learning (3)
Machine learning (3)Machine learning (3)
Machine learning (3)
NYversity
 
An introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using PythonAn introduction to Bayesian Statistics using Python
An introduction to Bayesian Statistics using Python
freshdatabos
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
Axel de Romblay
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
ssuserd329601
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
sarahfarhin
 
Probability Review for beginner to be used for
Probability Review for beginner to be used forProbability Review for beginner to be used for
Probability Review for beginner to be used for
Ridwan Syarif Siregar
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
Sameer607695
 
Probability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.pptProbability_Review HELPFUL IN STATISTICS.ppt
Probability_Review HELPFUL IN STATISTICS.ppt
ShamshadAli58
 
Probability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledgProbability_Review.ppt for your knowledg
Probability_Review.ppt for your knowledg
nsnayak03
 
Probability_Review.ppt
Probability_Review.pptProbability_Review.ppt
Probability_Review.ppt
Yonas992841
 
Lecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptxLecture 8 about data mining and how to use it.pptx
Lecture 8 about data mining and how to use it.pptx
HedraAtif
 
A Brief Introduction to Linear Regression
A Brief Introduction to Linear RegressionA Brief Introduction to Linear Regression
A Brief Introduction to Linear Regression
Nidhal Selmi
 
_lecture_06 F_minima_maxima_classification.pdf
_lecture_06 F_minima_maxima_classification.pdf_lecture_06 F_minima_maxima_classification.pdf
_lecture_06 F_minima_maxima_classification.pdf
LeoIrsi
 

Recently uploaded (20)

Nanometer Metal-Organic-Framework Literature Comparison
Nanometer Metal-Organic-Framework  Literature ComparisonNanometer Metal-Organic-Framework  Literature Comparison
Nanometer Metal-Organic-Framework Literature Comparison
Chris Harding
 
Autodesk Fusion 2025 Tutorial: User Interface
Autodesk Fusion 2025 Tutorial: User InterfaceAutodesk Fusion 2025 Tutorial: User Interface
Autodesk Fusion 2025 Tutorial: User Interface
Atif Razi
 
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control
 
Interfacing PMW3901 Optical Flow Sensor with ESP32
Interfacing PMW3901 Optical Flow Sensor with ESP32Interfacing PMW3901 Optical Flow Sensor with ESP32
Interfacing PMW3901 Optical Flow Sensor with ESP32
CircuitDigest
 
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjjseninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
AjijahamadKhaji
 
最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制
最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制
最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制
Taqyea
 
Building-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdfBuilding-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdf
Lawrence Omai
 
Evonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdfEvonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdf
szhang13
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)
6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)
6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)
ijflsjournal087
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
roshinijoga
 
Slide share PPT of NOx control technologies.pptx
Slide share PPT of  NOx control technologies.pptxSlide share PPT of  NOx control technologies.pptx
Slide share PPT of NOx control technologies.pptx
vvsasane
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdfML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
rameshwarchintamani
 
PRIZ Academy - Functional Modeling In Action with PRIZ.pdf
PRIZ Academy - Functional Modeling In Action with PRIZ.pdfPRIZ Academy - Functional Modeling In Action with PRIZ.pdf
PRIZ Academy - Functional Modeling In Action with PRIZ.pdf
PRIZ Guru
 
SICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introductionSICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introduction
fabienklr
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
Computer Security Fundamentals Chapter 1
Computer Security Fundamentals Chapter 1Computer Security Fundamentals Chapter 1
Computer Security Fundamentals Chapter 1
remoteaimms
 
Nanometer Metal-Organic-Framework Literature Comparison
Nanometer Metal-Organic-Framework  Literature ComparisonNanometer Metal-Organic-Framework  Literature Comparison
Nanometer Metal-Organic-Framework Literature Comparison
Chris Harding
 
Autodesk Fusion 2025 Tutorial: User Interface
Autodesk Fusion 2025 Tutorial: User InterfaceAutodesk Fusion 2025 Tutorial: User Interface
Autodesk Fusion 2025 Tutorial: User Interface
Atif Razi
 
Interfacing PMW3901 Optical Flow Sensor with ESP32
Interfacing PMW3901 Optical Flow Sensor with ESP32Interfacing PMW3901 Optical Flow Sensor with ESP32
Interfacing PMW3901 Optical Flow Sensor with ESP32
CircuitDigest
 
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjjseninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
seninarppt.pptx1bhjiikjhggghjykoirgjuyhhhjj
AjijahamadKhaji
 
最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制
最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制
最新版加拿大魁北克大学蒙特利尔分校毕业证(UQAM毕业证书)原版定制
Taqyea
 
Building-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdfBuilding-Services-Introduction-Notes.pdf
Building-Services-Introduction-Notes.pdf
Lawrence Omai
 
Evonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdfEvonik Overview Visiomer Specialty Methacrylates.pdf
Evonik Overview Visiomer Specialty Methacrylates.pdf
szhang13
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)
6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)
6th International Conference on Big Data, Machine Learning and IoT (BMLI 2025)
ijflsjournal087
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
Parameter-Efficient Fine-Tuning (PEFT) techniques across language, vision, ge...
roshinijoga
 
Slide share PPT of NOx control technologies.pptx
Slide share PPT of  NOx control technologies.pptxSlide share PPT of  NOx control technologies.pptx
Slide share PPT of NOx control technologies.pptx
vvsasane
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdfML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
rameshwarchintamani
 
PRIZ Academy - Functional Modeling In Action with PRIZ.pdf
PRIZ Academy - Functional Modeling In Action with PRIZ.pdfPRIZ Academy - Functional Modeling In Action with PRIZ.pdf
PRIZ Academy - Functional Modeling In Action with PRIZ.pdf
PRIZ Guru
 
SICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introductionSICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introduction
fabienklr
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
Computer Security Fundamentals Chapter 1
Computer Security Fundamentals Chapter 1Computer Security Fundamentals Chapter 1
Computer Security Fundamentals Chapter 1
remoteaimms
 
Ad

Logistic-Regression - Machine learning model

  • 2. Logistic Regression Important analytic tool in natural and social sciences Baseline supervised machine learning tool for classification Is also the foundation of neural networks
  • 3. Generative and Discriminative Classifiers Naive Bayes is a generative classifier by contrast: Logistic regression is a discriminative classifier
  • 4. Generative and Discriminative Classifiers Suppose we're distinguishing cat from dog images imagenet imagenet
  • 5. Generative Classifier: • Build a model of what's in a cat image • Knows about whiskers, ears, eyes • Assigns a probability to any image: • how cat-y is this image? Also build a model for dog images Now given a new image: Run both models and see which one fits better
  • 6. Discriminative Classifier Just try to distinguish dogs from cats Oh look, dogs have collars! Let's ignore everything else
  • 7. Finding the correct class c from a document d in Generative vs Discriminative Classifiers Naive Bayes Logistic Regression 7 ISTIC REGRESSION ormally, recall that the naive Bayes assigns a class c to a document d no computing P(c|d) but by computing a likelihood and a prior ĉ = argmax c2C likelihood z }| { P(d|c) prior z}|{ P(c) (5.1 ive model like naive Bayes makes use of this likelihood term, which how to generate the features of a document if we knew it was of class c. trast a discriminative model in this text categorization scenario attempts compute P(c|d). Perhaps it will learn to assign high weight to documen at directly improve its ability to discriminate between possible classes ISTIC REGRESSION rmally, recall that the naive Bayes assigns a class c to a document d computing P(c|d) but by computing a likelihood and a prior ĉ = argmax c2C likelihood z }| { P(d|c) prior z}|{ P(c) ( P(c|d) posterior
  • 8. Components of a probabilistic machine learning classifier 1. A feature representation of the input. For each input observation x(i), a vector of features [x1, x2, ... , xn]. Feature j for input x(i) is xj, more completely xj (i), or sometimes fj(x). 2. A classification function that computes ! ", the estimated class, via p(y|x), like the sigmoid or softmax functions. 3. An objective function for learning, like cross-entropy loss. 4. An algorithm for optimizing the objective function: stochastic gradient descent. Given m input/output pairs (x(i),y(i)):
  • 9. The two phases of logistic regression Training: we learn weights w and b using stochastic gradient descent and cross-entropy loss. Test: Given a test example x we compute p(y|x) using learned weights w and b, and return whichever label (y = 1 or y = 0) is higher probability
  • 12. Classification Reminder Positive/negative sentiment Spam/not spam Authorship attribution (Hamilton or Madison?) Alexander Hamilton
  • 13. Text Classification: definition Input: ◦ a document x ◦ a fixed set of classes C = {c1, c2,…, cJ} Output: a predicted class ! " Î C
  • 14. Binary Classification in Logistic Regression Given a series of input/output pairs: ◦ (x(i), y(i)) For each observation x(i) ◦ We represent x(i) by a feature vector [x1, x2,…, xn] ◦ We compute an output: a predicted class ! "(i) Î {0,1}
  • 15. Features in logistic regression • For feature xi, weight wi tells is how important is xi • xi ="review contains ‘awesome’": wi = +10 • xj ="review contains ‘abysmal’": wj = -10 • xk =“review contains ‘mediocre’": wk = -2
  • 16. Logistic Regression for one observation x Input observation: vector x = [x1, x2,…, xn] Weights: one per feature: W = [w1, w2,…, wn] ◦ Sometimes we call the weights θ = [θ1, θ2,…, θn] Output: a predicted class ! " Î {0,1} (multinomial logistic regression: ! " Î {0, 1, 2, 3, 4})
  • 17. How to do classification For each feature xi, weight wi tells us importance of xi ◦ (Plus we'll have a bias b) We'll sum up all the weighted features and the bias If this sum is high, we say y=1; if low, then y=0 her real number that’s added to the weighted inputs. ecision on a test instance— after we’ve learned the ssifier first multiplies each xi by its weight wi, sums up th ds the bias term b. The resulting single number z exp the evidence for the class. z = n X i=1 wixi ! +b ook we’ll represent such sums using the dot product no z = n X i=1 wixi ! +b k we’ll represent such sums using the dot product notatio ot product of two vectors a and b, written as a·b is the s orresponding elements of each vector. Thus the followin to Eq. 5.2: z = w·x+b
  • 18. But we want a probabilistic classifier We need to formalize “sum is high”. We’d like a principled classifier that gives us a probability, just like Naive Bayes did We want a model that can tell us: p(y=1|x; θ) p(y=0|x; θ)
  • 19. The problem: z isn't a probability, it's just a number! Solution: use a function of z that goes from 0 to 1 to Eq. 5.2: z = w·x+b g in Eq. 5.3 forces z to be a legal probabil fact, since weights are real-valued, the outp om • to •. ear around 0 but outlier values get squashed toward 0 or 1. e a probability, we’ll pass z through the sigmoid func ction (named because it looks like an s) is also called t es logistic regression its name. The sigmoid has the fol ically in Fig. 5.1: y = s(z) = 1 1+e z = 1 1+exp( z) rest of the book, we’ll use the notation exp(x) to mean e r of advantages; it takes a real-valued number and maps
  • 20. The very useful sigmoid or logistic function 20 ranges from • to •. bility, we’ll pass z through the sigmoid function, s(z). T med because it looks like an s) is also called the logistic fu regression its name. The sigmoid has the following equati Fig. 5.1: y = s(z) = 1 1+e z (5
  • 21. Idea of logistic regression We’ll compute w·x+b And then we’ll pass it through the sigmoid function: σ(w·x+b) And we'll just treat it as a probability
  • 22. Making probabilities with sigmoids wo cases, p(y = 1) and p(y = 0), sum to 1. We can do this as foll P(y = 1) = s(w·x+b) = 1 1+exp( (w·x+b)) P(y = 0) = 1 s(w·x+b) = 1 1 1+exp( (w·x+b)) = exp( (w·x+b)) 1+exp( (w·x+b)) wo cases, p(y = 1) and p(y = 0), sum to 1. We can do this as fol P(y = 1) = s(w·x+b) = 1 1+exp( (w·x+b)) P(y = 0) = 1 s(w·x+b) = 1 1 1+exp( (w·x+b)) = exp( (w·x+b)) 1+exp( (w·x+b))
  • 23. By the way: The sigmoid function has the property 1 s(x) = s( x) so we could also have expressed P(y = 0) as s( (w·x+b)). Now we have an algorithm that given an instance x computes the P(y = 1|x). How do we make a decision? For a test instance x, we sa probability P(y = 1|x) is more than .5, and no otherwise. We call .5 boundary: ŷ = ⇢ 1 if P(y = 1|x) > 0.5 0 otherwise 5.1.1 Example: sentiment classification P(y = 1) = s(w·x+b) = 1 1+exp( (w·x+b)) P(y = 0) = 1 s(w·x+b) = 1 1 1+exp( (w·x+b)) = exp( (w·x+b)) 1+exp( (w·x+b)) (5.5) tion has the property 1 s(x) = s( x) (5.6) = Because P(y = 0) = 1 s(w·x+b) = 1 1 1+exp( (w·x = exp( (w·x+b)) 1+exp( (w·x+b The sigmoid function has the property 1 s(x) = s( x) so we could also have expressed P(y = 0) as s( (w·x Now we have an algorithm that given an instance P(y = 1|x). How do we make a decision? For a test i probability P(y = 1|x) is more than .5, and no otherwi boundary: decision boundary
  • 24. Turning a probability into a classifier How do we make a decision? For a test instance x, we sa (y = 1|x) is more than .5, and no otherwise. We call .5 t ŷ = ⇢ 1 if P(y = 1|x) > 0.5 0 otherwise ample: sentiment classification n example. Suppose we are doing binary sentiment class text, and we would like to know whether to assign the sen 0.5 here is called the decision boundary
  • 25. The probabilistic classifier ranges from • to •. wx + b P(y = 1) = s(w·x+b) = 1 1+e (w·x+b) P(y = 0) = 1 s(w·x+ = 1 1 1+e (w· = e (w·x+b) 1+e (w·x+b) Now we have an algorithm that given an instan P(y=1)
  • 26. Turning a probability into a classifier algorithm that given an instance x computes the probabil we make a decision? For a test instance x, we say yes if t is more than .5, and no otherwise. We call .5 the decisi ŷ = ⇢ 1 if P(y = 1|x) > 0.5 0 otherwise sentiment classification e. Suppose we are doing binary sentiment classification if w·x+b > 0 if w·x+b ≤ 0
  • 28. Logistic Regression Logistic Regression: a text example on sentiment classification
  • 29. Sentiment example: does y=1 or y=0? It's hokey . There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable ? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you . 29
  • 30. 30 It's hokey . There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable ? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you . x1=3 x6=4.19 x3=1 x4=3 x5=0 x2=2 Figure 5.2 A sample mini test document showing the extracted features in the vector x. Given these 6 features and the input review x, P(+|x) and P( |x) can be com- puted using Eq. 5.5: p(+|x) = P(Y = 1|x) = s(w·x+b) = s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1) = s(.833) = 0.70 (5.6) ŷ = 1 if P(y = 1|x) > 0.5 0 otherwise 5.1.1 Example: sentiment classification Let’s have an example. Suppose we are doing binary sentiment classification on movie review text, and we would like to know whether to assign the sentiment class + or to a review document doc. We’ll represent each input observation by the 6 features x1...x6 of the input shown in the following table; Fig. 5.2 shows the features in a sample mini test document. Var Definition Value in Fig. 5.2 x1 count(positive lexicon) 2 doc) 3 x2 count(negative lexicon) 2 doc) 2 x3 ⇢ 1 if “no” 2 doc 0 otherwise 1 x4 count(1st and 2nd pronouns 2 doc) 3 x5 ⇢ 1 if “!” 2 doc 0 otherwise 0 x6 log(word count of doc) ln(66) = 4.19
  • 31. Classifying sentiment for input x 31 tures x1...x6 of the input shown in the following table; Fig. 5.2 shows the features a sample mini test document. Var Definition Value in Fig. 5.2 x1 count(positive lexicon) 2 doc) 3 x2 count(negative lexicon) 2 doc) 2 x3 ⇢ 1 if “no” 2 doc 0 otherwise 1 x4 count(1st and 2nd pronouns 2 doc) 3 x5 ⇢ 1 if “!” 2 doc 0 otherwise 0 x6 log(word count of doc) ln(66) = 4.19 et’s assume for the moment that we’ve already learned a real-valued weight for ch of these features, and that the 6 weights corresponding to the 6 features are x2 count(negative lexicon) 2 doc) x3 ⇢ 1 if “no” 2 doc 0 otherwise x4 count(1st and 2nd pronouns 2 doc) x5 ⇢ 1 if “!” 2 doc 0 otherwise x6 log(word count of doc) Let’s assume for the moment that we’ve alrea for each of these features, and that the 6 weights are [2.5, 5.0, 1.2,0.5,2.0,0.7], while b = 0.1. ( how the weights are learned.) The weight w1, for e Suppose w = b = 0.1
  • 32. Classifying sentiment for input x Figure 5.2 A sample mini test document showing the extracted features in the vector x. Given these 6 features and the input review x, P(+|x) and P( |x) can be com- puted using Eq. 5.5: p(+|x) = P(Y = 1|x) = s(w·x+b) = s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1) = s(.833) = 0.70 (5.6) p( |x) = P(Y = 0|x) = 1 s(w·x+b) = 0.30 Logistic regression is commonly applied to all sorts of NLP tasks, and any property of the input can be a feature. Consider the task of period disambiguation: deciding if a period is the end of a sentence or part of a word, by classifying each period 32 1 6 5 Figure 5.2 A sample mini test document showing the extracted features in the vector x. Given these 6 features and the input review x, P(+|x) and P( |x) can be com- puted using Eq. 5.5: p(+|x) = P(Y = 1|x) = s(w·x+b) = s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1) = s(.833) = 0.70 (5.6) p( |x) = P(Y = 0|x) = 1 s(w·x+b) = 0.30 Logistic regression is commonly applied to all sorts of NLP tasks, and any property of the input can be a feature. Consider the task of period disambiguation: deciding
  • 33. We can build features for logistic regression for any classification task: period disambiguation erty of the input can be a feature. Consider the task of period disambiguation: deciding if a period is the end of a sentence or part of a word, by classifying each period into one of two classes EOS (end-of-sentence) and not-EOS. We might use features like x1 below expressing that the current word is lower case and the class is EOS (perhaps with a positive weight), or that the current word is in our abbrevia- tions dictionary (“Prof.”) and the class is EOS (perhaps with a negative weight). A feature can also express a quite complex combination of properties. For example a period following a upper cased word is a likely to be an EOS, but if the word itself is St. and the previous word is capitalized, then the period is likely part of a shortening of the word street. x1 = ⇢ 1 if “Case(wi) = Lower” 0 otherwise x2 = ⇢ 1 if “wi 2 AcronymDict” 0 otherwise x3 = ⇢ 1 if “wi = St. & Case(wi 1) = Cap” 0 otherwise 33 This ends in a period. The house at 465 Main St. is new. End of sentence Not end
  • 34. Classification in (binary) logistic regression: summary Given: ◦ a set of classes: (+ sentiment,- sentiment) ◦ a vector x of features [x1, x2, …, xn] ◦ x1= count( "awesome") ◦ x2 = log(number of words in review) ◦ A vector w of weights [w1, w2, …, wn] ◦ wi for each feature fi , which is just what we want for a probability. Because it is but has a sharp slope toward the ends, it tends to squash outlier And it’s differentiable, which as we’ll see in Section 5.8 will e. If we apply the sigmoid to the sum of the weighted features, ween 0 and 1. To make it a probability, we just need to make s, p(y = 1) and p(y = 0), sum to 1. We can do this as follows: P(y = 1) = s(w·x+b) = 1 1+e (w·x+b)
  • 36. Wait, where did the W’s come from? Supervised classification: • We know the correct label y (either 0 or 1) for each x. • But what the system produces is an estimate, ! " We want to set w and b to minimize the distance between our estimate ! "(i) and the true y(i). • We need a distance estimator: a loss function or a cost function • We need an optimization algorithm to update w and b to minimize the loss. 36
  • 37. Learning components A loss function: ◦ cross-entropy loss An optimization algorithm: ◦ stochastic gradient descent
  • 38. The distance between ! " and y We want to know how far is the classifier output: ! " = σ(w·x+b) from the true output: y [= either 0 or 1] We'll call this difference: L(! " ,y) = how much ! " differs from the true y
  • 39. Intuition of negative log likelihood loss = cross-entropy loss A case of conditional maximum likelihood estimation We choose the parameters w,b that maximize • the log probability • of the true y labels in the training data • given the observations x
  • 40. Deriving cross-entropy loss for a single observation x Goal: maximize probability of the correct label p(y|x) Since there are only 2 discrete outcomes (0 or 1) we can express the probability p(y|x) from our classifier (the thing we want to maximize) as noting: if y=1, this simplifies to ! " if y=0, this simplifies to 1- ! " is the negative log likelihood loss, generally called the cross-entrop derive this loss function, applied to a single observation x. We’d ghts that maximize the probability of the correct label p(y|x). Sinc two discrete outcomes (1 or 0), this is a Bernoulli distribution, and he probability p(y|x) that our classifier produces for one observa wing (keeping in mind that if y=1, Eq. 5.9 simplifies to ŷ; if y=0, s to 1 ŷ): p(y|x) = ŷy (1 ŷ)1 y take the log of both sides. This will turn out to be handy mathema n’t hurt us; whatever values maximize a probability will also maxim e probability:
  • 41. Deriving cross-entropy loss for a single observation x Now take the log of both sides (mathematically handy) Whatever values maximize log p(y|x) will also maximize p(y|x) he probability p(y|x) that our classifier produces for one observa wing (keeping in mind that if y=1, Eq. 5.9 simplifies to ŷ; if y=0, s to 1 ŷ): p(y|x) = ŷy (1 ŷ)1 y take the log of both sides. This will turn out to be handy mathema n’t hurt us; whatever values maximize a probability will also maxim e probability: log p(y|x) = log ⇥ ŷy (1 ŷ)1 y ⇤ = ylogŷ+(1 y)log(1 ŷ) p(y|x) = ŷy (1 ŷ)1 y Now we take the log of both sides. This will turn out to be handy m nd doesn’t hurt us; whatever values maximize a probability will also og of the probability: log p(y|x) = log ⇥ ŷy (1 ŷ)1 y ⇤ = ylogŷ+(1 y)log(1 ŷ) Eq. 5.10 describes a log likelihood that should be maximized. In or nto loss function (something that we need to minimize), we’ll just Eq. 5.10. The result is the cross-entropy loss LCE: Goal: maximize probability of the correct label p(y|x) Maximize: Maximize:
  • 42. Deriving cross-entropy loss for a single observation x Now flip sign to turn this into a loss: something to minimize Cross-entropy loss (because is formula for cross-entropy(y, ! " )) Or, plugging in definition of ! ": Now we take the log of both sides. This will turn out to be handy ma and doesn’t hurt us; whatever values maximize a probability will also m log of the probability: log p(y|x) = log ⇥ ŷy (1 ŷ)1 y ⇤ = ylogŷ+(1 y)log(1 ŷ) Eq. 5.10 describes a log likelihood that should be maximized. In orde into loss function (something that we need to minimize), we’ll just fli Eq. 5.10. The result is the cross-entropy loss LCE: LCE(ŷ,y) = log p(y|x) = [ylogŷ+(1 y)log(1 ŷ) Goal: maximize probability of the correct label p(y|x) Maximize: Minimize: Now we take the log of both sides. This will turn out to be handy mathe and doesn’t hurt us; whatever values maximize a probability will also max log of the probability: log p(y|x) = log ⇥ ŷy (1 ŷ)1 y ⇤ = ylogŷ+(1 y)log(1 ŷ) Eq. 5.10 describes a log likelihood that should be maximized. In order t into loss function (something that we need to minimize), we’ll just flip th Eq. 5.10. The result is the cross-entropy loss LCE: LCE(ŷ,y) = log p(y|x) = [ylogŷ+(1 y)log(1 ŷ)] Finally, we can plug in the definition of ŷ = s(w·x+b): LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b)) and doesn’t hurt us; whatever values maximize a probability will also maximiz log of the probability: log p(y|x) = log ⇥ ŷy (1 ŷ)1 y ⇤ = ylogŷ+(1 y)log(1 ŷ) ( Eq. 5.10 describes a log likelihood that should be maximized. In order to turn into loss function (something that we need to minimize), we’ll just flip the sig Eq. 5.10. The result is the cross-entropy loss LCE: LCE(ŷ,y) = log p(y|x) = [ylogŷ+(1 y)log(1 ŷ)] ( Finally, we can plug in the definition of ŷ = s(w·x+b): LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] (
  • 43. Let's see if this works for our sentiment example We want loss to be: • smaller if the model estimate is close to correct • bigger if model is confused Let's first suppose the true label of this is y=1 (positive) It's hokey . There are virtually no surprises , and the writing is second-rate . So why was it so enjoyable ? For one thing , the cast is great . Another nice touch is the music . I was overcome with the urge to get off the couch and start dancing . It sucked me in , and it'll do the same to you .
  • 44. Let's see if this works for our sentiment example True value is y=1. How well is our model doing? Pretty well! What's the loss? x1=3 x6=4.19 x4=3 x5=0 Figure 5.2 A sample mini test document showing the extracted features in the vector x. Given these 6 features and the input review x, P(+|x) and P( |x) can be com- puted using Eq. 5.5: p(+|x) = P(Y = 1|x) = s(w·x+b) = s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+0.1) = s(.833) = 0.70 (5.6) p( |x) = P(Y = 0|x) = 1 s(w·x+b) = 0.30 Logistic regression is commonly applied to all sorts of NLP tasks, and any property of the input can be a feature. Consider the task of period disambiguation: deciding if a period is the end of a sentence or part of a word, by classifying each period into one of two classes EOS (end-of-sentence) and not-EOS. We might use features like x1 below expressing that the current word is lower case and the class is EOS R 5 • LOGISTIC REGRESSION side of the equation drops out, leading to the following loss (we’ll use log to mean natural log when the base is not specified): LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] = [logs(w·x+b)] = log(.70) = .36
  • 45. Let's see if this works for our sentiment example Suppose true value instead was y=0. What's the loss? CE = [logs(w·x+b)] = log(.70) = .36 By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative, i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is terrible! I beg you not to see it!”). In this case our model is confused and we’d want the loss to be higher. Now if we plug y = 0 and 1 s(w·x+b) = .31 from Eq. 5.7 into Eq. 5.12, the left side of the equation drops out: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] = [log(1 s(w·x+b))] = log(.30) = 1.2 Sure enough, the loss for the first classifier (.37) is less than the loss for the second p(+|x) = P(Y = 1|x) = s(w·x+b) = s([2.5, 5.0, 1.2,0.5,2.0,0.7]·[3,2,1 = s(.833) = 0.70 p( |x) = P(Y = 0|x) = 1 s(w·x+b) = 0.30 Logistic regression is commonly applied to all sorts of NLP tasks, of the input can be a feature. Consider the task of period disambig if a period is the end of a sentence or part of a word, by classif into one of two classes EOS (end-of-sentence) and not-EOS. We m like x1 below expressing that the current word is lower case and (perhaps with a positive weight), or that the current word is in o
  • 46. Let's see if this works for our sentiment example The loss when model was right (if true y=1) Is lower than the loss when model was wrong (if true y=0): Sure enough, loss was bigger when model was wrong! LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] = [logs(w·x+b)] = log(.70) = .36 By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative, i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is terrible! I beg you not to see it!”). In this case our model is confused and we’d want the loss to be higher. Now if we plug y = 0 and 1 s(w·x+b) = .31 from Eq. 5.7 into Eq. 5.12, the left side of the equation drops out: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] = [log(1 s(w·x+b))] = log(.30) = 1.2 Sure enough, the loss for the first classifier (.37) is less than the loss for the second classifier (1.17). 8 CHAPTER 5 • LOGISTIC REGRESSION side of the equation drops out, leading to the following loss (we’ll use log to mean natural log when the base is not specified): LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] = [logs(w·x+b)] = log(.70) = .36 By contrast, let’s pretend instead that the example in Fig. 5.2 was actually negative, i.e., y = 0 (perhaps the reviewer went on to say “But bottom line, the movie is terrible! I beg you not to see it!”). In this case our model is confused and we’d want the loss to be higher. Now if we plug y = 0 and 1 s(w·x+b) = .31 from Eq. 5.7 into Eq. 5.12, the left side of the equation drops out: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] = [log(1 s(w·x+b))] = log(.30)
  • 49. Our goal: minimize the loss Let's make explicit that the loss function is parameterized by weights !=(w,b) • And we’ll represent " # as f (x; θ ) to make the dependence on θ more obvious We want the weights that minimize the loss, averaged over all examples: th gradient descent is to find the optimal weights: minim ve defined for the model. In Eq. 5.13 below, we’ll explici the loss function L is parameterized by the weights, whic e learning in general as q (in the case of logistic regressio s to find the set of weights which minimizes the loss functi mples: q̂ = argmin q 1 m m X i=1 LCE(f(x(i) ;q),y(i) )
  • 50. Intuition of gradient descent How do I get to the bottom of this river canyon? x Look around me 360∘ Find the direction of steepest slope down Go that way
  • 51. Our goal: minimize the loss For logistic regression, loss function is convex • A convex function has just one minimum • Gradient descent starting from any point is guaranteed to find the minimum • (Loss for neural networks is non-convex)
  • 52. Let's first visualize for a single scalar w w Loss 0 w1 wmin (goal) Should we move right or left from here? Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function
  • 53. Let's first visualize for a single scalar w w Loss 0 w1 wmin slope of loss at w1 is negative (goal) Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function So we'll move positive
  • 54. Let's first visualize for a single scalar w w Loss 0 w1 wmin slope of loss at w1 is negative (goal) one step of gradient descent Q: Given current w, should we make it bigger or smaller? A: Move w in the reverse direction from the slope of the function So we'll move positive
  • 55. Gradients The gradient of a function of many variables is a vector pointing in the direction of the greatest increase in a function. Gradient Descent: Find the gradient of the loss function at the current point and move in the opposite direction.
  • 56. How much do we move in that direction ? • The value of the gradient (slope in our example) ! !" #(% &; ( , *) weighted by a learning rate η • Higher learning rate means move w faster GISTIC REGRESSION wt+1 = wt h d dw L( f(x;w),y) ’s extend the intuition from a function of one scalar variable
  • 57. Now let's consider N dimensions We want to know where in the N-dimensional space (of the N parameters that make up θ ) we should move. The gradient is just such a vector; it expresses the directional components of the sharpest slope along each of the N dimensions.
  • 58. Imagine 2 dimensions, w and b Visualizing the gradient vector at the red point It has two dimensions shown in the x-y plane of a 2-dimensional gradient vector taken at the red point. Cost(w,b) w b
  • 59. Real gradients Are much longer; lots and lots of weights For each dimension wi the gradient component i tells us the slope with respect to that variable. ◦ “How much would a small change in wi influence the total loss function L?” ◦ We express the slope as a partial derivative ∂ of the loss ∂wi The gradient is then defined as a vector of these partials.
  • 60. The gradient ach dimension wi, we express the slope as a partial derivative ∂wi of th n. The gradient is then defined as a vector of these partials. We’ll repre q) to make the dependence on q more obvious: —q L(f(x;q),y)) = 2 6 6 6 6 4 ∂ ∂w1 L(f(x;q),y) ∂ ∂w2 L(f(x;q),y) . . . ∂ ∂wn L(f(x;q),y) 3 7 7 7 7 5 final equation for updating q based on the gradient is thus We’ll represent ! " as f (x; θ ) to make the dependence on θ more obvious: The final equation for updating θ based on the gradient is thus —q L( f (x;q),y)) = 2 6 6 6 6 4 ∂ ∂w1 L( f (x;q),y) ∂ ∂w2 L( f (x;q),y) . . . ∂ ∂wn L( f (x;q),y) 3 7 7 7 7 5 nal equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y)
  • 61. What are these partial derivatives for logistic regression? The loss function 4.1 The Gradient for Logistic Regression order to update q, we need a definition for the gradient —L(f(x;q),y). Recal logistic regression, the cross-entropy loss function is: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] It turns out that the derivative of this function for one observation vecto . 5.18 (the interested reader can see Section 5.8 for the derivation of this equa ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj Note in Eq. 5.18 that the gradient with respect to a single weight wj represe y intuitive value: the difference between the true y and our estimated ŷ = Gradient for Logistic Regression te q, we need a definition for the gradient —L( f(x;q),y). Re ession, the cross-entropy loss function is: y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] that the derivative of this function for one observation vec erested reader can see Section 5.8 for the derivation of this eq ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj 5.18 that the gradient with respect to a single weight wj repr The elegant derivative of this function (see textbook 5.8 for derivation)
  • 62. function STOCHASTIC GRADIENT DESCENT(L(), f(), x, y) returns q # where: L is the loss function # f is a function parameterized by q # x is the set of training inputs x(1), x(2),..., x(m) # y is the set of training outputs (labels) y(1), y(2),..., y(m) q 0 repeat til done # see caption For each training tuple (x(i), y(i)) (in random order) 1. Optional (for reporting): # How are we doing on this tuple? Compute ŷ(i) = f(x(i);q) # What is our estimated output ŷ? Compute the loss L(ŷ(i),y(i)) # How far off is ŷ(i)) from the true output y(i)? 2. g —q L( f(x(i);q),y(i)) # How should we move q to maximize loss? 3. q q h g # Go the other way instead return q Figure 5.5 The stochastic gradient descent algorithm. Step 1 (computing the loss) is used
  • 63. Hyperparameters The learning rate η is a hyperparameter ◦ too high: the learner will take big steps and overshoot ◦ too low: the learner will take too long Hyperparameters: • Briefly, a special kind of parameter for an ML model • Instead of being learned by algorithm from supervision (like regular parameters), they are chosen by algorithm designer.
  • 66. Working through an example One step of gradient descent A mini-sentiment example, where the true y=1 (positive) Two features: x1 = 3 (count of positive lexicon words) x2 = 2 (count of negative lexicon words) Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero: w1 = w2 = b = 0 η = 0.1
  • 67. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: al equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning rate h is 0.1: w1 = w2 = b = 0 h = 0.1 The single update step requires that we compute the gradient, multiplied by the learning rate qt+1 = qt h—q L( f(x(i) ;q),y(i) ) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving w1 = w2 = b = 0; x1 = 3; x2 = 2 5.4.1 The Gradient for Logistic Regression In order to update q, we need a definition for the gradient —L( f(x;q),y). Re for logistic regression, the cross-entropy loss function is: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] It turns out that the derivative of this function for one observation ve Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj Note in Eq. 5.18 that the gradient with respect to a single weight wj rep very intuitive value: the difference between the true y and our estimated ŷ x+b) for that observation, multiplied by the corresponding input value xj. 5.4.2 The Stochastic Gradient Descent Algorithm Stochastic gradient descent is an online algorithm that minimizes the loss
  • 68. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: al equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning rate h is 0.1: w1 = w2 = b = 0 h = 0.1 The single update step requires that we compute the gradient, multiplied by the learning rate qt+1 = qt h—q L( f(x(i) ;q),y(i) ) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving w1 = w2 = b = 0; x1 = 3; x2 = 2 5.4.1 The Gradient for Logistic Regression In order to update q, we need a definition for the gradient —L( f(x;q),y). Re for logistic regression, the cross-entropy loss function is: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] It turns out that the derivative of this function for one observation ve Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj Note in Eq. 5.18 that the gradient with respect to a single weight wj rep very intuitive value: the difference between the true y and our estimated ŷ x+b) for that observation, multiplied by the corresponding input value xj. 5.4.2 The Stochastic Gradient Descent Algorithm Stochastic gradient descent is an online algorithm that minimizes the loss
  • 69. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: al equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning rate h is 0.1: w1 = w2 = b = 0 h = 0.1 The single update step requires that we compute the gradient, multiplied by the learning rate qt+1 = qt h—q L( f(x(i) ;q),y(i) ) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving w1 = w2 = b = 0; x1 = 3; x2 = 2 5.4.1 The Gradient for Logistic Regression In order to update q, we need a definition for the gradient —L( f(x;q),y). Re for logistic regression, the cross-entropy loss function is: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] It turns out that the derivative of this function for one observation ve Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj Note in Eq. 5.18 that the gradient with respect to a single weight wj rep very intuitive value: the difference between the true y and our estimated ŷ x+b) for that observation, multiplied by the corresponding input value xj. 5.4.2 The Stochastic Gradient Descent Algorithm Stochastic gradient descent is an online algorithm that minimizes the loss
  • 70. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: al equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning rate h is 0.1: w1 = w2 = b = 0 h = 0.1 The single update step requires that we compute the gradient, multiplied by the learning rate qt+1 = qt h—q L( f(x(i) ;q),y(i) ) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving w1 = w2 = b = 0; x1 = 3; x2 = 2 5.4.1 The Gradient for Logistic Regression In order to update q, we need a definition for the gradient —L( f(x;q),y). Re for logistic regression, the cross-entropy loss function is: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] It turns out that the derivative of this function for one observation ve Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj Note in Eq. 5.18 that the gradient with respect to a single weight wj rep very intuitive value: the difference between the true y and our estimated ŷ x+b) for that observation, multiplied by the corresponding input value xj. 5.4.2 The Stochastic Gradient Descent Algorithm Stochastic gradient descent is an online algorithm that minimizes the loss
  • 71. Example of gradient descent Update step for update θ is: where Gradient vector has 3 dimensions: al equation for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) Let’s assume the initial weights and bias in q0 are all set to 0, and the initial learning rate h is 0.1: w1 = w2 = b = 0 h = 0.1 The single update step requires that we compute the gradient, multiplied by the learning rate qt+1 = qt h—q L( f(x(i) ;q),y(i) ) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving w1 = w2 = b = 0; x1 = 3; x2 = 2 5.4.1 The Gradient for Logistic Regression In order to update q, we need a definition for the gradient —L( f(x;q),y). Re for logistic regression, the cross-entropy loss function is: LCE(ŷ,y) = [ylogs(w·x+b)+(1 y)log(1 s(w·x+b))] It turns out that the derivative of this function for one observation ve Eq. 5.18 (the interested reader can see Section 5.8 for the derivation of this eq ∂LCE(ŷ,y) ∂wj = [s(w·x+b) y]xj Note in Eq. 5.18 that the gradient with respect to a single weight wj rep very intuitive value: the difference between the true y and our estimated ŷ x+b) for that observation, multiplied by the corresponding input value xj. 5.4.2 The Stochastic Gradient Descent Algorithm Stochastic gradient descent is an online algorithm that minimizes the loss
  • 72. Example of gradient descent —q L( f (x;q),y)) = 6 6 6 4 ∂w2 L( f (x;q),y) . . . ∂ ∂wn L( f (x;q),y) 7 7 7 5 (5.15) on for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) (5.16) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving q0 in the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 So after one step of gradient descent, the weights have shifted to be: w1 = .15, w2 = .1, and b = .05. Note that this observation x happened to be a positive example. We would expect that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. η = 0.1; r mini example there are three parameters, so the gradient vector has 3 dimen- for w1, w2, and b. We can compute the first gradient as follows: = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 that we have a gradient, we compute the new parameter vector q1 by moving the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 ter one step of gradient descent, the weights have shifted to be: w = .15, Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient:
  • 73. Example of gradient descent —q L( f (x;q),y)) = 6 6 6 4 ∂w2 L( f (x;q),y) . . . ∂ ∂wn L( f (x;q),y) 7 7 7 5 (5.15) on for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) (5.16) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving q0 in the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 So after one step of gradient descent, the weights have shifted to be: w1 = .15, w2 = .1, and b = .05. Note that this observation x happened to be a positive example. We would expect that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. η = 0.1; r mini example there are three parameters, so the gradient vector has 3 dimen- for w1, w2, and b. We can compute the first gradient as follows: = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 that we have a gradient, we compute the new parameter vector q1 by moving the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 ter one step of gradient descent, the weights have shifted to be: w = .15, Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient:
  • 74. Example of gradient descent —q L( f (x;q),y)) = 6 6 6 4 ∂w2 L( f (x;q),y) . . . ∂ ∂wn L( f (x;q),y) 7 7 7 5 (5.15) on for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) (5.16) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving q0 in the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 So after one step of gradient descent, the weights have shifted to be: w1 = .15, w2 = .1, and b = .05. Note that this observation x happened to be a positive example. We would expect that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. η = 0.1; r mini example there are three parameters, so the gradient vector has 3 dimen- for w1, w2, and b. We can compute the first gradient as follows: = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 that we have a gradient, we compute the new parameter vector q1 by moving the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 ter one step of gradient descent, the weights have shifted to be: w = .15, Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient:
  • 75. Example of gradient descent —q L( f (x;q),y)) = 6 6 6 4 ∂w2 L( f (x;q),y) . . . ∂ ∂wn L( f (x;q),y) 7 7 7 5 (5.15) on for updating q based on the gradient is thus qt+1 = qt h—L( f (x;q),y) (5.16) In our mini example there are three parameters, so the gradient vector has 3 dimen- sions, for w1, w2, and b. We can compute the first gradient as follows: —w,b = 2 6 4 ∂LCE(ŷ,y) ∂w1 ∂LCE(ŷ,y) ∂w2 ∂LCE(ŷ,y) ∂b 3 7 5 = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 Now that we have a gradient, we compute the new parameter vector q1 by moving q0 in the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 So after one step of gradient descent, the weights have shifted to be: w1 = .15, w2 = .1, and b = .05. Note that this observation x happened to be a positive example. We would expect that after seeing more negative examples with high counts of negative words, that the weight w2 would shift to have a negative value. η = 0.1; r mini example there are three parameters, so the gradient vector has 3 dimen- for w1, w2, and b. We can compute the first gradient as follows: = 2 4 (s(w·x+b) y)x1 (s(w·x+b) y)x2 s(w·x+b) y 3 5 = 2 4 (s(0) 1)x1 (s(0) 1)x2 s(0) 1 3 5 = 2 4 0.5x1 0.5x2 0.5 3 5 = 2 4 1.5 1.0 0.5 3 5 that we have a gradient, we compute the new parameter vector q1 by moving the opposite direction from the gradient: q1 = 2 4 w1 w2 b 3 5 h 2 4 1.5 1.0 0.5 3 5 = 2 4 .15 .1 .05 3 5 ter one step of gradient descent, the weights have shifted to be: w = .15, Now that we have a gradient, we compute the new parameter vector θ1 by moving θ0 in the opposite direction from the gradient: Note that enough negative examples would eventually make w2 negative
  • 76. Mini-batch training Stochastic gradient descent chooses a single random example at a time. That can result in choppy movements More common to compute gradient over batches of training instances. Batch training: entire dataset Mini-batch training: m examples (512, or 1024)
  • 78. Overfitting A model that perfectly match the training data has a problem. It will also overfit to the data, modeling noise ◦ A random word that perfectly predicts y (it happens to only occur in one class) will get a very high weight. ◦ Failing to generalize to a test set without this word. A good model should be able to generalize
  • 79. Overfitting Models that are too powerful can overfit the data ◦ Fitting the details of the training data so exactly that the model doesn't generalize well to the test set ◦ How to avoid overfitting? ◦ Regularization in logistic regression ◦ Dropout in neural networks 79
  • 80. Regularization A solution for overfitting Add a regularization term R(θ) to the loss function (for now written as maximizing logprob rather than minimizing loss) Idea: choose an R(θ) that penalizes large weights ◦ fitting the data well with lots of big weights not as good as fitting the data a little less well, with small weights o the unseen test set, but a model that overfits will have poor generalization o avoid overfitting, a new regularization term R(q) is added to the object on in Eq. 5.13, resulting in the following objective for a batch of m exa slightly rewritten from Eq. 5.13 to be maximizing log probability rather th mizing loss, and removing the 1 m term which doesn’t affect the argmax): q̂ = argmax q m X i=1 logP(y(i) |x(i) ) aR(q) (5. ew regularization term R(q) is used to penalize large weights. Thus a sett weights that matches the training data perfectly— but uses many weights w values to do so—will be penalized more than a setting that matches the dat ess well, but does so using smaller weights. There are two common ways
  • 81. L2 Regularization (= ridge regression) The sum of the squares of the weights The name is because this is the (square of the) L2 norm ||θ||2, = Euclidean distance of θ to the origin. L2 regularized objective function: to do so—will be penalized more than a setting that matches the data a ell, but does so using smaller weights. There are two common ways to s regularization term R(q). L2 regularization is a quadratic function o values, named because it uses the (square of the) L2 norm of the weigh L2 norm, ||q||2, is the same as the Euclidean distance of the vector q gin. If q consists of n weights, then: R(q) = ||q||2 2 = n X j=1 q2 j (5.23 larized objective function becomes: " m X # n X high values to do so—will be penalized more than a setting that matches the da little less well, but does so using smaller weights. There are two common way compute this regularization term R(q). L2 regularization is a quadratic function the weight values, named because it uses the (square of the) L2 norm of the wei values. The L2 norm, ||q||2, is the same as the Euclidean distance of the vecto from the origin. If q consists of n weights, then: R(q) = ||q||2 2 = n X j=1 q2 j (5 The L2 regularized objective function becomes: q̂ = argmax q " m X i=1 logP(y(i) |x(i) ) # a n X j=1 q2 j (5
  • 82. L1 Regularization (= lasso regression) The sum of the (absolute value of the) weights Named after the L1 norm ||W||1, = sum of the absolute values of the weights, = Manhattan distance L1 regularized objective function: q i=1 j=1 ularization is a linear function of the weight values, named after the L1 n the sum of the absolute values of the weights, or Manhattan distance tan distance is the distance you’d have to walk between two points in a treet grid like New York): R(q) = ||q||1 = n X i=1 |qi| ( regularized objective function becomes: q̂ = argmax " m X logP(y(i) |x(i) ) # a n X |qj| ( i=1 j=1 regularization is a linear function of the weight values, named after the L1 n ||1, the sum of the absolute values of the weights, or Manhattan distance nhattan distance is the distance you’d have to walk between two points in a h a street grid like New York): R(q) = ||q||1 = n X i=1 |qi| ( L1 regularized objective function becomes: q̂ = argmax q " m X 1=i logP(y(i) |x(i) ) # a n X j=1 |qj| (
  • 84. Multinomial Logistic Regression Often we need more than 2 classes ◦ Positive/negative/neutral ◦ Parts of speech (noun, verb, adjective, adverb, preposition, etc.) ◦ Classify emergency SMSs into different actionable classes If >2 classes we use multinomial logistic regression = Softmax regression = Multinomial logit = (defunct names : Maximum entropy modeling or MaxEnt So "logistic regression" will just mean binary (2 output classes) 84
  • 85. Recall that that the cross-entropy loss for binary logistic regression LCE(! ",y)= -logp(y|x)=-[ylog ! " +(1-y)log(1- ! ")] =- =-log # "$ [where c is the true class) The loss function for multinomial logistic regression LCE(! ",y)=- (where there are K classes) =-log # "$ [where c is the true class) =-logP(y=c|x) % &'( ) "*+,- # "* % &'( . "*+,- # "*
  • 86. Multinomial Logistic Regression The probability of everything must still sum to 1 P(positive|doc) + P(negative|doc) + P(neutral|doc) = 1 Need a generalization of the sigmoid called the softmax ◦ Takes a vector z = [z1, z2, ..., zk] of k arbitrary values ◦ Outputs a probability distribution ◦ each value in the range [0,1] ◦ all the values summing to 1 86
  • 87. The softmax function Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into probabilities 87 moid, it is an exponential function. ctor z of dimensionality k, the softmax is defined as: softmax(zi) = exp(zi) Pk j=1 exp(zj) 1  i  k max of an input vector z = [z1,z2,...,zk] is thus a vector itself: max(z) = " exp(z1) Pk i=1 exp(zi) , exp(z2) Pk i=1 exp(zi) ,..., exp(zk) Pk i=1 exp(zi) # akes a vector z = [z1,z2,...,zk] of k arbitrary values and maps them to a probability distribution, with each value in the range (0,1), and all the values summing to 1. Like the sigmoid, it is an exponential function. For a vector z of dimensionality k, the softmax is defined as: softmax(zi) = exp(zi) Pk j=1 exp(zj) 1  i  k (5.30) The softmax of an input vector z = [z1,z2,...,zk] is thus a vector itself: softmax(z) = " exp(z1) Pk i=1 exp(zi) , exp(z2) Pk i=1 exp(zi) ,..., exp(zk) Pk i=1 exp(zi) # (5.31) softmax(zi) = e Pk j=1 ezj 1  i  k (5.32) The softmax of an input vector z = [z1,z2,...,zk] is thus a vector itself: softmax(z) = " ez1 Pk i=1 ezi , ez2 Pk i=1 ezi ,..., ezk Pk i=1 ezi # (5.33) The denominator Pk i=1 ezi is used to normalize all the values into probabilities. Thus for example given a vector: z = [0.6,1.1, 1.5,1.2,3.2, 1.1] the result softmax(z) is [0.055,0.090,0.0067,0.10,0.74,0.010]
  • 88. The softmax function ◦ Turns a vector z = [z1,z2,...,zk] of k arbitrary values into probabilities 88 i=1 ezi i=1 ezi i=1 ezi nominator Pk i=1 ezi is used to normalize all the values into proba xample given a vector: z = [0.6,1.1, 1.5,1.2,3.2, 1.1] oftmax(z) is [0.055,0.090,0.0067,0.10,0.74,0.010] ike the sigmoid, the input to the softmax will be the dot product b softmax(z) = ez1 Pk i=1 ezi , ez2 Pk i=1 ezi ,..., ezk Pk i=1 ezi enominator Pk i=1 ezi is used to normalize all the values into prob example given a vector: z = [0.6,1.1, 1.5,1.2,3.2, 1.1] softmax(z) is [0.055,0.090,0.0067,0.10,0.74,0.010] For a vector z of dimensionality k, the softmax is defined as: softmax(zi) = exp(zi) Pk j=1 exp(zj) 1  i  k (5.30) The softmax of an input vector z = [z1,z2,...,zk] is thus a vector itself: softmax(z) = " exp(z1) Pk i=1 exp(zi) , exp(z2) Pk i=1 exp(zi) ,..., exp(zk) Pk i=1 exp(zi) # (5.31)
  • 89. Softmax in multinomial logistic regression vector w and an input vector x (plus a bias). But now we’ll need sepa ectors (and bias) for each of the K classes. p(y = c|x) = exp(wc ·x+bc) k X j=1 exp(wj ·x+bj) (5 he sigmoid, the softmax has the property of squashing values toward 0 ne of the inputs is larger than the others, it will tend to push its probab and suppress the probabilities of the smaller inputs. 89 Input is still the dot product between weight vector w and input vector x But now we’ll need separate weight vectors for each of the K classes.
  • 90. for gradient descent we don’t need the loss, we need its gradient. The gradient for a single example turns out to be very similar to the gradient for binary logistic regression, (! " −y)x,
  翻译: