# 21. Generalized Linear Models

The following content is
MIT OpenCourseWare continue to offer high quality
view additional materials from hundreds of MIT courses,
visit MIT OpenCourseWare and ocw.mit.edu. PHILIPPE RIGOLLET: The
chapter is a natural capstone chapter for this entire course. We’ll see some of
the things we’ve seen during maximum likelihood
and some of the things we’ve seen during linear
regression, some of the things we’ve seen in terms of the basic
modeling that we’ve had before. We’re not going to go back
to much inference questions. It’s really going to
be about modeling. And in a way, generalized
linear models, as the word says, are just a generalization
of linear models. And they’re actually
extremely useful. They’re often forgotten
about and people just jump onto machine learning
and sophisticated techniques. But those things do
the job quite well. So let’s see in what sense
they are a generalization of the linear models. So remember, the linear
model looked like this. We said that y was equal to x
transpose beta plus epsilon, right? That was our linear
regression model. And it’s– another way
to say this is that if– and let’s assume
that those were, say, Gaussian with mean 0 and
identity covariance matrix. Then another way
to say this is that the conditional distribution
of y given x is equal to– sorry, I a Gaussian with mean
x transpose beta and variance– well, we had a sigma squared,
which I will forget as usual– x transpose beta and
then sigma squared. OK, so here, we just assumed
that– so what is regression is just saying I’m trying to
explain why as a function of x. Given x, I’m assuming a
distribution for the y. And this x is just
going to be here to help me model what the mean
of this Gaussian is, right? I mean, I could have
something crazy. I could have something
that looks like y given x is n0 x transpose beta. And then this could
be some other thing which looks like, I don’t
know, some x transpose gamma squared
times, I don’t know, x, x transpose plus identity– some crazy thing that
depends on x here, right? And we deliberately assumed that
all the thing that depends on x shows up in the mean, OK? And so what I have
here is that y given x is a Gaussian
with a mean that depends on x and covariance
matrix sigma square identity. Now the linear model
assumed a very specific form for the mean. It said I want the
mean to be equal to x transpose beta
which, remember, was the sum from, say, j equals
1 to p of beta j xj, right? It’s where the xj’s are
the coordinates of x. But I could do something
also more complicated, right? I could have something
that looks like instead , replace this by, I don’t know,
sum of beta j log of x to the j divided by x to the j squared
or something like this, right? I could do this as well. So there’s two things
that we have assumed. The first one is
that when I look at the conditional
distribution of y given x, x affects only the mean. I also assume that
it was Gaussian and that it affects
only the mean. And the mean is affected
in a very specific way, which is linear in x, right? So this is
essentially the things we’re going to try to relax. So the first thing
that we assume, the fact that y was Gaussian and
had only its mean [INAUDIBLE] dependant no x is what’s
called the random component. It just says that the
response variables, you know, it sort of makes sense to
assume that they’re Gaussian. And everything was
essentially captured, right? So there’s this
property of Gaussians that if you tell me– if
the variance is known, all you need to tell
me to understand exactly what the distribution
of a Gaussian is, all you need to tell me
is its expected value. All right, so
that’s this mu of x. And the second thing is that
we have this link that says, well, I need to find a way
to use my x’s to explain this mu you and the
link was exactly mu of x was equal
to x transpose beta. Now we are talking about
generalized linear models. So this part here where mu
of x is of the form– the way I want my beta, my x,
to show up is linear, this will never be a question. In principle, I could
add a third point, which is just question this
part, the fact that mu of x is x transpose beta. I could have some more
complicated, nonlinear function of x. And then we’ll never do
that because we’re talking about generalized linear model. The only thing with generalize
are the random component, the conditional
distribution of y given x, and the link that just says,
well, once you actually tell me that the only thing I need
to figure out is the mean, I’m just going to slap it
exactly these x transpose beta thing without any transformation
of x transpose beta. So those are the two things. It will become
clear what I mean. This sounds like a
tautology, but let’s just see how we could extend that. So what we’re going to do in
generalized linear models– right, so when I
talk about GLNs, the first thing I’m
going to do with my x is turn it into some
x transpose beta. And that’s just
the l part, right? I’m not going to
be able to change. That’s the way it works. I’m not going to do
anything non-linear. But the two things
I’m going to change is this random
component, which is that y, which used to be some
Gaussian with mean mu of x here in sigma squared– so y given x, sorry– this is going to become y given
x follows some distribution. And I’m not going to
allow any distribution. I want something that comes
from the exponential family. Who knows what the exponential
family of distribution is? This is not the same thing as
the exponential distribution. It’s a family of distributions. All right, so we’ll see that. It’s– wow. What can that be? Oh yeah, that’s
actually [INAUDIBLE].. So– I’m sorry? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: I’m
in presentation mode. That should not happen. OK, so hopefully, this is muted. So essentially, this is going
to be a family of distributions. And what makes them
exponential typically is that there’s an
exponential that shows up in the definition
of the density, all right? We’ll see that the
Gaussian belongs to the exponential family. But they’re slightly
less expected ones because there’s this crazy
thing that a to the x is exponential x log a, which
makes the potential show up without being there. So if there’s an
exponential of some power, it’s going to show up. But it’s more than that. So we’ll actually come
to this particular family of distribution. Why this particular family? Because in a way,
everything we’ve done for the linear
model with Gaussian is going to extend fairly
naturally to this family. All right, and it actually
also, because it encompasses pretty much everything,
all the distributions we’ve discussed before. All right, so the second thing
that I want to question– right, so before,
we just said, well, mu of x was directly
equal to this thing. Mu of x was directly
x transpose beta. So I knew I was going to
have an x transpose beta and I said, well, I could do
something with this x transpose beta before I used it to
explain the expected value. But I’m actually
taking it like that. Here, we’re going to say, let’s
extend this to some function is equal to this thing. Now admittedly, this is
not the most natural way to think about it. What you would probably
feel more comfortable doing is write something like
mu of x is a function. Let’s call it f of
x transpose beta. But here, I decide
to call f g inverse. OK, let’s just my g inverse. Yes. AUDIENCE: Is this different
then just [INAUDIBLE] PHILIPPE RIGOLLET: Yeah. I mean, what transformation
you want to put on your x’s? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: Oh
no, certainly not, right? I mean, if I give you– if I
force you to work with x1 plus x2, you cannot work with
any function of x1 plus any function of x2, right? So this is different. All right, so– yeah. The transformation would
be just the simple part of your linear
regression problem where you would take your
exes, transform them, and then just apply
another linear regression. This is genuinely new. Any other question? All right, so this
function g and the reason why I sort of have to, like,
stick to this slightly less natural way of defining
it is because that’s g that gets a name, not g
inverse that gets a name. And the name of g is
the link function. So if I want to give you a
generalized linear model, I need to give you
two ingredients. The first one is the
random component, which is the distribution
of y given x. And it can be anything in what’s
called the exponential family of distributions. So for example, I
could say, y given x is Gaussian with mean
mu x sigma identity. But I can also
tell you y given x is gamma with shared parameter
equal to alpha of x, OK? I could do some weird
things like this. And the second thing is I need
going to become very clear how you pick a link function. And the only reason that you
actually pick a link function is because of compatibility. This mu of x, I call
it mu because mu of x is always the conditional
expectation of y given x, always, which means
that let’s think of y as being a Bernoulli
random variable. Where does mu of x live? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: 0, 1, right? That’s the expectation
of a Bernoulli. It’s just the probability
that my coin flip gives me 1. So it’s a number
between 0 and 1. But this guy right here, if
my x’s are anything, right– think of any body
measurements plus [INAUDIBLE] linear combinations with
arbitrarily large coefficients. This thing can be
any real number. So the link function, what
it’s effectively going to do is make those two
things compatible. It’s going to take
my number which, for example, is constrained
to be between 0 and 1 and map it into the
entire real line. If I have mu which is forced
to be positive, for example, in an exponential distribution,
the mean is positive, right? That’s the, say, don’t
know, inter-arrival time for Poisson process. This thing is known to be
positive for an exponential. I need to map something
that’s exponential to the entire real line. I need a function that
takes something positive and [INAUDIBLE] everywhere. So we’ll see. By the end of this
chapter, you will have 100 ways of doing this, but
there are some more traditional ones [INAUDIBLE]. So before we go any further,
I gave you the example of a Bernoulli random variable. Let’s see a few examples
that actually fit there. Yes. AUDIENCE: Will it come up
later [INAUDIBLE] already know why do we need the
transformer [INAUDIBLE] why don’t [INAUDIBLE] PHILIPPE RIGOLLET:
Well actually, this will not come up later. It should be very
clear from here because if I actually
have a model, I just want it to
be plausible, right? I mean, what happens if I
suddenly decide that my– so this is what’s
going to happen. You’re going to have only
data to fit this model. Let’s say you actually
going to pretend my y’s just happen to be the realizations
of said Gaussians that happen to be 0 or 1 only. You can always, like, stuff that
in some linear model, right? You will have some least
squares estimated for beta. And it’s going to be fine. For all the points
that you see, it will definitely put
some number that’s actually between 0 and 1. So this is what your picture
is going to look like. You’re going to have a
bunch of values for x. This is your y. And for different– so
these are the values of x that you will get. And for a y, you will see
either a 0 or a 1, right? Right, that’s what your
Bernoulli dataset would look like with a one dimensional x. Now if you do least squares
on this, you will find this. And for this guy,
this line certainly takes values between 0 and 1. But let’s say now
you get an x here. You’re going to actually
start pretending that the probability it spits
out one conditionally in x is like 1.2, and that’s
going to be weird. Any other questions? All right, so let’s
to them through this course. So the first one is– so all these things are taken. So there’s a few
books on generalizing, your models, generalize
[INAUDIBLE] models. And there’s tons of
applications that you can see. Those are extremely
versatile, and as soon as you want to do modeling
to explain some y given x, you sort of need to
do that if you want to go beyond linear models. So this was in the
disease occurring rate. So you have a disease
epidemic and you want to basically model
the expected number of new cases given– at a certain time, OK? So you have time that progresses
is a time stamp– say, I don’t know, 20th day. And your response is
the number of new cases. And you’re going to actually
put your model directly on mu, right? When I looked at
this, everything here was on mu itself, on
the expected, right? Mu of x is always the expected– the conditional
expectation of y given x. right? So all I need to model
is this expected value. So this mu I’m going
to actually say– so I look at some parameters,
and it says, well, it increases exponentially. So I want to say I have some
sort of exponential trend. I can parametrize
that in several ways. And the two parameters
I want to slap in is, like, some sort of gamma,
which is just the coefficient. And then there’s some rate
delta that’s in the exponential. So if I tell you
it’s exponential, that’s a nice family
of functions you might want to think about, OK? So here, mu of x, if I want
to keep the notation, x is gamma exponential
delta x, right? Except that here, my x
are t1, t2, t3, et cetera. And I want to find what the
parameters gamma and delta are because I want to be
able to maybe compare different epidemics and see if
they have the same parameter or maybe just do some
prediction based on the data that I have without– to
extrapolate in the future. So here, clearly mu of
x is not of the form x transpose beta, right? That’s not x
transpose beta at all. And it’s actually not even a
function of x transpose data, right? There’s two parameters,
gamma and delta, and it’s not of the form. So here we have x,
which is 1 and x, right? I have two parameters. So what I do here
is that I say, well, first, let me transform
mu in such a way that I can hope to see
something that’s linear. So if I transform mu, I’m
going to have log of mu, which is log of this thing, right? So log of mu of
x is equal, well, to log of gamma plus
log of exponential delta x, which is delta x. And now this thing is
actually linear in x. So I have that this
guy is my first beta 1. And so that’s beta 1 finds 1. And this guy is beta 2– times, sorry that said beta
0– times 1, and this guy is beta 1 times x. OK, so that looks
like a linear model. I just have to change
my parameters– my parameters beta 1 becomes
the log of gamma and beta 2 becomes delta itself. And the reason why we do this
is because, well, the way we put those gamma and those
delta was just so that we have some parametrization. It just so happens that if
we want this to be linear, we need to just change the
parametrization itself. This is going to
have some effects. We know that it’s going
to have some effect in the fissure information. It’s going to have a bunch of
effect to change those things. But that’s what needs
to be done to have a generalized linear model. Now here, the
function that I took to turn it into something
that’s linear is simple. It came directly from some
natural thing I would do here, which is taking the log. And so the function g,
the link that I take, is called the log
link very creatively. And it’s just the
function that I apply to mu so that I see
something that’s linear and that looks like this. So now this only tells me how
to deal with the link function. But I still have
to deal with 0.1. And this, again, is
just some modeling. Given some data,
some random data, what distribution do you choose
to explain the randomness? And this– I mean,
unless there’s no choice, you know, it’s just a
matter of practice, right? I mean, why would it be
Gaussian and not, you know, doubly exponential? This is– there’s matters
of convenience that come into this, and there’s
just matter of experience that come into this. You know, I remember when
you chat with engineers, they have a very
good notion of what the distribution should be. They have y bold distributions. You know, they do optics
and things like this. So there’s some distributions
that just come up but sometimes just have to work. Now here what do we have? The thing we’re
trying to measure, y– as we said, so mu
is the expectation, the conditional
expectation, of y given x. But y is the number
of new cases, right? Well it’s a number of. And the first thing
you should think of when you think
about number of, if it were bounded above, you
would think binomial, baby. But here, it’s just a number. So you think Poisson. That’s how insurers think. I have a number of, you
know, claims per year. This is a Poisson distribution. And hopefully they can model
the conditional distribution of the number of claims given
everything that they actually ask you in the
surveys that I hear you now fail in 15 minutes. All right, so now you have
this Poisson distribution. And that’s just the
modeling assumption. There’s no particular
reason why you should do this except
that, you know, that might be a good idea. And the expected
value of your Poisson has to be this mu i, OK? At time i. Any question about this slide? OK, so let’s switch
to another example. Another example is the
so-called pray capture rate. So here, what
you’re interested in is the rate capture of
preys yi for a given prey. And you have xy, which
is your explanation. And this is just
the density of pray. So you’re trying to explain the
rate of captures of preys given the density of the prey, OK? And so you need to find
some sort of relationship between the two. And here again,
you talk to experts and what they tell you
is that, well, it’s going to be increasing, right? I mean, animals like predators
are going to just eat more if there’s more preys. But at some point,
they’re just going to level off because they’re
going to be [INAUDIBLE] full and they’re going to stop
capturing those prays. And you’re just going to
have some phenomenon that looks like this. So here is a curve that
sort of makes sense, right? As your capture rate goes from
0 to 1, you’re increasing, and then you see you have
this like [INAUDIBLE] function that says, you know, at
some point it levels up. OK, so here, one way I could– I mean, there’s again
many ways I could just model a function
that looks like this. But a simple one that
has only two parameters is this one, where mu i is
this a function of xi where I have some parameter alpha
here and some parameter h here. OK, so there’s clearly– so this function, there’s one
that essentially tells you– so this thing starts
at 0 for sure. And essentially,
alpha tells you how sharp this thing
is, and h tells you at which points you end here. Well, it’s not exactly what
those values are equal to, but that tells you this. OK, so, you know– simple, and– well, no, OK. Sorry, that’s actually alpha,
which is the maximum capture. The rate and h represent
the pre-density at which the capture weight is. So that’s the half time. OK, so there’s actual
value [INAUDIBLE].. All right, so now I
have this function. It’s certainly not a function. There’s no– I don’t see
it as a function of x. So I need to find something that
looks like a function of x, OK? So then here, there’s no log. There’s no– well, I could
actually take a log here. But I would have log of
x and log of x plus h. So that would be weird. So what we propose to
do here is to look, rather than looking at mu
i, we look 1 over mu i. Right, and so
since your function was mu i, when you
take 1 over mu i, you get h plus xi divided
by alpha xi, which is h over alpha times one
over xi plus 1 over alpha. And now if I’m willing to
make this transformation of variables and say,
actually, I don’t– my x, whether it’s
the density of prey or the inverse density of
prey, it really doesn’t matter. I can always make
this transformation when the data comes. Then I’m actually just
going to think of this as being some linear
function beta 0 plus beta 1, which is this guy,
times 1 over xi. And now my new variable
becomes 1 over xi. And now it’s linear. And the transformation
called the reciprocal link, OK? You can probably guess what the
exponential link is going to be and things like this, all right? So we’ll talk about other
links that have slightly less obvious names. Now again, modeling, right? So this was the
random component. This was the easy part. Now I need to just poor
in some domain knowledge about how do I think this
function, this y, which is which is the rate
of capture of praise, I want to understand how
this thing is actually changing what is the randomness
of the thing around its mean. And you know, something
that– so that comes from this textbook. The standing deviation
of capture rate might be approximately
proportional to the mean rate. You need to find a
distribution that actually has this property. And it turns out
that this happens for gamma distributions, right? In gamma distributions,
just like say, for Poisson distribution, the– well, for Poisson, the variance
and mean are of the same order. Here is the standard
deviation that’s of the same order as the
[INAUDIBLE] for gammas. And it’s a positive
distribution as well. So here is a candidate. Now since we’re
sort of constrained to work under the exponential
family of distributions, then you can just
go through your list and just decide which
one works best for you. All right, third example– so here we have binary response. Here, essentially the
binary response variable indicates the
presence or absence of postoperative deforming
for kyphosis on children. And here, rather than having
one covariance which was before, in the first example, was
time, in the second example was the density, here
there’s three ways that you measure on children. The first one is
age of the child and the second one is
the number of vertebrae involved in the operation. And the third one is
the start of the range, right– so where
it is on the spine. OK, so the response
variable here is, you know, did it work or not, right? I mean, that’s very simple. And so here, it’s nice
because the random component is the easiest one. As I said, any random variable
that takes only two outcomes must be a Bernoulli, right? So that’s nice there’s no
modeling going on here. So you know that y given x
is going to be Bernoulli, but of course, all
your efforts are going to try to understand
what the conditional mean of your Bernoulli, what
the conditional probability of being 1 is going to be, OK? And so in particular–
so I’m just– here, I’m spelling it out before
we close those examples. I cannot say that mu of x is x
transpose data for exactly this picture that I drew
for you here, right? There’s just no
way here– the goal of doing this is certainly
to be able to extrapolate for yet unseen children
whether this is something that we should be doing. And maybe the range
of x is actually going to be slightly out. And so, OK I don’t
want to see that have a negative probability of
outcome or a positive one– sorry, or one that’s
lower than one. So I need to make
this transformation. So what I need to do is
to transform mu, which is, we know only a number. All we know is a
number between 0 and 1. And we need to transform
it in such a way that it maps the
entire real line or reciprocally to say that– or inversely, I should say– that f of x
transpose beta should be a number between 0 and 1. I need to find a function
that takes any real number and maps it into 0 and 1. And we’ll see that
again, but you have an army of functions
that do that for you. What are those functions? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: I’m sorry? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: Trait? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: Oh. AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: Yeah, I want
them to be invertible, right? AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: I
have an army of function. I’m not asking for one
soldier in this army. I want the name of this army. AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: Well, they’re
not really invertible either, right? So they’re actually in
[INAUDIBLE] textbook. Because remember,
statisticians don’t know how to integrate
functions, but they know how to turn a function
into a Gaussian integral. So we know it integrates
to 1 and things like this. Same thing here–
we don’t know how to build functions that
are invertible and map the entire real line
to 0, 1, but there’s all the cumulative distribution
functions that do that for us. So I can you any of
those guys, and that’s what I’m going to
be doing, actually. All right, so just
to recap what I just said as we were speaking, so
normal linear model is not appropriate for these examples
if only because the response variable is not
necessarily Gaussian and also because the
linear model has to be– the mean has to be transformed
before I can actually apply a linear model for all
these plausible nonlinear models that I
actually came up with. OK, so the family
we’re going to go for is the exponential
family of distributions. And we’re going to
be able to show– so one of the nice
part of this is to actually compute
maximum likelihood estimaters for those right? In the linear model,
maximum– like, in the Gauss linear model, maximum likelihood
was as nice as it gets, right? This actually was the
least squares estimator. We had a close form. x transpose x inverse
x transpose y, and that was it, OK? We had to just take
one derivative. Here, we’re going to have a
generally concave likelihood. We’re not going to
be able to actually solve this thing
directly in close form unless it’s Gaussian,
but we will have– we’ll see actually
how this is not just a black box optimization
of a concave function. We have a lot of properties
of this concave function, and we will be able to show
some iterative algorithms. We’ll basically see how, when
you opened the box of convex optimization, you will actually
be able to see how things work and actually implement
it using least squares. So each iteration of
this iterative algorithm will essentially
be a least squares, and that’s actually
quite [INAUDIBLE].. So, very demonstrative
of statisticians being pretty
ingenious so that they don’t have to call in
some statistical software but just can repeatedly
call their least squares Oracle within a
statistical software. OK, so what is the
exponential family, right? I promised to do the
exponential family. Before we go into
this, let me just tell you something about
exponential families, and what’s the only
thing to differentiate an exponential family from
all possible distributions? An exponential family has
two parameters, right? And those are not
really parameters, but there’s this theta parameter
of my distribution, OK? So it’s going to be
indexed by some parameter. Here, I’m only talking
about the distribution of, say, some random variable
or some random vector, OK? So here in this slide, you see
that the parameter theta that indexed those distribution
is k dimensional and the space of the x’s
that I’m looking at– so that should really be y, right? What I’m going to
plug in here is the conditional distribution
of y given x and theta is going to depend on x. But this really is the y. That’s their distribution
of the response variable. And so this is on q, right? So I’m going to
assume that y takes– q dimensional–
is q dimensional. Clearly soon, q is
going to be equal to 1, but I can define those
things generally. OK, so I have this. I have to tell you
what this looks like. And let’s assume that this is
a probability density function. So this, right this notation,
the fact that I just put my theta in
subscript, is just for me to remember that
this is the variable that indicates the random variable,
and this is just the parameter. But I could just write it as a
function of theta and x, right? This is just going to be–
right, if you were in calc, in multivariable
calc, you would have two parameter of theta
and x and you would need to give me a function. Now think of all– think of x and theta as being
one dimensional at this point. Think of all the
functions that can be depending on theta and x. There’s many of them. And in particular, there’s many
ways theta and x can interact. What the exponential
family does for you is that it restricts
the way these things can actually interact
with each other. It’s essentially
saying the following. It’s saying this is going to
be of the form exponential– so this exponential is
really not much because I could put a log next to it. But what I want is that
the way theta and x interact has to be of
the form theta times x in an exponential, OK? So that’s the
simplest– that’s one of the ways you can think of
them interacting is you just the product of the two. Now clearly, this is
not a very rich family. So what I’m allowing
myself is to just slap on some terms that depend only
on theta and depend only on x. So let’s just call this thing, I
don’t know, f of x, g of theta. OK, so here, I’ve restricted the
way theta and x can interact. So I have something
that depends only on x, something that
depends only on theta. And here, I have this
very specific interaction. And that’s all that exponential
families are doing for you, OK? So if we go back to this slide,
this is much more general, right? if I want to go from
theta and x in r to theta and x theta in r– to theta in r k and x in rq,
I cannot take the product of theta and x. I cannot even take the inner
product between theta and x because they’re not even
of compatible dimensions. But what I can do is to first
map my theta into something and map my x into something
so that I actually end up having the same dimensions. And then I can take
the inner product. That’s the natural
generalization of this simple product. OK, so what I have is– right, so if I want
to go from theta to x, when I’m going to first
do is I’m going to take theta, eta of theta– so let’s say eta1 of
theta to eta k of theta. And then I’m going
to actually take x becomes t1 of x all
the way to tk of x. And what I’m going to do
is take the inner product– so let’s call this eta
and let’s call this t. And I’m going to take the inner
product of eta and t, which is just the sum from j equal
1 to k of eta j of theta times tj of x. OK, so that’s just a way to say
I want this simple interaction but in higher dimension. The simplest way I can actually
make those things happen is just by taking inner product. OK, and so now what
it’s telling me is that the distribution– so
I want the exponential times something that depends only
on theta and something that depends only on x. And so what it tells
me is that when I’m going to take
p of theta x, it’s just going to be something
which is exponential times the sum from j equal 1
to k of eta j theta tj of x. And then I’m going to have a
function that depends only– so let me read it for now
like c of theta and then a function that
depends only on x. Let me call it h of x. And for convenience,
there’s no particular reason why I do that. I’m taking this
function c of theta and I’m just actually
pushing it in there. So I can write c of theta as
exponential minus log of 1 over c of theta, right? And now I have exponential
times exponential. So I push it in, and
this thing actually looks like exponential sum
from j equal 1 to k of eta j theta tj of x minus log 1
over c of theta times h of x. And this thing here, log 1 over
c of theta, I call actually b of theta Because
c, I called it c. But I can actually
directly call this guy b, and I don’t actually
care about c itself. Now why don’t I put back
also h of x in there? Because h of x is
really here to just– how to put it– OK, h of x and b of theta
don’t play the same role. B of theta in many ways is a
normalizing constant, right? I want this density
to integrate to 1. If I did not have
this guy, I’m not guaranteed that this
thing integrates to 1. But by tweaking this function
b of theta or c of theta– they’re equivalent– I can actually ensure that
this thing integrates to 1. So b of theta is just
a normalizing constant. H of x is something that’s
going to be funny for us. It’s going to be
something that allows us to be able to treat both
discrete and continuous variables within the framework
of exponential families. So for those that are
familiar with this, this is essentially
saying that that h of x is really just a
change of measure. When I actually look at
the density of p of theta– this is with respect
to some measure– the fact that I just multiplied
by a function of x just means that I’m not looking– that this guy here
without h of theta is not the density with respect
to the original measure, but it’s the density with
respect to the distribution that has h as a density. That’s all I’m saying, right? So I can first transform my
x’s and then take the density with respect to that. If you don’t want to think
about densities or measures, you don’t have to. This is just the way– this is just the definition. Is there any question
looks complicated, but it’s actually
essentially the simplest way you could think about it. You want to be able to
have x and theta interact and you just say, I
want the interaction to be of the form
exponential x times theta. And if they’re
higher dimensions, I’m going to take
the exponential of the function
of x inner product with a function of theta. All right, so I claimed
since the beginning that the Gaussian
was such an example. So let’s just do it. So is the Gaussian of the– is
the interaction between theta and x in a Gaussian of
the form in the product? And the answer is yes. Actually, whether I know or
not what the variance is, OK? So let’s start for the case
where I actually do not know what the variance is. So here, I have x is
n mu sigma squared. This is all one dimensional. And here, I’m going to assume
that my parameter is both mu and sigma square. OK, so what I need to do is
to have some function of mu, some function of stigma square,
and take an inner product of some function of x and
some other function of x. So I want to show that– so p theta of x is what? Well, it’s one over
square root sigma 2 pi exponential minus x minus mu
squared over 2 sigma squared, right? So that’s just my
Gaussian density. And I want to say that
this thing here– so clearly, the exponential
shows up already. I want to show that this
is something that looks like, you know, eta 1 of– sorry, so that was– yeah, eta
1 of, say, mu sigma squared. So I have only
two of those guys, so I’m going to need
only two etas, right? So I want it to be eta 1
of mu and sigma times t1 of x plus eta 2 mu 1 mu sigma
squared times t2 of x, right? So I want to have something
like that that shows up, and the only things
that are left, I want them to depend either
only on theta or only on x. So to find that out,
we just need to expand. OK, so I’m going to first put
everything into my exponential and expand this guy. So the first term here
is going to be minus x squared over 2 sigma square. The second term is
going to be minus mu squared over two sigma squared. And then the cross term is
going to be plus x mu divided by sigma squared. And then I’m going
to put this guy here. So I have a minus log
sigma over 2 pi, OK? OK, is this– so this term
here contains an interaction between X and the parameters. This term here
contains an interaction between X and the parameters. So let me try to write
them in a way that I want. This guy only depends
on the parameters, this guy only depends
on the parameter. So I’m going to
rearrange things. And so I claim that this
is of the form x squared. Well, let’s say– do– who’s getting the minus? Eta, OK. So it’s x squared times
minus 1 over 2 sigma squared plus x times mu
over sigma squared, right? So that’s this term here. That’s this term here. Now I need to get this guy
here, and that’s minus. So I’m going to write
it like this– minus, and now I have mu
squared over 2 sigma squared plus log sigma
square root 2 pi. And now this thing is definitely
of the form t of x times– did I call them the
right way or not? Of course not. OK, so that’s going to
be t2 of x times eta 2 of x eta 2 of theta. This guy is going to be t1
of x times eta 1 of theta. All right, so just a function
of theta times a function of x– just a function of theta
times a function of x. And the way combined is
just by sending them. And this is going
to be my d of theta. What is h of x? AUDIENCE: 1. PHILIPPE RIGOLLET: 1. There’s one thing I
can actually play with, and this is something you’re
going to have some three choices, right? This is not actually completely
determined here is that– for example, so when I write
the log sigma square root 2 pi, this is just log of sigma
plus log square root 2 pi. So I have two choices here. Either my b becomes
this guy, or– so either I have
b of theta, which is mu squared over 2 sigma
squared plus log sigma square root 2 pi and h of
x is equal to 1, or I have that b of theta is mu
square over 2 sigma squared plus log sigma. And h of x is equal to what? Well, I can just push
this guy out, right? I can push it out
of the exponential. And so it’s just square
root of 2 pi, which is a function of x, technically. I mean, it’s a constant function
of x, but it’s a function. So you can see that it’s
not completely clear how you’re going to do
the trade off, right? So the constant terms can
go either in b or in h. But you know, why bother with
tracking down b and h when you can actually stuff
everything into one and just call h one
and call it a day? Right, so you can
just forget about h. You know it’s one and
think about the right. H won’t matter actually for
estimation purposes or anything like this. All right, so that’s basically
everything that’s written. When stigma square
is known, what’s happening is that this
guy here is no longer a function of theta, right? Agreed? This is no longer a parameter. When sigma square is known,
then theta is equal to mu only. There’s no sigma
square going on. So this– everything
depends on sigma square can be thought of as a constant. Think one. So in particular, this
term here does not belong in the interaction
between x and theta. It belongs to h, right? So if sigma is known, then this
guy is only a function of h– of x. So h of x becomes exponential
x squared minus x squared over 2 sigma squared, right? That’s just a function of x. Is that clear? So if you complete this
computation, what you’re going to get is that your new
one parameter thing is that p theta x is not equal to
exponential x times mu over sigma squared minus– well, it’s still the same thing. And then you have your
h of x that comes out– x squared over 2 sigma squared. OK, so that’s my h of x. That’s still my b of theta. And this is my t1 of x. And this is my eta one of theta. And remember, theta is just
equal to mu in this case. So if I ask you prove that
this distribution belongs to an exponential family,
you just have to work it out. Typically, it’s expanding what’s
in the exponential and see what’s– and just write it in
this term and identify all the components, right? So here, notice those guys
don’t even get an index anymore because there’s
just one of them. So I wrote eta 1 and t1, but
it’s really just eta and t. Oh sorry, this guy also goes. This is also a constant, right? So it can actually
just put sigma divided by sigma square root 2 pi. So h of x is what, actually? Is it the density of– AUDIENCE: Standard [INAUDIBLE]. PHILIPPE RIGOLLET:
It’s not standard. It’s centered. It has mean 0. But it variance
sigma squared, right? But it’s the density
of a Gaussian. And this is what I
meant when I said h of x is really just telling
you with respect to which distribution, which measure
you’re taking the density. And so this thing here
is really telling you the density of my
Gaussian with mean mu is equal to– is this with
respect to a centered Gaussian is this guy, right? That’s what it means. If this thing ends
up being a density, it just means that now you
just have a new measure, which is this density. So it’s just saying
that the density of the Gaussian with
mean mu with respect to the Gaussian with mean 0
is just this [INAUDIBLE] here. All right, so let’s move on. So here, as I said,
you could actually do all these computations
and forget about the fact that x is continuous. You can actually do it with PMFs
and do it for x is discrete. This actually also tells
you if you can actually get the same form for
your density, which is of the form exponential
times the product of the the interaction
between theta and x is just
taking this product, then a function only of theta
and of function only of x, for the PMF, it also works. OK, so I claim
that the Bernoulli belongs to this family. So the PMF of a Bernoulli– we say parameter p is p to the
x 1 minus p to the 1 minus x, right? Because we know so that’s
only for x equals 0 or 1. And the reason is because
when x is equal to 0, this is 1 minus p. When x is equal to
1, this is minus 0. OK, we’ve seen that
when we’re looking at likelihoods for Bernoullis. OK, this is not clear this is
going to look like this at all. But let’s do it. OK, so what does
this thing look like? Well, the first
thing I want to do is to make an
exponential show up. So what I’m going
to write is I’m going to write p to the x as
exponential x log p, right? And so I’m going to do
that for the other one. So this thing here– so I’m going to get
exponential x log p plus 1 minus x log 1 minus p. So what I need to do is
to collect my terms in x and my terms in whatever
parameters I have, see here if theta is equal to p. So if I do this,
what I end up having is equal to exponential– so determine x is log
p minus log 1 minus p. So that’s x times
log p over 1 minus p. And then the term
that rest is just– that stays is just 1
times log 1 minus p. But I want to see this as
a minus something, right? It was minus b of theta. So I’m going to
write it as minus– well, I can just keep the
plus, and I’m going to do– and that’s all [INAUDIBLE]. A-ha! Well, this is of the
form exponential– something that depends only on
x times something that depends only on theta– minus a function that
depends only on theta. And then h of x is
equal to 1 again. OK, so let’s see. So I have t1 of x is equal to x. That’s this guy. Eta 1 of theta is equal
to log p1 minus p. And b of theta is equal to
log 1 over 1 minus p, OK? And h of x is equal
to 1, all right? You guys want to do
Poisson, or do you want to have any homework? It’s a dilemma because that’s
an easy homework versus no homework at all but maybe
something more difficult. OK, who wants to do it now? Who does not want to
raise their hand now? Who wants to raise
their hand now? All right, so let’s move on. I’ll just do– do you want
to do the gammas instead in the homework? That’s going to be fun. I’m not even going to
propose to do the gammas. And so this is the
gamma distribution. It’s brilliantly
called gamma because it has the gamma function just
like the beta distribution had the beta function in there. They look very similar. One is defined over r plus,
the positive real line. And remember, the beta was
defined over the interval 0, 1. And it’s of the form x to
some power times exponential of minus x to some– times something, right? So there’s a function of
polynomial [INAUDIBLE] x where the exponent
depends on the parameter. And then there’s the exponential
minus x times something depends on the parameters. So this is going to also look
like some function of x– sorry, like some
exponential distribution. Can somebody guess what
is going to be t2 of x? Oh, those are the functions of
x that show up in this product, right? Remember when we have this– we just need to take
some transformations of x so it looks linear in those
things and not in x itself. Remember, we had x squared
and x, for example, in the Gaussian case. I don’t know if
it’s still there. Yeah, it’s still there, right? t2 was x squared. What do you think x is
going– t2 of x here. So here’s a hint.
t1 is going to be x. AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET:
Yeah, [INAUDIBLE],, what is going to be t1? Yeah, you can–
this one is taken. This one is taken. What? Log x, right? Because this x to
the a minus 1, I’m going to write that as
exponential a minus 1 log x. So basically, eta 1 is
going to be a minus 1. Eta 2 is going to
be minus 1 over b– well, actually the opposite. And then you’re going to have– but this is actually
not too complicated. All right, then those
parameters get names. a is the shape parameter,
b is the scale parameter. It doesn’t really matter. You have other things that
are called the inverse gamma distribution, which
has this form. The difference is that
the parameter alpha shows negatively there and
then the inverse Gaussian distribution. You know, just densities
you can come up with and they just happened
to fall in this family. And there’s other ones that
you can actually put in there that we’ve seen before. The chi-square is actually
part of this family. The beta distribution
is part of this family. The binomial distribution
is part of this family. Well, that’s easy because
the Bernoulli was. The negative binomial, which
is some stopping time– the first time you hit a
certain number of successes when you flip some
Bernoulli coins. So you can check
for all of those, and you will see that you can
actually write them as part of the exponential family. So the main goal
of this slide is to convince you that
this is actually a pretty broad range
of distributions because it basically includes
everything we’ve seen but not anything there– sorry, plus more, OK? Yeah. AUDIENCE: Is there any
example of a distribution that comes up
pretty often that’s not in the exponential family? PHILIPPE RIGOLLET:
Yeah, like uniform. AUDIENCE: Oh, OK, so maybe
a bit more complicated than [INAUDIBLE]. Anything Anything that
has a support that depends on the parameter
is not going to fall– is not going to fit in there. Right, and you can
actually convince yourself why anything that
has the support that does not– that depends
on the parameter is not going to be
part of this guy. It’s kind of a hard thing to– in fact, you proved that it’s
not and you prove this rule. That’s kind of a
little difficult, but the way you can convince
yourself is that remember, the only interaction between
x and theta that I allowed was taking the
product of those guys and then the exponential, right? If you have something that
depends on some parameter– let’s say you’re going to see
something that looks like this. Right, for uniform,
it looks like this. Well, this is not of the form
exponential x times theta. There’s an interaction
between x and theta here, but it’s actually
certainly not of the form x exponential x times theta. So this is definitely
not going to be part of the exponential family. And every time you start
doing things like that, it’s just not going to happen. Actually, to be fair,
I’m not even sure that all these
guys, when you allow them to have all
their parameters free, are actually going
to be part of this. For example– the
beta probably is, but I’m not actually
entirely convinced. There’s books on
experiential families. All right, so let’s go back. So here, we’ve put a lot
of effort understanding how big, how much wider than
the Gaussian distribution can we think of for the
conditional distribution of our response y given x. So let’s go back to the
generalized linear models, right? So [INAUDIBLE] said, OK,
the random component? y has to be part of
some exponential family distribution– check. We know what this means. So now I have to
understand two things. I have to understand what
is the expectation, right? Because that’s actually
what I model, right? I take the expectation, the
conditional expectation, of y given x. So I need to understand
given this guy, it would be nice if you had some
simple rules that would tell me exactly what the expectation
is rather than having to do it over and over again, right? If I told you,
here’s a Gaussian, compute the
expectation, every time you had to use that would
be slightly painful. So hopefully, this thing
being simple enough– we’ve actually
selected a class that’s simple enough so that
we can have rules. Whereas as soon as they give you
those parameters t1, t2, eta 1, eta 2, b and h, you can
actually have some simple rules to compute the mean and
variance and all those things. And so in particular, I’m
interested in the mean, and I’m going to have to
actually say, well, you know, this mean has to be mapped
into the whole real line. So I can actually talk
about modeling this function of the mean as x transpose beta. And we saw that for
the [INAUDIBLE] dataset or whatever other data sets. You actually can– you can
actually do this using the log of the reciprocal or for the– oh, actually, we didn’t
do it for the Bernoulli. We’ll come to this. This is the most important
one, and that’s called a logit it or a logistic link. But before we go there,
this was actually a very broad family, right? When I wrote this thing on the
bottom board– it’s gone now, but when I wrote it
in the first place, the only thing that I wrote
is I wanted x times theta. Wouldn’t it be nice if you
have some distribution that was just x times theta,
not some function of x times some function of theta? The functions seem to be
here so that they actually make things a little– so the functions were here
so that I can actually put a lot of functions there. But first of all,
if I actually decide to re-parametrize my
problem, I can always assume– if I’m
one dimensional, I can always assume
that eta 1 of theta becomes my new theta, right? So this thing–
here for example, I could say, well,
this is actually the parameter of my Bernoulli. Let me call this
guy theta, right? I could do that. Then I could say, well, here
I have x that shows up here. And here since I’m talking
about the response, I cannot really make
any transformations. So here, I’m going to actually
talk about a specific family for which this guy is not x
square or square root of x or log of x or anything I want. I’m just going to actually
look at distributions for which this is x. This exponential
families are called a canonical exponential family. So in the canonical
exponential family, what I have is that I have my x times theta. I’m going to allow myself
some normalization factor phi, and we’ll see, for
example, that it’s very convenient when I talk
about the Gaussian, right? Because even if I know– yeah, even if I know this guy,
which I actually pull into my– oh, that’s over here, right? Right, I know sigma squared. But I don’t want to
change my parameter to be mu over sigma squared. It’s kind of painful. So I just take mu, and
I’m going to keep this guy as being this phi over there. And it’s called the
dispersion parameter from a clear analogy
with the Gaussian, right? That’s the variance and
that’s measuring dispersion. OK, so here, what
I want is I’m going to think throughout this class–
so phi may be known or not. And depending–
when it’s not known, this actually might turn
into some exponential family or it might not. And the main reason is because
this b of theta over phi is not necessarily a function
of theta over phi, right? If I actually have phi
unknown, then y theta over phi has to be– this guy has to be
my new parameter. And b might not be a function
of this new parameter. OK, so in a way,
it may or may not, but this is not really a
concern that we’re going to have because throughout
this class, we’re going to assume that
phi is known, OK? Phi is going to be known all the
time, which means that this is always an exponential family. And it’s just the
simplest one you could think of– one
dimensional parameter, one dimensional response, and I just
have– the product is just y times or, we used to call it x. Now I’ve switched to y, but y
times theta divided by phi, OK? Should I write this or this is
clear to everyone what this is? Let me write it somewhere so
we actually keep track of it toward the [INAUDIBLE]. OK, so this is– remember, we had all
the distributions. And then here we had
the exponential family. And now we have the
canonical exponential family. It’s actually
much, much smaller. Well, actually, it’s probably
sort of a good picture. And what I have is that
my density or my PMF is just exponential
y times theta minus b of theta divided by phi. And I have plus phi of– oh, yeah, plus phi
of y phi, which means that this is really–
if phi is known, h of y is just exponential
c of y phi, agreed? Actually, this is the reason
why it’s not necessarily a canonical family. It might not be that
this depends only on y. It could depend on y and
phi in some annoying way and I may not be
able to break it. OK, but if phi is known,
this is just a function that depends on y, agreed? In particular, I
think you need– I hope you can convince
yourself that this is just a subcase of everything
we’ve seen before. So for example, the Gaussian
when the variance is known is indeed of this form, right? So we still have
it on the board. So here is my y, right? So then let me write
this as f theta of y. So every x is replaceable
with y, blah, blah, blah. This is this guy. And now what I have is that
this is going to be my phi. This is my parameter of theta. So I’m definitely of the form
y times theta divided by phi. And then here I
have a function b that depends only on
theta over phi again. So b of theta is mu
squared divided by 2. OK, then it’s divided
by 6 sigma square. And then I have
this extra stuff. But I really don’t care
what it is for now. It’s just something that depends
only on y and known stuff. So it was just a function
of y just like my h. I stuff everything in there. The b, though, this
thing here, this is actually what’s
important because in the canonical
family, if you think about it, when you know phi– sorry– right, this
is just y times theta scaled by a known
constant– sorry, y times theta scaled by a known
constant is the first term. The second term is b of theta
scaled by some known constant. But b of theta is
what’s going to make the difference between the
Gaussian and Bernoullis and gammas and betas– this is all in this b
of theta. b of theta contains everything
that’s idiosyncratic to this particular distribution. And so this is going
to be important. And we will see that b of theta
is going to capture information about the mean,
about everything. Should I go through
this computation? I mean, it’s the same. We’ve just done it, right? So maybe it’s probably better
if you can redo it on your own. All right, so the canonical
exponential family also has other distributions, right? So there’s the Gaussian
and there’s the Poisson and there’s the Bernoulli. But the other ones may not
be part of this, right? In particular, think about
the gamma distribution. We had this– log x was one
of the things that showed up. I mean, I cannot get
rid of this log x. I mean, that’s part of it
except if a is equal to 1 and I know it for sure, right? So if a is equal to 1, then
I’m going to have a minus 1, which is equal to 0. So I’m going to have
a minus 1 times log x, which is going to be just 0. So log x is going
to vanish from here. But if a is equal to 1,
then this distribution is actually much nicer, and
it actually does not even deserve the name gamma. What is it if a is equal to 1? It’s an exponential, right? Gamma 1 is equal to 1. x to
the a minus 1 is equal to 1. b– so I have exponential
x over b divided by b. So 1 over b– call it lambda. And this is just an
exponential distribution. And so every time you’re
going to see something– so all these guys that
don’t make it to this table, they could be part of those
guys, but they’re just more– they’re just to– they just have another
name in this thing. All right, so you could
compute the value of theta for different values, right? So again, you still have some
continuous or discrete ones. This is my b of theta. And I said this is actually
really what captures my theta. This b is actually called
cumulant generating function, OK? I don’t have time. I could write five
slides to explain to you, but it would just only
tell you why it’s called cumulant generating function. It’s also known as the log of
the moment generating function. And the way it’s called
cumulant generating function is because if I start taking
successive derivatives and evaluating them at 0, I
get the successive cumulance of this distribution, which
are some transformation of the moments. AUDIENCE: What are you
The function b. AUDIENCE: [INAUDIBLE] PHILIPPE RIGOLLET: So this
is just normalization. So this is just to tell
you I can compute this, but I really don’t care. And obviously I don’t care
about stuff that’s complicated. This is actually cute, and this
is what completes everything. And the rest is just like
some general description. You only need to tell
you that the range of y is 0 to infinity, right? And that is
essentially telling me this is going to give me some
hints as to which link function I should be using, right? Because the range
of y tells me what the range of expectation
of y is going to be. All right, so here, it
tells me that the range of y is between 0 and 1. OK, so what I want
to show you is that this captures a
variety of different ranges that you can have. OK, so I’m going to want
to go into the likelihood. And the likelihood
I’m actually going to use to compute
the expectations. But since I actually
don’t have time to do this now, let’s just
go quickly through this and give you spoiler alert to
make sure that you all wake up on Thursday and
really, really want to think about coming
here immediately. All right, so the thing
I’m going to want to do, as I said, is it would
be nice if, at least for this canonical
family, when I give you b, you would be able
to say, oh, here is a simple computation of b
that would actually give me the mean and the variance. The mean and the variance
are also known as moments. b is called cumulant
generating function. So it sounds like
moments being related to cumulance, I might have a
path to finding those, right? And it might involve taking
derivatives of b, as we’ll see. The way we’re
going to prove this by using this thing that
we’ve used several times. So this property we use
when we’re computing, remember, the fisher
information, right? We had two formulas for
the fisher information. One was the expectation of the
second derivative of the log likelihood, and one was negative
expectation of the square– sorry, expectation of the
square, and the other one was negative the expectation of
the second derivative, right? The log likelihood is concave,
so this number is negative, this number is positive. And the way we did this is by
just permuting some derivative and integral here. And there was just– we
used the fact that something that looked like this, right? The log likelihood
is log of f theta. And when I take the derivative
of this guy with respect to theta, then I
have something that looks like the derivative
divided by f theta. And if I start taking the
integral against f theta of this thing, so the
expectation of this thing, those things would cancel. And then I had just the
integral of a derivative, which I would make a leap of faith
and say that it’s actually the derivative of the integral. But this was equal to 1. So this derivative was
actually equal to 0. And so that’s how you
got that the expectation of the derivative of the log
likelihood is equal to 0. And you do it once again
and you get this guy. It’s just some nice
things that happen with the [INAUDIBLE] taking
derivative of the log. We’ve done that,
we’ll do that again. But once you do this, you
can actually apply it. And– missing a
parenthesis over there. So when you write
the log likelihood, it’s just log of an exponential. Huh, that’s actually
pretty nice. Just like the least squares
came naturally, the least squares [INAUDIBLE]
came naturally when we took the log
likelihood of the Gaussians, we’re going to have the
same thing that happens when I take the log of the density. The exponential is
going to go away, and then I’m going
to use this formula. But this formula is
going to actually give me an equation directly–
oh, that’s where it was. So that’s the one
that’s missing up there. And so the expectation
minus this thing is going to be equal
to 0, which tells me that the expectation
is just the derivative. Right, so it’s still
a function of theta, but it’s just a derivative of b. And the variance
is just going to be the second derivative of b. But remember, this was some
sort of a scaling, right? It’s called the
dispersion parameter. So if I had a Gaussian and
the variance of the Gaussian did not depend on
the sigma squared which I stuffed in this phi,
that would be certainly weird. And it cannot depend only
on mu, and so this will– for the Gaussian, this is
definitely going to be equal to 1. And this is just going to
be equal to my variance. So this is just by taking
the second derivative. So basically, the take-home
message is that this function b captures– by taking one derivative
of the expectation and by taking two derivatives
captures the variance. Another thing
that’s actually cool and we’ll come
back to this and I want to think about is if
this second derivative is the variance, what can