Most of the time, when we introduce binomial models, such as the logistic or probit models, we discuss only Bernoulli variables, . This year (actually also the year before), I discuss extensions to multinomial regressions, where is a function on some simplex. The multinomial logistic model was mention here. The idea is to consider, for instance with three possible classes

the following model

and

Now, what about a *real* Binomial model, , where ‘s are known. How should we run such a regression model ? Consider the following dataset

> set.seed(1) > n=100 > N=1+rpois(n,5) > X1=runif(n) > X2=rexp(n) > s=X2-X1-2 > p=exp(s)/(1+exp(s)) > vY=NULL > for(i in 1:n){ + Y=rbinom(1,prob=p[i],size=N[i]) + vY=c(vY,Y) + } > db=data.frame(Y=vY,N=N,X1,X2) > head(db,4) Y N X1 X2 1 0 5 0.6547239 0.76318001 2 1 5 0.3531973 1.57271671 3 3 6 0.2702601 1.83564098 4 1 9 0.9926841 0.03715227

My first idea was to say that it should be simple since if (and only if)

where are i.i.d. random variables . So, a natural idea is to generate the dataset containing the ‘s

> vY=vX1=vX2=vN=NULL; > for(i in 1:n){ + vY=c(vY,c(rep(0,db$N[i]-db$Y[i]),rep(1,db$Y[i]))) + vX1=c(vX1,rep(db$X1[i],db$N[i])) + vX2=c(vX2,rep(db$X2[i],db$N[i])) + } > largedb=data.frame(Z=vY,X1=vX1,X2=vX2) > head(largedb,16) Z X1 X2 1 0 0.6547239 0.763180 2 0 0.6547239 0.763180 3 0 0.6547239 0.763180 4 0 0.6547239 0.763180 5 0 0.6547239 0.763180 6 0 0.3531973 1.572717 7 0 0.3531973 1.572717 8 0 0.3531973 1.572717 9 0 0.3531973 1.572717 10 1 0.3531973 1.572717 11 0 0.2702601 1.835641 12 0 0.2702601 1.835641 13 0 0.2702601 1.835641 14 1 0.2702601 1.835641 15 1 0.2702601 1.835641 16 1 0.2702601 1.835641

Then, we run a standard Bernoulli regression on those ‘s

> reg1=glm(Z~X1+X2,family=binomial,data=largedb)

But actually, if look around, on the internet, you can see (e.g. in Alan Agresti’s R_web.pdf chapter) that it is possible to run – *directly* – a binomial regression, using the following syntax,

> reg2=glm(Y/N~X1+X2,family=binomial,weights=N,data=db)

I was a bit scared because a few weeks ago, I tried two techniques to run a regression on contingency tables, and the output were different (on the standard deviation actually, not the estimation). Here we have the same thing

> coefficients(summary(reg1)) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0547550 0.2875231 -7.146399 8.908380e-13 X1 -0.9159275 0.3970303 -2.306946 2.105781e-02 X2 1.0564059 0.1360305 7.765952 8.103448e-15 > coefficients(summary(reg2)) Estimate Std. Error z value Pr(>|z|) (Intercept) -2.0547550 0.2875234 -7.146392 8.908817e-13 X1 -0.9159275 0.3970313 -2.306941 2.105813e-02 X2 1.0564059 0.1360310 7.765923 8.105285e-15

almost if we take into account the fact that numerical algorithm might be ran with different starting point. But the going thing is that – theoretically – the output should be exactly the same. Simply because here, we solve the same first order conditions ! The likelihood in the first case was

which will be simplified as

which is what we have in the second case. The use of the weight function will insure that the variance here are equal.

A third way would be to provide a response matrix with the number of successes and number of failures, respectively. This is what I personally find easiest to remember:

R> reg3 <- glm(cbind(Y, N-Y) ~ X1 + X2, data = db, family = binomial)

The result is also the same as before:

R> coefficients(summary(reg3))

Estimate Std. Error z value Pr(>|z|)

(Intercept) -2.0547550 0.2875234 -7.146392 8.908817e-13

X1 -0.9159275 0.3970313 -2.306941 2.105813e-02

X2 1.0564059 0.1360310 7.765923 8.105285e-15

Thanks Achim

I really like that one ! I will try to keep it in mind…