Theory Assignment 3

COMP 451 - Fundamentals of Machine Learning

Winter 2021

Preamble The assignment is due April 6th at 11:59pm via MyCourses. Late work will be automatically

subject to a 20% penalty, and can be submitted up to 5 days after the deadline. You may scan written

answers or submit a typeset assignment, as long as you submit a single pdf file with clear indication of

what question each answer refers to. You may consult with other students in the class regarding solution

strategies, but you must list all the students that you consulted with on the first page of your submitted

assignment. You may also consult published papers, textbooks, and other resources, but you must cite any

source that you use in a non-trivial way (except the course notes). You must write the answer in your own

words and be able to explain the solution to the professor, if asked.

Question 1 [13 points]

In class we introduced the Gaussian mixture model (GMM). In this question, we will consider a mixture

of Bernoulli distributions. Here, our data points will be defined as m-dimensional vectors of binary values

x ∈ {0, 1}

m.

First, we will introduce a single multivariate Bernoulli distribution, which is defined by a mean vector µ

P(x|µ) =

mY−1

j=0

µ[j]

x[j]

(1 − µ[j])(1−x[j])

. (1)

Thus, we see that a the individual binary dimensions are independent for a single multivariate Bernoulli.

Now, we can define a mixture of K multivariate Bernoulli distributions as follows

, πk, k = 0, .., K − 1} are the parameters of the mixture and P(x|µk

) is the probability

assigned to the point by each individual component in the model.

Note that the mean of each individual component distribution P(x|µk) is given by

Ek[x] = µk

, (5)

and the covariance matrix of each component is given by

Cov[x] = Σk = diag(µk ? (1 − µk

)), (6)

1

where ? denotes elementwise multiplication. In other words, the covariance matrix Σk for each component

is a diagonal matrix with diagonal entries given by Σk[j, j] = µ[j](1 − µ[j]). It is a diagonal matrix because

each dimension is independent.

Part 1 [8 points]

Derive expression for the mean vector and the covariance matrix of the full mixture distribution defined in

Equation 2. That is, give expressions for the following:

E[x] =? Cov[x] =? (7)

Hint: use the fact that

Cov[x] = E

(x − E[x])(x − E[x])>

= E[xx>] − E[x]E[x]

>.

2

Part 2 [5 points]

Just as with a GMM, we can use the expectation maximization (EM) algorithm to compute learn the

parameters of a Bernoulli mixture model. Here, we will provide you with the formula for the expectation

step as well as the log-likelihood of the model. You must derive the formula for the maximization step.

Expectation step. In the expectation step of the Bernoulli mixture model, we compute scores r(x, k), which

tell us how likely it is that point x belongs to component k. These scores are computed as follows:

r(x, k) = πkP(x|µk

)

PK

j=1 πjP(x|πj )

, (8)

where P(x|µk

) is defined as in Equation 2.

Log-likelihood.

? (9)

Maximization step. You must find the formula for the µk parameters in the maximization step:

µk =? (10)

3

Question 2 [5 points]

Recall that the low dimensional codes in PCA are defined as

zi = U>(xi − µ), (11)

where U is a matrix containing the top-k eigenvectors of the covariance matrix and. (12)

Recall that the reconstruction of a point xi using its code zi

is given by

x˜i = Uzi + µ. (13)

Show that

(x˜i − xi)

>(x˜i − µ) = 0. (14)

4

Question 3 [short answers; 2 points each]

Answer each question with 1-3 sentences for justification, potentially with equations/examples for support.

a) True or false: It is always possible to choose an initialization so that K-means converges in one iteration.

b) Suppose you are learning a decision tree for email spam classification. Your current sample of the training

data has the following distribution of labels:

[43+, 30−], (15)

i.e., the training sample has 43 examples that are spam and 30 that are not spam. Now, you are choosing

between two candidate tests.

Test 1 (T1) tests whether the number of words in the email is greater than 30 and would result in the

following splits:

• num words > 30 : [5+, 15−]

• num words ≤ 30: [38+, 15−]

Test 2 (T2) tests whether the email contains an external URL link and would result in the following splits:

• has link: [25+, 5−]

• not has link: [18+, 25−]

Which test should you use to split the data? I.e., which test provides a higher information gain?

c) Which of the following statements is false:

1. If the covariance between two variables is zero, then their mutual information is also zero.

2. Adding more features is a useful strategy to combat underfitting.

3. Decision trees can learn non-linear decision boundaries.

4. The Gaussian mixture model contains more parameters than K-means.