INSEA             Techniques de réduction de dimension - 2025

TP 3: Numpy: Linear algebra and PCA

            Author: Hicham Janati

How to follow this lab:

The goal is to understand AND retain in the long term: resist copy-pasting, prefer typing manually.
Getting stuck while programming is completely normal: search online, use documentation, or use AI.
When prompting the AI, you must be specific. Explain that your goal is to learn, not to get an instant solution no matter what. Ask for short, explained answers with alternatives.
NEVER ASK THE AI TO PRODUCE MORE THAN ONE LINE OF CODE!
Adopt the Solve-It method: always try to solve a question or predict the output of code before running it. Learning happens when you confirm your understanding—and even more when you’re wrong and surprised.

1 - Eigenvalues, eigenvectors, singular values, singular vectors

Eigenvalues and eigenvectors can be computed using the np.linalg.eig function. Example with a symmetric matrix:

import numpy as np
A = np.random.randn(5, 5)
A = A + A.T
eigenvalues, eigenvectors = np.linalg.eig(A)

Question 1

Verify the eigen property \(Ax = \lambda x\) for all eigenvalues in a vectorized form. Do these eigenvectors form an orthogonal matrix ?

Question 2

If the matrix is Hermitian (Hermitian is equivalent to Symmetric if the matrix is real), np.linalg.eigh must be prefered, it is more stable and faster. Compare their output and computation time for large matrices.

The SVD can be computed for any rectangular matrix using np.linalg.svd

A = np.random.randn(5, 3)
U, S, V_t = np.linalg.svd(A)

Question 3:

Verify the SVD formula \(A = USV^\top\).

Question 4:

What is the link between the singular values of A and the eigenvalues of \(A^\top A\) ? Verify that.

Question 5:

We can define several different norms of a rectangular matrix of rank r. In the theoretical exercises, we have demonstrated that they can all be obtained using the singular values: - The usual Frobenius norm: \(\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\sum_{k=1}^r \sigma_k^2}\) - The spectral norm (also known as operator norm or L2 norm): \(\|A\|_{op} = \max_{x \neq 0} \frac{\|Ax\|_2}{\|x\|_2} = \max_k {\sigma_k}\) - The nuclear norm (also known as Trace norm) \(\|A\|_{*} = \sum_{k} \sigma_k\) Verify how to compute each of these norms using the svd and directly via np.linalg.norm by specifying the correct ord argument. Check the table in the Notes for the different ord values: https://numpy.org/devdocs/reference/generated/numpy.linalg.norm.html

A = np.random.randn(5, 3)

2 - PCA with Python

The goal of this section is to make your own implementation of PCA with numpy. We will first be working with random data.

X = np.random.randn(100, 5)

Question 6:

Center the data and compute the empirical covariance matrix \(\Sigma\) of \(X\).

Question 7:

Compute the principal components of the PCA using the spectral decomposition of \(\Sigma\) and project the data on the first k components

k = 5

Question 8:

How could you have obtained this projection without computing \(\Sigma\) ?

Question 9:

Package this logic in a function that takes a dataset, a number of components and returns the eigenvalues and eigenvectors

def compute_pca(dataset, n_components):
    # todo
    return eigenvalues, eigenvectors

Question 10:

Compute the explained variance ratio of each component and visualize with matplotlib the scree plot. The plotting code is given below:

from matplotlib import pyplot as plt

percentages = # todo

plt.figure(figsize=(10, 5))
plt.plot(percentages, marker='o')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Scree Plot')
plt.show()

Question 11:

Using np.cumsum make this plot visualize the cumulated explained variance percentage.

Question 12:

How can we obtain the PCA projection in 2D of the data ? Visualize the obtained data in 2d, here’s the matplotlib code for that:

pca_2d_projection = # todo

plt.figure(figsize=(10, 5))
plt.scatter(pca_2d_projection[:, 0], pca_2d_projection[:, 1], marker='o')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection in 2D')

Question 13:

We would like to project a new data point on the PCA space. Complete the following function that does this.

def project_on_pca(data, eigenvalues, eigenvectors, n_components):
    # todo
    return projection

3. Introduction to Object oriented programming (OOP) in Python:

The goal of this section is to nicely package the PCA code into a clean abstraction using class. A class is a way to define a new object with attributes. Here’s an example. We create a “Player” character in some 2D video game which can move around. Each character is defined with: - its speed (fixed) - its coordinates (x, y) that change during the game and are initially set to (0, 0).

A Class needs a constructor function called __init__ that sets attributes speed, x and y. The self argument is the object itself. The __init__ function is called when we create a new instance of the object player.

class Player():
    def __init__(self, speed):
        self.speed = speed
        self.x = 0
        self.y = 0

We can create a variable of type player with speed=5:

p1 = Player(5) # here __init__ is called with speed=5

print(f"p1.speed: {p1.speed} | p1.x: {p1.x} | p1.y: {p1.y}")

We can modify the attributes of the object:

p1.x += 10
print(f"p1.x: {p1.x}")

But that is not a good practice ! To keep things clean, we never change attributes directly, instead we implement functions inside class called methods. We add a move method that changes the coordinates of the player given a direction and duration:

class Player():
    def __init__(self, speed):
        self.speed = speed
        self.x = 0
        self.y = 0
    
    def move(self, direction_x, direction_y, dt=1):
        # directions must be -1, 0 or 1
        direction_x = np.sign(direction_x)
        direction_y = np.sign(direction_y)
        self.x += direction_x * self.speed * dt
        self.y += direction_y * self.speed * dt

p1 = Player(3)
p1.move(direction_x=1, direction_y=-1, dt=5)
print(f"p1.x: {p1.x} | p1.y: {p1.y}")

We would like to apply this to create our own PCA class. A PCA class is defined with its number of components and can be applied to any data. ### Question 14: Complete the following code using your previous implementation

class MyPCA():
    def __init__(self, n_components):
        self.n_components = n_components
        self.eigenvalues = None
        self.eigenvectors = None
        # add attributes if needed

    def computePCA(self, data):
        # todo

    def applyPCA(self, new_data):
        # todo
    
    def plot_scree_plot(self):
        # todo

    def plot_cumulated_variance(self):
        # todo

    def plot_pca_projection(self, new_data):

You can now create your own PCA object. Test its functions on random data:

X = np.random.randn(100, 5)

pca = MyPCA(2)

# todo

Question 15

Compare the PCA you obtained with scikit-learn’s implementation which can be computed using:

from sklearn.decomposition import PCA
data = np.random.randn(100, 5)


pca_sklearn = PCA(n_components=2)
pca_sklearn.fit(data)
explained_variance_ratio = pca_sklearn.explained_variance_ratio_
projected_data = pca_sklearn.transform(data)

4 - PCA for Machine learning

PCA is mostly used as a preprocessing step if the dimension is very high. We will investigate this using synthetic data. The following code creates a dataset of variables (X) and binary classes (y) that we want to predict using k-NN. To keep this simple, we keep the number of observations (samples) fixed n_samples=100 and only change the dimension (n_features):

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200, n_features=300, n_informative=10)

print(X.shape)
print(y.shape)
print(f"the first 5 y are: {y[:5]}")

To test the performance of k-NN, we first split the data into a training and a test set. The train set is where we assume y is known. On the other hand, the test set is used to evaluate the model: we predict the labels y with the model, and evaluate its accuracy using y_test.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

We can fit a k-NN classifier with 5 neighbors and predict the labels of X_test:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(f"the first 5 y_pred are: {y_pred[:5]}")
print(f"the first 5 y_test are: {y_test[:5]}")

Question 16:

Write a function with numpy that computes the average accuracy comparing y_test with y_true. Compute the accuracy of the predictions on the test data.

def accuracy(y, y_true):
    # todo

Question 17:

Apply PCA keeping only 20 components, then fit the k-NN model on the transformed data. Is the performance better or worse ?