INSEA             Techniques de réduction de dimension - 2025

TP 4: PCA and non-linear visualization methods

 Lundi 14 Décembre  2025            Author: Hicham Janati

Part 1

The following code reads a subset of image data of handwritten digits. The data matrix X contains the images: each row is an 8x8 image. The vector y contains the true digit for each image. In ML, we call y the labels/targets (or “étiquettes” en français).

import numpy as np
from sklearn.datasets import load_digits
from matplotlib import pyplot as plt

digits = load_digits()
X = digits.data
y = digits.target

print(f"La taille des données est {X.shape} et celle des labels est {y.shape}")
print(f"La première image ressemble à:")
plt.figure(figsize=(3, 3))
plt.imshow(X[0].reshape(8, 8), cmap="Greys")
plt.axis("off")
plt.title(f"Le label de cette image est {y[0]}")
plt.show()

Question 1

Visualize the first 8 images of the dataset in a single figure, with their true label shown as the title for each image.

Question 2

Prepare the data to perform a principal component analysis and compute the covariance matrix.

Question 3

Determine the principal axes and their variances, then visualize the 2-dimensional projection using a plt.scatter, including the percentage of variance explained by each principal axis. What can you conclude from this?

Question 4

In the 2D projection figure, color each point according to its true label to see whether PCA made it possible to separate the digits into separate clusters. Check the arguments of plt.scatter by running plt.scatter?:

plt.scatter?

Question 5

Visualize the scree plot (the curve of the cumulative percentage of explained variance as a function of the principal component index, in decreasing order of importance) for this PCA. What do you think about it?

Part 2: Non-linear methods

MDS, Isomap and TSNE are implemented in scikit-learn’s manifold module:

from sklearn.manifold import Isomap, TSNE, MDS, ClassicalMDS
from sklearn.decomposition import PCA

You can use the following to run MDS for example:

mds = ClassicalMDS(n_components=2)
X_mds = mds.fit_transform(X)
print(X_mds.shape)

Question 6:

Compare the PCA projections with classical MDS both visually and computationally. Are they equivalent ?

Question 7:

Now using MDS (metric) MDS, is the visualization similar ? Compare with PCA and Classical MDS.

Question 8:

Run isomap and play around with its main arguments (n_neighbors). Check the Isomap Documentation.

Question 9:

Run TSNE and play around with perplexity. Check the TSNE documentation

Question 10:

Install umap (pip install umap-learn) and run it on the same dataset. The package follows the same API (logic) of scikit-learn with .fit_transform

from umap import UMAP

u = UMAP(n_components=2)

# to do

Part 3: SVHN dataset

Now we move on to a more complex dataset: the SVHN (Street view house numbers). You can download it here:

import os, requests
url = "https://www.dropbox.com/scl/fi/5u0s8hdv7wyzh8rhndi2o/small_svhn.npz?rlkey=l0zsmchiymz8qmdjb16koijqa&dl=1"

local_path = "small_svhn.npz"

if not os.path.exists(local_path):
    print("Downloading small_svhn.npz...")
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
else:
    print("Using cached small_svhn.npz")

data = np.load(local_path)

X = np.load("small_svhn.npz")["X"]
y = np.load("small_svhn.npz")["y"]

print(f"The shape of the dataset is {X.shape}")
print(f"The first labels are {y[:10]}")

Question 11:

Visualize the first images and their labels. What do you notice ?

Question 12:

Apply PCA, t-SNE and UMAP on this data. Was the result expected ?