INSEA             Techniques de réduction de dimension - 2025

TP 5: Optimization and supervised representations

            Author: Hicham Janati

How to follow this lab:

The goal is to understand AND retain in the long term: resist copy-pasting, prefer typing manually.
Getting stuck while programming is completely normal: search online, use documentation, or use the AI.
When prompting the AI, you must be specific. Explain that your goal is to learn, not to get an instant solution no matter what. Ask for short, explained answers with alternatives.
NEVER ASK THE AI TO PRODUCE MORE THAN ONE LINE OF CODE!
Adopt the Solve-It method: always try to solve a question or predict the output of code before running it. Learning happens when you confirm your understanding—and even more when you’re wrong and surprised.

Part 1: Logistic Regression from Scratch

In this first part, we will implement logistic regression from scratch using only NumPy. This will help you understand the fundamental concepts of machine learning: loss functions, gradients, and optimization.

We’ll work with the digits dataset from scikit-learn (that we used last week), which contains images of handwritten digits (0-9). For simplicity, we’ll start with a binary classification problem: distinguishing digit 0 from digit 1.

1.1 Loading and preparing the data

Let’s start by loading the digits dataset and preparing it for binary classification:

import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from scipy.optimize import check_grad

# Load the digits dataset
digits = load_digits()
X = digits.data
y = digits.target

print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"Unique labels: {np.unique(y)}")

Question 1:

Filter the dataset to keep only digits 0 and 1. Then split the data into training and test sets (use 80% for training, 20% for test). What are the shapes of your training and test sets?

Question 2:

Normalize the features by subtracting the mean and dividing by the standard deviation. Why is this important? Apply the normalization to both training and test sets, but compute the mean and standard deviation only from the training set.

1.2 The logistic regression model

Logistic regression models the probability that a sample belongs to class 1 using the sigmoid function:

\[P(y=1 | x) = \sigma(w^T x) = \frac{1}{1 + e^{-w^T x}}\]

where \(w\) is the weight vector (including the bias term) and \(\sigma\) is the sigmoid function.

Question 3:

Implement the sigmoid function. What is the range of its output? What happens when the input is very large (positive or negative)?

def sigmoid(z):
    """
    Compute the sigmoid function: sigma(z) = 1 / (1 + exp(-z))
    
    Args:
        z: Input (can be a scalar or array)
    
    Returns:
        Sigmoid of z
    """
    # TODO: implement the sigmoid function

Question 4:

Implement a function that computes the predictions (probabilities) for a given weight vector \(w\) and input data \(X\). The function should return probabilities for each sample.

def predict_proba(X, w):
    """
    Compute the probability P(y=1 | x) for each sample in X.
    
    Args:
        X: Input data (n_samples, n_features)
        w: Weight vector (n_features,)
    
    Returns:
        Probabilities (n_samples,)
    """

# test it with random weights

1.3 The loss function

For binary classification, we use the binary cross-entropy loss (also called log loss):

\[L(w) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\sigma(w^T x_i)) + (1-y_i) \log(1-\sigma(w^T x_i)) \right]\]

This loss function penalizes confident wrong predictions more than uncertain ones.

Question 5:

Implement the loss function. What happens if \(\sigma(w^T x_i) = 0\) when \(y_i = 1\)? How can we avoid numerical issues?

def compute_loss(X, y, w):
    """
    Compute the binary cross-entropy loss.
    
    Args:
        X: Input data (n_samples, n_features)
        y: True labels (n_samples,)
        w: Weight vector (n_features,)
    
    Returns:
        Loss value (scalar)
    """

    return loss

# Test the loss function

1.4 Computing the gradient

To minimize the loss function using gradient descent, we need to compute its gradient with respect to the weights \(w\). Compute the gradient of the binary cross-entropy loss with pen and paper then implement the gradient function. Verify that the output has the same shape as the weight vector \(w\).

def compute_gradient(X, y, w):
    """
    Compute the gradient of the loss with respect to w.
    
    Args:
        X: Input data (n_samples, n_features)
        y: True labels (n_samples,)
        w: Weight vector (n_features,)
    
    Returns:
        Gradient vector (n_features,)
    """

    return gradient

# Test the gradient

1.5 Gradient checking

Before implementing gradient descent, it’s crucial to verify that our gradient computation is correct. We can do this using numerical differentiation and comparing it with our analytical gradient.

Question 7:

We can use scipy.optimize.check_grad to verify that your gradient implementation of the loss is correct. Here’s an example with the function: \[ x \mapsto f(x, a, b) = a \|x\|^2 + b^\top x \] its gradient is given by: \[ \nabla_x f(x, a, b) = 2 a x + b\]

# Wrapper functions for check_grad
import numpy as np
from scipy.optimize import check_grad

dim = 10
a = 5
b = np.random.randn(dim)

def f(x, a, b):
    return a * np.linalg.norm(x)**2 + b @ x

def grad_f(x, a, b):
    return 2 * a * x + b


def loss_wrapper(x):
    return f(x, a, b)

def grad_wrapper(x):
    return grad_f(x, a, b)

x_check = np.random.randn(dim)

error = check_grad(loss_wrapper, grad_wrapper, x_check)

print(f"Gradient check error: {error:.2e}")

1.6 Gradient descent

Now that we’ve verified our gradient, we can implement gradient descent to minimize the loss function. The update rule is:

\[w_{t+1} = w_t - \alpha \nabla_w L(w_t)\]

where \(\alpha\) is the learning rate.

Question 8:

Implement gradient descent. The function should: 1. Initialize weights (you can use zeros or small random values) 2. For each iteration: - Compute the gradient - Update the weights - Optionally store the loss for visualization 3. Return the final weights and the history of losses

Visualize the curve of the loss as function of the iterations

def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000, verbose=True):
    """
    Perform gradient descent to minimize the loss.
    
    Args:
        X: Input data (n_samples, n_features)
        y: True labels (n_samples,)
        learning_rate: Step size for gradient descent
        n_iterations: Number of iterations
        verbose: Whether to print progress
    
    Returns:
        w: Final weight vector
        loss_history: List of loss values at each iteration
    """

1.7 Effect of learning rate

The learning rate is a crucial hyperparameter. If it’s too small, convergence is slow. If it’s too large, the algorithm might diverge or oscillate.

Question 10:

Run gradient descent with different learning rates (e.g., 0.001, 0.01, 0.1, 1.0) and compare: - The convergence speed - The final loss value - Whether the algorithm converges or diverges

Visualize the loss curves for different learning rates on the same plot and explain what you observe.

1.8 Evaluating on test data

Now that we’ve trained our model, we need to evaluate its performance on unseen test data. This is crucial to assess whether our model generalizes well.

Question 12:

Implement a function that: 1. Computes predictions (probabilities) for test data 2. Converts probabilities to binary predictions (threshold = 0.5) 3. Computes the accuracy: (number of correct predictions) / (total number of samples)

What is the accuracy on the training set? On the test set? Are they similar?

def predict(X, w, threshold=0.5):
    """
    Make binary predictions.
    
    Args:
        X: Input data (n_samples, n_features)
        w: Weight vector (n_features,)
        threshold: Decision threshold
    
    Returns:
        Binary predictions (n_samples,)
    """

def compute_accuracy(X, y, w):
    """
    Compute the accuracy of predictions.
    
    Args:
        X: Input data (n_samples, n_features)
        y: True labels (n_samples,)
        w: Weight vector (n_features,)
    
    Returns:
        Accuracy (scalar between 0 and 1)
    """

Part 2: Neural Networks with PyTorch

In this second part, we’ll explore neural networks using PyTorch. We’ll build a neural network with one hidden layer and learn about automatic differentiation, which is one of the key features that makes deep learning frameworks powerful.

2.1 Introduction to PyTorch

PyTorch is a deep learning framework that provides automatic differentiation (autograd). This means we don’t need to manually compute gradients—PyTorch tracks operations and can compute gradients automatically.

Let’s start by importing PyTorch and understanding tensors:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

print(f"PyTorch version: {torch.__version__}")

# Create a simple tensor
x = torch.tensor([1.0, 2.0, 3.0])
print(f"Tensor x: {x}")
print(f"Tensor shape: {x.shape}")
print(f"Tensor dtype: {x.dtype}")

2.2 Automatic Differentiation (Autograd)

The key feature of PyTorch is automatic differentiation. When we create a tensor with requires_grad=True, PyTorch tracks all operations on it and can compute gradients automatically.

Question 14:

Run the following code and explain what happens. What is the difference between requires_grad=True and requires_grad=False?

# Create a tensor that requires gradient computation
x = torch.tensor([2.0], requires_grad=True)
print(f"x: {x}")
print(f"x.requires_grad: {x.requires_grad}")

# Define a simple function: y = x^2
y = x ** 2
print(f"y: {y}")
print(f"y.requires_grad: {y.requires_grad}")
print(f"y.grad_fn: {y.grad_fn}")  # This shows the operation that created y

# Compute the gradient
y.backward()  # This computes dy/dx
print(f"x.grad: {x.grad}")  # Should be 2*x = 4.0

Question 15:

Try a more complex function: \(z = x^2 + 2xy + y^2\) where \(x=2\) and \(y=3\). Compute \(\frac{\partial z}{\partial x}\) and \(\frac{\partial z}{\partial y}\) using PyTorch’s autograd. Verify manually that the gradients are correct.

2.3 Building a Neural Network

Now let’s build a neural network with one hidden layer. We’ll use PyTorch’s nn.Module class, which provides a convenient way to define neural networks.

Our network will have: - Input layer: 64 features (8x8 image flattened) - Hidden layer: 32 neurons with ReLU activation - Output layer: 10 neurons (one for each digit 0-9) with softmax

Question 17:

We create a neural network class that inherits from nn.Module. What is the purpose of the __init__ and forward methods?

class NeuralNetwork(nn.Module):
    def __init__(self, input_size=64, hidden_size=32, output_size=10):
        """
        Initialize the neural network.
        
        Args:
            input_size: Number of input features
            hidden_size: Number of neurons in the hidden layer
            output_size: Number of output classes
        """
        super(NeuralNetwork, self).__init__()
        
        self.fc1 = nn.Linear(input_size, hidden_size)  # Input to hidden
        self.fc2 = nn.Linear(hidden_size, output_size)  # Hidden to output
        self.relu = nn.ReLU()  # Activation function
    
    def forward(self, x):
        """
        Forward pass through the network.
        
        Args:
            x: Input tensor (batch_size, input_size)
        
        Returns:
            Output tensor (batch_size, output_size)
        """
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create an instance of the network
model = NeuralNetwork(input_size=64, hidden_size=32, output_size=10)
print(model)

# Test with a random input
x_test = torch.randn(5, 64)  # Batch of 5 samples
output = model(x_test)
print(f"\nInput shape: {x_test.shape}")
print(f"Output shape: {output.shape}")

2.4 Preparing the Data

Before training, we need to convert our NumPy arrays to PyTorch tensors and create data loaders for efficient batch processing.

Question 18:

We convert the digits dataset to PyTorch tensors. Use all 10 classes (not just 0 and 1). What is the difference between torch.tensor() and torch.from_numpy()?

# Load full digits dataset (all 10 classes)
digits = load_digits()
X_full = digits.data
y_full = digits.target

# Split into train and test
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
    X_full, y_full, test_size=0.2, random_state=42
)

# Normalize
mean_train_full = X_train_full.mean(axis=0)
std_train_full = X_train_full.std(axis=0)
X_train_full_norm = (X_train_full - mean_train_full) / (std_train_full + 1e-8)
X_test_full_norm = (X_test_full - mean_train_full) / (std_train_full + 1e-8)

# Convert to PyTorch tensors
X_train_tensor = torch.from_numpy(X_train_full_norm).float()
y_train_tensor = torch.from_numpy(y_train_full).long()
X_test_tensor = torch.from_numpy(X_test_full_norm).float()
y_test_tensor = torch.from_numpy(y_test_full).long()

print(f"Training set: {X_train_tensor.shape}, {y_train_tensor.shape}")
print(f"Test set: {X_test_tensor.shape}, {y_test_tensor.shape}")

# Create data loaders for batch processing
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"\nNumber of batches in training set: {len(train_loader)}")
print(f"Batch size: {batch_size}")

2.5 Training the Model

Now we’ll train the neural network. The training loop involves: 1. Forward pass: compute predictions 2. Compute loss 3. Backward pass: compute gradients 4. Update weights using an optimizer

We implement the training loop. Using: - Cross-entropy loss (nn.CrossEntropyLoss) - Stochastic gradient descent optimizer (optim.SGD) - Learning rate of 0.01

Train for 100 epochs and print the loss every 10 epochs (iterations).

# Create model, loss function, and optimizer
model = NeuralNetwork(input_size=64, hidden_size=32, output_size=10)
criterion = nn.CrossEntropyLoss()  # Loss function for multi-class classification
optimizer = optim.SGD(model.parameters(), lr=0.01)  # Stochastic Gradient Descent

# Training loop
n_epochs = 100
train_losses = []

for epoch in range(n_epochs):
    epoch_loss = 0.0
    n_batches = 0
    
    # Iterate over batches
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(batch_X)
        
        # Compute loss
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        loss.backward()
        
        # Update weights
        optimizer.step()
        
        epoch_loss += loss.item()
        n_batches += 1
    
    avg_loss = epoch_loss / n_batches
    train_losses.append(avg_loss)
    
    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/{n_epochs}], Loss: {avg_loss:.4f}")

# Plot training loss
plt.figure(figsize=(10, 5))
plt.plot(train_losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True)
plt.show()

Question 20:

Evaluate the model on the test set. Compute the accuracy. How does it compare to the training accuracy?

def evaluate_model(model, data_loader):
    """
    Evaluate the model on a dataset.
    
    Returns:
        accuracy: Classification accuracy
    """
    model.eval()  # Set model to evaluation mode (disables dropout, etc.)
    correct = 0
    total = 0
    
    with torch.no_grad():  # Disable gradient computation for efficiency
        for batch_X, batch_y in data_loader:
            outputs = model(batch_X)
            _, predicted = torch.max(outputs.data, 1)  # Get predicted class
            total += batch_y.size(0)
            correct += (predicted == batch_y).sum().item()
    
    accuracy = correct / total
    return accuracy

# Evaluate on training and test sets
train_accuracy = evaluate_model(model, train_loader)
test_accuracy = evaluate_model(model, test_loader)

print(f"Training accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")

2.6 Extracting Hidden Layer Representations

One of the powerful aspects of neural networks is that the hidden layers learn useful representations of the input data. These representations can be used for visualization, transfer learning, or other tasks.

Question 21:

Modify the forward method to also return the hidden layer activations. Then extract the hidden layer representations for all test samples and visualize them using PCA (project to 2D). Color the points by their true labels. Do the hidden representations separate the classes well?

Question 22:

Compare the PCA visualization of the hidden representations with a PCA visualization of the original input data. Which one separates the classes better? What does this tell you about what the neural network learned?