Training neural networks can be really tedious. Both in terms of methodological complexity – tuning hyperparameters, syncing training runs, scheduling learning rates, etc. – and also hardware requirements.

That’s why historically large architectures (like ResNet) have been “pretrained”, or trained on broad tasks (image classification, segmentation, object detection, etc) by more-or-less powerful organizations – like Google – and then, practitioners froze the original weights adding only a few layers at the last part of the network which were the ones being trained.

This approach enabled researchers, individuals and even companies to leverage previously learned knowledge in those pretrained models.

## How do I finetune without a high end GPU / TPU?

Nonetheless, the explosion of big models like LLMs or diffusion models made even more difficult to the general public to gather the necessary compute power for training them from scratch. Only a handful of powerful companies are capable of doing this.

So, in order to do it, a similar scheme emerged: pretraining big models on very generalist tasks (like text modeling), usually in an unsupervised (or semisupervised) setting, releasing those weights to the public and let practitioners further train them on more specific tasks (for example, running a few more epochs on their own corpus).

This, while drastically reducing the required amount of data and compute time, left us with the “compute memory problem”. Most of these new models, even in their simplest versions, barely fit on most consumer-level GPUs (like a RTX3080). (*NOTE: This is usually addressed by setting up multi-GPU settings*)

The problem of being unable to fit these models usually arises when running backpropagation, when it’s necessary to store a copy on memory of all the partial derivatives of all the matrix weights, usually doubling the memory footprint.

Several hacks have been proposed for tacking or easing this constraint, but the one I’m talking about on this post is LoRA: Low-Rank Adaptation of Large Language Models.

## What is LoRA?

LoRA, or “Low Rank Adaptation” is a very clever yet useful technique which allows us to train big models with less memory than if we fully trained the model as usual. Just for reference, using this method, one can reduce the amount of memory for training Stable Diffusion XL from ~32GB down to less than 8GB (and also proportionally faster!).

## How does this method work?

When training a neural net, on each backpropagation run we’re basically calculating a delta of the weights on each layer ($\Delta W$) which are added to the previously weights to update them. In an equation form this is expressed as:

\[\begin{align*} W_{new} = W_{old} + \Delta W \end{align*}\]The main idea starts here: we can decompose the matrix $\Delta W$ into two (smaller) matrices: $W_A$ and $W_B$. That’s to say, we’re factorizing them! And we can rewrite the it as:

\[\begin{align*} W_{new} = W_{old} + \Delta W = W_{old} + W_B W_A \end{align*}\]This matrices should be “rectangular” with one common dimension: $W_A \in \mathbb{R}^{r \times d}; W_B \in \mathbb{R}^{k \times r}$. $d$ and $k$ match the original dimensions of the weight matrix and $r \ll \min(d, k)$.

So, if you noticed the equations above, the number of entries on the original matrix $\Delta W$ is far greater than the $W_A$ and $W_B$ matrices combined. For example, with a $10 \times 10$ sized $\Delta W$, we can create $W_A, W_B$ of $5 \times 2$ and $2 \times 5$ respectively ($10 \times 10 = 100 \gg 10 + 10 = 20$). It’s not hard to imagine this difference will grow bigger and bigger with larger $\Delta W$ matrices.

This method is mainly applied on Dense layers (with are the easiest to apply to). During training, we freeze the original layers, and train $W_A$ and $W_B$ via regular backprop and, after being done, we recompose the original $\Delta W$ by multiplying the two decomposed matrices $W_B W_A$, which in turn we sum to the original weights ($W_{old}$)

## Okay, nice. But WHY does it work?

Previously, papers like The lottery ticket hypothesis have stated that, after trained, neural networks contain sparse weights and can be pruned in order to reduce their size. Related with that, research Aghajanyan et al. pointed out that, after fine-tuning, LLM’s layers could be further reduced in size and still maintain a good performance.

Translated to maths, this mean that layers get “an intrinsic low dimension”, or that weight matrices have not full rank (i.e. the same “knowledge” could be reproduced with fewer dimensions because some of them are being squeezed after the transformation). If we can remove the column dependencies of the matrix, we save compute and training time!

The last question is: *how do we chose “$r$”? Is there a rule? Does it have a tradeoff choosing a smaller “$r$”?*

Well, researchers study this effect in the paper by applying LoRA with different “$r$” to the self-attention layers of GPT-2/3 and evaluating it in several datasets and found no noticeable differences (see tables 5 and 6 of the paper), which could mean that, even compressing the matrices, the same knowledge for the downstream tasks is preserved.

## Implementation

Let’s see a very straightforward example on how to LoRA is applied to the training of a very simple model consisting on only one transformation matrix. The same idea can be applied to larger models by selecting Dense layers and substituting them for LoRA layers, though.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

import torch
import torch.nn as nn
import torch.nn.functional as F
# Let us suppose a pretrained network with shape input_dim x output_dim
W = torch.eye(100, requires_grad=False)
# Just for reference, we build a "target" W matrix which represents the
# final (finetuned) version of the W matrix. As stated by the literature, this
# matrix isn't full-rank, so we artificially set it to have 99 linear combinations
# of the first column.
W_finetuned = torch.ones([100, 100])
for i in range(1, 100):
W_finetuned[:, i] *= i

*Note the $W$ matrix has frozen weights, thus, we won’t be retraining it and the gradients for it won’t be computed and held into memory.*

Then, we initialize the 2 weight sub-matrices as shown in the LoRA paper

1
2
3
4
5
6
7
8
9
10
11

rank = 4 # The rank 'r' for this low-rank adaptation
W_B = nn.Parameter(torch.empty(W.shape[0], rank)) # LoRA weight B
W_A = nn.Parameter(torch.empty(rank, W.shape[1])) # LoRA weight A
# Initialization of LoRA weights
nn.init.zeros_(W_B);
nn.init.normal_(W_A);
print(f"Number of original parameters: {W.shape[0] * W.shape[1]}")
print(f"Number of LoRA parameters: {W_B.shape[0] * W_B.shape[1] + W_A.shape[0] * W_A.shape[1]}")

1
2

Number of original parameters: 10000
Number of LoRA parameters: 800

Let’s create a dummy dataset for training (finetuning) the sub-matrices

1
2
3
4
5
6
7
8

# We create a dummy training dataset (consinsting only on one batch)
input_batch = torch.rand((32, 100))
# We simulate a target dataset (in wich we finetune) by passing the batch through
# the "non-full-rank" matrix.
target_batch = input_batch @ W_finetuned
print((input_batch.shape, target_batch.shape))

1

(torch.Size([32, 100]), torch.Size([32, 100]))

Now, we train the $W_A$ and $W_B$ matrices as told on the paper.

1
2
3
4
5
6
7
8
9
10
11
12
13

losses = []
optimizer = torch.optim.Adam([W_A, W_B], lr=1e-2)
for epoch in range(1_000):
optimizer.zero_grad()
out = input_batch @ (W + (W_B @ W_A))
loss = F.l1_loss(target_batch, out, reduce='mean')
loss.backward()
optimizer.step()
losses.append(float(loss.detach().numpy()))
print(losses[-3:])

1

[0.5827709436416626, 0.6526086330413818, 0.6803646683692932]

Now, we update the original $W$ matrix by summing the product of the factorized matrices:

1
2
3
4
5

alpha = rank
scaling_factor = alpha/rank # in this example the scaling factor is 1
# Update parameters!
W += (W_B @ W_A) * scaling_factor

If everything has gone well, the updated $W$ matrix should be very similar as the expected one (`W_finetuned`

). Let’s check it out:

1
2
3
4

print("Finetuned should be:")
print(W_finetuned)
print("Actual W is:")
print(W)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Finetuned should be:
tensor([[ 1., 1., 2., ..., 97., 98., 99.],
[ 1., 1., 2., ..., 97., 98., 99.],
[ 1., 1., 2., ..., 97., 98., 99.],
...,
[ 1., 1., 2., ..., 97., 98., 99.],
[ 1., 1., 2., ..., 97., 98., 99.],
[ 1., 1., 2., ..., 97., 98., 99.]])
Actual W is:
tensor([[ 1.9471, 1.0072, 1.9705, ..., 96.9968, 97.9976, 98.9610],
[ 1.0144, 1.9725, 2.0391, ..., 97.0116, 98.0161, 99.0161],
[ 0.9610, 1.0239, 2.9211, ..., 96.9896, 97.9768, 98.9119],
...,
[ 0.9898, 0.9909, 2.0054, ..., 97.9946, 98.0018, 98.9830],
[ 0.9855, 0.9858, 2.0140, ..., 96.9801, 98.9850, 98.9716],
[ 0.9781, 1.0012, 1.9835, ..., 96.9722, 97.9784, 99.9474]],
grad_fn=<AddBackward0>)

Nice! As you can see, the matrices are pretty close despite the fact we’ve finetuned with only 8% of the original number of parameters!

If you want to learn more about how LoRA can be applied to your models, check out this repo: https://github.com/microsoft/LoRA. It contains re-implementations of common layer types supporting LoRA out of the box.

## Closing up

LoRA is a very interesting idea that closes us, the mere mortals, the ability of playing with big models (and also makes it cheap!) by applying a simple matrix decomposition approach on Dense layers, which are currently present on most models using Attention mechanisms. There are also further improvements over LoRA, for example QLoRA, which is a Quantized version of LoRA that reduces even more the memory footprint!

Hopefully, we’ll see how these kind of techniques enable new practitioners enter the community!