Instead of solely training module \( B \) while keeping module \( A \) frozen permanently,
LoRA-A2 alternates between the two:
LoRA module \( B \) is frozen during even rounds,
while module \( A \) is frozen during odd rounds.
This method preserves the optimization space while effectively resolving discordance.
Specifically, when freezing \( A \), we have
\[
\Delta W = \sum_{k=1}^K \left( w_k B_k \right) A = \sum_{k=1}^K \left( w_k B_k A \right) = \sum_{k=1}^K \left( w_k B_k A_k \right) = \sum_{k=1}^K \left( w_k \Delta W_k \right),
\]
and when freezing \( B \), we have
\[
\Delta W = \sum_{k=1}^K B \left( w_k A_k \right) = \sum_{k=1}^K \left( w_k B A_k \right) = \sum_{k=1}^K \left( w_k B_k A_k \right) = \sum_{k=1}^K \left( w_k \Delta W_k \right).
\]
In this way,
LoRA-A2 trains both \( B \) and \( A \), ensuring that \( A \) does not remain the same as its initial value.
(Note: Since the standard convention of LoRA initializes \( B \) to zero, we first train \( B \).
However, if a different initialization scheme is used, training \( A \) first could also be a reasonable choice.)