LoRA-A²

Abstract

Federated fine-tuning for Large Language Models (LLMs) has recently gained attention due to the heavy communication overhead of transmitting large model updates. Low Rank Adaptation (LoRA) has been proposed as a solution, yet its application in federated learning is complicated by discordance in aggregation. Existing methods addressing this discordance often suffer from performance degradation at low ranks in heterogeneous data settings. In response, we introduce LoRA-A² (Low Rank Adaptation with Alternating freeze and Adaptive rank selection), which demonstrates robustness in challenging settings with low ranks and high data heterogeneity. Our experimental findings reveal that LoRA-A² maintains performance even under extreme heterogeneity and low rank conditions, achieving up to a 99.8% reduction in uploaded parameters compared to full fine-tuning without compromising performance. This adaptive mechanism boosts robustness and communication efficiency in federated fine-tuning, enabling the practical deployment of LLMs in resource-constrained environments.

Discordance Problem in Federated LoRA

LoRA approximates the fine-tuned weight as \( W = W_0 + \Delta W = W_0 + BA \), where \( W_0 \) is the frozen pre-trained matrix, and \( B \in \mathbb{R}^{d_1 \times r}, A \in \mathbb{R}^{r \times d_2} \) are low-rank matrices with rank \( r \ll \min\{d_1, d_2\} \). This reduces the number of trainable parameters from \( d_1 \cdot d_2 \) to \( r \cdot (d_1 + d_2) \), which benefits in both communication and computation efficiency in Federated Learning (FL).

However, due to the bilinear parametrization of LoRA, aggregating client updates \( ( \Delta W_k = B_kA_k \; \forall k \in [K] )\) leads to the discordance problem:

\[ \sum_{k=1}^K w_k B_k A_k \neq \left( \sum_{k=1}^K w_k B_k \right) \left( \sum_{k=1}^K w_k A_k \right), \]

where \( w_k \)'s are non-negative aggregation weights with \( \sum w_k = 1 \). This mismatch degrades performance, which has led to a growing body of research aimed at addressing this issue.

FFA-LoRA [Sun et al.] Let \( A_k = A_0 \) be frozen for all \( k \in [K] \), so that
\[ \sum_{k=1}^K w_k B_k A_k = \left( \sum_{k=1}^K w_k B_k \right) A_0. \]
FlexLoRA [Bai et al.] Aggregate \( B_kA_k \) directly. Then, using SVD, decompose it back into
\[ \sum_{k=1}^K w_k B_k A_k = BA. \]

Limited Parameter Space in LoRA

(a) \( Dir(0.1) \)

(b) \( Dir(0.01) \)

Figure 1: Accuracy of previous Federated LoRA methods across different rank sizes in heterogeneous data settings.

However, as shown in Figure 1, existing methods struggle to maintain performance, especially with low rank and high data heterogeneity. One possible explanation is that the parameter space available to these methods might be insufficient to accommodate diverse local updates. Our goal is to address the discordance problem while maintaining performance in such challenging settings.

Alternating Freeze

Instead of solely training module \( B \) while keeping module \( A \) frozen permanently, LoRA-A² alternates between the two: LoRA module \( B \) is frozen during even rounds, while module \( A \) is frozen during odd rounds. This method preserves the optimization space while effectively resolving discordance. Specifically, when freezing \( A \), we have

\[ \Delta W = \sum_{k=1}^K \left( w_k B_k \right) A = \sum_{k=1}^K \left( w_k B_k A \right) = \sum_{k=1}^K \left( w_k B_k A_k \right) = \sum_{k=1}^K \left( w_k \Delta W_k \right), \]

and when freezing \( B \), we have

\[ \Delta W = \sum_{k=1}^K B \left( w_k A_k \right) = \sum_{k=1}^K \left( w_k B A_k \right) = \sum_{k=1}^K \left( w_k B_k A_k \right) = \sum_{k=1}^K \left( w_k \Delta W_k \right). \]

In this way, LoRA-A² trains both \( B \) and \( A \), ensuring that \( A \) does not remain the same as its initial value. (Note: Since the standard convention of LoRA initializes \( B \) to zero, we first train \( B \). However, if a different initialization scheme is used, training \( A \) first could also be a reasonable choice.)

Adaptive Rank Selection

LoRA-A² selects important LoRA ranks to match local communication rank budget \( r_i \) out of global LoRA adapter with rank \( r_G \) adaptively based on the local dataset. This approach provides two key benefits:

It minimizes client conflicts by allowing different clients to choose different LoRA ranks in high heterogeneity.
It reallocates rank resources from unimportant LoRA modules to modules that require more fine-tuning, which is especially effective when communication rank budget is small.

Figure 2 below shows the number of selected ranks per module for three clients in a pathological data distribution setting. In this setup, clients 0 and 1 are assigned the "medical" and "space" classes respectively, while client 2 is assigned "motorcycle" and "religions" classes from the 20 Newsgroups dataset. This indicates that clients with semantically related data tend to converge on similar rank subspaces, potentially enabling more effective cooperative training. Meanwhile, clients with divergent data distributions select distinct ranks, leading to more independent updates. In addition, we observe that most modules are assigned zero ranks across clients. This suggests that our adaptive mechanism effectively prunes out modules that do not benefit from fine-tuning. Please refer to Sections 4.2 and 4.3 for the detailed methodology of our rank selection mechanism.

(a) Client 0

(b) Client 1

(c) Client 2

Figure 2: Visualization on number of selected rank per module. The x-axis shows RoBERTa module types, while the y-axis indicates layer numbers. Experimented on the 20 Newsgroups dataset with a pathological data distribution. Average 2 ranks were selected out of 16 ranks by our adaptive rank selection algorithm.

Experimental Results

LoRA Rank	Method	BANKING77 Dataset			20 Newsgroups Dataset			Communicated Parameters^*
LoRA Rank	Method	Dir(0.5)	Dir(0.1)	Dir(0.01)	Dir(0.5)	Dir(0.1)	Dir(0.01)	Communicated Parameters^*
w/o LoRA	FL (w/o LoRA)	92.76 ± 0.30	90.29 ± 0.73	67.58 ± 0.44	70.93 ± 1.04	68.82 ± 0.69	64.41 ± 0.30	186B
Rank = 8	FL + LoRA	92.80 ± 0.24	90.47 ± 0.53	60.96 ± 1.47	70.44 ± 0.28	67.33 ± 0.18	43.90 ± 1.08	1.99B
	FFA-LoRA	87.20 ± 0.57	77.44 ± 1.28	40.88 ± 1.04	67.00 ± 0.67	61.27 ± 0.71	37.34 ± 0.30	0.991B
	FlexLoRA	93.35 ± 0.24	92.14 ± 0.25	69.84 ± 0.65	70.59 ± 0.22	68.10 ± 0.38	60.41 ± 1.54	1.99B
	Ours	93.24 ± 0.27	91.61 ± 0.39	70.13 ± 1.22	70.26 ± 0.21	67.12 ± 0.22	54.50 ± 1.44	1.31B
Rank = 4	FL + LoRA	92.86 ± 0.08	88.11 ± 0.88	54.99 ± 0.59	70.33 ± 0.12	67.29 ± 0.19	43.12 ± 2.67	0.991B
	FFA-LoRA	86.90 ± 1.14	76.38 ± 0.61	37.63 ± 0.80	67.75 ± 0.45	61.25 ± 0.26	36.04 ± 0.80	0.497B
	FlexLoRA	92.71 ± 0.31	90.53 ± 0.70	57.38 ± 1.30	70.05 ± 0.14	68.00 ± 0.33	50.50 ± 2.09	0.991B
	Ours	93.22 ± 0.24	91.43 ± 0.63	69.63 ± 1.52	70.28 ± 0.32	67.12 ± 0.60	53.04 ± 1.68	0.888B
Rank = 2	FL + LoRA	91.97 ± 0.43	85.59 ± 1.13	49.08 ± 0.56	70.14 ± 0.13	65.40 ± 0.31	39.07 ± 2.23	0.497B
	FFA-LoRA	84.65 ± 1.05	73.44 ± 0.88	34.44 ± 2.15	68.12 ± 0.47	61.57 ± 0.38	36.65 ± 0.52	0.249B
	FlexLoRA	92.22 ± 0.50	87.31 ± 0.27	55.24 ± 2.19	70.03 ± 0.31	66.17 ± 1.70	48.23 ± 1.73	0.497B
	Ours	93.10 ± 0.07	92.02 ± 0.36	69.40 ± 0.48	70.12 ± 0.18	67.02 ± 0.26	52.99 ± 2.56	0.528B
Rank = 1	FL + LoRA	90.61 ± 0.10	82.24 ± 1.68	45.78 ± 1.04	69.40 ± 0.33	63.16 ± 0.53	36.58 ± 0.98	0.249B
	FFA-LoRA	82.51 ± 0.53	72.96 ± 0.54	33.68 ± 0.20	67.73 ± 0.30	61.35 ± 0.22	34.44 ± 0.68	0.124B
	FlexLoRA	90.40 ± 0.54	82.20 ± 0.74	42.75 ± 0.89	69.53 ± 0.25	62.98 ± 1.12	35.54 ± 0.68	0.249B
	Ours	93.21 ± 0.13	91.87 ± 0.33	68.88 ± 1.15	70.31 ± 0.24	66.95 ± 0.07	54.84 ± 1.15	0.270B

Table: Results with RoBERTa-base on BANKING77 and 20 Newsgroups datasets. Smaller \( \alpha \) for \( Dir(\alpha) \) implies that the simulated setting is more heterogeneous. The best results on each dataset are shown in bold and second best in underline. The metric used here is accuracy (%). ^*This column reports the total number of uploaded parameters, averaged across rows.

BibTeX

@misc{koo2024robustefficientfederatedlowrank,
        title={Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients}, 
        author={Jabin Koo and Minwoo Jang and Jungseul Ok},
        year={2024},
        eprint={2410.22815},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2410.22815}, 
  }

Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients

LoRA-A2 comprises two key components: Alternating freeze and Adaptive rank selection.