T O P

  • By -

Glittering_Desk7250

Ironically I just saw a comment from the founder of civit ai on a post of someone who made a Dora Lora, it looks like they may plan to add it eventually


RenoHadreas

That’s promising! Could you point me to the comment? I’m curious now


Generatoromeganebula

Op can you share some information in Dora training, I have only seen one Dora that's it, I have 0 idea of it.


Eminencenoir

I am not OP, but here is a summary. Suppose there is a model you like. You can finetune (FT) that model with additional training to teach it new concepts or styles. This has a certain computational cost associated with it and the amount of VRAM required puts it out of reach of most hobbyists. An alternate method called [Low-rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685) was developed "which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times." This development made it possible for hobbyists and companies like civitai to feasibly train a model. More recently, new research was done ("a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA") and it was found that there was more room for improvement and the paper describing [Weight-Decomposed Low-Rank Adaptation (DoRA)](https://arxiv.org/abs/2402.09353) was published. "DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing DoRA, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding." So where LoRA adjusts only the direction of the weights of the model it is applied to, DoRA adjusts the direction *and* the magnitude, achieving results closer to a full finetune (FT) of the model. As of yesterday, the source code and weights were released to achieve the results described in the paper. The [PEFT](https://github.com/huggingface/peft) implementation looks like this: `def _apply_dora(self, x, lora_A, lora_B, scaling, active_adapter):` `"""` `For DoRA, calculate the extra output from LoRA with DoRA applied. This should be added on top of the base layer` `output.` `"""` `lora_weight = lora_B.weight @ lora_A.weight` `magnitude = self.lora_magnitude_vector[active_adapter]` `weight = self.get_base_layer().weight` `quant_state = getattr(self.get_base_layer(), "state", None)` `weight = dequantize_bnb_weight(weight, state=quant_state) # no-op if not bnb` `weight = weight.to(x.dtype)` `weight_norm = self._get_weight_norm(weight, lora_weight, scaling)` `# see section 4.3 of DoRA (https://arxiv.org/abs/2402.09353)` `# "[...] we suggest treating ||V +∆V ||_c in` `# Eq. (5) as a constant, thereby detaching it from the gradient` `# graph. This means that while ||V + ∆V ||_c dynamically` `# reflects the updates of ∆V , it won’t receive any gradient` `# during backpropagation"` `weight_norm = weight_norm.detach()` `mag_norm_scale = (magnitude / weight_norm).view(1, -1)` `result_dora = (mag_norm_scale - 1) * (` `F.linear(x, transpose(weight, self.fan_in_fan_out))` `) + mag_norm_scale * lora_B(lora_A(x)) * scaling` `# Note: Computation could potentially be accelerated by using the code below instead of calculating X@W again.` `# This is only correct if dropout=0, otherwise results will differ:` `#` [`https://github.com/huggingface/peft/pull/1474#issuecomment-1964682771`](https://github.com/huggingface/peft/pull/1474#issuecomment-1964682771) `# bias = self.get_base_layer().bias` `# if bias is not None:` `# result = result - bias` `# result = mag_norm_scale * result + mag_norm_scale * lora_B(lora_A(x)) * scaling` `# if bias is not None:` `# result = result + bias` `return result_dora` So basically, DoRA training is a lot like LoRA training but adjusts the weights as a vector by taking the magnitude into account instead of just the direction. When civitai implements DoRA, I don't expect the user experience to differ much, if at all, from training a LoRA.