The ability to train a LoRA is an amazing thing, whether you’re using Civitai’s LoRA trainer, or one of the popular local training scripts, but the technical descriptions for what each of the options actually do are… complicated! This Glossary was created to bring together as many of these terms as possible in a single, searchable, list, offering short explanations of the most common LoRA training options!
Some of the terms below might be missing a description – this document is in a constant state of development to keep up to speed with additions and changes – it’s being worked on!
Term | Tags | Description |
---|---|---|
Kohya SS | User, Software | A set of training scripts by Kohya-ss for Stable Diffusion, allowing us to train DreamBooth, LoRA, Textual Inversion. No GUI! Github link |
Tensorboard | Software | A visualization toolkit to track and visualize training metrics like loss and accuracy. Helps determine whether training is going well. |
Configuration file | Software | After configuring Kohya scripts via the GUI, we can save our training settings in a Configuration file for easy subsequent training setup, or for sharing. |
LoRA | Concept | Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning research |
Bmaltais | User | Owner of the Kohya GUI repository on Github. Provides a (primarily) Windows Gradio GUI for Kohya-ss's Stable Diffusion training scripts. The GUI allows easy creation of the necessary parameters for Kohya SS scripts to run. Github link |
Caption Extension | Configuration Setting | Captions can be in .txt or .caption file format. Beware - the default is .caption! |
Captioning | Concept | The process of describing your input training images to help Stable Diffusion understand what it's looking at. Captioning can be done by hand, or via a number of Caption creation tools. |
BLIP (Captioning) | Concept, Software | Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (BLIP), created by Salesforce, is a solid, simple image-to-text processor. |
GIT (Captioning) | Concept, Software | GenerativeImage2Text (GIT), first discussed in this paper, was trained on 20 million image-text pairs, and further fine-tuned on TextCaps. Another robust image-to-text processor. |
WD14 (Captioning) | Concept, Software | WD1.4 by SmilingWolf is a processor with a whole host of variations (including vit-tagger, convnext-tagger, vit-tagger-v2, convnext-tagger-v2, swinv2-tagger-v2). Trained on Danbooru images to various Epochs, with different filtering. An excellent image-to-text tagger |
LoRA-LierLa | Concept, Model | LoRA for Linear layers and Conv2d layers with a 1x1 kernel |
LoRA-C3Lier | Concept, Model | LoRA for Conv2d layers with a 3x3 kernel |
Resizing (LoRA) | Concept, Software | LoRA can be Resized by altering the Network Dim, after training. |
Image Folder | Configuration Setting | Training image directory |
Output Folder | Configuration Setting | LoRA output directory |
Regularization Images | Configuration Setting | Regularization images are images of the same class to the training data - but not the training data. They provide a Dampening effect, and prevent class-drift. Example; if training on images of a female celebrity, regularization images would be images of other women (same class), but NOT your celebrity. |
Regularization Folder | Configuration Setting | Folder for Regularization images |
Logging Folder | Configuration Setting | LoRA training metadata directory |
Print Training Command | Configuration Setting | Shows the command which will be submitted to the Kohya-ss script when training begins. |
Batch (Train batch size) | Configuration Setting, Concept | A "batch" is the number of training images to read at once. A batch of 2 will read, and train, two images simultaneously. When multiple pictures are learned at the same time, the accuracy for each picture will drop. The higher the batch, the shorter the learning time, but the training accuracy decreases. Many people increase the learning rate to compensate for higher batching. Additionally, the higher the batch size, the more VRAM is consumed. The default is 1. |
Save Every N Epochs | Configuration Setting | We can save the progress as our LoRA trains by outputting each Epoch as it completes. If we specify that we want the LoRA to run for 10 Epochs, we can use this setting to output a LoRA file after every N Epoch completions. An extremely useful tool in testing our LoRA outputs! |
Epoch | Configuration Setting, Concept | An Epoch is one learning iteration. If we have 20 data set images, and have specified we want to repeat those images 10 times, 1 epoch will be 20x10, resulting in 200 steps of training. |
Mixed Precision | Configuration Setting | Weight data is originally 32bit units, but we can gain considerable VRAM savings by training with 16bit precision. LoRA can be successfully trained with FP16 (16bit precision). Bf16 is a format devised to provide the VRAM savings of FP16 with the accuracy of FP32 (32bit). Bf16 may only work on the latest generation GPUs. |
Save Precision | Configuration Setting | Specifies the type of weight data to save the LoRA file as. Float is 32bit. The default is FP16. |
Seed | Configuration Setting, Concept | A Seed can be specified to help replicate future training sessions, but note that not every Kohya process uses the Seed. There will be an element of unpredictability in each training session, even with a Seed specified. |
Learning Rate | Configuration Setting | The larger the Learning Rate value, the faster the LoRA training, but the more details are missed during training. Low Learning Rates are typically desirable, to retain flexibility in the LoRA. |
LR Scheduler | Configuration Setting | Learning Rate is a parameter of the Optimizer. A Learning Rate Scheduler adjusts the learning rate according to a predefined schedule during the training process. |
Optimizer | Configuration Setting | The Optimizer controls how the neural network weights are updated during training. There are various options, and different LoRA guides will suggest different Optimizers and various associated settings. The most commonly used 1.5 Optimizer is AdamW8bit (the default), which uses the least VRAM and has sufficiently good accuracy. Alternatives include DAdapt, which automatically adjusts the learning rate as training progresses, and Adafactor, which incorporates both methods. |
Buckets | Configuration Setting | With Koyha's LoRA training scripts, there's no need to pre-crop your training images to 512x512 (or 2.0's 768x768, or SDXL's 1024x1024). Bucketing will sort the images into various "containers" based on resolution and aspect ratio, as images of different sizes cannot be trained at the same time. Similarly sizes images will be grouped for training. |
Text Encoder Learning Rate | Configuration Setting | While training, associates tokens (parts of your prompt) to blocks in the neural network. The default for this is 5e-5 (0.00005). Lowering this value can reduce unwanted objects showing up in your LoRA created images. If you can’t get things to appear which should have been trained into the LoRA, you’ve set this too low. |
Unet Learning Rate | Configuration Setting | The Unet is like the visual memory of the neural network, and the thing that causes most problems with LoRA. It’s extremely sensitive, very easy to over and under-bake. The default is 0.0001 (1e-4) |
Optimizer Extra Arguments | Configuration Setting | Some Optimizers accept (or require) extra command line arguments for specific features. Many LoRA guides will specify the values required for each Optimizer. |
Network Rank (Dimension) | Configuration Setting | Also expressed as Net Dim or Rank. This setting affects the “power” of the model in displaying the concepts trained within. Higher values result in a larger LoRA and more training time, but may capture the element to be trained with better fidelity. |
Network Alpha | Configuration Setting | Closely related to the Network Rank (Dim), the smaller the Network Alpha value, the larger the LoRA neural net weights. Can be used to Dampen, or "slow down" learning. Alpha of 16 and Network Rank (Dim) of 32 halves the Learning Rate. If Alpha and Network Rank are set to the same value, there will be no effect on Learning Rate. |
Clip Skip | Configuration Setting | The Text Encoder uses a mechanism called "CLIP", make up of 12 layers (corresponding to the 12 layers of the neural network). Clip Skip specified the layer number Xth from the end. Clip Skip of 2 will send the penultimate layer's output vector to the Attention block. Unless the base model you're training against was trained (or Mixed) with Clip Skip 2, you can use 1. SDXL does not benefit from Clip Skip 2. |
Noise Offset | Configuration Setting | First described by researchers at Crosslabs. A method of introducing true "darkness" (and highlights) into models at the training stage. See Noise Offset for SD 1.5. Note that Noise Offset will increase dampening. A Setting of 0.1 will make a LoRA's colors more vivid. Default is 0. |
Gradient Checkpointing | Configuration Setting | Enables intermediate saving of gradients, reduces overall training speed but uses less VRAM. Has no effect on the training results. |
Persistent Data Loader | Configuration Setting | The data required for training (the latent images, etc.) is discarded (unloaded from memory) and reloaded after each epoch completes. Turning on Persistent Data Loader speeds up training, but uses significantly more VRAM to store the data continuously. |
Memory Efficient Attention | Configuration Setting | When enabled, results in greatly lowered VRAM consumption while training, but is slower than Xformers. Default is OFF. |
User Xformers | Configuration Setting | Xformers is a Python library which greatly trades speed for less VRAM usage. Turn this on if you have OOM errors. Defaults to ON. |
Flip Augmentation | Configuration Setting | Artificially double the number of training images by performing a horizontal flip. If your data is not left-right symmetrical (which will be the case when training humans, subjects), this option should be avoided. |
Color Augmentation | Configuration Setting | Artificially increases the number of training image variations by changing the image hue during learning; supposedly improving model fidelity. When enabled, Cache latents cannot be used due to the training images changing dynamically during training. |
Shuffle caption | Configuration Setting | When enabled, Captions (comma separated) are shuffled to produce more captioning variation in the training data. We can "fix" a set number of leading tokens in place, not to be shuffled, using the Keep n tokens slider. |
Keep n tokens | Configuration Setting | A slider allowing us to specify how many of our leading comma separated captions are excluded from caption shuffling. |
Cache latents | Configuration Setting | To speed up the training process, pre-generate the latent representation of the training images in advance, and save into system memory. |
Cache latents to disk | Configuration Setting | Cache the pre-generated latent images to disk, as temporary numpy .npz format files, saved alongside the training images. |
v_paramaterization | Configuration Setting | Must be checked when training against Stable Diffusion 2.0/768 base models. |
Learn Dampening | Concept | The goal of training is (generally) to fit the most number of Steps in, without Overcooking. Certain settings, by design, or coincidentally, "dampen" learning, allowing us to train more steps before the LoRA appears Overcooked. Some settings which affect Dampening include Network Alpha and Noise Offset. |
Unet | Concept | The Unet is the part of the model architecture which is the "visual memory" of the neural network. |
Overcooking | Concept | See Overfitting |
Overfitting | Concept | A model which tries to reproduce the training data too aggressively - resulting in a LoRA which is hard to work with, doesn’t follow prompts well. |
Undercooking | Concept | See Underbaking |
Underbaking | Concept | Underbaking is apparent when testing a LoRA and the effect is weak or needs to be pushed past 1.0 strength to show. In this case, training could benefit from more steps. |
Training Set (or Data Set) | Concept | The images you will be passing into the interface to be trained into the LoRA. The data set can also include captions - files containing descriptions of your image contents. |
TEnc | Concept, Software | The Text Encoder (TEnc) controls how the AI interprets text (prompts) while generating, and while training associates tokens (caption words) to blocks in the neural network. |
Baking | The process of training a model, TI, LoRA, etc. | |
Steps | Concept, Configuration Setting | Total computed number of steps for training - # of images * # of repeats * # of epochs / batch size |
OOM Error | Concept | Out of Memory (OOM) errors occur when the required VRAM for training exceeds the available VRAM on the GPU. Some LoRA training settings are designed to lower the amount of VRAM used, with a tradeoff in training speed. |
v2 | Configuration Setting | Must be checked when training against Stable Diffusion 2.0 base models. |
Everydream2 | Software | Training software, specializing in processing/training large data sets. Github link |