
Multi-LoRA Serving: One Base Model, Ten Adapters
The Multi-Tenant Architecture. Learn how to serve dozens of specialized expert models on a single GPU by sharing a base model and hot-swapping tiny LoRA adapters.
Multi-LoRA Serving: The Multi-Tenant Architecture
Imagine you are building a SaaS platform for writers. You want to offer $50$ different "Writing Styles"—one for poetry, one for legal briefs, one for Gen-Z slang, etc.
If you used Full Fine-Tuning (FFT), you would need to host $50$ separate models. Each model is $14GB$. That’s $700GB$ of VRAM—roughly equivalent to $9$ NVIDIA A100 GPUs. At current cloud prices, this would cost you over $10,000 per month just to keep the models running.
Multi-LoRA Serving solves this. Because LoRA adapters are tiny ($50MB$) and use a "Frozen" base model ($14GB$), you can host one base model and "plug in" all $50$ adapters on top of it simultaneously.
In this lesson, we will learn how to build this "Multi-Tenant" AI architecture.
1. How it Works: The Shared Backbone
In a Multi-LoRA setup:
- The Base Model is loaded once into the GPU memory.
- The Adapters are loaded into a small "secondary" space in the VRAM.
- The Request: When a user sends a message, they also send an "Adapter ID."
- The Swap: The inference engine (like vLLM) instantly routes the math through the shared base + the specific adapter requested.
Because the base weights are shared, the only "Extra" memory you need is 50MB per adapter. You can easily fit $50$ adapters on a single GPU that would normally only hold one full model.
2. Dynamic Adapter Selection
Modern serving engines allow you to switch adapters on-the-fly without restarting the server. You can even handle a batch of 10 users where every single user is using a different adapter!
Visualizing Multi-LoRA Efficiency
graph TD
A["User 1 (Poetry)"] --> B["API Gateway"]
C["User 2 (Code)"] --> B
D["User 3 (Legal)"] --> B
B --> E["vLLM Multi-LoRA Server"]
subgraph "Single GPU Memory"
E --> F["Shared Base Model (Mistral 7B)"]
F --> G["Poetry Adapter"]
F --> H["Code Adapter"]
F --> I["Legal Adapter"]
end
G --> J["Poetry Output"]
H --> K["Code Output"]
I --> L["Legal Output"]
3. Implementation: Using Multi-LoRA with vLLM
To enable this, you simply tell vLLM where your adapters are located when you start the server.
Command Line:
python -m vllm.entrypoints.openai.api_server \
--model /path/to/base-model \
--enable-lora \
--lora-modules \
poetry-bot=/path/to/poetry-adapter \
legal-bot=/path/to/legal-adapter \
code-bot=/path/to/code-adapter
Usage (OpenAI Client):
# To use the poetry bot
response = client.chat.completions.create(
model="poetry-bot", # This name maps to the adapter path above
messages=[{"role": "user", "content": "Write a sonnet about GPUs."}]
)
4. The Performance "Tax" of Multi-LoRA
There is a small performance cost to Multi-LoRA. Because the engine has to manage different adapter weights for different users in the same batch, the Tokens Per Second (TPS) might drop by $5% - 10%$. However, for most businesses, a $10%$ drop in speed is a tiny price to pay for a $90%$ drop in hardware costs.
Summary and Key Takeaways
- Multi-LoRA is the "economical king" of specialized AI.
- Shared weights: You only pay the "memory price" of the base model once.
- Dynamic switching: You can host dozens of behaviors on a single server.
- vLLM Support: Enable lora with
--enable-loraand map your adapter paths. - Modular Business: This architecture allows you to scale to thousands of specialized use cases without exploding your cloud bill.
In the next lesson, we will look at the final decision in your deployment journey: Local vs. Cloud Deployment Trade-offs.
Reflection Exercise
- If you have 100 adapters and each is 100MB, how much extra VRAM do you need in addition to the base model? (Hint: 100 * 100MB = ?).
- Can you use Multi-LoRA with QLoRA (Lesson 3 of Module 9)? (Hint: Yes, but the base model must be in the same quantized state as the training base).
SEO Metadata & Keywords
Focus Keywords: Multi-LoRA serving vLLM, hosting multiple adapters one model, dynamic adapter switching, multi-tenant AI architecture, reducing AI infrastructure costs. Meta Description: Save thousands on GPU costs. Learn how to use Multi-LoRA serving to host dozens of specialized expert models on a single GPU by sharing a base model and hot-swapping tiny adapters.