AI trading agent: Fusing LLM adapters and converting to Ollama
Deploy specialized LoRA adapters through three methods: direct usage, model fusion, and Ollama conversion for production-ready API access and scalable deployment.
Previous section: AI trading agent: Generative adversarial networks and synthetic data
Project repository: Web3 AI trading agent
After completing teacher-student distillation, you have specialized LoRA adapters containing domain-specific trading knowledge. This section covers deployment options: direct LoRA usage, model fusion, and Ollama conversion for production deployment.
Understanding model deployment options
Choose your deployment strategy based on requirements
The fine-tuning process produces LoRA (Low-Rank Adaptation) adapters that modify the base model’s behavior without altering the original weights. You have three deployment options:
Option 1: Direct LoRA usage
- Pros: Smallest memory footprint, fastest deployment
- Cons: Requires MLX runtime, adapter loading overhead
- Best for: Development, testing, resource-constrained environments
Option 2: Fused model deployment
- Pros: Single model file, no adapter dependencies, consistent performance
- Cons: Larger file size, permanent modification
- Best for: Production deployment, sharing, simplified distribution
Option 3: Ollama integration
- Pros: Easy API access, model versioning, production-ready serving
- Cons: Additional quantization step, external dependency
- Best for: API-based integration, multi-user access, scalable deployment
If you are on your learning path, I suggest trying out all three to get a feel of the process and to see the behavior difference (e.g., inference time). It doesn’t take too much time; follow the instructions further in this section.
Direct LoRA adapter usage
Direct LoRA usage provides immediate access to your specialized trading model. If you did a quick test/validation of your fine-tuned model loaded with adapters.safetensors
, you already did this direct LoRA adapter usage test.
Here it is again:
Model fusion for production deployment
Fusion combines LoRA adapters with base model weights, creating a single model file with embedded trading knowledge.
Understanding the fusion process
LoRA fusion is a technique to directly integrate specialized knowledge from adapter weights into the base model parameters.
Mathematically, this involves taking the original Qwen 2.5 3B model parameters and combining them with the adapter’s low-rank matrices.
Practically, this is sort of a double-edged sword: on the one hand, the model grows from roughly 2 GB (when using adapters separately) to around 6 GB in its fully fused state; on the other hand, inference loading times improve significantly because adapter loading overhead is eliminated. Additionally, the fused model maintains consistent and stable inference speeds without adapter-related delays.
Performing model fusion
Execute fusion with appropriate settings
Verify fusion success
Test the fused model:
Compare with adapter version:
Converting to Ollama format
Ollama provides production-grade model serving with API access and it just works very well & smooth.
Set up llama.cpp
for model conversion:
Build with optimizations (macOS with Apple Silicon):
Alternative: Build for CUDA (Linux/Windows with NVIDIA GPU):
Install Python conversion dependencies
Verify conversion script availability in root (cd ..
):
Converting MLX to GGUF format
Run the conversion script:
Quantizing for efficiency
Optionally, apply e.g. Q4_K_M quantization for performance (4-bit with K-quant method):
Quantization options and trade-offs
Format | Size | Speed | Quality | Use Case |
---|---|---|---|---|
F16 | 100% | Medium | Highest | Development/Testing |
Q8_0 | ~50% | Fast | High | Balanced Production |
Q4_K_M | ~25% | Fastest | Good | Resource Constrained |
Q2_K | ~12% | Very Fast | Lower | Extreme Efficiency |
Creating Ollama model
Register quantized model with Ollama with the following instructions. Note that I’m using the trader-qwen:latest
model name in these instructions—make sure you change to your name, if you have a different one.
Create Ollama Modelfile with configuration:
Import model to Ollama:
Test Ollama deployment
Test model with trading prompt:
Test API endpoint:
Integrating custom models with trading agents
Update your trading agent configuration to leverage the custom-trained model.
Configuration for Ollama integration
Update config.py
for Ollama model usage:
Set custom model as default:
Configuration for direct MLX usage
Alternative: Use MLX adapters directly.
MLX-based configuration in config.py
:
Running agents with custom models
Run stateful agent with Ollama model:
Alternative: Direct MLX execution: