Project repository: Web3 AI trading agent

After completing teacher-student distillation, you have specialized LoRA adapters containing domain-specific trading knowledge. This section covers deployment options: direct LoRA usage, model fusion, and Ollama conversion for production deployment.

Understanding model deployment options

Choose your deployment strategy based on requirements

The fine-tuning process produces LoRA (Low-Rank Adaptation) adapters that modify the base model’s behavior without altering the original weights. You have three deployment options:

Option 1: Direct LoRA usage

  • Pros: Smallest memory footprint, fastest deployment
  • Cons: Requires MLX runtime, adapter loading overhead
  • Best for: Development, testing, resource-constrained environments

Option 2: Fused model deployment

  • Pros: Single model file, no adapter dependencies, consistent performance
  • Cons: Larger file size, permanent modification
  • Best for: Production deployment, sharing, simplified distribution

Option 3: Ollama integration

  • Pros: Easy API access, model versioning, production-ready serving
  • Cons: Additional quantization step, external dependency
  • Best for: API-based integration, multi-user access, scalable deployment

If you are on your learning path, I suggest trying out all three to get a feel of the process and to see the behavior difference (e.g., inference time). It doesn’t take too much time; follow the instructions further in this section.

Direct LoRA adapter usage

Direct LoRA usage provides immediate access to your specialized trading model. If you did a quick test/validation of your fine-tuned model loaded with adapters.safetensors, you already did this direct LoRA adapter usage test.

Here it is again:

mlx_lm.generate --model Qwen/Qwen2.5-3B \
  --adapter-path off-chain/models/trading_model_lora \
  --prompt "Given ETH price is $2506.92 with volume of 999.43 and volatility of 0.045, recent price change of 35.7765 ticks, and I currently hold 3.746 ETH and 9507.14 USDC, what trading action should I take on Uniswap?" \
  --temp 0.3

Model fusion for production deployment

Fusion combines LoRA adapters with base model weights, creating a single model file with embedded trading knowledge.

Understanding the fusion process

LoRA fusion is a technique to directly integrate specialized knowledge from adapter weights into the base model parameters.

Mathematically, this involves taking the original Qwen 2.5 3B model parameters and combining them with the adapter’s low-rank matrices.

Practically, this is sort of a double-edged sword: on the one hand, the model grows from roughly 2 GB (when using adapters separately) to around 6 GB in its fully fused state; on the other hand, inference loading times improve significantly because adapter loading overhead is eliminated. Additionally, the fused model maintains consistent and stable inference speeds without adapter-related delays.

Performing model fusion

Execute fusion with appropriate settings

# Fuse LoRA adapters into base model
mlx_lm.fuse \
  --model Qwen/Qwen2.5-3B \
  --adapter-path off-chain/models/trading_model_lora \
  --save-path off-chain/models/fused_qwen \

Verify fusion success

Test the fused model:

mlx_lm.generate \
  --model off-chain/models/fused_qwen \
  --prompt "Given ETH price is $2506.92 with volume of 999.43 and volatility of 0.045, recent price change of 35.7765 ticks, and I currently hold 3.746 ETH and 9507.14 USDC, what trading action should I take on Uniswap?" \
  --temp 0.3

Compare with adapter version:

mlx_lm.generate --model Qwen/Qwen2.5-3B \
  --adapter-path off-chain/models/trading_model_lora \
  --prompt "Given ETH price is $2506.92 with volume of 999.43 and volatility of 0.045, recent price change of 35.7765 ticks, and I currently hold 3.746 ETH and 9507.14 USDC, what trading action should I take on Uniswap?" \
  --temp 0.3

Converting to Ollama format

Ollama provides production-grade model serving with API access and it just works very well & smooth.

Set up llama.cpp for model conversion:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build with optimizations (macOS with Apple Silicon):

mkdir build
cd build
cmake .. -DGGML_METAL=ON
cmake --build . --config Release -j 8

Alternative: Build for CUDA (Linux/Windows with NVIDIA GPU):

mkdir build
cd build
cmake .. -DGGML_CUDA=ON
cmake --build . --config Release -j 8

Install Python conversion dependencies

pip install torch transformers sentencepiece protobuf

Verify conversion script availability in root (cd ..):

ls convert_hf_to_gguf.py

Converting MLX to GGUF format

Run the conversion script:

python convert_hf_to_gguf.py \
  ../off-chain/models/fused_qwen \
  --outtype f16 \
  --outfile ../off-chain/models/fused_qwen/ggml-model-f16.gguf

Quantizing for efficiency

Optionally, apply e.g. Q4_K_M quantization for performance (4-bit with K-quant method):

./build/bin/llama-quantize \
  ../off-chain/models/fused_qwen/ggml-model-f16.gguf \
  ../off-chain/models/fused_qwen/ggml-model-q4_k_m.gguf \
  q4_k_m

Quantization options and trade-offs

FormatSizeSpeedQualityUse Case
F16100%MediumHighestDevelopment/Testing
Q8_0~50%FastHighBalanced Production
Q4_K_M~25%FastestGoodResource Constrained
Q2_K~12%Very FastLowerExtreme Efficiency

Creating Ollama model

Register quantized model with Ollama with the following instructions. Note that I’m using the trader-qwen:latest model name in these instructions—make sure you change to your name, if you have a different one.

Create Ollama Modelfile with configuration:

cat > Modelfile << 'EOF'
FROM off-chain/models/fused_qwen/ggml-model-q4_k_m.gguf

# Model parameters optimized for trading decisions
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096

# No system prompt - let the fine-tuned model use its trained Canary words naturally
EOF

Import model to Ollama:

ollama create trader-qwen:latest -f Modelfile

Test Ollama deployment

Test model with trading prompt:

ollama run trader-qwen:latest "Given ETH price is $2506.92 with volume of 999.43 and volatility of 0.045, recent price change of 35.7765 ticks, and I currently hold 3.746 ETH and 9507.14 USDC, what trading action should I take on Uniswap?"

Test API endpoint:

curl -X POST http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "trader-qwen:latest",
    "prompt": "Given ETH price is $2506.92 with volume of 999.43 and volatility of 0.045, recent price change of 35.7765 ticks, and I currently hold 3.746 ETH and 9507.14 USDC, what trading action should I take on Uniswap?",
    "stream": false
  }'

Integrating custom models with trading agents

Update your trading agent configuration to leverage the custom-trained model.

Configuration for Ollama integration

Update config.py for Ollama model usage:

AVAILABLE_MODELS = {
    'fin-r1-iq4': {
        'model': 'fin-r1:latest',
        'context_capacity': 8192
    },
    'qwen-trader': {
        'model': 'trader-qwen:latest', 
        'context_capacity': 4096,
        'temperature': 0.3,
        'top_p': 0.9
    }
}

Set custom model as default:

MODEL_KEY = "qwen-trader"
USE_MLX_MODEL = False  # Use Ollama for serving

Configuration for direct MLX usage

Alternative: Use MLX adapters directly.

MLX-based configuration in config.py:

USE_MLX_MODEL = True
MLX_BASE_MODEL = "Qwen/Qwen2.5-3B"
MLX_ADAPTER_PATH = "off-chain/models/trading_model_lora"

# MLX generation parameters
MLX_GENERATION_CONFIG = {
    'temperature': 0.3,
    'top_p': 0.9,
    'max_tokens': 512,
    'repetition_penalty': 1.1
}

Running agents with custom models

Run stateful agent with Ollama model:

python on-chain/uniswap_v4_stateful_trading_agent.py

Alternative: Direct MLX execution:

MLX_MODEL=True python on-chain/uniswap_v4_stateful_trading_agent.py