AI trading agent: Reinforcement learning
Enhance trading models with Deep Q-Network (DQN) reinforcement learning, train agents through market interactions, and integrate RL insights with fine-tuned language models for optimal performance.
Previous section: AI trading agent: Fusing LLM adapters and converting to Ollama
Project repository: Web3 AI trading agent
Reinforcement learning (RL) enables models to learn from market interactions and improve decision-making through experience. RL is particularly useful for trading because it provides clear reward signals (portfolio performance) and handles sequential decision-making.
Our RL system is using DQN—the most popular algorithm.
Deep Q-Network (DQN) for trading
DQN enables our model to learn optimal trading policies through continuous trial and error, building expertise over time.
At its core, DQN uses a neural network to approximate the Q-function, denoted as Q(s,a), which represents the expected future reward when taking action a
in a given state s
. A separate target network ensures training stability by providing consistent learning targets, and experience replay helps the model learn efficiently from varied historical market scenarios.
We’ve adapted DQN specifically for trading by customizing the state representation to include market data and current portfolio positions. The actions map directly to trading decisions—buy, sell, or hold. The reward function evaluates portfolio returns and uses risk penalties to guide the trading behavior. Each training “episode” corresponds to a clearly defined trading session, complete with precise start and end conditions.
Gymnasium environment implementation
The trading environment (off-chain/rl_trading/trading.py
) implements a Gymnasium-compatible interface.
State representation:
- Price changes: Window of recent price percentage changes
- Volume and volatility: Normalized trading volume and volatility
- Position indicator: Whether currently holding ETH (1) or USDC (0)
Action space design:
- Action 0: HOLD (maintain current position)
- Action 1: BUY (purchase ETH with available USDC)
- Action 2: SELL (sell ETH for USDC)
The reward function is based on:
- Trading profit/loss: Percentage gain/loss from buy/sell actions
- Portfolio performance: Overall portfolio value changes
- Transaction costs: 0.3% trading fee penalty
DQN training
Train the reinforcement learning agent:
DQN hyperparameters (hardcoded in script):
- Learning rate: 1e-4 (controls adaptation speed)
- Buffer size: 10,000 (experience replay capacity)
- Batch size: 64 (neural network update batch size)
- Gamma: 0.99 (discount factor for future rewards)
- Exploration: 20% exploration fraction with epsilon decay
RL-based dataset creation
Transform the trained DQN agent’s decision-making into structured training data for language model enhancement.
Decision extraction from trained agent
Generate trading decisions across diverse market scenarios:
MLX integration for RL enhancement
Convert RL-generated decisions into MLX-compatible training format for the second stage of fine-tuning.
At this point, you have already gone through this process with your teacher-student distillation and then further with the LoRA fine-tuning, so all of this is familiar to you.
Transform RL decisions into conversational training format:
LoRA configuration for RL training
Check (and feel free to edit & experiment) the RL-specific LoRA configuration:
Pay special attention to taking your base model and loading it with the previous LoRA fine-tuning and resuming this new RL fine-tuning on top of the previous one here in rl_lora_config.yaml
:
Second-stage fine-tuning execution
Execute RL-enhanced model training with careful monitoring
Perform the second fine-tuning stage to integrate RL insights with language model capabilities.
RL training validation
Test the RL-model responses:
Compare with base model (without RL enhancement):
Final model integration and deployment
Integrate RL-enhanced adapters with the complete trading system for optimal performance.
Fuse RL-enhanced adapters into final model:
Verify fusion success:
Ollama deployment for RL-enhanced model
Just like before, go through the same conversion process again.
Navigate to llama.cpp for conversion:
Convert RL-enhanced model to GGUF:
Optionally quantize the model:
Create enhanced Ollama configuration:
Deploy RL-enhanced model to Ollama:
Test final RL-enhanced deployment: