Time Series Forecasting for Financial Data

A comprehensive analysis of time series forecasting algorithms applied to high-frequency stock market data, comparing traditional statistical methods, machine learning approaches, and modern foundation models.

📊 Dataset

Source: Hugging Face datasets (matthewchung74/nvda_5_min_bars, matthewchung74/aapl_5_min_bars, matthewchung74/asml_5_min_bars)
Frequency: 5-minute bars
Features: OHLCV + trade_count + vwap
Period: May 2020 - May 2025
Trading Hours: 9:30 AM - 4:00 PM ET only
Total Samples: ~99,101 (NVDA)

🎯 Evaluation Setup

Test Period: Volatile Period 43 (indices 99022-99058)

Date Range: 2025-05-16 13:30:00 to 16:25:00 ET
Price Range: $133.52 - $136.19 ($2.67 range)
Returns Std: 0.260% (volatile period for meaningful model comparison)
Training Size: 8,000 samples (~4.8 months) for most models
Test Size: 36 samples (3 hours) for most models

🔮 Algorithms Overview

Algorithm	Type	Implementation	MAE ($)	RMSE ($)	MAPE (%)	Direction Acc (%)	Test Samples	Key Features
XGBoost	ML Ensemble	`train_nvda_xgboost.py`	0.52	0.58	0.38	60.9	23	185 engineered features, gradient boosting
Prophet	Statistical	`train_nvda_prophet.py`	13.11	14.61	9.71	58.3	36	Additive decomposition, trend + seasonality
Chronos	Foundation Model	`zeroshot_nvda_chronos.py`	0.59	0.75	0.44	58.3	36	T5-based transformer, 60M params
TimesFM/ASFM	Foundation Model	`zeroshot_nvda_timesfm.py`	2.10	2.30	1.55	55.6	36	Multi-component statistical model
Monte Carlo	Statistical	`zeroshot_nvda_monte.py`	0.52	0.61	0.39	41.7	36	GBM simulation, perfect CI coverage
ARIMA	Statistical	`train_nvda_arima.py`	0.52	0.62	0.39	41.7	36	ARIMA(1,0,1), perfect CI coverage
Moirai (Zero-Shot)	Foundation Model	`zeroshot_nvda_moirai.py`	1.69	1.99	1.25	40.0	36	Patch-based transformer, 311M params
Lag-Llama	Foundation Model	`train_nvda_lagllama.py`	0.38	0.53	0.28	36.4	12	Fine-tuned foundation model, 2.4M params
Moirai (SFT)	Foundation Model	`train_nvda_moirai.py`	48.42	48.43	35.70	27.3	12	Custom SFT implementation (needs improvement)

📈 Key Findings

🏆 Traditional Methods Outperform Deep Neural Networks

Surprising Result: Traditional statistical and machine learning methods significantly outperformed large foundation models:

XGBoost (ML Ensemble): Best overall performance with 185 engineered features (latest run, May 2025)
- Price Accuracy: $0.52 MAE
- Directional Accuracy: 60.9%
- Key: Feature engineering + gradient boosting still outperforms deep models, but recent results show higher error and lower directional accuracy, likely due to expanded test set and updated feature engineering.
Lag-Llama (Foundation Model): Excellent fine-tuned foundation model performance
- Price Accuracy: $0.38 MAE - 3rd best overall, best among foundation models
- Percentage Error: 0.28% MAPE - 2nd best overall
- Key: Proper fine-tuning of purpose-built time series foundation model
Statistical Models Excel: Monte Carlo and ARIMA tied for 4th-best price accuracy
- Perfect Risk Assessment: 100% confidence interval coverage
- Consistent Performance: Reliable across different market conditions
- Computational Efficiency: Fast training and prediction
Foundation Models Mixed Results: Despite 60M-311M parameters
- Chronos: Competitive performance (MAE: $0.59) with excellent probabilistic forecasting
- Moirai Zero-Shot: Decent performance (MAE: $1.69) without fine-tuning
- Moirai SFT: Poor custom implementation (MAE: $48.42) highlights implementation complexity

💡 Why Traditional Methods Won

Domain-Specific Feature Engineering: XGBoost's 92 features capture market microstructure
Statistical Rigor: ARIMA and Monte Carlo provide theoretically sound uncertainty quantification
Implementation Quality: Well-implemented simple models beat poorly implemented complex ones
Data Efficiency: Traditional methods work well with limited training data (8,000 samples)
Market Efficiency: High-frequency financial data may have limited predictable patterns for deep learning

🔍 Foundation Model Insights

Zero-shot capability: Chronos and Moirai work out-of-the-box without domain training
Scale vs Performance: Larger models (Moirai 311M) didn't outperform smaller ones (Chronos 60M)
Implementation Complexity: Custom fine-tuning can underperform zero-shot approaches
Probabilistic Forecasting: Foundation models excel at uncertainty quantification when properly calibrated

🔮 Detailed Algorithm Analysis

1. Prophet Model

Implementation: train_nvda_prophet.py

Configuration

Training Size: 8,000 samples (~4.8 months)
Test Size: 36 samples (3 hours)
Test Period: Volatile Period 43 (indices 99022-99058)
Data Preprocessing: Log-transformed prices, ET trading hours only
Model Features: Daily seasonality + Volume regressor + Volatility regressor

Results Summary (Volatile Test Period)

Metric	Prophet (Volatile Period)	Actual Data
Price Range	$103.62 - $126.98	$133.52 - $136.19
Standard Deviation	$6.561	$0.260
Price Change	-$23.36	$0.06
MAE	$13.11	-
RMSE	$14.61	-
MAPE	9.71%	-
Directional Accuracy	58.3% (21/36)	-

Key Findings

✅ Strengths:

Best Directional Accuracy: 58.3% - significantly above random (50%)
Captures Market Volatility: Highest prediction variability ($6.561 std)
Trend Detection: Attempts to capture longer-term patterns
Uncertainty Quantification: Provides confidence intervals

❌ Limitations:

Large Systematic Bias: Massive $30+ underestimation of price level
Poor Price Accuracy: MAE $13.11 - worst among all models
Overconfident Predictions: Trends strongly downward when market was stable
Scale Mismatch: Predictions in $103-127 range vs actual $133-136

🔍 Root Causes:

Training Data Mismatch: Model trained on earlier period with different price levels
Trend Extrapolation: Prophet extrapolated downward trend inappropriately
Volatility Overestimation: Model predicted much higher volatility than occurred
Time Series Assumptions: Additive components don't capture financial market dynamics

2. Monte Carlo Simulation (Zero-Shot)

Implementation: zeroshot_nvda_monte.py

Configuration

Training Size: 8,000 samples (~4.8 months)
Test Size: 36 samples (3 hours)
Test Period: Volatile Period 43 (indices 99022-99058)
Simulations: 10,000 Monte Carlo paths
Model: Geometric Brownian Motion (GBM)
Parameters: μ = 0.000015, σ = 0.004470 (30.13% annualized return, 62.66% volatility)

Results Summary (Volatile Test Period)

Metric	Monte Carlo (Volatile Period)	Actual Data
Price Range	$133.49 - $133.55	$133.52 - $136.19
Standard Deviation	$0.032	$0.260
Price Change	$0.06	$0.06
MAE	$0.52	-
RMSE	$0.61	-
MAPE	0.39%	-
Directional Accuracy	41.7% (15/36)	-
90% CI Coverage	100.0%	-

Key Findings

✅ Strengths:

Excellent Price Accuracy: MAE $0.52 - second best overall
Perfect CI Coverage: 100% of actual values within 90% confidence intervals
No Systematic Bias: Mean prediction very close to actual mean
Realistic Starting Point: Begins at correct price level
Statistical Foundation: Based on historical return distribution

❌ Limitations:

Extremely Flat Predictions: Lowest variability ($0.032 std vs $0.260 actual)
Poor Directional Accuracy: 41.7% - worse than random
Random Walk Limitation: Cannot capture any predictable patterns
Underestimates Volatility: Despite volatile test period, predictions remain flat

🔍 Root Causes:

Random Walk Nature: GBM assumes price changes are purely random
Parameter Estimation: Historical volatility may not reflect test period dynamics
No Pattern Recognition: Cannot capture momentum, mean reversion, or market structure
Short-term Focus: 3-hour horizon too short for meaningful drift effects

3. ARIMA Model

Implementation: train_nvda_arima.py

Configuration

Training Size: 8,000 samples (~4.8 months)
Test Size: 36 samples (3 hours)
Test Period: Volatile Period 43 (indices 99022-99058)
Model Selection: Manual ARIMA(1,0,1) due to pmdarima unavailability
Model Stats: AIC = -63848.35, BIC = -63820.40
Data Preprocessing: Log returns (stationary), ADF test confirmed stationarity

Results Summary (Volatile Test Period)

Metric	ARIMA(1,0,1) (Volatile Period)	Actual Data
Price Range	$133.50 - $133.52	$133.52 - $136.19
Standard Deviation	$0.021	$0.260
Price Change	$0.02	$0.06
MAE	$0.52	-
RMSE	$0.62	-
MAPE	0.39%	-
Directional Accuracy	41.7% (15/36)	-
90% CI Coverage	100.0%	-

Key Findings

✅ Strengths:

Excellent Price Accuracy: MAE $0.52 - tied for second best
Perfect CI Coverage: 100% of actual values within confidence intervals
No Systematic Bias: Mean prediction very close to actual mean
Statistical Rigor: Proper model selection, residual diagnostics
Computational Efficiency: Fast training and prediction

❌ Limitations:

Flattest Predictions: Lowest variability ($0.021 std vs $0.260 actual)
Poor Directional Accuracy: 41.7% - same as Monte Carlo, worse than random
Limited Model: ARIMA(1,0,1) suggests minimal predictable patterns
Widening Confidence Intervals: Uncertainty grows rapidly with forecast horizon

🔍 Root Causes:

Efficient Market Hypothesis: Limited autocorrelation in 5-minute returns
Model Simplicity: ARIMA(1,0,1) captures minimal patterns
Random Walk Behavior: Log prices follow near-random walk
Short-term Unpredictability: No meaningful patterns in high-frequency data

4. TimesFM / Advanced Statistical Foundation Model (ASFM) (Zero-Shot)

Implementation: zeroshot_nvda_timesfm.py

Configuration

Training Size: 20,000 samples (~11.9 months) - More data than other models
Test Size: 36 samples (3 hours) - Same as other models for fair comparison
Test Period: Volatile Period 43 (indices 99022-99058)
Context Length: 512 time steps
Model: Advanced Statistical Foundation Model (ASFM) fallback
Components: Trend extraction, seasonal decomposition, volatility modeling

Results Summary (Volatile Test Period)

Metric	ASFM (Volatile Period)	Actual Data
Price Range	$131.48 - $134.77	$133.52 - $136.19
Standard Deviation	$1.084	$0.260
Price Change	-$3.04	$0.06
MAE	$2.10	-
RMSE	$2.30	-
MAPE	1.55%	-
Directional Accuracy	55.6% (20/36)	-

Key Findings

✅ Strengths:

Good Directional Accuracy: 55.6% - second best after Prophet
Realistic Prediction Variability: $1.084 std - closest to actual market volatility
Sophisticated Architecture: Multi-component foundation model approach
Large Training Dataset: 20,000 samples vs 8,000 for other models
Component Interpretability: Separate trend, seasonal, and volatility modeling

❌ Limitations:

Moderate Price Accuracy: MAE $2.10 - better than Prophet but worse than Monte Carlo/ARIMA
Systematic Downward Bias: Mean prediction $2.10 below actual
Increasing Error Over Time: Errors grow from $0.44 to $3.54 across horizon
Fallback Implementation: Not true TimesFM due to dependency complexity

🔍 Root Causes:

Trend Overestimation: Model detected downward trend from training data
Component Mismatch: Individual components may not reflect actual market behavior
Context Window: 512 steps may include outdated patterns
Error Accumulation: Multi-step forecasting compounds errors over time

5. XGBoost Model

Implementation: train_nvda_xgboost.py

Configuration

Training Size: 8,000 samples (~4.8 months)
Test Size: 15 samples (reduced from 36 due to feature engineering requirements)
Test Period: Volatile Period 43 (indices 99022-99058)
Lookback Window: 20 periods for feature engineering
Model Parameters: n_estimators=100, max_depth=6, learning_rate=0.1
Features: 92 engineered features across 6 categories

Feature Engineering

Price/Returns Features: SMA, EMA, price ratios, returns, log returns (61 features)
Technical Indicators: Volatility measures, rolling statistics (13 features)
Volume Features: Volume ratios, trends, moving averages (25 features)
Time Features: Hour, minute, day of week, time of day (4 features)
Lagged Features: 20-period lookback for price, returns, volume (60 features)
Rolling Statistics: Min, max, std over multiple windows (10 features)

Results Summary (Volatile Test Period)

Metric	XGBoost (Volatile Period)	Actual Data
Price Range	$134.66 - $135.14	$134.75 - $135.18
Standard Deviation	$0.121	$0.146
Price Change	$0.04	-$0.04
MAE	$0.04	-
RMSE	$0.05	-
MAPE	0.03%	-
Directional Accuracy	80.0% (12/15)	-

Key Findings

✅ Strengths:

Best Price Accuracy: MAE $0.04 - dramatically better than all other models
Excellent Directional Accuracy: 80.0% - highest among all models
Realistic Volatility Matching: Prediction std ($0.121) close to actual ($0.146)
Feature Importance Insights: EMA_5 (61.4%) and price_lag_1 (23.5%) most important
No Systematic Bias: Mean prediction very close to actual values

❌ Limitations:

Reduced Test Sample Size: Only 15 samples due to 20-period lookback requirement
Feature Engineering Complexity: Requires extensive preprocessing and domain knowledge
Overfitting Risk: High feature count (92) relative to test samples
Computational Overhead: Feature engineering and model training more complex than simpler models

🔍 Root Causes:

Rich Feature Set: 92 engineered features capture multiple market dynamics
Non-linear Learning: XGBoost can model complex relationships between features
Ensemble Method: Gradient boosting combines multiple weak learners effectively
Short-term Focus: 20-period lookback captures recent market patterns

Top Feature Importance

EMA_5 (61.4%) - 5-period exponential moving average
price_lag_1 (23.5%) - Previous period price
price_min_5 (8.6%) - 5-period minimum price
price_max_5 (5.7%) - 5-period maximum price
day_of_week (0.3%) - Day of week effect

6. Chronos Model with Supervised Fine-Tuning (SFT) (Zero-Shot)

Implementation: zeroshot_nvda_chronos.py

Configuration

Training Size: 8,000 samples (~4.8 months)
Test Size: 36 samples (3 hours)
Test Period: Volatile Period 43 (indices 99022-99058)
Model: Amazon Chronos-T5-Small foundation model (optimal)
Context Length: 512 time steps
Fine-tuning: Supervised Fine-Tuning (SFT) with 10 epochs (optimal)
Probabilistic Forecasting: 20 forecast samples with quantile estimation

Model Size and Hyperparameter Analysis

Small vs Large Model Comparison:

Metric	Chronos-T5-Small (10 epochs)	Chronos-T5-Large (20 epochs)	Winner
MAE	$0.59	$0.68	🏆 Small
RMSE	$0.75	$0.82	🏆 Small
MAPE	0.44%	0.50%	🏆 Small
Directional Accuracy	58.3%	55.6%	🏆 Small
80% CI Coverage	88.9%	77.8%	🏆 Small
Prediction Variability	$0.180 std	$0.127 std	🏆 Small

Key Finding: Smaller model with fewer epochs performed better across all metrics, suggesting:

Large models can overfit on smaller datasets (8,000 samples)
Optimal fine-tuning requires careful hyperparameter selection
Foundation model size should match dataset complexity

Supervised Fine-Tuning (SFT) Details

Approach: Domain adaptation of pre-trained Chronos foundation model
Training Data: 8,000 NVDA price samples for domain-specific patterns
Epochs: 10 fine-tuning epochs with learning rate 1e-4 (optimal configuration)
Batch Size: 8 samples per batch
Window Size: 512 context length for pattern recognition
Implementation: Simplified SFT approach (production would use full pipeline)

Results Summary (Volatile Test Period - Optimal Configuration)

Metric	Fine-tuned Chronos (Volatile Period)	Actual Data
Price Range	$134.70 - $135.62	$133.52 - $136.19
Standard Deviation	$0.180	$0.525
Price Change	$0.52	-$1.40
MAE	$0.59	-
RMSE	$0.75	-
MAPE	0.44%	-
Directional Accuracy	58.3% (21/36)	-
80% CI Coverage	88.9%	-

Key Findings

✅ Strengths:

Strong Price Accuracy: MAE $0.59 - fourth best overall, competitive with top models
Good Directional Accuracy: 58.3% - tied with Prophet for second best
Excellent CI Coverage: 88.9% - near-optimal probabilistic forecasting
Foundation Model Benefits: Leverages pre-trained knowledge from diverse time series
Supervised Fine-Tuning: Domain adaptation improves performance on NVDA-specific patterns
Probabilistic Forecasting: Provides uncertainty quantification with multiple forecast samples
Optimal Hyperparameters: Small model + 10 epochs prevents overfitting

❌ Limitations:

Moderate Prediction Variability: $0.180 std vs $0.525 actual - underestimates volatility
Slight Upward Bias: Mean prediction $0.52 above actual values
Computational Complexity: Requires GPU/TPU for optimal performance, large model size
Fine-tuning Overhead: Additional training step compared to zero-shot approaches
Hyperparameter Sensitivity: Performance degrades significantly with suboptimal settings

🔍 Root Causes:

Foundation Model Architecture: T5-based transformer captures long-range dependencies
Domain Adaptation: SFT helps model learn NVDA-specific price dynamics
Context Window: 512-step context provides rich historical information
Probabilistic Nature: Multiple forecast samples enable uncertainty quantification
Pre-training Benefits: Model starts with general time series knowledge
Overfitting Prevention: Small model + fewer epochs optimal for dataset size

7. Moirai Foundation Model (Zero-Shot)

Implementation: zeroshot_nvda_moirai.py

Configuration

Training Size: 8,000 samples (~4.9 months)
Test Size: 36 samples (3 hours)
Test Period: Volatile Period 43 (indices 99022-99058)
Model: Salesforce Moirai-1.0-R-Large Foundation Model
Model Size: Large (311M parameters)
Context Length: 512 time steps
Patch Size: 32 (auto-computed)
Probabilistic Forecasting: 100 forecast samples
Adaptation: Zero-shot with statistical adaptation

Key Features

Patch-based Tokenization: Divides time series into patches for better pattern recognition
Transformer Architecture: Masked encoder-based universal forecasting
Probabilistic Forecasting: Multiple forecast samples with confidence intervals
Foundation Model: Pre-trained on LOTSA (Large-scale Open Time Series Archive)
Zero-shot Capability: No fine-tuning required, works out-of-the-box
Statistical Adaptation: Adapts to NVDA price and volatility patterns

Results (Volatile Test Period)

MAE: $1.69 (5th best overall)
RMSE: $1.99
MAPE: 1.25%
Directional Accuracy: 40.0% (14/35 correct)
80% CI Coverage: 38.9% (needs improvement)
Model Parameters: 310,970,624

Key Findings

Foundation Model Performance: Competitive results without fine-tuning
Large Model Capacity: 311M parameters provide strong representational power
Patch-based Processing: Effective for time series pattern recognition
Zero-shot Generalization: Works on NVDA data despite being trained on diverse datasets
Confidence Interval Challenge: Lower CI coverage suggests calibration needs improvement

8. Lag-Llama Foundation Model with Fine-Tuning

Implementation: train_nvda_lagllama.py

Configuration

Training Size: 99,022 samples (~5 years)
Test Size: 12 samples (1 hour) - Limited by prediction length
Test Period: Volatile Period 43 (indices 99022-99058)
Model: Lag-Llama Foundation Model (2.4M parameters)
Context Length: 96 time steps (8 hours)
Prediction Length: 12 time steps (1 hour)
Fine-tuning: 50 epochs with learning rate 5e-4

Key Features

Purpose-built Foundation Model: First open-source model specifically designed for time series forecasting
Lag-based Tokenization: Innovative approach to time series tokenization
Decoder-only Transformer: Similar to LLaMA architecture but for time series
Probabilistic Forecasting: 100 forecast samples with uncertainty quantification
Pre-trained Knowledge: Trained on 27 diverse time series datasets (352M tokens)
Fine-tuning Capability: Supports domain adaptation for specific use cases

Results Summary (Volatile Test Period)

Metric	Lag-Llama Fine-tuned (Volatile Period)	Actual Data
Price Range	$134.40 - $135.28	$134.75 - $135.43
Standard Deviation	$0.276	$0.248
Price Change	$0.88	$0.68
MAE	$0.38	-
RMSE	$0.53	-
MAPE	0.28%	-
Directional Accuracy	36.4% (4/11)	-
Test Samples	12	-

Key Findings

✅ Strengths:

Excellent Price Accuracy: MAE $0.38 - 3rd best overall, best among foundation models
Outstanding Percentage Error: MAPE 0.28% - 2nd best overall, only behind XGBoost
Smooth Training: Loss decreased consistently from 2.53 to 1.26 over 50 epochs
Foundation Model Benefits: Leveraged pre-trained knowledge from diverse time series
Purpose-built Architecture: Designed specifically for time series, not adapted from language models
Realistic Prediction Variability: $0.276 std very close to actual $0.248 std
Proper Fine-tuning: Successfully adapted to NVDA-specific patterns

❌ Limitations:

Lower Directional Accuracy: 36.4% - below random (50%), worst among competitive models
Reduced Test Window: Only 12 samples vs 36 for other models due to prediction length constraint
Computational Overhead: Required 50 epochs of fine-tuning (vs zero-shot approaches)
Limited Forecasting Horizon: 1-hour prediction limit vs longer horizons for other models
Memory Requirements: 2.4M parameters during fine-tuning

🔍 Root Causes:

Architecture Advantage: Purpose-built for time series vs adapted language models (Chronos)
Fine-tuning Quality: Proper implementation vs custom SFT struggles (Moirai SFT)
Pre-training Benefits: Leveraged knowledge from 27 diverse datasets
Scale Optimization: 2.4M parameters well-suited for dataset size vs over-parameterized models
Lag-based Approach: Innovative tokenization captures time series patterns effectively
Implementation Maturity: Well-developed framework vs experimental implementations

📊 Performance Analysis:

Price Prediction Excellence: Best foundation model for price accuracy
Percentage Error Leadership: Only XGBoost performs better on MAPE
Training Efficiency: Consistent loss reduction shows proper convergence
Foundation Model Success: Demonstrates value of purpose-built vs adapted models

9. Moirai Foundation Model with Supervised Fine-Tuning (SFT)

Implementation: train_nvda_moirai.py

Configuration

Training Size: 8,000 samples (~4.9 months)
Test Size: 12 samples (1 hour) - Reduced due to context length requirements
Test Period: Volatile Period 43 (indices 99022-99058)
Model: Moirai-Small SFT Model (Custom Implementation)
Context Length: 96 time steps
Prediction Length: 12 time steps
Fine-tuning: 3 epochs with learning rate 1e-4
Architecture: Transformer-based with positional encoding

Results Summary (Volatile Test Period)

Metric	Moirai SFT (Volatile Period)	Actual Data
Price Range	$182.78 - $185.80	$136.19 - $135.43
MAE	$48.42	-
RMSE	$48.43	-
MAPE	35.70%	-
Directional Accuracy	27.3% (3/11)	-

Key Findings

❌ Major Issues:

Severe Systematic Bias: Predictions ~$48 higher than actual values
Poor Performance: Worst MAE ($48.42) among all implemented models
Low Directional Accuracy: 27.3% - significantly worse than random
Scale Mismatch: Predicting in $180+ range vs actual $135-136 range

🔍 Root Causes:

Implementation Issues: Custom SFT implementation may have fundamental flaws
Training Instability: Model may not be converging properly during fine-tuning
Architecture Mismatch: Simple transformer may not capture Moirai's patch-based approach
Data Preprocessing: Potential issues with input normalization or scaling

📝 Note: This implementation demonstrates the challenges of recreating foundation model fine-tuning from scratch. The zero-shot Moirai performs significantly better ($1.69 MAE vs $48.42 MAE), suggesting the SFT implementation needs substantial improvement.

📁 Project Structure

prophet/
├── README.md                           # This file
├── requirements.txt                    # Dependencies
├── .gitignore                         # Git ignore file
├── upload_dataset.py                   # Dataset upload utility
├── analyze_test_periods.py            # Test period analysis utility
│
├── # Training Models (Actual Training/Fine-tuning)
├── train_nvda_prophet.py              # Prophet implementation
├── train_nvda_arima.py                # ARIMA model
├── train_nvda_xgboost.py              # XGBoost model
├── train_nvda_lagllama.py             # Lag-Llama foundation model with fine-tuning
├── train_nvda_moirai.py               # Moirai SFT (needs improvement)
│
├── # Zero-Shot Models (No Training Required)
├── zeroshot_nvda_chronos.py           # Chronos foundation model (zero-shot)
├── zeroshot_nvda_moirai.py            # Moirai foundation model (zero-shot)
├── zeroshot_nvda_timesfm.py           # TimesFM/ASFM model (zero-shot)
├── zeroshot_nvda_monte.py             # Monte Carlo simulation
│
├── # Results
├── nvda_moirai_sft_results_*.json     # Moirai SFT results
│
└── plots/moirai_sft/nvda/             # Moirai SFT visualizations
    ├── NVDA_MoiraiSFT_main_analysis.png
    ├── NVDA_MoiraiSFT_directional_analysis.png
    └── NVDA_MoiraiSFT_predicted_vs_actual.png

🛠️ Setup & Usage

Prerequisites

pip install -r requirements.txt

Environment Variables

# .env file
HF_TOKEN=your_huggingface_token

Running Models

Training Models (Actual Training/Fine-tuning)

python train_nvda_prophet.py          # Prophet model
python train_nvda_arima.py             # ARIMA model  
python train_nvda_xgboost.py           # XGBoost model
python train_nvda_lagllama.py          # Lag-Llama foundation model with fine-tuning
python train_nvda_moirai.py            # Moirai SFT (needs improvement)

Zero-Shot Models (No Training Required)

python zeroshot_nvda_chronos.py        # Chronos foundation model
python zeroshot_nvda_moirai.py         # Moirai foundation model
python zeroshot_nvda_timesfm.py        # TimesFM/ASFM model
python zeroshot_nvda_monte.py          # Monte Carlo simulation

📈 Evaluation Metrics

Price Accuracy

MAE (Mean Absolute Error): Average absolute difference in dollars
RMSE (Root Mean Square Error): Penalizes larger errors more heavily
MAPE (Mean Absolute Percentage Error): Percentage-based error metric

Directional Accuracy

Direction Accuracy: Percentage of correct up/down predictions
Confusion Matrix: True/False positives and negatives for directions
Precision/Recall: For up and down movement predictions

Statistical Analysis

Prediction Variability: Standard deviation of predictions vs actual
Bias Analysis: Systematic over/under-estimation patterns
Error Distribution: Temporal patterns in prediction errors

🎯 Research Questions

Which algorithms perform best for different prediction horizons?
- 5-minute, 1-hour, 1-day forecasts
How do different market conditions affect model performance?
- High vs low volatility periods
- Trending vs sideways markets
What features are most predictive for short-term price movements?
- Technical indicators, volume patterns, time-of-day effects
Can ensemble methods improve upon individual algorithms?
- Weighted combinations, stacking approaches
How does prediction accuracy degrade with forecast horizon?
- 1-step vs multi-step ahead predictions

🔬 Key Insights

Algorithm Comparison: Volatile Test Period Results

Aspect	Prophet	Monte Carlo	ARIMA	TimesFM/ASFM	XGBoost	Lag-Llama	Chronos	Moirai (Zero-Shot)	Moirai (SFT)	Winner
Price Accuracy (MAE)	$13.11	$0.52	$0.52	$2.10	$0.52	$0.38	$0.59	$1.69	$48.42	🏆 XGBoost
Prediction Variability	$6.561 std	$0.032 std	$0.021 std	$1.084 std	$0.121 std	$0.276 std	$0.180 std	$0.180 std	N/A	🏆 TimesFM/ASFM
Directional Accuracy	58.3%	41.7%	41.7%	55.6%	60.9%	36.4%	58.3%	40.0%	27.3%	🏆 XGBoost
Confidence Intervals	Unrealistic	Perfect (100% coverage)	Perfect (100% coverage)	Not measured	Not implemented	Available	Excellent (88.9% coverage)	Needs improvement	Not measured	🏆 Monte Carlo & ARIMA
Systematic Bias	-$30.36	$0.06	$0.02	-$3.04	-$0.03	$0.09	$0.52	$0.52	$48.42	🏆 ARIMA
Interpretability	High (trend + seasonality)	Medium (statistical model)	High (statistical foundation)	Medium (components)	Medium (feature importance)	Low (foundation model)	Low (foundation model)	Medium (general-purpose)	Low	🏆 Prophet & ARIMA
Computational Speed	Fast	Medium (10k simulations)	Fast	Fast	Medium (feature engineering)	Slow (50 epochs fine-tuning)	Slow (large model)	Slow (general-purpose)	Medium	🏆 Prophet, ARIMA & ASFM
Model Complexity	Medium	Simple	Simple	Medium (multi-component)	High (92 features)	Medium (2.4M parameters)	Very High (60M parameters)	Very High (311M parameters)	Medium	🏆 Monte Carlo & ARIMA
Training Data Used	8,000 samples	8,000 samples	8,000 samples	20,000 samples	8,000 samples	99,022 samples	8,000 samples + pre-training	8,000 samples	8,000 samples	🏆 Lag-Llama
Test Sample Size	36 samples	36 samples	36 samples	36 samples	15 samples	12 samples	36 samples	36 samples	12 samples	🏆 Most models
Probabilistic Forecasting	No	Yes (perfect CI)	Yes (perfect CI)	No	No	Yes (100 samples)	Yes (excellent CI)	Yes (needs improvement)	No	🏆 Monte Carlo, ARIMA & Lag-Llama

Research Questions Answered

1. Traditional vs Deep Learning Performance

Surprising Result: Traditional methods significantly outperformed deep neural networks:

Traditional/ML Methods (Winners):

XGBoost: Best overall (MAE: $0.52, Direction: 60.9%)
Monte Carlo & ARIMA: Excellent price accuracy (MAE: $0.52) + perfect risk assessment
Prophet: Good directional accuracy (58.3%) despite price bias

Deep Learning/Foundation Models (Mixed Results):

Chronos: Competitive performance (MAE: $0.59) with excellent probabilistic forecasting
Moirai Zero-Shot: Decent performance (MAE: $1.69) without fine-tuning
Moirai SFT: Poor custom implementation (MAE: $48.42)

Key Insights:

Feature Engineering Beats Raw Neural Power: XGBoost's 92 engineered features outperformed 60M-311M parameter models
Statistical Rigor Matters: ARIMA and Monte Carlo provide theoretically sound uncertainty quantification
Implementation Quality Critical: Well-implemented simple models beat poorly implemented complex ones
Data Efficiency: Traditional methods excel with limited training data (8,000 samples)
Market Characteristics: High-frequency financial data may have limited patterns suitable for deep learning

2. Foundation Model Insights

Zero-shot vs Fine-tuning Trade-offs:

Zero-shot models (Chronos, Moirai) provide good out-of-the-box performance
Custom SFT implementation can be challenging and may underperform zero-shot approaches
Proper SFT requires sophisticated implementation and careful hyperparameter tuning
Lag-Llama demonstrates that well-implemented fine-tuning can achieve excellent results

Model Scale vs Performance:

Larger models (Moirai 311M) didn't outperform smaller ones (Chronos 60M, Lag-Llama 2.4M)
Model size should match dataset complexity to prevent overfitting
Purpose-built models (Lag-Llama) outperform adapted language models (Chronos)
Foundation models excel at uncertainty quantification when properly calibrated

Implementation Quality Matters:

Lag-Llama success (MAE $0.38) shows importance of mature frameworks
Moirai SFT failure (MAE $48.42) highlights implementation complexity challenges
Well-engineered solutions consistently outperform experimental implementations

3. Most Predictive Features for Short-term Price Movements

XGBoost Feature Importance Reveals:

EMA_5 (61.4%) - 5-period exponential moving average dominates
price_lag_1 (23.5%) - Previous period price is critical
price_min_5 (8.6%) - Recent support levels matter
price_max_5 (5.7%) - Recent resistance levels matter
day_of_week (0.3%) - Minimal time-of-day effects

Key Insights:

Short-term momentum (EMA_5 + price_lag_1) accounts for 84.9% of importance
Technical levels (min/max) provide additional 14.3% of predictive power
Volume and volatility features have minimal individual impact but contribute collectively
Time features are less important than price-based features

4. Model Complexity vs Performance Trade-offs

Complexity Spectrum:

Simple (Monte Carlo, ARIMA): Excellent price accuracy + perfect risk assessment
Medium (Prophet, XGBoost): Good performance with domain-specific engineering
Complex (Foundation Models): Mixed results, implementation-dependent

Key Finding: Optimal complexity depends on implementation quality and data characteristics

🏆 Final Conclusions

Traditional Methods Triumph Over Deep Learning

This comprehensive analysis reveals a surprising and important finding: traditional statistical and machine learning methods significantly outperformed large foundation models (60M-311M parameters) on high-frequency financial forecasting.

Why Traditional Methods Won

Domain Expertise Beats Raw Compute: XGBoost with 92 engineered features achieved 13x better accuracy than foundation models
Statistical Rigor Provides Reliability: ARIMA and Monte Carlo offer perfect confidence interval coverage for risk management
Implementation Quality Matters Most: Well-implemented simple models consistently beat poorly implemented complex ones
Data Efficiency: Traditional methods excel with limited training data (8,000 samples)
Market Characteristics: High-frequency financial data may have limited patterns suitable for deep learning

Foundation Model Lessons

Zero-shot capability is valuable but doesn't guarantee superior performance
Model scale (311M vs 60M vs 2.4M parameters) doesn't correlate with better financial forecasting
Custom fine-tuning is extremely challenging and can underperform zero-shot approaches
Implementation complexity often outweighs theoretical advantages
Purpose-built models (Lag-Llama) significantly outperform adapted language models (Chronos)
Mature frameworks are crucial for foundation model success

Lag-Llama Success Story

Best Foundation Model: Achieved $0.38 MAE - 3rd best overall, best among all foundation models
Implementation Quality: Demonstrates that proper frameworks enable foundation model success
Fine-tuning Value: Shows domain adaptation can significantly improve performance
Architecture Matters: Purpose-built time series models outperform adapted language models

Practical Implications

Start Simple: Begin with well-implemented traditional methods before considering complex models
Feature Engineering: Domain-specific feature engineering can be more valuable than model complexity
Risk Management: Statistical models provide superior uncertainty quantification for financial applications
Implementation Focus: Invest in implementation quality rather than just model sophistication
Foundation Model Strategy: Use mature frameworks (Lag-Llama) over experimental implementations
Purpose-built vs Adapted: Choose models designed for time series over adapted language models
Fine-tuning Value: Domain adaptation can significantly improve foundation model performance
Evaluation Rigor: Test on volatile periods to reveal true model performance differences

Future Research Directions

Ensemble Methods: Combine strengths of traditional and foundation models
Hybrid Approaches: Use foundation models for feature extraction with traditional predictors
Better SFT: Develop more sophisticated fine-tuning approaches for financial data
Multi-horizon Analysis: Evaluate performance across different prediction horizons
Market Regime Analysis: Test performance across different market conditions

Bottom Line: In financial forecasting, domain expertise, feature engineering, and implementation quality often matter more than model complexity. Traditional methods remain highly competitive and should be the starting point for any serious financial prediction system. However, Lag-Llama demonstrates that purpose-built foundation models with proper implementation can achieve excellent results and represent a promising direction for time series forecasting.

📚 References

This project demonstrates that in financial time series forecasting, traditional statistical methods and well-engineered machine learning approaches can significantly outperform large foundation models. The results highlight the critical importance of domain expertise, feature engineering, and implementation quality over raw model complexity.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
plots		plots
.gitignore		.gitignore
README.md		README.md
analyze_test_periods.py		analyze_test_periods.py
download_nvda_alpaca.py		download_nvda_alpaca.py
medium_post.md		medium_post.md
nvda_5min_bars.csv		nvda_5min_bars.csv
requirements.txt		requirements.txt
train_nvda_arima.py		train_nvda_arima.py
train_nvda_lagllama.py		train_nvda_lagllama.py
train_nvda_moirai.py		train_nvda_moirai.py
train_nvda_prophet.py		train_nvda_prophet.py
train_nvda_xgboost.py		train_nvda_xgboost.py
upload_dataset.py		upload_dataset.py
zeroshot_nvda_chronos.py		zeroshot_nvda_chronos.py
zeroshot_nvda_moirai.py		zeroshot_nvda_moirai.py
zeroshot_nvda_monte.py		zeroshot_nvda_monte.py
zeroshot_nvda_timesfm.py		zeroshot_nvda_timesfm.py

matthewchung74/stock_algo_eval

Folders and files

Latest commit

History

Repository files navigation

Time Series Forecasting for Financial Data

📊 Dataset

🎯 Evaluation Setup

🔮 Algorithms Overview

📈 Key Findings

🏆 Traditional Methods Outperform Deep Neural Networks

💡 Why Traditional Methods Won

🔍 Foundation Model Insights

🔮 Detailed Algorithm Analysis

1. Prophet Model

Configuration

Results Summary (Volatile Test Period)

Key Findings

2. Monte Carlo Simulation (Zero-Shot)

Configuration

Results Summary (Volatile Test Period)

Key Findings

3. ARIMA Model

Configuration

Results Summary (Volatile Test Period)

Key Findings

4. TimesFM / Advanced Statistical Foundation Model (ASFM) (Zero-Shot)

Configuration

Results Summary (Volatile Test Period)

Key Findings

5. XGBoost Model

Configuration

Feature Engineering

Results Summary (Volatile Test Period)

Key Findings

Top Feature Importance

6. Chronos Model with Supervised Fine-Tuning (SFT) (Zero-Shot)

Configuration

Model Size and Hyperparameter Analysis

Supervised Fine-Tuning (SFT) Details

Results Summary (Volatile Test Period - Optimal Configuration)

Key Findings

7. Moirai Foundation Model (Zero-Shot)

Configuration

Key Features

Results (Volatile Test Period)

Key Findings

8. Lag-Llama Foundation Model with Fine-Tuning

Configuration

Key Features

Results Summary (Volatile Test Period)

Key Findings

9. Moirai Foundation Model with Supervised Fine-Tuning (SFT)

Configuration

Results Summary (Volatile Test Period)

Key Findings

📁 Project Structure

🛠️ Setup & Usage

Prerequisites

Environment Variables

Running Models

Training Models (Actual Training/Fine-tuning)

Zero-Shot Models (No Training Required)

📈 Evaluation Metrics

Price Accuracy

Directional Accuracy

Statistical Analysis

🎯 Research Questions

🔬 Key Insights

Algorithm Comparison: Volatile Test Period Results

Research Questions Answered

1. Traditional vs Deep Learning Performance

2. Foundation Model Insights

3. Most Predictive Features for Short-term Price Movements

4. Model Complexity vs Performance Trade-offs

🏆 Final Conclusions

Traditional Methods Triumph Over Deep Learning

Why Traditional Methods Won

Foundation Model Lessons

Lag-Llama Success Story

Packages