Transform Module (src.transform)
Convert enriched Parquet to Qlib binary format with validation.
Qlib Binary Writer
Module: src.transform.qlib_binary_writer
Convert enriched Parquet to Qlib binary format for ML training.
Class: QlibBinaryWriter
QlibBinaryWriter(
enriched_root: Path,
qlib_root: Path,
config: ConfigLoader
)
Parameters:
enriched_root: Input directory (enriched Parquet data)qlib_root: Output directory (Qlib binary format)config: Configuration loader
Key Methods:
convert_data_type(data_type: str, start_date: str, end_date: str, incremental: bool = True, metadata_manager: Optional[object] = None) -> Dict[str, Any]
Convert entire data type to Qlib binary.
from src.transform import QlibBinaryWriter
from src.core import ConfigLoader
from pathlib import Path
config = ConfigLoader()
writer = QlibBinaryWriter(
enriched_root=Path('data/enriched'),
qlib_root=Path('data/qlib'),
config=config
)
result = writer.convert_data_type(
data_type='stocks_daily',
start_date='2024-01-01',
end_date='2024-01-31',
incremental=True
)
print(f"Converted {result['symbols_converted']} symbols")
print(f"Total features: {result['features_written']}")
close()
Close DuckDB connection.
Context Manager:
with QlibBinaryWriter(enriched_root, qlib_root, config) as writer:
result = writer.convert_data_type('stocks_daily', '2024-01-01', '2024-01-31')
Binary Format:
Qlib uses a custom binary format for fast data access:
qlib_root/
├── instruments/
│ └── all.txt # Tab-separated: symbol start_date end_date
├── calendars/
│ └── day.txt # One date per line: YYYY-MM-DD
└── features/
└── {symbol}/
├── close.bin # Binary feature data
├── volume.bin
├── return_1d.bin
└── ...
Binary File Format:
4 bytes: count (uint32, little-endian)
N * 4 bytes: float32 values (little-endian)
Processing Modes:
Streaming: One symbol at a time (memory-efficient)
Batch: Multiple symbols (future)
Parallel: All symbols in parallel (future)
Critical Fixes Applied:
See docs/changelog/QLIB_BINARY_WRITER_UPDATES.md for details on 6 critical fixes:
Filter NULL symbols in SQL query
Tab-separated instruments file
Create
.qlib/dataset_info.jsonwith frequencyClean macOS metadata files (
._*)Proper date ranges in instruments file
Validated binary format
Qlib Binary Validator
Module: src.transform.qlib_binary_validator
Validate Qlib binary format conversions.
Class: QlibBinaryValidator
QlibBinaryValidator(qlib_root: Path)
Key Methods:
validate_conversion(data_type: str) -> Dict[str, Any]
Run all validation checks.
from src.transform import QlibBinaryValidator
from pathlib import Path
validator = QlibBinaryValidator(Path('data/qlib'))
results = validator.validate_conversion('stocks_daily')
if results['all_passed']:
print("✓ All validation checks passed")
else:
print("✗ Validation failed:")
for check, passed in results['checks'].items():
print(f" {check}: {'✓' if passed else '✗'}")
Validation Checks:
Instruments file: Exists, correct format, valid date ranges
Calendar file: Exists, correct format, business days only
Binary files: Exist for each symbol, correct format
File structure: Proper directory structure
Metadata:
.qlib/dataset_info.jsonexists and valid
read_binary_feature(data_type: str, symbol: str, feature: str) -> np.ndarray
Read binary feature for testing.
feature_data = validator.read_binary_feature('stocks_daily', 'AAPL', 'close')
print(f"AAPL close prices: shape={feature_data.shape}")
get_feature_list(data_type: str, symbol: str) -> List[str]
Get list of features for symbol.
features = validator.get_feature_list('stocks_daily', 'AAPL')
print(f"AAPL features: {', '.join(features)}")
compare_with_parquet(data_type: str, symbol: str, feature: str, parquet_df, tolerance: float = 1e-6) -> Dict[str, Any]
Compare binary with original Parquet (roundtrip test).
import pandas as pd
# Load original data
df = pd.read_parquet('data/enriched/stocks_daily/AAPL.parquet')
# Compare
comparison = validator.compare_with_parquet(
data_type='stocks_daily',
symbol='AAPL',
feature='close',
parquet_df=df,
tolerance=1e-6
)
if comparison['match']:
print(f"✓ Binary matches Parquet within tolerance")
else:
print(f"✗ Mismatch: {comparison['max_diff']} max difference")
Using Qlib Binary Data
After conversion, initialize Qlib with the binary data:
import qlib
from qlib.data import D
# Initialize Qlib
qlib.init(
provider_uri='data/qlib/stocks_daily',
region='us'
)
# Query data
symbols = ['AAPL', 'MSFT', 'GOOGL']
fields = ['$close', '$volume', '$return_1d', '$alpha_daily']
data = D.features(
symbols,
fields,
start_time='2024-01-01',
end_time='2024-01-31'
)
print(data.head())
Field Naming:
Prefix with
$when querying:$close,$volume,$return_1dBinary files have no prefix:
close.bin,volume.bin,return_1d.bin
Incremental Updates
The binary writer supports incremental updates:
from src.storage import MetadataManager
metadata = MetadataManager(Path('data/metadata'))
# First conversion
writer.convert_data_type(
'stocks_daily',
start_date='2024-01-01',
end_date='2024-01-31',
incremental=True,
metadata_manager=metadata
)
# Later, add new data
writer.convert_data_type(
'stocks_daily',
start_date='2024-02-01',
end_date='2024-02-29',
incremental=True,
metadata_manager=metadata
)
# Only converts symbols not already converted
Benefits:
Skip already-converted symbols
Faster updates
Avoid reprocessing
Troubleshooting
Issue: qlib.init() fails with “Provider not found”
Solution: Check binary format structure:
ls data/qlib/stocks_daily/
# Should see: instruments/, calendars/, features/
Issue: Validation fails on instruments file
Solution: Check tab-separated format:
head data/qlib/stocks_daily/instruments/all.txt
# Format: SYMBOL\tSTART_DATE\tEND_DATE
Issue: Binary files have wrong size
Solution: Verify count matches calendar:
import struct
# Read first 4 bytes (count)
with open('data/qlib/stocks_daily/features/AAPL/close.bin', 'rb') as f:
count = struct.unpack('<I', f.read(4))[0]
print(f"Binary has {count} values")
# Compare with calendar
with open('data/qlib/stocks_daily/calendars/day.txt') as f:
calendar_days = len(f.readlines())
print(f"Calendar has {calendar_days} days")
See docs/changelog/QLIB_BINARY_WRITER_UPDATES.md for more troubleshooting.