Features Module (src.features)
Feature engineering and technical indicators calculation.
Feature Engineer
Module: src.features.feature_engineer
Compute features with adaptive memory usage using DuckDB.
Class: FeatureEngineer
FeatureEngineer(
parquet_root: Path,
enriched_root: Path,
config: ConfigLoader,
metadata_root: Optional[Path] = None
)
Parameters:
parquet_root: Input directory (raw Parquet data)enriched_root: Output directory (enriched Parquet data)config: Configuration loadermetadata_root: Optional metadata directory
Key Methods:
enrich_date_range(data_type: str, start_date: str, end_date: str, incremental: bool = True) -> Dict
Enrich date range with features.
from src.features import FeatureEngineer
from src.core import ConfigLoader
from pathlib import Path
config = ConfigLoader()
engineer = FeatureEngineer(
parquet_root=Path('data/lake'),
enriched_root=Path('data/enriched'),
config=config
)
result = engineer.enrich_date_range(
data_type='stocks_daily',
start_date='2024-01-01',
end_date='2024-01-31',
incremental=True
)
print(f"Enriched {result['records_processed']} records")
get_enrichment_status(data_type: str) -> Dict
Get enrichment status.
Returns:
{
'completed': List[str], # Dates completed
'pending': List[str], # Dates pending
'failed': List[str] # Dates failed
}
close()
Close DuckDB connection.
Context Manager:
with FeatureEngineer(parquet_root, enriched_root, config) as engineer:
result = engineer.enrich_date_range('stocks_daily', '2024-01-01', '2024-01-31')
Processing Modes:
Streaming: <32GB RAM (one symbol at a time)
Batch: 32-64GB RAM (multiple symbols)
Parallel: >64GB RAM (all symbols in parallel, future)
Feature Definitions
Module: src.features.definitions
Define features to calculate for each data type.
Functions:
get_feature_definitions(data_type: str) -> Dict[str, str]
Get feature definitions (name -> SQL expression).
from src.features.definitions import get_feature_definitions
features = get_feature_definitions('stocks_daily')
for name, expr in features.items():
print(f"{name}: {expr}")
get_feature_list(data_type: str) -> List[str]
Get list of feature names.
features = get_feature_list('stocks_daily')
print(f"Total features: {len(features)}")
build_feature_sql(data_type: str, base_view: str = 'raw_data') -> str
Build complete SQL query with all features.
Feature Sets:
Stock Daily Features:
Returns:
return_1d,return_5d,return_20dAlpha:
alpha_daily(daily return - market return)Price features:
price_range,daily_return_pctVolume features:
vwap,avg_trade_size,volume_change_pctVolatility:
volatility_20d
Stock Minute Features:
Returns:
minute_return,minute_return_pctIntraday:
intraday_vwap,intraday_high,intraday_lowVolume:
volume_change,spread,typical_price
Options Daily Features:
Ticker parsing:
underlying,expiration_date,contract_type,strike_priceMoneyness:
moneyness(strike/spot ratio)Returns:
return_1dVolume:
volume_change_pct
Options Minute Features:
Ticker parsing:
underlying,expiration_date,contract_type,strike_priceReturns:
minute_returnVolume:
volume_change
Simple Definitions
Module: src.features.definitions_simple
Simplified features for testing (no nested window functions).
Functions:
build_simple_stock_daily_sql(base_view: str = 'raw_data') -> str
Build simple SQL for stock daily features.
Features:
return_1d: Daily returnalpha_daily: Price change vs previous dayprice_range: (high - low) / closedaily_return_pct: Daily return percentagevwap: Volume-weighted average price
When to Use:
Testing and development
DuckDB compatibility issues
Simpler feature requirements
Options Parser
Module: src.features.options_parser
Parse Polygon.io options ticker format.
Class: OptionsTickerParser
Class Methods:
parse(ticker: str) -> Optional[Dict[str, Any]]
Parse options ticker.
from src.features import OptionsTickerParser
result = OptionsTickerParser.parse('O:SPY230327P00390000')
print(result)
# {
# 'underlying': 'SPY',
# 'expiration_date': datetime.date(2023, 3, 27),
# 'contract_type': 'P',
# 'strike_price': 390.0
# }
parse_batch(tickers: List[str]) -> Dict[str, Dict]
Parse multiple tickers efficiently.
tickers = ['O:SPY230327P00390000', 'O:AAPL230317C00150000']
results = OptionsTickerParser.parse_batch(tickers)
for ticker, parsed in results.items():
if parsed:
print(f"{ticker}: {parsed['underlying']} {parsed['strike_price']}")
is_valid_ticker(ticker: str) -> bool
Check if ticker matches options format.
if OptionsTickerParser.is_valid_ticker('O:SPY230327P00390000'):
print("Valid options ticker")
extract_underlying(ticker: str) -> Optional[str]
Quick extraction of underlying symbol.
underlying = OptionsTickerParser.extract_underlying('O:SPY230327P00390000')
# Returns: 'SPY'
extract_expiration(ticker: str) -> Optional[date]
Quick extraction of expiration date.
Ticker Format:
O:UNDERLYING[YY]MMDD[C/P]STRIKE
Examples:
O:SPY230327P00390000 -> SPY put, exp 2023-03-27, strike $390
O:AAPL230317C00150000 -> AAPL call, exp 2023-03-17, strike $150
Components:
Prefix:
O:Underlying: Stock symbol
Expiration: YYMMDD (6 digits)
Type:
C(call) orP(put)Strike: 8 digits (dollars * 1000)
Custom Features
To add custom features, edit src/features/definitions.py:
STOCK_DAILY_FEATURES = {
'return_1d': '(close / LAG(close, 1) OVER (PARTITION BY symbol ORDER BY date)) - 1',
# Add your custom feature:
'my_custom_feature': 'YOUR SQL EXPRESSION HERE',
}
Features are computed using DuckDB SQL, supporting:
Window functions:
LAG,LEAD,ROW_NUMBER, etc.Aggregations:
SUM,AVG,MAX,MIN, etc.Date functions:
DATE_DIFF,EXTRACT, etc.Math functions:
LOG,SQRT,POWER, etc.
Best Practices:
Use memory-efficient types (CAST to FLOAT if needed)
Partition by symbol for stock-level features
Order by date for time-series features
Handle NULLs explicitly
Test with
definitions_simplefirst