# Features Module (`src.features`) Feature engineering and technical indicators calculation. ## Feature Engineer **Module**: `src.features.feature_engineer` Compute features with adaptive memory usage using DuckDB. ### Class: `FeatureEngineer` ```python FeatureEngineer( parquet_root: Path, enriched_root: Path, config: ConfigLoader, metadata_root: Optional[Path] = None ) ``` **Parameters:** - `parquet_root`: Input directory (raw Parquet data) - `enriched_root`: Output directory (enriched Parquet data) - `config`: Configuration loader - `metadata_root`: Optional metadata directory **Key Methods:** #### `enrich_date_range(data_type: str, start_date: str, end_date: str, incremental: bool = True) -> Dict` Enrich date range with features. ```python from src.features import FeatureEngineer from src.core import ConfigLoader from pathlib import Path config = ConfigLoader() engineer = FeatureEngineer( parquet_root=Path('data/lake'), enriched_root=Path('data/enriched'), config=config ) result = engineer.enrich_date_range( data_type='stocks_daily', start_date='2024-01-01', end_date='2024-01-31', incremental=True ) print(f"Enriched {result['records_processed']} records") ``` #### `get_enrichment_status(data_type: str) -> Dict` Get enrichment status. Returns: ```python { 'completed': List[str], # Dates completed 'pending': List[str], # Dates pending 'failed': List[str] # Dates failed } ``` #### `close()` Close DuckDB connection. **Context Manager:** ```python with FeatureEngineer(parquet_root, enriched_root, config) as engineer: result = engineer.enrich_date_range('stocks_daily', '2024-01-01', '2024-01-31') ``` **Processing Modes:** - Streaming: <32GB RAM (one symbol at a time) - Batch: 32-64GB RAM (multiple symbols) - Parallel: >64GB RAM (all symbols in parallel, future) --- ## Feature Definitions **Module**: `src.features.definitions` Define features to calculate for each data type. **Functions:** #### `get_feature_definitions(data_type: str) -> Dict[str, str]` Get feature definitions (name -> SQL expression). ```python from src.features.definitions import get_feature_definitions features = get_feature_definitions('stocks_daily') for name, expr in features.items(): print(f"{name}: {expr}") ``` #### `get_feature_list(data_type: str) -> List[str]` Get list of feature names. ```python features = get_feature_list('stocks_daily') print(f"Total features: {len(features)}") ``` #### `build_feature_sql(data_type: str, base_view: str = 'raw_data') -> str` Build complete SQL query with all features. **Feature Sets:** **Stock Daily Features:** - Returns: `return_1d`, `return_5d`, `return_20d` - Alpha: `alpha_daily` (daily return - market return) - Price features: `price_range`, `daily_return_pct` - Volume features: `vwap`, `avg_trade_size`, `volume_change_pct` - Volatility: `volatility_20d` **Stock Minute Features:** - Returns: `minute_return`, `minute_return_pct` - Intraday: `intraday_vwap`, `intraday_high`, `intraday_low` - Volume: `volume_change`, `spread`, `typical_price` **Options Daily Features:** - Ticker parsing: `underlying`, `expiration_date`, `contract_type`, `strike_price` - Moneyness: `moneyness` (strike/spot ratio) - Returns: `return_1d` - Volume: `volume_change_pct` **Options Minute Features:** - Ticker parsing: `underlying`, `expiration_date`, `contract_type`, `strike_price` - Returns: `minute_return` - Volume: `volume_change` --- ## Simple Definitions **Module**: `src.features.definitions_simple` Simplified features for testing (no nested window functions). **Functions:** #### `build_simple_stock_daily_sql(base_view: str = 'raw_data') -> str` Build simple SQL for stock daily features. **Features:** - `return_1d`: Daily return - `alpha_daily`: Price change vs previous day - `price_range`: (high - low) / close - `daily_return_pct`: Daily return percentage - `vwap`: Volume-weighted average price **When to Use:** - Testing and development - DuckDB compatibility issues - Simpler feature requirements --- ## Options Parser **Module**: `src.features.options_parser` Parse Polygon.io options ticker format. ### Class: `OptionsTickerParser` **Class Methods:** #### `parse(ticker: str) -> Optional[Dict[str, Any]]` Parse options ticker. ```python from src.features import OptionsTickerParser result = OptionsTickerParser.parse('O:SPY230327P00390000') print(result) # { # 'underlying': 'SPY', # 'expiration_date': datetime.date(2023, 3, 27), # 'contract_type': 'P', # 'strike_price': 390.0 # } ``` #### `parse_batch(tickers: List[str]) -> Dict[str, Dict]` Parse multiple tickers efficiently. ```python tickers = ['O:SPY230327P00390000', 'O:AAPL230317C00150000'] results = OptionsTickerParser.parse_batch(tickers) for ticker, parsed in results.items(): if parsed: print(f"{ticker}: {parsed['underlying']} {parsed['strike_price']}") ``` #### `is_valid_ticker(ticker: str) -> bool` Check if ticker matches options format. ```python if OptionsTickerParser.is_valid_ticker('O:SPY230327P00390000'): print("Valid options ticker") ``` #### `extract_underlying(ticker: str) -> Optional[str]` Quick extraction of underlying symbol. ```python underlying = OptionsTickerParser.extract_underlying('O:SPY230327P00390000') # Returns: 'SPY' ``` #### `extract_expiration(ticker: str) -> Optional[date]` Quick extraction of expiration date. **Ticker Format:** ``` O:UNDERLYING[YY]MMDD[C/P]STRIKE Examples: O:SPY230327P00390000 -> SPY put, exp 2023-03-27, strike $390 O:AAPL230317C00150000 -> AAPL call, exp 2023-03-17, strike $150 ``` **Components:** - Prefix: `O:` - Underlying: Stock symbol - Expiration: YYMMDD (6 digits) - Type: `C` (call) or `P` (put) - Strike: 8 digits (dollars * 1000) --- ## Custom Features To add custom features, edit `src/features/definitions.py`: ```python STOCK_DAILY_FEATURES = { 'return_1d': '(close / LAG(close, 1) OVER (PARTITION BY symbol ORDER BY date)) - 1', # Add your custom feature: 'my_custom_feature': 'YOUR SQL EXPRESSION HERE', } ``` Features are computed using DuckDB SQL, supporting: - Window functions: `LAG`, `LEAD`, `ROW_NUMBER`, etc. - Aggregations: `SUM`, `AVG`, `MAX`, `MIN`, etc. - Date functions: `DATE_DIFF`, `EXTRACT`, etc. - Math functions: `LOG`, `SQRT`, `POWER`, etc. **Best Practices:** 1. Use memory-efficient types (CAST to FLOAT if needed) 2. Partition by symbol for stock-level features 3. Order by date for time-series features 4. Handle NULLs explicitly 5. Test with `definitions_simple` first