特征构建与变量增强

特征构建与变量增强是提升时序预测模型性能的关键环节，通过创建有意义的特征来帮助模型捕捉时间模式和外部因素的影响。良好的特征工程可以显著提高时序分析-机器学习模型和部分时序分析-深度学习模型的预测能力。

日期时间类特征

时间是时序数据最基本的特征维度，可以从中提取多种有价值的特征：

# 基础时间特征
df['hour'] = df.index.hour
df['day'] = df.index.day
df['day_of_week'] = df.index.dayofweek  # 0-6代表周一至周日
df['month'] = df.index.month
df['quarter'] = df.index.quarter
df['year'] = df.index.year
df['is_weekend'] = df.index.dayofweek >= 5  # 周末标记
 
# 特殊时段
df['is_business_hour'] = (df.index.hour >= 9) & (df.index.hour < 18)
 
# 节假日特征（需要holidays库）
from holidays import China
holidays_cn = China()
df['is_holiday'] = df.index.map(lambda x: x in holidays_cn)
 
# 周期性编码（避免循环特征的断点）
df['hour_sin'] = np.sin(2 * np.pi * df.index.hour / 24)
df['hour_cos'] = np.cos(2 * np.pi * df.index.hour / 24)
df['day_sin'] = np.sin(2 * np.pi * df.index.dayofweek / 7)
df['day_cos'] = np.cos(2 * np.pi * df.index.dayofweek / 7)
df['month_sin'] = np.sin(2 * np.pi * df.index.month / 12)
df['month_cos'] = np.cos(2 * np.pi * df.index.month / 12)

滞后特征与滑动窗口特征

对于时序预测，历史值通常是最强的预测因子：

# 滞后特征（前N天的值）
for lag in [1, 7, 14, 28]:
    df[f'lag_{lag}d'] = df['value'].shift(lag)
 
# 滑动窗口统计特征
for window in [7, 14, 30]:
    df[f'rolling_mean_{window}d'] = df['value'].rolling(window=window).mean()
    df[f'rolling_std_{window}d'] = df['value'].rolling(window=window).std()
    df[f'rolling_min_{window}d'] = df['value'].rolling(window=window).min()
    df[f'rolling_max_{window}d'] = df['value'].rolling(window=window).max()
    
# 同比环比特征
df['mom_change'] = df['value'] / df['value'].shift(1) - 1  # 环比变化率
df['yoy_change'] = df['value'] / df['value'].shift(365) - 1  # 同比变化率
 
# 差分特征
df['diff_1d'] = df['value'].diff(1)  # 一阶差分
df['diff_7d'] = df['value'].diff(7)  # 周差分

外部变量整合

将与目标变量相关的外部因素纳入模型可以显著提高预测准确性：

# 示例：天气数据整合
weather_df = pd.read_csv('weather_data.csv', index_col='date', parse_dates=True)
df = pd.merge(df, weather_df[['temperature', 'rainfall']], 
              left_index=True, right_index=True, how='left')
 
# 示例：营销活动数据
df['is_promotion'] = 0
promotion_dates = ['2023-01-01', '2023-02-14', '2023-06-18']
for date in promotion_dates:
    df.loc[date, 'is_promotion'] = 1
    
# 示例：搜索指数或社交媒体热度
df = pd.merge(df, search_trend_df[['search_index']], 
              left_index=True, right_index=True, how='left')

基于领域知识的特征

针对不同业务场景，可以添加特定的领域知识特征：

# 零售行业示例
df['days_to_payday'] = df.index.map(lambda x: min((x.day - 15) % 30, (30 - x.day + 15) % 30))  # 距离发薪日天数
df['shopping_season'] = df.apply(lambda x: 1 if (x['month'] == 11 and x['day'] >= 20) or 
                                (x['month'] == 12 and x['day'] <= 25) else 0, axis=1)  # 购物季
 
# 餐饮行业示例
df['meal_time'] = df.apply(lambda x: 'breakfast' if 6 <= x['hour'] < 10 else
                         ('lunch' if 11 <= x['hour'] < 14 else
                         ('dinner' if 17 <= x['hour'] < 21 else 'other')), axis=1)

自动特征构建

对于复杂问题，可以使用自动化特征工程工具：

# 使用tsfresh自动提取时序特征
from tsfresh import extract_features
from tsfresh.feature_extraction import MinimalFCParameters
 
# 配置要提取的特征
extraction_settings = MinimalFCParameters()
# 提取特征
extracted_features = extract_features(df_formatted, 
                                     column_id="id", 
                                     column_sort="timestamp",
                                     column_value="value", 
                                     default_fc_parameters=extraction_settings)

特征重要性与选择

并非所有特征都对预测有帮助，需要评估和筛选特征：

# 使用相关性筛选特征
correlation = df.corr()['target'].abs().sort_values(ascending=False)
high_corr_features = correlation[correlation > 0.1].index.tolist()
 
# 使用模型自带的特征重要性
xgb_model = XGBRegressor().fit(X, y)
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)
 
# 使用SHAP值分析特征贡献
import shap
explainer = shap.Explainer(xgb_model)
shap_values = explainer(X)
shap.summary_plot(shap_values, X)

与其他模块的关系

特征构建工作需要结合时序分析-数据预处理与趋势识别的结果，例如使用趋势分解得到的组件作为特征。它对时序分析-机器学习模型的效果影响最大，对于时序分析-传统统计模型则主要是提供外部回归变量。在时序分析-混合方法中，特征工程同样扮演着关键角色。通过时序分析-评估指标与误差分析可以检验特征的有效性，并在不同时序分析-应用场景中选择最合适的特征组合。

ZYX HOME

Explorer

时序分析-特征构建与变量增强