网易首页 > 网易号 > 正文 申请入驻

【Python】60个“特征工程”计算函数(Python代码)

0
分享至

近期一些朋友询问我关于如何做特征工程的问题,有没有什么适合初学者的有效操作。

特征工程的问题往往需要具体问题具体分析,当然也有一些暴力的策略,可以在竞赛初赛前期可以带来较大提升,而很多竞赛往往依赖这些信息就可以拿到非常好的效果,剩余的则需要结合业务逻辑以及很多其他的技巧,此处我们将平时用得最多的聚合操作罗列在下方。

最近刚好看到一篇文章汇总了非常多的聚合函数,就摘录在下方,供许多初入竞赛的朋友参考。

聚合特征汇总

pandas自带的聚合函数

其它重要聚合函数

其它重要聚合函数&分类分别如下。

def median(x):     return np.median(x) def variation_coefficient(x):     mean = np.mean(x)     if mean != 0:         return np.std(x) / mean     else:         return np.nan def variance(x):     return np.var(x) def skewness(x):     if not isinstance(x, pd.Series):         x = pd.Series(x)     return pd.Series.skew(x) 
def kurtosis(x):
    if not isinstance(x, pd.Series):
        x = pd.Series(x)
    return pd.Series.kurtosis(x)

def standard_deviation(x):
    return np.std(x)

def large_standard_deviation(x):
    if (np.max(x)-np.min(x)) == 0:
        return np.nan
    else:
        return np.std(x)/(np.max(x)-np.min(x))

def variation_coefficient(x):
    mean = np.mean(x)
    if mean != 0:
        return np.std(x) / mean
    else:
        return np.nan

def variance_std_ratio(x):
    y = np.var(x)
    if y != 0:
        return y/np.sqrt(y)
    else:
        return np.nan

def ratio_beyond_r_sigma(x, r):
    if x.size == 0:
        return np.nan
    else:
        return np.sum(np.abs(x - np.mean(x)) > r * np.asarray(np.std(x))) / x.size

def range_ratio(x):
    mean_median_difference = np.abs(np.mean(x) - np.median(x))
    max_min_difference = np.max(x) - np.min(x)
    if max_min_difference == 0:
        return np.nan
    else:
        return mean_median_difference / max_min_difference
    
def has_duplicate_max(x):
    return np.sum(x == np.max(x)) >= 2

def has_duplicate_min(x):
    return np.sum(x == np.min(x)) >= 2

def has_duplicate(x):
    return x.size != np.unique(x).size

def count_duplicate_max(x):
    return np.sum(x == np.max(x))

def count_duplicate_min(x):
    return np.sum(x == np.min(x))

def count_duplicate(x):
    return x.size - np.unique(x).size

def sum_values(x):
    if len(x) == 0:
        return 0
    return np.sum(x)

def log_return(list_stock_prices):
    return np.log(list_stock_prices).diff() 

def realized_volatility(series):
    return np.sqrt(np.sum(series**2))

def realized_abs_skew(series):
    return np.power(np.abs(np.sum(series**3)),1/3)

def realized_skew(series):
    return np.sign(np.sum(series**3))*np.power(np.abs(np.sum(series**3)),1/3)

def realized_vol_skew(series):
    return np.power(np.abs(np.sum(series**6)),1/6)

def realized_quarticity(series):
    return np.power(np.sum(series**4),1/4)

def count_unique(series):
    return len(np.unique(series))

def count(series):
    return series.size

 #drawdons functions are mine 
def maximum_drawdown(series):
    series = np.asarray(series)
    if len(series)<2:
        return 0
    k = series[np.argmax(np.maximum.accumulate(series) - series)]
    i = np.argmax(np.maximum.accumulate(series) - series)
    if len(series[:i])<1:
        return np.NaN
    else:
        j = np.max(series[:i])
    return j-k

def maximum_drawup(series):
    series = np.asarray(series)
    if len(series)<2:
        return 0
    

    series = - series
    k = series[np.argmax(np.maximum.accumulate(series) - series)]
    i = np.argmax(np.maximum.accumulate(series) - series)
    if len(series[:i])<1:
        return np.NaN
    else:
        j = np.max(series[:i])
    return j-k

def drawdown_duration(series):
    series = np.asarray(series)
    if len(series)<2:
        return 0

    k = np.argmax(np.maximum.accumulate(series) - series)
    i = np.argmax(np.maximum.accumulate(series) - series)
    if len(series[:i]) == 0:
        j=k
    else:
        j = np.argmax(series[:i])
    return k-j

def drawup_duration(series):
    series = np.asarray(series)
    if len(series)<2:
        return 0

    series=-series
    k = np.argmax(np.maximum.accumulate(series) - series)
    i = np.argmax(np.maximum.accumulate(series) - series)
    if len(series[:i]) == 0:
        j=k
    else:
        j = np.argmax(series[:i])
    return k-j

def max_over_min(series):
    if len(series)<2:
        return 0
    if np.min(series) == 0:
        return np.nan
    return np.max(series)/np.min(series)

def mean_n_absolute_max(x, number_of_maxima = 1):
    """ Calculates the arithmetic mean of the n absolute maximum values of the time series."""
    assert (
        number_of_maxima > 0
    ), f" number_of_maxima={number_of_maxima} which is not greater than 1"

    n_absolute_maximum_values = np.sort(np.absolute(x))[-number_of_maxima:]

    return np.mean(n_absolute_maximum_values) if len(x) > number_of_maxima else np.NaN


def count_above(x, t):
    if len(x)==0:
        return np.nan
    else:
        return np.sum(x >= t) / len(x)

def count_below(x, t):
    if len(x)==0:
        return np.nan
    else:
        return np.sum(x <= t) / len(x)

 #number of valleys = number_peaks(-x, n) 
def number_peaks(x, n):
    """     Calculates the number of peaks of at least support n in the time series x. A peak of support n is defined as a     subsequence of x where a value occurs, which is bigger than its n neighbours to the left and to the right.     """
    x_reduced = x[n:-n]

    res = None
    for i in range(1, n + 1):
        result_first = x_reduced > _roll(x, i)[n:-n]

        if res is None:
            res = result_first
        else:
            res &= result_first

        res &= x_reduced > _roll(x, -i)[n:-n]
    return np.sum(res)

def mean_abs_change(x):
    return np.mean(np.abs(np.diff(x)))

def mean_change(x):
    x = np.asarray(x)
    return (x[-1] - x[0]) / (len(x) - 1) if len(x) > 1 else np.NaN

def mean_second_derivative_central(x):
    x = np.asarray(x)
    return (x[-1] - x[-2] - x[1] + x[0]) / (2 * (len(x) - 2)) if len(x) > 2 else np.NaN


def root_mean_square(x):
    return np.sqrt(np.mean(np.square(x))) if len(x) > 0 else np.NaN

def absolute_sum_of_changes(x):
    return np.sum(np.abs(np.diff(x)))

def longest_strike_below_mean(x):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    return np.max(_get_length_sequences_where(x < np.mean(x))) if x.size > 0 else 0

def longest_strike_above_mean(x):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    return np.max(_get_length_sequences_where(x > np.mean(x))) if x.size > 0 else 0

def count_above_mean(x):
    m = np.mean(x)
    return np.where(x > m)[0].size

def count_below_mean(x):
    m = np.mean(x)
    return np.where(x < m)[0].size

def last_location_of_maximum(x):
    x = np.asarray(x)
    return 1.0 - np.argmax(x[::-1]) / len(x) if len(x) > 0 else np.NaN

def first_location_of_maximum(x):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    return np.argmax(x) / len(x) if len(x) > 0 else np.NaN

def last_location_of_minimum(x):
    x = np.asarray(x)
    return 1.0 - np.argmin(x[::-1]) / len(x) if len(x) > 0 else np.NaN

def first_location_of_minimum(x):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    return np.argmin(x) / len(x) if len(x) > 0 else np.NaN

 # Test non-consecutive non-reoccuring values ? 
def percentage_of_reoccurring_values_to_all_values(x):
    if len(x) == 0:
        return np.nan
    unique, counts = np.unique(x, return_counts=True)
    if counts.shape[0] == 0:
        return 0
    return np.sum(counts > 1) / float(counts.shape[0])

def percentage_of_reoccurring_datapoints_to_all_datapoints(x):
    if len(x) == 0:
        return np.nan
    if not isinstance(x, pd.Series):
        x = pd.Series(x)
    value_counts = x.value_counts()
    reoccuring_values = value_counts[value_counts > 1].sum()
    if np.isnan(reoccuring_values):
        return 0

    return reoccuring_values / x.size


def sum_of_reoccurring_values(x):
    unique, counts = np.unique(x, return_counts=True)
    counts[counts < 2] = 0
    counts[counts > 1] = 1
    return np.sum(counts * unique)

def sum_of_reoccurring_data_points(x):
    unique, counts = np.unique(x, return_counts=True)
    counts[counts < 2] = 0
    return np.sum(counts * unique)

def ratio_value_number_to_time_series_length(x):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    if x.size == 0:
        return np.nan

    return np.unique(x).size / x.size

def abs_energy(x):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    return np.dot(x, x)

def quantile(x, q):
    if len(x) == 0:
        return np.NaN
    return np.quantile(x, q)

 # crossing the mean ? other levels ?  
def number_crossing_m(x, m):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
     # From https://stackoverflow.com/questions/3843017/efficiently-detect-sign-changes-in-python 
    positive = x > m
    return np.where(np.diff(positive))[0].size

def absolute_maximum(x):
    return np.max(np.absolute(x)) if len(x) > 0 else np.NaN

def value_count(x, value):
    if not isinstance(x, (np.ndarray, pd.Series)):
        x = np.asarray(x)
    if np.isnan(value):
        return np.isnan(x).sum()
    else:
        return x[x == value].size

def range_count(x, min, max):
    return np.sum((x >= min) & (x < max))

def mean_diff(x):
    return np.nanmean(np.diff(x.values))

base_stats = ['mean','sum','size','count','std','first','last','min','max',median,skewness,kurtosis]
higher_order_stats = [abs_energy,root_mean_square,sum_values,realized_volatility,realized_abs_skew,realized_skew,realized_vol_skew,realized_quarticity]
additional_quantiles = [quantile_01,quantile_025,quantile_075,quantile_09]
other_min_max = [absolute_maximum,max_over_min]
min_max_positions = [last_location_of_maximum,first_location_of_maximum,last_location_of_minimum,first_location_of_minimum]
peaks = [number_peaks_2, mean_n_absolute_max_2, number_peaks_5, mean_n_absolute_max_5, number_peaks_10, mean_n_absolute_max_10]
counts = [count_unique,count,count_above_0,count_below_0,value_count_0,count_near_0]
reoccuring_values = [count_above_mean,count_below_mean,percentage_of_reoccurring_values_to_all_values,percentage_of_reoccurring_datapoints_to_all_datapoints,sum_of_reoccurring_values,sum_of_reoccurring_data_points,ratio_value_number_to_time_series_length]
count_duplicate = [count_duplicate,count_duplicate_min,count_duplicate_max]
variations = [mean_diff,mean_abs_change,mean_change,mean_second_derivative_central,absolute_sum_of_changes,number_crossing_0]
ranges = [variance_std_ratio,ratio_beyond_01_sigma,ratio_beyond_02_sigma,ratio_beyond_03_sigma,large_standard_deviation,range_ratio]

参考文献:

https://www.kaggle.com/code/lucasmorin/amex-feature-engineering-2-aggreg-functions

特别声明:以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布,本平台仅提供信息存储服务。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.

相关推荐
热点推荐
三星官宣1月23日举办发布会:Galaxy S25系列将至

三星官宣1月23日举办发布会:Galaxy S25系列将至

PChome电脑之家
2025-01-07 10:01:26
100年的拉链被中国团队颠覆!“无布带拉链”获得红点最佳设计大奖!

100年的拉链被中国团队颠覆!“无布带拉链”获得红点最佳设计大奖!

最黑科技
2025-01-06 21:51:23
举世罕见!中国足球大变天:3年前16支中超球队仅存活7支!

举世罕见!中国足球大变天:3年前16支中超球队仅存活7支!

邱泽云
2025-01-06 21:55:57
马斯克连发推文,直言香港法官头顶的“裹脚布”该扔了

马斯克连发推文,直言香港法官头顶的“裹脚布”该扔了

墨下
2025-01-07 07:19:03
哈马斯宣布:准备释放34名人质

哈马斯宣布:准备释放34名人质

参考消息
2025-01-06 14:24:10
当下地位“断崖式下降”的六大职业:1、“空姐”曾经高贵美丽…

当下地位“断崖式下降”的六大职业:1、“空姐”曾经高贵美丽…

沧海一书客
2025-01-02 10:27:21
都说2.2亿合同贵了,你却带队联盟第一,把它打成超值合同

都说2.2亿合同贵了,你却带队联盟第一,把它打成超值合同

大西体育
2025-01-07 15:31:05
神通广大的潮汕商会,怎么从泰缅诈骗窝捞人的?星星有救了

神通广大的潮汕商会,怎么从泰缅诈骗窝捞人的?星星有救了

小盖纪实
2025-01-07 12:54:53
中国人的神逻辑!只有符合中国利益的观点才是正确的逻辑吗?

中国人的神逻辑!只有符合中国利益的观点才是正确的逻辑吗?

大风文字
2025-01-06 11:37:41
查获“问题书籍”2本,海关发布提醒

查获“问题书籍”2本,海关发布提醒

新京报
2025-01-07 13:23:07
英法领导人同日喊话马斯克

英法领导人同日喊话马斯克

环球时报国际
2025-01-07 08:32:38
4位导演为张颂文发声,揭穿被"排挤"真相,活动现场与王宝强热聊

4位导演为张颂文发声,揭穿被"排挤"真相,活动现场与王宝强热聊

浩哥爱聊天
2025-01-07 00:42:52
A股紧急提醒,周三开盘千万要注意了,市场即将大变天!

A股紧急提醒,周三开盘千万要注意了,市场即将大变天!

一丛深色花儿
2025-01-07 11:55:23
国王双加时逆转热火:小萨21+18+11哈克斯16+12+10 德罗赞30分

国王双加时逆转热火:小萨21+18+11哈克斯16+12+10 德罗赞30分

醉卧浮生
2025-01-07 13:55:11
澳网官宣:郑钦文搭档德约出战澳网混双表演赛 对阵奥运冠军组合

澳网官宣:郑钦文搭档德约出战澳网混双表演赛 对阵奥运冠军组合

醉卧浮生
2025-01-07 17:52:18
商业教父段永平回浙大,称自己是赚了几千亿的懒人,校长全程陪同

商业教父段永平回浙大,称自己是赚了几千亿的懒人,校长全程陪同

小宇宙双色球
2025-01-07 12:07:31
黄皮肤战神在俄乌杀疯,身中4枪后用刀攮死乌军,普京应为其授勋

黄皮肤战神在俄乌杀疯,身中4枪后用刀攮死乌军,普京应为其授勋

凯撒谈兵
2025-01-04 22:04:12
马伊琍官宣喜讯:男方是31岁的帅气演员,阻拦姚笛通过节目复出?

马伊琍官宣喜讯:男方是31岁的帅气演员,阻拦姚笛通过节目复出?

小路杂谈
2025-01-07 10:47:42
普京失算了,乌军14个旅打响绝境反攻!

普京失算了,乌军14个旅打响绝境反攻!

牲产队2024
2025-01-06 20:51:37
比“掀桌子”还严重!续航1400km,仅售9万多,首月大卖20000辆

比“掀桌子”还严重!续航1400km,仅售9万多,首月大卖20000辆

隔壁说车老王
2025-01-06 17:26:49
2025-01-07 18:23:00
机器学习与Python社区 incentive-icons
机器学习与Python社区
机器学习算法与Python
2766文章数 10351关注度
往期回顾 全部

科技要闻

黄仁勋化身"美国队长" 发布RTX 50系列显卡

头条要闻

商业街1公里内至少开11家"俄货馆" 首批商家黯然离场

头条要闻

商业街1公里内至少开11家"俄货馆" 首批商家黯然离场

体育要闻

广州队解散,一场注定徒劳的自救

娱乐要闻

演员星星在缅北照片公开:眼神空洞表情恐惧

财经要闻

重要通知!各地区不得违法开展异地执法

汽车要闻

10万元级无图智驾 悦也PLUS全路况实测

态度原创

艺术
游戏
家居
健康
公开课

艺术要闻

故宫珍藏的墨迹《十七帖》,比拓本更精良,这才是地道的魏晋写法

玩家太喜欢 《守望先锋2》6v6模式测试再次延长

家居要闻

简约大气 居心之所

抑郁症患者称好的“乌托邦”宝地

公开课

李玫瑾:为什么性格比能力更重要?

无障碍浏览 进入关怀版