查看: 1106|回复: 0
打印 上一主题 下一主题

[其他] 2023年美赛C题Wordle预测问题二建模及Python代码详细讲解

[复制链接]

418

主题

21

听众

7923

积分

管理员

Rank: 9Rank: 9Rank: 9

纳金币
7448
精华
30

活跃会员 优秀版主 荣誉管理 论坛元老 周年庆

跳转到指定楼层
楼主
发表于 2024-9-24 14:09:23 |只看该作者 |倒序浏览

相关链接

(1)2023年美赛C题Wordle预测问题一建模及Python代码详细讲解
(2)2023年美赛C题Wordle预测问题二建模及Python代码详细讲解
(3)2023年美赛C题Wordle预测问题三、四建模及Python代码详细讲解
(4)2023年美赛C题Wordle预测问题27中文页论文

1 数据分析与特征工程

(1)将2023-3-1的EERIR样本,加入到数据集中,和所有数据集一起预处理和做特征工程。

特征工程中,和问题一第二问中类同的是,提取了’w1’,‘w2’,‘w3’,‘w4’,‘w5’,‘Vowel_fre’,'Consonant_fre’这几个特征,再加上时间特征包括年、月、日、季节、样本序号。

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsdf = pd.read_excel('data/Problem_C_Data_Wordle.xlsx',header=1)data = df.drop(columns='Unnamed: 0')data.loc[len(data)] = ['2023-3-1',np.nan,'eerie',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]data['Date'] = pd.to_datetime(data['Date'])df =data.copy()df['Words']  = df['Word'].apply(lambda x:str(list(x))[1:-1].replace("'","").replace(" ",""))df['w1'], df['w2'],df['w3'], df['w4'],df['w5'] = df['Words'].str.split(',',n=4).strdf
small = [str(chr(i)) for i in range(ord('a'),ord('z')+1)]letter_map = dict(zip(small,range(1,27)))letter_map

{‘a’: 1, ‘b’: 2, ‘c’: 3, ‘d’: 4, ‘e’: 5, ‘f’: 6, ‘g’: 7, ‘h’: 8, ‘i’: 9, ‘j’: 10, ‘k’: 11, ‘l’: 12, ‘m’: 13, ‘n’: 14, ‘o’: 15, ‘p’: 16, ‘q’: 17, ‘r’: 18, ‘s’: 19, ‘t’: 20, ‘u’: 21, ‘v’: 22, ‘w’: 23, ‘x’: 24, ‘y’: 25, ‘z’: 26}

df['w1'] = df['w1'].map(letter_map)df['w2'] = df['w2'].map(letter_map)df['w3'] = df['w3'].map(letter_map) df['w4'] = df['w4'].map(letter_map)df['w5'] = df['w5'].map(letter_map)df.set_index('Date',inplace=True)df.sort_index(ascending=True,inplace=True)df

(2)统计元音辅音频率

Vowel = ['a','e','i','o','u'] Consonant = list(set(small).difference(set(Vowel)))def count_Vowel(s):    c = 0    for i in range(len(s)):        if s in Vowel:            c+=1    return cdef count_Consonant(s):    c = 0    for i in range(len(s)):        if s in Consonant:            c+=1    return cdf['Vowel_fre'] = df['Word'].apply(lambda x:count_Vowel(x)) df['Consonant_fre'] = df['Word'].apply(lambda x:count_Consonant(x))

(3)提取时间特征

df["year"] = df.index.yeardf["qtr"] = df.index.quarterdf["mon"] = df.index.monthdf["week"] = df.index.weekdf["day"] = df.index.weekdaydf["ix"] = range(0,len(data))time_features = ['year','qtr','mon','week','day','ix']df

(3)构造数据集

from sklearn.preprocessing import StandardScalerfeatures = ['w1','w2','w3','w4','w5','Vowel_fre','Consonant_fre']+time_featureslabel = ['1 try','6 tries','6 tries','6 tries','6 tries','6 tries','7 or more tries (X)']Trian_all = df[features+label].copy().dropna()X = Trian_all[features]# 标准化ss = StandardScaler()X_1 = ss.fit_transform(X)Y_1= Trian_all[label[0]]Y_2= Trian_all[label[1]]Y_3= Trian_all[label[2]]Y_4= Trian_all[label[3]]Y_5= Trian_all[label[4]]Y_6= Trian_all[label[5]]Y_7= Trian_all[label[6]]Trian_all
2 模型预测与评估from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionX_train, X_test, y_train, y_test = train_test_split(X_1,Y_1, test_size=0.1, random_state=0)reg = LinearRegression().fit(X_train, y_train)p_pred = reg.predict(X_test)test_df =pd.DataFrame(y_test,columns=label)test_df['pred_1'] = p_pred# 计算误差from sklearn.metrics import mean_squared_errorRMSE_1 = np.sqrt(mean_squared_error(test_df[label[0]],test_df['pred_1']))print(f'第1个模型,RMSE误差是:{RMSE_1}')# 预测结果可视化test_df[[label[0],'pred_1']].plot()plt.legend()plt.savefig('img/3.png',dpi=300)plt.show()

然后训练7个回归模型。

第1个模型,RMSE误差是:0.901305736956438

剩余的6个模型,复制粘贴代码,改一下标签就行。


3 预测EERIE难度# 训练所有模型from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.preprocessing import StandardScalerfeatures = ['w1','w2','w3','w4','w5','Vowel_fre','Consonant_fre']+time_featureslabel = ['1 try','2 tries','3 tries','4 tries','5 tries','6 tries','7 or more tries (X)']Trian_all = df[features+label].copy().dropna()X = Trian_all[features]# 标准化ss = StandardScaler()X_1 = ss.fit_transform(X)Y_1= Trian_all[label[0]]Y_2= Trian_all[label[1]]Y_3= Trian_all[label[2]]Y_4= Trian_all[label[3]]Y_5= Trian_all[label[4]]Y_6= Trian_all[label[5]]Y_7= Trian_all[label[6]]reg1 = LinearRegression().fit(X_1, Y_1)reg2 = LinearRegression().fit(X_1, Y_2)reg3 = LinearRegression().fit(X_1, Y_3)reg4 = LinearRegression().fit(X_1, Y_4)reg5 = LinearRegression().fit(X_1, Y_5)reg6 = LinearRegression().fit(X_1, Y_6)reg7 = LinearRegression().fit(X_1, Y_7)X_pred = ss.fit_transform(np.array(df.loc['2023-3-1'][features]).reshape(1,-1))p_pred1 = reg1.predict(X_pred)p_pred2 = reg2.predict(X_pred)p_pred3 = reg3.predict(X_pred)p_pred4 = reg4.predict(X_pred)p_pred5 = reg5.predict(X_pred)p_pred6 = reg6.predict(X_pred)p_pred7 = reg7.predict(X_pred)print(p_pred1,p_pred2,p_pred3,p_pred4,p_pred5,p_pred6,p_pred7)

进行预测3月1号的EERIR的百分比,结果如下

[0.46327684] [5.77683616] [22.67231638] [32.97457627] [23.68361582] [11.5819209] [2.81920904]

评价模型的好坏,除了上面的RMSE还有MAPE、MAE、MSE等误差的计算方法,都可以计算一下,做一个表格来评价模型。

改进的地方,就是可以对比其他的机器学习回归模型,比如KNN回归、随机森林回归等。

3 CodeCode获取,在浏览器中输入:betterbench.top/#/40/detail,或者Si我
分享到: QQ好友和群QQ好友和群 腾讯微博腾讯微博 腾讯朋友腾讯朋友 微信微信
转播转播0 分享淘帖0 收藏收藏0 支持支持0 反对反对0
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

手机版|纳金网 ( 闽ICP备2021016425号-2/3

GMT+8, 2024-11-13 08:09 , Processed in 0.091577 second(s), 29 queries .

Powered by Discuz!-创意设计 X2.5

© 2008-2019 Narkii Inc.

回顶部