作业要求、参考文献

作业要求

根据前9个小时的18个features(包含PM2.5)预测第十个小时的PM2.5

参考

http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML20.html
基本是按照参考代码来的,加入了一些注释更方便理解

另外所有的print都可以去掉,加上只是为了检验是否输出正确的数据

加载训练集数据

train.csv 的資料為 12 個月中,每個月取 20 天,每天 24 小時的資料(每小時資料有 18 個 features)。

1
2
3
4
5
import sys
import pandas as pd
import numpy as np

data = pd.read_csv('./hw1_train.csv', encoding = 'big5')

处理数据

取出需要的数值部分,即从第四列开始取数据
把输出的数据与train.csv对比即可发现不同

1
2
3
data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()

提取特征值1

将数据转置,即将原始4230×18按照每个月分组为12个月中的18个features×480Hours

1
2
3
4
5
6
7
8
month_data = {}
for month in range(12):
sample = np.empty([18, 480])
for day in range(20):
sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
month_data[month] = sample

print(month_data)

提取特征2

每个月有20*24=480h,每9个小时形成一个data,每个月就会有471个data,总资料数目是471×12笔,每笔数据中有9×18个features;
对应的target(第10个小时的PM2.5)为471 × 12

注意:471是怎么得到的?首先每个月有480个小时,只需要9个小时形成一组来预测第是个小时;举例来看:1-9是一组,预测10;2-10是一组来预测11;以此类推,可以得到471-479是最后一组,预测480

1
2
3
4
5
6
7
8
9
10
11
12
x = np.empty([12 * 471, 18 * 9], dtype = float)
y = np.empty([12 * 471, 1], dtype = float)
for month in range(12):
for day in range(20):
for hour in range(24):
if day == 19 and hour > 14:
#执行下个月的数据
continue
x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) #vector dim:18*9 (9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9)
y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value
print(x)
print(y)
[[14.  14.  14.  ...  2.   2.   0.5]
 [14.  14.  13.  ...  2.   0.5  0.3]
 [14.  13.  12.  ...  0.5  0.3  0.8]
 ...
 [17.  18.  19.  ...  1.1  1.4  1.3]
 [18.  19.  18.  ...  1.4  1.3  1.6]
 [19.  18.  17.  ...  1.3  1.6  1.8]]
[[30.]
 [41.]
 [44.]
 ...
 [17.]
 [24.]
 [29.]]

标准化Normalize

标准化方法有好几种,可参考下面找个博客,本文的标准化方法是Z-score方法
https://www.cnblogs.com/lvdongjie/p/11349701.html

1
2
3
4
5
6
7
mean_x = np.mean(x, axis = 0) #18 * 9 
std_x = np.std(x, axis = 0) #18 * 9
for i in range(len(x)): #12 * 471
for j in range(len(x[0])): #18 * 9
if std_x[j] != 0:
x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
x
array([[-1.35825331, -1.35883937, -1.359222  , ...,  0.26650729,
         0.2656797 , -1.14082131],
       [-1.35825331, -1.35883937, -1.51819928, ...,  0.26650729,
        -1.13963133, -1.32832904],
       [-1.35825331, -1.51789368, -1.67717656, ..., -1.13923451,
        -1.32700613, -0.85955971],
       ...,
       [-0.88092053, -0.72262212, -0.56433559, ..., -0.57693779,
        -0.29644471, -0.39079039],
       [-0.7218096 , -0.56356781, -0.72331287, ..., -0.29578943,
        -0.39013211, -0.1095288 ],
       [-0.56269867, -0.72262212, -0.88229015, ..., -0.38950555,
        -0.10906991,  0.07797893]])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#划分训练集和测试集,train_set 用来训练, validation_set用来验证结果
import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8):, :]
y_validation = y[math.floor(len(y) * 0.8):, :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))

训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
dim = 18 * 9 + 1
w = np.zeros([dim, 1])
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 100
iter_time = 1000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
for t in range(iter_time):
loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
if(t%100==0):
print(str(t) + ":" + str(loss))
gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1
adagrad += gradient ** 2
w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('myweight.npy', w)
w
0:27.071214829194115
100:33.78905859777455
200:19.913751298197102
300:13.531068193689693
400:10.64546615844617
500:9.277353455475062
600:8.518042045956497
700:8.014061987588418
800:7.636756824775688
900:7.336563740371121

测试

1
2
3
4
5
6
7
8
9
10
11
12
13
testdata = pd.read_csv('./hw1_test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18*9], dtype = float)
for i in range(240):
test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
for j in range(len(test_x[0])):
if std_x[j] != 0:
test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
test_x
array([[ 1.        , -0.24447681, -0.24545919, ..., -0.67065391,
        -1.04594393,  0.07797893],
       [ 1.        , -1.35825331, -1.51789368, ...,  0.17279117,
        -0.10906991, -0.48454426],
       [ 1.        ,  1.5057434 ,  1.34508393, ..., -1.32666675,
        -1.04594393, -0.57829812],
       ...,
       [ 1.        ,  0.3919669 ,  0.54981237, ...,  0.26650729,
        -0.20275731,  1.20302531],
       [ 1.        , -1.8355861 , -1.8360023 , ..., -1.04551839,
        -1.13963133, -1.14082131],
       [ 1.        , -1.35825331, -1.35883937, ...,  2.98427476,
         3.26367657,  1.76554849]])

预测

1
2
3
w = np.load('myweight.npy')
ans_y = np.dot(test_x, w)
ans_y

保存结果到csv

1
2
3
4
5
6
7
8
9
10
import csv
with open('my_submit.csv', mode='w', newline='') as submit_file:
csv_writer = csv.writer(submit_file)
header = ['id', 'value']
print(header)
csv_writer.writerow(header)
for i in range(240):
row = ['id_' + str(i), ans_y[i][0]]
csv_writer.writerow(row)
print(row)