李宏毅机器学习 2021

HW1 Regression

之前都是用Pytorch 训练图片数据, 这次作业是一个直接的csv数据,记录一下相关的操作

读入数据

1
2
3
4
train_data = pd.read_csv('./ml2021spring-hw1/covid.train.csv')
test_data = pd.read_csv('./ml2021spring-hw1/covid.test.csv')
train_data.shape, test_data.shape
# ((2700, 95), (893, 94))

数据维度说明

States (40, encoded to one-hot vectors)

○ e.g. AL, AK, AZ, …

● COVID-like illness (4)

○ e.g. cli,ili (influenza-like illness), …

● Behavior Indicators (8)

○ e.g. wearing_mask, travel_outside_state, …

● Mental Health Indicators (5)

○ e.g. anxious, depressed, …

● Tested Positive Cases (1)

○ tested_positive (this is what we want to predict)

定义网络

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Net(LightningModule):
    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = torch.mean((logits - y) ** 2)
        self.log('my_loss', loss, on_step=True, on_epoch=True, prog_bar=True, logger=True)
        return loss
    def configure_optimizers(self):
        return Adam(self.parameters(), lr=1e-2)
    def __init__(self, n_feature):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_feature, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        #self.fc3 = nn.Linear(64, n_output)
        self.criterion = nn.MSELoss(reduction='mean')
    def forward(self, x):
        #batch_size, channels, width, height = x.size()
        # (b, 1, 28, 28) -> (b, 1*28*28)
        #x = x.view(batch_size, -1)
        return self.net(x).squeeze(1)

踩坑

也是自己蠢了, 直接套用了之前进行十六进制数字识别的网络结构,最后一层是sigmoid,而这次是一个回归问题,最后是一个数值,而不是分类问题。导致Loss 一直在60左右,不降,没想到是这里出问题了。。。

DataLoader

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
transform=transforms.Compose([  
                                
                                 transforms.ToTensor()
                            ])

class MySet(Dataset):
	# 读取数据
    def __init__(self, df):
        self.df = df
	# 根据索引返回数据
    def __getitem__(self, index):
        X = torch.tensor(np.array(self.df.iloc[index, :94]), dtype=torch.float)
        y = torch.tensor(np.array(self.df.iloc[index,94]), dtype=torch.float)
        return X, y
	# 返回数据集总长度
    def __len__(self):
        return self.df.shape[0]

1
2
train_data = MySet(train_data)
train = DataLoader(train_data, batch_size=128, shuffle=True)

训练

1
2
3
4
#optimizer = Adam(LitMNIST(input_num, hidden_num, output_num).parameters(), lr=1e-2)
model = Net(94)
trainer = Trainer(gpus=0, max_epochs = 500)
trainer.fit(model, train)

预测

1
2
3
4
5
6
X = torch.tensor(np.array(test_data.iloc[:, :94]), dtype=torch.float)
result = model(X).detach().numpy().reshape(-1)
sub = pd.DataFrame(columns=['id', 'tested_positive'])
sub['id'] = list(range(test_data.shape[0]))
sub['tested_positive'] = result
sub.to_csv('result.csv', index=None)

结果

第一名0.86536 还有很大提升空间