Riid Questions Correct
in Ds-kaggle on Competitions
Riiid Questions Correct Prediction
캐글대회 좋은 노트북 정리
출처 : ‘Riiid: Comprehensive EDA + Baseline’ https://www.kaggle.com/erikbruin/riiid-comprehensive-eda-baseline
1.EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
import seaborn as sns
import os
from matplotlib.ticker import FuncFormatter
import os
for dirname, _, filenames in os.walk('/kaggle/input/riiid-test-answer-prediction'):
for filename in filenames:
print(os.path.join(dirname, filename))
As the train dataset is huge, I am gladly using the pickle that Rohan Rao prepared in this kernel: https://www.kaggle.com/rohanrao/tutorial-on-reading-large-datasets/ (Thanks Rohan!). I actually do this at work all the time, and in this case it reduces the time to load the dataset (with the data types specified in the file description) from close to 9 minutes to about 16 seconds.
As we can see, we have over 101 million rows the the train set.
%%time
train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")
print("Train size:", train.shape)
train.memory_usage(deep=True)
Index 128
row_id 809842656
timestamp 809842656
user_id 404921328
content_id 202460664
content_type_id 101230332
task_container_id 202460664
user_answer 101230332
answered_correctly 101230332
prior_question_elapsed_time 404921328
prior_question_had_explanation 3594972816
dtype: int64
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101230332 entries, 0 to 101230331
Data columns (total 10 columns):
# Column Dtype
--- ------ -----
0 row_id int64
1 timestamp int64
2 user_id int32
3 content_id int16
4 content_type_id bool
5 task_container_id int16
6 user_answer int8
7 answered_correctly int8
8 prior_question_elapsed_time float32
9 prior_question_had_explanation object
dtypes: bool(1), float32(1), int16(2), int32(1), int64(2), int8(2), object(1)
memory usage: 3.7+ GB
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')
train.memory_usage(deep=True)
Index 128
row_id 809842656
timestamp 809842656
user_id 404921328
content_id 202460664
content_type_id 101230332
task_container_id 202460664
user_answer 101230332
answered_correctly 101230332
prior_question_elapsed_time 404921328
prior_question_had_explanation 202460664
dtype: int64
%%time
questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')
example_sample_submission = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_sample_submission.csv')
CPU times: user 15.3 ms, sys: 4.43 ms, total: 19.7 ms
Wall time: 22.7 ms
1.1 Exploring Train set
print(f'We have {train.user_id.nunique()} unique users in our train set')
We have 393656 unique users in our train set
train.content_type_id.value_counts()
False 99271300
True 1959032
Name: content_type_id, dtype: int64
print(f'We have {train.content_id.nunique()} content ids in our train set, of which {train[train.content_type_id == False].content_id.nunique()} are questions.')
We have 13782 content ids in our train set, of which 13523 are questions.
cids = train.content_id.value_counts()[:30]
fig = plt.figure(figsize=(12,6))
ax = cids.plot.bar()
plt.title("Thirty most used content id's")
plt.xticks(rotation=90)
ax.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ','))) #add thousands separator
plt.show()
print(f'We have {train.task_container_id.nunique()} unique Batches of questions or lectures.')
We have 10000 unique Batches of questions or lectures.
train.user_answer.value_counts()
0 28186489
1 26990007
3 26084784
2 18010020
-1 1959032
Name: user_answer, dtype: int64
#1 year = 31536000000 ms
ts = train['timestamp']/(31536000000/12)
fig = plt.figure(figsize=(12,6))
ts.plot.hist(bins=100)
plt.title("Histogram of timestamp")
plt.xticks(rotation=0)
plt.xlabel("Months between this user interaction and the first event completion from that user")
plt.show()
print(f'Of the {train.user_id.nunique()} users in train we have {train[train.timestamp == 0].user_id.nunique()} users with a timestamp zero row.')
Of the 393656 users in train we have 393656 users with a timestamp zero row.
The target : answered_correctly
correct = train[train.answered_correctly != -1].answered_correctly.value_counts(ascending=True)
fig = plt.figure(figsize=(12,4))
correct.plot.barh()
for i, v in zip(correct.index, correct.values):
plt.text(v, i, '{:,}'.format(v), color='white', fontweight='bold', fontsize=14, ha='right', va='center')
plt.title("Questions answered correctly")
plt.xticks(rotation=0)
plt.show()
bin_labels_5 = ['Bin_1', 'Bin_2', 'Bin_3', 'Bin_4', 'Bin_5']
train['ts_bin'] = pd.qcut(train['timestamp'], q=5, labels=bin_labels_5)
#make function that can also be used for other fields
def correct(field):
correct = train[train.answered_correctly != -1].groupby([field, 'answered_correctly'], as_index=False).size()
correct = correct.pivot(index= field, columns='answered_correctly', values='size')
correct['Percent_correct'] = round(correct.iloc[:,1]/(correct.iloc[:,0] + correct.iloc[:,1]),2)
correct = correct.sort_values(by = "Percent_correct", ascending = False)
correct = correct.iloc[:,2]
return(correct)
bins_correct = correct("ts_bin")
bins_correct = bins_correct.sort_index()
fig = plt.figure(figsize=(12,6))
plt.bar(bins_correct.index, bins_correct.values)
for i, v in zip(bins_correct.index, bins_correct.values):
plt.text(i, v, v, color='white', fontweight='bold', fontsize=14, va='top', ha='center')
plt.title("Percent answered_correctly for 5 bins of timestamp")
plt.xticks(rotation=0)
plt.show()
task_id_correct = correct("task_container_id")
fig = plt.figure(figsize=(12,6))
task_id_correct.plot.hist(bins=40)
plt.title("Histogram of percent_correct grouped by task_container_id")
plt.xticks(rotation=0)
plt.show()
user_percent = train[train.answered_correctly != -1].groupby('user_id')['answered_correctly'].agg(Mean='mean', Answers='count')
print(f'the highest number of questions answered by a user is {user_percent.Answers.max()}')
the highest number of questions answered by a user is 17609
user_percent = user_percent.query('Answers <= 1000').sample(n=200, random_state=1)
fig = plt.figure(figsize=(12,6))
x = user_percent.Answers
y = user_percent.Mean
plt.scatter(x, y, marker='o')
plt.title("Percent answered correctly versus number of questions answered User")
plt.xticks(rotation=0)
plt.xlabel("Number of questions answered")
plt.ylabel("Percent answered correctly")
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
content_percent = train[train.answered_correctly != -1].groupby('content_id')['answered_correctly'].agg(Mean='mean', Answers='count')
print(f'The highest number of questions asked by content_id is {content_percent.Answers.max()}.')
print(f'Of {len(content_percent)} content_ids, {len(content_percent[content_percent.Answers > 25000])} content_ids had more than 25,000 questions asked.')
The highest number of questions asked by content_id is 213605.
Of 13523 content_ids, 529 content_ids had more than 25,000 questions asked.
content_percent = content_percent.query('Answers <= 25000').sample(n=200, random_state=1)
fig = plt.figure(figsize=(12,6))
x = content_percent.Answers
y = content_percent.Mean
plt.scatter(x, y, marker='o')
plt.title("Percent answered correctly versus number of questions answered Content_id")
plt.xticks(rotation=0)
plt.xlabel("Number of questions answered")
plt.ylabel("Percent answered correctly")
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
pq = train[train.answered_correctly != -1].groupby(['prior_question_had_explanation'], dropna=False).agg({'answered_correctly': ['mean', 'count']})
#pq.index = pq.index.astype(str)
print(pq.iloc[:,1])
pq = pq.iloc[:,0]
fig = plt.figure(figsize=(12,4))
pq.plot.barh()
# for i, v in zip(pq.index, pq.values):
# plt.text(v, i, round(v,2), color='white', fontweight='bold', fontsize=14, ha='right', va='center')
plt.title("Answered_correctly versus Prior Question had explanation")
plt.xlabel("Percent answered correctly")
plt.ylabel("Prior question had explanation")
plt.xticks(rotation=0)
plt.show()
prior_question_had_explanation
False 9193234
True 89685560
NaN 392506
Name: (answered_correctly, count), dtype: int64
pq = train[train.answered_correctly != -1]
pq = pq[['prior_question_elapsed_time', 'answered_correctly']]
pq = pq.groupby(['answered_correctly']).agg({'answered_correctly': ['count'], 'prior_question_elapsed_time': ['mean']})
pq
answered_correctly | answered_correctly | prior_question_elapsed_time |
---|---|---|
count | mean | |
answered_correctly | ||
0 | 34026673 | 25641.992188 |
1 | 65244627 | 25309.976562 |
#please be aware that there is an issues with train.prior_question_elapsed_time.mean()
#see https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/195032
mean_pq = train.prior_question_elapsed_time.astype("float64").mean()
condition = ((train.answered_correctly != -1) & (train.prior_question_elapsed_time.notna()))
pq = train[condition][['prior_question_elapsed_time', 'answered_correctly']].sample(n=200, random_state=1)
pq = pq.set_index('prior_question_elapsed_time').iloc[:,0]
fig = plt.figure(figsize=(12,6))
x = pq.index
y = pq.values
plt.scatter(x, y, marker='o')
plt.title("Answered_correctly versus prior_question_elapsed_time")
plt.xticks(rotation=0)
plt.xlabel("Prior_question_elapsed_time")
plt.ylabel("Answered_correctly")
plt.vlines(mean_pq, ymin=-0.1, ymax=1.1)
plt.text(x= 27000, y=0.4, s='mean')
plt.text(x=80000, y=0.6, s='trend')
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
plt.plot(x,p(x),"r--")
plt.show()
1.2 Exploring Questions
1.3 Exploring Lectures
Example Test
2. Model - Baseline
import numpy as np
import pandas as pd
import riiideducation
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.style as style
style.use('fivethirtyeight')
import seaborn as sns
import os
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
import gc
import sys
pd.set_option('display.max_rows', None)
%%time
cols_to_load = ['row_id', 'user_id', 'answered_correctly', 'content_id', 'prior_question_had_explanation', 'prior_question_elapsed_time']
train = pd.read_pickle("../input/riiid-train-data-multiple-formats/riiid_train.pkl.gzip")[cols_to_load]
train['prior_question_had_explanation'] = train['prior_question_had_explanation'].astype('boolean')
print("Train size:", train.shape)
%%time
questions = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/questions.csv')
lectures = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/lectures.csv')
example_test = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_test.csv')
example_sample_submission = pd.read_csv('/kaggle/input/riiid-test-answer-prediction/example_sample_submission.csv')
train.head()
row_id | user_id | answered_correctly | content_id | prior_question_had_explanation | prior_question_elapsed_time | |
---|---|---|---|---|---|---|
0 | 0 | 115 | 1 | 5692 | NaN | |
1 | 1 | 115 | 1 | 5716 | False | 37000.0 |
2 | 2 | 115 | 1 | 128 | False | 55000.0 |
3 | 3 | 115 | 1 | 7860 | False | 19000.0 |
4 | 4 | 115 | 1 | 7922 | False | 11000.0 |
train.shape
(101230332, 6)
%%time
#adding user features
user_df = train[train.answered_correctly != -1].groupby('user_id').agg({'answered_correctly': ['count', 'mean']}).reset_index()
user_df.columns = ['user_id', 'user_questions', 'user_mean']
user_lect = train.groupby(["user_id", "answered_correctly"]).size().unstack()
user_lect.columns = ['Lecture', 'Wrong', 'Right']
user_lect['Lecture'] = user_lect['Lecture'].fillna(0)
user_lect = user_lect.astype('Int64')
user_lect['watches_lecture'] = np.where(user_lect.Lecture > 0, 1, 0)
user_lect = user_lect.reset_index()
user_lect = user_lect[['user_id', 'watches_lecture']]
user_df = user_df.merge(user_lect, on = "user_id", how = "left")
del user_lect
user_df.head()
CPU times: user 19.9 s, sys: 4.77 s, total: 24.7 s
Wall time: 24.7 s
user_id | user_questions | user_mean | watches_lecture | |
---|---|---|---|---|
0 | 115 | 46 | 0.695652 | 0 |
1 | 124 | 30 | 0.233333 | 0 |
2 | 2746 | 19 | 0.578947 | 1 |
3 | 5382 | 125 | 0.672000 | 1 |
4 | 8623 | 109 | 0.642202 | 1 |
%%time
#adding content features
content_df = train[train.answered_correctly != -1].groupby('content_id').agg({'answered_correctly': ['count', 'mean']}).reset_index()
content_df.columns = ['content_id', 'content_questions', 'content_mean']
CPU times: user 15.8 s, sys: 2.79 s, total: 18.6 s
Wall time: 18.6 s
%%time
#using one of the validation sets composed by tito
cv2_train = pd.read_pickle("../input/riiid-cross-validation-files/cv2_train.pickle")['row_id']
cv2_valid = pd.read_pickle("../input/riiid-cross-validation-files/cv2_valid.pickle")['row_id']
CPU times: user 1.77 s, sys: 7.91 s, total: 9.68 s
Wall time: 14.1 s
train = train[train.answered_correctly != -1]
#save mean before splitting
#please be aware that there is an issues with train.prior_question_elapsed_time.mean()
#see https://www.kaggle.com/c/riiid-test-answer-prediction/discussion/195032
mean_prior = train.prior_question_elapsed_time.astype("float64").mean()
validation = train[train.row_id.isin(cv2_valid)]
train = train[train.row_id.isin(cv2_train)]
validation = validation.drop(columns = "row_id")
train = train.drop(columns = "row_id")
del cv2_train, cv2_valid
gc.collect()
64
label_enc = LabelEncoder()
train = train.merge(user_df, on = "user_id", how = "left")
train = train.merge(content_df, on = "content_id", how = "left")
train['content_questions'].fillna(0, inplace = True)
train['content_mean'].fillna(0.5, inplace = True)
train['watches_lecture'].fillna(0, inplace = True)
train['user_questions'].fillna(0, inplace = True)
train['user_mean'].fillna(0.5, inplace = True)
train['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
train['prior_question_had_explanation'].fillna(False, inplace = True)
train['prior_question_had_explanation'] = label_enc.fit_transform(train['prior_question_had_explanation'])
train[['content_questions', 'user_questions']] = train[['content_questions', 'user_questions']].astype(int)
train.sample(5)
user_id | answered_correctly | content_id | prior_question_had_explanation | prior_question_elapsed_time | user_questions | user_mean | watches_lecture | content_questions | content_mean | |
---|---|---|---|---|---|---|---|---|---|---|
81374868 | 1857010359 | 0 | 8710 | 1 | 6000.0 | 1178 | 0.726655 | 1 | 4561 | 0.650296 |
52733143 | 1199106875 | 1 | 3871 | 1 | 13000.0 | 2871 | 0.558690 | 1 | 58320 | 0.795953 |
71174462 | 1625702355 | 0 | 6257 | 1 | 14000.0 | 5032 | 0.648649 | 1 | 17200 | 0.717791 |
93540588 | 2130517850 | 1 | 860 | 1 | 16000.0 | 518 | 0.762548 | 1 | 9911 | 0.797800 |
26006872 | 595305625 | 1 | 5010 | 1 | 16000.0 | 479 | 0.649269 | 1 | 10766 | 0.789987 |
validation = validation.merge(user_df, on = "user_id", how = "left")
validation = validation.merge(content_df, on = "content_id", how = "left")
validation['content_questions'].fillna(0, inplace = True)
validation['content_mean'].fillna(0.5, inplace = True)
validation['watches_lecture'].fillna(0, inplace = True)
validation['user_questions'].fillna(0, inplace = True)
validation['user_mean'].fillna(0.5, inplace = True)
validation['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
validation['prior_question_had_explanation'].fillna(False, inplace = True)
validation['prior_question_had_explanation'] = label_enc.fit_transform(validation['prior_question_had_explanation'])
validation[['content_questions', 'user_questions']] = validation[['content_questions', 'user_questions']].astype(int)
validation.sample(5)
user_id | answered_correctly | content_id | prior_question_had_explanation | prior_question_elapsed_time | user_questions | user_mean | watches_lecture | content_questions | content_mean | |
---|---|---|---|---|---|---|---|---|---|---|
163923 | 152507119 | 1 | 9825 | 1 | 28000.0 | 138 | 0.514493 | 1 | 4527 | 0.776894 |
2206154 | 1929293949 | 0 | 6786 | 1 | 35500.0 | 434 | 0.449309 | 1 | 18230 | 0.758640 |
2261289 | 1980024567 | 0 | 9875 | 1 | 16000.0 | 104 | 0.625000 | 1 | 6313 | 0.369555 |
2433892 | 2130775706 | 0 | 6043 | 0 | 33000.0 | 31 | 0.580645 | 0 | 15324 | 0.662099 |
1978050 | 1739122369 | 1 | 4120 | 0 | 26000.0 | 249 | 0.771084 | 1 | 199372 | 0.275585 |
# features = ['user_questions', 'user_mean', 'content_questions', 'content_mean', 'watches_lecture',
# 'prior_question_elapsed_time', 'prior_question_had_explanation']
features = ['user_questions', 'user_mean', 'content_questions', 'content_mean', 'prior_question_elapsed_time']
#for now just taking 10.000.000 rows for training
train = train.sample(n=10000000, random_state = 1)
y_train = train['answered_correctly']
train = train[features]
y_val = validation['answered_correctly']
validation = validation[features]
params = {'objective': 'binary',
'metric': 'auc',
'seed': 2020,
'learning_rate': 0.1, #default
"boosting_type": "gbdt" #default
}
lgb_train = lgb.Dataset(train, y_train, categorical_feature = None)
lgb_eval = lgb.Dataset(validation, y_val, categorical_feature = None)
del train, y_train, validation, y_val
gc.collect()
80
%%time
model = lgb.train(
params, lgb_train,
valid_sets=[lgb_train, lgb_eval],
verbose_eval=50,
num_boost_round=10000,
early_stopping_rounds=8
)
Training until validation scores don't improve for 8 rounds
[50] training's auc: 0.757093 valid_1's auc: 0.762699
[100] training's auc: 0.757629 valid_1's auc: 0.763182
[150] training's auc: 0.75783 valid_1's auc: 0.76329
[200] training's auc: 0.757989 valid_1's auc: 0.763359
[250] training's auc: 0.758123 valid_1's auc: 0.763419
Early stopping, best iteration is:
[284] training's auc: 0.758218 valid_1's auc: 0.763463
CPU times: user 38min 23s, sys: 46.6 s, total: 39min 9s
Wall time: 12min 11s
lgb.plot_importance(model)
plt.show()
env = riiideducation.make_env()
iter_test = env.iter_test()
for (test_df, sample_prediction_df) in iter_test:
test_df = test_df.merge(user_df, on = "user_id", how = "left")
test_df = test_df.merge(content_df, on = "content_id", how = "left")
test_df['content_questions'].fillna(0, inplace = True)
test_df['content_mean'].fillna(0.5, inplace = True)
test_df['watches_lecture'].fillna(0, inplace = True)
test_df['user_questions'].fillna(0, inplace = True)
test_df['user_mean'].fillna(0.5, inplace = True)
test_df['prior_question_elapsed_time'].fillna(mean_prior, inplace = True)
test_df['prior_question_had_explanation'].fillna(False, inplace = True)
test_df['prior_question_had_explanation'] = label_enc.fit_transform(test_df['prior_question_had_explanation'])
test_df[['content_questions', 'user_questions']] = test_df[['content_questions', 'user_questions']].astype(int)
test_df['answered_correctly'] = model.predict(test_df[features])
env.predict(test_df.loc[test_df['content_type_id'] == 0, ['row_id', 'answered_correctly']])
이 글이 도움이 되셨다면 추천 클릭을 부탁드립니다 :)