Creating Time Splits
In the video, we learned why splitting data randomly can be dangerous for time series as data from the future can cause overfitting in our model. Often with time series, you acquire new data as it is made available and you will want to retrain your model using the newest data. In the video, we showed how to do a percentage split for test and training sets but suppose you wish to train on all available data except for the last 45days which you want to use for a test set.
In this exercise, we will create a function to find the split date for using the last 45 days of data for testing and the rest for training. Please note that timedelta() has already been imported for you from the standard python library datetime.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Create a function
train_test_split_date()that takes in a dataframe,df, the date column to use for splittingsplit_coland the number of days to use for the test set,test_daysand set it to have a default value of 45. - Find the
minandmaxdates forsplit_colusing,(). - Find the date to split the test and training sets using
max_dateand subtracttest_daysfrom it by usingtimedelta()which takes adaysparameter, in this case, pass in `test_days, - Using
OFFMKTDATEas thesplit_colfindsplit_dateand use it to filter the dataframe into two new ones,train_dfandtest_df, Wheretest_dfis only the last 45 days of the data. Additionally, ensure that thetest_dfonly contains homes listed as of the split date by filteringdf['LISTDATE']less than or equal to thesplit_date.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
def train_test_split_date(df, split_col, test_days=____):
"""Calculate the date to split test and training sets"""
# Find how many days our data spans
max_date = df.____({____: ____}).collect()[0][0]
min_date = df.____({____: ____}).collect()[0][0]
# Subtract an integer number of days from the last date in dataset
split_date = ____ - timedelta(days=____)
return split_date
# Find the date to use in spitting test and train
split_date = train_test_split_date(df, ____)
# Create Sequential Test and Training Sets
____ = df.where(df[____] < split_date)
____ = df.where(df[____] >= split_date).where(df[____] <= split_date)