The dataset is based on the 2015 NYC Yellow Cab trip record data published by the NYC Taxi and Limousine Commission (TLC). The train data was randomly sampled which is 2% of the whole year dataset. The test data is non overlapping with the train data and is 1% of the whole year dataset. File descriptions train.csv - the training set (contains 2874263 trip records) test.csv - the testing set (contains 1446517 trip records) Update: I have uploaded the train and test dataset with geodesic distance calculated by library geopy (from the coordinates in the dataset) to save up the time for calculation.