1 | knitr::opts_chunk$set(echo=TRUE, error=FALSE) |
Introduction
This is a comprehensive Exploratory Data Analysis for the New York City Taxi Trip Duration competition with tidy R and ggplot2.
The goal of this playground challenge is to predict the duration of taxi rides in NYC based on features like trip coordinates or pickup date and time. The data comes in the shape of 1.5 million training observations (../input/train.csv
) and 630k test observation (../input/test.csv
). Each row contains one taxi trip.
In this notebook, we will first study and visualise the original data, engineer new features, and examine potential outliers. Then we add two external data sets on the NYC weather and on the theoretically fastest routes. We visualise and analyse the new features within these data sets and their impact on the target trip_duration values. Finally, we will make a brief excursion into viewing this challenge as a classification problem and finish this notebook with a simple XGBoost model that provides a basic prediction.
I hope that this notebook will help you in getting started with this challenge. There is lots of room for you to develop your own ideas for new features and visualisations. In particular the classification approach and the final model are only a basic starting point for you to improve and optimise them for better performance. As always, any feedback, questions, or constructive criticism are much appreciated.
Originally, this analysis contained a few hidden figures, due to the memory/run-time limitations of the Kernels environment at the time. This means that I had to disable a few auxilliary visualisations as the notebook grew towards its final stage. I tried to only remove those plots that didn’t affect the flow of the analysis too much. All of the corresponding code was still contained in this notebook and you can easily recover these hidden plots. Now, with the new and improved Kernel specs those plots are back in the “official” version. (On related matters, the movie hidden figures is a pretty awesome piece of history for maths (and space) geeks like us ;-) )
Finally, I would like to thank everyone of you for viewing, upvoting, or commenting on this kernel! I’m still a beginner here on Kaggle and your support means a lot to me :-) . Also, thanks to NockedDown for suggesting the new updated title pun in the comments! ;-)
Enough talk. Let’s get started:
Load libraries and helper functions
1 | library('ggplot2') # visualisation |
We use the multiplot function, courtesy of R Cookbooks to create multi-panel plots.
1 | # Define multiple plot function |
Load data
We use data.table’s fread function to speed up reading in the data:
1 | train <- as.tibble(fread('../input/nyc-taxi-trip-duration/train.csv')) |
File structure and content
Let’s have an overview of the data sets using the summary and glimpse tools. First the training data:
1 | summary(train) |
1 | glimpse(train) |
And then the testing data:
1 | summary(test) |
1 | glimpse(test) |
We find:
vendor_id only takes the values 1 or 2, presumably to differentiate two taxi companies
pickup_datetime and (in the training set) dropoff_datetime are combinations of date and time that we will have to re-format into a more useful shape
passenger_count takes a median of 1 and a maximum of 9 in both data sets
The pickup/dropoff_longitute/latitute describes the geographical coordinates where the meter was activate/deactivated.
store_and_fwd_flag is a flag that indicates whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”). Maybe there could be a correlation with certain geographical areas with bad reception?
trip_duration: our target feature in the training data is measured in seconds.
Missing values
Knowing about missing values is important because they indicate how much we don’t know about our data. Making inferences based on just a few cases is often unwise. In addition, many modelling procedures break down when missing values are involved and the corresponding rows will either have to be removed completely or the values need to be estimated somehow.
Here, we are in the fortunate position that our data is complete and there are no missing values:
1 | sum(is.na(train)) |
Combining train and test
In preparation for our eventual modelling analysis we combine the train and test data sets into a single one. I find it generally best not to examine the test data too closely, since this bears the risk of overfitting your analysis to this data. However, a few simple consistency checks between the two data sets can be of advantage.
1 | combine <- bind_rows(train %>% mutate(dset = "train"), |
Reformating features
For our following analysis, we will turn the data and time from characters into date objects. We also recode vendor_id as a factor. This makes it easier to visualise relationships that involve these features.
1 | train <- train %>% |
Consistency check
It is worth checking whether the trip_durations are consistent with the intervals between the pickup_datetime and dropoff_datetime. Presumably the former were directly computed from the latter, but you never know. Below, the check variable shows “TRUE” if the two intervals are not consistent:
1 | train %>% |
And we find that everything fits perfectly.
Individual feature visualisations
Visualisations of feature distributions and their relations are key to understanding a data set, and they often open up new lines of inquiry. I always recommend to examine the data from as many different perspectives as possible to notice even subtle trends and correlations.
In this section we will begin by having a look at the distributions of the individual data features.
We start with a map of NYC and overlay a managable number of pickup coordinates to get a general overview of the locations and distances in question. For this visualisation we use the leaflet package, which includes a variety of cool tools for interactive maps. In this map you can zoom and pan through the pickup locations:
1 | set.seed(1234) |
In turns out that almost all of our trips were in fact taking place in Manhattan only. Another notable hot-spot is JFK airport towards the south-east of the city.
The map gives us an idea what some of the our distributions could look like. Let’s start with plotting the target feature trip_duration:
1 | train %>% |
Note the logarithmic x-axis and square-root y-axis.
We find:
the majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just short of 1000 seconds, i.e. about 17 minutes.
There are several suspiciously short rides with less than 10 seconds duration.
Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 seconds mark and even a few way above it:
1 | train %>% |
Those records would correspond to 24-hour trips and beyond, with a maximum of almost 12 days. I know that rush hour can be bad, but those values are a little unbelievable.
Over the year, the distributions of pickup_datetime and dropoff_datetime look like this:
1 | p1 <- train %>% |
Fairly homogeneous, covering half a year between January and July 2016. There is an interesting drop around late January early February:
1 | train %>% |
That’s winter in NYC, so maybe snow storms or other heavy weather? Events like this should be taken into account, maybe through some handy external data set?
In the plot above we can already see some daily and weekly modulations in the number of trips. Let’s investigate these variations together with the distributions of passenger_count and vendor_id by creating a multi-plot panel with different components:
1 | p1 <- train %>% |
We find:
- There are a few trips with zero, or seven to nine passengers but they are a rare exception:
1 | train %>% |
The vast majority of rides had only a single passenger, with two passengers being the (distant) second most popular option.
Towards larger passenger numbers we are seeing a smooth decline through 3 to 4, until the larger crowds (and larger cars) give us another peak at 5 to 6 passengers.
Vendor 2 has significantly more trips in this data set than vendor 1 (note the logarithmic y-axis). This is true for every day of the week.
We find an interesting pattern with Monday being the quietest day and Friday very busy. This is the same for the two different vendors, with vendor_id == 2 showing significantly higher trip numbers.
As one would intuitively expect, there is a strong dip during the early morning hours. There we also see not much difference between the two vendors. We find another dip around 4pm and then the numbers increase towards the evening.
The store_and_fwd_flag values, indicating whether the trip data was sent immediately to the vendor (“N”) or held in the memory of the taxi because there was no connection to the server (“Y”), show that there was almost no storing taking place (note again the logarithmic y-axis):
1 | train %>% |
These numbers are equivalent to about half a percent of trips not being transmitted immediately.
The trip volume per hour of the day depends somewhat on the month and strongly on the day of the week:
1 | p1 <- train %>% |
We find:
January and June have fewer trips, whereas March and April are busier months. This tendency is observed for both vendor_ids.
The weekend (Sat and Sun, plus Fri to an extend) have higher trip numbers during the early morning ours but lower ones in the morning between 5 and 10, which can most likely be attributed to the contrast between NYC business days and weekend night life. In addition, trip numbers drop on a Sunday evening/night.
Finally, we will look at a simple overview visualisation of the pickup/dropoff latitudes and longitudes:
1 | p1 <- train %>% |
Here we had to constrain the range of latitude and longitude values, because there are a few cases which are way outside the NYC boundaries. The resulting distributions are consistent with the focus on Manhattan that we had already seen on the map. These are the most extreme values from the pickup_latitude feature:
1 | train %>% |
1 | train %>% |
We need to keep the existence of these (rather astonishing) values in mind so that they don’t bias our analysis.
Feature relations
While the previous section looked primarily at the distributions of the individual features, here we will examine in more detail how those features are related to each other and to our target trip_duration.
Pickup date/time vs trip_duration
How does the variation in trip numbers throughout the day and the week affect the average trip duration? Do quieter days and hours lead to faster trips? Here we include the vendor_id as an additional feature. Furthermore, for the hours of the day we add a smoothing layer to indicate the extent of the variation and its uncertainties:
1 | p1 <- train %>% |
We find:
There is indeed a similar pattern as for the business of the day of the week. Vendor 2, the one with the more frequent trips, also has consistently higher trip durations than vendor 1. It will be worth adding the vendor_id feature to a model to test its predictive importance.
Over the course of a typical day we find a peak in the early afternoon and dips around 5-6am and 8pm. The weekday and hour of a trip appear to be important features for predicting its duration and should be included in a successful model.
Passenger count and Vendor vs trip_duration
The next question we are asking is whether different numbers of passengers and/or the different vendors are correlated with the duration of the trip. We choose to examine this issue using a series of boxplots for the passenger_counts together with a facet wrap which contrasts the two vendor_ids:
1 | train %>% |
We find:
Both vendors have short trips without any passengers.
Vendor 1 has all of the trips beyond 24 hours, whereas vendor 2 has all of the (five) trips with more than six passengers and many more trips that approach the 24-hour limit.
Between 1 and 6 passengers the median trip durations are remarkably similar, in particular for vendor 2. There might be differences for vendor 1, but they are small (note the logarithmic y-axis):
1 | train %>% |
Comparing the densities of the trip_duration distribution for the two vendors we find that the medians are very similar, whereas the means are likely skewed by vendor 2 containing most of the long-duration outliers:
1 | train %>% |
Store and Forward vs trip_duration
The temporary storing of the trip data only occured for Vendor 1:
1 |
|
1 |
|
We find that there is no overwhelming differences between the stored and non-stored trips. The stored ones might be slightly longer, though, and don’t include any of the suspiciously long trips.
Feature engineering
In this section we build new features from the existing ones, trying to find better predictors for our target variable. I prefer to define all these new features in a single code block below and then study them in the following subsections. This does not correspond to a linear analysis but it makes the notebook more readable and ensures that we don’t miss any stray feature definitions.
The new temporal features (date, month, wday, hour) are derived from the pickup_datetime. We got the JFK and La Guardia airport coordinates from Wikipedia. The blizzard feature is based on the external weather data.
1 |
|
Direct distance of the trip
From the coordinates of the pickup and dropoff points we can calculate the direct distance (as the crow flies) between the two points, and compare it to our trip_durations. Since taxis aren’t crows (in most practical scenarios), these values correspond to the minimum possible travel distance.
To compute these distances we are using the distCosine function of the geosphere package for spherical trigonometry. This method gives us the shortest distance between two points on a spherical earth. For the purpose of this localised analysis we choose to ignore ellipsoidal distortion of the earth’s shape. Here are the raw values of distance vs duration (based on a down-sized sample to speed up the kernel).
1 |
|
We find:
- The distance generally increases with increasing trip_duration
- Here, the 24-hour trips look even more suspicious and are even more likely to be artefacts in the data.
- In addition, there are number of trips with very short distances, down to 1 metre, but with a large range of apparent trip_durations
Let’s filter the data a little bit to remove the extreme (and the extremely suspicious) data points, and bin the data into a 2-d histogram. This plot shows that in log-log space the trip_duration is increasing slower than linear for larger distance values:
1 |
|
Travel speed
Distance over time is of course velocity, and by computing the average apparent velocity of our taxis we will have another diagnostic to remove bogus values. Of course, we won’t be able to use speed as a predictor for our model, since it requires knowing the travel time, but it can still be helpful in cleaning up our training data and finding other features with predictive power. This is the speed distribution:
1 |
|
Well, after removing the most extreme values this looks way better than I would have expected. An average speed of around 15 km/h sounds probably reasonable for NYC. Everything above 50 km/h certainly requires magical cars (or highway travel). Also keep in mind that this refers to the direct distance and that the real velocity would have been always higher.
In a similar way as the average duration per day and hour we can also investigate the average speed for these time bins:
1 |
|
We find:
- Our taxis appear to be travelling faster on the weekend and on a Monday than during the rest of the week.
- The early morning hours allow for a speedier trip, with everything from 8am to 6pm being similarly slow.
- There are almost no differences between the two vendors.
- The heatmap in the lower panel visualises how these trends combine to create a “low-speed-zone” in the middle of the day and week. Based on this, we create a new feature work, which we define as 8am-6pm on Mon-Fri.
Bearing direction
If the direct distance is the magnitude of the trip vector then the bearing is its (initial) direction. Easily estimated through the geosphere package, it tells us whether our trip started out for instance in the direction of North-West or South-East. Here we visualise the bearing distribution and its relation to trip_duration, direct distance, and speed:
1 |
|
This is one of the few cases where polar coordinates might be actually helpful. (Ignore the small discontinuities caused by the binning.)
We find:
- The bearing direction has two prominent peaks around 30 and -150 degrees. Intuitively, those might relate to the orientation of Manhattan. Three or four smaller, less sharp peaks are visible in between those larger peaks.
- The 2-d histograms of distance and trip_duration don’t reveal much structure. There are indications for shorter durations and distances at the two peaks, but those might simply be caused by the larger number of observations in these peaks. Faint clusters towards higher trip_durations might be visible near -60 and 125 degrees bearing.
- The bright “rings” we see in the distance and trip_duration reflect the familiar distribution peaks on the logarithmic scale.
- Speed, on the other hand, shows an interestingly clustered structure along the bearing directions. Zones of higher speed can be identified at the aforementioned bearing peaks as well roughly perpendicular to them (most prominently around 125 degrees). It is possible that we are seeing the Manhattan street grid in this data. Note, that here the radial scale is not logarithmic.
Airport distance
In our maps (above) and trip paths (below) we noticed that a number of trips began or ended at either of the two NYC airports: JFK and La Guardia. Since airports are usually not in the city centre it is reasonable to assume that the pickup/dropoff distance from the airport could be a useful predictor for longer trip_durations. Above, we defined the coordinates of the two airports and compute the corresponding distances:
1 |
|
Based on these numbers, we can define a JFK/La Guardia trip as having a pickup or dropoff distance of less than 2 km from the corresponding airport.
What are the trip_durations of these journeys?
1 |
|
We find that our hypothesis was correct and that trips to the airport, in particular the more distant JFK, have significantly longer average trip_durations. These two features should definitely be part of our model.
Data cleaning
Before we turn to the modelling it is time to clean up our training data. We have waited to do this until now to have a more complete overview of the problematic observations. The aim here is to remove trips that have improbable features, such as extreme trip durations or very low average speed.
While there might also be a number of bogus trip durations in the test data we shouldn’t be able to predict them in any case (unless there were some real correlations). By removing these training data values we will make our model more robust and more likely to generalise to unseen data, which is always our primary goal in machine learning.
Extreme trip durations
Let’s visualise the distances of the trips that took a day or longer. Unless someone took a taxi from NYC to LA it is unlikely that those values are accurate. Here we make use of the maps package to draw an outline of Manhattan, where most of the trips begin or end. We then overlay the pickup coordinates in red, and the dropoff coordinates in blue.
To further add the trip connections (direct distance) we use another handy tool from the geosphere package: gcIntermediate allows us to interpolate the path between two sets of coordinates. I’ve seen this tool first in action in this truly outstanding kernel by Jonathan Bouchet. (If you haven’t seen his kernel, then I strongly recommend you check it out; right after reading this one, of course ;-) ).
Longer than a day
We start with the few trips that pretend to have taken several days to complete:
1 |
|
1 |
|
We find nothing out of the ordinary here. While a trip to JFK can seem like an eternity if your flight is boarding soon, it is unlikely to take this long in real time. The average taxi speeds don’t look very likely either.
Decision: These values should be removed from the training data set for continued exploration and modelling.
Close to 24 hours
Call me crazy, but I don’t think it is inconceivable that someone takes a taxi for a trip that lasts almost a day (with breaks, of course). In very rare occasions this might happen; provided, of course, that the distance travelled was sufficiently long.
Here we define day-long trips as taking between 22 and 24 hours, which covers a small peak in our raw trip_duration distribution. Those are the top 5 direct distances (in m) among the day-long trips:
1 |
|
The top one is about 60 km (about 37 miles), which is not particularly far. The average speed wouldn’t suggest a generous tip, either. What do these trips look like on the map?
1 |
|
Here we are plotting only 200 of the about 1800 connections to keep the map reasonably readable and the script fast. Pickup points are red and dropoff points are blue.
We find:
- A few longer distances stand out, but they are exceptions. The two major group of trips are those within Manhattan and those between Manhattan and the airports.
- There is little to suggest that these extreme trip_durations were real.
- There is another insight here which is rather intuitive: trips to or from any of the airports (most prominently JFK) are unlikely to be very short. Thus, the a close distance of either pickup or dropoff to the airport could be a valuable predictor for longer trip_duration. This is something that we took from here to the feature engineering.
Decision: We will remove trip_durations longer than 22 hours from the exploration and possibly from the modelling.
Shorter than a few minutes
On the other side of the trip_duration distribution we have those rides that appear to only have lasted for a couple of minutes. While such short trips are entirely possible, let’s check their durations and speeds to make sure that they are realistic.
1 |
|
Zero-distance trips
In doing so, we notice that there are a relatively large number of zero-distance trips:
1 |
|
What are their nominal top durations?
1 |
|
There really are a few taxis where the data wants to tell us that they have not moved at all for about a day. While carrying a passenger. We choose not to believe the data in this case.
Once we remove the extreme cases, this is what the distribution looks like:
1 |
|
We find:
- trip_durations of about a minute might still be somehow possible, assuming that someone got into a taxi but then changed their mind before the taxi could move. Whether that should count as a “trip” is a different question. But trip durations of about 15 minutes (900 s) without any distance covered seem hardly possible. Unless they involve traffic jams, but who get’s into a taxi that’s stuck in a traffic jam?
- It is also noteworthy that most trips in the less-than-a-minute-group were from vendor 1, whereas the 10-minute-group predominantly consists of vendor 2 taxis.
Decision: We will remove those zero-distance trips that took more than a minute for our continued analysis. Removing them from the modelling might be detrimental if there are similar trips in the test sample.
Short trips with above zero distance
After removing the zero-distance trips, those are the short rides with the highest average speed:
1 |
|
Clearly, the top speed values are impossible. We include a map plot of the general distribution of distances within the sample:
1 |
|
We find (from the hidden plot):
- Most distances are in fact short, which means that combined with setting an average speed limit we should be able to remove those values that are way beyond being realistic. This should also get rid of many trips that appear to have durations of seconds only.
Decision: We impose a lower trip_duration limit of 10 seconds and a (very conservative) speed limit of 100 km/h (62 mph). (Remember that this refers to the direct distance.)
Intermission - The best spurious trips
Every data set has a few entries that are just flat out ridiculous. Here are the best ones from this one, with pickup or dropoff locations more than 300 km away from NYC (JFK airport)
1 |
|
Many zero-distance trips with more than a minute duration, which we would remove anyway. But just out of curiousity, where did they happen? (We will again use the amazing leaflet package, this time with individual markers that give us id and direct distance information for mouse-over and click actions.)
1 |
|
See what I mean?
Not only are there two(!) lone NYC taxis near San Francisco, but 9 others are actually ocean-going taxis, who knew ;-) . Most of the others will be removed by our previous cleaning limits.
These long-distance locations represent outliers that should be removed to improve the robustness of predictive models.
Final cleaning
Here we apply the cleaning filters that are discussed above. This code block is likely to expand as the analysis progresses.
1 |
|
External data
In this playground competition we have been encouraged to supplement our analysis with additional data sources. These are published as Kaggle data set and are being collected in this discussion thread.
In fact, there are monetary prizes for publishing the top 4 data sets, which together with the Kernel awards gives this competition a fresh and interesting spin.
If you want to know more about how to include multiple data sources in your Kaggle kernel then check out this article.
Weather reports
We start by incorporating a list of NYC weather data provided by Mathijs Waegemakers right here. In the next paragraph I copy the data description verbatim:
Weather data collected from the National Weather Service. It contains the first six months of 2016, for a weather station in central park. It contains for each day the minimum temperature, maximum temperature, average temperature, precipitation, new snow fall, and current snow depth. The temperature is measured in Fahrenheit and the depth is measured in inches. T means that there is a trace of precipitation.
Of particular interest here will be the rain and snow fall statistics, which I’m curious to compare to the dip in trip numbers in late January. Check also this kernel which presents an independent analysis of the same data set.
Data import, overview, formatting, joining
The data can be found in the data set sub-directory of the ../input
directory and it looks like this:
1 |
|
1 |
|
We turn the date into a lubridate object and convert the traces (“T”) of rain and snow into small numeric amounts. We also save the maximum and minimum temperature in a shorter form:
1 |
|
Then we join this information to our training data using the common data column:
1 |
|
Visualisation and impact on trip_duration
Let’s compare the snow fall statistics to our trip numbers per day:
1 |
|
I knew it! The dip in trip volume corresponds to the largest (and first?) snow fall of the winter in NYC (Jan 23rd). In fact, NYC was hit by a blizzard and experienced record-breaking snowfall. The impact on the traffic patterns was enormous, and the median speed slowed down notably. (Note the square-root y-axis in the two middle plots.)
After this large spike, we clearly see that the following periods of snow fall, albeit signficantly less heavy, have nowhere near as large an effect on the number of taxi trips or their velocities. A possible exception might have been the last snow in mid March, which due to the fact that it hadn’t snowed in a while might have caught people by surprise.
Now, do lower numbers of trips in snowy weather also lead to longer journeys? To answer this question we look at a scatter plot between the average trip duration and the total precipitation (rain + snow) for all days in our sample:
1 |
|
We find that except for a snow day with long-duration trips it seems more like snow would lead to shorter trips. However, this could simply mean that passengers were more likely to travel shorter distances, so how does the average speed compare?
1 |
|
We find that there is no significant difference here. Moreover, Jan 23rd (blizzard) is not at the top of the list of snow days with the longest median trip_duration (it’s in the top 10, though).
1 |
|
We can spot a different interesting pattern in this table, though, when looking at the top 5 slowest snow days. Interestingly, the overall lowest median speeds were observed two days after the blizzard: from Jan 25th until 27th (although note that the day immediately after the blizzard, Jan 24th, was a Sunday):
1 |
|
Conclusion: From an exploratory point of view this doesn’t look too promising. Still, I’m inclined to include the rain and snow fall occurence in a tentative model to see whether it shows any interaction terms that we can’t see at the moment. The “blizzard effect” on the (working) week after the heavy snow fall is probably worth taking into account separately. Thus, we create a blizzard feature in our engineering code block.
Fastest Routes
Another interesting external data set is an estimate of the fastest routes for each trip provided by oscarleo using the the Open Source Routing Machine, OSRM. The data can be found here and includes the pickup/dropoff streets and total distance/duration between these two points together with a sequence of travels steps such as turns or entering a highway.
Note, that according to discussion items those really are the “fastest”, not the shortest, routes between the two points. This will most likely not account for traffic volume, and the fastest route on one day might not be the fastest on another day of the week. Still, here we have an actual driving distance and duration measure that should be valuable for our prediction goal.
Data import and overview
The data comes in three separate files, split into the train (2 files) vs test trip IDs. Here, we will load and examine the data corresponding to the training observations. Note, that this data set is more than twice as large as our original training data.
1 |
|
1 |
|
The glimpse view shows us that indeed for each step the streets are named (streets_for_each_step) alongside a detailed account of the distance_per_step, travel_time_per_step, and the step_maneuvers and step_direction describing the journey. The step_maneuvers are named in a pretty self-explanatory way and are also described in the data set overview. Depart and arrive are individual steps. The direction of a step_maneuver “turn” is listed in the the step_directions. The total distance is measured in meters and the duration in seconds, like in the original data.
New data visualisations
Before joining the “fastest routes” data to our training set, let’s first visualise the distributions of the numerical parameters:
1 |
|
We find:
- The total_travel_time and total_distance peak at values of around 300 s (5 min) and 2800 m, respectively. These are the median values. Really high values are rare in both features, however there is are indications for secondary peaks towards longer times and distances. Very short distances (few meters) and travel times (few seconds) are present in the data.
- The single most frequent number_of_steps is 5. Since “depart” and “arrive” are always present, there are no trips with less than 2 steps. There are almost 60k trips that only contain these two steps, which would mean that they would have at most travelled in a straight line, as far as I can see:
1 |
|
- Larger number of steps are present, but only about 20% are above 10 and less than 2% are above 20 steps.
- total_distance and total_travel_time are strongly correlated in a pretty linear way; except for possibly the very short distances. There is a certain amout of scatter around this linear dependency and it looks rather homogeneous over the entire range of distances and times.
1 |
|
We also find that the median travel time correlates well with the number_of_steps, but that there is also much overlap between neighbouring step counts, along with numerous outliers predominantly towards longer durations.
The distributions of total_distance per number_of_steps look very similar to the plot above.
Joining and derived features
In a first look at the joined data, we will mainly focus on the total_distance and total_travel_time, when combined with our original training data set:
1 |
|
In addition, we also create the following new features:
- fastest_speed is the average speed of the fastest route based on the OSRM values, and fast_speed_trip is the speed assuming the fastest route and the actual trip duration. Both features are in units of km/h.
- left_turns, right_turns, and turns are the total number of those maneuvers per route. The step_directions “left” and “right” might not always be associated to the step_maneuver “turn”, but in most cases they are so we are using them as a first estimate of the number of left and right turns. Especially left turns might slow down travel.
Beyond that, for instance the detailed list of streets used and their respective travel_time_per_step might allow us to take into account known areas of slow traffic.
Visualisation and impact on trip_duration
1 |
|
Here the fastest features are in blue and the original trip_duration and direct distance in red. In the binned scatter plots the red line has slope 1.
We find:
- The actual trip_durations are on average significantly longer than those for the fastest route. The shape of the distributions is also subtly different, with the fastest ones having more of a right skew than a left skew in log space. The widths of the distributions, however, are broadly similar.
- The binned scatter plot of trip_duration vs total_travel_time (lower left corner) reveals that there are numerous ouliers towards trip_durations. Interestingly, almost all of those appear below 2000 s (about 0.55 h) of total_travel_time. We will study those trips in more detail below.
- The direct distances (as the crow flies) are longer than the fastest distances, but not by that much. By and large, the direct dist*ance is a pretty good proxy for the actual, *fastest distance according to the binned scatter plot. There are (rare) outliers towards longer direct distances, which indicates potentially problematic data entries.
What about the speed? Comparing our original speed estimate to the fastest_speed from the OSRM data should give us an idea how realistic the values of the former feature are.
For this figure, in the kernel-density plot the fastest values are in blue and the red line in the binned scatter plot has slope equal one:
1 |
|
Here we find several effects:
- By not accounting for traffic, our fastest speeds are always above 20 km/h (12 mph), a value that is not frequently reached in our actual data.
- The theoretical speed disribution shows several peaks, most prominently a very sharp one between 25 and 30 km/h (15 - 18 mph) and a broader one around 40 km/h (25 mph). Without knowing very much about the traffic management in NYC I would assume that they correspond to zones of different speed limits and their linear combinations.
- There is no correlation between speed and fastest_speed. This reflects the large number of delays in the scatter plot of trip_duration vs total_travel_time which we will study in a separate step.
At face value, the impact of the number of turns on the total_travel_time is very similar for left_turns and right_turns. A main reason behind this will be that the total_travel_time increases with increasing number_of_steps of any kind and that the number of turns is of course correlated with the number_of_steps. We will learn more from the percentage of turns per trip, which we will investigate in a moment. From these plots we find the following results:
1 |
|
- There is a “NA” entry here which is caused by a single observation (“id3008062”) which is missing from the fastest_routes data. This is a very minor issue without any practical impact on our predictions, but it’s good to be aware of this “NA” entry. It turns out that OSRM does not provide a route for this data point.
- Right turns and slightly more frequent than left turns (up to 21 vs 19).
- For the total number of turns there is a first maximum in the median total_travel_time at 13 turns, after which the median declines somewhat.
Now let’s move on to the turn percentages:
1 |
|
We find:
- Right turns are indeed somewhat more frequent than left turns. Both kind of turns make up mostly less than 50% of all travel maneuvers but can reach 80%. There is a significant number of trips without any turns at all.
- A higher fraction of turns does not correlate with higher total_travel_times. Instead, the longest times are reached at around 1/3 of maneuvers being turns.
- On the other hand, the fastest_speed (i.e. total_distance / total_travel_time) shows indications for lower average values for larger number of left_turns for the percentage range of 50% to 80%. Notably, there is no such obvious impact for the number of right_turns up until 75%. Thus, excessive numbers of left turns really appear to be slowing down our travel speed.
trip_duration vs total_travel_time - a look at curious delays
In this section we will compare the OSRM total_travel_time to the taxi trip_duration in our original data. We will specifically focus on understanding those curious cases that suggest significant delays in our data.
Here we plot the a scatter plot of the two timing features colour-coded by the percentage of left turns in the the OSRM route (the solid red line has again a slope of 1):
1 |
|
Note the double-log axes.
We find:
- There is a population of data point where the trip_duration (the actual timing of the taxi ride) is lower than the total_travel_time based on the fastest OSRM route. These differences can be significant for the trips that take less than 1000 s (i.e. about 17 min) on the fastest route but become less severe for rides longer than that.
- There are no “fastest” routes with more than 10 ks total_travel_time. The maximum in the cleaned sample is at about 5000 s or 1.5 h. Almost all of the significant long-delay outliers in trip_duration (indicated by the horizontal dashed lines) lie in the range of 100 - 2000 s total_travel_time and extend beyond 10 ks trip_duration.
- There appears to be a tendency for longer trips (especially in total_travel_time) to contain a higher fraction of left_turns. However, this apparent effect might be due to the shortest rides containing no turns and longer rides (above 1 min) including an average number of turns. In any case, within the long-delay outliers there is no obvious impact of the percentage of left_turns.
The most prominent outliers only make up a very small percentage of the overall rides:
1 |
|
We will briefly examine those data points using a comparison of relative frequencies in bar plots and density overlays:
1 |
|
Here the facets (true vs false) refer to the observations being part of the long-delay group or not.
We find:
- Almost all long delays happen for the taxis of vendor 2, even though vendor 2 has only a slightly higher percentage of rides overall. This discrepancy is definitely significant enough to explain most of the effect.
- Another interesting fact is that in the long-delay sample the fraction of JFK trips is significantly higher, and the fraction of La Guardia trips is slightly higher, than in the general population. We could hypothesise that for an airport journey the trip_duration might include the waiting time while queuing for passengers. JFK trips make up almost 30% of the long-delay trips (60 out of 207).
- The percentage of delays during work hours is slightly higher but not with practical significance.
- Furthermore, in the long-delay distribution the long distances are clearly favoured over shorter ones. Remember that here we are only comparing trips within the same bracket of total_travel_time (50 - 2000 s).
- In addition, not shown here is the observation that long delays appear to be slightly more frequent on Tuesdays or in January and March; compared to Wednesday and Thursdays or February. However, considering the number of categories and that the deviations at most reach 5 percentage points these effect might be random.
Another curious group of outliers can be found in the lower middle portion of the trip_duration vs total_travel_time plot. Here we encode the colour of the trips by their vendor_id, to further highlight the impact of vendor 2 on the long-duration outliers on the top right and the few ones on the bottom right:
1 |
|
The second group of outliers lies at total_travel_time < 10 and trip_duration = 600 - 4000; units in seconds as usual. See the hidden figure or fig 26. Therefore, those are trips over very short distances that should have only taken a few seconds, but in our data they took between 10 minutes and 1.4 hours. In a way, there is a continuum of data points in the lower left to middle part of the plot, but the ones over 10 minutes show another peak in density:
1 |
|
1 |
|
We find:
- Again, vendor 2 has systematically longer trip_durations than vendor 1.
- Delays of this second kind appear to be more frequent on weekends, compared to the overall distribution.
- The number of passengers is higher too for these delays, so maybe we are counting the time it takes them (and their luggage) to get in and out of the taxi?
- The distances for the delays are again longer, but notice the different scale here. Distances of less than 100 m should not take more than 10 min.
- Not shown: jfk_trip and lg_trip have no impact here because the overall distances are so short.
In summary, we find that the vendor_id has distinguishing power for the two outlier groups, with airport trips very likely to contribute to the long-delay outliers. Other factors like wday or passenger_count might contribute additional signal.
Correlations overview
After engineering new features and before starting the modelling, we will visualise the relations between our parameters using a correlation matrix. For this, we need to change all the input features into a numerical format. The visualisation uses the corrplot function from the eponymous package. Corrplot gives us great flexibility in manipulating the style of our plot.
What we see below, are the colour-coded correlation coefficients for each combination of two features. In simplest terms: this shows whether two features are connected so that one changes with a predictable trend if you change the other. The closer this coefficient is to zero the weaker is the correlation. Both 1 and -1 are the ideal cases of perfect correlation and anti-correlation (dark blue and dark red in the plots below).
Here, we are of course interested if and how strongly our features correlate with the trip_duration, the prediction of which is the ultimate goal of this challenge. But we also want to know whether our potential predictors are correlated among each other, so that we can reduce the collinearity in our data set and improve the robustness of our prediction:
1 |
|
If we remove the features with little to no significant impact on other features that are not directly related then our (effective) correlation matrix looks like this. In this plot (and in the hidden figure) the fading serves the purpose that everything that would be hard to see is not worth seeing:
1 |
|
We find:
- The strongest correlations with the trip_duration are seen for the direct dist*ance as well as the *total_distance and total_travel_time derived from the OSRM data. As we have already seen above, the distances are highly correlated with one another and with the total_travel_time. Also the number of turns, presumably mostly via the number_of_steps, are having an impact on the trip_duration.
- Another effect on the trip_duration can bee seen for our engineered features jfk_trip and lg_trip; indicating journeys to either airport. A similar statement is true for the average speed and airport travel.
- The pickup and dropoff coordinates are correlated, which is a bit puzzling but might be partly explained by the shape of Manhattan stretching from south-west to north-east. Another part of the explanation might be short trips within lower or upper Manhattan only.
- vendor_id is correlated with passenger_count because of vendor 2 having all the (five) trips with more than 6 passengers.
- Among the weather data (mostly seen in the full hidden plot), the minimum and maximum temperatures show a strong correlation with each other and the month of the year, which is intuitively understandable. Also the different ways of measuring precipitation are naturally correlated with each other and, in case of snow, with the month.
We should be able to see these relations in our model.
An excursion into classification
Our challenge is by nature a regression problem: we want to predict a continous variable, trip_duration from a set of features. In this kernel we will continue to treat it as a regression problem, but in this section I want to present a brief excursion into the possibility of turning it into a classification problem.
In a classification context our target variable can only take a (small) number of discrete values. Often, we are faced with a binary outcome, for instance “Survived vs Not Survived” in the popular Titanic challenge. Here we want to examine which factors contribute to shorter or longer trip_durations.
Recall the distribution of trip_durations from the beginning of this notebook, now cleaned with the more spurious trips removed. Here we separate it into three parts: short, middle, and long duration:
1 |
|
(Ignore the apparent spikes caused by the binning resolution.)
For the purpose of this excercise, we are not interested in the middle bit. Our aim is to compare the properties of the fast vs slow trips.
1 |
|
We find:
- Our airport features, jfk_trip and lg_trip, behave as expected and are predominantly slow trips. Also the blizzard feature shows a promising increase in slow trips.
- Throughout the week, there is a notable difference between the weekend and the weekdays. It might be worth to encode the wday feature according to this trend. We see that the monthly trend already shows a gradually increasing “slow” proportion.
- During the day, we find again the difference between night and (working) day that we had seen in previously in the scatter plot plus heatmap and which is partly resolved in our engineered work feature.
For a visualisation of the pickup coordinates for the fast vs slow rides we build another map using the great leaflet package. Those maps don’t require external API calls and run fully contained in a Kaggle kernel. We visualise our two groups of trips using a heatmap which we add through tools in the leaflet.extras package.
In this leaflet visualisation, you can toggle and overlay the fast vs slow rides independently, and also choose a dark vs light background map (in addition to a free zoom and pan). These layers are easy to define within a leaflet object; have a look at the code to see how concise the addLayersControl element is. Zoom in to see street-level resolution:
1 |
|
We find:
- As expected, the airports stand out as origins of slow rides. Interestingly, a similar observations can be made for the areas surrounding the airports.
- There is a number of fast trip pickups just accross the Hudson river and near Newark airport in Jersey.
- Interestingly, except for the airport locations the fast trip pickups appear to be more spatially extended than the slow ones.
For a different kind of visualisation of your classification problem I would like to illustrate an alluvial plot. Made available through the alluvial package, those plots are a kind of mix between a flow chart and a bar plot and they show how the target categories relate to various discrete features.
I would like to thank retrospectprospect for introducing me to alluvial plots in their very elegant EDA kernel for the Titanic challenge! The same user also has shared an informative and concise kernel in this competition, and if you haven’t seen it yet then I definitely recommend you to check it out.
Without further ado, here is the plot:
1 |
|
In this plot, the vertical sizes of the blocks and the widths of the stripes (called “alluvia”) are proportional to the frequency. We decided to hide the alluvia with the lowest frequencies using the parameter of the same name. Within a category block you can re-order the vertical layering of the alluvia to emphasise certain aspects of your data set. Here, we see nicely how the fast vs slow groups fan out into the different categories.
We find:
- Outside of work hours (work == FALSE) the majority of trips are fast, which we can see from the larger proportion of red alluvia.
- In terms of wday, Sat and Sun contain a larger proportion of fast trips than the weekdays. In addition, Sat and Sun have of course no contribution to work == TRUE, since no alluvia connect the corresponding boxes. This is a useful consistency check.
- Airport trips are almost entirely slow ones. We removed the few thin alluvias (using the hide parameter) because they were hard to see in the first place.
In summary: We find that looking at this problem from the point of view of classifying fast vs slow trips helps us to see certains effects clearer and to notice useful details about for instance the spatial distributions or the impact of work hours. Furthermore, we derive helpful hints on how to encode otherwise randomly ordered categorical variables like wday (day of the week) or month (randomly with respect to trip duration).
This concludes our brief excursion into classification. I encourage any of you who are interested to take this classification approach and see how far you can take it.
A simple model and prediction
In a brief final step we will feed our exploratory and engineering insights into a simple model to predict the target trip_duration for the test data set. However, this is primarily an EDA kernel and not a modelling one. This section leaves plenty of room for improvement for anyone who wants to optimise it. We will be using XGBoost, but you can try out any other algorithm.
I’m expecting that as time progresses in this competition the focus will shift towards the intricacies of model selection and tuning. I can think of a few more masters and grandmasters that I would like to see demonstrating their skills here, so that I can learn from them. For instance, this kernel won’t touch on the subject of model stacking at all.
Preparations
Train vs test overlap
In order to make sure that we are really training on features that are relevant to our test data set we will now briefly compare the temporal and spatial properties of the train and test data. This is another consistency check. We could have done this before the exploration, but my personal preference is to examine the training data first before looking at the test data set so that my analysis is as unbiased as possible. Here are the two relevant comparison plots:
1 |
|
1 |
|
We find that our train and test data sets do indeed cover the same time range and geographical area.
Data formatting
Here we will format the selected features to turn them into integer columns, since many classifiers cannot deal with categorical values. For the encoding we make use of our exploratory knowledge, including the insights gained in our classification excursion. This is necessary since most classifiers would assume a natural ordering for integer features (i.e. 1 < 2 < 3 with respect to impact on the target). An alternative would be to use one-hot encoding.
In order for the encoding to be homogeneous between the train and test data sets it is a useful habit to perform it on the combined data, in case there are differences in some of the factor levels. For instance, if the passenger_count were a genuine factor feature then here the train and test data would contain different levels (since test is missing 7 and 8 passenger trips).
This formatting stage is practically a repeat of what we did at the beginning of the Feature Engineering section. In principle we could have already worked with the combine data set back then, but for the purpose of keeping it simple we used only the train sample back then. You are very welcome to take this kernel and improve it for better efficiency.
Nothing magical here in our engineering/formatting code block:
1 |
|
You might have noticed that we now recode the weekdays (wday) according to the results from the classification section. Conveniently, the months are already in a sequence of increasing fraction of slow trips. (Anybody got an idea why this could be?)
Consistency check:
1 |
|
Looks good. The only non-numerical features are id, pickup_datetime, dropoff_datetime, and date, which will remove in any case, together with dset which we will use now to separate the train vs test again.
Feature selection, metric adjustment, validation split, and careful cleaning
Not all features in our data set will be useful. Here we only include meaningful variables and remove for instance the id feature and most of the weather features.
In principle, we could include all features and leave the selection of the useful ones to our modelling algorithm. However, in our case a pre-selection could be useful because we have (a) engineered a couple of features from existing ones (such as work or blizzard), and (b) imported additional features with strong correlation (such as total_distance and total_travel_time). Both strategies can cause significant collinearity within our training feature set, which will make it more difficult to interpret the result of our model in terms of the impact of individual features. Furthermore, from a pragmatic point of view any strongly correlated features don’t add much new information and should (in my opinion) be removed for reasons of statistical parsimony.
Feature selection is normally an iterative process where we run an initial model with either few or many features and then step-by-step add or remove some features based on the results of the previous run. For the sake of kernel run time here we only include one model fitting step. You are very welcome to use it as a starting point for your own analysis and combine it with insights from other kernels. We will base our feature selection on the results of the exploratory analysis, in particular the final correlation matrix.
In this code block, the feature selection is structured with a little more detail than necessary, in order to make it easier to generalise it to a new problem:
1 |
|
For our taxi challenge, the evaluation metric is RMSLE, the Root Mean Squared Logarithmic Error. Essentially, this means that we are optimising the prediction vs data deviations in log space. This has the advantage that large individual outliers don’t get as much weight as they would in a linear metric.
In order to easily simulate the evaluation metric in our model fitting we replace the trip_duration with its logarithm. (The + 1
is added to avoid an undefined log(0)
and we need to remember to remove this 1 second for the prediction file, although it shouldn’t make a huge difference if we didn’t.)
1 |
|
In order to assess how well our model generalises we will perform a cross-validation step and also split our training data into a train vs validation data set. Thereby, the model performance can be evaluated on a sample that the algorithm has not seen. We split our data into 80/20 fractions using a tool from the caret package. We wouldn’t really need this package here for such a simple task, but I’m including it to give it a mention. Caret is a multi-purpose ML package that is certainly worth checking out.
1 |
|
Finally, we are performing careful cleaning steps on our train sample only. By removing observations from the training data we always run the risk of overfitting our model to a more “idealised” version of reality. In particular for the current data set, where lots of values don’t make obvious sense, this idealised model might not generalise well to unseen data, which could have similar “quirks”. For this reason, we are keeping the validation sample unchanged, so that any noticeable deviations between train and valid model score can alert us to overfitting.
Why clean at all? Well, in my opinion obvious outliers should be removed to make the model robust. It’s true that these outliers might also exist in the test data, but if there is a way to predict them then this suggests that they might not be outliers after all. And if we can’t predict them there’s no way to take them into account anyway. Lastly, the answer also depends on the goal of your analysis (and your evaluation metric) which might not always aim at fitting a given test set as good as possible but sometimes have a more pragmatic target that allows you to ignore extreme values.
Here we only remove the few trip_duration that are longer than a day and the couple of data points far, far away from NYC:
1 |
|
XGBoost parameters and fitting
For the model fitting we will use everybody’s favourite XGBoost - eXtreme Gradient Boosting. This is a decision-tree gradient boosting algorithm with a high degree of popularity here on Kaggle. In this kernel we will implement a fairly straight-forward fitting strategy. Feel free to optimise the parameters and get some inspiration from other kernels that are more focussed on model fitting. Also try out other classifiers, such as Megan’s baseline Random Forest to see how they treat our features.
A few brief words about gradient boosting, as far as I understand it: Boosting is what we call the step-by-step improvement of a weak learner (like a relatively shallow decision tree of max_depth levels) by successively applying it to the results of the previous learning step (for nrounds times in total). Gradient Boosting focusses on minimising the Loss Function (according to our evaluation metric) by training the algorithm on the gradient of this function. The method of Gradient Decent iteratively moves into the direction of the greatest decent (i.e. most negative first derivative) of the loss function. The step sizes can vary from iteration to iteration but has a multiplicative shrinkage factor eta in (0,1] associated with it for additional tuning. Smaller values of eta result in a slower decent and require higher nrounds.
In order for XGBoost to properly ingest our data samples we need to re-format them slightly:
1 |
|
Now we define the meta-parameters that govern how XGBoost operates. See here for more details. The watchlist parameter tells the algorithm to keep an eye on both the training and validation sample metrics. Those parameters are not optimised for performance, so feel free to play with them:
1 |
|
And here we train our classifier, i.e. fit it to the training data. To ensure reproducability we set an R seed here. To make this model run in the kernel environment, given our extensive EDA run time, we will restrict the learning to 60 sample rounds.
1 |
|
After the fitting we are running a 5-fold cross-validation (CV) to estimate our model’s performance. Also this stage would exceed the Kaggle run-time limit for a larger number of rounds, therefore I’m limiting it here to 15 sample rounds to demonstrate the principle. You should use at least a few 100 in your analysis, depending on your XGBoost parameters. The early-stopping parameter will make sure that the CV fitting is stopped once the model can’t be improved through additional steps.
1 |
|
Feature importance
After training we will check which features are the most important for our model. This can provide the starting point for an iterative process where we identify, step by step, the significant features for a model. Here we will simply visualise these features:
1 |
|
We find that, unsurprisingly, the OSRM features like total_distance and total_travel_time are by far the most important predictors for the trip_duration. Beyond that, our engineered features like wday or bearing are not doing bad either.
Prediction and submission file
Once we are reasonably happy with our CV score we use the corresponding model to make a prediction for the test data, write the submission file, and submit it to the Kaggle leaderboard. This gives a performance score based on an independent data set. For this very reason, it is advisable to aim to keep the number of leaderboard submission low. Otherwise you run the risk to have your model influenced too much by this public leaderboard data which will result in overfitting. It’s better to trust your local CV. Just ask anyone who has just completed the Mercedes challenge ;-)
Practically, this pre-ulimate code block will apply the XGBoost model to the testing data and write a submission file according to the competition specification. You will then find the file submit.csv
in the “Output” tab of the Kaggle kernel or, if you have run the script locally, in the source directory.
1 |
|
All that’s left to do is to double-check this submission file to make sure that we are not wasting a (potentially) valuable submission with a mal-formatted output file. We are comparing its shape and content with the sample submission file that we read in at the very beginning, and also plot the distribution of the predicted values compared to the training values for an approximate comparison:
1 |
|
Prediction file format and trip_duration range successfully verified.
In its present state, this model will give you a score of 0.410 on the public leaderboard which, at the time of writing, will put you in the top 40% of submissions. Clearly this is just a starting point for you to explore and improve the modelling process. Here are a few suggestions:
- First of all, run the training and cross-validation for a larger number of rounds. This alone will immediately improve your score.
- Add further features to the model and explore the impact of more or less cleaning.
- Try out different XGBoost parameters and see how they affect the CV score.
The rest is up to you. I hope that this extensive EDA gave you useful insights into the data as well as ideas to get you started with your own analysis. I wish you the best of success.
Many thanks to all of you for reading, commenting on, and upvoting this kernel! This competition was an awesome experience for me and I hope you enjoyed it as well.
See you at the next challenge! :-)