nyc taxi

Plotting very large datasets meaningfully, using datashader

There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the datashader library helps make large datasets truly practical.

We'll use part of the well-studied NYC Taxi trip database , with the locations of all NYC taxi pickups and dropoffs from the month of January 2015. Although we know what the data is, let's approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.

Load NYC Taxi data

(takes 10-20 seconds, since it's in the inefficient but widely supported CSV file format...)

In [1]:
import pandas as pd
%time df = pd.read_csv('../data/nyc_taxi.csv',usecols= \
                       ['pickup_x', 'pickup_y', 'dropoff_x','dropoff_y', 'passenger_count','tpep_pickup_datetime'])
df.tail()
CPU times: user 19 s, sys: 1.28 s, total: 20.3 s
Wall time: 20.3 s
Out[1]:
tpep_pickup_datetime passenger_count pickup_x pickup_y dropoff_x dropoff_y
11842089 2015-01-10 19:01:44 2 -8.232298e+06 4.980860e+06 -8.232492e+06 4.979234e+06
11842090 2015-01-10 19:01:44 2 -8.235721e+06 4.972331e+06 -8.234857e+06 4.971131e+06
11842091 2015-01-10 19:01:44 1 -8.235341e+06 4.975470e+06 -8.234203e+06 4.981092e+06
11842092 2015-01-10 19:01:44 1 -8.237594e+06 4.973844e+06 -8.235618e+06 4.973722e+06
11842093 2015-01-10 19:01:45 1 -8.233229e+06 4.977946e+06 -8.234152e+06 4.977120e+06

As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.

Define a simple plot

In [2]:
from bokeh.models import BoxZoomTool
from bokeh.plotting import figure, output_notebook, show

output_notebook()

NYC = x_range, y_range = ((-8242000,-8210000), (4965000,4990000))

plot_width  = int(750)
plot_height = int(plot_width//1.2)

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)
    
    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    
    p.add_tools(BoxZoomTool(match_aspect=True))
    
    return p
    
options = dict(line_color=None, fill_color='blue', size=5)
Loading BokehJS ...

1000-point scatterplot: undersampling

Any plotting program should be able to handle a plot of 1000 datapoints. Here the points are initially overplotting each other, but if you hit the Reset button (top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample. If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots. We've included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks. In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called undersampling .

In [3]:
%%time
from bokeh.tile_providers import STAMEN_TERRAIN

samples = df.sample(n=1000)
p = base_plot()
p.add_tile(STAMEN_TERRAIN)
p.circle(x=samples['dropoff_x'], y=samples['dropoff_y'], **options)
show(p)
CPU times: user 581 ms, sys: 20.8 ms, total: 602 ms
Wall time: 600 ms

10,000-point scatterplot: overplotting

We can of course plot more points to reduce the amount of undersampling. However, even if we only try to plot 0.1% of the data, ignoring the other 99.9%, we will find major problems with overplotting , such that the true density of dropoffs in central Manhattan is impossible to see due to occlusion:

In [4]:
%%time
samples = df.sample(n=10000)
p = base_plot()

p.circle(x=samples['dropoff_x'], y=samples['dropoff_y'], **options)
show(p)
CPU times: user 631 ms, sys: 13.6 ms, total: 645 ms
Wall time: 642 ms

Overplotting is reduced if you zoom in on a particular region (may need to click to enable the wheel-zoom tool in the upper right of the plot first, then use the scroll wheel). However, then the problem switches to back to serious undersampling, as the too-sparsely sampled datapoints get revealed for zoomed-in regions, even though much more data is available.

100,000-point scatterplot: saturation

If you make the dot size smaller, you can reduce the overplotting that occurs when you try to combat undersampling. Even so, with enough opaque data points, overplotting will be unavoidable in popular dropoff locations. So you can then adjust the alpha (opacity) parameter of most plotting programs, so that multiple points need to overlap before full color saturation is achieved. With enough data, such a plot can approximate the probability density function for dropoffs, showing where dropoffs were most common:

%%time
options = dict(line_color=None, fill_color='blue', size=1, alpha=0.1)
samples = df.sample(n=100000)
p = base_plot(webgl=True)
p.circle(x=samples['dropoff_x'], y=samples['dropoff_y'], **options)
show(p)

[ Here we've shown static output as a PNG rather than a live Bokeh plot, to reduce the file size for distributing full notebooks and because some browsers will have trouble with plots this large. The above cell can be converted into code and executed to get the full interactive plot. ]

However, it's very tricky to set the size and alpha parameters. How do we know if certain regions are saturating, unable to show peaks in dropoff density? Here we've manually set the alpha to show a clear structure of streets and blocks, as one would intuitively expect to see, but the density of dropoffs still seems approximately the same on nearly all Manhattan streets (just wider in some locations), which is unlikely to be true. We can of course reduce the alpha value to reduce saturation further, but there's no way to tell when it's been set correctly, and it's already low enough that nothing other than Manhattan and La Guardia is showing up at all. Plus, this alpha value will only work even reasonably well at the one zoom level shown. Try zooming in (may need to enable the wheel zoom tool in the upper right) to see that at higher zooms, there is less overlap between dropoff locations, so that the points all start to become transparent due to lack of overlap. Yet without setting the size and alpha to a low value in the first place, the stucture is invisible when zoomed out, due to overplotting. Thus even though Bokeh provides rich support for interactively revealing structure by zooming, it is of limited utility for large data; either the data is invisible when zoomed in, or there's no large-scale structure when zoomed out, which is necessary to indicate where zooming would be informative.

Moreover, we're still ignoring 99% of the data. Many plotting programs will have trouble with plots even this large, but Bokeh can handle 100-200,000 points in most browsers. Here we've enabled Bokeh's WebGL support, which gives smoother zooming behavior, but the non-WebGL mode also works well. Still, for such large sizes the plots become slow due to the large HTML file sizes involved, because each of the data points are encoded as text in the web page, and for even larger samples the browser will fail to render the page at all.

10-million-point datashaded plots: auto-ranging, but limited dynamic range

To let us work with truly large datasets without discarding most of the data, we can take an entirely different approach. Instead of using a Bokeh scatterplot, which encodes every point into JSON and stores it in the HTML file read by the browser, we can use the datashader library to render the entire dataset into a pixel buffer in a separate Python process, and then provide a fixed-size image to the browser containing only the data currently visible. This approach decouples the data processing from the visualization. The data processing is then limited only by the computational power available, while the visualization has much more stringent constraints determined by your display device (a web browser and your particular monitor, in this case). This approach works particularly well when your data is in a far-off server, but it is also useful whenever your dataset is larger than your display device can render easily.

Because the number of points involved is no longer a limiting factor, you can now use the entire dataset (including the full 150 million trips that have been made public, if you download that data separately). Most importantly, because datashader allows computation on the intermediate stages of plotting, you can easily define operations like auto-ranging (which is on by default), so that we can be sure there is no overplotting or saturation and no need to set parameters like alpha.

The steps involved in datashading are (1) create a Canvas object with the shape of the eventual plot (i.e. having one storage bin for collecting points, per final pixel), (2) aggregating all points into that set of bins, incrementally counting them, and (3) mapping the resulting counts into a visible color from a specified range to make an image:

In [5]:
import datashader as ds
from datashader import transfer_functions as tf
from datashader.colors import Greys9
Greys9_r = list(reversed(Greys9))[:-2]
In [6]:
%%time
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'dropoff_x', 'dropoff_y',  ds.count('passenger_count'))
img = tf.shade(agg, cmap=["white", 'darkblue'], how='linear')
CPU times: user 410 ms, sys: 22 ms, total: 432 ms
Wall time: 431 ms

The resulting image is similar to the 100,000-point Bokeh plot above, but (a) makes use of all 12 million datapoints, (b) is computed in only a tiny fraction of the time, (c) does not require any magic-number parameters like size and alpha, and (d) automatically ensures that there is no saturation or overplotting:

In [7]:
img
Out[7]:

This plot renders the count at every pixel as a color from the specified range (here from white to dark blue), mapped linearly. If your display device were linear, and the data were distributed evenly across this color range, then the result of such linear, auto-ranged processing would be an effective, parameter-free way to visualize your dataset.

However, real display devices are not typically linear, and more importantly, real data is rarely distributed evenly. Here, it is clear that there are "hotspots" in dropoffs, with a very high count for areas around Penn Station and Madison Square Garden, relatively low counts for the rest of Manhattan's streets, and apparently no dropoffs anywhere else but La Guardia airport. NYC taxis definitely cover a larger geographic range than this, so what is the problem? To see, let's look at the histogram of counts for the above image:

In [8]:
import numpy as np

def histogram(x,colors=None):
    hist,edges = np.histogram(x, bins=100)
    p = figure(y_axis_label="Pixels",
               tools='', height=130, outline_line_color=None,
               min_border=0, min_border_left=0, min_border_right=0,
               min_border_top=0, min_border_bottom=0)
    p.quad(top=hist[1:], bottom=0, left=edges[1:-1], right=edges[2:])
    print("min: {}, max: {}".format(np.min(x),np.max(x)))
    show(p)
In [9]:
histogram(agg.values)
min: 0, max: 15394

Clearly, most of the pixels have very low counts (under 3000), while a very few pixels have much larger counts (up to 22000, in this case). When these values are mapped into colors for display, nearly all of the pixels will end up being colored with the lowest colors in the range, i.e. white or nearly white, while the other colors in the available range will be used for only a few dozen pixels at most. Thus most of the pixels in this plot convey very little information about the data, wasting nearly all of dynamic range available on your display device. It's thus very likely that we are missing a lot of the structure in this data that we could be seeing.

10-million-point datashaded plots: high dynamic range

For the typical case of data that is distributed nonlinearly over the available range, we can use nonlinear scaling to map the data range into the visible color range. E.g. first transforming the values via a log function will help flatten out this histogram and reveal much more of the structure of this data:

In [10]:
histogram(np.log1p(agg.values))

tf.shade(agg, cmap=Greys9_r, how='log')
min: 0.0, max: 9.641798060358601
Out[10]:

We can now see that there is rich structure throughout this dataset -- geographic features like streets and buildings are clearly modulating the values in both the high-dropoff regions in Manhattan and the relatively low-dropoff regions in the surrounding areas. Still, this choice is arbitrary -- why the log function in particular? It clearly flattened the histogram somewhat, but it was just a guess. We can instead explicitly equalize the histogram of the data before building the image, making structure visible at every data level (and thus at all the geographic locations covered) in a general way:

In [11]:
histogram(tf.eq_hist(agg.values))
    
tf.shade(agg, cmap=Greys9_r, how='eq_hist')
min: 0.7400661333333334, max: 1.0
Out[11]: