nyc taxi

Plotting very large datasets meaningfully, using datashader

There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the datashader library helps make large datasets truly practical.

We'll use part of the well-studied NYC Taxi trip database, with the locations of all NYC taxi pickups and dropoffs from the month of January 2015. Although we know what the data is, let's approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.

NOTE: This dataset is also explorable through the Datashader example dashboard. From inside the examples directory, run: DS_DATASET=nyc_taxi panel serve --show dashboard.ipynb

Load NYC Taxi data

(takes 10-20 seconds, since it's in the inefficient but widely supported CSV file format...)

In [1]:
import pandas as pd
%time df = pd.read_csv('../data/nyc_taxi.csv',usecols= \
                       ['pickup_x', 'pickup_y', 'dropoff_x','dropoff_y', 'passenger_count','tpep_pickup_datetime'])
CPU times: user 16.7 s, sys: 1.59 s, total: 18.3 s
Wall time: 17.9 s
tpep_pickup_datetime passenger_count pickup_x pickup_y dropoff_x dropoff_y
10679302 2015-01-10 19:01:44 2 -8.232298e+06 4.980860e+06 -8.232492e+06 4.979234e+06
10679303 2015-01-10 19:01:44 2 -8.235721e+06 4.972331e+06 -8.234857e+06 4.971131e+06
10679304 2015-01-10 19:01:44 1 -8.235341e+06 4.975470e+06 -8.234203e+06 4.981092e+06
10679305 2015-01-10 19:01:44 1 -8.237594e+06 4.973844e+06 -8.235618e+06 4.973722e+06
10679306 2015-01-10 19:01:45 1 -8.233229e+06 4.977946e+06 -8.234152e+06 4.977120e+06

As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.

Define a simple plot

In [2]:
from bokeh.models import BoxZoomTool
from bokeh.plotting import figure, output_notebook, show


NYC = x_range, y_range = ((-8242000,-8210000), (4965000,4990000))

plot_width  = int(750)
plot_height = int(plot_width//1.2)

def base_plot(tools='pan,wheel_zoom,reset',plot_width=plot_width, plot_height=plot_height, **plot_args):
    p = figure(tools=tools, plot_width=plot_width, plot_height=plot_height,
        x_range=x_range, y_range=y_range, outline_line_color=None,
        min_border=0, min_border_left=0, min_border_right=0,
        min_border_top=0, min_border_bottom=0, **plot_args)

    p.axis.visible = False
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None


    return p

options = dict(line_color=None, fill_color='blue', size=5)
Loading BokehJS ...

1000-point scatterplot: undersampling

Any plotting program should be able to handle a plot of 1000 datapoints. Here the points are initially overplotting each other, but if you hit the Reset button (top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample. If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots. We've included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks. In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called undersampling.

In [3]:
from bokeh.tile_providers import STAMEN_TERRAIN

samples = df.sample(n=1000)
p = base_plot()
p.add_tile(STAMEN_TERRAIN)['dropoff_x'], y=samples['dropoff_y'], **options)
CPU times: user 339 ms, sys: 8.81 ms, total: 348 ms
Wall time: 347 ms

10,000-point scatterplot: overplotting

We can of course plot more points to reduce the amount of undersampling. However, even if we only try to plot 0.1% of the data, ignoring the other 99.9%, we will find major problems with overplotting, such that the true density of dropoffs in central Manhattan is impossible to see due to occlusion:

In [4]:
samples = df.sample(n=10000)
p = base_plot()['dropoff_x'], y=samples['dropoff_y'], **options)
CPU times: user 519 ms, sys: 10.5 ms, total: 529 ms
Wall time: 529 ms

Overplotting is reduced if you zoom in on a particular region (may need to click to enable the wheel-zoom tool in the upper right of the plot first, then use the scroll wheel). However, then the problem switches to back to serious undersampling, as the too-sparsely sampled datapoints get revealed for zoomed-in regions, even though much more data is available.

100,000-point scatterplot: saturation

If you make the dot size smaller, you can reduce the overplotting that occurs when you try to combat undersampling. Even so, with enough opaque data points, overplotting will be unavoidable in popular dropoff locations. So you can then adjust the alpha (opacity) parameter of most plotting programs, so that multiple points need to overlap before full color saturation is achieved. With enough data, such a plot can approximate the probability density function for dropoffs, showing where dropoffs were most common:

options = dict(line_color=None, fill_color='blue', size=1, alpha=0.1)
samples = df.sample(n=100000)
p = base_plot(webgl=True)['dropoff_x'], y=samples['dropoff_y'], **options)

[Here we've shown static output as a PNG rather than a live Bokeh plot, to reduce the file size for distributing full notebooks and because some browsers will have trouble with plots this large. The above cell can be converted into code and executed to get the full interactive plot.]

However, it's very tricky to set the size and alpha parameters. How do we know if certain regions are saturating, unable to show peaks in dropoff density? Here we've manually set the alpha to show a clear structure of streets and blocks, as one would intuitively expect to see, but the density of dropoffs still seems approximately the same on nearly all Manhattan streets (just wider in some locations), which is unlikely to be true. We can of course reduce the alpha value to reduce saturation further, but there's no way to tell when it's been set correctly, and it's already low enough that nothing other than Manhattan and La Guardia is showing up at all. Plus, this alpha value will only work even reasonably well at the one zoom level shown. Try zooming in (may need to enable the wheel zoom tool in the upper right) to see that at higher zooms, there is less overlap between dropoff locations, so that the points all start to become transparent due to lack of overlap. Yet without setting the size and alpha to a low value in the first place, the stucture is invisible when zoomed out, due to overplotting. Thus even though Bokeh provides rich support for interactively revealing structure by zooming, it is of limited utility for large data; either the data is invisible when zoomed in, or there's no large-scale structure when zoomed out, which is necessary to indicate where zooming would be informative.

Moreover, we're still ignoring 99% of the data. Many plotting programs will have trouble with plots even this large, but Bokeh can handle 100-200,000 points in most browsers. Here we've enabled Bokeh's WebGL support, which gives smoother zooming behavior, but the non-WebGL mode also works well. Still, for such large sizes the plots become slow due to the large HTML file sizes involved, because each of the data points are encoded as text in the web page, and for even larger samples the browser will fail to render the page at all.

10-million-point datashaded plots: auto-ranging, but limited dynamic range

To let us work with truly large datasets without discarding most of the data, we can take an entirely different approach. Instead of using a Bokeh scatterplot, which encodes every point into JSON and stores it in the HTML file read by the browser, we can use the datashader library to render the entire dataset into a pixel buffer in a separate Python process, and then provide a fixed-size image to the browser containing only the data currently visible. This approach decouples the data processing from the visualization. The data processing is then limited only by the computational power available, while the visualization has much more stringent constraints determined by your display device (a web browser and your particular monitor, in this case). This approach works particularly well when your data is in a far-off server, but it is also useful whenever your dataset is larger than your display device can render easily.

Because the number of points involved is no longer a limiting factor, you can now use the entire dataset (including the full 150 million trips that have been made public, if you download that data separately). Most importantly, because datashader allows computation on the intermediate stages of plotting, you can easily define operations like auto-ranging (which is on by default), so that we can be sure there is no overplotting or saturation and no need to set parameters like alpha.

The steps involved in datashading are (1) create a Canvas object with the shape of the eventual plot (i.e. having one storage bin for collecting points, per final pixel), (2) aggregating all points into that set of bins, incrementally counting them, and (3) mapping the resulting counts into a visible color from a specified range to make an image:

In [5]:
import datashader as ds
from datashader import transfer_functions as tf
from datashader.colors import Greys9
Greys9_r = list(reversed(Greys9))[:-2]
In [6]:
cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)
agg = cvs.points(df, 'dropoff_x', 'dropoff_y',  ds.count('passenger_count'))
img = tf.shade(agg, cmap=["white", 'darkblue'], how='linear')
CPU times: user 484 ms, sys: 84.3 ms, total: 568 ms
Wall time: 568 ms

The resulting image is similar to the 100,000-point Bokeh plot above, but (a) makes use of all 12 million datapoints, (b) is computed in only a tiny fraction of the time, (c) does not require any magic-number parameters like size and alpha, and (d) automatically ensures that there is no saturation or overplotting:

In [7]:

This plot renders the count at every pixel as a color from the specified range (here from white to dark blue), mapped linearly. If your display device were linear, and the data were distributed evenly across this color range, then the result of such linear, auto-ranged processing would be an effective, parameter-free way to visualize your dataset.

However, real display devices are not typically linear, and more importantly, real data is rarely distributed evenly. Here, it is clear that there are "hotspots" in dropoffs, with a very high count for areas around Penn Station and Madison Square Garden, relatively low counts for the rest of Manhattan's streets, and apparently no dropoffs anywhere else but La Guardia airport. NYC taxis definitely cover a larger geographic range than this, so what is the problem? To see, let's look at the histogram of counts for the above image:

In [8]:
import numpy as np

def histogram(x,colors=None):
    hist,edges = np.histogram(x, bins=100)
    p = figure(y_axis_label="Pixels",
               tools='', height=130, outline_line_color=None,
               min_border=0, min_border_left=0, min_border_right=0,
               min_border_top=0, min_border_bottom=0)
    p.quad(top=hist[1:], bottom=0, left=edges[1:-1], right=edges[2:])
    print("min: {}, max: {}".format(np.min(x),np.max(x)))
In [9]:
min: 0, max: 15205

Clearly, most of the pixels have very low counts (under 3000), while a very few pixels have much larger counts (up to 22000, in this case). When these values are mapped into colors for display, nearly all of the pixels will end up being colored with the lowest colors in the range, i.e. white or nearly white, while the other colors in the available range will be used for only a few dozen pixels at most. Thus most of the pixels in this plot convey very little information about the data, wasting nearly all of dynamic range available on your display device. It's thus very likely that we are missing a lot of the structure in this data that we could be seeing.

10-million-point datashaded plots: high dynamic range

For the typical case of data that is distributed nonlinearly over the available range, we can use nonlinear scaling to map the data range into the visible color range. E.g. first transforming the values via a log function will help flatten out this histogram and reveal much more of the structure of this data:

In [10]:

tf.shade(agg, cmap=Greys9_r, how='log')
min: 0.0, max: 9.629445365788381

We can now see that there is rich structure throughout this dataset -- geographic features like streets and buildings are clearly modulating the values in both the high-dropoff regions in Manhattan and the relatively low-dropoff regions in the surrounding areas. Still, this choice is arbitrary -- why the log function in particular? It clearly flattened the histogram somewhat, but it was just a guess. We can instead explicitly equalize the histogram of the data before building the image, making structure visible at every data level (and thus at all the geographic locations covered) in a general way:

In [11]:

tf.shade(agg, cmap=Greys9_r, how='eq_hist')
min: 0.8699626666666667, max: 1.0

The histogram is now fully flat (apart from the spacing of bins caused by the discrete nature of integer counting). Effectively, the visualization now shows a rank-order or percentile distribution of the data. I.e., pixels are now colored according to where their corresponding counts fall in the distribution of all counts, with one end of the color range for the lowest counts, one end for the highest ones, and every colormap step in between having similar numbers of counts. Such a visualization preserves the ordering between count values, faithfully displaying local differences in these counts, but discards absolute magnitudes (as the top 1% of the color range will be used for the top 1% of the data values, whatever those may be).

Now that the data is visible at every level, we can immediately see that there are some clear problems with the quality of the data -- there is a surprising number of trips that claim to drop off in the water or in the roadless areas of Central park, as well as in the middle of most of the tallest buildings in central Manhattan. These locations are likely to be GPS errors being made visible, perhaps partly because of poor GPS performance in between the tallest buildings.

Histogram equalization does not require any magic parameters, and in theory it should convey the maximum information available about the relative values between pixels, by mapping each of the observed ranges of values into visibly discriminable colors. And it's clearly a good start in practice, because it shows both low values (avoiding undersaturation) and relatively high values clearly, without arbitrary settings.

Even so, the results will depend on the nonlinearities of your visual system, your specific display device, and any automatic compensation or calibration being applied to your display device. Thus in practice, the resulting range of colors may not map directly into a linearly perceivable range for your particular setup, and so you may want to further adjust the values to more accurately reflect the underlying structure, by adding additional calibration or compensation steps.

Moreover, at this point you can now bring in your human-centered goals for the visualization -- once the overall structure has been clearly revealed, you can select specific aspects of the data to highlight or bring out, based on your own questions about the data. These questions can be expressed at whatever level of the pipeline is most appropriate, as shown in the examples below. For instance, histogram equalization was done on the counts in the aggregate array, because if we waited until the image had been created, we would have been working with data truncated to the 256 color levels available per channel in most display devices, greatly reducing precision. Or you may want to focus specifically on the highest peaks (as shown below), which again should be done at the aggregate level so that you can use the full color range of your display device to represent the narrow range of data that you are interested in. Throughout, the goal is to map from the data of interest into the visible, clearly perceptible range available on your display device.

10-million-point datashaded plots: interactive

Although the above plots reveal the entire dataset at once, the full power of datashading requires an interactive plot, because a big dataset will usually have structure at very many different levels (such as different geographic regions). Datashading allows auto-ranging and other automatic operations to be recomputed dynamically for the specific selected viewport, automatically revealing local structure that may not be visible from a global view. Here we'll embed the generated images into a Bokeh plot to support fully interactive zooming. For the highest detail on large monitors, you should increase the plot width and height above.

In [12]:
import datashader as ds
from datashader.bokeh_ext import InteractiveImage
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9, Hot, inferno

background = "black"
export = partial(export_image, export_path="export", background=background)
cm = partial(colormap_select, reverse=(background=="black"))

def create_image(x_range, y_range, w=plot_width, h=plot_height):
    cvs = ds.Canvas(plot_width=w, plot_height=h, x_range=x_range, y_range=y_range)
    agg = cvs.points(df, 'dropoff_x', 'dropoff_y',  ds.count('passenger_count'))
    img = tf.shade(agg, cmap=Hot, how='eq_hist')
    return tf.dynspread(img, threshold=0.5, max_px=4)

p = base_plot(background_fill_color=background)
InteractiveImage(p, create_image)