Introduction#
What is Datashader?#
Datashader turns even the largest datasets into images, faithfully preserving the data’s distribution.
Datashader is an open-source Python library for analyzing and visualizing large datasets. Specifically, Datashader is designed to “rasterize” or “aggregate” datasets into regular grids that can be analyzed further or viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core, distributed, or GPU processing for even larger datasets.
This page of the getting-started guide will give a simple example to show how it works, and the following page will show how to use Datashader as a standalone library for generating arrays or images directly (Pipeline). Next we’ll show how to use Datashader as a component in a larger visualization system like HoloViews or Bokeh that provides interactive plots with dynamic zooming, labeled axes, and overlays and layouts (3-Interactivity). More detailed information about each topic is then provided in the User Guide.
Example: NYC taxi trips#
To illustrate how this process works, we will demonstrate some of the key features of Datashader using a standard “big-data” example: millions of taxi trips from New York City, USA. First let’s import the libraries we are going to use and then read the dataset.
import datashader as ds, pandas as pd, colorcet as cc
df = pd.read_csv('../data/nyc_taxi.csv', usecols=['dropoff_x', 'dropoff_y'])
df.head()
dropoff_x | dropoff_y | |
---|---|---|
0 | -8.234835e+06 | 4.975627e+06 |
1 | -8.237021e+06 | 4.976875e+06 |
2 | -8.238124e+06 | 4.971127e+06 |
3 | -8.238108e+06 | 4.974457e+06 |
4 | -8.236804e+06 | 4.975483e+06 |
Here you can see that we have a simple columnar dataset with x and y dropoff locations (in Web Mercator coordinates) for each of the 10 million taxi trips included; other columns were skipped during loading. With Datashader, we can choose what we want to plot on the x
and y
axes and see the full data immediately, with no parameter tweaking, magic numbers, subsampling, or approximation, up to the resolution of the display:
agg = ds.Canvas().points(df, 'dropoff_x', 'dropoff_y')
ds.tf.set_background(ds.tf.shade(agg, cmap=cc.fire), "black")
Here you can immediately see that the data points are aligned to a street grid, that some areas have much more traffic than others, and that the quality of the signal varies spatially (with some areas having blurry patterns that indicate GPS errors, perhaps due to tall buildings). Getting a plot like this with other approaches would take quite a bit of time and effort, but with Datashader it appears in milliseconds without trial and error.
The output above is just a bare image, which is all that Datashader knows how to generate directly. But Datashader can integrate closely with Bokeh, HoloViews, and GeoViews, which makes it simple to allow interactive zooming, axis labeling, overlays and layouts, and complex web apps. For example, making a zoomable interactive overlay on a geographic map requires just a few more lines of code:
import holoviews as hv
from holoviews.element.tiles import EsriImagery
from holoviews.operation.datashader import datashade
hv.extension('bokeh')
map_tiles = EsriImagery().opts(alpha=0.5, width=900, height=480, bgcolor='black')
points = hv.Points(df, ['dropoff_x', 'dropoff_y'])
taxi_trips = datashade(points, x_sampling=1, y_sampling=1, cmap=cc.fire, width=900, height=480)
map_tiles * taxi_trips
You can select the “Wheel Zoom” tool on the right and then do panning and zooming (with the scroll bar). As long as you have a network connection, the maps will update as you zoom, but the datashaded image will only update if you have a live Python process running. If you do have Python “live”, each time you zoom in, the data will be re-aggregated at the new zoom level, converted to an image, and displayed embedded on the map data, making it simple to explore and understand the data.
At the most basic level, Datashader can accept scatterplot points (as above), line segments (for time series, and trajectories), areas (for filled-area plots), polygons (for choropleths), or gridded data (rasters, quadmeshes, and trimeshes to be regridded), and can turn each of these into a regularly sampled array or the corresponding pixel-based image. The rest of this getting-started guide shows how to go from your data to either images or interactive plots, as simply as possible. The next getting-started section breaks down each of the steps taken by Datashader, using a synthetic dataset so that you can see precisely how the data relates to the images. The user guide then explains each of the steps in much more detail.