What is Datashader?¶
Datashader turns even the largest datasets into images, faithfully preserving the data's distribution.
Datashader is an open-source Python 2 and 3 library for analyzing and visualizing large datasets. Specifically, Datashader is designed to "rasterize" or "aggregate" datasets into regular grids that can be viewed as images, making it simple and quick to see the properties and patterns of your data. Datashader can plot a billion points in a second or so on a 16GB laptop, and scales up easily to out-of-core, distributed, or GPU processing for even larger datasets.
This page of the getting-started guide will give a simple example to show how it works, and the following page will show how to use Datashader as a standalone library for generating arrays or images directly (2-Pipeline). Next we'll show how to use Datashader as a component in a larger visualization system like HoloViews or Bokeh that provides interactive plots with dynamic zooming, labeled axes, and overlays and layouts (3-Interactivity). More detailed information about each topic is then provided in the User Guide.
Example: NYC taxi trips¶
To illustrate how this process works, we will demonstrate some of the key features of Datashader using a standard "big-data" example: millions of taxi trips from New York City, USA. First let's import the libraries we are going to use and then read the dataset.
import datashader as ds import pandas as pd from colorcet import fire from datashader import transfer_functions as tf df = pd.read_csv('../data/nyc_taxi.csv', usecols=['dropoff_x', 'dropoff_y']) df.head()
Here you can see that we have a variety of columns with data about each of the 10 million taxi trips here, such as the locations in Web Mercator coordinates, the distance, etc. With datashader, we can choose what we want to plot on the
y axes and see the full data immediately, with no parameter tweaking, magic numbers, subsampling, or approximation, up to the resolution of the display:
agg = ds.Canvas().points(df, 'dropoff_x', 'dropoff_y') tf.set_background(tf.shade(agg, cmap=fire),"black")
Here you can immediately see that the data points are aligned to a street grid, that some areas have much more traffic than others, and that the quality of the signal varies spatially (with some areas having blurry patterns that indicate GPS errors, perhaps due to tall buildings). Getting a plot like this with other approaches would take quite a bit of time and effort, but with Datashader it appears in milliseconds without trial and error.
The output above is just a bare image, which is all that Datashader knows how to generate directly. But Datashader can integrate closely with Bokeh, HoloViews, and GeoViews, which makes it simple to allow interactive zooming, axis labeling, overlays and layouts, and complex web apps. For example, making a zoomable interactive overlay on a geographic map requires just a few more lines of code:
import holoviews as hv from holoviews.element.tiles import EsriImagery from holoviews.operation.datashader import datashade hv.extension('bokeh') map_tiles = EsriImagery().opts(alpha=0.5, width=900, height=480, bgcolor='black') points = hv.Points(df, ['dropoff_x', 'dropoff_y']) taxi_trips = datashade(points, x_sampling=1, y_sampling=1, cmap=fire, width=900, height=480) map_tiles * taxi_trips