Datashader

Turns even the largest data into images, accurately.

Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader breaks the creation of images into a series of explicit steps that allow computations to be done on intermediate representations. This approach allows accurate and effective visualizations to be produced automatically, and also makes it simple for data scientists to focus on particular data and relationships of interest in a principled way. Using highly optimized rendering routines written in Python but compiled to machine code using Numba , datashader makes it practical to work with extremely large datasets even on standard hardware.

To make it concrete, here’s an example of what datashader code looks like:

>>> import datashader as ds
>>> import datashader.transfer_functions as tf
>>> import pandas as pd
>>> df = pd.read_csv('user_data.csv')

>>> cvs = ds.Canvas(plot_width=400, plot_height=400)
>>> agg = cvs.points(df, 'x_col', 'y_col', ds.mean('z_col'))
>>> img = tf.shade(agg, cmap=['lightblue', 'darkblue'], how='log')

This code reads a data file into a Pandas dataframe df , and then projects the fields x_col and y_col onto the x and y dimensions of 400x400 grid, aggregating it by the mean value of the z_col of each datapoint. The results are rendered into an image where the minimum count will be plotted in lightblue , the maximum in darkblue , and ranging logarithmically in between.

And here are some sample outputs for data from the 2010 US census, each constructed using a similar set of code:

_images/usa_census.jpg _images/nyc_races.jpg

Documentation for datashader is primarily provided in the form of Jupyter notebooks. To understand which plotting problems datashader helps you avoid, you can start with our Plotting Pitfalls notebook. To see the steps in the datashader pipeline in detail, you can start with our Pipeline notebook. Or you may want to start with detailed case studies of datashader in action, such as our NYC Taxi , US Census , and OpenSky notebooks. In most cases, the easiest way to use Datashader via the high-level HoloViews package, which lets you flexibly switch between Datashader and non-Datashader plots generated by Matplotlib or Bokeh. Additional notebooks showing how to use datashader for other applications or data types are viewable on Anaconda Cloud and can be downloaded in runnable form as described on the datashader examples page.

Other resources

You can watch a short talk about datashader on YouTube: Datashader: Revealing the Structure of Genuinely Big Data . The video, Visualizing Billions of Points of Data , and its slides from a February 2016 one-hour talk introducing Datashader are also available, but do not cover more recent extensions to the library.

Some of the original ideas for datashader were developed under the name Abstract Rendering, which is described in a 2014 SPIE VDA paper .

The source code for datashader is maintained at our Github site , and is documented using the API link on this page.

We recommend the Getting Started Guide to learn the basic concepts and start using Datashader as quickly as possible.

The User Guide covers specific topics in more detail.

The API is the definitive guide to each part of Datashader, but the same information is available more conveniently via the help() command as needed when using each component.

Please feel free to report issues or contribute code . You are also welcome to chat with the developers on gitter .