Performance

Datashader is designed to make it simple to work with even very large datasets. To get good performance, it is essential that each step in the overall processing pipeline be set up appropriately. Below we share some of our suggestions based on our own Benchmarking and optimization experience, which should help you obtain suitable performance in your own work.

File formats

Based on our testing with various file formats, we recommend storing any large columnar datasets in the Apache Parquet format when possible, using the fastparquet library with "Snappy" compression:

>>> import dask.dataframe as dd
>>> dd.to_parquet(filename, df, compression="SNAPPY")

If your data includes categorical values that take on a limited, fixed number of possible values (e.g. "Male", "Female"), Parquet's categorical columns use a more memory-efficient data representation and are optimized for common operations such as sorting and finding uniques. Before saving, just convert the column as follows:

>>> df[colname] = df[colname].astype('category')

By default, numerical datasets typically use 64-bit floats, but many applications do not require 64-bit precision when aggregating over a very large number of datapoints to show a distribution. Using 32-bit floats reduces storage and memory requirements in half, and also typically greatly speeds up computations because only half as much data needs to be accessed in memory. If applicable to your particular situation, just convert the data type before generating the file:

>>> df[colname] = df[colname].astype(numpy.float32)

Data libraries

Datashader performance on a given system will vary significantly depending on the library used to represent the data. All glyph types will support data in a Pandas DataFrame, but Pandas will generally be the lowest-performance option because it does not always make use of the multiple processing cores available on a given CPU, does not support distributed computation across multiple CPUs, does not support out-of-core operation for larger-than-memory datasets, and does not support GPU computations. Where supported, you can get much higher performance using another data library:

  • Pandas: supports every glyph type (points, lines, areas, trimesh, raster)
  • Dask: supports points, lines, and areas, using distributed, multi-core, and/or out of core computation
  • cuDF: supports points, lines, and areas, with computation on single NVIDIA GPUs
  • Dask-cuDF: supports points, lines, and areas, with computation on multiple NVIDIA GPUs

In each case, simply pass a DataFrame of the indicated type to a Datashader call, just as you would do with a Pandas DataFrame.

Multi-core operation using Dask

Even on a single machine, a Dask DataFrame typically give higher performance than Pandas, because it makes good use of all available cores, and it also supports out-of-core operation for datasets larger than memory.

Dasks works on chunks of the data at any one time, called partitions. With Dask on a single machine, a rule of thumb for the number of partitions to use is multiprocessing.cpu_count(), which allows Dask to use one thread per core for parallelizing computations.

When the entire dataset fits into memory at once, you can (and should) persist the data as a Dask dataframe prior to passing it into datashader, to ensure that data only needs to be loaded once:

>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)

Right click to download this notebook from GitHub.