osm

Visualizing datasets larger than memory using Datashader with Dask

Datashading a 2.7-billion-point Open Street Map database

Most datashader examples use "medium-sized" datasets, because they need to be small enough to be distributed over the internet without racking up huge bandwidth charges for the project maintainers. Even though these datasets can be relatively large (such as the 1-billion point Open Street Map example ), they still fit into memory on a 16GB laptop.

Because Datashader supports Dask dataframes, it also works well with truly large datasets, much bigger than will fit in any one machine's physical memory. On a single machine, Dask will automatically and efficiently page in the data as needed, and you can also easily distribute the data and computation across multiple machines. Here we illustrate how to work "out of core" on a single machine using a 22GB OSM dataset containing 2.7 billion points.

The data is taken from Open Street Map's (OSM) bulk GPS point data , and is unfortunately too large to distribute with Datashader (7GB compressed). The data was collected by OSM contributors' GPS devices, and was provided as a CSV file of latitude,longitude coordinates. The data was downloaded from their website, extracted, converted to use positions in Web Mercator format using datashader.utils.lnglat_to_meters() , and then stored in a parquet file for faster disk access . To run this notebook, you would need to do the same process yourself to obtain osm.snappy.parq . Once you have it, you can follow the steps below to load and plot the data.

In [1]:
import dask.dataframe as dd
import dask.diagnostics as diag
import datashader as ds
import datashader.transfer_functions as tf
In [2]:
df = dd.io.parquet.read_parquet('../data/osm.snappy.parq')
df.head()
Out[2]:
x y
0 15028218.0 -19219556.0
1 12836614.0 -19219556.0
2 11584269.0 -19219556.0
3 11271185.0 -19219556.0
4 11271180.0 -19219556.0

Aggregation

First, we create a canvas to provide pixel-shaped bins in which points can be aggregated, and then aggregate the data to produce a fixed-size aggregate array. This process may take up to a minute, so we provide a progress bar using dask:

In [3]:
bound = 20026376.39
bounds = dict(x_range = (-bound, bound), y_range = (int(-bound*0.4), int(bound*0.6)))
plot_width = 1000
plot_height = int(plot_width*0.5)

cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, **bounds)

with diag.ProgressBar(), diag.Profiler() as prof, diag.ResourceProfiler(0.5) as rprof:
    agg = cvs.points(df, 'x', 'y', ds.count())
[########################################] | 100% Completed |  1.7s
[########################################] | 100% Completed | 29.3s

We can now visualize this data very quickly, ignoring low-count noise as described in the 1-billion point OSM version :

In [4]:
tf.shade(agg.where(agg > 15), cmap=["lightblue", "darkblue"])
Out[4]:

Performance Profile

Dask offers some tools to visualize how memory and processing power are being used during these calculations:

In [5]:
from bokeh.io import output_notebook
from bokeh.resources import CDN
output_notebook(CDN, hide_banner=True)
In [6]:
diag.visualize([prof, rprof])
Out[6]:
Column (
id = '10a7eab2-5e28-4ff4-93a7-1bb5c7edc939', …)

Performance notes:

  • On a 16GB machine, most of the time is spent reading the data from disk (the purple rectangles)
  • Reading time includes not just disk I/O, but decompressing chunks of data
  • The disk reads don't release the Global Interpreter Lock (GIL), and so CPU usage (see second chart above) drops to only one core during those periods.
  • During the aggregation steps (the green rectangles), CPU usage on this machine with 8 hyperthreaded cores (4 full cores) spikes to nearly 800%, because the aggregation function is implemented in parallel.
  • The data takes up 22 GB uncompressed, but only a peak of around 6 GB of physical memory is ever used because the data is paged in as needed.

Right click to download this notebook from GitHub.