census

2010 US Census data

The 2010 Census collected a variety of demographic information for all the more than 300 million people in the USA. Here we'll focus on the subset of the data selected by the Cooper Center, who produced a map of the population density and the racial makeup of the USA . Each dot in this map corresponds to a specific person counted in the census, located approximately at their residence. (To protect privacy, the precise locations have been randomized at the census block level, so that the racial category can only be determined to within a rough geographic precision.) The Cooper Center website delivers pre-rendered tiles, which are fast to view but limited to the specific plotting choices they made. Here we will show how to run novel analyses focusing on whatever aspects of the data that you select yourself, rendered dynamically as requested using the datashader library.

Load data and set up

First, let's load this data into a Dask dataframe. Dask is similar to Pandas , but with extra support for distributed or out-of-core (larger than memory) operation. If you have at least 16GB of RAM, you should be able to run this notebook-as is, using fast in-core operations. If you have less memory, you'll need to use the slower out-of-core operations by commenting out the .persist() call.

In [1]:
import datashader as ds
import datashader.transfer_functions as tf
import dask.dataframe as dd
import numpy as np
In [2]:
%%time
df = dd.io.parquet.read_parquet('../data/census.snappy.parq')
df = df.persist()
CPU times: user 17.8 s, sys: 5.17 s, total: 23 s
Wall time: 10.6 s
In [3]:
df.head()
Out[3]:
easting northing race
0 -13700737.0 6275190.0 w
1 -13700711.0 6275195.0 w
2 -13702081.0 6274898.5 w
3 -13701948.0 6274931.0 w
4 -13701793.0 6275088.5 w

There are 306675004 rows in this dataframe (one per person counted in the census), each with a location in Web Mercator format and a race encoded as a single character (where 'w' is white, 'b' is black, 'a' is Asian, 'h' is Hispanic, and 'o' is other (typically Native American)). (Try len(df) to see the size, if you want to check, though that forces the dataset to be loaded so it's skipped here.)

Let's define some geographic ranges to look at later, and also a default plot size. Feel free to increase plot_width to 2000 or more if you have a very large monitor or want to save big files to disk, which shouldn't greatly affect the processing time or memory requirements.

In [4]:
USA           = ((-124.72,  -66.95), (23.55, 50.06))
LakeMichigan  = (( -91.68,  -83.97), (40.75, 44.08))
Chicago       = (( -88.29,  -87.30), (41.57, 42.00))
Chinatown     = (( -87.67,  -87.63), (41.84, 41.86))
NewYorkCity   = (( -74.39,  -73.44), (40.51, 40.91))
LosAngeles    = ((-118.53, -117.81), (33.63, 33.96))
Houston       = (( -96.05,  -94.68), (29.45, 30.11))
Austin        = (( -97.91,  -97.52), (30.17, 30.37))
NewOrleans    = (( -90.37,  -89.89), (29.82, 30.05))
Atlanta       = (( -84.88,  -84.04), (33.45, 33.84))

from datashader.utils import lnglat_to_meters as webm
x_range,y_range = [list(r) for r in webm(*USA)]

plot_width  = int(900)
plot_height = int(plot_width*7.0/12)

Let's also choose a background color for our results. A black background makes bright colors more vivid, and works well when later adding relatively dark satellite image backgrounds, but white backgrounds ( background=None ) are good for examining the weakest patterns, and work well when overlaying on maps that use light colors. Try it both ways and decide for yourself!

In [5]:
background = "black"

We'll also need some utility functions and colormaps, and to make the page as big as possible:

In [6]:
from functools import partial
from datashader.utils import export_image
from datashader.colors import colormap_select, Greys9
from IPython.core.display import HTML, display

export = partial(export_image, background = background, export_path="export")
cm = partial(colormap_select, reverse=(background!="black"))

display(HTML("<style>.container { width:100% !important; }</style>"))

Population density

For our first examples, let's ignore the race data for now, focusing on population density alone (as for the nyc_taxi.ipynb example). We'll first aggregate all the points from the continental USA into a grid containing the population density per pixel:

In [7]:
%%time
cvs = ds.Canvas(plot_width, plot_height, *webm(*USA))
agg = cvs.points(df, 'easting', 'northing')
CPU times: user 2.09 s, sys: 241 ms, total: 2.33 s
Wall time: 1.09 s

Computing this aggregate grid will take some CPU power (5 seconds on a MacBook Pro), because datashader has to iterate through the entire dataset, with hundreds of millions of points. Once the agg array has been computed, subsequent processing will now be nearly instantaneous, because there are far fewer pixels on a screen than points in the original database.

If we now plot the aggregate grid linearly, we can clearly see...

In [8]:
export(tf.shade(agg, cmap = cm(Greys9), how='linear'),"census_gray_linear")
Out[8]:

...almost nothing. If you know what to look for, you can see hotspots (high population densities) in New York City, Los Angeles, Chicago, and a few other places. More hotspots can dimly be seen when using a white background than with a black, on most monitors, though they are very dim either way. In any case, for feeding 300 million points in, we're getting almost nothing back in terms of visualization.

The first thing we can do is to prevent undersampling (see plotting_pitfalls.ipynb). In the plot above, there is no way to distinguish between pixels that are part of the background, and those that have low but nonzero counts; both are mapped to black or nearly black on a linear scale. Instead, let's map all values that are not background to a dimly visible gray, leaving the highest-density values at white. I.e., let's discard the first 25% of the gray colormap, and linearly interpolate the population densities over the remaining range:

In [9]:
export(tf.shade(agg, cmap = cm(Greys9,0.25), how='linear'),"census_gray_linear")
Out[9]:

The above plot reveals at least that data has been measured only within the political boundaries of the continental United States, and also that many areas in the West are so poorly populated that many pixels contained not even a single person. (In datashader images, the background color is shown for pixels that have no data at all, using the alpha channel of a PNG image, while the colormap range is shown for pixels that do have data.) Some additional population centers are now visible, at least on some monitors. But mainly what the above plot indicates is that population in the USA is extremely non-uniformly distributed, with hotspots in a few regions, and nearly all other pixels having much, much lower (but nonzero) values. Again, that's not much information to be getting out out of 300 million datapoints!

The problem is that of the available intensity scale in this gray colormap, nearly all pixels are colored the same low-end gray value, with only a few urban areas using any other colors. Thus this version of the map conveys very little information as well. Because the data are clearly distributed so non-uniformly, let's instead try a nonlinear mapping from population counts into the colormap. A logarithmic mapping is often a good choice:

In [10]:
export(tf.shade(agg, cmap = cm(Greys9,0.2), how='log'),"census_gray_log")
Out[10]:

Suddenly, we can see an amazing amount of structure! There are clearly meaningful patterns at nearly every location, ranging from the geographic variations in the mountainous West, to the densely spaced urban centers in New England, and the many towns stretched out along roadsides in the midwest (especially those leading to Denver, the hot spot towards the right of the Rocky Mountains).

Clearly, we can now see much more of what's going on in this dataset, thanks to the logarithmic mapping. Worryingly, though, the choice of 'log' was purely arbitrary, basically just a hunch based on how many datasets behave. One could easily imagine that other nonlinear functions would show other interesting patterns, or that different functions would be needed for different datasets.

Instead of blindly searching through the space of all such functions, we can step back and notice that the main effect of the log transform has been to reveal local patterns at all population densities -- small towns show up clearly even if they are just slightly more dense than their immediate, rural neighbors, yet large cities with high population density also show up well against the surrounding suburban regions, even if those regions are more dense than the small towns on an absolute scale.

With this idea of showing relative differences across a large range of data values in mind, let's try the image-processing technique called histogram equalization. I.e., given a set of raw counts, map these into a range for display such that every available color on the screen represents about the same number of samples in the original dataset. The result is similar to that from the log transform, but is now non-parametric -- it will equalize any linearly or nonlinearly distributed integer data, regardless of the distribution:

In [11]:
export(tf.shade(agg, cmap = cm(Greys9,0.2), how='eq_hist'),"census_gray_eq_hist")
Out[11]: