Plotting Pitfalls

Common plotting pitfalls that get worse with large data

When working with large datasets, visualizations are often the only way available to understand the properties of that dataset -- there are simply too many data points to examine each one! Thus it is very important to be aware of some common plotting problems that are minor inconveniences with small datasets but very serious problems with larger ones.

We'll cover:

  1. Overplotting
  2. Oversaturation
  3. Undersampling
  4. Undersaturation
  5. Underutilized range
  6. Nonuniform colormapping

You can skip to the end if you just want to see an illustration of these problems.

This notebook requires HoloViews, colorcet, and matplotlib, and optionally scikit-image, which can be installed with:

conda install holoviews colorcet matplotlib scikit-image

We'll first load the plotting libraries and set up some defaults:

In [1]:
import numpy as np

import holoviews as hv
from holoviews.operation.datashader import datashade
from holoviews import opts, dim

from colorcet import fire
In [2]:
    opts.Image(cmap="gray_r", axiswise=True),
    opts.Points(cmap="bwr", edgecolors='k', s=50, alpha=1.0), # Remove color_index=2
    opts.RGB(bgcolor="black", show_grid=False),
    opts.Scatter3D(color=dim('c'), fig_size=250, cmap='bwr', edgecolor='k', s=50, alpha=1.0)) #color_index=3

1. Overplotting

Let's consider plotting some 2D data points that come from two separate categories, here plotted as blue and red in A and B below. When the two categories are overlaid, the appearance of the result can be very different depending on which one is plotted first:

In [3]:
def blue_points(offset=0.5,pts=300):
    blues = (np.random.normal( offset,size=pts), np.random.normal( offset,size=pts), -1 * np.ones((pts)))
    return hv.Points(blues, vdims=['c']).opts(color=dim('c'))
def red_points(offset=0.5,pts=300):
    reds  = (np.random.normal(-offset,size=pts), np.random.normal(-offset,size=pts),  1*np.ones((pts)))
    return hv.Points(reds, vdims=['c']).opts(color=dim('c'))

blues, reds = blue_points(), red_points()
blues + reds + (reds * blues) + (blues * reds)

Plots C and D shown the same distribution of points, yet they give a very different impression of which category is more common, which can lead to incorrect decisions based on this data. Of course, both are equally common in this case, so neither C nor D accurately reflects the data. The cause for this problem is simply occlusion:

In [4]:
hmap = hv.HoloMap({0:blues,0.000001:reds,1:blues,2:reds}, kdims=['level'])
hv.Scatter3D(hmap.table(), kdims=['x','y','level'], vdims=['c'])

Occlusion of data by other data is called overplotting or overdrawing, and it occurs whenever a datapoint or curve is plotted on top of another datapoint or curve, obscuring it. It's thus a problem not just for scatterplots, as here, but for curve plots, 3D surface plots, 3D bar graphs, and any other plot type where data can be obscured.

2. Oversaturation

You can reduce problems with overplotting by using transparency/opacity, via the alpha parameter provided to control opacity in most plotting programs. E.g. if alpha is 0.1, full color saturation will be achieved only when 10 points overlap, reducing the effects of plot ordering but making it harder to see individual points:

In [5]:
layout = blues + reds + (reds * blues) + (blues * reds)
layout.opts(opts.Points(s=50, alpha=0.1))

Here and look very similar (as they should, since the distributions are identical), but there are still a few locations with oversaturation, a problem that will occur when more than 10 points overlap. In this example the oversaturated points are located near the middle of the plot, but the only way to know whether they are there would be to plot both versions and compare, or to examine the pixel values to see if any have reached full saturation (a necessary but not sufficient condition for oversaturation). Locations where saturation has been reached have problems similar to overplotting, because only the last 10 points plotted will affect the final color (for alpha of 0.1).

Worse, even if one has set the alpha value to approximately or usually avoid oversaturation, as in the plot above, the correct value depends on the dataset. If there are more points overlapping in that particular region, a manually adjusted alpha setting that worked well for a previous dataset will systematically misrepresent the new dataset:

In [6]:
blues, reds = blue_points(pts=600), red_points(pts=600)
layout = blues + reds + (reds * blues) + (blues * reds)

Here and again look qualitatively different, yet still represent the same distributions. Since we're assuming that the point of the visualization is to reveal the underlying dataset, having to tune visualization parameters manually based on the properties of the dataset itself is a serious problem.

To make it even more complicated, the correct alpha also depends on the dot size, because smaller dots have less overlap for the same dataset. With smaller dots, and look more similar, but the color of the dots is now difficult to see in all cases because the dots are too transparent for this size:

In [7]:
layout = blues + reds + (reds * blues) + (blues * reds)
layout.opts(opts.Points(s=10, alpha=0.1, edgecolor=None))

As you can see, it is very difficult to find settings for the dotsize and alpha parameters that correctly reveal the data, even for relatively small and obvious datasets like these. With larger datasets with unknown contents, it is difficult to detect that such problems are occuring, leading to false conclusions based on inappropriately visualized data.

3. Undersampling

With a single category instead of the multiple categories shown above, oversaturation simply obscures spatial differences in density. For instance, 10, 20, and 2000 single-category points overlapping will all look the same visually, for alpha=0.1. Let's again consider an example that has a sum of two normal distributions slightly offset from one another, but no longer using color to separate them into categories:

In [8]:
def gaussians(specs=[(1.5,0,1.0),(-1.5,0,1.0)],num=100):
    A concatenated list of points taken from 2D Gaussian distributions.
    Each distribution is specified as a tuple (x,y,s), where x,y is the mean
    and s is the standard deviation.  Defaults to two horizontally
    offset unit-mean Gaussians.
    dists = [(np.random.normal(x,s,num), np.random.normal(y,s,num)) for x,y,s in specs]
    return np.hstack([d[0] for d in dists]), np.hstack([d[1] for d in dists])
points = (hv.Points(gaussians(num=600),   label="600 points",   group="Small dots") +
          hv.Points(gaussians(num=60000), label="60000 points", group="Small dots") +
          hv.Points(gaussians(num=600),   label="600 points",   group="Tiny dots")  +
          hv.Points(gaussians(num=60000), label="60000 points", group="Tiny dots"))

    opts.Points('Small_dots', s=1, alpha=1),
    opts.Points('Tiny_dots', s=0.1, alpha=0.1))