Timeseries#

In many domains it is common to plot scalar values as a function of time (or other single dimensions). As long as the total number of datapoints is relatively low (in the tens of thousands, perhaps) and there are only a few separate curves involved, most plotting packages will do well. However, for longer or more frequent sampling, you’ll be required to subsample your data before plotting, potentially missing important peaks or troughs in the data. And even just a few timeseries visualized together quickly run into highly misleading overplotting issues, making the most recently plotted curves unduly prominent.

For applications with many datapoints or when visualizing multiple curves, datashader provides a principled way to view all of your data. In this example, we will synthesize several time-series curves so that we know their properties, and then show how datashader can reveal them.

import datetime
import pandas as pd
import numpy as np
import xarray as xr
import datashader as ds
import datashader.transfer_functions as tf

Create some fake timeseries data#

Here we create a fake time series signal, then generate many noisy versions of it. We will also add a couple of “rogue” lines, with different statistical properties, and see how well those stand out from the rest.

# Constants
np.random.seed(2)
n = 100000                               # Number of points
cols = list('abcdefg')                   # Column names of samples
start = datetime.datetime(2010, 10, 1, 0)   # Start time

# Generate a fake signal
signal = np.random.normal(0, 0.3, size=n).cumsum() + 50

# Generate many noisy samples from the signal
noise = lambda var, bias, n: np.random.normal(bias, var, n)
data = {c: signal + noise(1, 10*(np.random.random() - 0.5), n) for c in cols}

# Add some "rogue lines" that differ from the rest
cols += ['x'] ; data['x'] = signal + np.random.normal(0, 0.02, size=n).cumsum() # Gradually diverges
cols += ['y'] ; data['y'] = signal + noise(1, 20*(np.random.random() - 0.5), n) # Much noisier
cols += ['z'] ; data['z'] = signal # No noise at all

# Pick a few samples from the first line and really blow them out
locs = np.random.choice(n, 10)
data['a'][locs] *= 2

# Create a dataframe
data['Time'] = [start + datetime.timedelta(minutes=1)*i for i in range(n)]

df = pd.DataFrame(data)
df.tail()
a b c d e f g x y z Time
99995 -43.266888 -51.619471 -42.867714 -45.723507 -47.229012 -44.246641 -46.509765 -54.598808 -44.922483 -48.523868 2010-12-09 10:35:00
99996 -42.733578 -51.773237 -42.984615 -45.269990 -45.537120 -45.258385 -44.763207 -54.623914 -45.116631 -48.523325 2010-12-09 10:36:00
99997 -42.614768 -51.268256 -42.201864 -45.497689 -46.426903 -45.170198 -45.393405 -53.837847 -44.981804 -47.741589 2010-12-09 10:37:00
99998 -44.684017 -50.131985 -42.204521 -45.856321 -48.606589 -45.880804 -46.653791 -53.910759 -43.639239 -47.806325 2010-12-09 10:38:00
99999 -46.027779 -50.496361 -42.615223 -45.025591 -46.462669 -43.848973 -47.319626 -54.216269 -45.546920 -48.076312 2010-12-09 10:39:00

The native datashader API illustrated here does not support datetimes directly so we create a new column casting the datetimes to integer:

df['ITime'] = pd.to_datetime(df['Time']).astype('int64')

Now we can compute the x- and y-ranges:

# Default plot ranges:
x_range = (df.iloc[0].ITime, df.iloc[-1].ITime)
y_range = (1.2*signal.min(), 1.2*signal.max())

print("x_range: {0} y_range: {1}".format(x_range,y_range))
x_range: (np.int64(1285891200000000000), np.int64(1291891140000000000)) y_range: (np.float64(-104.8094774985483), np.float64(100.09816883332905))

Plotting all the datapoints#

With datashader, we can plot all the datapoints for a given timeseries. Let’s select the first curve ‘a’ and draw it into an aggregate grid, connecting each datapoint in the series:

%%time
cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=300, plot_width=900)
aggs= dict((c, cvs.line(df, 'ITime', c)) for c in cols)
img = tf.shade(aggs['a'])
CPU times: user 9.33 s, sys: 63.8 ms, total: 9.39 s
Wall time: 9.39 s
img

The result looks similar to what you might find in any plotting program, but it uses all 100,000 datapoints, and would work similarly for 1, 10, or 100 million points (determined by the n attribute above).

Why is using all the datapoints important? To see, let’s downsample the data by a factor of 10, plotting 10,000 datapoints for the same curve:

mask = (df.index % 10) == 0
tf.shade(cvs.line(df[mask][['a','ITime']], 'ITime', 'a'))

The resulting plot is similar, but now none of the “blown up” datapoints (sharp spikes) that were clearly visible in the first version are visible at all! They didn’t happen to be amongst the sampled points, and thus do not show up in this plot, which should never be a problem using datashader with all the points.

Antialiasing#

With the default ds.count() aggregator, Canvas.line() uses a very simple representation for a line (strictly, a polyline, i.e., a connected set of line segments). This representation has a clear mathematical interpretation: each pixel’s count is incremented by 1 when a line segment passes through that pixel. Having such a clear interpretation is useful when summing large numbers of curves, as illustrated in later sections below. However, the result will have sharp transitions between pixels, giving it a blocky appearance, and so if you prefer a smoother rendering where the count is spread between multiple pixels, or if you simply want to specify a thicker line for visibility, you can pass a line_width greater than zero and Datashader will draw a (potentially thicker) line and antialias it with intermediate values:

cvs2  = ds.Canvas(x_range=(12879023 * 1E11, 12879070 * 1E11),
                 y_range=(37, 50), plot_height=200, plot_width=500)

w0   = tf.shade(cvs2.line(df, 'ITime', 'a', line_width=0), name="line_width 0")
w1   = tf.shade(cvs2.line(df, 'ITime', 'a', line_width=1), name="line_width 1")
w5   = tf.shade(cvs2.line(df, 'ITime', 'a', line_width=5), name="line_width 5")

tf.Images(w0, w1, w5).cols(1)
line_width 0

line_width 1

line_width 5

Antialiasing and using large widths does slow down the rendering and may make it more difficult to interpret large collections of line segments, so it is not on by default.

Overplotting problems#

What happens if we overlay multiple curves? In a traditional plotting program, there would be serious issues with overplotting, because these curves are highly overlapping. To show what would typically happen, let’s brute-force merge the images corresponding to each of the curves:

renamed = [aggs[key].rename({key: 'value'}) for key in aggs]
merged = xr.concat(renamed, 'cols')
tf.shade(merged.any(dim='cols'))

The any operator merges all the data such that any pixel that is lit up for any curve is lit up in the final result. Clearly, it is difficult to see any structure in this fully overplotted data; all we can see is the envelope of these curves, i.e. the minimum and maximum value of any curve for any given time point. It remains completely unclear how the various curves in the set relate to each other, or even how many curves are being plotted here. We already know that we put in one particularly noisy curve, which presumably determines the envelope, but there’s no way to tell that from the plot.

We can of course try giving each curve a different color:

colors = ["red", "grey", "black", "purple", "pink",
          "yellow", "brown", "green", "orange", "blue"]
imgs = [tf.shade(aggs[i], cmap=[c]) for i, c in zip(cols, colors)]
tf.stack(*imgs)

But that doesn’t help much; there are 10 curves, but only three or four colors are visible, due to overplotting. Problems like that will just get much worse if there are 100, 1000, or 1 million curves. Moreover, the results will look entirely different if we plot them in the opposite order:

tf.stack(*reversed(imgs))

Having the visualization look completely different for an arbitrary choice like the plotting order is a serious problem if we want to understand the properties of this data, from the data itself.

Highlighting specific curves#

The data set also includes a couple of traces that are difficult to detect in the .sum() plot above, one with no noise and one with much higher noise. One way to detect such issues is to highlight each of the curves in turn, and display it in relation to the datashaded average values. For instance, those two curves (each highlighted in red below) stand out against the pack as having less or more noise than is typical:

tf.stack(total, tf.shade(aggs['z'], cmap=["lightblue", "red"]))
tf.stack(total, tf.shade(aggs['y'], cmap=["lightblue", "red"]))

Area plots#

As an alternative to plotting time-series data as a line, the same data can be plotted as a filled area plot. This approach serves to emphasise whether the time series has a positive or negative sign. Here is an example of a filled area plot for the ‘a’ curve.

cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=300, plot_width=900)
agg = cvs.area(df, x='ITime', y='a')
img = tf.shade(agg)
img

By specifying the y_stack argument, an area plot can also be used to display the difference between two curves. Here is an example of an area plot of the difference between the ‘a’ and ‘b’ curves

cvs = ds.Canvas(x_range=x_range, y_range=y_range, plot_height=300, plot_width=900)
agg = cvs.area(df, x='ITime', y='a', y_stack='b')
img = tf.shade(agg)
img

Dynamic Plots#

In practice, it might be difficult to cycle through each of the curves to find one that’s different, as done above. Perhaps a criterion based on similarity could be devised, choosing the curve most dissimilar from the rest to plot in this way, which would be an interesting topic for future research. In any case, one thing that can be achieved with HoloViews is to make the plot fully interactive, with direct support for datetimes so that the viewer can zoom in and discover such patterns dynamically with correctly formatted axes.

import holoviews as hv
from holoviews.operation.datashader import datashade
from holoviews.streams import PlotSize
PlotSize.scale=2
hv.extension('bokeh')

If you are looking at a live, running version of this notebook, just enable the wheel zoom or box zoom tools, and then zoom and pan as you wish:

opts = hv.opts.RGB(width=600, height=300)
ndoverlay = hv.NdOverlay({c:hv.Curve((df['Time'], df[c]), kdims=['Time'], vdims=['Value']) for c in cols})
datashade(ndoverlay, cnorm='linear', aggregator=ds.count(), line_width=2).opts(opts)

Here the diverging “rogue line” is immediately apparent as a light blue region, and if you zoom in towards the right you can see precisely how it differs from the rest. The low-noise “rogue line” is much harder to see, but if you zoom in enough (particularly if you stretch out the x axis by zooming on the axis itself), you may be able to see that one line goes through the middle of the pack, with different properties from the rest. HoloViews should soon be adding an inspect_curve operation to make it easier to find out the identity of each line by hovering on it.

Plotting large numbers of time series together#

The examples above all used a small number of very long time series, which is one important use case for Datashader. Another important use case is visualizing very large numbers of time series, even if each individual curve is relatively short. If you have hundreds of thousands of timeseries, putting each one into a Pandas dataframe column and aggregating it individually will not be very efficient.

Luckily, Datashader can render arbitrarily many separate curves, limited only by what you can fit into a Dask dataframe (which in turn is limited only by your system’s total disk storage). Instead of having a dataframe where each pair of columns (one for x one for y) represents a curve, you can have a dataframe where each row represents a fixed-length curve. In this case, the x and y arguments should be set to lists of the labels of the columns that represent the curve coordinates and the axis argument should be set to 1. If all of the lines share the same coordinates for one of the dimensions, then the corresponding argument (either x or y) can be replaced with a 1-dimensional NumPy array containing these coordinates.

In this way you can plot millions or billions of fixed-length curves efficiently.

As an example, let’s generate a NumPy array containing 100,000 sequences with 10 points each, where each sequence represents a 1-dimensional random walk with step size drawn from the standard normal distribution:

n = 100000
points = 10
time = np.linspace(0, 1, points)
data = np.cumsum(np.c_[np.zeros((n,1)), np.random.randn(n, points)] , axis=1)
data.shape
(100000, 11)

We can create a Pandas dataframe from this NumPy array directly, where each row in the resulting dataframe represents an independent trial of the random walk:

walks = pd.DataFrame(data)
walks.head(15)
0 1 2 3 4 5 6 7 8 9 10
0 0.0 0.377222 -0.249735 -1.194367 -1.380634 -0.671281 0.040787 -0.156663 -0.759360 -1.081526 1.554707
1 0.0 -1.403119 -2.378427 -4.830572 -4.935300 -4.924859 -2.114318 -1.906418 -1.540874 -0.510023 2.564636
2 0.0 -1.352935 -0.608178 -0.667442 -0.502931 -1.786608 -2.136291 -4.458136 -4.872798 -3.712484 -4.120521
3 0.0 -0.716378 -0.244081 -0.951915 -2.497982 -2.557246 -3.183032 -3.065062 -4.547952 -5.034308 -7.086219
4 0.0 0.762461 -0.156942 -1.275631 -0.866515 -0.972335 -1.165401 -0.391618 -0.070743 -0.955226 -0.381761
5 0.0 -0.070304 -1.927411 -4.105959 -4.774451 -4.660079 -4.846771 -3.551492 -3.181119 -2.579117 -2.828520
6 0.0 0.860222 -0.533911 -0.910240 -0.202988 0.236928 -1.937484 -4.459690 -5.201604 -5.191556 -5.331570
7 0.0 -0.496851 1.005509 1.465258 0.580072 -0.483865 -0.071584 -1.954305 -2.183242 -0.257736 -0.202729
8 0.0 2.195455 1.946487 0.243154 2.269577 3.185899 4.204509 2.511518 1.103528 2.888355 3.317844
9 0.0 0.073215 -0.361441 -0.074260 -0.055021 0.451899 -0.522455 -1.334748 -1.011224 -2.124819 -2.311535
10 0.0 0.329212 0.594557 -0.532055 0.378673 0.745532 1.055768 1.243595 2.097591 3.091042 3.351349
11 0.0 -0.815413 -2.223627 -0.809400 -2.561691 -3.117988 -4.467533 -7.045296 -6.748950 -6.662267 -6.865935
12 0.0 0.358719 1.029473 0.944431 -0.047894 -1.121161 -0.529860 -1.632196 -1.512300 -1.607074 -1.750311
13 0.0 0.584170 1.392100 1.240867 2.669086 2.555953 2.512450 4.762607 2.489497 1.581741 0.497938
14 0.0 0.252747 1.392970 0.957619 0.077459 0.189394 -1.974626 -1.744716 -1.760675 -1.828831 -0.953880

To render these lines we set the x argument to time (the 1-dimensional NumPy array containing the shared x coordinates) the y argument to list(range(points)) (a list of the column labels of df the contain the y coordinates of each line), and the axis argument to 1 (indicating that each row of the dataframe represents a line rather than each pair of columns).

%%time
cvs = ds.Canvas(plot_height=400, plot_width=1000)
agg = cvs.line(walks, x=time, y=list(range(points)), agg=ds.count(), axis=1, line_width=0)
img = tf.shade(agg, how='eq_hist')
img
CPU times: user 1.29 s, sys: 3.97 ms, total: 1.29 s
Wall time: 1.29 s

Here, each line represents an independent trial of this random walk process. All lines start from the same point at time 0 (all the way to the left). At each subsequent time step, each line moves upward or downward from its prior position by a distance drawn from a normal distribution. Thanks to the nonlinear eq-hist colorization, you can see the dispersion in the density of the overall distribution as time advances, at the same time as you can see the individual outliers at the extremes of the distribution. You’ll see a similar plot for 1,000,000 or 10,000,000 curves, and much more interesting plots if you have real data to show!

This web page was generated from a Jupyter notebook and not all interactivity will work on this website. Right click to download and run locally for full Python-backed interactivity.

Right click to download this notebook from GitHub.