A common issue when dealing with more than a few thousand data points is how to effectively make scatterplots. There is a lot of research on this topic that I won’t go into in any detail, but in this post I’ll just point out a few features that come with rbokeh that allow you to work with larger scatterplots, including level of detail thresholds, WebGL, and hexbins, and finally faceting.

Data

For the examples in this post we’ll use a set of data that I have stored in a github gist.

dat <- read.csv("https://gist.githubusercontent.com/hafen/d75cbb8b2b6047572d2d/raw/e365aa7b187caf42f81694b7f877b55e4584616f/data.R", stringsAsFactors = FALSE)

head(dat)
#            x          y f1 f2
# 1 -1.2070657 -1.8168975 d1 -1
# 2  0.2774292  0.6271668 d1  3
# 3  1.0844412  0.5180921 d1  3
# 4 -2.3456977  0.1409218 d1 -3
# 5  0.4291247  1.4572719 d1  3
# 6  0.5060559 -0.4935965 d1  1

There are only 10k points, but that’s big enough to demonstrate some of the limitations / features of different approaches.

A simple scatterplot

Let’s start with a simple scatterplot of y vs x.

# install.packages("rbokeh", repos = "http://packages.tessera.io")
library(rbokeh)

figure() %>%
  ly_points(x, y, data = dat)

The variables seem to look like draws from standard normal distributions.

There is a lot of overplotting. We might hope to get a better feel for the data through some interaction with the plot, such as zooming. This isn’t a very large data set, but 10k points is already kind of pushing the boundary of smooth interactivity. For example, try panning around (click and drag inside the plot) or using the wheel zoom tool (click the wheel zoom icon and then use your mouse wheel to zoom in and out), and you will see that there is some sluggishness in the response.

Set level of detail threshold

One approach for making interaction transitions more smooth is to set a “level of detail threshold”, a Bokeh feature which specifies a number of points above which downsampling will occur when the user is interacting with the plot. For example, we can set the parameter lod_threshold to downsample when the plot has more than 100 points:

figure(lod_threshold = 100) %>%
  ly_points(x, y, data = dat)

Now when you pan and zoom you will notice that only a subset of the points are shown during the transitions and that the transitions are a bit more smooth.

For finer control over level of detail, see the lod_ arguments to figure() described here and also here.

  • Pros: We can see all the data points with more responsive interactivity.
  • Cons: The transition between the downsampling and full detail after each interaction can have a bit too much latency and can be confusing to a viewer that doesn’t know that downsampling is occuring.

Render with WebGL

Another nice feature in Bokeh is that for certain glyphs (currently circles, squares, and lines), you can use WebGL to render the plot. Note that this feature works best with Chrome. If you are in other browser like Safari you may not see a plot at all or it may not have the same performance.

figure(webgl = TRUE) %>%
  ly_points(x, y, data = dat)

Wow! That’s some smooth interaction! Go ahead and zoom in and out quickly with the mouse wheel zoom tool. Warp speed!

As I noted, there are some issues with using WebGL – limitations on the type glyphs that can be plotted and browser compatibility. You can mix WebGL and non-WebGL layers, but currently the WebGL layer always renders on top. See here for more details on WebGL in Bokeh.

  • Pros: Rendering and interactivity are really fast.
  • Cons: WebGL support in Bokeh is currently not flexible enough for more involved plots.

Hexbins

A well-known approach to plotting many points in a scatterplot is to bin the points in hexagons and plot the hexagons either varying the size or color of the hexagon to depict the count of observations that fall into the bin. This has been made popular in R by the hexbin package.

Note that in all of the previous examples, even though we found some ways to render all the points faster, we still had to send all the points to the browser, which at some point will not be a good idea. With hexbins, we only send the data needed to draw the hexagons, which will be bounded as the number of points grows.

We can add a hexbin layer instead of a points layer easily in rbokeh.

figure() %>%
  ly_hexbin(x, y, data = dat)

Do you notice anything interesting from looking at this plot? The most dense area of the plot does not occur at (0,0) but at (-1,-1), so these are not standard normal draws. The overplotting in the previous plots made it difficult to see this (at least for me – even after I knew this is how I generated the data) even after zooming in, etc.

  • Pros: Hexbinning sends little data to the browser and can help find .
  • Cons: Binning doesn’t let us see the raw data.

Javascript callback teaser

An interesting idea would be to show hexbins when zoomed out and then show points when zoomed in far enough. Bokeh’s nice javascript callback features allow us to do this. Javascript callback support in rbokeh is an experimental feature I’ve been working on with Saptarshi Guha. I’m just going to tease it here without code as we haven’t released it yet, but here’s such a plot.

Zoom in until the hexbins start to get large and you will see them transform into the raw data points. Make sure to try this out – it’s pretty cool (cool that this was generated from R). There are some things that could be done to make this more useful but it illustrates the idea. Stay tuned for a post about how to make such a plot and others with callbacks, as well as integrating rbokeh user interactions into Shiny applications.

Abstract rendering

Another interesting approach for scatterplots with a very large number of points and severe overplotting that has been available in Bokeh but not yet rbokeh is abstract rendering from Joseph Cottam.

Faceting

I just want to mention that when you are plotting a lot of data, there are better things to do than throw it all up on a single plot. Of course it’s always good to look at summaries such as hexbins, but usually if there is anything really interesting going on in the data, you’re bound to miss it if you only look at summaries.

A powerful technique for visualizing a larger or more complex data set in more detail is by using the techniques of Trellis Display. Those who are familiar with the grammar of graphics will know of this as faceting, or who are familiar with Edward Tufte, as small multiples – having studied with Bill Cleveland, I tend to think of it all as Trellis Display, but faceting is a more simple term. The idea is to take your data and break it into pieces, usually according to a variable or combination of variables in the data, and then apply the same plot to each piece. This is powerful for several reasons. In interest of keeping this post from going on and on, I’ll save these reasons for future posts.

How do you decide how to break up the data? It is often dictated by the domain of the data being analyzed, but also is often a trial and error process. The data I have provided here is artificial but you probably noticed that we have variables “f1” and “f2”. Let’s break the data up by “f1” and plot a hexbin for each. Make sure you have the latest version of rbokeh (0.3.5) before you run these examples.

bounds <- range(c(dat$x, dat$y))

grid_plot(lapply(split(dat, dat$f1), function(d) {
  figure(width = 300, height = 350) %>%
    ly_hexbin(x, y, data = d, xbnds = bounds, ybnds = bounds)
}), nrow = 1, same_axes = TRUE)

Here we see that “f1” is a variable that seems to separate two distributions that are present in the data. This is very useful information that became very apparent with faceting. We already knew from our hexbin summary that something beyond random standard normals was going on, but not if there was something else in the data that could explain it, as we see here.

So can faceting help us discover something we haven’t already noticed in our data? You can probably guess that since I designed the data, the answer is yes and that it lies in the other variable “f2”.

grid_plot(lapply(split(dat, dat$f2), function(d) {
  figure(width = 300, height = 350) %>%
    ly_hexbin(x, y, data = d, xbnds = bounds, ybnds = bounds)
}), nrow = 2, same_axes = TRUE)

So “f2” is a variable that tells us what quadrant of the cartesian plane our data is in.

Note that in these examples the rbokeh code isn’t quite as elegant as the faceting you can do with something like:

xyplot(y ~ x | f1, data = dat)

in lattice, or

ggplot(dat, aes(x, y)) + geom_point() + facet_wrap(~ f1)

in ggplot2, but I’m putting some thought into how to make the faceting interface to rbokeh more simple.

There are many other very favorable properties of Trellis Display that in my opinion make it superior to other more interactive techniques for visualizing large data sets. I’ll be posting more on this topic as it’s probably apparent that I’m rather partial to it.

What about really really big data sets?

As a final parting thought, I’d like to point you to a recent post that shows a faceted plot made against hundreds of gigabytes of data stored on Hadoop. This is made possible by the Trelliscope package. Check out that post here.