Ryan Hafen, Ph.D.

Statistical Consultant, Data Scientist

@hafenstats

I am an independent consultant and (remote) adjunct assistant professor at Purdue University. My work and research focuses on tools, methodology, and applications in exploratory analysis, visualization, statistical model building, and machine learning on large, complex datasets.

I write a lot of code and work on a lot of interesting applications, some of which you can learn more about below.

Projects

I spend a lot of time building tools for data analysis. Some of the most potentially useful projects I have worked on or am working on are listed here. For a list of other projects I am involved in, visit my github page.

DeltaRho

DeltaRho (formerly known as Tessera) is an environment that enables deep statistical analysis and visualization of large complex data through a simple interface in R, providing all of R’s functionality at scale. While designed for and used with data sets in the multi-terabyte range, it is useful for small and moderate size data as well. DeltaRho consists of interface packages I have written, datadr and trelliscope for analysis and visualization, and ties to scalable back ends like Hadoop through packages like RHIPE. Learn more at deltarho.org.

rbokeh

rbokeh is an R package that provides an interface to the Bokeh plotting library in R, providing a simple but powerful declarative interface for building interactive web-based visualizations. Learn more here.

stlplus

stlplus is an R package for seasonal trend decomposition using loess (original paper). Compared to the stl method that ships with base R, stlplus provides several enhancements including the ability to deal with missing values and higher order polynomial smoothing.

ed

Ed is a nonparametric density estimation method that turns density estimation into a regression problem, providing much-needed diagnostics for evaluating bandwidth choice and peak identification, and helping to deal with some common problems encoutered using traditional KDE approaches, such as fixed bandwidths, boundaries / discontinuities, bias, and choosing optimal smoothing parameters.

packagedocs

packagedocs is an R package for generating a nice-looking web site for package documentation / vignettes. The rbokeh web page is an example of a page generated by packagedocs.

Consulting

Aside from building tools for data analysis, I spend a significant amount of time analyzing data. During my graduate studies at Purdue and my time as a research scientist PNNL I worked on many difficult problems involving large complex data in several domains including disease surveillance, computer network modeling, power systems engineering, nuclear forensics, high energy physics, and finance.

I currently work as an independent consultant, providing tool development and analytic / visualization services for big data problems. I currently am working on the following projects:

DARPA XDATA

DARPA XDATA

The DARPA XDATA program is part of the White House’s Big Data Initiative, funding research and development of open source tools for analysis and visualization of big data. I am Co-PI (with Purdue University and Stanford University) on an XDATA project that has funded much of my work on DeltaRho and has funded the application of our tools to several interesting data sets such as the Bitcoin blockchain, Akamai traceroute data, and high frequency trading data from Nanex. Information about the program’s teams and products can be found in the XDATA Open Catalog.

More to come as listing content is approved by clients...

Publications / Talks

Upcoming & Recent Talks / Presentations

Modern Approaches to Data Exploration with Trellis Display
Yale Biostatistics Seminar, New Haven, CT, March 28, 2017

Exploration and Analysis of Longitudinal Growth Data (tutorial)
Bill & Melinda Gates Foundation Grand Challenges India, March 23, 2017

TrelliscopeJS: Visualization in the Tidyverse
rstudio::conf, Kissimmee, FL, January 13, 2017

Rbokeh: An R Interface to the Bokeh Plotting Library
JSM 2016, Chicago, IL, August 2, 2016

The Need for Flexibility in Distributed Computing with R
DSC 2016, Stanford, CA, July 2, 2016

rbokeh: A Simple, Flexible, Declarative Framework for Interactive Graphics
useR! 2016, Stanford, CA, June 30, 2016

Tools for analysis and visualization of large complex data in R
Rencontres R, Toulouse, France, June 23, 2016

Analysis and Visualization of Large Complex Data with Tessera
Short Course at Queensland University of Technology, Brisbane, Australia, February 16, 2016

Analysis and Visualization of Large Complex Data with Tessera
Short Course at University of Technology Sydney, Sydney, Australia, October 13, 2015

Tessera: A System for Deep Analysis of Large Complex Data in R
ISM HPC Week, Tokyo, Japan, October 11, 2015

Tessera Tutorial
useR! 2015, Aalborg, Denmark, June 30, 2015

Tessera: Analysis of Large Complex Data in R
R Summit 2015, Copenhagen, Denmark, June 27-28, 2015

Tessera Tutorial
Interface Symposium, Morgantown, WV, June 12, 2015

A Simple Scalable Visualization Approach for Large Complex Data
HSARPA Big Data Series: Data Visualization, Washington, D.C., June 10, 2015

Tessera: open source environment for deep analysis of large complex data
Bay Area R User’s Group. San Francisco, CA, January 20, 2015

Divide and Recombine: A distributed data analysis paradigm
HP Workshop on Distributed Computing. San Francisco, CA, January 27, 2015

Tessera: open source environment for deep analysis of large complex data
Seattle R Meetup. Seattle, WA, November 4, 2014

Selected Articles, Tech Reports, Book Chapters

preprint | publisher link

R. Hafen
“Divide and Recombine: Approach for Detailed Analysis and Visualization of Large Complex Data.”
Chapter 1 in Handbook of Big Data, ed. P Bühlmann et al., pp. 35-46. Chapman and Hall/CRC, 2016.

W. S. Cleveland and R. Hafen
Divide and recombine (D&R): Data science for large complex data
Statistical Analysis and Data Mining: The ASA Data Science Journal 7 425–433 (2014)

R. Hafen, L. Gosink, J. McDermott, K. Rodland, K. K.-V. Dam, and W. S. Cleveland.
Trelliscope: a system for detailed visualization in the deep analysis of large complex data.
In Large-Scale Data Analysis and Visualization (LDAV), 2013 IEEE Symposium on, pages 105-112. IEEE, 2013.

S. Guha, R. Hafen, J. Rounds, J. Xia, J. Li, B. Xi, and W. S. Cleveland.
Large complex data: divide and recombine (d&r) with rhipe.
Stat, 1(1):53-67, 2012.

R. Hafen and T. Critchlow.
EDA and ML - A perfect pair for large-scale data analysis.
In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2013 IEEE 27th International, pages 1894-1898. IEEE, 2013.

R. Hafen, T. D. Gibson, K. K. van Dam, and T. Critchlow.
Large-scale exploratory analysis, cleaning, and modeling for event detection in real-world power systems data.
In Proceedings of the 3rd International Workshop on High Performance Computing, Networking and Analytics for the Power Grid, page 4. ACM, 2013.

D. M. Best, R. Hafen, B. K. Olsen, and W. A. Pike.
Atypical behavior identification in large-scale network traffic.
In Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on, pages 15-22. IEEE, 2011.

R. Hafen, D. E. Anderson, W. S. Cleveland, R. Maciejewski, D. S. Ebert, A. Abusalah, M. Yakout, M. Ouzzani, and S. J. Grannis.
Syndromic surveillance: STL for modeling, visualizing, and monitoring disease counts.
BMC Medical Informatics and Decision Making 2009, 9:21

S. Guha, P. Kidwell, R. Hafen, W. S. Cleveland.
Visualization Databases for the Analysis of Large Complex Datasets.
Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), 5:193-200, 2009.

R. Hafen, TD Gibson, K Kleese van Dam, and TJ Critchlow. 2014.
“Power Grid Data Analysis with R and Hadoop.”
Chapter 1 in Data Mining Applications with R, ed. Y Zhao and Y Cen, pp. 1-34. Academic Press, Waltham, MA.

R. Hafen and W. S. Cleveland.
Ed: A method for density estimation and diagnostic checking.
Technical report.

R. Hafen
Local regression models: advancements, applications, and new methods
PhD Thesis.

Other Talks / Presentations

Tessera: open source environment for deep analysis of large complex data
Genentech seminar. Redwood City, CA, January 21, 2015

Tessera: a computational environment for the analysis of large, complex data
International Indian Statistical Association (IISA) Conference. Riverside, CA, July 11-13, 2014

Developing a statistics curriculum for the data aevolution - invited panel
Joint Statistical Meetings, Boston, MA, August 4, 2014.

Trelliscope: a system for detailed visualization in the deep analysis of large complex data
Large-Scale Data Analysis and Visualization (LDAV), 2013 IEEE Symposium on, Atlanta, GA, October 14, 2013.

Divide and recombine for large complex data
Joint Statistical Meetings, Montreal, Canada, August 5, 2013.

EDA and ML - a perfect pair for large scale data analysis
ParLearning / IPDPS, Boston, MA, May 24, 2013.

Multi-resolution data model and directed data reduction, reconstruction, and aggregation
FPGI Annual Review, May 8, 2013.

Exploratory data analysis and statistical model building with large and complex data.
Conference on Data Analysis (CODA). Santa Fe, NM, March 2, 2012.

The ed Method for Nonparametric Density Estimation and Diagnostic Checking
Joint Statistical Meetings, Washington D.C., August 2009.

STL: Seasonal-Trend Decomposition by Loess with applications in Syndromic Surveillance
Stat Day 2009, Purdue University, March 2009.

The ed Method for Nonparametric Density Estimation and Diagnostic Checking
Machine Learning Seminar, Purdue University, March 2009.

Other Articles, Tech Reports

R. Hafen, N. A. Samaan, Y. V. Makarov, R. Diao, and N. Lu.
Joint seasonal ARMA approach for modeling of load forecast errors in planning studies.
In IEEE PES Transmission and Distribution Conference and Exposition, April, 2014.

E. D. Merkley, S. Rysavy, A. Kahraman, R. Hafen, V. Daggett, and J. N. Adkins.
Distance Restraints from Cross-Linking Mass Spectrometry: Mining a Molecular Dynamics Simulation Database to Evaluate Lysine-Lysine Distances.
Protein Science Epub Ahead of Print:, doi:10.1002/pro.2458}

N. Lu, R. Diao, R. Hafen, N. Samaan, and Y. V. Makarov.
A comparison of forecast error generators for modeling wind and load uncertainty.
In Power and Energy Society General Meeting (PES), 2013 IEEE, pages 1-5. IEEE, 2013.

J. E. McDermott, J. Wang, H. Mitchell, B.-J. Webb-Robertson, R. Hafen, J. Ramey, and K. D. Rodland.
Challenges in biomarker discovery: combining expert insights with statistical analysis of complex omics data.
Expert opinion on medical diagnostics, 7(1):37-51, 2013.

R. Maciejewski, A. Pattath, S. Ko, R. Hafen, W. S. Cleveland, and D. S. Ebert.
Automated box-cox transformations for improved visual encoding.
Visualization and Computer Graphics, IEEE Transactions on, 19(1):130-140, 2013.

R. Diao, N. Samaan, Y. Makarov, R. Hafen, and J. Ma.
Planning for variable generation integration through balancing authorities consolidation.
In Power and Energy Society General Meeting, 2012 IEEE, pages 1-8. IEEE, 2012.

R. Hafen and M. J. Henry.
Speech information retrieval: a review.
Multimedia systems, 18(6):499-518, 2012.

J. Ma, S. Lu, R. Hafen, P. V. Etingov, Y. V. Makarov, and V. Chadliev.
The impact of solar photovoltaic generation on balancing requirements in the southern nevada system.
In Transmission and Distribution Conference and Exposition (T&D), 2012 IEEE PES, pages 1-9. IEEE, 2012.

R. Maciejewski, R. Hafen, S. Rudolph, S. G. Larew, M. A. Mitchell, W. S. Cleveland, and D. S. Ebert.
Forecasting hotspots—a predictive analytics approach.
Visualization and Computer Graphics, IEEE Transactions on, 17(4):440-453, 2011.

Y. V. Makarov, S. Lu, N. Samaan, Z. Huang, K. Subbarao, P. V. Etingov, J. Ma, R. Hafen, R. Diao, and N. Lu.
Integration of uncertainty information into power system operations.
In Power and Energy Society General Meeting, 2011 IEEE, pages 1-13. IEEE, 2011.

S. Lu, P. V. Etingov, N. A. Samaan, R. P. Hafen, Y. V. Makarov, and J. Ma. 2011.
Evaluating Impact of Solar Generation on Balancing Requirements in Southern Nevada System.
In The 1st International Workshop on Integration of Solar Power into Power Systems, Oct. 24, 2011, Aarhus, Denmark.

Y. V. Makarov, J. F, Reyes Spindola, N. A. Samaan, R. Diao, and R. Hafen.
Wind and Load Forecast Error Model for Multiple Geographically Distributed Forecasts.
In Proceedings of the 9th International Workshop on Large-Scale Integration of Wind Power into Power Systems as well as on Transmission Networks for Offshore Wind Power Plants, 2010.

R. Maciejewski, S. Rudolph, R. Hafen, A. Abusalah, M. Yakout, M. Ouzzani, W. S. Cleveland, S. J. Grannis, D. S. Ebert.
A Visual Analytics Approach to Understanding Spatiotemporal Hotspots.
IEEE Transactions on Visualization and Computer Graphics. 205-220, 2009.

R. Maciejewski, S. Rudolph, R. Hafen, A. Abusalah, M. Yakout, M. Ouzzani, W. S. Cleveland, S. J. Grannis, M. Wade, and D. S. Ebert.
Understanding syndromic hotspots - a visual analytics approach.
IEEE Symposium on Visual Analytics Science and Technology (VAST), pages 35-42, 2008.

R. Maciejewski, R. Hafen, S. Rudolph, G. Tebbetts, W. S. Cleveland, S. J. Grannis, and D. S. Ebert.
Generating synthetic syndromic surveillance data for evaluating visual analytics techniques.
IEEE Computer Graphics and Applications</emph} 29(3): 18-28, May/June 2009.

M. Vlachopoulou, L. J. Gosink, T. C. Pulsipher, R. Hafen, N. Zhou, and J. Tong.
Net Interchange Schedule Forecasting Using Bayesian Model Aggregation.
Technical Report. Pacific Northwest National Laboratory, Richland, WA, 2013.

K. L. Gordon, R. Hafen, J. E. Hathaway, and J. J. McCullough.
Lumen maintenance testing of the philips 60-watt replacement lamp l prize entry.
Technical report, Pacific Northwest National Laboratory (PNNL), Richland, WA (US), 2012.

R. Hafen, K. Subbarao, V. V. Viswanathan, and M. C. Kintner-Meyer.
Requirements for Defining Utility Drive Cycles: An Exploratory Analysis of Grid Frequency Regulation Data for Establishing Battery Performance Testing Standards.
Technical Report. Pacific Northwest National Laboratory, 2011.

S. Lu, P. V. Etingov, R. Diao, J. Ma, N. A. Samaan, Y. V. Makarov, X. Guo, R. Hafen, C. Jin, H. Kirkham, et al.
Large-Scale PV Integration Study.
Technical Report. Pacific Northwest National Laboratory, 2011.

R. Hafen.
Topics in Empirical Distribution Functions and Change-Point Analysis.
Master thesis, University of Utah, 2006.

Patents

R. Hafen, T. J. Critchlow, and T. D. Gibson.
Methods and apparatus of analyzing electrical power grid data, June 26 2013.
US Patent App. 13/928,108.

R. Maciejewski, R. Hafen, S. Rudolph, W. Cleveland, and D. Ebert.
Forecasting hotspots using predictive visual analytics approach, Jan. 31 2013.
US Patent 20,130,031,041.

Contact