Explore large-scale data analysis and visualization for ARM using noSQL technologies

 

Authors

Kyle K Dumas (Quicklooks) — Oak Ridge National Laboratory
Giri Prakash — Oak Ridge National Laboratory
William I. Gustafson — Pacific Northwest National Laboratory
Andrew M. Vogelmann — Brookhaven National Laboratory
Tami Fairless — Pacific Northwest National Laboratory

Category

ARM next generation – Megasite and LES activities

Description

The aim of this poster is to present a new way of providing large-scale data analysis and visualization services for ARM data. The current search for ARM data is performed by using its metadata, such as the site name, instrument name, date, etc. NoSQL technologies were explored to improve the capabilities of data searching, not only by their metadata, but also by using the data values. Few technologies have been explored for this purpose. However, the two that are currently being tested for ARM data are Apache Cassandra (noSQL database) and Apache Spark (noSQL based analytics framework). Both of these technologies were developed to work in a distributed environment and hence can handle large data for storing and analytics. D3.js is a JavaScript library that can generate interactive data visualizations in web browsers by making use of commonly used SVG, HTML5, and CSS standards. NoSQL can also be taken advantage of for the LASSO work. LASSO will require the ability for users to analyze both observations and LES model output either individually or together across multiple time periods. LASSO’s intent is to present the simulation output, corresponding observations, associated statistics, estimated uncertainties, and metadata in an accessible and easily used form by combining them into a unified package called a "data cube." This cube will include both raw and processed simulation information. The example listed in the LASSO implementation strategy suggests that enormous data storage is required to store the above mentioned quantities, and noSQL can potentially provide a powerful means to store portions of the data and subsequently provide efficient user access via tools such as Spark and D3.js. The author will demonstrate these capabilities and collect the feedback from the participants.