Breakout Summary Report

 

ARM/ASR User and PI Meeting

19 - 23 March 2018

Applications of machine learning to the analysis of observational and modeling data sets
20 March 2018
1:30 PM - 3:30 PM
46
Kotamarthi and Comstock

Breakout Description

The goal of the breakout session is to discuss ideas for using machine learning (ML) techniques for analysis of data sets and models, and for learning from models and observations to improve model parameterizations. Eventually we hope to bring together those interested in using ML for their research and integrate the development of algorithms, analysis, model parameterizations, and data products that are facilitated using the rapidly emerging ML capabilities at the DOE HPC clusters. Ideas from current projects and future directions will be discussed.

Main Discussion

The breakout was well attended with nearly 50 attending the session and eight technical presentations. The presentations covered topics ranging from the application of ML to ARM data sets to the development of process-scale model parameterizations. The session ended with a discussion on the future directions for ML in the ARM/ASR community. It was agreed that as increasing number of ML software and techniques become available and the forecasted increase in computing power at the DOE Leadership computing facilities eventuates, the application areas for ML will continue to increase. The potential for forming an interest group that could collect various ML tools that are finding use within the community and discussion on data set requirements for ML is likely an appropriate next step.

Key Findings

The talks presented at the breakout can be categorized into the following:

1) Feature Identification/Clustering:
One of the most obvious activities with potential for a significant outcome is using ML for feature identification using the vast ARM data collection of remote-sensing and surface meteorological, aerosol and cloud data sets. All of the ARM-related talks were focused on this area of research. The feature identification algorithm development is aimed at detecting anomalies in the archived ARM data set for establishing automatic QA/QC guidelines for the data.
Shaocheng Xie and Xiao Chen (LLNL) presented the process for developing a novel machine learning framework for anomaly detection and data quality assessment. They proposed exploring three advanced ML algorithms for anomaly detection and QA/QC of the ARM data set. These three methods are (a) Automatic relevance determination for identifying the relevant features using a Bayesian feature selection and using this to reduce ML fitting errors (b) Gaussian process regression that assumes Gaussian distribution of the training samples to build a probabilistic ML and automatically produce error bars for the neighboring data points based on the previous ARD data error estimation from the observational data points and (c) Autoregressive Integrated Moving Average to remove the seasonal and trend components.
Edward Luke, Bernat Puigdomenec, and Pavlos Kollias (BNL) presented a method for detecting non-meteorological features in ARM facility radars using convolutional neural network as a key part of the deep learning neural network they propose to build. They will first use supervised learning on data sets with the following labels: “no echo”, “clutter-only”, “hydrometeor-only”, or “clutter and hydrometeor”. From these training data sets they propose to train and test a CNN-based classifier. The second phase will exploit spatiotemporal differences between clutter and meteorological echoes to emphasize unsupervised learning.
Joseph Hardin, Nitin Bharadwaj, Mahantesh HalapPanavar and Adam Theisen (PNNL) proposed the development of an unsupervised learning to predict anomalous data quality periods in the ARM data set -- that is, unsupervised clustering to detect statistically independent clusters. The process will be trained and applied to the Oliktok KAZR radar data set. The process will start by using an unsupervised model classification scheme that will apply a clustering algorithm on a graph/b-matching followed by a region-based aggregation. The parameters in this unsupervised model will be further optimized using expert knowledge.
Erol Cromwell and Donna Flynn (PNNL) proposed using deep neural network to identify clouds from micro-pulse lidar (MPL) backscatter images and evaluate the performance of this method with the currently implemented MPLCMASK. Preliminary results from the implementation of the detection algorithm were presented and there was support from the audience to build this type of algorithm into cloud boundaries needed as input to ARSCL.
Adam Theisen (DQO) presented a number of ongoing activities at the Data Quality Office to implement ML algorithms for the anomaly detection and assist in the automatic QA/QC of the data. The talk highlighted the of MARCUS AOS data and its possible contamination with ship exhaust. Initial results obtained by using Scikit-Learn to train a random forest classifier on periods of known ship contamination from manual inspection have shown promise and will be further explored. A path forward for implementing an automated ML framework for developing DQRs was also discussed.
b) process identification model development/regression:
Another class of problems that have shown promise for the use of machine learning is for the identification and development of process models using model output from very high-resolution and fully resolved model simulations. The methodology used here is generally some form of regression analysis and deep neural networking algorithms. Two talks were presented in this topical area and they are summarized below.
Matt West, Mike Hughes, John Kodros, Jeff Pierce, and Nicole Riemer (UIUC) presented the application of ML for developing a parametric approach to predict mixing state metrics of aerosols. To learn the global mixing state output for GEOS-Chem-TOMAS model was used to create a grid-cell-scale output of chemical and meteorological drivers for the single-particle model PartMC-MOSIAC. From these GEOS-Chem grid cells a population of cells were selected and PartMC-MoSAIC simulations performed for these cells. This formed the training set for the ML learning technique. A gradient-bboosted regression tree was employed to derive the output mixing state from this data set. Initial results from this work are promising and further studies are planned.

Rao Kotamarthi, Won Cheng, Prasanna Balakrishnan, Jiali Wang, Liz Moyer, and Michael Stein (Argonne) presented results from a feature identification and tracking algorithm and the implementation of a deep learning neural network for developing physical parameterizations in an atmospheric model. The feature identification and tracking algorithm of the location and size of storm systems over the CONUS was trained using output from a high-resolution, regional-scale, climate model simulation. A second application used a deep NN to develop a parametric model for representing the boundary-layer physical model parameterization present in most earth system models. The physics of the boundary layer were learned by the deep NN using a large training data set that covered approximately 10 years of model output and data available every three hours. The DNN PBL parameterizations was successful in predicting the boundary-layer wind and temperature profiles on the validation data set. Further development of this approach is underway.
Arnoldas Kurbanovas, Kara Sulia, Mark Beauharnois, and Will May (SUNY at Albany) presented ideas being explored at the XCITE lab at Albany for applying ML to develop a short-term weather forecasting technique. This work is at a preliminary stage.

Issues

Most of the work presented is in the early stages of development. Many of the projects have just started this FY. The approaches presented use off-the-shelf software in most of the cases and the applications are along expected lines for an early stage effort.

Needs

More concerted effort to identify opportunities and developed labeled data sets for identifying features and their tracking of ASR/ARM-targeted phenomena in the ARM/ASR data sets. Develop a group of investigators who would prepare codes/approaches and more adventurous applications of ML that fits into the high-performance/exascale future.

Decisions

None at this point.

Future Plans

We plan to start a mailing list for machine learning within in the ARM/ASR community. Plan to hold the breakout in the next years science meeting.

Action Items

Start a mailing list and initiate further discussions with the group on future direction for the application of ML to ARM/ASR.