Breakout Summary Report

 

ARM/ASR User and PI Meeting

13 - 17 March 2017

Beyond quality assurance: Using machine learning and multiple instruments to quantify retrieval quality and uncertainty
15 March 2017
1:30 PM - 3:30 PM
25
Evgueni Kassianov

Breakout Description

Improving operational data quality and assigning uncertainties to retrievals have been consistent requests to the Atmospheric Radiation Measurement (ARM) Climate Research Facility, particularly for retrievals of aerosol and hydrometeor microphysical properties. While large uncertainties exist in retrievals due to factors such as retrieval assumptions and instrument data quality in the field, quantifying those uncertainties has remained an unsolved challenge.

Main Discussion

A number of new efforts in the ARM Facility and the Atmospheric System Research (ASR) program have begun to meet these challenges with new techniques such as machine learning, comparisons of multiple instruments, and Bayesian statistics. The goals of this session were: 1) to update and get feedback from the broader community on new work being done to tackle these challenges, and 2) to develop a community discussion and potential collaborations on using these techniques for multiple retrieval problems. Here was the session agenda:



1:30 Introduction – Laura Riihimaki (presented by Shaocheng Xie)



Comparing Multiple Instruments


  • 1:40 Multi-Filter Rotating Shadowband Radiometer (MFRSR) Data Quality Assessment – Evgueni Kassianov


Machine Learning


  • 2:00 Machine Learning for the ARM Climate Research Facility – Jeff Mitchell

  • 2:20 Estimation of Data Error Bars, Data Probability and Anomaly Source Identification through Machine Learning – Xiao Chen



Bayesian approaches


  • 2:40 Using Bayesian MCMC to Retrieve Cloud Properties and their Uncertainties from Active and Passive Measurements – Derek Posselt

  • 3:00 A Bayesian Approach to Development and Observational Constraint of a new Cloud Microphysical Parametrization Scheme – Marcus van Lier-Walqui



3:20 General Discussion

Key Findings

Shaocheng Xie highlighted a need in the ARM Facility and ASR program to improve data quality control and uncertainty quantification. Currently, data quality control is often done manually with limited resources. Moreover, uncertainty quantification is limited by assumptions and unknown instrument performance in the field. These challenging issues can be addressed by: 1) comparisons of data from multiple instruments, 2) machine learning, and 3) new uncertainty quantification techniques based on Bayesian approaches.



Evgueni Kassianov introduced a two-step approach for multi-filter rotating shadowband radiometer (MFRSR) data quality assessment following valuable discussions during the ARM-supported Workshop in Boulder (January 27-29, 2016).



  • Screening of “bad” cases using available information about head and logger changes, and documented data quality issues (the first step). To do that, the team cross-referenced three major sources: 1) head_id and logger_id global attributes in mfrsr.b1 Network Common Data Form files, 2) Southern Great Plains atmospheric observatory instrument logs, and 3) MFRSR Ingest calibration files on the ARM Data Management Facility server. “Good” cases for a 20-year period were selected. The term “good” indicates the number of MFRSR aerosol optical depth points available for comparison between C1 and E13 sites. An increasing trend of the selected “good” cases was demonstrated. The observed increase was associated mostly with reduced numbers of reported data quality issues and hardware changes.

  • Identification of potential problems using comparison of data from available simultaneous measurements (the second step). Aerosol optical depths from colocated MFRSRs (both C1 and E13 sites) and AERONET Cimel sun photometer (CSPHOT, C1 site) were compared. Inconsistency in CSPHOT files was found. For example, two CSPHOT files with different numbers of aerosol optical depths and very similar start times (one second apart only) were located in the same directory. The outlined inconsistency complicates substantially data concatenating and generation of the corresponding plots. Comparison plots of aerosol optical depths from MFRSR (C1), MFRSR (E13), and AERONET were generated, and the corresponding basic statistics, such as root-mean-square error, were calculated.


  • Jeff Mitchell reviewed three machine learning projects: 1) anomaly detection for the CSPHOT using a Random Forest algorithm, 2) anomaly detection for the MFRSR using multivariate regression, and 3) aerosol observing system (AOS) local source emission detection using a neural network and a Support Vector Machine model.



  • CSPHOT anomaly detection: The machine learning model uses multiple data features at the same time. This creates a context for detecting anomalies, which is especially useful because there are large variations in the CSPHOT data. These variations are due to weather, environment, and gaps in the data (10-minute sampling rate). It is important to define features properly. These features are used to train a model. The Random Forest algorithm is used, which can process new data well. The algorithm is developed to predict good data. Any data that deviate show up with a high residual root-mean-square error. With a runtime of 15 seconds/year, all anomalies and periods of missing data were found. These were verified against existing Data Quality Reports.

  • MFRSR anomaly detection:The application is similar to the previous CSPHOT application. Also, the anomaly detection applies Fast Fourier Transform method to detect the Shadow Band Misalignment problem (first time automated). One year of data is analyzed in two minutes. Results were verified using a Data Quality Report. No false positives were reported for this test case. A multivariate regression model is used and trained on three months of good data. Root-mean-square error is reported and a Data Quality Report successfully detected.

  • AOS local source emission detection: The AOS trailer was located next to an airport runway. A camera was set up to view the runway and detect when airplanes land or trucks are present. A neural network is used to detect the presence of airplanes. The neural network can process 700 images in 10 seconds on a laptop with accuracy of 96 percent (the 4 percent missed can probably be found using a more complicated neural network). This collected information was compared to five AOS instruments simultaneously. Events can be detected in datastreams, but multiple datastreams are needed because not all instruments detect anomalies. Also, the Support Vector Machine model was applied using five days of training data. The Support Vector Machine model is better than 99 percent accurate with a run-time of 15 seconds. Wind data were taken into account because instruments were not able to detect if winds blow away from detectors. The model can work even if missing a sensor. The lowest sampling rate was used to align data.


    • Xiao Chen discussed various types of the machine learning-related uncertainty, including: 1) instrument errors (precision), 2) measurement issues (e.g., random error and bias), 3) random variability, 4) retrieval uncertainty, 5) environmental effects, and 6) modeling errors. Xiao Chen also discussed application of the machine learning model to three broad categories of data: 1) “good” data (can be trusted), 2) “potentially good” data (have small errors, thus can be corrected), and 3) “bad “data (cannot be trusted, but some features can be used to find the cause). There are two main limitations of such application: 1) Only average behavior on a daily basis is represented, and 2) the strong dependence on a machine learning algorithm and its parameters. More effort is needed to refine workflow and algorithms. Future work includes application to other instruments and feature reduction. Three major products are expected: 1) error bars on data, 2) event-related probabilities (help to detect natural or seasonal variations), and 3) anomaly source identification (help to detect source of anomaly, which is difficult with machine learning, level 1 alone).



      Marcus van Lier-Walqui, on behalf of Derek Posselt, outlined the Bayesian MCMC approach and its application to retrieve cloud properties. It is well known that cloud retrievals have a range of possible solutions for given measurement uncertainties. Thus, there is a need to quantify the range of solutions, bias and benefits of integrated measurements. Such quantification involves application of Bayes’ theorem and thus allows one to predict a likelihood of a state and identify a range of solutions for given information. The current project involves Bayesian MCMC approach to infer cloud properties from Doppler radar data.



      Marcus van Lier-Walqui summarized recent updates regarding the Bayesian approach and its application to develop a new cloud microphysical parametrization scheme. The current project BOSS (Bayesian Observationally-constrained Statistical-physical Scheme) aims to constrain parameters using radar observations and estimate uncertainties associated with different sources such as noise, calibration, and sampling uncertainty.

Issues

N/A

Needs

N/A

Decisions

All methods hold promise for either improving data quality or quantifying uncertainties. Our group is directing its future efforts toward creating stronger links between the existing activities while continuing to develop and improve the new techniques introduced in the previous section. The expected future efforts are described in the Future Plans section.

Future Plans

Our group is directing its future efforts toward creating stronger links between the existing activities while continuing to develop and improve the new techniques introduced in the previous section.



Comparisons of multiple instruments. So far, only aerosol optical depths from three instruments were considered. It was decided to include: 1) aerosol optical depths from additional instruments, such as a new radiometer (SAS-He) with improved spectral resolution and extended spectral coverage (from ultraviolet to near-infrared), and 2) calibration-independent quantities such as wavelength dependence of direct/diffuse ratios from colocated instruments. As part of the “creating stronger links” task, it was suggested to use the selected “good” cases as potential candidates for training data required for the machine learning projects.



Machine learning projects. The relevant discussion led to development of plans for: 1) getting output into the archive, and 2) trying to automate Data Quality Reports. Benefit of Data Quality Reports is that data can be filtered upon delivery to the user. Data could also be in a value-added product, perhaps as a C1 data set with flags.



Data uncertainty quantification projects. We will consider reducing the machine learning fitting errors by applying regression algorithms such as Gaussian process, autoregressive-moving-average, and other deep-learning algorithms. The extracted data feature vector will be a high-dimensional random vector with limited realizations (number of training days), thus posing challenges to build an adequately accurate machine learning model. We will address this challenge, by considering nonlinear dimension reduction methods using kernel-based manifold learning techniques, such as polynomial kernel, Gaussian kernel, and diffusion kernel. These algorithms and techniques allow us to better train the underlying machine learning and uncertainty quantification models, thus increasing the accuracy of the predictions. Once we obtain the adequately accurate machine learning and uncertainty quantification models on the training data set, we are able to provide the statistical error bars and other statistical products for the test data set based on the novel feature extraction and correction algorithms.



Bayesian approaches. Additional development and testing of methods will be done as the projects continue. Continued communication between ASR principal investigators and translators is planned to make the techniques and products developed in the ASR projects available for use by ARM as appropriate. Marcus van Lier-Walqui volunteered to share generic Bayesian code to the ARM infrastructure for use in other projects as well.

Action Items

N/A