Work in Progress: 20190129

Jan 29, 2019

In Situ/In Transit Data Analytics for Scientic Numerical Simulations with Apache Flink

(gl) Análise de datos en liña para simulacións científicas con Apache Flink
(es) Análisis de datos en línea para simulaciones científicas con Apache Flink

Student

Vo Thi Quynh Yen

Final year project, MSc in HPC, UDC-USC-CESGA
(TFM Mestrado en Computación de Altas Prestacións, UDC-USC-CESGA)

Supervision

Bruno Raffin (Inria Rhône-Alpes, Univ. Grenoble Alpes)
Emilio José Padrón González (UDC)

Brief description

Large-scale simulations are producing an ever-growing amount of data that is becoming prohibitively costly, in terms of time and energy, to save to disks, and next to retrieve and process during the post-hoc data analysis phase. To circumvent this bottleneck, in-situ analytics [1] proposes to start processing data online, as soon as made available by the simulation in the memories of the compute nodes (or using other nodes in the same cluster, known as in-transit analysis). The benefits are:

Raw data produced by the simulation can start to be reduced before moving out of the compute nodes, saving on data movements and on the amount of data to store to disk.
Part of data analysis can be performed on the same supercomputer as the one booked for the simulation. The process can be massively parallelized, reading data from memory and not from disk, reducing the time for performing these tasks.

This integration of data analytics with large-scale simulations represents a new kind of workflow. Scientists need to rethink the way to use the available data movement and storage budgets and the way to take advantage of the compute resources for advanced data processing. So far, only a few framework prototypes have been developed to investigate some key concepts, with experiments with simple analysis scenarios.

The goal of this project proposal is to investigate and develop algorithms to enable advanced in-situ/in-transit processing of scientific data from numerical simulations with the ‘Big Data’ framework Apache Flink. Map/Reduce solutions where first targeting batch data processing. But needs for processing continuou streams of data like tweets led to a new breed of tools like Flink [2] able to connect to stream sources and trigger on-line analysis every time a user defined window of events being filled. These stream processing approaches have only recently been investigated for analysing results from large scale parallel simulations [3].

But in-situ processing can be seen as special case of stream processing where the data are produced not by a web server, but by a large scale parallel simulation. Expected benefits include a user interface that does not require extensive parallel expertise to develop analysis kernels, kernels that can be used for both in-situ an post-hoc analysis, interoperability with advanced massive key/value stores such as Cassandra, out-of-the-box support for fault tolerance or multi-tenant analysis execution.

[1] Lessons Learned from Building In Situ Coupling Frameworks.
Matthieu Dorier, Matthieu Dreher, Tom Peterka, Gabriel Antoniu, Bruno Raffin, Justin M. Wozniak.
ISAV 2015 – First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (held in conjunction with SC15), Nov 2015, Austin, United States.
https://hal.inria.fr/hal-01224846
[2] Apache Flink: Scalable Stream and Batch Data Processing.
https://flink.apache.org
[3] In-Transit Molecular Dynamics Analysis with Apache Flink.
Henrique C. Zanuz, Bruno Raffin, Omar A. Mures, Emilio J. Padrón.
ISAV 2018 – Fourth Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (held in conjunction with SC18), Nov 2018, Dallas, United States.
https://hal.inria.fr/hal-01889939

Specific objectives

The main objective of this project is to develop analysis kernels for the online processing of scientic data from large-scale numerical simulations with Flink.
These kernels will operate within a current work-in-progress HPC infrastructure for in-transit analysis of scientific data based on Flink.
The specific applicative domain(s) will be determined during the development of the project, but we will probably target (at least) Molecular Dynamics simulations.

Methodology

An Agile development method will guide the project, with relatively short sprints to build the different analysis kernels, after a preliminary work of study and documentation.

Development steps

Analysis of requirements and project scheduling, according to student disponibility.
Study and documentation.
- The Map/Reduce paradigm and the framework Apache Flink.
- Molecular Dynamics simulations (and other numerical simulations we can target to write online analysis kernels).
Incremental, iterative work sequences (sprints) to develop analysis kernels using Flink and integrate them in the existing work-in-progress HPC infrastructure for in-transit analysis of scientific data based on Apache Flink.

Material

Personal computer with internet access.
Access to HPC resources will be provided to the student.

Emilio J. Padrón González

Teaching and Researching in Computer Science/Engineering

My research interests include High Performance Computing (HPC) and Computer Graphics.