22/04a: Work in Progress

Online Data Analytics for Scientic Numerical Simulations with Python and Ray

(gl) Análise de datos en liña para simulacións científicas con Python e Ray
(es) Análisis de datos en línea para simulaciones científicas con Python y Ray

Student

Iago Fernández Picos

Supervision

Emilio José Padrón González (UDC)

Brief description

Large-scale simulations are producing an ever-growing amount of data that is becoming prohibitively costly, in terms of time and energy, to save to disks, and next to retrieve and process during the post-hoc data analysis phase. To circumvent this bottleneck, in-situ analytics [1] proposes to start processing data online, as soon as made available by the simulation in the memories of the compute nodes (or using other nodes in the same cluster, known as in-transit analysis). The benefits are:

  • Alleviate I/O pressure: Raw data produced by the simulation can start to be reduced before moving out of the compute nodes, saving on data movements and on the amount of data to store to disk.
  • High performance data analysis at scale: Part of data analysis can be performed on the same supercomputer as the one booked for the simulation. The process can be massively parallelized, reading data from memory and not from disk, reducing the time for performing these tasks.

This integration of data analytics with large-scale simulations represents a new kind of workflow. Scientists need to rethink the way to use the available data movement and storage budgets and the way to take advantage of the compute resources for advanced data processing. So far, only a few framework prototypes have been developed to investigate some key concepts, with experiments with simple analysis scenarios.

The goal of this project proposal is to investigate and develop algorithms to enable advanced in-situ/in-transit processing of scientific data from numerical simulations with the Python framework Ray. The Python language has become one of the most popular programming languages in the scientific computing community, mainly because of the benefits it offers for fast code development. Python’s strength in general purpose programming, combined with a great ecosystem and library support, makes it an excellent choice for data analysis tasks and building data-centric applications.

Although the performance of pure Python code is often far from being optimal, there are multiple ways to optimise and parallelise Python programs, particularly in the context of scientific and high performance computing. Particularly, frameworks such as Dask [2] and Ray [3] are focused on developing large scale distributed applications. This kind of frameworks can be a good fit for a stream processing approach for analysing in-situ/in-transit results from parallel numerical simulations, analogous to [4], an analytics stack based on big data technologies (Apache Flink, HBase…), but with a less restrictive programming model than map/reduce-like paradigms.

Figure 1: Overview of the proposed system

Even though Dask is getting quite a lot of attention in the HPC world, whereas Ray is probably more popular in the machine learning community, we think Ray could maybe have more potential, so Ray is the platform we will use in this work for the distributed python execution.

[1] Lessons Learned from Building In Situ Coupling Frameworks.
Matthieu Dorier, Matthieu Dreher, Tom Peterka, Gabriel Antoniu, Bruno Raffin, Justin M. Wozniak.
ISAV 2015 – First Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (held in conjunction with SC15), Nov 2015, Austin, United States.
https://hal.inria.fr/hal-01224846
[2] DASK: Dask natively scales Python
https://dask.org
[3] RAY: Fast and Simple Distributed Computing
https://ray.io
[4] In-Transit Molecular Dynamics Analysis with Apache Flink.
Henrique C. Zanuz, Bruno Raffin, Omar A. Mures, Emilio J. Padrón.
ISAV 2018 – Fourth Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (held in conjunction with SC18), Nov 2018, Dallas, United States.
https://hal.inria.fr/hal-01889939

Specific objectives

  • The main objective of this project is to develop an analytics pipeline based on Ray for the online processing of scientic data from large-scale numerical simulations.

  • The proposed analytics pipeline will be able to be deployed at scale in the same HPC environment the numerical simulations is running on.

  • Different analysis kernels will be implemented to test the proposal.

Methodology

An Agile development method will guide the project, with relatively short sprints to build the different analysis kernels, after a preliminary work of study and documentation.

Development steps

  • Analysis of requirements and project scheduling, according to student disponibility.

  • Study and documentation.

    • Ray basics and deployment of distributied Python applications with Ray
    • Communicating Ray with an external application
  • Incremental, iterative work sequences (sprints) to develop the analytics pipeline, and some analysis kernels, and integrate it with a numerical simulation.

Material

  • Personal computer with internet access.

  • Access to HPC resources will be provided to the student.

Teaching and Researching in Computer Science/Engineering

My research interests include High Performance Computing (HPC) and Computer Graphics.