Análisis de Datos con HPC / Data Analytics with HPC
The increasing amount of information available through the Internet calls for the efficient processing of large amounts of data. This has led to the development of new storage and processing techniques to deal with huge amounts of data, namely Big Data techniques, that naturally adapt to distributed systems.
The main goal of this subject is to learn suitable processing techniques for large amounts of information in the Big Data world, particularly using the Hadoop ecosystem, and compare these techniques with the traditional ones employed in HPC environments. This will allow the student to select the optimal tools to solve a particular problem.
Educational and learning outcomes (RD 822/2021 degree programs) or competences (RD 1393/2007 degree programs)
- [AP01] CE1 - Define, evaluate and select the most appropriate architecture and software to solve a problem
- [AP02] CE2 - Analyze and improve the performance of a given architecture or software
- [BP01] CB6 - Poseer y comprender conocimientos que aporten una base u oportunidad de ser originales en el desarrollo y/o aplicación de ideas, a menudo en un contexto de investigación
- [BP02] CB7 - Que los estudiantes sepan aplicar los conocimientos adquiridos y su capacidad de resolución de problemas en entornos nuevos o poco conocidos dentro de contextos más amplios (o multidisciplinares) relacionados con su área de estudio
- [BP06] CG1 - Ser capaz de buscar y seleccionar la información útil necesaria para resolver problemas complejos, manejando con soltura las fuentes bibliográficas del campo
- [BP08] CG3 - Ser capaz de mantener y extender planteamientos teóricos fundados para permitir la introducción y explotación de tecnologías nuevas y avanzadas en el campo
- [BP10] CG5 - Ser capaz de trabajar en equipo, especialmente de carácter multidisciplinar, y ser hábiles en la gestión del tiempo, personas y toma de decisiones.
- [CP01] CT1 - Utilizar las herramientas básicas de las tecnologías de la información y las comunicaciones (TIC) necesarias para el ejercicio de su profesión y para el aprendizaje a lo largo de su vida
- [CP04] CT4 - Valorar la importancia que tiene la investigación, la innovación y el desarrollo tecnológico en el avance socioeconómico y cultural de la sociedad
Learning outcomes (RD 1393/2007 degree programs)
| Learning outcomes | Study programme competences / results |
|---|---|
| The student will be capable of installing, configuring, and managing the basic software for massive data processing. | AP1, AP2, BP2, BP6, BP8, BP10, CP1 |
| The student will be capable of coding massive data processing applications using domain-specific languages. | AP2, BP1, BP2, BP10, CP1 |
| The student will learn about Data Engineering tools (for Intake/Storage/Processing/Visualization). | AP1, AP2, BP1, BP2, CP1, CP4 |
| The student will learn the skills to search, select and manage Big data-related resources (bibliography, software, etc.). | AP1, AP2, BP1, BP6, CP1, CP4 |
Contents
- Introduction to Data Engineering
- HPC vs Big Data: similarities and differences in data management.
- Hardware and Software Technologies for High Performance Data Engineering
- Data Engineering in HPC infrastructures vs. Cloud environments
- Introduction to Data Analytics
- Exploratory Data Analytics
- Introduction to Machine Learning
- Data Engineering phases
- Modeling (Formats, Compression, Designing Schemas)
- Intake (Periodicity, Transformations, Tools)
- Storage (HDFS and NoSQL DBs, HBase, MongoDB, Cassandra)
- Processing (Batch, Real-Time)
- Orchestration
- Analysis (SQL, Machine Learning, Graphs, UI)
- Governance
- Integration with BI (Visualization)
- Use cases
- Applications to Internet of Things (Smart environments and Industry 4.0)
- Applications to sciences and engineering