Do moldable applications perform better on failure-prone HPC platforms? - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Rapport (Rapport De Recherche) Année : 2018

Do moldable applications perform better on failure-prone HPC platforms?

Résumé

This paper compares the performance of di erent approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) Rigid applications, which use a constant number of processors throughout execution; (ii) Moldable applications, which can use a di erent number of processors after each restart following a fail-stop error; and (iii) GridShaped applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with realistic applicative scenarios and make it publicly available for further usage.
Fichier principal
Vignette du fichier
rr9174.pdf (1.11 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)
Loading...

Dates et versions

hal-01799498 , version 1 (24-05-2018)

Identifiants

  • HAL Id : hal-01799498 , version 1

Citer

Valentin Le Fèvre, George Bosilca, Aurelien Bouteiller, Thomas Herault, Atsushi Hori, et al.. Do moldable applications perform better on failure-prone HPC platforms?. [Research Report] RR-9174, Inria Grenoble Rhône-Alpes. 2018, pp.1-24. ⟨hal-01799498⟩
107 Consultations
231 Téléchargements

Partager

Gmail Facebook X LinkedIn More