Addressing failures in exascale computing M Snir, RW Wisniewski, JA Abraham, SV Adve, S Bagchi, P Balaji, J Belak, ... The International Journal of High Performance Computing Applications 28 (2 …, 2014 | 451 | 2014 |
Memory errors in modern systems: The good, the bad, and the ugly V Sridharan, N DeBardeleben, S Blanchard, KB Ferreira, J Stearley, ... ACM SIGARCH Computer Architecture News 43 (1), 297-310, 2015 | 307 | 2015 |
Feng shui of supercomputer memory positional effects in DRAM and SRAM faults V Sridharan, J Stearley, N DeBardeleben, S Blanchard, S Gurumurthi SC'13: Proceedings of the International Conference on High Performance …, 2013 | 204 | 2013 |
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation D Tiwari, S Gupta, J Rogers, D Maxwell, P Rech, S Vazhkudai, D Oliveira, ... 2015 IEEE 21st International Symposium on High Performance Computer …, 2015 | 146 | 2015 |
On the diversity of cluster workloads and its impact on research results G Amvrosiadis, JW Park, GR Ganger, GA Gibson, E Baseman, ... 2018 USENIX Annual Technical Conference (USENIX ATC 18), 533-546, 2018 | 95 | 2018 |
High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development N DeBardeleben, J Laros, JT Daly, SL Scott, C Engelmann, B Harrod Whitepaper, Dec, 2009 | 81 | 2009 |
F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability Q Guan, N Debardeleben, S Blanchard, S Fu Proceedings of the 2014 IEEE 28th International Parallel and Distributed …, 2014 | 77 | 2014 |
GPGPUs: How to Combine High Computational Power with High Reliability LB Gomez, F Cappello, L Carro, N DeBardeleben, B Fang, S Gurumurthi, ... | 75 | 2014 |
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters WM Jones, JT Daly, N DeBardeleben Proceedings of the 19th ACM International Symposium on High Performance …, 2010 | 54 | 2010 |
Inter-agency workshop on hpc resilience at extreme scale J Daly, B Harrod, T Hoang, L Nowell, B Adolf, S Borkar, N DeBardeleben, ... National Security Agency Advanced Computing Systems, 2012 | 42 | 2012 |
Application monitoring and checkpointing in hpc: looking towards exascale systems WM Jones, JT Daly, N DeBardeleben Proceedings of the 50th Annual Southeast Regional Conference, 262-267, 2012 | 41 | 2012 |
BinFI an efficient fault injector for safety-critical machine learning systems Z Chen, G Li, K Pattabiraman, N DeBardeleben Proceedings of the International Conference for High Performance Computing …, 2019 | 40 | 2019 |
Experimental and analytical study of xeon phi reliability D Oliveira, L Pilla, N DeBardeleben, S Blanchard, H Quinn, I Koren, ... Proceedings of the International Conference for High Performance Computing …, 2017 | 40 | 2017 |
Towards practical algorithm based fault tolerance in dense linear algebra P Wu, Q Guan, N DeBardeleben, S Blanchard, D Tao, X Liang, J Chen, ... Proceedings of the 25th ACM International Symposium on High-Performance …, 2016 | 37 | 2016 |
Developing scientific applications using eclipse GR Watson, NA DeBardeleben Computing in Science & Engineering 8 (4), 50-61, 2006 | 37 | 2006 |
GPU behavior on a large HPC cluster N DeBardeleben, S Blanchard, L Monroe, P Romero, D Grunau, C Idler, ... European Conference on Parallel Processing, 680-689, 2013 | 34 | 2013 |
Experimental framework for injecting logic errors in a virtual machine to profile applications for soft error resilience N DeBardeleben, S Blanchard, Q Guan, Z Zhang, S Fu European Conference on Parallel Processing, 282-291, 2011 | 32 | 2011 |
Tensorfi: A configurable fault injector for tensorflow applications G Li, K Pattabiraman, N DeBardeleben 2018 IEEE International symposium on software reliability engineering …, 2018 | 30 | 2018 |
Interpretable anomaly detection for monitoring of high performance computing systems E Baseman, S Blanchard, N DeBardeleben, A Bonnie, A Morrow Outlier Definition, Detection, and Description on Demand Workshop at ACM …, 2016 | 26 | 2016 |
Letgo: A lightweight continuous framework for hpc applications under failures B Fang, Q Guan, N Debardeleben, K Pattabiraman, M Ripeanu Proceedings of the 26th International Symposium on High-Performance Parallel …, 2017 | 25 | 2017 |