Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU

Yim, K.S., Phan, C., Saleheen, M., Kalbarczyk, Z., Iyer, R.
Citation:

IEEE International Parallel & Distributed Processing Symposium (IPDPS)pp.287-300, 16-20 May 2011.

Visit Publisher Online Entry:
Abstract:

High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output correctness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have developed for commodity GPU devices. On average, 16-33% of injected faults cause silent data corruption (SDC) errors in the HPC programs executing on GPU. This SDC ratio is significantly higher than that measured in CPU programs (<;2.3%). In order to tolerate SDC errors, customized error detectors are strategically placed in the source code of target GPU programs so as to minimize performance impact and error propagation and maximize recoverability. The presented Hauberk technique is deployed in seven HPC benchmark programs and evaluated using a fault injection. The results show a high average error detection coverage (~87%) with a small performance overhead (~15%).

Publication Status:
Published
Publication Type:
Proceedings
Publication Date:
05/16/2011
Copyright Notice:

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

  1. The following copyright notice applies to all of the above items that appear in IEEE publications: "Personal use of this material is permitted. However, permission to reprint/publish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from IEEE."

  2. The following copyright notice applies to all of the above items that appear in ACM publications: "© ACM, effective the year of publication shown in the bibliographic information. This file is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the journal or proceedings indicated in the bibliographic data for each item."

  3. The following copyright notice applies to all of the above items that appear in IFAC publications: "Document is being reproduced under permission of the Copyright Holder. Use or reproduction of the Document is for informational or personal use only."