Argonne Leadership Computing Facility | ||
an Office of Science user facility | ||
DATA CATALOG |
ALCF Public Data | |
Data Set: RAS_EVENT | |
Source: Argonne Leadership Computing Facility |
COLUMN | DATA TYPE | DESCRIPTION | EXAMPLE |
---|---|---|---|
RECID | INTEGER | Unique ID for this ras event | |
MSG_ID | VARCHAR(16) | The MSG ID referred to in the Ras Event Book | 00020013 |
CATEGORY | VARCHAR(32) | The category portion of the RAS message is the entity that encountered an error. See the Ras Event book. Category names can be up to 16 characters long. Each category can be used by multiple components. Table 7-2 provides a list of possible RAS categories. Table 7-2 RAS categories RAS category Description BQC Blue Gene/Q compute card BGL Blue Gene/Q link module DDR Double Data Rate Memory PCI PCI adapter card Ethernet Ethernet adapter card InfiniBand InfiniBand adapter card AC_TO_DC_PWR Bulk Power Supply DC_TO_DC_PWR Power Module Cable Cable Message_Unit Message unit Card Generic Card/Board Clocks Clocks Clock_FPGA Clock FPGA Service_Card Service Card IO_Board I/O Board Node_Board Node Board Icon Icon FPGA Palomino Palomino FPGA DCA Direct Current Assembly Card Fan Fan Fan_Assembly Fan Assembly Optical_Module Optical Module Temp_Sensor Temperature sensor on a card or board Job Job Block Block Process Process or daemon Coolant_Monitor Coolant Monitor Software_Error Software error condition ELF_Image ELF Image error condition | BQL |
COMPONENT | VARCHAR(16) | Hardware, software or operating system component as reported by the vendor of the supercomputer. | DIAGS |
SEVERITY | VARCHAR(8) | Each RAS message is assigned a severity. There are three possible severities of a RAS event: FATAL Designates a severe error event that presumably leads the application to fail or abort. WARN Designates potentially harmful situations, such as exceeding a soft error threshold or failure of a redundant component. INFO Designates informational messages that highlight the progress of system software. | FATAL |
EVENT_TIME | TIMESTAMP | The timestamp of the event | 2015-11-21 17:34:33.58887' |
JOBID | INTEGER | A uniqued ID for each job on a machine. | 26587 |
BLOCK | VARCHAR(32) | Compute blocks on the system, defined by the vendor or the owner of the system. See section 2.2.7 of the IBM Redbook | |
LOCATION | VARCHAR(64) | Block and Location are used to figure out what midplane were used for that event. | R28-M1-S |
SERIALNUMBER | VARCHAR(32) | Serial Number of the hardware component | |
CPU | INTEGER | Processor ID | 9 |
COUNT | INTEGER | 1000023479 | |
CTLACTION | VARCHAR(256) | Control Action referred to in the Ras Event Book
The Control Action identifies an action to be taken by the Control System when a RAS event
occurs. Multiple Control Actions can be specified.
Certain RAS events might specify Control Actions to end the job and free the block. Although
the event is fatal to the job, the condition might be transient or soft, which means that
hardware failures occur and cause an application to fail, but that the application can be
restarted on the same hardware without hardware repair. Soft errors are most likely radiation
induced and are a natural consequence of the impact of terrestrial and cosmic radiation
sources, as well as radioactive impurities in the semiconductor and packaging materials close
to the CMOS circuits. Soft errors can also occur because of defects in system software.
98 IBM System Blue Gene Solution: Blue Gene/Q System Administration
Other failures are hard. Hard failures mean that the failed part permanently exhibits the failure
mode, and the hardware must be replaced to fix the failure.
A RAS event can include multiple Control Actions that indicate that the condition is likely hard
and warrants marking the hardware in error. Marking hardware in error helps job schedulers
avoid dispatching jobs to failing hardware.
The possible Control Actions are: COMPUTE_IN_ERROR This action marks a compute node or I/O node in Error state in the database. Any compute block that uses the compute node is prevented from booting. Any I/O block that uses the I/O node will boot, but any I/O links from compute nodes to the failed I/O node are not usable. BOARD_IN_ERROR This action marks a node board or I/O drawer in Error state in the database and prevents any blocks that use the node board or I/O drawer from booting. CABLE_IN_ERROR This action marks a cable in Error state in the database and prevents any blocks that use the cable from booting. END_JOB This action ends all jobs associated with the compute node, I/O node, midplane, or node board. FREE_COMPUTE_BLOCK This action frees a block associated with the compute node. SOFTWARE_IN_ERROR This action marks a node in software error state in the database. The state is reset back to the Available state upon freeing the block. BQL_SPARE This action spares the bad wire on the link chip. RACK_IN_ERROR This action marks all node boards and I/O drawers in the rack in Error state in the database and prevents any blocks that use the node boards and I/O drawers in the rack from booting. DCA_IN_ERROR This action marks the DCA in Error state in the database. | COMPUTE_IN_ERROR |
MESSAGE | VARCHAR(1024) | Message referred to in the Ras Event Book | |
DIAGS | VARCHAR(1) | Diagnostics, usually not useful information | F |
QUALIFIER | VARCHAR(32) | 973625195 | |
MACHINE_NAME | VARCHAR(32) | The name of the supercomputer |