Argonne Leadership Computing Facility Home
an Office of Science user facility
DATA CATALOG
ALCF Public Data

ALCF Public Data
Data Set: RAS_EVENT
Source: Argonne Leadership Computing Facility

See Ras Event Book for more information.

COLUMNDATA TYPEDESCRIPTIONEXAMPLE
RECIDINTEGERUnique ID for this ras event
MSG_IDVARCHAR(16)The MSG ID referred to in the Ras Event Book00020013
CATEGORYVARCHAR(32)The category portion of the RAS message is the entity that encountered an error. See the Ras Event book. Category names can be up to 16 characters long. Each category can be used by multiple components. Table 7-2 provides a list of possible RAS categories. Table 7-2 RAS categories RAS category Description BQC Blue Gene/Q compute card BGL Blue Gene/Q link module DDR Double Data Rate Memory PCI PCI adapter card Ethernet Ethernet adapter card InfiniBand InfiniBand adapter card AC_TO_DC_PWR Bulk Power Supply DC_TO_DC_PWR Power Module Cable Cable Message_Unit Message unit Card Generic Card/Board Clocks Clocks Clock_FPGA Clock FPGA Service_Card Service Card IO_Board I/O Board Node_Board Node Board Icon Icon FPGA Palomino Palomino FPGA DCA Direct Current Assembly Card Fan Fan Fan_Assembly Fan Assembly Optical_Module Optical Module Temp_Sensor Temperature sensor on a card or board Job Job Block Block Process Process or daemon Coolant_Monitor Coolant Monitor Software_Error Software error condition ELF_Image ELF Image error conditionBQL
COMPONENTVARCHAR(16)Hardware, software or operating system component as reported by the vendor of the supercomputer.DIAGS
SEVERITYVARCHAR(8)Each RAS message is assigned a severity. There are three possible severities of a RAS event: FATAL Designates a severe error event that presumably leads the application to fail or abort. WARN Designates potentially harmful situations, such as exceeding a soft error threshold or failure of a redundant component. INFO Designates informational messages that highlight the progress of system software.FATAL
EVENT_TIMETIMESTAMPThe timestamp of the event2015-11-21 17:34:33.58887'
JOBIDINTEGERA uniqued ID for each job on a machine.26587
BLOCKVARCHAR(32)Compute blocks on the system, defined by the vendor or the owner of the system. See section 2.2.7 of the IBM Redbook
LOCATIONVARCHAR(64)Block and Location are used to figure out what midplane were used for that event. R28-M1-S
SERIALNUMBERVARCHAR(32)Serial Number of the hardware component
CPUINTEGERProcessor ID9
COUNTINTEGER1000023479
CTLACTIONVARCHAR(256)Control Action referred to in the Ras Event Book The Control Action identifies an action to be taken by the Control System when a RAS event occurs. Multiple Control Actions can be specified. Certain RAS events might specify Control Actions to end the job and free the block. Although the event is fatal to the job, the condition might be transient or soft, which means that hardware failures occur and cause an application to fail, but that the application can be restarted on the same hardware without hardware repair. Soft errors are most likely radiation induced and are a natural consequence of the impact of terrestrial and cosmic radiation sources, as well as radioactive impurities in the semiconductor and packaging materials close to the CMOS circuits. Soft errors can also occur because of defects in system software. 98 IBM System Blue Gene Solution: Blue Gene/Q System Administration Other failures are hard. Hard failures mean that the failed part permanently exhibits the failure mode, and the hardware must be replaced to fix the failure. A RAS event can include multiple Control Actions that indicate that the condition is likely hard and warrants marking the hardware in error. Marking hardware in error helps job schedulers avoid dispatching jobs to failing hardware.
The possible Control Actions are:
COMPUTE_IN_ERROR This action marks a compute node or I/O node in Error state in the database. Any compute block that uses the compute node is prevented from booting. Any I/O block that uses the I/O node will boot, but any I/O links from compute nodes to the failed I/O node are not usable.
BOARD_IN_ERROR This action marks a node board or I/O drawer in Error state in the database and prevents any blocks that use the node board or I/O drawer from booting.
CABLE_IN_ERROR This action marks a cable in Error state in the database and prevents any blocks that use the cable from booting.
END_JOB This action ends all jobs associated with the compute node, I/O node, midplane, or node board.
FREE_COMPUTE_BLOCK This action frees a block associated with the compute node.
SOFTWARE_IN_ERROR This action marks a node in software error state in the database. The state is reset back to the Available state upon freeing the block.
BQL_SPARE This action spares the bad wire on the link chip.
RACK_IN_ERROR This action marks all node boards and I/O drawers in the rack in Error state in the database and prevents any blocks that use the node boards and I/O drawers in the rack from booting.
DCA_IN_ERROR This action marks the DCA in Error state in the database.
COMPUTE_IN_ERROR
MESSAGEVARCHAR(1024)Message referred to in the Ras Event Book
DIAGSVARCHAR(1)Diagnostics, usually not useful informationF
QUALIFIERVARCHAR(32)973625195
MACHINE_NAMEVARCHAR(32)The name of the supercomputer