This document includes the list of possible RAS Events along with link to the event details.
RAS Events are uniquely identified by the message id. The Component is the software component detecting and reporting the event. The list of components include:
RAS Events can have one of the following severities:
Generated Sat May 17 15:26:07 2014
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00010001 | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel unexpected operation. IP=$(Address) LR=$(LR) ESR=$(ESR) DEAR=$(DEAR) MSR=$(MSR) IntCode=$(CODE) | ||
00010002 | FATAL | BQC | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel invalid number of cores. CoreMask=$(MASK) Number=$(COUNT) | ||
00010003 | FATAL | BQC | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel invalid personality options were specified. NodeConfig=$(Config) | ||
00010004 | FATAL | BQC | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel Network CRC Exchange failed. Link=$(LINK) Expected=$(%x,EXPECTED) Actual=$(%x,ACTUAL) | ||
00010005 | WARN | BQC | Power threshold exceeded on processor domain. Current=$(current) mA | |||
00010006 | WARN | DDR | Power threshold exceeded on memory domain. Current=$(current) mA | |||
00010007 | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel Internal assertion failure. FileStringPtr=$(FILE1) on line $(LINE). Function=$(FUNC) Assert=$(ASSERT) | ||
00010008 | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel Preload Application failure. JobID $(jobid). LoadStateError=$(loaderr) exitStatus=$(exitstatus) | ||
00010009 | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Kernel Unexpected Exit. $(core) $(processor) $(timebase) | ||
0001000A | FATAL | Software_Error | END_JOB,FREE_COMPUTE_BLOCK | CNK Unexpected GEA Interrupt | ||
0001000B | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | For message service $(%d,service) CNK protocol version $(%d,myver) does not match CIOS protocol version $(%d,ciosver). | ||
0001000C | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | CNK: Unable to connect CNV to address $(%lx,ADDR) port $(%ld,PORT) with return code $(%ld,RETCODE). Total failed nodes $(%ld,TOTALFAILS) | ||
0001000D | FATAL | Software_Error | END_JOB,FREE_COMPUTE_BLOCK | CNK Unexpected MU or ND interrupt. ND NFatal=$(nfe0) $(nfe1) Fatal=$(fe0) $(fe1) $(fe2) $(fe3) $(fe4) $(fe5) $(fe6) $(fe7) $(fe8) $(fe9) $(fe10) MU INTS=$(mu0) $(mu1) $(mu2) $(mu3) $(mu4) $(mu5) $(mu6) $(mu7) $(mu8) $(mu9) $(mu10) $(mu11) $(mu12) | ||
0001000E | WARN | Software_Error | CNK detected DCR violation. DCR_STATUS $(dcrnum)=$(status) | |||
0001000F | WARN | Software_Error | CNK detected un-delivered IPI Message. Sent from processor id $(%d,fromcpu) to processor id $(%d,tocpu) | |||
00010010 | WARN | Software_Error | CNK detected a NULL IPI target function pointer. Sent from processor id $(%d,fromcpu) to processor id $(%d,tocpu) | |||
00010011 | WARN | UPC | UPC Hardware error detected. $(UPC_C_INT_STATE), $(INTERNAL_ERROR_STATE), $(UPC_C_INT_FIRST), $(INTERNAL_ERROR_FIRST), $(INTERNAL_SW_INFO), $(INTERNAL_HW_INFO), $(SRAM_PARITY_INFO), $(IOSRAM_PARITY_INFO) | |||
00010012 | WARN | Software_Error | CNK could not open the specified mapfile |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00020000 | FATAL | BQL | COMPUTE_IN_ERROR | BQL $(MISR) miscompare. Expected $(EXPECT) but read $(ACTUAL). | ||
00020001 | FATAL | BQL | COMPUTE_IN_ERROR | BQL $(PRPG) miscompare. Expected $(EXPECT) but read $(ACTUAL). | ||
00020002 | FATAL | BQL | Unable to read BQL $(MISR). | |||
00020003 | FATAL | BQL | Unable to read BQL $(PRPG). | |||
00020004 | FATAL | BQL | COMPUTE_IN_ERROR | ASIC Lbist has not been started. Lbist Status = $(VALUE). | ||
00020005 | INFO | BQL | LBIST is running. Lbist Status = $(VALUE). | |||
00020006 | INFO | BQL | LBIST complete. Lbist Status = $(VALUE). | |||
00020007 | FATAL | BQL | LBIST never completed. Lbist Status = $(VALUE). | |||
0002000E | FATAL | BQL | BQL-HSS BIST (PRBS) check failed on at least 1 4G port. Expected $(EXPECT) but read $(ACTUAL). | |||
0002000F | FATAL | BQL | BQL-HSS BIST (PRBS) check failed on at least 1 10G port. | |||
00020010 | INFO | BQL | All BQL-HSS BIST (PRBS) test passed. This BQL looks good. | |||
00020011 | INFO | BQL | Asic Lbist has passed for Clock Domain 0. | |||
00020012 | FATAL | BQL | Asic Lbist has failed for Clock Domain 0. | |||
00020013 | FATAL | BQL | Command Collision Error | |||
00020014 | FATAL | BQL | COMPUTE_IN_ERROR | BQL ACC SCSTAT Error = $(VALUE) $(MSG) | ||
00020015 | INFO | BQL | $(MESSAGE) | |||
00020016 | FATAL | BQL | $(MESSAGE) | |||
00020017 | INFO | BQL | $(MESSAGE) | |||
00020018 | FATAL | BQL | $(MESSAGE) | |||
00020019 | INFO | BQL | $(MESSAGE) | |||
0002001A | FATAL | BQL | $(MESSAGE) | |||
0002001B | INFO | BQL | $(MESSAGE) | |||
0002001C | FATAL | BQL | $(MESSAGE) | |||
0002001D | FATAL | BQL | $(MESSAGE) | |||
0002001E | INFO | BQL | $(MESSAGE) | |||
0002001F | FATAL | BQL | $(MESSAGE) | |||
00020020 | INFO | BQC | PULBIST passed on core $(CORE). | |||
00020021 | WARN | BQC | PULBIST failed on core $(CORE) after comparison with $(COUNT) signature(s). $(MISR) miscompare. Expected 0x$(EXPECT) but read 0x$(ACTUAL). Watch output for final service action. | |||
00020022 | WARN | BQC | PULBIST failed on core $(CORE) after comparison with $(COUNT) signature(s). $(PRPG) miscompare. Expected 0x$(EXPECT) but read 0x$(ACTUAL). Watch output for final service action. | |||
00020023 | WARN | BQC | $(WHAT) failed on the following $(NUMBER) core(s): $(CORE) after comparison with $(COUNT) signatures. Watch output for final service action. | |||
00020024 | INFO | BQC | PUABIST passed on core $(CORE). | |||
00020025 | WARN | BQC | PUABIST failed on core $(CORE). Failing array at $(ARRAY) with $(HEX). Watch output for final service action. | |||
00020026 | WARN | BQC | PUABIST failed on the following $(NUMBER) core(s): $(CORE). Watch output for final service action. | |||
00020027 | WARN | BQC | PU Selftest determined to activate core sparing on core: $(CORE) | |||
00020028 | FATAL | BQC | PU Selftest determined to send this node back to IBM for FA because core $(CORE) failed. | |||
00020029 | INFO | BQC | MABIST passed. Mabist Tdr = $(VALUE) | |||
0002002A | FATAL | BQC | COMPUTE_IN_ERROR | MABIST failed. Mabist Tdr = $(VALUE), Controller value expected $(EXPECT), read $(ACTUAL), invalid bit $(BITPOS)=0b$(ACTUALBIT), expected 0b$(EXPBIT), Bistinfo: Number=$(NUMBER), VMACName=$(VMACNAME), Array type=$(TYPE), InstanceName1=$(INSTANCENAME1), Description=$(DESCRIPTION), CellName=$(CELLNAME), InstanceName1=$(INSTANCENAME2), X location=$(XLOC), Y location=$(YLOC), Orientation=$(ORIENT) | ||
0002002B | FATAL | BQC | COMPUTE_IN_ERROR | $(MESSAGE) | ||
00020031 | FATAL | BQC | BQC-HSS BIST (PRBS) check failed on at least 1 link. | |||
00020032 | INFO | BQC | All BQC-HSS BIST (PRBS) test passed. This BQC looks good. | |||
00020033 | INFO | BQC | ASICLBIST passed on $(DOMAIN). | |||
00020034 | FATAL | BQC | COMPUTE_IN_ERROR | ASIC Lbist has not been started on $(DOMAIN). Lbist Status = $(VALUE). | ||
00020035 | FATAL | BQC | COMPUTE_IN_ERROR | $(MISR) miscompare in $(DOMAIN). Expected $(EXPECT) but read $(ACTUAL). | ||
00020036 | FATAL | BQC | COMPUTE_IN_ERROR | $(PRPG) miscompare in $(DOMAIN). Expected $(EXPECT) but read $(ACTUAL). | ||
00020037 | FATAL | BQC | COMPUTE_IN_ERROR | $(REG) has its read error bit set while running $(DOMAIN) Read: $(READ) | ||
0002003D | INFO | BQC | $(MESSAGE) | |||
0002003E | WARN | BQC | $(MESSAGE) | |||
0002003F | FATAL | BQC | COMPUTE_IN_ERROR | $(MESSAGE) | ||
00020040 | FATAL | DDR | DDR Miscompare. Value at address $(ADDRESS) was $(ACTUAL) but expected $(EXPECTED). | |||
00020041 | WARN | DDR | $(COUNT) DDR single symbol errors encountered. | |||
00020042 | FATAL | DDR | $(COUNT) DDR double symbol errors encountered. | |||
00020043 | FATAL | DDR | $(COUNT) DDR chipkill errors encountered. | |||
00020044 | FATAL | DDR | Node has one or more bad DRAM. Value is $(MC). | |||
00020060 | FATAL | DDR | DDR Miscompare: $(PTR1) -> $(VALUE1) vs. $(PTR2) -> $(VALUE2) operation=$(OPERATION). | |||
00020061 | WARN | DDR | DDR single symbol errors encountered SSECOUNT($(I)): $(COUNT) | |||
00020062 | FATAL | DDR | DDR double symbol errors encountered DSECOUNT($(I)): $(COUNT) | |||
00020063 | FATAL | DDR | DDR chipkill errors encountered: CKCOUNT($(I)): $(COUNT) | |||
00020064 | WARN | DDR | DDR single wire encountered: SWECOUNTA($(I)): $(COUNT) | |||
00020065 | WARN | DDR | DDR single wire encountered: SWECOUNTB($(I)): $(COUNT) | |||
00020066 | WARN | DDR | DDR single wire encountered: SWECOUNTO($(I)): $(COUNT) | |||
00020067 | FATAL | DDR | DDR Fault detected: MCFIR($(I)): $(MCFIR) MSR: $(MSR) DR-INT: $(DRINT) CTRL: $(CTRL) | |||
00020080 | FATAL | BQC | A bad block number was generated; it was out of range. The block number was $(BLOCKNUM), which is greater than the maximum of $(NUMBLOCKS) - 1. | |||
00020081 | FATAL | BQC | The memory size to use is too large for the system. The calculated memory size was $(MEM) bytes but the maximum available is $(MAXMEM) bytes. | |||
00020082 | FATAL | BQC | Failed to generate a random sequence. | |||
00020083 | FATAL | BQC | A pointer went out of range while generating a random sequence. | |||
00020084 | FATAL | BQC | A miscompare occurred at byte $(BYTE). L1 core $(CORE1) address $(ADDRESS1) had a value $(VALUE1), L1 core $(CORE2) address $(ADDRESS2) had a value $(VALUE2). The address for the miscompare is $(ADDRESS). | |||
00020085 | FATAL | BQC | Unable to compute a log base 2 for value $(VAL). | |||
000200A0 | FATAL | BQC | Coherency checksum fail: csum: $(CSUM) Buf value: $(BUF) Buf number: $(BUFNUMBER). | |||
000200C0 | FATAL | BQC | Dgemm malloc call failed to retrieve memory. | |||
000200C1 | FATAL | BQC | Dgemm miscompare. Thread $(%d,THREADID): Cptr[$(%ld,ERR_ARR_INDEX)] was not close enough to 0.0. It is $(%ld,ERR_ARR_VAL) x the allowed tolerance. | |||
000200C2 | FATAL | BQC | Link errors encountered: receive error count= $(%ld,RECV_ERR_CNT), threshold is $(%ld,RECV_ERR_CNT_THRESH); sender retransmissions=$(%ld,RETRANS_CNT), threshold is $(%ld,RETRANS_CNT_THRESH). | |||
000200C3 | FATAL | BQC | Dgemm miscompare. The CRC-like check indicates that thread $(%d,THREADID) did not match half of $(%d,NUMTHREADS) threads. | |||
00020100 | FATAL | BQC | Dgemm pulse Initialization of MU or ND failed. | |||
00020101 | FATAL | BQC | Dgemm pulse Global Barrier setup or initialization failed. | |||
00020102 | FATAL | BQC | Dgemm pulse Global Barrier timeout. MUSPI_GIBarrierPollWithTimeout failed, timeout value $(TIMEOUT). | |||
00020103 | FATAL | BQC | Dgemm pulse L2 Barrier timeout. Num threads $(NUMTHREADS) and timeout $(TIMEOUT) | |||
00020104 | FATAL | BQC | Dgemm pulse Required Memory for the test $(REQUIRED) is greater than system memory $(ACTUAL). | |||
00020120 | INFO | BQC | Trash loop complete. Thread ID $(%d,THREADID), init data address $(INITDATADDR), work area address $(WRKADDR), pass number $(%d,PASSNUMBER), pass 2 loop count $(%d,PASS2LOOPCOUNT). | |||
00020121 | FATAL | BQC | Trash failed. Thread ID $(%d,THREADID), Init data address $(INITDATADDR), Work area address $(WRKADDR), Pass number $(%d,PASSNUMBER), Pass 2 loop count $(%d,PASS2LOOPCOUNT), GPR0 $(GPR0), GPR1 $(GPR1), GPR2 $(GPR2), GPR3 $(GPR3), GPR4 $(GPR4), GPR5 $(GPR5), GPR6 $(GPR6), GPR7 $(GPR7), GPR8 $(GPR8), GPR9 $(GPR9), GPR10 $(GPR10), GPR11 $(GPR11), GPR28 $(GPR28), GPR29 $(GPR29), GPR30 $(GPR30), GPR31 $(GPR31) | |||
00020140 | INFO | BQC | Grub successfully completed $(%d,ITERATION) of $(%d,ITERATIONS) passes and no hardware errors were found by this thread. | |||
00020150 | FATAL | BQC | Grub failed for an undefined error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020151 | FATAL | BQC | Grub failed for a tryhang error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020152 | FATAL | BQC | Grub failed for an external interrupt failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020153 | FATAL | BQC | Grub failed for a dozer failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020154 | FATAL | BQC | Grub failed for a GPR miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020155 | FATAL | BQC | Grub failed for an AXU miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020156 | FATAL | BQC | Grub failed for an SPR miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020157 | FATAL | BQC | Grub failed for a data miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020158 | FATAL | BQC | Grub failed for a user function miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020159 | FATAL | BQC | Grub failed for a vector function miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002015A | FATAL | BQC | Grub failed for a memory corruption failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002015B | FATAL | BQC | Grub failed for a data storage interrupt failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002015C | FATAL | BQC | Grub failed for an instruction storage interrupt failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002015D | FATAL | BQC | Grub failed for a setlock2() failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002015E | FATAL | BQC | Grub failed for a ptecheck() miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002015F | FATAL | BQC | Grub failed for a system dead failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020160 | FATAL | BQC | Grub failed for a VR miscompare at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020161 | INFO | BQC | Grub reached its test limit and finished successfully. | |||
00020162 | FATAL | BQC | Grub failed for a random number generator failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020163 | FATAL | BQC | Grub failed for an illegal instruction program interrupt at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020164 | FATAL | BQC | Grub failed for an unimplemented opcode program interrupt at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020165 | FATAL | BQC | Grub failed for a vconstm error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020166 | FATAL | BQC | Grub failed for a first tlbwe/ivax error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020167 | FATAL | BQC | Grub failed for a second tlbwe/ivax error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020168 | FATAL | BQC | Grub failed for a DCR dead handler at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020169 | FATAL | BQC | Grub failed for an RO not 0 error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002016A | FATAL | BQC | Grub failed for a coherence failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002016B | FATAL | BQC | Grub failed for a full vector DCR failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002016C | FATAL | BQC | Grub failed for a lane assignement DCR failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002016D | FATAL | BQC | Grub failed for a lane assignment random data DCR failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002016E | FATAL | BQC | Grub failed for a configuration handler failure at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
0002016F | FATAL | BQC | Grub failed for an exception error at instruction address $(0x%016X,FAILADDR) on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED), thread ID $(%d,THREAD). | |||
00020180 | INFO | BQC | TPSM with pid $(%d,COPY) successfully completed iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED1) alloc seed $(0x%016X,SEED2). | |||
00020190 | FATAL | BQC | TPSM with pid $(%d,COPY) failed for an undefined error on iteration $(%d,ITERATION) of $(%d,ITERATIONS) with random seed $(0x%016X,SEED1) alloc seed $(0x%016X,SEED2). | |||
00020191 | FATAL | BQC | TPSM got user panic error: $(DETAILS) | |||
00020200 | FATAL | BQC | Rank $(RANK) at ($(A),$(B),$(C),$(D),$(E)) was not able to complete an MPI_Isend to rank $(NEIGHBOR). | |||
00020220 | FATAL | DDR | DDR Error Stress Mismatch during SSE correction test for s1:$(S1) and pattern:$(PATTERN). | |||
00020221 | FATAL | DDR | DDR Error Stress Mismatch during DSE correction test for s1:$(S1), s2:$(S2) and pattern:$(PATTERN). | |||
00020222 | FATAL | DDR | DDR Error Stress Mismatch during Chipkill correction test for chip:$(CHIP) and pattern:$(PATTERN). | |||
00020223 | FATAL | DDR | DDR Error Stress Mismatch during Chipkill plus SSE correction test for chip:$(CHIP), s2:$(S2) and pattern:$(PATTERN). | |||
00020240 | FATAL | BQC | MU_ND library malloc call failed to retrieve memory. | |||
00020241 | INFO | BQC | MU_ND library: function memalign failed. | |||
00020242 | INFO | BQC | MU_ND library: Barrier function failed. | |||
00020243 | FATAL | BQC | MU_ND library: Barrier Timeout failed. Timeout value is $(TIMEOUT). | |||
00020244 | INFO | BQC | MU_ND library: Test Setup failure | |||
00020245 | INFO | BQC | MU_ND library: Message Check Probe failure | |||
00020246 | INFO | BQC | MU_ND library: msg_ran_check_and_clear_upc failure with errors $(ERROR) | |||
00020247 | INFO | BQC | MU_ND library: Firmware function failed. | |||
00020248 | FATAL | BQC | MU_ND library: msg_CheckBuffer: DATA MISMATCH: Buffer size = $(SIZE) , Mismatch at buffer offset $(OFFSET), bufStart=$(BUFSTART), CompareAddress $(COMP_ADDR), Buffer Data=$(BUF_DATA) Expected Data=$(EXP_DATA). | |||
00020249 | FATAL | BQC | MU_ND library: Poll Timeout. Timeout value=$(TIMEOUT) | |||
0002024A | INFO | BQC | MU_ND library: Kernel Function Failed. | |||
0002024B | INFO | BQC | MU_ND library: MUSPI Function Failed. | |||
0002024C | FATAL | BQC | MU_ND library: Out-of-order packet arrival detected. Current sequence number: $(CURR) Previous sequence number $(PRE). max_pkt_size: $(PKT_SIZE) inj_fifo_id: $(INJ_FIFO_ID) recv_partner_idx: $(RECV_ID) VC: $(VC) nd_fifo: $(ND_FIFO) src_id: $(SRC_ID) | |||
0002024D | FATAL | BQC | MU_ND library: injection FIFO threshold crossing status bit NOT set but it should be. free_space=$(FREE_SPACE) imu_thold=$(IMU_THOLD) thold_cross=$(THOLD_CROSS) MASK=$(MASK) sgroup=$(SGROUP) inj_fifo_id=$(FIFO_ID) | |||
0002024E | FATAL | BQC | MU_ND library: reception FIFO threshold crossing status bit NOT set but it should be. free_space=$(FREE_SPACE) rmu_thold=$(RMU_THOLD) thold_cross=$(THOLD_CROSS) MASK=$(MASK) sgroup=$(SGROUP) rmfifo offset==$(OFFSET) head=$(HEAD) tail=$(TAIL) size=$(SIZE) | |||
0002024F | FATAL | BQC | MU_ND library: Receive Count Underflow Tid=$(TID), Count=$(COUNT) | |||
00020250 | FATAL | BQC | MU_ND library: Timeout value=$(TIMEOUT) | |||
00020251 | FATAL | BQC | MU_ND library: MMIO READ ERROR: unexpected MMIO reg read value. reg[$(REG)]=$(VALUE) != Expected=$(EXPECTED) | |||
00020252 | INFO | BQC | MU_ND library: Internal error | |||
00020253 | FATAL | BQC | MU_ND library: MU_DCR error | |||
00020254 | FATAL | BQC | MU_ND library: Interrupt handler hit fatal condition. IP=$(IP) LR=$(LR). | |||
00020255 | INFO | BQC | MU_ND library: PUT ERROR. | |||
00020256 | INFO | BQC | MU_ND library: GET ERROR. | |||
00020257 | FATAL | BQC | MU_ND library: nd_termcheck_compare_values error: a!=b Addr=$(ADDR) line number=$(LINE) A=$(A_VALUE) and B=$(B_VALUE). | |||
00020258 | INFO | BQC | MSG_RAN_DIAG1: Tid ($(TID)) >= num_threads $(NUM_THREADS). | |||
00020259 | INFO | BQC | MU_ND library: fifo send_recv error. | |||
0002025A | INFO | BQC | MU_ND library: C-assert failure | |||
00020260 | FATAL | BQC | MSG_DIAG_CONNECTIVITY: No neighboors. | |||
00020261 | INFO | BQC | MSG_DIAG_CONNECTIVITY: Tid ($(TID)) >= MAX dgemm threads $(NUM_THREADS). | |||
00020262 | FATAL | BQC | Link errors encountered on link $(%c,LINK)$(%c,DIR). Link error rates = $(%u,LINKERR) e-15, TX retran count = $(%u,TXERRCNT), RX error count = $(%u,RXERRCNT), tvs_data=0x$(%016x,TVSDATA), tv0=0x$(%d,TV0), tv1=$(%d,TV1), tv2=$(%d,TV2). | |||
00020263 | WARN | BQC | Link errors encountered on link $(%c,LINK)$(%c,DIR) (PHY $(%c,PLINK)$(%c,PDIR)). TX retran count = $(%u,TXERRCNT), RX error count = $(%u,RXERRCNT). | |||
00020280 | FATAL | DDR | Survey: Bist error result for link direction $(LINK): ES $(%.2f,EYESIZE) CdRerr $(CDRERR) PRBS: Errors: $(ERRORS) Align: $(ALIGN) | |||
000202A0 | WARN | UPC | Add Event failure configuring UPC hardware performance counters. Errcode=$(RC), Module Lineno $(LINE). | |||
000202A1 | WARN | UPC | Apply Events failure configuring UPC hardware performance counters. Errcode=$(RC), Module Lineno $(LINE). | |||
000202A2 | WARN | UPC | This node failed the UPC tests due to an unexpected write/read miscompare in the UPC_C unit at address=$(ADDR). Expected Value=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A3 | WARN | UPC | This node failed the UPC tests due to an unexpected write/read miscompare in the UPC_P unit at core $(CORE) address=$(ADDR). Expected Value=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A4 | WARN | UPC | This node failed the UPC tests due to an unexpected miscompare in the UPC_C SRAM at offset=$(OFFSET). Expected Value=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A5 | WARN | UPC | This node failed the UPC tests due to an unexpected register state at offset=$(OFFSET). Expected Value=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A6 | WARN | UPC | This node failed the UPC tests due to an unexpected miscompare in the UPC_C I/O SRAM at offset=$(OFFSET). Expected Value=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A7 | WARN | UPC | This node failed the UPC tests due to UPC Ring miscompare. UPC_C Counter value: Address=$(ADDR). Expected Value=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A8 | WARN | UPC | This node failed the UPC tests due to an unexpected count value the UPC_C I/O SRAM at offset=$(OFFSET). Expected Value>=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202A9 | WARN | UPC | This node failed the UPC tests due to an unexpected count value the UPC_C L2 SRAM for slice=$(SLICE), counter=${CTR}. Expected Value>=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202AA | WARN | UPC | This node failed the UPC tests due to an unexpected count value the Punit SRAM for core=$(CORE), counter=${CTR}. Expected Value>=$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202AB | WARN | UPC | This node failed the UPC tests due to an unexpected count value during A2 event signal map test: eventid=%d, hwThread=$(HWTHREAD), eventIndex=${EVTIDX}. Expected Value!=0 or ==$(EXPVAL). Actual Value=$(ACTVAL). | |||
000202AC | WARN | UPC | This node failed the UPC tests due to an unexpected interrupt: code=%d, hwThread=$(HWTHREAD) | |||
000202C0 | FATAL | BQC | Unable to obtain sync on $(%c,TORUS)$(%c,TORUSDIR), bit $(%d,BIT) (lane $(%d,LANE)). Check the midplane pins, midplane cage, and node board connector blocks for damage, alignment, or seating issues. Also ensure there are no loose materials in and around the node board. | |||
000202C1 | FATAL | BQC | Static data eye size bad for link $(%d,LINK), lane $(%d,LANE) ($(%c,TORUS)$(%c,TORUSDIR), bit $(%d,BIT)). Eye size is $(%d,EYE).$(%02d,FRACTION)%. Check the midplane pins, midplane cage, and node board connector blocks for damage, alignment, or seating issues. Also ensure there are no loose materials in and around the node board. | |||
000202C2 | FATAL | BQC | Sync was achieved but the error bit is set on $(%c,TORUS)$(%c,TORUSDIR), bit $(%d,BIT) (lane $(%d,LANE)). Check the midplane pins, midplane cage, and node board connector blocks for damage, alignment, or seating issues. Also ensure there are no loose materials in and around the node board. | |||
000202C3 | FATAL | BQC | The control system barrier timed out. | |||
000202C4 | FATAL | BQC | An unrecognized phase $(PHASE) was encountered for torus dimension $(DIMENSION) during PRBS startup. This is a software failure. | |||
000202C5 | FATAL | BQC | An unrecognized phase $(PHASE) was encountered for torus dimension $(DIMENSION) during PRBS shutdown. This is a software failure. | |||
000202E0 | FATAL | BQC | Miscompare between MPI rank $(%d,RANK) and MPI rank 0: $(%le,VAL) vs $(%le,DEST). Differing bits are $(%16.16lX,DIFFERINGBITS). | |||
000202E1 | FATAL | BQC | QCD Failure: $(MESSAGE). | |||
000202F0 | FATAL | BQC | L2Tester failed. | |||
00020300 | FATAL | BQC | $(%c,TORUS)$(%c,TORUSDIR) lane $(%d,LANE) failed to complete calibration. |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00030000 | INFO | Process | bgmaster_server has been started in process $(PID) | |||
00030001 | INFO | Process | bgmaster_server process $(PID) stopped | |||
00030002 | INFO | Process | bgmaster_server started binary $(BIN) for alias $(ALIAS) | |||
00030003 | INFO | Process | bgmaster_server stopped binary $(BIN) for alias $(ALIAS) with signal $(SIGNAL). | |||
00030004 | WARN | Process | 1 | bgmaster_server has detected a failure of binary $(BIN) for alias $(ALIAS) with signal $(SIGNAL) and exit status $(ESTAT). Error message is $(EMSG). | ||
00030005 | INFO | Process | bgmaster_server has executed a restart policy for alias $(ALIAS) | |||
00030006 | INFO | Process | bgmaster_server has executed a failover policy for alias $(ALIAS) from $(SOURCE) to $(TARGET) | |||
00030007 | FATAL | Process | 1 | bgmaster_server has detected a failure of bgagentd $(AGENT_ID) | ||
00030008 | INFO | Process | bgmaster_server has been requested to end bgagentd $(AGENT_ID) | |||
00030009 | FATAL | Process | 1 | bgmaster_server process $(PID) has failed with signal $(SIGNAL) | ||
00030010 | FATAL | Process | bgmaster_server process $(PID) has failed with a configuration error $(ERROR) | |||
00030011 | INFO | Process | bgmaster_server failed to start alias $(ALIAS) |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00040001 | WARN | Temp_Sensor | 10 | Nonresponsive or missing Temperature Sensor: $(ARG) | ||
00040002 | WARN | Temp_Sensor | 10 | Temperature data is unavailable: $(ARG) | ||
00040003 | WARN | Optical_Module | 10 | Optical module environmental data is unavailable: $(ARG) | ||
00040004 | WARN | Optical_Module | 10 | Optical module environmental data is unavailable: $(ARG) | ||
00040005 | WARN | DCA | 10 | Power domain environmental data is unavailable: $(ARG) | ||
00040006 | WARN | DCA | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 10 | Unable to disable a Power Domain: $(ARG) | |
00040007 | WARN | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 3 | Unable to take I2C and JTAG out of reset : $(ARG) | |
00040008 | WARN | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to initialize LCD display : $(ARG) | |
00040009 | WARN | DCA | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 5 | Unable to bring up a power domain : $(ARG) | |
00040010 | WARN | Clocks | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to initialize system clock : $(ARG) | |
00040011 | WARN | Clocks | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to initialize system clock : $(ARG) | |
00040012 | WARN | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to bring up BPMs : $(ARG) | |
00040013 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to set card's jtag speed : $(ARG) | |
00040014 | WARN | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to bring up Link Chips : $(ARG) | |
00040015 | WARN | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to bring up Optical Modules : $(ARG) | |
00040016 | WARN | AC_TO_DC_PWR | 1 | Unable to bring up BPMs : $(ARG) | ||
00040017 | WARN | AC_TO_DC_PWR | Unable to bring up BPMs : $(ARG) | |||
00040018 | WARN | AC_TO_DC_PWR | 1 | Unable to bring up BPMs : $(ARG) | ||
00040019 | WARN | AC_TO_DC_PWR | Unable to bring up BPM : $(ARG) | |||
00040020 | WARN | AC_TO_DC_PWR | BPM is reporting a problem : $(ARG) | |||
00040021 | WARN | AC_TO_DC_PWR | Unable to 'restart' BPM : $(ARG) | |||
00040022 | FATAL | AC_TO_DC_PWR | 1 | Unable to 'restart' BPM (even after retrying this BPM is still reporting a problem): $(ARG) | ||
00040023 | WARN | AC_TO_DC_PWR | 1 | Unable to bring up BPMs : $(ARG) | ||
00040024 | WARN | Palomino | 2 | Unable to lower PGOOD on this card's link chips : $(ARG) | ||
00040025 | WARN | Palomino | 2 | Unable to raise PGOOD on this card's link chips : $(ARG) | ||
00040026 | WARN | Palomino | 2 | Unable to take link chips out of reset : $(ARG) | ||
00040027 | FATAL | BQL | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to read this link chip's JTAG ID : $(ARG) | |
00040028 | FATAL | BQL | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Incoherent link chip : $(ARG) | |
00040029 | WARN | Palomino | 1 | Unable to take this card's Optical Modules out of reset : $(ARG) | ||
00040030 | WARN | Optical_Module | 1 | Unable to communicate with the RX Optical Modules : $(ARG) | ||
00040031 | WARN | Optical_Module | 1 | Unable to communicate with the TX Optical Modules : $(ARG) | ||
00040032 | WARN | DCA | 1 | Unable to communicate with the DC-DC Power Module : $(ARG) | ||
00040033 | WARN | DCA | 1 | DC-DC Power Module is non-responsive : $(ARG) | ||
00040034 | WARN | Palomino | 1 | Unable to enable this card DC-DC Power Module's global enable : $(ARG) | ||
00040035 | WARN | DCA | 1 | Unable to enable this Power Domain. : $(ARG) | ||
00040036 | WARN | Palomino | 1 | Unable to read card's environmental data : $(ARG) | ||
00040037 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Detected a power rail with an incorrect voltage : $(ARG) | |
00040038 | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
00040039 | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
0004003A | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
0004003B | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
0004003C | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
0004003D | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
0004003E | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
0004003F | WARN | Palomino | 1 | Unable to update the LCD display : $(ARG) | ||
00040040 | WARN | Clock_FPGA | 1 | Unable to set clock frequency : $(ARG) | ||
00040041 | WARN | Clock_FPGA | 1 | Unable to set clock frequency : $(ARG) | ||
00040042 | WARN | Clock_FPGA | 1 | Unable to set up gsync pulse interval : $(ARG) | ||
00040043 | WARN | Clock_FPGA | 1 | Unable to set up gsync pulse interval : $(ARG) | ||
00040044 | WARN | Clock_FPGA | 1 | Unable to set up gsync pulse interval : $(ARG) | ||
00040045 | WARN | Clock_FPGA | 1 | Unable to set up gsync pulse interval : $(ARG) | ||
00040046 | WARN | Clock_FPGA | 1 | Unable to set up gsync pulse interval : $(ARG) | ||
00040047 | WARN | Card | 1 | Unable to start up this board : $(ARG) | ||
00040048 | WARN | Software_Error | Problem with FPGA image file : $(ARG) | |||
00040049 | WARN | Card | Card has incorrect location values. : $(ARG) | |||
0004004A | FATAL | Card | Card has incorrect License Plate value. : $(ARG) | |||
0004004B | FATAL | Card | Card has incorrect location value. : $(ARG) | |||
0004004C | FATAL | Card | Card is connected to the incorrect subnet. : $(ARG) | |||
0004004D | FATAL | Card | Card conflicts with a previously found card. : $(ARG) | |||
0004004E | FATAL | Card | Card has incorrect IP address value. : $(ARG) | |||
0004004F | FATAL | Card | Card has incorrect IP address value. : $(ARG) | |||
00040050 | WARN | Service_Card | 1 | Unable to start up this Service card : $(ARG) | ||
00040051 | WARN | Node_Board | 1 | Unable to start up this Node board : $(ARG) | ||
00040052 | WARN | Clocks | 1 | Unable to save away the system clock M and N values. : $(ARG) | ||
00040053 | WARN | Card | This card does not have an iCon associated with it | |||
00040054 | WARN | Palomino | 1 | Unable to lower PGOOD on this card's computes : $(ARG) | ||
00040055 | WARN | Palomino | 1 | Unable to raise PGOOD on this card's computes : $(ARG) | ||
00040056 | WARN | Palomino | Unable to reinitialize this card : $(ARG) | |||
00040057 | WARN | Card | Unable to find this card on the Service Network : $(ARG) | |||
00040058 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Unable to initialize this card : $(ARG) | ||
00040059 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Unable to initialize this card : $(ARG) | ||
0004005A | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Unable to initialize this card : $(ARG) | ||
0004005B | INFO | Card | Successfully initialized this card | |||
0004005C | WARN | Node_Board | 1 | Unable to read the local temperature from a temperature sensor : $(ARG) | ||
0004005D | INFO | Card | MC was halted. : $(ARG) | |||
0004005E | WARN | Node_Board | 1 | Unable to enable this DCA's i2c channel : $(ARG) | ||
0004005F | WARN | Icon | 1 | Unable to see if iCon is in fiber or copper mode. : $(ARG) | ||
00040060 | FATAL | Icon | 1 | Unable to see if iCon is in full duplex mode. : $(ARG) | ||
00040061 | WARN | Node_Board | 1 | Unable to disable power domain 7. : $(ARG) | ||
00040062 | FATAL | Node_Board | 1 | Unable to initialize the I2C switch. : $(ARG) | ||
00040063 | FATAL | Node_Board | 1 | Unable to find power domain 7's DAC. | ||
00040064 | FATAL | Node_Board | 1 | Unable to enable the I2C channel for power domain 7's DAC. : $(ARG) | ||
00040065 | FATAL | Node_Board | 1 | Unable to initialize power domain 7's DAC. : $(ARG) | ||
00040066 | FATAL | Node_Board | 1 | Unable to set power domain 7 DAC's voltage. : $(ARG) | ||
00040067 | FATAL | Node_Board | 1 | Unable to enable power domain 7. : $(ARG) | ||
00040068 | FATAL | Service_Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to enable iConFork facility. : $(ARG) | |
00040069 | FATAL | Service_Card | 1 | Unable to enable the broadcom switch. : $(ARG) | ||
0004006A | FATAL | Service_Card | 1 | Unable to find the broadcom switch. : $(ARG) | ||
0004006B | FATAL | Service_Card | 1 | Unable to initialize the broadcom switch. : $(ARG) | ||
0004006C | FATAL | Service_Card | 1 | Unable to zero the broadcom switch's counters. : $(ARG) | ||
0004006D | FATAL | Service_Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to load the node boards hanging off of this service card. : $(ARG) | |
0004006E | FATAL | Service_Card | 3 | Unable to put the node boards hanging off of this service card into reset. : $(ARG) | ||
0004006F | FATAL | Service_Card | 3 | Unable to ensure that the node boards hanging off of this service card are in reset. : $(ARG) | ||
00040070 | FATAL | Service_Card | 1 | Node board did not go into reset. : $(ARG) | ||
00040071 | FATAL | Software_Error | Problem with Node board FPGA image file : $(ARG) | |||
00040072 | FATAL | Service_Card | Unable to load the Node board FPGA image into a node board. : $(ARG) | |||
00040073 | FATAL | Service_Card | Unable to put node board's LP and IP into node board's FPGA. : $(ARG) | |||
00040074 | FATAL | Service_Card | Unable to put node board's LP and IP into node board's FPGA. : $(ARG) | |||
00040075 | FATAL | Service_Card | Unable to check and see if the node boards hanging off of this service card came out of reset. : $(ARG) | |||
00040076 | FATAL | Service_Card | Child node board has incorrect LP and IP information. : $(ARG) | |||
00040077 | FATAL | Service_Card | Child node board did not come out of reset. : $(ARG) | |||
00040078 | FATAL | Service_Card | Unable to check the specified child node board's LP and IP to ensure that it has the expected values. : $(ARG) | |||
00040079 | FATAL | Service_Card | While checking that the specified child node board's LP and IP we found that it does not have the expected values. : $(ARG) | |||
0004007A | INFO | Card | Successfully prepared this card for service. : $(ARG) | |||
0004007B | INFO | Node_Board | Successfully restarted this card, it has been initialized and is found on the service network. : $(ARG) | |||
0004007C | FATAL | Node_Board | Unable to restart this card. | |||
0004007D | FATAL | Software_Error | An error was detected during bringup processing. : $(ARG) | |||
0004007E | FATAL | Software_Error | Unable to detect any Service cards or IO boards in this machine. : $(ARG) | |||
0004007F | INFO | Software_Error | Machine is ready for use. | |||
00040080 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to execute the requested JTAG instruction stream against this card. : $(ARG) | |
00040081 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to execute the requested I2C instruction stream against this card. : $(ARG) | |
00040082 | WARN | Card | The specified card is not present in the machine. | |||
00040083 | WARN | Node_Board | This card's parent service card is not functional/initialized, so we are unable to initialize this card. : $(ARG) | |||
00040084 | WARN | Node_Board | 1 | This card's parent service card's I2C bus is not functional, so we are unable to initialize this card. : $(ARG) | ||
00040085 | WARN | Node_Board | 1 | Unable to load this node boards iCon/palomino FPGA image. : $(ARG) | ||
00040086 | FATAL | ELF_Image | Unable to load a specified ELF image, an error was detected. : $(ARG) | |||
00040087 | FATAL | ELF_Image | Unable to load a specified ELF image, an error was detected. : $(ARG) | |||
00040088 | FATAL | ELF_Image | Unable to load a specified ELF image, an error was detected. : $(ARG) | |||
00040089 | WARN | DCA | 1 | Unable to perform the specified operation on this DCA. : $(ARG) | ||
0004008A | WARN | DCA | Unable to perform the specified operation on this DCA. : $(ARG) | |||
0004008B | WARN | Software_Error | Unable to perform the specified operation on this DCA. : $(ARG) | |||
0004008C | WARN | Software_Error | Unable to perform the specified operation on this DCA. : $(ARG) | |||
0004008D | WARN | Software_Error | Unable to perform the specified operation on this DCA. : $(ARG) | |||
0004008E | WARN | DCA | 1 | Unable to perform the specified operation on this DCA. : $(ARG) | ||
0004008F | WARN | DCA | 1 | Unable to perform the specified operation on this DCA. : $(ARG) | ||
00040090 | INFO | Card | Successfully performed the specified operation on this DCA. : $(ARG) | |||
00040091 | FATAL | Node_Board | 1 | Unable to enable this Optical Module's i2c channel : $(ARG) | ||
00040092 | FATAL | ELF_Image | Error loading firmware image: $(ARG) | |||
00040093 | FATAL | ELF_Image | Error loading node image: $(ARG) | |||
00040094 | FATAL | ELF_Image | The symbol associated with the address of the firmware personality could not be found in the firmware image: $(ARG) | |||
00040095 | FATAL | Software_Error | Unable to create a pthread to process a request that came into a SubnetMc: $(ARG) | |||
00040096 | INFO | BQC | The node requested a JTAG barrier | |||
00040097 | FATAL | Card | 1 | Unable to bring up Computes : $(ARG) | ||
00040098 | FATAL | Software_Error | An error was detected during restart of a subnet. : $(ARG) | |||
00040099 | FATAL | Card | 1 | Mailbox verification failed for all the nodes being booted on the board. | ||
0004009A | FATAL | BQC | Mailbox verification failed for this node | |||
0004009B | FATAL | DCA | 1 | Unable to get this DCA's firmware level : $(ARG) | ||
0004009C | FATAL | Node_Board | 1 | Unable to read this DCA's VPD : $(ARG) | ||
0004009D | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to enable this compute's i2c channel : $(ARG) | |
0004009E | FATAL | Node_Board | 1 | Unable to read this compute's VPD : $(ARG) | ||
0004009F | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to write a Palomino's register : $(ARG) | |
000400A0 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to read a Palomino's register : $(ARG) | |
000400A1 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to perform a write operation on a Palomino's register : $(ARG) | |
000400A2 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to perform a write operation on a Palomino's register : $(ARG) | |
000400A3 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to perform a write operation on a Palomino's register : $(ARG) | |
000400A4 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to initialize computes : $(ARG) | |
000400A5 | FATAL | Card | The block failed to boot because compute nodes did not report READY as expected. | |||
000400A6 | FATAL | DCA | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to set the voltage for the specified Power Domain. : $(ARG) | |
000400A7 | FATAL | DCA | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to turn on power domain 7 because power domain 1 is not turned on. : $(ARG) | |
000400A8 | FATAL | DCA | 1 | Unable to disable this Power Domain. : $(ARG) | ||
000400A9 | FATAL | DCA | 1 | Unable to enable this Power Domain. : $(ARG) | ||
000400AA | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to disable this card DC-DC Power Module's global enable : $(ARG) | |
000400AB | INFO | Card | The compute node test did not report TERMINATE within the timeout period. | |||
000400AC | WARN | Card | 1 | Unable to bring up PCIE cards : $(ARG) | ||
000400AD | WARN | IO_Board | 1 | Unable to enable the PCIE cards clock : $(ARG) | ||
000400AE | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to enable this boards PCIE clock enable. : $(ARG) | |
000400AF | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | dev bus error condition : $(ARG) | ||
000400B0 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | dcr access error : $(ARG) | ||
000400B1 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | dcr access error : $(ARG) | ||
000400B2 | FATAL | BQC | 1 | Unable to read this compute's JTAG ID : $(ARG) | ||
000400B3 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | This compute's VPD does not contain ecid : $(ARG) | |
000400B4 | WARN | DCA | Disabled a Power Domain: $(ARG) | |||
000400B5 | FATAL | DCA | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | Unable to turn on power domain 7 because power domain 4 is not turned on. : $(ARG) | |
000400B6 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | Incoherent compute : $(ARG) | |
000400B7 | INFO | Card | Successfully powered off this device : $(ARG) | |||
000400B8 | WARN | Palomino | 1 | Unable to power off this card : $(ARG) | ||
000400B9 | FATAL | Software_Error | Unable to create a pthread to process a MC command request: $(ARG) | |||
000400BA | WARN | Clock_FPGA | 1 | Unable to set up spread spectrum : $(ARG) | ||
000400BB | WARN | Clock_FPGA | 1 | Unable to set up spread spectrum : $(ARG) | ||
000400BC | WARN | Clock_FPGA | 1 | Unable to set up spread spectrum : $(ARG) | ||
000400BD | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | This compute's VPD is missing some important data fields : $(ARG) | |
000400BE | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | This compute is unable to tell us its ecid : $(ARG) | |
000400BF | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to refresh the I2C channel to include optical modules. : $(ARG) | |
000400C0 | WARN | Node_Board | 1 | Unable to find DCA's vpd i2c channel. : $(ARG) | ||
000400C1 | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to find the correct number of temperature sensors : $(ARG) | |
000400C2 | WARN | DCA | 1 | Unable to enable a Power Domain Loop: $(ARG) | ||
000400C3 | WARN | DCA | 1 | Unable to disable a Power Domain Loop: $(ARG) | ||
000400C4 | WARN | DCA | 1 | Unable to perform the specified operation on this DCA. : $(ARG) | ||
000400C5 | WARN | DCA | Disabled this DCA. : $(ARG) | |||
000400C6 | WARN | DCA | Enabled this DCA. : $(ARG) | |||
000400C7 | WARN | DCA | Attempted to disable an IO board DCA, this is an unsupported operation. : $(ARG) | |||
000400C8 | WARN | DCA | Attempted to enable an IO board DCA, this is an unsupported operation. : $(ARG) | |||
000400C9 | WARN | Node_Board | 1 | During shutdown of a switchable power domain we detected an under-voltage situation : $(ARG) | ||
000400CA | WARN | DCA | 1 | Unable to gather an execution trace of the specified DCA. : $(ARG) | ||
000400CB | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | This compute's VPD does not contain CCIN : $(ARG) | |
000400CD | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | ACCESS alert. $(ARG) | ||
000400CE | FATAL | BQL | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to start the clocks for this link chip : $(ARG) | |
000400CF | FATAL | BQL | 1 | TVSense logic never came active : $(ARG) | ||
000400D0 | WARN | BQL | 1 | TVSense temperature is unavailable : $(ARG) | ||
000400D1 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to set the overtemperature limit for this board : $(ARG) | |
000400D2 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to read this card's VPD : $(ARG) | |
000400D3 | WARN | Card | 1 | Power domain loop data is not available in board's VPD : $(ARG) | ||
000400D4 | FATAL | Card | 1 | Compute has gone over-temperature : $(ARG) | ||
000400D5 | FATAL | Card | 1 | Link chip has gone over-temperature : $(ARG) | ||
000400D6 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to set DC-DC Power Module's uC to reset : $(ARG) | |
000400D7 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to set DC-DC Power Module's uC to normal state: $(ARG) | |
000400D8 | WARN | Optical_Module | 1 | Unable to perform Optical Link Connectivity Test (aka OLCT) : $(ARG) | ||
000400D9 | WARN | Optical_Module | 1 | Unable to perform Optical Link Connectivity Test (aka OLCT) : $(ARG) | ||
000400DA | WARN | Optical_Module | 1 | Unable to disable Optical Link Connectivity Test (aka OLCT) : $(ARG) | ||
000400DB | WARN | Optical_Module | 1 | Unable to perform Optical Link Connectivity Test (aka OLCT) : $(ARG) | ||
000400DC | FATAL | Card | 1 | Unable to initialize the cache of compute information : $(ARG) | ||
000400DD | WARN | DCA | 1 | Unable to update this DCA's firmware level : $(ARG) | ||
000400DE | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to disable Power Domain VTMs : $(ARG) | |
000400DF | WARN | Node_Board | 1 | Unable to enable this Power Domain VTMs : $(ARG) | ||
000400E0 | WARN | Node_Board | 1 | Unable to disable this Power Domain VTMs : $(ARG) | ||
000400E1 | WARN | Card | 1 | Compute's VPD does not contain HSS calibration data : $(ARG) | ||
000400E2 | INFO | BQL | TVSense temperature is unavailable : $(ARG) | |||
000400E3 | WARN | BQC | SOFTWARE_IN_ERROR | 1 | Verification of the I/O link shutdown failed. $(STATUS) | |
000400E4 | FATAL | BQC | 1 | BQC device bus write of the personality failed. | ||
000400E5 | WARN | BQC | 1 | Send of a kernel shutdown message failed. Return code=$(ARG) | ||
000400E6 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | BQC mailbox stdin failed. Return code=$(ARG) | |
000400E7 | FATAL | BQC | COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | 1 | BQC configure domain failed. Return code=$(ARG) | |
000400E8 | FATAL | Software_Error | Kernel configuration data address is out of range. $(STATUS) | |||
000400E9 | FATAL | BQC | COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | BQC write mailbox failed for kernel configuration data. | ||
000400EA | FATAL | Card | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | 1 | BQC barrier acknowledgement failed. Return code=$(ARG) | |
000400EB | WARN | BQL | BQL_SPARE | BQL error threshold exceeded. | ||
000400EC | FATAL | BQC | COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | The compute node did not report its mailbox was READY as expected. | ||
000400ED | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Detected that this board has become unusable | ||
000400EE | FATAL | Fan | Detected a failed fan : $(ARG) | |||
000400EF | WARN | IO_Board | 1 | All of the fans on this board have been set to run at full speed. | ||
000400F0 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to read board's fan speeds : $(ARG) | |
000400F1 | FATAL | Palomino | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to set the fan speed for this board : $(ARG) | |
000400F2 | WARN | BQC | SOFTWARE_IN_ERROR | 1 | The notify I/O link shutdown failed. $(STATUS) | |
000400F3 | WARN | Card | 1 | Unable to reset the eDRAM charge pumps on one or more computes on this board : $(ARG) | ||
000400F4 | WARN | AC_TO_DC_PWR | 1 | Unable to perform the specified operation on this BPM / Bulk Power Module. : $(ARG) | ||
000400F5 | FATAL | AC_TO_DC_PWR | Unable to perform the specified command/operation on this BPM / Bulk Power Module. : $(ARG) | |||
000400F6 | WARN | AC_TO_DC_PWR | 1 | Unable to perform the specified operation on this BPM / Bulk Power Module. : $(ARG) | ||
000400F7 | WARN | AC_TO_DC_PWR | 1 | Unable to update this BPM's firmware level : $(ARG) | ||
000400F8 | FATAL | AC_TO_DC_PWR | Unable to clear this BPM's faults : $(ARG) | |||
000400F9 | WARN | Card | The broadcast install of a kernel image failed, $(STATUS) | |||
000400FA | FATAL | BQC | COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | 1 | The install of a kernel image failed, $(STATUS) | |
000400FB | WARN | Optical_Module | 1 | Unable to perform the specified operation on this Optics Module. : $(ARG) | ||
000400FC | WARN | Optical_Module | 1 | Unable to update this Optics Module's firmware level : $(ARG) | ||
000400FD | WARN | Optical_Module | 1 | Unable to perform the specified operation on this Optics Module. : $(ARG) | ||
000400FE | WARN | Optical_Module | 1 | Unable to perform the specified operation on this Fiber Optics Module. : $(ARG) | ||
000400FF | WARN | Optical_Module | 1 | Unable to perform the specified operation on this Optics Module. : $(ARG) | ||
00040100 | FATAL | Optical_Module | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Invalid voltage on this board, so we can not talk to this Optics Module. : $(ARG) | |
00040101 | WARN | Card | Stale card found during restart of SubnetMc. | |||
00040102 | WARN | BQL | 1 | This link chip has mismatched ecids : $(ARG) | ||
00040103 | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to find the board's vpd i2c channel. : $(ARG) | |
00040104 | WARN | Software_Error | Valid location not found in ras event. Message ID: $(PROB_ID) Location: $(PROB_LOC) | |||
00040105 | INFO | Software_Error | Modifying some of this SubnetMc's logging levels: $(ARG) | |||
00040106 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | CFAM Machine Check. $(ARG) | ||
00040107 | INFO | BQC | CFAM Special Attention. $(ARG) | |||
00040108 | WARN | BQC | CFAM Recoverable Error. $(ARG) | |||
00040109 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | CFAM alert. $(ARG) | ||
0004010A | FATAL | Card | FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | An attempt was made to execute a boot step on a board that is not initialized/usable, since the board is unavailable we are failing this boot step. : $(ARG) | |
0004010B | FATAL | Software_Error | The PrimaryMc has detected that this SubnetMc process has terminated. : $(ARG) | |||
0004010C | FATAL | Coolant_Monitor | RACK_IN_ERROR | 1 | Unable to bring up Coolant Monitor : $(ARG) | |
0004010D | WARN | Card | 1 | Unable to bring up the Coolant Monitor for this rack. | ||
0004010E | WARN | Coolant_Monitor | 1 | Unable to get this Coolant Monitor's environmental data. | ||
0004010F | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to reset Link Chips : $(ARG) | |
00040110 | INFO | Card | Successfully reset this card (did not do a full reinitialization). | |||
00040111 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to reset Computes : $(ARG) | |
00040112 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to reset this board : $(ARG) | |
00040113 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to reset this board : $(ARG) | |
00040114 | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | This card's parent service card is not functional/initialized, so we are unable to reconnect to this card. : $(ARG) | ||
00040115 | FATAL | Node_Board | This board's VPD does not contain CCIN. | |||
00040116 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Unable to initialize this card | ||
00040117 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Unable to initialize this card | ||
00040118 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to reset this board : $(ARG) | |
00040119 | WARN | Software_Error | Unable to create a log file for this board's bist run, $(ARG) | |||
0004011A | WARN | Software_Error | Unable to create a log file for this board's eDRAM Charge Pumps, $(ARG) | |||
0004011B | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to reset Power Domains : $(ARG) | |
0004011C | WARN | Coolant_Monitor | 1 | Unable to perform the specified operation on this Coolant Monitor. : $(ARG) | ||
0004011D | WARN | Coolant_Monitor | 1 | Unable to perform the specified operation on this Coolant Monitor. : $(ARG) | ||
0004011E | WARN | Coolant_Monitor | 1 | Unable to perform the specified operation on this Coolant Monitor. : $(ARG) | ||
0004011F | WARN | Coolant_Monitor | 1 | Unable to update the firmware on this Coolant Monitor. : $(ARG) | ||
00040120 | WARN | Coolant_Monitor | 1 | Unable to perform the specified operation on this Coolant Monitor. : $(ARG) | ||
00040121 | INFO | Card | This RAS event has been deprecated, it should never again occur. : $(ARG) | |||
00040122 | FATAL | Node_Board | 1 | Unable to disable the EnvMon polling of this boards DCAs. : $(ARG) | ||
00040123 | FATAL | Node_Board | 1 | Unable to enable the EnvMon polling of this boards DCAs. : $(ARG) | ||
00040124 | FATAL | AC_TO_DC_PWR | Unable to disable this BPM : $(ARG) | |||
00040125 | FATAL | AC_TO_DC_PWR | Unable to enable this BPM : $(ARG) | |||
00040126 | FATAL | Software_Error | Unable to create a pthread in scanForAndShutdownUninitializedBoards() : $(ARG) | |||
00040127 | WARN | BQC | Verification of the kernel shutdown failed. | |||
00040128 | WARN | Coolant_Monitor | 1 | Unable to read the specified Coolant Monitor register. : $(ARG) | ||
00040129 | WARN | Coolant_Monitor | 1 | Unable to write the specified Coolant Monitor register. : $(ARG) | ||
0004012A | WARN | Software_Error | Unable to restart this subnet as this subnet has already been initialized. : $(ARG) | |||
0004012B | WARN | Software_Error | Unable to perform bringup on this machine, as this subnet has already been initialized. : $(ARG) | |||
0004012C | WARN | Card | BOARD_IN_ERROR | 1 | Unable to disable optics module interrupts : $(ARG) | |
0004012D | WARN | Card | 1 | Unable to enable optics module interrupts : $(ARG) | ||
0004012E | WARN | DC_TO_DC_PWR | DCA_IN_ERROR | 1 | Detected that one of the DCAs on this board has experienced a Domain 1 power failure. : $(ARG) | |
0004012F | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Unable to disable DCA Alert Controls : $(ARG) | |
00040130 | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | Both DCAs are reporting that their domain 1 has failed. : $(ARG) | |
00040131 | FATAL | Card | The compute nodes on this board did not report mailbox READY as expected. | |||
00040132 | WARN | DC_TO_DC_PWR | END_JOB,FREE_COMPUTE_BLOCK,DCA_IN_ERROR | 1 | Unable to bump this DCA's domain 1 voltage. : $(ARG) | |
00040133 | WARN | DC_TO_DC_PWR | Bumped this DCA's domain 1 voltage up by DefaultDomain1VoltageBump mV. : $(ARG) | |||
00040134 | FATAL | Node_Board | 1 | Unable to enable the DCA alarm/alert processing for this boards DCAs. : $(ARG) | ||
00040135 | FATAL | Node_Board | 1 | Unable to disable the DCA alarm/alert processing for this boards DCAs. : $(ARG) | ||
00040136 | WARN | DC_TO_DC_PWR | END_JOB,FREE_COMPUTE_BLOCK,DCA_IN_ERROR | 1 | Unable to read this DCA's domain 1 voltage. : $(ARG) | |
00040137 | INFO | Software_Error | Received a request to restart this subnet, but the subnet is already initialized. : $(ARG) | |||
00040138 | WARN | Software_Error | Received a request to restart this subnet - so marking all cards in this subnet as not present. : $(ARG) | |||
00040139 | WARN | Optical_Module | 1 | Unable to ensure that this Optical Module has nominal channel status. : $(ARG) | ||
0004013A | FATAL | Card | BOARD_IN_ERROR | 1 | This board has an out of spec system clock signal. : $(ARG) | |
0004013B | FATAL | Service_Card | BOARD_IN_ERROR | 1 | This card is not up and in an usable state. : $(ARG) | |
0004013C | WARN | Card | The broadcast launch of the kernel failed, $(STATUS) | |||
0004013D | FATAL | BQC | FREE_COMPUTE_BLOCK,COMPUTE_IN_ERROR | The kernel launch failed, $(STATUS) | ||
0004013E | FATAL | Card | Mailbox register read failed: $(STATUS) | |||
00040140 | WARN | Card | The broadcast verify (read) of a kernel image failed, $(STATUS) | |||
00040141 | FATAL | BQC | FREE_COMPUTE_BLOCK,COMPUTE_IN_ERROR | The verify (read) of a kernel image failed, $(STATUS) | ||
00040142 | FATAL | BQC | The node sent an invalid mailbox message. The header indicated it was a RAS message but it had no payload. | |||
00040143 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Detected that this board has become unusable (due to invalid power rail voltages) | ||
00040144 | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Unable to reset/reconnect to this card : $(ARG) | ||
00040145 | WARN | BQC | The node sent an unexpected mailbox command, $(STATUS) | |||
00040146 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | The node failed to send a control system barrier request. | ||
00040147 | FATAL | Card | BOARD_IN_ERROR | 1 | Incorrect number of link chips on this board. : $(ARG) | |
00040148 | FATAL | Card | BOARD_IN_ERROR | 1 | Incorrect number of optics modules on this board. : $(ARG) | |
00040149 | FATAL | Card | BOARD_IN_ERROR | 1 | Incorrect number of computes on this board. : $(ARG) | |
0004014A | FATAL | Node_Board | BOARD_IN_ERROR | Unable to load the FPGA image in to this node board (from the service card). : $(ARG) | ||
0004014B | FATAL | Software_Error | Unable to create a pthread to process a ServiceNetworkSubnet command request: $(ARG) | |||
0004014C | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | Detected that this board has become unusable (due to unresponsive fpga) | ||
0004014D | FATAL | Card | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | 1 | This board was powered off due to overtemperature. : $(ARG) | |
0004014E | INFO | Software_Error | Received a request to restart the special subnet MCSERVER_DIE, but there is no dirty hardware, so there is nothing else to do. : $(ARG) | |||
0004014F | INFO | Software_Error | PrimaryMc was told to perform a non-controlled shutdown. : $(ARG) | |||
00040150 | INFO | Software_Error | PrimaryMc was told to perform a controlled shutdown. : $(ARG) | |||
00040151 | FATAL | Software_Error | Uncorrectable error occurred during bringup of this machine. : $(ARG) | |||
00040152 | WARN | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Encountered an exception while servicing this compute's mailbox. $(STATUS) | ||
00040153 | FATAL | Software_Error | Unable to create a pthread to delete/shutdown a board: $(ARG) | |||
00040154 | FATAL | Node_Board | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | This card's parent service card is not functional/initialized, so we are unable to fully reinitialize this card. : $(ARG) | ||
00040155 | WARN | Node_Board | During failover, unable to discover this board even after resetting service card : $(ARG) | |||
00040156 | WARN | Node_Board | During failover, unable to discover any node boards for this entire midplane. : $(ARG) | |||
00040157 | FATAL | BQC | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | CFAM Live Lock Buster Failure. $(ARG) | ||
00040158 | FATAL | Clocks | BOARD_IN_ERROR | 1 | Unable to set up the system master clock card : $(ARG) | |
00040159 | FATAL | BQC | Compute has invalid EnvMon data | |||
0004015A | FATAL | BQC | Compute has an EnvMon data parity error | |||
0004015B | FATAL | BQC | Compute has has an EnvMon data sync error | |||
0004015C | WARN | Card | Verification of the kernel shutdown failed for all nodes being shutdown on this board. | |||
0004015D | WARN | Card | 1 | Send of a kernel shutdown message failed for the nodes being shutdown on this board. Return code=$(ARG) |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00050000 | INFO | Process | dummy RAS event for debug. | |||
00050001 | WARN | Process | Console message queue full. | |||
00050002 | WARN | Process | RAS message queue full. |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00060009 | WARN | Block | Midplane has inconsistent node memory. | |||
0006000A | WARN | Block | Block failed to boot. | |||
0006000D | INFO | Process | I/O node usage error. Block $(BLOCK) cannot boot. Limit exceeded for $(NODE). Usage count is $(COUNT). Limit is $(LIMIT). | |||
00061001 | WARN | Node_Board | MMCS could not contact node board at location $(BG_LOC). | |||
00061002 | WARN | Service_Card | MMCS could not contact service card at location $(BG_LOC). | |||
00061003 | WARN | AC_TO_DC_PWR | MMCS could not contact bulk power module at location $(BG_LOC). | |||
00061004 | WARN | IO_Board | MMCS could not contact IO board at location $(BG_LOC). | |||
00061005 | WARN | Coolant_Monitor | MMCS could not contact coolant monitor at location $(BG_LOC). | |||
00061006 | WARN | Coolant_Monitor | Health Check detected an error on the coolant monitor connected to the service card at location $(BG_LOC). The condition is related to $(COND). | |||
00061007 | WARN | Service_Card | Health Check detected an incorrect clock frequency $(FREQ) on the service card at location $(BG_LOC). | |||
00061008 | WARN | Service_Card | Health Check detected an overtemp condition on the service card at location $(BG_LOC). The temperature $(ACTUAL) is above the expected maximum temperature of $(EXP). | |||
00061009 | WARN | IO_Board | Health Check detected an incorrect clock frequency $(FREQ) on the IO board at location $(BG_LOC). | |||
0006100A | WARN | IO_Board | Health Check detected an abnormal status flag set for the IO board at location $(BG_LOC). The status flag is for $(COMP). | |||
0006100B | WARN | DCA | Health Check detected an abnormal condition for the Direct Current Assembly (DCA) card at location $(BG_LOC). The condition is $(COND). The invalid value is $(BADVAL). | |||
0006100C | WARN | DCA | Health Check detected an abnormal condition for the Direct Current Assembly (DCA) card at location $(BG_LOC). The condition is related to $(COND). The invalid value is $(BADVAL). | |||
0006100D | WARN | Optical_Module | Health Check detected an abnormal condition for the optical module at location $(BG_LOC). The condition is related to $(COND). | |||
0006100E | WARN | Optical_Module | Health Check detected an abnormal condition for the optical module at location $(BG_LOC). The condition is related to $(COND). | |||
0006100F | WARN | AC_TO_DC_PWR | Health Check detected an abnormal condition for the bulk power module at location $(BG_LOC). The condition is related to $(COND). The invalid value is $(BADVAL). | |||
00061010 | WARN | Node_Board | Health Check detected an incorrect clock frequency $(FREQ) on the node board at location $(BG_LOC). | |||
00061011 | WARN | Node_Board | Health Check detected an abnormal status flag set for the node board at location $(BG_LOC). The status flag is for $(COMP). | |||
00061012 | FATAL | AC_TO_DC_PWR | BOARD_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Health Check detected multiple failed bulk power modules in an enclosure. Hardware at location $(BG_LOC) is being marked in error. | ||
00061013 | FATAL | AC_TO_DC_PWR | BOARD_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Health Check detected multiple failed bulk power modules in an enclosure. Hardware at location $(BG_LOC) is being marked in error. | ||
00061014 | FATAL | Coolant_Monitor | 1 | Health Check detected the coolant monitor connected to the service card at location $(BG_LOC) has been shut-off: $(COND) | ||
00061015 | WARN | BQC | Health Check detected an overtemp condition on the compute node at location $(BG_LOC). The temperature $(ACTUAL) is above the expected maximum temperature of $(EXP). | |||
00062000 | INFO | Job | killing job $(JOB) timed out after $(TIMEOUT) seconds. $(NODE_COUNT) nodes are now unavailable. | |||
00062001 | FATAL | Block | SOFTWARE_IN_ERROR | Failed to authenticate with the CIOS $(DAEMON) daemon running on I/O node $(BG_LOC). | ||
00062002 | INFO | Job | The prolog program failed with error $(ERROR) on I/O node $(BG_LOC). | |||
00062003 | INFO | Job | The epilog program failed with error $(ERROR) on I/O node $(BG_LOC). | |||
00062004 | WARN | Block | A job was submitted to block $(BG_BLOCKID) that is not ready to run jobs. | |||
00062005 | FATAL | Block | SOFTWARE_IN_ERROR,FREE_COMPUTE_BLOCK,END_JOB | A CIOS jobctl daemon running on I/O node $(BG_LOC) failed to acknowledge a heartbeat. | ||
00062006 | WARN | Job | Maximum secondary group limit exceeded for user $(USER) running job $(JOB). | |||
00063000 | INFO | Process | user $(USER) denied $(ACTION) authority on $(OBJECT) $(ID) | |||
00063001 | INFO | Process | user $(USER) denied administrative authority for $(COMMAND) |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00070101 | FATAL | Card | VPD for $(CARD_TYPE) $(BG_LOC) is not coherent. | |||
00070102 | FATAL | Card | ECID chip value $(BG_ECID) does not match ECID VPD value $(VPD_VALUE) for $(CARD_TYPE) $(BG_LOC). | |||
00070103 | FATAL | BQC | ECID value $(BG_ECID) is not supported for $(CARD_TYPE) $(BG_LOC). $(ERROR_DATA) | |||
00070104 | FATAL | BQC | CCIN field in the VPD is not valid for $(CARD_TYPE) $(BG_LOC): '$(VPD_VALUE)'. | |||
00070105 | FATAL | BQC | VT field in the VPD is not valid for $(CARD_TYPE) $(BG_LOC): '$(VPD_VALUE)'. | |||
00070107 | FATAL | BQC | VD field in the VPD is not valid for $(CARD_TYPE) $(BG_LOC): '$(VPD_VALUE)'. | |||
00070108 | FATAL | BQC | VPD for $(CARD_TYPE) $(BG_LOC) does not contain HSS calibration data. | |||
00070109 | FATAL | BQC | $(CARD_TYPE) $(BG_LOC) (CCIN $(VPD_VALUE)) does not follow the compute card mix rules. | |||
00070201 | FATAL | Card | VPD for $(CARD_TYPE) $(BG_LOC) is not coherent. | |||
00070202 | FATAL | Card | ECID value $(BG_ECID) read from chip does not match VPD ECID value $(VPD_VALUE) for $(CARD_TYPE) $(BG_LOC). | |||
00070203 | FATAL | Card | ECID value $(BG_ECID) is not supported for $(CARD_TYPE) $(BG_LOC). $(ERROR_DATA) | |||
00070204 | WARN | BQC | CCIN field in the VPD is not valid for $(CARD_TYPE) $(BG_LOC): '$(VPD_VALUE)'. | |||
00070205 | FATAL | BQC | VT field in the VPD is not valid for $(CARD_TYPE) $(BG_LOC): '$(VPD_VALUE)'. | |||
00070206 | INFO | Card | Service Action $(IDSA) started to service $(BG_LOC) by $(USER). | |||
00070207 | INFO | Card | Service Action $(IDSA) completed on $(BG_LOC) by $(USER). | |||
00070208 | INFO | Card | Service Action $(IDSA) on $(BG_LOC) was forced closed by by $(USER). | |||
00070209 | INFO | Card | Service Action $(IDSA) on $(BG_LOC) failed with error code $(ERROR). $(ERROR_DATA) | |||
0007020A | INFO | Card | Service Action $(IDSA) turned off $(CARD_TYPE) $(BG_LOC). | |||
0007020C | INFO | AC_TO_DC_PWR | Service Action $(IDSA) turned off $(CARD_TYPE) $(BG_LOC). | |||
0007020F | INFO | Card | Service Action $(IDSA) restarted $(CARD_TYPE) $(BG_LOC). | |||
00070212 | INFO | AC_TO_DC_PWR | Service Action $(IDSA) verified that $(CARD_TYPE) $(BG_LOC) is functional. | |||
00070214 | FATAL | AC_TO_DC_PWR | $(CARD_TYPE) $(BG_LOC) is not functional. $(ERROR_DATA) | |||
00070216 | FATAL | BQC | VD field in the VPD is not valid for $(CARD_TYPE) $(BG_LOC): '$(VPD_VALUE)'. | |||
00070217 | INFO | Cable | Verification of the cable from $(FROMPORT) to $(TOPORT) on $(CARD_TYPE) $(BG_LOC) failed. $(ERROR_DATA) | |||
00070218 | WARN | Cable | Cable from $(FROMPORT) to $(TOPORT) on $(CARD_TYPE) $(BG_LOC) contains bad wires. $(ERROR_DATA) | |||
00070219 | FATAL | Cable | Cable from $(FROMPORT) to $(TOPORT) on $(CARD_TYPE) $(BG_LOC) contains bad wires. $(ERROR_DATA) | |||
0007021B | FATAL | Card | VPD is not available for $(CARD_TYPE) $(BG_LOC). $(ERROR_DATA) | |||
0007021C | FATAL | BQC | VPD for $(CARD_TYPE) $(BG_LOC) does not contain HSS calibration data. | |||
0007021D | FATAL | BQC | $(CARD_TYPE) $(BG_LOC) (CCIN $(VPD_VALUE)) does not follow the compute card mix rules. | |||
0007021E | INFO | Coolant_Monitor | Service Action $(IDSA) verified that $(CARD_TYPE) $(BG_LOC) is functional. | |||
0007021F | FATAL | Coolant_Monitor | $(CARD_TYPE) $(BG_LOC) is not functional. $(ERROR_DATA) | |||
00070301 | FATAL | Process | BlueGene Resource Agent has been stopped on service node $(HOSTNAME) by user $(USER). | |||
00070302 | INFO | Process | BlueGene Resource Agent has been started on service node $(HOSTNAME) by user $(USER). | |||
00070303 | INFO | Process | BlueGene Resource Agent has completed on service node $(HOSTNAME) by user $(USER). |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00080001 | FATAL | PCI | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | PCIe Error. $(DETAILS) | ||
00080002 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | CRC error detected. address=$(ADDDRESS) size=$(SIZE) expected-CRC=$(EXPECTED) actual-CRC=$(ACTUAL) | ||
00080004 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | BeDRAM Machine Check : $(DETAILS) | ||
00080005 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | ClockStop Unit Machine Check : $(DETAILS) | ||
00080006 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | DCR Arbiter Machine Check : $(DETAILS) | ||
00080007 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | DDR Arbiter Machine Check : $(DETAILS) | ||
00080008 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | DevBus Machine Check : $(DETAILS) | ||
00080009 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | EnvMon Machine Check : $(DETAILS) | ||
0008000A | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | GEA Machine Check : $(DETAILS) | ||
0008000B | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | L1P Machine Check : $(DETAILS) | ||
0008000C | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | L2 Machine Check : $(DETAILS) | ||
0008000D | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | L2C Machine Check : $(DETAILS) | ||
0008000E | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | L2 Counter Machine Check : $(DETAILS) | ||
0008000F | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | MSGC Machine Check : $(DETAILS) | ||
00080010 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | TestInt Machine Check : $(DETAILS) | ||
00080011 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | UPC Machine Check : $(DETAILS) | ||
00080012 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | Wakeup Unit Machine Check : $(DETAILS) | ||
00080013 | WARN | BQC | RAS Storm Warning: Firmware has detected a burst of similar RAS events and has compressed them. There were $(%d,COUNT) similar events detected for message code $(%08x,CODE). The burst has subsided. | |||
00080014 | FATAL | BQC | END_JOB,FREE_COMPUTE_BLOCK | RAS Storm Error: Firmware has detected a significant burst of similar RAS events and has compressed them. There were $(%d,COUNT) similar events detected for message code $(%08x,CODE). | ||
00080015 | WARN | Message_Unit | Message Unit Recoverable Error: $(DETAILS) | |||
00080016 | FATAL | Message_Unit | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | Message Unit Error: $(DETAILS) | ||
00080017 | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | SerDes Machine Check: $(DETAILS) | ||
00080018 | WARN | Message_Unit | ND Correctable Error: $(DETAILS) | |||
00080019 | FATAL | Message_Unit | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | ND Fatal Error: $(DETAILS) | ||
0008001A | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | A2 Processor Machine Check : $(DETAILS) | ||
0008001B | WARN | BQC | DDR Arbiter Machine Check (Recoverable) : $(DETAILS) | |||
0008001C | WARN | BQC | L1P Correctable : $(DETAILS) | |||
0008001D | WARN | BQC | DDR Arbiter Machine Check (Recoverable) : $(DETAILS) | |||
0008001E | WARN | BQC | L2 Machine Check (Recoverable) : $(DETAILS) | |||
0008001F | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | Unrecoverable Machine Check. | ||
00080020 | WARN | BQC | 10 | Memory Controller Initialization Warning: $(DETAILS) | ||
00080021 | FATAL | PCI | PCIe Root Complex Initialization Failed at Step $(I). | |||
00080022 | WARN | Software_Error | (WARNING) $(MSG) | |||
00080023 | FATAL | Software_Error | (ERROR) $(MSG) | |||
00080024 | FATAL | Software_Error | END_JOB,FREE_COMPUTE_BLOCK,SOFTWARE_IN_ERROR | Unexpected Interrupt: $(DETAILS). | ||
00080025 | INFO | Software_Error | Firmware termination: status:$(STATUS) LR:$(LR) SRR0:$(SRR0) SRR1:$(SRR1) ESR:$(ESR) DEAR:$(DEAR) | |||
00080026 | INFO | BQC | DDR Drilldown : $(DETAILS) | |||
00080027 | INFO | PCI | PCIe Initialization took $(%d,MILLIS) millseconds. | |||
00080028 | WARN | PCI | PCIe PL_LINKUP status has not locked after $(%d,DURATION) milliseconds | |||
00080029 | FATAL | Software_Error | SOFTWARE_IN_ERROR,FREE_COMPUTE_BLOCK | The actual DDR memory size of $(%d,ACTUAL)MB is less than the configured size of $(%d,CONFIGURED)MB. | ||
0008002A | INFO | Software_Error | The actual DDR memory size of $(%d,ACTUAL)MB is larger then the configured size of $(%d,CONFIGURED)MB. | |||
0008002B | INFO | Software_Error | DDR memory size has been automatically adjusted to $(%d,ACTUAL)MB from $(%d,CONFIGURED)MB to match the hardware. | |||
0008002C | WARN | Software_Error | A control system barrier has gone unacknowledged for $(%d,MICROS) microseconds. | |||
0008002D | WARN | BQC | 10 | Bad DRAM was detected - $(DETAILS) | ||
0008002E | WARN | BQC | 10 | Bad PHY was detected - MC $(%d,$MC) Byte $(%d,BYTE) | ||
0008002F | INFO | BQC | 100000 | 1 day | L1P Correctable Error Summary : $(DETAILS) | |
00080030 | INFO | BQC | 2400 | 1 day | L2 Array Correctable Error Summary : $(DETAILS) | |
00080031 | INFO | BQC | L2 Directory Correctable Error Summary : $(DETAILS) | |||
00080032 | WARN | BQC | END_JOB | Illegal DCR Access : $(DETAILS) | ||
00080033 | INFO | DDR | DDR Correctable Error Summary : $(DETAILS) | |||
00080034 | INFO | DDR | DDR Maintenance Correctable Error Summary : $(DETAILS) | |||
00080036 | INFO | Message_Unit | 10 | 1 hour | Message Unit ECC Summary : $(DETAILS) | |
00080037 | INFO | Message_Unit | ND Receiver Link Error : $(LINK) count=$(%d,2) $(DETAILS) | |||
00080038 | INFO | Message_Unit | ND Sender Retransmission Correctable Error : $(LINK) count=$(%d,2) $(DETAILS) | |||
00080039 | INFO | Message_Unit | 5 | 1 hour | ND Receiver Correctable Error : $(LINK) count=$(%d,2) $(DETAILS) | |
0008003A | FATAL | BQC | END_JOB,SOFTWARE_IN_ERROR,FREE_COMPUTE_BLOCK | A2 Processor Machine Check : $(DETAILS) | ||
0008003B | WARN | BQC | 10 | A2 TLB Parity Error : MMUCR1=$(MMUCR1) MCSR=$(MCSR) : $(MCSR_DETAILS) | ||
0008003C | INFO | Software_Error | $(MSG) | |||
0008003D | FATAL | BQC | END_JOB,COMPUTE_IN_ERROR,FREE_COMPUTE_BLOCK | Memory Controller Initialization Error: $(DETAILS) | ||
0008003E | INFO | BQC | Memory Controller Initialization Information : $(DETAILS) | |||
0008003F | FATAL | BQC | Barrier Initialization Error : $(DETAILS) | |||
00080099 | WARN | BQC | This is a test. |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
00090001 | FATAL | BQL | A Link Chip did not indicate HSS Ready: $(STATUS) | |||
00090002 | WARN | BQC | PLL problem on BQC chip: $(STATUS) | |||
00090003 | FATAL | BQC | POR DONE indicator bit not set on BQC chip. | |||
00090004 | FATAL | BQC | Zero-scan LBIST failed to complete on BQC chip. | |||
00090005 | FATAL | BQC | Fuse Download procedure failed to complete on BQC chip using the following method: $(STATUS) | |||
00090006 | FATAL | BQC | Miscompare on debug TDR readback: $(STATUS) | |||
00090007 | FATAL | BQC | JTAG2PIB interface dummy read failed. | |||
00090008 | FATAL | BQC | Scan0 procedure failed to complete successfully on BQC chip. | |||
00090009 | FATAL | BQC | Fuse Sense Done bit not set in ECID register. | |||
0009000A | FATAL | BQC | ABIST procedure failed to complete successfully on BQC chip. | |||
0009000B | FATAL | BQC | An error has occured while trying to update the spare core value: $(STATUS) | |||
0009000C | FATAL | BQC | Invalid number of cores to release, must be between 1 and 17 inclusive: $(STATUS) | |||
0009000D | FATAL | BQC | An error occured while trying to read the JTAG ID of the BQC chip: $(STATUS) | |||
0009000E | FATAL | BQC | Check for Success of POR sequence failed: $(STATUS) | |||
0009000F | WARN | BQC | The BQC clocks are not in the correct state: $(STATUS) | |||
00090010 | FATAL | BQC | Unexpected status bits are active in ACCESS clock status register: $(STATUS) | |||
00090011 | FATAL | BQC | IRSTAT failure. | |||
00090012 | INFO | BQC | JTAG2PIB interface dummy read failed at least one time. | |||
00090013 | INFO | BQC | Redundancy value updated to: $(STATUS) | |||
0009009E | FATAL | Software_Error | Invalid input parameters detected: $(STATUS) | |||
0009009F | FATAL | BQC | SerDes training failure | |||
000900A0 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical A-) receiver. | |||
000900A1 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical A+) receiver. | |||
000900B0 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical B-) receiver. | |||
000900B1 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical B+) receiver. | |||
000900C0 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical C-) receiver. | |||
000900C1 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical C+) receiver. | |||
000900D0 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical D-) receiver. | |||
000900D1 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical D+) receiver. | |||
000900E0 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical E-) receiver. | |||
000900E1 | WARN | BQC | Training failure detected by the Torus logical $(LOGICAL) (SerDes physical E+) receiver. | |||
000900F0 | WARN | BQC | Training failure detected by the SerDes I/O link receiver. | |||
000900FF | FATAL | BQL | A Link Chip write scom failed: $(STATUS) | |||
00090100 | FATAL | BQL | A link chip did not align along the A port on switch 0: $(STATUS) | |||
00090101 | FATAL | BQL | A link chip did not align along the A port on switch 1: $(STATUS) | |||
00090102 | FATAL | BQL | A link chip did not align along the A port on switch 2: $(STATUS) | |||
00090103 | FATAL | BQL | A link chip did not align along the A and B ports on switch 3: $(STATUS) | |||
00090104 | FATAL | BQL | CABLE_IN_ERROR | A link chip did not bit align along the C port: $(STATUS) | ||
00090105 | FATAL | BQL | CABLE_IN_ERROR | A link chip did not byte align along the C port: $(STATUS) | ||
00090106 | WARN | BQC | A write to a RCB register failed | |||
00090107 | FATAL | BQL | PRBS failure detected by the link chip B port transmitter: $(STATUS) | |||
00090108 | FATAL | BQL | PRBS failure detected by the link chip A port receiver: $(STATUS) | |||
00090109 | WARN | BQC | PLL did not lock | |||
000901A0 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical A-) receiver. | |||
000901A1 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical A+) receiver. | |||
000901B0 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical B-) receiver. | |||
000901B1 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical B+) receiver. | |||
000901C0 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical C-) receiver. | |||
000901C1 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical C+) receiver. | |||
000901D0 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical D-) receiver. | |||
000901D1 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical D+) receiver. | |||
000901E0 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical E-) receiver. | |||
000901E1 | WARN | BQC | PRBS failure detected by the Torus logical $(LOGICAL) (Serdes physical E+) receiver. | |||
000901F0 | WARN | BQC | PRBS failure detected by the Serdes I/O link receiver. | |||
000901F1 | FATAL | BQL | PRBS failure detected by the link chip C port receiver: $(STATUS) | |||
000901F2 | WARN | BQC | Data eye opening is smaller than 20%: $(STATUS) | |||
000901F3 | WARN | BQC | Lane not ready: $(STATUS) | |||
000901F4 | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | Invalid serdes rate settings detected: $(STATUS) | ||
000901F5 | WARN | BQC | IRstat failure during program load: $(STATUS) | |||
000901F6 | FATAL | Software_Error | ELF image filename is null | |||
000901F7 | FATAL | Software_Error | ELF image file OPEN failed: open return code = $(STATUS) | |||
000901F8 | FATAL | BQC | ELF image load failed: return code = $(STATUS) | |||
000901F9 | FATAL | Software_Error | ELF image segment has invalid data: segment number = $(STATUS) | |||
000901FA | FATAL | Software_Error | ELF image segment get data failed: $(STATUS) | |||
000901FB | FATAL | BQL | This link chip's ACCESS macro raised an alert: $(STATUS) | |||
000901FC | FATAL | BQL | This link chip's ACCESS macro raised an alert due to a LBIST SCOM attention: $(STATUS) | |||
000901FD | FATAL | BQL | This link chip's ACCESS macro raised an alert due to a Clock Tree SCOM attention: $(STATUS) | |||
000901FE | FATAL | BQL | This link chip's ACCESS macro raised an alert due to a ACCESS SCOM attention: $(STATUS) | |||
000901FF | FATAL | BQL | This link chip's ACCESS macro raised an alert due to a Machine Check attention. : $(STATUS) | |||
00090200 | WARN | BQL | A BQL single bit error threshold was exceeded. | |||
00090201 | WARN | BQL | A BQL single bit error threshold was exceeded but sparing is not possible. | |||
00090202 | WARN | BQL | A BQL double bit error threshold was exceeded for Switch $(SWITCH) Group $(GROUP) | |||
00090203 | FATAL | BQL | The Access controller's SCOM bus is hung: $(STATUS) | |||
00090204 | FATAL | BQL | This link chip's ClkInt Stat indicates an scom command collision: $(STATUS) | |||
00090205 | INFO | BQL | The link chip rate was changed from fullrate to halfrate. | |||
00090206 | WARN | BQL | The link chip rate was changed from halfrate to fullrate. | |||
00090207 | WARN | BQL | A BQL double bit error was observed for group(s) $(SWITCH) | |||
00090208 | WARN | BQL | A BQL 4G HSS lost ready $(SWITCH) | |||
00090209 | WARN | BQL | A BQL 4G HSS lost PLL Lock $(SWITCH) | |||
0009020A | WARN | BQL | A BQL 10G HSS lost PLL Lock $(SWITCH) | |||
0009020B | WARN | BQL | A BQL 10G HSS has degraded eye quality $(SWITCH) | |||
0009020C | WARN | BQL | A BQL sparing register was misconfigured. | |||
0009020D | INFO | BQL | BQL_SPARE | A BQL lane was spared. | ||
0009020E | WARN | BQC | An exception was caught during BqcSerdes::enableLanes() processing. | |||
0009020F | FATAL | BQC | The transmitting node did not align with link chip $(STATUS) | |||
00090210 | FATAL | BQL | BQL_SPARE | A link chip did not bit align along the receiver C port: $(STATUS). The control system will attempt to replace the failing lane(s) with spare(s). | ||
00090211 | FATAL | BQL | BQL_SPARE | A link chip did not byte align along the receiver C port: $(STATUS). The control system will attempt to replace the failing lane(s) with spare(s). | ||
00090212 | FATAL | BQC | PCIE-Clock PLL problem on BQC chip: $(STATUS) | |||
00090213 | FATAL | BQC | 1 | Link failure detected between nodes connected via copper links. Neighbor location=$(NEIGHBOR) | ||
00090214 | FATAL | BQL | CABLE_IN_ERROR | BQL receiver sparing failed: $(STATUS) | ||
00090215 | FATAL | BQL | BQL transmitter sparing failed: $(STATUS) | |||
00090216 | FATAL | BQC | 1 | Link failure detected between nodes connected via copper and optical links. Neighbor location=$(NEIGHBOR) | ||
00090217 | FATAL | BQC | 1 | Serdes link failure. |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
000A0001 | WARN | PCI | [PCIe] An unsupported PCIe adapter was detected: $(DETAILS) | |||
000A0002 | FATAL | PCI | COMPUTE_IN_ERROR | [PCIe] No PCIe adapter VPD detected. | ||
000A0003 | FATAL | Software_Error | SOFTWARE_IN_ERROR | [BOOT] All attempts to mount /bgsys failed: $(DETAILS) | ||
000A0004 | FATAL | Software_Error | SOFTWARE_IN_ERROR | [GPFS] GPFS failed to start: $(DETAILS) | ||
000A0005 | FATAL | Software_Error | SOFTWARE_IN_ERROR | [BOOT] No network interface was defined for the node: $(DETAILS) | ||
000A0006 | WARN | Software_Error | [LINUX] An init script has encountered an error or failed to properly execute: $(DETAILS) | |||
000A0007 | FATAL | Software_Error | SOFTWARE_IN_ERROR | [BOOT] The specified BG/Q Linux Distribution path is missing or invalid: $(DETAILS) | ||
000A0008 | FATAL | Software_Error | SOFTWARE_IN_ERROR | [GPFS] GPFS on the specified cluster node failed to initialize: $(DETAILS} | ||
000A0009 | WARN | PCI | [PCIe] The PCIe adapter is running in a suboptimal configuration. $(DETAILS) | |||
000A000A | FATAL | Software_Error | SOFTWARE_IN_ERROR | [BOOT] Network configuration failed for the indicated node: $(DETAILS) | ||
000A000B | FATAL | Software_Error | SOFTWARE_IN_ERROR | [BOOT] The installation of interrupt vectors has failed. | ||
000A000C | FATAL | Software_Error | [GPFS] GPFS was unable to resolve a hostname for the indicated node: $(DETAILS) | |||
000A000D | FATAL | Software_Error | SOFTWARE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | [LINUX] A kernel panic has occurred: $(DETAILS) | ||
000A000E | FATAL | Ethernet | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | [ETHERNET] An Ethernet link was not established: $(DETAILS) | ||
000A000F | FATAL | Infiniband | COMPUTE_IN_ERROR,END_JOB,FREE_COMPUTE_BLOCK | [INFINIBAND] An Infiniband link was not established: $(DETAILS) | ||
000A0010 | WARN | Software_Error | [BGHEALTHMON] The Blue Gene Node Health Monitor has detected a potential resource problem: $(DETAILS) | |||
000A0011 | FATAL | Ethernet | END_JOB,FREE_COMPUTE_BLOCK | [ETHERNET] The Ethernet link was lost: $(DETAILS) | ||
000A0012 | FATAL | Infiniband | END_JOB,FREE_COMPUTE_BLOCK | [INFINIBAND] The Infiniband link was lost: $(DETAILS) | ||
000A0013 | WARN | Software_Error | [LINUX] The node's root filesystem is unresponsive: $(DETAILS) | |||
000A0014 | WARN | Software_Error | [LINUX] A problem was encountered while processing the configuration service data for the specified node: $(DETAILS) |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
000B0001 | WARN | Software_Error | CIOS daemon in process $(%d,PID) received signal $(%d,SIGNAL). | |||
000B0002 | FATAL | Software_Error | SOFTWARE_IN_ERROR | iosd failed to start a daemon with process $(%d,PID), errno $(%d,ERRNO). | ||
000B0003 | FATAL | Software_Error | SOFTWARE_IN_ERROR | A CIOS daemon failed to initialize and is not ready, errno $(%d,ERRNO). | ||
000B0004 | FATAL | Software_Error | END_JOB,FREE_COMPUTE_BLOCK | A CIOS daemon running in process $(%d,OLDPID) was restarted after it failed and is now running in process $(%d,NEWPID) after $(%d,RESTARTS) restart attempts. | ||
000B0005 | FATAL | Software_Error | END_JOB,FREE_COMPUTE_BLOCK | A CIOS daemon running in process $(%d,PID) has reached the restart limit after being restarted $(%d,RESTARTS) times and was not restarted. | ||
000B0006 | WARN | Software_Error | The sysiod process seems to be stuck in a system call while running. Flight log $(DETAILS) | |||
000B0007 | WARN | Software_Error | The sysiod process seems to be stuck in a system call while ending. Flight log $(DETAILS) | |||
000B0008 | WARN | Software_Error | SOFTWARE_IN_ERROR | A CIOS daemon is testing RAS. |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
000C0001 | FATAL | Software_Error | MUDM encountered a fatal error $(ERROR) | |||
000C0002 | WARN | Software_Error | MUDM encountered an error $(ERROR) | |||
000C0039 | WARN | Software_Error | A system packet appears to be stuck on the torus MU transmission queue. Packet descriptor $(PKTD), timestamp $(TIMESTAMP) packet descriptor queued timestamp $(PKTDTIMESTAMP) packet descriptor count $(PKTDCOUNT) current reference count $(COUNT). | |||
000C0040 | WARN | Software_Error | A descriptor appears to be stuck on a system torus MU injection FIFO. The head pointer is currently injecting $(HEX1) $(HEX2) $(HEX3) $(HEX4) $(HEX5) $(HEX6) $(HEX7) $(HEX8). | |||
000C0042 | WARN | Software_Error | END_JOB,FREE_COMPUTE_BLOCK | A descriptor appears to be stuck on the torus MU injection FIFO. The FIFO pointers are start $(START), end $(END), head $(HEAD), and tail $(TAIL). There are $(%d,UNINJECTED) injections pending. The stuck packet has target torus location $(%d,A) $(%d,B) $(%d,C) $(%d,D) $(%d,E) | ||
000C0043 | WARN | Software_Error | A descriptor appears to be stuck on the torus MU injection FIFO. The FIFO pointers are start $(START), end $(END), head $(HEAD), and tail $(TAIL). There are $(%d,UNINJECTED) injections pending. | |||
000CD000 | WARN | Software_Error | The callback on the receiving of a packet had error $(HEX0). The packet header is $(HEX1) $(HEX2) $(HEX3) $(HEX4). | |||
000CD001 | FATAL | Software_Error | END_JOB,FREE_COMPUTE_BLOCK | The callback on the receiving of a packet had error $(HEX0). The packet header is $(HEX1) $(HEX2) $(HEX3) $(HEX4). | ||
000CD002 | WARN | Software_Error | A remote node rejected a transmitted packet. The debug code is $(HEX0). The debug data is $(HEX1) $(HEX2) $(HEX3) $(HEX4). | |||
000CD003 | WARN | Software_Error | A connection is taking a long time to complete. The packet header is $(HEX0) $(HEX1) $(HEX2) $(HEX3). |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
000D0001 | WARN | Message_Unit | MU non-fatal error has been detected in the Network Device Hardware: $(DETAILS) | |||
000D0002 | FATAL | Message_Unit | MU fatal error has been detected in the Network Device Hardware: $(DETAILS) | |||
000D0003 | WARN | Message_Unit | END_JOB,FREE_COMPUTE_BLOCK | MU Network termination check has failed. Details: $(DETAILS) |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
000E0000 | WARN | UPC | Bgpm Performance Monitor interrupts have been disabled due to low level UPC Counter maximum overflow. This is likely a soft hardware error. Details: $(DETAILS) |
MSG ID | SEV | CATEGORY | CTRL_ACTION | COUNT | PERIOD | MESSAGE |
FFFE0000 | FATAL | Software_Error | SOFTWARE_IN_ERROR | This is a test ras message. | ||
FFFE0001 | FATAL | Coolant_Monitor | RACK_IN_ERROR | This is a test ras message. | ||
FFFE0002 | WARN | BQL | CABLE_IN_ERROR | This is a test ras message. | ||
FFFE0003 | WARN | BQL | BQL_SPARE | This is a test ras message. | ||
FFFE0004 | FATAL | Software_Error | FREE_COMPUTE_BLOCK,SOFTWARE_IN_ERROR | This is a test ras message. | ||
FFFE0007 | FATAL | BQC | END_JOB,FREE_COMPUTE_BLOCK | This is a test ras message. | ||
FFFE0008 | FATAL | BQC | END_JOB,FREE_COMPUTE_BLOCK,COMPUTE_IN_ERROR | This is a test ras message. | ||
FFFE0009 | FATAL | BQC | END_JOB,FREE_COMPUTE_BLOCK,BOARD_IN_ERROR | This is a test ras message. | ||
FFFE000A | FATAL | BQC | END_JOB,FREE_COMPUTE_BLOCK,SOFTWARE_IN_ERROR | This is a test ras message. | ||
FFFE000B | WARN | BQC | 10 | This is a test ras message. | ||
FFFE000C | WARN | BQC | 10 | 2 HOURS | This is a test ras message. | |
FFFE000D | FATAL | BQC | COMPUTE_IN_ERROR | This is a test ras message. | ||
FFFE000E | FATAL | BQC | BOARD_IN_ERROR | This is a test ras message. | ||
FFFE000F | FATAL | Software_Error | END_JOB | This is a test ras message. | ||
FFFE0010 | FATAL | BQL | COMPUTE_IN_ERROR | This is a test ras message. | ||
FFFE0011 | FATAL | BQL | END_JOB | This is a test ras message. | ||
FFFE0013 | FATAL | BQL | FREE_COMPUTE_BLOCK | This is a test ras message. | ||
FFFE0014 | WARN | Job | 1 | This is a test ras message. | ||
FFFE0015 | FATAL | BQC | COMPUTE_IN_ERROR | 1 | 1 HOUR | This is a test ras message. |
FFFE0016 | FATAL | BQC | BOARD_IN_ERROR | 1 | 1 HOUR | This is a test ras message. |
FFFE0017 | FATAL | BQC | END_JOB | 1 | 1 HOUR | This is a test ras message. |