Академический Документы
Профессиональный Документы
Культура Документы
Mike Buckley
Platforms TRE Sun Microsystems
Goals
Improve Customer Satisfaction by reducing time to resolution through increased technical proficiency Reduce the incidence of inaccurate Onsite Action Plans and wrong parts ordered Replace memory dimms and other parts only when necessary Correctly identify proper dimm size and part number Accurately diagnose various memory issues Save SUN money $$$
Topics
Types of memory errors Sun's Best Practices regarding memory errors (review) Dimm size and part number identification techniques Diagnostic tools and utilities (some new ones) Troubleshooting tips and techniques Examples of error messages and tool usage Known memory issues (PLL dimm chip) Resources
Because ECC can correct single bit flips, single bit errors are referred to as Correctable Errors. These are detected and corrected and generally do not impact performance. CE: Correctable (single bit) errors Types of CE's: Intermittent Persistent Sticky Bit
Multi-bit errors are referred to as Uncorrectable Errors. These are detected, but not corrected. These will result in machine reset (panic or reboot). UE: Uncorrectable (multi bit) errors
Sun Proprietary/Confidential: Internal Use Only
When a CE is detected, the device that reads the word and detected the error can correct the data read and continue on unimpeded. However, this does not address the fact that the referenced word could still be resident in memory uncorrected (i.e. a subsequent read of this word could result in another CE event). If, over time, this word in memory is never corrected the possibility starts to arise that another bit may flip in the same word. This would lead to a UE event which will result in a loss of system service. To avoid this possibility, the detection of a CE causes a trap to Solaris. The Solaris error handling code logs the error and scrubs the affected memory word by writing the corrected word back into memory.
Intermittent: Means the error was not detected on a reread of the affected memory word. "Intermittent" is not the best choice of words because it implies that this same error can be expected to manifest itself at irregular intervals. This CE is also known as a transient soft error. No DIMM with this sort of error should be considered for replacement without first examining the soft error rate (SER) of this DIMM. Persistent: Means the error was detected again on a re-read of the affected memory word but the scrub operation corrected it. This CE is also known as a temporary soft error. No DIMM with this sort of error should be considered for replacement without first examining the SER of this DIMM. Sticky (aka Sticky Bit): Means that the error still exists in memory even after the scrub operation. This CE is also known as a stuck-at hard error. No DIMM with this sort of error should be considered for replacement without first examining the SER of this DIMM.
Sun Proprietary/Confidential: Internal Use Only
# cat messages | grep -i memory May 24 16:07:34 smro97 SUNW,UltraSPARC-III+: [ID 631608 kern.info] [AFT0] errID 0x00055be6.99821550 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Persistent May 24 16:07:34 smro97 SUNW,UltraSPARC-III+: [ID 631608 kern.info] [AFT0] errID 0x00055be6.99821550 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Persistent May 24 16:12:40 smro97 SUNW,UltraSPARC-III+: [ID 910566 kern.info] [AFT0] errID 0x00055c2d.d4cf4320 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Sticky May 24 16:12:40 smro97 SUNW,UltraSPARC-III+: [ID 910566 kern.info] [AFT0] errID 0x00055c2d.d4cf4320 Corrected Memory Error on /N0/SB3/P2/B0/D2 J15500 is Sticky May 24 16:12:40 smro97 unix: [ID 752700 kern.warning] WARNING: [AFT0] Sticky Softerror encountered on Memory Module /N0/SB3/P2/B0/D2 J15500 May 24 16:12:40 smro97 unix: [ID 752700 kern.warning] WARNING: [AFT0] Sticky Softerror encountered on Memory Module /N0/SB3/P2/B0/D2 J15500
Sun Proprietary/Confidential: Internal Use Only
Here is an example of a memory errors which does not involve a CPU module. A PCI controller was reading data from memory.
May 17 18:45:01 j2kweb06 unix: WARNING: correctable error from pci0 (upa mid 1f) during dvma read transaction May 17 18:45:01 j2kweb06 unix: AFSR=40f10000.9f800000 AFAR=00000000.4fbc1e60, May 17 18:45:01 j2kweb06 <U0402> port id 31. double word offset=4, Memory Module
If this is just a single event then there is nothing to worry about. Basically, there was simply a correctable ECC event on a read from memory. The only difference between the "normal" CE events is that this one happened to be detected by the PCI controller (since it was doing the read) instead of a CPU. A single CE event is nothing to worry about. That's the reason for having ECC protected memory. The system is doing its job and functioning normally.
Sun Proprietary/Confidential: Internal Use Only
Uncorrectable Errors (UE): If a UE is detected, the device that read the word and detected the error cannot correct the data and continue. A UE will cause Solaris to panic if the UE is in kernel memory, or kill of the particular user process that contained the memory in error and an then issue an orderly shutdown and reboot to protect the other processes in the domain. Either way, whether via panic or shutdown and reboot, the customer is considerably impacted (and will likely call for support).
Memory Scrubber: The Solaris OS runs a memory "scrubber" routine as part of its normal operation. The time interval is 12 hours for scrubbing stale (unused, idle) memory pages. This scrubber does not do anything special besides ensure that every memory location is accessed at least once every 12 hours. If the access finds a CE, then the normal trap to the Solaris OS that occurs for any CE will scrub the affected memory word by writing the corrected word back into memory and log the event. This ensures that multiple CEs do not have time to build up and form a UE at memory locations that are infrequently accessed. Correctable memory errors reported EXACTLY every twelve hours are a result of the Memory scrubber. (see Infodoc 74049: How often does the memory scrubber run? The normal rules apply: a DIMM should only be replaced if it meets the criteria described in the Sun DIMM Replacement Policy. Infodoc 79928 Sun Enhanced Memory DIMM Replacement Policy.
Sun Proprietary/Confidential: Internal Use Only
BADWRITERS: 1. Sometimes multiple memory DIMMs within a system can start reporting soft errors. Examining the messages may reveal that the same databit (or error syndrome) is in error on each DIMM. This indicates that some other component is actually writing the bad data to RAM and consistently creating errors at the same bit address, regardless of the physical DIMM. Recognizing this pattern, and troubleshooting further can prevent much wasted downtime and cost, and the replacing of perfectly good memory DIMMs. 2. When a DIMM is replaced and the errors persist, or return with the same data bit in error, some other component in the system is likely causing the memory errors. Again, recognizing this possibility can head off assumptions that replacing memory will solve the problem. 3. In terms of CE/ECC, a system may only reveal errors when the failing address range is utilized by a particular application or combination of applications. This is almost always a hardware fault. In very rare instances, bad code may generate errors that appear to be hardware. A good first step when troubleshooting a reproducible CE memory issue is to first isolate or disable the suspect memory component(s) via asr-disable, setenv disabled-memory-list X, setenv disabledboard-list X(all under OBP), psradm -f X or cfgadm (under OS). If disabling the suspect memory components is not possible, it may be advisable (especially on lower-end machines) to swap the suspect DIMM with another DIMM in the same bank. If the problem follows the DIMM, replace it. If the problems persist in the same location, it is not a bad DIMM issue. Note: FINDAFT is especially useful when diagnosing Bad Writer scenarios, look for a common CPU (the one implicated more than other CPU's) to be possible Bad Writer.
Sun Proprietary/Confidential: Internal Use Only
4B. For all UltraSPARC II-based systems and all other systems without Predictive Self-Healing (Solaris 9 and earlier), whenever Solaris reports two or more CEs from two or more different physical addresses on each of three or more different outputs from the same DRAM within 24 hours of each other, as long as the three outputs do not all correspond to the same relative bit position in their respective checkwords. [Note: This means at least 6 CEs; two from one DRAM output signal, with unique addresses, two from another output from the same DRAM, also with unique addresses, and two more from yet another output from the same DRAM, again with unique addresses, as long as the three outputs do not all correspond to the same relative bit position in their respective checkwords.] 5. For Solaris 8 and 9 systems with page retirement (Solaris 8, patch level 108528-24 or later; Solaris 9, patch level 112233-11 or later), as well as for UltraSPARC II-based systems running Solaris 10 and later, when the system indicates that the page retirement limit of 0.1% of physical memory has been reached and denotes one and only one DIMM as suspect (i.e., it has accumulated 130 or more non-intermittent CEs). If more than one DIMM is marked as suspect, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. [Note: Determining these factors is aided by the CEDIAG diagnostic tool set.] In the unlikely event that the system indicates that the page retirement limit has been reached but no DIMM is marked as suspect, contact a Sun Support specialist for assistance in determining any necessary action. Example:
connole 73 =>uname -a SunOS connole 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-5_10
Sun Proprietary/Confidential: Internal Use Only
6. For older Solaris releases and patch levels, when Solaris reports more than 24 non-intermittent CEs in 24 hours from a single DIMM. If more than one DIMM has experienced more than 24 non-intermittent CEs in 24 hours, then other possible causes of CEs have to be ruled out by a qualified Sun Support specialist before replacing any DIMMs. Limitations: Prior to Solaris 10, retired pages are returned to service whenever a system is rebooted, and will be re-retired if and when Solaris encounters CEs from them again. POST may fail a DIMM that contained retired pages; if it does, replace the DIMM at that time. ----------------------------------------------(end of official policy)------------------------------------Note: Exceptions MAY be made to the Policy in the interest of Customer Satisfaction. Consult with your lead, backline or manager if necessary. When making exception, always make note of that in case notes. Example: Advised customer of Sun's Enhanced Memory Dimm Replacement Policy and suggested that they employ the cediag utility. Referenced Infodocs 79928 & 82264 which explain more about Sun's Enhanced Memory DIMM Replacement Policy and the recommended CEDIAG utility. Customer declined to follow recommendations and insists upon dimm replacement.
Identifying the correct dimm size / part number Variables that need to be known: dimm size dimm type (speed) dimm quantity (some dimms are always replaced in pairs, eg: V440) Useful utilities to identify dimm size: prtdiag -v /usr/platform/sun4u/sbin/prtdiag -v prtfru -x output (applies to newer machines) POST diagnostic output (when available) memconf utility (now able to be run against Explorer output) showfru ALOM command, displays all FRU info Depending upon the machine platform the prtdiag output may report only the total memory installed, the physical bank size, the logical bank size, or the actual dimm size
Sun Proprietary/Confidential: Internal Use Only
Each dimm is shown individually 4 Banks of memory 3 Banks of 128 meg dimms, 1 Bank of 64 meg dimms
Sun Proprietary/Confidential: Internal Use Only
========================= Memory ========================= Intrlv. Intrlv. Brd Bank MB 0 0 2 2 4 4 5 5 0 1 0 1 0 1 0 1 Status Condition Speed Factor With OK OK OK OK OK OK OK OK 60ns 60ns 60ns 60ns 60ns 60ns 60ns 60ns 4-way 8-way 4-way 8-way 4-way 8-way 4-way 8-way A B A B A B A B Board 0 / Bank 0 (8 dimms per bank) 2048 / 8 = 256 meg dimms Board 0 / Bank 1 (8 dimms per bank) 1024 / 8 = 128 meg dimms --- ----- ---- ------- ---------- ----- ------- ------2048 Active 1024 Active 2048 Active 1024 Active 2048 Active 1024 Active 2048 Active 1024 Active
memconf is a perl script that reports the size of each SIMM/DIMM memory module that is installed in a Sun system. It also reports the system type and any empty memory sockets. In verbose mode, it also reports: * banner name, model, and CPU/system frequencies * address range and bank numbers for each module External url (for customers) http://www.sunfreeware.com/ http://myweb.cableone.net/4schmidts/memconf.html Usage: memconf [ -v | -D | -h ] [ explorer_dir ] -v verbose mode -D send results to memconf maintainer -h print help explorer_dir Sun Explorer output directory
Sun Proprietary/Confidential: Internal Use Only
Notice that the prtdiag output from this Ultra 10 shows only the TOTAL memory installed. NOT how many dimms or which size.
# prtdiag -v System Configuration: Sun Microsystems sun4u Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz) System clock frequency: 90 MHz Memory size: 1024 Megabytes ========================= CPUs ========================= Run Ecache CPU Brd CPU Module MHz 0 0 0 360 MB --- --- ------- ----- ------ ------ ---0.2 12 9.1 ========================= IO Cards ========================= Bus# Freq Brd Type MHz Slot Name 0 PCI-1 33 0 PCI-1 33 0 PCI-1 33 0 PCI-1 33 1 ebus 1 network-SUNW,hme 2 SUNW,m64B 3 ide-pci1095,646.1095.646.3 ATY,GT-C Model --- ---- ---- ---- -------------------------------- ---------------------CPU Impl. Mask
No failures found in System ========================= HW Revisions ========================= ASIC Revisions: --------------Cheerio: ebus Rev 1 System PROM revisions: ---------------------OBP 3.31.0 2001/07/25 20:36 POST 3.1.0 2000/06/27 13:56 Sun Proprietary/Confidential: Internal Use Only
connole 167 =>./memconf hostname: connole Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz) Memory Interleave Factor = 2-way socket DIMM1 has a 256MB DIMM socket DIMM2 has a 256MB DIMM socket DIMM3 has a 256MB DIMM socket DIMM4 has a 256MB DIMM empty sockets: None total memory = 1024MB (1GB) connole 168 =>./memconf -v memconf: V1.65 13-Feb-2006 http://www.4schmidts.com/unix.html hostname: connole banner: Sun Ultra 5/10 UPA/PCI (UltraSPARC-IIi 360MHz) model: Ultra-5_10 Sun development name: Darwin/Otter (Ultra 5), Darwin/SeaLion (Ultra 10) Solaris 9 4/04 s9s_u6wos_08a SPARC, 64-bit kernel, SunOS 5.9 1 UltraSPARC-IIi 360MHz cpu, system freq: 90MHz CPU Units: ========================= CPUs ========================= Run Ecache CPU Brd CPU Module MHz 0 0 0 360 MB --- --- ------- ----- ------ ------ ---0.2 12 9.1 Memory Units: Memory Interleave Factor = 2-way socket DIMM1 has a 256MB DIMM (bank 0L, address 0x00000000-0x0fffffff, 0x20000000-0x2fffffff) socket DIMM2 has a 256MB DIMM (bank 0H, address 0x00000000-0x0fffffff, 0x20000000-0x2fffffff) socket DIMM3 has a 256MB DIMM (bank 1L, address 0x10000000-0x1fffffff, 0x30000000-0x3fffffff) socket DIMM4 has a 256MB DIMM (bank 1H, address 0x10000000-0x1fffffff, 0x30000000-0x3fffffff) empty sockets: None total memory = 1024MB (1GB) CPU Impl. Mask (verbose mode)
System Configuration: Sun Microsystems sun4u Sun Enterprise 420R (4 X UltraSPARC-II 450MHz) System clock frequency: 113 MHz
420R prtdiag
--- --- ------- ----- ------ ------ ---0 0 0 0 0 1 2 3 0 1 2 3 450 450 450 450 4.0 US-II 4.0 US-II 4.0 US-II 4.0 US-II 10.0 10.0 10.0 10.0
========================= IO Cards ========================= Bus Freq Brd Type MHz Slot Name Model
--- ---- ---- ---- -------------------------------- ---------------------0 PCI 0 PCI 0 PCI 0 PCI 0 PCI 0 PCI 0 PCI 0 PCI 0 PCI 33 33 33 33 33 33 33 33 33 0 SUNW,qfe-pci108e,1001 1 network-SUNW,hme 1 SUNW,qfe-pci108e,1001 2 fibre-channel-pci10df,f800.10df.+ 2 SUNW,qfe-pci108e,1001 3 scsi-glm/disk (block) 3 scsi-glm/disk (block) 3 SUNW,qfe-pci108e,1001 4 fibre-channel-pci10df,f800.10df.+ SUNW,pci-qfe Symbios,53C875 Symbios,53C875 SUNW,pci-qfe SUNW,pci-qfe SUNW,pci-qfe
========================= HW Revisions ========================= ASIC Revisions: --------------PCI: pci Rev 4 PCI: pci Rev 4 Cheerio: ebus Rev 1
System PROM revisions: ---------------------OBP 3.31.0 2001/07/25 20:35 POST 1.2.8 2000/08/22 19:50
connole 170 =>memconf /home/mbuckley/Explorers/64834462_420R/explorer.80e8b7a9.njocsprd2-2005.12.05.17.04 hostname: njocsprd2 Sun Explorer directory: /home/mbuckley/Explorers/64834462_420R/explorer.80e8b7a9.njocsprd2-2005.12.05.17.04 Sun Enterprise 420R (4 X UltraSPARC-II 450MHz) socket U0301 has a 256MB DIMM socket U0302 has a 256MB DIMM socket U1301 has a 256MB DIMM socket U1302 has a 256MB DIMM socket U0401 has a 256MB DIMM socket U0402 has a 256MB DIMM socket U1401 has a 256MB DIMM socket U1402 has a 256MB DIMM socket U0303 has a 256MB DIMM socket U0304 has a 256MB DIMM socket U1303 has a 256MB DIMM socket U1304 has a 256MB DIMM socket U0403 has a 256MB DIMM socket U0404 has a 256MB DIMM socket U1403 has a 256MB DIMM socket U1404 has a 256MB DIMM empty sockets: None total memory = 4096MB (4GB) WARNING: Layout of memory sockets not completely recognized on this system. The memory configuration displayed should be correct though since this is a fully stuffed system. This is a known bug due to Sun's 'prtconf', 'prtdiag' and 'prtfru' commands not providing enough detail for the memory layout of this SunOS 5.8 SUNW,Ultra-80 system to be accurately determined. This is a bug in Sun's OBP, not a bug in memconf. The latest release (OBP 3.33.0 2003/10/07) still has this bug. This system is using OBP 3.31.0 2001/07/25 20:35 Sun Proprietary/Confidential: Internal Use Only (individual dimm size reported)
256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB, 256MB + 256MB,
SC#0 (512 meg dimm) SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0 SC#0
SubTool output: Part#: 501-5030 Desc: FRU,ASSY,SDRAM,DIMM,512MB Category: Boards Is a FRU but has no substitutable parts.
Sun Proprietary/Confidential: Internal Use Only
Prtfru -x output from V880: <Location name="dimm-slot?Label=J8001"> <Container name="dimm-module"> <ContainerData> <Segment name="SD"> <ManR> <UNIX_Timestamp32 value="Mon Mar 3 19:39:25 MST 2003"/> <Fru_Description value="256 MB NG SDRAM DIMM"/> <Manufacture_Loc value="ONYANG,KOREA"/> <Sun_Part_No value="5015401"/> <Sun_Serial_No value="A4663A"/> <Vendor_Name value="Samsung"/> <Initial_HW_Dash_Level value="03"/> <Initial_HW_Rev_Level value="50"/> <Fru_Shortname value="DIMM"/> </ManR> <Fru_Type value="256 MB DIMM"/> <DIMM_R> <DIMM_Speed value="75"/> <DIMM_Size value="256"/> </DIMM_R> </Segment> </ContainerData> </Container> <!-- dimm-module --> </Location> <!-- dimm-slot?Label=J8001 -->
Sun Proprietary/Confidential: Internal Use Only
SubTool output: Part#: 501-5401 Desc: FRU,ASSY,SDRAM,DIMM,256MB,18X8MX16 Category: Boards Is a FRU but has no substitutable parts.
Showfru is a commandline prtfru -x summary script available online from: http://pts-appl-z1.holland/showfru.html From the commandline showfru needs to be run on Solaris 10 FCS or later, where the XML perl modules are installed by default. The Showfru script aims to provide a concise summary of FRU data from a prtfru -x output This allows quick identification of FRUs installed and depending on the platform other additional information is available. NOTE: Please link to the script rather than taking a private copy. ################################################################## Latest version 0.74 /net/cores.uk/export/hotline/hotlocal/bin/showfru Report bugs, RFEs or if you have questions email doug.baker@sun.com Further info from http://pts-platform/twiki/bin/view/Tools/ToolPageShowfru ################################################################### Non RoHS example: http://gmpweb.uk/~db124859/showfru/v240_mixed_dimm_sizes.html The script only runs on Solaris 10 and above so if you are stuck on a Solaris 9 sunray use the online version: http://pts-appl-z1.holland/showfru.html Further details and example outputs here: http://pts-platform/twiki/bin/view/Tools/ToolPageShowfru More on ROHS: http://sunsolve2.central.sun.com/handbook_internal/Systems/commondocs/RoHS_Communication.html#meaning
Sun Proprietary/Confidential: Internal Use Only
$ /net/cores.uk/export/hotline/hotlocal/bin/showfru prtfru_-x.out ################################################################################ FRU part and serial number info, use -v for install date and vendor ################################################################################ MB PS0 IFB PS1 MB.P0.B0.D0 MB.P0.B0.D1 MB.P0.B1.D0 MB.P0.B1.D1 MOTHERBOARD PS CHASSIS PS 1 GB 1 GB 1 GB 1 GB 375-3346 RoHS H00ORF
################################################################################ SPD DIMM info - FRU, vendor name, vendor part and serial number ################################################################################ MB.P0.B0.D0 MB.P0.B0.D1 MB.P0.B1.D0 MB.P0.B1.D1 Infineon (formerly Siemens) 72D128320GBR6C Infineon (formerly Siemens) 72D128320GBR6C Infineon (formerly Siemens) 72D128320GBR6C Infineon (formerly Siemens) 72D128320GBR6C 0403E910 0403EA10 0403EA12 0409FD27
sc> showfru FRU_PROM at PS0.SEEPROM Manufacturer Record Timestamp: TUE JUL 01 19:53:52 UTC 2003 Description: P/S,SSI MPS,680W,HOT PLUG Manufacture Location: DELTA ELECTRONICS THAILAND Sun Part No: 3001501 Sun Serial No: T00541 Vendor: Delta Electronics Initial HW Dash Level: 06 Initial HW Rev Level: 50 Shortname: A42_PSU FRU_PROM at C0.P0.B0.D0.SEEPROM Timestamp: MON JUN 02 12:00:00 UTC 2003 Description: SDRAM DDR, 512 MB Manufacture Location: Vendor: Samsung Vendor Part No: M3 12L6420DT0-CA2 FRU_PROM at C0.P0.B0.D1.SEEPROM Timestamp: MON JUN 02 12:00:00 UTC 2003 Description: SDRAM DDR, 512 MB Manufacture Location: Vendor: Samsung Vendor Part No: M3 12L6420DT0-CA2
The Findaft script, aims to provide a concise summary of AFT, CPU and PCI ECC errors found in the Solaris Operating System /var/adm/messages files. This summary can then used to assist in diagnosing a customers' hardware fault. Note: Findaft is Sun Internal only and cannot be sent to customers. Provides a concise summary of all CPU/Memory/PCI/ECC errors found in the messages. (Makes an ideal case note or start point for an SGR template.) Assists with identification of memory UE errors. Features highlighting of E-Cache events. Directs TSE's towards Best Practices, when to do "nothing". Features highlighting of Datapath faults. Helps to identify the true, root cause of errors. Helps to prevent mis-diagnosis which could result in "wrong" parts being replaced.
Sun Proprietary/Confidential: Internal Use Only
Findaft is a standalone perl script, the latest version is runnable from here: /net/cores.uk/export/hotline/hotlocal/bin/findaft (always use latest available versions of tools) Or downloadable from here: http://gmpweb.uk/~db124859/findaft/ Findaft is always a good starting step to troubleshooting and diagnosing memory issues. Read the docs that findaft suggests, these will usually assist diagnosis. Reference: Infodoc 80270: "Findaft an AFT, CPU, Memory and PCI ECC error message summary script" http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft Alias is available to provide tool support: findaft-interest@sun.com
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 862595 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7f9abdc2 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2b4 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 339206 kern.info] [AFT0] errID 0x0015bd43.7f9abdc2 Corrected Memory Error on U0302 is Intermittent May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 368593 kern.info] [AFT0] errID 0x0015bd43.7f9abdc2 ECC Data Bit 31 was in error and corrected May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 748639 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa13b30 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2ac UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 233778 kern.info] [AFT0] errID 0x0015bd43.7fa13b30 Corrected Memory Error on U0302 is Intermittent May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 315879 kern.info] [AFT0] errID 0x0015bd43.7fa13b30 ECC Data Bit 31 was in error and corrected May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 712106 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa59597 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2b4 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 957346 kern.info] [AFT0] errID 0x0015bd43.7fa59597 Corrected Memory Error on U0302 is Intermittent May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 356654 kern.info] [AFT0] errID 0x0015bd43.7fa59597 ECC Data Bit 31 was in error and corrected May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 585299 kern.info] [AFT0] Corrected Memory Error detected by CPU0, errID 0x0015bd43.7fa993cf May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 May 28 22:43:55 cht1ds004 AFSR 0x00000000.00100000<CE> AFAR 0x00000000.55955a30 AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00 Fault_PC 0xff07b2a8 UDBH Syndrome 0x58 Memory Module U0302
May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 342081 kern.info] [AFT0] errID 0x0015bd43.7fa993cf Corrected Memory Error on U0302 is Persistent May 28 22:43:55 cht1ds004 SUNW,UltraSPARC-II: [ID 499012 kern.info] [AFT0] errID 0x0015bd43.7fa993cf ECC Data Bit 31 was in error and corrected
# /net/cores.uk/export/hotline/hotlocal/bin/findaft /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages ################################################################################ This script looks for Hardware errors including all AFT and pci ECC events Written for 108528-16/112233-01 or above. Some tests may fail on other revisions Report bugs,RFEs or if you have questions email findaft-interest@sun.com Version 2.00 homepage http://pts-platform/twiki/bin/view/Tools/ToolPageFindaft Or runnable from /net/cores.uk/export/hotline/hotlocal/bin/findaft Infodoc 80270 Findaft an AFT CPU Memory and PCI ECC error message summary script ################################################################################ Input file /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/messages/messages is 0.1 MB ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Syndrome errors CE and UE errors are included ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 10 Syndrome 0x58 Memory Module U0302 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Other AFT Events ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 5 1 2 2 3 [AFT0] Corrected Memory Error detected by CPU0, [AFT0] Corrected Memory Error detected by CPU1, [AFT0] Corrected Memory Error detected by CPU2, [AFT0] Corrected Memory Error detected by CPU3, [AFT0] Sticky Softerror encountered on Memory Module U0302
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Main Memory Correctable ECC events sorted by date ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 1 3 2 1 May 28 U0302 is Intermittent May 28 U0302 is Persistent May 30 U0302 is Sticky May 31 U0302 is Persistent Jun 01 U0302 is Persistent
(continued) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Panics, Reboots, Fatal errors etc ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 Jun 01 cht1ds004 SunOS Release 5.8 Version Generic_117000-03 64-bit ############################################################################### Correctable memory errors found, use cediag to determine if a DIMM needs to be replaced, see Infodoc 83216 for examples of the cediag rule failure messages Infodoc 79928: Sun Enhanced Memory DIMM Replacement Policy ################################################################################ cediag -e explorer_directory/ cediag -c SunOS,cht1ds004,5.8,sparc -k 117000-03 -u 2 /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds0042006.06.02.05.22/messages/messages ################################################################################ Start of Ultrasparc II CE specific checks Unique Simms total 1 ################################################################################ U0302 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Unique Syndromes total 1 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ CE Event Syndrome 0x58 Data Bit 31 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ USII CE Event type reported by each CPU ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Reporting CPU Intermittent Persistent Sticky CPU0 CPU1 CPU2 CPU3 3 0 0 0 2 0 1 1 0 1 1 1 << << << Sun Proprietary/Confidential: Internal Use Only
CEDIAG is a memory error analysis tool, comprised of shell scripts and a few binary executables. Currently runs on Solaris SPARC architectures only. Reference: http://onestop/qco/dimm/tools/cediag.shtml Internally runnable from: /net/cores.uk/export/hotline/hotlocal/bin/cediag Usage: # cediag -e unpacked_explorer_dir May also be run in verbose mode to gather additional information (such as total number of memory pages retired) Syntax example: # /net/cores.uk/export/hotline/hotlocal/bin/cediag -v -e /explorer-directory Customers may download from: http://sunsolve.sun.com (Diagnostic Tools) Memory DIMM Replacement Management Tool (Download, install cediag 1.2.1)
Sun Proprietary/Confidential: Internal Use Only
# /net/cores.uk/export/hotline/hotlocal/bin/cediag -v -e /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/ cediag: Revision: 1.78 @ 2005/02/11 15:54:29 UTC cediag: info: cediag directory: /net/cores.uk/export/hotline/hotlocal/bin cediag: info: Explorer directory: /home/mbuckley/Explorers/65041327_420R/explorer.80e93f26.cht1ds004-2006.06.02.05.22/ cediag: info: UltraSPARC Version: 2 (2) cediag: info: OS Type: SunOS cediag: info: OS Version: 5.8 cediag: info: Hostname: cht1ds004 cediag: info: Memory size: 524032 (8KB pages) cediag: info: MPR (deduced) PRL pages: 497 (8KB pages) cediag: info: MPR-capable OS: true cediag: info: KJP: 117000-03 cediag: info: MPR-aware kernel in-use: true cediag: info: MPR enabled: true cediag: info: MPR disabled in /etc/system: false cediag: info: MPR force mode: n/a cediag: info: MPR state: active cediag: info: Rule#3 check: true cediag: info: Rule#4 check: true cediag: info: Rule#5 check: true cediag: info: Rule#5 check via cestat: false cediag: info: Rule#6 check: false cediag: #### CE Summary prior to reboot at Jun 1 14:55:33 ################### cediag: info: DIMM U0302 had 10 CE(s) cediag: info: DIMM U0302 had 7 non-intermittent CE(s) cediag: info: DIMM U0302 @ Data Bit 31 had 10 CE(s) cediag: info: DIMM U0302 @ Data Bit 31 @ AFAR%64=48 had 10 CE(s) across 1 AFARs
Sun Proprietary/Confidential: Internal Use Only
cediag: info: messages files: 1 pages scheduled for retirement cediag: info: messages files: 1 pages successfully retired cediag: info: messages files: 0 pages scheduled for clearing cediag: info: messages files: 0 pages successfully cleared cediag: info: PRL deduced status: PRL reached = false cediag: findings: 0 datapath fault message(s) found cediag: findings: 0 UE(s) found - there is no rule#3 match cediag: findings: 0 DIMMs with a failure pattern matching rule#4 cediag: findings: 0 DIMMs with a failure pattern matching rule#5 cediag: #### CE Summary since last detected reboot ########################### cediag: #### last detected reboot at Jun 1 14:55:33 ######################### cediag: info: messages files: 0 pages scheduled for retirement cediag: info: messages files: 0 pages successfully retired cediag: info: messages files: 0 pages scheduled for clearing cediag: info: messages files: 0 pages successfully cleared cediag: info: PRL deduced status: PRL reached = false cediag: findings: 0 datapath fault message(s) found cediag: findings: 0 UE(s) found - there is no rule#3 match cediag: findings: 0 DIMMs with a failure pattern matching rule#4 cediag: findings: 0 DIMMs with a failure pattern matching rule#5 #
Example cediag messages for the more complex faults. Uncorrectable UE errors are often seen as a result of single DIMM Rule 4 failures. cediag: findings: 1 UE(s) found - potential rule#3 match cediag: advice:HIGH: refer UE(s) to Sun Support [A]s [S]oon [A]s [P]ossible Datapath fault - See Infodocs 70134 and 80288 for diagnosis of bad writers and datapath faults from Solaris messages. cediag: findings: 4 datapath fault message(s) found cediag: findings: 8 DIMM(s) having CEs with Esynd of 0x0010 found cediag: advice:HIGH: possible datapath fault - refer to Sun Support ASAP Whenever more than one DIMM fails rules 4,5 or 6 you will get this message. Make sure you really do have multiple failures before replacing any DIMMs cediag: advice:MEDIUM: consult Sun Support to rule out other causes of CEs before replacing any DIMMs
FIND_UE Utility: Used to identify those UE errors where a single DIMM from a memory bank can be reliably identified as the cause of the fault, or at least narrow down the number of suspect DIMMs. (Enhanced algorithms are being implemented to reduce the number of suspect components.) Identify those UEs which are likely to have been caused by FCO A0258. Field Change Order A0258-1: Mitsubishi 256MB DIMMs (Sun p/n 501-5658) showing significantly lower than expected reliability. Info available at: http://pts-platform/twiki/bin/view/Tools/ToolPageFindUE Alias list is available to provide tool support: findue-interest@sun.com FindUE is a commandline syndrome decoder Usage: /net/cores.uk/export/hotline/hotlocal/bin/findUE messages
Sun Proprietary/Confidential: Internal Use Only
################################################################### FindUE was written to assist in ECC syndrome history analysis, the script will understand Solaris messages, console logs, msgbuf, showlogs and wfail outputs Supported systems include USIII and USIV systems the E3000-6500s but not USIIIi. Version 1.33 from /net/cores.uk/export/hotline/hotlocal/bin/findUE Infodocs 75538 and 74624 have further details. If you find bugs in the script email doug.baker@sun.com for syndrome decode bugs benoit.baguette@sun.com ##################################################################
Infodoc 80346: Using the fin954 script to diagnose main memory versus L2SRAM errors
The fin954 script was written by Mike Arnott in 2003. The aim was to automate the diagnosis of main memory versus L2SRAM errors on the UltraSparc III systems using the errors found in a Solaris messages file. The fin954 script implements the rules described in FIN I0954-1. These rules apply to all USIII and USIV systems including the VSP systems, SunBlade 1000/2000, 280R, Netra 20, 480/490 and the 880/890, but not the USIIIi based systems. The latest version of fin954 is available from http://fde.aus/tools/fin954 it is a Perl script and needs to be run from the commandline. Further information is available from: FIN I0760-2 Sun Enhanced Memory DIMM Replacement Policy. Infodoc 52427 L2SRAM/DIMM Misdiagnosis Issues Infodoc 75538 Sun Fire[TM] Server: Using ECC Syndrome History to Troubleshoot Uncorrectable Errors (UE) in Memory
Example:
$ fin954 sample.4.messages.txt ============================================================================ Findings: from analysing sample.4.messages.txt Total "Events" logged: 167 Total *significant* "Events" logged: 28 Total insignificant "Events" logged: 138 . FRU "SB10/P1/B1 J14301 J14401 J14501 J14601" implicated as error source 22 times. FRU "SB10/P1/B1/D0 J14301" implicated as error source 6 times.
Fin954 is a standalone script, the latest version is runnable from here: /net/cores.uk/export/hotline/hotlocal/bin/fin954 Sun Proprietary/Confidential: Internal Use Only
Memory reference site: http://onestop/qco/dimm/ https://onestop.sfbay.sun.com/qco/dimm/index_dimm.shtml Infodoc 70361: "Introduction to Solaris[TM] Operating System CE/UE/ECC/CBB/CBI/DBB/DBI Error Messages" Infodoc 72846: Event Messages for UltraSPARC-III[R], UltraSPARC-III+[R], UltraSPARC-IIIi[R], and UltraSPARC-IV[R] CPU Modules Infodoc 72775: "How to determine if a correctable error (CE) on a memory DIMM should result in replacement of FRU" Infodoc 70134 Diagnosis of bad writers and datapath faults from Solaris messages Infodoc 79928: "Sun Enhanced Memory DIMM Replacement Policy" Infodoc 82264: Memory DIMM Replacement Management Tool - cediag 1.2.1 FAQ FIN 100271 (Formerly I0760-2) Sun Enhanced Memory DIMM Replacement Policy
Sun Proprietary/Confidential: Internal Use Only