linux/Documentation/edac.txt
<<
>>
Prefs
   1
   2
   3EDAC - Error Detection And Correction
   4
   5Written by Doug Thompson <dougthompson@xmission.com>
   67 Dec 2005
   717 Jul 2007     Updated
   8
   9
  10EDAC is maintained and written by:
  11
  12        Doug Thompson, Dave Jiang, Dave Peterson et al,
  13        original author: Thayne Harbaugh,
  14
  15Contact:
  16        website:        bluesmoke.sourceforge.net
  17        mailing list:   bluesmoke-devel@lists.sourceforge.net
  18
  19"bluesmoke" was the name for this device driver when it was "out-of-tree"
  20and maintained at sourceforge.net.  When it was pushed into 2.6.16 for the
  21first time, it was renamed to 'EDAC'.
  22
  23The bluesmoke project at sourceforge.net is now utilized as a 'staging area'
  24for EDAC development, before it is sent upstream to kernel.org
  25
  26At the bluesmoke/EDAC project site, is a series of quilt patches against
  27recent kernels, stored in a SVN respository. For easier downloading, there
  28is also a tarball snapshot available.
  29
  30============================================================================
  31EDAC PURPOSE
  32
  33The 'edac' kernel module goal is to detect and report errors that occur
  34within the computer system running under linux.
  35
  36MEMORY
  37
  38In the initial release, memory Correctable Errors (CE) and Uncorrectable
  39Errors (UE) are the primary errors being harvested. These types of errors
  40are harvested by the 'edac_mc' class of device.
  41
  42Detecting CE events, then harvesting those events and reporting them,
  43CAN be a predictor of future UE events.  With CE events, the system can
  44continue to operate, but with less safety. Preventive maintenance and
  45proactive part replacement of memory DIMMs exhibiting CEs can reduce
  46the likelihood of the dreaded UE events and system 'panics'.
  47
  48NON-MEMORY
  49
  50A new feature for EDAC, the edac_device class of device, was added in
  51the 2.6.23 version of the kernel.
  52
  53This new device type allows for non-memory type of ECC hardware detectors
  54to have their states harvested and presented to userspace via the sysfs
  55interface.
  56
  57Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA
  58engines, fabric switches, main data path switches, interconnections,
  59and various other hardware data paths. If the hardware reports it, then
  60a edac_device device probably can be constructed to harvest and present
  61that to userspace.
  62
  63
  64PCI BUS SCANNING
  65
  66In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
  67in order to determine if errors are occurring on data transfers.
  68
  69The presence of PCI Parity errors must be examined with a grain of salt.
  70There are several add-in adapters that do NOT follow the PCI specification
  71with regards to Parity generation and reporting. The specification says
  72the vendor should tie the parity status bits to 0 if they do not intend
  73to generate parity.  Some vendors do not do this, and thus the parity bit
  74can "float" giving false positives.
  75
  76In the kernel there is a pci device attribute located in sysfs that is
  77checked by the EDAC PCI scanning code. If that attribute is set,
  78PCI parity/error scannining is skipped for that device. The attribute
  79is:
  80
  81        broken_parity_status
  82
  83as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys for
  84PCI devices.
  85
  86FUTURE HARDWARE SCANNING
  87
  88EDAC will have future error detectors that will be integrated with
  89EDAC or added to it, in the following list:
  90
  91        MCE     Machine Check Exception
  92        MCA     Machine Check Architecture
  93        NMI     NMI notification of ECC errors
  94        MSRs    Machine Specific Register error cases
  95        and other mechanisms.
  96
  97These errors are usually bus errors, ECC errors, thermal throttling
  98and the like.
  99
 100
 101============================================================================
 102EDAC VERSIONING
 103
 104EDAC is composed of a "core" module (edac_core.ko) and several Memory
 105Controller (MC) driver modules. On a given system, the CORE
 106is loaded and one MC driver will be loaded. Both the CORE and
 107the MC driver (or edac_device driver) have individual versions that reflect
 108current release level of their respective modules.
 109
 110Thus, to "report" on what version a system is running, one must report both
 111the CORE's and the MC driver's versions.
 112
 113
 114LOADING
 115
 116If 'edac' was statically linked with the kernel then no loading is
 117necessary.  If 'edac' was built as modules then simply modprobe the
 118'edac' pieces that you need.  You should be able to modprobe
 119hardware-specific modules and have the dependencies load the necessary core
 120modules.
 121
 122Example:
 123
 124$> modprobe amd76x_edac
 125
 126loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
 127core module.
 128
 129
 130============================================================================
 131EDAC sysfs INTERFACE
 132
 133EDAC presents a 'sysfs' interface for control, reporting and attribute
 134reporting purposes.
 135
 136EDAC lives in the /sys/devices/system/edac directory.
 137
 138Within this directory there currently reside 2 'edac' components:
 139
 140        mc      memory controller(s) system
 141        pci     PCI control and status system
 142
 143
 144============================================================================
 145Memory Controller (mc) Model
 146
 147First a background on the memory controller's model abstracted in EDAC.
 148Each 'mc' device controls a set of DIMM memory modules. These modules are
 149laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
 150be multiple csrows and multiple channels.
 151
 152Memory controllers allow for several csrows, with 8 csrows being a typical value.
 153Yet, the actual number of csrows depends on the electrical "loading"
 154of a given motherboard, memory controller and DIMM characteristics.
 155
 156Dual channels allows for 128 bit data transfers to the CPU from memory.
 157Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
 158(FB-DIMMs). The following example will assume 2 channels:
 159
 160
 161                Channel 0       Channel 1
 162        ===================================
 163        csrow0  | DIMM_A0       | DIMM_B0 |
 164        csrow1  | DIMM_A0       | DIMM_B0 |
 165        ===================================
 166
 167        ===================================
 168        csrow2  | DIMM_A1       | DIMM_B1 |
 169        csrow3  | DIMM_A1       | DIMM_B1 |
 170        ===================================
 171
 172In the above example table there are 4 physical slots on the motherboard
 173for memory DIMMs:
 174
 175        DIMM_A0
 176        DIMM_B0
 177        DIMM_A1
 178        DIMM_B1
 179
 180Labels for these slots are usually silk screened on the motherboard. Slots
 181labeled 'A' are channel 0 in this example. Slots labeled 'B'
 182are channel 1. Notice that there are two csrows possible on a
 183physical DIMM. These csrows are allocated their csrow assignment
 184based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
 185is placed in each Channel, the csrows cross both DIMMs.
 186
 187Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
 188Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
 189will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
 190when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
 191csrow1 will be populated. The pattern repeats itself for csrow2 and
 192csrow3.
 193
 194The representation of the above is reflected in the directory tree
 195in EDAC's sysfs interface. Starting in directory
 196/sys/devices/system/edac/mc each memory controller will be represented
 197by its own 'mcX' directory, where 'X" is the index of the MC.
 198
 199
 200        ..../edac/mc/
 201                   |
 202                   |->mc0
 203                   |->mc1
 204                   |->mc2
 205                   ....
 206
 207Under each 'mcX' directory each 'csrowX' is again represented by a
 208'csrowX', where 'X" is the csrow index:
 209
 210
 211        .../mc/mc0/
 212                |
 213                |->csrow0
 214                |->csrow2
 215                |->csrow3
 216                ....
 217
 218Notice that there is no csrow1, which indicates that csrow0 is
 219composed of a single ranked DIMMs. This should also apply in both
 220Channels, in order to have dual-channel mode be operational. Since
 221both csrow2 and csrow3 are populated, this indicates a dual ranked
 222set of DIMMs for channels 0 and 1.
 223
 224
 225Within each of the 'mcX' and 'csrowX' directories are several
 226EDAC control and attribute files.
 227
 228============================================================================
 229'mcX' DIRECTORIES
 230
 231
 232In 'mcX' directories are EDAC control and attribute files for
 233this 'X" instance of the memory controllers:
 234
 235
 236Counter reset control file:
 237
 238        'reset_counters'
 239
 240        This write-only control file will zero all the statistical counters
 241        for UE and CE errors.  Zeroing the counters will also reset the timer
 242        indicating how long since the last counter zero.  This is useful
 243        for computing errors/time.  Since the counters are always reset at
 244        driver initialization time, no module/kernel parameter is available.
 245
 246        RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
 247
 248                This resets the counters on memory controller 0
 249
 250
 251Seconds since last counter reset control file:
 252
 253        'seconds_since_reset'
 254
 255        This attribute file displays how many seconds have elapsed since the
 256        last counter reset. This can be used with the error counters to
 257        measure error rates.
 258
 259
 260
 261Memory Controller name attribute file:
 262
 263        'mc_name'
 264
 265        This attribute file displays the type of memory controller
 266        that is being utilized.
 267
 268
 269Total memory managed by this memory controller attribute file:
 270
 271        'size_mb'
 272
 273        This attribute file displays, in count of megabytes, of memory
 274        that this instance of memory controller manages.
 275
 276
 277Total Uncorrectable Errors count attribute file:
 278
 279        'ue_count'
 280
 281        This attribute file displays the total count of uncorrectable
 282        errors that have occurred on this memory controller. If panic_on_ue
 283        is set this counter will not have a chance to increment,
 284        since EDAC will panic the system.
 285
 286
 287Total UE count that had no information attribute fileY:
 288
 289        'ue_noinfo_count'
 290
 291        This attribute file displays the number of UEs that
 292        have occurred have occurred with  no informations as to which DIMM
 293        slot is having errors.
 294
 295
 296Total Correctable Errors count attribute file:
 297
 298        'ce_count'
 299
 300        This attribute file displays the total count of correctable
 301        errors that have occurred on this memory controller. This
 302        count is very important to examine. CEs provide early
 303        indications that a DIMM is beginning to fail. This count
 304        field should be monitored for non-zero values and report
 305        such information to the system administrator.
 306
 307
 308Total Correctable Errors count attribute file:
 309
 310        'ce_noinfo_count'
 311
 312        This attribute file displays the number of CEs that
 313        have occurred wherewith no informations as to which DIMM slot
 314        is having errors. Memory is handicapped, but operational,
 315        yet no information is available to indicate which slot
 316        the failing memory is in. This count field should be also
 317        be monitored for non-zero values.
 318
 319Device Symlink:
 320
 321        'device'
 322
 323        Symlink to the memory controller device.
 324
 325Sdram memory scrubbing rate:
 326
 327        'sdram_scrub_rate'
 328
 329        Read/Write attribute file that controls memory scrubbing. The scrubbing
 330        rate is set by writing a minimum bandwidth in bytes/sec to the attribute
 331        file. The rate will be translated to an internal value that gives at
 332        least the specified rate.
 333
 334        Reading the file will return the actual scrubbing rate employed.
 335
 336        If configuration fails or memory scrubbing is not implemented, the value
 337        of the attribute file will be -1.
 338
 339
 340
 341============================================================================
 342'csrowX' DIRECTORIES
 343
 344In the 'csrowX' directories are EDAC control and attribute files for
 345this 'X" instance of csrow:
 346
 347
 348Total Uncorrectable Errors count attribute file:
 349
 350        'ue_count'
 351
 352        This attribute file displays the total count of uncorrectable
 353        errors that have occurred on this csrow. If panic_on_ue is set
 354        this counter will not have a chance to increment, since EDAC
 355        will panic the system.
 356
 357
 358Total Correctable Errors count attribute file:
 359
 360        'ce_count'
 361
 362        This attribute file displays the total count of correctable
 363        errors that have occurred on this csrow. This
 364        count is very important to examine. CEs provide early
 365        indications that a DIMM is beginning to fail. This count
 366        field should be monitored for non-zero values and report
 367        such information to the system administrator.
 368
 369
 370Total memory managed by this csrow attribute file:
 371
 372        'size_mb'
 373
 374        This attribute file displays, in count of megabytes, of memory
 375        that this csrow contains.
 376
 377
 378Memory Type attribute file:
 379
 380        'mem_type'
 381
 382        This attribute file will display what type of memory is currently
 383        on this csrow. Normally, either buffered or unbuffered memory.
 384        Examples:
 385                Registered-DDR
 386                Unbuffered-DDR
 387
 388
 389EDAC Mode of operation attribute file:
 390
 391        'edac_mode'
 392
 393        This attribute file will display what type of Error detection
 394        and correction is being utilized.
 395
 396
 397Device type attribute file:
 398
 399        'dev_type'
 400
 401        This attribute file will display what type of DRAM device is
 402        being utilized on this DIMM.
 403        Examples:
 404                x1
 405                x2
 406                x4
 407                x8
 408
 409
 410Channel 0 CE Count attribute file:
 411
 412        'ch0_ce_count'
 413
 414        This attribute file will display the count of CEs on this
 415        DIMM located in channel 0.
 416
 417
 418Channel 0 UE Count attribute file:
 419
 420        'ch0_ue_count'
 421
 422        This attribute file will display the count of UEs on this
 423        DIMM located in channel 0.
 424
 425
 426Channel 0 DIMM Label control file:
 427
 428        'ch0_dimm_label'
 429
 430        This control file allows this DIMM to have a label assigned
 431        to it. With this label in the module, when errors occur
 432        the output can provide the DIMM label in the system log.
 433        This becomes vital for panic events to isolate the
 434        cause of the UE event.
 435
 436        DIMM Labels must be assigned after booting, with information
 437        that correctly identifies the physical slot with its
 438        silk screen label. This information is currently very
 439        motherboard specific and determination of this information
 440        must occur in userland at this time.
 441
 442
 443Channel 1 CE Count attribute file:
 444
 445        'ch1_ce_count'
 446
 447        This attribute file will display the count of CEs on this
 448        DIMM located in channel 1.
 449
 450
 451Channel 1 UE Count attribute file:
 452
 453        'ch1_ue_count'
 454
 455        This attribute file will display the count of UEs on this
 456        DIMM located in channel 0.
 457
 458
 459Channel 1 DIMM Label control file:
 460
 461        'ch1_dimm_label'
 462
 463        This control file allows this DIMM to have a label assigned
 464        to it. With this label in the module, when errors occur
 465        the output can provide the DIMM label in the system log.
 466        This becomes vital for panic events to isolate the
 467        cause of the UE event.
 468
 469        DIMM Labels must be assigned after booting, with information
 470        that correctly identifies the physical slot with its
 471        silk screen label. This information is currently very
 472        motherboard specific and determination of this information
 473        must occur in userland at this time.
 474
 475============================================================================
 476SYSTEM LOGGING
 477
 478If logging for UEs and CEs are enabled then system logs will have
 479error notices indicating errors that have been detected:
 480
 481EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
 482channel 1 "DIMM_B1": amd76x_edac
 483
 484EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
 485channel 1 "DIMM_B1": amd76x_edac
 486
 487
 488The structure of the message is:
 489        the memory controller                   (MC0)
 490        Error type                              (CE)
 491        memory page                             (0x283)
 492        offset in the page                      (0xce0)
 493        the byte granularity                    (grain 8)
 494                or resolution of the error
 495        the error syndrome                      (0xb741)
 496        memory row                              (row 0)
 497        memory channel                          (channel 1)
 498        DIMM label, if set prior                (DIMM B1
 499        and then an optional, driver-specific message that may
 500                have additional information.
 501
 502Both UEs and CEs with no info will lack all but memory controller,
 503error type, a notice of "no info" and then an optional,
 504driver-specific error message.
 505
 506
 507============================================================================
 508PCI Bus Parity Detection
 509
 510
 511On Header Type 00 devices the primary status is looked at
 512for any parity error regardless of whether Parity is enabled on the
 513device.  (The spec indicates parity is generated in some cases).
 514On Header Type 01 bridges, the secondary status register is also
 515looked at to see if parity occurred on the bus on the other side of
 516the bridge.
 517
 518
 519SYSFS CONFIGURATION
 520
 521Under /sys/devices/system/edac/pci are control and attribute files as follows:
 522
 523
 524Enable/Disable PCI Parity checking control file:
 525
 526        'check_pci_parity'
 527
 528
 529        This control file enables or disables the PCI Bus Parity scanning
 530        operation. Writing a 1 to this file enables the scanning. Writing
 531        a 0 to this file disables the scanning.
 532
 533        Enable:
 534        echo "1" >/sys/devices/system/edac/pci/check_pci_parity
 535
 536        Disable:
 537        echo "0" >/sys/devices/system/edac/pci/check_pci_parity
 538
 539
 540Parity Count:
 541
 542        'pci_parity_count'
 543
 544        This attribute file will display the number of parity errors that
 545        have been detected.
 546
 547
 548============================================================================
 549MODULE PARAMETERS
 550
 551Panic on UE control file:
 552
 553        'edac_mc_panic_on_ue'
 554
 555        An uncorrectable error will cause a machine panic.  This is usually
 556        desirable.  It is a bad idea to continue when an uncorrectable error
 557        occurs - it is indeterminate what was uncorrected and the operating
 558        system context might be so mangled that continuing will lead to further
 559        corruption. If the kernel has MCE configured, then EDAC will never
 560        notice the UE.
 561
 562        LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1]
 563
 564        RUN TIME:  echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
 565
 566
 567Log UE control file:
 568
 569        'edac_mc_log_ue'
 570
 571        Generate kernel messages describing uncorrectable errors.  These errors
 572        are reported through the system message log system.  UE statistics
 573        will be accumulated even when UE logging is disabled.
 574
 575        LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1]
 576
 577        RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
 578
 579
 580Log CE control file:
 581
 582        'edac_mc_log_ce'
 583
 584        Generate kernel messages describing correctable errors.  These
 585        errors are reported through the system message log system.
 586        CE statistics will be accumulated even when CE logging is disabled.
 587
 588        LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1]
 589
 590        RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
 591
 592
 593Polling period control file:
 594
 595        'edac_mc_poll_msec'
 596
 597        The time period, in milliseconds, for polling for error information.
 598        Too small a value wastes resources.  Too large a value might delay
 599        necessary handling of errors and might loose valuable information for
 600        locating the error.  1000 milliseconds (once each second) is the current
 601        default. Systems which require all the bandwidth they can get, may
 602        increase this.
 603
 604        LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1]
 605
 606        RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
 607
 608
 609Panic on PCI PARITY Error:
 610
 611        'panic_on_pci_parity'
 612
 613
 614        This control files enables or disables panicking when a parity
 615        error has been detected.
 616
 617
 618        module/kernel parameter: edac_panic_on_pci_pe=[0|1]
 619
 620        Enable:
 621        echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
 622
 623        Disable:
 624        echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
 625
 626
 627
 628=======================================================================
 629
 630
 631EDAC_DEVICE type of device
 632
 633In the header file, edac_core.h, there is a series of edac_device structures
 634and APIs for the EDAC_DEVICE.
 635
 636User space access to an edac_device is through the sysfs interface.
 637
 638At the location /sys/devices/system/edac (sysfs) new edac_device devices will
 639appear.
 640
 641There is a three level tree beneath the above 'edac' directory. For example,
 642the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website)
 643installs itself as:
 644
 645        /sys/devices/systm/edac/test-instance
 646
 647in this directory are various controls, a symlink and one or more 'instance'
 648directorys.
 649
 650The standard default controls are:
 651
 652        log_ce          boolean to log CE events
 653        log_ue          boolean to log UE events
 654        panic_on_ue     boolean to 'panic' the system if an UE is encountered
 655                        (default off, can be set true via startup script)
 656        poll_msec       time period between POLL cycles for events
 657
 658The test_device_edac device adds at least one of its own custom control:
 659
 660        test_bits       which in the current test driver does nothing but
 661                        show how it is installed. A ported driver can
 662                        add one or more such controls and/or attributes
 663                        for specific uses.
 664                        One out-of-tree driver uses controls here to allow
 665                        for ERROR INJECTION operations to hardware
 666                        injection registers
 667
 668The symlink points to the 'struct dev' that is registered for this edac_device.
 669
 670INSTANCES
 671
 672One or more instance directories are present. For the 'test_device_edac' case:
 673
 674        test-instance0
 675
 676
 677In this directory there are two default counter attributes, which are totals of
 678counter in deeper subdirectories.
 679
 680        ce_count        total of CE events of subdirectories
 681        ue_count        total of UE events of subdirectories
 682
 683BLOCKS
 684
 685At the lowest directory level is the 'block' directory. There can be 0, 1
 686or more blocks specified in each instance.
 687
 688        test-block0
 689
 690
 691In this directory the default attributes are:
 692
 693        ce_count        which is counter of CE events for this 'block'
 694                        of hardware being monitored
 695        ue_count        which is counter of UE events for this 'block'
 696                        of hardware being monitored
 697
 698
 699The 'test_device_edac' device adds 4 attributes and 1 control:
 700
 701        test-block-bits-0       for every POLL cycle this counter
 702                                is incremented
 703        test-block-bits-1       every 10 cycles, this counter is bumped once,
 704                                and test-block-bits-0 is set to 0
 705        test-block-bits-2       every 100 cycles, this counter is bumped once,
 706                                and test-block-bits-1 is set to 0
 707        test-block-bits-3       every 1000 cycles, this counter is bumped once,
 708                                and test-block-bits-2 is set to 0
 709
 710
 711        reset-counters          writing ANY thing to this control will
 712                                reset all the above counters.
 713
 714
 715Use of the 'test_device_edac' driver should any others to create their own
 716unique drivers for their hardware systems.
 717
 718The 'test_device_edac' sample driver is located at the
 719bluesmoke.sourceforge.net project site for EDAC.
 720
 721