linux/Documentation/mtd/nand_ecc.txt
<<
>>
Prefs
   1Introduction
   2============
   3
   4Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
   5I felt there was room for optimisation. I bashed the code for a few hours
   6performing tricks like table lookup removing superfluous code etc.
   7After that the speed was increased by 35-40%.
   8Still I was not too happy as I felt there was additional room for improvement.
   9
  10Bad! I was hooked.
  11I decided to annotate my steps in this file. Perhaps it is useful to someone
  12or someone learns something from it.
  13
  14
  15The problem
  16===========
  17
  18NAND flash (at least SLC one) typically has sectors of 256 bytes.
  19However NAND flash is not extremely reliable so some error detection
  20(and sometimes correction) is needed.
  21
  22This is done by means of a Hamming code. I'll try to explain it in
  23laymans terms (and apologies to all the pro's in the field in case I do
  24not use the right terminology, my coding theory class was almost 30
  25years ago, and I must admit it was not one of my favourites).
  26
  27As I said before the ecc calculation is performed on sectors of 256
  28bytes. This is done by calculating several parity bits over the rows and
  29columns. The parity used is even parity which means that the parity bit = 1
  30if the data over which the parity is calculated is 1 and the parity bit = 0
  31if the data over which the parity is calculated is 0. So the total
  32number of bits over the data over which the parity is calculated + the
  33parity bit is even. (see wikipedia if you can't follow this).
  34Parity is often calculated by means of an exclusive or operation,
  35sometimes also referred to as xor. In C the operator for xor is ^
  36
  37Back to ecc.
  38Let's give a small figure:
  39
  40byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
  41byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
  42byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
  43byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
  44byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
  45....
  46byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
  47byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
  48           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
  49           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
  50           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
  51
  52This figure represents a sector of 256 bytes.
  53cp is my abbreviation for column parity, rp for row parity.
  54
  55Let's start to explain column parity.
  56cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
  57so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
  58Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
  59cp2 is the parity over bit0, bit1, bit4 and bit5
  60cp3 is the parity over bit2, bit3, bit6 and bit7.
  61cp4 is the parity over bit0, bit1, bit2 and bit3.
  62cp5 is the parity over bit4, bit5, bit6 and bit7.
  63Note that each of cp0 .. cp5 is exactly one bit.
  64
  65Row parity actually works almost the same.
  66rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
  67rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
  68rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
  69(so handle two bytes, then skip 2 bytes).
  70rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
  71for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
  72so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
  73and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
  74The story now becomes quite boring. I guess you get the idea.
  75rp6 covers 8 bytes then skips 8 etc
  76rp7 skips 8 bytes then covers 8 etc
  77rp8 covers 16 bytes then skips 16 etc
  78rp9 skips 16 bytes then covers 16 etc
  79rp10 covers 32 bytes then skips 32 etc
  80rp11 skips 32 bytes then covers 32 etc
  81rp12 covers 64 bytes then skips 64 etc
  82rp13 skips 64 bytes then covers 64 etc
  83rp14 covers 128 bytes then skips 128
  84rp15 skips 128 bytes then covers 128
  85
  86In the end the parity bits are grouped together in three bytes as
  87follows:
  88ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
  89ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
  90ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
  91ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
  92
  93I detected after writing this that ST application note AN1823
  94(http://www.st.com/stonline/) gives a much
  95nicer picture.(but they use line parity as term where I use row parity)
  96Oh well, I'm graphically challenged, so suffer with me for a moment :-)
  97And I could not reuse the ST picture anyway for copyright reasons.
  98
  99
 100Attempt 0
 101=========
 102
 103Implementing the parity calculation is pretty simple.
 104In C pseudocode:
 105for (i = 0; i < 256; i++)
 106{
 107    if (i & 0x01)
 108       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
 109    else
 110       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
 111    if (i & 0x02)
 112       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
 113    else
 114       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
 115    if (i & 0x04)
 116      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
 117    else
 118      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
 119    if (i & 0x08)
 120      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
 121    else
 122      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
 123    if (i & 0x10)
 124      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
 125    else
 126      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
 127    if (i & 0x20)
 128      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
 129    else
 130    rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
 131    if (i & 0x40)
 132      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
 133    else
 134      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
 135    if (i & 0x80)
 136      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
 137    else
 138      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
 139    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
 140    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
 141    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
 142    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
 143    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
 144    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
 145}
 146
 147
 148Analysis 0
 149==========
 150
 151C does have bitwise operators but not really operators to do the above
 152efficiently (and most hardware has no such instructions either).
 153Therefore without implementing this it was clear that the code above was
 154not going to bring me a Nobel prize :-)
 155
 156Fortunately the exclusive or operation is commutative, so we can combine
 157the values in any order. So instead of calculating all the bits
 158individually, let us try to rearrange things.
 159For the column parity this is easy. We can just xor the bytes and in the
 160end filter out the relevant bits. This is pretty nice as it will bring
 161all cp calculation out of the if loop.
 162
 163Similarly we can first xor the bytes for the various rows.
 164This leads to:
 165
 166
 167Attempt 1
 168=========
 169
 170const char parity[256] = {
 171    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 172    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 173    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 174    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 175    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 176    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 177    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 178    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 179    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 180    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 181    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 182    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 183    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
 184    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 185    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
 186    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
 187};
 188
 189void ecc1(const unsigned char *buf, unsigned char *code)
 190{
 191    int i;
 192    const unsigned char *bp = buf;
 193    unsigned char cur;
 194    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
 195    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
 196    unsigned char par;
 197
 198    par = 0;
 199    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
 200    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
 201    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
 202    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
 203
 204    for (i = 0; i < 256; i++)
 205    {
 206        cur = *bp++;
 207        par ^= cur;
 208        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
 209        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
 210        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
 211        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
 212        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
 213        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
 214        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
 215        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
 216    }
 217    code[0] =
 218        (parity[rp7] << 7) |
 219        (parity[rp6] << 6) |
 220        (parity[rp5] << 5) |
 221        (parity[rp4] << 4) |
 222        (parity[rp3] << 3) |
 223        (parity[rp2] << 2) |
 224        (parity[rp1] << 1) |
 225        (parity[rp0]);
 226    code[1] =
 227        (parity[rp15] << 7) |
 228        (parity[rp14] << 6) |
 229        (parity[rp13] << 5) |
 230        (parity[rp12] << 4) |
 231        (parity[rp11] << 3) |
 232        (parity[rp10] << 2) |
 233        (parity[rp9]  << 1) |
 234        (parity[rp8]);
 235    code[2] =
 236        (parity[par & 0xf0] << 7) |
 237        (parity[par & 0x0f] << 6) |
 238        (parity[par & 0xcc] << 5) |
 239        (parity[par & 0x33] << 4) |
 240        (parity[par & 0xaa] << 3) |
 241        (parity[par & 0x55] << 2);
 242    code[0] = ~code[0];
 243    code[1] = ~code[1];
 244    code[2] = ~code[2];
 245}
 246
 247Still pretty straightforward. The last three invert statements are there to
 248give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
 249all data is 0xff, so the checksum then matches.
 250
 251I also introduced the parity lookup. I expected this to be the fastest
 252way to calculate the parity, but I will investigate alternatives later
 253on.
 254
 255
 256Analysis 1
 257==========
 258
 259The code works, but is not terribly efficient. On my system it took
 260almost 4 times as much time as the linux driver code. But hey, if it was
 261*that* easy this would have been done long before.
 262No pain. no gain.
 263
 264Fortunately there is plenty of room for improvement.
 265
 266In step 1 we moved from bit-wise calculation to byte-wise calculation.
 267However in C we can also use the unsigned long data type and virtually
 268every modern microprocessor supports 32 bit operations, so why not try
 269to write our code in such a way that we process data in 32 bit chunks.
 270
 271Of course this means some modification as the row parity is byte by
 272byte. A quick analysis:
 273for the column parity we use the par variable. When extending to 32 bits
 274we can in the end easily calculate p0 and p1 from it.
 275(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
 276respectively)
 277also rp2 and rp3 can be easily retrieved from par as rp3 covers the
 278first two bytes and rp2 the last two bytes.
 279
 280Note that of course now the loop is executed only 64 times (256/4).
 281And note that care must taken wrt byte ordering. The way bytes are
 282ordered in a long is machine dependent, and might affect us.
 283Anyway, if there is an issue: this code is developed on x86 (to be
 284precise: a DELL PC with a D920 Intel CPU)
 285
 286And of course the performance might depend on alignment, but I expect
 287that the I/O buffers in the nand driver are aligned properly (and
 288otherwise that should be fixed to get maximum performance).
 289
 290Let's give it a try...
 291
 292
 293Attempt 2
 294=========
 295
 296extern const char parity[256];
 297
 298void ecc2(const unsigned char *buf, unsigned char *code)
 299{
 300    int i;
 301    const unsigned long *bp = (unsigned long *)buf;
 302    unsigned long cur;
 303    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
 304    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
 305    unsigned long par;
 306
 307    par = 0;
 308    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
 309    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
 310    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
 311    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
 312
 313    for (i = 0; i < 64; i++)
 314    {
 315        cur = *bp++;
 316        par ^= cur;
 317        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
 318        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
 319        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
 320        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
 321        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
 322        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
 323    }
 324    /*
 325       we need to adapt the code generation for the fact that rp vars are now
 326       long; also the column parity calculation needs to be changed.
 327       we'll bring rp4 to 15 back to single byte entities by shifting and
 328       xoring
 329    */
 330    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
 331    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
 332    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
 333    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
 334    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
 335    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
 336    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
 337    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
 338    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
 339    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
 340    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
 341    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
 342    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
 343    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
 344    par ^= (par >> 16);
 345    rp1 = (par >> 8); rp1 &= 0xff;
 346    rp0 = (par & 0xff);
 347    par ^= (par >> 8); par &= 0xff;
 348
 349    code[0] =
 350        (parity[rp7] << 7) |
 351        (parity[rp6] << 6) |
 352        (parity[rp5] << 5) |
 353        (parity[rp4] << 4) |
 354        (parity[rp3] << 3) |
 355        (parity[rp2] << 2) |
 356        (parity[rp1] << 1) |
 357        (parity[rp0]);
 358    code[1] =
 359        (parity[rp15] << 7) |
 360        (parity[rp14] << 6) |
 361        (parity[rp13] << 5) |
 362        (parity[rp12] << 4) |
 363        (parity[rp11] << 3) |
 364        (parity[rp10] << 2) |
 365        (parity[rp9]  << 1) |
 366        (parity[rp8]);
 367    code[2] =
 368        (parity[par & 0xf0] << 7) |
 369        (parity[par & 0x0f] << 6) |
 370        (parity[par & 0xcc] << 5) |
 371        (parity[par & 0x33] << 4) |
 372        (parity[par & 0xaa] << 3) |
 373        (parity[par & 0x55] << 2);
 374    code[0] = ~code[0];
 375    code[1] = ~code[1];
 376    code[2] = ~code[2];
 377}
 378
 379The parity array is not shown any more. Note also that for these
 380examples I kinda deviated from my regular programming style by allowing
 381multiple statements on a line, not using { } in then and else blocks
 382with only a single statement and by using operators like ^=
 383
 384
 385Analysis 2
 386==========
 387
 388The code (of course) works, and hurray: we are a little bit faster than
 389the linux driver code (about 15%). But wait, don't cheer too quickly.
 390THere is more to be gained.
 391If we look at e.g. rp14 and rp15 we see that we either xor our data with
 392rp14 or with rp15. However we also have par which goes over all data.
 393This means there is no need to calculate rp14 as it can be calculated from
 394rp15 through rp14 = par ^ rp15;
 395(or if desired we can avoid calculating rp15 and calculate it from
 396rp14).  That is why some places refer to inverse parity.
 397Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
 398Effectively this means we can eliminate the else clause from the if
 399statements. Also we can optimise the calculation in the end a little bit
 400by going from long to byte first. Actually we can even avoid the table
 401lookups
 402
 403Attempt 3
 404=========
 405
 406Odd replaced:
 407        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
 408        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
 409        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
 410        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
 411        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
 412        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
 413with
 414        if (i & 0x01) rp5 ^= cur;
 415        if (i & 0x02) rp7 ^= cur;
 416        if (i & 0x04) rp9 ^= cur;
 417        if (i & 0x08) rp11 ^= cur;
 418        if (i & 0x10) rp13 ^= cur;
 419        if (i & 0x20) rp15 ^= cur;
 420
 421        and outside the loop added:
 422    rp4  = par ^ rp5;
 423    rp6  = par ^ rp7;
 424    rp8  = par ^ rp9;
 425    rp10  = par ^ rp11;
 426    rp12  = par ^ rp13;
 427    rp14  = par ^ rp15;
 428
 429And after that the code takes about 30% more time, although the number of
 430statements is reduced. This is also reflected in the assembly code.
 431
 432
 433Analysis 3
 434==========
 435
 436Very weird. Guess it has to do with caching or instruction parallellism
 437or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
 438observation was that this one is only 30% slower (according to time)
 439executing the code as my 3Ghz D920 processor.
 440
 441Well, it was expected not to be easy so maybe instead move to a
 442different track: let's move back to the code from attempt2 and do some
 443loop unrolling. This will eliminate a few if statements. I'll try
 444different amounts of unrolling to see what works best.
 445
 446
 447Attempt 4
 448=========
 449
 450Unrolled the loop 1, 2, 3 and 4 times.
 451For 4 the code starts with:
 452
 453    for (i = 0; i < 4; i++)
 454    {
 455        cur = *bp++;
 456        par ^= cur;
 457        rp4 ^= cur;
 458        rp6 ^= cur;
 459        rp8 ^= cur;
 460        rp10 ^= cur;
 461        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
 462        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
 463        cur = *bp++;
 464        par ^= cur;
 465        rp5 ^= cur;
 466        rp6 ^= cur;
 467        ...
 468
 469
 470Analysis 4
 471==========
 472
 473Unrolling once gains about 15%
 474Unrolling twice keeps the gain at about 15%
 475Unrolling three times gives a gain of 30% compared to attempt 2.
 476Unrolling four times gives a marginal improvement compared to unrolling
 477three times.
 478
 479I decided to proceed with a four time unrolled loop anyway. It was my gut
 480feeling that in the next steps I would obtain additional gain from it.
 481
 482The next step was triggered by the fact that par contains the xor of all
 483bytes and rp4 and rp5 each contain the xor of half of the bytes.
 484So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
 485that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
 486eliminate rp5 (or rp4, but I already foresaw another optimisation).
 487The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
 488
 489
 490Attempt 5
 491=========
 492
 493Effectively so all odd digit rp assignments in the loop were removed.
 494This included the else clause of the if statements.
 495Of course after the loop we need to correct things by adding code like:
 496    rp5 = par ^ rp4;
 497Also the initial assignments (rp5 = 0; etc) could be removed.
 498Along the line I also removed the initialisation of rp0/1/2/3.
 499
 500
 501Analysis 5
 502==========
 503
 504Measurements showed this was a good move. The run-time roughly halved
 505compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
 506of the processor time compared to the current code in the linux kernel.
 507
 508However, still I thought there was more. I didn't like all the if
 509statements. Why not keep a running parity and only keep the last if
 510statement. Time for yet another version!
 511
 512
 513Attempt 6
 514=========
 515
 516THe code within the for loop was changed to:
 517
 518    for (i = 0; i < 4; i++)
 519    {
 520        cur = *bp++; tmppar  = cur; rp4 ^= cur;
 521        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
 522        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 523        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
 524
 525        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
 526        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
 527            cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 528            cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
 529
 530            cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
 531        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
 532            cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
 533        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
 534
 535        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
 536        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
 537        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 538        cur = *bp++; tmppar ^= cur;
 539
 540            par ^= tmppar;
 541        if ((i & 0x1) == 0) rp12 ^= tmppar;
 542        if ((i & 0x2) == 0) rp14 ^= tmppar;
 543    }
 544
 545As you can see tmppar is used to accumulate the parity within a for
 546iteration. In the last 3 statements is is added to par and, if needed,
 547to rp12 and rp14.
 548
 549While making the changes I also found that I could exploit that tmppar
 550contains the running parity for this iteration. So instead of having:
 551rp4 ^= cur; rp6 = cur;
 552I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next
 553statement. A similar change was done for rp8 and rp10
 554
 555
 556Analysis 6
 557==========
 558
 559Measuring this code again showed big gain. When executing the original
 560linux code 1 million times, this took about 1 second on my system.
 561(using time to measure the performance). After this iteration I was back
 562to 0.075 sec. Actually I had to decide to start measuring over 10
 563million iterations in order not to lose too much accuracy. This one
 564definitely seemed to be the jackpot!
 565
 566There is a little bit more room for improvement though. There are three
 567places with statements:
 568rp4 ^= cur; rp6 ^= cur;
 569It seems more efficient to also maintain a variable rp4_6 in the while
 570loop; This eliminates 3 statements per loop. Of course after the loop we
 571need to correct by adding:
 572    rp4 ^= rp4_6;
 573    rp6 ^= rp4_6
 574Furthermore there are 4 sequential assignments to rp8. This can be
 575encoded slightly more efficiently by saving tmppar before those 4 lines
 576and later do rp8 = rp8 ^ tmppar ^ notrp8;
 577(where notrp8 is the value of rp8 before those 4 lines).
 578Again a use of the commutative property of xor.
 579Time for a new test!
 580
 581
 582Attempt 7
 583=========
 584
 585The new code now looks like:
 586
 587    for (i = 0; i < 4; i++)
 588    {
 589        cur = *bp++; tmppar  = cur; rp4 ^= cur;
 590        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
 591        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 592        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
 593
 594        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
 595        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
 596            cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 597            cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
 598
 599            notrp8 = tmppar;
 600            cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
 601        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
 602            cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 603        cur = *bp++; tmppar ^= cur;
 604            rp8 = rp8 ^ tmppar ^ notrp8;
 605
 606        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
 607        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
 608        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
 609        cur = *bp++; tmppar ^= cur;
 610
 611            par ^= tmppar;
 612        if ((i & 0x1) == 0) rp12 ^= tmppar;
 613        if ((i & 0x2) == 0) rp14 ^= tmppar;
 614    }
 615    rp4 ^= rp4_6;
 616    rp6 ^= rp4_6;
 617
 618
 619Not a big change, but every penny counts :-)
 620
 621
 622Analysis 7
 623==========
 624
 625Actually this made things worse. Not very much, but I don't want to move
 626into the wrong direction. Maybe something to investigate later. Could
 627have to do with caching again.
 628
 629Guess that is what there is to win within the loop. Maybe unrolling one
 630more time will help. I'll keep the optimisations from 7 for now.
 631
 632
 633Attempt 8
 634=========
 635
 636Unrolled the loop one more time.
 637
 638
 639Analysis 8
 640==========
 641
 642This makes things worse. Let's stick with attempt 6 and continue from there.
 643Although it seems that the code within the loop cannot be optimised
 644further there is still room to optimize the generation of the ecc codes.
 645We can simply calculate the total parity. If this is 0 then rp4 = rp5
 646etc. If the parity is 1, then rp4 = !rp5;
 647But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
 648in the result byte and then do something like
 649    code[0] |= (code[0] << 1);
 650Lets test this.
 651
 652
 653Attempt 9
 654=========
 655
 656Changed the code but again this slightly degrades performance. Tried all
 657kind of other things, like having dedicated parity arrays to avoid the
 658shift after parity[rp7] << 7; No gain.
 659Change the lookup using the parity array by using shift operators (e.g.
 660replace parity[rp7] << 7 with:
 661rp7 ^= (rp7 << 4);
 662rp7 ^= (rp7 << 2);
 663rp7 ^= (rp7 << 1);
 664rp7 &= 0x80;
 665No gain.
 666
 667The only marginal change was inverting the parity bits, so we can remove
 668the last three invert statements.
 669
 670Ah well, pity this does not deliver more. Then again 10 million
 671iterations using the linux driver code takes between 13 and 13.5
 672seconds, whereas my code now takes about 0.73 seconds for those 10
 673million iterations. So basically I've improved the performance by a
 674factor 18 on my system. Not that bad. Of course on different hardware
 675you will get different results. No warranties!
 676
 677But of course there is no such thing as a free lunch. The codesize almost
 678tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
 679
 680
 681Correcting errors
 682=================
 683
 684For correcting errors I again used the ST application note as a starter,
 685but I also peeked at the existing code.
 686The algorithm itself is pretty straightforward. Just xor the given and
 687the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
 688are 1 we have one correctable bit error. If there is 1 bit 1, we have an
 689error in the given ecc code.
 690It proved to be fastest to do some table lookups. Performance gain
 691introduced by this is about a factor 2 on my system when a repair had to
 692be done, and 1% or so if no repair had to be done.
 693Code size increased from 330 bytes to 686 bytes for this function.
 694(gcc 4.2, -O3)
 695
 696
 697Conclusion
 698==========
 699
 700The gain when calculating the ecc is tremendous. Om my development hardware
 701a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
 702embedded system with a MIPS core a factor 7 was obtained.
 703On  a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
 7045 (big endian mode, gcc 4.1.2, -O3)
 705For correction not much gain could be obtained (as bitflips are rare). Then
 706again there are also much less cycles spent there.
 707
 708It seems there is not much more gain possible in this, at least when
 709programmed in C. Of course it might be possible to squeeze something more
 710out of it with an assembler program, but due to pipeline behaviour etc
 711this is very tricky (at least for intel hw).
 712
 713Author: Frans Meulenbroeks
 714Copyright (C) 2008 Koninklijke Philips Electronics NV.
 715
lxr.linux.no kindly hosted by Redpill Linpro AS, provider of Linux consulting and operations services since 1995.