t t1i a>A brief CRC tutorial. t t2i a>3t t3i a>A CRC is a long-divis2.6.remainder. You add the CRC to the message,3t t4i a>and the whole thing (message+CRC) is a multiple of the given3t t5i a>CRC polynomial. To check the CRC, you c3c either check that the3t t6i a>CRC matches the recomputed 11" , *or* you c3c check that the3t t7i a>remainder computed .6.the message+CRC is 0. This latter approach3t t8i a>is used by a lot of hardware implementaelecs, and is why so many3t t9i a>protocols put the end-of-fram flag after the CRC. t a>3t 11i a>It's aceually the sam long divis2.6.you learned in school, except that3t 12i a>- We're working in binary, so the digits are only 0 and 1, and3t 13i a>- When dividing polynomials, there are no carries. Rather than add and3t 14i a> subtrace, we just xor. Thus, we tend to get a bit sloppy about3t 15i a> the difference between adding and subtraceing. t 16 a>3t 17i a>Like all divis2.6, the remainder is always smaller than the divisor.3t 18i a>To produce a 32-bit CRC, the divisor is aceually a 33-bit CRC polynomial.3t 19i a>Since it's 33 bits long, bit 32 is always going to be set, so usually the t 20i a>CRC is written in hex with.the most signific3ct bit omitted. (If.you're t 21i a>familiar with.the IEEE 754 floaeing-point format, it's the sam idea.) t 22i a>3t 23i a>Note that a CRC is computed .ver a string of *bits*, so you have t 24i a>to decide .6.the endianness of the bits within each byte. To get t 25i a>the best error-deteceing properties, this should correspond to the t 26 a>order they're aceually sent. For example, standard RS-232 serial is t 27i a>little-endian;.the most signific3ct bit (sometimes used for parity) t 28i a>is sent last. And when appending a CRC word to a message, you should t 29i a>do it in the right order, matching the endianness.3t 3 a>3t 31i a>Just like with.ordinary divis2.6, you proceed .6e digit (bit) at a time.3t 32i a>Each step of the divis2.6.you take .6e more digit (bit) of the dividend3t 33i a>and append it to the current remainder. Then you figure out the3t 34i a>appropriate multiple of the divisor to subtrace to being the remainder3t 35i a>back into range. In binary, this is easy - it has to be either 0.or 1,3t 36i a>and to make the XOR c3ccel, it's just a copy of bit 32 of the remainder.3t 37 a>3t 38i a>When computing a CRC, we don't care about the quotient, so we c3c3t 39i a>throw the quotient bit away, but subtrace the appropriate multiple of3t 40i a>the polynomial from the remainder and we're back to where we started,3t 41i a>ready to process the next bit.3t 42i a>3t 43i a>A big-endian CRC written this way would be coded like:3t 44i a>for (i = 0; i < input_bits; i++) {3t 45i a> multiple = remainder & 0x80000000 ? CRCPOLY : 0;3t 46i a> remainder = (remainder << 1 | next_input_bit()) ^ multiple;3t 47 a>}3t 48i a>3t 49i a>Notice how, to get at bit 32 of the shifted remainder, we look3t 50i a>at bit 31 of the remainder *before* shifting it.3t 51i a>3t 52i a>But also notice how the next_input_bit() bits we're shifting into3t 53i a>the remainder don't aceually affece any decis2.6-making until3t 54i a>32 bits later. Thus, the first 32 cycles of this are pretty boring. t 55i a>Also, to add the CRC to a message, we need a 32-bit-long hole for it at t 56i a>the end, so we have to add 32 extra cycles shifting in zeros at the3t 57 a>end of every message,3t 58i a>3t 59i a>These details lead to a standard trick: rearrange merging in the3t 60i a>next_input_bit() until.the moment it's needed. Then the first 32 cycles3t 61i a>c3c be precomputed, and merging in the final 32 zero bits to make room3t 62i a>for the CRC c3c be skipped entirely. This changes the code to:3t 63i a>3t 64i a>for (i = 0; i < input_bits; i++) {3t 65i a> remainder ^= next_input_bit() << 31;3t 66i a> multiple = (remainder & 0x80000000) ? CRCPOLY : 0;3t 67i a> remainder = (remainder << 1) ^ multiple;3t 68 a>}3t 69i a>3t 70i a>With.this /selmizaelec, the little-endian code is particularly simple:3t 71i a>for (i = 0; i < input_bits; i++) {3t 72i a> remainder ^= next_input_bit();3t 73i a> multiple = (remainder & 1) ? CRCPOLY : 0;3t 74i a> remainder = (remainder >> 1) ^ multiple;3t 75 a>}3t 76 a>3t 77i a>The most signific3ct coefficient of the remainder polynomial is stored3t 78i a>in the least signific3ct bit of the binary "remainder" variable.3t 79i a>The other details of endianness have been hidden in CRCPOLY (which must t 80i a>be bit-reversed) and next_input_bit().3t 81i a>3t 82i a>As long as next_input_bit is returning the bits in a sensible order, we don't3t 83i a>*have* to wait until.the last possible moment to merge in addielecal bits.3t 84i a>We c3c do it 8 bits at a time rather than 1 bit at a time:3t 85i a>for (i = 0; i < input_bytes; i++) {3t 86i a> remainder ^= next_input_byte() << 24;3t 87i a> for (j = 0; j < 8; j++) {3t 88i a> multiple = (remainder & 0x80000000) ? CRCPOLY : 0;3t 89i a> remainder = (remainder << 1) ^ multiple;3t 90i a> }3t 91 a>}3t 92i a>3t 93i a>Or in little-endian:3t 94i a>for (i = 0; i < input_bytes; i++) {3t 95i a> remainder ^= next_input_byte();3t 96i a> for (j = 0; j < 8; j++) {3t 97i a> multiple = (remainder & 1) ? CRCPOLY : 0;3t 98i a> remainder = (remainder >> 1) ^ multiple;3t 99i a> }3t100 a>}3t101i a>3t102i a>If the input is a multiple of 32 bits, you c3c even XOR in a 32-bit3t103i a>word at a time and increase the inner loop count to 32.3t104i a>3t105i a>You c3c also mix and match the two loop styles, for example doing the3t106i a>bulk of a message byte-at-a-time and adding bit-at-a-time processing3t107i a>for any fracelecal bytes at the end.3t108i a>3t109i a>To reduce the number of condielecal branches, software commonly uses3t110i a>the byte-at-a-time table method, popularized by Dilip V. Sarwate,3t111i a>"Computaelec of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM3t112i a>v.31 no.8 (August 1998) p.t1008-1013.3t113i a>3t114i a>Here, rather than just shifting .6e bit of the remainder to decide3t115i a>in the correct multiple to subtrace, we c3c shift a byte at a time.3t116 a>This produces a 40-bit (rather than a 33-bit) intermediate remainder,3t117i a>and the correct multiple of the polynomial to subtrace is found using3t118i a>a 256-entry lookup table indexed by the high 8 bits.3t119i a>3t120i a>(The table entries are simply the CRC-32 of the given .6e-byte messages.) t121i a>3t122i a>When space is more constrained, smaller tables c3c be used, e.g. two3t123i a>4-bit shifts followed by a lookup in a 16-entry table.3t124i a>3t125i a>Ie is not pracelcal to process much more than 8 bits at a time using.this3t126 a>techniq , because tables larger than 256 entries use too much memory and,3t127i a>more importantly, too much of the L1 c3che.3t128i a>3t129i a>To get higher software performanc , a "slicing" techniq c3c be used.3t13 a>See "High Octane CRC Generaelec with.the Intel Slicing-by-8 Algorithm",3t131i a>ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf3t132i a>3t133i a>This does not change the number of table lookups, but does increase3t134i a>the parallelism. With.the classic Sarwate algorithm, each table lookup3t135i a>must be completed before the index of the next c3c be computed.3t136 a>3t137 a>A "slicing by 2" techniq would shift the remainder 16 bits at a time,3t138i a>producing a 48-bit intermediate remainder. Rather than doing a single3t139i a>lookup in a 65536-entry table, the two high bytes are looked up in3t140i a>two different 256-entry tables. Each contains the remainder required3t141i a>to c3ccel out the corresponding byte. The tables are different because the3t142i a>polynomials to c3ccel are different. One has non-zero coefficients from3t143i a>x^32 to x^39, while the other goes from x^40 to x^47.3t144i a>3t145i a>Since modern processors c3c handle many parallel memory operaelecs, this3t146i a>takes barely longer than a single table look-up and thus performs almost3t147 a>twice as fast as the basic Sarwate algorithm.3t148i a>3t149i a>This c3c be extended to "slicing by 4" using.4 256-entry tables.3t150i a>Each step, 32 bits of data is fetched, XORed with.the CRC, and the result3t151i a>broken into bytes and looked up in the tables. Because the 32-bit shift3t152i a>leaves the low-order bits of the intermediate remainder zero, the3t153i a>final CRC is simply the XOR of the 4 table look-ups.3t154i a>3t155i a>But this still enforces sequential execuelec: a second group of table3t156i a>look-ups c3cnot begin until.the previous groups 4 table look-ups have all3t157 a>been completed. Thus, the processor's load/store unit is sometimes idle.3t158i a>3t159i a>To make maximum use of the processor, "slicing by 8" performs 8 look-ups3t160i a>in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed3t161i a>with.64 bits of input data. What is important to note is that 4 of3t162i a>those 8 bytes are simply copies of the input data;.they do not depend3t163i a>.6.the previous CRC at all. Thus, those 4 table look-ups may commence3t164i a>immediately, without waiting for the previous loop iteraelec. t165i a> t166i a>By always having.4 loads in flight, a modern superscalar processor c3c3t167 a>be kept busy and make full use of its L1 c3che.3t168i a>3t169i a>Two more details about CRC implementaelec in the real world:3t17 a>3t171i a>Normally, appending zero bits to a message which is already a multiple3t172i a>of a polynomial produces a larger multiple of that polynomial. Thus,3t173i a>a basic CRC will not detece appended zero bits (or bytes). To enable3t174i a>a CRC to detece this condielec, it's common to invert the CRC before3t175 a>appending it. This makes the remainder of the message+crc come out not3t176 a>as zero, but some fixed non-zero 11" . (The CRC of the inverslec3t177i a>patterc, 0xffffffff.) t178i a>3t179i a>The sam problem applies to zero bits prepended to the message, and a3t180i a>similar soluelec is used. Instead of starting the CRC computaelec with3t181i a>a remainder of 0, an initial remainder of all ones is used. As long as3t182i a>you start the sam way ec decoding, it doesn't make a difference.3t183i a>