| b.liu | e958203 | 2025-04-17 19:18:16 +0800 | [diff] [blame] | 1 | ========================== | 
|  | 2 | NAND Error-correction Code | 
|  | 3 | ========================== | 
|  | 4 |  | 
|  | 5 | Introduction | 
|  | 6 | ============ | 
|  | 7 |  | 
|  | 8 | Having looked at the linux mtd/nand driver and more specific at nand_ecc.c | 
|  | 9 | I felt there was room for optimisation. I bashed the code for a few hours | 
|  | 10 | performing tricks like table lookup removing superfluous code etc. | 
|  | 11 | After that the speed was increased by 35-40%. | 
|  | 12 | Still I was not too happy as I felt there was additional room for improvement. | 
|  | 13 |  | 
|  | 14 | Bad! I was hooked. | 
|  | 15 | I decided to annotate my steps in this file. Perhaps it is useful to someone | 
|  | 16 | or someone learns something from it. | 
|  | 17 |  | 
|  | 18 |  | 
|  | 19 | The problem | 
|  | 20 | =========== | 
|  | 21 |  | 
|  | 22 | NAND flash (at least SLC one) typically has sectors of 256 bytes. | 
|  | 23 | However NAND flash is not extremely reliable so some error detection | 
|  | 24 | (and sometimes correction) is needed. | 
|  | 25 |  | 
|  | 26 | This is done by means of a Hamming code. I'll try to explain it in | 
|  | 27 | laymans terms (and apologies to all the pro's in the field in case I do | 
|  | 28 | not use the right terminology, my coding theory class was almost 30 | 
|  | 29 | years ago, and I must admit it was not one of my favourites). | 
|  | 30 |  | 
|  | 31 | As I said before the ecc calculation is performed on sectors of 256 | 
|  | 32 | bytes. This is done by calculating several parity bits over the rows and | 
|  | 33 | columns. The parity used is even parity which means that the parity bit = 1 | 
|  | 34 | if the data over which the parity is calculated is 1 and the parity bit = 0 | 
|  | 35 | if the data over which the parity is calculated is 0. So the total | 
|  | 36 | number of bits over the data over which the parity is calculated + the | 
|  | 37 | parity bit is even. (see wikipedia if you can't follow this). | 
|  | 38 | Parity is often calculated by means of an exclusive or operation, | 
|  | 39 | sometimes also referred to as xor. In C the operator for xor is ^ | 
|  | 40 |  | 
|  | 41 | Back to ecc. | 
|  | 42 | Let's give a small figure: | 
|  | 43 |  | 
|  | 44 | =========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ==== | 
|  | 45 | byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14 | 
|  | 46 | byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14 | 
|  | 47 | byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14 | 
|  | 48 | byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14 | 
|  | 49 | byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14 | 
|  | 50 | ... | 
|  | 51 | byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15 | 
|  | 52 | byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15 | 
|  | 53 | cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0 | 
|  | 54 | cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2 | 
|  | 55 | cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4 | 
|  | 56 | =========  ==== ==== ==== ==== ==== ==== ==== ====   === === === === ==== | 
|  | 57 |  | 
|  | 58 | This figure represents a sector of 256 bytes. | 
|  | 59 | cp is my abbreviation for column parity, rp for row parity. | 
|  | 60 |  | 
|  | 61 | Let's start to explain column parity. | 
|  | 62 |  | 
|  | 63 | - cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. | 
|  | 64 |  | 
|  | 65 | so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. | 
|  | 66 |  | 
|  | 67 | Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. | 
|  | 68 |  | 
|  | 69 | - cp2 is the parity over bit0, bit1, bit4 and bit5 | 
|  | 70 | - cp3 is the parity over bit2, bit3, bit6 and bit7. | 
|  | 71 | - cp4 is the parity over bit0, bit1, bit2 and bit3. | 
|  | 72 | - cp5 is the parity over bit4, bit5, bit6 and bit7. | 
|  | 73 |  | 
|  | 74 | Note that each of cp0 .. cp5 is exactly one bit. | 
|  | 75 |  | 
|  | 76 | Row parity actually works almost the same. | 
|  | 77 |  | 
|  | 78 | - rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) | 
|  | 79 | - rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) | 
|  | 80 | - rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... | 
|  | 81 | (so handle two bytes, then skip 2 bytes). | 
|  | 82 | - rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) | 
|  | 83 | - for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. | 
|  | 84 |  | 
|  | 85 | so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) | 
|  | 86 | - and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. | 
|  | 87 |  | 
|  | 88 | The story now becomes quite boring. I guess you get the idea. | 
|  | 89 |  | 
|  | 90 | - rp6 covers 8 bytes then skips 8 etc | 
|  | 91 | - rp7 skips 8 bytes then covers 8 etc | 
|  | 92 | - rp8 covers 16 bytes then skips 16 etc | 
|  | 93 | - rp9 skips 16 bytes then covers 16 etc | 
|  | 94 | - rp10 covers 32 bytes then skips 32 etc | 
|  | 95 | - rp11 skips 32 bytes then covers 32 etc | 
|  | 96 | - rp12 covers 64 bytes then skips 64 etc | 
|  | 97 | - rp13 skips 64 bytes then covers 64 etc | 
|  | 98 | - rp14 covers 128 bytes then skips 128 | 
|  | 99 | - rp15 skips 128 bytes then covers 128 | 
|  | 100 |  | 
|  | 101 | In the end the parity bits are grouped together in three bytes as | 
|  | 102 | follows: | 
|  | 103 |  | 
|  | 104 | =====  ===== ===== ===== ===== ===== ===== ===== ===== | 
|  | 105 | ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 | 
|  | 106 | =====  ===== ===== ===== ===== ===== ===== ===== ===== | 
|  | 107 | ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00 | 
|  | 108 | ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08 | 
|  | 109 | ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1 | 
|  | 110 | =====  ===== ===== ===== ===== ===== ===== ===== ===== | 
|  | 111 |  | 
|  | 112 | I detected after writing this that ST application note AN1823 | 
|  | 113 | (http://www.st.com/stonline/) gives a much | 
|  | 114 | nicer picture.(but they use line parity as term where I use row parity) | 
|  | 115 | Oh well, I'm graphically challenged, so suffer with me for a moment :-) | 
|  | 116 |  | 
|  | 117 | And I could not reuse the ST picture anyway for copyright reasons. | 
|  | 118 |  | 
|  | 119 |  | 
|  | 120 | Attempt 0 | 
|  | 121 | ========= | 
|  | 122 |  | 
|  | 123 | Implementing the parity calculation is pretty simple. | 
|  | 124 | In C pseudocode:: | 
|  | 125 |  | 
|  | 126 | for (i = 0; i < 256; i++) | 
|  | 127 | { | 
|  | 128 | if (i & 0x01) | 
|  | 129 | rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | 
|  | 130 | else | 
|  | 131 | rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp0; | 
|  | 132 | if (i & 0x02) | 
|  | 133 | rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; | 
|  | 134 | else | 
|  | 135 | rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; | 
|  | 136 | if (i & 0x04) | 
|  | 137 | rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; | 
|  | 138 | else | 
|  | 139 | rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; | 
|  | 140 | if (i & 0x08) | 
|  | 141 | rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; | 
|  | 142 | else | 
|  | 143 | rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; | 
|  | 144 | if (i & 0x10) | 
|  | 145 | rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; | 
|  | 146 | else | 
|  | 147 | rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; | 
|  | 148 | if (i & 0x20) | 
|  | 149 | rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; | 
|  | 150 | else | 
|  | 151 | rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; | 
|  | 152 | if (i & 0x40) | 
|  | 153 | rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; | 
|  | 154 | else | 
|  | 155 | rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; | 
|  | 156 | if (i & 0x80) | 
|  | 157 | rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; | 
|  | 158 | else | 
|  | 159 | rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; | 
|  | 160 | cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; | 
|  | 161 | cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; | 
|  | 162 | cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; | 
|  | 163 | cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 | 
|  | 164 | cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 | 
|  | 165 | cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 | 
|  | 166 | } | 
|  | 167 |  | 
|  | 168 |  | 
|  | 169 | Analysis 0 | 
|  | 170 | ========== | 
|  | 171 |  | 
|  | 172 | C does have bitwise operators but not really operators to do the above | 
|  | 173 | efficiently (and most hardware has no such instructions either). | 
|  | 174 | Therefore without implementing this it was clear that the code above was | 
|  | 175 | not going to bring me a Nobel prize :-) | 
|  | 176 |  | 
|  | 177 | Fortunately the exclusive or operation is commutative, so we can combine | 
|  | 178 | the values in any order. So instead of calculating all the bits | 
|  | 179 | individually, let us try to rearrange things. | 
|  | 180 | For the column parity this is easy. We can just xor the bytes and in the | 
|  | 181 | end filter out the relevant bits. This is pretty nice as it will bring | 
|  | 182 | all cp calculation out of the for loop. | 
|  | 183 |  | 
|  | 184 | Similarly we can first xor the bytes for the various rows. | 
|  | 185 | This leads to: | 
|  | 186 |  | 
|  | 187 |  | 
|  | 188 | Attempt 1 | 
|  | 189 | ========= | 
|  | 190 |  | 
|  | 191 | :: | 
|  | 192 |  | 
|  | 193 | const char parity[256] = { | 
|  | 194 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 195 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 196 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 197 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 198 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 199 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 200 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 201 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 202 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 203 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 204 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 205 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 206 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | 
|  | 207 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 208 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | 
|  | 209 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 | 
|  | 210 | }; | 
|  | 211 |  | 
|  | 212 | void ecc1(const unsigned char *buf, unsigned char *code) | 
|  | 213 | { | 
|  | 214 | int i; | 
|  | 215 | const unsigned char *bp = buf; | 
|  | 216 | unsigned char cur; | 
|  | 217 | unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | 
|  | 218 | unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | 
|  | 219 | unsigned char par; | 
|  | 220 |  | 
|  | 221 | par = 0; | 
|  | 222 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | 
|  | 223 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | 
|  | 224 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | 
|  | 225 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | 
|  | 226 |  | 
|  | 227 | for (i = 0; i < 256; i++) | 
|  | 228 | { | 
|  | 229 | cur = *bp++; | 
|  | 230 | par ^= cur; | 
|  | 231 | if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; | 
|  | 232 | if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; | 
|  | 233 | if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; | 
|  | 234 | if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; | 
|  | 235 | if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; | 
|  | 236 | if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; | 
|  | 237 | if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; | 
|  | 238 | if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; | 
|  | 239 | } | 
|  | 240 | code[0] = | 
|  | 241 | (parity[rp7] << 7) | | 
|  | 242 | (parity[rp6] << 6) | | 
|  | 243 | (parity[rp5] << 5) | | 
|  | 244 | (parity[rp4] << 4) | | 
|  | 245 | (parity[rp3] << 3) | | 
|  | 246 | (parity[rp2] << 2) | | 
|  | 247 | (parity[rp1] << 1) | | 
|  | 248 | (parity[rp0]); | 
|  | 249 | code[1] = | 
|  | 250 | (parity[rp15] << 7) | | 
|  | 251 | (parity[rp14] << 6) | | 
|  | 252 | (parity[rp13] << 5) | | 
|  | 253 | (parity[rp12] << 4) | | 
|  | 254 | (parity[rp11] << 3) | | 
|  | 255 | (parity[rp10] << 2) | | 
|  | 256 | (parity[rp9]  << 1) | | 
|  | 257 | (parity[rp8]); | 
|  | 258 | code[2] = | 
|  | 259 | (parity[par & 0xf0] << 7) | | 
|  | 260 | (parity[par & 0x0f] << 6) | | 
|  | 261 | (parity[par & 0xcc] << 5) | | 
|  | 262 | (parity[par & 0x33] << 4) | | 
|  | 263 | (parity[par & 0xaa] << 3) | | 
|  | 264 | (parity[par & 0x55] << 2); | 
|  | 265 | code[0] = ~code[0]; | 
|  | 266 | code[1] = ~code[1]; | 
|  | 267 | code[2] = ~code[2]; | 
|  | 268 | } | 
|  | 269 |  | 
|  | 270 | Still pretty straightforward. The last three invert statements are there to | 
|  | 271 | give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash | 
|  | 272 | all data is 0xff, so the checksum then matches. | 
|  | 273 |  | 
|  | 274 | I also introduced the parity lookup. I expected this to be the fastest | 
|  | 275 | way to calculate the parity, but I will investigate alternatives later | 
|  | 276 | on. | 
|  | 277 |  | 
|  | 278 |  | 
|  | 279 | Analysis 1 | 
|  | 280 | ========== | 
|  | 281 |  | 
|  | 282 | The code works, but is not terribly efficient. On my system it took | 
|  | 283 | almost 4 times as much time as the linux driver code. But hey, if it was | 
|  | 284 | *that* easy this would have been done long before. | 
|  | 285 | No pain. no gain. | 
|  | 286 |  | 
|  | 287 | Fortunately there is plenty of room for improvement. | 
|  | 288 |  | 
|  | 289 | In step 1 we moved from bit-wise calculation to byte-wise calculation. | 
|  | 290 | However in C we can also use the unsigned long data type and virtually | 
|  | 291 | every modern microprocessor supports 32 bit operations, so why not try | 
|  | 292 | to write our code in such a way that we process data in 32 bit chunks. | 
|  | 293 |  | 
|  | 294 | Of course this means some modification as the row parity is byte by | 
|  | 295 | byte. A quick analysis: | 
|  | 296 | for the column parity we use the par variable. When extending to 32 bits | 
|  | 297 | we can in the end easily calculate rp0 and rp1 from it. | 
|  | 298 | (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 | 
|  | 299 | respectively, from MSB to LSB) | 
|  | 300 | also rp2 and rp3 can be easily retrieved from par as rp3 covers the | 
|  | 301 | first two MSBs and rp2 covers the last two LSBs. | 
|  | 302 |  | 
|  | 303 | Note that of course now the loop is executed only 64 times (256/4). | 
|  | 304 | And note that care must taken wrt byte ordering. The way bytes are | 
|  | 305 | ordered in a long is machine dependent, and might affect us. | 
|  | 306 | Anyway, if there is an issue: this code is developed on x86 (to be | 
|  | 307 | precise: a DELL PC with a D920 Intel CPU) | 
|  | 308 |  | 
|  | 309 | And of course the performance might depend on alignment, but I expect | 
|  | 310 | that the I/O buffers in the nand driver are aligned properly (and | 
|  | 311 | otherwise that should be fixed to get maximum performance). | 
|  | 312 |  | 
|  | 313 | Let's give it a try... | 
|  | 314 |  | 
|  | 315 |  | 
|  | 316 | Attempt 2 | 
|  | 317 | ========= | 
|  | 318 |  | 
|  | 319 | :: | 
|  | 320 |  | 
|  | 321 | extern const char parity[256]; | 
|  | 322 |  | 
|  | 323 | void ecc2(const unsigned char *buf, unsigned char *code) | 
|  | 324 | { | 
|  | 325 | int i; | 
|  | 326 | const unsigned long *bp = (unsigned long *)buf; | 
|  | 327 | unsigned long cur; | 
|  | 328 | unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | 
|  | 329 | unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | 
|  | 330 | unsigned long par; | 
|  | 331 |  | 
|  | 332 | par = 0; | 
|  | 333 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | 
|  | 334 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | 
|  | 335 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | 
|  | 336 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | 
|  | 337 |  | 
|  | 338 | for (i = 0; i < 64; i++) | 
|  | 339 | { | 
|  | 340 | cur = *bp++; | 
|  | 341 | par ^= cur; | 
|  | 342 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | 
|  | 343 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | 
|  | 344 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | 
|  | 345 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | 
|  | 346 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | 
|  | 347 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | 
|  | 348 | } | 
|  | 349 | /* | 
|  | 350 | we need to adapt the code generation for the fact that rp vars are now | 
|  | 351 | long; also the column parity calculation needs to be changed. | 
|  | 352 | we'll bring rp4 to 15 back to single byte entities by shifting and | 
|  | 353 | xoring | 
|  | 354 | */ | 
|  | 355 | rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; | 
|  | 356 | rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; | 
|  | 357 | rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; | 
|  | 358 | rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; | 
|  | 359 | rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; | 
|  | 360 | rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; | 
|  | 361 | rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; | 
|  | 362 | rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; | 
|  | 363 | rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; | 
|  | 364 | rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; | 
|  | 365 | rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; | 
|  | 366 | rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; | 
|  | 367 | rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; | 
|  | 368 | rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; | 
|  | 369 | par ^= (par >> 16); | 
|  | 370 | rp1 = (par >> 8); rp1 &= 0xff; | 
|  | 371 | rp0 = (par & 0xff); | 
|  | 372 | par ^= (par >> 8); par &= 0xff; | 
|  | 373 |  | 
|  | 374 | code[0] = | 
|  | 375 | (parity[rp7] << 7) | | 
|  | 376 | (parity[rp6] << 6) | | 
|  | 377 | (parity[rp5] << 5) | | 
|  | 378 | (parity[rp4] << 4) | | 
|  | 379 | (parity[rp3] << 3) | | 
|  | 380 | (parity[rp2] << 2) | | 
|  | 381 | (parity[rp1] << 1) | | 
|  | 382 | (parity[rp0]); | 
|  | 383 | code[1] = | 
|  | 384 | (parity[rp15] << 7) | | 
|  | 385 | (parity[rp14] << 6) | | 
|  | 386 | (parity[rp13] << 5) | | 
|  | 387 | (parity[rp12] << 4) | | 
|  | 388 | (parity[rp11] << 3) | | 
|  | 389 | (parity[rp10] << 2) | | 
|  | 390 | (parity[rp9]  << 1) | | 
|  | 391 | (parity[rp8]); | 
|  | 392 | code[2] = | 
|  | 393 | (parity[par & 0xf0] << 7) | | 
|  | 394 | (parity[par & 0x0f] << 6) | | 
|  | 395 | (parity[par & 0xcc] << 5) | | 
|  | 396 | (parity[par & 0x33] << 4) | | 
|  | 397 | (parity[par & 0xaa] << 3) | | 
|  | 398 | (parity[par & 0x55] << 2); | 
|  | 399 | code[0] = ~code[0]; | 
|  | 400 | code[1] = ~code[1]; | 
|  | 401 | code[2] = ~code[2]; | 
|  | 402 | } | 
|  | 403 |  | 
|  | 404 | The parity array is not shown any more. Note also that for these | 
|  | 405 | examples I kinda deviated from my regular programming style by allowing | 
|  | 406 | multiple statements on a line, not using { } in then and else blocks | 
|  | 407 | with only a single statement and by using operators like ^= | 
|  | 408 |  | 
|  | 409 |  | 
|  | 410 | Analysis 2 | 
|  | 411 | ========== | 
|  | 412 |  | 
|  | 413 | The code (of course) works, and hurray: we are a little bit faster than | 
|  | 414 | the linux driver code (about 15%). But wait, don't cheer too quickly. | 
|  | 415 | There is more to be gained. | 
|  | 416 | If we look at e.g. rp14 and rp15 we see that we either xor our data with | 
|  | 417 | rp14 or with rp15. However we also have par which goes over all data. | 
|  | 418 | This means there is no need to calculate rp14 as it can be calculated from | 
|  | 419 | rp15 through rp14 = par ^ rp15, because par = rp14 ^ rp15; | 
|  | 420 | (or if desired we can avoid calculating rp15 and calculate it from | 
|  | 421 | rp14).  That is why some places refer to inverse parity. | 
|  | 422 | Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. | 
|  | 423 | Effectively this means we can eliminate the else clause from the if | 
|  | 424 | statements. Also we can optimise the calculation in the end a little bit | 
|  | 425 | by going from long to byte first. Actually we can even avoid the table | 
|  | 426 | lookups | 
|  | 427 |  | 
|  | 428 | Attempt 3 | 
|  | 429 | ========= | 
|  | 430 |  | 
|  | 431 | Odd replaced:: | 
|  | 432 |  | 
|  | 433 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | 
|  | 434 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | 
|  | 435 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | 
|  | 436 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | 
|  | 437 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | 
|  | 438 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | 
|  | 439 |  | 
|  | 440 | with:: | 
|  | 441 |  | 
|  | 442 | if (i & 0x01) rp5 ^= cur; | 
|  | 443 | if (i & 0x02) rp7 ^= cur; | 
|  | 444 | if (i & 0x04) rp9 ^= cur; | 
|  | 445 | if (i & 0x08) rp11 ^= cur; | 
|  | 446 | if (i & 0x10) rp13 ^= cur; | 
|  | 447 | if (i & 0x20) rp15 ^= cur; | 
|  | 448 |  | 
|  | 449 | and outside the loop added:: | 
|  | 450 |  | 
|  | 451 | rp4  = par ^ rp5; | 
|  | 452 | rp6  = par ^ rp7; | 
|  | 453 | rp8  = par ^ rp9; | 
|  | 454 | rp10  = par ^ rp11; | 
|  | 455 | rp12  = par ^ rp13; | 
|  | 456 | rp14  = par ^ rp15; | 
|  | 457 |  | 
|  | 458 | And after that the code takes about 30% more time, although the number of | 
|  | 459 | statements is reduced. This is also reflected in the assembly code. | 
|  | 460 |  | 
|  | 461 |  | 
|  | 462 | Analysis 3 | 
|  | 463 | ========== | 
|  | 464 |  | 
|  | 465 | Very weird. Guess it has to do with caching or instruction parallellism | 
|  | 466 | or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting | 
|  | 467 | observation was that this one is only 30% slower (according to time) | 
|  | 468 | executing the code as my 3Ghz D920 processor. | 
|  | 469 |  | 
|  | 470 | Well, it was expected not to be easy so maybe instead move to a | 
|  | 471 | different track: let's move back to the code from attempt2 and do some | 
|  | 472 | loop unrolling. This will eliminate a few if statements. I'll try | 
|  | 473 | different amounts of unrolling to see what works best. | 
|  | 474 |  | 
|  | 475 |  | 
|  | 476 | Attempt 4 | 
|  | 477 | ========= | 
|  | 478 |  | 
|  | 479 | Unrolled the loop 1, 2, 3 and 4 times. | 
|  | 480 | For 4 the code starts with:: | 
|  | 481 |  | 
|  | 482 | for (i = 0; i < 4; i++) | 
|  | 483 | { | 
|  | 484 | cur = *bp++; | 
|  | 485 | par ^= cur; | 
|  | 486 | rp4 ^= cur; | 
|  | 487 | rp6 ^= cur; | 
|  | 488 | rp8 ^= cur; | 
|  | 489 | rp10 ^= cur; | 
|  | 490 | if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; | 
|  | 491 | if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; | 
|  | 492 | cur = *bp++; | 
|  | 493 | par ^= cur; | 
|  | 494 | rp5 ^= cur; | 
|  | 495 | rp6 ^= cur; | 
|  | 496 | ... | 
|  | 497 |  | 
|  | 498 |  | 
|  | 499 | Analysis 4 | 
|  | 500 | ========== | 
|  | 501 |  | 
|  | 502 | Unrolling once gains about 15% | 
|  | 503 |  | 
|  | 504 | Unrolling twice keeps the gain at about 15% | 
|  | 505 |  | 
|  | 506 | Unrolling three times gives a gain of 30% compared to attempt 2. | 
|  | 507 |  | 
|  | 508 | Unrolling four times gives a marginal improvement compared to unrolling | 
|  | 509 | three times. | 
|  | 510 |  | 
|  | 511 | I decided to proceed with a four time unrolled loop anyway. It was my gut | 
|  | 512 | feeling that in the next steps I would obtain additional gain from it. | 
|  | 513 |  | 
|  | 514 | The next step was triggered by the fact that par contains the xor of all | 
|  | 515 | bytes and rp4 and rp5 each contain the xor of half of the bytes. | 
|  | 516 | So in effect par = rp4 ^ rp5. But as xor is commutative we can also say | 
|  | 517 | that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can | 
|  | 518 | eliminate rp5 (or rp4, but I already foresaw another optimisation). | 
|  | 519 | The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. | 
|  | 520 |  | 
|  | 521 |  | 
|  | 522 | Attempt 5 | 
|  | 523 | ========= | 
|  | 524 |  | 
|  | 525 | Effectively so all odd digit rp assignments in the loop were removed. | 
|  | 526 | This included the else clause of the if statements. | 
|  | 527 | Of course after the loop we need to correct things by adding code like:: | 
|  | 528 |  | 
|  | 529 | rp5 = par ^ rp4; | 
|  | 530 |  | 
|  | 531 | Also the initial assignments (rp5 = 0; etc) could be removed. | 
|  | 532 | Along the line I also removed the initialisation of rp0/1/2/3. | 
|  | 533 |  | 
|  | 534 |  | 
|  | 535 | Analysis 5 | 
|  | 536 | ========== | 
|  | 537 |  | 
|  | 538 | Measurements showed this was a good move. The run-time roughly halved | 
|  | 539 | compared with attempt 4 with 4 times unrolled, and we only require 1/3rd | 
|  | 540 | of the processor time compared to the current code in the linux kernel. | 
|  | 541 |  | 
|  | 542 | However, still I thought there was more. I didn't like all the if | 
|  | 543 | statements. Why not keep a running parity and only keep the last if | 
|  | 544 | statement. Time for yet another version! | 
|  | 545 |  | 
|  | 546 |  | 
|  | 547 | Attempt 6 | 
|  | 548 | ========= | 
|  | 549 |  | 
|  | 550 | THe code within the for loop was changed to:: | 
|  | 551 |  | 
|  | 552 | for (i = 0; i < 4; i++) | 
|  | 553 | { | 
|  | 554 | cur = *bp++; tmppar  = cur; rp4 ^= cur; | 
|  | 555 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | 
|  | 556 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 557 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | 
|  | 558 |  | 
|  | 559 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | 
|  | 560 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | 
|  | 561 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 562 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | 
|  | 563 |  | 
|  | 564 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; | 
|  | 565 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; | 
|  | 566 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; | 
|  | 567 | cur = *bp++; tmppar ^= cur; rp8 ^= cur; | 
|  | 568 |  | 
|  | 569 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | 
|  | 570 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | 
|  | 571 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 572 | cur = *bp++; tmppar ^= cur; | 
|  | 573 |  | 
|  | 574 | par ^= tmppar; | 
|  | 575 | if ((i & 0x1) == 0) rp12 ^= tmppar; | 
|  | 576 | if ((i & 0x2) == 0) rp14 ^= tmppar; | 
|  | 577 | } | 
|  | 578 |  | 
|  | 579 | As you can see tmppar is used to accumulate the parity within a for | 
|  | 580 | iteration. In the last 3 statements is added to par and, if needed, | 
|  | 581 | to rp12 and rp14. | 
|  | 582 |  | 
|  | 583 | While making the changes I also found that I could exploit that tmppar | 
|  | 584 | contains the running parity for this iteration. So instead of having: | 
|  | 585 | rp4 ^= cur; rp6 ^= cur; | 
|  | 586 | I removed the rp6 ^= cur; statement and did rp6 ^= tmppar; on next | 
|  | 587 | statement. A similar change was done for rp8 and rp10 | 
|  | 588 |  | 
|  | 589 |  | 
|  | 590 | Analysis 6 | 
|  | 591 | ========== | 
|  | 592 |  | 
|  | 593 | Measuring this code again showed big gain. When executing the original | 
|  | 594 | linux code 1 million times, this took about 1 second on my system. | 
|  | 595 | (using time to measure the performance). After this iteration I was back | 
|  | 596 | to 0.075 sec. Actually I had to decide to start measuring over 10 | 
|  | 597 | million iterations in order not to lose too much accuracy. This one | 
|  | 598 | definitely seemed to be the jackpot! | 
|  | 599 |  | 
|  | 600 | There is a little bit more room for improvement though. There are three | 
|  | 601 | places with statements:: | 
|  | 602 |  | 
|  | 603 | rp4 ^= cur; rp6 ^= cur; | 
|  | 604 |  | 
|  | 605 | It seems more efficient to also maintain a variable rp4_6 in the while | 
|  | 606 | loop; This eliminates 3 statements per loop. Of course after the loop we | 
|  | 607 | need to correct by adding:: | 
|  | 608 |  | 
|  | 609 | rp4 ^= rp4_6; | 
|  | 610 | rp6 ^= rp4_6 | 
|  | 611 |  | 
|  | 612 | Furthermore there are 4 sequential assignments to rp8. This can be | 
|  | 613 | encoded slightly more efficiently by saving tmppar before those 4 lines | 
|  | 614 | and later do rp8 = rp8 ^ tmppar ^ notrp8; | 
|  | 615 | (where notrp8 is the value of rp8 before those 4 lines). | 
|  | 616 | Again a use of the commutative property of xor. | 
|  | 617 | Time for a new test! | 
|  | 618 |  | 
|  | 619 |  | 
|  | 620 | Attempt 7 | 
|  | 621 | ========= | 
|  | 622 |  | 
|  | 623 | The new code now looks like:: | 
|  | 624 |  | 
|  | 625 | for (i = 0; i < 4; i++) | 
|  | 626 | { | 
|  | 627 | cur = *bp++; tmppar  = cur; rp4 ^= cur; | 
|  | 628 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | 
|  | 629 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 630 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | 
|  | 631 |  | 
|  | 632 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | 
|  | 633 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | 
|  | 634 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 635 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | 
|  | 636 |  | 
|  | 637 | notrp8 = tmppar; | 
|  | 638 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | 
|  | 639 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | 
|  | 640 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 641 | cur = *bp++; tmppar ^= cur; | 
|  | 642 | rp8 = rp8 ^ tmppar ^ notrp8; | 
|  | 643 |  | 
|  | 644 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | 
|  | 645 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | 
|  | 646 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | 
|  | 647 | cur = *bp++; tmppar ^= cur; | 
|  | 648 |  | 
|  | 649 | par ^= tmppar; | 
|  | 650 | if ((i & 0x1) == 0) rp12 ^= tmppar; | 
|  | 651 | if ((i & 0x2) == 0) rp14 ^= tmppar; | 
|  | 652 | } | 
|  | 653 | rp4 ^= rp4_6; | 
|  | 654 | rp6 ^= rp4_6; | 
|  | 655 |  | 
|  | 656 |  | 
|  | 657 | Not a big change, but every penny counts :-) | 
|  | 658 |  | 
|  | 659 |  | 
|  | 660 | Analysis 7 | 
|  | 661 | ========== | 
|  | 662 |  | 
|  | 663 | Actually this made things worse. Not very much, but I don't want to move | 
|  | 664 | into the wrong direction. Maybe something to investigate later. Could | 
|  | 665 | have to do with caching again. | 
|  | 666 |  | 
|  | 667 | Guess that is what there is to win within the loop. Maybe unrolling one | 
|  | 668 | more time will help. I'll keep the optimisations from 7 for now. | 
|  | 669 |  | 
|  | 670 |  | 
|  | 671 | Attempt 8 | 
|  | 672 | ========= | 
|  | 673 |  | 
|  | 674 | Unrolled the loop one more time. | 
|  | 675 |  | 
|  | 676 |  | 
|  | 677 | Analysis 8 | 
|  | 678 | ========== | 
|  | 679 |  | 
|  | 680 | This makes things worse. Let's stick with attempt 6 and continue from there. | 
|  | 681 | Although it seems that the code within the loop cannot be optimised | 
|  | 682 | further there is still room to optimize the generation of the ecc codes. | 
|  | 683 | We can simply calculate the total parity. If this is 0 then rp4 = rp5 | 
|  | 684 | etc. If the parity is 1, then rp4 = !rp5; | 
|  | 685 |  | 
|  | 686 | But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits | 
|  | 687 | in the result byte and then do something like:: | 
|  | 688 |  | 
|  | 689 | code[0] |= (code[0] << 1); | 
|  | 690 |  | 
|  | 691 | Lets test this. | 
|  | 692 |  | 
|  | 693 |  | 
|  | 694 | Attempt 9 | 
|  | 695 | ========= | 
|  | 696 |  | 
|  | 697 | Changed the code but again this slightly degrades performance. Tried all | 
|  | 698 | kind of other things, like having dedicated parity arrays to avoid the | 
|  | 699 | shift after parity[rp7] << 7; No gain. | 
|  | 700 | Change the lookup using the parity array by using shift operators (e.g. | 
|  | 701 | replace parity[rp7] << 7 with:: | 
|  | 702 |  | 
|  | 703 | rp7 ^= (rp7 << 4); | 
|  | 704 | rp7 ^= (rp7 << 2); | 
|  | 705 | rp7 ^= (rp7 << 1); | 
|  | 706 | rp7 &= 0x80; | 
|  | 707 |  | 
|  | 708 | No gain. | 
|  | 709 |  | 
|  | 710 | The only marginal change was inverting the parity bits, so we can remove | 
|  | 711 | the last three invert statements. | 
|  | 712 |  | 
|  | 713 | Ah well, pity this does not deliver more. Then again 10 million | 
|  | 714 | iterations using the linux driver code takes between 13 and 13.5 | 
|  | 715 | seconds, whereas my code now takes about 0.73 seconds for those 10 | 
|  | 716 | million iterations. So basically I've improved the performance by a | 
|  | 717 | factor 18 on my system. Not that bad. Of course on different hardware | 
|  | 718 | you will get different results. No warranties! | 
|  | 719 |  | 
|  | 720 | But of course there is no such thing as a free lunch. The codesize almost | 
|  | 721 | tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. | 
|  | 722 |  | 
|  | 723 |  | 
|  | 724 | Correcting errors | 
|  | 725 | ================= | 
|  | 726 |  | 
|  | 727 | For correcting errors I again used the ST application note as a starter, | 
|  | 728 | but I also peeked at the existing code. | 
|  | 729 |  | 
|  | 730 | The algorithm itself is pretty straightforward. Just xor the given and | 
|  | 731 | the calculated ecc. If all bytes are 0 there is no problem. If 11 bits | 
|  | 732 | are 1 we have one correctable bit error. If there is 1 bit 1, we have an | 
|  | 733 | error in the given ecc code. | 
|  | 734 |  | 
|  | 735 | It proved to be fastest to do some table lookups. Performance gain | 
|  | 736 | introduced by this is about a factor 2 on my system when a repair had to | 
|  | 737 | be done, and 1% or so if no repair had to be done. | 
|  | 738 |  | 
|  | 739 | Code size increased from 330 bytes to 686 bytes for this function. | 
|  | 740 | (gcc 4.2, -O3) | 
|  | 741 |  | 
|  | 742 |  | 
|  | 743 | Conclusion | 
|  | 744 | ========== | 
|  | 745 |  | 
|  | 746 | The gain when calculating the ecc is tremendous. Om my development hardware | 
|  | 747 | a speedup of a factor of 18 for ecc calculation was achieved. On a test on an | 
|  | 748 | embedded system with a MIPS core a factor 7 was obtained. | 
|  | 749 |  | 
|  | 750 | On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor | 
|  | 751 | 5 (big endian mode, gcc 4.1.2, -O3) | 
|  | 752 |  | 
|  | 753 | For correction not much gain could be obtained (as bitflips are rare). Then | 
|  | 754 | again there are also much less cycles spent there. | 
|  | 755 |  | 
|  | 756 | It seems there is not much more gain possible in this, at least when | 
|  | 757 | programmed in C. Of course it might be possible to squeeze something more | 
|  | 758 | out of it with an assembler program, but due to pipeline behaviour etc | 
|  | 759 | this is very tricky (at least for intel hw). | 
|  | 760 |  | 
|  | 761 | Author: Frans Meulenbroeks | 
|  | 762 |  | 
|  | 763 | Copyright (C) 2008 Koninklijke Philips Electronics NV. |