[RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
@ 2008-07-29 17:58 Frans Meulenbroeks
  2008-07-29 20:04 ` Ricard Wanderlof
  2008-07-30  6:17 ` Artem Bityutskiy
  0 siblings, 2 replies; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-07-29 17:58 UTC (permalink / raw)
  To: linux-mtd

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="0-415199513-1217354338=:23881"

--0-415199513-1217354338=:23881
Content-Type: text/plain; charset=us-ascii

Dear all,

A resubmit of my patch, with all comments from Thomas addressed.

This patch improves the performance of the ecc generation code by a factor of 18 on an INTEL D920 CPU,
by a factor of 7 on MIPS and by a factor of 5 on ARM (NSLU2).

As my email client wraps lines at 79 chars I've added the patch as an attachement instead of inlining it (the line with the filenames created by diff exceeds 79 chars).

Please let me know if additional changes are needed.

Best regards, Frans.


      
--0-415199513-1217354338=:23881
Content-Type: text/x-patch; name="ecc.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="ecc.patch"

ZGlmZiAtdXJOIGxpbnV4LTIuNi4yNS4xMC9Eb2N1bWVudGF0aW9uL25hbmQv
ZWNjLnR4dCBsaW51eC0yLjYuMjUuMTAud29yay9Eb2N1bWVudGF0aW9uL25h
bmQvZWNjLnR4dAotLS0gbGludXgtMi42LjI1LjEwL0RvY3VtZW50YXRpb24v
bmFuZC9lY2MudHh0CTE5NzAtMDEtMDEgMDE6MDA6MDAuMDAwMDAwMDAwICsw
MTAwCisrKyBsaW51eC0yLjYuMjUuMTAud29yay9Eb2N1bWVudGF0aW9uL25h
bmQvZWNjLnR4dAkyMDA4LTA3LTI5IDE5OjI1OjE5LjAwMDAwMDAwMCArMDIw
MApAQCAtMCwwICsxLDcxNCBAQAorSW50cm9kdWN0aW9uCis9PT09PT09PT09
PT0KKworSGF2aW5nIGxvb2tlZCBhdCB0aGUgbGludXggbXRkL25hbmQgZHJp
dmVyIGFuZCBtb3JlIHNwZWNpZmljIGF0IG5hbmRfZWNjLmMgCitJIGZlbHQg
dGhlcmUgd2FzIHJvb20gZm9yIG9wdGltaXNhdGlvbi4gSSBiYXNoZWQgdGhl
IGNvZGUgZm9yIGEgZmV3IGhvdXJzCitwZXJmb3JtaW5nIHRyaWNrcyBsaWtl
IHRhYmxlIGxvb2t1cCByZW1vdmluZyBzdXBlcmZsdW91cyBjb2RlIGV0Yy4g
CitBZnRlciB0aGF0IHRoZSBzcGVlZCB3YXMgaW5jcmVhc2VkIGJ5IDM1LTQw
JS4gCitTdGlsbCBJIHdhcyBub3QgdG9vIGhhcHB5IGFzIEkgZmVsdCB0aGVy
ZSB3YXMgYWRkaXRpb25hbCByb29tIGZvciBpbXByb3ZlbWVudC4KKworQmFk
ISBJIHdhcyBob29rZWQuCitJIGRlY2lkZWQgdG8gYW5ub3RhdGUgbXkgc3Rl
cHMgaW4gdGhpcyBmaWxlLiBQZXJoYXBzIGl0IGlzIHVzZWZ1bCB0byBzb21l
b25lCitvciBzb21lb25lIGxlYXJucyBzb21ldGhpbmcgZnJvbSBpdC4KKwor
CitUaGUgcHJvYmxlbQorPT09PT09PT09PT0KKworTkFORCBmbGFzaCAoYXQg
bGVhc3QgU0xDIG9uZSkgdHlwaWNhbGx5IGhhcyBzZWN0b3JzIG9mIDI1NiBi
eXRlcy4KK0hvd2V2ZXIgTkFORCBmbGFzaCBpcyBub3QgZXh0cmVtZWx5IHJl
bGlhYmxlIHNvIHNvbWUgZXJyb3IgZGV0ZWN0aW9uCisoYW5kIHNvbWV0aW1l
cyBjb3JyZWN0aW9uKSBpcyBuZWVkZWQuCisKK1RoaXMgaXMgZG9uZSBieSBt
ZWFucyBvZiBhIEhhbW1pbmcgY29kZS4gSSdsbCB0cnkgdG8gZXhwbGFpbiBp
dCBpbgorbGF5bWFucyB0ZXJtcyAoYW5kIGFwb2xvZ2llcyB0byBhbGwgdGhl
IHBybydzIGluIHRoZSBmaWVsZCBpbiBjYXNlIEkgZG8KK25vdCB1c2UgdGhl
IHJpZ2h0IHRlcm1pbm9sb2d5LCBteSBjb2RpbmcgdGhlb3J5IGNsYXNzIHdh
cyBhbG1vc3QgMzAKK3llYXJzIGFnbywgYW5kIEkgbXVzdCBhZG1pdCBpdCB3
YXMgbm90IG9uZSBvZiBteSBmYXZvdXJpdGVzKS4KKworQXMgSSBzYWlkIGJl
Zm9yZSB0aGUgZWNjIGNhbGN1bGF0aW9uIGlzIHBlcmZvcm1lZCBvbiBzZWN0
b3JzIG9mIDI1NgorYnl0ZXMuIFRoaXMgaXMgZG9uZSBieSBjYWxjdWxhdGlu
ZyBzZXZlcmFsIHBhcml0eSBiaXRzIG92ZXIgdGhlIHJvd3MgYW5kCitjb2x1
bW5zLiBUaGUgcGFyaXR5IHVzZWQgaXMgZXZlbiBwYXJpdHkgd2hpY2ggbWVh
bnMgdGhhdCB0aGUgcGFyaXR5IGJpdCA9IDEKK2lmIHRoZSBkYXRhIG92ZXIg
d2hpY2ggdGhlIHBhcml0eSBpcyBjYWxjdWxhdGVkIGlzIDEgYW5kIHRoZSBw
YXJpdHkgYml0ID0gMAoraWYgdGhlIGRhdGEgb3ZlciB3aGljaCB0aGUgcGFy
aXR5IGlzIGNhbGN1bGF0ZWQgaXMgMC4gU28gdGhlIHRvdGFsCitudW1iZXIg
b2YgYml0cyBvdmVyIHRoZSBkYXRhIG92ZXIgd2hpY2ggdGhlIHBhcml0eSBp
cyBjYWxjdWxhdGVkICsgdGhlCitwYXJpdHkgYml0IGlzIGV2ZW4uIChzZWUg
d2lraXBlZGlhIGlmIHlvdSBjYW4ndCBmb2xsb3cgdGhpcykuCitQYXJpdHkg
aXMgb2Z0ZW4gY2FsY3VsYXRlZCBieSBtZWFucyBvZiBhbiBleGNsdXNpdmUg
b3Igb3BlcmF0aW9uLAorc29tZXRpbWVzIGFsc28gcmVmZXJyZWQgdG8gYXMg
eG9yLiBJbiBDIHRoZSBvcGVyYXRvciBmb3IgeG9yIGlzIF4KKworQmFjayB0
byBlY2MuCitMZXQncyBnaXZlIGEgc21hbGwgZmlndXJlOgorCitieXRlICAg
MDogIGJpdDcgYml0NiBiaXQ1IGJpdDQgYml0MyBiaXQyIGJpdDEgYml0MCAg
IHJwMCBycDIgcnA0IC4uLiBycDE0CitieXRlICAgMTogIGJpdDcgYml0NiBi
aXQ1IGJpdDQgYml0MyBiaXQyIGJpdDEgYml0MCAgIHJwMSBycDIgcnA0IC4u
LiBycDE0CitieXRlICAgMjogIGJpdDcgYml0NiBiaXQ1IGJpdDQgYml0MyBi
aXQyIGJpdDEgYml0MCAgIHJwMCBycDMgcnA0IC4uLiBycDE0CitieXRlICAg
MzogIGJpdDcgYml0NiBiaXQ1IGJpdDQgYml0MyBiaXQyIGJpdDEgYml0MCAg
IHJwMSBycDMgcnA0IC4uLiBycDE0CitieXRlICAgNDogIGJpdDcgYml0NiBi
aXQ1IGJpdDQgYml0MyBiaXQyIGJpdDEgYml0MCAgIHJwMCBycDIgcnA1IC4u
LiBycDE0CisuLi4uCitieXRlIDI1NDogIGJpdDcgYml0NiBiaXQ1IGJpdDQg
Yml0MyBiaXQyIGJpdDEgYml0MCAgIHJwMCBycDMgcnA1IC4uLiBycDE1Citi
eXRlIDI1NTogIGJpdDcgYml0NiBiaXQ1IGJpdDQgYml0MyBiaXQyIGJpdDEg
Yml0MCAgIHJwMSBycDMgcnA1IC4uLiBycDE1CisgICAgICAgICAgIGNwMSAg
Y3AwICBjcDEgIGNwMCAgY3AxICBjcDAgIGNwMSAgY3AwCisgICAgICAgICAg
IGNwMyAgY3AzICBjcDIgIGNwMiAgY3AzICBjcDMgIGNwMiAgY3AyCisgICAg
ICAgICAgIGNwNSAgY3A1ICBjcDUgIGNwNSAgY3A0ICBjcDQgIGNwNCAgY3A0
CisKK1RoaXMgZmlndXJlIHJlcHJlc2VudHMgYSBzZWN0b3Igb2YgMjU2IGJ5
dGVzLgorY3AgaXMgbXkgYWJicmV2aWF0b24gZm9yIGNvbHVtbiBwYXJpdHks
IHJwIGZvciByb3cgcGFyaXR5LgorCitMZXQncyBzdGFydCB0byBleHBsYWlu
IGNvbHVtbiBwYXJpdHkuCitjcDAgaXMgdGhlIHBhcml0eSB0aGF0IGJlbG9u
Z3MgdG8gYWxsIGJpdDAsIGJpdDIsIGJpdDQsIGJpdDYuCitzbyB0aGUgc3Vt
IG9mIGFsbCBiaXQwLCBiaXQyLCBiaXQ0IGFuZCBiaXQ2IHZhbHVlcyArIGNw
MCBpdHNlbGYgaXMgZXZlbi4KK1NpbWlsYXJseSBjcDEgaXMgdGhlIHN1bSBv
ZiBhbGwgYml0MSwgYml0MywgYml0NSBhbmQgYml0Ny4KK2NwMiBpcyB0aGUg
cGFyaXR5IG92ZXIgYml0MCwgYml0MSwgYml0NCBhbmQgYml0NQorY3AzIGlz
IHRoZSBwYXJpdHkgb3ZlciBiaXQyLCBiaXQzLCBiaXQ2IGFuZCBiaXQ3Lgor
Y3A0IGlzIHRoZSBwYXJpdHkgb3ZlciBiaXQwLCBiaXQxLCBiaXQyIGFuZCBi
aXQzLgorY3A1IGlzIHRoZSBwYXJpdHkgb3ZlciBiaXQ0LCBiaXQ1LCBiaXQ2
IGFuZCBiaXQ3LgorTm90ZSB0aGF0IGVhY2ggb2YgY3AwIC4uIGNwNSBpcyBl
eGFjdGx5IG9uZSBiaXQuCisKK1JvdyBwYXJpdHkgYWN0dWFsbHkgd29ya3Mg
YWxtb3N0IHRoZSBzYW1lLgorcnAwIGlzIHRoZSBwYXJpdHkgb2YgYWxsIGV2
ZW4gYnl0ZXMgKDAsIDIsIDQsIDYsIC4uLiAyNTIsIDI1NCkKK3JwMSBpcyB0
aGUgcGFyaXR5IG9mIGFsbCBvZGQgYnl0ZXMgKDEsIDMsIDUsIDcsIC4uLiwg
MjUzLCAyNTUpCitycDIgaXMgdGhlIHBhcml0eSBvZiBhbGwgYnl0ZXMgMCwg
MSwgNCwgNSwgOCwgOSwgLi4uIAorKHNvIGhhbmRsZSB0d28gYnl0ZXMsIHRo
ZW4gc2tpcCAyIGJ5dGVzKS4KK3JwMyBpcyBjb3ZlcnMgdGhlIGhhbGYgcnAy
IGRvZXMgbm90IGNvdmVyIChieXRlcyAyLCAzLCA2LCA3LCAxMCwgMTEsIC4u
LikKK2ZvciBycDQgdGhlIHJ1bGUgaXMgY292ZXIgNCBieXRlcywgc2tpcCA0
IGJ5dGVzLCBjb3ZlciA0IGJ5dGVzLCBza2lwIDQgZXRjLgorc28gcnA0IGNh
bGN1bGF0ZXMgcGFyaXR5IG92ZXIgYnl0ZXMgMCwgMSwgMiwgMywgOCwgOSwg
MTAsIDExLCAxNiwgLi4uKQorYW5kIHJwNSBjb3ZlcnMgdGhlIG90aGVyIGhh
bGYsIHNvIGJ5dGVzIDQsIDUsIDYsIDcsIDEyLCAxMywgMTQsIDE1LCAyMCwg
Li4KK1RoZSBzdG9yeSBub3cgYmVjb21lcyBxdWl0ZSBib3JpbmcuIEkgZ3Vl
c3MgeW91IGdldCB0aGUgaWRlYS4KK3JwNiBjb3ZlcnMgOCBieXRlcyB0aGVu
IHNraXBzIDggZXRjCitycDcgc2tpcHMgOCBieXRlcyB0aGVuIGNvdmVycyA4
IGV0YworcnA4IGNvdmVycyAxNiBieXRlcyB0aGVuIHNraXBzIDE2IGV0Ywor
cnA5IHNraXBzIDE2IGJ5dGVzIHRoZW4gY292ZXJzIDE2IGV0YworcnAxMCBj
b3ZlcnMgMzIgYnl0ZXMgdGhlbiBza2lwcyAzMiBldGMKK3JwMTEgc2tpcHMg
MzIgYnl0ZXMgdGhlbiBjb3ZlcnMgMzIgZXRjCitycDEyIGNvdmVycyA2NCBi
eXRlcyB0aGVuIHNraXBzIDY0IGV0YworcnAxMyBza2lwcyA2NCBieXRlcyB0
aGVuIGNvdmVycyA2NCBldGMKK3JwMTQgY292ZXJzIDEyOCBieXRlcyB0aGVu
IHNraXBzIDEyOAorcnAxNSBza2lwcyAxMjggYnl0ZXMgdGhlbiBjb3ZlcnMg
MTI4IAorCitJbiB0aGUgZW5kIHRoZSBwYXJpdHkgYml0cyBhcmUgZ3JvdXBl
ZCB0b2dldGhlciBpbiB0aHJlZSBieXRlcyBhcworZm9sbG93czoKK0VDQyAg
ICBCaXQgNyBCaXQgNiBCaXQgNSBCaXQgNCBCaXQgMyBCaXQgMiBCaXQgMSBC
aXQgMAorRUNDIDAgICBycDA3ICBycDA2ICBycDA1ICBycDA0ICBycDAzICBy
cDAyICBycDAxICBycDAwCitFQ0MgMSAgIHJwMTUgIHJwMTQgIHJwMTMgIHJw
MTIgIHJwMTEgIHJwMTAgIHJwMDkgIHJwMDgKK0VDQyAyICAgY3A1ICAgY3A0
ICAgY3AzICAgY3AyICAgY3AxICAgY3AwICAgICAgMSAgICAgMQorCitJIGRl
dGVjdGVkIGFmdGVyIHdyaXRpbmcgdGhpcyB0aGF0IFNUIGFwcGxpY2F0aW9u
IG5vdGUgQU4xODIzCisoaHR0cDovL3d3dy5zdC5jb20vc3RvbmxpbmUvYm9v
a3MvcGRmL2RvY3MvMTAxMjMucGRmKSBnaXZlcyBhIG11Y2gKK25pY2VyIHBp
Y3R1cmUuKGJ1dCB0aGV5IHVzZSBsaW5lIHBhcml0eSBhcyB0ZXJtIHdoZXJl
IEkgdXNlIHJvdyBwYXJpdHkpCitPaCB3ZWxsLCBJJ20gZ3JhcGhpY2FsbHkg
Y2hhbGxlbmdlZCwgc28gc3VmZmVyIHdpdGggbWUgZm9yIGEgbW9tZW50IDot
KQorQW5kIEkgY291bGQgbm90IHJldXNlIHRoZSBTVCBwaWN0dXJlIGFueXdh
eSBmb3IgY29weXJpZ2h0IHJlYXNvbnMuCisKKworQXR0ZW1wdCAwCis9PT09
PT09PT0KKworSW1wbGVtZW50aW5nIHRoZSBwYXJpdHkgY2FsY3VsYXRpb24g
aXMgcHJldHR5IHNpbXBsZS4KK0luIEMgcHNldWRvY29kZToKK2ZvciAoaSA9
IDA7IGkgPCAyNTY7IGkrKykKK3sKKyAgICBpZiAoaSAmIDB4MDEpCisgICAg
ICAgcnAxID0gYml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBi
aXQyIF4gYml0MSBeIGJpdDAgXiBycDE7CisgICAgZWxzZQorICAgICAgIHJw
MCA9IGJpdDcgXiBiaXQ2IF4gYml0NSBeIGJpdDQgXiBiaXQzIF4gYml0MiBe
IGJpdDEgXiBiaXQwIF4gcnAxOworICAgIGlmIChpICYgMHgwMikKKyAgICAg
ICBycDMgPSBiaXQ3IF4gYml0NiBeIGJpdDUgXiBiaXQ0IF4gYml0MyBeIGJp
dDIgXiBiaXQxIF4gYml0MCBeIHJwMzsKKyAgICBlbHNlCisgICAgICAgcnAy
ID0gYml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4g
Yml0MSBeIGJpdDAgXiBycDI7CisgICAgaWYgKGkgJiAweDA0KQorICAgICAg
cnA1ID0gYml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQy
IF4gYml0MSBeIGJpdDAgXiBycDU7CisgICAgZWxzZQorICAgICAgcnA0ID0g
Yml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4gYml0
MSBeIGJpdDAgXiBycDQ7CisgICAgaWYgKGkgJiAweDA4KQorICAgICAgcnA3
ID0gYml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4g
Yml0MSBeIGJpdDAgXiBycDc7CisgICAgZWxzZQorICAgICAgcnA2ID0gYml0
NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4gYml0MSBe
IGJpdDAgXiBycDY7CisgICAgaWYgKGkgJiAweDEwKQorICAgICAgcnA5ID0g
Yml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4gYml0
MSBeIGJpdDAgXiBycDk7CisgICAgZWxzZQorICAgICAgcnA4ID0gYml0NyBe
IGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4gYml0MSBeIGJp
dDAgXiBycDg7CisgICAgaWYgKGkgJiAweDIwKQorICAgICAgcnAxMSA9IGJp
dDcgXiBiaXQ2IF4gYml0NSBeIGJpdDQgXiBiaXQzIF4gYml0MiBeIGJpdDEg
XiBiaXQwIF4gcnAxMTsKKyAgICBlbHNlCisgICAgcnAxMCA9IGJpdDcgXiBi
aXQ2IF4gYml0NSBeIGJpdDQgXiBiaXQzIF4gYml0MiBeIGJpdDEgXiBiaXQw
IF4gcnAxMDsKKyAgICBpZiAoaSAmIDB4NDApCisgICAgICBycDEzID0gYml0
NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4gYml0MSBe
IGJpdDAgXiBycDEzOworICAgIGVsc2UKKyAgICAgIHJwMTIgPSBiaXQ3IF4g
Yml0NiBeIGJpdDUgXiBiaXQ0IF4gYml0MyBeIGJpdDIgXiBiaXQxIF4gYml0
MCBeIHJwMTI7CisgICAgaWYgKGkgJiAweDgwKQorICAgICAgcnAxNSA9IGJp
dDcgXiBiaXQ2IF4gYml0NSBeIGJpdDQgXiBiaXQzIF4gYml0MiBeIGJpdDEg
XiBiaXQwIF4gcnAxNTsKKyAgICBlbHNlCisgICAgICBycDE0ID0gYml0NyBe
IGJpdDYgXiBiaXQ1IF4gYml0NCBeIGJpdDMgXiBiaXQyIF4gYml0MSBeIGJp
dDAgXiBycDE0OworICAgIGNwMCA9IGJpdDYgXiBiaXQ0IF4gYml0MiBeIGJp
dDAgXiBjcDA7CisgICAgY3AxID0gYml0NyBeIGJpdDUgXiBiaXQzIF4gYml0
MSBeIGNwMTsKKyAgICBjcDIgPSBiaXQ1IF4gYml0NCBeIGJpdDEgXiBiaXQw
IF4gY3AyOworICAgIGNwMyA9IGJpdDcgXiBiaXQ2IF4gYml0MyBeIGJpdDIg
XiBjcDMKKyAgICBjcDQgPSBiaXQzIF4gYml0MiBeIGJpdDEgXiBiaXQwIF4g
Y3A0CisgICAgY3A1ID0gYml0NyBeIGJpdDYgXiBiaXQ1IF4gYml0NCBeIGNw
NQorfQorCisKK0FuYWx5c2lzIDAKKz09PT09PT09PT0KKworQyBkb2VzIGhh
dmUgYml0d2lzZSBvcGVyYXRvcnMgYnV0IG5vdCByZWFsbHkgb3BlcmF0b3Jz
IHRvIGRvIHRoZSBhYm92ZQorZWZmaWNpZW50bHkgKGFuZCBtb3N0IGhhcmR3
YXJlIGhhcyBubyBzdWNoIGluc3RydWN0aW9ucyBlaXRoZXIpLgorVGhlcmVm
b3JlIHdpdGhvdXQgaW1wbGVtZW50aW5nIHRoaXMgaXQgd2FzIGNsZWFyIHRo
YXQgdGhlIGNvZGUgYWJvdmUgd2FzCitub3QgZ29pbmcgdG8gYnJpbmcgbWUg
YSBOb2JlbCBwcml6ZSA6LSkKKworRm9ydHVuYXRlbHkgdGhlIGV4Y2x1c2l2
ZSBvciBvcGVyYXRpb24gaXMgY29tbXV0YXRpdmUsIHNvIHdlIGNhbiBjb21i
aW5lCit0aGUgdmFsdWVzIGluIGFueSBvcmRlci4gU28gaW5zdGVhZCBvZiBj
YWxjdWxhdGluZyBhbGwgdGhlIGJpdHMKK2luZGl2aWR1YWxseSwgbGV0IHVz
IHRyeSB0byByZWFycmFuZ2UgdGhpbmdzLgorRm9yIHRoZSBjb2x1bW4gcGFy
aXR5IHRoaXMgaXMgZWFzeS4gV2UgY2FuIGp1c3QgeG9yIHRoZSBieXRlcyBh
bmQgaW4gdGhlCitlbmQgZmlsdGVyIG91dCB0aGUgcmVsZXZhbnQgYml0cy4g
VGhpcyBpcyBwcmV0dHkgbmljZSBhcyBpdCB3aWxsIGJyaW5nCithbGwgY3Ag
Y2FsY3VsYXRpb24gb3V0IG9mIHRoZSBpZiBsb29wLgorCitTaW1pbGFybHkg
d2UgY2FuIGZpcnN0IHhvciB0aGUgYnl0ZXMgZm9yIHRoZSB2YXJpb3VzIHJv
d3MuCitUaGlzIGxlYWRzIHRvOgorCisKK0F0dGVtcHQgMQorPT09PT09PT09
CisKK2NvbnN0IGNoYXIgcGFyaXR5WzI1Nl0gPSB7CisgICAgMCwgMSwgMSwg
MCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgCisgICAg
MSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgMCwgMSwgMSwgMCwgMSwgMCwgMCwg
MSwgCisgICAgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgMCwgMSwgMSwgMCwg
MSwgMCwgMCwgMSwgCisgICAgMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgMSwg
MCwgMCwgMSwgMCwgMSwgMSwgMCwgCisgICAgMSwgMCwgMCwgMSwgMCwgMSwg
MSwgMCwgMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgCisgICAgMCwgMSwgMSwg
MCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgCisgICAg
MCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwg
MCwgCisgICAgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgMCwgMSwgMSwgMCwg
MSwgMCwgMCwgMSwgCisgICAgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgMCwg
MSwgMSwgMCwgMSwgMCwgMCwgMSwgCisgICAgMCwgMSwgMSwgMCwgMSwgMCwg
MCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgCisgICAgMCwgMSwgMSwg
MCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgCisgICAg
MSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgMCwgMSwgMSwgMCwgMSwgMCwgMCwg
MSwgCisgICAgMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwg
MCwgMSwgMSwgMCwgCisgICAgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwgMCwg
MSwgMSwgMCwgMSwgMCwgMCwgMSwgCisgICAgMSwgMCwgMCwgMSwgMCwgMSwg
MSwgMCwgMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgCisgICAgMCwgMSwgMSwg
MCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMAorfTsKKwor
dm9pZCBlY2MxKGNvbnN0IHVuc2lnbmVkIGNoYXIgKmJ1ZiwgdW5zaWduZWQg
Y2hhciAqY29kZSkKK3sKKyAgICBpbnQgaTsKKyAgICBjb25zdCB1bnNpZ25l
ZCBjaGFyICpicCA9IGJ1ZjsKKyAgICB1bnNpZ25lZCBjaGFyIGN1cjsKKyAg
ICB1bnNpZ25lZCBjaGFyIHJwMCwgcnAxLCBycDIsIHJwMywgcnA0LCBycDUs
IHJwNiwgcnA3OworICAgIHVuc2lnbmVkIGNoYXIgcnA4LCBycDksIHJwMTAs
IHJwMTEsIHJwMTIsIHJwMTMsIHJwMTQsIHJwMTU7CisgICAgdW5zaWduZWQg
Y2hhciBwYXI7CisKKyAgICBwYXIgPSAwOworICAgIHJwMCA9IDA7IHJwMSA9
IDA7IHJwMiA9IDA7IHJwMyA9IDA7CisgICAgcnA0ID0gMDsgcnA1ID0gMDsg
cnA2ID0gMDsgcnA3ID0gMDsKKyAgICBycDggPSAwOyBycDkgPSAwOyBycDEw
ID0gMDsgcnAxMSA9IDA7CisgICAgcnAxMiA9IDA7IHJwMTMgPSAwOyBycDE0
ID0gMDsgcnAxNSA9IDA7CisKKyAgICBmb3IgKGkgPSAwOyBpIDwgMjU2OyBp
KyspCisgICAgeworICAgICAgICBjdXIgPSAqYnArKzsKKyAgICAgICAgcGFy
IF49IGN1cjsKKyAgICAgICAgaWYgKGkgJiAweDAxKSBycDEgXj0gY3VyOyBl
bHNlIHJwMCBePSBjdXI7CisgICAgICAgIGlmIChpICYgMHgwMikgcnAzIF49
IGN1cjsgZWxzZSBycDIgXj0gY3VyOworICAgICAgICBpZiAoaSAmIDB4MDQp
IHJwNSBePSBjdXI7IGVsc2UgcnA0IF49IGN1cjsKKyAgICAgICAgaWYgKGkg
JiAweDA4KSBycDcgXj0gY3VyOyBlbHNlIHJwNiBePSBjdXI7CisgICAgICAg
IGlmIChpICYgMHgxMCkgcnA5IF49IGN1cjsgZWxzZSBycDggXj0gY3VyOwor
ICAgICAgICBpZiAoaSAmIDB4MjApIHJwMTEgXj0gY3VyOyBlbHNlIHJwMTAg
Xj0gY3VyOworICAgICAgICBpZiAoaSAmIDB4NDApIHJwMTMgXj0gY3VyOyBl
bHNlIHJwMTIgXj0gY3VyOworICAgICAgICBpZiAoaSAmIDB4ODApIHJwMTUg
Xj0gY3VyOyBlbHNlIHJwMTQgXj0gY3VyOworICAgIH0KKyAgICBjb2RlWzBd
ID0KKyAgICAgICAgKHBhcml0eVtycDddIDw8IDcpIHwKKyAgICAgICAgKHBh
cml0eVtycDZdIDw8IDYpIHwKKyAgICAgICAgKHBhcml0eVtycDVdIDw8IDUp
IHwKKyAgICAgICAgKHBhcml0eVtycDRdIDw8IDQpIHwKKyAgICAgICAgKHBh
cml0eVtycDNdIDw8IDMpIHwKKyAgICAgICAgKHBhcml0eVtycDJdIDw8IDIp
IHwKKyAgICAgICAgKHBhcml0eVtycDFdIDw8IDEpIHwKKyAgICAgICAgKHBh
cml0eVtycDBdKTsKKyAgICBjb2RlWzFdID0KKyAgICAgICAgKHBhcml0eVty
cDE1XSA8PCA3KSB8CisgICAgICAgIChwYXJpdHlbcnAxNF0gPDwgNikgfAor
ICAgICAgICAocGFyaXR5W3JwMTNdIDw8IDUpIHwKKyAgICAgICAgKHBhcml0
eVtycDEyXSA8PCA0KSB8CisgICAgICAgIChwYXJpdHlbcnAxMV0gPDwgMykg
fAorICAgICAgICAocGFyaXR5W3JwMTBdIDw8IDIpIHwKKyAgICAgICAgKHBh
cml0eVtycDldICA8PCAxKSB8CisgICAgICAgIChwYXJpdHlbcnA4XSk7Cisg
ICAgY29kZVsyXSA9CisgICAgICAgIChwYXJpdHlbcGFyICYgMHhmMF0gPDwg
NykgfAorICAgICAgICAocGFyaXR5W3BhciAmIDB4MGZdIDw8IDYpIHwKKyAg
ICAgICAgKHBhcml0eVtwYXIgJiAweGNjXSA8PCA1KSB8CisgICAgICAgIChw
YXJpdHlbcGFyICYgMHgzM10gPDwgNCkgfAorICAgICAgICAocGFyaXR5W3Bh
ciAmIDB4YWFdIDw8IDMpIHwKKyAgICAgICAgKHBhcml0eVtwYXIgJiAweDU1
XSA8PCAyKTsKKyAgICBjb2RlWzBdID0gfmNvZGVbMF07CisgICAgY29kZVsx
XSA9IH5jb2RlWzFdOworICAgIGNvZGVbMl0gPSB+Y29kZVsyXTsKK30KKwor
U3RpbGwgcHJldHR5IHN0cmFpZ2h0Zm9yd2FyZC4gVGhlIGxhc3QgdGhyZWUg
aW52ZXJ0IHN0YXRlbWVudHMgYXJlIHRoZXJlIHRvIAorZ2l2ZSBhIGNoZWNr
c3VtIG9mIDB4ZmYgMHhmZiAweGZmIGZvciBhbiBlbXB0eSBmbGFzaC4gSW4g
YW4gZW1wdHkgZmxhc2gKK2FsbCBkYXRhIGlzIDB4ZmYsIHNvIHRoZSBjaGVj
a3N1bSB0aGVuIG1hdGNoZXMuCisKK0kgYWxzbyBpbnRyb2R1Y2VkIHRoZSBw
YXJpdHkgbG9va3VwLiBJIGV4cGVjdGVkIHRoaXMgdG8gYmUgdGhlIGZhc3Rl
c3QKK3dheSB0byBjYWxjdWxhdGUgdGhlIHBhcml0eSwgYnV0IEkgd2lsbCBp
bnZlc3RpZ2F0ZSBhbHRlcm5hdGl2ZXMgbGF0ZXIKK29uLgorCisKK0FuYWx5
c2lzIDEKKz09PT09PT09PT0KKworVGhlIGNvZGUgd29ya3MsIGJ1dCBpcyBu
b3QgdGVycmlibHkgZWZmaWNpZW50LiBPbiBteSBzeXN0ZW0gaXQgdG9vawor
YWxtb3N0IDQgdGltZXMgYXMgbXVjaCB0aW1lIGFzIHRoZSBsaW51eCBkcml2
ZXIgY29kZS4gQnV0IGhleSwgaWYgaXQgd2FzCisqdGhhdCogZWFzeSB0aGlz
IHdvdWxkIGhhdmUgYmVlbiBkb25lIGxvbmcgYmVmb3JlLgorTm8gcGFpbi4g
bm8gZ2Fpbi4KKworRm9ydHVuYXRlbHkgdGhlcmUgaXMgcGxlbnR5IG9mIHJv
b20gZm9yIGltcHJvdmVtZW50LgorCitJbiBzdGVwIDEgd2UgbW92ZWQgZnJv
bSBiaXQtd2lzZSBjYWxjdWxhdGlvbiB0byBieXRlLXdpc2UgY2FsY3VsYXRp
b24uCitIb3dldmVyIGluIEMgd2UgY2FuIGFsc28gdXNlIHRoZSB1bnNpZ25l
ZCBsb25nIGRhdGEgdHlwZSBhbmQgdmlydHVhbGx5CitldmVyeSBtb2Rlcm4g
bWljcm9wcm9jZXNzb3Igc3VwcG9ydHMgMzIgYml0IG9wZXJhdGlvbnMsIHNv
IHdoeSBub3QgdHJ5Cit0byB3cml0ZSBvdXIgY29kZSBpbiBzdWNoIGEgd2F5
IHRoYXQgd2UgcHJvY2VzcyBkYXRhIGluIDMyIGJpdCBjaHVua3MuCisKK09m
IGNvdXJzZSB0aGlzIG1lYW5zIHNvbWUgbW9kaWZpY2F0aW9uIGFzIHRoZSBy
b3cgcGFyaXR5IGlzIGJ5dGUgYnkKK2J5dGUuIEEgcXVpY2sgYW5hbHlzaXM6
Citmb3IgdGhlIGNvbHVtbiBwYXJpdHkgd2UgdXNlIHRoZSBwYXIgdmFyaWFi
bGUuIFdoZW4gZXh0ZW5kaW5nIHRvIDMyIGJpdHMgCit3ZSBjYW4gaW4gdGhl
IGVuZCBlYXNpbHkgY2FsY3VsYXRlIHAwIGFuZCBwMSBmcm9tIGl0LgorKGJl
Y2F1c2UgcGFyIG5vdyBjb25zaXN0cyBvZiA0IGJ5dGVzLCBjb250cmlidXRp
bmcgdG8gcnAxLCBycDAsIHJwMSwgcnAwCityZXNwZWN0aXZlbHkpCithbHNv
IHJwMiBhbmQgcnAzIGNhbiBiZSBlYXNpbHkgcmV0cmlldmVkIGZyb20gcGFy
IGFzIHJwMyBjb3ZlcnMgdGhlCitmaXJzdCB0d28gYnl0ZXMgYW5kIHJwMiB0
aGUgbGFzdCB0d28gYnl0ZXMuCisKK05vdGUgdGhhdCBvZiBjb3Vyc2Ugbm93
IHRoZSBsb29wIGlzIGV4ZWN1dGVkIG9ubHkgNjQgdGltZXMgKDI1Ni80KS4g
CitBbmQgbm90ZSB0aGF0IGNhcmUgbXVzdCB0YWtlbiB3cnQgYnl0ZSBvcmRl
cmluZy4gVGhlIHdheSBieXRlcyBhcmUKK29yZGVyZWQgaW4gYSBsb25nIGlz
IG1hY2hpbmUgZGVwZW5kZW50LCBhbmQgbWlnaHQgYWZmZWN0IHVzLiAKK0Fu
eXdheSwgaWYgdGhlcmUgaXMgYW4gaXNzdWU6IHRoaXMgY29kZSBpcyBkZXZl
bG9wZWQgb24geDg2ICh0byBiZQorcHJlY2lzZTogYSBERUxMIFBDIHdpdGgg
YSBEOTIwIEludGVsIENQVSkKKworQW5kIG9mIGNvdXJzZSB0aGUgcGVyZm9y
bWFuY2UgbWlnaHQgZGVwZW5kIG9uIGFsaWdubWVudCwgYnV0IEkgZXhwZWN0
Cit0aGF0IHRoZSBJL08gYnVmZmVycyBpbiB0aGUgbmFuZCBkcml2ZXIgYXJl
IGFsaWduZWQgcHJvcGVybHkgKGFuZAorb3RoZXJ3aXNlIHRoYXQgc2hvdWxk
IGJlIGZpeGVkIHRvIGdldCBtYXhpbXVtIHBlcmZvcm1hbmNlKS4KKworTGV0
J3MgZ2l2ZSBpdCBhIHRyeS4uLgorCisKK0F0dGVtcHQgMgorPT09PT09PT09
CisKK2V4dGVybiBjb25zdCBjaGFyIHBhcml0eVsyNTZdOworCit2b2lkIGVj
YzIoY29uc3QgdW5zaWduZWQgY2hhciAqYnVmLCB1bnNpZ25lZCBjaGFyICpj
b2RlKQoreworICAgIGludCBpOworICAgIGNvbnN0IHVuc2lnbmVkIGxvbmcg
KmJwID0gKHVuc2lnbmVkIGxvbmcgKilidWY7CisgICAgdW5zaWduZWQgbG9u
ZyBjdXI7CisgICAgdW5zaWduZWQgbG9uZyBycDAsIHJwMSwgcnAyLCBycDMs
IHJwNCwgcnA1LCBycDYsIHJwNzsKKyAgICB1bnNpZ25lZCBsb25nIHJwOCwg
cnA5LCBycDEwLCBycDExLCBycDEyLCBycDEzLCBycDE0LCBycDE1OworICAg
IHVuc2lnbmVkIGxvbmcgcGFyOworCisgICAgcGFyID0gMDsKKyAgICBycDAg
PSAwOyBycDEgPSAwOyBycDIgPSAwOyBycDMgPSAwOworICAgIHJwNCA9IDA7
IHJwNSA9IDA7IHJwNiA9IDA7IHJwNyA9IDA7CisgICAgcnA4ID0gMDsgcnA5
ID0gMDsgcnAxMCA9IDA7IHJwMTEgPSAwOworICAgIHJwMTIgPSAwOyBycDEz
ID0gMDsgcnAxNCA9IDA7IHJwMTUgPSAwOworCisgICAgZm9yIChpID0gMDsg
aSA8IDY0OyBpKyspCisgICAgeworICAgICAgICBjdXIgPSAqYnArKzsKKyAg
ICAgICAgcGFyIF49IGN1cjsKKyAgICAgICAgaWYgKGkgJiAweDAxKSBycDUg
Xj0gY3VyOyBlbHNlIHJwNCBePSBjdXI7CisgICAgICAgIGlmIChpICYgMHgw
MikgcnA3IF49IGN1cjsgZWxzZSBycDYgXj0gY3VyOworICAgICAgICBpZiAo
aSAmIDB4MDQpIHJwOSBePSBjdXI7IGVsc2UgcnA4IF49IGN1cjsKKyAgICAg
ICAgaWYgKGkgJiAweDA4KSBycDExIF49IGN1cjsgZWxzZSBycDEwIF49IGN1
cjsKKyAgICAgICAgaWYgKGkgJiAweDEwKSBycDEzIF49IGN1cjsgZWxzZSBy
cDEyIF49IGN1cjsKKyAgICAgICAgaWYgKGkgJiAweDIwKSBycDE1IF49IGN1
cjsgZWxzZSBycDE0IF49IGN1cjsKKyAgICB9CisgICAgLyoKKyAgICAgICB3
ZSBuZWVkIHRvIGFkYXB0IHRoZSBjb2RlIGdlbmVyYXRpb24gZm9yIHRoZSBm
YWN0IHRoYXQgcnAgdmFycyBhcmUgbm93CisgICAgICAgbG9uZzsgYWxzbyB0
aGUgY29sdW1uIHBhcml0eSBjYWxjdWxhdGlvbiBuZWVkcyB0byBiZSBjaGFu
Z2VkLgorICAgICAgIHdlJ2xsIGJyaW5nIHJwNCB0byAxNSBiYWNrIHRvIHNp
bmdsZSBieXRlIGVudGl0aWVzIGJ5IHNoaWZ0aW5nIGFuZAorICAgICAgIHhv
cmluZworICAgICovCisgICAgcnA0IF49IChycDQgPj4gMTYpOyBycDQgXj0g
KHJwNCA+PiA4KTsgcnA0ICY9IDB4ZmY7CisgICAgcnA1IF49IChycDUgPj4g
MTYpOyBycDUgXj0gKHJwNSA+PiA4KTsgcnA1ICY9IDB4ZmY7CisgICAgcnA2
IF49IChycDYgPj4gMTYpOyBycDYgXj0gKHJwNiA+PiA4KTsgcnA2ICY9IDB4
ZmY7CisgICAgcnA3IF49IChycDcgPj4gMTYpOyBycDcgXj0gKHJwNyA+PiA4
KTsgcnA3ICY9IDB4ZmY7CisgICAgcnA4IF49IChycDggPj4gMTYpOyBycDgg
Xj0gKHJwOCA+PiA4KTsgcnA4ICY9IDB4ZmY7CisgICAgcnA5IF49IChycDkg
Pj4gMTYpOyBycDkgXj0gKHJwOSA+PiA4KTsgcnA5ICY9IDB4ZmY7CisgICAg
cnAxMCBePSAocnAxMCA+PiAxNik7IHJwMTAgXj0gKHJwMTAgPj4gOCk7IHJw
MTAgJj0gMHhmZjsKKyAgICBycDExIF49IChycDExID4+IDE2KTsgcnAxMSBe
PSAocnAxMSA+PiA4KTsgcnAxMSAmPSAweGZmOworICAgIHJwMTIgXj0gKHJw
MTIgPj4gMTYpOyBycDEyIF49IChycDEyID4+IDgpOyBycDEyICY9IDB4ZmY7
CisgICAgcnAxMyBePSAocnAxMyA+PiAxNik7IHJwMTMgXj0gKHJwMTMgPj4g
OCk7IHJwMTMgJj0gMHhmZjsKKyAgICBycDE0IF49IChycDE0ID4+IDE2KTsg
cnAxNCBePSAocnAxNCA+PiA4KTsgcnAxNCAmPSAweGZmOworICAgIHJwMTUg
Xj0gKHJwMTUgPj4gMTYpOyBycDE1IF49IChycDE1ID4+IDgpOyBycDE1ICY9
IDB4ZmY7CisgICAgcnAzID0gKHBhciA+PiAxNik7IHJwMyBePSAocnAzID4+
IDgpOyBycDMgJj0gMHhmZjsKKyAgICBycDIgPSBwYXIgJiAweGZmZmY7IHJw
MiBePSAocnAyID4+IDgpOyBycDIgJj0gMHhmZjsKKyAgICBwYXIgXj0gKHBh
ciA+PiAxNik7CisgICAgcnAxID0gKHBhciA+PiA4KTsgcnAxICY9IDB4ZmY7
CisgICAgcnAwID0gKHBhciAmIDB4ZmYpOworICAgIHBhciBePSAocGFyID4+
IDgpOyBwYXIgJj0gMHhmZjsKKworICAgIGNvZGVbMF0gPQorICAgICAgICAo
cGFyaXR5W3JwN10gPDwgNykgfAorICAgICAgICAocGFyaXR5W3JwNl0gPDwg
NikgfAorICAgICAgICAocGFyaXR5W3JwNV0gPDwgNSkgfAorICAgICAgICAo
cGFyaXR5W3JwNF0gPDwgNCkgfAorICAgICAgICAocGFyaXR5W3JwM10gPDwg
MykgfAorICAgICAgICAocGFyaXR5W3JwMl0gPDwgMikgfAorICAgICAgICAo
cGFyaXR5W3JwMV0gPDwgMSkgfAorICAgICAgICAocGFyaXR5W3JwMF0pOwor
ICAgIGNvZGVbMV0gPQorICAgICAgICAocGFyaXR5W3JwMTVdIDw8IDcpIHwK
KyAgICAgICAgKHBhcml0eVtycDE0XSA8PCA2KSB8CisgICAgICAgIChwYXJp
dHlbcnAxM10gPDwgNSkgfAorICAgICAgICAocGFyaXR5W3JwMTJdIDw8IDQp
IHwKKyAgICAgICAgKHBhcml0eVtycDExXSA8PCAzKSB8CisgICAgICAgIChw
YXJpdHlbcnAxMF0gPDwgMikgfAorICAgICAgICAocGFyaXR5W3JwOV0gIDw8
IDEpIHwKKyAgICAgICAgKHBhcml0eVtycDhdKTsKKyAgICBjb2RlWzJdID0K
KyAgICAgICAgKHBhcml0eVtwYXIgJiAweGYwXSA8PCA3KSB8CisgICAgICAg
IChwYXJpdHlbcGFyICYgMHgwZl0gPDwgNikgfAorICAgICAgICAocGFyaXR5
W3BhciAmIDB4Y2NdIDw8IDUpIHwKKyAgICAgICAgKHBhcml0eVtwYXIgJiAw
eDMzXSA8PCA0KSB8CisgICAgICAgIChwYXJpdHlbcGFyICYgMHhhYV0gPDwg
MykgfAorICAgICAgICAocGFyaXR5W3BhciAmIDB4NTVdIDw8IDIpOworICAg
IGNvZGVbMF0gPSB+Y29kZVswXTsKKyAgICBjb2RlWzFdID0gfmNvZGVbMV07
CisgICAgY29kZVsyXSA9IH5jb2RlWzJdOworfQorCitUaGUgcGFyaXR5IGFy
cmF5IGlzIG5vdCBzaG93biBhbnkgbW9yZS4gTm90ZSBhbHNvIHRoYXQgZm9y
IHRoZXNlCitleGFtcGxlcyBJIGtpbmRhIGRldmlhdGVkIGZyb20gbXkgcmVn
dWxhciBwcm9ncmFtbWluZyBzdHlsZSBieSBhbGxvd2luZworbXVsdGlwbGUg
c3RhdGVtZW50cyBvbiBhIGxpbmUsIG5vdCB1c2luZyB7IH0gaW4gdGhlbiBh
bmQgZWxzZSBibG9ja3MKK3dpdGggb25seSBhIHNpbmdsZSBzdGF0ZW1lbnQg
YW5kIGJ5IHVzaW5nIG9wZXJhdG9ycyBsaWtlIF49CisKKworQW5hbHlzaXMg
MgorPT09PT09PT09PQorCitUaGUgY29kZSAob2YgY291cnNlKSB3b3Jrcywg
YW5kIGh1cnJheTogd2UgYXJlIGEgbGl0dGxlIGJpdCBmYXN0ZXIgdGhhbgor
dGhlIGxpbnV4IGRyaXZlciBjb2RlIChhYm91dCAxNSUpLiBCdXQgd2FpdCwg
ZG9uJ3QgY2hlZXIgdG9vIHF1aWNrbHkuCitUSGVyZSBpcyBtb3JlIHRvIGJl
IGdhaW5lZC4KK0lmIHdlIGxvb2sgYXQgZS5nLiBycDE0IGFuZCBycDE1IHdl
IHNlZSB0aGF0IHdlIGVpdGhlciB4b3Igb3VyIGRhdGEgd2l0aAorcnAxNCBv
ciB3aXRoIHJwMTUuIEhvd2V2ZXIgd2UgYWxzbyBoYXZlIHBhciB3aGljaCBn
b2VzIG92ZXIgYWxsIGRhdGEuCitUaGlzIG1lYW5zIHRoZXJlIGlzIG5vIG5l
ZWQgdG8gY2FsY3VsYXRlIHJwMTQgYXMgaXQgY2FuIGJlIGNhbGN1bGF0ZWQg
ZnJvbQorcnAxNSB0aHJvdWdoIHJwMTQgPSBwYXIgXiBycDE1OworKG9yIGlm
IGRlc2lyZWQgd2UgY2FuIGF2b2lkIGNhbGN1bGF0aW5nIHJwMTUgYW5kIGNh
bGN1bGF0ZSBpdCBmcm9tCitycDE0KS4gIFRoYXQgaXMgd2h5IHNvbWUgcGxh
Y2VzIHJlZmVyIHRvIGludmVyc2UgcGFyaXR5LgorT2YgY291cnNlIHRoZSBz
YW1lIHRoaW5nIGhvbGRzIGZvciBycDQvNSwgcnA2LzcsIHJwOC85LCBycDEw
LzExIGFuZCBycDEyLzEzLgorRWZmZWN0aXZlbHkgdGhpcyBtZWFucyB3ZSBj
YW4gZWxpbWluYXRlIHRoZSBlbHNlIGNsYXVzZSBmcm9tIHRoZSBpZgorc3Rh
dGVtZW50cy4gQWxzbyB3ZSBjYW4gb3B0aW1pc2UgdGhlIGNhbGN1bGF0aW9u
IGluIHRoZSBlbmQgYSBsaXR0bGUgYml0CitieSBnb2luZyBmcm9tIGxvbmcg
dG8gYnl0ZSBmaXJzdC4gQWN0dWFsbHkgd2UgY2FuIGV2ZW4gYXZvaWQgdGhl
IHRhYmxlCitsb29rdXBzCisKK0F0dGVtcHQgMworPT09PT09PT09CisKK09k
ZCByZXBsYWNlZDoKKyAgICAgICAgaWYgKGkgJiAweDAxKSBycDUgXj0gY3Vy
OyBlbHNlIHJwNCBePSBjdXI7CisgICAgICAgIGlmIChpICYgMHgwMikgcnA3
IF49IGN1cjsgZWxzZSBycDYgXj0gY3VyOworICAgICAgICBpZiAoaSAmIDB4
MDQpIHJwOSBePSBjdXI7IGVsc2UgcnA4IF49IGN1cjsKKyAgICAgICAgaWYg
KGkgJiAweDA4KSBycDExIF49IGN1cjsgZWxzZSBycDEwIF49IGN1cjsKKyAg
ICAgICAgaWYgKGkgJiAweDEwKSBycDEzIF49IGN1cjsgZWxzZSBycDEyIF49
IGN1cjsKKyAgICAgICAgaWYgKGkgJiAweDIwKSBycDE1IF49IGN1cjsgZWxz
ZSBycDE0IF49IGN1cjsKK3dpdGgKKyAgICAgICAgaWYgKGkgJiAweDAxKSBy
cDUgXj0gY3VyOworICAgICAgICBpZiAoaSAmIDB4MDIpIHJwNyBePSBjdXI7
CisgICAgICAgIGlmIChpICYgMHgwNCkgcnA5IF49IGN1cjsgCisgICAgICAg
IGlmIChpICYgMHgwOCkgcnAxMSBePSBjdXI7CisgICAgICAgIGlmIChpICYg
MHgxMCkgcnAxMyBePSBjdXI7CisgICAgICAgIGlmIChpICYgMHgyMCkgcnAx
NSBePSBjdXI7CisKKyAgICAgICAgYW5kIG91dHNpZGUgdGhlIGxvb3AgYWRk
ZWQ6CisgICAgcnA0ICA9IHBhciBeIHJwNTsKKyAgICBycDYgID0gcGFyIF4g
cnA3OworICAgIHJwOCAgPSBwYXIgXiBycDk7CisgICAgcnAxMCAgPSBwYXIg
XiBycDExOworICAgIHJwMTIgID0gcGFyIF4gcnAxMzsKKyAgICBycDE0ICA9
IHBhciBeIHJwMTU7CisKK0FuZCBhZnRlciB0aGF0IHRoZSBjb2RlIHRha2Vz
IGFib3V0IDMwJSBtb3JlIHRpbWUsIGFsdGhvdWdoIHRoZSBudW1iZXIgb2YK
K3N0YXRlbWVudHMgaXMgcmVkdWNlZC4gVGhpcyBpcyBhbHNvIHJlZmxlY3Rl
ZCBpbiB0aGUgYXNzZW1ibHkgY29kZS4KKworCitBbmFseXNpcyAzCis9PT09
PT09PT09CisKK1Zlcnkgd2VpcmQuIEd1ZXNzIGl0IGhhcyB0byBkbyB3aXRo
IGNhY2hpbmcgb3IgaW5zdHJ1Y3Rpb24gcGFyYWxsZWxsaXNtCitvciBzby4g
SSBhbHNvIHRyaWVkIG9uIGFuIGVlZVBDIChDZWxlcm9uLCBjbG9ja2VkIGF0
IDkwMCBNaHopLiBJbnRlcmVzdGluZworb2JzZXJ2YXRpb24gd2FzIHRoYXQg
dGhpcyBvbmUgaXMgb25seSAzMCUgc2xvd2VyIChhY2NvcmRpbmcgdG8gdGlt
ZSkKK2V4ZWN1dGluZyB0aGUgY29kZSBhcyBteSAzR2h6IEQ5MjAgcHJvY2Vz
c29yLgorCitXZWxsLCBpdCB3YXMgZXhwZWN0ZWQgbm90IHRvIGJlIGVhc3kg
c28gbWF5YmUgaW5zdGVhZCBtb3ZlIHRvIGEKK2RpZmZlcmVudCB0cmFjazog
bGV0J3MgbW92ZSBiYWNrIHRvIHRoZSBjb2RlIGZyb20gYXR0ZW1wdDIgYW5k
IGRvIHNvbWUKK2xvb3AgdW5yb2xsaW5nLiBUaGlzIHdpbGwgZWxpbWluYXRl
IGEgZmV3IGlmIHN0YXRlbWVudHMuIEknbGwgdHJ5CitkaWZmZXJlbnQgYW1v
dW50cyBvZiB1bnJvbGxpbmcgdG8gc2VlIHdoYXQgd29ya3MgYmVzdC4KKwor
CitBdHRlbXB0IDQKKz09PT09PT09PQorCitVbnJvbGxlZCB0aGUgbG9vcCAx
LCAyLCAzIGFuZCA0IHRpbWVzLgorRm9yIDQgdGhlIGNvZGUgc3RhcnRzIHdp
dGg6CisKKyAgICBmb3IgKGkgPSAwOyBpIDwgNDsgaSsrKQorICAgIHsKKyAg
ICAgICAgY3VyID0gKmJwKys7CisgICAgICAgIHBhciBePSBjdXI7CisgICAg
ICAgIHJwNCBePSBjdXI7CisgICAgICAgIHJwNiBePSBjdXI7CisgICAgICAg
IHJwOCBePSBjdXI7CisgICAgICAgIHJwMTAgXj0gY3VyOworICAgICAgICBp
ZiAoaSAmIDB4MSkgcnAxMyBePSBjdXI7IGVsc2UgcnAxMiBePSBjdXI7Cisg
ICAgICAgIGlmIChpICYgMHgyKSBycDE1IF49IGN1cjsgZWxzZSBycDE0IF49
IGN1cjsKKyAgICAgICAgY3VyID0gKmJwKys7CisgICAgICAgIHBhciBePSBj
dXI7CisgICAgICAgIHJwNSBePSBjdXI7CisgICAgICAgIHJwNiBePSBjdXI7
CisgICAgICAgIC4uLgorCisKK0FuYWx5c2lzIDQKKz09PT09PT09PT0KKwor
VW5yb2xsaW5nIG9uY2UgZ2FpbnMgYWJvdXQgMTUlCitVbnJvbGxpbmcgdHdp
Y2Uga2VlcHMgdGhlIGdhaW4gYXQgYWJvdXQgMTUlCitVbnJvbGxpbmcgdGhy
ZWUgdGltZXMgZ2l2ZXMgYSBnYWluIG9mIDMwJSBjb21wYXJlZCB0byBhdHRl
bXB0IDIuCitVbnJvbGxpbmcgZm91ciB0aW1lcyBnaXZlcyBhIG1hcmdpbmFs
IGltcHJvdmVtZW50IGNvbXBhcmVkIHRvIHVucm9sbGluZwordGhyZWUgdGlt
ZXMuCisKK0kgZGVjaWRlZCB0byBwcm9jZWVkIHdpdGggYSBmb3VyIHRpbWUg
dW5yb2xsZWQgbG9vcCBhbnl3YXkuIEl0IHdhcyBteSBndXQKK2ZlZWxpbmcg
dGhhdCBpbiB0aGUgbmV4dCBzdGVwcyBJIHdvdWxkIG9idGFpbiBhZGRpdGlv
bmFsIGdhaW4gZnJvbSBpdC4KKworVGhlIG5leHQgc3RlcCB3YXMgdHJpZ2dl
cmVkIGJ5IHRoZSBmYWN0IHRoYXQgcGFyIGNvbnRhaW5zIHRoZSB4b3Igb2Yg
YWxsCitieXRlcyBhbmQgcnA0IGFuZCBycDUgZWFjaCBjb250YWluIHRoZSB4
b3Igb2YgaGFsZiBvZiB0aGUgYnl0ZXMuCitTbyBpbiBlZmZlY3QgcGFyID0g
cnA0IF4gcnA1LiBCdXQgYXMgeG9yIGlzIGNvbW11dGF0aXZlIHdlIGNhbiBh
bHNvIHNheQordGhhdCBycDUgPSBwYXIgXiBycDQuIFNvIG5vIG5lZWQgdG8g
a2VlcCBib3RoIHJwNCBhbmQgcnA1IGFyb3VuZC4gV2UgY2FuCitlbGltaW5h
dGUgcnA1IChvciBycDQsIGJ1dCBJIGFscmVhZHkgZm9yZXNhdyBhbm90aGVy
IG9wdGltaXNhdGlvbikuCitUaGUgc2FtZSBob2xkcyBmb3IgcnA2LzcsIHJw
OC85LCBycDEwLzExIHJwMTIvMTMgYW5kIHJwMTQvMTUuCisKKworQXR0ZW1w
dCA1Cis9PT09PT09PT0KKworRWZmZWN0aXZlbHkgc28gYWxsIG9kZCBkaWdp
dCBycCBhc3NpZ25tZW50cyBpbiB0aGUgbG9vcCB3ZXJlIHJlbW92ZWQuCitU
aGlzIGluY2x1ZGVkIHRoZSBlbHNlIGNsYXVzZSBvZiB0aGUgaWYgc3RhdGVt
ZW50cy4KK09mIGNvdXJzZSBhZnRlciB0aGUgbG9vcCB3ZSBuZWVkIHRvIGNv
cnJlY3QgdGhpbmdzIGJ5IGFkZGluZyBjb2RlIGxpa2U6CisgICAgcnA1ID0g
cGFyIF4gcnA0OworQWxzbyB0aGUgaW5pdGlhbCBhc3NpZ25tZW50cyAocnA1
ID0gMDsgZXRjKSBjb3VsZCBiZSByZW1vdmVkLgorQWxvbmcgdGhlIGxpbmUg
SSBhbHNvIHJlbW92ZWQgdGhlIGluaXRpYWxpc2F0aW9uIG9mIHJwMC8xLzIv
My4KKworCitBbmFseXNpcyA1Cis9PT09PT09PT09CisKK01lYXN1cmVtZW50
cyBzaG93ZWQgdGhpcyB3YXMgYSBnb29kIG1vdmUuIFRoZSBydW4tdGltZSBy
b3VnaGx5IGhhbHZlZAorY29tcGFyZWQgd2l0aCBhdHRlbXB0IDQgd2l0aCA0
IHRpbWVzIHVucm9sbGVkLCBhbmQgd2Ugb25seSByZXF1aXJlIDEvM3JkCitv
ZiB0aGUgcHJvY2Vzc29yIHRpbWUgY29tcGFyZWQgdG8gdGhlIGN1cnJlbnQg
Y29kZSBpbiB0aGUgbGludXgga2VybmVsLgorCitIb3dldmVyLCBzdGlsbCBJ
IHRob3VnaHQgdGhlcmUgd2FzIG1vcmUuIEkgZGlkbid0IGxpa2UgYWxsIHRo
ZSBpZgorc3RhdGVtZW50cy4gV2h5IG5vdCBrZWVwIGEgcnVubmluZyBwYXJp
dHkgYW5kIG9ubHkga2VlcCB0aGUgbGFzdCBpZgorc3RhdGVtZW50LiBUaW1l
IGZvciB5ZXQgYW5vdGhlciB2ZXJzaW9uIQorCisKK0F0dGVtcHQgNgorPT09
PT09PT09CisKK1RIZSBjb2RlIHdpdGhpbiB0aGUgZm9yIGxvb3Agd2FzIGNo
YW5nZWQgdG86CisKKyAgICBmb3IgKGkgPSAwOyBpIDwgNDsgaSsrKQorICAg
IHsKKyAgICAgICAgY3VyID0gKmJwKys7IHRtcHBhciAgPSBjdXI7IHJwNCBe
PSBjdXI7CisgICAgICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBy
cDYgXj0gdG1wcGFyOworICAgICAgICBjdXIgPSAqYnArKzsgdG1wcGFyIF49
IGN1cjsgcnA0IF49IGN1cjsKKyAgICAgICAgY3VyID0gKmJwKys7IHRtcHBh
ciBePSBjdXI7IHJwOCBePSB0bXBwYXI7CisKKyAgICAgICAgY3VyID0gKmJw
Kys7IHRtcHBhciBePSBjdXI7IHJwNCBePSBjdXI7IHJwNiBePSBjdXI7Cisg
ICAgICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBycDYgXj0gY3Vy
OworCSAgICBjdXIgPSAqYnArKzsgdG1wcGFyIF49IGN1cjsgcnA0IF49IGN1
cjsKKwkgICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7IHJwMTAgXj0g
dG1wcGFyOworCisJICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBy
cDQgXj0gY3VyOyBycDYgXj0gY3VyOyBycDggXj0gY3VyOworICAgICAgICBj
dXIgPSAqYnArKzsgdG1wcGFyIF49IGN1cjsgcnA2IF49IGN1cjsgcnA4IF49
IGN1cjsKKwkgICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7IHJwNCBe
PSBjdXI7IHJwOCBePSBjdXI7CisgICAgICAgIGN1ciA9ICpicCsrOyB0bXBw
YXIgXj0gY3VyOyBycDggXj0gY3VyOworCisgICAgICAgIGN1ciA9ICpicCsr
OyB0bXBwYXIgXj0gY3VyOyBycDQgXj0gY3VyOyBycDYgXj0gY3VyOworICAg
ICAgICBjdXIgPSAqYnArKzsgdG1wcGFyIF49IGN1cjsgcnA2IF49IGN1cjsK
KyAgICAgICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7IHJwNCBePSBj
dXI7CisgICAgICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOworCisJ
ICAgIHBhciBePSB0bXBwYXI7CisgICAgICAgIGlmICgoaSAmIDB4MSkgPT0g
MCkgcnAxMiBePSB0bXBwYXI7CisgICAgICAgIGlmICgoaSAmIDB4MikgPT0g
MCkgcnAxNCBePSB0bXBwYXI7CisgICAgfQorCitBcyB5b3UgY2FuIHNlZSB0
bXBwYXIgaXMgdXNlZCB0byBhY2N1bXVsYXRlIHRoZSBwYXJpdHkgd2l0aGlu
IGEgZm9yCitpdGVyYXRpb24uIEluIHRoZSBsYXN0IDMgc3RhdGVtZW50cyBp
cyBpcyBhZGRlZCB0byBwYXIgYW5kLCBpZiBuZWVkZWQsCit0byBycDEyIGFu
ZCBycDE0LgorCitXaGlsZSBtYWtpbmcgdGhlIGNoYW5nZXMgSSBhbHNvIGZv
dW5kIHRoYXQgSSBjb3VsZCBleHBsb2l0IHRoYXQgdG1wcGFyCitjb250YWlu
cyB0aGUgcnVubmluZyBwYXJpdHkgZm9yIHRoaXMgaXRlcmF0aW9uLiBTbyBp
bnN0ZWFkIG9mIGhhdmluZzoKK3JwNCBePSBjdXI7IHJwNiA9IGN1cjsKK0kg
cmVtb3ZlZCB0aGUgcnA2ID0gY3VyOyBzdGF0ZW1lbnQgYW5kIGRpZCBycDYg
Xj0gdG1wcGFyOyBvbiBuZXh0CitzdGF0ZW1lbnQuIEEgc2ltaWxhciBjaGFu
Z2Ugd2FzIGRvbmUgZm9yIHJwOCBhbmQgcnAxMAorCisKK0FuYWx5c2lzIDYK
Kz09PT09PT09PT0KKworTWVhc3VyaW5nIHRoaXMgY29kZSBhZ2FpbiBzaG93
ZWQgYmlnIGdhaW4uIFdoZW4gZXhlY3V0aW5nIHRoZSBvcmlnaW5hbAorbGlu
dXggY29kZSAxIG1pbGxpb24gdGltZXMsIHRoaXMgdG9vayBhYm91dCAxIHNl
Y29uZCBvbiBteSBzeXN0ZW0uCisodXNpbmcgdGltZSB0byBtZWFzdXJlIHRo
ZSBwZXJmb3JtYW5jZSkuIEFmdGVyIHRoaXMgaXRlcmF0aW9uIEkgd2FzIGJh
Y2sKK3RvIDAuMDc1IHNlYy4gQWN0dWFsbHkgSSBoYWQgdG8gZGVjaWRlIHRv
IHN0YXJ0IG1lYXN1cmluZyBvdmVyIDEwCittaWxsaW9uIGludGVyYXRpb25z
IGluIG9yZGVyIG5vdCB0byBsb29zZSB0b28gbXVjaCBhY2N1cmFjeS4gVGhp
cyBvbmUKK2RlZmluaXRlbHkgc2VlbWVkIHRvIGJlIHRoZSBqYWNrcG90IQor
CitUaGVyZSBpcyBhIGxpdHRsZSBiaXQgbW9yZSByb29tIGZvciBpbXByb3Zl
bWVudCB0aG91Z2guIFRoZXJlIGFyZSB0aHJlZQorcGxhY2VzIHdpdGggc3Rh
dGVtZW50czoKK3JwNCBePSBjdXI7IHJwNiBePSBjdXI7CitJdCBzZWVtcyBt
b3JlIGVmZmljaWVudCB0byBhbHNvIG1haW50YWluIGEgdmFyaWFibGUgcnA0
XzYgaW4gdGhlIHdoaWxlCitsb29wOyBUaGlzIGVsaW1pbmF0ZXMgMyBzdGF0
ZW1lbnRzIHBlciBsb29wLiBPZiBjb3Vyc2UgYWZ0ZXIgdGhlIGxvb3Agd2UK
K25lZWQgdG8gY29ycmVjdCBieSBhZGRpbmc6CisgICAgcnA0IF49IHJwNF82
OworICAgIHJwNiBePSBycDRfNgorRnVydGhlcm1vcmUgdGhlcmUgYXJlIDQg
c2VxdWVudGlhbCBhc3NpbmdtZW50cyB0byBycDguIFRoaXMgY2FuIGJlCitl
bmNvZGVkIHNsaWdodGx5IG1vcmUgZWZmaWNpZW50IGJ5IHNhdmluZyB0bXBw
YXIgYmVmb3JlIHRob3NlIDQgbGluZXMKK2FuZCBsYXRlciBkbyBycDggPSBy
cDggXiB0bXBwYXIgXiBub3RycDg7Cisod2hlcmUgbm90cnA4IGlzIHRoZSB2
YWx1ZSBvZiBycDggYmVmb3JlIHRob3NlIDQgbGluZXMpLgorQWdhaW4gYSB1
c2Ugb2YgdGhlIGNvbW11dGF0aXZlIHByb3BlcnR5IG9mIHhvci4KK1RpbWUg
Zm9yIGEgbmV3IHRlc3QhCisKKworQXR0ZW1wdCA3Cis9PT09PT09PT0KKwor
VGhlIG5ldyBjb2RlIG5vdyBsb29rcyBsaWtlOgorCisgICAgZm9yIChpID0g
MDsgaSA8IDQ7IGkrKykKKyAgICB7CisgICAgICAgIGN1ciA9ICpicCsrOyB0
bXBwYXIgID0gY3VyOyBycDQgXj0gY3VyOworICAgICAgICBjdXIgPSAqYnAr
KzsgdG1wcGFyIF49IGN1cjsgcnA2IF49IHRtcHBhcjsKKyAgICAgICAgY3Vy
ID0gKmJwKys7IHRtcHBhciBePSBjdXI7IHJwNCBePSBjdXI7CisgICAgICAg
IGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBycDggXj0gdG1wcGFyOwor
CisgICAgICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBycDRfNiBe
PSBjdXI7CisgICAgICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBy
cDYgXj0gY3VyOworCSAgICBjdXIgPSAqYnArKzsgdG1wcGFyIF49IGN1cjsg
cnA0IF49IGN1cjsKKwkgICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7
IHJwMTAgXj0gdG1wcGFyOworCisJICAgIG5vdHJwOCA9IHRtcHBhcjsKKwkg
ICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7IHJwNF82IF49IGN1cjsK
KyAgICAgICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7IHJwNiBePSBj
dXI7CisJICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBycDQgXj0g
Y3VyOworICAgICAgICBjdXIgPSAqYnArKzsgdG1wcGFyIF49IGN1cjsKKwkg
ICAgcnA4ID0gcnA4IF4gdG1wcGFyIF4gbm90cnA4OworCisgICAgICAgIGN1
ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBycDRfNiBePSBjdXI7CisgICAg
ICAgIGN1ciA9ICpicCsrOyB0bXBwYXIgXj0gY3VyOyBycDYgXj0gY3VyOwor
ICAgICAgICBjdXIgPSAqYnArKzsgdG1wcGFyIF49IGN1cjsgcnA0IF49IGN1
cjsKKyAgICAgICAgY3VyID0gKmJwKys7IHRtcHBhciBePSBjdXI7CisKKwkg
ICAgcGFyIF49IHRtcHBhcjsKKyAgICAgICAgaWYgKChpICYgMHgxKSA9PSAw
KSBycDEyIF49IHRtcHBhcjsKKyAgICAgICAgaWYgKChpICYgMHgyKSA9PSAw
KSBycDE0IF49IHRtcHBhcjsKKyAgICB9CisgICAgcnA0IF49IHJwNF82Owor
ICAgIHJwNiBePSBycDRfNjsKKworCitOb3QgYSBiaWcgY2hhbmdlLCBidXQg
ZXZlcnkgcGVubnkgY291bnRzIDotKQorCisKK0FuYWx5c2lzIDcKKz09PT09
PT09PT0KKworQWN1dGFsbHkgdGhpcyBtYWRlIHRoaW5ncyB3b3JzZS4gTm90
IHZlcnkgbXVjaCwgYnV0IEkgZG9uJ3Qgd2FudCB0byBtb3ZlCitpbnRvIHRo
ZSB3cm9uZyBkaXJlY3Rpb24uIE1heWJlIHNvbWV0aGluZyB0byBpbnZlc3Rp
Z2F0ZSBsYXRlci4gQ291bGQKK2hhdmUgdG8gZG8gd2l0aCBjYWNoaW5nIGFn
YWluLgorCitHdWVzcyB0aGF0IGlzIHdoYXQgdGhlcmUgaXMgdG8gd2luIHdp
dGhpbiB0aGUgbG9vcC4gTWF5YmUgdW5yb2xsaW5nIG9uZQorbW9yZSB0aW1l
IHdpbGwgaGVscC4gSSdsbCBrZWVwIHRoZSBvcHRpbWlzYXRpb25zIGZyb20g
NyBmb3Igbm93LgorCisKK0F0dGVtcHQgOAorPT09PT09PT09CisKK1Vucm9s
bGVkIHRoZSBsb29wIG9uZSBtb3JlIHRpbWUuCisKKworQW5hbHlzaXMgOAor
PT09PT09PT09PQorCitUaGlzIG1ha2VzIHRoaW5ncyB3b3JzZS4gTGV0J3Mg
c3RpY2sgd2l0aCBhdHRlbXB0IDYgYW5kIGNvbnRpbnVlIGZyb20gdGhlcmUu
CitBbHRob3VnaCBpdCBzZWVtcyB0aGF0IHRoZSBjb2RlIHdpdGhpbiB0aGUg
bG9vcCBjYW5ub3QgYmUgb3B0aW1pc2VkCitmdXJ0aGVyIHRoZXJlIGlzIHN0
aWxsIHJvb20gdG8gb3B0aW1pemUgdGhlIGdlbmVyYXRpb24gb2YgdGhlIGVj
YyBjb2Rlcy4KK1dlIGNhbiBzaW1wbHkgY2FsY3VhbGF0ZSB0aGUgdG90YWwg
cGFyaXR5LiBJZiB0aGlzIGlzIDAgdGhlbiBycDQgPSBycDUKK2V0Yy4gSWYg
dGhlIHBhcml0eSBpcyAxLCB0aGVuIHJwNCA9ICFycDU7CitCdXQgaWYgcnA0
ID0gcnA1IHdlIGRvIG5vdCBuZWVkIHJwNSBldGMuIFdlIGNhbiBqdXN0IHdy
aXRlIHRoZSBldmVuIGJpdHMKK2luIHRoZSByZXN1bHQgYnl0ZSBhbmQgdGhl
biBkbyBzb21ldGhpbmcgbGlrZQorICAgIGNvZGVbMF0gfD0gKGNvZGVbMF0g
PDwgMSk7CitMZXRzIHRlc3QgdGhpcy4KKworCitBdHRlbXB0IDkKKz09PT09
PT09PQorCitDaGFuZ2VkIHRoZSBjb2RlIGJ1dCBhZ2FpbiB0aGlzIHNsaWdo
dGx5IGRlZ3JhZGVzIHBlcmZvcm1hbmNlLiBUcmllZCBhbGwKK2tpbmQgb2Yg
b3RoZXIgdGhpbmdzLCBsaWtlIGhhdmluZyBkZWRpY2F0ZWQgcGFyaXR5IGFy
cmF5cyB0byBhdm9pZCB0aGUKK3NoaWZ0IGFmdGVyIHBhcml0eVtycDddIDw8
IDc7IE5vIGdhaW4uCitDaGFuZ2UgdGhlIGxvb2t1cCB1c2luZyB0aGUgcGFy
aXR5IGFycmF5IGJ5IHVzaW5nIHNoaWZ0IG9wZXJhdG9ycyAoZS5nLgorcmVw
bGFjZSBwYXJpdHlbcnA3XSA8PCA3IHdpdGg6CitycDcgXj0gKHJwNyA8PCA0
KTsKK3JwNyBePSAocnA3IDw8IDIpOworcnA3IF49IChycDcgPDwgMSk7City
cDcgJj0gMHg4MDsKK05vIGdhaW4uCisKK1RoZSBvbmx5IG1hcmdpbmFsIGNo
YW5nZSB3YXMgaW52ZXJ0aW5nIHRoZSBwYXJpdHkgYml0cywgc28gd2UgY2Fu
IHJlbW92ZQordGhlIGxhc3QgdGhyZWUgaW52ZXJ0IHN0YXRlbWVudHMuCisK
K0FoIHdlbGwsIHBpdHkgdGhpcyBkb2VzIG5vdCBkZWxpdmVyIG1vcmUuIFRo
ZW4gYWdhaW4gMTAgbWlsbGlvbgoraXRlcmF0aW9ucyB1c2luZyB0aGUgbGlu
dXggZHJpdmVyIGNvZGUgdGFrZXMgYmV0d2VlbiAxMyBhbmQgMTMuNQorc2Vj
b25kcywgd2hlcmVhcyBteSBjb2RlIG5vdyB0YWtlcyBhYm91dCAwLjczIHNl
Y29uZHMgZm9yIHRob3NlIDEwCittaWxsaW9uIGl0ZXJhdGlvbnMuIFNvIGJh
c2ljYWxseSBJJ3ZlIGltcHJvdmVkIHRoZSBwZXJmb3JtYW5jZSBieSBhCitm
YWN0b3IgMTggb24gbXkgc3lzdGVtLiBOb3QgdGhhdCBiYWQuIE9mIGNvdXJz
ZSBvbiBkaWZmZXJlbnQgaGFyZHdhcmUKK3lvdSB3aWxsIGdldCBkaWZmZXJl
bnQgcmVzdWx0cy4gTm8gd2FycmFudGllcyEKKworQnV0IG9mIGNvdXJzZSB0
aGVyZSBpcyBubyBzdWNoIHRoaW5nIGFzIGEgZnJlZSBsdW5jaC4gVGhlIGNv
ZGVzaXplIGFsbW9zdAordHJpcGxlZCAoZnJvbSA1NjIgYnl0ZXMgdG8gMTQz
NCBieXRlcykuIFRoZW4gYWdhaW4sIGl0IGlzIG5vdCB0aGF0IG11Y2guCisK
KworQ29ycmVjdGluZyBlcnJvcnMKKz09PT09PT09PT09PT09PT09CisKK0Zv
ciBjb3JyZWN0aW5nIGVycm9ycyBJIGFnYWluIHVzZWQgdGhlIFNUIGFwcGxp
Y2F0aW9uIG5vdGUgYXMgYSBzdGFydGVyLAorYnV0IEkgYWxzbyBwZWVrZWQg
YXQgdGhlIGV4aXN0aW5nIGNvZGUuCitUaGUgYWxnb3JpdGhtIGl0c2VsZiBp
cyBwcmV0dHkgc3RyYWlnaHRmb3J3YXJkLiBKdXN0IHhvciB0aGUgZ2l2ZW4g
YW5kCit0aGUgY2FsY3VsYXRlZCBlY2MuIElmIGFsbCBieXRlcyBhcmUgMCB0
aGVyZSBpcyBubyBwcm9ibGVtLiBJZiAxMSBiaXRzCithcmUgMSB3ZSBoYXZl
IG9uZSBjb3JyZWN0YWJsZSBiaXQgZXJyb3IuIElmIHRoZXJlIGlzIDEgYml0
IDEsIHdlIGhhdmUgYW4KK2Vycm9yIGluIHRoZSBnaXZlbiBlY2MgY29kZS4g
CitJdCBwcm92ZWQgdG8gYmUgZmFzdGVzdCB0byBkbyBzb21lIHRhYmxlIGxv
b2t1cHMuIFBlcmZvcm1hbmNlIGdhaW4KK2ludHJvZHVjZWQgYnkgdGhpcyBp
cyBhYm91dCBhIGZhY3RvciAyIG9uIG15IHN5c3RlbSB3aGVuIGEgcmVwYWly
IGhhZCB0bworYmUgZG9uZSwgYW5kIDElIG9yIHNvIGlmIG5vIHJlcGFpciBo
YWQgdG8gYmUgZG9uZS4KK0NvZGUgc2l6ZSBpbmNyZWFzZWQgZnJvbSAzMzAg
Ynl0ZXMgdG8gNjg2IGJ5dGVzIGZvciB0aGlzIGZ1bmN0aW9uLgorKGdjYyA0
LjIsIC1PMykKKworCitDb25jbHVzaW9uCis9PT09PT09PT09CisKK1RoZSBn
YWluIHdoZW4gY2FsY3VsYXRpbmcgdGhlIGVjYyBpcyB0cmVtZW5kb3VzLiBP
bSBteSBkZXZlbG9wbWVudCBoYXJkd2FyZSAKK2Egc3BlZWR1cCBvZiBhIGZh
Y3RvciBvZiAxOCBmb3IgZWNjIGNhbGN1bGF0aW9uIHdhcyBhY2hpZXZlZC4g
T24gYSB0ZXN0IG9uIGFuCitlbWJlZGRlZCBzeXN0ZW0gd2l0aCBhIE1JUFMg
Y29yZSBhIGZhY3RvciA3IHdhcyBvYnRhaW5lZC4KK09uICBhIHRlc3Qgd2l0
aCBhIExpbmtzeXMgTlNMVTIgKEFSTXY1VEUgcHJvY2Vzc29yKSB0aGUgc3Bl
ZWR1cCB3YXMgYSBmYWN0b3IKKzUgKGJpZyBlbmRpYW4gbW9kZSwgZ2NjIDQu
MS4yLCAtTzMpCitGb3IgY29ycmVjdGlvbiBub3QgbXVjaCBnYWluIGNvdWxk
IGJlIG9idGFpbmVkIChhcyBiaXRmbGlwcyBhcmUgcmFyZSkuIFRoZW4KK2Fn
YWluIHRoZXJlIGFyZSBhbHNvIG11Y2ggbGVzcyBjeWNsZXMgc3BlbnQgdGhl
cmUuCisKK0l0IHNlZW1zIHRoZXJlIGlzIG5vdCBtdWNoIG1vcmUgZ2FpbiBw
b3NzaWJsZSBpbiB0aGlzLCBhdCBsZWFzdCB3aGVuCitwcm9ncmFtbWVkIGlu
IEMuIE9mIGNvdXJzZSBpdCBtaWdodCBiZSBwb3NzaWJsZSB0byBzcXVlZXpl
IHNvbWV0aGluZyBtb3JlCitvdXQgb2YgaXQgd2l0aCBhbiBhc3NlbWJsZXIg
cHJvZ3JhbSwgYnV0IGR1ZSB0byBwaXBlbGluZSBiZWhhdmlvdXIgZXRjCit0
aGlzIGlzIHZlcnkgdHJpY2t5IChhdCBsZWFzdCBmb3IgaW50ZWwgaHcpLgor
CitBdXRob3I6IEZyYW5zIE1ldWxlbmJyb2VrcworQ29weXJpZ2h0IChDKSAy
MDA4IEtvbmlua2xpamtlIFBoaWxpcHMgRWxlY3Ryb25pY3MgTlYuCmRpZmYg
LXVyTiBsaW51eC0yLjYuMjUuMTAvZHJpdmVycy9tdGQvbmFuZC9uYW5kX2Vj
Yy5jIGxpbnV4LTIuNi4yNS4xMC53b3JrL2RyaXZlcnMvbXRkL25hbmQvbmFu
ZF9lY2MuYwotLS0gbGludXgtMi42LjI1LjEwL2RyaXZlcnMvbXRkL25hbmQv
bmFuZF9lY2MuYwkyMDA4LTA3LTAzIDA1OjQ2OjQ3LjAwMDAwMDAwMCArMDIw
MAorKysgbGludXgtMi42LjI1LjEwLndvcmsvZHJpdmVycy9tdGQvbmFuZC9u
YW5kX2VjYy5jCTIwMDgtMDctMjkgMTk6MzM6MTkuMDAwMDAwMDAwICswMjAw
CkBAIC0xLDE1ICsxLDE4IEBACiAvKgotICogVGhpcyBmaWxlIGNvbnRhaW5z
IGFuIEVDQyBhbGdvcml0aG0gZnJvbSBUb3NoaWJhIHRoYXQgZGV0ZWN0cyBh
bmQKLSAqIGNvcnJlY3RzIDEgYml0IGVycm9ycyBpbiBhIDI1NiBieXRlIGJs
b2NrIG9mIGRhdGEuCisgKiBUaGlzIGZpbGUgY29udGFpbnMgYW4gRUNDIGFs
Z29yaXRobSB0aGF0IGRldGVjdHMgYW5kIGNvcnJlY3RzIDEgYml0CisgKiBl
cnJvcnMgaW4gYSAyNTYgYnl0ZSBibG9jayBvZiBkYXRhLgogICoKICAqIGRy
aXZlcnMvbXRkL25hbmQvbmFuZF9lY2MuYwogICoKLSAqIENvcHlyaWdodCAo
QykgMjAwMC0yMDA0IFN0ZXZlbiBKLiBIaWxsIChzamhpbGxAcmVhbGl0eWRp
bHV0ZWQuY29tKQotICogICAgICAgICAgICAgICAgICAgICAgICAgVG9zaGli
YSBBbWVyaWNhIEVsZWN0cm9uaWNzIENvbXBvbmVudHMsIEluYy4KKyAqIENv
cHlyaWdodCAoQykgMjAwOCBLb25pbmtsaWprZSBQaGlsaXBzIEVsZWN0cm9u
aWNzIE5WLgorICogICAgICAgICAgICAgICAgICAgIEF1dGhvcjogRnJhbnMg
TWV1bGVuYnJvZWtzCiAgKgotICogQ29weXJpZ2h0IChDKSAyMDA2IFRob21h
cyBHbGVpeG5lciA8dGdseEBsaW51dHJvbml4LmRlPgorICogQ29tcGxldGVs
eSByZXBsYWNlcyB0aGUgcHJldmlvdXMgRUNDIGltcGxlbWVudGF0aW9uIHdo
aWNoIHdhcyB3cml0dGVuIGJ5OgorICogICBTdGV2ZW4gSi4gSGlsbCAoc2po
aWxsQHJlYWxpdHlkaWx1dGVkLmNvbSkKKyAqICAgVGhvbWFzIEdsZWl4bmVy
ICh0Z2x4QGxpbnV0cm9uaXguZGUpCiAgKgotICogJElkOiBuYW5kX2VjYy5j
LHYgMS4xNSAyMDA1LzExLzA3IDExOjE0OjMwIGdsZWl4bmVyIEV4cCAkCisg
KiBJbmZvcm1hdGlvbiBvbiBob3cgdGhpcyBhbGdvcml0aG0gd29ya3MgYW5k
IGhvdyBpdCB3YXMgZGV2ZWxvcGVkCisgKiBjYW4gYmUgZm91bmQgaW4gRG9j
dW1lbnRhdGlvbi9uYW5kL2VjYy50eHQKICAqCiAgKiBUaGlzIGZpbGUgaXMg
ZnJlZSBzb2Z0d2FyZTsgeW91IGNhbiByZWRpc3RyaWJ1dGUgaXQgYW5kL29y
IG1vZGlmeSBpdAogICogdW5kZXIgdGhlIHRlcm1zIG9mIHRoZSBHTlUgR2Vu
ZXJhbCBQdWJsaWMgTGljZW5zZSBhcyBwdWJsaXNoZWQgYnkgdGhlCkBAIC0y
NSwxNzQgKzI4LDQxNiBAQAogICogd2l0aCB0aGlzIGZpbGU7IGlmIG5vdCwg
d3JpdGUgdG8gdGhlIEZyZWUgU29mdHdhcmUgRm91bmRhdGlvbiwgSW5jLiwK
ICAqIDU5IFRlbXBsZSBQbGFjZSwgU3VpdGUgMzMwLCBCb3N0b24sIE1BIDAy
MTExLTEzMDcgVVNBLgogICoKLSAqIEFzIGEgc3BlY2lhbCBleGNlcHRpb24s
IGlmIG90aGVyIGZpbGVzIGluc3RhbnRpYXRlIHRlbXBsYXRlcyBvciB1c2UK
LSAqIG1hY3JvcyBvciBpbmxpbmUgZnVuY3Rpb25zIGZyb20gdGhlc2UgZmls
ZXMsIG9yIHlvdSBjb21waWxlIHRoZXNlCi0gKiBmaWxlcyBhbmQgbGluayB0
aGVtIHdpdGggb3RoZXIgd29ya3MgdG8gcHJvZHVjZSBhIHdvcmsgYmFzZWQg
b24gdGhlc2UKLSAqIGZpbGVzLCB0aGVzZSBmaWxlcyBkbyBub3QgYnkgdGhl
bXNlbHZlcyBjYXVzZSB0aGUgcmVzdWx0aW5nIHdvcmsgdG8gYmUKLSAqIGNv
dmVyZWQgYnkgdGhlIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlLiBIb3dl
dmVyIHRoZSBzb3VyY2UgY29kZSBmb3IKLSAqIHRoZXNlIGZpbGVzIG11c3Qg
c3RpbGwgYmUgbWFkZSBhdmFpbGFibGUgaW4gYWNjb3JkYW5jZSB3aXRoIHNl
Y3Rpb24gKDMpCi0gKiBvZiB0aGUgR05VIEdlbmVyYWwgUHVibGljIExpY2Vu
c2UuCi0gKgotICogVGhpcyBleGNlcHRpb24gZG9lcyBub3QgaW52YWxpZGF0
ZSBhbnkgb3RoZXIgcmVhc29ucyB3aHkgYSB3b3JrIGJhc2VkIG9uCi0gKiB0
aGlzIGZpbGUgbWlnaHQgYmUgY292ZXJlZCBieSB0aGUgR05VIEdlbmVyYWwg
UHVibGljIExpY2Vuc2UuCiAgKi8KIAorLyoKKyAqIFRoZSBTVEFOREFMT05F
IG1hY3JvIGlzIHVzZWZ1bCB3aGVuIHJ1bm5pbmcgdGhlIGNvZGUgb3V0c2lk
ZSB0aGUga2VybmVsCisgKiBlLmcuIHdoZW4gcnVubmluZyB0aGUgY29kZSBp
biBhIHRlc3RiZWQgb3IgYSBiZW5jaG1hcmsgcHJvZ3JhbS4KKyAqIFdoZW4g
U1RBTkRBTE9ORSBpcyB1c2VkLCB0aGUgbW9kdWxlIHJlbGF0ZWQgbWFjcm9z
IGFyZSBjb21tZW50ZWQgb3V0CisgKiBhcyB3ZWxsIGFzIHRoZSBsaW51eCBp
bmNsdWRlIGZpbGVzLgorICogSW5zdGVhZCBhIHByaXZhdGUgZGVmaW5pdGlv
biBvZiBtdGRfaW50byBpcyBnaXZlbiB0byBzYXRpc2Z5IHRoZSBjb21waWxl
cgorICogKHRoZSBjb2RlIGRvZXMgbm90IHVzZSBtdGRfaW5mbywgc28gdGhl
IGNvZGUgZG9lcyBub3QgY2FyZSkKKyAqLworI2lmbmRlZiBTVEFOREFMT05F
CiAjaW5jbHVkZSA8bGludXgvdHlwZXMuaD4KICNpbmNsdWRlIDxsaW51eC9r
ZXJuZWwuaD4KICNpbmNsdWRlIDxsaW51eC9tb2R1bGUuaD4KICNpbmNsdWRl
IDxsaW51eC9tdGQvbmFuZF9lY2MuaD4KKyNlbHNlCitzdHJ1Y3QgbXRkX2lu
Zm8geworCWludCBkdW1teTsKK307CisjZW5kaWYKIAogLyoKLSAqIFByZS1j
YWxjdWxhdGVkIDI1Ni13YXkgMSBieXRlIGNvbHVtbiBwYXJpdHkKKyAqIGlu
dnBhcml0eSBpcyBhIDI1NiBieXRlIHRhYmxlIHRoYXQgY29udGFpbnMgdGhl
IG9kZCBwYXJpdHkKKyAqIGZvciBlYWNoIGJ5dGUuIFNvIGlmIHRoZSBudW1i
ZXIgb2YgYml0cyBpbiBhIGJ5dGUgaXMgZXZlbiwKKyAqIHRoZSBhcnJheSBl
bGVtZW50IGlzIDEsIGFuZCB3aGVuIHRoZSBudW1iZXIgb2YgYml0cyBpcyBv
ZGQKKyAqIHRoZSBhcnJheSBlbGVlbW50IGlzIDAuCiAgKi8KLXN0YXRpYyBj
b25zdCB1X2NoYXIgbmFuZF9lY2NfcHJlY2FsY190YWJsZVtdID0gewotCTB4
MDAsIDB4NTUsIDB4NTYsIDB4MDMsIDB4NTksIDB4MGMsIDB4MGYsIDB4NWEs
IDB4NWEsIDB4MGYsIDB4MGMsIDB4NTksIDB4MDMsIDB4NTYsIDB4NTUsIDB4
MDAsCi0JMHg2NSwgMHgzMCwgMHgzMywgMHg2NiwgMHgzYywgMHg2OSwgMHg2
YSwgMHgzZiwgMHgzZiwgMHg2YSwgMHg2OSwgMHgzYywgMHg2NiwgMHgzMywg
MHgzMCwgMHg2NSwKLQkweDY2LCAweDMzLCAweDMwLCAweDY1LCAweDNmLCAw
eDZhLCAweDY5LCAweDNjLCAweDNjLCAweDY5LCAweDZhLCAweDNmLCAweDY1
LCAweDMwLCAweDMzLCAweDY2LAotCTB4MDMsIDB4NTYsIDB4NTUsIDB4MDAs
IDB4NWEsIDB4MGYsIDB4MGMsIDB4NTksIDB4NTksIDB4MGMsIDB4MGYsIDB4
NWEsIDB4MDAsIDB4NTUsIDB4NTYsIDB4MDMsCi0JMHg2OSwgMHgzYywgMHgz
ZiwgMHg2YSwgMHgzMCwgMHg2NSwgMHg2NiwgMHgzMywgMHgzMywgMHg2Niwg
MHg2NSwgMHgzMCwgMHg2YSwgMHgzZiwgMHgzYywgMHg2OSwKLQkweDBjLCAw
eDU5LCAweDVhLCAweDBmLCAweDU1LCAweDAwLCAweDAzLCAweDU2LCAweDU2
LCAweDAzLCAweDAwLCAweDU1LCAweDBmLCAweDVhLCAweDU5LCAweDBjLAot
CTB4MGYsIDB4NWEsIDB4NTksIDB4MGMsIDB4NTYsIDB4MDMsIDB4MDAsIDB4
NTUsIDB4NTUsIDB4MDAsIDB4MDMsIDB4NTYsIDB4MGMsIDB4NTksIDB4NWEs
IDB4MGYsCi0JMHg2YSwgMHgzZiwgMHgzYywgMHg2OSwgMHgzMywgMHg2Niwg
MHg2NSwgMHgzMCwgMHgzMCwgMHg2NSwgMHg2NiwgMHgzMywgMHg2OSwgMHgz
YywgMHgzZiwgMHg2YSwKLQkweDZhLCAweDNmLCAweDNjLCAweDY5LCAweDMz
LCAweDY2LCAweDY1LCAweDMwLCAweDMwLCAweDY1LCAweDY2LCAweDMzLCAw
eDY5LCAweDNjLCAweDNmLCAweDZhLAotCTB4MGYsIDB4NWEsIDB4NTksIDB4
MGMsIDB4NTYsIDB4MDMsIDB4MDAsIDB4NTUsIDB4NTUsIDB4MDAsIDB4MDMs
IDB4NTYsIDB4MGMsIDB4NTksIDB4NWEsIDB4MGYsCi0JMHgwYywgMHg1OSwg
MHg1YSwgMHgwZiwgMHg1NSwgMHgwMCwgMHgwMywgMHg1NiwgMHg1NiwgMHgw
MywgMHgwMCwgMHg1NSwgMHgwZiwgMHg1YSwgMHg1OSwgMHgwYywKLQkweDY5
LCAweDNjLCAweDNmLCAweDZhLCAweDMwLCAweDY1LCAweDY2LCAweDMzLCAw
eDMzLCAweDY2LCAweDY1LCAweDMwLCAweDZhLCAweDNmLCAweDNjLCAweDY5
LAotCTB4MDMsIDB4NTYsIDB4NTUsIDB4MDAsIDB4NWEsIDB4MGYsIDB4MGMs
IDB4NTksIDB4NTksIDB4MGMsIDB4MGYsIDB4NWEsIDB4MDAsIDB4NTUsIDB4
NTYsIDB4MDMsCi0JMHg2NiwgMHgzMywgMHgzMCwgMHg2NSwgMHgzZiwgMHg2
YSwgMHg2OSwgMHgzYywgMHgzYywgMHg2OSwgMHg2YSwgMHgzZiwgMHg2NSwg
MHgzMCwgMHgzMywgMHg2NiwKLQkweDY1LCAweDMwLCAweDMzLCAweDY2LCAw
eDNjLCAweDY5LCAweDZhLCAweDNmLCAweDNmLCAweDZhLCAweDY5LCAweDNj
LCAweDY2LCAweDMzLCAweDMwLCAweDY1LAotCTB4MDAsIDB4NTUsIDB4NTYs
IDB4MDMsIDB4NTksIDB4MGMsIDB4MGYsIDB4NWEsIDB4NWEsIDB4MGYsIDB4
MGMsIDB4NTksIDB4MDMsIDB4NTYsIDB4NTUsIDB4MDAKK3N0YXRpYyBjb25z
dCBjaGFyIGludnBhcml0eVsyNTZdID0geworCTEsIDAsIDAsIDEsIDAsIDEs
IDEsIDAsIDAsIDEsIDEsIDAsIDEsIDAsIDAsIDEsCisJMCwgMSwgMSwgMCwg
MSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwgMSwgMCwKKwkwLCAxLCAx
LCAwLCAxLCAwLCAwLCAxLCAxLCAwLCAwLCAxLCAwLCAxLCAxLCAwLAorCTEs
IDAsIDAsIDEsIDAsIDEsIDEsIDAsIDAsIDEsIDEsIDAsIDEsIDAsIDAsIDEs
CisJMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwg
MSwgMCwKKwkxLCAwLCAwLCAxLCAwLCAxLCAxLCAwLCAwLCAxLCAxLCAwLCAx
LCAwLCAwLCAxLAorCTEsIDAsIDAsIDEsIDAsIDEsIDEsIDAsIDAsIDEsIDEs
IDAsIDEsIDAsIDAsIDEsCisJMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgMSwg
MCwgMCwgMSwgMCwgMSwgMSwgMCwKKwkwLCAxLCAxLCAwLCAxLCAwLCAwLCAx
LCAxLCAwLCAwLCAxLCAwLCAxLCAxLCAwLAorCTEsIDAsIDAsIDEsIDAsIDEs
IDEsIDAsIDAsIDEsIDEsIDAsIDEsIDAsIDAsIDEsCisJMSwgMCwgMCwgMSwg
MCwgMSwgMSwgMCwgMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwKKwkwLCAxLCAx
LCAwLCAxLCAwLCAwLCAxLCAxLCAwLCAwLCAxLCAwLCAxLCAxLCAwLAorCTEs
IDAsIDAsIDEsIDAsIDEsIDEsIDAsIDAsIDEsIDEsIDAsIDEsIDAsIDAsIDEs
CisJMCwgMSwgMSwgMCwgMSwgMCwgMCwgMSwgMSwgMCwgMCwgMSwgMCwgMSwg
MSwgMCwKKwkwLCAxLCAxLCAwLCAxLCAwLCAwLCAxLCAxLCAwLCAwLCAxLCAw
LCAxLCAxLCAwLAorCTEsIDAsIDAsIDEsIDAsIDEsIDEsIDAsIDAsIDEsIDEs
IDAsIDEsIDAsIDAsIDEKK307CisKKy8qCisgKiBiaXRzcGVyYnl0ZSBjb250
YWlucyB0aGUgbnVtYmVyIG9mIGJpdHMgcGVyIGJ5dGUKKyAqIHRoaXMgaXMg
b25seSB1c2VkIGZvciB0ZXN0aW5nIGFuZCByZXBhaXJpbmcgcGFyaXR5Cisg
KiAoYSBwcmVjYWxjdWxhdGVkIHZhbHVlIHNsaWdodGx5IGltcHJvdmVzIHBl
cmZvcm1hbmNlKQorICovCitzdGF0aWMgY29uc3QgY2hhciBiaXRzcGVyYnl0
ZVsyNTZdID0geworCTAsIDEsIDEsIDIsIDEsIDIsIDIsIDMsIDEsIDIsIDIs
IDMsIDIsIDMsIDMsIDQsCisJMSwgMiwgMiwgMywgMiwgMywgMywgNCwgMiwg
MywgMywgNCwgMywgNCwgNCwgNSwKKwkxLCAyLCAyLCAzLCAyLCAzLCAzLCA0
LCAyLCAzLCAzLCA0LCAzLCA0LCA0LCA1LAorCTIsIDMsIDMsIDQsIDMsIDQs
IDQsIDUsIDMsIDQsIDQsIDUsIDQsIDUsIDUsIDYsCisJMSwgMiwgMiwgMywg
MiwgMywgMywgNCwgMiwgMywgMywgNCwgMywgNCwgNCwgNSwKKwkyLCAzLCAz
LCA0LCAzLCA0LCA0LCA1LCAzLCA0LCA0LCA1LCA0LCA1LCA1LCA2LAorCTIs
IDMsIDMsIDQsIDMsIDQsIDQsIDUsIDMsIDQsIDQsIDUsIDQsIDUsIDUsIDYs
CisJMywgNCwgNCwgNSwgNCwgNSwgNSwgNiwgNCwgNSwgNSwgNiwgNSwgNiwg
NiwgNywKKwkxLCAyLCAyLCAzLCAyLCAzLCAzLCA0LCAyLCAzLCAzLCA0LCAz
LCA0LCA0LCA1LAorCTIsIDMsIDMsIDQsIDMsIDQsIDQsIDUsIDMsIDQsIDQs
IDUsIDQsIDUsIDUsIDYsCisJMiwgMywgMywgNCwgMywgNCwgNCwgNSwgMywg
NCwgNCwgNSwgNCwgNSwgNSwgNiwKKwkzLCA0LCA0LCA1LCA0LCA1LCA1LCA2
LCA0LCA1LCA1LCA2LCA1LCA2LCA2LCA3LAorCTIsIDMsIDMsIDQsIDMsIDQs
IDQsIDUsIDMsIDQsIDQsIDUsIDQsIDUsIDUsIDYsCisJMywgNCwgNCwgNSwg
NCwgNSwgNSwgNiwgNCwgNSwgNSwgNiwgNSwgNiwgNiwgNywKKwkzLCA0LCA0
LCA1LCA0LCA1LCA1LCA2LCA0LCA1LCA1LCA2LCA1LCA2LCA2LCA3LAorCTQs
IDUsIDUsIDYsIDUsIDYsIDYsIDcsIDUsIDYsIDYsIDcsIDYsIDcsIDcsIDgs
Cit9OworCisvKgorICogYWRkcmVzc2JpdHMgaXMgYSBsb29rdXAgdGFibGUg
dG8gZmlsdGVyIG91dCB0aGUgYml0cyBmcm9tIHRoZSB4b3ItZWQKKyAqIGVj
YyBkYXRhIHRoYXQgaWRlbnRpZnkgdGhlIGZhdWx0eSBsb2NhdGlvbi4KKyAq
IHRoaXMgaXMgb25seSB1c2VkIGZvciByZXBhaXJpbmcgcGFyaXR5CisgKiBz
ZWUgdGhlIGNvbW1lbnRzIGluIG5hbmRfY29ycmVjdF9kYXRhIGZvciBtb3Jl
IGRldGFpbHMKKyAqLworc3RhdGljIGNvbnN0IGNoYXIgYWRkcmVzc2JpdHNb
MjU2XSA9IHsKKwkweDAwLCAweDAwLCAweDAxLCAweDAxLCAweDAwLCAweDAw
LCAweDAxLCAweDAxLAorCTB4MDIsIDB4MDIsIDB4MDMsIDB4MDMsIDB4MDIs
IDB4MDIsIDB4MDMsIDB4MDMsCisJMHgwMCwgMHgwMCwgMHgwMSwgMHgwMSwg
MHgwMCwgMHgwMCwgMHgwMSwgMHgwMSwKKwkweDAyLCAweDAyLCAweDAzLCAw
eDAzLCAweDAyLCAweDAyLCAweDAzLCAweDAzLAorCTB4MDQsIDB4MDQsIDB4
MDUsIDB4MDUsIDB4MDQsIDB4MDQsIDB4MDUsIDB4MDUsCisJMHgwNiwgMHgw
NiwgMHgwNywgMHgwNywgMHgwNiwgMHgwNiwgMHgwNywgMHgwNywKKwkweDA0
LCAweDA0LCAweDA1LCAweDA1LCAweDA0LCAweDA0LCAweDA1LCAweDA1LAor
CTB4MDYsIDB4MDYsIDB4MDcsIDB4MDcsIDB4MDYsIDB4MDYsIDB4MDcsIDB4
MDcsCisJMHgwMCwgMHgwMCwgMHgwMSwgMHgwMSwgMHgwMCwgMHgwMCwgMHgw
MSwgMHgwMSwKKwkweDAyLCAweDAyLCAweDAzLCAweDAzLCAweDAyLCAweDAy
LCAweDAzLCAweDAzLAorCTB4MDAsIDB4MDAsIDB4MDEsIDB4MDEsIDB4MDAs
IDB4MDAsIDB4MDEsIDB4MDEsCisJMHgwMiwgMHgwMiwgMHgwMywgMHgwMywg
MHgwMiwgMHgwMiwgMHgwMywgMHgwMywKKwkweDA0LCAweDA0LCAweDA1LCAw
eDA1LCAweDA0LCAweDA0LCAweDA1LCAweDA1LAorCTB4MDYsIDB4MDYsIDB4
MDcsIDB4MDcsIDB4MDYsIDB4MDYsIDB4MDcsIDB4MDcsCisJMHgwNCwgMHgw
NCwgMHgwNSwgMHgwNSwgMHgwNCwgMHgwNCwgMHgwNSwgMHgwNSwKKwkweDA2
LCAweDA2LCAweDA3LCAweDA3LCAweDA2LCAweDA2LCAweDA3LCAweDA3LAor
CTB4MDgsIDB4MDgsIDB4MDksIDB4MDksIDB4MDgsIDB4MDgsIDB4MDksIDB4
MDksCisJMHgwYSwgMHgwYSwgMHgwYiwgMHgwYiwgMHgwYSwgMHgwYSwgMHgw
YiwgMHgwYiwKKwkweDA4LCAweDA4LCAweDA5LCAweDA5LCAweDA4LCAweDA4
LCAweDA5LCAweDA5LAorCTB4MGEsIDB4MGEsIDB4MGIsIDB4MGIsIDB4MGEs
IDB4MGEsIDB4MGIsIDB4MGIsCisJMHgwYywgMHgwYywgMHgwZCwgMHgwZCwg
MHgwYywgMHgwYywgMHgwZCwgMHgwZCwKKwkweDBlLCAweDBlLCAweDBmLCAw
eDBmLCAweDBlLCAweDBlLCAweDBmLCAweDBmLAorCTB4MGMsIDB4MGMsIDB4
MGQsIDB4MGQsIDB4MGMsIDB4MGMsIDB4MGQsIDB4MGQsCisJMHgwZSwgMHgw
ZSwgMHgwZiwgMHgwZiwgMHgwZSwgMHgwZSwgMHgwZiwgMHgwZiwKKwkweDA4
LCAweDA4LCAweDA5LCAweDA5LCAweDA4LCAweDA4LCAweDA5LCAweDA5LAor
CTB4MGEsIDB4MGEsIDB4MGIsIDB4MGIsIDB4MGEsIDB4MGEsIDB4MGIsIDB4
MGIsCisJMHgwOCwgMHgwOCwgMHgwOSwgMHgwOSwgMHgwOCwgMHgwOCwgMHgw
OSwgMHgwOSwKKwkweDBhLCAweDBhLCAweDBiLCAweDBiLCAweDBhLCAweDBh
LCAweDBiLCAweDBiLAorCTB4MGMsIDB4MGMsIDB4MGQsIDB4MGQsIDB4MGMs
IDB4MGMsIDB4MGQsIDB4MGQsCisJMHgwZSwgMHgwZSwgMHgwZiwgMHgwZiwg
MHgwZSwgMHgwZSwgMHgwZiwgMHgwZiwKKwkweDBjLCAweDBjLCAweDBkLCAw
eDBkLCAweDBjLCAweDBjLCAweDBkLCAweDBkLAorCTB4MGUsIDB4MGUsIDB4
MGYsIDB4MGYsIDB4MGUsIDB4MGUsIDB4MGYsIDB4MGYKIH07CiAKIC8qKgog
ICogbmFuZF9jYWxjdWxhdGVfZWNjIC0gW05BTkQgSW50ZXJmYWNlXSBDYWxj
dWxhdGUgMy1ieXRlIEVDQyBmb3IgMjU2LWJ5dGUgYmxvY2sKLSAqIEBtdGQ6
CU1URCBibG9jayBzdHJ1Y3R1cmUKKyAqIEBtdGQ6CU1URCBibG9jayBzdHJ1
Y3R1cmUgKHVudXNlZCkKICAqIEBkYXQ6CXJhdyBkYXRhCiAgKiBAZWNjX2Nv
ZGU6CWJ1ZmZlciBmb3IgRUNDCiAgKi8KLWludCBuYW5kX2NhbGN1bGF0ZV9l
Y2Moc3RydWN0IG10ZF9pbmZvICptdGQsIGNvbnN0IHVfY2hhciAqZGF0LAot
CQkgICAgICAgdV9jaGFyICplY2NfY29kZSkKK2ludCBuYW5kX2NhbGN1bGF0
ZV9lY2Moc3RydWN0IG10ZF9pbmZvICptdGQsIGNvbnN0IHVuc2lnbmVkIGNo
YXIgKmJ1ZiwKKwkJICAgICAgIHVuc2lnbmVkIGNoYXIgKmNvZGUpCiB7Ci0J
dWludDhfdCBpZHgsIHJlZzEsIHJlZzIsIHJlZzMsIHRtcDEsIHRtcDI7CiAJ
aW50IGk7Ci0KLQkvKiBJbml0aWFsaXplIHZhcmlhYmxlcyAqLwotCXJlZzEg
PSByZWcyID0gcmVnMyA9IDA7Ci0KLQkvKiBCdWlsZCB1cCBjb2x1bW4gcGFy
aXR5ICovCi0JZm9yKGkgPSAwOyBpIDwgMjU2OyBpKyspIHsKLQkJLyogR2V0
IENQMCAtIENQNSBmcm9tIHRhYmxlICovCi0JCWlkeCA9IG5hbmRfZWNjX3By
ZWNhbGNfdGFibGVbKmRhdCsrXTsKLQkJcmVnMSBePSAoaWR4ICYgMHgzZik7
Ci0KLQkJLyogQWxsIGJpdCBYT1IgPSAxID8gKi8KLQkJaWYgKGlkeCAmIDB4
NDApIHsKLQkJCXJlZzMgXj0gKHVpbnQ4X3QpIGk7Ci0JCQlyZWcyIF49IH4o
KHVpbnQ4X3QpIGkpOwotCQl9CisJY29uc3QgdW5zaWduZWQgbG9uZyAqYnAg
PSAodW5zaWduZWQgbG9uZyAqKWJ1ZjsKKwl1bnNpZ25lZCBsb25nIGN1cjsJ
LyogY3VycmVudCB2YWx1ZSBpbiBidWZmZXIgKi8KKwkvKiBycDAuLnJwMTUg
YXJlIHRoZSB2YXJpb3VzIGFjY3VtdWxhdGVkIHBhcml0aWVzIChwZXIgYnl0
ZSkgKi8KKwl1bnNpZ25lZCBsb25nIHJwMCwgcnAxLCBycDIsIHJwMywgcnA0
LCBycDUsIHJwNiwgcnA3OworCXVuc2lnbmVkIGxvbmcgcnA4LCBycDksIHJw
MTAsIHJwMTEsIHJwMTIsIHJwMTMsIHJwMTQsIHJwMTU7CisJdW5zaWduZWQg
bG9uZyBwYXI7CS8qIHRoZSBjdW11bGF0aXZlIHBhcml0eSBmb3IgYWxsIGRh
dGEgKi8KKwl1bnNpZ25lZCBsb25nIHRtcHBhcjsJLyogdGhlIGN1bXVsYXRp
dmUgcGFyaXR5IGZvciB0aGlzIGl0ZXJhdGlvbjsKKwkJCQkgICBmb3IgcnAx
MiBhbmQgcnAxNCBhdCB0aGUgZW5kIG9mIHRoZSBsb29wICovCisKKwlwYXIg
PSAwOworCXJwNCA9IDA7CisJcnA2ID0gMDsKKwlycDggPSAwOworCXJwMTAg
PSAwOworCXJwMTIgPSAwOworCXJwMTQgPSAwOworCisJLyoKKwkgKiBUaGUg
bG9vcCBpcyB1bnJvbGxlZCBhIG51bWJlciBvZiB0aW1lczsKKwkgKiBUaGlz
IGF2b2lkcyBpZiBzdGF0ZW1lbnRzIHRvIGRlY2lkZSBvbiB3aGljaCBycCB2
YWx1ZSB0byB1cGRhdGUKKwkgKiBBbHNvIHdlIHByb2Nlc3MgdGhlIGRhdGEg
YnkgbG9uZ3dvcmRzLgorCSAqIE5vdGU6IHBhc3NpbmcgdW5hbGlnbmVkIGRh
dGEgbWlnaHQgZ2l2ZSBhIHBlcmZvcm1hbmNlIHBlbmFsdHkuCisJICogSXQg
aXMgYXNzdW1lZCB0aGF0IHRoZSBidWZmZXJzIGFyZSBhbGlnbmVkLgorCSAq
IHRtcHBhciBpcyB0aGUgY3VtdWxhdGl2ZSBzdW0gb2YgdGhpcyBpdGVyYXRp
b24uCisJICogbmVlZGVkIGZvciBjYWxjdWxhdGluZyBycDEyLCBycDE0IGFu
ZCBwYXIKKwkgKiBhbHNvIHVzZWQgYXMgYSBwZXJmb3JtYW5jZSBpbXByb3Zl
bWVudCBmb3IgcnA2LCBycDggYW5kIHJwMTAKKwkgKi8KKwlmb3IgKGkgPSAw
OyBpIDwgNDsgaSsrKSB7CisJCWN1ciA9ICpicCsrOworCQl0bXBwYXIgPSBj
dXI7CisJCXJwNCBePSBjdXI7CisJCWN1ciA9ICpicCsrOworCQl0bXBwYXIg
Xj0gY3VyOworCQlycDYgXj0gdG1wcGFyOworCQljdXIgPSAqYnArKzsKKwkJ
dG1wcGFyIF49IGN1cjsKKwkJcnA0IF49IGN1cjsKKwkJY3VyID0gKmJwKys7
CisJCXRtcHBhciBePSBjdXI7CisJCXJwOCBePSB0bXBwYXI7CisKKwkJY3Vy
ID0gKmJwKys7CisJCXRtcHBhciBePSBjdXI7CisJCXJwNCBePSBjdXI7CisJ
CXJwNiBePSBjdXI7CisJCWN1ciA9ICpicCsrOworCQl0bXBwYXIgXj0gY3Vy
OworCQlycDYgXj0gY3VyOworCQljdXIgPSAqYnArKzsKKwkJdG1wcGFyIF49
IGN1cjsKKwkJcnA0IF49IGN1cjsKKwkJY3VyID0gKmJwKys7CisJCXRtcHBh
ciBePSBjdXI7CisJCXJwMTAgXj0gdG1wcGFyOworCisJCWN1ciA9ICpicCsr
OworCQl0bXBwYXIgXj0gY3VyOworCQlycDQgXj0gY3VyOworCQlycDYgXj0g
Y3VyOworCQlycDggXj0gY3VyOworCQljdXIgPSAqYnArKzsKKwkJdG1wcGFy
IF49IGN1cjsKKwkJcnA2IF49IGN1cjsKKwkJcnA4IF49IGN1cjsKKwkJY3Vy
ID0gKmJwKys7CisJCXRtcHBhciBePSBjdXI7CisJCXJwNCBePSBjdXI7CisJ
CXJwOCBePSBjdXI7CisJCWN1ciA9ICpicCsrOworCQl0bXBwYXIgXj0gY3Vy
OworCQlycDggXj0gY3VyOworCisJCWN1ciA9ICpicCsrOworCQl0bXBwYXIg
Xj0gY3VyOworCQlycDQgXj0gY3VyOworCQlycDYgXj0gY3VyOworCQljdXIg
PSAqYnArKzsKKwkJdG1wcGFyIF49IGN1cjsKKwkJcnA2IF49IGN1cjsKKwkJ
Y3VyID0gKmJwKys7CisJCXRtcHBhciBePSBjdXI7CisJCXJwNCBePSBjdXI7
CisJCWN1ciA9ICpicCsrOworCQl0bXBwYXIgXj0gY3VyOworCisJCXBhciBe
PSB0bXBwYXI7CisJCWlmICgoaSAmIDB4MSkgPT0gMCkKKwkJCXJwMTIgXj0g
dG1wcGFyOworCQlpZiAoKGkgJiAweDIpID09IDApCisJCQlycDE0IF49IHRt
cHBhcjsKIAl9CiAKLQkvKiBDcmVhdGUgbm9uLWludmVydGVkIEVDQyBjb2Rl
IGZyb20gbGluZSBwYXJpdHkgKi8KLQl0bXAxICA9IChyZWczICYgMHg4MCkg
Pj4gMDsgLyogQjcgLT4gQjcgKi8KLQl0bXAxIHw9IChyZWcyICYgMHg4MCkg
Pj4gMTsgLyogQjcgLT4gQjYgKi8KLQl0bXAxIHw9IChyZWczICYgMHg0MCkg
Pj4gMTsgLyogQjYgLT4gQjUgKi8KLQl0bXAxIHw9IChyZWcyICYgMHg0MCkg
Pj4gMjsgLyogQjYgLT4gQjQgKi8KLQl0bXAxIHw9IChyZWczICYgMHgyMCkg
Pj4gMjsgLyogQjUgLT4gQjMgKi8KLQl0bXAxIHw9IChyZWcyICYgMHgyMCkg
Pj4gMzsgLyogQjUgLT4gQjIgKi8KLQl0bXAxIHw9IChyZWczICYgMHgxMCkg
Pj4gMzsgLyogQjQgLT4gQjEgKi8KLQl0bXAxIHw9IChyZWcyICYgMHgxMCkg
Pj4gNDsgLyogQjQgLT4gQjAgKi8KLQotCXRtcDIgID0gKHJlZzMgJiAweDA4
KSA8PCA0OyAvKiBCMyAtPiBCNyAqLwotCXRtcDIgfD0gKHJlZzIgJiAweDA4
KSA8PCAzOyAvKiBCMyAtPiBCNiAqLwotCXRtcDIgfD0gKHJlZzMgJiAweDA0
KSA8PCAzOyAvKiBCMiAtPiBCNSAqLwotCXRtcDIgfD0gKHJlZzIgJiAweDA0
KSA8PCAyOyAvKiBCMiAtPiBCNCAqLwotCXRtcDIgfD0gKHJlZzMgJiAweDAy
KSA8PCAyOyAvKiBCMSAtPiBCMyAqLwotCXRtcDIgfD0gKHJlZzIgJiAweDAy
KSA8PCAxOyAvKiBCMSAtPiBCMiAqLwotCXRtcDIgfD0gKHJlZzMgJiAweDAx
KSA8PCAxOyAvKiBCMCAtPiBCMSAqLwotCXRtcDIgfD0gKHJlZzIgJiAweDAx
KSA8PCAwOyAvKiBCNyAtPiBCMCAqLwotCi0JLyogQ2FsY3VsYXRlIGZpbmFs
IEVDQyBjb2RlICovCisJLyoKKwkgKiBoYW5kbGUgdGhlIGZhY3QgdGhhdCB3
ZSB1c2UgbG9uZ3dvcmQgb3BlcmF0aW9ucworCSAqIHdlJ2xsIGJyaW5nIHJw
NC4ucnAxNCBiYWNrIHRvIHNpbmdsZSBieXRlIGVudGl0aWVzIGJ5IHNoaWZ0
aW5nIGFuZAorCSAqIHhvcmluZyBmaXJzdCBmb2xkIHRoZSB1cHBlciBhbmQg
bG93ZXIgMTYgYml0cywKKwkgKiB0aGVuIHRoZSB1cHBlciBhbmQgbG93ZXIg
OCBiaXRzLgorCSAqLworCXJwNCBePSAocnA0ID4+IDE2KTsKKwlycDQgXj0g
KHJwNCA+PiA4KTsKKwlycDQgJj0gMHhmZjsKKwlycDYgXj0gKHJwNiA+PiAx
Nik7CisJcnA2IF49IChycDYgPj4gOCk7CisJcnA2ICY9IDB4ZmY7CisJcnA4
IF49IChycDggPj4gMTYpOworCXJwOCBePSAocnA4ID4+IDgpOworCXJwOCAm
PSAweGZmOworCXJwMTAgXj0gKHJwMTAgPj4gMTYpOworCXJwMTAgXj0gKHJw
MTAgPj4gOCk7CisJcnAxMCAmPSAweGZmOworCXJwMTIgXj0gKHJwMTIgPj4g
MTYpOworCXJwMTIgXj0gKHJwMTIgPj4gOCk7CisJcnAxMiAmPSAweGZmOwor
CXJwMTQgXj0gKHJwMTQgPj4gMTYpOworCXJwMTQgXj0gKHJwMTQgPj4gOCk7
CisJcnAxNCAmPSAweGZmOworCisJLyoKKwkgKiB3ZSBhbHNvIG5lZWQgdG8g
Y2FsY3VsYXRlIHRoZSByb3cgcGFyaXR5IGZvciBycDAuLnJwMworCSAqIFRo
aXMgaXMgcHJlc2VudCBpbiBwYXIsIGJlY2F1c2UgcGFyIGlzIG5vdworCSAq
IHJwMyBycDMgcnAyIHJwMgorCSAqIGFzIHdlbGwgYXMKKwkgKiBycDEgcnAw
IHJwMSBycDAKKwkgKiBGaXJzdCBjYWxjdWxhdGUgcnAyIGFuZCBycDMKKwkg
KiAoYW5kIHllczogcnAyID0gKHBhciBeIHJwMykgJiAweGZmOyBidXQgZG9p
bmcgdGhhdCBkaWQgbm90CisJICogZ2l2ZSBhIHBlcmZvcm1hbmNlIGltcHJv
dmVtZW50KQorCSAqLworCXJwMyA9IChwYXIgPj4gMTYpOworCXJwMyBePSAo
cnAzID4+IDgpOworCXJwMyAmPSAweGZmOworCXJwMiA9IHBhciAmIDB4ZmZm
ZjsKKwlycDIgXj0gKHJwMiA+PiA4KTsKKwlycDIgJj0gMHhmZjsKKworCS8q
IHJlZHVjZSBwYXIgdG8gMTYgYml0cyB0aGVuIGNhbGN1bGF0ZSBycDEgYW5k
IHJwMCAqLworCXBhciBePSAocGFyID4+IDE2KTsKKwlycDEgPSAocGFyID4+
IDgpICYgMHhmZjsKKwlycDAgPSAocGFyICYgMHhmZik7CisKKwkvKiBmaW5h
bGx5IHJlZHVjZSBwYXIgdG8gOCBiaXRzICovCisJcGFyIF49IChwYXIgPj4g
OCk7CisJcGFyICY9IDB4ZmY7CisKKwkvKgorCSAqIGFuZCBjYWxjdWxhdGUg
cnA1Li5ycDE1CisJICogbm90ZSB0aGF0IHBhciA9IHJwNCBeIHJwNSBhbmQg
ZHVlIHRvIHRoZSBjb21tdXRhdGl2ZSBwcm9wZXJ0eQorCSAqIG9mIHRoZSBe
IG9wZXJhdG9yIHdlIGNhbiBzYXk6CisJICogcnA1ID0gKHBhciBeIHJwNCk7
CisJICogVGhlICYgMHhmZiBzZWVtcyBzdXBlcmZsdW91cywgYnV0IGJlbmNo
bWFya2luZyBsZWFybmVkIHRoYXQKKwkgKiBsZWF2aW5nIGl0IG91dCBnaXZl
cyBzbGlnaHRseSB3b3JzZSByZXN1bHRzLiBObyBpZGVhIHdoeSwgcHJvYmFi
bHkKKwkgKiBpdCBoYXMgdG8gZG8gd2l0aCB0aGUgd2F5IHRoZSBwaXBlbGlu
ZSBpbiBwZW50aXVtIGlzIG9yZ2FuaXplZC4KKwkgKi8KKwlycDUgPSAocGFy
IF4gcnA0KSAmIDB4ZmY7CisJcnA3ID0gKHBhciBeIHJwNikgJiAweGZmOwor
CXJwOSA9IChwYXIgXiBycDgpICYgMHhmZjsKKwlycDExID0gKHBhciBeIHJw
MTApICYgMHhmZjsKKwlycDEzID0gKHBhciBeIHJwMTIpICYgMHhmZjsKKwly
cDE1ID0gKHBhciBeIHJwMTQpICYgMHhmZjsKKworCS8qCisJICogRmluYWxs
eSBjYWxjdWxhdGUgdGhlIGVjYyBiaXRzLgorCSAqIEFnYWluIGhlcmUgaXQg
bWlnaHQgc2VlbSB0aGF0IHRoZXJlIGFyZSBwZXJmb3JtYW5jZSBvcHRpbWlz
YXRpb25zCisJICogcG9zc2libGUsIGJ1dCBiZW5jaG1hcmtzIHNob3dlZCB0
aGF0IG9uIHRoZSBzeXN0ZW0gdGhpcyBpcyBkZXZlbG9wZWQKKwkgKiB0aGUg
Y29kZSBiZWxvdyBpcyB0aGUgZmFzdGVzdAorCSAqLwogI2lmZGVmIENPTkZJ
R19NVERfTkFORF9FQ0NfU01DCi0JZWNjX2NvZGVbMF0gPSB+dG1wMjsKLQll
Y2NfY29kZVsxXSA9IH50bXAxOworCWNvZGVbMF0gPQorCSAgICAoaW52cGFy
aXR5W3JwN10gPDwgNykgfAorCSAgICAoaW52cGFyaXR5W3JwNl0gPDwgNikg
fAorCSAgICAoaW52cGFyaXR5W3JwNV0gPDwgNSkgfAorCSAgICAoaW52cGFy
aXR5W3JwNF0gPDwgNCkgfAorCSAgICAoaW52cGFyaXR5W3JwM10gPDwgMykg
fAorCSAgICAoaW52cGFyaXR5W3JwMl0gPDwgMikgfAorCSAgICAoaW52cGFy
aXR5W3JwMV0gPDwgMSkgfAorCSAgICAoaW52cGFyaXR5W3JwMF0pOworCWNv
ZGVbMV0gPQorCSAgICAoaW52cGFyaXR5W3JwMTVdIDw8IDcpIHwKKwkgICAg
KGludnBhcml0eVtycDE0XSA8PCA2KSB8CisJICAgIChpbnZwYXJpdHlbcnAx
M10gPDwgNSkgfAorCSAgICAoaW52cGFyaXR5W3JwMTJdIDw8IDQpIHwKKwkg
ICAgKGludnBhcml0eVtycDExXSA8PCAzKSB8CisJICAgIChpbnZwYXJpdHlb
cnAxMF0gPDwgMikgfAorCSAgICAoaW52cGFyaXR5W3JwOV0gPDwgMSkgIHwK
KwkgICAgKGludnBhcml0eVtycDhdKTsKICNlbHNlCi0JZWNjX2NvZGVbMF0g
PSB+dG1wMTsKLQllY2NfY29kZVsxXSA9IH50bXAyOworCWNvZGVbMV0gPQor
CSAgICAoaW52cGFyaXR5W3JwN10gPDwgNykgfAorCSAgICAoaW52cGFyaXR5
W3JwNl0gPDwgNikgfAorCSAgICAoaW52cGFyaXR5W3JwNV0gPDwgNSkgfAor
CSAgICAoaW52cGFyaXR5W3JwNF0gPDwgNCkgfAorCSAgICAoaW52cGFyaXR5
W3JwM10gPDwgMykgfAorCSAgICAoaW52cGFyaXR5W3JwMl0gPDwgMikgfAor
CSAgICAoaW52cGFyaXR5W3JwMV0gPDwgMSkgfAorCSAgICAoaW52cGFyaXR5
W3JwMF0pOworCWNvZGVbMF0gPQorCSAgICAoaW52cGFyaXR5W3JwMTVdIDw8
IDcpIHwKKwkgICAgKGludnBhcml0eVtycDE0XSA8PCA2KSB8CisJICAgIChp
bnZwYXJpdHlbcnAxM10gPDwgNSkgfAorCSAgICAoaW52cGFyaXR5W3JwMTJd
IDw8IDQpIHwKKwkgICAgKGludnBhcml0eVtycDExXSA8PCAzKSB8CisJICAg
IChpbnZwYXJpdHlbcnAxMF0gPDwgMikgfAorCSAgICAoaW52cGFyaXR5W3Jw
OV0gPDwgMSkgIHwKKwkgICAgKGludnBhcml0eVtycDhdKTsKICNlbmRpZgot
CWVjY19jb2RlWzJdID0gKCh+cmVnMSkgPDwgMikgfCAweDAzOwotCi0JcmV0
dXJuIDA7CisJY29kZVsyXSA9CisJICAgIChpbnZwYXJpdHlbcGFyICYgMHhm
MF0gPDwgNykgfAorCSAgICAoaW52cGFyaXR5W3BhciAmIDB4MGZdIDw8IDYp
IHwKKwkgICAgKGludnBhcml0eVtwYXIgJiAweGNjXSA8PCA1KSB8CisJICAg
IChpbnZwYXJpdHlbcGFyICYgMHgzM10gPDwgNCkgfAorCSAgICAoaW52cGFy
aXR5W3BhciAmIDB4YWFdIDw8IDMpIHwKKwkgICAgKGludnBhcml0eVtwYXIg
JiAweDU1XSA8PCAyKSB8CisJICAgIDM7CiB9Ci1FWFBPUlRfU1lNQk9MKG5h
bmRfY2FsY3VsYXRlX2VjYyk7Ci0KLXN0YXRpYyBpbmxpbmUgaW50IGNvdW50
Yml0cyh1aW50MzJfdCBieXRlKQotewotCWludCByZXMgPSAwOwogCi0JZm9y
ICg7Ynl0ZTsgYnl0ZSA+Pj0gMSkKLQkJcmVzICs9IGJ5dGUgJiAweDAxOwot
CXJldHVybiByZXM7Ci19CisjaWZuZGVmIFNUQU5EQUxPTkUKK0VYUE9SVF9T
WU1CT0wobmFuZF9jYWxjdWxhdGVfZWNjKTsKKyNlbmRpZgogCiAvKioKICAq
IG5hbmRfY29ycmVjdF9kYXRhIC0gW05BTkQgSW50ZXJmYWNlXSBEZXRlY3Qg
YW5kIGNvcnJlY3QgYml0IGVycm9yKHMpCi0gKiBAbXRkOglNVEQgYmxvY2sg
c3RydWN0dXJlCisgKiBAbXRkOglNVEQgYmxvY2sgc3RydWN0dXJlICh1bnVz
ZWQpCiAgKiBAZGF0OglyYXcgZGF0YSByZWFkIGZyb20gdGhlIGNoaXAKICAq
IEByZWFkX2VjYzoJRUNDIGZyb20gdGhlIGNoaXAKICAqIEBjYWxjX2VjYzoJ
dGhlIEVDQyBjYWxjdWxhdGVkIGZyb20gcmF3IGRhdGEKICAqCiAgKiBEZXRl
Y3QgYW5kIGNvcnJlY3QgYSAxIGJpdCBlcnJvciBmb3IgMjU2IGJ5dGUgYmxv
Y2sKICAqLwotaW50IG5hbmRfY29ycmVjdF9kYXRhKHN0cnVjdCBtdGRfaW5m
byAqbXRkLCB1X2NoYXIgKmRhdCwKLQkJICAgICAgdV9jaGFyICpyZWFkX2Vj
YywgdV9jaGFyICpjYWxjX2VjYykKK2ludCBuYW5kX2NvcnJlY3RfZGF0YShz
dHJ1Y3QgbXRkX2luZm8gKm10ZCwgdW5zaWduZWQgY2hhciAqYnVmLAorCQkg
ICAgICB1bnNpZ25lZCBjaGFyICpyZWFkX2VjYywgdW5zaWduZWQgY2hhciAq
Y2FsY19lY2MpCiB7Ci0JdWludDhfdCBzMCwgczEsIHMyOwotCisJaW50IG5y
X2JpdHM7CisJdW5zaWduZWQgY2hhciBiMCwgYjEsIGIyOworCXVuc2lnbmVk
IGNoYXIgYnl0ZV9hZGRyLCBiaXRfYWRkcjsKKworCS8qCisJICogYjAgdG8g
YjIgaW5kaWNhdGUgd2hpY2ggYml0IGlzIGZhdWx0eSAoaWYgYW55KQorCSAq
IHdlIG1pZ2h0IG5lZWQgdGhlIHhvciByZXN1bHQgIG1vcmUgdGhhbiBvbmNl
LAorCSAqIHNvIGtlZXAgdGhlbSBpbiBhIGxvY2FsIHZhcgorCSovCiAjaWZk
ZWYgQ09ORklHX01URF9OQU5EX0VDQ19TTUMKLQlzMCA9IGNhbGNfZWNjWzBd
IF4gcmVhZF9lY2NbMF07Ci0JczEgPSBjYWxjX2VjY1sxXSBeIHJlYWRfZWNj
WzFdOwotCXMyID0gY2FsY19lY2NbMl0gXiByZWFkX2VjY1syXTsKKwliMCA9
IHJlYWRfZWNjWzBdIF4gY2FsY19lY2NbMF07CisJYjEgPSByZWFkX2VjY1sx
XSBeIGNhbGNfZWNjWzFdOwogI2Vsc2UKLQlzMSA9IGNhbGNfZWNjWzBdIF4g
cmVhZF9lY2NbMF07Ci0JczAgPSBjYWxjX2VjY1sxXSBeIHJlYWRfZWNjWzFd
OwotCXMyID0gY2FsY19lY2NbMl0gXiByZWFkX2VjY1syXTsKKwliMCA9IHJl
YWRfZWNjWzFdIF4gY2FsY19lY2NbMV07CisJYjEgPSByZWFkX2VjY1swXSBe
IGNhbGNfZWNjWzBdOwogI2VuZGlmCi0JaWYgKChzMCB8IHMxIHwgczIpID09
IDApCi0JCXJldHVybiAwOworCWIyID0gcmVhZF9lY2NbMl0gXiBjYWxjX2Vj
Y1syXTsKIAotCS8qIENoZWNrIGZvciBhIHNpbmdsZSBiaXQgZXJyb3IgKi8K
LQlpZiggKChzMCBeIChzMCA+PiAxKSkgJiAweDU1KSA9PSAweDU1ICYmCi0J
ICAgICgoczEgXiAoczEgPj4gMSkpICYgMHg1NSkgPT0gMHg1NSAmJgotCSAg
ICAoKHMyIF4gKHMyID4+IDEpKSAmIDB4NTQpID09IDB4NTQpIHsKKwkvKiBj
aGVjayBpZiB0aGVyZSBhcmUgYW55IGJpdGZhdWx0cyAqLwogCi0JCXVpbnQz
Ml90IGJ5dGVvZmZzLCBiaXRudW07CisJLyogY291bnQgbnIgb2YgYml0czsg
dXNlIHRhYmxlIGxvb2t1cCwgZmFzdGVyIHRoYW4gY2FsY3VsYXRpbmcgaXQg
Ki8KKwlucl9iaXRzID0gYml0c3BlcmJ5dGVbYjBdICsgYml0c3BlcmJ5dGVb
YjFdICsgYml0c3BlcmJ5dGVbYjJdOwogCi0JCWJ5dGVvZmZzID0gKHMxIDw8
IDApICYgMHg4MDsKLQkJYnl0ZW9mZnMgfD0gKHMxIDw8IDEpICYgMHg0MDsK
LQkJYnl0ZW9mZnMgfD0gKHMxIDw8IDIpICYgMHgyMDsKLQkJYnl0ZW9mZnMg
fD0gKHMxIDw8IDMpICYgMHgxMDsKLQotCQlieXRlb2ZmcyB8PSAoczAgPj4g
NCkgJiAweDA4OwotCQlieXRlb2ZmcyB8PSAoczAgPj4gMykgJiAweDA0Owot
CQlieXRlb2ZmcyB8PSAoczAgPj4gMikgJiAweDAyOwotCQlieXRlb2ZmcyB8
PSAoczAgPj4gMSkgJiAweDAxOwotCi0JCWJpdG51bSA9IChzMiA+PiA1KSAm
IDB4MDQ7Ci0JCWJpdG51bSB8PSAoczIgPj4gNCkgJiAweDAyOwotCQliaXRu
dW0gfD0gKHMyID4+IDMpICYgMHgwMTsKLQotCQlkYXRbYnl0ZW9mZnNdIF49
ICgxIDw8IGJpdG51bSk7Ci0KLQkJcmV0dXJuIDE7CisJLyogcmVwZWF0ZWQg
aWYgc3RhdGVtZW50cyBhcmUgc2xpZ2h0bHkgbW9yZSBlZmZpY2llbnQgdGhh
biBzd2l0Y2ggLi4uICovCisJLyogb3JkZXJlZCBpbiBvcmRlciBvZiBsaWtl
bGlob29kICovCisJaWYgKG5yX2JpdHMgPT0gMCkKKwkJcmV0dXJuICgwKTsJ
Lyogbm8gZXJyb3IgKi8KKwlpZiAobnJfYml0cyA9PSAxMSkgewkvKiBjb3Jy
ZWN0YWJsZSBlcnJvciAqLworCQkvKgorCQkgKiBycDE1LzEzLzExLzkvNy81
LzMvMSBpbmRpY2F0ZSB3aGljaCBieXRlIGlzIHRoZSBmYXVsdHkgYnl0ZQor
CQkgKiBjcCA1LzMvMSBpbmRpY2F0ZSB0aGUgZmF1bHR5IGJpdC4KKwkJICog
QSBsb29rdXAgdGFibGUgKGNhbGxlZCBhZGRyZXNzYml0cykgaXMgdXNlZCB0
byBmaWx0ZXIKKwkJICogdGhlIGJpdHMgZnJvbSB0aGUgYnl0ZSB0aGV5IGFy
ZSBpbi4KKwkJICogQSBtYXJnaW5hbCBvcHRpbWlzYXRpb24gaXMgcG9zc2li
bGUgYnkgaGF2aW5nIHRocmVlCisJCSAqIGRpZmZlcmVudCBsb29rdXAgdGFi
bGVzLgorCQkgKiBPbmUgYXMgd2UgaGF2ZSBub3cgKGZvciBiMCksIG9uZSBm
b3IgYjIKKwkJICogKHRoYXQgd291bGQgYXZvaWQgdGhlID4+IDEpLCBhbmQg
b25lIGZvciBiMSAod2l0aCBhbGwgdmFsdWVzCisJCSAqIDw8IDQpLiBIb3dl
dmVyIGl0IHdhcyBmZWx0IHRoYXQgaW50cm9kdWNpbmcgdHdvIG1vcmUgdGFi
bGVzCisJCSAqIGhhcmRseSBqdXN0aWZ5IHRoZSBnYWluLgorCQkgKgorCQkg
KiBUaGUgYjIgc2hpZnQgaXMgdGhlcmUgdG8gZ2V0IHJpZCBvZiB0aGUgbG93
ZXN0IHR3byBiaXRzLgorCQkgKiBXZSBjb3VsZCBhbHNvIGRvIGFkZHJlc3Ni
aXRzW2IyXSA+PiAxIGJ1dCBmb3IgdGhlCisJCSAqIHBlcmZvcm1hY2UgaXQg
ZG9lcyBub3QgbWFrZSBhbnkgZGlmZmVyZW5jZQorCQkgKi8KKwkJYnl0ZV9h
ZGRyID0gKGFkZHJlc3NiaXRzW2IxXSA8PCA0KSArIGFkZHJlc3NiaXRzW2Iw
XTsKKwkJYml0X2FkZHIgPSBhZGRyZXNzYml0c1tiMiA+PiAyXTsKKwkJLyog
ZmxpcCB0aGUgYml0ICovCisJCWJ1ZltieXRlX2FkZHJdIF49ICgxIDw8IGJp
dF9hZGRyKTsKKwkJcmV0dXJuICgxKTsKIAl9Ci0KLQlpZihjb3VudGJpdHMo
czAgfCAoKHVpbnQzMl90KXMxIDw8IDgpIHwgKCh1aW50MzJfdClzMiA8PDE2
KSkgPT0gMSkKLQkJcmV0dXJuIDE7Ci0KLQlyZXR1cm4gLUVCQURNU0c7CisJ
aWYgKG5yX2JpdHMgPT0gMSkKKwkJcmV0dXJuICgxKTsJLyogZXJyb3IgaW4g
ZWNjIGRhdGE7IG5vIGFjdGlvbiBuZWVkZWQgKi8KKwlyZXR1cm4gLTE7CiB9
CisKKyNpZm5kZWYgU1RBTkRBTE9ORQogRVhQT1JUX1NZTUJPTChuYW5kX2Nv
cnJlY3RfZGF0YSk7CiAKIE1PRFVMRV9MSUNFTlNFKCJHUEwiKTsKLU1PRFVM
RV9BVVRIT1IoIlN0ZXZlbiBKLiBIaWxsIDxzamhpbGxAcmVhbGl0eWRpbHV0
ZWQuY29tPiIpOworTU9EVUxFX0FVVEhPUigiRnJhbnMgTWV1bGVuYnJvZWtz
Iik7CiBNT0RVTEVfREVTQ1JJUFRJT04oIkdlbmVyaWMgTkFORCBFQ0Mgc3Vw
cG9ydCIpOworI2VuZGlmClNpZ25lZC1vZmYtYnk6IEZyYW5zIE1ldWxlbmJy
b2Vrcwo=

--0-415199513-1217354338=:23881--

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-07-29 17:58 Frans Meulenbroeks
@ 2008-07-29 20:04 ` Ricard Wanderlof
  2008-07-30  6:17 ` Artem Bityutskiy
  1 sibling, 0 replies; 29+ messages in thread
From: Ricard Wanderlof @ 2008-07-29 20:04 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd@lists.infradead.org


On Tue, 29 Jul 2008, Frans Meulenbroeks wrote:

> As my email client wraps lines at 79 chars I've added the patch as an 
> attachement instead of inlining it (the line with the filenames created 
> by diff exceeds 79 chars).

At least my email client (pine) doesn't seem to recognize it as an 
attachment but shows it inline. Don't know if anyone else had this 
problem.

/Ricard

>
>
> --0-415199513-1217354338=:23881
> Content-Type: text/x-patch; name="ecc.patch"
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment; filename="ecc.patch"
>
> ZGlmZiAtdXJOIGxpbnV4LTIuNi4yNS4xMC9Eb2N1bWVudGF0aW9uL25hbmQv
> ZWNjLnR4dCBsaW51eC0yLjYuMjUuMTAud29yay9Eb2N1bWVudGF0aW9uL25h
> ....

--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-07-29 17:58 Frans Meulenbroeks
  2008-07-29 20:04 ` Ricard Wanderlof
@ 2008-07-30  6:17 ` Artem Bityutskiy
  1 sibling, 0 replies; 29+ messages in thread
From: Artem Bityutskiy @ 2008-07-30  6:17 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd

On Tue, 2008-07-29 at 10:58 -0700, Frans Meulenbroeks wrote:
> Dear all,
> 
> A resubmit of my patch, with all comments from Thomas addressed.

I'd suggest you to glance at Documentation/email-clients.txt

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
@ 2008-07-31  8:35 frans
  2008-08-11 11:35 ` Frans Meulenbroeks
  2008-08-11 16:30 ` David Woodhouse
  0 siblings, 2 replies; 29+ messages in thread
From: frans @ 2008-07-31  8:35 UTC (permalink / raw)
  To: linux-mtd

Dear all,

A resubmit of my patch, with all comments from Thomas addressed, and this 
time wiht the patch inlined and submitted using pine/gmail

This patch improves the performance of the ecc generation code by a factor
of 18 on an INTEL D920 CPU, a factor of 7 on MIPS and a factor of 5 on ARM 
(NSLU2)

Please let me know if additional changes are needed.

Best regards, Frans.

diff -urN linux-2.6.25.10/Documentation/nand/ecc.txt linux-2.6.25.10.work/Documentation/nand/ecc.txt
--- linux-2.6.25.10/Documentation/nand/ecc.txt	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.25.10.work/Documentation/nand/ecc.txt	2008-07-29 19:25:19.000000000 +0200
@@ -0,0 +1,714 @@
+Introduction
+============
+
+Having looked at the linux mtd/nand driver and more specific at nand_ecc.c 
+I felt there was room for optimisation. I bashed the code for a few hours
+performing tricks like table lookup removing superfluous code etc. 
+After that the speed was increased by 35-40%. 
+Still I was not too happy as I felt there was additional room for improvement.
+
+Bad! I was hooked.
+I decided to annotate my steps in this file. Perhaps it is useful to someone
+or someone learns something from it.
+
+
+The problem
+===========
+
+NAND flash (at least SLC one) typically has sectors of 256 bytes.
+However NAND flash is not extremely reliable so some error detection
+(and sometimes correction) is needed.
+
+This is done by means of a Hamming code. I'll try to explain it in
+laymans terms (and apologies to all the pro's in the field in case I do
+not use the right terminology, my coding theory class was almost 30
+years ago, and I must admit it was not one of my favourites).
+
+As I said before the ecc calculation is performed on sectors of 256
+bytes. This is done by calculating several parity bits over the rows and
+columns. The parity used is even parity which means that the parity bit = 1
+if the data over which the parity is calculated is 1 and the parity bit = 0
+if the data over which the parity is calculated is 0. So the total
+number of bits over the data over which the parity is calculated + the
+parity bit is even. (see wikipedia if you can't follow this).
+Parity is often calculated by means of an exclusive or operation,
+sometimes also referred to as xor. In C the operator for xor is ^
+
+Back to ecc.
+Let's give a small figure:
+
+byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
+byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
+byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
+byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
+byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
+....
+byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
+byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
+           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
+           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
+           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
+
+This figure represents a sector of 256 bytes.
+cp is my abbreviaton for column parity, rp for row parity.
+
+Let's start to explain column parity.
+cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
+so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
+Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
+cp2 is the parity over bit0, bit1, bit4 and bit5
+cp3 is the parity over bit2, bit3, bit6 and bit7.
+cp4 is the parity over bit0, bit1, bit2 and bit3.
+cp5 is the parity over bit4, bit5, bit6 and bit7.
+Note that each of cp0 .. cp5 is exactly one bit.
+
+Row parity actually works almost the same.
+rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
+rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
+rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... 
+(so handle two bytes, then skip 2 bytes).
+rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
+for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
+so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
+and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
+The story now becomes quite boring. I guess you get the idea.
+rp6 covers 8 bytes then skips 8 etc
+rp7 skips 8 bytes then covers 8 etc
+rp8 covers 16 bytes then skips 16 etc
+rp9 skips 16 bytes then covers 16 etc
+rp10 covers 32 bytes then skips 32 etc
+rp11 skips 32 bytes then covers 32 etc
+rp12 covers 64 bytes then skips 64 etc
+rp13 skips 64 bytes then covers 64 etc
+rp14 covers 128 bytes then skips 128
+rp15 skips 128 bytes then covers 128 
+
+In the end the parity bits are grouped together in three bytes as
+follows:
+ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
+ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
+ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
+
+I detected after writing this that ST application note AN1823
+(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much
+nicer picture.(but they use line parity as term where I use row parity)
+Oh well, I'm graphically challenged, so suffer with me for a moment :-)
+And I could not reuse the ST picture anyway for copyright reasons.
+
+
+Attempt 0
+=========
+
+Implementing the parity calculation is pretty simple.
+In C pseudocode:
+for (i = 0; i < 256; i++)
+{
+    if (i & 0x01)
+       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
+    else
+       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
+    if (i & 0x02)
+       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
+    else
+       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
+    if (i & 0x04)
+      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
+    else
+      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
+    if (i & 0x08)
+      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
+    else
+      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
+    if (i & 0x10)
+      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
+    else
+      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
+    if (i & 0x20)
+      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
+    else
+    rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
+    if (i & 0x40)
+      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
+    else
+      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
+    if (i & 0x80)
+      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
+    else
+      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
+    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
+    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
+    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
+    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
+    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
+    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
+}
+
+
+Analysis 0
+==========
+
+C does have bitwise operators but not really operators to do the above
+efficiently (and most hardware has no such instructions either).
+Therefore without implementing this it was clear that the code above was
+not going to bring me a Nobel prize :-)
+
+Fortunately the exclusive or operation is commutative, so we can combine
+the values in any order. So instead of calculating all the bits
+individually, let us try to rearrange things.
+For the column parity this is easy. We can just xor the bytes and in the
+end filter out the relevant bits. This is pretty nice as it will bring
+all cp calculation out of the if loop.
+
+Similarly we can first xor the bytes for the various rows.
+This leads to:
+
+
+Attempt 1
+=========
+
+const char parity[256] = {
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
+};
+
+void ecc1(const unsigned char *buf, unsigned char *code)
+{
+    int i;
+    const unsigned char *bp = buf;
+    unsigned char cur;
+    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+    unsigned char par;
+
+    par = 0;
+    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
+    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
+    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
+    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
+
+    for (i = 0; i < 256; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
+        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
+        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
+    }
+    code[0] =
+        (parity[rp7] << 7) |
+        (parity[rp6] << 6) |
+        (parity[rp5] << 5) |
+        (parity[rp4] << 4) |
+        (parity[rp3] << 3) |
+        (parity[rp2] << 2) |
+        (parity[rp1] << 1) |
+        (parity[rp0]);
+    code[1] =
+        (parity[rp15] << 7) |
+        (parity[rp14] << 6) |
+        (parity[rp13] << 5) |
+        (parity[rp12] << 4) |
+        (parity[rp11] << 3) |
+        (parity[rp10] << 2) |
+        (parity[rp9]  << 1) |
+        (parity[rp8]);
+    code[2] =
+        (parity[par & 0xf0] << 7) |
+        (parity[par & 0x0f] << 6) |
+        (parity[par & 0xcc] << 5) |
+        (parity[par & 0x33] << 4) |
+        (parity[par & 0xaa] << 3) |
+        (parity[par & 0x55] << 2);
+    code[0] = ~code[0];
+    code[1] = ~code[1];
+    code[2] = ~code[2];
+}
+
+Still pretty straightforward. The last three invert statements are there to 
+give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
+all data is 0xff, so the checksum then matches.
+
+I also introduced the parity lookup. I expected this to be the fastest
+way to calculate the parity, but I will investigate alternatives later
+on.
+
+
+Analysis 1
+==========
+
+The code works, but is not terribly efficient. On my system it took
+almost 4 times as much time as the linux driver code. But hey, if it was
+*that* easy this would have been done long before.
+No pain. no gain.
+
+Fortunately there is plenty of room for improvement.
+
+In step 1 we moved from bit-wise calculation to byte-wise calculation.
+However in C we can also use the unsigned long data type and virtually
+every modern microprocessor supports 32 bit operations, so why not try
+to write our code in such a way that we process data in 32 bit chunks.
+
+Of course this means some modification as the row parity is byte by
+byte. A quick analysis:
+for the column parity we use the par variable. When extending to 32 bits 
+we can in the end easily calculate p0 and p1 from it.
+(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
+respectively)
+also rp2 and rp3 can be easily retrieved from par as rp3 covers the
+first two bytes and rp2 the last two bytes.
+
+Note that of course now the loop is executed only 64 times (256/4). 
+And note that care must taken wrt byte ordering. The way bytes are
+ordered in a long is machine dependent, and might affect us. 
+Anyway, if there is an issue: this code is developed on x86 (to be
+precise: a DELL PC with a D920 Intel CPU)
+
+And of course the performance might depend on alignment, but I expect
+that the I/O buffers in the nand driver are aligned properly (and
+otherwise that should be fixed to get maximum performance).
+
+Let's give it a try...
+
+
+Attempt 2
+=========
+
+extern const char parity[256];
+
+void ecc2(const unsigned char *buf, unsigned char *code)
+{
+    int i;
+    const unsigned long *bp = (unsigned long *)buf;
+    unsigned long cur;
+    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+    unsigned long par;
+
+    par = 0;
+    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
+    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
+    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
+    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
+
+    for (i = 0; i < 64; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
+    }
+    /*
+       we need to adapt the code generation for the fact that rp vars are now
+       long; also the column parity calculation needs to be changed.
+       we'll bring rp4 to 15 back to single byte entities by shifting and
+       xoring
+    */
+    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
+    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
+    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
+    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
+    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
+    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
+    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
+    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
+    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
+    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
+    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
+    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
+    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
+    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
+    par ^= (par >> 16);
+    rp1 = (par >> 8); rp1 &= 0xff;
+    rp0 = (par & 0xff);
+    par ^= (par >> 8); par &= 0xff;
+
+    code[0] =
+        (parity[rp7] << 7) |
+        (parity[rp6] << 6) |
+        (parity[rp5] << 5) |
+        (parity[rp4] << 4) |
+        (parity[rp3] << 3) |
+        (parity[rp2] << 2) |
+        (parity[rp1] << 1) |
+        (parity[rp0]);
+    code[1] =
+        (parity[rp15] << 7) |
+        (parity[rp14] << 6) |
+        (parity[rp13] << 5) |
+        (parity[rp12] << 4) |
+        (parity[rp11] << 3) |
+        (parity[rp10] << 2) |
+        (parity[rp9]  << 1) |
+        (parity[rp8]);
+    code[2] =
+        (parity[par & 0xf0] << 7) |
+        (parity[par & 0x0f] << 6) |
+        (parity[par & 0xcc] << 5) |
+        (parity[par & 0x33] << 4) |
+        (parity[par & 0xaa] << 3) |
+        (parity[par & 0x55] << 2);
+    code[0] = ~code[0];
+    code[1] = ~code[1];
+    code[2] = ~code[2];
+}
+
+The parity array is not shown any more. Note also that for these
+examples I kinda deviated from my regular programming style by allowing
+multiple statements on a line, not using { } in then and else blocks
+with only a single statement and by using operators like ^=
+
+
+Analysis 2
+==========
+
+The code (of course) works, and hurray: we are a little bit faster than
+the linux driver code (about 15%). But wait, don't cheer too quickly.
+THere is more to be gained.
+If we look at e.g. rp14 and rp15 we see that we either xor our data with
+rp14 or with rp15. However we also have par which goes over all data.
+This means there is no need to calculate rp14 as it can be calculated from
+rp15 through rp14 = par ^ rp15;
+(or if desired we can avoid calculating rp15 and calculate it from
+rp14).  That is why some places refer to inverse parity.
+Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
+Effectively this means we can eliminate the else clause from the if
+statements. Also we can optimise the calculation in the end a little bit
+by going from long to byte first. Actually we can even avoid the table
+lookups
+
+Attempt 3
+=========
+
+Odd replaced:
+        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
+with
+        if (i & 0x01) rp5 ^= cur;
+        if (i & 0x02) rp7 ^= cur;
+        if (i & 0x04) rp9 ^= cur; 
+        if (i & 0x08) rp11 ^= cur;
+        if (i & 0x10) rp13 ^= cur;
+        if (i & 0x20) rp15 ^= cur;
+
+        and outside the loop added:
+    rp4  = par ^ rp5;
+    rp6  = par ^ rp7;
+    rp8  = par ^ rp9;
+    rp10  = par ^ rp11;
+    rp12  = par ^ rp13;
+    rp14  = par ^ rp15;
+
+And after that the code takes about 30% more time, although the number of
+statements is reduced. This is also reflected in the assembly code.
+
+
+Analysis 3
+==========
+
+Very weird. Guess it has to do with caching or instruction parallellism
+or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
+observation was that this one is only 30% slower (according to time)
+executing the code as my 3Ghz D920 processor.
+
+Well, it was expected not to be easy so maybe instead move to a
+different track: let's move back to the code from attempt2 and do some
+loop unrolling. This will eliminate a few if statements. I'll try
+different amounts of unrolling to see what works best.
+
+
+Attempt 4
+=========
+
+Unrolled the loop 1, 2, 3 and 4 times.
+For 4 the code starts with:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        rp4 ^= cur;
+        rp6 ^= cur;
+        rp8 ^= cur;
+        rp10 ^= cur;
+        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
+        cur = *bp++;
+        par ^= cur;
+        rp5 ^= cur;
+        rp6 ^= cur;
+        ...
+
+
+Analysis 4
+==========
+
+Unrolling once gains about 15%
+Unrolling twice keeps the gain at about 15%
+Unrolling three times gives a gain of 30% compared to attempt 2.
+Unrolling four times gives a marginal improvement compared to unrolling
+three times.
+
+I decided to proceed with a four time unrolled loop anyway. It was my gut
+feeling that in the next steps I would obtain additional gain from it.
+
+The next step was triggered by the fact that par contains the xor of all
+bytes and rp4 and rp5 each contain the xor of half of the bytes.
+So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
+that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
+eliminate rp5 (or rp4, but I already foresaw another optimisation).
+The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
+
+
+Attempt 5
+=========
+
+Effectively so all odd digit rp assignments in the loop were removed.
+This included the else clause of the if statements.
+Of course after the loop we need to correct things by adding code like:
+    rp5 = par ^ rp4;
+Also the initial assignments (rp5 = 0; etc) could be removed.
+Along the line I also removed the initialisation of rp0/1/2/3.
+
+
+Analysis 5
+==========
+
+Measurements showed this was a good move. The run-time roughly halved
+compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
+of the processor time compared to the current code in the linux kernel.
+
+However, still I thought there was more. I didn't like all the if
+statements. Why not keep a running parity and only keep the last if
+statement. Time for yet another version!
+
+
+Attempt 6
+=========
+
+THe code within the for loop was changed to:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++; tmppar  = cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
+
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
+
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
+
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+
+	    par ^= tmppar;
+        if ((i & 0x1) == 0) rp12 ^= tmppar;
+        if ((i & 0x2) == 0) rp14 ^= tmppar;
+    }
+
+As you can see tmppar is used to accumulate the parity within a for
+iteration. In the last 3 statements is is added to par and, if needed,
+to rp12 and rp14.
+
+While making the changes I also found that I could exploit that tmppar
+contains the running parity for this iteration. So instead of having:
+rp4 ^= cur; rp6 = cur;
+I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next
+statement. A similar change was done for rp8 and rp10
+
+
+Analysis 6
+==========
+
+Measuring this code again showed big gain. When executing the original
+linux code 1 million times, this took about 1 second on my system.
+(using time to measure the performance). After this iteration I was back
+to 0.075 sec. Actually I had to decide to start measuring over 10
+million interations in order not to loose too much accuracy. This one
+definitely seemed to be the jackpot!
+
+There is a little bit more room for improvement though. There are three
+places with statements:
+rp4 ^= cur; rp6 ^= cur;
+It seems more efficient to also maintain a variable rp4_6 in the while
+loop; This eliminates 3 statements per loop. Of course after the loop we
+need to correct by adding:
+    rp4 ^= rp4_6;
+    rp6 ^= rp4_6
+Furthermore there are 4 sequential assingments to rp8. This can be
+encoded slightly more efficient by saving tmppar before those 4 lines
+and later do rp8 = rp8 ^ tmppar ^ notrp8;
+(where notrp8 is the value of rp8 before those 4 lines).
+Again a use of the commutative property of xor.
+Time for a new test!
+
+
+Attempt 7
+=========
+
+The new code now looks like:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++; tmppar  = cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
+
+        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
+
+	    notrp8 = tmppar;
+	    cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+	    rp8 = rp8 ^ tmppar ^ notrp8;
+
+        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+
+	    par ^= tmppar;
+        if ((i & 0x1) == 0) rp12 ^= tmppar;
+        if ((i & 0x2) == 0) rp14 ^= tmppar;
+    }
+    rp4 ^= rp4_6;
+    rp6 ^= rp4_6;
+
+
+Not a big change, but every penny counts :-)
+
+
+Analysis 7
+==========
+
+Acutally this made things worse. Not very much, but I don't want to move
+into the wrong direction. Maybe something to investigate later. Could
+have to do with caching again.
+
+Guess that is what there is to win within the loop. Maybe unrolling one
+more time will help. I'll keep the optimisations from 7 for now.
+
+
+Attempt 8
+=========
+
+Unrolled the loop one more time.
+
+
+Analysis 8
+==========
+
+This makes things worse. Let's stick with attempt 6 and continue from there.
+Although it seems that the code within the loop cannot be optimised
+further there is still room to optimize the generation of the ecc codes.
+We can simply calcualate the total parity. If this is 0 then rp4 = rp5
+etc. If the parity is 1, then rp4 = !rp5;
+But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
+in the result byte and then do something like
+    code[0] |= (code[0] << 1);
+Lets test this.
+
+
+Attempt 9
+=========
+
+Changed the code but again this slightly degrades performance. Tried all
+kind of other things, like having dedicated parity arrays to avoid the
+shift after parity[rp7] << 7; No gain.
+Change the lookup using the parity array by using shift operators (e.g.
+replace parity[rp7] << 7 with:
+rp7 ^= (rp7 << 4);
+rp7 ^= (rp7 << 2);
+rp7 ^= (rp7 << 1);
+rp7 &= 0x80;
+No gain.
+
+The only marginal change was inverting the parity bits, so we can remove
+the last three invert statements.
+
+Ah well, pity this does not deliver more. Then again 10 million
+iterations using the linux driver code takes between 13 and 13.5
+seconds, whereas my code now takes about 0.73 seconds for those 10
+million iterations. So basically I've improved the performance by a
+factor 18 on my system. Not that bad. Of course on different hardware
+you will get different results. No warranties!
+
+But of course there is no such thing as a free lunch. The codesize almost
+tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
+
+
+Correcting errors
+=================
+
+For correcting errors I again used the ST application note as a starter,
+but I also peeked at the existing code.
+The algorithm itself is pretty straightforward. Just xor the given and
+the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
+are 1 we have one correctable bit error. If there is 1 bit 1, we have an
+error in the given ecc code. 
+It proved to be fastest to do some table lookups. Performance gain
+introduced by this is about a factor 2 on my system when a repair had to
+be done, and 1% or so if no repair had to be done.
+Code size increased from 330 bytes to 686 bytes for this function.
+(gcc 4.2, -O3)
+
+
+Conclusion
+==========
+
+The gain when calculating the ecc is tremendous. Om my development hardware 
+a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
+embedded system with a MIPS core a factor 7 was obtained.
+On  a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
+5 (big endian mode, gcc 4.1.2, -O3)
+For correction not much gain could be obtained (as bitflips are rare). Then
+again there are also much less cycles spent there.
+
+It seems there is not much more gain possible in this, at least when
+programmed in C. Of course it might be possible to squeeze something more
+out of it with an assembler program, but due to pipeline behaviour etc
+this is very tricky (at least for intel hw).
+
+Author: Frans Meulenbroeks
+Copyright (C) 2008 Koninklijke Philips Electronics NV.
diff -urN linux-2.6.25.10/drivers/mtd/nand/nand_ecc.c linux-2.6.25.10.work/drivers/mtd/nand/nand_ecc.c
--- linux-2.6.25.10/drivers/mtd/nand/nand_ecc.c	2008-07-03 05:46:47.000000000 +0200
+++ linux-2.6.25.10.work/drivers/mtd/nand/nand_ecc.c	2008-07-30 09:53:59.000000000 +0200
@@ -1,15 +1,18 @@
 /*
- * This file contains an ECC algorithm from Toshiba that detects and
- * corrects 1 bit errors in a 256 byte block of data.
+ * This file contains an ECC algorithm that detects and corrects 1 bit
+ * errors in a 256 byte block of data.
  *
  * drivers/mtd/nand/nand_ecc.c
  *
- * Copyright (C) 2000-2004 Steven J. Hill (sjhill@realitydiluted.com)
- *                         Toshiba America Electronics Components, Inc.
+ * Copyright (C) 2008 Koninklijke Philips Electronics NV.
+ *                    Author: Frans Meulenbroeks
  *
- * Copyright (C) 2006 Thomas Gleixner <tglx@linutronix.de>
+ * Completely replaces the previous ECC implementation which was written by:
+ *   Steven J. Hill (sjhill@realitydiluted.com)
+ *   Thomas Gleixner (tglx@linutronix.de)
  *
- * $Id: nand_ecc.c,v 1.15 2005/11/07 11:14:30 gleixner Exp $
+ * Information on how this algorithm works and how it was developed
+ * can be found in Documentation/nand/ecc.txt
  *
  * This file is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
@@ -25,174 +28,415 @@
  * with this file; if not, write to the Free Software Foundation, Inc.,
  * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
  *
- * As a special exception, if other files instantiate templates or use
- * macros or inline functions from these files, or you compile these
- * files and link them with other works to produce a work based on these
- * files, these files do not by themselves cause the resulting work to be
- * covered by the GNU General Public License. However the source code for
- * these files must still be made available in accordance with section (3)
- * of the GNU General Public License.
- *
- * This exception does not invalidate any other reasons why a work based on
- * this file might be covered by the GNU General Public License.
  */

+/*
+ * The STANDALONE macro is useful when running the code outside the kernel
+ * e.g. when running the code in a testbed or a benchmark program.
+ * When STANDALONE is used, the module related macros are commented out
+ * as well as the linux include files.
+ * Instead a private definition of mtd_into is given to satisfy the compiler
+ * (the code does not use mtd_info, so the code does not care)
+ */
+#ifndef STANDALONE
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mtd/nand_ecc.h>
+#else
+struct mtd_info {
+	int dummy;
+};
+#define EXPORT_SYMBOL(x)  /* x */
+
+#define MODULE_LICENSE(x)	/* x */
+#define MODULE_AUTHOR(x)	/* x */
+#define MODULE_DESCRIPTION(x)	/* x */
+#endif
+
+/*
+ * invparity is a 256 byte table that contains the odd parity
+ * for each byte. So if the number of bits in a byte is even,
+ * the array element is 1, and when the number of bits is odd
+ * the array eleemnt is 0.
+ */
+static const char invparity[256] = {
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1
+};
+
+/*
+ * bitsperbyte contains the number of bits per byte
+ * this is only used for testing and repairing parity
+ * (a precalculated value slightly improves performance)
+ */
+static const char bitsperbyte[256] = {
+	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
+};

 /*
- * Pre-calculated 256-way 1 byte column parity
+ * addressbits is a lookup table to filter out the bits from the xor-ed
+ * ecc data that identify the faulty location.
+ * this is only used for repairing parity
+ * see the comments in nand_correct_data for more details
  */
-static const u_char nand_ecc_precalc_table[] = {
-	0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00,
-	0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
-	0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
-	0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
-	0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
-	0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
-	0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
-	0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
-	0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
-	0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
-	0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
-	0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
-	0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
-	0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
-	0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
-	0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00
+static const char addressbits[256] = {
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f
 };

 /**
  * nand_calculate_ecc - [NAND Interface] Calculate 3-byte ECC for 256-byte block
- * @mtd:	MTD block structure
+ * @mtd:	MTD block structure (unused)
  * @dat:	raw data
  * @ecc_code:	buffer for ECC
  */
-int nand_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
-		       u_char *ecc_code)
+int nand_calculate_ecc(struct mtd_info *mtd, const unsigned char *buf,
+		       unsigned char *code)
 {
-	uint8_t idx, reg1, reg2, reg3, tmp1, tmp2;
 	int i;
-
-	/* Initialize variables */
-	reg1 = reg2 = reg3 = 0;
-
-	/* Build up column parity */
-	for(i = 0; i < 256; i++) {
-		/* Get CP0 - CP5 from table */
-		idx = nand_ecc_precalc_table[*dat++];
-		reg1 ^= (idx & 0x3f);
-
-		/* All bit XOR = 1 ? */
-		if (idx & 0x40) {
-			reg3 ^= (uint8_t) i;
-			reg2 ^= ~((uint8_t) i);
-		}
+	const unsigned long *bp = (unsigned long *)buf;
+	unsigned long cur;	/* current value in buffer */
+	/* rp0..rp15 are the various accumulated parities (per byte) */
+	unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+	unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+	unsigned long par;	/* the cumulative parity for all data */
+	unsigned long tmppar;	/* the cumulative parity for this iteration;
+				   for rp12 and rp14 at the end of the loop */
+
+	par = 0;
+	rp4 = 0;
+	rp6 = 0;
+	rp8 = 0;
+	rp10 = 0;
+	rp12 = 0;
+	rp14 = 0;
+
+	/*
+	 * The loop is unrolled a number of times;
+	 * This avoids if statements to decide on which rp value to update
+	 * Also we process the data by longwords.
+	 * Note: passing unaligned data might give a performance penalty.
+	 * It is assumed that the buffers are aligned.
+	 * tmppar is the cumulative sum of this iteration.
+	 * needed for calculating rp12, rp14 and par
+	 * also used as a performance improvement for rp6, rp8 and rp10
+	 */
+	for (i = 0; i < 4; i++) {
+		cur = *bp++;
+		tmppar = cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= tmppar;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp8 ^= tmppar;
+
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp10 ^= tmppar;
+
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp8 ^= cur;
+
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+
+		par ^= tmppar;
+		if ((i & 0x1) == 0)
+			rp12 ^= tmppar;
+		if ((i & 0x2) == 0)
+			rp14 ^= tmppar;
 	}

-	/* Create non-inverted ECC code from line parity */
-	tmp1  = (reg3 & 0x80) >> 0; /* B7 -> B7 */
-	tmp1 |= (reg2 & 0x80) >> 1; /* B7 -> B6 */
-	tmp1 |= (reg3 & 0x40) >> 1; /* B6 -> B5 */
-	tmp1 |= (reg2 & 0x40) >> 2; /* B6 -> B4 */
-	tmp1 |= (reg3 & 0x20) >> 2; /* B5 -> B3 */
-	tmp1 |= (reg2 & 0x20) >> 3; /* B5 -> B2 */
-	tmp1 |= (reg3 & 0x10) >> 3; /* B4 -> B1 */
-	tmp1 |= (reg2 & 0x10) >> 4; /* B4 -> B0 */
-
-	tmp2  = (reg3 & 0x08) << 4; /* B3 -> B7 */
-	tmp2 |= (reg2 & 0x08) << 3; /* B3 -> B6 */
-	tmp2 |= (reg3 & 0x04) << 3; /* B2 -> B5 */
-	tmp2 |= (reg2 & 0x04) << 2; /* B2 -> B4 */
-	tmp2 |= (reg3 & 0x02) << 2; /* B1 -> B3 */
-	tmp2 |= (reg2 & 0x02) << 1; /* B1 -> B2 */
-	tmp2 |= (reg3 & 0x01) << 1; /* B0 -> B1 */
-	tmp2 |= (reg2 & 0x01) << 0; /* B7 -> B0 */
-
-	/* Calculate final ECC code */
+	/*
+	 * handle the fact that we use longword operations
+	 * we'll bring rp4..rp14 back to single byte entities by shifting and
+	 * xoring first fold the upper and lower 16 bits,
+	 * then the upper and lower 8 bits.
+	 */
+	rp4 ^= (rp4 >> 16);
+	rp4 ^= (rp4 >> 8);
+	rp4 &= 0xff;
+	rp6 ^= (rp6 >> 16);
+	rp6 ^= (rp6 >> 8);
+	rp6 &= 0xff;
+	rp8 ^= (rp8 >> 16);
+	rp8 ^= (rp8 >> 8);
+	rp8 &= 0xff;
+	rp10 ^= (rp10 >> 16);
+	rp10 ^= (rp10 >> 8);
+	rp10 &= 0xff;
+	rp12 ^= (rp12 >> 16);
+	rp12 ^= (rp12 >> 8);
+	rp12 &= 0xff;
+	rp14 ^= (rp14 >> 16);
+	rp14 ^= (rp14 >> 8);
+	rp14 &= 0xff;
+
+	/*
+	 * we also need to calculate the row parity for rp0..rp3
+	 * This is present in par, because par is now
+	 * rp3 rp3 rp2 rp2
+	 * as well as
+	 * rp1 rp0 rp1 rp0
+	 * First calculate rp2 and rp3
+	 * (and yes: rp2 = (par ^ rp3) & 0xff; but doing that did not
+	 * give a performance improvement)
+	 */
+	rp3 = (par >> 16);
+	rp3 ^= (rp3 >> 8);
+	rp3 &= 0xff;
+	rp2 = par & 0xffff;
+	rp2 ^= (rp2 >> 8);
+	rp2 &= 0xff;
+
+	/* reduce par to 16 bits then calculate rp1 and rp0 */
+	par ^= (par >> 16);
+	rp1 = (par >> 8) & 0xff;
+	rp0 = (par & 0xff);
+
+	/* finally reduce par to 8 bits */
+	par ^= (par >> 8);
+	par &= 0xff;
+
+	/*
+	 * and calculate rp5..rp15
+	 * note that par = rp4 ^ rp5 and due to the commutative property
+	 * of the ^ operator we can say:
+	 * rp5 = (par ^ rp4);
+	 * The & 0xff seems superfluous, but benchmarking learned that
+	 * leaving it out gives slightly worse results. No idea why, probably
+	 * it has to do with the way the pipeline in pentium is organized.
+	 */
+	rp5 = (par ^ rp4) & 0xff;
+	rp7 = (par ^ rp6) & 0xff;
+	rp9 = (par ^ rp8) & 0xff;
+	rp11 = (par ^ rp10) & 0xff;
+	rp13 = (par ^ rp12) & 0xff;
+	rp15 = (par ^ rp14) & 0xff;
+
+	/*
+	 * Finally calculate the ecc bits.
+	 * Again here it might seem that there are performance optimisations
+	 * possible, but benchmarks showed that on the system this is developed
+	 * the code below is the fastest
+	 */
 #ifdef CONFIG_MTD_NAND_ECC_SMC
-	ecc_code[0] = ~tmp2;
-	ecc_code[1] = ~tmp1;
+	code[0] =
+	    (invparity[rp7] << 7) |
+	    (invparity[rp6] << 6) |
+	    (invparity[rp5] << 5) |
+	    (invparity[rp4] << 4) |
+	    (invparity[rp3] << 3) |
+	    (invparity[rp2] << 2) |
+	    (invparity[rp1] << 1) |
+	    (invparity[rp0]);
+	code[1] =
+	    (invparity[rp15] << 7) |
+	    (invparity[rp14] << 6) |
+	    (invparity[rp13] << 5) |
+	    (invparity[rp12] << 4) |
+	    (invparity[rp11] << 3) |
+	    (invparity[rp10] << 2) |
+	    (invparity[rp9] << 1)  |
+	    (invparity[rp8]);
 #else
-	ecc_code[0] = ~tmp1;
-	ecc_code[1] = ~tmp2;
+	code[1] =
+	    (invparity[rp7] << 7) |
+	    (invparity[rp6] << 6) |
+	    (invparity[rp5] << 5) |
+	    (invparity[rp4] << 4) |
+	    (invparity[rp3] << 3) |
+	    (invparity[rp2] << 2) |
+	    (invparity[rp1] << 1) |
+	    (invparity[rp0]);
+	code[0] =
+	    (invparity[rp15] << 7) |
+	    (invparity[rp14] << 6) |
+	    (invparity[rp13] << 5) |
+	    (invparity[rp12] << 4) |
+	    (invparity[rp11] << 3) |
+	    (invparity[rp10] << 2) |
+	    (invparity[rp9] << 1)  |
+	    (invparity[rp8]);
 #endif
-	ecc_code[2] = ((~reg1) << 2) | 0x03;
-
-	return 0;
+	code[2] =
+	    (invparity[par & 0xf0] << 7) |
+	    (invparity[par & 0x0f] << 6) |
+	    (invparity[par & 0xcc] << 5) |
+	    (invparity[par & 0x33] << 4) |
+	    (invparity[par & 0xaa] << 3) |
+	    (invparity[par & 0x55] << 2) |
+	    3;
 }
 EXPORT_SYMBOL(nand_calculate_ecc);

-static inline int countbits(uint32_t byte)
-{
-	int res = 0;
-
-	for (;byte; byte >>= 1)
-		res += byte & 0x01;
-	return res;
-}
-
 /**
  * nand_correct_data - [NAND Interface] Detect and correct bit error(s)
- * @mtd:	MTD block structure
+ * @mtd:	MTD block structure (unused)
  * @dat:	raw data read from the chip
  * @read_ecc:	ECC from the chip
  * @calc_ecc:	the ECC calculated from raw data
  *
  * Detect and correct a 1 bit error for 256 byte block
  */
-int nand_correct_data(struct mtd_info *mtd, u_char *dat,
-		      u_char *read_ecc, u_char *calc_ecc)
+int nand_correct_data(struct mtd_info *mtd, unsigned char *buf,
+		      unsigned char *read_ecc, unsigned char *calc_ecc)
 {
-	uint8_t s0, s1, s2;
-
+	int nr_bits;
+	unsigned char b0, b1, b2;
+	unsigned char byte_addr, bit_addr;
+
+	/*
+	 * b0 to b2 indicate which bit is faulty (if any)
+	 * we might need the xor result  more than once,
+	 * so keep them in a local var
+	*/
 #ifdef CONFIG_MTD_NAND_ECC_SMC
-	s0 = calc_ecc[0] ^ read_ecc[0];
-	s1 = calc_ecc[1] ^ read_ecc[1];
-	s2 = calc_ecc[2] ^ read_ecc[2];
+	b0 = read_ecc[0] ^ calc_ecc[0];
+	b1 = read_ecc[1] ^ calc_ecc[1];
 #else
-	s1 = calc_ecc[0] ^ read_ecc[0];
-	s0 = calc_ecc[1] ^ read_ecc[1];
-	s2 = calc_ecc[2] ^ read_ecc[2];
+	b0 = read_ecc[1] ^ calc_ecc[1];
+	b1 = read_ecc[0] ^ calc_ecc[0];
 #endif
-	if ((s0 | s1 | s2) == 0)
-		return 0;
-
-	/* Check for a single bit error */
-	if( ((s0 ^ (s0 >> 1)) & 0x55) == 0x55 &&
-	    ((s1 ^ (s1 >> 1)) & 0x55) == 0x55 &&
-	    ((s2 ^ (s2 >> 1)) & 0x54) == 0x54) {
-
-		uint32_t byteoffs, bitnum;
-
-		byteoffs = (s1 << 0) & 0x80;
-		byteoffs |= (s1 << 1) & 0x40;
-		byteoffs |= (s1 << 2) & 0x20;
-		byteoffs |= (s1 << 3) & 0x10;
+	b2 = read_ecc[2] ^ calc_ecc[2];

-		byteoffs |= (s0 >> 4) & 0x08;
-		byteoffs |= (s0 >> 3) & 0x04;
-		byteoffs |= (s0 >> 2) & 0x02;
-		byteoffs |= (s0 >> 1) & 0x01;
+	/* check if there are any bitfaults */

-		bitnum = (s2 >> 5) & 0x04;
-		bitnum |= (s2 >> 4) & 0x02;
-		bitnum |= (s2 >> 3) & 0x01;
+	/* count nr of bits; use table lookup, faster than calculating it */
+	nr_bits = bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2];

-		dat[byteoffs] ^= (1 << bitnum);
-
-		return 1;
+	/* repeated if statements are slightly more efficient than switch ... */
+	/* ordered in order of likelihood */
+	if (nr_bits == 0)
+		return (0);	/* no error */
+	if (nr_bits == 11) {	/* correctable error */
+		/*
+		 * rp15/13/11/9/7/5/3/1 indicate which byte is the faulty byte
+		 * cp 5/3/1 indicate the faulty bit.
+		 * A lookup table (called addressbits) is used to filter
+		 * the bits from the byte they are in.
+		 * A marginal optimisation is possible by having three
+		 * different lookup tables.
+		 * One as we have now (for b0), one for b2
+		 * (that would avoid the >> 1), and one for b1 (with all values
+		 * << 4). However it was felt that introducing two more tables
+		 * hardly justify the gain.
+		 *
+		 * The b2 shift is there to get rid of the lowest two bits.
+		 * We could also do addressbits[b2] >> 1 but for the
+		 * performace it does not make any difference
+		 */
+		byte_addr = (addressbits[b1] << 4) + addressbits[b0];
+		bit_addr = addressbits[b2 >> 2];
+		/* flip the bit */
+		buf[byte_addr] ^= (1 << bit_addr);
+		return (1);
 	}
-
-	if(countbits(s0 | ((uint32_t)s1 << 8) | ((uint32_t)s2 <<16)) == 1)
-		return 1;
-
-	return -EBADMSG;
+	if (nr_bits == 1)
+		return (1);	/* error in ecc data; no action needed */
+	return -1;
 }
 EXPORT_SYMBOL(nand_correct_data);

 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
+MODULE_AUTHOR("Frans Meulenbroeks");
 MODULE_DESCRIPTION("Generic NAND ECC support");
Signed-off-by: Frans Meulenbroeks

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-07-31  8:35 [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance frans
@ 2008-08-11 11:35 ` Frans Meulenbroeks
  2008-08-11 16:30 ` David Woodhouse
  1 sibling, 0 replies; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-11 11:35 UTC (permalink / raw)
  To: linux-mtd

Haven't seen any response on my patch, but didn't see it in git either.
Is there a problem or an issue I need to resolve?
Or am I just impatient?

Frans.

2008/7/31 frans <fransmeulenbroeks@gmail.com>:
> Dear all,
>
> A resubmit of my patch, with all comments from Thomas addressed, and this
> time wiht the patch inlined and submitted using pine/gmail
>
> This patch improves the performance of the ecc generation code by a factor
> of 18 on an INTEL D920 CPU, a factor of 7 on MIPS and a factor of 5 on ARM
> (NSLU2)
>
> Please let me know if additional changes are needed.
>
> Best regards, Frans.
>
> diff -urN linux-2.6.25.10/Documentation/nand/ecc.txt linux-2.6.25.10.work/Documentation/nand/ecc.txt
> --- linux-2.6.25.10/Documentation/nand/ecc.txt  1970-01-01 01:00:00.000000000 +0100
> +++ linux-2.6.25.10.work/Documentation/nand/ecc.txt     2008-07-29 19:25:19.000000000 +0200
> @@ -0,0 +1,714 @@
> +Introduction
> +============
> +
> +Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
> +I felt there was room for optimisation. I bashed the code for a few hours
> +performing tricks like table lookup removing superfluous code etc.
> +After that the speed was increased by 35-40%.
> +Still I was not too happy as I felt there was additional room for improvement.
> +
> +Bad! I was hooked.
> +I decided to annotate my steps in this file. Perhaps it is useful to someone
> +or someone learns something from it.
> +
> +
> +The problem
> +===========
> +
> +NAND flash (at least SLC one) typically has sectors of 256 bytes.
> +However NAND flash is not extremely reliable so some error detection
> +(and sometimes correction) is needed.
> +
> +This is done by means of a Hamming code. I'll try to explain it in
> +laymans terms (and apologies to all the pro's in the field in case I do
> +not use the right terminology, my coding theory class was almost 30
> +years ago, and I must admit it was not one of my favourites).
> +
> +As I said before the ecc calculation is performed on sectors of 256
> +bytes. This is done by calculating several parity bits over the rows and
> +columns. The parity used is even parity which means that the parity bit = 1
> +if the data over which the parity is calculated is 1 and the parity bit = 0
> +if the data over which the parity is calculated is 0. So the total
> +number of bits over the data over which the parity is calculated + the
> +parity bit is even. (see wikipedia if you can't follow this).
> +Parity is often calculated by means of an exclusive or operation,
> +sometimes also referred to as xor. In C the operator for xor is ^
> +
> +Back to ecc.
> +Let's give a small figure:
> +
> +byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
> +byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
> +byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
> +byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
> +byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
> +....
> +byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
> +byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
> +           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
> +           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
> +           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
> +
> +This figure represents a sector of 256 bytes.
> +cp is my abbreviaton for column parity, rp for row parity.
> +
> +Let's start to explain column parity.
> +cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
> +so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
> +Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
> +cp2 is the parity over bit0, bit1, bit4 and bit5
> +cp3 is the parity over bit2, bit3, bit6 and bit7.
> +cp4 is the parity over bit0, bit1, bit2 and bit3.
> +cp5 is the parity over bit4, bit5, bit6 and bit7.
> +Note that each of cp0 .. cp5 is exactly one bit.
> +
> +Row parity actually works almost the same.
> +rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
> +rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
> +rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
> +(so handle two bytes, then skip 2 bytes).
> +rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
> +for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
> +so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
> +and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
> +The story now becomes quite boring. I guess you get the idea.
> +rp6 covers 8 bytes then skips 8 etc
> +rp7 skips 8 bytes then covers 8 etc
> +rp8 covers 16 bytes then skips 16 etc
> +rp9 skips 16 bytes then covers 16 etc
> +rp10 covers 32 bytes then skips 32 etc
> +rp11 skips 32 bytes then covers 32 etc
> +rp12 covers 64 bytes then skips 64 etc
> +rp13 skips 64 bytes then covers 64 etc
> +rp14 covers 128 bytes then skips 128
> +rp15 skips 128 bytes then covers 128
> +
> +In the end the parity bits are grouped together in three bytes as
> +follows:
> +ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
> +ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
> +ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
> +ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
> +
> +I detected after writing this that ST application note AN1823
> +(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much
> +nicer picture.(but they use line parity as term where I use row parity)
> +Oh well, I'm graphically challenged, so suffer with me for a moment :-)
> +And I could not reuse the ST picture anyway for copyright reasons.
> +
> +
> +Attempt 0
> +=========
> +
> +Implementing the parity calculation is pretty simple.
> +In C pseudocode:
> +for (i = 0; i < 256; i++)
> +{
> +    if (i & 0x01)
> +       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
> +    else
> +       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
> +    if (i & 0x02)
> +       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
> +    else
> +       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
> +    if (i & 0x04)
> +      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
> +    else
> +      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
> +    if (i & 0x08)
> +      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
> +    else
> +      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
> +    if (i & 0x10)
> +      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
> +    else
> +      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
> +    if (i & 0x20)
> +      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
> +    else
> +    rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
> +    if (i & 0x40)
> +      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
> +    else
> +      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
> +    if (i & 0x80)
> +      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
> +    else
> +      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
> +    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
> +    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
> +    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
> +    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
> +    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
> +    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
> +}
> +
> +
> +Analysis 0
> +==========
> +
> +C does have bitwise operators but not really operators to do the above
> +efficiently (and most hardware has no such instructions either).
> +Therefore without implementing this it was clear that the code above was
> +not going to bring me a Nobel prize :-)
> +
> +Fortunately the exclusive or operation is commutative, so we can combine
> +the values in any order. So instead of calculating all the bits
> +individually, let us try to rearrange things.
> +For the column parity this is easy. We can just xor the bytes and in the
> +end filter out the relevant bits. This is pretty nice as it will bring
> +all cp calculation out of the if loop.
> +
> +Similarly we can first xor the bytes for the various rows.
> +This leads to:
> +
> +
> +Attempt 1
> +=========
> +
> +const char parity[256] = {
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
> +};
> +
> +void ecc1(const unsigned char *buf, unsigned char *code)
> +{
> +    int i;
> +    const unsigned char *bp = buf;
> +    unsigned char cur;
> +    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
> +    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
> +    unsigned char par;
> +
> +    par = 0;
> +    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
> +    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
> +    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
> +    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
> +
> +    for (i = 0; i < 256; i++)
> +    {
> +        cur = *bp++;
> +        par ^= cur;
> +        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
> +        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
> +        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
> +        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
> +        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
> +        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
> +        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
> +        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
> +    }
> +    code[0] =
> +        (parity[rp7] << 7) |
> +        (parity[rp6] << 6) |
> +        (parity[rp5] << 5) |
> +        (parity[rp4] << 4) |
> +        (parity[rp3] << 3) |
> +        (parity[rp2] << 2) |
> +        (parity[rp1] << 1) |
> +        (parity[rp0]);
> +    code[1] =
> +        (parity[rp15] << 7) |
> +        (parity[rp14] << 6) |
> +        (parity[rp13] << 5) |
> +        (parity[rp12] << 4) |
> +        (parity[rp11] << 3) |
> +        (parity[rp10] << 2) |
> +        (parity[rp9]  << 1) |
> +        (parity[rp8]);
> +    code[2] =
> +        (parity[par & 0xf0] << 7) |
> +        (parity[par & 0x0f] << 6) |
> +        (parity[par & 0xcc] << 5) |
> +        (parity[par & 0x33] << 4) |
> +        (parity[par & 0xaa] << 3) |
> +        (parity[par & 0x55] << 2);
> +    code[0] = ~code[0];
> +    code[1] = ~code[1];
> +    code[2] = ~code[2];
> +}
> +
> +Still pretty straightforward. The last three invert statements are there to
> +give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
> +all data is 0xff, so the checksum then matches.
> +
> +I also introduced the parity lookup. I expected this to be the fastest
> +way to calculate the parity, but I will investigate alternatives later
> +on.
> +
> +
> +Analysis 1
> +==========
> +
> +The code works, but is not terribly efficient. On my system it took
> +almost 4 times as much time as the linux driver code. But hey, if it was
> +*that* easy this would have been done long before.
> +No pain. no gain.
> +
> +Fortunately there is plenty of room for improvement.
> +
> +In step 1 we moved from bit-wise calculation to byte-wise calculation.
> +However in C we can also use the unsigned long data type and virtually
> +every modern microprocessor supports 32 bit operations, so why not try
> +to write our code in such a way that we process data in 32 bit chunks.
> +
> +Of course this means some modification as the row parity is byte by
> +byte. A quick analysis:
> +for the column parity we use the par variable. When extending to 32 bits
> +we can in the end easily calculate p0 and p1 from it.
> +(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
> +respectively)
> +also rp2 and rp3 can be easily retrieved from par as rp3 covers the
> +first two bytes and rp2 the last two bytes.
> +
> +Note that of course now the loop is executed only 64 times (256/4).
> +And note that care must taken wrt byte ordering. The way bytes are
> +ordered in a long is machine dependent, and might affect us.
> +Anyway, if there is an issue: this code is developed on x86 (to be
> +precise: a DELL PC with a D920 Intel CPU)
> +
> +And of course the performance might depend on alignment, but I expect
> +that the I/O buffers in the nand driver are aligned properly (and
> +otherwise that should be fixed to get maximum performance).
> +
> +Let's give it a try...
> +
> +
> +Attempt 2
> +=========
> +
> +extern const char parity[256];
> +
> +void ecc2(const unsigned char *buf, unsigned char *code)
> +{
> +    int i;
> +    const unsigned long *bp = (unsigned long *)buf;
> +    unsigned long cur;
> +    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
> +    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
> +    unsigned long par;
> +
> +    par = 0;
> +    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
> +    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
> +    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
> +    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
> +
> +    for (i = 0; i < 64; i++)
> +    {
> +        cur = *bp++;
> +        par ^= cur;
> +        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
> +        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
> +        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
> +        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
> +        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
> +        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
> +    }
> +    /*
> +       we need to adapt the code generation for the fact that rp vars are now
> +       long; also the column parity calculation needs to be changed.
> +       we'll bring rp4 to 15 back to single byte entities by shifting and
> +       xoring
> +    */
> +    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
> +    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
> +    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
> +    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
> +    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
> +    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
> +    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
> +    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
> +    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
> +    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
> +    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
> +    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
> +    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
> +    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
> +    par ^= (par >> 16);
> +    rp1 = (par >> 8); rp1 &= 0xff;
> +    rp0 = (par & 0xff);
> +    par ^= (par >> 8); par &= 0xff;
> +
> +    code[0] =
> +        (parity[rp7] << 7) |
> +        (parity[rp6] << 6) |
> +        (parity[rp5] << 5) |
> +        (parity[rp4] << 4) |
> +        (parity[rp3] << 3) |
> +        (parity[rp2] << 2) |
> +        (parity[rp1] << 1) |
> +        (parity[rp0]);
> +    code[1] =
> +        (parity[rp15] << 7) |
> +        (parity[rp14] << 6) |
> +        (parity[rp13] << 5) |
> +        (parity[rp12] << 4) |
> +        (parity[rp11] << 3) |
> +        (parity[rp10] << 2) |
> +        (parity[rp9]  << 1) |
> +        (parity[rp8]);
> +    code[2] =
> +        (parity[par & 0xf0] << 7) |
> +        (parity[par & 0x0f] << 6) |
> +        (parity[par & 0xcc] << 5) |
> +        (parity[par & 0x33] << 4) |
> +        (parity[par & 0xaa] << 3) |
> +        (parity[par & 0x55] << 2);
> +    code[0] = ~code[0];
> +    code[1] = ~code[1];
> +    code[2] = ~code[2];
> +}
> +
> +The parity array is not shown any more. Note also that for these
> +examples I kinda deviated from my regular programming style by allowing
> +multiple statements on a line, not using { } in then and else blocks
> +with only a single statement and by using operators like ^=
> +
> +
> +Analysis 2
> +==========
> +
> +The code (of course) works, and hurray: we are a little bit faster than
> +the linux driver code (about 15%). But wait, don't cheer too quickly.
> +THere is more to be gained.
> +If we look at e.g. rp14 and rp15 we see that we either xor our data with
> +rp14 or with rp15. However we also have par which goes over all data.
> +This means there is no need to calculate rp14 as it can be calculated from
> +rp15 through rp14 = par ^ rp15;
> +(or if desired we can avoid calculating rp15 and calculate it from
> +rp14).  That is why some places refer to inverse parity.
> +Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
> +Effectively this means we can eliminate the else clause from the if
> +statements. Also we can optimise the calculation in the end a little bit
> +by going from long to byte first. Actually we can even avoid the table
> +lookups
> +
> +Attempt 3
> +=========
> +
> +Odd replaced:
> +        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
> +        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
> +        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
> +        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
> +        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
> +        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
> +with
> +        if (i & 0x01) rp5 ^= cur;
> +        if (i & 0x02) rp7 ^= cur;
> +        if (i & 0x04) rp9 ^= cur;
> +        if (i & 0x08) rp11 ^= cur;
> +        if (i & 0x10) rp13 ^= cur;
> +        if (i & 0x20) rp15 ^= cur;
> +
> +        and outside the loop added:
> +    rp4  = par ^ rp5;
> +    rp6  = par ^ rp7;
> +    rp8  = par ^ rp9;
> +    rp10  = par ^ rp11;
> +    rp12  = par ^ rp13;
> +    rp14  = par ^ rp15;
> +
> +And after that the code takes about 30% more time, although the number of
> +statements is reduced. This is also reflected in the assembly code.
> +
> +
> +Analysis 3
> +==========
> +
> +Very weird. Guess it has to do with caching or instruction parallellism
> +or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
> +observation was that this one is only 30% slower (according to time)
> +executing the code as my 3Ghz D920 processor.
> +
> +Well, it was expected not to be easy so maybe instead move to a
> +different track: let's move back to the code from attempt2 and do some
> +loop unrolling. This will eliminate a few if statements. I'll try
> +different amounts of unrolling to see what works best.
> +
> +
> +Attempt 4
> +=========
> +
> +Unrolled the loop 1, 2, 3 and 4 times.
> +For 4 the code starts with:
> +
> +    for (i = 0; i < 4; i++)
> +    {
> +        cur = *bp++;
> +        par ^= cur;
> +        rp4 ^= cur;
> +        rp6 ^= cur;
> +        rp8 ^= cur;
> +        rp10 ^= cur;
> +        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
> +        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
> +        cur = *bp++;
> +        par ^= cur;
> +        rp5 ^= cur;
> +        rp6 ^= cur;
> +        ...
> +
> +
> +Analysis 4
> +==========
> +
> +Unrolling once gains about 15%
> +Unrolling twice keeps the gain at about 15%
> +Unrolling three times gives a gain of 30% compared to attempt 2.
> +Unrolling four times gives a marginal improvement compared to unrolling
> +three times.
> +
> +I decided to proceed with a four time unrolled loop anyway. It was my gut
> +feeling that in the next steps I would obtain additional gain from it.
> +
> +The next step was triggered by the fact that par contains the xor of all
> +bytes and rp4 and rp5 each contain the xor of half of the bytes.
> +So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
> +that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
> +eliminate rp5 (or rp4, but I already foresaw another optimisation).
> +The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
> +
> +
> +Attempt 5
> +=========
> +
> +Effectively so all odd digit rp assignments in the loop were removed.
> +This included the else clause of the if statements.
> +Of course after the loop we need to correct things by adding code like:
> +    rp5 = par ^ rp4;
> +Also the initial assignments (rp5 = 0; etc) could be removed.
> +Along the line I also removed the initialisation of rp0/1/2/3.
> +
> +
> +Analysis 5
> +==========
> +
> +Measurements showed this was a good move. The run-time roughly halved
> +compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
> +of the processor time compared to the current code in the linux kernel.
> +
> +However, still I thought there was more. I didn't like all the if
> +statements. Why not keep a running parity and only keep the last if
> +statement. Time for yet another version!
> +
> +
> +Attempt 6
> +=========
> +
> +THe code within the for loop was changed to:
> +
> +    for (i = 0; i < 4; i++)
> +    {
> +        cur = *bp++; tmppar  = cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
> +        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
> +
> +        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
> +           cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +           cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
> +
> +           cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
> +           cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
> +
> +        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur;
> +
> +           par ^= tmppar;
> +        if ((i & 0x1) == 0) rp12 ^= tmppar;
> +        if ((i & 0x2) == 0) rp14 ^= tmppar;
> +    }
> +
> +As you can see tmppar is used to accumulate the parity within a for
> +iteration. In the last 3 statements is is added to par and, if needed,
> +to rp12 and rp14.
> +
> +While making the changes I also found that I could exploit that tmppar
> +contains the running parity for this iteration. So instead of having:
> +rp4 ^= cur; rp6 = cur;
> +I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next
> +statement. A similar change was done for rp8 and rp10
> +
> +
> +Analysis 6
> +==========
> +
> +Measuring this code again showed big gain. When executing the original
> +linux code 1 million times, this took about 1 second on my system.
> +(using time to measure the performance). After this iteration I was back
> +to 0.075 sec. Actually I had to decide to start measuring over 10
> +million interations in order not to loose too much accuracy. This one
> +definitely seemed to be the jackpot!
> +
> +There is a little bit more room for improvement though. There are three
> +places with statements:
> +rp4 ^= cur; rp6 ^= cur;
> +It seems more efficient to also maintain a variable rp4_6 in the while
> +loop; This eliminates 3 statements per loop. Of course after the loop we
> +need to correct by adding:
> +    rp4 ^= rp4_6;
> +    rp6 ^= rp4_6
> +Furthermore there are 4 sequential assingments to rp8. This can be
> +encoded slightly more efficient by saving tmppar before those 4 lines
> +and later do rp8 = rp8 ^ tmppar ^ notrp8;
> +(where notrp8 is the value of rp8 before those 4 lines).
> +Again a use of the commutative property of xor.
> +Time for a new test!
> +
> +
> +Attempt 7
> +=========
> +
> +The new code now looks like:
> +
> +    for (i = 0; i < 4; i++)
> +    {
> +        cur = *bp++; tmppar  = cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
> +        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
> +
> +        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
> +           cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +           cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
> +
> +           notrp8 = tmppar;
> +           cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
> +           cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur;
> +           rp8 = rp8 ^ tmppar ^ notrp8;
> +
> +        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
> +        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
> +        cur = *bp++; tmppar ^= cur;
> +
> +           par ^= tmppar;
> +        if ((i & 0x1) == 0) rp12 ^= tmppar;
> +        if ((i & 0x2) == 0) rp14 ^= tmppar;
> +    }
> +    rp4 ^= rp4_6;
> +    rp6 ^= rp4_6;
> +
> +
> +Not a big change, but every penny counts :-)
> +
> +
> +Analysis 7
> +==========
> +
> +Acutally this made things worse. Not very much, but I don't want to move
> +into the wrong direction. Maybe something to investigate later. Could
> +have to do with caching again.
> +
> +Guess that is what there is to win within the loop. Maybe unrolling one
> +more time will help. I'll keep the optimisations from 7 for now.
> +
> +
> +Attempt 8
> +=========
> +
> +Unrolled the loop one more time.
> +
> +
> +Analysis 8
> +==========
> +
> +This makes things worse. Let's stick with attempt 6 and continue from there.
> +Although it seems that the code within the loop cannot be optimised
> +further there is still room to optimize the generation of the ecc codes.
> +We can simply calcualate the total parity. If this is 0 then rp4 = rp5
> +etc. If the parity is 1, then rp4 = !rp5;
> +But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
> +in the result byte and then do something like
> +    code[0] |= (code[0] << 1);
> +Lets test this.
> +
> +
> +Attempt 9
> +=========
> +
> +Changed the code but again this slightly degrades performance. Tried all
> +kind of other things, like having dedicated parity arrays to avoid the
> +shift after parity[rp7] << 7; No gain.
> +Change the lookup using the parity array by using shift operators (e.g.
> +replace parity[rp7] << 7 with:
> +rp7 ^= (rp7 << 4);
> +rp7 ^= (rp7 << 2);
> +rp7 ^= (rp7 << 1);
> +rp7 &= 0x80;
> +No gain.
> +
> +The only marginal change was inverting the parity bits, so we can remove
> +the last three invert statements.
> +
> +Ah well, pity this does not deliver more. Then again 10 million
> +iterations using the linux driver code takes between 13 and 13.5
> +seconds, whereas my code now takes about 0.73 seconds for those 10
> +million iterations. So basically I've improved the performance by a
> +factor 18 on my system. Not that bad. Of course on different hardware
> +you will get different results. No warranties!
> +
> +But of course there is no such thing as a free lunch. The codesize almost
> +tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
> +
> +
> +Correcting errors
> +=================
> +
> +For correcting errors I again used the ST application note as a starter,
> +but I also peeked at the existing code.
> +The algorithm itself is pretty straightforward. Just xor the given and
> +the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
> +are 1 we have one correctable bit error. If there is 1 bit 1, we have an
> +error in the given ecc code.
> +It proved to be fastest to do some table lookups. Performance gain
> +introduced by this is about a factor 2 on my system when a repair had to
> +be done, and 1% or so if no repair had to be done.
> +Code size increased from 330 bytes to 686 bytes for this function.
> +(gcc 4.2, -O3)
> +
> +
> +Conclusion
> +==========
> +
> +The gain when calculating the ecc is tremendous. Om my development hardware
> +a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
> +embedded system with a MIPS core a factor 7 was obtained.
> +On  a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
> +5 (big endian mode, gcc 4.1.2, -O3)
> +For correction not much gain could be obtained (as bitflips are rare). Then
> +again there are also much less cycles spent there.
> +
> +It seems there is not much more gain possible in this, at least when
> +programmed in C. Of course it might be possible to squeeze something more
> +out of it with an assembler program, but due to pipeline behaviour etc
> +this is very tricky (at least for intel hw).
> +
> +Author: Frans Meulenbroeks
> +Copyright (C) 2008 Koninklijke Philips Electronics NV.
> diff -urN linux-2.6.25.10/drivers/mtd/nand/nand_ecc.c linux-2.6.25.10.work/drivers/mtd/nand/nand_ecc.c
> --- linux-2.6.25.10/drivers/mtd/nand/nand_ecc.c 2008-07-03 05:46:47.000000000 +0200
> +++ linux-2.6.25.10.work/drivers/mtd/nand/nand_ecc.c    2008-07-30 09:53:59.000000000 +0200
> @@ -1,15 +1,18 @@
>  /*
> - * This file contains an ECC algorithm from Toshiba that detects and
> - * corrects 1 bit errors in a 256 byte block of data.
> + * This file contains an ECC algorithm that detects and corrects 1 bit
> + * errors in a 256 byte block of data.
>  *
>  * drivers/mtd/nand/nand_ecc.c
>  *
> - * Copyright (C) 2000-2004 Steven J. Hill (sjhill@realitydiluted.com)
> - *                         Toshiba America Electronics Components, Inc.
> + * Copyright (C) 2008 Koninklijke Philips Electronics NV.
> + *                    Author: Frans Meulenbroeks
>  *
> - * Copyright (C) 2006 Thomas Gleixner <tglx@linutronix.de>
> + * Completely replaces the previous ECC implementation which was written by:
> + *   Steven J. Hill (sjhill@realitydiluted.com)
> + *   Thomas Gleixner (tglx@linutronix.de)
>  *
> - * $Id: nand_ecc.c,v 1.15 2005/11/07 11:14:30 gleixner Exp $
> + * Information on how this algorithm works and how it was developed
> + * can be found in Documentation/nand/ecc.txt
>  *
>  * This file is free software; you can redistribute it and/or modify it
>  * under the terms of the GNU General Public License as published by the
> @@ -25,174 +28,415 @@
>  * with this file; if not, write to the Free Software Foundation, Inc.,
>  * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
>  *
> - * As a special exception, if other files instantiate templates or use
> - * macros or inline functions from these files, or you compile these
> - * files and link them with other works to produce a work based on these
> - * files, these files do not by themselves cause the resulting work to be
> - * covered by the GNU General Public License. However the source code for
> - * these files must still be made available in accordance with section (3)
> - * of the GNU General Public License.
> - *
> - * This exception does not invalidate any other reasons why a work based on
> - * this file might be covered by the GNU General Public License.
>  */
>
> +/*
> + * The STANDALONE macro is useful when running the code outside the kernel
> + * e.g. when running the code in a testbed or a benchmark program.
> + * When STANDALONE is used, the module related macros are commented out
> + * as well as the linux include files.
> + * Instead a private definition of mtd_into is given to satisfy the compiler
> + * (the code does not use mtd_info, so the code does not care)
> + */
> +#ifndef STANDALONE
>  #include <linux/types.h>
>  #include <linux/kernel.h>
>  #include <linux/module.h>
>  #include <linux/mtd/nand_ecc.h>
> +#else
> +struct mtd_info {
> +       int dummy;
> +};
> +#define EXPORT_SYMBOL(x)  /* x */
> +
> +#define MODULE_LICENSE(x)      /* x */
> +#define MODULE_AUTHOR(x)       /* x */
> +#define MODULE_DESCRIPTION(x)  /* x */
> +#endif
> +
> +/*
> + * invparity is a 256 byte table that contains the odd parity
> + * for each byte. So if the number of bits in a byte is even,
> + * the array element is 1, and when the number of bits is odd
> + * the array eleemnt is 0.
> + */
> +static const char invparity[256] = {
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
> +       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1
> +};
> +
> +/*
> + * bitsperbyte contains the number of bits per byte
> + * this is only used for testing and repairing parity
> + * (a precalculated value slightly improves performance)
> + */
> +static const char bitsperbyte[256] = {
> +       0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
> +       1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
> +       1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
> +       2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
> +       1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
> +       2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
> +       2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
> +       3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
> +       1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
> +       2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
> +       2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
> +       3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
> +       2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
> +       3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
> +       3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
> +       4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
> +};
>
>  /*
> - * Pre-calculated 256-way 1 byte column parity
> + * addressbits is a lookup table to filter out the bits from the xor-ed
> + * ecc data that identify the faulty location.
> + * this is only used for repairing parity
> + * see the comments in nand_correct_data for more details
>  */
> -static const u_char nand_ecc_precalc_table[] = {
> -       0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00,
> -       0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
> -       0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
> -       0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
> -       0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
> -       0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
> -       0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
> -       0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
> -       0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
> -       0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
> -       0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
> -       0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
> -       0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
> -       0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
> -       0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
> -       0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00
> +static const char addressbits[256] = {
> +       0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
> +       0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
> +       0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
> +       0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
> +       0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
> +       0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
> +       0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
> +       0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
> +       0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
> +       0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
> +       0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
> +       0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
> +       0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
> +       0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
> +       0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
> +       0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
> +       0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
> +       0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
> +       0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
> +       0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
> +       0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
> +       0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
> +       0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
> +       0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
> +       0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
> +       0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
> +       0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
> +       0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
> +       0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
> +       0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
> +       0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
> +       0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f
>  };
>
>  /**
>  * nand_calculate_ecc - [NAND Interface] Calculate 3-byte ECC for 256-byte block
> - * @mtd:       MTD block structure
> + * @mtd:       MTD block structure (unused)
>  * @dat:       raw data
>  * @ecc_code:  buffer for ECC
>  */
> -int nand_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
> -                      u_char *ecc_code)
> +int nand_calculate_ecc(struct mtd_info *mtd, const unsigned char *buf,
> +                      unsigned char *code)
>  {
> -       uint8_t idx, reg1, reg2, reg3, tmp1, tmp2;
>        int i;
> -
> -       /* Initialize variables */
> -       reg1 = reg2 = reg3 = 0;
> -
> -       /* Build up column parity */
> -       for(i = 0; i < 256; i++) {
> -               /* Get CP0 - CP5 from table */
> -               idx = nand_ecc_precalc_table[*dat++];
> -               reg1 ^= (idx & 0x3f);
> -
> -               /* All bit XOR = 1 ? */
> -               if (idx & 0x40) {
> -                       reg3 ^= (uint8_t) i;
> -                       reg2 ^= ~((uint8_t) i);
> -               }
> +       const unsigned long *bp = (unsigned long *)buf;
> +       unsigned long cur;      /* current value in buffer */
> +       /* rp0..rp15 are the various accumulated parities (per byte) */
> +       unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
> +       unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
> +       unsigned long par;      /* the cumulative parity for all data */
> +       unsigned long tmppar;   /* the cumulative parity for this iteration;
> +                                  for rp12 and rp14 at the end of the loop */
> +
> +       par = 0;
> +       rp4 = 0;
> +       rp6 = 0;
> +       rp8 = 0;
> +       rp10 = 0;
> +       rp12 = 0;
> +       rp14 = 0;
> +
> +       /*
> +        * The loop is unrolled a number of times;
> +        * This avoids if statements to decide on which rp value to update
> +        * Also we process the data by longwords.
> +        * Note: passing unaligned data might give a performance penalty.
> +        * It is assumed that the buffers are aligned.
> +        * tmppar is the cumulative sum of this iteration.
> +        * needed for calculating rp12, rp14 and par
> +        * also used as a performance improvement for rp6, rp8 and rp10
> +        */
> +       for (i = 0; i < 4; i++) {
> +               cur = *bp++;
> +               tmppar = cur;
> +               rp4 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp6 ^= tmppar;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp8 ^= tmppar;
> +
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               rp6 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp6 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp10 ^= tmppar;
> +
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               rp6 ^= cur;
> +               rp8 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp6 ^= cur;
> +               rp8 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               rp8 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp8 ^= cur;
> +
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               rp6 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp6 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +               rp4 ^= cur;
> +               cur = *bp++;
> +               tmppar ^= cur;
> +
> +               par ^= tmppar;
> +               if ((i & 0x1) == 0)
> +                       rp12 ^= tmppar;
> +               if ((i & 0x2) == 0)
> +                       rp14 ^= tmppar;
>        }
>
> -       /* Create non-inverted ECC code from line parity */
> -       tmp1  = (reg3 & 0x80) >> 0; /* B7 -> B7 */
> -       tmp1 |= (reg2 & 0x80) >> 1; /* B7 -> B6 */
> -       tmp1 |= (reg3 & 0x40) >> 1; /* B6 -> B5 */
> -       tmp1 |= (reg2 & 0x40) >> 2; /* B6 -> B4 */
> -       tmp1 |= (reg3 & 0x20) >> 2; /* B5 -> B3 */
> -       tmp1 |= (reg2 & 0x20) >> 3; /* B5 -> B2 */
> -       tmp1 |= (reg3 & 0x10) >> 3; /* B4 -> B1 */
> -       tmp1 |= (reg2 & 0x10) >> 4; /* B4 -> B0 */
> -
> -       tmp2  = (reg3 & 0x08) << 4; /* B3 -> B7 */
> -       tmp2 |= (reg2 & 0x08) << 3; /* B3 -> B6 */
> -       tmp2 |= (reg3 & 0x04) << 3; /* B2 -> B5 */
> -       tmp2 |= (reg2 & 0x04) << 2; /* B2 -> B4 */
> -       tmp2 |= (reg3 & 0x02) << 2; /* B1 -> B3 */
> -       tmp2 |= (reg2 & 0x02) << 1; /* B1 -> B2 */
> -       tmp2 |= (reg3 & 0x01) << 1; /* B0 -> B1 */
> -       tmp2 |= (reg2 & 0x01) << 0; /* B7 -> B0 */
> -
> -       /* Calculate final ECC code */
> +       /*
> +        * handle the fact that we use longword operations
> +        * we'll bring rp4..rp14 back to single byte entities by shifting and
> +        * xoring first fold the upper and lower 16 bits,
> +        * then the upper and lower 8 bits.
> +        */
> +       rp4 ^= (rp4 >> 16);
> +       rp4 ^= (rp4 >> 8);
> +       rp4 &= 0xff;
> +       rp6 ^= (rp6 >> 16);
> +       rp6 ^= (rp6 >> 8);
> +       rp6 &= 0xff;
> +       rp8 ^= (rp8 >> 16);
> +       rp8 ^= (rp8 >> 8);
> +       rp8 &= 0xff;
> +       rp10 ^= (rp10 >> 16);
> +       rp10 ^= (rp10 >> 8);
> +       rp10 &= 0xff;
> +       rp12 ^= (rp12 >> 16);
> +       rp12 ^= (rp12 >> 8);
> +       rp12 &= 0xff;
> +       rp14 ^= (rp14 >> 16);
> +       rp14 ^= (rp14 >> 8);
> +       rp14 &= 0xff;
> +
> +       /*
> +        * we also need to calculate the row parity for rp0..rp3
> +        * This is present in par, because par is now
> +        * rp3 rp3 rp2 rp2
> +        * as well as
> +        * rp1 rp0 rp1 rp0
> +        * First calculate rp2 and rp3
> +        * (and yes: rp2 = (par ^ rp3) & 0xff; but doing that did not
> +        * give a performance improvement)
> +        */
> +       rp3 = (par >> 16);
> +       rp3 ^= (rp3 >> 8);
> +       rp3 &= 0xff;
> +       rp2 = par & 0xffff;
> +       rp2 ^= (rp2 >> 8);
> +       rp2 &= 0xff;
> +
> +       /* reduce par to 16 bits then calculate rp1 and rp0 */
> +       par ^= (par >> 16);
> +       rp1 = (par >> 8) & 0xff;
> +       rp0 = (par & 0xff);
> +
> +       /* finally reduce par to 8 bits */
> +       par ^= (par >> 8);
> +       par &= 0xff;
> +
> +       /*
> +        * and calculate rp5..rp15
> +        * note that par = rp4 ^ rp5 and due to the commutative property
> +        * of the ^ operator we can say:
> +        * rp5 = (par ^ rp4);
> +        * The & 0xff seems superfluous, but benchmarking learned that
> +        * leaving it out gives slightly worse results. No idea why, probably
> +        * it has to do with the way the pipeline in pentium is organized.
> +        */
> +       rp5 = (par ^ rp4) & 0xff;
> +       rp7 = (par ^ rp6) & 0xff;
> +       rp9 = (par ^ rp8) & 0xff;
> +       rp11 = (par ^ rp10) & 0xff;
> +       rp13 = (par ^ rp12) & 0xff;
> +       rp15 = (par ^ rp14) & 0xff;
> +
> +       /*
> +        * Finally calculate the ecc bits.
> +        * Again here it might seem that there are performance optimisations
> +        * possible, but benchmarks showed that on the system this is developed
> +        * the code below is the fastest
> +        */
>  #ifdef CONFIG_MTD_NAND_ECC_SMC
> -       ecc_code[0] = ~tmp2;
> -       ecc_code[1] = ~tmp1;
> +       code[0] =
> +           (invparity[rp7] << 7) |
> +           (invparity[rp6] << 6) |
> +           (invparity[rp5] << 5) |
> +           (invparity[rp4] << 4) |
> +           (invparity[rp3] << 3) |
> +           (invparity[rp2] << 2) |
> +           (invparity[rp1] << 1) |
> +           (invparity[rp0]);
> +       code[1] =
> +           (invparity[rp15] << 7) |
> +           (invparity[rp14] << 6) |
> +           (invparity[rp13] << 5) |
> +           (invparity[rp12] << 4) |
> +           (invparity[rp11] << 3) |
> +           (invparity[rp10] << 2) |
> +           (invparity[rp9] << 1)  |
> +           (invparity[rp8]);
>  #else
> -       ecc_code[0] = ~tmp1;
> -       ecc_code[1] = ~tmp2;
> +       code[1] =
> +           (invparity[rp7] << 7) |
> +           (invparity[rp6] << 6) |
> +           (invparity[rp5] << 5) |
> +           (invparity[rp4] << 4) |
> +           (invparity[rp3] << 3) |
> +           (invparity[rp2] << 2) |
> +           (invparity[rp1] << 1) |
> +           (invparity[rp0]);
> +       code[0] =
> +           (invparity[rp15] << 7) |
> +           (invparity[rp14] << 6) |
> +           (invparity[rp13] << 5) |
> +           (invparity[rp12] << 4) |
> +           (invparity[rp11] << 3) |
> +           (invparity[rp10] << 2) |
> +           (invparity[rp9] << 1)  |
> +           (invparity[rp8]);
>  #endif
> -       ecc_code[2] = ((~reg1) << 2) | 0x03;
> -
> -       return 0;
> +       code[2] =
> +           (invparity[par & 0xf0] << 7) |
> +           (invparity[par & 0x0f] << 6) |
> +           (invparity[par & 0xcc] << 5) |
> +           (invparity[par & 0x33] << 4) |
> +           (invparity[par & 0xaa] << 3) |
> +           (invparity[par & 0x55] << 2) |
> +           3;
>  }
>  EXPORT_SYMBOL(nand_calculate_ecc);
>
> -static inline int countbits(uint32_t byte)
> -{
> -       int res = 0;
> -
> -       for (;byte; byte >>= 1)
> -               res += byte & 0x01;
> -       return res;
> -}
> -
>  /**
>  * nand_correct_data - [NAND Interface] Detect and correct bit error(s)
> - * @mtd:       MTD block structure
> + * @mtd:       MTD block structure (unused)
>  * @dat:       raw data read from the chip
>  * @read_ecc:  ECC from the chip
>  * @calc_ecc:  the ECC calculated from raw data
>  *
>  * Detect and correct a 1 bit error for 256 byte block
>  */
> -int nand_correct_data(struct mtd_info *mtd, u_char *dat,
> -                     u_char *read_ecc, u_char *calc_ecc)
> +int nand_correct_data(struct mtd_info *mtd, unsigned char *buf,
> +                     unsigned char *read_ecc, unsigned char *calc_ecc)
>  {
> -       uint8_t s0, s1, s2;
> -
> +       int nr_bits;
> +       unsigned char b0, b1, b2;
> +       unsigned char byte_addr, bit_addr;
> +
> +       /*
> +        * b0 to b2 indicate which bit is faulty (if any)
> +        * we might need the xor result  more than once,
> +        * so keep them in a local var
> +       */
>  #ifdef CONFIG_MTD_NAND_ECC_SMC
> -       s0 = calc_ecc[0] ^ read_ecc[0];
> -       s1 = calc_ecc[1] ^ read_ecc[1];
> -       s2 = calc_ecc[2] ^ read_ecc[2];
> +       b0 = read_ecc[0] ^ calc_ecc[0];
> +       b1 = read_ecc[1] ^ calc_ecc[1];
>  #else
> -       s1 = calc_ecc[0] ^ read_ecc[0];
> -       s0 = calc_ecc[1] ^ read_ecc[1];
> -       s2 = calc_ecc[2] ^ read_ecc[2];
> +       b0 = read_ecc[1] ^ calc_ecc[1];
> +       b1 = read_ecc[0] ^ calc_ecc[0];
>  #endif
> -       if ((s0 | s1 | s2) == 0)
> -               return 0;
> -
> -       /* Check for a single bit error */
> -       if( ((s0 ^ (s0 >> 1)) & 0x55) == 0x55 &&
> -           ((s1 ^ (s1 >> 1)) & 0x55) == 0x55 &&
> -           ((s2 ^ (s2 >> 1)) & 0x54) == 0x54) {
> -
> -               uint32_t byteoffs, bitnum;
> -
> -               byteoffs = (s1 << 0) & 0x80;
> -               byteoffs |= (s1 << 1) & 0x40;
> -               byteoffs |= (s1 << 2) & 0x20;
> -               byteoffs |= (s1 << 3) & 0x10;
> +       b2 = read_ecc[2] ^ calc_ecc[2];
>
> -               byteoffs |= (s0 >> 4) & 0x08;
> -               byteoffs |= (s0 >> 3) & 0x04;
> -               byteoffs |= (s0 >> 2) & 0x02;
> -               byteoffs |= (s0 >> 1) & 0x01;
> +       /* check if there are any bitfaults */
>
> -               bitnum = (s2 >> 5) & 0x04;
> -               bitnum |= (s2 >> 4) & 0x02;
> -               bitnum |= (s2 >> 3) & 0x01;
> +       /* count nr of bits; use table lookup, faster than calculating it */
> +       nr_bits = bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2];
>
> -               dat[byteoffs] ^= (1 << bitnum);
> -
> -               return 1;
> +       /* repeated if statements are slightly more efficient than switch ... */
> +       /* ordered in order of likelihood */
> +       if (nr_bits == 0)
> +               return (0);     /* no error */
> +       if (nr_bits == 11) {    /* correctable error */
> +               /*
> +                * rp15/13/11/9/7/5/3/1 indicate which byte is the faulty byte
> +                * cp 5/3/1 indicate the faulty bit.
> +                * A lookup table (called addressbits) is used to filter
> +                * the bits from the byte they are in.
> +                * A marginal optimisation is possible by having three
> +                * different lookup tables.
> +                * One as we have now (for b0), one for b2
> +                * (that would avoid the >> 1), and one for b1 (with all values
> +                * << 4). However it was felt that introducing two more tables
> +                * hardly justify the gain.
> +                *
> +                * The b2 shift is there to get rid of the lowest two bits.
> +                * We could also do addressbits[b2] >> 1 but for the
> +                * performace it does not make any difference
> +                */
> +               byte_addr = (addressbits[b1] << 4) + addressbits[b0];
> +               bit_addr = addressbits[b2 >> 2];
> +               /* flip the bit */
> +               buf[byte_addr] ^= (1 << bit_addr);
> +               return (1);
>        }
> -
> -       if(countbits(s0 | ((uint32_t)s1 << 8) | ((uint32_t)s2 <<16)) == 1)
> -               return 1;
> -
> -       return -EBADMSG;
> +       if (nr_bits == 1)
> +               return (1);     /* error in ecc data; no action needed */
> +       return -1;
>  }
>  EXPORT_SYMBOL(nand_correct_data);
>
>  MODULE_LICENSE("GPL");
> -MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
> +MODULE_AUTHOR("Frans Meulenbroeks");
>  MODULE_DESCRIPTION("Generic NAND ECC support");
> Signed-off-by: Frans Meulenbroeks
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-07-31  8:35 [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance frans
  2008-08-11 11:35 ` Frans Meulenbroeks
@ 2008-08-11 16:30 ` David Woodhouse
       [not found]   ` <ac9c93b10808120153m7435424ci3e49a70d3599cc06@mail.gmail.com>
  1 sibling, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2008-08-11 16:30 UTC (permalink / raw)
  To: frans; +Cc: linux-mtd

On Thu, 2008-07-31 at 10:35 +0200, frans wrote:
> Dear all,
> 
> A resubmit of my patch, with all comments from Thomas addressed, and this 
> time wiht the patch inlined and submitted using pine/gmail
> 
> This patch improves the performance of the ecc generation code by a factor
> of 18 on an INTEL D920 CPU, a factor of 7 on MIPS and a factor of 5 on ARM 
> (NSLU2)
> 
> Please let me know if additional changes are neede

Needs a Signed-Off-By, and I can apply it.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
       [not found]     ` <1218535872.2977.133.camel@pmac.infradead.org>
@ 2008-08-14 18:07       ` frans
  2008-08-14 19:10         ` Troy Kisky
  0 siblings, 1 reply; 29+ messages in thread
From: frans @ 2008-08-14 18:07 UTC (permalink / raw)
  To: linux-mtd, David Woodhouse; +Cc: Frans Meulenbroeks

Fixed the last remaining issues, made sure to diff with the very latest mtd
git version.

Attached is a complete rewrite of nand_ecc.c including documentation.
This rewrite improves performance about 18 times on intel (D920),
7 times on MIPS and 5 times on ARM (NSLU2)

Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>

diff -urN git/Documentation/nand/ecc.txt work/Documentation/nand/ecc.txt
--- git/Documentation/nand/ecc.txt	1970-01-01 01:00:00.000000000 +0100
+++ work/Documentation/nand/ecc.txt	2008-08-14 19:44:43.000000000 +0200
@@ -0,0 +1,714 @@
+Introduction
+============
+
+Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
+I felt there was room for optimisation. I bashed the code for a few hours
+performing tricks like table lookup removing superfluous code etc.
+After that the speed was increased by 35-40%.
+Still I was not too happy as I felt there was additional room for improvement.
+
+Bad! I was hooked.
+I decided to annotate my steps in this file. Perhaps it is useful to someone
+or someone learns something from it.
+
+
+The problem
+===========
+
+NAND flash (at least SLC one) typically has sectors of 256 bytes.
+However NAND flash is not extremely reliable so some error detection
+(and sometimes correction) is needed.
+
+This is done by means of a Hamming code. I'll try to explain it in
+laymans terms (and apologies to all the pro's in the field in case I do
+not use the right terminology, my coding theory class was almost 30
+years ago, and I must admit it was not one of my favourites).
+
+As I said before the ecc calculation is performed on sectors of 256
+bytes. This is done by calculating several parity bits over the rows and
+columns. The parity used is even parity which means that the parity bit = 1
+if the data over which the parity is calculated is 1 and the parity bit = 0
+if the data over which the parity is calculated is 0. So the total
+number of bits over the data over which the parity is calculated + the
+parity bit is even. (see wikipedia if you can't follow this).
+Parity is often calculated by means of an exclusive or operation,
+sometimes also referred to as xor. In C the operator for xor is ^
+
+Back to ecc.
+Let's give a small figure:
+
+byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
+byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
+byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
+byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
+byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
+....
+byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
+byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
+           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
+           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
+           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
+
+This figure represents a sector of 256 bytes.
+cp is my abbreviaton for column parity, rp for row parity.
+
+Let's start to explain column parity.
+cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
+so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
+Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
+cp2 is the parity over bit0, bit1, bit4 and bit5
+cp3 is the parity over bit2, bit3, bit6 and bit7.
+cp4 is the parity over bit0, bit1, bit2 and bit3.
+cp5 is the parity over bit4, bit5, bit6 and bit7.
+Note that each of cp0 .. cp5 is exactly one bit.
+
+Row parity actually works almost the same.
+rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
+rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
+rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
+(so handle two bytes, then skip 2 bytes).
+rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
+for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
+so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
+and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
+The story now becomes quite boring. I guess you get the idea.
+rp6 covers 8 bytes then skips 8 etc
+rp7 skips 8 bytes then covers 8 etc
+rp8 covers 16 bytes then skips 16 etc
+rp9 skips 16 bytes then covers 16 etc
+rp10 covers 32 bytes then skips 32 etc
+rp11 skips 32 bytes then covers 32 etc
+rp12 covers 64 bytes then skips 64 etc
+rp13 skips 64 bytes then covers 64 etc
+rp14 covers 128 bytes then skips 128
+rp15 skips 128 bytes then covers 128
+
+In the end the parity bits are grouped together in three bytes as
+follows:
+ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
+ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
+ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
+
+I detected after writing this that ST application note AN1823
+(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much
+nicer picture.(but they use line parity as term where I use row parity)
+Oh well, I'm graphically challenged, so suffer with me for a moment :-)
+And I could not reuse the ST picture anyway for copyright reasons.
+
+
+Attempt 0
+=========
+
+Implementing the parity calculation is pretty simple.
+In C pseudocode:
+for (i = 0; i < 256; i++)
+{
+    if (i & 0x01)
+       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
+    else
+       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
+    if (i & 0x02)
+       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
+    else
+       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
+    if (i & 0x04)
+      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
+    else
+      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
+    if (i & 0x08)
+      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
+    else
+      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
+    if (i & 0x10)
+      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
+    else
+      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
+    if (i & 0x20)
+      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
+    else
+    rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
+    if (i & 0x40)
+      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
+    else
+      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
+    if (i & 0x80)
+      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
+    else
+      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
+    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
+    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
+    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
+    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
+    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
+    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
+}
+
+
+Analysis 0
+==========
+
+C does have bitwise operators but not really operators to do the above
+efficiently (and most hardware has no such instructions either).
+Therefore without implementing this it was clear that the code above was
+not going to bring me a Nobel prize :-)
+
+Fortunately the exclusive or operation is commutative, so we can combine
+the values in any order. So instead of calculating all the bits
+individually, let us try to rearrange things.
+For the column parity this is easy. We can just xor the bytes and in the
+end filter out the relevant bits. This is pretty nice as it will bring
+all cp calculation out of the if loop.
+
+Similarly we can first xor the bytes for the various rows.
+This leads to:
+
+
+Attempt 1
+=========
+
+const char parity[256] = {
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
+};
+
+void ecc1(const unsigned char *buf, unsigned char *code)
+{
+    int i;
+    const unsigned char *bp = buf;
+    unsigned char cur;
+    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+    unsigned char par;
+
+    par = 0;
+    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
+    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
+    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
+    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
+
+    for (i = 0; i < 256; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
+        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
+        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
+    }
+    code[0] =
+        (parity[rp7] << 7) |
+        (parity[rp6] << 6) |
+        (parity[rp5] << 5) |
+        (parity[rp4] << 4) |
+        (parity[rp3] << 3) |
+        (parity[rp2] << 2) |
+        (parity[rp1] << 1) |
+        (parity[rp0]);
+    code[1] =
+        (parity[rp15] << 7) |
+        (parity[rp14] << 6) |
+        (parity[rp13] << 5) |
+        (parity[rp12] << 4) |
+        (parity[rp11] << 3) |
+        (parity[rp10] << 2) |
+        (parity[rp9]  << 1) |
+        (parity[rp8]);
+    code[2] =
+        (parity[par & 0xf0] << 7) |
+        (parity[par & 0x0f] << 6) |
+        (parity[par & 0xcc] << 5) |
+        (parity[par & 0x33] << 4) |
+        (parity[par & 0xaa] << 3) |
+        (parity[par & 0x55] << 2);
+    code[0] = ~code[0];
+    code[1] = ~code[1];
+    code[2] = ~code[2];
+}
+
+Still pretty straightforward. The last three invert statements are there to
+give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
+all data is 0xff, so the checksum then matches.
+
+I also introduced the parity lookup. I expected this to be the fastest
+way to calculate the parity, but I will investigate alternatives later
+on.
+
+
+Analysis 1
+==========
+
+The code works, but is not terribly efficient. On my system it took
+almost 4 times as much time as the linux driver code. But hey, if it was
+*that* easy this would have been done long before.
+No pain. no gain.
+
+Fortunately there is plenty of room for improvement.
+
+In step 1 we moved from bit-wise calculation to byte-wise calculation.
+However in C we can also use the unsigned long data type and virtually
+every modern microprocessor supports 32 bit operations, so why not try
+to write our code in such a way that we process data in 32 bit chunks.
+
+Of course this means some modification as the row parity is byte by
+byte. A quick analysis:
+for the column parity we use the par variable. When extending to 32 bits
+we can in the end easily calculate p0 and p1 from it.
+(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
+respectively)
+also rp2 and rp3 can be easily retrieved from par as rp3 covers the
+first two bytes and rp2 the last two bytes.
+
+Note that of course now the loop is executed only 64 times (256/4).
+And note that care must taken wrt byte ordering. The way bytes are
+ordered in a long is machine dependent, and might affect us.
+Anyway, if there is an issue: this code is developed on x86 (to be
+precise: a DELL PC with a D920 Intel CPU)
+
+And of course the performance might depend on alignment, but I expect
+that the I/O buffers in the nand driver are aligned properly (and
+otherwise that should be fixed to get maximum performance).
+
+Let's give it a try...
+
+
+Attempt 2
+=========
+
+extern const char parity[256];
+
+void ecc2(const unsigned char *buf, unsigned char *code)
+{
+    int i;
+    const unsigned long *bp = (unsigned long *)buf;
+    unsigned long cur;
+    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+    unsigned long par;
+
+    par = 0;
+    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
+    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
+    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
+    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
+
+    for (i = 0; i < 64; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
+    }
+    /*
+       we need to adapt the code generation for the fact that rp vars are now
+       long; also the column parity calculation needs to be changed.
+       we'll bring rp4 to 15 back to single byte entities by shifting and
+       xoring
+    */
+    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
+    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
+    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
+    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
+    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
+    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
+    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
+    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
+    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
+    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
+    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
+    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
+    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
+    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
+    par ^= (par >> 16);
+    rp1 = (par >> 8); rp1 &= 0xff;
+    rp0 = (par & 0xff);
+    par ^= (par >> 8); par &= 0xff;
+
+    code[0] =
+        (parity[rp7] << 7) |
+        (parity[rp6] << 6) |
+        (parity[rp5] << 5) |
+        (parity[rp4] << 4) |
+        (parity[rp3] << 3) |
+        (parity[rp2] << 2) |
+        (parity[rp1] << 1) |
+        (parity[rp0]);
+    code[1] =
+        (parity[rp15] << 7) |
+        (parity[rp14] << 6) |
+        (parity[rp13] << 5) |
+        (parity[rp12] << 4) |
+        (parity[rp11] << 3) |
+        (parity[rp10] << 2) |
+        (parity[rp9]  << 1) |
+        (parity[rp8]);
+    code[2] =
+        (parity[par & 0xf0] << 7) |
+        (parity[par & 0x0f] << 6) |
+        (parity[par & 0xcc] << 5) |
+        (parity[par & 0x33] << 4) |
+        (parity[par & 0xaa] << 3) |
+        (parity[par & 0x55] << 2);
+    code[0] = ~code[0];
+    code[1] = ~code[1];
+    code[2] = ~code[2];
+}
+
+The parity array is not shown any more. Note also that for these
+examples I kinda deviated from my regular programming style by allowing
+multiple statements on a line, not using { } in then and else blocks
+with only a single statement and by using operators like ^=
+
+
+Analysis 2
+==========
+
+The code (of course) works, and hurray: we are a little bit faster than
+the linux driver code (about 15%). But wait, don't cheer too quickly.
+THere is more to be gained.
+If we look at e.g. rp14 and rp15 we see that we either xor our data with
+rp14 or with rp15. However we also have par which goes over all data.
+This means there is no need to calculate rp14 as it can be calculated from
+rp15 through rp14 = par ^ rp15;
+(or if desired we can avoid calculating rp15 and calculate it from
+rp14).  That is why some places refer to inverse parity.
+Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
+Effectively this means we can eliminate the else clause from the if
+statements. Also we can optimise the calculation in the end a little bit
+by going from long to byte first. Actually we can even avoid the table
+lookups
+
+Attempt 3
+=========
+
+Odd replaced:
+        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
+with
+        if (i & 0x01) rp5 ^= cur;
+        if (i & 0x02) rp7 ^= cur;
+        if (i & 0x04) rp9 ^= cur;
+        if (i & 0x08) rp11 ^= cur;
+        if (i & 0x10) rp13 ^= cur;
+        if (i & 0x20) rp15 ^= cur;
+
+        and outside the loop added:
+    rp4  = par ^ rp5;
+    rp6  = par ^ rp7;
+    rp8  = par ^ rp9;
+    rp10  = par ^ rp11;
+    rp12  = par ^ rp13;
+    rp14  = par ^ rp15;
+
+And after that the code takes about 30% more time, although the number of
+statements is reduced. This is also reflected in the assembly code.
+
+
+Analysis 3
+==========
+
+Very weird. Guess it has to do with caching or instruction parallellism
+or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
+observation was that this one is only 30% slower (according to time)
+executing the code as my 3Ghz D920 processor.
+
+Well, it was expected not to be easy so maybe instead move to a
+different track: let's move back to the code from attempt2 and do some
+loop unrolling. This will eliminate a few if statements. I'll try
+different amounts of unrolling to see what works best.
+
+
+Attempt 4
+=========
+
+Unrolled the loop 1, 2, 3 and 4 times.
+For 4 the code starts with:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        rp4 ^= cur;
+        rp6 ^= cur;
+        rp8 ^= cur;
+        rp10 ^= cur;
+        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
+        cur = *bp++;
+        par ^= cur;
+        rp5 ^= cur;
+        rp6 ^= cur;
+        ...
+
+
+Analysis 4
+==========
+
+Unrolling once gains about 15%
+Unrolling twice keeps the gain at about 15%
+Unrolling three times gives a gain of 30% compared to attempt 2.
+Unrolling four times gives a marginal improvement compared to unrolling
+three times.
+
+I decided to proceed with a four time unrolled loop anyway. It was my gut
+feeling that in the next steps I would obtain additional gain from it.
+
+The next step was triggered by the fact that par contains the xor of all
+bytes and rp4 and rp5 each contain the xor of half of the bytes.
+So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
+that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
+eliminate rp5 (or rp4, but I already foresaw another optimisation).
+The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
+
+
+Attempt 5
+=========
+
+Effectively so all odd digit rp assignments in the loop were removed.
+This included the else clause of the if statements.
+Of course after the loop we need to correct things by adding code like:
+    rp5 = par ^ rp4;
+Also the initial assignments (rp5 = 0; etc) could be removed.
+Along the line I also removed the initialisation of rp0/1/2/3.
+
+
+Analysis 5
+==========
+
+Measurements showed this was a good move. The run-time roughly halved
+compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
+of the processor time compared to the current code in the linux kernel.
+
+However, still I thought there was more. I didn't like all the if
+statements. Why not keep a running parity and only keep the last if
+statement. Time for yet another version!
+
+
+Attempt 6
+=========
+
+THe code within the for loop was changed to:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++; tmppar  = cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
+
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
+
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
+
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+
+	    par ^= tmppar;
+        if ((i & 0x1) == 0) rp12 ^= tmppar;
+        if ((i & 0x2) == 0) rp14 ^= tmppar;
+    }
+
+As you can see tmppar is used to accumulate the parity within a for
+iteration. In the last 3 statements is is added to par and, if needed,
+to rp12 and rp14.
+
+While making the changes I also found that I could exploit that tmppar
+contains the running parity for this iteration. So instead of having:
+rp4 ^= cur; rp6 = cur;
+I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next
+statement. A similar change was done for rp8 and rp10
+
+
+Analysis 6
+==========
+
+Measuring this code again showed big gain. When executing the original
+linux code 1 million times, this took about 1 second on my system.
+(using time to measure the performance). After this iteration I was back
+to 0.075 sec. Actually I had to decide to start measuring over 10
+million interations in order not to loose too much accuracy. This one
+definitely seemed to be the jackpot!
+
+There is a little bit more room for improvement though. There are three
+places with statements:
+rp4 ^= cur; rp6 ^= cur;
+It seems more efficient to also maintain a variable rp4_6 in the while
+loop; This eliminates 3 statements per loop. Of course after the loop we
+need to correct by adding:
+    rp4 ^= rp4_6;
+    rp6 ^= rp4_6
+Furthermore there are 4 sequential assingments to rp8. This can be
+encoded slightly more efficient by saving tmppar before those 4 lines
+and later do rp8 = rp8 ^ tmppar ^ notrp8;
+(where notrp8 is the value of rp8 before those 4 lines).
+Again a use of the commutative property of xor.
+Time for a new test!
+
+
+Attempt 7
+=========
+
+The new code now looks like:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++; tmppar  = cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
+
+        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
+
+	    notrp8 = tmppar;
+	    cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+	    rp8 = rp8 ^ tmppar ^ notrp8;
+
+        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+
+	    par ^= tmppar;
+        if ((i & 0x1) == 0) rp12 ^= tmppar;
+        if ((i & 0x2) == 0) rp14 ^= tmppar;
+    }
+    rp4 ^= rp4_6;
+    rp6 ^= rp4_6;
+
+
+Not a big change, but every penny counts :-)
+
+
+Analysis 7
+==========
+
+Acutally this made things worse. Not very much, but I don't want to move
+into the wrong direction. Maybe something to investigate later. Could
+have to do with caching again.
+
+Guess that is what there is to win within the loop. Maybe unrolling one
+more time will help. I'll keep the optimisations from 7 for now.
+
+
+Attempt 8
+=========
+
+Unrolled the loop one more time.
+
+
+Analysis 8
+==========
+
+This makes things worse. Let's stick with attempt 6 and continue from there.
+Although it seems that the code within the loop cannot be optimised
+further there is still room to optimize the generation of the ecc codes.
+We can simply calcualate the total parity. If this is 0 then rp4 = rp5
+etc. If the parity is 1, then rp4 = !rp5;
+But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
+in the result byte and then do something like
+    code[0] |= (code[0] << 1);
+Lets test this.
+
+
+Attempt 9
+=========
+
+Changed the code but again this slightly degrades performance. Tried all
+kind of other things, like having dedicated parity arrays to avoid the
+shift after parity[rp7] << 7; No gain.
+Change the lookup using the parity array by using shift operators (e.g.
+replace parity[rp7] << 7 with:
+rp7 ^= (rp7 << 4);
+rp7 ^= (rp7 << 2);
+rp7 ^= (rp7 << 1);
+rp7 &= 0x80;
+No gain.
+
+The only marginal change was inverting the parity bits, so we can remove
+the last three invert statements.
+
+Ah well, pity this does not deliver more. Then again 10 million
+iterations using the linux driver code takes between 13 and 13.5
+seconds, whereas my code now takes about 0.73 seconds for those 10
+million iterations. So basically I've improved the performance by a
+factor 18 on my system. Not that bad. Of course on different hardware
+you will get different results. No warranties!
+
+But of course there is no such thing as a free lunch. The codesize almost
+tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
+
+
+Correcting errors
+=================
+
+For correcting errors I again used the ST application note as a starter,
+but I also peeked at the existing code.
+The algorithm itself is pretty straightforward. Just xor the given and
+the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
+are 1 we have one correctable bit error. If there is 1 bit 1, we have an
+error in the given ecc code.
+It proved to be fastest to do some table lookups. Performance gain
+introduced by this is about a factor 2 on my system when a repair had to
+be done, and 1% or so if no repair had to be done.
+Code size increased from 330 bytes to 686 bytes for this function.
+(gcc 4.2, -O3)
+
+
+Conclusion
+==========
+
+The gain when calculating the ecc is tremendous. Om my development hardware
+a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
+embedded system with a MIPS core a factor 7 was obtained.
+On  a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
+5 (big endian mode, gcc 4.1.2, -O3)
+For correction not much gain could be obtained (as bitflips are rare). Then
+again there are also much less cycles spent there.
+
+It seems there is not much more gain possible in this, at least when
+programmed in C. Of course it might be possible to squeeze something more
+out of it with an assembler program, but due to pipeline behaviour etc
+this is very tricky (at least for intel hw).
+
+Author: Frans Meulenbroeks
+Copyright (C) 2008 Koninklijke Philips Electronics NV.
diff -urN git/drivers/mtd/nand/nand_ecc.c work/drivers/mtd/nand/nand_ecc.c
--- git/drivers/mtd/nand/nand_ecc.c	2008-08-14 19:47:14.000000000 +0200
+++ work/drivers/mtd/nand/nand_ecc.c	2008-08-14 19:59:32.000000000 +0200
@@ -1,13 +1,18 @@
  /*
- * This file contains an ECC algorithm from Toshiba that detects and
- * corrects 1 bit errors in a 256 byte block of data.
+ * This file contains an ECC algorithm that detects and corrects 1 bit
+ * errors in a 256 byte block of data.
   *
   * drivers/mtd/nand/nand_ecc.c
   *
- * Copyright (C) 2000-2004 Steven J. Hill (sjhill@realitydiluted.com)
- *                         Toshiba America Electronics Components, Inc.
+ * Copyright (C) 2008 Koninklijke Philips Electronics NV.
+ *                    Author: Frans Meulenbroeks
   *
- * Copyright (C) 2006 Thomas Gleixner <tglx@linutronix.de>
+ * Completely replaces the previous ECC implementation which was written by:
+ *   Steven J. Hill (sjhill@realitydiluted.com)
+ *   Thomas Gleixner (tglx@linutronix.de)
+ *
+ * Information on how this algorithm works and how it was developed
+ * can be found in Documentation/nand/ecc.txt
   *
   * This file is free software; you can redistribute it and/or modify it
   * under the terms of the GNU General Public License as published by the
@@ -23,174 +28,415 @@
   * with this file; if not, write to the Free Software Foundation, Inc.,
   * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
   *
- * As a special exception, if other files instantiate templates or use
- * macros or inline functions from these files, or you compile these
- * files and link them with other works to produce a work based on these
- * files, these files do not by themselves cause the resulting work to be
- * covered by the GNU General Public License. However the source code for
- * these files must still be made available in accordance with section (3)
- * of the GNU General Public License.
- *
- * This exception does not invalidate any other reasons why a work based on
- * this file might be covered by the GNU General Public License.
   */

+/*
+ * The STANDALONE macro is useful when running the code outside the kernel
+ * e.g. when running the code in a testbed or a benchmark program.
+ * When STANDALONE is used, the module related macros are commented out
+ * as well as the linux include files.
+ * Instead a private definition of mtd_into is given to satisfy the compiler
+ * (the code does not use mtd_info, so the code does not care)
+ */
+#ifndef STANDALONE
  #include <linux/types.h>
  #include <linux/kernel.h>
  #include <linux/module.h>
  #include <linux/mtd/nand_ecc.h>
+#else
+struct mtd_info {
+	int dummy;
+};
+#define EXPORT_SYMBOL(x)  /* x */
+
+#define MODULE_LICENSE(x)	/* x */
+#define MODULE_AUTHOR(x)	/* x */
+#define MODULE_DESCRIPTION(x)	/* x */
+#endif
+
+/*
+ * invparity is a 256 byte table that contains the odd parity
+ * for each byte. So if the number of bits in a byte is even,
+ * the array element is 1, and when the number of bits is odd
+ * the array eleemnt is 0.
+ */
+static const char invparity[256] = {
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1
+};
+
+/*
+ * bitsperbyte contains the number of bits per byte
+ * this is only used for testing and repairing parity
+ * (a precalculated value slightly improves performance)
+ */
+static const char bitsperbyte[256] = {
+	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
+};

  /*
- * Pre-calculated 256-way 1 byte column parity
+ * addressbits is a lookup table to filter out the bits from the xor-ed
+ * ecc data that identify the faulty location.
+ * this is only used for repairing parity
+ * see the comments in nand_correct_data for more details
   */
-static const u_char nand_ecc_precalc_table[] = {
-	0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00,
-	0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
-	0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
-	0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
-	0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
-	0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
-	0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
-	0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
-	0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
-	0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
-	0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
-	0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
-	0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
-	0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
-	0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
-	0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00
+static const char addressbits[256] = {
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f
  };

  /**
   * nand_calculate_ecc - [NAND Interface] Calculate 3-byte ECC for 256-byte block
- * @mtd:	MTD block structure
+ * @mtd:	MTD block structure (unused)
   * @dat:	raw data
   * @ecc_code:	buffer for ECC
   */
-int nand_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
-		       u_char *ecc_code)
+int nand_calculate_ecc(struct mtd_info *mtd, const unsigned char *buf,
+		       unsigned char *code)
  {
-	uint8_t idx, reg1, reg2, reg3, tmp1, tmp2;
  	int i;
-
-	/* Initialize variables */
-	reg1 = reg2 = reg3 = 0;
-
-	/* Build up column parity */
-	for(i = 0; i < 256; i++) {
-		/* Get CP0 - CP5 from table */
-		idx = nand_ecc_precalc_table[*dat++];
-		reg1 ^= (idx & 0x3f);
-
-		/* All bit XOR = 1 ? */
-		if (idx & 0x40) {
-			reg3 ^= (uint8_t) i;
-			reg2 ^= ~((uint8_t) i);
-		}
+	const unsigned long *bp = (unsigned long *)buf;
+	unsigned long cur;	/* current value in buffer */
+	/* rp0..rp15 are the various accumulated parities (per byte) */
+	unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+	unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+	unsigned long par;	/* the cumulative parity for all data */
+	unsigned long tmppar;	/* the cumulative parity for this iteration;
+				   for rp12 and rp14 at the end of the loop */
+
+	par = 0;
+	rp4 = 0;
+	rp6 = 0;
+	rp8 = 0;
+	rp10 = 0;
+	rp12 = 0;
+	rp14 = 0;
+
+	/*
+	 * The loop is unrolled a number of times;
+	 * This avoids if statements to decide on which rp value to update
+	 * Also we process the data by longwords.
+	 * Note: passing unaligned data might give a performance penalty.
+	 * It is assumed that the buffers are aligned.
+	 * tmppar is the cumulative sum of this iteration.
+	 * needed for calculating rp12, rp14 and par
+	 * also used as a performance improvement for rp6, rp8 and rp10
+	 */
+	for (i = 0; i < 4; i++) {
+		cur = *bp++;
+		tmppar = cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= tmppar;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp8 ^= tmppar;
+
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp10 ^= tmppar;
+
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp8 ^= cur;
+
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+
+		par ^= tmppar;
+		if ((i & 0x1) == 0)
+			rp12 ^= tmppar;
+		if ((i & 0x2) == 0)
+			rp14 ^= tmppar;
  	}

-	/* Create non-inverted ECC code from line parity */
-	tmp1  = (reg3 & 0x80) >> 0; /* B7 -> B7 */
-	tmp1 |= (reg2 & 0x80) >> 1; /* B7 -> B6 */
-	tmp1 |= (reg3 & 0x40) >> 1; /* B6 -> B5 */
-	tmp1 |= (reg2 & 0x40) >> 2; /* B6 -> B4 */
-	tmp1 |= (reg3 & 0x20) >> 2; /* B5 -> B3 */
-	tmp1 |= (reg2 & 0x20) >> 3; /* B5 -> B2 */
-	tmp1 |= (reg3 & 0x10) >> 3; /* B4 -> B1 */
-	tmp1 |= (reg2 & 0x10) >> 4; /* B4 -> B0 */
-
-	tmp2  = (reg3 & 0x08) << 4; /* B3 -> B7 */
-	tmp2 |= (reg2 & 0x08) << 3; /* B3 -> B6 */
-	tmp2 |= (reg3 & 0x04) << 3; /* B2 -> B5 */
-	tmp2 |= (reg2 & 0x04) << 2; /* B2 -> B4 */
-	tmp2 |= (reg3 & 0x02) << 2; /* B1 -> B3 */
-	tmp2 |= (reg2 & 0x02) << 1; /* B1 -> B2 */
-	tmp2 |= (reg3 & 0x01) << 1; /* B0 -> B1 */
-	tmp2 |= (reg2 & 0x01) << 0; /* B7 -> B0 */
-
-	/* Calculate final ECC code */
+	/*
+	 * handle the fact that we use longword operations
+	 * we'll bring rp4..rp14 back to single byte entities by shifting and
+	 * xoring first fold the upper and lower 16 bits,
+	 * then the upper and lower 8 bits.
+	 */
+	rp4 ^= (rp4 >> 16);
+	rp4 ^= (rp4 >> 8);
+	rp4 &= 0xff;
+	rp6 ^= (rp6 >> 16);
+	rp6 ^= (rp6 >> 8);
+	rp6 &= 0xff;
+	rp8 ^= (rp8 >> 16);
+	rp8 ^= (rp8 >> 8);
+	rp8 &= 0xff;
+	rp10 ^= (rp10 >> 16);
+	rp10 ^= (rp10 >> 8);
+	rp10 &= 0xff;
+	rp12 ^= (rp12 >> 16);
+	rp12 ^= (rp12 >> 8);
+	rp12 &= 0xff;
+	rp14 ^= (rp14 >> 16);
+	rp14 ^= (rp14 >> 8);
+	rp14 &= 0xff;
+
+	/*
+	 * we also need to calculate the row parity for rp0..rp3
+	 * This is present in par, because par is now
+	 * rp3 rp3 rp2 rp2
+	 * as well as
+	 * rp1 rp0 rp1 rp0
+	 * First calculate rp2 and rp3
+	 * (and yes: rp2 = (par ^ rp3) & 0xff; but doing that did not
+	 * give a performance improvement)
+	 */
+	rp3 = (par >> 16);
+	rp3 ^= (rp3 >> 8);
+	rp3 &= 0xff;
+	rp2 = par & 0xffff;
+	rp2 ^= (rp2 >> 8);
+	rp2 &= 0xff;
+
+	/* reduce par to 16 bits then calculate rp1 and rp0 */
+	par ^= (par >> 16);
+	rp1 = (par >> 8) & 0xff;
+	rp0 = (par & 0xff);
+
+	/* finally reduce par to 8 bits */
+	par ^= (par >> 8);
+	par &= 0xff;
+
+	/*
+	 * and calculate rp5..rp15
+	 * note that par = rp4 ^ rp5 and due to the commutative property
+	 * of the ^ operator we can say:
+	 * rp5 = (par ^ rp4);
+	 * The & 0xff seems superfluous, but benchmarking learned that
+	 * leaving it out gives slightly worse results. No idea why, probably
+	 * it has to do with the way the pipeline in pentium is organized.
+	 */
+	rp5 = (par ^ rp4) & 0xff;
+	rp7 = (par ^ rp6) & 0xff;
+	rp9 = (par ^ rp8) & 0xff;
+	rp11 = (par ^ rp10) & 0xff;
+	rp13 = (par ^ rp12) & 0xff;
+	rp15 = (par ^ rp14) & 0xff;
+
+	/*
+	 * Finally calculate the ecc bits.
+	 * Again here it might seem that there are performance optimisations
+	 * possible, but benchmarks showed that on the system this is developed
+	 * the code below is the fastest
+	 */
  #ifdef CONFIG_MTD_NAND_ECC_SMC
-	ecc_code[0] = ~tmp2;
-	ecc_code[1] = ~tmp1;
+	code[0] =
+	    (invparity[rp7] << 7) |
+	    (invparity[rp6] << 6) |
+	    (invparity[rp5] << 5) |
+	    (invparity[rp4] << 4) |
+	    (invparity[rp3] << 3) |
+	    (invparity[rp2] << 2) |
+	    (invparity[rp1] << 1) |
+	    (invparity[rp0]);
+	code[1] =
+	    (invparity[rp15] << 7) |
+	    (invparity[rp14] << 6) |
+	    (invparity[rp13] << 5) |
+	    (invparity[rp12] << 4) |
+	    (invparity[rp11] << 3) |
+	    (invparity[rp10] << 2) |
+	    (invparity[rp9] << 1)  |
+	    (invparity[rp8]);
  #else
-	ecc_code[0] = ~tmp1;
-	ecc_code[1] = ~tmp2;
+	code[1] =
+	    (invparity[rp7] << 7) |
+	    (invparity[rp6] << 6) |
+	    (invparity[rp5] << 5) |
+	    (invparity[rp4] << 4) |
+	    (invparity[rp3] << 3) |
+	    (invparity[rp2] << 2) |
+	    (invparity[rp1] << 1) |
+	    (invparity[rp0]);
+	code[0] =
+	    (invparity[rp15] << 7) |
+	    (invparity[rp14] << 6) |
+	    (invparity[rp13] << 5) |
+	    (invparity[rp12] << 4) |
+	    (invparity[rp11] << 3) |
+	    (invparity[rp10] << 2) |
+	    (invparity[rp9] << 1)  |
+	    (invparity[rp8]);
  #endif
-	ecc_code[2] = ((~reg1) << 2) | 0x03;
-
-	return 0;
+	code[2] =
+	    (invparity[par & 0xf0] << 7) |
+	    (invparity[par & 0x0f] << 6) |
+	    (invparity[par & 0xcc] << 5) |
+	    (invparity[par & 0x33] << 4) |
+	    (invparity[par & 0xaa] << 3) |
+	    (invparity[par & 0x55] << 2) |
+	    3;
  }
  EXPORT_SYMBOL(nand_calculate_ecc);

-static inline int countbits(uint32_t byte)
-{
-	int res = 0;
-
-	for (;byte; byte >>= 1)
-		res += byte & 0x01;
-	return res;
-}
-
  /**
   * nand_correct_data - [NAND Interface] Detect and correct bit error(s)
- * @mtd:	MTD block structure
+ * @mtd:	MTD block structure (unused)
   * @dat:	raw data read from the chip
   * @read_ecc:	ECC from the chip
   * @calc_ecc:	the ECC calculated from raw data
   *
   * Detect and correct a 1 bit error for 256 byte block
   */
-int nand_correct_data(struct mtd_info *mtd, u_char *dat,
-		      u_char *read_ecc, u_char *calc_ecc)
+int nand_correct_data(struct mtd_info *mtd, unsigned char *buf,
+		      unsigned char *read_ecc, unsigned char *calc_ecc)
  {
-	uint8_t s0, s1, s2;
-
+	int nr_bits;
+	unsigned char b0, b1, b2;
+	unsigned char byte_addr, bit_addr;
+
+	/*
+	 * b0 to b2 indicate which bit is faulty (if any)
+	 * we might need the xor result  more than once,
+	 * so keep them in a local var
+	*/
  #ifdef CONFIG_MTD_NAND_ECC_SMC
-	s0 = calc_ecc[0] ^ read_ecc[0];
-	s1 = calc_ecc[1] ^ read_ecc[1];
-	s2 = calc_ecc[2] ^ read_ecc[2];
+	b0 = read_ecc[0] ^ calc_ecc[0];
+	b1 = read_ecc[1] ^ calc_ecc[1];
  #else
-	s1 = calc_ecc[0] ^ read_ecc[0];
-	s0 = calc_ecc[1] ^ read_ecc[1];
-	s2 = calc_ecc[2] ^ read_ecc[2];
+	b0 = read_ecc[1] ^ calc_ecc[1];
+	b1 = read_ecc[0] ^ calc_ecc[0];
  #endif
-	if ((s0 | s1 | s2) == 0)
-		return 0;
-
-	/* Check for a single bit error */
-	if( ((s0 ^ (s0 >> 1)) & 0x55) == 0x55 &&
-	    ((s1 ^ (s1 >> 1)) & 0x55) == 0x55 &&
-	    ((s2 ^ (s2 >> 1)) & 0x54) == 0x54) {
-
-		uint32_t byteoffs, bitnum;
-
-		byteoffs = (s1 << 0) & 0x80;
-		byteoffs |= (s1 << 1) & 0x40;
-		byteoffs |= (s1 << 2) & 0x20;
-		byteoffs |= (s1 << 3) & 0x10;
+	b2 = read_ecc[2] ^ calc_ecc[2];

-		byteoffs |= (s0 >> 4) & 0x08;
-		byteoffs |= (s0 >> 3) & 0x04;
-		byteoffs |= (s0 >> 2) & 0x02;
-		byteoffs |= (s0 >> 1) & 0x01;
+	/* check if there are any bitfaults */

-		bitnum = (s2 >> 5) & 0x04;
-		bitnum |= (s2 >> 4) & 0x02;
-		bitnum |= (s2 >> 3) & 0x01;
+	/* count nr of bits; use table lookup, faster than calculating it */
+	nr_bits = bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2];

-		dat[byteoffs] ^= (1 << bitnum);
-
-		return 1;
+	/* repeated if statements are slightly more efficient than switch ... */
+	/* ordered in order of likelihood */
+	if (nr_bits == 0)
+		return (0);	/* no error */
+	if (nr_bits == 11) {	/* correctable error */
+		/*
+		 * rp15/13/11/9/7/5/3/1 indicate which byte is the faulty byte
+		 * cp 5/3/1 indicate the faulty bit.
+		 * A lookup table (called addressbits) is used to filter
+		 * the bits from the byte they are in.
+		 * A marginal optimisation is possible by having three
+		 * different lookup tables.
+		 * One as we have now (for b0), one for b2
+		 * (that would avoid the >> 1), and one for b1 (with all values
+		 * << 4). However it was felt that introducing two more tables
+		 * hardly justify the gain.
+		 *
+		 * The b2 shift is there to get rid of the lowest two bits.
+		 * We could also do addressbits[b2] >> 1 but for the
+		 * performace it does not make any difference
+		 */
+		byte_addr = (addressbits[b1] << 4) + addressbits[b0];
+		bit_addr = addressbits[b2 >> 2];
+		/* flip the bit */
+		buf[byte_addr] ^= (1 << bit_addr);
+		return (1);
  	}
-
-	if(countbits(s0 | ((uint32_t)s1 << 8) | ((uint32_t)s2 <<16)) == 1)
-		return 1;
-
-	return -EBADMSG;
+	if (nr_bits == 1)
+		return (1);	/* error in ecc data; no action needed */
+	return -1;
  }
  EXPORT_SYMBOL(nand_correct_data);

  MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
+MODULE_AUTHOR("Frans Meulenbroeks <fransmeulenbroeks@gmail.com>");
  MODULE_DESCRIPTION("Generic NAND ECC support");

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-14 18:07       ` frans
@ 2008-08-14 19:10         ` Troy Kisky
  2008-08-15  8:41           ` Frans Meulenbroeks
  0 siblings, 1 reply; 29+ messages in thread
From: Troy Kisky @ 2008-08-14 19:10 UTC (permalink / raw)
  To: frans; +Cc: David Woodhouse, linux-mtd

frans wrote:
> Fixed the last remaining issues, made sure to diff with the very latest mtd
> git version.
> 
> Attached is a complete rewrite of nand_ecc.c including documentation.
> This rewrite improves performance about 18 times on intel (D920),
> 7 times on MIPS and 5 times on ARM (NSLU2)
> 
> Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>

This look very complex to me. How about something like this.
Note, I could make it much smaller if allowed to result in a different
ecc value than current implementation.


#ifdef CONFIG_MTD_NAND_ECC_SMC
#define LOW_ORDER_INDEX 0
#define HIGH_ORDER_INDEX 1
#else
#define LOW_ORDER_INDEX 1
#define HIGH_ORDER_INDEX 0
#endif

/**
 * nand_calculate_ecc - [NAND Interface] Calculate 3-byte ECC for 256/512 byte block
 * @mtd:	MTD block structure
 * @dat:	raw data
 * @ecc_code:	buffer for ECC
 */
int nand_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
		       u_char *ecc_code)
{
	uint32_t j = ((struct nand_chip *)mtd->priv)->ecc.size;			/* 256 or 512 bytes/ecc  */
	uint32_t k=0;
	uint32_t xor = 0;
	uint32_t tecc = 0;
	uint32_t ecc = 0;
	uint32_t * p = (uint32_t *)dat;
	uint32_t v;
//#define FORCE_ECC_ERROR
#ifdef FORCE_ECC_ERROR
	uint32_t forceBitNum;
	{
		/* force single bit ecc error to test ecc correction code */
		u_char* p = (u_char*)dat;
		forceBitNum = p[0] | (p[1]<<8);	/* 1st word of block determines which bit is made in error */
		forceBitNum &= 0xfff;
		if ((forceBitNum&0x800)==0) {
			/* limit error to 1st 256 bytes */
			p[forceBitNum>>3] ^= 1<<(forceBitNum&7);
			printk(KERN_INFO "Forcing ecc error, byte:0x%x, bit:%i\n",forceBitNum>>3,forceBitNum&7);
		}
	}
#endif

	do {
		v = *p++;
		xor ^= v;
		v ^= (v>>16);
		v ^= (v>>8);
		v ^= (v>>4);
		v ^= (v>>2);
		v ^= (v>>1);
		if (v&1) tecc ^= k;
		k++;
		j-=4;
	} while (j);
	__cpu_to_le64s(xor);
	v = (xor>>16)^xor;
	v ^= ((v>>8)&(0x00ff00ff&0x00ffffff));
	v ^= ((v>>4)&(0x0f0f0f0f&0x000f0fff));
	v ^= ((v>>2)&(0x33333333&0x0003033f));
	v ^= ((v>>1)&(0x55555555&0x00010117));
	
	/* now duplicate all bits */
	if (tecc&(1<<6)) ecc ^= 3<<22;
	if (tecc&(1<<5)) ecc ^= 3<<20;
	if (tecc&(1<<4)) ecc ^= 3<<18;
	if (tecc&(1<<3)) ecc ^= 3<<16;
	if (tecc&(1<<2)) ecc ^= 3<<14;
	if (tecc&(1<<1)) ecc ^= 3<<12;
	if (tecc&(1<<0)) ecc ^= 3<<10;

	if (v&(1<<16)) ecc ^= 3<<8;
	if (v&(1<<8)) ecc ^= 3<<6;
	if (v&(1<<4)) ecc ^= 3<<4;
	if (v&(1<<2)) ecc ^= 3<<2;
	if (v&(1<<1)) ecc ^= 3<<0;
	if (v&(1<<0)) ecc ^= 0x555555;		/* if parity is odd, low bits are opposite of high bits */

#ifdef FORCE_ECC_ERROR
	if (forceBitNum&0x800) {
		ecc ^= 1<<(forceBitNum&0xf);
		printk(KERN_INFO "Forcing single bit error in ecc itself bit %i\n",forceBitNum&0xf);
	}
#endif
	if (((struct nand_chip *)mtd->priv)->ecc.size==256) ecc <<= 2;
	ecc = ~ecc;


#ifdef VERIFY_NEW_ECC_ALG
	{
		uint32_t s;
		nand_calculate_ecc_old(mtd,dat,ecc_code);
		ecc &= 0xffffff;
		s = (ecc_code[HIGH_ORDER_INDEX]<<16) | (ecc_code[LOW_ORDER_INDEX]<<8) | ecc_code[2];

		if (s != ecc) {
			printk(KERN_ERR "New algorithm is buggy!!!! s=%x, ecc=%x\n",s,ecc);
			return 0;
		}
	}
#endif

	/* Calculate final ECC code */
	ecc_code[HIGH_ORDER_INDEX] = (u_char)(ecc>>16);
	ecc_code[LOW_ORDER_INDEX] = (u_char)(ecc>>8);
	ecc_code[2] = (u_char)ecc;
	return 0;
}

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-14 19:10         ` Troy Kisky
@ 2008-08-15  8:41           ` Frans Meulenbroeks
  2008-08-15  8:46             ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-15  8:41 UTC (permalink / raw)
  To: Troy Kisky; +Cc: David Woodhouse, linux-mtd

2008/8/14, Troy Kisky <troy.kisky@boundarydevices.com>:
> frans wrote:
>  > Fixed the last remaining issues, made sure to diff with the very latest mtd
>  > git version.
>  >
>  > Attached is a complete rewrite of nand_ecc.c including documentation.
>  > This rewrite improves performance about 18 times on intel (D920),
>  > 7 times on MIPS and 5 times on ARM (NSLU2)
>  >
>  > Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>
>
>
> This look very complex to me. How about something like this.
>  Note, I could make it much smaller if allowed to result in a different
>  ecc value than current implementation.
>

To some extend my code is complex. That is why I also provided
documentation to explain how things work.
Your code is indeed much simpler and smaller, but unfortunately it is
also slower.
I did a quick benchmark test. Your code (on x86) is about three times
as fast as the original code. However it is still about 5 times as
slow as the code I submitted.

Your speedup is mostly caused by going from 8 bit processing in the
original code to 32 bit processing in your code.
I've achieved additional gain by loop unrolling.

Since this code is very often executed in embedded systems that have
nand flash and no hardware ecc correction, I feel the focus on
performance.

Best regards, Frans.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15  8:41           ` Frans Meulenbroeks
@ 2008-08-15  8:46             ` David Woodhouse
  2008-08-15  9:23               ` Frans Meulenbroeks
  0 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2008-08-15  8:46 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd, Troy Kisky

Frans, thank you very much for your work. One question -- has this been
tested on 64-bit machines?

The patch still doesn't apply, unfortunately -- something seems to be
mangling it. In general, it's a good idea to send patches to yourself in
email first, then save them off from your own INBOX and try to apply
them to a clean kernel tree.

Please could you let me have a copy as an attachment, or just a straight
copy of both files? 

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15  8:46             ` David Woodhouse
@ 2008-08-15  9:23               ` Frans Meulenbroeks
  2008-08-15  9:41                 ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-15  9:23 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd, Troy Kisky

2008/8/15, David Woodhouse <dwmw2@infradead.org>:
> Frans, thank you very much for your work. One question -- has this been
>  tested on 64-bit machines?

Good question, and more difficult than I thought.
Yes, It has been tested on 64 bit hardware. More specific a DELL 9150
with a D920 (presler) dual core processor.
However, not too sure if this system runs 64bit linux. Basically it
runs an out-of-the-box install of Ubuntu 7.10.

Note also that I tested this on this PC by running a test application
and comparing the results with the original code as my PC does not
have (user accessible) NAND flash.

If desired I can install and test on a 64 bit kernel.
Then again, I do not really expect issues (it is only an algorithm
with simple operations).
Also 64 bit is not that interesting. I doubt if we will ever see
systems 64 bit processors that use NAND flash without hardware ECC
calculations...

>
>  The patch still doesn't apply, unfortunately -- something seems to be
>  mangling it. In general, it's a good idea to send patches to yourself in
>  email first, then save them off from your own INBOX and try to apply
>  them to a clean kernel tree.

Yikes. Can you email me the error log?
I know about sending patches to myself and testing them, but again
things are more complicated here.
I've developed this on a fresh kernel from kernel.org. The previous
version failed because the mtd git has a change from beginning of june
where all cvs comments have been removed. This change has not
propagated to mainline yet (2.6.26.2 still contains the cvs comment).
As I have no git experience, what I did was pull a copy of nand_ecc.c
(the only file I change) from the mtd git using my browser
(http://git.infradead.org/mtd-2.6.git?a=blob_plain;f=drivers/mtd/nand/nand_ecc.c;hb=HEAD)
and overwrite the one in 2.6.26.2. The only thing I can imagine is
that something went wrong with that. (e.g. tabs and spaces).

>
>  Please could you let me have a copy as an attachment, or just a straight
>  copy of both files?

I'll send you both files later today (they are on a different system).

Best regards, Frans. (and apologies for the trouble I caused with this).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15  9:23               ` Frans Meulenbroeks
@ 2008-08-15  9:41                 ` David Woodhouse
  2008-08-15 10:04                   ` Frans Meulenbroeks
  0 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2008-08-15  9:41 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd

On Fri, 2008-08-15 at 11:23 +0200, Frans Meulenbroeks wrote:
> 2008/8/15, David Woodhouse <dwmw2@infradead.org>:
> If desired I can install and test on a 64 bit kernel.

No need -- if you've thought about it and believe it should work, that's
probably enough. I just saw some 'unsigned long' data types, which are
going to have a different size between 32-bit and 64-bit systems, and
wondered if that would introduce differences.

> Yikes. Can you email me the error log?

patching file Documentation/nand/ecc.txt
patching file drivers/mtd/nand/nand_ecc.c
Hunk #1 FAILED at 1.
Hunk #2 FAILED at 28.
2 out of 2 hunks FAILED -- saving rejects to file drivers/mtd/nand/nand_ecc.c.rej

I see stuff like this in the patch file:

  }
  EXPORT_SYMBOL(nand_correct_data);

  MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
+MODULE_AUTHOR("Frans Meulenbroeks <fransmeulenbroeks@gmail.com>");
  MODULE_DESCRIPTION("Generic NAND ECC support");

But those lines at the end of the file aren't indented by a space,
although the patch seems to expect them to be.

> I know about sending patches to myself and testing them, but again
> things are more complicated here.
> I've developed this on a fresh kernel from kernel.org. The previous
> version failed because the mtd git has a change from beginning of june
> where all cvs comments have been removed. This change has not
> propagated to mainline yet (2.6.26.2 still contains the cvs comment).
> As I have no git experience, what I did was pull a copy of nand_ecc.c
> (the only file I change) from the mtd git using my browser
> (http://git.infradead.org/mtd-2.6.git?a=blob_plain;f=drivers/mtd/nand/nand_ecc.c;hb=HEAD)
> and overwrite the one in 2.6.26.2. The only thing I can imagine is
> that something went wrong with that. (e.g. tabs and spaces).

You'd do better with just 
 git clone git://git.infradead.org/mtd-2.6.git
 cp ~/nand_ecc.c drivers/mtd/nand
 git-diff drivers/mtd/nand/nand_ecc.c | mail dwmw2

But I actually suspect it was mangled in transit in the patch -- an
attachment might survive, although you should really work out what's
eating your mail.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15  9:41                 ` David Woodhouse
@ 2008-08-15 10:04                   ` Frans Meulenbroeks
  2008-08-15 10:12                     ` David Woodhouse
  0 siblings, 1 reply; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-15 10:04 UTC (permalink / raw)
  To: David Woodhouse; +Cc: linux-mtd

2008/8/15, David Woodhouse <dwmw2@infradead.org>:
> On Fri, 2008-08-15 at 11:23 +0200, Frans Meulenbroeks wrote:
>  > 2008/8/15, David Woodhouse <dwmw2@infradead.org>:
>
> No need -- if you've thought about it and believe it should work, that's
>  probably enough. I just saw some 'unsigned long' data types, which are
>  going to have a different size between 32-bit and 64-bit systems, and
>  wondered if that would introduce differences.

I was unaware of that. For me unsigned long is more or less a synonym
for 32 bit. Maybe I'm just getting too old :-(
Anyway, if you have a suggestion for a better type, I'll happily change things.
Would uint32_t be better?

>
>
>  > Yikes. Can you email me the error log?
>
>
> patching file Documentation/nand/ecc.txt
>  patching file drivers/mtd/nand/nand_ecc.c
>  Hunk #1 FAILED at 1.
>  Hunk #2 FAILED at 28.
>  2 out of 2 hunks FAILED -- saving rejects to file drivers/mtd/nand/nand_ecc.c.rej
>
>  I see stuff like this in the patch file:
>
>
>   }
>   EXPORT_SYMBOL(nand_correct_data);
>
>   MODULE_LICENSE("GPL");
>  -MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
>  +MODULE_AUTHOR("Frans Meulenbroeks <fransmeulenbroeks@gmail.com>");
>   MODULE_DESCRIPTION("Generic NAND ECC support");
>
>
> But those lines at the end of the file aren't indented by a space,
>  although the patch seems to expect them to be.

This is odd, as the patch of ecc.txt did succeed. Just browsed back to
the email I did sent
did not have a space between the + and the text in ecc.txt either. and
patch did not make a fuzz out of that one.
Btw the patch is made with diff (the std one from ubuntu 7.10). No
fancy new stuff like quilt here (I'm just an old fart :-) )

> You'd do better with just
>   git clone git://git.infradead.org/mtd-2.6.git
>   cp ~/nand_ecc.c drivers/mtd/nand
>   git-diff drivers/mtd/nand/nand_ecc.c | mail dwmw2

Will try this tonight (which is in 8 hrs or so).
Not sure if the | mail will work.
I mostly use web based mail, and for this I installed alpine to avoid
html makeup and line wrapping. Alpine is coupled to my gmail account.

Anyway, I'll execute the commands you gave to me and send the result
of git-diff to you inlined as well as in an attachment.
>
>  But I actually suspect it was mangled in transit in the patch -- an
>  attachment might survive, although you should really work out what's
>  eating your mail.

I'll dig into this. It could also be that the nand_ecc.c I pulled from
git through the browser got mangled. Will get back to you. Thanks for
your patience & help!

Frans.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15 10:04                   ` Frans Meulenbroeks
@ 2008-08-15 10:12                     ` David Woodhouse
  2008-08-15 18:56                       ` Troy Kisky
  0 siblings, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2008-08-15 10:12 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd

On Fri, 2008-08-15 at 12:04 +0200, Frans Meulenbroeks wrote:
> 2008/8/15, David Woodhouse <dwmw2@infradead.org>:
> > On Fri, 2008-08-15 at 11:23 +0200, Frans Meulenbroeks wrote:
> >  > 2008/8/15, David Woodhouse <dwmw2@infradead.org>:
> >
> > No need -- if you've thought about it and believe it should work, that's
> >  probably enough. I just saw some 'unsigned long' data types, which are
> >  going to have a different size between 32-bit and 64-bit systems, and
> >  wondered if that would introduce differences.
> 
> I was unaware of that. For me unsigned long is more or less a synonym
> for 32 bit. Maybe I'm just getting too old :-(
> Anyway, if you have a suggestion for a better type, I'll happily change things.
> Would uint32_t be better?

If it needs to be 32-bit, then yes -- uint32_t is the correct type to
use.

If 64-bit is OK, then 'unsigned long' is likely to be more efficient on
some platforms.

> >  I see stuff like this in the patch file:
> >
> >
> >   }
> >   EXPORT_SYMBOL(nand_correct_data);
> >
> >   MODULE_LICENSE("GPL");
> >  -MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
> >  +MODULE_AUTHOR("Frans Meulenbroeks <fransmeulenbroeks@gmail.com>");
> >   MODULE_DESCRIPTION("Generic NAND ECC support");
> >
> >
> > But those lines at the end of the file aren't indented by a space,
> >  although the patch seems to expect them to be.

Er, stop. What you quoted above _isn't_ what I sent you. In what I sent,
there were _two_ spaces before the 'context' lines (MODULE_LICENSE...
and MODULE_DESCRIPTION...).

Your mail setup is corrupting your mail.

> Will try this tonight (which is in 8 hrs or so).
> Not sure if the | mail will work.

It's fairly unlikely to -- that was kind of a placeholder for sending
mail _somehow_ that doesn't get it corrupted.

> I mostly use web based mail, and for this I installed alpine to avoid
> html makeup and line wrapping. Alpine is coupled to my gmail account.

Make sure you turn off flowed text in alpine.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15 10:12                     ` David Woodhouse
@ 2008-08-15 18:56                       ` Troy Kisky
  2008-08-15 21:14                         ` frans
  0 siblings, 1 reply; 29+ messages in thread
From: Troy Kisky @ 2008-08-15 18:56 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Frans Meulenbroeks, linux-mtd

David Woodhouse wrote:
> On Fri, 2008-08-15 at 12:04 +0200, Frans Meulenbroeks wrote:
>> 2008/8/15, David Woodhouse <dwmw2@infradead.org>:
>>> On Fri, 2008-08-15 at 11:23 +0200, Frans Meulenbroeks wrote:
>>>  > 2008/8/15, David Woodhouse <dwmw2@infradead.org>:
>>>
>>> No need -- if you've thought about it and believe it should work, that's
>>>  probably enough. I just saw some 'unsigned long' data types, which are
>>>  going to have a different size between 32-bit and 64-bit systems, and
>>>  wondered if that would introduce differences.
>> I was unaware of that. For me unsigned long is more or less a synonym
>> for 32 bit. Maybe I'm just getting too old :-(
>> Anyway, if you have a suggestion for a better type, I'll happily change things.
>> Would uint32_t be better?
> 
> If it needs to be 32-bit, then yes -- uint32_t is the correct type to
> use.
> 
> If 64-bit is OK, then 'unsigned long' is likely to be more efficient on
> some platforms.
> 

You might also test it on a big endian system to be safe.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15 18:56                       ` Troy Kisky
@ 2008-08-15 21:14                         ` frans
  2008-08-16 10:04                           ` David Woodhouse
                                             ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: frans @ 2008-08-15 21:14 UTC (permalink / raw)
  To: Troy Kisky; +Cc: linux-mtd, Frans Meulenbroeks, David Woodhouse

[-- Attachment #1: Type: TEXT/PLAIN, Size: 47441 bytes --]



On Fri, 15 Aug 2008, Troy Kisky wrote:
>
> You might also test it on a big endian system to be safe.
>
> Troy
>
Good remark!
I should have mentioned this more explicitly. As mentioned before I did test
on Linksys NSLU2. This is an ARM and standard Linksys software (as well as
the std unslung one from nslu2-linux.org) is big endian (and yes, that
is the part I did not mention).

Wrt the patch: I've performed the git commands as given by David. Before
doing so I created an empty Documentation/nand/ecc.txt file and did a local 
commit. After that I've added the two files and did a git diff (didn't know 
another way to add a new file, hope that does not break things).

Wrt the previous patch failing. No idea what happened. The patch file on my system is sound, but the file in the sent folder of alpine is already bad. Did a test post to myself and now it is good. No idea what went wrong. Probably I did something clumsy in the alpine editor when inserting the file. (ok, I'll admit it, I am an alpine n00 <blush> )

Anyway, here is the patch again.
I hope it is ok this time, but I'll also attach a .tgz file with diff and
complete source files.
And again apologies that this is not a first-time-right job (but I am 
learning quickly).

Frans.

Signed-off-by: Frans Meulenbroeks <fransmeulenbroeks@gmail.com>

diff --git a/Documentation/nand/ecc.txt b/Documentation/nand/ecc.txt
index e69de29..bdf93b7 100644
--- a/Documentation/nand/ecc.txt
+++ b/Documentation/nand/ecc.txt
@@ -0,0 +1,714 @@
+Introduction
+============
+
+Having looked at the linux mtd/nand driver and more specific at nand_ecc.c
+I felt there was room for optimisation. I bashed the code for a few hours
+performing tricks like table lookup removing superfluous code etc.
+After that the speed was increased by 35-40%.
+Still I was not too happy as I felt there was additional room for improvement.
+
+Bad! I was hooked.
+I decided to annotate my steps in this file. Perhaps it is useful to someone
+or someone learns something from it.
+
+
+The problem
+===========
+
+NAND flash (at least SLC one) typically has sectors of 256 bytes.
+However NAND flash is not extremely reliable so some error detection
+(and sometimes correction) is needed.
+
+This is done by means of a Hamming code. I'll try to explain it in
+laymans terms (and apologies to all the pro's in the field in case I do
+not use the right terminology, my coding theory class was almost 30
+years ago, and I must admit it was not one of my favourites).
+
+As I said before the ecc calculation is performed on sectors of 256
+bytes. This is done by calculating several parity bits over the rows and
+columns. The parity used is even parity which means that the parity bit = 1
+if the data over which the parity is calculated is 1 and the parity bit = 0
+if the data over which the parity is calculated is 0. So the total
+number of bits over the data over which the parity is calculated + the
+parity bit is even. (see wikipedia if you can't follow this).
+Parity is often calculated by means of an exclusive or operation,
+sometimes also referred to as xor. In C the operator for xor is ^
+
+Back to ecc.
+Let's give a small figure:
+
+byte   0:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp4 ... rp14
+byte   1:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp2 rp4 ... rp14
+byte   2:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp4 ... rp14
+byte   3:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp4 ... rp14
+byte   4:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp2 rp5 ... rp14
+....
+byte 254:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp0 rp3 rp5 ... rp15
+byte 255:  bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0   rp1 rp3 rp5 ... rp15
+           cp1  cp0  cp1  cp0  cp1  cp0  cp1  cp0
+           cp3  cp3  cp2  cp2  cp3  cp3  cp2  cp2
+           cp5  cp5  cp5  cp5  cp4  cp4  cp4  cp4
+
+This figure represents a sector of 256 bytes.
+cp is my abbreviaton for column parity, rp for row parity.
+
+Let's start to explain column parity.
+cp0 is the parity that belongs to all bit0, bit2, bit4, bit6.
+so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even.
+Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7.
+cp2 is the parity over bit0, bit1, bit4 and bit5
+cp3 is the parity over bit2, bit3, bit6 and bit7.
+cp4 is the parity over bit0, bit1, bit2 and bit3.
+cp5 is the parity over bit4, bit5, bit6 and bit7.
+Note that each of cp0 .. cp5 is exactly one bit.
+
+Row parity actually works almost the same.
+rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254)
+rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255)
+rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ...
+(so handle two bytes, then skip 2 bytes).
+rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...)
+for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc.
+so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...)
+and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, ..
+The story now becomes quite boring. I guess you get the idea.
+rp6 covers 8 bytes then skips 8 etc
+rp7 skips 8 bytes then covers 8 etc
+rp8 covers 16 bytes then skips 16 etc
+rp9 skips 16 bytes then covers 16 etc
+rp10 covers 32 bytes then skips 32 etc
+rp11 skips 32 bytes then covers 32 etc
+rp12 covers 64 bytes then skips 64 etc
+rp13 skips 64 bytes then covers 64 etc
+rp14 covers 128 bytes then skips 128
+rp15 skips 128 bytes then covers 128
+
+In the end the parity bits are grouped together in three bytes as
+follows:
+ECC    Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0
+ECC 0   rp07  rp06  rp05  rp04  rp03  rp02  rp01  rp00
+ECC 1   rp15  rp14  rp13  rp12  rp11  rp10  rp09  rp08
+ECC 2   cp5   cp4   cp3   cp2   cp1   cp0      1     1
+
+I detected after writing this that ST application note AN1823
+(http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much
+nicer picture.(but they use line parity as term where I use row parity)
+Oh well, I'm graphically challenged, so suffer with me for a moment :-)
+And I could not reuse the ST picture anyway for copyright reasons.
+
+
+Attempt 0
+=========
+
+Implementing the parity calculation is pretty simple.
+In C pseudocode:
+for (i = 0; i < 256; i++)
+{
+    if (i & 0x01)
+       rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
+    else
+       rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1;
+    if (i & 0x02)
+       rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3;
+    else
+       rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2;
+    if (i & 0x04)
+      rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5;
+    else
+      rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4;
+    if (i & 0x08)
+      rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7;
+    else
+      rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6;
+    if (i & 0x10)
+      rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9;
+    else
+      rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8;
+    if (i & 0x20)
+      rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11;
+    else
+    rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10;
+    if (i & 0x40)
+      rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13;
+    else
+      rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12;
+    if (i & 0x80)
+      rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15;
+    else
+      rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14;
+    cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0;
+    cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1;
+    cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2;
+    cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3
+    cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4
+    cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5
+}
+
+
+Analysis 0
+==========
+
+C does have bitwise operators but not really operators to do the above
+efficiently (and most hardware has no such instructions either).
+Therefore without implementing this it was clear that the code above was
+not going to bring me a Nobel prize :-)
+
+Fortunately the exclusive or operation is commutative, so we can combine
+the values in any order. So instead of calculating all the bits
+individually, let us try to rearrange things.
+For the column parity this is easy. We can just xor the bytes and in the
+end filter out the relevant bits. This is pretty nice as it will bring
+all cp calculation out of the if loop.
+
+Similarly we can first xor the bytes for the various rows.
+This leads to:
+
+
+Attempt 1
+=========
+
+const char parity[256] = {
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+    0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0
+};
+
+void ecc1(const unsigned char *buf, unsigned char *code)
+{
+    int i;
+    const unsigned char *bp = buf;
+    unsigned char cur;
+    unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+    unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+    unsigned char par;
+
+    par = 0;
+    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
+    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
+    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
+    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
+
+    for (i = 0; i < 256; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        if (i & 0x01) rp1 ^= cur; else rp0 ^= cur;
+        if (i & 0x02) rp3 ^= cur; else rp2 ^= cur;
+        if (i & 0x04) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x08) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x10) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x20) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x40) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x80) rp15 ^= cur; else rp14 ^= cur;
+    }
+    code[0] =
+        (parity[rp7] << 7) |
+        (parity[rp6] << 6) |
+        (parity[rp5] << 5) |
+        (parity[rp4] << 4) |
+        (parity[rp3] << 3) |
+        (parity[rp2] << 2) |
+        (parity[rp1] << 1) |
+        (parity[rp0]);
+    code[1] =
+        (parity[rp15] << 7) |
+        (parity[rp14] << 6) |
+        (parity[rp13] << 5) |
+        (parity[rp12] << 4) |
+        (parity[rp11] << 3) |
+        (parity[rp10] << 2) |
+        (parity[rp9]  << 1) |
+        (parity[rp8]);
+    code[2] =
+        (parity[par & 0xf0] << 7) |
+        (parity[par & 0x0f] << 6) |
+        (parity[par & 0xcc] << 5) |
+        (parity[par & 0x33] << 4) |
+        (parity[par & 0xaa] << 3) |
+        (parity[par & 0x55] << 2);
+    code[0] = ~code[0];
+    code[1] = ~code[1];
+    code[2] = ~code[2];
+}
+
+Still pretty straightforward. The last three invert statements are there to
+give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash
+all data is 0xff, so the checksum then matches.
+
+I also introduced the parity lookup. I expected this to be the fastest
+way to calculate the parity, but I will investigate alternatives later
+on.
+
+
+Analysis 1
+==========
+
+The code works, but is not terribly efficient. On my system it took
+almost 4 times as much time as the linux driver code. But hey, if it was
+*that* easy this would have been done long before.
+No pain. no gain.
+
+Fortunately there is plenty of room for improvement.
+
+In step 1 we moved from bit-wise calculation to byte-wise calculation.
+However in C we can also use the unsigned long data type and virtually
+every modern microprocessor supports 32 bit operations, so why not try
+to write our code in such a way that we process data in 32 bit chunks.
+
+Of course this means some modification as the row parity is byte by
+byte. A quick analysis:
+for the column parity we use the par variable. When extending to 32 bits
+we can in the end easily calculate p0 and p1 from it.
+(because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0
+respectively)
+also rp2 and rp3 can be easily retrieved from par as rp3 covers the
+first two bytes and rp2 the last two bytes.
+
+Note that of course now the loop is executed only 64 times (256/4).
+And note that care must taken wrt byte ordering. The way bytes are
+ordered in a long is machine dependent, and might affect us.
+Anyway, if there is an issue: this code is developed on x86 (to be
+precise: a DELL PC with a D920 Intel CPU)
+
+And of course the performance might depend on alignment, but I expect
+that the I/O buffers in the nand driver are aligned properly (and
+otherwise that should be fixed to get maximum performance).
+
+Let's give it a try...
+
+
+Attempt 2
+=========
+
+extern const char parity[256];
+
+void ecc2(const unsigned char *buf, unsigned char *code)
+{
+    int i;
+    const unsigned long *bp = (unsigned long *)buf;
+    unsigned long cur;
+    unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+    unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+    unsigned long par;
+
+    par = 0;
+    rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0;
+    rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0;
+    rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0;
+    rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0;
+
+    for (i = 0; i < 64; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
+    }
+    /*
+       we need to adapt the code generation for the fact that rp vars are now
+       long; also the column parity calculation needs to be changed.
+       we'll bring rp4 to 15 back to single byte entities by shifting and
+       xoring
+    */
+    rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff;
+    rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff;
+    rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff;
+    rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff;
+    rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff;
+    rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff;
+    rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff;
+    rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff;
+    rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff;
+    rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff;
+    rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff;
+    rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff;
+    rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff;
+    rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff;
+    par ^= (par >> 16);
+    rp1 = (par >> 8); rp1 &= 0xff;
+    rp0 = (par & 0xff);
+    par ^= (par >> 8); par &= 0xff;
+
+    code[0] =
+        (parity[rp7] << 7) |
+        (parity[rp6] << 6) |
+        (parity[rp5] << 5) |
+        (parity[rp4] << 4) |
+        (parity[rp3] << 3) |
+        (parity[rp2] << 2) |
+        (parity[rp1] << 1) |
+        (parity[rp0]);
+    code[1] =
+        (parity[rp15] << 7) |
+        (parity[rp14] << 6) |
+        (parity[rp13] << 5) |
+        (parity[rp12] << 4) |
+        (parity[rp11] << 3) |
+        (parity[rp10] << 2) |
+        (parity[rp9]  << 1) |
+        (parity[rp8]);
+    code[2] =
+        (parity[par & 0xf0] << 7) |
+        (parity[par & 0x0f] << 6) |
+        (parity[par & 0xcc] << 5) |
+        (parity[par & 0x33] << 4) |
+        (parity[par & 0xaa] << 3) |
+        (parity[par & 0x55] << 2);
+    code[0] = ~code[0];
+    code[1] = ~code[1];
+    code[2] = ~code[2];
+}
+
+The parity array is not shown any more. Note also that for these
+examples I kinda deviated from my regular programming style by allowing
+multiple statements on a line, not using { } in then and else blocks
+with only a single statement and by using operators like ^=
+
+
+Analysis 2
+==========
+
+The code (of course) works, and hurray: we are a little bit faster than
+the linux driver code (about 15%). But wait, don't cheer too quickly.
+THere is more to be gained.
+If we look at e.g. rp14 and rp15 we see that we either xor our data with
+rp14 or with rp15. However we also have par which goes over all data.
+This means there is no need to calculate rp14 as it can be calculated from
+rp15 through rp14 = par ^ rp15;
+(or if desired we can avoid calculating rp15 and calculate it from
+rp14).  That is why some places refer to inverse parity.
+Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13.
+Effectively this means we can eliminate the else clause from the if
+statements. Also we can optimise the calculation in the end a little bit
+by going from long to byte first. Actually we can even avoid the table
+lookups
+
+Attempt 3
+=========
+
+Odd replaced:
+        if (i & 0x01) rp5 ^= cur; else rp4 ^= cur;
+        if (i & 0x02) rp7 ^= cur; else rp6 ^= cur;
+        if (i & 0x04) rp9 ^= cur; else rp8 ^= cur;
+        if (i & 0x08) rp11 ^= cur; else rp10 ^= cur;
+        if (i & 0x10) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x20) rp15 ^= cur; else rp14 ^= cur;
+with
+        if (i & 0x01) rp5 ^= cur;
+        if (i & 0x02) rp7 ^= cur;
+        if (i & 0x04) rp9 ^= cur;
+        if (i & 0x08) rp11 ^= cur;
+        if (i & 0x10) rp13 ^= cur;
+        if (i & 0x20) rp15 ^= cur;
+
+        and outside the loop added:
+    rp4  = par ^ rp5;
+    rp6  = par ^ rp7;
+    rp8  = par ^ rp9;
+    rp10  = par ^ rp11;
+    rp12  = par ^ rp13;
+    rp14  = par ^ rp15;
+
+And after that the code takes about 30% more time, although the number of
+statements is reduced. This is also reflected in the assembly code.
+
+
+Analysis 3
+==========
+
+Very weird. Guess it has to do with caching or instruction parallellism
+or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting
+observation was that this one is only 30% slower (according to time)
+executing the code as my 3Ghz D920 processor.
+
+Well, it was expected not to be easy so maybe instead move to a
+different track: let's move back to the code from attempt2 and do some
+loop unrolling. This will eliminate a few if statements. I'll try
+different amounts of unrolling to see what works best.
+
+
+Attempt 4
+=========
+
+Unrolled the loop 1, 2, 3 and 4 times.
+For 4 the code starts with:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++;
+        par ^= cur;
+        rp4 ^= cur;
+        rp6 ^= cur;
+        rp8 ^= cur;
+        rp10 ^= cur;
+        if (i & 0x1) rp13 ^= cur; else rp12 ^= cur;
+        if (i & 0x2) rp15 ^= cur; else rp14 ^= cur;
+        cur = *bp++;
+        par ^= cur;
+        rp5 ^= cur;
+        rp6 ^= cur;
+        ...
+
+
+Analysis 4
+==========
+
+Unrolling once gains about 15%
+Unrolling twice keeps the gain at about 15%
+Unrolling three times gives a gain of 30% compared to attempt 2.
+Unrolling four times gives a marginal improvement compared to unrolling
+three times.
+
+I decided to proceed with a four time unrolled loop anyway. It was my gut
+feeling that in the next steps I would obtain additional gain from it.
+
+The next step was triggered by the fact that par contains the xor of all
+bytes and rp4 and rp5 each contain the xor of half of the bytes.
+So in effect par = rp4 ^ rp5. But as xor is commutative we can also say
+that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can
+eliminate rp5 (or rp4, but I already foresaw another optimisation).
+The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15.
+
+
+Attempt 5
+=========
+
+Effectively so all odd digit rp assignments in the loop were removed.
+This included the else clause of the if statements.
+Of course after the loop we need to correct things by adding code like:
+    rp5 = par ^ rp4;
+Also the initial assignments (rp5 = 0; etc) could be removed.
+Along the line I also removed the initialisation of rp0/1/2/3.
+
+
+Analysis 5
+==========
+
+Measurements showed this was a good move. The run-time roughly halved
+compared with attempt 4 with 4 times unrolled, and we only require 1/3rd
+of the processor time compared to the current code in the linux kernel.
+
+However, still I thought there was more. I didn't like all the if
+statements. Why not keep a running parity and only keep the last if
+statement. Time for yet another version!
+
+
+Attempt 6
+=========
+
+THe code within the for loop was changed to:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++; tmppar  = cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
+
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
+
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= cur;
+
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+
+	    par ^= tmppar;
+        if ((i & 0x1) == 0) rp12 ^= tmppar;
+        if ((i & 0x2) == 0) rp14 ^= tmppar;
+    }
+
+As you can see tmppar is used to accumulate the parity within a for
+iteration. In the last 3 statements is is added to par and, if needed,
+to rp12 and rp14.
+
+While making the changes I also found that I could exploit that tmppar
+contains the running parity for this iteration. So instead of having:
+rp4 ^= cur; rp6 = cur;
+I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next
+statement. A similar change was done for rp8 and rp10
+
+
+Analysis 6
+==========
+
+Measuring this code again showed big gain. When executing the original
+linux code 1 million times, this took about 1 second on my system.
+(using time to measure the performance). After this iteration I was back
+to 0.075 sec. Actually I had to decide to start measuring over 10
+million interations in order not to loose too much accuracy. This one
+definitely seemed to be the jackpot!
+
+There is a little bit more room for improvement though. There are three
+places with statements:
+rp4 ^= cur; rp6 ^= cur;
+It seems more efficient to also maintain a variable rp4_6 in the while
+loop; This eliminates 3 statements per loop. Of course after the loop we
+need to correct by adding:
+    rp4 ^= rp4_6;
+    rp6 ^= rp4_6
+Furthermore there are 4 sequential assingments to rp8. This can be
+encoded slightly more efficient by saving tmppar before those 4 lines
+and later do rp8 = rp8 ^ tmppar ^ notrp8;
+(where notrp8 is the value of rp8 before those 4 lines).
+Again a use of the commutative property of xor.
+Time for a new test!
+
+
+Attempt 7
+=========
+
+The new code now looks like:
+
+    for (i = 0; i < 4; i++)
+    {
+        cur = *bp++; tmppar  = cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= tmppar;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp8 ^= tmppar;
+
+        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp10 ^= tmppar;
+
+	    notrp8 = tmppar;
+	    cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+	    cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+	    rp8 = rp8 ^ tmppar ^ notrp8;
+
+        cur = *bp++; tmppar ^= cur; rp4_6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp6 ^= cur;
+        cur = *bp++; tmppar ^= cur; rp4 ^= cur;
+        cur = *bp++; tmppar ^= cur;
+
+	    par ^= tmppar;
+        if ((i & 0x1) == 0) rp12 ^= tmppar;
+        if ((i & 0x2) == 0) rp14 ^= tmppar;
+    }
+    rp4 ^= rp4_6;
+    rp6 ^= rp4_6;
+
+
+Not a big change, but every penny counts :-)
+
+
+Analysis 7
+==========
+
+Acutally this made things worse. Not very much, but I don't want to move
+into the wrong direction. Maybe something to investigate later. Could
+have to do with caching again.
+
+Guess that is what there is to win within the loop. Maybe unrolling one
+more time will help. I'll keep the optimisations from 7 for now.
+
+
+Attempt 8
+=========
+
+Unrolled the loop one more time.
+
+
+Analysis 8
+==========
+
+This makes things worse. Let's stick with attempt 6 and continue from there.
+Although it seems that the code within the loop cannot be optimised
+further there is still room to optimize the generation of the ecc codes.
+We can simply calcualate the total parity. If this is 0 then rp4 = rp5
+etc. If the parity is 1, then rp4 = !rp5;
+But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits
+in the result byte and then do something like
+    code[0] |= (code[0] << 1);
+Lets test this.
+
+
+Attempt 9
+=========
+
+Changed the code but again this slightly degrades performance. Tried all
+kind of other things, like having dedicated parity arrays to avoid the
+shift after parity[rp7] << 7; No gain.
+Change the lookup using the parity array by using shift operators (e.g.
+replace parity[rp7] << 7 with:
+rp7 ^= (rp7 << 4);
+rp7 ^= (rp7 << 2);
+rp7 ^= (rp7 << 1);
+rp7 &= 0x80;
+No gain.
+
+The only marginal change was inverting the parity bits, so we can remove
+the last three invert statements.
+
+Ah well, pity this does not deliver more. Then again 10 million
+iterations using the linux driver code takes between 13 and 13.5
+seconds, whereas my code now takes about 0.73 seconds for those 10
+million iterations. So basically I've improved the performance by a
+factor 18 on my system. Not that bad. Of course on different hardware
+you will get different results. No warranties!
+
+But of course there is no such thing as a free lunch. The codesize almost
+tripled (from 562 bytes to 1434 bytes). Then again, it is not that much.
+
+
+Correcting errors
+=================
+
+For correcting errors I again used the ST application note as a starter,
+but I also peeked at the existing code.
+The algorithm itself is pretty straightforward. Just xor the given and
+the calculated ecc. If all bytes are 0 there is no problem. If 11 bits
+are 1 we have one correctable bit error. If there is 1 bit 1, we have an
+error in the given ecc code.
+It proved to be fastest to do some table lookups. Performance gain
+introduced by this is about a factor 2 on my system when a repair had to
+be done, and 1% or so if no repair had to be done.
+Code size increased from 330 bytes to 686 bytes for this function.
+(gcc 4.2, -O3)
+
+
+Conclusion
+==========
+
+The gain when calculating the ecc is tremendous. Om my development hardware
+a speedup of a factor of 18 for ecc calculation was achieved. On a test on an
+embedded system with a MIPS core a factor 7 was obtained.
+On  a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor
+5 (big endian mode, gcc 4.1.2, -O3)
+For correction not much gain could be obtained (as bitflips are rare). Then
+again there are also much less cycles spent there.
+
+It seems there is not much more gain possible in this, at least when
+programmed in C. Of course it might be possible to squeeze something more
+out of it with an assembler program, but due to pipeline behaviour etc
+this is very tricky (at least for intel hw).
+
+Author: Frans Meulenbroeks
+Copyright (C) 2008 Koninklijke Philips Electronics NV.
diff --git a/drivers/mtd/nand/nand_ecc.c b/drivers/mtd/nand/nand_ecc.c
index 918a806..7129da5 100644
--- a/drivers/mtd/nand/nand_ecc.c
+++ b/drivers/mtd/nand/nand_ecc.c
@@ -1,13 +1,18 @@
 /*
- * This file contains an ECC algorithm from Toshiba that detects and
- * corrects 1 bit errors in a 256 byte block of data.
+ * This file contains an ECC algorithm that detects and corrects 1 bit
+ * errors in a 256 byte block of data.
  *
  * drivers/mtd/nand/nand_ecc.c
  *
- * Copyright (C) 2000-2004 Steven J. Hill (sjhill@realitydiluted.com)
- *                         Toshiba America Electronics Components, Inc.
+ * Copyright (C) 2008 Koninklijke Philips Electronics NV.
+ *                    Author: Frans Meulenbroeks
  *
- * Copyright (C) 2006 Thomas Gleixner <tglx@linutronix.de>
+ * Completely replaces the previous ECC implementation which was written by:
+ *   Steven J. Hill (sjhill@realitydiluted.com)
+ *   Thomas Gleixner (tglx@linutronix.de)
+ *
+ * Information on how this algorithm works and how it was developed
+ * can be found in Documentation/nand/ecc.txt
  *
  * This file is free software; you can redistribute it and/or modify it
  * under the terms of the GNU General Public License as published by the
@@ -23,174 +28,417 @@
  * with this file; if not, write to the Free Software Foundation, Inc.,
  * 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
  *
- * As a special exception, if other files instantiate templates or use
- * macros or inline functions from these files, or you compile these
- * files and link them with other works to produce a work based on these
- * files, these files do not by themselves cause the resulting work to be
- * covered by the GNU General Public License. However the source code for
- * these files must still be made available in accordance with section (3)
- * of the GNU General Public License.
- *
- * This exception does not invalidate any other reasons why a work based on
- * this file might be covered by the GNU General Public License.
  */

+/*
+ * The STANDALONE macro is useful when running the code outside the kernel
+ * e.g. when running the code in a testbed or a benchmark program.
+ * When STANDALONE is used, the module related macros are commented out
+ * as well as the linux include files.
+ * Instead a private definition of mtd_into is given to satisfy the compiler
+ * (the code does not use mtd_info, so the code does not care)
+ */
+#ifndef STANDALONE
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mtd/nand_ecc.h>
+#else
+typedef uint32_t unsigned long
+struct mtd_info {
+	int dummy;
+};
+#define EXPORT_SYMBOL(x)  /* x */
+
+#define MODULE_LICENSE(x)	/* x */
+#define MODULE_AUTHOR(x)	/* x */
+#define MODULE_DESCRIPTION(x)	/* x */
+#endif
+
+/*
+ * invparity is a 256 byte table that contains the odd parity
+ * for each byte. So if the number of bits in a byte is even,
+ * the array element is 1, and when the number of bits is odd
+ * the array eleemnt is 0.
+ */
+static const char invparity[256] = {
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
+	1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1
+};

 /*
- * Pre-calculated 256-way 1 byte column parity
+ * bitsperbyte contains the number of bits per byte
+ * this is only used for testing and repairing parity
+ * (a precalculated value slightly improves performance)
  */
-static const u_char nand_ecc_precalc_table[] = {
-	0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00,
-	0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
-	0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
-	0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
-	0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
-	0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
-	0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
-	0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
-	0x6a, 0x3f, 0x3c, 0x69, 0x33, 0x66, 0x65, 0x30, 0x30, 0x65, 0x66, 0x33, 0x69, 0x3c, 0x3f, 0x6a,
-	0x0f, 0x5a, 0x59, 0x0c, 0x56, 0x03, 0x00, 0x55, 0x55, 0x00, 0x03, 0x56, 0x0c, 0x59, 0x5a, 0x0f,
-	0x0c, 0x59, 0x5a, 0x0f, 0x55, 0x00, 0x03, 0x56, 0x56, 0x03, 0x00, 0x55, 0x0f, 0x5a, 0x59, 0x0c,
-	0x69, 0x3c, 0x3f, 0x6a, 0x30, 0x65, 0x66, 0x33, 0x33, 0x66, 0x65, 0x30, 0x6a, 0x3f, 0x3c, 0x69,
-	0x03, 0x56, 0x55, 0x00, 0x5a, 0x0f, 0x0c, 0x59, 0x59, 0x0c, 0x0f, 0x5a, 0x00, 0x55, 0x56, 0x03,
-	0x66, 0x33, 0x30, 0x65, 0x3f, 0x6a, 0x69, 0x3c, 0x3c, 0x69, 0x6a, 0x3f, 0x65, 0x30, 0x33, 0x66,
-	0x65, 0x30, 0x33, 0x66, 0x3c, 0x69, 0x6a, 0x3f, 0x3f, 0x6a, 0x69, 0x3c, 0x66, 0x33, 0x30, 0x65,
-	0x00, 0x55, 0x56, 0x03, 0x59, 0x0c, 0x0f, 0x5a, 0x5a, 0x0f, 0x0c, 0x59, 0x03, 0x56, 0x55, 0x00
+static const char bitsperbyte[256] = {
+	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8,
+};
+
+/*
+ * addressbits is a lookup table to filter out the bits from the xor-ed
+ * ecc data that identify the faulty location.
+ * this is only used for repairing parity
+ * see the comments in nand_correct_data for more details
+ */
+static const char addressbits[256] = {
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x00, 0x00, 0x01, 0x01, 0x00, 0x00, 0x01, 0x01,
+	0x02, 0x02, 0x03, 0x03, 0x02, 0x02, 0x03, 0x03,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x04, 0x04, 0x05, 0x05, 0x04, 0x04, 0x05, 0x05,
+	0x06, 0x06, 0x07, 0x07, 0x06, 0x06, 0x07, 0x07,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x08, 0x08, 0x09, 0x09, 0x08, 0x08, 0x09, 0x09,
+	0x0a, 0x0a, 0x0b, 0x0b, 0x0a, 0x0a, 0x0b, 0x0b,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f,
+	0x0c, 0x0c, 0x0d, 0x0d, 0x0c, 0x0c, 0x0d, 0x0d,
+	0x0e, 0x0e, 0x0f, 0x0f, 0x0e, 0x0e, 0x0f, 0x0f
 };

 /**
  * nand_calculate_ecc - [NAND Interface] Calculate 3-byte ECC for 256-byte block
- * @mtd:	MTD block structure
+ * @mtd:	MTD block structure (unused)
  * @dat:	raw data
  * @ecc_code:	buffer for ECC
  */
-int nand_calculate_ecc(struct mtd_info *mtd, const u_char *dat,
-		       u_char *ecc_code)
+int nand_calculate_ecc(struct mtd_info *mtd, const unsigned char *buf,
+		       unsigned char *code)
 {
-	uint8_t idx, reg1, reg2, reg3, tmp1, tmp2;
 	int i;
+	const uint32_t *bp = (uint32_t *)buf;
+	uint32_t cur;		/* current value in buffer */
+	/* rp0..rp15 are the various accumulated parities (per byte) */
+	uint32_t rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7;
+	uint32_t rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15;
+	uint32_t par;		/* the cumulative parity for all data */
+	uint32_t tmppar;	/* the cumulative parity for this iteration;
+				   for rp12 and rp14 at the end of the loop */
+
+	par = 0;
+	rp4 = 0;
+	rp6 = 0;
+	rp8 = 0;
+	rp10 = 0;
+	rp12 = 0;
+	rp14 = 0;
+
+	/*
+	 * The loop is unrolled a number of times;
+	 * This avoids if statements to decide on which rp value to update
+	 * Also we process the data by longwords.
+	 * Note: passing unaligned data might give a performance penalty.
+	 * It is assumed that the buffers are aligned.
+	 * tmppar is the cumulative sum of this iteration.
+	 * needed for calculating rp12, rp14 and par
+	 * also used as a performance improvement for rp6, rp8 and rp10
+	 */
+	for (i = 0; i < 4; i++) {
+		cur = *bp++;
+		tmppar = cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= tmppar;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp8 ^= tmppar;

-	/* Initialize variables */
-	reg1 = reg2 = reg3 = 0;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp10 ^= tmppar;

-	/* Build up column parity */
-	for(i = 0; i < 256; i++) {
-		/* Get CP0 - CP5 from table */
-		idx = nand_ecc_precalc_table[*dat++];
-		reg1 ^= (idx & 0x3f);
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp8 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp8 ^= cur;

-		/* All bit XOR = 1 ? */
-		if (idx & 0x40) {
-			reg3 ^= (uint8_t) i;
-			reg2 ^= ~((uint8_t) i);
-		}
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp6 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+		rp4 ^= cur;
+		cur = *bp++;
+		tmppar ^= cur;
+
+		par ^= tmppar;
+		if ((i & 0x1) == 0)
+			rp12 ^= tmppar;
+		if ((i & 0x2) == 0)
+			rp14 ^= tmppar;
 	}

-	/* Create non-inverted ECC code from line parity */
-	tmp1  = (reg3 & 0x80) >> 0; /* B7 -> B7 */
-	tmp1 |= (reg2 & 0x80) >> 1; /* B7 -> B6 */
-	tmp1 |= (reg3 & 0x40) >> 1; /* B6 -> B5 */
-	tmp1 |= (reg2 & 0x40) >> 2; /* B6 -> B4 */
-	tmp1 |= (reg3 & 0x20) >> 2; /* B5 -> B3 */
-	tmp1 |= (reg2 & 0x20) >> 3; /* B5 -> B2 */
-	tmp1 |= (reg3 & 0x10) >> 3; /* B4 -> B1 */
-	tmp1 |= (reg2 & 0x10) >> 4; /* B4 -> B0 */
-
-	tmp2  = (reg3 & 0x08) << 4; /* B3 -> B7 */
-	tmp2 |= (reg2 & 0x08) << 3; /* B3 -> B6 */
-	tmp2 |= (reg3 & 0x04) << 3; /* B2 -> B5 */
-	tmp2 |= (reg2 & 0x04) << 2; /* B2 -> B4 */
-	tmp2 |= (reg3 & 0x02) << 2; /* B1 -> B3 */
-	tmp2 |= (reg2 & 0x02) << 1; /* B1 -> B2 */
-	tmp2 |= (reg3 & 0x01) << 1; /* B0 -> B1 */
-	tmp2 |= (reg2 & 0x01) << 0; /* B7 -> B0 */
-
-	/* Calculate final ECC code */
+	/*
+	 * handle the fact that we use longword operations
+	 * we'll bring rp4..rp14 back to single byte entities by shifting and
+	 * xoring first fold the upper and lower 16 bits,
+	 * then the upper and lower 8 bits.
+	 */
+	rp4 ^= (rp4 >> 16);
+	rp4 ^= (rp4 >> 8);
+	rp4 &= 0xff;
+	rp6 ^= (rp6 >> 16);
+	rp6 ^= (rp6 >> 8);
+	rp6 &= 0xff;
+	rp8 ^= (rp8 >> 16);
+	rp8 ^= (rp8 >> 8);
+	rp8 &= 0xff;
+	rp10 ^= (rp10 >> 16);
+	rp10 ^= (rp10 >> 8);
+	rp10 &= 0xff;
+	rp12 ^= (rp12 >> 16);
+	rp12 ^= (rp12 >> 8);
+	rp12 &= 0xff;
+	rp14 ^= (rp14 >> 16);
+	rp14 ^= (rp14 >> 8);
+	rp14 &= 0xff;
+
+	/*
+	 * we also need to calculate the row parity for rp0..rp3
+	 * This is present in par, because par is now
+	 * rp3 rp3 rp2 rp2
+	 * as well as
+	 * rp1 rp0 rp1 rp0
+	 * First calculate rp2 and rp3
+	 * (and yes: rp2 = (par ^ rp3) & 0xff; but doing that did not
+	 * give a performance improvement)
+	 */
+	rp3 = (par >> 16);
+	rp3 ^= (rp3 >> 8);
+	rp3 &= 0xff;
+	rp2 = par & 0xffff;
+	rp2 ^= (rp2 >> 8);
+	rp2 &= 0xff;
+
+	/* reduce par to 16 bits then calculate rp1 and rp0 */
+	par ^= (par >> 16);
+	rp1 = (par >> 8) & 0xff;
+	rp0 = (par & 0xff);
+
+	/* finally reduce par to 8 bits */
+	par ^= (par >> 8);
+	par &= 0xff;
+
+	/*
+	 * and calculate rp5..rp15
+	 * note that par = rp4 ^ rp5 and due to the commutative property
+	 * of the ^ operator we can say:
+	 * rp5 = (par ^ rp4);
+	 * The & 0xff seems superfluous, but benchmarking learned that
+	 * leaving it out gives slightly worse results. No idea why, probably
+	 * it has to do with the way the pipeline in pentium is organized.
+	 */
+	rp5 = (par ^ rp4) & 0xff;
+	rp7 = (par ^ rp6) & 0xff;
+	rp9 = (par ^ rp8) & 0xff;
+	rp11 = (par ^ rp10) & 0xff;
+	rp13 = (par ^ rp12) & 0xff;
+	rp15 = (par ^ rp14) & 0xff;
+
+	/*
+	 * Finally calculate the ecc bits.
+	 * Again here it might seem that there are performance optimisations
+	 * possible, but benchmarks showed that on the system this is developed
+	 * the code below is the fastest
+	 */
 #ifdef CONFIG_MTD_NAND_ECC_SMC
-	ecc_code[0] = ~tmp2;
-	ecc_code[1] = ~tmp1;
+	code[0] =
+	    (invparity[rp7] << 7) |
+	    (invparity[rp6] << 6) |
+	    (invparity[rp5] << 5) |
+	    (invparity[rp4] << 4) |
+	    (invparity[rp3] << 3) |
+	    (invparity[rp2] << 2) |
+	    (invparity[rp1] << 1) |
+	    (invparity[rp0]);
+	code[1] =
+	    (invparity[rp15] << 7) |
+	    (invparity[rp14] << 6) |
+	    (invparity[rp13] << 5) |
+	    (invparity[rp12] << 4) |
+	    (invparity[rp11] << 3) |
+	    (invparity[rp10] << 2) |
+	    (invparity[rp9] << 1)  |
+	    (invparity[rp8]);
 #else
-	ecc_code[0] = ~tmp1;
-	ecc_code[1] = ~tmp2;
+	code[1] =
+	    (invparity[rp7] << 7) |
+	    (invparity[rp6] << 6) |
+	    (invparity[rp5] << 5) |
+	    (invparity[rp4] << 4) |
+	    (invparity[rp3] << 3) |
+	    (invparity[rp2] << 2) |
+	    (invparity[rp1] << 1) |
+	    (invparity[rp0]);
+	code[0] =
+	    (invparity[rp15] << 7) |
+	    (invparity[rp14] << 6) |
+	    (invparity[rp13] << 5) |
+	    (invparity[rp12] << 4) |
+	    (invparity[rp11] << 3) |
+	    (invparity[rp10] << 2) |
+	    (invparity[rp9] << 1)  |
+	    (invparity[rp8]);
 #endif
-	ecc_code[2] = ((~reg1) << 2) | 0x03;
-
+	code[2] =
+	    (invparity[par & 0xf0] << 7) |
+	    (invparity[par & 0x0f] << 6) |
+	    (invparity[par & 0xcc] << 5) |
+	    (invparity[par & 0x33] << 4) |
+	    (invparity[par & 0xaa] << 3) |
+	    (invparity[par & 0x55] << 2) |
+	    3;
 	return 0;
 }
 EXPORT_SYMBOL(nand_calculate_ecc);

-static inline int countbits(uint32_t byte)
-{
-	int res = 0;
-
-	for (;byte; byte >>= 1)
-		res += byte & 0x01;
-	return res;
-}
-
 /**
  * nand_correct_data - [NAND Interface] Detect and correct bit error(s)
- * @mtd:	MTD block structure
+ * @mtd:	MTD block structure (unused)
  * @dat:	raw data read from the chip
  * @read_ecc:	ECC from the chip
  * @calc_ecc:	the ECC calculated from raw data
  *
  * Detect and correct a 1 bit error for 256 byte block
  */
-int nand_correct_data(struct mtd_info *mtd, u_char *dat,
-		      u_char *read_ecc, u_char *calc_ecc)
+int nand_correct_data(struct mtd_info *mtd, unsigned char *buf,
+		      unsigned char *read_ecc, unsigned char *calc_ecc)
 {
-	uint8_t s0, s1, s2;
+	int nr_bits;
+	unsigned char b0, b1, b2;
+	unsigned char byte_addr, bit_addr;

+	/*
+	 * b0 to b2 indicate which bit is faulty (if any)
+	 * we might need the xor result  more than once,
+	 * so keep them in a local var
+	*/
 #ifdef CONFIG_MTD_NAND_ECC_SMC
-	s0 = calc_ecc[0] ^ read_ecc[0];
-	s1 = calc_ecc[1] ^ read_ecc[1];
-	s2 = calc_ecc[2] ^ read_ecc[2];
+	b0 = read_ecc[0] ^ calc_ecc[0];
+	b1 = read_ecc[1] ^ calc_ecc[1];
 #else
-	s1 = calc_ecc[0] ^ read_ecc[0];
-	s0 = calc_ecc[1] ^ read_ecc[1];
-	s2 = calc_ecc[2] ^ read_ecc[2];
+	b0 = read_ecc[1] ^ calc_ecc[1];
+	b1 = read_ecc[0] ^ calc_ecc[0];
 #endif
-	if ((s0 | s1 | s2) == 0)
-		return 0;
-
-	/* Check for a single bit error */
-	if( ((s0 ^ (s0 >> 1)) & 0x55) == 0x55 &&
-	    ((s1 ^ (s1 >> 1)) & 0x55) == 0x55 &&
-	    ((s2 ^ (s2 >> 1)) & 0x54) == 0x54) {
-
-		uint32_t byteoffs, bitnum;
+	b2 = read_ecc[2] ^ calc_ecc[2];

-		byteoffs = (s1 << 0) & 0x80;
-		byteoffs |= (s1 << 1) & 0x40;
-		byteoffs |= (s1 << 2) & 0x20;
-		byteoffs |= (s1 << 3) & 0x10;
+	/* check if there are any bitfaults */

-		byteoffs |= (s0 >> 4) & 0x08;
-		byteoffs |= (s0 >> 3) & 0x04;
-		byteoffs |= (s0 >> 2) & 0x02;
-		byteoffs |= (s0 >> 1) & 0x01;
+	/* count nr of bits; use table lookup, faster than calculating it */
+	nr_bits = bitsperbyte[b0] + bitsperbyte[b1] + bitsperbyte[b2];

-		bitnum = (s2 >> 5) & 0x04;
-		bitnum |= (s2 >> 4) & 0x02;
-		bitnum |= (s2 >> 3) & 0x01;
-
-		dat[byteoffs] ^= (1 << bitnum);
-
-		return 1;
+	/* repeated if statements are slightly more efficient than switch ... */
+	/* ordered in order of likelihood */
+	if (nr_bits == 0)
+		return (0);	/* no error */
+	if (nr_bits == 11) {	/* correctable error */
+		/*
+		 * rp15/13/11/9/7/5/3/1 indicate which byte is the faulty byte
+		 * cp 5/3/1 indicate the faulty bit.
+		 * A lookup table (called addressbits) is used to filter
+		 * the bits from the byte they are in.
+		 * A marginal optimisation is possible by having three
+		 * different lookup tables.
+		 * One as we have now (for b0), one for b2
+		 * (that would avoid the >> 1), and one for b1 (with all values
+		 * << 4). However it was felt that introducing two more tables
+		 * hardly justify the gain.
+		 *
+		 * The b2 shift is there to get rid of the lowest two bits.
+		 * We could also do addressbits[b2] >> 1 but for the
+		 * performace it does not make any difference
+		 */
+		byte_addr = (addressbits[b1] << 4) + addressbits[b0];
+		bit_addr = addressbits[b2 >> 2];
+		/* flip the bit */
+		buf[byte_addr] ^= (1 << bit_addr);
+		return (1);
 	}
-
-	if(countbits(s0 | ((uint32_t)s1 << 8) | ((uint32_t)s2 <<16)) == 1)
-		return 1;
-
-	return -EBADMSG;
+	if (nr_bits == 1)
+		return (1);	/* error in ecc data; no action needed */
+	return -1;
 }
 EXPORT_SYMBOL(nand_correct_data);

 MODULE_LICENSE("GPL");
-MODULE_AUTHOR("Steven J. Hill <sjhill@realitydiluted.com>");
+MODULE_AUTHOR("Frans Meulenbroeks <fransmeulenbroeks@gmail.com>");
 MODULE_DESCRIPTION("Generic NAND ECC support");

[-- Attachment #2: Type: APPLICATION/x-gtar, Size: 23492 bytes --]

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15 21:14                         ` frans
@ 2008-08-16 10:04                           ` David Woodhouse
  2008-08-17 21:09                           ` Troy Kisky
  2008-08-17 23:30                           ` Troy Kisky
  2 siblings, 0 replies; 29+ messages in thread
From: David Woodhouse @ 2008-08-16 10:04 UTC (permalink / raw)
  To: frans; +Cc: linux-mtd, Troy Kisky

On Fri, 2008-08-15 at 23:14 +0200, frans wrote:
> Anyway, here is the patch again.
> I hope it is ok this time,

Applied; thanks.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15 21:14                         ` frans
  2008-08-16 10:04                           ` David Woodhouse
@ 2008-08-17 21:09                           ` Troy Kisky
  2008-08-18  6:33                             ` Frans Meulenbroeks
  2008-08-17 23:30                           ` Troy Kisky
  2 siblings, 1 reply; 29+ messages in thread
From: Troy Kisky @ 2008-08-17 21:09 UTC (permalink / raw)
  To: frans; +Cc: linux-mtd, David Woodhouse

frans wrote:
> 
> On Fri, 15 Aug 2008, Troy Kisky wrote:
>> You might also test it on a big endian system to be safe.
>>
>> Troy
>>
> Good remark!
> I should have mentioned this more explicitly. As mentioned before I did test
> on Linksys NSLU2. This is an ARM and standard Linksys software (as well as
> the std unslung one from nslu2-linux.org) is big endian (and yes, that
> is the part I did not mention).
> 

So, you have tried reading a file system that was created before your patch
was applied on a big endian system? Sorry to be a pest, but I didn't see any
endian-ness test in your patch. But I will admit that my eyes aren't very good.
I'm especially bad at spotting the jelly in the fridge. :^)

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-15 21:14                         ` frans
  2008-08-16 10:04                           ` David Woodhouse
  2008-08-17 21:09                           ` Troy Kisky
@ 2008-08-17 23:30                           ` Troy Kisky
  2008-08-18  6:40                             ` Frans Meulenbroeks
  2 siblings, 1 reply; 29+ messages in thread
From: Troy Kisky @ 2008-08-17 23:30 UTC (permalink / raw)
  To: frans; +Cc: linux-mtd, David Woodhouse

> +	if (nr_bits == 11) {	/* correctable error */

This is a necessary, but NOT sufficient condition to
determine that it is a 1 bit error.

Big MAYBE below

This is probably why you passed the big endian test.
It was seeing every read as a single bit error
and silently correcting. Adding a debug print
when correcting or failing to correct would be
useful.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-17 21:09                           ` Troy Kisky
@ 2008-08-18  6:33                             ` Frans Meulenbroeks
  2008-08-18 17:20                               ` Troy Kisky
  0 siblings, 1 reply; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-18  6:33 UTC (permalink / raw)
  To: Troy Kisky; +Cc: linux-mtd, David Woodhouse

2008/8/17, Troy Kisky <troy.kisky@boundarydevices.com>:
> frans wrote:
>  >
>  > On Fri, 15 Aug 2008, Troy Kisky wrote:
>  >> You might also test it on a big endian system to be safe.
>  >>
>  >> Troy
>  >>
>  > Good remark!
>  > I should have mentioned this more explicitly. As mentioned before I did test
>  > on Linksys NSLU2. This is an ARM and standard Linksys software (as well as
>  > the std unslung one from nslu2-linux.org) is big endian (and yes, that
>  > is the part I did not mention).
>  >
>
>
> So, you have tried reading a file system that was created before your patch
>  was applied on a big endian system? Sorry to be a pest, but I didn't see any
>  endian-ness test in your patch. But I will admit that my eyes aren't very good.
>  I'm especially bad at spotting the jelly in the fridge. :^)
>
Yes, the NSLU2 had a filesystem that was created before the patch was applied.
But actually I think the filesystem is irrelevant.
I verified the proper operation by comparing the values generated by
the original code with the values generated by my code over a set of
input blocks.
Guess there is no endianness dependency and that if the data is big
endian the ecc is too.

Frans.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-17 23:30                           ` Troy Kisky
@ 2008-08-18  6:40                             ` Frans Meulenbroeks
  2008-08-18 17:08                               ` Troy Kisky
  0 siblings, 1 reply; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-18  6:40 UTC (permalink / raw)
  To: Troy Kisky; +Cc: linux-mtd, David Woodhouse

2008/8/18, Troy Kisky <troy.kisky@boundarydevices.com>:
> > +     if (nr_bits == 11) {    /* correctable error */
>
>
> This is a necessary, but NOT sufficient condition to
>  determine that it is a 1 bit error.

The test for 11 bits is in accordance to the ST datasheet I used
(http://www.st.com/stonline/books/pdf/docs/10123.pdf, see section
3.4).
What other check do you feel should be needed.
>
>
>  Big MAYBE below
>
>  This is probably why you passed the big endian test.
>  It was seeing every read as a single bit error
>  and silently correcting. Adding a debug print
>  when correcting or failing to correct would be
>  useful.

As long as they are debug print, that could indeed be done. No problem
with that.
We could even have a printk to indicate a failure on the console/in the log.
Having a printk when a correction is done is probably not that
desirable (newer generations of nand seem more and more often to
require corrections, which could mean too many msges in the log).

Frans.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18  6:40                             ` Frans Meulenbroeks
@ 2008-08-18 17:08                               ` Troy Kisky
  0 siblings, 0 replies; 29+ messages in thread
From: Troy Kisky @ 2008-08-18 17:08 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd, David Woodhouse

Frans Meulenbroeks wrote:
> 2008/8/18, Troy Kisky <troy.kisky@boundarydevices.com>:
>>> +     if (nr_bits == 11) {    /* correctable error */
>>
>> This is a necessary, but NOT sufficient condition to
>>  determine that it is a 1 bit error.
> 
> The test for 11 bits is in accordance to the ST datasheet I used
> (http://www.st.com/stonline/books/pdf/docs/10123.pdf, see section
> 3.4).
> What other check do you feel should be needed.

The original check. I'll try to send you a program to demonstrate
the difference after I get off work.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18  6:33                             ` Frans Meulenbroeks
@ 2008-08-18 17:20                               ` Troy Kisky
  2008-08-18 21:09                                 ` Frans Meulenbroeks
  0 siblings, 1 reply; 29+ messages in thread
From: Troy Kisky @ 2008-08-18 17:20 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd, David Woodhouse

Frans Meulenbroeks wrote:
> Yes, the NSLU2 had a filesystem that was created before the patch was applied.
> But actually I think the filesystem is irrelevant.
> I verified the proper operation by comparing the values generated by
> the original code with the values generated by my code over a set of
> input blocks.
> Guess there is no endianness dependency and that if the data is big
> endian the ecc is too.
> 
> Frans.
> 

Does that make logical sense to you? The correction routine
accesses the data as a byte and flips a bit. If it accessed it as
an uint32 and flipped the bit, then I can see that there would be
no endianness dependency. I'm not suggesting you do that, as it would be
incompatible with current ecc, just explaining my logic. I'd very
much appreciate an explanation of why I'm wrong. I would expect
big endian ecc to have 4 bits differences whenever the entire
block parity is odd. These would be the bits that select the byte
within the uint32.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18 17:20                               ` Troy Kisky
@ 2008-08-18 21:09                                 ` Frans Meulenbroeks
  2008-08-18 21:29                                   ` Troy Kisky
  0 siblings, 1 reply; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-18 21:09 UTC (permalink / raw)
  To: Troy Kisky; +Cc: linux-mtd, David Woodhouse

2008/8/18 Troy Kisky <troy.kisky@boundarydevices.com>:
> Frans Meulenbroeks wrote:
>> Yes, the NSLU2 had a filesystem that was created before the patch was applied.
>> But actually I think the filesystem is irrelevant.
>> I verified the proper operation by comparing the values generated by
>> the original code with the values generated by my code over a set of
>> input blocks.
>> Guess there is no endianness dependency and that if the data is big
>> endian the ecc is too.
>
> Does that make logical sense to you? The correction routine
> accesses the data as a byte and flips a bit. If it accessed it as
> an uint32 and flipped the bit, then I can see that there would be
> no endianness dependency. I'm not suggesting you do that, as it would be
> incompatible with current ecc, just explaining my logic. I'd very
> much appreciate an explanation of why I'm wrong. I would expect
> big endian ecc to have 4 bits differences whenever the entire
> block parity is odd. These would be the bits that select the byte
> within the uint32.

Troy, did a further investigation.
Your explanation is correct. My test program had a flaw causing this
case to be undetected.
Indeed in case of odd parity the 4 bits selecting the byte are flipped
on big endian systems.
(little endian is ok).

Still looking at what the best way to fix it. In the code you posted
before you used __cpu_to_le64s.
Not sure why you are using the 64 variant. As it is an uint32_t, I
would expect __cpu_to_le32s to suffice.

Then again I am not too eager to use that function as it generates
some overhead. I'd rather use the builtin gcc macro __BIG_ENDIAN__ (in
that case I can just use an #ifdef to distinguish the two cases and in
case of BE no byte swapping is needed.
What is your opinion on this?

Frans.

PS: wrt the 11 bits check for the other message. Can't really envision
why this fails, but maybe it is just too late.
If you have an ecc and a faulty 256 byte data block that would be
erroneously accepted by my code and that would be rightfully rejected
by the original code, I'll be more than happy to change it.
Performancewise the difference is very small and it is a rare
situation anyway. The original test is definitely more rigid than just
the nr of bits test.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18 21:09                                 ` Frans Meulenbroeks
@ 2008-08-18 21:29                                   ` Troy Kisky
  2008-08-18 21:31                                     ` David Woodhouse
  2008-08-18 22:10                                     ` Troy Kisky
  0 siblings, 2 replies; 29+ messages in thread
From: Troy Kisky @ 2008-08-18 21:29 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd, David Woodhouse

Frans Meulenbroeks wrote:
> 2008/8/18 Troy Kisky <troy.kisky@boundarydevices.com>:
>> Frans Meulenbroeks wrote:
>>> Yes, the NSLU2 had a filesystem that was created before the patch was applied.
>>> But actually I think the filesystem is irrelevant.
>>> I verified the proper operation by comparing the values generated by
>>> the original code with the values generated by my code over a set of
>>> input blocks.
>>> Guess there is no endianness dependency and that if the data is big
>>> endian the ecc is too.
>> Does that make logical sense to you? The correction routine
>> accesses the data as a byte and flips a bit. If it accessed it as
>> an uint32 and flipped the bit, then I can see that there would be
>> no endianness dependency. I'm not suggesting you do that, as it would be
>> incompatible with current ecc, just explaining my logic. I'd very
>> much appreciate an explanation of why I'm wrong. I would expect
>> big endian ecc to have 4 bits differences whenever the entire
>> block parity is odd. These would be the bits that select the byte
>> within the uint32.
> 
> Troy, did a further investigation.
> Your explanation is correct. My test program had a flaw causing this
> case to be undetected.
> Indeed in case of odd parity the 4 bits selecting the byte are flipped
> on big endian systems.
> (little endian is ok).

I appreciate you digging into it, as I don't have a big endian system.

> 
> Still looking at what the best way to fix it. In the code you posted
> before you used __cpu_to_le64s.
> Not sure why you are using the 64 variant. As it is an uint32_t, I
> would expect __cpu_to_le32s to suffice.
My bug.
> 
> Then again I am not too eager to use that function as it generates
> some overhead. I'd rather use the builtin gcc macro __BIG_ENDIAN__ (in
> that case I can just use an #ifdef to distinguish the two cases and in
> case of BE no byte swapping is needed.
> What is your opinion on this?

I agree.

> 
> Frans.
> 
> PS: wrt the 11 bits check for the other message. Can't really envision
> why this fails, but maybe it is just too late.
> If you have an ecc and a faulty 256 byte data block that would be
> erroneously accepted by my code and that would be rightfully rejected
> by the original code, I'll be more than happy to change it.
> Performancewise the difference is very small and it is a rare
> situation anyway. The original test is definitely more rigid than just
> the nr of bits test.
> 

(ignoring inversions)
Example: You have a block of all zeros.

The ecc stored in the spare bytes of this is also 0.
Now, upon reading this block of zeroes, a two bit ecc occurs. The bits that happen to be
read incorrectly are bit # 0 & bit # 0x3f of the block
The hardware calculated ecc will be
0:0 ^ 0:fff = 0:fff after bit 0
0:fff ^ 3f:fc00 = 3f:3f after bit 3f

Now, when your algorithm counts bits you get 12, and decide
it is a single bit ecc error.

The old way however will xor the high and low 12 bits 3f ^ 3f = 0, 0 != fff and
decide it is multi bit ecc error and give an error.

Note, that both approaches would have decide it was a single bit error, if the second
error wouldn't have happened.


So, try a block of zeroes and flip bits 0 and 0x3f.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18 21:29                                   ` Troy Kisky
@ 2008-08-18 21:31                                     ` David Woodhouse
  2008-08-18 22:14                                       ` Troy Kisky
  2008-08-18 22:10                                     ` Troy Kisky
  1 sibling, 1 reply; 29+ messages in thread
From: David Woodhouse @ 2008-08-18 21:31 UTC (permalink / raw)
  To: Troy Kisky; +Cc: Frans Meulenbroeks, linux-mtd

On Mon, 2008-08-18 at 14:29 -0700, Troy Kisky wrote:
> I appreciate you digging into it, as I don't have a big endian system.

Either of you can mail me a SSH public key if you want access to a
decent big-endian desktop system to play with.

-- 
David Woodhouse                            Open Source Technology Centre
David.Woodhouse@intel.com                              Intel Corporation

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18 21:29                                   ` Troy Kisky
  2008-08-18 21:31                                     ` David Woodhouse
@ 2008-08-18 22:10                                     ` Troy Kisky
  2008-08-19  6:00                                       ` Frans Meulenbroeks
  1 sibling, 1 reply; 29+ messages in thread
From: Troy Kisky @ 2008-08-18 22:10 UTC (permalink / raw)
  To: Frans Meulenbroeks; +Cc: linux-mtd, David Woodhouse

Troy Kisky wrote:
> Frans Meulenbroeks wrote:
> (ignoring inversions)
> Example: You have a block of all zeros.
> 
> The ecc stored in the spare bytes of this is also 0.
> Now, upon reading this block of zeroes, a two bit ecc occurs. The bits that happen to be
> read incorrectly are bit # 0 & bit # 0x3f of the block
> The hardware calculated ecc will be
> 0:0 ^ 0:fff = 0:fff after bit 0
> 0:fff ^ 3f:fc00 = 3f:3f after bit 3f
> 
> Now, when your algorithm counts bits you get 12, and decide
> it is a single bit ecc error.
> 
> The old way however will xor the high and low 12 bits 3f ^ 3f = 0, 0 != fff and
> decide it is multi bit ecc error and give an error.
> 
> Note, that both approaches would have decide it was a single bit error, if the second
> error wouldn't have happened.
> 
> 
> So, try a block of zeroes and flip bits 0 and 0x3f.
> 
> Troy
> 

Whoops, that's a 512 bytes ecc example (as that's what I'm used to).

The 256 byte ecc may be harder. How about
bit 0, 0x3f, and the 1st bit of the ecc

 0:0 ^ 0:7ff = 0:7ff after bit 0
 0:7ff ^ 3f:7c00 = 3f:3f after bit 3f
 3f:3f ^ 0:1 = 3f:3e



This is 11 bits but 3f^3e = 1, 1 !=7ff and the old algorithm will refuse to correct.
So the new behavior is different.

Any extra detection is worth it to me.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18 21:31                                     ` David Woodhouse
@ 2008-08-18 22:14                                       ` Troy Kisky
  0 siblings, 0 replies; 29+ messages in thread
From: Troy Kisky @ 2008-08-18 22:14 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Frans Meulenbroeks, linux-mtd

David Woodhouse wrote:
> On Mon, 2008-08-18 at 14:29 -0700, Troy Kisky wrote:
>> I appreciate you digging into it, as I don't have a big endian system.
> 
> Either of you can mail me a SSH public key if you want access to a
> decent big-endian desktop system to play with.
> 
Thanks,

I'll remember that.

Troy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance
  2008-08-18 22:10                                     ` Troy Kisky
@ 2008-08-19  6:00                                       ` Frans Meulenbroeks
  0 siblings, 0 replies; 29+ messages in thread
From: Frans Meulenbroeks @ 2008-08-19  6:00 UTC (permalink / raw)
  To: Troy Kisky; +Cc: linux-mtd, David Woodhouse

@David:

I have no need for access to a big-endian system as I have the NSLU2

@Troy:

I'll look into this tonight (I'm about to leave to work now).
I already saw that the 11 bits count would go wrong if either of the
two lower bits were to flip by accident, so as a minimum these should
be masked off before counting.
But I will dig further into this & will follow up on this.

Best regards, Frans.



2008/8/19 Troy Kisky <troy.kisky@boundarydevices.com>:
> Troy Kisky wrote:
>> Frans Meulenbroeks wrote:
>> (ignoring inversions)
>> Example: You have a block of all zeros.
>>
>> The ecc stored in the spare bytes of this is also 0.
>> Now, upon reading this block of zeroes, a two bit ecc occurs. The bits that happen to be
>> read incorrectly are bit # 0 & bit # 0x3f of the block
>> The hardware calculated ecc will be
>> 0:0 ^ 0:fff = 0:fff after bit 0
>> 0:fff ^ 3f:fc00 = 3f:3f after bit 3f
>>
>> Now, when your algorithm counts bits you get 12, and decide
>> it is a single bit ecc error.
>>
>> The old way however will xor the high and low 12 bits 3f ^ 3f = 0, 0 != fff and
>> decide it is multi bit ecc error and give an error.
>>
>> Note, that both approaches would have decide it was a single bit error, if the second
>> error wouldn't have happened.
>>
>>
>> So, try a block of zeroes and flip bits 0 and 0x3f.
>>
>> Troy
>>
>
> Whoops, that's a 512 bytes ecc example (as that's what I'm used to).
>
> The 256 byte ecc may be harder. How about
> bit 0, 0x3f, and the 1st bit of the ecc
>
>  0:0 ^ 0:7ff = 0:7ff after bit 0
>  0:7ff ^ 3f:7c00 = 3f:3f after bit 3f
>  3f:3f ^ 0:1 = 3f:3e
>
>
>
> This is 11 bits but 3f^3e = 1, 1 !=7ff and the old algorithm will refuse to correct.
> So the new behavior is different.
>
> Any extra detection is worth it to me.
>
> Troy
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2008-08-19  6:00 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-31  8:35 [RESUBMIT] [PATCH] [MTD] NAND nand_ecc.c: rewrite for improved performance frans
2008-08-11 11:35 ` Frans Meulenbroeks
2008-08-11 16:30 ` David Woodhouse
     [not found]   ` <ac9c93b10808120153m7435424ci3e49a70d3599cc06@mail.gmail.com>
     [not found]     ` <1218535872.2977.133.camel@pmac.infradead.org>
2008-08-14 18:07       ` frans
2008-08-14 19:10         ` Troy Kisky
2008-08-15  8:41           ` Frans Meulenbroeks
2008-08-15  8:46             ` David Woodhouse
2008-08-15  9:23               ` Frans Meulenbroeks
2008-08-15  9:41                 ` David Woodhouse
2008-08-15 10:04                   ` Frans Meulenbroeks
2008-08-15 10:12                     ` David Woodhouse
2008-08-15 18:56                       ` Troy Kisky
2008-08-15 21:14                         ` frans
2008-08-16 10:04                           ` David Woodhouse
2008-08-17 21:09                           ` Troy Kisky
2008-08-18  6:33                             ` Frans Meulenbroeks
2008-08-18 17:20                               ` Troy Kisky
2008-08-18 21:09                                 ` Frans Meulenbroeks
2008-08-18 21:29                                   ` Troy Kisky
2008-08-18 21:31                                     ` David Woodhouse
2008-08-18 22:14                                       ` Troy Kisky
2008-08-18 22:10                                     ` Troy Kisky
2008-08-19  6:00                                       ` Frans Meulenbroeks
2008-08-17 23:30                           ` Troy Kisky
2008-08-18  6:40                             ` Frans Meulenbroeks
2008-08-18 17:08                               ` Troy Kisky
  -- strict thread matches above, loose matches on Subject: below --
2008-07-29 17:58 Frans Meulenbroeks
2008-07-29 20:04 ` Ricard Wanderlof
2008-07-30  6:17 ` Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox