From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mtagate3.uk.ibm.com ([195.212.29.136]) by pentafluge.infradead.org with esmtps (Exim 4.63 #1 (Red Hat Linux)) id 1GohSd-0003OW-Ko for linux-mtd@lists.infradead.org; Mon, 27 Nov 2006 14:28:00 +0000 Received: from d06nrmr1407.portsmouth.uk.ibm.com (d06nrmr1407.portsmouth.uk.ibm.com [9.149.38.185]) by mtagate3.uk.ibm.com (8.13.8/8.13.8) with ESMTP id kAREQi8k111780 for ; Mon, 27 Nov 2006 14:26:44 GMT Received: from d06av02.portsmouth.uk.ibm.com (d06av02.portsmouth.uk.ibm.com [9.149.37.228]) by d06nrmr1407.portsmouth.uk.ibm.com (8.13.6/8.13.6/NCO v8.1.1) with ESMTP id kARETc2M2748516 for ; Mon, 27 Nov 2006 14:29:38 GMT Received: from d06av02.portsmouth.uk.ibm.com (loopback [127.0.0.1]) by d06av02.portsmouth.uk.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id kAREQhI8013638 for ; Mon, 27 Nov 2006 14:26:44 GMT Message-ID: <456AF5A2.1070708@vnet.ibm.com> Date: Mon, 27 Nov 2006 15:26:42 +0100 From: Timo Lindhorst MIME-Version: 1.0 To: linux-mtd@lists.infradead.org Subject: wrong ECC byte order with ndfc causes fatal consequences Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Cc: tglx@linutronix.de List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hey, I've been testing the error detection and correction with the ndfc driver and observed some fatal consequences using the wrong byte order. MTD detects one bit errors and pretends to correct them. But instead of correcting the bit, it causes another bit error in the same 256 Byte ECC region. To state more precisely, it toggles the bit at the correct bit position, but at an offset with swapped address bytes: If the error occurs at 0x????e4, the bit will be toggled at offset 0x????4e. I do not understand this ECC magic. I assumed, that the ECC would totally fail, if different byte orders are used in calculation and correction. But this effect is worse, data is unnoticeably damaged bit by bit if you blindly believe MTD. Using CONFIG_MTD_NAND_ECC_SMC (byte order according to Smart Media Specification) solved this problem (see also previous posting: [PATCH] [MTD] NAND: fix ifdef option in nand_ecc.c). What would be a sensible way to connect this option to NDFC? Something like #if defined(CONFIG_MTD_NAND_ECC_SMC) || defined(CONFIG_MTD_NAND_NDFC) in nand_ecc.c? Or is there a way to connect these options in the kernel configuration? I attached a shell session showing the behavior. Note, that the prompt switches between the card and my notebook as host. Kind regards, Timo ### generate data and dump it with error code #### /var/tmp/debug $ dd if=/dev/urandom of=data.img bs=2048 count=1 1+0 records in 1+0 records out 2048 bytes (2.0 kB) copied, 0.002122 seconds, 965 kB/s /var/tmp/debug $ flash_erase /dev/mtd5 0 1 Erase Total 1 Units Performing Flash Erase of length 131072 at offset 0x0 done /var/tmp/debug $ nandwrite /dev/mtd5 data.img Writing data to block 0 /var/tmp/debug $ nanddump -f data.dump -l 2048 /dev/mtd5 ECC failed: 0 ECC corrected: 0 Number of bad blocks: 0 Number of bbt blocks: 0 Block size 131072, page size 2048, OOB size 64 Dumping data starting at 0x00000000 and ending at 0x00000800... /var/tmp/debug $ nanddump -n -f data.noecc.dump -l 2048 /dev/mtd5 Block size 131072, page size 2048, OOB size 64 Dumping data starting at 0x00000000 and ending at 0x00000800... ### the data is what it should be. Toggle one bit ### lindhors@lapt /tftpboot/192.168.1.44/var/tmp/debug $ md5sum data*dump ea7a75606e9f3615d70b0ce1391d18b0 data.dump ea7a75606e9f3615d70b0ce1391d18b0 data.noecc.dump lindhors@lapt /tftpboot/192.168.1.44/var/tmp/debug $ khexedit data.dump lindhors@lapt /tftpboot/192.168.1.44/var/tmp/debug $ hexdiff data.dump \ data.err.dump 4c4 < 00000030 - 79 24 1f 2f 6a a2 6b 2b bc 46 28 eb 48 16 6e 4b --- > 00000030 - 79 04 1f 2f 6a a2 6b 2b bc 46 28 eb 48 16 6e 4b ### write to flash and read back ### /var/tmp/debug $ flash_erase /dev/mtd5 0 1 Erase Total 1 Units Performing Flash Erase of length 131072 at offset 0x0 done /var/tmp/debug $ nandwrite -o -n /dev/mtd5 data.err.dump Writing data to block 0 /var/tmp/debug $ nanddump -f data.back.dump -l 2048 /dev/mtd5 ECC failed: 0 ECC corrected: 0 Number of bad blocks: 0 Number of bbt blocks: 0 Block size 131072, page size 2048, OOB size 64 Dumping data starting at 0x00000000 and ending at 0x00000800... ECC: 1 corrected bitflip(s) at offset 0x00000000 ### Here MTD claims everything is fine. ### /var/tmp/debug $ nanddump -n -f data.back.noecc.dump -l 2048 /dev/mtd5 Block size 131072, page size 2048, OOB size 64 Dumping data starting at 0x00000000 and ending at 0x00000800... ### Compare the dumps, data.back.dump shoud be the same like ### ### data.dump since the error should be corrected, but acutally ### ### there is another error. ### lindhors@lapt /tftpboot/192.168.1.44/var/tmp/debug $ md5sum data*dump 883eb72100032362a6d3f23f86c13f73 data.back.noecc.dump 883eb72100032362a6d3f23f86c13f73 data.err.dump ea7a75606e9f3615d70b0ce1391d18b0 data.dump ea7a75606e9f3615d70b0ce1391d18b0 data.noecc.dump f994a582e5c7b78a3a7a10096772c353 data.back.dump lindhors@lapt /tftpboot/192.168.1.44/var/tmp/debug $ hexdiff data.dump \ data.back.dump 2c2 < 00000010 - a7 07 2b a1 91 20 ad 1b 71 93 90 b2 a6 79 57 8e --- > 00000010 - a7 07 2b 81 91 20 ad 1b 71 93 90 b2 a6 79 57 8e 4c4 < 00000030 - 79 24 1f 2f 6a a2 6b 2b bc 46 28 eb 48 16 6e 4b --- > 00000030 - 79 04 1f 2f 6a a2 6b 2b bc 46 28 eb 48 16 6e 4b