From mboxrd@z Thu Jan 1 00:00:00 1970 From: pieterg@gmx.com (pieterg) Date: Mon, 27 Sep 2010 13:38:20 +0200 Subject: pxa3xx_nand issues In-Reply-To: References: <201009221912.24905.pieterg@gmx.com> <201009231729.48147.pieterg@gmx.com> Message-ID: <201009271338.21084.pieterg@gmx.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On Saturday 25 September 2010 04:50:04 Haojian Zhuang wrote: > On Thu, Sep 23, 2010 at 11:29 PM, pieterg wrote: > > On Thursday 23 September 2010 13:32:26 pieterg wrote: > >> On Thursday 23 September 2010 08:05:56 Eric Miao wrote: > >> > On Thu, Sep 23, 2010 at 1:12 AM, pieterg wrote: > >> > > In my search for the cause of the huge number of single/double bit > >> > > errors I'm experiencing on colibri pxa320/310 devices, I've come > >> > > across this commit > > > > http://git.kernel.org/?p=linux/kernel/git/ycmiao/pxa-linux-2.6.git;a=co > >mmit;h=7f9938d0fd6c778bd0ce296a3e3b50266de2b892 > > > >> > > According to the commitlog, it attempts to work around an issue > >> > > regarding non-page-aligned reads. > >> > > The workaround seems to force page-aligned access, by dropping the > >> > > offset within the page (column address bytes). > >> > > However, in my setup (with a jffs2 filesystem on nand), > >> > > non-page-aligned reads never occur, but non-page-aligned writes > >> > > occur very frequently. (during the jffs2 gc). > >> > > These are also affected by this commit, while the commitlog does > >> > > not state whether or not the same issue would occur for the > >> > > program command, and in that case, whether or not the same > >> > > workaround would apply. > >> > > > >> > > I've tried to revert the commit, but unfortunately this doesn't > >> > > reduce the huge number of single/double bit errors (and jffs2 crc > >> > > errors as a result) I'm getting. > >> > > > >> > > But having these non-aligned writes during GC, would that indicate > >> > > a problem with my jffs2 image parameters perhaps? > >> > > (though I cannot imagine this could actually cause double bit > >> > > errors) > >> > > >> > It might not be related to the commit above. ?The NAND controller > >> > will always read the whole page and ignoring the column address, > >> > that patch tries to make less confusion. The offset is actually > >> > handled completely by software (memorized). > >> > >> I can see how the read offset works, but I do not quite see how this > >> would work for writes (which call the same prepare_read_prog_cmd, and > >> have their column address stripped as well). > >> Found out that this happens when writing oob data by the way; these > >> are writes with offset 2048 within the page. Jffs2 does this when > >> writing cleanmarkers. > > > > Tested this, and found out that this commit is actually quite essential > > for writes as well. > > Without it, the OOB data doesn't get written. > > So we can close this part of the topic, commit 7f9938d0 is perfectly > > fine. > > > >> I could identify about 10 eraseblocks with pages which produce > >> single/double bit errors. > >> After I marked them bad (manually), I've seen no more bit errors, and > >> the jffs2 rootfs has remained perfectly healthy. > > > > Turned out to be a short-term solution. > > After a while I got more double-bit errors, and ended up bad-marking a > > dozen or so other eraseblocks, and it does not seem to stop. > > > > Strangest thing is that when I write a new jffs2 image with uboot (nand > > erase, nand write) or with the kernel (flash_eraseall, nandwrite), it > > never contains any biterrors when I mount it. > > Only after the filesystem has been mounted, gets modified, and then > > after the first reboot, the biterrors are there. > > Could you make sure whether these "wrong" block are truely bad block? > Maybe you can erase/write them continuously multi-times in XDB. Unfortunately I don't have XDB. However, I can erase/write/read them with u-boot and with the kernel (flash_eraseall / nandwrite), several times, without ever getting a NDSR_CS0_BBD status. However, I get many NDSR_DBERR and NDSR_SBERR interrupts. But because these occur during a read, the kernel never takes any action, the blocks will not be marked bad. (And I find it hard to believe that such a huge number of blocks on a brand new chip would actually be bad) Rgds, Pieter