From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from down.free-electrons.com ([37.187.137.238] helo=mail.free-electrons.com) by bombadil.infradead.org with esmtp (Exim 4.80.1 #2 (Red Hat Linux)) id 1agZLK-0004O9-Fh for linux-mtd@lists.infradead.org; Thu, 17 Mar 2016 14:56:07 +0000 Date: Thu, 17 Mar 2016 15:55:44 +0100 From: Boris Brezillon To: Martin Townsend Cc: Ricard Wanderlof , Richard Weinberger , "linux-mtd@lists.infradead.org" Subject: Re: UBIFS question Message-ID: <20160317155544.3b43bbb9@bbrezillon> In-Reply-To: References: <56EA7148.7050008@nod.at> <56EA943C.4000505@nod.at> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi Martin, On Thu, 17 Mar 2016 12:54:43 +0000 Martin Townsend wrote: > Hi Ricard, Richard > > On Thu, Mar 17, 2016 at 11:43 AM, Ricard Wanderlof > wrote: > > > >> > We expect the flash devices to start failing quicker than normally > >> > expected due to the environment in which they will be operating in, so > >> > sudden NAND blocks turning bad will eventually happen and what we > >> > would like to do is try and capture this as soon as possible. > >> > The boards are not accessible as they will be located in very remote > >> > locations so detecting these failures before the system locks up would > >> > be an advantage so we can report home with the information and fail > >> > over to the other filesystem (providing that hasn't also been > >> > corrupted). > >> > >> Dealing with sudden bad NAND blocks is almost impossible. > >> Unless you have a copy of each block. > >> NAND is not expected to gain bad blocks without an indication like > >> correctable bitflips. > > I'm not interested in dealing with sudden bad NAND blocks, I accept > this will more than likely happen at some point but what I am > interested in is early detection. Once the system has booted most > files will be cached to memory and the product that the flash devices > are in is designed to run for many months without being power cycled > so what I'm looking to do is monitor the health of the flash devices. > Ideally I would like to know FEC counts but I doubt I will get this > information :) But checking LEBs, pages etc for bad checksums would be > great. > > > > > Yes, although the NAND flash documentation sometimes reads like blocks can > > suddenly 'go bad' for no special reason, in practice it is due to > > excessive erase/write cycles, i.e. its a wear problem. > > > > However, I don't know, if you are operating the flash in an environment > > where there is cosmic radiation that can actually damage the chip for > > instance, then of course any part of the chip could fail randomly with a > > fairly high probability. But NAND bad block management is not designed to > > take care of that case, which is why bad block detection is only done > > during block erasure (i.e. when a block fails to erase). > > > I'm not sure how much I can say I'm afraid as I'm under NDA but assume > that it is going to be operating in an environment where it's > receiving more cosmic radiation than expected. So I could look at the > bad block detection code to get some ideas? I don't necessary want to > mark blocks as bad I just want to detect them so I have an idea that > the flash is failing. I guess you're more worried about bitflips than blocks becoming bad (which, AFAIK, can only happen when writing or erasing a block, not when reading it). If bitflips detection/prevention is what your looking for, I guess ubihealthd (developed by Richard) could help. [1]https://lwn.net/Articles/663751/ [2]https://lkml.org/lkml/2015/3/29/31 -- Boris Brezillon, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com