From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-iy0-f177.google.com ([209.85.210.177]) by canuck.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux)) id 1Qim3X-0005RC-8P for linux-mtd@lists.infradead.org; Mon, 18 Jul 2011 11:32:15 +0000 Received: by iyn15 with SMTP id 15so3374687iyn.36 for ; Mon, 18 Jul 2011 04:32:04 -0700 (PDT) Subject: Re: UBIFS Corruption From: Artem Bityutskiy To: Reginald Perrin Date: Mon, 18 Jul 2011 14:33:22 +0300 In-Reply-To: <1310130765.68852.YahooMailRC@web114617.mail.gq1.yahoo.com> References: <1310130765.68852.YahooMailRC@web114617.mail.gq1.yahoo.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Message-ID: <1310988807.20738.44.camel@sauron> Mime-Version: 1.0 Cc: MTD Mailing List Reply-To: dedekind1@gmail.com List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi, On Fri, 2011-07-08 at 06:12 -0700, Reginald Perrin wrote: > Hi folks, > > We're using ubifs in an embedded uclinux system (based on ADI Blackfin's BF524). > Been working great for us for a while. Kernel is 2.6.34.7 (uclinux). > > However, we just saw 2 corruptions within the past 48h that we can't explain. > We've been doing the same basic operation (in terms of flashing/reading/writing > images) for quite some time, and have reflashed our units many times (over > thousands of different hardware units). OK, did more happen meanwhile? > Device #1 failure: > * Device was running out of a partition mounted to /home (a 93MB partition from > a 128MB NAND device) > * Our app was running normally and locked up (not sure why). Our code may have > been updating a sqlite database located in that partition > * When we power cycled, the partition had the corruption issue noted. Do you still have these devices? You need to enable UBIFS recovery debugging messages and send them to me, may be then I can help. > > Device #1 boot log: > UBI device number 1, total 750 LEBs (96768000 bytes, 92.3 MiB), available 0 LEBs > (0 bytes), LEB size 129024 bytes (126.0 KiB) > [ 5.228000] UBIFS: recovery needed > [ 5.320000] UBIFS error (pid 363): ubifs_scanned_corruption: corruption at LEB > 172:45056 > [ 5.348000] UBIFS error (pid 363): ubifs_recover_leb: LEB 172 scanning failed > mount: mounting ubi1:home on /home failed: Structure needs cleaning You need to enable UBIFS debugging at least, better the recovery messages as well, and try to mount the UBI volume again, then send the UBIFS output. Remember to make sure that you send all of them, not only those which you see on your console, see here for more details: http://www.linux-mtd.infradead.org/doc/ubifs.html#L_how_send_bugreport > > Device #2 failure: > * Device was running normally. > * We upgraded our application (which involved updating executables on that > partition) > * After the successful upgrade, we powered the unit down and stored > * Days later, powered up the device and the above invalid CRC as noted > > Device #2 boot log: > UBI device number 1, total 750 LEBs (96768000 bytes, 92.3 MiB), available 0 LEBs > (0 bytes), LEB size 129024 bytes (126.0 KiB) > [ 5.488000] UBIFS: recovery needed > [ 5.492000] UBIFS error (pid 365): check_lpt_crc: invalid crc in LPT node: crc > a0 calc 9013 > mount: mounting ubi1:home on /home failed: Invalid argument > > > So, what is concerning is the sheer randomness of these failures. In neither > case were we doing anything new (vs. standard operations we have been performing > for over a year on many devices per day). Additionally, there's no additional > logging available, because this *never* happens. We have never needed (after we > got UBIFS working) to have the debug output enabled in the driver. To make > matters worse, if you ask me to reproduce this, I don't know any way of doing > it. We have automated tests that run continually, and they never see these > issues. Well, if UBIFS is unable to mount, then you should not re-flash the device, then you at least may re-compile the kernel and enable debugging and try to figure out what makes UBIFS reject the flash. > One corruption could be written off as a fluke, but 2 happening within 48h is > very unusual. > > Can anybody give me any insight into this? Yeah, 2 is enough to start worrying. Note, we probably also fixed some recovery failure since 2.6.34, you might check the UBIFS back-port trees. -- Best Regards, Artem Bityutskiy