From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.nokia.com ([192.100.122.233] helo=mgw-mx06.nokia.com) by bombadil.infradead.org with esmtps (Exim 4.72 #1 (Red Hat Linux)) id 1OYbFO-0001Z0-VN for linux-mtd@lists.infradead.org; Tue, 13 Jul 2010 08:53:52 +0000 Subject: Re: UBIFS failed to recover master node From: Artem Bityutskiy To: re In-Reply-To: <4C285B76.5010108@web.de> References: <1274763982.2106.2.camel@localhost> <4C285B76.5010108@web.de> Content-Type: text/plain; charset="UTF-8" Date: Tue, 13 Jul 2010 11:48:43 +0300 Message-ID: <1279010923.31639.17.camel@localhost> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Cc: linux-mtd@lists.infradead.org, twebb Reply-To: dedekind1@gmail.com List-Id: Linux MTD discussion mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Hi On Mon, 2010-06-28 at 10:21 +0200, re wrote: > Am 25.05.2010 07:06, schrieb Artem Bityutskiy: > > On Mon, 2010-05-24 at 11:22 -0400, twebb wrote: > >> I've had several cases where our MLC NAND flash appears corrupted in > >> such a way that one of three UBIFS volumes can not be mounted due to > >> "failed to recover master node". I haven't been able to reproduce the > >> problem, but we've had at least 5 incidents where this has occurred. > >> (A partial capture from one of the failures is below.) > >> > >> I'm starting to investigate this problem and don't know if this is a > >> UBIFS/UBI problem or a NAND driver problem. I'm starting the process > >> of back-porting the latest UBIFS code to our 2.6.29 kernel - hoping > >> that new UBIFS code will solve the problem. However, this may also be > >> a driver problem and I wonder if I also need to update that driver > >> (pxa3xx_nand). Any suggestions for debugging this problem? > >> > >> Thanks, > >> twebb > >> > >> > >> capture: > >> [root@ESIedge mtd-utils]# mount -t ubifs ubi0_0 /mnt/ > >> [ 239.605869] UBI error: ubi_io_read: error -74 while reading 516096 > >> bytes from PEB 4:8192, read 516096 bytes > >> [ 239.616317] UBIFS error (pid 676): ubifs_scan: corrupt empty space > >> at LEB 2:268135 > >> [ 239.623996] UBIFS error (pid 676): ubifs_scanned_corruption: > >> corruption at LEB 2:268135 > >> [ 239.642101] UBIFS error (pid 676): ubifs_scan: LEB 2 scanning failed > >> [ 239.976396] UBI error: ubi_io_read: error -74 while reading 516096 > >> bytes from PEB 4:8192, read 516096 bytes > >> [ 239.986742] UBIFS error (pid 676): ubifs_recover_master_node: > >> failed to recover master node > >> mount: mounting ubi0_0 on /mnt/ failed: Invalid argument > > And BTW, it is a good idea not to erase/re-flash this device if you want > > to fix this problem. > > > Our power off tests causes this sporadic error too (ubifs_recover_master_node: failed > to recover master node). > We use kernel 2.6.29 with the git-patch (from 3/2010) for 47MB NOR flash partition. > > I tried to find with debugging the error reason. > The recover of the master_node reads the master_node1 and master_node2. > The master_node1 was emty. > The error was detected in: > int ubifs_recover_master_node(struct ubifs_info *c) > .... > if (mst1) { > ...... > } else { > if (!mst2) > goto out_err; > /* 1st LEB was unmapped and about to be written, so there must > * be no room left in 2nd LEB. */ > offs2 = (void *)mst2 - buf2; > if (offs2 + sz + sz <= c->leb_size) > goto out_err; !!!!!!!!!!!!!!!!!!! > mst = mst2; > } > I checked the values of the compare "if (115712 + 512 +512 (=116736) <= 130944)". > I skipped this error for test purpose. The master_node was recovered. I saw no problems > with the FS. I was not able to follow this check. But how this situation could happen? UBIFS updates the master nodes by writing them one-after-another, till there is space to write. And when thee there is no space, it unmaps the 1st LEB, writes the master node, then unmaps the 2nd LEB, and writes the master node. How could we end-up with a situation when the 1st LEB is empty, while the 2nd has room for more master nodes? This sounds like the problem is somewhere else, may be in UBI? Do you have any explanation? I mean, the only code-path which changes the master nodes in UBIFS is 'ubifs_write_master()'. If this function is the only one which, your situation cannot happen. Did you try to enable recovery debugging messages? Did you look what is in your LEB2 after 'offs2' ? Are there 0xFFs? I think if you enable recover debugging, UBIFS will print a hexdump? Or you can just inject some 'dbg_dump_node()' or 'print_hex_dump()' calls. I mean, if you just remove that check, you may hide the real problem. > I was able to provoke this error manual. Well, yes, you break UBIFS assumptions about which kind of errors it fixes. As I answered in another e-mail today to twebb - UBIFS fixes only problems caused by power-cuts. If it sees a problem which cannot happen because of a power-cut, it panics. So, as I explain above, your issue should not happen due to power cuts. But it happens, which means there is probably a bug somewhere else. Reproducing the problem and dumping the flash contents in the end of LEB2 would be interesting. Here you can find some notes about debugging UBIFS: http://www.linux-mtd.infradead.org/doc/ubifs.html#L_how_send_bugreport -- Best Regards, Artem Bityutskiy (Артём Битюцкий)