UBIFS master node corruption

linux-mtd.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* UBIFS master node corruption
@ 2012-05-31 13:52 Romain Izard
  2012-06-01  7:47 ` Adrian Hunter
  0 siblings, 1 reply; 5+ messages in thread
From: Romain Izard @ 2012-05-31 13:52 UTC (permalink / raw)
  To: linux-mtd

Sirs,

While using a system based on UBI and UBIFS, I am encountering a rare
but regular occurence of corruption of the master node of the UBIFS
partitions.

This is happening on a device using a MLC flash with a 8 KiB write
pages, 2 MiB erase blocks, and an embedded hardware controller ensuring
a 24bit/KiB BCH error correction. The flash is split in multiple MTD
partitions and UBI/UBIFS is only used on some partitions. Because the
system is reusing a legacy bootloader, other MTD partitions are used as
raw MTD areas, or as UBI containing static cramfs volumes.

The system is derived from the BSP provided by my IC vendor, based on
linux-2.6.32 with android patches, upon which were added various
bugfixes and additional features, as well as the UBI and UBIFS bugfixes
from the ubifs-v2.6.32 repository.

The most common corruption I observe is that LEB 1 & 2, containing the
master nodes, are not synchronized anymore: one of the LEBs contains
many additional versions of the master node, just as if the other LEB
had been recovered from the past. I can see that by analyzing the
contents of the LEB from the beginning, as the only difference for each
written node in the beginning of the erase block is the sequence number
and the crc.  Thus it does not look like the shorter LEB has been
corrupted, only cut short. Unfortunatly, due to the difficulty of
reproducing the issue, I do not have any trace of what happened that led
to this. I only get the information from the fact that the kernel
refuses to mount the file system.

Have you ever encountered this kind of issue before ?
Do you have an idea of what could be happenning that triggers this
problem ?

If you could provide any help on this issue, I'd be glad to accept it.

Regards,
-- 
Romain Izard

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UBIFS master node corruption
  2012-05-31 13:52 UBIFS master node corruption Romain Izard
@ 2012-06-01  7:47 ` Adrian Hunter
  2012-06-01  8:08   ` Artem Bityutskiy
  2012-06-01  9:04   ` Romain Izard
  0 siblings, 2 replies; 5+ messages in thread
From: Adrian Hunter @ 2012-06-01  7:47 UTC (permalink / raw)
  To: Romain Izard; +Cc: linux-mtd

On 31/05/12 16:52, Romain Izard wrote:
> Sirs,
> 
> While using a system based on UBI and UBIFS, I am encountering a rare
> but regular occurence of corruption of the master node of the UBIFS
> partitions.
> 
> This is happening on a device using a MLC flash with a 8 KiB write
> pages, 2 MiB erase blocks, and an embedded hardware controller ensuring
> a 24bit/KiB BCH error correction. The flash is split in multiple MTD
> partitions and UBI/UBIFS is only used on some partitions. Because the
> system is reusing a legacy bootloader, other MTD partitions are used as
> raw MTD areas, or as UBI containing static cramfs volumes.
> 
> The system is derived from the BSP provided by my IC vendor, based on
> linux-2.6.32 with android patches, upon which were added various
> bugfixes and additional features, as well as the UBI and UBIFS bugfixes
> from the ubifs-v2.6.32 repository.
> 
> The most common corruption I observe is that LEB 1 & 2, containing the
> master nodes, are not synchronized anymore: one of the LEBs contains
> many additional versions of the master node, just as if the other LEB
> had been recovered from the past. I can see that by analyzing the
> contents of the LEB from the beginning, as the only difference for each
> written node in the beginning of the erase block is the sequence number
> and the crc.  Thus it does not look like the shorter LEB has been
> corrupted, only cut short. Unfortunatly, due to the difficulty of
> reproducing the issue, I do not have any trace of what happened that led
> to this. I only get the information from the fact that the kernel
> refuses to mount the file system.
> 
> Have you ever encountered this kind of issue before ?

You need to make sure you have this patch from the 
linux-2.6.32.y branch of linux-stable:

From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Date: Thu, 21 Apr 2011 14:49:55 +0300
Subject: [PATCH] UBIFS: fix master node recovery

commit 6e0d9fd38b750d678bf9fd07db23582f52fafa55 upstream.

This patch fixes the following symptoms:
1. Unmount UBIFS cleanly.
2. Start mounting UBIFS R/W and have a power cut immediately
3. Start mounting UBIFS R/O, this succeeds
4. Try to re-mount UBIFS R/W - this fails immediately or later on,
   because UBIFS will write the master node to the flash area
   which has been written before.

The analysis of the problem:

1. UBIFS is unmounted cleanly, both copies of the master node are clean.
2. UBIFS is being mounter R/W, starts changing master node copy 1, and
   a power cut happens. The copy N1 becomes corrupted.
3. UBIFS is being mounted R/O. It notices the copy N1 is corrupted and
   reads copy N2. Copy N2 is clean.
4. Because of R/O mode, UBIFS cannot recover copy 1.
5. The mount code (ubifs_mount()) sees that the master node is clean,
   so it decides that no recovery is needed.
6. We are re-mounting R/W. UBIFS believes no recovery is needed and
   starts updating the master node, but copy N1 is still corrupted
   and was not recovered!

Fix this problem by marking the master node as dirty every time we
recover it and we are in R/O mode. This forces further recovery and
the UBIFS cleans-up the corruptions and recovers the copy N1 when
re-mounting R/W later.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
---
 fs/ubifs/recovery.c |   26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c
index f94ddf7..31d09d1 100644
--- a/fs/ubifs/recovery.c
+++ b/fs/ubifs/recovery.c
@@ -299,6 +299,32 @@ int ubifs_recover_master_node(struct ubifs_info *c)
 			goto out_free;
 		}
 		memcpy(c->rcvrd_mst_node, c->mst_node, UBIFS_MST_NODE_SZ);
+
+		/*
+		 * We had to recover the master node, which means there was an
+		 * unclean reboot. However, it is possible that the master node
+		 * is clean at this point, i.e., %UBIFS_MST_DIRTY is not set.
+		 * E.g., consider the following chain of events:
+		 *
+		 * 1. UBIFS was cleanly unmounted, so the master node is clean
+		 * 2. UBIFS is being mounted R/W and starts changing the master
+		 *    node in the first (%UBIFS_MST_LNUM). A power cut happens,
+		 *    so this LEB ends up with some amount of garbage at the
+		 *    end.
+		 * 3. UBIFS is being mounted R/O. We reach this place and
+		 *    recover the master node from the second LEB
+		 *    (%UBIFS_MST_LNUM + 1). But we cannot update the media
+		 *    because we are being mounted R/O. We have to defer the
+		 *    operation.
+		 * 4. However, this master node (@c->mst_node) is marked as
+		 *    clean (since the step 1). And if we just return, the
+		 *    mount code will be confused and won't recover the master
+		 *    node when it is re-mounter R/W later.
+		 *
+		 *    Thus, to force the recovery by marking the master node as
+		 *    dirty.
+		 */
+		c->mst_node->flags |= cpu_to_le32(UBIFS_MST_DIRTY);
 	} else {
 		/* Write the recovered master node */
 		c->max_sqnum = le64_to_cpu(mst->ch.sqnum) - 1;
-- 
1.7.10.2



Otherwise could you send copies of the corrupted LEB 1 and 2?

And the output from:

	git log v2.6.32..HEAD -- fs/ubifs


> Do you have an idea of what could be happenning that triggers this
> problem ?
> 
> If you could provide any help on this issue, I'd be glad to accept it.
> 
> Regards,

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: UBIFS master node corruption
  2012-06-01  7:47 ` Adrian Hunter
@ 2012-06-01  8:08   ` Artem Bityutskiy
  2012-06-01  9:04   ` Romain Izard
  1 sibling, 0 replies; 5+ messages in thread
From: Artem Bityutskiy @ 2012-06-01  8:08 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: Romain Izard, linux-mtd

[-- Attachment #1: Type: text/plain, Size: 321 bytes --]

On Fri, 2012-06-01 at 10:47 +0300, Adrian Hunter wrote:
> You need to make sure you have this patch from the 
> linux-2.6.32.y branch of linux-stable:

Thanks Adrian. Probably even better - pull the backport tree:

http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UBIFS master node corruption
  2012-06-01  7:47 ` Adrian Hunter
  2012-06-01  8:08   ` Artem Bityutskiy
@ 2012-06-01  9:04   ` Romain Izard
  2012-06-05 12:15     ` Artem Bityutskiy
  1 sibling, 1 reply; 5+ messages in thread
From: Romain Izard @ 2012-06-01  9:04 UTC (permalink / raw)
  To: linux-mtd

On 2012-06-01, Adrian Hunter <adrian.hunter@intel.com> wrote:

>> Have you ever encountered this kind of issue before ?
>
> You need to make sure you have this patch from the 
> linux-2.6.32.y branch of linux-stable:
>

I got it from Artem Bityutskiy's ubifs-v2.6.32.git repository on 
git.infradead.org, as commit ea0d024b63251232c60d76990e96d4453b5ceec1.
The tests were run with all the patches from this repository merged,
until the following patch:

> commit 6fef28bc82d0592d939e4c662449e93cbcfd08be
> Author: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
> Date:  2012-01-09 15:33:21
> Subject: x86: fix gcc 4.6 compilation

I also verified the subject of all the newer patches on this branch
until now, but they do not seem to be related to my case.


With the same subject, I also have in my repository the following patch:

> commit d6a68cebd4838e2d273d717f7258601454901359
> Author: Anatolij Gustschin <agust@denx.de>
> Date:   Thu Jul 7 12:25:02 2011 +0200
>
>    UBIFS: fix master node recovery
>    
>    When the 1st LEB was unmapped and written but 2nd LEB not,
>    the master node recovery doesn't succeed after power cut.
>    We see following error when mounting UBIFS partition on NOR
>    flash:
>    
>    UBIFS error (pid 1137): ubifs_recover_master_node: failed to recover master node
>    
>    Correct 2nd master node offset check is needed to fix the
>    problem. If the 2nd master node is at the end in the 2nd LEB,
>    first master node is used for recovery. When checking for this
>    condition we should check whether the master node is exactly at
>    the end of the LEB (without remaining empty space) or whether
>    it is followed by an empty space less than the master node size.
>    
>    Artem: when the error happened, offs2 = 261120, sz = 512, c->leb_size = 262016.
>    
>    Signed-off-by: Anatolij Gustschin <agust@denx.de>
>    Signed-off-by: Artem Bityutskiy <dedekind1@gmail.com>
>
> diff --git a/fs/ubifs/recovery.c b/fs/ubifs/recovery.c
> index 5256f42..2c98d77 100644
> --- a/fs/ubifs/recovery.c
> +++ b/fs/ubifs/recovery.c
> @@ -273,7 +273,8 @@ int ubifs_recover_master_node(struct ubifs_info *c)
> 				if (cor1)
> 					goto out_err;
> 				mst = mst1;
> -			} else if (offs1 == 0 && offs2 + sz >= c->leb_size) {
> +			} else if (offs1 == 0 &&
> +				   c->leb_size - offs2 - sz < sz) {
> 				/* 1st LEB was unmapped and written, 2nd not */
> 				if (cor1)
> 					goto out_err;
>


>From what I understand of the user-space application running on the
devices, there are many operations related to switching UBI and UBIFS to
a read-only mode, to support a 'boot snapshot' feature implemented with
Linux suspend. Since the patch you indicated me was related to changing
the read-only property of the volume, I guess I should continue to
search in this direction.

-- 
Romain Izard

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: UBIFS master node corruption
  2012-06-01  9:04   ` Romain Izard
@ 2012-06-05 12:15     ` Artem Bityutskiy
  0 siblings, 0 replies; 5+ messages in thread
From: Artem Bityutskiy @ 2012-06-05 12:15 UTC (permalink / raw)
  To: Romain Izard; +Cc: linux-mtd

[-- Attachment #1: Type: text/plain, Size: 531 bytes --]

On Fri, 2012-06-01 at 09:04 +0000, Romain Izard wrote:
> From what I understand of the user-space application running on the
> devices, there are many operations related to switching UBI and UBIFS to
> a read-only mode, to support a 'boot snapshot' feature implemented with
> Linux suspend. Since the patch you indicated me was related to changing
> the read-only property of the volume, I guess I should continue to
> search in this direction.

Can you share your corrupted image?

-- 
Best Regards,
Artem Bityutskiy

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-06-05 12:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-31 13:52 UBIFS master node corruption Romain Izard
2012-06-01  7:47 ` Adrian Hunter
2012-06-01  8:08   ` Artem Bityutskiy
2012-06-01  9:04   ` Romain Izard
2012-06-05 12:15     ` Artem Bityutskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).