UBIFS volume corruption (bad node at LEB 0:0)

public inbox for linux-mtd@lists.infradead.org
 help / color / mirror / Atom feed

* UBIFS volume corruption (bad node at LEB 0:0)
@ 2009-01-08  4:13 David Bergeron
  2009-01-08  6:46 ` Artem Bityutskiy
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: David Bergeron @ 2009-01-08  4:13 UTC (permalink / raw)
  To: linux-mtd

Hello all,

I'm getting some sort of volume corruption problem with UBIFS after  
doing
rootfs updates using rsync.

I've cooked up a minimalist test trying to eliminate possible  
interference.
The following steps will trigger the corruption almost every time. No  
errors
or warnings are produced during this procedure, every step behaves as  
expected:

boot kernel ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs ro init=/bin/ 
bash

# mount -t proc none /proc
# ifconfig ...
# mount -o remount,rw,sync /
# rsync -aHxvi --delete ... /
# mount -o remount,ro /
# reboot -d -f

When rebooting, the kernel fails to mount the rootfs with the  
following error:

[   61.033142] UBIFS error (pid 1): ubifs_read_node: bad node type (11  
but expected 6)
[   61.040965] UBIFS error (pid 1): ubifs_read_node: bad node at LEB 0:0

I'm running a fresh 2.6.28.0. mtd-utils were built from git on  
2008-11-19.
About 3 weeks ago I also tried with an mtd-2.6.git master branch  
patched kernel
to no avail.

Note that in my few attempts, when I do NOT remount read-only before  
rebooting,
the filesystem has so far remained functional (albeit being left  
unclean) and it
boots as expected. The error msg is always the same. A certain amount of
filesystem changes is necessary to trigger the problem, simply  
touching a file
is not enough. I have not determined exactly what operations rsync  
needs to
perform to reach breaking point, but sometimes everything goes well.

Every other usage patterns work great so far on a second ubifs volume.  
rsync
is the only thing I do on the rootfs, which is otherwise always read- 
only.
UBIFS rocks, and I don't want to stick with jffs2.
I can try anything, any help or pointers would be appreciated.

Thanks, cheers
-david

kernel output of my setup:

[   10.303908] NAND device: Manufacturer ID: 0xad, Chip ID: 0xf1  
(Hynix NAND 128MiB 3,3V 8-bit)
[   10.312429] Scanning device for bad blocks
[   10.369515] Using static partition definition
[   10.373927] Creating 1 MTD partitions on "NAND 128MiB 3,3V 8-bit":
[   10.380120] 0x000000000000-0x000008000000 : "cmx270-0"
[   10.385525] mtd: Giving out device 0 to cmx270-0
[   10.392720] UBI: attaching mtd0 to ubi0
[   10.396766] UBI DBG (pid 1): io_init: min_io_size      2048
[   10.402316] UBI DBG (pid 1): io_init: hdrs_min_io_size 512
[   10.407866] UBI DBG (pid 1): io_init: ec_hdr_alsize    512
[   10.413367] UBI DBG (pid 1): io_init: vid_hdr_alsize   512
[   10.418827] UBI DBG (pid 1): io_init: vid_hdr_offset   512
[   10.424327] UBI DBG (pid 1): io_init: vid_hdr_aloffset 512
[   10.429795] UBI DBG (pid 1): io_init: vid_hdr_shift    0
[   10.435118] UBI DBG (pid 1): io_init: leb_start        2048
[   10.440676] UBI: physical eraseblock size:   131072 bytes (128 KiB)
[   10.446942] UBI: logical eraseblock size:    129024 bytes
[   10.452320] UBI: smallest flash I/O unit:    2048
[   10.457030] UBI: sub-page size:              512
[   10.461634] UBI: VID header offset:          512 (aligned 512)
[   10.467473] UBI: data offset:                2048
[   10.472726] UBI DBG (pid 1): process_eb: scan PEB 0
...(no red flags seen)...
[   70.361901] UBI DBG (pid 1): ubi_scan: scanning is finished
[   70.505807] UBI: attached mtd0 to ubi0
[   70.509646] UBI: MTD device name:            "cmx270-0"
[   70.514969] UBI: MTD device size:            128 MiB
[   70.519908] UBI: number of good PEBs:        1024
[   70.524710] UBI: number of bad PEBs:         0
[   70.529136] UBI: max. allowed volumes:       128
[   70.533775] UBI: wear-leveling threshold:    4096
[   70.538474] UBI: number of internal volumes: 1
[   70.542939] UBI: number of user volumes:     2
[   70.547371] UBI: available PEBs:             33
[   70.551887] UBI: total number of reserved PEBs: 991
[   70.556782] UBI: number of PEBs reserved for bad PEB handling: 10
[   70.562879] UBI: max/mean erase counter: 162/57
[   70.568050] UBI: background thread "ubi_bgt0d" started, PID 264

bad node type error happens here, this is when fine:
[   71.251007] UBIFS: mounted UBI device 0, volume 0, name "rootfs"
[   71.257202] UBIFS: mounted read-only
[   71.260762] UBIFS: file system size:   82704384 bytes (80766 KiB,  
78 MiB, 641 LEBs)
[   71.268447] UBIFS: journal size:       4128768 bytes (4032 KiB, 3  
MiB, 32 LEBs)
[   71.275751] UBIFS: media format:       4 (latest is 4)
[   71.280866] UBIFS: default compressor: LZO
[   71.284975] UBIFS: reserved for root:  4087320 bytes (3991 KiB)
[   71.292580] VFS: Mounted root (ubifs filesystem) readonly.


MTD & UBI .config grep:

CONFIG_ARCH_MTD_XIP=y
CONFIG_MTD=y
CONFIG_MTD_DEBUG=y
CONFIG_MTD_DEBUG_VERBOSE=0
CONFIG_MTD_PARTITIONS=y
CONFIG_MTD_CHAR=y
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
CONFIG_MTD_NAND=y
CONFIG_MTD_NAND_IDS=y
CONFIG_MTD_NAND_CM_X270=y
CONFIG_MTD_UBI=y
CONFIG_MTD_UBI_WL_THRESHOLD=4096
CONFIG_MTD_UBI_BEB_RESERVE=1
CONFIG_MTD_UBI_DEBUG=y
CONFIG_MTD_UBI_DEBUG_PARANOID=y
CONFIG_MTD_UBI_DEBUG_MSG_BLD=y
CONFIG_UBIFS_FS=y
CONFIG_UBIFS_FS_LZO=y
CONFIG_UBIFS_FS_ZLIB=y

--

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-08  4:13 UBIFS volume corruption (bad node at LEB 0:0) David Bergeron
@ 2009-01-08  6:46 ` Artem Bityutskiy
  2009-01-16 15:34   ` David Bergeron
  2009-01-14 16:17 ` Artem Bityutskiy
  2009-01-26  9:18 ` Artem Bityutskiy
  2 siblings, 1 reply; 16+ messages in thread
From: Artem Bityutskiy @ 2009-01-08  6:46 UTC (permalink / raw)
  To: David Bergeron; +Cc: linux-mtd

On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
> Hello all,
> 
> I'm getting some sort of volume corruption problem with UBIFS after  
> doing
> rootfs updates using rsync.
> 
> I've cooked up a minimalist test trying to eliminate possible  
> interference.
> The following steps will trigger the corruption almost every time. No  
> errors
> or warnings are produced during this procedure, every step behaves as  
> expected:
> 
> boot kernel ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs ro init=/bin/ 
> bash
> 
> # mount -t proc none /proc
> # ifconfig ...
> # mount -o remount,rw,sync /
> # rsync -aHxvi --delete ... /
> # mount -o remount,ro /
> # reboot -d -f
> 
> When rebooting, the kernel fails to mount the rootfs with the  
> following error:
> 
> [   61.033142] UBIFS error (pid 1): ubifs_read_node: bad node type (11  
> but expected 6)
> [   61.040965] UBIFS error (pid 1): ubifs_read_node: bad node at LEB 0:0

Hmm, 11 is an orphan node, 6 is the superblock node. Indeed LEB 0 has to
contain superblock node and cannot contain orphans.

We do not think we tested UBIFS re-mounting well enough, so I would not
be surprised to see bugs there.

> I'm running a fresh 2.6.28.0. mtd-utils were built from git on  
> 2008-11-19.
> About 3 weeks ago I also tried with an mtd-2.6.git master branch  
> patched kernel
> to no avail.
> 
> Note that in my few attempts, when I do NOT remount read-only before  
> rebooting,
> the filesystem has so far remained functional (albeit being left  
> unclean) and it
> boots as expected. The error msg is always the same. A certain amount of
> filesystem changes is necessary to trigger the problem, simply  
> touching a file
> is not enough. I have not determined exactly what operations rsync  
> needs to
> perform to reach breaking point, but sometimes everything goes well.

Hmm, OK. I'll try to look at this and figure out what is going wrong.
What would help a lot is if I was able to reproduce this at my setup. So
you may help by sending a shell script which reproduces this issue, if
you can. And it is better to work with nandsim, because this is the tool
I use here
(http://www.linux-mtd.infradead.org/faq/nand.html#L_nand_nandsim)

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-08  6:46 ` Artem Bityutskiy
@ 2009-01-16 15:34   ` David Bergeron
  2009-01-19  8:56     ` Artem Bityutskiy
  0 siblings, 1 reply; 16+ messages in thread
From: David Bergeron @ 2009-01-16 15:34 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On 2009-01-08, at 1:46, Artem Bityutskiy wrote:
> On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
>> # mount -o remount,rw,sync /
>> # rsync -aHxvi --delete ... /
>> # mount -o remount,ro /
>> # reboot -d -f
>>
>> When rebooting, the kernel fails to mount the rootfs with the
>> following error:
>>
>> [   61.033142] UBIFS error (pid 1): ubifs_read_node: bad node type  
>> (11
>> but expected 6)
>> [   61.040965] UBIFS error (pid 1): ubifs_read_node: bad node at  
>> LEB 0:0
>
> Hmm, OK. I'll try to look at this and figure out what is going wrong.
> What would help a lot is if I was able to reproduce this at my  
> setup. So
> you may help by sending a shell script which reproduces this issue, if
> you can. And it is better to work with nandsim, because this is the  
> tool
> I use here

Hi Artem,

So I am able to reproduce it on nandsim, with the following setup, it  
takes on average ~30 cycles of rsync & remount before it breaks, which  
is much more resilient than with my real setup.

Couple of observations:
- It is the read-only mount followed by a 'remount,rw' that is the  
problem enabler, nothing bad happens without doing that.
- I first tried to play with extracting tarballs but it ran fine for  
hours, when I went back to rsync'ing files it broke almost immediately.
- rsync hops between syncing two rootfs userlands, mostly identical  
besides a bunch of mtime differences and one having more files (55% vs  
88% used capacity), so far it always breaks after rsync has grown the  
data footprint, shrinking seems to go well.

I will keep poking around this issue, let me know if you want me to  
try anything.

The setup:

64MiB, 2048 bytes page
nandsim.first_id_byte=0x20 nandsim.second_id_byte=0xa2  
nandsim.third_id_byte=0x00 nandsim.fourth_id_byte=0x15

ubiformat /dev/mtd0 -s 512 -y
ubiattach /dev/ubi_ctrl -m 0
ubimkvol /dev/ubi0 -N rootfs -s 40MiB

mounted once rw to init empty filesystem then I run this:

#!/bin/sh -x

UBIFS=ubi0:rootfs
MNT=/mnt
step=1
count=0

while true; do
mount -t ubifs -o ro $UBIFS $MNT || { echo GAME OVER score $count;  
break; }
mount -o remount,rw,sync $MNT

case $step in
   1)
     rsync -aHx --delete host::systemA $MNT
     step=2 ;;
   2)
     rsync -aHx --delete host::systemB $MNT
     step=1 ;;
esac

umount $MNT
count=$((count+1))
done
-EOF-

Regards,
-david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-16 15:34   ` David Bergeron
@ 2009-01-19  8:56     ` Artem Bityutskiy
  2009-01-20  3:44       ` David Bergeron
  0 siblings, 1 reply; 16+ messages in thread
From: Artem Bityutskiy @ 2009-01-19  8:56 UTC (permalink / raw)
  To: David Bergeron; +Cc: linux-mtd

On Fri, 2009-01-16 at 10:34 -0500, David Bergeron wrote:
> On 2009-01-08, at 1:46, Artem Bityutskiy wrote:
> > On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
> >> # mount -o remount,rw,sync /
> >> # rsync -aHxvi --delete ... /
> >> # mount -o remount,ro /
> >> # reboot -d -f
> >>
> >> When rebooting, the kernel fails to mount the rootfs with the
> >> following error:
> >>
> >> [   61.033142] UBIFS error (pid 1): ubifs_read_node: bad node type  
> >> (11
> >> but expected 6)
> >> [   61.040965] UBIFS error (pid 1): ubifs_read_node: bad node at  
> >> LEB 0:0
> >
> > Hmm, OK. I'll try to look at this and figure out what is going wrong.
> > What would help a lot is if I was able to reproduce this at my  
> > setup. So
> > you may help by sending a shell script which reproduces this issue, if
> > you can. And it is better to work with nandsim, because this is the  
> > tool
> > I use here
> 
> Hi Artem,
> 
> So I am able to reproduce it on nandsim, with the following setup, it  
> takes on average ~30 cycles of rsync & remount before it breaks, which  
> is much more resilient than with my real setup.
> 
> Couple of observations:
> - It is the read-only mount followed by a 'remount,rw' that is the  
> problem enabler, nothing bad happens without doing that.
> - I first tried to play with extracting tarballs but it ran fine for  
> hours, when I went back to rsync'ing files it broke almost immediately.
> - rsync hops between syncing two rootfs userlands, mostly identical  
> besides a bunch of mtime differences and one having more files (55% vs  
> 88% used capacity), so far it always breaks after rsync has grown the  
> data footprint, shrinking seems to go well.
> 
> I will keep poking around this issue, let me know if you want me to  
> try anything.

Just tried to reproduce this on my x86_64 host without success.
Below is the script I used. I guess SystemA and SystemB contents
matters. I tried to put /bin from Fedora to SystemA, and /bin from
Debian to SystemB.

Would you share your SystemA and SystemB?

#!/bin/sh -x

UBIFS=ubi0:rootfs
MNT=/mnt/ubifs
SystemA=/home/dedekind/tmp/rsync/A
SystemB=/home/dedekind/tmp/rsync/B
step=1
count=0

umount $MNT &> /dev/null
rmmod ubifs &> /dev/null
rmmod ubi &> /dev/null
rmmod nandsim &> /dev/null

# Prepare UBIFS
modprobe nandsim first_id_byte=0x20 second_id_byte=0xa2 third_id_byte=0x00 fourth_id_byte=0x15 || exit 1
modprobe ubi
udevsettle
ubiformat /dev/mtd0 -s 512 -y || { echo ubiformat; exit 1; }
ubiattach /dev/ubi_ctrl -m 0  || { echo ubiattach; exit 1; }
ubimkvol /dev/ubi0 -N rootfs -s 40MiB || { echo ubimkvol; exit 1; }
mount -t ubifs $UBIFS $MNT || { echo mount; exit 1; }
umount $MNT

# Start the test
while true; do
mount -t ubifs -o ro $UBIFS $MNT || { echo GAME OVER score $count; break; }
mount -o remount,rw,sync $MNT

case $step in
   1)
     rsync -aHx --delete SystemA $MNT
     step=2 ;;
   2)
     rsync -aHx --delete SystemB $MNT
     step=1 ;;
esac

umount $MNT
count=$((count+1))
done

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-19  8:56     ` Artem Bityutskiy
@ 2009-01-20  3:44       ` David Bergeron
  2009-01-20  5:40         ` David Bergeron
  2009-01-20  9:01         ` Artem Bityutskiy
  0 siblings, 2 replies; 16+ messages in thread
From: David Bergeron @ 2009-01-20  3:44 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On 2009-01-19, at 3:56, Artem Bityutskiy wrote:
> Just tried to reproduce this on my x86_64 host without success.

Well, today I had a bit of a breakthrough.

Put the nandsim + rsync loop on the side for now, and let's go back to  
a slightly evolved version of my original scenario.

boot kernel ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs ro init=/root/ 
updater.sh

/root/updater.sh:
#!/bin/sh -x

mount -t proc none /proc
ifconfig eth0 ...
mount -o remount,rw /
rsync -aHx --delete systemA /
rsync -aHx --delete systemB /
sync
exec /bin/sh -c "mount -o remount,ro /; reboot -df;"
-EOF-

no more 'sync' flag, but still an orphan LEB 0:0 on first run, every  
time, 100% reproducibility. (reboot -d just means don't update /var/ 
log/wtmp)

So to break it, it's necessary to re-mount rw,SYNC in the nandsim  
context, but not in the rootfs update scenario (a stripped down  
version of a procedure I've used on jffs2 for two years now)
I've always been puzzled by the lower reproducibility rates in the  
nandsim context, because I could replace nandsim with the real flash  
and run the same steps (exactly the same hardware, kernel, userland,  
rsync, sync data, nand flash, volume layout and what not) The only  
difference? one scenario is a live rootfs, and the other is not...

Which finally lead me to the real cause:
It turns out that my test script /root/updater.sh differed between  
systemA and systemB, so it was being replaced by rsync *while being  
opened & running*.

Indeed, once the script was identical on both rsync targets, UBIFS no  
longer broke. (I believe at the time of my first report, my rsync  
binary itself was different..)

So here's my sysadmin grade speculation (I'm no filesystem guru):
- A script is executed & open
- It is soon deleted and replaced by a new one but cannot be released  
just yet
- There is little time between the file becoming unreferenced and the  
filesystem becoming read-only
- [wild speculation] Something goes wrong, the LEB it used had to be  
orphaned but it's too late, some bad pointer gives LEB 0 the axe

Now clearly nothing is executed or left open for a long time in the  
nandsim + rsync based test, I guess the remount w/sync flag is helping  
get the edge case. Try to add the "--delay-updates" option to rsync to  
have unlinking rushed at the end. Maybe the slower arm hardware (cm- 
x270) I'm using is more prone to this.

I'll try to come up with a sure fire & simple way to replicate this on  
nandsim without the rsync hassle.

will keep you posted

Best regards,
-david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-20  3:44       ` David Bergeron
@ 2009-01-20  5:40         ` David Bergeron
  2009-01-20  9:01         ` Artem Bityutskiy
  1 sibling, 0 replies; 16+ messages in thread
From: David Bergeron @ 2009-01-20  5:40 UTC (permalink / raw)
  To: linux-mtd

On 2009-01-19, at 22:44, David Bergeron wrote:
> - There is little time between the file becoming unreferenced and the
> filesystem becoming read-only

> Now clearly nothing is executed or left open for a long time in the
> nandsim + rsync based test, I guess the remount w/sync flag is helping
> get the edge case. Try to add the "--delay-updates" option to rsync to
> have unlinking rushed at the end. Maybe the slower arm hardware (cm-
> x270) I'm using is more prone to this.


I was mistaken here, timing doesn't seem to matter. I added plenty of  
syncs and sleeps.

rsync ...
sync; sync; sleep 4; sync;
exec /bin/sh -xc "lsof; sync; sleep 2; mount -o remount, ro /;
     sync; sleep 2; lsof; reboot -df;"

The deleted shell script is listed by the first lsof but not in the  
second.

right before rebooting fs usage boils down to this:

sh          1 root  cwd       DIR  253,1    1520    1 /
sh          1 root  rtd       DIR  253,1    1520    1 /
sh          1 root  txt       REG  253,1   88220 1113 /bin/dash
sh          1 root  mem       REG  253,1  121636 1685 /lib/ld-2.7.so
sh          1 root  mem       REG  253,1 1168924 1693 /lib/libc-2.7.so
sh          1 root    0u      CHR    5,1           96 /dev/console

It just seems like as soon as rsync changes an opened file, I loose  
the filesystem.

-david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-20  3:44       ` David Bergeron
  2009-01-20  5:40         ` David Bergeron
@ 2009-01-20  9:01         ` Artem Bityutskiy
  2009-01-20 21:47           ` David Bergeron
  1 sibling, 1 reply; 16+ messages in thread
From: Artem Bityutskiy @ 2009-01-20  9:01 UTC (permalink / raw)
  To: David Bergeron; +Cc: linux-mtd

On Mon, 2009-01-19 at 22:44 -0500, David Bergeron wrote:
> So here's my sysadmin grade speculation (I'm no filesystem guru):
> - A script is executed & open
> - It is soon deleted and replaced by a new one but cannot be released  
> just yet
Right, and UBIFS (well, and other FSes) calls such files "orphans" -
they are opened, but deleted. Indeed UBIFS does not do final removal for
such files until after the final close().

> - There is little time between the file becoming unreferenced and the  
> filesystem becoming read-only
> - [wild speculation] Something goes wrong, the LEB it used had to be  
> orphaned but it's too late, some bad pointer gives LEB 0 the axe

Yeah, our current theory is that you have your script running, which
means it is opened, and it is orphan now, and you re-mount the FS R/O,
and end up with a R/O FS + an orphan. We never considered this scenario
before. And the scenario is a little nasty because UBIFS may want to
write when you release the orphan (close the file), but the FS is R/O.
We'll work on this, thanks for excellent bug description!

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-20  9:01         ` Artem Bityutskiy
@ 2009-01-20 21:47           ` David Bergeron
  0 siblings, 0 replies; 16+ messages in thread
From: David Bergeron @ 2009-01-20 21:47 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On 2009-01-20, at 4:01, Artem Bityutskiy wrote:
> Yeah, our current theory is that you have your script running, which
> means it is opened, and it is orphan now, and you re-mount the FS R/O,
> and end up with a R/O FS + an orphan. We never considered this  
> scenario
> before. And the scenario is a little nasty because UBIFS may want to
> write when you release the orphan (close the file), but the FS is R/O.
> We'll work on this, thanks for excellent bug description!

I'm afraid I've misguided you here. It's actually not the case. The  
script is *not* open when I re-mount r/o, the kernel simply will not  
allow that to happen, and I get a "Device or resource busy" error if I  
even try.

considering:

exec /bin/sh -xc "lsof; sync; sleep 2; lsof; sync; sleep 2; \
                   mount -o remount,ro /; sleep 2; reboot -df;"

The running & orphaned script is history as soon as 'exec'. lsof  
confirms this. UBIFS gets two sync() + 4 seconds before the filesystem  
goes read-only to clean up, it should be plenty of time and opportunity.

There's gotta be something else. Especially since it doesn't break if  
I tell the kernel to mount 'rw' in the first place (bad practice, but  
a possible temporary workaround for me nonetheless).

One thing I wonder, UBIFS itself mentions "UBIFS: mounted read-only",  
is it just information for the human reader? or is UBIFS perhaps  
behaving as in read-only operation somewhere, even /after/ becoming  
writable, which could cause it mishandle the cleaning up of the orphan?

I very much appreciate all the time you've put looking at this.

Best regards,
-david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-08  4:13 UBIFS volume corruption (bad node at LEB 0:0) David Bergeron
  2009-01-08  6:46 ` Artem Bityutskiy
@ 2009-01-14 16:17 ` Artem Bityutskiy
  2009-01-16 21:23   ` David Bergeron
  2009-01-26  9:18 ` Artem Bityutskiy
  2 siblings, 1 reply; 16+ messages in thread
From: Artem Bityutskiy @ 2009-01-14 16:17 UTC (permalink / raw)
  To: David Bergeron; +Cc: linux-mtd

Hi,

I've just upgraded few of my tests to do re-mount testing, and
remounting works fine - I couldn't find any problem.

On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
> I've cooked up a minimalist test trying to eliminate possible  
> interference.
> The following steps will trigger the corruption almost every time. No  
> errors
> or warnings are produced during this procedure, every step behaves as  
> expected:
> 
> boot kernel ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs ro init=/bin/ 
> bash
> 
> # mount -t proc none /proc
> # ifconfig ...
> # mount -o remount,rw,sync /
> # rsync -aHxvi --delete ... /
> # mount -o remount,ro /
> # reboot -d -f

I wonder what does -d option mean. Could you please try to play with
things like:

1. Try to remove the "sync" parameter of "mount -o remount,rw,sync".
BTW, -o sync makes your rsync work much slower than if you had async
mount and call sync after rsync is done.
2. Try to avoid re-mounting the fs into RO mode, but just call "sync".
3. When you reboot, I guess your boot scripts try to mount the FS in R/O
mode first. Try to tweak them and mount the FS read-write for the first
time.

Please, see what happens, may be this will give us some clue.

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-14 16:17 ` Artem Bityutskiy
@ 2009-01-16 21:23   ` David Bergeron
  0 siblings, 0 replies; 16+ messages in thread
From: David Bergeron @ 2009-01-16 21:23 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

sorry my primary MX was acting up badly, I just got a bunch of spooled  
mail...

On 2009-01-14, at 11:17, Artem Bityutskiy wrote:
> 1. Try to remove the "sync" parameter of "mount -o remount,rw,sync".
> BTW, -o sync makes your rsync work much slower than if you had async
> mount and call sync after rsync is done.

Ah! Continuing on my nandsim testbed, it appears that re-mounting with  
sync is indeed responsible, so:
ro -> rw,sync = bad (breaks in ~20 rounds)
ro -> rw = good (survived 630 rounds before I moved on)
rw,sync = good (survived 370 rounds and counting...)

I can live with ro -> rw; `rsync & sync`; rw -> ro; reboot

I intend to use UBIFS on a solar+battery powered device, so I really  
want to opt-out of write-back caching altogether, write performance is  
less important to me than reliability. Since a rw,sync initial mount  
seems to work fine, I will be using that on my writable data volume.

> 2. Try to avoid re-mounting the fs into RO mode, but just call "sync".

AFAIK, most init scripts will remount the rootfs read-only at  
shutdown, it's pretty much due process since if it's not remounted  
read-only, the rootfs would remain dirty (I don't think '/' is ever  
unmounted). Anyway, it doesn't look like the rw -> ro direction is  
causing any problems.

> 3. When you reboot, I guess your boot scripts try to mount the FS in  
> R/O
> mode first. Try to tweak them and mount the FS read-write for the  
> first
> time.

It's the kernel that mounts read-only first. I do want my rootfs to be  
read-only during normal operations, and I have no kernel cmdline  
arguments control from the userland.

On 2009-01-14, at 11:23, Artem Bityutskiy wrote:
> Hmm, could you please check if -f is essential here. I mean, can you
> still reproduce the bug if you reboot cleanly, with unmount. When you
> reboot uncleanly, UBIFS starts so called "recovery" process, so I want
> to check whether recovery may be guilty.

I have no recovery at all in any of my test cases. Because I remount  
read-only before I reboot -f, the filesystem is clean.

Many thanks for your help on this.

Regards,
-david

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-08  4:13 UBIFS volume corruption (bad node at LEB 0:0) David Bergeron
  2009-01-08  6:46 ` Artem Bityutskiy
  2009-01-14 16:17 ` Artem Bityutskiy
@ 2009-01-26  9:18 ` Artem Bityutskiy
  2009-01-28  2:20   ` Kyungmin Park
  2009-01-28 16:31   ` David Bergeron
  2 siblings, 2 replies; 16+ messages in thread
From: Artem Bityutskiy @ 2009-01-26  9:18 UTC (permalink / raw)
  To: David Bergeron; +Cc: linux-mtd

On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
> Hello all,
> 
> I'm getting some sort of volume corruption problem with UBIFS after  
> doing
> rootfs updates using rsync.
> 
> I've cooked up a minimalist test trying to eliminate possible  
> interference.
> The following steps will trigger the corruption almost every time.
> No  
> errors
> or warnings are produced during this procedure, every step behaves
> as  
> expected:

David,

we think we've fixed this bug. The reason was that when mounting RO,
then re-mounting read-write, the orphan sub-system was not properly
initialized. The fix is available at the ubifs-v2.6.28.git tree. I've
back-ported all stable changes and fixes to the tree. You need this fix:

commit b3718eedf4ab91307f64a79ad8eb911ca401bd22
Author: Adrian Hunter <ext-adrian.hunter@nokia.com>
Date:   Mon Jan 26 10:55:40 2009 +0200

    UBIFS: ensure orphan area head is initialized

    When mounting read-only the orphan area head is
    not initialized.  It must be initialized when
    remounting read/write, but it was not.  This patch
    fixes that.

    [Artem: sorry, added comment tweaking noise]
    Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>

However, I recommend you to incorporate all UBI/UBIFS changes since
2.6.28, because there were other bug-fixes.

You may find information about back-port trees here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source

Please, send us some feed-back about whether your problem has been
solved.

-- 
> 
> Best regards,
> Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-26  9:18 ` Artem Bityutskiy
@ 2009-01-28  2:20   ` Kyungmin Park
  2009-01-28  7:51     ` Artem Bityutskiy
  2009-01-28  8:24     ` Adrian Hunter
  2009-01-28 16:31   ` David Bergeron
  1 sibling, 2 replies; 16+ messages in thread
From: Kyungmin Park @ 2009-01-28  2:20 UTC (permalink / raw)
  To: dedekind; +Cc: David Bergeron, linux-mtd

Hi,

On Mon, Jan 26, 2009 at 6:18 PM, Artem Bityutskiy
<dedekind@infradead.org> wrote:
> On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
>> Hello all,
>>
>> I'm getting some sort of volume corruption problem with UBIFS after
>> doing
>> rootfs updates using rsync.
>>
>> I've cooked up a minimalist test trying to eliminate possible
>> interference.
>> The following steps will trigger the corruption almost every time.
>> No
>> errors
>> or warnings are produced during this procedure, every step behaves
>> as
>> expected:
>
> David,
>
> we think we've fixed this bug. The reason was that when mounting RO,
> then re-mounting read-write, the orphan sub-system was not properly
> initialized. The fix is available at the ubifs-v2.6.28.git tree. I've
> back-ported all stable changes and fixes to the tree. You need this fix:
>
> commit b3718eedf4ab91307f64a79ad8eb911ca401bd22
> Author: Adrian Hunter <ext-adrian.hunter@nokia.com>
> Date:   Mon Jan 26 10:55:40 2009 +0200
>
>    UBIFS: ensure orphan area head is initialized
>
>    When mounting read-only the orphan area head is
>    not initialized.  It must be initialized when
>    remounting read/write, but it was not.  This patch
>    fixes that.
>
>    [Artem: sorry, added comment tweaking noise]
>    Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
>    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
>
> However, I recommend you to incorporate all UBI/UBIFS changes since
> 2.6.28, because there were other bug-fixes.
>
> You may find information about back-port trees here:
> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source
>
> Please, send us some feed-back about whether your problem has been
> solved.
>

I got the similar problem with orphan node handling. with this patch,
it's solved. but there are strange ones.

I boot with ubifs with ro mode. and then I remount it as rw. after
some work, I remount it as ro. at that case.
It displayed "Device busy" so I can't change the mode. Did you have any clue?

Thank you,
Kyungmin Park

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-28  2:20   ` Kyungmin Park
@ 2009-01-28  7:51     ` Artem Bityutskiy
  2009-01-28  8:24     ` Adrian Hunter
  1 sibling, 0 replies; 16+ messages in thread
From: Artem Bityutskiy @ 2009-01-28  7:51 UTC (permalink / raw)
  To: kyungmin78; +Cc: David Bergeron, linux-mtd

On Wed, 2009-01-28 at 11:20 +0900, Kyungmin Park wrote:
> I got the similar problem with orphan node handling. with this patch,
> it's solved. but there are strange ones.
> 
> I boot with ubifs with ro mode. and then I remount it as rw. after
> some work, I remount it as ro. at that case.
> It displayed "Device busy" so I can't change the mode. Did you have any clue?

Hmm, no, I haven't seen this. Could you turn debugging on, reproduce the
bug, and see if there is anything interesting in the kernel logs?

-- 
Best regards,
Artem Bityutskiy (Битюцкий Артём)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-28  2:20   ` Kyungmin Park
  2009-01-28  7:51     ` Artem Bityutskiy
@ 2009-01-28  8:24     ` Adrian Hunter
  2009-01-29  4:24       ` Kyungmin Park
  1 sibling, 1 reply; 16+ messages in thread
From: Adrian Hunter @ 2009-01-28  8:24 UTC (permalink / raw)
  To: kyungmin78@gmail.com; +Cc: David Bergeron, linux-mtd@lists.infradead.org

Kyungmin Park wrote:
> Hi,
> 
> On Mon, Jan 26, 2009 at 6:18 PM, Artem Bityutskiy
> <dedekind@infradead.org> wrote:
>> On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
>>> Hello all,
>>>
>>> I'm getting some sort of volume corruption problem with UBIFS after
>>> doing
>>> rootfs updates using rsync.
>>>
>>> I've cooked up a minimalist test trying to eliminate possible
>>> interference.
>>> The following steps will trigger the corruption almost every time.
>>> No
>>> errors
>>> or warnings are produced during this procedure, every step behaves
>>> as
>>> expected:
>> David,
>>
>> we think we've fixed this bug. The reason was that when mounting RO,
>> then re-mounting read-write, the orphan sub-system was not properly
>> initialized. The fix is available at the ubifs-v2.6.28.git tree. I've
>> back-ported all stable changes and fixes to the tree. You need this fix:
>>
>> commit b3718eedf4ab91307f64a79ad8eb911ca401bd22
>> Author: Adrian Hunter <ext-adrian.hunter@nokia.com>
>> Date:   Mon Jan 26 10:55:40 2009 +0200
>>
>>    UBIFS: ensure orphan area head is initialized
>>
>>    When mounting read-only the orphan area head is
>>    not initialized.  It must be initialized when
>>    remounting read/write, but it was not.  This patch
>>    fixes that.
>>
>>    [Artem: sorry, added comment tweaking noise]
>>    Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
>>    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
>>
>> However, I recommend you to incorporate all UBI/UBIFS changes since
>> 2.6.28, because there were other bug-fixes.
>>
>> You may find information about back-port trees here:
>> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source
>>
>> Please, send us some feed-back about whether your problem has been
>> solved.
>>
> 
> I got the similar problem with orphan node handling. with this patch,
> it's solved. but there are strange ones.
> 
> I boot with ubifs with ro mode. and then I remount it as rw. after
> some work, I remount it as ro. at that case.
> It displayed "Device busy" so I can't change the mode. Did you have any clue?

Speaking of orphans, it is not possible to remount rw->ro with files that are
pending for deletion on any file system e.g. open file, unlink it, try to
remount ro => device busy error.  The unlinked file must be closed first so
that the file system can delete it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-28  8:24     ` Adrian Hunter
@ 2009-01-29  4:24       ` Kyungmin Park
  0 siblings, 0 replies; 16+ messages in thread
From: Kyungmin Park @ 2009-01-29  4:24 UTC (permalink / raw)
  To: Adrian Hunter; +Cc: David Bergeron, linux-mtd@lists.infradead.org

On Wed, Jan 28, 2009 at 5:24 PM, Adrian Hunter
<ext-adrian.hunter@nokia.com> wrote:
> Kyungmin Park wrote:
>> Hi,
>>
>> On Mon, Jan 26, 2009 at 6:18 PM, Artem Bityutskiy
>> <dedekind@infradead.org> wrote:
>>> On Wed, 2009-01-07 at 23:13 -0500, David Bergeron wrote:
>>>> Hello all,
>>>>
>>>> I'm getting some sort of volume corruption problem with UBIFS after
>>>> doing
>>>> rootfs updates using rsync.
>>>>
>>>> I've cooked up a minimalist test trying to eliminate possible
>>>> interference.
>>>> The following steps will trigger the corruption almost every time.
>>>> No
>>>> errors
>>>> or warnings are produced during this procedure, every step behaves
>>>> as
>>>> expected:
>>> David,
>>>
>>> we think we've fixed this bug. The reason was that when mounting RO,
>>> then re-mounting read-write, the orphan sub-system was not properly
>>> initialized. The fix is available at the ubifs-v2.6.28.git tree. I've
>>> back-ported all stable changes and fixes to the tree. You need this fix:
>>>
>>> commit b3718eedf4ab91307f64a79ad8eb911ca401bd22
>>> Author: Adrian Hunter <ext-adrian.hunter@nokia.com>
>>> Date:   Mon Jan 26 10:55:40 2009 +0200
>>>
>>>    UBIFS: ensure orphan area head is initialized
>>>
>>>    When mounting read-only the orphan area head is
>>>    not initialized.  It must be initialized when
>>>    remounting read/write, but it was not.  This patch
>>>    fixes that.
>>>
>>>    [Artem: sorry, added comment tweaking noise]
>>>    Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
>>>    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
>>>
>>> However, I recommend you to incorporate all UBI/UBIFS changes since
>>> 2.6.28, because there were other bug-fixes.
>>>
>>> You may find information about back-port trees here:
>>> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source
>>>
>>> Please, send us some feed-back about whether your problem has been
>>> solved.
>>>
>>
>> I got the similar problem with orphan node handling. with this patch,
>> it's solved. but there are strange ones.
>>
>> I boot with ubifs with ro mode. and then I remount it as rw. after
>> some work, I remount it as ro. at that case.
>> It displayed "Device busy" so I can't change the mode. Did you have any clue?
>
> Speaking of orphans, it is not possible to remount rw->ro with files that are
> pending for deletion on any file system e.g. open file, unlink it, try to
> remount ro => device busy error.  The unlinked file must be closed first so
> that the file system can delete it.
>

It seems that. the normal operation is okay. but some complicated
operation doesn't.
I try to do and give more descriptions.

Thank you,
Kyungmin Park

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: UBIFS volume corruption (bad node at LEB 0:0)
  2009-01-26  9:18 ` Artem Bityutskiy
  2009-01-28  2:20   ` Kyungmin Park
@ 2009-01-28 16:31   ` David Bergeron
  1 sibling, 0 replies; 16+ messages in thread
From: David Bergeron @ 2009-01-28 16:31 UTC (permalink / raw)
  To: dedekind; +Cc: linux-mtd

On 2009-01-26, at 4:18, Artem Bityutskiy wrote:
> we think we've fixed this bug. The reason was that when mounting RO,
> then re-mounting read-write, the orphan sub-system was not properly
> initialized. The fix is available at the ubifs-v2.6.28.git tree. I've
> back-ported all stable changes and fixes to the tree. You need this  
> fix:
>
> commit b3718eedf4ab91307f64a79ad8eb911ca401bd22
> Author: Adrian Hunter <ext-adrian.hunter@nokia.com>
> Date:   Mon Jan 26 10:55:40 2009 +0200
>
>    UBIFS: ensure orphan area head is initialized
>
>    When mounting read-only the orphan area head is
>    not initialized.  It must be initialized when
>    remounting read/write, but it was not.  This patch
>    fixes that.
>
>    [Artem: sorry, added comment tweaking noise]
>    Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
>    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
>
> Please, send us some feed-back about whether your problem has been
> solved.

Hi Artem,

I've patched my kernel with ubifs-v2.6.28.git and it seems to be  
working perfectly!

Thanks a lot to you and Adrian.

Best Regards,
-david

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-01-29  4:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-08  4:13 UBIFS volume corruption (bad node at LEB 0:0) David Bergeron
2009-01-08  6:46 ` Artem Bityutskiy
2009-01-16 15:34   ` David Bergeron
2009-01-19  8:56     ` Artem Bityutskiy
2009-01-20  3:44       ` David Bergeron
2009-01-20  5:40         ` David Bergeron
2009-01-20  9:01         ` Artem Bityutskiy
2009-01-20 21:47           ` David Bergeron
2009-01-14 16:17 ` Artem Bityutskiy
2009-01-16 21:23   ` David Bergeron
2009-01-26  9:18 ` Artem Bityutskiy
2009-01-28  2:20   ` Kyungmin Park
2009-01-28  7:51     ` Artem Bityutskiy
2009-01-28  8:24     ` Adrian Hunter
2009-01-29  4:24       ` Kyungmin Park
2009-01-28 16:31   ` David Bergeron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox