Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-05 18:15       ` Ted Ts'o
@ 2012-01-06 16:40         ` Mikulas Patocka
  2012-01-28  4:53           ` WIMPy
  0 siblings, 1 reply; 8+ messages in thread
From: Mikulas Patocka @ 2012-01-06 16:40 UTC (permalink / raw)
  To: device-mapper development; +Cc: Sander Eikelenboom, linux-ext4, linux-kernel



On Thu, 5 Jan 2012, Ted Ts'o wrote:

> On Thu, Jan 05, 2012 at 05:14:28PM +0100, Sander Eikelenboom wrote:
> > 
> > OK spoke too soon, i have been able to trigger it again:
> > - copying files from LV to the same LV without the snapshot went OK
> > - copying from the RO snapshot of a LV to the same LV gave the error while copying the file again:
> 
> OK.  Originally, you said you did this:
> 
> 1) fsck -v -p -f the filesystem
> 2) mount the filesystem
> 3) Try to copy a file
> 4) filesystem will be mounted RO on error  (see below)
> 5) fsck again, journal will be recovered, no other errors
> 6) start at 1)
> 
> Was this with with a read-only snapshot always being in existence
> through all of these five steps?  When was the RO snapshot created?
> 
> If a RO snapshot has to be there in order for this to happen, then
> this is almost certainly a device-mapper regression.  (dm-devel folks,

The existence of a snapshot changes I/O completion times significantly, so 
it may be a race condition in ext4 that gets triggered which changed 
timings.

Mikulas

> this is a problem which apparently occurred when the user went from
> v3.1.5 to v3.2, so this looks likes 3.2 regression.)
> 
> 						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-06 16:40         ` [dm-devel] " Mikulas Patocka
@ 2012-01-28  4:53           ` WIMPy
  2012-01-28  8:14             ` WIMPy
  0 siblings, 1 reply; 8+ messages in thread
From: WIMPy @ 2012-01-28  4:53 UTC (permalink / raw)
  To: linux-ext4

Mikulas Patocka <mpatocka <at> redhat.com> writes:

> The existence of a snapshot changes I/O completion times significantly, so 
> it may be a race condition in ext4 that gets triggered which changed 
> timings.

The idea that timing might cause issues on a FS is disturbing.

> > this is a problem which apparently occurred when the user went from
> > v3.1.5 to v3.2, so this looks likes 3.2 regression.)

I am on 3.2.0 as well.

It happened for me on a freshly created FS.
"mke2fs -j -O sparse_super -O dir_index -O extents -O filetype -O uninit_bg"
mounted with no additional options for the first time I got an
"EXT4-fs error (device md127): ext4_mb_generate_buddy:739: group 28671, 32765 
clusters in bitmap, 32766 in gd"
after writing about 3TB of data.
I do not have RO snapshots as the OP, but my md sits on to of luks containers. 
So we do have the device mapper in common.

Just for the records: Unlike the contents, the hardware is not new and did not 
have any known issues.

  Greetings,
    WIMPy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-28  4:53           ` WIMPy
@ 2012-01-28  8:14             ` WIMPy
  2012-01-28  8:34               ` Andreas Dilger
  0 siblings, 1 reply; 8+ messages in thread
From: WIMPy @ 2012-01-28  8:14 UTC (permalink / raw)
  To: linux-ext4

Update:

>> > > this is a problem which apparently occurred when the user went from
> > > v3.1.5 to v3.2, so this looks likes 3.2 regression.)
> 
> I am on 3.2.0 as well.

I didn't spot anything obvious in the logs.
 
> It happened for me on a freshly created FS.
> "mke2fs -j -O sparse_super -O dir_index -O extents -O filetype -O uninit_bg"
> mounted with no additional options for the first time I got an
> "EXT4-fs error (device md127): ext4_mb_generate_buddy:739: group 28671, 32765 
> clusters in bitmap, 32766 in gd"
> after writing about 3TB of data.
> I do not have RO snapshots as the OP, but my md sits on to of luks 
containers. 
> So we do have the device mapper in common.

After I did an fsck and tried to continue, I didn't get that far.
After another 200GB or so it happened again.
And now it's reproducible:
I can run fsck and then try to continue (using rsync). But as soon as writing 
starts, the process hangs for a long time. At least one minute, probably longer.
Then the ext4_mb_generate_buddy comes again.

I upgraded e2fstools from 1.41.14 to 1.42 and the kernel to 3.2.2.
No difference.
That FS is unusable.

> Just for the records: Unlike the contents, the hardware is not new and did 
not 
> have any known issues.
> 
>   Greetings,
>     WIMPy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-28  8:14             ` WIMPy
@ 2012-01-28  8:34               ` Andreas Dilger
  2012-01-28 15:31                 ` WIMPy
  0 siblings, 1 reply; 8+ messages in thread
From: Andreas Dilger @ 2012-01-28  8:34 UTC (permalink / raw)
  To: WIMPy; +Cc: linux-ext4@vger.kernel.org

Could you please try to bisect the problem, if it is reproducible?

I was looking for a change which I thought might be responsible (removal of block bitmap initialization when inodes are first allocated from an uninitialized inode table) but I couldn't see it in the git log, so maybe that change has not landed yet.

I don't have any other ideas of which recent patches might be responsible at this point. 

Cheers, Andreas

On 2012-01-28, at 1:14, WIMPy <WIMPy@yeti.dk> wrote:

> Update:
> 
>>>>> this is a problem which apparently occurred when the user went from
>>>> v3.1.5 to v3.2, so this looks likes 3.2 regression.)
>> 
>> I am on 3.2.0 as well.
> 
> I didn't spot anything obvious in the logs.
> 
>> It happened for me on a freshly created FS.
>> "mke2fs -j -O sparse_super -O dir_index -O extents -O filetype -O uninit_bg"
>> mounted with no additional options for the first time I got an
>> "EXT4-fs error (device md127): ext4_mb_generate_buddy:739: group 28671, 32765 
>> clusters in bitmap, 32766 in gd"
>> after writing about 3TB of data.
>> I do not have RO snapshots as the OP, but my md sits on to of luks 
> containers. 
>> So we do have the device mapper in common.
> 
> After I did an fsck and tried to continue, I didn't get that far.
> After another 200GB or so it happened again.
> And now it's reproducible:
> I can run fsck and then try to continue (using rsync). But as soon as writing 
> starts, the process hangs for a long time. At least one minute, probably longer.
> Then the ext4_mb_generate_buddy comes again.
> 
> I upgraded e2fstools from 1.41.14 to 1.42 and the kernel to 3.2.2.
> No difference.
> That FS is unusable.
> 
>> Just for the records: Unlike the contents, the hardware is not new and did 
> not 
>> have any known issues.
>> 
>>  Greetings,
>>    WIMPy
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo <at> vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-28  8:34               ` Andreas Dilger
@ 2012-01-28 15:31                 ` WIMPy
  2012-01-28 21:04                   ` WIMPy
  0 siblings, 1 reply; 8+ messages in thread
From: WIMPy @ 2012-01-28 15:31 UTC (permalink / raw)
  To: linux-ext4

Andreas Dilger <adilger <at> dilger.ca> writes:

> 
> Could you please try to bisect the problem, if it is reproducible?

If you or someone else has an idea, how to do so, I will try to collect more 
information.

There is actually an important bit I forgot to mention in the last message: 
After I got the error and umount the FS I get lots of journal commit I/O 
errors. But no indication as to what or why it fails.

> I was looking for a change which I thought might be responsible (removal of 
block bitmap initialization
> when inodes are first allocated from an uninitialized inode table) but I 
couldn't see it in the git log, so
> maybe that change has not landed yet.
> 
> I don't have any other ideas of which recent patches might be responsible at 
this point. 

As there was a mention at the beginning that this may have happened after an 
upgrade from 3.1.5 to 3.2, I will build a 3.1.5 and see if that really makes a 
difference.

> On 2012-01-28, at 1:14, WIMPy <WIMPy <at> yeti.dk> wrote:
> 
> > Update:
> > 
> >>>>> this is a problem which apparently occurred when the user went from
> >>>> v3.1.5 to v3.2, so this looks likes 3.2 regression.)
> >> 
> >> I am on 3.2.0 as well.
> > 
> > I didn't spot anything obvious in the logs.
> > 
> >> It happened for me on a freshly created FS.
> >> "mke2fs -j -O sparse_super -O dir_index -O extents -O filetype -O uninit_
bg"
> >> mounted with no additional options for the first time I got an
> >> "EXT4-fs error (device md127): ext4_mb_generate_buddy:739: group 28671, 
32765 
> >> clusters in bitmap, 32766 in gd"
> >> after writing about 3TB of data.
> >> I do not have RO snapshots as the OP, but my md sits on to of luks 
> > containers. 
> >> So we do have the device mapper in common.
> > 
> > After I did an fsck and tried to continue, I didn't get that far.
> > After another 200GB or so it happened again.
> > And now it's reproducible:
> > I can run fsck and then try to continue (using rsync). But as soon as 
writing 
> > starts, the process hangs for a long time. At least one minute, probably 
longer.
> > Then the ext4_mb_generate_buddy comes again.
> > 
> > I upgraded e2fstools from 1.41.14 to 1.42 and the kernel to 3.2.2.
> > No difference.
> > That FS is unusable.
> > 
> >> Just for the records: Unlike the contents, the hardware is not new and did 
> > not 
> >> have any known issues.
> >> 
> >>  Greetings,
> >>    WIMPy



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-28 15:31                 ` WIMPy
@ 2012-01-28 21:04                   ` WIMPy
  2012-02-03  5:30                     ` WIMPy
  0 siblings, 1 reply; 8+ messages in thread
From: WIMPy @ 2012-01-28 21:04 UTC (permalink / raw)
  To: linux-ext4

... and another update:

> As there was a mention at the beginning that this may have happened after an 
> upgrade from 3.1.5 to 3.2, I will build a 3.1.5 and see if that really makes 
a 
> difference.

Yes it does.
3.1.5 has been working for 4.5 hours now, continuing form the point where 3.2 
and 3.2.2 reproducibly barfed.
I see some changes to ext4 on January 9 and 10. But nothing thereafter so I'm 
not sure if it's worth trying something like 3.3-rc1.
The bad thing is that 3.2 has been working for about 20 hours, so it's not a 
quick test.

> > >>>>> this is a problem which apparently occurred when the user went from
> > >>>> v3.1.5 to v3.2, so this looks likes 3.2 regression.)

> > >> It happened for me on a freshly created FS.
> > >> "mke2fs -j -O sparse_super -O dir_index -O extents -O filetype -O uninit_
> bg"
> > >> mounted with no additional options for the first time I got an
> > >> "EXT4-fs error (device md127): ext4_mb_generate_buddy:739: group 28671, 
> 32765 
> > >> clusters in bitmap, 32766 in gd"
> > >> after writing about 3TB of data.
> > >> I do not have RO snapshots as the OP, but my md sits on top of luks 
> > > containers. 
> > >> So we do have the device mapper in common.
> > > 
> > > After I did an fsck and tried to continue, I didn't get that far.
> > > After another 200GB or so it happened again.
> > > And now it's reproducible:
> > > I can run fsck and then try to continue (using rsync). But as soon as 
> writing 
> > > starts, the process hangs for a long time. At least one minute, probably 
> longer.
> > > Then the ext4_mb_generate_buddy comes again.

  Greetings,
    WIMPy



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
  2012-01-28 21:04                   ` WIMPy
@ 2012-02-03  5:30                     ` WIMPy
  0 siblings, 0 replies; 8+ messages in thread
From: WIMPy @ 2012-02-03  5:30 UTC (permalink / raw)
  To: linux-ext4

WIMPy <WIMPy <at> yeti.dk> writes:

> ... and another update:

I don't know what the cause is, but I think I've got the trigger.
Those errors appeared when using rsync on a directory containing a file that was 
written to (extended) while the rsync was running, which seems to be a 
situation, where rsync causes a lot of stress. It certainly takes a hell of a 
lot of time.

I suspect any of the ext4 related commits from Jan 9th/10th. From the log I 
guess they should still exist in 3.3-rc1. I'm currently testing that, but 
unfortunately that might take some time.

And a short repeat: I'm using an md, but no lvm.

  Greetings,
    WIMPy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd
@ 2012-03-19 23:06 Tony Hoyle
  0 siblings, 0 replies; 8+ messages in thread
From: Tony Hoyle @ 2012-03-19 23:06 UTC (permalink / raw)
  To: linux-ext4

I looked at the changelogs for 3.2.x and couldn't see anything that
obviously related to this issue - hence posting on this (slightly old)
thread, since I can't find any followup.  I've downgraded to 2.6.32
(last debian kernel available, since they don't seem to keep historical
kernels around) for now, which is running solidly.

WIMPy <wimpy <at> yeti.dk> writes:

> written to (extended) while the rsync was running, which seems to be 
> a situation, where rsync causes a lot of stress. It certainly takes a
> hell of a lot of time.

I get it when I'm writing large files over nfs - exactly the same
symptoms as mentioned elsewhere in the thread, followed by nfsd going
into D state and things generally going downhill from there.

Started when I upgraded to 3.1.0 and continued up to 3.2.0.  fsck shows
no errors on the disk, but the logs fill up with
ext4_mb_generate_buddy:739 errors anyway.

> And a short repeat: I'm using an md, but no lvm.
> 
Same setup here - md, but no lvm.  Another non-raid drive doesn't show
the same symptoms, if it's any help.

Tony

nb:  Some logs, FWIW.  As mentioned above, fsck says there are no errors
on the drive:

Mar 19 20:50:52 goliath kernel: [ 1721.686880] EXT4-fs error (device
md0): ext4_mb_generate_buddy:739: group 21345, 32254 clusters in bitmap,
32258 in gd
Mar 19 20:50:52 goliath kernel: [ 1721.703397] JBD2: Spotted dirty
metadata buffer (dev = md0, blocknr = 0). There's a risk of filesystem
corruption in case of system crash.
Mar 19 20:51:38 goliath kernel: [ 1767.622399] EXT4-fs error (device
md0): ext4_mb_generate_buddy:739: group 21346, 32254 clusters in bitmap,
32258 in gd
Mar 19 20:52:18 goliath kernel: [ 1808.268856] EXT4-fs error (device
md0): ext4_mb_generate_buddy:739: group 21347, 32254 clusters in bitmap,
32258 in gd
Mar 19 20:53:29 goliath kernel: [ 1879.257332] EXT4-fs error (device
md0): ext4_mb_generate_buddy:739: group 21348, 32254 clusters in bitmap,
32258 in gd
Mar 19 20:54:45 goliath kernel: [ 1955.083019] EXT4-fs error (device
md0): ext4_mb_generate_buddy:739: group 21349, 32254 clusters in bitmap,
32258 in gd
..etc.  They don't vary much.  A few thousand of these in rapid succession.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2012-03-19 23:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-19 23:06 [dm-devel] can't recover ext4 on lvm from ext4_mb_generate_buddy:739: group 1687, 32254 clusters in bitmap, 32258 in gd Tony Hoyle
  -- strict thread matches above, loose matches on Subject: below --
2012-01-05 10:37 Sander Eikelenboom
2012-01-05 13:21 ` Sander Eikelenboom
2012-01-05 14:45   ` Theodore Tso
     [not found]     ` <4910694144.20120105171428@eikelenboom.it>
2012-01-05 18:15       ` Ted Ts'o
2012-01-06 16:40         ` [dm-devel] " Mikulas Patocka
2012-01-28  4:53           ` WIMPy
2012-01-28  8:14             ` WIMPy
2012-01-28  8:34               ` Andreas Dilger
2012-01-28 15:31                 ` WIMPy
2012-01-28 21:04                   ` WIMPy
2012-02-03  5:30                     ` WIMPy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).