public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
* assertion failures
@ 2010-02-24 13:45 Bill Pemberton
  2010-02-25  0:40 ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Pemberton @ 2010-02-24 13:45 UTC (permalink / raw)
  To: linux-btrfs

I've had 2 different filesystems fail in the same way recently.  In
both cases the server crashed (probably due to something unrelated to
btrfs).  On reboot the fsck fails as follows:

#  btrfsck /dev/raidvg/vol4
parent transid verify failed on 20971520 wanted 206856 found 214247
parent transid verify failed on 20971520 wanted 206856 found 214247
parent transid verify failed on 20971520 wanted 206856 found 214247
btrfsck: disk-io.c:723: open_ctree_fd: Assertion `!(!chunk_root->node)' failed.
Aborted (core dumped)


I tried compiling the latest tools from git, but the error is the
same.  This is on Fedora 12 machine running kernel
2.6.31.12-174.2.22.fc12.x86_64


-- 
Bill


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-24 13:45 assertion failures Bill Pemberton
@ 2010-02-25  0:40 ` Chris Mason
  2010-02-25 14:04   ` Bill Pemberton
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-25  0:40 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: linux-btrfs

On Wed, Feb 24, 2010 at 08:45:32AM -0500, Bill Pemberton wrote:
> I've had 2 different filesystems fail in the same way recently.  In
> both cases the server crashed (probably due to something unrelated to
> btrfs).  On reboot the fsck fails as follows:
> 
> #  btrfsck /dev/raidvg/vol4
> parent transid verify failed on 20971520 wanted 206856 found 214247

I don't suppose you have the dmesg errors from the crash?  This error
shows the header in the block is incorrect, so either something was
written to the wrong place or not written at all.

Have you memtest86 on this system?

How did it crash...was a power off used to reset the machine?

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-25  0:40 ` Chris Mason
@ 2010-02-25 14:04   ` Bill Pemberton
  2010-02-25 18:28     ` Gustavo Alves
  2010-02-26 16:17     ` Chris Mason
  0 siblings, 2 replies; 24+ messages in thread
From: Bill Pemberton @ 2010-02-25 14:04 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

> 
> I don't suppose you have the dmesg errors from the crash?  This error
> shows the header in the block is incorrect, so either something was
> written to the wrong place or not written at all.
> 
> Have you memtest86 on this system?
> 
> How did it crash...was a power off used to reset the machine?
> 

No dmesg.  This has happened on two different machines that both have
other active btrfs filesystems, so I suspect it's not a memory issue.
In both cases it was the same data that was being copied when the
crash occurred.

I didn't deal with the reboot in the first case, so I don't have much
in the way of details.  In the second case the kernel seemed convinced
the array was having problems (and the load went way up), but the
array was convinced it was fine.  A normal reboot hung and the server
had to be powered off.

Since it appears that the same operation caused the problem in both
cases, I'm going to try to reproduce it.  I'll let you know if I can
reproduce it.

-- 
Bill



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-25 14:04   ` Bill Pemberton
@ 2010-02-25 18:28     ` Gustavo Alves
  2010-02-26 16:13       ` Chris Mason
  2010-02-26 16:17     ` Chris Mason
  1 sibling, 1 reply; 24+ messages in thread
From: Gustavo Alves @ 2010-02-25 18:28 UTC (permalink / raw)
  To: linux-btrfs

I've got the same error before in a similar situation (24 partitions,
only two with problems). Unfortunally I erased all data after this
error. Strange that all I've done was shutdown and poweron the
machine.

----
Gustavo Junior Alves


On Thu, Feb 25, 2010 at 11:04 AM, Bill Pemberton
<wfp5p@viridian.itc.virginia.edu> wrote:
>
> >
> > I don't suppose you have the dmesg errors from the crash? =A0This e=
rror
> > shows the header in the block is incorrect, so either something was
> > written to the wrong place or not written at all.
> >
> > Have you memtest86 on this system?
> >
> > How did it crash...was a power off used to reset the machine?
> >
>
> No dmesg. =A0This has happened on two different machines that both ha=
ve
> other active btrfs filesystems, so I suspect it's not a memory issue.
> In both cases it was the same data that was being copied when the
> crash occurred.
>
> I didn't deal with the reboot in the first case, so I don't have much
> in the way of details. =A0In the second case the kernel seemed convin=
ced
> the array was having problems (and the load went way up), but the
> array was convinced it was fine. =A0A normal reboot hung and the serv=
er
> had to be powered off.
>
> Since it appears that the same operation caused the problem in both
> cases, I'm going to try to reproduce it. =A0I'll let you know if I ca=
n
> reproduce it.
>
> --
> Bill
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-25 18:28     ` Gustavo Alves
@ 2010-02-26 16:13       ` Chris Mason
  2010-02-26 16:15         ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 16:13 UTC (permalink / raw)
  To: Gustavo Alves; +Cc: linux-btrfs

On Thu, Feb 25, 2010 at 03:28:19PM -0300, Gustavo Alves wrote:
> I've got the same error before in a similar situation (24 partitions,
> only two with problems). Unfortunally I erased all data after this
> error. Strange that all I've done was shutdown and poweron the
> machine.

Basically it looks like the tree of data checksums isn't right.  Which
kernels were you running when you had these problems?

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 16:13       ` Chris Mason
@ 2010-02-26 16:15         ` Chris Mason
  2010-02-26 19:57           ` Gustavo Alves
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 16:15 UTC (permalink / raw)
  To: Gustavo Alves, linux-btrfs

On Fri, Feb 26, 2010 at 11:13:32AM -0500, Chris Mason wrote:
> On Thu, Feb 25, 2010 at 03:28:19PM -0300, Gustavo Alves wrote:
> > I've got the same error before in a similar situation (24 partitions,
> > only two with problems). Unfortunally I erased all data after this
> > error. Strange that all I've done was shutdown and poweron the
> > machine.
> 
> Basically it looks like the tree of data checksums isn't right.  Which
> kernels were you running when you had these problems?

Sorry, I mixed up this corruption with one farther down.  The same
question stands though, this error generally means that IO either didn't
happen or happened in the wrong place.

So, the more details you can give about your config the easier it will
be to nail it down.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-25 14:04   ` Bill Pemberton
  2010-02-25 18:28     ` Gustavo Alves
@ 2010-02-26 16:17     ` Chris Mason
  2010-02-26 16:41       ` Bill Pemberton
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 16:17 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: linux-btrfs

On Thu, Feb 25, 2010 at 09:04:20AM -0500, Bill Pemberton wrote:
> > 
> > I don't suppose you have the dmesg errors from the crash?  This error
> > shows the header in the block is incorrect, so either something was
> > written to the wrong place or not written at all.
> > 
> > Have you memtest86 on this system?
> > 
> > How did it crash...was a power off used to reset the machine?
> > 
> 
> No dmesg.  This has happened on two different machines that both have
> other active btrfs filesystems, so I suspect it's not a memory issue.
> In both cases it was the same data that was being copied when the
> crash occurred.

Ok, is there anything special about this data?

> 
> I didn't deal with the reboot in the first case, so I don't have much
> in the way of details.  In the second case the kernel seemed convinced
> the array was having problems (and the load went way up), but the
> array was convinced it was fine.  A normal reboot hung and the server
> had to be powered off.
> 
> Since it appears that the same operation caused the problem in both
> cases, I'm going to try to reproduce it.  I'll let you know if I can
> reproduce it.

What kind of array is this?  It really sounds like the IO isn't
happening properly.

-chris

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 16:17     ` Chris Mason
@ 2010-02-26 16:41       ` Bill Pemberton
  2010-02-26 17:59         ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Pemberton @ 2010-02-26 16:41 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

> > 
> > No dmesg.  This has happened on two different machines that both have
> > other active btrfs filesystems, so I suspect it's not a memory issue.
> > In both cases it was the same data that was being copied when the
> > crash occurred.
> 
> Ok, is there anything special about this data?
> 

There shouldn't be, it's the various backup files from a few
machines.  The only thing odd that I've seen in the data is there is
at least 1 file with some less-than-normally-used characters in the
name.

Rsyncing it from an ext4 fs to a btrfs fileystem on the same array
didn't cause the problem.  The original user was doing the rsync
remotely, so I don't know the exact options he was using.


> 
> What kind of array is this?  It really sounds like the IO isn't
> happening properly.
> 

Yeah, I'd be keen on blaming the arrays and/or machines if it weren't
for the fact that we have other btrfs filesystems on these machines
that hum along fine.

The array is from RAIDKing.  It's SCSI attached using SATA disks.  I
can get the exact module number if it'll help.  The SCSI card is a LSI
53c1030 (again, let me know if you need exact make/model).

-- 
Bill



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 16:41       ` Bill Pemberton
@ 2010-02-26 17:59         ` Chris Mason
  2010-02-26 18:11           ` Bill Pemberton
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 17:59 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: linux-btrfs

On Fri, Feb 26, 2010 at 11:41:51AM -0500, Bill Pemberton wrote:
> > > 
> > > No dmesg.  This has happened on two different machines that both have
> > > other active btrfs filesystems, so I suspect it's not a memory issue.
> > > In both cases it was the same data that was being copied when the
> > > crash occurred.
> > 
> > Ok, is there anything special about this data?
> > 
> 
> There shouldn't be, it's the various backup files from a few
> machines.  The only thing odd that I've seen in the data is there is
> at least 1 file with some less-than-normally-used characters in the
> name.
> 
> Rsyncing it from an ext4 fs to a btrfs fileystem on the same array
> didn't cause the problem.  The original user was doing the rsync
> remotely, so I don't know the exact options he was using.

Ok, rsync doesn't do anything especially scarey.

> 
> 
> > 
> > What kind of array is this?  It really sounds like the IO isn't
> > happening properly.
> > 
> 
> Yeah, I'd be keen on blaming the arrays and/or machines if it weren't
> for the fact that we have other btrfs filesystems on these machines
> that hum along fine.
> 
> The array is from RAIDKing.  It's SCSI attached using SATA disks.  I
> can get the exact module number if it'll help.  The SCSI card is a LSI
> 53c1030 (again, let me know if you need exact make/model).

Does the array have any kind of writeback cache?

Are all of the filesystems spread across all of the drives?  Or do some
filesystems use some drives only?

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 17:59         ` Chris Mason
@ 2010-02-26 18:11           ` Bill Pemberton
  2010-02-26 19:09             ` Chris Mason
  2010-02-26 19:11             ` Mike Fedyk
  0 siblings, 2 replies; 24+ messages in thread
From: Bill Pemberton @ 2010-02-26 18:11 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

> 
> Does the array have any kind of writeback cache?
> 

Yes, the array has a writeback cache.

>
> Are all of the filesystems spread across all of the drives?  Or do some
> filesystems use some drives only?
> 

In all cases the array is presenting 1 physical volume to the host
system (which is RAID 6 on the array itself).  That physical volume is
made into a volume group and the filesystems are on logical volumes in
that volume group.

-- 
Bill



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 18:11           ` Bill Pemberton
@ 2010-02-26 19:09             ` Chris Mason
  2010-02-26 20:43               ` Bill Pemberton
  2010-02-26 20:49               ` Diego Calleja
  2010-02-26 19:11             ` Mike Fedyk
  1 sibling, 2 replies; 24+ messages in thread
From: Chris Mason @ 2010-02-26 19:09 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: linux-btrfs

On Fri, Feb 26, 2010 at 01:11:57PM -0500, Bill Pemberton wrote:
> > 
> > Does the array have any kind of writeback cache?
> > 
> 
> Yes, the array has a writeback cache.

Ok, this would be my top suspect then, especially if it had to be
powered off to reset it.  The errors you sent look like some IO just
didn't happen, which the btrfs code goes to great length to
detect and complain about.

Going back to the errors:

parent transid verify failed on 20971520 wanted 206856 found 214247
parent transid verify failed on 20971520 wanted 206856 found 214247
parent transid verify failed on 20971520 wanted 206856 found 214247
btrfsck: disk-io.c:723: open_ctree_fd: Assertion `!(!chunk_root->node)' failed.
Aborted (core dumped)

You're actually hitting this very early in the mount.  We read in the
super block and then we read all the tree roots it points to.  Each
pointer includes the generation number it expects to find.

The generation number is similar to a version counter.  Each transaction
that updates that block increments the generation number.

So, the super block says: go read block number 20971520, and it is
supposed to be generation 206856.  Instead we find: 214247, which is
much newer.

The most likely cause of this is that a write to either the super block
or block 20971520 went to the writeback cache but never made it to the
drive.

My would be the super block, it is updated more often and so more likely
to get stuck in the array's cache.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 18:11           ` Bill Pemberton
  2010-02-26 19:09             ` Chris Mason
@ 2010-02-26 19:11             ` Mike Fedyk
  2010-02-26 19:15               ` Chris Mason
  2010-02-26 20:44               ` Bill Pemberton
  1 sibling, 2 replies; 24+ messages in thread
From: Mike Fedyk @ 2010-02-26 19:11 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: Chris Mason, linux-btrfs

On Fri, Feb 26, 2010 at 10:11 AM, Bill Pemberton
<wfp5p@viridian.itc.virginia.edu> wrote:
>>
>> Does the array have any kind of writeback cache?
>>
>
> Yes, the array has a writeback cache.
>
>>
>> Are all of the filesystems spread across all of the drives? =C2=A0Or=
 do some
>> filesystems use some drives only?
>>
>
> In all cases the array is presenting 1 physical volume to the host
> system (which is RAID 6 on the array itself). =C2=A0That physical vol=
ume is
> made into a volume group and the filesystems are on logical volumes i=
n
> that volume group.
>

I wonder if the barrier messages are making it to this write back
cache.  Do you see any messages about barriers in your kernel logs?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 19:11             ` Mike Fedyk
@ 2010-02-26 19:15               ` Chris Mason
  2010-02-26 20:45                 ` Bill Pemberton
  2010-02-26 20:44               ` Bill Pemberton
  1 sibling, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 19:15 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Bill Pemberton, linux-btrfs

On Fri, Feb 26, 2010 at 11:11:07AM -0800, Mike Fedyk wrote:
> On Fri, Feb 26, 2010 at 10:11 AM, Bill Pemberton
> <wfp5p@viridian.itc.virginia.edu> wrote:
> >>
> >> Does the array have any kind of writeback cache?
> >>
> >
> > Yes, the array has a writeback cache.
> >
> >>
> >> Are all of the filesystems spread across all of the drives? =A0Or =
do some
> >> filesystems use some drives only?
> >>
> >
> > In all cases the array is presenting 1 physical volume to the host
> > system (which is RAID 6 on the array itself). =A0That physical volu=
me is
> > made into a volume group and the filesystems are on logical volumes=
 in
> > that volume group.
> >
>=20
> I wonder if the barrier messages are making it to this write back
> cache.  Do you see any messages about barriers in your kernel logs?

Most drives with writeback caches will honor the barriers.  Most arrays
with writeback caches will ignore them.  Usually they also have their
own battery backup, which should be safe enough to continue using the
writeback cache.

But, that depends on the array.

Bill, I've got a great little application that you can use to test the
safety of the array against power failures.  You'll have to pull the
plug on the poor machine about 10 times to be sure, just let me know if
you're interested.

If the raid array works, the power failure test won't hurt any of the
existing filesystems.  If not, it's possible they will all get
corrupted, so I wouldn't blame you for not wanting to run it.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 16:15         ` Chris Mason
@ 2010-02-26 19:57           ` Gustavo Alves
  2010-02-26 21:10             ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Gustavo Alves @ 2010-02-26 19:57 UTC (permalink / raw)
  To: Chris Mason, linux-btrfs

In my case, kernel 2.6.32-0.51.rc7.git2.fc13.i686.PAE and BTRFS under L=
VM2.

----
Gustavo Junior Alves
Specchio Solu=E7=F5es em TI


On Fri, Feb 26, 2010 at 1:15 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> On Fri, Feb 26, 2010 at 11:13:32AM -0500, Chris Mason wrote:
>> On Thu, Feb 25, 2010 at 03:28:19PM -0300, Gustavo Alves wrote:
>> > I've got the same error before in a similar situation (24 partitio=
ns,
>> > only two with problems). Unfortunally I erased all data after this
>> > error. Strange that all I've done was shutdown and poweron the
>> > machine.
>>
>> Basically it looks like the tree of data checksums isn't right. =A0W=
hich
>> kernels were you running when you had these problems?
>
> Sorry, I mixed up this corruption with one farther down. =A0The same
> question stands though, this error generally means that IO either did=
n't
> happen or happened in the wrong place.
>
> So, the more details you can give about your config the easier it wil=
l
> be to nail it down.
>
> -chris
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 19:09             ` Chris Mason
@ 2010-02-26 20:43               ` Bill Pemberton
  2010-02-26 20:49               ` Diego Calleja
  1 sibling, 0 replies; 24+ messages in thread
From: Bill Pemberton @ 2010-02-26 20:43 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

> > 
> > Yes, the array has a writeback cache.
> 
> Ok, this would be my top suspect then, especially if it had to be
> powered off to reset it.  The errors you sent look like some IO just
> didn't happen, which the btrfs code goes to great length to
> detect and complain about.
> 

While the arrays were powered off in both cases, it was well after the
original problem was observed, so I would have expected the array to
flush the cache by then.

In any event, the biggest issue is that I'm left with a totally
unrecoverable filesystem.  On these filesystems I can live with
"something strange happened and you lost some files", but in this case
it's "something strange happened you lost 1TB".

-- 
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 19:11             ` Mike Fedyk
  2010-02-26 19:15               ` Chris Mason
@ 2010-02-26 20:44               ` Bill Pemberton
  1 sibling, 0 replies; 24+ messages in thread
From: Bill Pemberton @ 2010-02-26 20:44 UTC (permalink / raw)
  To: Mike Fedyk; +Cc: Chris Mason, linux-btrfs

> 
> I wonder if the barrier messages are making it to this write back
> cache.  Do you see any messages about barriers in your kernel logs?
> 

None relating to the array.  The only barrier messages I see are for
filesystems on the servers internal disks.


-- 
Bill


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 19:15               ` Chris Mason
@ 2010-02-26 20:45                 ` Bill Pemberton
  2010-02-26 20:53                   ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Bill Pemberton @ 2010-02-26 20:45 UTC (permalink / raw)
  To: Chris Mason; +Cc: Mike Fedyk, linux-btrfs

> 
> Bill, I've got a great little application that you can use to test the
> safety of the array against power failures.  You'll have to pull the
> plug on the poor machine about 10 times to be sure, just let me know if
> you're interested.
> 
> If the raid array works, the power failure test won't hurt any of the
> existing filesystems.  If not, it's possible they will all get
> corrupted, so I wouldn't blame you for not wanting to run it.
> 

I have one of the servers idled now, so I can abuse it anyway you'd
like.

-- 
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 19:09             ` Chris Mason
  2010-02-26 20:43               ` Bill Pemberton
@ 2010-02-26 20:49               ` Diego Calleja
  2010-02-26 21:08                 ` Chris Mason
  1 sibling, 1 reply; 24+ messages in thread
From: Diego Calleja @ 2010-02-26 20:49 UTC (permalink / raw)
  To: Chris Mason; +Cc: Bill Pemberton, linux-btrfs

On Viernes, 26 de Febrero de 2010 20:09:15 Chris Mason escribi=F3:
> My would be the super block, it is updated more often and so more lik=
ely
> to get stuck in the array's cache.

IIRC, this is exactly the same problem that ZFS users have been
hitting. Some users got cheap disks that don't honour barriers
correctly, so their uberblock didn't have the correct data.
They developed an app that tries to rollback transactions to
get the pool into a sane state...I guess that fsck will be able
to do that at some point?

Stupid question from someone who is not a fs dev...it's not possible
to solve this issue by doing some sort of "superblock journaling"?
Since there are several superblock copies you could:
 -Modify a secondary superblock copy to point to the tree root block
  that still has not been written to disk
 -Write whatever tree root block has been COW'ed
 -Modify the primary superblock

So in case of these failures, mount code could look in the secondary
superblock copy before failing. Since barriers are not being honoured,
there's still a possibility that the tree root blocks would be written
before the secondary superblock block that was submitted before, but
that problem would be much harder to hit I guess. But maybe the fs code
can not know where the tree root blocks are going to be written before
writting them, and hence it can't generate a valid superblock?

Sorry if all this has not sense at all, I'm just wondering if there's
a way to solve these drive issues without any kind of recovery tools
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 20:45                 ` Bill Pemberton
@ 2010-02-26 20:53                   ` Chris Mason
  2010-02-27 22:56                     ` Bill Pemberton
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 20:53 UTC (permalink / raw)
  To: Bill Pemberton; +Cc: Mike Fedyk, linux-btrfs

On Fri, Feb 26, 2010 at 03:45:34PM -0500, Bill Pemberton wrote:
> > 
> > Bill, I've got a great little application that you can use to test the
> > safety of the array against power failures.  You'll have to pull the
> > plug on the poor machine about 10 times to be sure, just let me know if
> > you're interested.
> > 
> > If the raid array works, the power failure test won't hurt any of the
> > existing filesystems.  If not, it's possible they will all get
> > corrupted, so I wouldn't blame you for not wanting to run it.
> > 
> 
> I have one of the servers idled now, so I can abuse it anyway you'd
> like.

http://oss.oracle.com/~mason/barrier-test

I'd run this on an ext3 partition on your raid array.  You can use btrfs
too, but it will give us some third party verification.

mount ext3 with mount -o barrier=1

Then run barrier-test -p <70% of your system ram in MB> -s 128 -d <path to ext3>

It will print some status:

Memory pin ready
fsyncs ready
Renames ready

Once you see all three ready lines, turn off power.  Don't use the
power button on the front of the machine, either pull the plug, use
the power switch on the back of the machine, or use an external controller

When the machine comes back, run fsck -f on the ext3 partition.  If you get
errors, things have gone horribly wrong.

Ripping the power out isn't nice, if this is a production system please
make sure you have backups of all the partitions.  I suggest running
sync a few times before running the application.

If the write cache isn't working, you'll get errors about 50% of the
time.  If you run it 10 times without any errors you're probably safe.

-chris


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 20:49               ` Diego Calleja
@ 2010-02-26 21:08                 ` Chris Mason
  2010-02-28  3:05                   ` Cláudio Martins
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 21:08 UTC (permalink / raw)
  To: Diego Calleja; +Cc: Bill Pemberton, linux-btrfs

On Fri, Feb 26, 2010 at 09:49:14PM +0100, Diego Calleja wrote:
> On Viernes, 26 de Febrero de 2010 20:09:15 Chris Mason escribi=F3:
> > My would be the super block, it is updated more often and so more l=
ikely
> > to get stuck in the array's cache.
>=20
> IIRC, this is exactly the same problem that ZFS users have been
> hitting. Some users got cheap disks that don't honour barriers
> correctly, so their uberblock didn't have the correct data.

This isn't new, XFS and reiserfs v3 have had problems as well.  But,
this is just my first suspect, Bill might be hitting something entirely
different.

> They developed an app that tries to rollback transactions to
> get the pool into a sane state...I guess that fsck will be able
> to do that at some point?

Yes, this is something that fsck will need to fix.  This corruption is
hardest because it involves the tree that maps all the other trees
(ugh).

The ioctl I'm working on for snapshot/subvol listing will make it easie=
r
to create a program to backup the chunk tree externally.

>=20
> Stupid question from someone who is not a fs dev...it's not possible
> to solve this issue by doing some sort of "superblock journaling"?
> Since there are several superblock copies you could:
>  -Modify a secondary superblock copy to point to the tree root block
>   that still has not been written to disk
>  -Write whatever tree root block has been COW'ed
>  -Modify the primary superblock
>=20
> So in case of these failures, mount code could look in the secondary
> superblock copy before failing. Since barriers are not being honoured=
,
> there's still a possibility that the tree root blocks would be writte=
n
> before the secondary superblock block that was submitted before, but
> that problem would be much harder to hit I guess. But maybe the fs co=
de
> can not know where the tree root blocks are going to be written befor=
e
> writting them, and hence it can't generate a valid superblock?
>=20
> Sorry if all this has not sense at all, I'm just wondering if there's
> a way to solve these drive issues without any kind of recovery tools

The problem is that with a writeback cache, any write is likely
to be missed on power failures.  journalling in general requires some
notion of being able to wait for block A to be on disk before you write
block B, and that's difficult to do when the disk lies about what is
really there ;)

To make things especially difficult, you can't really just roll back to
an older state.  Internally the filesystem does something like this:

allocate a bunch of blocks
free a bunch of blocks

commit

reuse blocks that were freed

Basically once that commit is on disk, we're allowed to (and likely to)
start writing over blocks that were freed in the earlier transaction.
If you try to roll back to the state at the start of that transaction
many of those blocks won't have the same data they did before.

Now, the size of the corruption might be smaller in the rolled back
transaction than in the main transaction or it might be much worse.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 19:57           ` Gustavo Alves
@ 2010-02-26 21:10             ` Chris Mason
  2010-02-26 21:26               ` Gustavo Alves
  0 siblings, 1 reply; 24+ messages in thread
From: Chris Mason @ 2010-02-26 21:10 UTC (permalink / raw)
  To: Gustavo Alves; +Cc: linux-btrfs

On Fri, Feb 26, 2010 at 04:57:27PM -0300, Gustavo Alves wrote:
> In my case, kernel 2.6.32-0.51.rc7.git2.fc13.i686.PAE and BTRFS under=
 LVM2.
>=20

Did you also have power-off based reboots?  Depending on the
configuration LVM (anything other than a single drive) won't send barri=
ers to
the device.

-chris

> ----
> Gustavo Junior Alves
> Specchio Solu=E7=F5es em TI
>=20
>=20
> On Fri, Feb 26, 2010 at 1:15 PM, Chris Mason <chris.mason@oracle.com>=
 wrote:
> > On Fri, Feb 26, 2010 at 11:13:32AM -0500, Chris Mason wrote:
> >> On Thu, Feb 25, 2010 at 03:28:19PM -0300, Gustavo Alves wrote:
> >> > I've got the same error before in a similar situation (24 partit=
ions,
> >> > only two with problems). Unfortunally I erased all data after th=
is
> >> > error. Strange that all I've done was shutdown and poweron the
> >> > machine.
> >>
> >> Basically it looks like the tree of data checksums isn't right. =A0=
Which
> >> kernels were you running when you had these problems?
> >
> > Sorry, I mixed up this corruption with one farther down. =A0The sam=
e
> > question stands though, this error generally means that IO either d=
idn't
> > happen or happened in the wrong place.
> >
> > So, the more details you can give about your config the easier it w=
ill
> > be to nail it down.
> >
> > -chris
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 21:10             ` Chris Mason
@ 2010-02-26 21:26               ` Gustavo Alves
  0 siblings, 0 replies; 24+ messages in thread
From: Gustavo Alves @ 2010-02-26 21:26 UTC (permalink / raw)
  To: Chris Mason, Gustavo Alves, linux-btrfs

In the tragic day I have done a "halt" command, but, as usual, never
finished as btrfs freezes before umount. I waited almost 5 minutes and
then pressed the power button.

The struct of the machine was very similar with the machine listed belo=
w.

 --- Physical volume ---
  PV Name               /dev/sda
  VG Name               specchio
  PV Size               465.76 GiB / not usable 12.02 MiB
  Allocatable           yes (but full)
  PE Size               128.00 MiB
  Total PE              3726
  Free PE               0
  Allocated PE          3726
  PV UUID               YpNYXY-AI1i-1D8D-9BAa-1DeY-hWNL-ZOtBIh

  --- Physical volume ---
  PV Name               /dev/sdb1
  VG Name               specchio
  PV Size               931.51 GiB / not usable 11.19 MiB
  Allocatable           yes
  PE Size               128.00 MiB
  Total PE              7452
  Free PE               24
  Allocated PE          7428
  PV UUID               G1mWRL-I7zx-bwbi-zYe8-tc2q-DrtD-tH6jIi

  --- Physical volume ---
  PV Name               /dev/sdd
  VG Name               specchio
  PV Size               931.51 GiB / not usable 13.71 MiB
  Allocatable           yes (but full)
  PE Size               128.00 MiB
  Total PE              7452
  Free PE               0
  Allocated PE          7452
  PV UUID               TWCEea-UURq-8HbM-NgkC-2J3Y-1qYE-TN7vud

  --- Physical volume ---
  PV Name               /dev/sdc
  VG Name               specchio
  PV Size               1.36 TiB / not usable 15.40 MiB
  Allocatable           yes
  PE Size               128.00 MiB
  Total PE              11178
  Free PE               11024
  Allocated PE          154
  PV UUID               hUbv4T-2Q0i-plGb-EO6a-2vAo-uHwj-Vda8zz

--- Volume group ---
  VG Name               specchio
  System ID
  Format                lvm2
  Metadata Areas        4
  Metadata Sequence No  37
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                15
  Open LV               15
  Max PV                0
  Cur PV                4
  Act PV                4
  VG Size               3.64 TiB
  PE Size               128.00 MiB
  Total PE              29808
  Alloc PE / Size       18760 / 2.29 TiB
  Free  PE / Size       11048 / 1.35 TiB
  VG UUID               AKxGo3-MVTJ-7XdC-GeTW-23e8-5Ee2-HY2PZ0

----
Gustavo Junior Alves

On Fri, Feb 26, 2010 at 6:10 PM, Chris Mason <chris.mason@oracle.com> w=
rote:
> On Fri, Feb 26, 2010 at 04:57:27PM -0300, Gustavo Alves wrote:
>> In my case, kernel 2.6.32-0.51.rc7.git2.fc13.i686.PAE and BTRFS unde=
r LVM2.
>>
>
> Did you also have power-off based reboots? =A0Depending on the
> configuration LVM (anything other than a single drive) won't send bar=
riers to
> the device.
>
> -chris
>
>> ----
>> Gustavo Junior Alves
>> Specchio Solu=E7=F5es em TI
>>
>>
>> On Fri, Feb 26, 2010 at 1:15 PM, Chris Mason <chris.mason@oracle.com=
> wrote:
>> > On Fri, Feb 26, 2010 at 11:13:32AM -0500, Chris Mason wrote:
>> >> On Thu, Feb 25, 2010 at 03:28:19PM -0300, Gustavo Alves wrote:
>> >> > I've got the same error before in a similar situation (24 parti=
tions,
>> >> > only two with problems). Unfortunally I erased all data after t=
his
>> >> > error. Strange that all I've done was shutdown and poweron the
>> >> > machine.
>> >>
>> >> Basically it looks like the tree of data checksums isn't right. =A0=
Which
>> >> kernels were you running when you had these problems?
>> >
>> > Sorry, I mixed up this corruption with one farther down. =A0The sa=
me
>> > question stands though, this error generally means that IO either =
didn't
>> > happen or happened in the wrong place.
>> >
>> > So, the more details you can give about your config the easier it =
will
>> > be to nail it down.
>> >
>> > -chris
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrf=
s" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 20:53                   ` Chris Mason
@ 2010-02-27 22:56                     ` Bill Pemberton
  0 siblings, 0 replies; 24+ messages in thread
From: Bill Pemberton @ 2010-02-27 22:56 UTC (permalink / raw)
  To: Chris Mason; +Cc: Mike Fedyk, linux-btrfs

> 
> If the write cache isn't working, you'll get errors about 50% of the
> time.  If you run it 10 times without any errors you're probably safe.
> 

Ok, I managed 12 times with no errors, so there's at least another
data point.....

-- 
Bill

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: assertion failures
  2010-02-26 21:08                 ` Chris Mason
@ 2010-02-28  3:05                   ` Cláudio Martins
  0 siblings, 0 replies; 24+ messages in thread
From: Cláudio Martins @ 2010-02-28  3:05 UTC (permalink / raw)
  To: Chris Mason; +Cc: Diego Calleja, Bill Pemberton, linux-btrfs


On Fri, 26 Feb 2010 16:08:53 -0500 Chris Mason <chris.mason@oracle.com>=
 wrote:
>=20
> The problem is that with a writeback cache, any write is likely
> to be missed on power failures.  journalling in general requires some
> notion of being able to wait for block A to be on disk before you wri=
te
> block B, and that's difficult to do when the disk lies about what is
> really there ;)
>=20

 Hi,

 Has anyone managed to make a list of drive models which are suspect of
lying about what has really hit the disk surface, or of not hounouring
barriers?

 Can anyone share any cases where your investigation turned positive
evidence that a drive model was defective in this respect?

 This is so worrying that I think we should understand why are
manufacturers doing this: is it firmware bugs? Is it to inflate
benchmarks? Are they shoving crap on the cheaper drives' firmware to
force "enterprise" people to buy the more expensive SAS models?

 If one could publish a list of known defective drives, maybe we could
put some shame on the manufacturers or even convince them to fix their
firmware. Failing that, we might at least be able to avoid the known
bad models.

 Any insight into this subject will be appreciated.

Best regards

Cl=C3=A1udio

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-02-28  3:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-24 13:45 assertion failures Bill Pemberton
2010-02-25  0:40 ` Chris Mason
2010-02-25 14:04   ` Bill Pemberton
2010-02-25 18:28     ` Gustavo Alves
2010-02-26 16:13       ` Chris Mason
2010-02-26 16:15         ` Chris Mason
2010-02-26 19:57           ` Gustavo Alves
2010-02-26 21:10             ` Chris Mason
2010-02-26 21:26               ` Gustavo Alves
2010-02-26 16:17     ` Chris Mason
2010-02-26 16:41       ` Bill Pemberton
2010-02-26 17:59         ` Chris Mason
2010-02-26 18:11           ` Bill Pemberton
2010-02-26 19:09             ` Chris Mason
2010-02-26 20:43               ` Bill Pemberton
2010-02-26 20:49               ` Diego Calleja
2010-02-26 21:08                 ` Chris Mason
2010-02-28  3:05                   ` Cláudio Martins
2010-02-26 19:11             ` Mike Fedyk
2010-02-26 19:15               ` Chris Mason
2010-02-26 20:45                 ` Bill Pemberton
2010-02-26 20:53                   ` Chris Mason
2010-02-27 22:56                     ` Bill Pemberton
2010-02-26 20:44               ` Bill Pemberton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox