“bio too big” regression and silent data corruption in 3.0

All of lore.kernel.org
 help / color / mirror / Atom feed

* “bio too big” regression and silent data corruption in 3.0
@ 2011-08-08  1:00 Alexandre Oliva
  2011-08-08 22:39 ` Alexandre Oliva
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Alexandre Oliva @ 2011-08-08  1:00 UTC (permalink / raw)
  To: linux-btrfs

tl;dr version: 3.0 produces =E2=80=9Cbio too big=E2=80=9D dmesg entries=
 and silently
corrupts data in =E2=80=9Cmeta-raid1/data-single=E2=80=9D configuration=
s on disks with
different max_hw_sectors, where 2.6.38 worked fine.

tl;dr side-issue: on-line removal of partitions holding =E2=80=9Csingle=
=E2=80=9D data
attempts to create raid0 (rather than single) block groups.  If it can'=
t
get enough room for raid0 over all remaining disks, it fails, leaving
the available space incorrect (even underflowed).  If it succeeds, it
creates raid0 block groups and permanently (?) switches the FS to raid0=
=2E

I've been (more or less) happily using btrfs on various machines with
internal and external disks combined into raid1(m/d) and
raid1(m)/single(d) -o compress filesystems, using Freed-ora
2.6.38.8-libre.35.fc15.

Once I upgraded to 2.6.40(AKA 3.0)-libre.4.fc15 and created a ceph OSD
on one of those machines, I hit some I/O errors that turned out to be
related with writing out updates to the ceph journal to the external
USB-connected disk (an odd choice, considering the internal disk has
more I/O bandwidth, though much less space; it seems that 3.0 changed
the block group allocation heuristics to avoid filling up disks too
soon, I suppose, but that's another issue).  So far so good.  I could
split out the filesystem, or just refrain from using a journal, but at
least I knew I'd get hard errors should I keep on with the split
filesystem.

Except that I couldn't count on getting hard errors, as I learned the
hard way yesterday.  I decided to shuffle some data around on an old
server with several internal SATA and PATA disks, plus one larger
external USB disk I decided to install on that server to give me enough
room for the shuffling.  That was an unfortunate decision of mine for a
few reasons:

1. Copying (rsync) the first few hundred GBs of data from one
internal-only (fast) filesystem to the internal/external filesystem was
very fast, which was not unexpected given that I thought it was copying
to the internal disk.  But it wasn't: it ended up choosing the larger
external disk for most writes, and *discarding* nearly all of the big
writes with no more than =E2=80=9Cbio too big=E2=80=9D warnings logged =
to dmesg, noticed
only after the fact.  No hard errors, just (nearly)-silent data
corruption, detected by data checksums that didn't match when trying to
use the newly-created copy.  Oops ;-) That's Bad (TM)

A bit of investigation showed that max_hw_sectors for the USB disk was
120, much lower than the internal SATA and PATA disks.  Unfortunately,
by just looking at the code in fs/btrfs, I couldn't tell how a bio that
exceeds max_hw_sectors size could possibly be created, but it was the
first time I even looked at the btrfs kernel code, or any in-kernel
filesystem code, so it doesn't surprise me that I couldn't figure it ou=
t
on my own ;-) Anyway, I couldn't see changes between 2.6.38 and 3.0 tha=
t
might be related with that either, so I'm at a loss as to how this
extremely serious regression might have come about.

2. Removing a partition from the filesystem (say, the external disk)
didn't relocate =E2=80=9Csingle=E2=80=9D block groups as such to other =
disks, as
expected.  Raid0 block groups were created to hold data from single
block groups and, if it couldn't create big-enough raid0 blocks because
*any* of the other disks was nearly-full, removal would fail.  This can
make it tricky to remove any partition from a filesystem that has two o=
r
more partition members nearly full.  I suppose rebalancing might do the
trick, though it adds an unnecessary step.

Worse: after the failure, the available space, as reported by /bin/df,
remains lower than before the request for removal.  The difference
appears to be the amount of space that would have been made unavailable
by the removal of the requested partition.  Repeating the request for
removal doesn't make it go lower, but asking for *another* member
partition to be removed (and failing in just the same way) does make it
go lower.  Asking for one large partition to be removed, after the firs=
t
failure, caused the amount of available space to underflow!  Wheee,
nearly-infinite storage ;-)  At least until the next reboot, that would
fix the reported available space.

3. Sometimes failure is better than success.  In this case, successful
removal of a partition meant the filesystem would no longer allocate
single block groups: it would only allocate raid0 groups, a very
unfortunate choice for a filesystem containing disks of very different
sizes.  I haven't tried to fill it up to check that it wouldn't revert
to single blocks after exhausting all the space that could be devoted t=
o
creating raid0 block groups, but the reported available space got me th=
e
impression that it would only create block groups while it could get an
equal number of blocks from each of the remaining disks.

I could reduce the space taken up by RAID0 block groups by asking for
removal of partitions holding such raid0 block groups; the blocks would
be happily relocated to available space in other pre-existing single
groups.  However, once it got to single groups, it would allocate raid0
groups, and any further block group allocations on that filesystem woul=
d
get raid0 block groups, rather than single.  I couldn't find a way to g=
o
back, in very much the same way that it appears to be impossible to go
back from RAID1 to DUP metadata once you temporarily add a second disk,
and any metadata block group happens to be allocated before you remove
it (why couldn't it go back to DUP, rather than refusing the removal
outright, which prevents even single block groups from being moved?)

4. I ended up re-creating the filesystem with single data, as intended,
and using 2.6.38.8 to safely use the external disk for the copying.  I
decided to keep it in for the time being, in part because I'm scared of
attempting a removal and ending up with raid0 block groups and
highly-reduced available disk space.  Instead of the large external
disk, however, 2.6.38.8 preferred the faster but smaller internal disks=
,
and it would happily fill them up with the large, long-term storage dat=
a
that was meant to remain mostly in the external disk (as 3.0 would have
done), leaving no room for raid1 metadata allocations.  I'd get
-ENOSPACE errors every now and again while copying data onto this
filesystem, even though there was plenty of available space, and even
plenty of available space in already-allocated metadata block groups.
So much so that retrying the same copies after a few seconds would
succeed.  Oh well...  That's a 2.6.38 issue that's AFAICT heuristically
fixed in 3.0.  Too bad I can't really take advantage of this fix becaus=
e
of the =E2=80=9Cbio too big=E2=80=9D problem.

5. This long message reminded me that another machine that has been
running 3.0 seems to have got *much* slower recently.  I thought it had
to do with the 98% full filesystem (though 40GB available for new block
group allocations would seem to be plenty), and the constant metadata
activity caused by ceph creating and removing snapshots all the time.
It seems that the removals lagged behind for a long time and kept the
disk in constant activity in spite of very little actual ceph activity.
I had decided to shuffle disks around precisely to make more disk space
available for that one machine.  However, once I switched back to
2.6.38, the machine seems to have gotten much faster again, in spite of
the larger ceph activity due to resyncing data to a re-created OSD.
This suggests some large inefficiency in 3.0's btrfs, at least for such
nearly-full disks, and/or for such frequent snapshot creation and
removal as done by ceph.  Indeed, I had noticed a significant slow down
of the ceph cluster, which I had associated with the nearly-full disk
under constant metadata activity, but after I switched back to 2.6.38,
the speed of the cluster was back to normal.  I'm afraid I don't have
enough data to be any more specific about this issue.

6. On a more positive note, I was totally amazed by btrfs's ability to
recover from a goof of mine.  While shuffling disks, removing them from
one filesystem and adding to another, I accidentally added to one of th=
e
data filesystems a partition that was in use by the btrfs raid1
filesystem containing my root (I mean the stuff mounted in /, including
usr, bin, lib, etc).  Oops.  I promptly noticed the mistake and removed
it from the data filesystem and rebooted, already reaching for the
recovery disk.  I didn't need it.  The root filesystem mounted
successfully, reporting a bunch of checksum errors and using the other
raid1 copy of the data.  Wow!  I removed the partition I had double-use=
d
and it again reported lots of errors, but succeeded, and then I added i=
t
back, and everything was fine.  I even compared the root filesystem
image with a recent backup, and all the data was correct, and the
filesystem was consistent.  Great stuff, thanks!

I wonder, why can't btrfs mark at least mounted partitions as busy, in
much the same way that swap, md and various filesystems do, to avoid
such accidental reuses?  I recall another occasion in which I attempted
to add a live swap partition to a btrfs filesystem (@&@#$@&@# disks tha=
t
get assigned different /dev/sd* names on each reboot!), and it refused,
because the swap partition was busy.  Couldn't btrfs use the same
mechanisms to protect its own mounted partitions from accidents?

Thanks in advance for any advice, fixes, or improvements,

--=20
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
=46ree Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-08  1:00 “bio too big” regression and silent data corruption in 3.0 Alexandre Oliva
@ 2011-08-08 22:39 ` Alexandre Oliva
  2011-08-09 14:02   ` Josef Bacik
  2011-08-09  2:53 ` Alexandre Oliva
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Alexandre Oliva @ 2011-08-08 22:39 UTC (permalink / raw)
  To: linux-btrfs

On Aug  7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:

> tl;dr version: 3.0 produces =E2=80=9Cbio too big=E2=80=9D dmesg entri=
es and silently
> corrupts data in =E2=80=9Cmeta-raid1/data-single=E2=80=9D configurati=
ons on disks with
> different max_hw_sectors, where 2.6.38 worked fine.

=46WIW, I just got the same problem with 2.6.38.  No idea how I hadn't =
hit
it before, but it's not a 3.0 regression, just a regular (but IMHO very
serious) bug.

--=20
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
=46ree Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-08 22:39 ` Alexandre Oliva
@ 2011-08-09 14:02   ` Josef Bacik
  0 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2011-08-09 14:02 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs

On 08/08/2011 06:39 PM, Alexandre Oliva wrote:
> On Aug  7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:
>=20
>> tl;dr version: 3.0 produces =E2=80=9Cbio too big=E2=80=9D dmesg entr=
ies and silently
>> corrupts data in =E2=80=9Cmeta-raid1/data-single=E2=80=9D configurat=
ions on disks with
>> different max_hw_sectors, where 2.6.38 worked fine.
>=20
> FWIW, I just got the same problem with 2.6.38.  No idea how I hadn't =
hit
> it before, but it's not a 3.0 regression, just a regular (but IMHO ve=
ry
> serious) bug.
>=20

This is worriesome, I will try and find a usb disk with a small
sectorsize and see if I can reproduce.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-08  1:00 “bio too big” regression and silent data corruption in 3.0 Alexandre Oliva
  2011-08-08 22:39 ` Alexandre Oliva
@ 2011-08-09  2:53 ` Alexandre Oliva
  2011-08-09 14:01   ` Josef Bacik
  2011-08-09  4:04 ` Alexandre Oliva
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Alexandre Oliva @ 2011-08-09  2:53 UTC (permalink / raw)
  To: linux-btrfs

On Aug  7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:

> 2. Removing a partition from the filesystem (say, the external disk)
> didn't relocate =E2=80=9Csingle=E2=80=9D block groups as such to othe=
r disks, as
> expected.

/me reads some code and resets expectations about RAID0 in btrfs ;-)

update_block_group_flags is what does this.  It doesn't care what was
chosen when the filesystem was created, it just forces RAID0 if more
than 1 disk remains:

		/* turn single device chunks into raid0 */
		return stripped | BTRFS_BLOCK_GROUP_RAID0;

Is this really intended?  Given my current understanding that RAID0
doesn't mean striping over all disks, but only over two disks, I guess =
I
might even be interested in it, but...  I still think the user's choice
should be honored, but I don't see where the choice is stored (if it is
at all).

> I wonder, why can't btrfs mark at least mounted partitions as busy, i=
n
> much the same way that swap, md and various filesystems do, to avoid
> such accidental reuses?

Heh.  And *unmark* them when they're removed, too...  As in, it won't
let me create a new filesystem in a partition that was just removed fro=
m
a filesystem, if that was the partition listed in /etc/mtab.

--=20
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
=46ree Software Evangelist      Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-09  2:53 ` Alexandre Oliva
@ 2011-08-09 14:01   ` Josef Bacik
  0 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2011-08-09 14:01 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs

On 08/08/2011 10:53 PM, Alexandre Oliva wrote:
> On Aug  7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:
>=20
>> 2. Removing a partition from the filesystem (say, the external disk)
>> didn't relocate =E2=80=9Csingle=E2=80=9D block groups as such to oth=
er disks, as
>> expected.
>=20
> /me reads some code and resets expectations about RAID0 in btrfs ;-)
>=20
> update_block_group_flags is what does this.  It doesn't care what was
> chosen when the filesystem was created, it just forces RAID0 if more
> than 1 disk remains:
>=20
> 		/* turn single device chunks into raid0 */
> 		return stripped | BTRFS_BLOCK_GROUP_RAID0;
>=20
> Is this really intended?  Given my current understanding that RAID0
> doesn't mean striping over all disks, but only over two disks, I gues=
s I
> might even be interested in it, but...  I still think the user's choi=
ce
> should be honored, but I don't see where the choice is stored (if it =
is
> at all).

Well -m single -d single means that we only have one disk and we don't
want duplication (usually one just does -m single since metadata is the
only thing duplicated by default).  But if you add more disks we want t=
o
do RAID0 as we should be stripping across all the devices in the fs.

>=20
>=20
>> I wonder, why can't btrfs mark at least mounted partitions as busy, =
in
>> much the same way that swap, md and various filesystems do, to avoid
>> such accidental reuses?
>=20
> Heh.  And *unmark* them when they're removed, too...  As in, it won't
> let me create a new filesystem in a partition that was just removed f=
rom
> a filesystem, if that was the partition listed in /etc/mtab.
>=20

Yeah our "what is busy" thing should be a little smarter.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-08  1:00 “bio too big” regression and silent data corruption in 3.0 Alexandre Oliva
  2011-08-08 22:39 ` Alexandre Oliva
  2011-08-09  2:53 ` Alexandre Oliva
@ 2011-08-09  4:04 ` Alexandre Oliva
  2011-08-09 19:05 ` Josef Bacik
  2011-08-16 16:56 ` Alexandre Oliva
  4 siblings, 0 replies; 8+ messages in thread
From: Alexandre Oliva @ 2011-08-09  4:04 UTC (permalink / raw)
  To: linux-btrfs

On Aug  7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:

> in very much the same way that it appears to be impossible to go
> back from RAID1 to DUP metadata once you temporarily add a second disk,
> and any metadata block group happens to be allocated before you remove
> it (why couldn't it go back to DUP, rather than refusing the removal
> outright, which prevents even single block groups from being moved?)

Which also appears to be intentional.  The code to suport this is right
there in update_block_group_flags, but btrfs_rm_device refuses to let it
do its job, denying the removal attempt right away, without any means to
bypass the test.  Could at least an option to bypass the test be
introduced, through say a mount option, some /sys setting, whatever?

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-08  1:00 “bio too big” regression and silent data corruption in 3.0 Alexandre Oliva
                   ` (2 preceding siblings ...)
  2011-08-09  4:04 ` Alexandre Oliva
@ 2011-08-09 19:05 ` Josef Bacik
  2011-08-16 16:56 ` Alexandre Oliva
  4 siblings, 0 replies; 8+ messages in thread
From: Josef Bacik @ 2011-08-09 19:05 UTC (permalink / raw)
  To: Alexandre Oliva; +Cc: linux-btrfs

On 08/07/2011 09:00 PM, Alexandre Oliva wrote:
> tl;dr version: 3.0 produces =E2=80=9Cbio too big=E2=80=9D dmesg entri=
es and silently
> corrupts data in =E2=80=9Cmeta-raid1/data-single=E2=80=9D configurati=
ons on disks with
> different max_hw_sectors, where 2.6.38 worked fine.
>=20

I've reproduced this but I'm stuck on something else atm but I should
get to it soon.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: “bio too big” regression and silent data corruption in 3.0
  2011-08-08  1:00 “bio too big” regression and silent data corruption in 3.0 Alexandre Oliva
                   ` (3 preceding siblings ...)
  2011-08-09 19:05 ` Josef Bacik
@ 2011-08-16 16:56 ` Alexandre Oliva
  4 siblings, 0 replies; 8+ messages in thread
From: Alexandre Oliva @ 2011-08-16 16:56 UTC (permalink / raw)
  To: linux-btrfs

Here's some additional information and work-arounds.

On Aug  7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:

> A bit of investigation showed that max_hw_sectors for the USB disk was
> 120, much lower than the internal SATA and PATA disks.

FWIW, overriding /sys/class/block/sd*/queue/max_sectors_kb of all disks
used by the filesystem to the lowest max_hw_sectors_kb works around this
problem, at least as long as you don't hit it before you get a chance to
change the setting.

> Raid0 block groups were created to hold data from single block groups
> and, if it couldn't create big-enough raid0 blocks because *any* of
> the other disks was nearly-full, removal would fail.

AFAICT this was my misunderstanding of the situation.  Apparenty btrfs
can rebalance the disk space in other partitions so as to create raid0
blocks during removal.  However, in my case it didn't because there was
some metadata inconsistency in the partition I was trying to remove that
led to block tree checksum errors being printed when it hit that part of
the partition, aborting the removal.  The checksum errors were likely
caused by the bio too big problem.

> it appears to be impossible to go back from RAID1 to DUP metadata once
> you temporarily add a second disk, and any metadata block group
> happens to be allocated before you remove it (why couldn't it go back
> to DUP, rather than refusing the removal outright, which prevents even
> single block groups from being moved?)

FWIW, I disabled the test that refuses to shrink a filesystem containing
RAID1 to a single disk and issued such a request while running this
modified kernel, and it completed successfully and perfectly.  Can we
change it from hard error to warning?

> 5. This long message reminded me that another machine that has been
> running 3.0 seems to have got *much* slower recently.  I thought it had
> to do with the 98% full filesystem (though 40GB available for new block
> group allocations would seem to be plenty), and the constant metadata
> activity caused by ceph creating and removing snapshots all the time.

AFAICT it had to do with extended attributes (heavily used by ceph),
that caused a large number of metadata block groups to be allocated,
even though only a tiny fraction of the space in them ended up being
used.  I've observed this in two of the ceph object stores.

I've also noticed that rsyncing the OSDs with all extended attributes
(-A -X) caused the source to use up a *lot* of CPU and far longer than
without.  I don't know why that is, but getfattr --dump at the source
and setfattr --restore at the target does pretty much the same, without
incurring such large CPU and time costs, so there's something to be
improved somewhere, in rsync and/or in btrfs.

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist      Red Hat Brazil Compiler Engineer

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-08-16 16:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-08  1:00 “bio too big” regression and silent data corruption in 3.0 Alexandre Oliva
2011-08-08 22:39 ` Alexandre Oliva
2011-08-09 14:02   ` Josef Bacik
2011-08-09  2:53 ` Alexandre Oliva
2011-08-09 14:01   ` Josef Bacik
2011-08-09  4:04 ` Alexandre Oliva
2011-08-09 19:05 ` Josef Bacik
2011-08-16 16:56 ` Alexandre Oliva

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.