btrfs stability

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* btrfs stability
@ 2013-01-25 20:05 Andrew McNabb
  2013-01-25 20:37 ` Josef Bacik
  2013-01-25 20:53 ` Josef Bacik
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew McNabb @ 2013-01-25 20:05 UTC (permalink / raw)
  To: linux-btrfs

I tried creating a multi-device btrfs filesystem for the first time (on
Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems.  I
had heard that btrfs is now reasonably stable, and though I expected to
possibly see a problem here or there, I was a little surprised at just
how many problems I encountered in such a short period of time.  I now
have about a thousand error messages in my kernel logs related to
several different problems.  Is this roughly the expected level of
stability for btrfs with multiple devices, or am I just particularly
lucky? :)

Am I correct in assuming that I'll need to switch to md for a few months
and try btrfs again later, or are there known problems in the specific
kernel I'm running that I could avoid by trying a different version?

For the sake of being specific, I'll detail a few of the problems I've
hit:

These two may have been caused by a possibly faulty disk (I'm still
trying to determine whether it was faulty or whether the bug was purely
in btrfs):

https://bugzilla.redhat.com/show_bug.cgi?id=903794
https://bugzilla.redhat.com/show_bug.cgi?id=904143

This one was triggered when I tried to remove a possibly faulty disk:

https://bugzilla.redhat.com/show_bug.cgi?id=904197

With a freshly created filesystem, I got a kernel bug, associated with a
hang in most filesystem operations.  This occurred in the middle of
ordinary operation and without any sort of hardware-related errors in
the kernel logs.

https://bugzilla.redhat.com/show_bug.cgi?id=904223

I've noticed that a lot of the reports in the Fedora bugzilla and kernel
bugzilla don't seem to include much discussion; is there any specific
type of information that bug submitters should try to include to make
the reports more helpful?  Thanks.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-25 20:05 btrfs stability Andrew McNabb
@ 2013-01-25 20:37 ` Josef Bacik
  2013-01-25 21:22   ` Andrew McNabb
  2013-01-25 20:53 ` Josef Bacik
  1 sibling, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2013-01-25 20:37 UTC (permalink / raw)
  To: Andrew McNabb; +Cc: linux-btrfs@vger.kernel.org

On Fri, Jan 25, 2013 at 01:05:14PM -0700, Andrew McNabb wrote:
> I tried creating a multi-device btrfs filesystem for the first time (on
> Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems.  I
> had heard that btrfs is now reasonably stable, and though I expected to
> possibly see a problem here or there, I was a little surprised at just
> how many problems I encountered in such a short period of time.  I now
> have about a thousand error messages in my kernel logs related to
> several different problems.  Is this roughly the expected level of
> stability for btrfs with multiple devices, or am I just particularly
> lucky? :)
> 
> Am I correct in assuming that I'll need to switch to md for a few months
> and try btrfs again later, or are there known problems in the specific
> kernel I'm running that I could avoid by trying a different version?
> 
> For the sake of being specific, I'll detail a few of the problems I've
> hit:
> 
> These two may have been caused by a possibly faulty disk (I'm still
> trying to determine whether it was faulty or whether the bug was purely
> in btrfs):
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=903794

This one is just a allocator warning because the relocator doesn't do the right
accounting for relocation.  It's just complainig, we need to fix it but it won't
keep it from working.

> https://bugzilla.redhat.com/show_bug.cgi?id=904143

This I'm almost certain (I have to check) was just a result of me making fsync
faster and forgetting to remove this warn on.  It's fixed upstream.  Again,
nothing to worry about, but annoying.

> 
> This one was triggered when I tried to remove a possibly faulty disk:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=904197
> 

Ok this is a bug, I can fix this.  Basically we tried to read from the faulty
disk, it failed, we read from the other copy, and then tried to write the good
copy back to the failed disk and when we saw that the IO wasn't actually going
to go to the bad disk we panic'ed.  Silly but easy enough to understand/fix.

> With a freshly created filesystem, I got a kernel bug, associated with a
> hang in most filesystem operations.  This occurred in the middle of
> ordinary operation and without any sort of hardware-related errors in
> the kernel logs.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=904223
> 

So this is from the fsync stuff, and I'm sure I fixed this somewhere but I can't
account for where I did it.  Can you give btrfs-next a try and see if you can
still reproduce.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-25 20:05 btrfs stability Andrew McNabb
  2013-01-25 20:37 ` Josef Bacik
@ 2013-01-25 20:53 ` Josef Bacik
  2013-01-25 21:39   ` Andrew McNabb
  1 sibling, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2013-01-25 20:53 UTC (permalink / raw)
  To: Andrew McNabb; +Cc: linux-btrfs@vger.kernel.org

On Fri, Jan 25, 2013 at 01:05:14PM -0700, Andrew McNabb wrote:
> I tried creating a multi-device btrfs filesystem for the first time (on
> Fedora 18 with 3.7.2-204.fc18.x86_64), and I ran into some problems.  I
> had heard that btrfs is now reasonably stable, and though I expected to
> possibly see a problem here or there, I was a little surprised at just
> how many problems I encountered in such a short period of time.  I now
> have about a thousand error messages in my kernel logs related to
> several different problems.  Is this roughly the expected level of
> stability for btrfs with multiple devices, or am I just particularly
> lucky? :)
> 
> Am I correct in assuming that I'll need to switch to md for a few months
> and try btrfs again later, or are there known problems in the specific
> kernel I'm running that I could avoid by trying a different version?
> 
> For the sake of being specific, I'll detail a few of the problems I've
> hit:
> 
> These two may have been caused by a possibly faulty disk (I'm still
> trying to determine whether it was faulty or whether the bug was purely
> in btrfs):
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=903794
> https://bugzilla.redhat.com/show_bug.cgi?id=904143
> 
> This one was triggered when I tried to remove a possibly faulty disk:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=904197

Actually for this one, how did you remove the disk?  Did you just yank it out
while the box was running?  Did you mount -o degraded and then delete the device
and then remove it?  How exactly did you get to this situation.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-25 20:37 ` Josef Bacik
@ 2013-01-25 21:22   ` Andrew McNabb
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew McNabb @ 2013-01-25 21:22 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs@vger.kernel.org

On Fri, Jan 25, 2013 at 03:37:17PM -0500, Josef Bacik wrote:
> > https://bugzilla.redhat.com/show_bug.cgi?id=903794
> 
> This one is just a allocator warning because the relocator doesn't do the right
> accounting for relocation.  It's just complainig, we need to fix it but it won't
> keep it from working.

I won't worry about this one, then.

> > https://bugzilla.redhat.com/show_bug.cgi?id=904143
> 
> This I'm almost certain (I have to check) was just a result of me making fsync
> faster and forgetting to remove this warn on.  It's fixed upstream.  Again,
> nothing to worry about, but annoying.

Sounds good.

> > This one was triggered when I tried to remove a possibly faulty disk:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=904197
> > 
> 
> Ok this is a bug, I can fix this.  Basically we tried to read from the faulty
> disk, it failed, we read from the other copy, and then tried to write the good
> copy back to the failed disk and when we saw that the IO wasn't actually going
> to go to the bad disk we panic'ed.  Silly but easy enough to understand/fix.

I was a little surprised that this happened after I had already done a
"btrfs dev delete"--is there a way to tell btrfs that a disk really is
gone?

> > With a freshly created filesystem, I got a kernel bug, associated with a
> > hang in most filesystem operations.  This occurred in the middle of
> > ordinary operation and without any sort of hardware-related errors in
> > the kernel logs.
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=904223
> > 
> 
> So this is from the fsync stuff, and I'm sure I fixed this somewhere but I can't
> account for where I did it.

Would this also be the cause of the hangs that I'm seeing?  In the end,
a hang with the load rising to 260.10 is the most serious problem.  It's
happened a few times, and it gets temporarily fixed by a reboot, but
then tends to recur fairly soon.

> Can you give btrfs-next a try and see if you can
> still reproduce.  Thanks,

Is there a pre-built RPM for btrfs-next, or what's the best way to try
it out in Fedora without breaking other things?

Thanks for your quick response, and sorry for not responding sooner
(I've been interrupted by a few phone calls).

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-25 20:53 ` Josef Bacik
@ 2013-01-25 21:39   ` Andrew McNabb
  2013-01-26 20:27     ` Andrew McNabb
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew McNabb @ 2013-01-25 21:39 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs@vger.kernel.org

On Fri, Jan 25, 2013 at 03:53:22PM -0500, Josef Bacik wrote:
> 
> Actually for this one, how did you remove the disk?  Did you just yank it out
> while the box was running?  Did you mount -o degraded and then delete the device
> and then remove it?  How exactly did you get to this situation.  Thanks,

I've moved my answer over to IRC to reduce the latency in the
conversation.  Thanks again for all the help.

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-25 21:39   ` Andrew McNabb
@ 2013-01-26 20:27     ` Andrew McNabb
  2013-01-28 14:17       ` Josef Bacik
  2013-01-28 15:10       ` Josef Bacik
  0 siblings, 2 replies; 10+ messages in thread
From: Andrew McNabb @ 2013-01-26 20:27 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs@vger.kernel.org

Here's an update.  I tried the new kernel, and I seem to be having some
new (possibly worse problems.  In my ssh session, I'm seeing many errors
of this sort:

Message from syslogd@guru at Jan 26 13:13:14 ...
 kernel:[  308.223834] BUG: soft lockup - CPU#0 stuck for 23s!
 [btrfs-endio-wri:2073]

Message from syslogd@guru at Jan 26 13:13:14 ...
 kernel:[  308.248754] BUG: soft lockup - CPU#2 stuck for 23s!
 [btrfs-delalloc-:594]

In the logs, I'm seeing several warnings and bugs, including:

WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]()
WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
BUG: unable to handle kernel NULL pointer dereference at     (null)
BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489]
BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607]

Kernel logs (across a few reboots) are at:

http://students.cs.byu.edu/~amcnabb/messages2

--
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-26 20:27     ` Andrew McNabb
@ 2013-01-28 14:17       ` Josef Bacik
  2013-01-28 15:10       ` Josef Bacik
  1 sibling, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2013-01-28 14:17 UTC (permalink / raw)
  To: Andrew McNabb; +Cc: Josef Bacik, linux-btrfs@vger.kernel.org

On Sat, Jan 26, 2013 at 01:27:11PM -0700, Andrew McNabb wrote:
> Here's an update.  I tried the new kernel, and I seem to be having some
> new (possibly worse problems.  In my ssh session, I'm seeing many errors
> of this sort:
> 
> Message from syslogd@guru at Jan 26 13:13:14 ...
>  kernel:[  308.223834] BUG: soft lockup - CPU#0 stuck for 23s!
>  [btrfs-endio-wri:2073]
> 
> Message from syslogd@guru at Jan 26 13:13:14 ...
>  kernel:[  308.248754] BUG: soft lockup - CPU#2 stuck for 23s!
>  [btrfs-delalloc-:594]
> 
> In the logs, I'm seeing several warnings and bugs, including:
> 
> WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]()
> WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
> BUG: unable to handle kernel NULL pointer dereference at     (null)
> BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489]
> BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607]
> 
> Kernel logs (across a few reboots) are at:
> 
> http://students.cs.byu.edu/~amcnabb/messages2
> 

Hrm well I didn't expect that.  I will look into this and see what I can come up
with.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2013-01-26 20:27     ` Andrew McNabb
  2013-01-28 14:17       ` Josef Bacik
@ 2013-01-28 15:10       ` Josef Bacik
  1 sibling, 0 replies; 10+ messages in thread
From: Josef Bacik @ 2013-01-28 15:10 UTC (permalink / raw)
  To: Andrew McNabb; +Cc: Josef Bacik, linux-btrfs@vger.kernel.org

On Sat, Jan 26, 2013 at 01:27:11PM -0700, Andrew McNabb wrote:
> Here's an update.  I tried the new kernel, and I seem to be having some
> new (possibly worse problems.  In my ssh session, I'm seeing many errors
> of this sort:
> 
> Message from syslogd@guru at Jan 26 13:13:14 ...
>  kernel:[  308.223834] BUG: soft lockup - CPU#0 stuck for 23s!
>  [btrfs-endio-wri:2073]
> 
> Message from syslogd@guru at Jan 26 13:13:14 ...
>  kernel:[  308.248754] BUG: soft lockup - CPU#2 stuck for 23s!
>  [btrfs-delalloc-:594]
> 
> In the logs, I'm seeing several warnings and bugs, including:
> 
> WARNING: at fs/btrfs/extent_map.c:78 free_extent_map+0x79/0x90 [btrfs]()
> WARNING: at lib/list_debug.c:62 __list_del_entry+0x82/0xd0()
> BUG: unable to handle kernel NULL pointer dereference at     (null)
> BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-endio-wri:1489]
> BUG: soft lockup - CPU#1 stuck for 22s! [btrfs-delalloc-:607]
> 
> Kernel logs (across a few reboots) are at:
> 
> http://students.cs.byu.edu/~amcnabb/messages2
> 

Ok I think I figured it out, can you give this a whirl?  Let me know when you
get testers fatigue ;)

http://koji.fedoraproject.org/koji/taskinfo?taskID=4908932

Thanks,

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* btrfs stability
@ 2016-05-26 22:42 Diego Torres
  2016-05-27  5:14 ` Roman Mamedov
  0 siblings, 1 reply; 10+ messages in thread
From: Diego Torres @ 2016-05-26 22:42 UTC (permalink / raw)
  To: linux-btrfs

Hi there,

I've been using btrfs with a raid5 configuration with 3 disks for 6
months, and then with 4 disks for a couple of months more. I run a
weekly scrub, and a monthly balance. Btrfs is the only fs that can add
drives one by one to an existing raid setup, and use the new space
inmediately, without replacing all the drives. For me, this is one of
the strongest points.

And, as far as I understand, If I keep and eye on the free space
available, and no drives fail, the filesystem would last indefinitely.
However, the code to replace a failed/missing drive is not yet final,
as I have discovered reading some wikis and this mailing list. Maybe
I'm wrong.

I haven't been able to find a timeline/roadmap about when the replace
command will be stable/ready for use.

Is this someone's priority? Is it planned for the next one,two or
three years coming?

Thanks in advance.

-- 
-- Use of a keyboard or mouse may be linked to serious injuries or disorders.
diego dot torres at gmail dot com - Madrid / Spain

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: btrfs stability
  2016-05-26 22:42 Diego Torres
@ 2016-05-27  5:14 ` Roman Mamedov
  0 siblings, 0 replies; 10+ messages in thread
From: Roman Mamedov @ 2016-05-27  5:14 UTC (permalink / raw)
  To: Diego Torres; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 776 bytes --]

On Fri, 27 May 2016 00:42:07 +0200
Diego Torres <diego.torres@gmail.com> wrote:

> Btrfs is the only fs that can add drives one by one to an existing raid
> setup, and use the new space inmediately, without replacing all the drives.

Ext4, XFS, JFS or pretty much any FS which can be resized upwards can also do
that, when placed on top of mdadm RAID5/6. It's not like you are absolutely
locked in to using Btrfs if you need that particular feature.

"Some of us" also prefer to use Btrfs on top of mdadm RAID, to benefit both
from Btrfs' advanced features such as snapshots, compression and checksum
verification (but not corruption resilience in this case), and from mdadm's
mature, well-tested and performant RAID implementations.

--
With respect,
Roman

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-05-27  5:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-25 20:05 btrfs stability Andrew McNabb
2013-01-25 20:37 ` Josef Bacik
2013-01-25 21:22   ` Andrew McNabb
2013-01-25 20:53 ` Josef Bacik
2013-01-25 21:39   ` Andrew McNabb
2013-01-26 20:27     ` Andrew McNabb
2013-01-28 14:17       ` Josef Bacik
2013-01-28 15:10       ` Josef Bacik
  -- strict thread matches above, loose matches on Subject: below --
2016-05-26 22:42 Diego Torres
2016-05-27  5:14 ` Roman Mamedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).