Still not production ready

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Still not production ready
@ 2015-12-13 22:35 Martin Steigerwald
  2015-12-13 23:19 ` Marc MERLIN
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Martin Steigerwald @ 2015-12-13 22:35 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi!

For me it is still not production ready. Again I ran into:

btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random 
write into big file
https://bugzilla.kernel.org/show_bug.cgi?id=90401

No matter whether SLES 12 uses it as default for root, no matter whether 
Fujitsu and Facebook use it: I will not let this onto any customer machine 
without lots and lots of underprovisioning and rigorous free space monitoring. 
Actually I will renew my recommendations in my trainings to be careful with 
BTRFS.

>From my experience the monitoring would check for:

merkaba:~> btrfs fi show /home
Label: 'home'  uuid: […]
        Total devices 2 FS bytes used 156.31GiB
        devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
        devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home

If "used" is same as "size" then make big fat alarm. It is not sufficient for 
it to happen. It can run for quite some time just fine without any issues, but 
I never have seen a kworker thread using 100% of one core for extended period 
of time blocking everything else on the fs without this condition being met.

In addition to that last time I tried it aborts scrub any of my BTRFS 
filesstems. Reported in another thread here that got completely ignored so 
far. I think I could go back to 4.2 kernel to make this work.

I am not going to bother to go into more detail on any on this, as I get the 
impression that my bug reports and feedback get ignored. So I spare myself the 
time to do this work for now.

Only thing I wonder now whether this all could be cause my /home is already 
more than one and a half year old. Maybe newly created filesystems are created 
in a way that prevents these issues? But it already has a nice global reserve:

merkaba:~> btrfs fi df /
Data, RAID1: total=27.98GiB, used=24.07GiB
System, RAID1: total=19.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=536.80MiB
GlobalReserve, single: total=192.00MiB, used=0.00B

Actually when I see that this free space thing is still not fixed for good I 
wonder whether it is fixable at all. Is this an inherent issue of BTRFS or 
more generally COW filesystem design?

I think it got somewhat better. It took much longer to come into that state 
again than last time, but still, blocking like this is *no* option for a 
*production ready* filesystem.

I am seriously consider to switch to XFS for my production laptop again. Cause 
I never saw any of these free space issues with any of the XFS or Ext4 
filesystems I used in the last 10 years.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-13 22:35 Still not production ready Martin Steigerwald
@ 2015-12-13 23:19 ` Marc MERLIN
  2015-12-14  7:59   ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready) Martin Steigerwald
  2015-12-14  2:08 ` Still not production ready Qu Wenruo
  2016-03-20 11:24 ` kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready) Martin Steigerwald
  2 siblings, 1 reply; 25+ messages in thread
From: Marc MERLIN @ 2015-12-13 23:19 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Btrfs BTRFS

On Sun, Dec 13, 2015 at 11:35:08PM +0100, Martin Steigerwald wrote:
> Hi!
> 
> For me it is still not production ready. Again I ran into:
> 
> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random 
> write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401
 
Sorry you're having issues. I haven't seen this before myself.
I couldn't find the kernel version you're using in your Email or the bug
you filed (quick scan).

That's kind of important :)

Marc
 
> No matter whether SLES 12 uses it as default for root, no matter whether 
> Fujitsu and Facebook use it: I will not let this onto any customer machine 
> without lots and lots of underprovisioning and rigorous free space monitoring. 
> Actually I will renew my recommendations in my trainings to be careful with 
> BTRFS.
> 
> From my experience the monitoring would check for:
> 
> merkaba:~> btrfs fi show /home
> Label: 'home'  uuid: […]
>         Total devices 2 FS bytes used 156.31GiB
>         devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
>         devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> 
> If "used" is same as "size" then make big fat alarm. It is not sufficient for 
> it to happen. It can run for quite some time just fine without any issues, but 
> I never have seen a kworker thread using 100% of one core for extended period 
> of time blocking everything else on the fs without this condition being met.
> 
> 
> In addition to that last time I tried it aborts scrub any of my BTRFS 
> filesstems. Reported in another thread here that got completely ignored so 
> far. I think I could go back to 4.2 kernel to make this work.
> 
> 
> I am not going to bother to go into more detail on any on this, as I get the 
> impression that my bug reports and feedback get ignored. So I spare myself the 
> time to do this work for now.
> 
> 
> Only thing I wonder now whether this all could be cause my /home is already 
> more than one and a half year old. Maybe newly created filesystems are created 
> in a way that prevents these issues? But it already has a nice global reserve:
> 
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.98GiB, used=24.07GiB
> System, RAID1: total=19.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=536.80MiB
> GlobalReserve, single: total=192.00MiB, used=0.00B
> 
> 
> Actually when I see that this free space thing is still not fixed for good I 
> wonder whether it is fixable at all. Is this an inherent issue of BTRFS or 
> more generally COW filesystem design?
> 
> I think it got somewhat better. It took much longer to come into that state 
> again than last time, but still, blocking like this is *no* option for a 
> *production ready* filesystem.
> 
> 
> 
> I am seriously consider to switch to XFS for my production laptop again. Cause 
> I never saw any of these free space issues with any of the XFS or Ext4 
> filesystems I used in the last 10 years.
> 
> Thanks,
> -- 
> Martin
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 25+ messages in thread

* still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready)
  2015-12-13 23:19 ` Marc MERLIN
@ 2015-12-14  7:59   ` Martin Steigerwald
  0 siblings, 0 replies; 25+ messages in thread
From: Martin Steigerwald @ 2015-12-14  7:59 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Btrfs BTRFS

Am Sonntag, 13. Dezember 2015, 15:19:14 CET schrieb Marc MERLIN:
> On Sun, Dec 13, 2015 at 11:35:08PM +0100, Martin Steigerwald wrote:
> > Hi!
> > 
> > For me it is still not production ready. Again I ran into:
> > 
> > btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> > random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Sorry you're having issues. I haven't seen this before myself.
> I couldn't find the kernel version you're using in your Email or the bug
> you filed (quick scan).
> 
> That's kind of important :)

I definately know this much. :) It happened with 4.3 yesterday. The other 
kernel version was 3.18. Information should be in the bug report. Yeah, 3.18 
as mentioned in the Kernel Version field. And 4.3 as I mentioned in the last 
comment of the bug report.

The scrubbing issue is I think since 4.3, I also seen it with 4.4-rc2/rc4 I 
believe, but I didn´t go back then to check more toroughly. I didn´t report 
the scrubbing issue with bugzilla yet as I got no feedback on my mailing list 
posts so far. I will bump the thread in a moment and suggest we discuss free 
space issue here and scrubbing issue in the other thread. I went back to 4.3 
cause 4.4-rc2/4 does not even boot on my machine most of the times. I also 
reported this (BTRFS unrelated one).

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-13 22:35 Still not production ready Martin Steigerwald
  2015-12-13 23:19 ` Marc MERLIN
@ 2015-12-14  2:08 ` Qu Wenruo
  2015-12-14  6:21   ` Duncan
                     ` (2 more replies)
  2016-03-20 11:24 ` kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready) Martin Steigerwald
  2 siblings, 3 replies; 25+ messages in thread
From: Qu Wenruo @ 2015-12-14  2:08 UTC (permalink / raw)
  To: Martin Steigerwald, Btrfs BTRFS

Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> Hi!
>
> For me it is still not production ready.

Yes, this is the *FACT* and not everyone has a good reason to deny it.

> Again I ran into:
>
> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
> write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401

Not sure about guideline for other fs, but it will attract more dev's 
attention if it can be posted to maillist.

>
>
> No matter whether SLES 12 uses it as default for root, no matter whether
> Fujitsu and Facebook use it: I will not let this onto any customer machine
> without lots and lots of underprovisioning and rigorous free space monitoring.
> Actually I will renew my recommendations in my trainings to be careful with
> BTRFS.
>
>  From my experience the monitoring would check for:
>
> merkaba:~> btrfs fi show /home
> Label: 'home'  uuid: […]
>          Total devices 2 FS bytes used 156.31GiB
>          devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
>          devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
>
> If "used" is same as "size" then make big fat alarm. It is not sufficient for
> it to happen. It can run for quite some time just fine without any issues, but
> I never have seen a kworker thread using 100% of one core for extended period
> of time blocking everything else on the fs without this condition being met.
>

And specially advice on the device size from myself:
Don't use devices over 100G but less than 500G.
Over 100G will leads btrfs to use big chunks, where data chunks can be 
at most 10G and metadata to be 1G.

I have seen a lot of users with about 100~200G device, and hit 
unbalanced chunk allocation (10G data chunk easily takes the last 
available space and makes later metadata no where to store)

And unfortunately, your fs is already in the dangerous zone.
(And you are using RAID1, which means it's the same as one 170G btrfs 
with SINGLE data/meta)

>
> In addition to that last time I tried it aborts scrub any of my BTRFS
> filesstems. Reported in another thread here that got completely ignored so
> far. I think I could go back to 4.2 kernel to make this work.

Unfortunately, this happens a lot of times, even you posted it to mail list.
Devs here are always busy locating bugs or adding new features or 
enhancing current behavior.

So *PLEASE* be patient about such slow response.

BTW, you may not want to revert to 4.2 until some bug fix is backported 
to 4.2.
As qgroup rework in 4.2 has broken delayed ref and caused some scrub 
bugs. (My fault)

>
>
> I am not going to bother to go into more detail on any on this, as I get the
> impression that my bug reports and feedback get ignored. So I spare myself the
> time to do this work for now.
>
>
> Only thing I wonder now whether this all could be cause my /home is already
> more than one and a half year old. Maybe newly created filesystems are created
> in a way that prevents these issues? But it already has a nice global reserve:
>
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.98GiB, used=24.07GiB
> System, RAID1: total=19.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=536.80MiB
> GlobalReserve, single: total=192.00MiB, used=0.00B
>
>
> Actually when I see that this free space thing is still not fixed for good I
> wonder whether it is fixable at all. Is this an inherent issue of BTRFS or
> more generally COW filesystem design?

GlobalReserve is just a reserved space *INSIDE* metadata for some corner 
case. So its profile is always single.

The real problem is, how we represent it in btrfs-progs.

If it output like below, I think you won't complain about it more:
 > merkaba:~> btrfs fi df /
 > Data, RAID1: total=27.98GiB, used=24.07GiB
 > System, RAID1: total=19.00MiB, used=16.00KiB
 > Metadata, RAID1: total=2.00GiB, used=728.80MiB

Or
 > merkaba:~> btrfs fi df /
 > Data, RAID1: total=27.98GiB, used=24.07GiB
 > System, RAID1: total=19.00MiB, used=16.00KiB
 > Metadata, RAID1: total=2.00GiB, used=(536.80 + 192.00)MiB
 >  \ GlobalReserve: total=192.00MiB, used=0.00B

>
> I think it got somewhat better. It took much longer to come into that state
> again than last time, but still, blocking like this is *no* option for a
> *production ready* filesystem.
>
>
>
> I am seriously consider to switch to XFS for my production laptop again. Cause
> I never saw any of these free space issues with any of the XFS or Ext4
> filesystems I used in the last 10 years.

Yes, xfs and ext4 is very stable for normal use case.

But at least, I won't recommend xfs yet, and considering the nature or 
journal based fs, I'll recommend backup power supply in crash recovery 
for both of them.

Xfs already messed up several test environment of mine, and an 
unfortunate double power loss has destroyed my whole /home ext4 
partition years ago.

[xfs story]
After several crash, xfs makes several corrupted file just to 0 size.
Including my kernel .git directory. Then I won't trust it any longer.
No to mention that grub2 support for xfs v5 is not here yet.

[ext4 story]
For ext4, when recovering my /home partition after a power loss, a new 
power loss happened, and my home partition is doomed.
Only several non-sense files are savaged.

Thanks,
Qu
>
> Thanks,
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-14  2:08 ` Still not production ready Qu Wenruo
@ 2015-12-14  6:21   ` Duncan
  2015-12-14  7:32     ` Qu Wenruo
  2015-12-14  8:18   ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready) Martin Steigerwald
  2015-12-15 21:59   ` Still not production ready Chris Mason
  2 siblings, 1 reply; 25+ messages in thread
From: Duncan @ 2015-12-14  6:21 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Mon, 14 Dec 2015 10:08:16 +0800 as excerpted:

> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
>> Hi!
>>
>> For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.

In the above sentence, I /think/ you (Qu) agree with Martin (and I) that 
btrfs shouldn't be considered production ready... yet, and the first part 
of the sentence makes it very clear that you feel strongly about the 
*FACT*, but the second half of the sentence (after *FACT*) doesn't parse 
well in English, thus leaving the entire sentence open to interpretation, 
tho it's obvious either way that you feel strongly about it. =:^\

At the risk of getting it completely wrong, what I /think/ you meant to 
say is (as expanded in typically Duncan fashion =:^)...

Yes, this is the *FACT*, though some people have reasons to deny it.

Presumably, said reasons would include the fact that various distros are 
trying to sell enterprise support contracts to customers very eager to 
have the features that btrfs provides, and said customers are willing to 
pay for assurances that the solutions they're buying are "production 
ready", whether that's actually the case or not, presumably because said 
payment is (in practice) simply ensuring there's someone else to pin the 
blame on if things go bad.

And the demonstration of that would be the continued fact that people 
otherwise unnecessarily continue to pay rather large sums of money for 
that very assurance, when in practice, they'd get equal or better support 
not worrying about that payment, but instead actually making use of free-
of-cost resources such as this list.

[Linguistic analysis, see frequent discussion of this topic at Language 
Log, which I happen to subscribe to as I find this sort of thing 
interesting, for more commentary and examples of the same general issue: 
http://languagelog.net ]

The problem with the sentence as originally written, is that English 
doesn't deal well with multi-negation, sometimes considering each 
negation an inversion of the previous (as do most programming languages 
and thus programmers), while other times or as read/heard/interpreted by 
others repeated negation may be considered a strengthening of the 
original negation.

Regardless, mis-negation due to speaker/writer confusion is quite common 
even among native English speakers/writers.

The negating words in question here are "not" and "deny".  If you will 
note, my rewrite kept "deny", but rewrote the "not" out of the sentence, 
so there's only one negative to worry about, making the meaning much 
clearer as the reader's mind isn't left trying to figure out what the 
speaker meant with the double-negative (mistake? deliberate canceling out 
of the first negative with the second? deliberate intensifier?)  and thus 
unable to be sure one way or the other what was meant.

And just in case there would have been doubt, the explanation then makes 
doubly obvious what I think your intent was by expanding on it.  Of 
course that's easy to do as I entirely agree.

OTOH if I'm mistaken as to your intent and you meant it the other way... 
well then you'll need to do the explaining as then the implication is 
that some people have good reasons to deny it and you agree with them, 
but without further expansion, I wouldn't know where you're trying to go 
with that claim.

Just in case there's any doubt left of my own opinion on the original 
claim of not production ready in the above discussion, let me be 
explicit:  I (too) agree with Martin (and I think with Qu) that btrfs 
isn't yet production ready.  But I don't believe you'll find many on the 
list taking issue with that, as I think everybody on-list agrees, btrfs 
/isn't/ production ready.  Certainly pretty much just that has been 
repeatedly stated in individualized style by many posters including 
myself, and I've yet to see anyone take serious issue with it.

>> No matter whether SLES 12 uses it as default for root, no matter
>> whether Fujitsu and Facebook use it: I will not let this onto any
>> customer machine without lots and lots of underprovisioning and
>> rigorous free space monitoring.
>> Actually I will renew my recommendations in my trainings to be careful
>> with BTRFS.

... And were I to put money on it, my money would be on every regular on-
list poster 100% agreeing with that. =:^)

>>
>>  From my experience the monitoring would check for:
>>
>> merkaba:~> btrfs fi show /home
>>          Label: 'home'  uuid: […]
>>          Total devices 2 FS bytes used 156.31GiB
>>          devid    1 size 170.00GiB used 164.13GiB path /dev/[path1]
>>          devid    2 size 170.00GiB used 164.13GiB path /dev/[path2]
>>
>> If "used" is same as "size" then make big fat alarm. It is not
>> sufficient for it to happen. It can run for quite some time just fine
>> without any issues, but I never have seen a kworker thread using 100%
>> of one core for extended period of time blocking everything else on the
>> fs without this condition being met.

Astutely observed. =:^)

> And specially advice on the device size from myself:
> Don't use devices over 100G but less than 500G.
> Over 100G will leads btrfs to use big chunks, where data chunks can be
> at most 10G and metadata to be 1G.

Thanks, Qu.  This is the first time I've seen such specifics both in 
terms of the big-chunks trigger (minimum 100 GiB effective usable 
filesystem size) and in terms of how big those big chunks are (10 GiB 
data, 1 GiB metadata).

Filed away for further reference. =:^)

> I have seen a lot of users with about 100~200G device, and hit
> unbalanced chunk allocation (10G data chunk easily takes the last
> available space and makes later metadata no where to store)

That does indeed seem to be a reoccurring theme.  Now I know why, and 
where the big-chunks trigger is. =:^)

And to add, while the kernel now does empty-chunk reaping, returning them 
to the unallocated pool, the chances of a 10 GiB chunk being mostly empty 
but still having at least one small extent still locking it in place as 
not entirely empty, and thus not reapable, are obviously going to be at 
least an order of magnitude higher (and in practice likely more, due to a 
likely unlinearly greater share of files being under 10 GiB size than 
under 1 GiB size) than the chances at the 1 GiB chunk size.

> And unfortunately, your fs is already in the dangerous zone.
> (And you are using RAID1, which means it's the same as one 170G btrfs
> with SINGLE data/meta)

That raid1 parenthetical is why I chose the "effective usable filesystem 
size" wording above, to try to word it broadly enough to include all the 
different replication/parity variants.

>> Reported in another thread here that got completely ignored
>> so far. I think I could go back to 4.2 kernel to make this work.
> 
> Unfortunately, this happens a lot of times, even you posted it to mail
> list.
> Devs here are always busy locating bugs or adding new features or
> enhancing current behavior.
> 
> So *PLEASE* be patient about such slow response.

Yes indeed.

Generally speaking, one post/thread alone isn't likely to get the eye of 
a dev unless they happen to be between bug-hunting projects at that 
moment.  But several posts/threads, particularly over a couple kernel 
cycles or from multiple posters, a trend makes, and then it's much more 
likely to catch attention.

> BTW, you may not want to revert to 4.2 until some bug fix is backported
> to 4.2.
> As qgroup rework in 4.2 has broken delayed ref and caused some scrub
> bugs. (My fault)

Good point.  (Tho I never happened to trigger those scrub bugs here, but 
I strongly suspect that's because I both use quite small filesystems, 
well under that 100 GiB effective size barrier mentioned above, and 
relatively fast ssds, so my scrubs are done in under a minute and don't 
tend to be subject to the same sort of IO bottlenecking and races that 
scrubs on spinning rust at 100 GiB plus filesystem sizes tend to be.)

>> I think it got somewhat better. It took much longer to come into that
>> state again than last time, but still, blocking like this is *no*
>> option for a *production ready* filesystem.

Agreed on both counts.  The problem should be markedly better since the 
empty-chunk-reaping went into (IIRC) 3.17, to the point that we're only 
now beginning to see reports of it being triggered again, while 
previously people were seeing it repeatedly, often monthly or more 
frequently.

But it's still not hitting the expectations for a production-ready 
filesystem, but then again, I've yet to see a list regular actually make 
anything like a claim that btrfs is in fact production ready; rather the 
opposite, in fact, and repeatedly.

What distros might be claiming is another matter, but arguably, people 
relying on their claims should be following up by demanding support from 
the distros making them, based on the claims they made.  Meanwhile, on 
this list we're /not/ making those claims and thus cannot reasonably be 
held to them as if we were.

>> I am seriously consider to switch to XFS for my production laptop
>> again. Cause I never saw any of these free space issues with any of the
>> XFS or Ext4 filesystems I used in the last 10 years.
> 
> Yes, xfs and ext4 is very stable for normal use case.
> 
> But at least, I won't recommend xfs yet, and considering the nature or
> journal based fs, I'll recommend backup power supply in crash recovery
> for both of them.
> 
> Xfs already messed up several test environment of mine, and an
> unfortunate double power loss has destroyed my whole /home ext4
> partition years ago.
> 
> [xfs story]
> After several crash, xfs makes several corrupted file just to 0 size.
> Including my kernel .git directory. Then I won't trust it any longer.
> No to mention that grub2 support for xfs v5 is not here yet.
> 
> [ext4 story]
> For ext4, when recovering my /home partition after a power loss, a new
> power loss happened, and my home partition is doomed.
> Only several non-sense files are savaged.

As they say YMMV, but FWIW, despite the stories from the pre-data=ordered-
by-default era, and with the acknowledgment that a single anecdote or 
even a small but unrandomized sampling of anecdotes doesn't a scientific 
study make, I've actually had surprisingly good luck with reiserfs here, 
even on hardware that I had little reason to expect a filesystem to 
actually work reliably on (bad memory incidents, overheated and head-
crashed drive incident where after cooldown I took the mounted at the 
time partitions out of use and successfully and reliably continued to use 
other partitions on the drive, old and burst capacitor and thus power-
unstable mobo incident,... etc, tho not all at once, fortunately!).

ATM I use btrfs on my SSDs but continue to use reiserfs on my spinning 
rust, and FWIW, reiserfs has continued to be as reliable as I'd expect a 
deeply mature and stable filesystem to be, while btrfs... has been as 
occasionally but arguably dependably buggy as I'd expect a still under 
heavy development tho past "experimental", still stabilizing and not yet 
mature filesystem to be.

Tho pre-ordered-by-default era, I remember a few of those 0-size-
truncated files on reiserfs, too.  But the ordered-by-default 
introduction was long in the past even when the 3.0 kernel was new, so is 
pretty well pre-history, by now (which I guess qualifies me as a Linux 
old fogey by now, even if I didn't really get into it to speak of until 
the turn of the century or so, after MS gave me the push by very 
specifically and deliberately shipping malware in eXPrivacy, thus 
crossing a line I was never to cross with them).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-14  6:21   ` Duncan
@ 2015-12-14  7:32     ` Qu Wenruo
  2015-12-14 12:10       ` Duncan
  0 siblings, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2015-12-14  7:32 UTC (permalink / raw)
  To: Duncan, linux-btrfs



Duncan wrote on 2015/12/14 06:21 +0000:
> Qu Wenruo posted on Mon, 14 Dec 2015 10:08:16 +0800 as excerpted:
>
>> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
>>> Hi!
>>>
>>> For me it is still not production ready.
>>
>> Yes, this is the *FACT* and not everyone has a good reason to deny it.
>
> In the above sentence, I /think/ you (Qu) agree with Martin (and I) that
> btrfs shouldn't be considered production ready... yet, and the first part
> of the sentence makes it very clear that you feel strongly about the
> *FACT*, but the second half of the sentence (after *FACT*) doesn't parse
> well in English, thus leaving the entire sentence open to interpretation,
> tho it's obvious either way that you feel strongly about it. =:^\

Oh, my poor English... :(

The latter half is just in case someone consider btrfs is stable in some 
respects.

>
> At the risk of getting it completely wrong, what I /think/ you meant to
> say is (as expanded in typically Duncan fashion =:^)...
>
> Yes, this is the *FACT*, though some people have reasons to deny it.

Right! That's what I want to say!!

>
> Presumably, said reasons would include the fact that various distros are
> trying to sell enterprise support contracts to customers very eager to
> have the features that btrfs provides, and said customers are willing to
> pay for assurances that the solutions they're buying are "production
> ready", whether that's actually the case or not, presumably because said
> payment is (in practice) simply ensuring there's someone else to pin the
> blame on if things go bad.
>
> And the demonstration of that would be the continued fact that people
> otherwise unnecessarily continue to pay rather large sums of money for
> that very assurance, when in practice, they'd get equal or better support
> not worrying about that payment, but instead actually making use of free-
> of-cost resources such as this list.
>
>
> [Linguistic analysis, see frequent discussion of this topic at Language
> Log, which I happen to subscribe to as I find this sort of thing
> interesting, for more commentary and examples of the same general issue:
> http://languagelog.net ]
>
> The problem with the sentence as originally written, is that English
> doesn't deal well with multi-negation, sometimes considering each
> negation an inversion of the previous (as do most programming languages
> and thus programmers), while other times or as read/heard/interpreted by
> others repeated negation may be considered a strengthening of the
> original negation.
>
> Regardless, mis-negation due to speaker/writer confusion is quite common
> even among native English speakers/writers.
>
> The negating words in question here are "not" and "deny".  If you will
> note, my rewrite kept "deny", but rewrote the "not" out of the sentence,
> so there's only one negative to worry about, making the meaning much
> clearer as the reader's mind isn't left trying to figure out what the
> speaker meant with the double-negative (mistake? deliberate canceling out
> of the first negative with the second? deliberate intensifier?)  and thus
> unable to be sure one way or the other what was meant.
>
> And just in case there would have been doubt, the explanation then makes
> doubly obvious what I think your intent was by expanding on it.  Of
> course that's easy to do as I entirely agree.
>
> OTOH if I'm mistaken as to your intent and you meant it the other way...
> well then you'll need to do the explaining as then the implication is
> that some people have good reasons to deny it and you agree with them,
> but without further expansion, I wouldn't know where you're trying to go
> with that claim.
>
>
> Just in case there's any doubt left of my own opinion on the original
> claim of not production ready in the above discussion, let me be
> explicit:  I (too) agree with Martin (and I think with Qu) that btrfs
> isn't yet production ready.  But I don't believe you'll find many on the
> list taking issue with that, as I think everybody on-list agrees, btrfs
> /isn't/ production ready.  Certainly pretty much just that has been
> repeatedly stated in individualized style by many posters including
> myself, and I've yet to see anyone take serious issue with it.
>
>>> No matter whether SLES 12 uses it as default for root, no matter
>>> whether Fujitsu and Facebook use it: I will not let this onto any
>>> customer machine without lots and lots of underprovisioning and
>>> rigorous free space monitoring.
>>> Actually I will renew my recommendations in my trainings to be careful
>>> with BTRFS.
>
> ... And were I to put money on it, my money would be on every regular on-
> list poster 100% agreeing with that. =:^)
>
>>>
>>>   From my experience the monitoring would check for:
>>>
>>> merkaba:~> btrfs fi show /home
>>>           Label: 'home'  uuid: […]
>>>           Total devices 2 FS bytes used 156.31GiB
>>>           devid    1 size 170.00GiB used 164.13GiB path /dev/[path1]
>>>           devid    2 size 170.00GiB used 164.13GiB path /dev/[path2]
>>>
>>> If "used" is same as "size" then make big fat alarm. It is not
>>> sufficient for it to happen. It can run for quite some time just fine
>>> without any issues, but I never have seen a kworker thread using 100%
>>> of one core for extended period of time blocking everything else on the
>>> fs without this condition being met.
>
> Astutely observed. =:^)
>
>
>> And specially advice on the device size from myself:
>> Don't use devices over 100G but less than 500G.
>> Over 100G will leads btrfs to use big chunks, where data chunks can be
>> at most 10G and metadata to be 1G.
>
> Thanks, Qu.  This is the first time I've seen such specifics both in
> terms of the big-chunks trigger (minimum 100 GiB effective usable
> filesystem size) and in terms of how big those big chunks are (10 GiB
> data, 1 GiB metadata).
>
> Filed away for further reference. =:^)
>
>> I have seen a lot of users with about 100~200G device, and hit
>> unbalanced chunk allocation (10G data chunk easily takes the last
>> available space and makes later metadata no where to store)
>
> That does indeed seem to be a reoccurring theme.  Now I know why, and
> where the big-chunks trigger is. =:^)
>
> And to add, while the kernel now does empty-chunk reaping, returning them
> to the unallocated pool, the chances of a 10 GiB chunk being mostly empty
> but still having at least one small extent still locking it in place as
> not entirely empty, and thus not reapable, are obviously going to be at
> least an order of magnitude higher (and in practice likely more, due to a
> likely unlinearly greater share of files being under 10 GiB size than
> under 1 GiB size) than the chances at the 1 GiB chunk size.
>
>> And unfortunately, your fs is already in the dangerous zone.
>> (And you are using RAID1, which means it's the same as one 170G btrfs
>> with SINGLE data/meta)
>
> That raid1 parenthetical is why I chose the "effective usable filesystem
> size" wording above, to try to word it broadly enough to include all the
> different replication/parity variants.
>
>>> Reported in another thread here that got completely ignored
>>> so far. I think I could go back to 4.2 kernel to make this work.
>>
>> Unfortunately, this happens a lot of times, even you posted it to mail
>> list.
>> Devs here are always busy locating bugs or adding new features or
>> enhancing current behavior.
>>
>> So *PLEASE* be patient about such slow response.
>
> Yes indeed.
>
> Generally speaking, one post/thread alone isn't likely to get the eye of
> a dev unless they happen to be between bug-hunting projects at that
> moment.  But several posts/threads, particularly over a couple kernel
> cycles or from multiple posters, a trend makes, and then it's much more
> likely to catch attention.
>
>> BTW, you may not want to revert to 4.2 until some bug fix is backported
>> to 4.2.
>> As qgroup rework in 4.2 has broken delayed ref and caused some scrub
>> bugs. (My fault)
>
> Good point.  (Tho I never happened to trigger those scrub bugs here, but
> I strongly suspect that's because I both use quite small filesystems,
> well under that 100 GiB effective size barrier mentioned above, and
> relatively fast ssds, so my scrubs are done in under a minute and don't
> tend to be subject to the same sort of IO bottlenecking and races that
> scrubs on spinning rust at 100 GiB plus filesystem sizes tend to be.)
>
>>> I think it got somewhat better. It took much longer to come into that
>>> state again than last time, but still, blocking like this is *no*
>>> option for a *production ready* filesystem.
>
> Agreed on both counts.  The problem should be markedly better since the
> empty-chunk-reaping went into (IIRC) 3.17, to the point that we're only
> now beginning to see reports of it being triggered again, while
> previously people were seeing it repeatedly, often monthly or more
> frequently.
>
> But it's still not hitting the expectations for a production-ready
> filesystem, but then again, I've yet to see a list regular actually make
> anything like a claim that btrfs is in fact production ready; rather the
> opposite, in fact, and repeatedly.
>
> What distros might be claiming is another matter, but arguably, people
> relying on their claims should be following up by demanding support from
> the distros making them, based on the claims they made.  Meanwhile, on
> this list we're /not/ making those claims and thus cannot reasonably be
> held to them as if we were.
>
>>> I am seriously consider to switch to XFS for my production laptop
>>> again. Cause I never saw any of these free space issues with any of the
>>> XFS or Ext4 filesystems I used in the last 10 years.
>>
>> Yes, xfs and ext4 is very stable for normal use case.
>>
>> But at least, I won't recommend xfs yet, and considering the nature or
>> journal based fs, I'll recommend backup power supply in crash recovery
>> for both of them.
>>
>> Xfs already messed up several test environment of mine, and an
>> unfortunate double power loss has destroyed my whole /home ext4
>> partition years ago.
>>
>> [xfs story]
>> After several crash, xfs makes several corrupted file just to 0 size.
>> Including my kernel .git directory. Then I won't trust it any longer.
>> No to mention that grub2 support for xfs v5 is not here yet.
>>
>> [ext4 story]
>> For ext4, when recovering my /home partition after a power loss, a new
>> power loss happened, and my home partition is doomed.
>> Only several non-sense files are savaged.
>
> As they say YMMV, but FWIW, despite the stories from the pre-data=ordered-
> by-default era, and with the acknowledgment that a single anecdote or
> even a small but unrandomized sampling of anecdotes doesn't a scientific
> study make,

Yes, that's right, all what I had is just some unfortunately sample.
But for people, that will bring a bad impression though.

Thanks,
Qu


> I've actually had surprisingly good luck with reiserfs here,
> even on hardware that I had little reason to expect a filesystem to
> actually work reliably on (bad memory incidents, overheated and head-
> crashed drive incident where after cooldown I took the mounted at the
> time partitions out of use and successfully and reliably continued to use
> other partitions on the drive, old and burst capacitor and thus power-
> unstable mobo incident,... etc, tho not all at once, fortunately!).
>
> ATM I use btrfs on my SSDs but continue to use reiserfs on my spinning
> rust, and FWIW, reiserfs has continued to be as reliable as I'd expect a
> deeply mature and stable filesystem to be, while btrfs... has been as
> occasionally but arguably dependably buggy as I'd expect a still under
> heavy development tho past "experimental", still stabilizing and not yet
> mature filesystem to be.
>
>
> Tho pre-ordered-by-default era, I remember a few of those 0-size-
> truncated files on reiserfs, too.  But the ordered-by-default
> introduction was long in the past even when the 3.0 kernel was new, so is
> pretty well pre-history, by now (which I guess qualifies me as a Linux
> old fogey by now, even if I didn't really get into it to speak of until
> the turn of the century or so, after MS gave me the push by very
> specifically and deliberately shipping malware in eXPrivacy, thus
> crossing a line I was never to cross with them).
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-14  7:32     ` Qu Wenruo
@ 2015-12-14 12:10       ` Duncan
  2015-12-14 19:08         ` Chris Murphy
  0 siblings, 1 reply; 25+ messages in thread
From: Duncan @ 2015-12-14 12:10 UTC (permalink / raw)
  To: linux-btrfs

Qu Wenruo posted on Mon, 14 Dec 2015 15:32:02 +0800 as excerpted:

> Oh, my poor English... :(

Well, as I said, native English speakers commonly enough mis-negate...

The real issue seems to be that English simply lacks proper support for 
the double-negatives feature that people keep wanting to use, despite the 
fact that it yields an officially undefined result that compilers (people 
reading/hearing) don't quite know what to do with, with actual results 
often throwing warnings and generally changing from compiler to 
compiler . =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-14 12:10       ` Duncan
@ 2015-12-14 19:08         ` Chris Murphy
  2015-12-14 20:33           ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Murphy @ 2015-12-14 19:08 UTC (permalink / raw)
  Cc: Btrfs BTRFS

On Mon, Dec 14, 2015 at 5:10 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Qu Wenruo posted on Mon, 14 Dec 2015 15:32:02 +0800 as excerpted:
>
>> Oh, my poor English... :(
>
> Well, as I said, native English speakers commonly enough mis-negate...
>
> The real issue seems to be that English simply lacks proper support for
> the double-negatives feature that people keep wanting to use, despite the
> fact that it yields an officially undefined result that compilers (people
> reading/hearing) don't quite know what to do with, with actual results
> often throwing warnings and generally changing from compiler to
> compiler . =:^)

It's a trap! Haha. Yeah like you say, it's not a matter of poor
English. Qu writes very understandable English. Officially in English
the negatives should cancel, which is different in many other
languages where additional negatives amplify. But even native English
speakers have dialects where it amplifies, rather than cancels. So I'd
consider the double or multiple negative in English as a
colloquialism. And a trap!


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-14 19:08         ` Chris Murphy
@ 2015-12-14 20:33           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 25+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-14 20:33 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2015-12-14 14:08, Chris Murphy wrote:
> On Mon, Dec 14, 2015 at 5:10 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> Qu Wenruo posted on Mon, 14 Dec 2015 15:32:02 +0800 as excerpted:
>>
>>> Oh, my poor English... :(
>>
>> Well, as I said, native English speakers commonly enough mis-negate...
>>
>> The real issue seems to be that English simply lacks proper support for
>> the double-negatives feature that people keep wanting to use, despite the
>> fact that it yields an officially undefined result that compilers (people
>> reading/hearing) don't quite know what to do with, with actual results
>> often throwing warnings and generally changing from compiler to
>> compiler . =:^)
>
> It's a trap! Haha. Yeah like you say, it's not a matter of poor
> English. Qu writes very understandable English. Officially in English
> the negatives should cancel, which is different in many other
> languages where additional negatives amplify. But even native English
> speakers have dialects where it amplifies, rather than cancels. So I'd
> consider the double or multiple negative in English as a
> colloquialism. And a trap!
>
Some days I really wish Esperanto or Interlingua had actually caught on...

Or even Lojban, at least then the language would be more like the 
systems being discussed, even if it would be a serious pain to learn and 
use.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready)
  2015-12-14  2:08 ` Still not production ready Qu Wenruo
  2015-12-14  6:21   ` Duncan
@ 2015-12-14  8:18   ` Martin Steigerwald
  2015-12-14  8:48     ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Qu Wenruo
  2015-12-15 21:59   ` Still not production ready Chris Mason
  2 siblings, 1 reply; 25+ messages in thread
From: Martin Steigerwald @ 2015-12-14  8:18 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> > Hi!
> > 
> > For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.
> 
> > Again I ran into:
> > 
> > btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> > random write into big file
> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Not sure about guideline for other fs, but it will attract more dev's
> attention if it can be posted to maillist.

I did, as mentioned in the bug report:

BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald | 26 Dec 14:37 2014
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790

> > No matter whether SLES 12 uses it as default for root, no matter whether
> > Fujitsu and Facebook use it: I will not let this onto any customer machine
> > without lots and lots of underprovisioning and rigorous free space
> > monitoring. Actually I will renew my recommendations in my trainings to
> > be careful with BTRFS.
> > 
> >  From my experience the monitoring would check for:
> > merkaba:~> btrfs fi show /home
> > Label: 'home'  uuid: […]
> > 
> >          Total devices 2 FS bytes used 156.31GiB
> >          devid    1 size 170.00GiB used 164.13GiB path
> >          /dev/mapper/msata-home
> >          devid    2 size 170.00GiB used 164.13GiB path
> >          /dev/mapper/sata-home
> > 
> > If "used" is same as "size" then make big fat alarm. It is not sufficient
> > for it to happen. It can run for quite some time just fine without any
> > issues, but I never have seen a kworker thread using 100% of one core for
> > extended period of time blocking everything else on the fs without this
> > condition being met.
> And specially advice on the device size from myself:
> Don't use devices over 100G but less than 500G.
> Over 100G will leads btrfs to use big chunks, where data chunks can be
> at most 10G and metadata to be 1G.
> 
> I have seen a lot of users with about 100~200G device, and hit
> unbalanced chunk allocation (10G data chunk easily takes the last
> available space and makes later metadata no where to store)

Interesting, but in my case there is still quite some free space in already 
allocated metadata chunks. Anyway, I did had enospc issues on trying to 
balance the chunks.

> And unfortunately, your fs is already in the dangerous zone.
> (And you are using RAID1, which means it's the same as one 170G btrfs
> with SINGLE data/meta)

Well, I know for any FS its not recommended to let it run to full and leave 
about 10-15% free at least, but while it is not 10-15% anymore, its still a 
whopping 11-12 GiB of free space. I would accept a somewhat slower operation 
in this case, but no kworker at 100% for about 10-30 seconds blocking 
everything else on going on on the filesystem. For whatever reason Plasma 
seems to access the fs on almost every action I do with it, so not even panels 
slide out anymore or activity switcher works during that time.

> > In addition to that last time I tried it aborts scrub any of my BTRFS
> > filesstems. Reported in another thread here that got completely ignored so
> > far. I think I could go back to 4.2 kernel to make this work.
> 
> Unfortunately, this happens a lot of times, even you posted it to mail list.
> Devs here are always busy locating bugs or adding new features or
> enhancing current behavior.
> 
> So *PLEASE* be patient about such slow response.

Okay, thanks at least for the acknowledgement of this. I try to be even more 
patient.

> BTW, you may not want to revert to 4.2 until some bug fix is backported
> to 4.2.
> As qgroup rework in 4.2 has broken delayed ref and caused some scrub
> bugs. (My fault)

Hm, well scrubbing does not work for me either. But since 4.3/4.4rc2/4. I just 
bumped the thread:

Re: [4.3-rc4] scrubbing aborts before finishing

by replying a well by replying a third time to it (not fourth, miscounted:). 

> > I am not going to bother to go into more detail on any on this, as I get
> > the impression that my bug reports and feedback get ignored. So I spare
> > myself the time to do this work for now.
> > 
> > 
> > Only thing I wonder now whether this all could be cause my /home is
> > already
> > more than one and a half year old. Maybe newly created filesystems are
> > created in a way that prevents these issues? But it already has a nice
> > global reserve:
> > 
> > merkaba:~> btrfs fi df /
> > Data, RAID1: total=27.98GiB, used=24.07GiB
> > System, RAID1: total=19.00MiB, used=16.00KiB
> > Metadata, RAID1: total=2.00GiB, used=536.80MiB
> > GlobalReserve, single: total=192.00MiB, used=0.00B
> > 
> > 
> > Actually when I see that this free space thing is still not fixed for good
> > I wonder whether it is fixable at all. Is this an inherent issue of BTRFS
> > or more generally COW filesystem design?
> 
> GlobalReserve is just a reserved space *INSIDE* metadata for some corner
> case. So its profile is always single.
> 
> The real problem is, how we represent it in btrfs-progs.
> 
> If it output like below, I think you won't complain about it more:
>  > merkaba:~> btrfs fi df /
>  > Data, RAID1: total=27.98GiB, used=24.07GiB
>  > System, RAID1: total=19.00MiB, used=16.00KiB
>  > Metadata, RAID1: total=2.00GiB, used=728.80MiB
> 
> Or
> 
>  > merkaba:~> btrfs fi df /
>  > Data, RAID1: total=27.98GiB, used=24.07GiB
>  > System, RAID1: total=19.00MiB, used=16.00KiB
>  > Metadata, RAID1: total=2.00GiB, used=(536.80 + 192.00)MiB
>  > 
>  >  \ GlobalReserve: total=192.00MiB, used=0.00B

Oh, the global reserve is *inside* the existing metadata chunks? Thats 
interesting. I didn´t know that.

> > I am seriously consider to switch to XFS for my production laptop again.
> > Cause I never saw any of these free space issues with any of the XFS or
> > Ext4 filesystems I used in the last 10 years.
> 
> Yes, xfs and ext4 is very stable for normal use case.
> 
> But at least, I won't recommend xfs yet, and considering the nature or
> journal based fs, I'll recommend backup power supply in crash recovery
> for both of them.
> 
> Xfs already messed up several test environment of mine, and an
> unfortunate double power loss has destroyed my whole /home ext4
> partition years ago.

Wow. I have never seen this. Actual I teach journal filesystems being quite 
safe on power losses as long as cache flushes (former barrier) functionality 
is active and working. With one caveat: It relies on one sector being either 
completely written or not. I never seen any scientific proof for that on usual 
storage devices.

> [xfs story]
> After several crash, xfs makes several corrupted file just to 0 size.
> Including my kernel .git directory. Then I won't trust it any longer.
> No to mention that grub2 support for xfs v5 is not here yet.

That is no filesystem metadata structure crash. It is a known issue with 
delayed allocation. Same with Ext4. I teach this as well in my performance 
analysis & tuning course.

Main cause is the following: Both XFS and Ext4 use delayed allocation, i.e.:

dd if=/dev/zero of=zeros bs=1M count=100 ; rm zeros

will not allocate nor write a single byte of file data. As the file is deleted 
before delayed allocation kicks in.

Now on renaming or truncating a file the journal may record the change already 
before the data is actually allocated.

There is an epic Ubuntu bug report about when Ext4 introduced delayed 
allocation. There has been an epic discussion. Theodore T´so said: Use 
fsync()! Linus said: Don´t break userspace. We know the app is broke, but it 
worked with Ext3, so fix it. Ext4 has a "fix" or workaround for apps not using 
fsync() properly meanwhile, for the rename over old file and truncate case. It 
does not use delayed allocation in these case, basically lowering performance.

XFS has a fix for truncating case, but *not* for rename case.

Also BTRFS in principle has this issue I believe.  As far as I am aware it has 
a fix for the rename case, not using delayed allocation in the case. Due to 
its COW nature it may not be affected at all however, I don´t know.

> [ext4 story]
> For ext4, when recovering my /home partition after a power loss, a new
> power loss happened, and my home partition is doomed.
> Only several non-sense files are savaged.

During a fsck? Well that is quite a special condition I´d say. Of course I 
think aborting an fsck should be safe at all time, but I wouldn´t be surprised 
if it wasn´t.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: still kworker at 100% cpu in all of device size allocated with chunks situations with write load
  2015-12-14  8:18   ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready) Martin Steigerwald
@ 2015-12-14  8:48     ` Qu Wenruo
  2015-12-14  8:59       ` Martin Steigerwald
  2015-12-14  9:10       ` safety of journal based fs (was: Re: still kworker at 100% cpu…) Martin Steigerwald
  0 siblings, 2 replies; 25+ messages in thread
From: Qu Wenruo @ 2015-12-14  8:48 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Btrfs BTRFS



Martin Steigerwald wrote on 2015/12/14 09:18 +0100:
> Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
>> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
>>> Hi!
>>>
>>> For me it is still not production ready.
>>
>> Yes, this is the *FACT* and not everyone has a good reason to deny it.
>>
>>> Again I ran into:
>>>
>>> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
>>> random write into big file
>>> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>>
>> Not sure about guideline for other fs, but it will attract more dev's
>> attention if it can be posted to maillist.
>
> I did, as mentioned in the bug report:
>
> BTRFS free space handling still needs more work: Hangs again
> Martin Steigerwald | 26 Dec 14:37 2014
> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41790
>
>>> No matter whether SLES 12 uses it as default for root, no matter whether
>>> Fujitsu and Facebook use it: I will not let this onto any customer machine
>>> without lots and lots of underprovisioning and rigorous free space
>>> monitoring. Actually I will renew my recommendations in my trainings to
>>> be careful with BTRFS.
>>>
>>>   From my experience the monitoring would check for:
>>> merkaba:~> btrfs fi show /home
>>> Label: 'home'  uuid: […]
>>>
>>>           Total devices 2 FS bytes used 156.31GiB
>>>           devid    1 size 170.00GiB used 164.13GiB path
>>>           /dev/mapper/msata-home
>>>           devid    2 size 170.00GiB used 164.13GiB path
>>>           /dev/mapper/sata-home
>>>
>>> If "used" is same as "size" then make big fat alarm. It is not sufficient
>>> for it to happen. It can run for quite some time just fine without any
>>> issues, but I never have seen a kworker thread using 100% of one core for
>>> extended period of time blocking everything else on the fs without this
>>> condition being met.
>> And specially advice on the device size from myself:
>> Don't use devices over 100G but less than 500G.
>> Over 100G will leads btrfs to use big chunks, where data chunks can be
>> at most 10G and metadata to be 1G.
>>
>> I have seen a lot of users with about 100~200G device, and hit
>> unbalanced chunk allocation (10G data chunk easily takes the last
>> available space and makes later metadata no where to store)
>
> Interesting, but in my case there is still quite some free space in already
> allocated metadata chunks. Anyway, I did had enospc issues on trying to
> balance the chunks.
>
>> And unfortunately, your fs is already in the dangerous zone.
>> (And you are using RAID1, which means it's the same as one 170G btrfs
>> with SINGLE data/meta)
>
> Well, I know for any FS its not recommended to let it run to full and leave
> about 10-15% free at least, but while it is not 10-15% anymore, its still a
> whopping 11-12 GiB of free space. I would accept a somewhat slower operation
> in this case, but no kworker at 100% for about 10-30 seconds blocking
> everything else on going on on the filesystem. For whatever reason Plasma
> seems to access the fs on almost every action I do with it, so not even panels
> slide out anymore or activity switcher works during that time.
>
>>> In addition to that last time I tried it aborts scrub any of my BTRFS
>>> filesstems. Reported in another thread here that got completely ignored so
>>> far. I think I could go back to 4.2 kernel to make this work.
>>
>> Unfortunately, this happens a lot of times, even you posted it to mail list.
>> Devs here are always busy locating bugs or adding new features or
>> enhancing current behavior.
>>
>> So *PLEASE* be patient about such slow response.
>
> Okay, thanks at least for the acknowledgement of this. I try to be even more
> patient.
>
>> BTW, you may not want to revert to 4.2 until some bug fix is backported
>> to 4.2.
>> As qgroup rework in 4.2 has broken delayed ref and caused some scrub
>> bugs. (My fault)
>
> Hm, well scrubbing does not work for me either. But since 4.3/4.4rc2/4. I just
> bumped the thread:
>
> Re: [4.3-rc4] scrubbing aborts before finishing
>
> by replying a well by replying a third time to it (not fourth, miscounted:).
>
>>> I am not going to bother to go into more detail on any on this, as I get
>>> the impression that my bug reports and feedback get ignored. So I spare
>>> myself the time to do this work for now.
>>>
>>>
>>> Only thing I wonder now whether this all could be cause my /home is
>>> already
>>> more than one and a half year old. Maybe newly created filesystems are
>>> created in a way that prevents these issues? But it already has a nice
>>> global reserve:
>>>
>>> merkaba:~> btrfs fi df /
>>> Data, RAID1: total=27.98GiB, used=24.07GiB
>>> System, RAID1: total=19.00MiB, used=16.00KiB
>>> Metadata, RAID1: total=2.00GiB, used=536.80MiB
>>> GlobalReserve, single: total=192.00MiB, used=0.00B
>>>
>>>
>>> Actually when I see that this free space thing is still not fixed for good
>>> I wonder whether it is fixable at all. Is this an inherent issue of BTRFS
>>> or more generally COW filesystem design?
>>
>> GlobalReserve is just a reserved space *INSIDE* metadata for some corner
>> case. So its profile is always single.
>>
>> The real problem is, how we represent it in btrfs-progs.
>>
>> If it output like below, I think you won't complain about it more:
>>   > merkaba:~> btrfs fi df /
>>   > Data, RAID1: total=27.98GiB, used=24.07GiB
>>   > System, RAID1: total=19.00MiB, used=16.00KiB
>>   > Metadata, RAID1: total=2.00GiB, used=728.80MiB
>>
>> Or
>>
>>   > merkaba:~> btrfs fi df /
>>   > Data, RAID1: total=27.98GiB, used=24.07GiB
>>   > System, RAID1: total=19.00MiB, used=16.00KiB
>>   > Metadata, RAID1: total=2.00GiB, used=(536.80 + 192.00)MiB
>>   >
>>   >  \ GlobalReserve: total=192.00MiB, used=0.00B
>
> Oh, the global reserve is *inside* the existing metadata chunks? Thats
> interesting. I didn´t know that.
>

And I have already submit btrfs-progs patch to change the default output 
of 'fi df'.

Hopes to solve the problem.

>>> I am seriously consider to switch to XFS for my production laptop again.
>>> Cause I never saw any of these free space issues with any of the XFS or
>>> Ext4 filesystems I used in the last 10 years.
>>
>> Yes, xfs and ext4 is very stable for normal use case.
>>
>> But at least, I won't recommend xfs yet, and considering the nature or
>> journal based fs, I'll recommend backup power supply in crash recovery
>> for both of them.
>>
>> Xfs already messed up several test environment of mine, and an
>> unfortunate double power loss has destroyed my whole /home ext4
>> partition years ago.
>
> Wow. I have never seen this. Actual I teach journal filesystems being quite
> safe on power losses as long as cache flushes (former barrier) functionality
> is active and working. With one caveat: It relies on one sector being either
> completely written or not. I never seen any scientific proof for that on usual
> storage devices.

The journal is used to be safe against power loss.
That's OK.

But the problem is, when recovering journal, there is no journal of 
journal, to keep journal recovering safe from power loss.

And that's the advantage of COW file system, no need of journal completely.
Although Btrfs is less safe than stable journal based fs yet.

>
>> [xfs story]
>> After several crash, xfs makes several corrupted file just to 0 size.
>> Including my kernel .git directory. Then I won't trust it any longer.
>> No to mention that grub2 support for xfs v5 is not here yet.
>
> That is no filesystem metadata structure crash. It is a known issue with
> delayed allocation. Same with Ext4. I teach this as well in my performance
> analysis & tuning course.

Unfortunately, it's not about delayed allocation, as it's not a new 
file, it's file already here with contents in previous transaction.
The workload should only rewrite the files.(Not sure though)

And for ext4 case, I'll see corrupted files, but not truncated to 0 size.
So IMHO it may be related to xfs recovery behavior.
But not sure as I never read xfs codes.

>
> Main cause is the following: Both XFS and Ext4 use delayed allocation, i.e.:
>
> dd if=/dev/zero of=zeros bs=1M count=100 ; rm zeros
>
> will not allocate nor write a single byte of file data. As the file is deleted
> before delayed allocation kicks in.
>
> Now on renaming or truncating a file the journal may record the change already
> before the data is actually allocated.

Yes, I know delayed allocation, as it's also used in Btrfs.

>
> There is an epic Ubuntu bug report about when Ext4 introduced delayed
> allocation. There has been an epic discussion. Theodore T´so said: Use
> fsync()! Linus said: Don´t break userspace. We know the app is broke, but it
> worked with Ext3, so fix it. Ext4 has a "fix" or workaround for apps not using
> fsync() properly meanwhile, for the rename over old file and truncate case. It
> does not use delayed allocation in these case, basically lowering performance.
>
> XFS has a fix for truncating case, but *not* for rename case.
>
> Also BTRFS in principle has this issue I believe.  As far as I am aware it has
> a fix for the rename case, not using delayed allocation in the case. Due to
> its COW nature it may not be affected at all however, I don´t know.

Anyway for rewrite case, none of these fs should truncate fs size to 0.
However, it seems xfs doesn't follow the way though.
Although I'm not 100% sure, as after that disaster I reinstall my test 
box using ext4.

(Maybe next time I should try btrfs, at least when it fails, I have my 
chance to submit new patches to kernel or btrfsck)

>
>> [ext4 story]
>> For ext4, when recovering my /home partition after a power loss, a new
>> power loss happened, and my home partition is doomed.
>> Only several non-sense files are savaged.
>
> During a fsck? Well that is quite a special condition I´d say. Of course I
> think aborting an fsck should be safe at all time, but I wouldn´t be surprised
> if it wasn´t.

Not only a fsck, any timing doing journal replay will be affected, like 
mounting a dirty fs.

But you're right, the case is quite minor, and even myself only 
encountered it once.

Thanks,
Qu

>
> Thanks,
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: still kworker at 100% cpu in all of device size allocated with chunks situations with write load
  2015-12-14  8:48     ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Qu Wenruo
@ 2015-12-14  8:59       ` Martin Steigerwald
  2015-12-14  9:10       ` safety of journal based fs (was: Re: still kworker at 100% cpu…) Martin Steigerwald
  1 sibling, 0 replies; 25+ messages in thread
From: Martin Steigerwald @ 2015-12-14  8:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Hi Qu.

I reply to the journal fs things in a mail with a different subject.

Am Montag, 14. Dezember 2015, 16:48:58 CET schrieb Qu Wenruo:
> Martin Steigerwald wrote on 2015/12/14 09:18 +0100:
> > Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
> >> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
[…]
> >> GlobalReserve is just a reserved space *INSIDE* metadata for some corner
> >> case. So its profile is always single.
> >> 
> >> The real problem is, how we represent it in btrfs-progs.
> >> 
> >> If it output like below, I think you won't complain about it more:
> >>   > merkaba:~> btrfs fi df /
> >>   > Data, RAID1: total=27.98GiB, used=24.07GiB
> >>   > System, RAID1: total=19.00MiB, used=16.00KiB
> >>   > Metadata, RAID1: total=2.00GiB, used=728.80MiB
> >> 
> >> Or
> >> 
> >>   > merkaba:~> btrfs fi df /
> >>   > Data, RAID1: total=27.98GiB, used=24.07GiB
> >>   > System, RAID1: total=19.00MiB, used=16.00KiB
> >>   > Metadata, RAID1: total=2.00GiB, used=(536.80 + 192.00)MiB
> >>   > 
> >>   >  \ GlobalReserve: total=192.00MiB, used=0.00B
> > 
> > Oh, the global reserve is *inside* the existing metadata chunks? Thats
> > interesting. I didn´t know that.
> 
> And I have already submit btrfs-progs patch to change the default output
> of 'fi df'.
> 
> Hopes to solve the problem.

Nice. Thank you. It clarifies it quite a bit. I always wondered why its 
single. On which device does it allocate it in a RAID 1? Also can the data 
stored in there temporarily be recreated in case of loosing a device? In case 
that not, BTRFS would not guarantee that one device of a RAID 1 can fail at 
all times.

Ciao,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* safety of journal based fs (was: Re: still kworker at 100% cpu…)
  2015-12-14  8:48     ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Qu Wenruo
  2015-12-14  8:59       ` Martin Steigerwald
@ 2015-12-14  9:10       ` Martin Steigerwald
  2015-12-22  2:34         ` Kai Krakow
  1 sibling, 1 reply; 25+ messages in thread
From: Martin Steigerwald @ 2015-12-14  9:10 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

Hi!

Using a different subject for the journal fs related things which are off 
topic, but still interesting. Might make sense to move to fsdevel-ml or ext4/
XFS mailing lists? Otherwise, I suggest we focus on BTRFS here. Still wanted 
to reply.

Am Montag, 14. Dezember 2015, 16:48:58 CET schrieb Qu Wenruo:
> Martin Steigerwald wrote on 2015/12/14 09:18 +0100:
> > Am Montag, 14. Dezember 2015, 10:08:16 CET schrieb Qu Wenruo:
> >> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
[…]
> >>> I am seriously consider to switch to XFS for my production laptop again.
> >>> Cause I never saw any of these free space issues with any of the XFS or
> >>> Ext4 filesystems I used in the last 10 years.
> >> 
> >> Yes, xfs and ext4 is very stable for normal use case.
> >> 
> >> But at least, I won't recommend xfs yet, and considering the nature or
> >> journal based fs, I'll recommend backup power supply in crash recovery
> >> for both of them.
> >> 
> >> Xfs already messed up several test environment of mine, and an
> >> unfortunate double power loss has destroyed my whole /home ext4
> >> partition years ago.
> > 
> > Wow. I have never seen this. Actual I teach journal filesystems being
> > quite
> > safe on power losses as long as cache flushes (former barrier)
> > functionality is active and working. With one caveat: It relies on one
> > sector being either completely written or not. I never seen any
> > scientific proof for that on usual storage devices.
> 
> The journal is used to be safe against power loss.
> That's OK.
> 
> But the problem is, when recovering journal, there is no journal of
> journal, to keep journal recovering safe from power loss.

But the journal should be safe due to a journal commit being one sector? Of 
course for the last changes without a journal commit its: The stuff is gone.

> And that's the advantage of COW file system, no need of journal completely.
> Although Btrfs is less safe than stable journal based fs yet.
> 
> >> [xfs story]
> >> After several crash, xfs makes several corrupted file just to 0 size.
> >> Including my kernel .git directory. Then I won't trust it any longer.
> >> No to mention that grub2 support for xfs v5 is not here yet.
> > 
> > That is no filesystem metadata structure crash. It is a known issue with
> > delayed allocation. Same with Ext4. I teach this as well in my performance
> > analysis & tuning course.
> 
> Unfortunately, it's not about delayed allocation, as it's not a new
> file, it's file already here with contents in previous transaction.
> The workload should only rewrite the files.(Not sure though)

For what I know the overwriting after truncating case is also related to the 
delayed allocation, deferred write thing: File has been truncated to zero 
bytes in journal, while no data has been written.

But well for Ext4 / XFS it doesn´t need to reallocate in this case.

> And for ext4 case, I'll see corrupted files, but not truncated to 0 size.
> So IMHO it may be related to xfs recovery behavior.
> But not sure as I never read xfs codes.

Journals online provide *metadata* consistency. Unless you use Ext4 with 
data=journal, which is supposed to be much slower, but in some workloads its 
actually faster. Even Andrew Morton had not explaination for that, however I 
do have an idea about it. Also data=journal is interesting, if you put journal 
for harddisk based Ext4 onto an SSD or an SSD RAID 1 or so.

> > Also BTRFS in principle has this issue I believe.  As far as I am aware it
> > has a fix for the rename case, not using delayed allocation in the case.
> > Due to its COW nature it may not be affected at all however, I don´t
> > know.
> Anyway for rewrite case, none of these fs should truncate fs size to 0.
> However, it seems xfs doesn't follow the way though.
> Although I'm not 100% sure, as after that disaster I reinstall my test
> box using ext4.
> 
> (Maybe next time I should try btrfs, at least when it fails, I have my
> chance to submit new patches to kernel or btrfsck)

I do think its the applications doing that on overwriting a file. Rewriting a 
config file for example. Its either write new file, rename to old, or truncate 
to zero bytes and rewrite.

Of course, its different for databases or other files written into without 
rewriting them. But there you need data=journal on Ext4. XFS doesn´t guarentee 
file consistency at all in that case, unless the application serializes 
changes with fsync() properly by using an in application journal for the data 
to write.

> >> [ext4 story]
> >> For ext4, when recovering my /home partition after a power loss, a new
> >> power loss happened, and my home partition is doomed.
> >> Only several non-sense files are savaged.
> > 
> > During a fsck? Well that is quite a special condition I´d say. Of course I
> > think aborting an fsck should be safe at all time, but I wouldn´t be
> > surprised if it wasn´t.
> 
> Not only a fsck, any timing doing journal replay will be affected, like
> mounting a dirty fs.
> 
> But you're right, the case is quite minor, and even myself only
> encountered it once.

Hmmm, okay, but still not nice. I thought a journal reply should be safe. 
Cause:

If will check last log entry with commit marker, only these are 1) complete, 
2) not fully applied. It will apply all changes that are not yet applied or 
even reapply those that where already I am not sure about that. And then it 
will remove commit marker which should be an atomic operation.

It least I thought that this is the whole point of using a journal in the 
first place.

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: safety of journal based fs (was: Re: still kworker at 100% cpu…)
  2015-12-14  9:10       ` safety of journal based fs (was: Re: still kworker at 100% cpu…) Martin Steigerwald
@ 2015-12-22  2:34         ` Kai Krakow
  0 siblings, 0 replies; 25+ messages in thread
From: Kai Krakow @ 2015-12-22  2:34 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 14 Dec 2015 10:10:51 +0100
schrieb Martin Steigerwald <martin@lichtvoll.de>:

> > But the problem is, when recovering journal, there is no journal of
> > journal, to keep journal recovering safe from power loss.  
> 
> But the journal should be safe due to a journal commit being one
> sector? Of course for the last changes without a journal commit its:
> The stuff is gone.

This may not be true for disks having write caching enabled and write
barriers of. Then there's no barrier at a journal checkpoint.

Next thing is: At least for ext4, journal is meta-data only.

But I think what was meant here: The case of powerloss during
log-replay... Tho I think the journal should simply be fully replayed
again if it wasn't marked clean before.

Which turns us back to write barriers and write caching... ;-)

It could be that the checkpointing (or marking the journal clean after
replay) could make it do disk before the actual data made it to disk,
due to write-reordering of the hard disk - which can be effectively
circumvented by disabling write caching and enabling write barriers
(the latter should be default while I would always check the former).

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-14  2:08 ` Still not production ready Qu Wenruo
  2015-12-14  6:21   ` Duncan
  2015-12-14  8:18   ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready) Martin Steigerwald
@ 2015-12-15 21:59   ` Chris Mason
  2015-12-15 23:16     ` Martin Steigerwald
  2015-12-16  1:20     ` Qu Wenruo
  2 siblings, 2 replies; 25+ messages in thread
From: Chris Mason @ 2015-12-15 21:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Martin Steigerwald, Btrfs BTRFS

On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> 
> 
> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >Hi!
> >
> >For me it is still not production ready.
> 
> Yes, this is the *FACT* and not everyone has a good reason to deny it.
> 
> >Again I ran into:
> >
> >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
> >write into big file
> >https://bugzilla.kernel.org/show_bug.cgi?id=90401
> 
> Not sure about guideline for other fs, but it will attract more dev's
> attention if it can be posted to maillist.
> 
> >
> >
> >No matter whether SLES 12 uses it as default for root, no matter whether
> >Fujitsu and Facebook use it: I will not let this onto any customer machine
> >without lots and lots of underprovisioning and rigorous free space monitoring.
> >Actually I will renew my recommendations in my trainings to be careful with
> >BTRFS.
> >
> > From my experience the monitoring would check for:
> >
> >merkaba:~> btrfs fi show /home
> >Label: 'home'  uuid: […]
> >         Total devices 2 FS bytes used 156.31GiB
> >         devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
> >         devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> >
> >If "used" is same as "size" then make big fat alarm. It is not sufficient for
> >it to happen. It can run for quite some time just fine without any issues, but
> >I never have seen a kworker thread using 100% of one core for extended period
> >of time blocking everything else on the fs without this condition being met.
> >
> 
> And specially advice on the device size from myself:
> Don't use devices over 100G but less than 500G.
> Over 100G will leads btrfs to use big chunks, where data chunks can be at
> most 10G and metadata to be 1G.
> 
> I have seen a lot of users with about 100~200G device, and hit unbalanced
> chunk allocation (10G data chunk easily takes the last available space and
> makes later metadata no where to store)

Maybe we should tune things so the size of the chunk is based on the
space remaining instead of the total space?

> 
> And unfortunately, your fs is already in the dangerous zone.
> (And you are using RAID1, which means it's the same as one 170G btrfs with
> SINGLE data/meta)
> 
> >
> >In addition to that last time I tried it aborts scrub any of my BTRFS
> >filesstems. Reported in another thread here that got completely ignored so
> >far. I think I could go back to 4.2 kernel to make this work.

We'll pick this thread up again, the ones that get fixed the fastest are
the ones that we can easily reproduce.  The rest need a lot of think
time.

-chris

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-15 21:59   ` Still not production ready Chris Mason
@ 2015-12-15 23:16     ` Martin Steigerwald
  2015-12-16  1:20     ` Qu Wenruo
  1 sibling, 0 replies; 25+ messages in thread
From: Martin Steigerwald @ 2015-12-15 23:16 UTC (permalink / raw)
  To: Chris Mason; +Cc: Qu Wenruo, Btrfs BTRFS

Am Dienstag, 15. Dezember 2015, 16:59:58 CET schrieb Chris Mason:
> On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> > Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> > >Hi!
> > >
> > >For me it is still not production ready.
> > 
> > Yes, this is the *FACT* and not everyone has a good reason to deny it.
> > 
> > >Again I ran into:
> > >
> > >btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> > >random write into big file
> > >https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > 
> > Not sure about guideline for other fs, but it will attract more dev's
> > attention if it can be posted to maillist.
> > 
> > >No matter whether SLES 12 uses it as default for root, no matter whether
> > >Fujitsu and Facebook use it: I will not let this onto any customer
> > >machine
> > >without lots and lots of underprovisioning and rigorous free space
> > >monitoring. Actually I will renew my recommendations in my trainings to
> > >be careful with BTRFS.
> > >
> > > From my experience the monitoring would check for:
> > >merkaba:~> btrfs fi show /home
> > >Label: 'home'  uuid: […]
> > >
> > >         Total devices 2 FS bytes used 156.31GiB
> > >         devid    1 size 170.00GiB used 164.13GiB path
> > >         /dev/mapper/msata-home
> > >         devid    2 size 170.00GiB used 164.13GiB path
> > >         /dev/mapper/sata-home
> > >
> > >If "used" is same as "size" then make big fat alarm. It is not sufficient
> > >for it to happen. It can run for quite some time just fine without any
> > >issues, but I never have seen a kworker thread using 100% of one core
> > >for extended period of time blocking everything else on the fs without
> > >this condition being met.> 
> > And specially advice on the device size from myself:
> > Don't use devices over 100G but less than 500G.
> > Over 100G will leads btrfs to use big chunks, where data chunks can be at
> > most 10G and metadata to be 1G.
> > 
> > I have seen a lot of users with about 100~200G device, and hit unbalanced
> > chunk allocation (10G data chunk easily takes the last available space and
> > makes later metadata no where to store)
> 
> Maybe we should tune things so the size of the chunk is based on the
> space remaining instead of the total space?

Still on my filesystem where was over 1 GiB free on metadata chunks, so…

… my theory still is: BTRFS has trouble finding free space in chunks at some 
time.

> > And unfortunately, your fs is already in the dangerous zone.
> > (And you are using RAID1, which means it's the same as one 170G btrfs with
> > SINGLE data/meta)
> > 
> > >In addition to that last time I tried it aborts scrub any of my BTRFS
> > >filesstems. Reported in another thread here that got completely ignored
> > >so
> > >far. I think I could go back to 4.2 kernel to make this work.
> 
> We'll pick this thread up again, the ones that get fixed the fastest are
> the ones that we can easily reproduce.  The rest need a lot of think
> time.

I understand. Maybe I just wanted to see at least some sort of an reaction.

I now have 4.4-rc5 running, the boot crash I had appears to be fixed. Oh, and 
I see that scrubbing / at leasted worked now:

merkaba:~> btrfs scrub status -d /
scrub status for […]
scrub device /dev/dm-5 (id 1) history
        scrub started at Wed Dec 16 00:13:20 2015 and finished after 00:01:42
        total bytes scrubbed: 23.94GiB with 0 errors
scrub device /dev/mapper/msata-debian (id 2) history
        scrub started at Wed Dec 16 00:13:20 2015 and finished after 00:01:34
        total bytes scrubbed: 23.94GiB with 0 errors

Okay, I test the other ones tomorrow, so maybe this one is fixed meanwhile.

Yay!

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-15 21:59   ` Still not production ready Chris Mason
  2015-12-15 23:16     ` Martin Steigerwald
@ 2015-12-16  1:20     ` Qu Wenruo
  2015-12-16  1:53       ` Liu Bo
  2016-01-01 10:44       ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Martin Steigerwald
  1 sibling, 2 replies; 25+ messages in thread
From: Qu Wenruo @ 2015-12-16  1:20 UTC (permalink / raw)
  To: Chris Mason, Martin Steigerwald, Btrfs BTRFS



Chris Mason wrote on 2015/12/15 16:59 -0500:
> On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
>>
>>
>> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
>>> Hi!
>>>
>>> For me it is still not production ready.
>>
>> Yes, this is the *FACT* and not everyone has a good reason to deny it.
>>
>>> Again I ran into:
>>>
>>> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
>>> write into big file
>>> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>>
>> Not sure about guideline for other fs, but it will attract more dev's
>> attention if it can be posted to maillist.
>>
>>>
>>>
>>> No matter whether SLES 12 uses it as default for root, no matter whether
>>> Fujitsu and Facebook use it: I will not let this onto any customer machine
>>> without lots and lots of underprovisioning and rigorous free space monitoring.
>>> Actually I will renew my recommendations in my trainings to be careful with
>>> BTRFS.
>>>
>>>  From my experience the monitoring would check for:
>>>
>>> merkaba:~> btrfs fi show /home
>>> Label: 'home'  uuid: […]
>>>          Total devices 2 FS bytes used 156.31GiB
>>>          devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
>>>          devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
>>>
>>> If "used" is same as "size" then make big fat alarm. It is not sufficient for
>>> it to happen. It can run for quite some time just fine without any issues, but
>>> I never have seen a kworker thread using 100% of one core for extended period
>>> of time blocking everything else on the fs without this condition being met.
>>>
>>
>> And specially advice on the device size from myself:
>> Don't use devices over 100G but less than 500G.
>> Over 100G will leads btrfs to use big chunks, where data chunks can be at
>> most 10G and metadata to be 1G.
>>
>> I have seen a lot of users with about 100~200G device, and hit unbalanced
>> chunk allocation (10G data chunk easily takes the last available space and
>> makes later metadata no where to store)
>
> Maybe we should tune things so the size of the chunk is based on the
> space remaining instead of the total space?

Submitted such patch before.
David pointed out that such behavior will cause a lot of small 
fragmented chunks at last several GB.
Which may make balance behavior not as predictable as before.


At least, we can just change the current 10% chunk size limit to 5% to 
make such problem less easier to trigger.
It's a simple and easy solution.

Another cause of the problem is, we understated the chunk size change 
for fs at the borderline of big chunk.

For 99G, its chunk size limit is 1G, and it needs 99 data chunks to 
fully cover the fs.
But for 100G, it only needs 10 chunks to covert the fs.
And it need to be 990G to match the number again.

The sudden drop of chunk number is the root cause.

So we'd better reconsider both the big chunk size limit and chunk size 
limit to find a balanaced solution for it.

Thanks,
Qu
>
>>
>> And unfortunately, your fs is already in the dangerous zone.
>> (And you are using RAID1, which means it's the same as one 170G btrfs with
>> SINGLE data/meta)
>>
>>>
>>> In addition to that last time I tried it aborts scrub any of my BTRFS
>>> filesstems. Reported in another thread here that got completely ignored so
>>> far. I think I could go back to 4.2 kernel to make this work.
>
> We'll pick this thread up again, the ones that get fixed the fastest are
> the ones that we can easily reproduce.  The rest need a lot of think
> time.
>
> -chris
>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-16  1:20     ` Qu Wenruo
@ 2015-12-16  1:53       ` Liu Bo
  2015-12-16  2:19         ` Qu Wenruo
  2016-01-01 10:44       ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Martin Steigerwald
  1 sibling, 1 reply; 25+ messages in thread
From: Liu Bo @ 2015-12-16  1:53 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Martin Steigerwald, Btrfs BTRFS

On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote:
> 
> 
> Chris Mason wrote on 2015/12/15 16:59 -0500:
> >On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >>>Hi!
> >>>
> >>>For me it is still not production ready.
> >>
> >>Yes, this is the *FACT* and not everyone has a good reason to deny it.
> >>
> >>>Again I ran into:
> >>>
> >>>btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
> >>>write into big file
> >>>https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >>
> >>Not sure about guideline for other fs, but it will attract more dev's
> >>attention if it can be posted to maillist.
> >>
> >>>
> >>>
> >>>No matter whether SLES 12 uses it as default for root, no matter whether
> >>>Fujitsu and Facebook use it: I will not let this onto any customer machine
> >>>without lots and lots of underprovisioning and rigorous free space monitoring.
> >>>Actually I will renew my recommendations in my trainings to be careful with
> >>>BTRFS.
> >>>
> >>> From my experience the monitoring would check for:
> >>>
> >>>merkaba:~> btrfs fi show /home
> >>>Label: 'home'  uuid: […]
> >>>         Total devices 2 FS bytes used 156.31GiB
> >>>         devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
> >>>         devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> >>>
> >>>If "used" is same as "size" then make big fat alarm. It is not sufficient for
> >>>it to happen. It can run for quite some time just fine without any issues, but
> >>>I never have seen a kworker thread using 100% of one core for extended period
> >>>of time blocking everything else on the fs without this condition being met.
> >>>
> >>
> >>And specially advice on the device size from myself:
> >>Don't use devices over 100G but less than 500G.
> >>Over 100G will leads btrfs to use big chunks, where data chunks can be at
> >>most 10G and metadata to be 1G.
> >>
> >>I have seen a lot of users with about 100~200G device, and hit unbalanced
> >>chunk allocation (10G data chunk easily takes the last available space and
> >>makes later metadata no where to store)
> >
> >Maybe we should tune things so the size of the chunk is based on the
> >space remaining instead of the total space?
> 
> Submitted such patch before.
> David pointed out that such behavior will cause a lot of small fragmented
> chunks at last several GB.
> Which may make balance behavior not as predictable as before.
> 
> 
> At least, we can just change the current 10% chunk size limit to 5% to make
> such problem less easier to trigger.
> It's a simple and easy solution.
> 
> Another cause of the problem is, we understated the chunk size change for fs
> at the borderline of big chunk.
> 
> For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully
> cover the fs.
> But for 100G, it only needs 10 chunks to covert the fs.
> And it need to be 990G to match the number again.

max_stripe_size is fixed at 1GB and the chunk size is stripe_size * data_stripes,
may I know how your partition gets a 10GB chunk?


Thanks,

-liubo
 

> 
> The sudden drop of chunk number is the root cause.
> 
> So we'd better reconsider both the big chunk size limit and chunk size limit
> to find a balanaced solution for it.
> 
> Thanks,
> Qu
> >
> >>
> >>And unfortunately, your fs is already in the dangerous zone.
> >>(And you are using RAID1, which means it's the same as one 170G btrfs with
> >>SINGLE data/meta)
> >>
> >>>
> >>>In addition to that last time I tried it aborts scrub any of my BTRFS
> >>>filesstems. Reported in another thread here that got completely ignored so
> >>>far. I think I could go back to 4.2 kernel to make this work.
> >
> >We'll pick this thread up again, the ones that get fixed the fastest are
> >the ones that we can easily reproduce.  The rest need a lot of think
> >time.
> >
> >-chris
> >
> >
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-16  1:53       ` Liu Bo
@ 2015-12-16  2:19         ` Qu Wenruo
  2015-12-16  2:30           ` Liu Bo
  0 siblings, 1 reply; 25+ messages in thread
From: Qu Wenruo @ 2015-12-16  2:19 UTC (permalink / raw)
  To: bo.li.liu; +Cc: Chris Mason, Martin Steigerwald, Btrfs BTRFS



Liu Bo wrote on 2015/12/15 17:53 -0800:
> On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote:
>>
>>
>> Chris Mason wrote on 2015/12/15 16:59 -0500:
>>> On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
>>>>
>>>>
>>>> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
>>>>> Hi!
>>>>>
>>>>> For me it is still not production ready.
>>>>
>>>> Yes, this is the *FACT* and not everyone has a good reason to deny it.
>>>>
>>>>> Again I ran into:
>>>>>
>>>>> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
>>>>> write into big file
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=90401
>>>>
>>>> Not sure about guideline for other fs, but it will attract more dev's
>>>> attention if it can be posted to maillist.
>>>>
>>>>>
>>>>>
>>>>> No matter whether SLES 12 uses it as default for root, no matter whether
>>>>> Fujitsu and Facebook use it: I will not let this onto any customer machine
>>>>> without lots and lots of underprovisioning and rigorous free space monitoring.
>>>>> Actually I will renew my recommendations in my trainings to be careful with
>>>>> BTRFS.
>>>>>
>>>>>  From my experience the monitoring would check for:
>>>>>
>>>>> merkaba:~> btrfs fi show /home
>>>>> Label: 'home'  uuid: […]
>>>>>          Total devices 2 FS bytes used 156.31GiB
>>>>>          devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
>>>>>          devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
>>>>>
>>>>> If "used" is same as "size" then make big fat alarm. It is not sufficient for
>>>>> it to happen. It can run for quite some time just fine without any issues, but
>>>>> I never have seen a kworker thread using 100% of one core for extended period
>>>>> of time blocking everything else on the fs without this condition being met.
>>>>>
>>>>
>>>> And specially advice on the device size from myself:
>>>> Don't use devices over 100G but less than 500G.
>>>> Over 100G will leads btrfs to use big chunks, where data chunks can be at
>>>> most 10G and metadata to be 1G.
>>>>
>>>> I have seen a lot of users with about 100~200G device, and hit unbalanced
>>>> chunk allocation (10G data chunk easily takes the last available space and
>>>> makes later metadata no where to store)
>>>
>>> Maybe we should tune things so the size of the chunk is based on the
>>> space remaining instead of the total space?
>>
>> Submitted such patch before.
>> David pointed out that such behavior will cause a lot of small fragmented
>> chunks at last several GB.
>> Which may make balance behavior not as predictable as before.
>>
>>
>> At least, we can just change the current 10% chunk size limit to 5% to make
>> such problem less easier to trigger.
>> It's a simple and easy solution.
>>
>> Another cause of the problem is, we understated the chunk size change for fs
>> at the borderline of big chunk.
>>
>> For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully
>> cover the fs.
>> But for 100G, it only needs 10 chunks to covert the fs.
>> And it need to be 990G to match the number again.
>
> max_stripe_size is fixed at 1GB and the chunk size is stripe_size * data_stripes,
> may I know how your partition gets a 10GB chunk?

Oh, it seems that I remembered the wrong size.
After checking the code, yes you're right.
A stripe won't be larger than 1G, so my assumption above is totally wrong.

And the problem is not in the 10% limit.

Please forget it.

Thanks,
Qu

>
>
> Thanks,
>
> -liubo
>
>
>>
>> The sudden drop of chunk number is the root cause.
>>
>> So we'd better reconsider both the big chunk size limit and chunk size limit
>> to find a balanaced solution for it.
>>
>> Thanks,
>> Qu
>>>
>>>>
>>>> And unfortunately, your fs is already in the dangerous zone.
>>>> (And you are using RAID1, which means it's the same as one 170G btrfs with
>>>> SINGLE data/meta)
>>>>
>>>>>
>>>>> In addition to that last time I tried it aborts scrub any of my BTRFS
>>>>> filesstems. Reported in another thread here that got completely ignored so
>>>>> far. I think I could go back to 4.2 kernel to make this work.
>>>
>>> We'll pick this thread up again, the ones that get fixed the fastest are
>>> the ones that we can easily reproduce.  The rest need a lot of think
>>> time.
>>>
>>> -chris
>>>
>>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-16  2:19         ` Qu Wenruo
@ 2015-12-16  2:30           ` Liu Bo
  2015-12-16 14:27             ` Chris Mason
  0 siblings, 1 reply; 25+ messages in thread
From: Liu Bo @ 2015-12-16  2:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Martin Steigerwald, Btrfs BTRFS

On Wed, Dec 16, 2015 at 10:19:00AM +0800, Qu Wenruo wrote:
> 
> 
> Liu Bo wrote on 2015/12/15 17:53 -0800:
> >On Wed, Dec 16, 2015 at 09:20:45AM +0800, Qu Wenruo wrote:
> >>
> >>
> >>Chris Mason wrote on 2015/12/15 16:59 -0500:
> >>>On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> >>>>
> >>>>
> >>>>Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >>>>>Hi!
> >>>>>
> >>>>>For me it is still not production ready.
> >>>>
> >>>>Yes, this is the *FACT* and not everyone has a good reason to deny it.
> >>>>
> >>>>>Again I ran into:
> >>>>>
> >>>>>btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random
> >>>>>write into big file
> >>>>>https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >>>>
> >>>>Not sure about guideline for other fs, but it will attract more dev's
> >>>>attention if it can be posted to maillist.
> >>>>
> >>>>>
> >>>>>
> >>>>>No matter whether SLES 12 uses it as default for root, no matter whether
> >>>>>Fujitsu and Facebook use it: I will not let this onto any customer machine
> >>>>>without lots and lots of underprovisioning and rigorous free space monitoring.
> >>>>>Actually I will renew my recommendations in my trainings to be careful with
> >>>>>BTRFS.
> >>>>>
> >>>>> From my experience the monitoring would check for:
> >>>>>
> >>>>>merkaba:~> btrfs fi show /home
> >>>>>Label: 'home'  uuid: […]
> >>>>>         Total devices 2 FS bytes used 156.31GiB
> >>>>>         devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
> >>>>>         devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> >>>>>
> >>>>>If "used" is same as "size" then make big fat alarm. It is not sufficient for
> >>>>>it to happen. It can run for quite some time just fine without any issues, but
> >>>>>I never have seen a kworker thread using 100% of one core for extended period
> >>>>>of time blocking everything else on the fs without this condition being met.
> >>>>>
> >>>>
> >>>>And specially advice on the device size from myself:
> >>>>Don't use devices over 100G but less than 500G.
> >>>>Over 100G will leads btrfs to use big chunks, where data chunks can be at
> >>>>most 10G and metadata to be 1G.
> >>>>
> >>>>I have seen a lot of users with about 100~200G device, and hit unbalanced
> >>>>chunk allocation (10G data chunk easily takes the last available space and
> >>>>makes later metadata no where to store)
> >>>
> >>>Maybe we should tune things so the size of the chunk is based on the
> >>>space remaining instead of the total space?
> >>
> >>Submitted such patch before.
> >>David pointed out that such behavior will cause a lot of small fragmented
> >>chunks at last several GB.
> >>Which may make balance behavior not as predictable as before.
> >>
> >>
> >>At least, we can just change the current 10% chunk size limit to 5% to make
> >>such problem less easier to trigger.
> >>It's a simple and easy solution.
> >>
> >>Another cause of the problem is, we understated the chunk size change for fs
> >>at the borderline of big chunk.
> >>
> >>For 99G, its chunk size limit is 1G, and it needs 99 data chunks to fully
> >>cover the fs.
> >>But for 100G, it only needs 10 chunks to covert the fs.
> >>And it need to be 990G to match the number again.
> >
> >max_stripe_size is fixed at 1GB and the chunk size is stripe_size * data_stripes,
> >may I know how your partition gets a 10GB chunk?
> 
> Oh, it seems that I remembered the wrong size.
> After checking the code, yes you're right.
> A stripe won't be larger than 1G, so my assumption above is totally wrong.
> 
> And the problem is not in the 10% limit.
> 
> Please forget it.

No problem, glad to see people talking about the space issue again.

Thanks,

-liubo
> 
> Thanks,
> Qu
> 
> >
> >
> >Thanks,
> >
> >-liubo
> >
> >
> >>
> >>The sudden drop of chunk number is the root cause.
> >>
> >>So we'd better reconsider both the big chunk size limit and chunk size limit
> >>to find a balanaced solution for it.
> >>
> >>Thanks,
> >>Qu
> >>>
> >>>>
> >>>>And unfortunately, your fs is already in the dangerous zone.
> >>>>(And you are using RAID1, which means it's the same as one 170G btrfs with
> >>>>SINGLE data/meta)
> >>>>
> >>>>>
> >>>>>In addition to that last time I tried it aborts scrub any of my BTRFS
> >>>>>filesstems. Reported in another thread here that got completely ignored so
> >>>>>far. I think I could go back to 4.2 kernel to make this work.
> >>>
> >>>We'll pick this thread up again, the ones that get fixed the fastest are
> >>>the ones that we can easily reproduce.  The rest need a lot of think
> >>>time.
> >>>
> >>>-chris
> >>>
> >>>
> >>
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majordomo@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Still not production ready
  2015-12-16  2:30           ` Liu Bo
@ 2015-12-16 14:27             ` Chris Mason
  0 siblings, 0 replies; 25+ messages in thread
From: Chris Mason @ 2015-12-16 14:27 UTC (permalink / raw)
  To: Liu Bo; +Cc: Qu Wenruo, Martin Steigerwald, Btrfs BTRFS

On Tue, Dec 15, 2015 at 06:30:58PM -0800, Liu Bo wrote:
> On Wed, Dec 16, 2015 at 10:19:00AM +0800, Qu Wenruo wrote:
> > >max_stripe_size is fixed at 1GB and the chunk size is stripe_size * data_stripes,
> > >may I know how your partition gets a 10GB chunk?
> > 
> > Oh, it seems that I remembered the wrong size.
> > After checking the code, yes you're right.
> > A stripe won't be larger than 1G, so my assumption above is totally wrong.
> > 
> > And the problem is not in the 10% limit.
> > 
> > Please forget it.
> 
> No problem, glad to see people talking about the space issue again.

You can still end up with larger block groups if you have a lot of
drives.  We've had different problems with that in the past, but it is
limited now to 10G.

At any rate if things are still getting badly out of balance we need to
tweak the allocator some more.

It's hard to reproduce because you need a burst of allocations for
whatever type is full.  I'll give it another shot.

-chris


^ permalink raw reply	[flat|nested] 25+ messages in thread

* still kworker at 100% cpu in all of device size allocated with chunks situations with write load
  2015-12-16  1:20     ` Qu Wenruo
  2015-12-16  1:53       ` Liu Bo
@ 2016-01-01 10:44       ` Martin Steigerwald
  1 sibling, 0 replies; 25+ messages in thread
From: Martin Steigerwald @ 2016-01-01 10:44 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Mason, Btrfs BTRFS

First: Happy New Year to you!

Second: Take your time. I know its holidays for many. For me it means I easily 
have time to follow-up on this.

Am Mittwoch, 16. Dezember 2015, 09:20:45 CET schrieb Qu Wenruo:
> Chris Mason wrote on 2015/12/15 16:59 -0500:
> > On Mon, Dec 14, 2015 at 10:08:16AM +0800, Qu Wenruo wrote:
> >> Martin Steigerwald wrote on 2015/12/13 23:35 +0100:
> >>> Hi!
> >>> 
> >>> For me it is still not production ready.
> >> 
> >> Yes, this is the *FACT* and not everyone has a good reason to deny it.
> >> 
> >>> Again I ran into:
> >>> 
> >>> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> >>> random write into big file
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=90401
> >> 
> >> Not sure about guideline for other fs, but it will attract more dev's
> >> attention if it can be posted to maillist.
> >> 
> >>> No matter whether SLES 12 uses it as default for root, no matter whether
> >>> Fujitsu and Facebook use it: I will not let this onto any customer
> >>> machine
> >>> without lots and lots of underprovisioning and rigorous free space
> >>> monitoring. Actually I will renew my recommendations in my trainings to
> >>> be careful with BTRFS.
> >>> 
> >>>  From my experience the monitoring would check for:
> >>> merkaba:~> btrfs fi show /home
> >>> Label: 'home'  uuid: […]
> >>> 
> >>>          Total devices 2 FS bytes used 156.31GiB
> >>>          devid    1 size 170.00GiB used 164.13GiB path
> >>>          /dev/mapper/msata-home
> >>>          devid    2 size 170.00GiB used 164.13GiB path
> >>>          /dev/mapper/sata-home
> >>> 
> >>> If "used" is same as "size" then make big fat alarm. It is not
> >>> sufficient for it to happen. It can run for quite some time just fine
> >>> without any issues, but I never have seen a kworker thread using 100%
> >>> of one core for extended period of time blocking everything else on the
> >>> fs without this condition being met.>> 
> >> And specially advice on the device size from myself:
> >> Don't use devices over 100G but less than 500G.
> >> Over 100G will leads btrfs to use big chunks, where data chunks can be at
> >> most 10G and metadata to be 1G.
> >> 
> >> I have seen a lot of users with about 100~200G device, and hit unbalanced
> >> chunk allocation (10G data chunk easily takes the last available space
> >> and
> >> makes later metadata no where to store)
> > 
> > Maybe we should tune things so the size of the chunk is based on the
> > space remaining instead of the total space?
> 
> Submitted such patch before.
> David pointed out that such behavior will cause a lot of small
> fragmented chunks at last several GB.
> Which may make balance behavior not as predictable as before.
> 
> 
> At least, we can just change the current 10% chunk size limit to 5% to
> make such problem less easier to trigger.
> It's a simple and easy solution.
> 
> Another cause of the problem is, we understated the chunk size change
> for fs at the borderline of big chunk.
> 
> For 99G, its chunk size limit is 1G, and it needs 99 data chunks to
> fully cover the fs.
> But for 100G, it only needs 10 chunks to covert the fs.
> And it need to be 990G to match the number again.
> 
> The sudden drop of chunk number is the root cause.
> 
> So we'd better reconsider both the big chunk size limit and chunk size
> limit to find a balanaced solution for it.

Did you come to any conclusion here? Is there anything I can change with my 
home BTRFS filesystem to try to find out what works? Challenge here is that it 
doesn´t happen under defined circumstances. So far I only know the required 
condition, but not the sufficient condition for it to happen.

Another user run into the issue and reported his findings in the bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=90401#c14

Thanks,
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready)
  2015-12-13 22:35 Still not production ready Martin Steigerwald
  2015-12-13 23:19 ` Marc MERLIN
  2015-12-14  2:08 ` Still not production ready Qu Wenruo
@ 2016-03-20 11:24 ` Martin Steigerwald
  2016-09-07  9:53   ` Christian Rohmann
  2 siblings, 1 reply; 25+ messages in thread
From: Martin Steigerwald @ 2016-03-20 11:24 UTC (permalink / raw)
  To: BTRFS

On Sonntag, 13. Dezember 2015 23:35:08 CET Martin Steigerwald wrote:
> Hi!
> 
> For me it is still not production ready. Again I ran into:
> 
> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> random write into big file
> https://bugzilla.kernel.org/show_bug.cgi?id=90401

I think I saw this up to kernel 4.3. I think I didn´t see this with 4.4 
anymore and definately not with 4.5.

So it may be fixed.

Did anyone else see kworker threads using 100% of a core for minutes with 4.4 
/ 4.5?


For me this would be a big step forward. And yes, I am aware some people have 
new and other issues, but well for me a non working balance – it may also be 
broken here with "no space left on device", it errored out often enough here – 
is still something different than having to switch off the device hard unless 
you want to give it a ton of time to eventually shutdown which is not an 
option if you just want to work with your system.


In any case many thanks to all the developers working on improving BTRFS, and 
especially those who bring in bug fixes. I do think BTRFS still needs more 
stability work when I read through the recent mailing list threads.

Thanks,
Martin

> No matter whether SLES 12 uses it as default for root, no matter whether
> Fujitsu and Facebook use it: I will not let this onto any customer machine
> without lots and lots of underprovisioning and rigorous free space
> monitoring. Actually I will renew my recommendations in my trainings to be
> careful with BTRFS.
> 
> From my experience the monitoring would check for:
> 
> merkaba:~> btrfs fi show /home
> Label: 'home'  uuid: […]
>         Total devices 2 FS bytes used 156.31GiB
>         devid    1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home
> devid    2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home
> 
> If "used" is same as "size" then make big fat alarm. It is not sufficient
> for it to happen. It can run for quite some time just fine without any
> issues, but I never have seen a kworker thread using 100% of one core for
> extended period of time blocking everything else on the fs without this
> condition being met.
> 
> 
> In addition to that last time I tried it aborts scrub any of my BTRFS
> filesstems. Reported in another thread here that got completely ignored so
> far. I think I could go back to 4.2 kernel to make this work.
> 
> 
> I am not going to bother to go into more detail on any on this, as I get the
> impression that my bug reports and feedback get ignored. So I spare myself
> the time to do this work for now.
> 
> 
> Only thing I wonder now whether this all could be cause my /home is already
> more than one and a half year old. Maybe newly created filesystems are
> created in a way that prevents these issues? But it already has a nice
> global reserve:
> 
> merkaba:~> btrfs fi df /
> Data, RAID1: total=27.98GiB, used=24.07GiB
> System, RAID1: total=19.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=536.80MiB
> GlobalReserve, single: total=192.00MiB, used=0.00B
> 
> 
> Actually when I see that this free space thing is still not fixed for good I
> wonder whether it is fixable at all. Is this an inherent issue of BTRFS or
> more generally COW filesystem design?
> 
> I think it got somewhat better. It took much longer to come into that state
> again than last time, but still, blocking like this is *no* option for a
> *production ready* filesystem.
> 
> 
> 
> I am seriously consider to switch to XFS for my production laptop again.
> Cause I never saw any of these free space issues with any of the XFS or
> Ext4 filesystems I used in the last 10 years.
> 
> Thanks,


-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready)
  2016-03-20 11:24 ` kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready) Martin Steigerwald
@ 2016-09-07  9:53   ` Christian Rohmann
  2016-09-07 14:28     ` Martin Steigerwald
  0 siblings, 1 reply; 25+ messages in thread
From: Christian Rohmann @ 2016-09-07  9:53 UTC (permalink / raw)
  To: Martin Steigerwald, BTRFS



On 03/20/2016 12:24 PM, Martin Steigerwald wrote:
>> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
>> > random write into big file
>> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> I think I saw this up to kernel 4.3. I think I didn´t see this with 4.4 
> anymore and definately not with 4.5.
> 
> So it may be fixed.
> 
> Did anyone else see kworker threads using 100% of a core for minutes with 4.4 
> / 4.5?

I run 4.8rc5 and currently see this issue. kworking has been running at
100% for hours now, seems stuck there.

Anything I should look at in order to narrow this down to a root cause?


Regards

Christian

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready)
  2016-09-07  9:53   ` Christian Rohmann
@ 2016-09-07 14:28     ` Martin Steigerwald
  0 siblings, 0 replies; 25+ messages in thread
From: Martin Steigerwald @ 2016-09-07 14:28 UTC (permalink / raw)
  To: Christian Rohmann; +Cc: BTRFS

Am Mittwoch, 7. September 2016, 11:53:04 CEST schrieb Christian Rohmann:
> On 03/20/2016 12:24 PM, Martin Steigerwald wrote:
> >> btrfs kworker thread uses up 100% of a Sandybridge core for minutes on
> >> 
> >> > random write into big file
> >> > https://bugzilla.kernel.org/show_bug.cgi?id=90401
> > 
> > I think I saw this up to kernel 4.3. I think I didn´t see this with 4.4
> > anymore and definately not with 4.5.
> > 
> > So it may be fixed.
> > 
> > Did anyone else see kworker threads using 100% of a core for minutes with
> > 4.4 / 4.5?
> 
> I run 4.8rc5 and currently see this issue. kworking has been running at
> 100% for hours now, seems stuck there.
> 
> Anything I should look at in order to narrow this down to a root cause?

I didn´t see any issues since my last post, currently running 4.8-rc5 myself.

I suggest you look at kernel log and probably review this thread and my bug 
report for what other information I came up with. Particulary in my case the 
issue only happened when BTRFS allocated all device spaces into chunks, but 
the space in the chunks was not fully used up yet. I.e. when BTRFS had to seek 
for new space in chunks and couldn´t just allocate a new chunk anymore. In 
addition to that your BTRFS configuration, storage configuration, yada. Just 
review what I reported to get an idea.

If you are sufficiently sure that your issue is the same from looking at the 
kernel log… so if the backtraces look sufficiently similar, then I´d add to my 
bug report. Otherwise I´d hope a new one.

Good luck.
-- 
Martin

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-09-07 14:35 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-13 22:35 Still not production ready Martin Steigerwald
2015-12-13 23:19 ` Marc MERLIN
2015-12-14  7:59   ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready) Martin Steigerwald
2015-12-14  2:08 ` Still not production ready Qu Wenruo
2015-12-14  6:21   ` Duncan
2015-12-14  7:32     ` Qu Wenruo
2015-12-14 12:10       ` Duncan
2015-12-14 19:08         ` Chris Murphy
2015-12-14 20:33           ` Austin S. Hemmelgarn
2015-12-14  8:18   ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load (was: Re: Still not production ready) Martin Steigerwald
2015-12-14  8:48     ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Qu Wenruo
2015-12-14  8:59       ` Martin Steigerwald
2015-12-14  9:10       ` safety of journal based fs (was: Re: still kworker at 100% cpu…) Martin Steigerwald
2015-12-22  2:34         ` Kai Krakow
2015-12-15 21:59   ` Still not production ready Chris Mason
2015-12-15 23:16     ` Martin Steigerwald
2015-12-16  1:20     ` Qu Wenruo
2015-12-16  1:53       ` Liu Bo
2015-12-16  2:19         ` Qu Wenruo
2015-12-16  2:30           ` Liu Bo
2015-12-16 14:27             ` Chris Mason
2016-01-01 10:44       ` still kworker at 100% cpu in all of device size allocated with chunks situations with write load Martin Steigerwald
2016-03-20 11:24 ` kworker threads may be working saner now instead of using 100% of a CPU core for minutes (Re: Still not production ready) Martin Steigerwald
2016-09-07  9:53   ` Christian Rohmann
2016-09-07 14:28     ` Martin Steigerwald

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).