linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
To: Zygo Blaxell <zblaxell@furryterror.org>, Robert White <rwhite@pobox.com>
Cc: Grzegorz Kowal <custos.mentis@gmail.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Standards Problems [Was: [PATCH v2 1/3] Btrfs: get more accurate output in df command.]
Date: Mon, 5 Jan 2015 17:56:00 +0800	[thread overview]
Message-ID: <54AA5FB0.5020102@cn.fujitsu.com> (raw)
In-Reply-To: <20141231001514.GA13116@hungrycats.org>

On 12/31/2014 08:15 AM, Zygo Blaxell wrote:
> On Wed, Dec 17, 2014 at 08:07:27PM -0800, Robert White wrote:
>> [...]
> There are a number of pathological examples in here, but I think there
> are justifiable correct answers for each of them that emerge from a
> single interpretation of the meanings of f_bavail, f_blocks, and f_bfree.
>
> One gotcha is that some of the numbers required may be difficult to
> calculate precisely before all space is allocated to chunks; however,
> some error is tolerable as long as free space is not overestimated.
> In other words:  when in doubt, guess low.
>
> statvfs(2) gives us six numbers, three of which are block counts.
> Very few users or programs ever bother to look at the inode counts
> (f_files, f_ffree, f_favail), but they could be overloaded for metadata
> block counts.
>
> The f_blocks parameter is mostly irrelevant to application behavior,
> except to the extent that the ratio between f_bavail and f_blocks is
> used by applications to calculate a percentage of occupied or free space.
> f_blocks must always be greater than or equal to f_bavail and f_blocks,
> and preferably f_blocks would be scaled to use the same effective unit
> size as f_bavail and f_blocks within a percent or two.
>
> Nobody cares about f_bfree since traditionally only root could use the
> difference between f_bfree and f_bavail.  f_bfree is effectively space
> conditionally available (e.g. if the process euid is root or the process
> egid matches a configured group id), while f_bavail is space available
> without conditions (e.g. processes without privilege can use it).
>
> The most important number is f_bavail.  It's what a bunch of software
> (archive unpackers, assorted garbage collectors, email MTAs, snapshot
> remover scripts, download managers, etc) uses to estimate how much space
> is available without conditions (except quotas, although arguably those
> should be included too).  Applications that are privileged still use
> the unprivileged f_bavail number so their decisions based on free space
> don't disrupt unprivileged applications.
>
> It's generally better to underestimate than to overestimate f_bavail.
> Historically filesystems have reserved extra space to avoid various
> problems in low-disk conditions, and application software has adapted
> to that well over the years.  Also, admin people are more pleasantly
> surprised when it turns out that they had more space than f_bavail,
> instead of when they had less.
>
> The rule should be:  if we have some space, but it is not available for
> data extents in the current allocation mode, don't add it to f_bavail
> in statvfs.  I think this rule handles all of these examples well.
>
> That would mean that we get cases where we add a drive to a full
> filesystem and it doesn't immediately give you any new f_bavail space.
> That may be an unexpected result for a naive admin, but much less
> unexpected than having all the new space show up in f_bavail when it
> is not available for allocation in the current data profile!  Better
> to have the surprising behavior earlier than later.
>
> On to examples...
>
>> But a more even case is downright common and likely. Say you run a
>> nice old-fashoned MUTT mail-spool. "most" of your files are small
>> enough to live in metadata. You start with one drive. and allocate 2
>> single-data and 10 metatata (5xDup). Then you add a second drive of
>> equal size. (the metadata just switched to DUP-as-RAID1-alike mode)
>> And then you do a dconvert=raid0.
>>
>> That uneven allocation of metadata will be a 2GiB difference between
>> the two drives forever.
>> So do you shave 2GiB off of your @size?
> Yes.  f_blocks is the total size of all allocated chunks plus all free
> space allocated by the current data profile.

Agreed. This is what my patch designed by.
>    That 2GiB should disappear
> from such a calculation.
>
>> Do you shave @2GiB off your @available?
> Yes, because it's _not_ available until something changes to make it
> available (e.g. balance to get rid of the dup metadata, change the
> metadata profile to dup or single, or change the data profile to single).
>
> The 2GiB could be added to f_bfree, but that might still be confusing
> for people and software.
>
>> Do you overreport your available by @2GiB and end up _still_ having
>> things "available" when you get your ENOSPC?
> No.  ENOSPC when f_bavail > 0 is very bad.

Yes, it is very bad.
>   Low-available-space admin
> alerts will not be triggered.  Automated mitigation software will not be
> activated.  Service daemons will start transactions they cannot complete.
>
>> How about this ::
>>
>> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
>> /dev/sdb == |10 GiB free                                 |
>>
>> Operator fills his drive, then adds a second one, then _foolishly_
>> tries to convert it to RAID0 when the power fails. In order to check
>> the FS he boots with no_balance. Then his maintenance window closes
>> and he has to go back into production, at which point he forgets (or
>> isn't allowed) to do the balance. The flags are set but now no more
>> extents can be allocated.
>>
>> Size is 20GiB, slack is 10.5GiB. Operator is about to get ENOSPACE.

I am not clear about this use case. Is the current profile raid0? if so, 
@available is
10.5G. If raid1, @available is 0.5G.
> f_bavail should be 0.5GB or so.  Operator is now aware that ENOSPC is
> imminent, and can report to whoever grants permission to do things that
> the machine will be continuing to balance outside of the maintenance
> window.  This is much better than the alternative, which is that the
> lack of available space is detected by application failure outside of
> a maintenance window.
>
> Even better:  if f_bavail is reflective of reality, the operator can
> minimize out-of-window balance time by monitoring f_bavail and pausing
> the balance when there is enough space to operate without ENOSPC until
> the next maintenance window.
>
>> Yes a balance would fix it, but that's not the question.
> The question is "how much space is available at this time?" and the
> correct answer is "almost none," and it stays that way until and unless
> someone runs a balance, adds more drives, deletes a lot of data, etc.
>
> balance changes the way space will be allocated, so it also changes
> the output of df to match.
>
>> In the meantime what does your patch report?
> It should report that there's almost no available space.
> If the patch doesn't report that, the patch needs rework.
>
>> Or...
>>
>> /dev/sda == |Sf|Sf|Mf|Mf|Mf|Mf|Sf|Sf|Sp|Mp|Mp| .5GiB free|
>> /dev/sdb == |10 GiB free                                 |
>> /dev/sdc == |10 GiB free                                 |
>>
>> Does a -dconvert=raid5 and immediately gets ENOSPC for all the
>> blocks. According to the flags we've got 10GiB free...
> 10.5GiB is correct:  1.0GiB in a 3-way RAID5 on sd[abc] (2x 0.5GiB
> data, 1x 0.5GiB parity), and 9.5GiB in a 2-way RAID5 (1x 9.5GiB data,
> 1x 9.5GiB parity) on sd[bc].  The current RAID5 implementation might
> not use space that way, but if that's true, that's arguably a bug in
> the current RAID5 implementation.

The current RAID5 does not work like this. It will alloc 10G (0.5 sd[ab] 
+ 9.5 sd[bc]).

The calculation in statfs() is same with the calculation in current 
allocator.
It will report 10G available.

Yes it would be better if we make the allocator more clever in these case.
That can be another topic about allocator.
>
> If it was -dconvert=raid6, there would be only 0.5GiB in f_bavail (1x
> 0.5GiB data, 2x 0.5GiB parity) since raid6 requires 3 disks per chunk.
>
>> See you keep giving me these examples where the history of the
>> filesystem is uniform. It was made a certain way and stayed that
>> way. But in real life this sort of thing is going to happen and your
>> patch simply report's a _different_ _wrong_ number. A _friendlier_
>> wrong number, I'll grant you that, but still wrong.
> Who cares if the number is wrong, as long as useful decisions can still be
> made with it?  It doesn't have to be byte-accurate in all possible cases.
>
> Existing software and admin practice is OK with underreporting free
> space, but not overreporting it.  All the errors should be biased in
> that direction.

Thanx Zygo and Robert, I agree that my patch did not cover the situation 
when
block groups in different raid level. I will update my patch soon and 
sent it out.

Thanx for your suggestion.

Yang
>


  reply	other threads:[~2015-01-05  9:58 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-11  8:31 [PATCH v2 1/3] Btrfs: get more accurate output in df command Dongsheng Yang
2014-12-11  8:31 ` [PATCH v2 2/3] Btrfs: raid56: simplify the parameter of nr_parity_stripes() Dongsheng Yang
2014-12-16  6:21   ` Satoru Takeuchi
2014-12-11  8:31 ` [PATCH v2 3/3] Btrfs: adapt df command to RAID5/6 Dongsheng Yang
2014-12-12 18:00 ` [PATCH v2 1/3] Btrfs: get more accurate output in df command Goffredo Baroncelli
2014-12-13  0:50   ` Duncan
2014-12-13 10:21     ` Dongsheng Yang
2014-12-13  9:57   ` Dongsheng Yang
2014-12-12 19:25 ` Goffredo Baroncelli
2014-12-14 11:29   ` Dongsheng Yang
     [not found]     ` <CABmMA7tw9BDsBXGHLO4vjcO4gaYmZPb_BQV8w22griqFvCJpPA@mail.gmail.com>
2014-12-14 14:32       ` Grzegorz Kowal
2014-12-15  1:21         ` Dongsheng Yang
2014-12-15  6:06           ` Robert White
2014-12-15  7:49             ` Robert White
2014-12-15  8:26               ` Dongsheng Yang
2014-12-15  9:36                 ` Robert White
2014-12-16  3:30                   ` Standards Problems [Was: [PATCH v2 1/3] Btrfs: get more accurate output in df command.] Robert White
2014-12-16  3:52                     ` Robert White
2014-12-16 11:30                     ` Dongsheng Yang
2014-12-16 13:24                       ` Dongsheng Yang
2014-12-16 19:52                       ` Robert White
2014-12-17 11:38                         ` Dongsheng Yang
2014-12-18  4:07                           ` Robert White
2014-12-18  8:02                             ` Duncan
2014-12-23 12:31                             ` Dongsheng Yang
2014-12-27  1:10                               ` Robert White
2015-01-05  9:59                                 ` Dongsheng Yang
2014-12-31  0:15                             ` Zygo Blaxell
2015-01-05  9:56                               ` Dongsheng Yang [this message]
2015-01-05 10:07                                 ` [PATCH v2 1/3] Btrfs: get more accurate output in df command Dongsheng Yang
2015-01-05 10:07                                   ` [PATCH v2 2/3] Btrfs: raid56: simplify the parameter of nr_parity_stripes() Dongsheng Yang
2015-01-05 10:07                                   ` [PATCH v2 3/3] Btrfs: adapt df command to RAID5/6 Dongsheng Yang
2014-12-19  3:32             ` [PATCH v2 1/3] Btrfs: get more accurate output in df command Zygo Blaxell
     [not found]     ` <548F1EA7.9050505@inwind.it>
2014-12-16 13:47       ` Dongsheng Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54AA5FB0.5020102@cn.fujitsu.com \
    --to=yangds.fnst@cn.fujitsu.com \
    --cc=custos.mentis@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rwhite@pobox.com \
    --cc=zblaxell@furryterror.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).