Re: Very slow balance / btrfs-transaction

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: <fdmanana@gmail.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>,
	Jorg Bornschein <jb@capsec.org>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: Very slow balance / btrfs-transaction
Date: Thu, 9 Feb 2017 09:13:07 +0800	[thread overview]
Message-ID: <2c726883-c0c9-9ea2-cfa5-764e47bcc21f@cn.fujitsu.com> (raw)
In-Reply-To: <CAL3q7H4z1S1xCv10VGutEGJWF+ROQB0-pc9nPWAdqFyue08S9g@mail.gmail.com>



At 02/08/2017 09:56 PM, Filipe Manana wrote:
> On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo <quwenruo@cn.fujitsu.com> wrote:
>>
>>
>> At 02/07/2017 11:55 PM, Filipe Manana wrote:
>>>
>>> On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo <quwenruo@cn.fujitsu.com>
>>> wrote:
>>>>
>>>>
>>>>
>>>> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hi Qu,
>>>>>
>>>>> On 02/05/2017 07:45 PM, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:
>>>>>>>
>>>>>>>
>>>>>>> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" <rgoldwyn@suse.de>
>>>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> <snipped>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Quata support was indeed active -- and it warned me that the qroup
>>>>>>> data was inconsistent.
>>>>>>>
>>>>>>> Disabling quotas had an immediate impact on balance throughput -- it's
>>>>>>> *much* faster now!
>>>>>>> From a quick glance at iostat I would guess it's at least a factor 100
>>>>>>> faster.
>>>>>>>
>>>>>>>
>>>>>>> Should quota support generally be disabled during balances? Or did I
>>>>>>> somehow push my fs into a weired state where it triggered a slow-path?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>    j
>>>>>>
>>>>>>
>>>>>>
>>>>>> Would you please provide the kernel version?
>>>>>>
>>>>>> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
>>>>>> fix qgroup bytes leaking, but also hugely slow down the balance
>>>>>> process:
>>>>>>
>>>>>> commit 62b99540a1d91e46422f0e04de50fc723812c421
>>>>>> Author: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>>> Date:   Mon Aug 15 10:36:51 2016 +0800
>>>>>>
>>>>>>     btrfs: relocation: Fix leaking qgroups numbers on data extents
>>>>>>
>>>>>> Sorry for that.
>>>>>>
>>>>>> And in v4.10, a better method is applied to fix the byte leaking
>>>>>> problem, and should be a little faster than previous one.
>>>>>>
>>>>>> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
>>>>>> Author: Qu Wenruo <quwenruo@cn.fujitsu.com>
>>>>>> Date:   Tue Oct 18 09:31:29 2016 +0800
>>>>>>
>>>>>>     btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
>>>>>>
>>>>>>
>>>>>> However, using balance with qgroup is still slower than balance without
>>>>>> qgroup, the root fix needs us to rework current backref iteration.
>>>>>>
>>>>>
>>>>> This patch has made the btrfs balance performance worse. The balance
>>>>> task has become more CPU intensive compared to earlier and takes longer
>>>>> to complete, besides hogging resources. While correctness is important,
>>>>> we need to figure out how this can be made more efficient.
>>>>>
>>>> The cause is already known.
>>>>
>>>> It's find_parent_node() which takes most of the time to find all
>>>> referencer
>>>> of an extent.
>>>>
>>>> And it's also the cause for FIEMAP softlockup (fixed in recent release by
>>>> early quit).
>>>>
>>>> The biggest problem is, current find_parent_node() uses list to iterate,
>>>> which is quite slow especially it's done in a loop.
>>>> In real world find_parent_node() is about O(n^3).
>>>> We can either improve find_parent_node() by using rb_tree, or introduce
>>>> some
>>>> cache for find_parent_node().
>>>
>>>
>>> Even if anyone is able to reduce that function's complexity from
>>> O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
>>> implementation of qgroups will always be a problem. The real problem
>>> is that this more recent rework of qgroups does all this accounting
>>> inside the critical section of a transaction - blocking any other
>>> tasks that want to start a new transaction or attempt to join the
>>> current transaction. Not to mention that on systems with small amounts
>>> of memory (2Gb or 4Gb from what I've seen from user reports) we also
>>> OOM due this allocation of struct btrfs_qgroup_extent_record per
>>> delayed data reference head, that are used for that accounting phase
>>> in the critical section of a transaction commit.
>>>
>>> Let's face it and be realistic, even if someone manages to make
>>> find_parent_node() much much better, like O(n) for example, it will
>>> always be a problem due to the reasons mentioned before. Many extents
>>> touched per transaction and many subvolumes/snapshots, will always
>>> expose that root problem - doing the accounting in the transaction
>>> commit critical section.
>>
>>
>> You must accept the fact that we must call find_parent_node() at least twice
>> to get correct owner modification for each touched extent.
>> Or qgroup number will never be correct.
>>
>> One for old_roots by searching commit root, and one for new_roots by
>> searching current root.
>>
>> You can call find_parent_node() as many time as you like, but that's just
>> wasting your CPU time.
>>
>> Only the final find_parent_node() will determine new_roots for that extent,
>> and there is no better timing than commit_transaction().
>
> You're missing my point.
>
> My point is not about needing to call find_parent_nodes() nor how many
> times to call it, or whether it's needed or not. My point is about
> doing expensive things inside the critical section of a transaction
> commit, which leads not only to low performance but getting a system
> becoming unresponsive and with too high latency - and this is not
> theory or speculation, there are upstream reports about this as well
> as several in suse's bugzilla, all caused when qgroups are enabled on
> 4.2+ kernels (when the last qgroups major changes landed).
>
> Judging from that code and from your reply to this and other threads
> it seems you didn't understand the consequences of doing all that
> accounting stuff inside the critical section of a transaction commit.

NO, I know what you're talking about.
Or I won't send the patch to move half of the find_all_roots() call out 
of commit_trans().
(OK, I just like refer to commit_trans() to its critical section, as 
quick exit or other part doesn't make much impact in this context)

While it seems that you still don't understand the necessity to call 
find_all_roots() in critical section.
It's the base stone of qgroup, or we get back to the qgroup mismatch days.

As I already mentioned, only before we switch fs commit roots, we can 
avoid calling extra expensive find_all_roots() and get correct new_roots.

That's to say, at least one find_all_roots() should be called at 
critical section for extent modified extent, to keep qgroup correct.

If you could have any better solution to keep qgroup correct and avoid 
calling find_all_roots() in critical section, I'm happy to listen.

>
> Forget find_parent_nodes() for a moment, yes it has problems, but it's
> not what I'm trying to make you understand. Just look at
> btrfs_qgroup_account_extents(), called within the critical section -
> it iterates all elements of a red black tree, and each element
> corresponds to some data extent allocated in the current transaction -
> if we have thousands, tens of thousands, or more, even if whatever the
> loop calls had an awesome complexity of O(1) or O(log N) it would
> still be bad, exactly because it's blocking future transactions to
> start and tasks from joining the current transaction. CPU time and
> memory consumption (used for those struct btrfs_qgroup_extent_record)
> are also concerns, but to a smaller extent imo.

That's another problem.

For worst case, we could limit the number of delayed ref head before 
triggering a transaction commit.
But I still doubt if qgroup extent record is the cause for OOM.

Memory consumption of qgroup_extent_record is quite small, it's only 48 
bytes for one extent.
(And delayed ref head is much larger than qgroup_extent_record)

While for one extent, its minimum size is 4K, overhead would be at most 1%.

IIRC before we hit OOM, memory pressure should trigger 
commit_transaction(), and freeing all the qgroup_extent_record().
So there is some other problem related to this.

>
>>
>> Or you can wasting more time calling find_parent_node() every time you
>> touched a extent, saving one find_parent_node() in commit_transaction() with
>> the cost of more find_parent_node() in other place.
>> Is that what you want?
>>
>> I can move the find_parent_node() for old_roots out of commit_transaction().
>> But that will only reduce 50% of the time spent on commit_transaction().
>>
>> Compared to O(n^3) find_parent_node(), that's not the determining fact even.
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>>>
>>>>
>>>> IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but
>>>> I
>>>> didn't hear anything about it recently.
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>>
>>
>>
>
>
>

next prev parent reply	other threads:[~2017-02-09  1:16 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-03 22:13 Very slow balance / btrfs-transaction jb
2017-02-03 23:25 ` Goldwyn Rodrigues
2017-02-04  0:30 ` Jorg Bornschein
2017-02-04  1:07   ` Goldwyn Rodrigues
2017-02-04  1:47   ` Jorg Bornschein
2017-02-04  2:55     ` Lakshmipathi.G
2017-02-04  8:22       ` Duncan
2017-02-06  1:45     ` Qu Wenruo
2017-02-06 16:09       ` Goldwyn Rodrigues
2017-02-07  0:22         ` Qu Wenruo
2017-02-07 15:55           ` Filipe Manana
2017-02-08  0:39             ` Qu Wenruo
2017-02-08 13:56               ` Filipe Manana
2017-02-09  1:13                 ` Qu Wenruo [this message]
2017-02-06  9:14     ` Jorg Bornschein
2017-02-06  9:29       ` Qu Wenruo
2017-02-04 20:50   ` Jorg Bornschein
2017-02-04 21:10     ` Kai Krakow
2017-02-06 13:19       ` Austin S. Hemmelgarn
2017-02-07 19:47         ` Kai Krakow
2017-02-07 19:58           ` Austin S. Hemmelgarn
  -- strict thread matches above, loose matches on Subject: below --
2017-07-01 14:24 Sidney San Martín

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2c726883-c0c9-9ea2-cfa5-764e47bcc21f@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=fdmanana@gmail.com \
    --cc=jb@capsec.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=rgoldwyn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).