Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Hans van Kranenburg <hans@knorrie.org>, linux-btrfs@vger.kernel.org
Subject: Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
Date: Tue, 28 Jul 2020 09:52:39 +0800	[thread overview]
Message-ID: <5155a008-d534-18d3-5416-1c879031b45d@gmx.com> (raw)
In-Reply-To: <b4d0c916-71d5-753a-1c10-cd28f1c3ebec@gmx.com>



On 2020/7/28 上午8:51, Qu Wenruo wrote:
>
>
> On 2020/7/28 上午1:17, Hans van Kranenburg wrote:
>> Hi!
>>
>> On 7/27/20 1:09 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/25 下午10:24, Hans van Kranenburg wrote:
>>>> Hi,
>>>>
>>>> I have a filesystem here that I'm filling up with data from elsewhere.
>>>> Most of it is done by rsync, and part by send/receive. So, receiving
>>>> data over the network, and then writing the files to disk. There can be
>>>> a dozen of these processes running in parallel.
>>>>
>>>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>>>> varying X) often is using nearly 100% cpu, while enormously slowing down
>>>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>>>> processes.
>>>
>>> The name is in fact pretty strange.
>>> It doesn't follow the btrfs workqueue names.
>>>
>>> There are two type of kernel threads used by btrfs:
>>>   [...]
>>>
>>> As you can see, there is no one named like "flush-btrfs".
>>
>> Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here,
>> currently. ;]
>>
>>> Thus I guess it's from other part of the stack.
>>>
>>> Also, the calltrace also shows that, that kernel thread is only doing
>>> page writeback, which calls back to the page write hooks of btrfs.
>>>
>>> So I guess it may not be btrfs, but something else trying to do all the
>>> writeback.
>>
>> Yes, so, from the writeback path it re-enters btrfs code, right?
>
> Yep. Btrfs is still responsible for the problem.
>
>>
>>> But still, the CPU usage is still a problem, it shouldn't cost so much
>>> CPU time just writing back pages from btrfs.
>>
>> It's find_free_extent which is the culprit.
>>
>>>> [...]
>>>>
>>>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
>>>>
>>>> I just did a 'perf record -g -a sleep 60' while disk writes were down to
>>>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
>>>> shot' of it. Also attached an example of what nmon shows to give an idea
>>>> about the situation.
>>>>
>>>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
>>>> see network and disk write speed ramping up, easily over 200 MiB/s,
>>>> until it soon plummets again.
>>>>
>>>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
>>>> is booted now).
>>>>
>>>> So, what I'm looking for is:
>>>> * Does anyone else see this happen when doing a lot of concurrent writes?
>>>> * What does this flush thing do?
>>>> * Why is it using 100% cpu all the time?
>>>> * How can I debug this more?
>>>
>>> bcc based runtime probes I guess?
>>>
>>> Since it's almost a dead CPU burner loop, regular sleep based lockup
>>> detector won't help much.
>>
>> Here's a flame graph of 180 seconds, taken from the kernel thread pid:
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg
>
> That's really awesome!
>
>>
>>> You can try trace events first to see which trace event get executed the
>>> most frequently, then try to add probe points to pin down the real cause.
>>
>> From the default collection, I already got the following, a few days
>> ago, by enabling find_free_extent and btrfs_cow_block:
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt

This output is in fact pretty confusing, and maybe give you a false view
on the callers of find_free_extent().

It always shows "EXTENT_TREE" as the owner, but that's due to a bad
decision on the trace event.

I have submitted a patch addressing it, and added you to the CC.

Would you mind to re-capture the events with that patch?
So that we could have a clearer idea on which tree is having the most
amount of concurrency?

Thanks,
Qu

>>
>> From the timestamps you can see how long this takes. It's not much that
>> gets done per second.
>>
>> The spin lock part must be spin_lock(&space_info->lock) because that's
>> the only one in find_free_extent.
>>
>> So, what I think is that, like I mentioned on saturday already,
>> find_free_extent is probably doing things in a really inefficient way,
>> scanning around for a small single free space gap all the time in a
>> really expensive way, and doing that over again for each tiny metadata
>> block or file that needs to be placed somewhere (also notice the
>> empty_size=0), instead of just throwing all of it on disk after each
>> other, when it's otherwise slowing down everyone.
>
> I think you're mostly right, find_free_extent() is not that efficient.
>
> But that only happens when there are a lot of parallel requests for it.
>
> One example is btrfs_get_block_group().
> That function is only calling one function, refcount_inc(), which
> normally shouldn't be that slow, unless there are tons of other callers
> here.
>
>>
>> And then we have the spin lock part and btrfs_get_block_group, which are
>> also a significant part of the time spent. All together is the
>> continuous 100% cpu.
>
> So is the spinlock part. That spinlock itself should be pretty fast
> (that's why we use spinlock), but when it's not, it's mostly caused by
> high concurrency.
>
>>
>> What I *can* try is switch to the ssd_spread option, to force it to do
>> much more yolo allocation, but I'm afraid this will result in a sudden
>> blast of metadata block groups getting allocated. Or, maybe try it for a
>> minute or so and compare the trace pipe output...
>
> Another thing can't be explained is, that slow down you're experiencing
> is from data write back.
> It should never conflicts with metadata space reservation, as data and
> metadata are separated by space_info first, thus they shouldn't affect
> each other.
> Either I'm missing something, or there are other problems.
>
>>
>>> But personally speaking, it's better to shrink the workload, to find a
>>> minimal workload to reproduce the 100% CPU burn, so that you need less
>>> probes/time to pindown the problem.
>>
>> I think I can reproduce it with a single btrfs receive operation, as
>> long as it has a large amount of small files in it.
>
> IMHO, if the culprit is really find_free_extent() itself, then single
> receive won't reach the concurrency we need to expose the slowdown.
>
> Since receive is just doing serial writes, that shouldn't cause enough
> race for find_free_extents().
>
> But that's worthy trying to see if it proofs my assumption of concurrency.
>
> Thanks,
> Qu
>
>>
>> Currently I'm almost only copying data onto the filesystem, only data
>> removals are some targeted dedupe script for known big files that are
>> identical, but which could not easily be handled by rsync.
>>
>> So, when find_free_extent wants to write extent tree blocks, it might be
>> playing outside in the woods all the time, searching around for little
>> holes that were caused by previous cow operations.
>>
>> Large files are not a problem, like postgresql database dumps. Copying
>> them over just runs >200-300M/s without problem, where the read speed of
>> the old storage I'm moving things off is the bottleneck, actually.
>>
>>>> * Ultimately, of course... how can we improve this?
>>>>
>>>> [...]
>>
>> I will see if I can experiment a bit with putting more trace points in.
>>
>> I remember reading around in the extent allocator code, until I got
>> dizzy from it. Deja vu...
>>
>> https://www.spinics.net/lists/linux-btrfs/msg70624.html
>>
>> Thanks
>> Hans
>>

next prev parent reply	other threads:[~2020-07-28  1:52 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg
2020-07-25 15:37 ` Holger Hoffstätte
2020-07-25 16:43   ` Hans van Kranenburg
2020-07-25 19:44     ` Holger Hoffstätte
2020-07-25 21:03       ` Hans van Kranenburg
2020-07-26  1:00         ` Chris Murphy
2020-07-25 21:27 ` Hans van Kranenburg
2020-07-26  8:10   ` A L
2020-07-26  0:50 ` Chris Murphy
2020-07-27 11:09 ` Qu Wenruo
2020-07-27 17:17   ` Hans van Kranenburg
2020-07-27 19:23     ` Chris Murphy
2020-07-27 23:16     ` Chris Murphy
2020-07-28  0:51     ` Qu Wenruo
2020-07-28  1:52       ` Qu Wenruo [this message]
2020-07-28 14:52         ` Hans van Kranenburg
2020-07-29  0:15           ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5155a008-d534-18d3-5416-1c879031b45d@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=hans@knorrie.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox