From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5007FC433E0 for ; Tue, 28 Jul 2020 00:51:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 17B4420775 for ; Tue, 28 Jul 2020 00:51:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=gmx.net header.i=@gmx.net header.b="fg0VhiIi" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726327AbgG1Av0 (ORCPT ); Mon, 27 Jul 2020 20:51:26 -0400 Received: from mout.gmx.net ([212.227.15.18]:59519 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726194AbgG1AvZ (ORCPT ); Mon, 27 Jul 2020 20:51:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1595897481; bh=b8bn6Xr9v4dZ50MzaEMvxNYzVjQxf4dD035d/c+WBH8=; h=X-UI-Sender-Class:Subject:To:References:From:Date:In-Reply-To; b=fg0VhiIii/i2Q/0QiHcz3BFS2feiPODv1Et6wwIoieLgGPaIcvFhm3QEpz4V3EBCK QuADI6VJEQncFLei4wiePP2ZLshg70u3ZdD8tjFkcRcQLXs7J4GAP/9wZoRF3uJzps JOSJdYHEKJHnriDo1/a5Gy1EhJUlPaESlSS5LzjY= X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c Received: from [0.0.0.0] ([149.28.201.231]) by mail.gmx.com (mrgmx005 [212.227.17.184]) with ESMTPSA (Nemesis) id 1MlNpH-1kekj73i0L-00lkVu; Tue, 28 Jul 2020 02:51:21 +0200 Subject: Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 To: Hans van Kranenburg , linux-btrfs@vger.kernel.org References: <2523ce77-31a3-ecec-f36d-8d74132eae02@knorrie.org> From: Qu Wenruo Autocrypt: addr=quwenruo.btrfs@gmx.com; prefer-encrypt=mutual; keydata= mQENBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAG0IlF1IFdlbnJ1byA8cXV3ZW5ydW8uYnRyZnNAZ214LmNvbT6JAVQEEwEIAD4CGwMFCwkI BwIGFQgJCgsCBBYCAwECHgECF4AWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCWdWCnQUJCWYC bgAKCRDCPZHzoSX+qAR8B/94VAsSNygx1C6dhb1u1Wp1Jr/lfO7QIOK/nf1PF0VpYjTQ2au8 ihf/RApTna31sVjBx3jzlmpy+lDoPdXwbI3Czx1PwDbdhAAjdRbvBmwM6cUWyqD+zjVm4RTG rFTPi3E7828YJ71Vpda2qghOYdnC45xCcjmHh8FwReLzsV2A6FtXsvd87bq6Iw2axOHVUax2 FGSbardMsHrya1dC2jF2R6n0uxaIc1bWGweYsq0LXvLcvjWH+zDgzYCUB0cfb+6Ib/ipSCYp 3i8BevMsTs62MOBmKz7til6Zdz0kkqDdSNOq8LgWGLOwUTqBh71+lqN2XBpTDu1eLZaNbxSI ilaVuQENBFnVga8BCACqU+th4Esy/c8BnvliFAjAfpzhI1wH76FD1MJPmAhA3DnX5JDORcga CbPEwhLj1xlwTgpeT+QfDmGJ5B5BlrrQFZVE1fChEjiJvyiSAO4yQPkrPVYTI7Xj34FnscPj /IrRUUka68MlHxPtFnAHr25VIuOS41lmYKYNwPNLRz9Ik6DmeTG3WJO2BQRNvXA0pXrJH1fN GSsRb+pKEKHKtL1803x71zQxCwLh+zLP1iXHVM5j8gX9zqupigQR/Cel2XPS44zWcDW8r7B0 q1eW4Jrv0x19p4P923voqn+joIAostyNTUjCeSrUdKth9jcdlam9X2DziA/DHDFfS5eq4fEv ABEBAAGJATwEGAEIACYCGwwWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCXZw1rgUJCWpOfwAK CRDCPZHzoSX+qFcEB/95cs8cM1OQdE/GgOfCGxwgckMeWyzOR7bkAWW0lDVp2hpgJuxBW/gy fmtBnUaifnggx3EE3ev8HTysZU9q0h+TJwwJKGv6sUc8qcTGFDtavnnl+r6xDUY7A6GvXEsS oCEEynby72byGeSovfq/4AWGNPBG1L61Exl+gbqfvbECP3ziXnob009+z9I4qXodHSYINfAk ZkA523JGap12LndJeLk3gfWNZfXEWyGnuciRGbqESkhIRav8ootsCIops/SqXm0/k+Kcl4gG UO/iD/T5oagaDh0QtOd8RWSMwLxwn8uIhpH84Q4X1LadJ5NCgGa6xPP5qqRuiC+9gZqbq4Nj Message-ID: Date: Tue, 28 Jul 2020 08:51:18 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:CbSx8sY/066uUmX5dLEIC30EjY41TP/Tn+Z40ppbFTLCSJzgWEU rMGCW8gjWNKXdiJxKsPEPB4xSptR1DFwd0cYj1Y6IDR4T0Yz0YXRr2XJA+I+fPke+UBxnCH bKeLq9POzJgFiD85V32R7X5LChsTUUgIjbvMj8Ve7rFHrvwAElH1fauIiuV5vXzapgjtL6Z XU1caQ/QAa1bI5WqmFdrQ== X-UI-Out-Filterresults: notjunk:1;V03:K0:AlLShXShX24=:qgz4serpt+V+hrE8oe98Gb vNd+9TwYjtvDzxrTqrSov/qIFSbk4kU1ORNf63zpwbp20iZ4INj40QJMdKE0Cb9p7sdCWW51O 1KpCEEltCnBKPljxVqBvSdiBdJAursxzdRrsamhl14fKuf9S7OxjZCKd/BHhJ08lH/QwqRnYy 5z7yK8SKoSkxEyM762pTDv/vkEYBkEdZGNQSTw0OionnJletchY1TZOnYNxBLPVvucv/1nSzm hdPyj+W43gBppLBLHzgYS6IpKIi6HoLRdyN70JrsBYOpo1TtWCmoAiCrlQuGxdJAdNzsbQwiD 51ZtSn/IeZfnz30a477eCLWFdKpNXwiNlhuakfZHkB5WEcbS1IW/4hv62cU1dfBTiOOn81tXb jZhEQ4OQvy7eBzvocIV6YbJGbwhJ/oMjpXxT1vipgYSm3urqXxX2gT4JJOucKDUPkBIcAsziF FCnJnvAj+23Mw2jxkf8b+VgTG6gWJJcGqb565YysI4FD66yYudCkmxzB6vTYzghMcZjoXrZcE TCNdBGGhurmbleau80CR8xGzZ8G96Osl9/RE0SFP0JuHLWvazGNaIYPntzmA9n8GbfaXM30Kg ex+HuPh8HByJDoKKxq0DjaTw1N/9rYPiUK0yWSPZ6T+IjeF2UBZ8w5ifvqJIEuecJs3F2GRmS XATab1Z3eNwl0GcwKnFqiE9/lRE0n9FdRQITEB/tFMOFOcYF2tgBOTcF4HG7z54Gun6tEpvbm hXAbv9TBHixcVHT4lZ7N2/aRvZPikCi8He4jbNNWMCCQGBiE86EM8mpT1c4tsdud04jk/bG3m 6iLDUFn2IEomRARbAy4fLRElXqxDDRNa1ubXJzu90CLXRVqlfoX6n2GQK89cKjTTWGJRBfLrK ap3jUbQRmtsQbZiLbUpHocIDu5tUGooZ4cEfSN3FIZee83MnQQd/TzDUM9xAiC2ot+4sQsoti 3X+m79vhJNcm8/p9zHodZ3+qDzeKZ6LVkb+82K6BLGtYA1pwsWVnjnRHtgwCv3urzTJsPAMMk WjP668tydbxR+DSVwo1FkYeajcWARm3KimfuM9Nbk6D5G44EJt+JsR5yWcSZETx3fPdQTIaIO M1Vx86fMjGuB7OAH1977n3HuMJHb/KDBtbDVaZJHDpyWFnEPa1dWgSXeoC8mLI1oaP9sLBulf mp21EZiGgO9xNhbZTgO6pcEHDH5hGEMTOJ/EwDn9iuHcvjJSkNbBwKfW/+vb7WSMG5DbOI55n xBRw4tNLEZ99fe8oF+0hF5fVF9itZmYE/zvX8vw== Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On 2020/7/28 =E4=B8=8A=E5=8D=881:17, Hans van Kranenburg wrote: > Hi! > > On 7/27/20 1:09 PM, Qu Wenruo wrote: >> >> >> On 2020/7/25 =E4=B8=8B=E5=8D=8810:24, Hans van Kranenburg wrote: >>> Hi, >>> >>> I have a filesystem here that I'm filling up with data from elsewhere. >>> Most of it is done by rsync, and part by send/receive. So, receiving >>> data over the network, and then writing the files to disk. There can b= e >>> a dozen of these processes running in parallel. >>> >>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with >>> varying X) often is using nearly 100% cpu, while enormously slowing do= wn >>> disk writes. This shows as disk IO wait for the rsync and btrfs receiv= e >>> processes. >> >> The name is in fact pretty strange. >> It doesn't follow the btrfs workqueue names. >> >> There are two type of kernel threads used by btrfs: >> [...] >> >> As you can see, there is no one named like "flush-btrfs". > > Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here, > currently. ;] > >> Thus I guess it's from other part of the stack. >> >> Also, the calltrace also shows that, that kernel thread is only doing >> page writeback, which calls back to the page write hooks of btrfs. >> >> So I guess it may not be btrfs, but something else trying to do all the >> writeback. > > Yes, so, from the writeback path it re-enters btrfs code, right? Yep. Btrfs is still responsible for the problem. > >> But still, the CPU usage is still a problem, it shouldn't cost so much >> CPU time just writing back pages from btrfs. > > It's find_free_extent which is the culprit. > >>> [...] >>> >>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. >>> >>> I just did a 'perf record -g -a sleep 60' while disk writes were down = to >>> under 1MiB (!) per second and then 'perf report'. Attached some 'scree= n >>> shot' of it. Also attached an example of what nmon shows to give an id= ea >>> about the situation. >>> >>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately >>> see network and disk write speed ramping up, easily over 200 MiB/s, >>> until it soon plummets again. >>> >>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (whic= h >>> is booted now). >>> >>> So, what I'm looking for is: >>> * Does anyone else see this happen when doing a lot of concurrent writ= es? >>> * What does this flush thing do? >>> * Why is it using 100% cpu all the time? >>> * How can I debug this more? >> >> bcc based runtime probes I guess? >> >> Since it's almost a dead CPU burner loop, regular sleep based lockup >> detector won't help much. > > Here's a flame graph of 180 seconds, taken from the kernel thread pid: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-f= lush-btrfs.svg That's really awesome! > >> You can try trace events first to see which trace event get executed th= e >> most frequently, then try to add probe points to pin down the real caus= e. > > From the default collection, I already got the following, a few days > ago, by enabling find_free_extent and btrfs_cow_block: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_exte= nt.txt > > From the timestamps you can see how long this takes. It's not much that > gets done per second. > > The spin lock part must be spin_lock(&space_info->lock) because that's > the only one in find_free_extent. > > So, what I think is that, like I mentioned on saturday already, > find_free_extent is probably doing things in a really inefficient way, > scanning around for a small single free space gap all the time in a > really expensive way, and doing that over again for each tiny metadata > block or file that needs to be placed somewhere (also notice the > empty_size=3D0), instead of just throwing all of it on disk after each > other, when it's otherwise slowing down everyone. I think you're mostly right, find_free_extent() is not that efficient. But that only happens when there are a lot of parallel requests for it. One example is btrfs_get_block_group(). That function is only calling one function, refcount_inc(), which normally shouldn't be that slow, unless there are tons of other callers here. > > And then we have the spin lock part and btrfs_get_block_group, which are > also a significant part of the time spent. All together is the > continuous 100% cpu. So is the spinlock part. That spinlock itself should be pretty fast (that's why we use spinlock), but when it's not, it's mostly caused by high concurrency. > > What I *can* try is switch to the ssd_spread option, to force it to do > much more yolo allocation, but I'm afraid this will result in a sudden > blast of metadata block groups getting allocated. Or, maybe try it for a > minute or so and compare the trace pipe output... Another thing can't be explained is, that slow down you're experiencing is from data write back. It should never conflicts with metadata space reservation, as data and metadata are separated by space_info first, thus they shouldn't affect each other. Either I'm missing something, or there are other problems. > >> But personally speaking, it's better to shrink the workload, to find a >> minimal workload to reproduce the 100% CPU burn, so that you need less >> probes/time to pindown the problem. > > I think I can reproduce it with a single btrfs receive operation, as > long as it has a large amount of small files in it. IMHO, if the culprit is really find_free_extent() itself, then single receive won't reach the concurrency we need to expose the slowdown. Since receive is just doing serial writes, that shouldn't cause enough race for find_free_extents(). But that's worthy trying to see if it proofs my assumption of concurrency. Thanks, Qu > > Currently I'm almost only copying data onto the filesystem, only data > removals are some targeted dedupe script for known big files that are > identical, but which could not easily be handled by rsync. > > So, when find_free_extent wants to write extent tree blocks, it might be > playing outside in the woods all the time, searching around for little > holes that were caused by previous cow operations. > > Large files are not a problem, like postgresql database dumps. Copying > them over just runs >200-300M/s without problem, where the read speed of > the old storage I'm moving things off is the bottleneck, actually. > >>> * Ultimately, of course... how can we improve this? >>> >>> [...] > > I will see if I can experiment a bit with putting more trace points in. > > I remember reading around in the extent allocator code, until I got > dizzy from it. Deja vu... > > https://www.spinics.net/lists/linux-btrfs/msg70624.html > > Thanks > Hans >