From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD1B8C433DF for ; Tue, 28 Jul 2020 01:52:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9579C20759 for ; Tue, 28 Jul 2020 01:52:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=gmx.net header.i=@gmx.net header.b="DohfZ4sW" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726620AbgG1Bws (ORCPT ); Mon, 27 Jul 2020 21:52:48 -0400 Received: from mout.gmx.net ([212.227.17.21]:37025 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726196AbgG1Bws (ORCPT ); Mon, 27 Jul 2020 21:52:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gmx.net; s=badeba3b8450; t=1595901163; bh=e0NuanHlXutXdsiphl/onL0S1JFOhVTKpBQqmkDSO2A=; h=X-UI-Sender-Class:Subject:From:To:References:Date:In-Reply-To; b=DohfZ4sWASzqhBVd6cy1V9h+vw/N1/7j5RVnWC84XnQv9pYNN6O1vXRkkq2yU7y4y riBA1Cg3Wvqa4GmPTY/t1QmrbsTxf8wZUu215pR0Wvi8IgI0Y8d1HZYx1Bl9QAs1vC rQ23VbBv8cwZR3tU8s4GhauX3pjK+NRQu1/bL2LI= X-UI-Sender-Class: 01bb95c1-4bf8-414a-932a-4f6e2808ef9c Received: from [0.0.0.0] ([149.28.201.231]) by mail.gmx.com (mrgmx104 [212.227.17.174]) with ESMTPSA (Nemesis) id 1N5mKJ-1ksP2X1i1L-017HAB; Tue, 28 Jul 2020 03:52:43 +0200 Subject: Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 From: Qu Wenruo To: Hans van Kranenburg , linux-btrfs@vger.kernel.org References: <2523ce77-31a3-ecec-f36d-8d74132eae02@knorrie.org> Autocrypt: addr=quwenruo.btrfs@gmx.com; prefer-encrypt=mutual; keydata= mQENBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAG0IlF1IFdlbnJ1byA8cXV3ZW5ydW8uYnRyZnNAZ214LmNvbT6JAVQEEwEIAD4CGwMFCwkI BwIGFQgJCgsCBBYCAwECHgECF4AWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCWdWCnQUJCWYC bgAKCRDCPZHzoSX+qAR8B/94VAsSNygx1C6dhb1u1Wp1Jr/lfO7QIOK/nf1PF0VpYjTQ2au8 ihf/RApTna31sVjBx3jzlmpy+lDoPdXwbI3Czx1PwDbdhAAjdRbvBmwM6cUWyqD+zjVm4RTG rFTPi3E7828YJ71Vpda2qghOYdnC45xCcjmHh8FwReLzsV2A6FtXsvd87bq6Iw2axOHVUax2 FGSbardMsHrya1dC2jF2R6n0uxaIc1bWGweYsq0LXvLcvjWH+zDgzYCUB0cfb+6Ib/ipSCYp 3i8BevMsTs62MOBmKz7til6Zdz0kkqDdSNOq8LgWGLOwUTqBh71+lqN2XBpTDu1eLZaNbxSI ilaVuQENBFnVga8BCACqU+th4Esy/c8BnvliFAjAfpzhI1wH76FD1MJPmAhA3DnX5JDORcga CbPEwhLj1xlwTgpeT+QfDmGJ5B5BlrrQFZVE1fChEjiJvyiSAO4yQPkrPVYTI7Xj34FnscPj /IrRUUka68MlHxPtFnAHr25VIuOS41lmYKYNwPNLRz9Ik6DmeTG3WJO2BQRNvXA0pXrJH1fN GSsRb+pKEKHKtL1803x71zQxCwLh+zLP1iXHVM5j8gX9zqupigQR/Cel2XPS44zWcDW8r7B0 q1eW4Jrv0x19p4P923voqn+joIAostyNTUjCeSrUdKth9jcdlam9X2DziA/DHDFfS5eq4fEv ABEBAAGJATwEGAEIACYCGwwWIQQt33LlpaVbqJ2qQuHCPZHzoSX+qAUCXZw1rgUJCWpOfwAK CRDCPZHzoSX+qFcEB/95cs8cM1OQdE/GgOfCGxwgckMeWyzOR7bkAWW0lDVp2hpgJuxBW/gy fmtBnUaifnggx3EE3ev8HTysZU9q0h+TJwwJKGv6sUc8qcTGFDtavnnl+r6xDUY7A6GvXEsS oCEEynby72byGeSovfq/4AWGNPBG1L61Exl+gbqfvbECP3ziXnob009+z9I4qXodHSYINfAk ZkA523JGap12LndJeLk3gfWNZfXEWyGnuciRGbqESkhIRav8ootsCIops/SqXm0/k+Kcl4gG UO/iD/T5oagaDh0QtOd8RWSMwLxwn8uIhpH84Q4X1LadJ5NCgGa6xPP5qqRuiC+9gZqbq4Nj Message-ID: <5155a008-d534-18d3-5416-1c879031b45d@gmx.com> Date: Tue, 28 Jul 2020 09:52:39 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K1:6U6DYkl2PQJt1Zt1TjiVWaWHQaxAzI+IA6UdT8IleUFNiWHaXi6 MkbxPEub0uS5EI2Omr92opA4cgFJeXThwc/cRrsOIe+hh3a6orug8Y0HAckT+EZYg3wkUOr GV0NC2scXxjfer61LMsTru2FFVS2vDxrmBcvJ88st1+I0393dDS8XYDorGgi//wr31r4zwQ HMwC0iqBD6RmYtpuKjN3A== X-UI-Out-Filterresults: notjunk:1;V03:K0:nvBuEp9QG8w=:3xBLDs7sYjKDmiW33TyOSh 4jld9Tvv1r2BD/6EMQXDMzSBnfmDl0roi4pWkfVsb035Y5hfV2vU0rxxU9GK4t/nP9Rrh88rY 6P/jv/5ufXHE0F+jD9BkA5/H9OpbY4GDaP1sdzWGvEhAakbvY51A1M3G0mQncKn0xlTUqhXm9 WZv4M50NaDdCXsQYS83G4B65dFocngGzVtaWpLXAOOnZUFKfVVABqF27+py1+xfgEtvV9fLSk FzrFidEkvziGt6/XyivA8+XLnethNRD8CdehPo/cDwEQwFMleY8M+3sXAHHlTJqZWVJsXvbWz Sdwa8lHjyoV70tHDGwTKly/MfSivdo4k7VlTZV2Wa3cWy5r0rfmO0OweS0JQGS5QTIKyMkE3e xQa1AQPJYMlUr49o/57C/cbXCAoKsUSszWGKF4Irp0YgKBI/Ch/W7hMmben/Tq9gQj7hRuCh5 HEcN3795hRFmdCaHooVMIXJMRSwh3eydTcJFlgrD9Atmi6vPaT4T/C7YZ2GmQI4MdrjPjOEiI Xaq2wxw/emB2ztAo99h/yORgRglCIQoqcOQaGUfpcX82IiXOkSguKwBOLq0rlOMhMAzeJuZbS DYZHVkiBRiwvf9dYIYYOH2Rgu0cEZPGaHB2W2nZdEBa7x5PPDwrookEkdwoydq79mANvWxggG CLS8m8YsW4RwKRtihINr9bWf7TlZVmeKZnmR04/dMD2mwzgrcGRWm5ePRSiu0C3Y2fvwyS9aC i/wHSH1zw4vuZjnnLbjYmt22iUuxg9QJRWJUk1kRKyVLuHm2vdEsF9xFLS99ATxzUgaXHu3he YzyBFsMr7Nwe7vFNs9/l6PK5gjWwUw7s4lsSvpWKO+GGSzIrrWoK/XJ0v5umiOuMxYN+EyIwn OH5abZGodIcUDznSDJ274D58TsvCVybqoXRzOmPd3PBXhVT7bOULsUXUwoxKpWfXxjbcJ7VRn v7Pek0Wb2jDuCMudmQ9OZ5L7rgTnwQb3UIx5TVIgVWeYPFdvz8D6dgXeV4jHxOqVix0SbWYFJ jbbJYv94HF/FGpWuvxU4ojW0FxQ0zJuhGpE8fgcVOCxbGeFBlO6IGDbhUGK+Ph9RVXgWZD+tF dhBBqGPDughvDJbVxai4tpUSX55EB6yKg2tQX/tstURpWkLS1UUCsLEiMfE3PAJ/SfmbHvnOg nTkDTG7mSIXTx3XBFnPkK8t7r/g09JaFSDPwnwTX59JjM0IPtQ/IOc6VNvqEbcw9y9RjJGBvH gmCc6fZ/Ci16Fg/eEr0ktx3gho/Tpm9D5XDBapw== Sender: linux-btrfs-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org On 2020/7/28 =E4=B8=8A=E5=8D=888:51, Qu Wenruo wrote: > > > On 2020/7/28 =E4=B8=8A=E5=8D=881:17, Hans van Kranenburg wrote: >> Hi! >> >> On 7/27/20 1:09 PM, Qu Wenruo wrote: >>> >>> >>> On 2020/7/25 =E4=B8=8B=E5=8D=8810:24, Hans van Kranenburg wrote: >>>> Hi, >>>> >>>> I have a filesystem here that I'm filling up with data from elsewhere= . >>>> Most of it is done by rsync, and part by send/receive. So, receiving >>>> data over the network, and then writing the files to disk. There can = be >>>> a dozen of these processes running in parallel. >>>> >>>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with >>>> varying X) often is using nearly 100% cpu, while enormously slowing d= own >>>> disk writes. This shows as disk IO wait for the rsync and btrfs recei= ve >>>> processes. >>> >>> The name is in fact pretty strange. >>> It doesn't follow the btrfs workqueue names. >>> >>> There are two type of kernel threads used by btrfs: >>> [...] >>> >>> As you can see, there is no one named like "flush-btrfs". >> >> Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here, >> currently. ;] >> >>> Thus I guess it's from other part of the stack. >>> >>> Also, the calltrace also shows that, that kernel thread is only doing >>> page writeback, which calls back to the page write hooks of btrfs. >>> >>> So I guess it may not be btrfs, but something else trying to do all th= e >>> writeback. >> >> Yes, so, from the writeback path it re-enters btrfs code, right? > > Yep. Btrfs is still responsible for the problem. > >> >>> But still, the CPU usage is still a problem, it shouldn't cost so much >>> CPU time just writing back pages from btrfs. >> >> It's find_free_extent which is the culprit. >> >>>> [...] >>>> >>>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here= . >>>> >>>> I just did a 'perf record -g -a sleep 60' while disk writes were down= to >>>> under 1MiB (!) per second and then 'perf report'. Attached some 'scre= en >>>> shot' of it. Also attached an example of what nmon shows to give an i= dea >>>> about the situation. >>>> >>>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately >>>> see network and disk write speed ramping up, easily over 200 MiB/s, >>>> until it soon plummets again. >>>> >>>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (whi= ch >>>> is booted now). >>>> >>>> So, what I'm looking for is: >>>> * Does anyone else see this happen when doing a lot of concurrent wri= tes? >>>> * What does this flush thing do? >>>> * Why is it using 100% cpu all the time? >>>> * How can I debug this more? >>> >>> bcc based runtime probes I guess? >>> >>> Since it's almost a dead CPU burner loop, regular sleep based lockup >>> detector won't help much. >> >> Here's a flame graph of 180 seconds, taken from the kernel thread pid: >> >> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-= flush-btrfs.svg > > That's really awesome! > >> >>> You can try trace events first to see which trace event get executed t= he >>> most frequently, then try to add probe points to pin down the real cau= se. >> >> From the default collection, I already got the following, a few days >> ago, by enabling find_free_extent and btrfs_cow_block: >> >> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_ext= ent.txt This output is in fact pretty confusing, and maybe give you a false view on the callers of find_free_extent(). It always shows "EXTENT_TREE" as the owner, but that's due to a bad decision on the trace event. I have submitted a patch addressing it, and added you to the CC. Would you mind to re-capture the events with that patch? So that we could have a clearer idea on which tree is having the most amount of concurrency? Thanks, Qu >> >> From the timestamps you can see how long this takes. It's not much that >> gets done per second. >> >> The spin lock part must be spin_lock(&space_info->lock) because that's >> the only one in find_free_extent. >> >> So, what I think is that, like I mentioned on saturday already, >> find_free_extent is probably doing things in a really inefficient way, >> scanning around for a small single free space gap all the time in a >> really expensive way, and doing that over again for each tiny metadata >> block or file that needs to be placed somewhere (also notice the >> empty_size=3D0), instead of just throwing all of it on disk after each >> other, when it's otherwise slowing down everyone. > > I think you're mostly right, find_free_extent() is not that efficient. > > But that only happens when there are a lot of parallel requests for it. > > One example is btrfs_get_block_group(). > That function is only calling one function, refcount_inc(), which > normally shouldn't be that slow, unless there are tons of other callers > here. > >> >> And then we have the spin lock part and btrfs_get_block_group, which ar= e >> also a significant part of the time spent. All together is the >> continuous 100% cpu. > > So is the spinlock part. That spinlock itself should be pretty fast > (that's why we use spinlock), but when it's not, it's mostly caused by > high concurrency. > >> >> What I *can* try is switch to the ssd_spread option, to force it to do >> much more yolo allocation, but I'm afraid this will result in a sudden >> blast of metadata block groups getting allocated. Or, maybe try it for = a >> minute or so and compare the trace pipe output... > > Another thing can't be explained is, that slow down you're experiencing > is from data write back. > It should never conflicts with metadata space reservation, as data and > metadata are separated by space_info first, thus they shouldn't affect > each other. > Either I'm missing something, or there are other problems. > >> >>> But personally speaking, it's better to shrink the workload, to find a >>> minimal workload to reproduce the 100% CPU burn, so that you need less >>> probes/time to pindown the problem. >> >> I think I can reproduce it with a single btrfs receive operation, as >> long as it has a large amount of small files in it. > > IMHO, if the culprit is really find_free_extent() itself, then single > receive won't reach the concurrency we need to expose the slowdown. > > Since receive is just doing serial writes, that shouldn't cause enough > race for find_free_extents(). > > But that's worthy trying to see if it proofs my assumption of concurrenc= y. > > Thanks, > Qu > >> >> Currently I'm almost only copying data onto the filesystem, only data >> removals are some targeted dedupe script for known big files that are >> identical, but which could not easily be handled by rsync. >> >> So, when find_free_extent wants to write extent tree blocks, it might b= e >> playing outside in the woods all the time, searching around for little >> holes that were caused by previous cow operations. >> >> Large files are not a problem, like postgresql database dumps. Copying >> them over just runs >200-300M/s without problem, where the read speed o= f >> the old storage I'm moving things off is the bottleneck, actually. >> >>>> * Ultimately, of course... how can we improve this? >>>> >>>> [...] >> >> I will see if I can experiment a bit with putting more trace points in. >> >> I remember reading around in the extent allocator code, until I got >> dizzy from it. Deja vu... >> >> https://www.spinics.net/lists/linux-btrfs/msg70624.html >> >> Thanks >> Hans >>