From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 003151FF7C7 for ; Wed, 26 Nov 2025 22:33:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.54 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764196418; cv=none; b=AWyRMT0hz7eoXNCpyYzPVW/3UAC3mk1KDLvREM7VAyBVgNE/fyBG2Rwyb6JErdU2E1k0QK3LhYCmAtwnd1ObOkcvhGmd+K9i2SEtPYvT9q39FN4yXGelgM680jPM5kkZqrXauNfSHM/2rPilT90j0l6xaVDmzs0vx1k0R+9ke1A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764196418; c=relaxed/simple; bh=dufnRkirIxziuqu3T2ICgzA0gTCQjGaasdp0XTVQ4ds=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=UtKF87Nu9PwPF78pASB1K7SRTZBLYkoyeMw0JP3SiwewF1Ucvt6dCVfxsufBrtZDZ3uyFmBXRV92BGOl/0oYLDhcECN53xeavVH9iO37SI88zGWZyLsg+6phpRfYOPktIyDKkZq6ld36hKl4BIuzKcY+1Mwf2ODz+RORSNJUxYY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=uU7QnuOW; arc=none smtp.client-ip=209.85.216.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="uU7QnuOW" Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-343ee44d89aso297377a91.2 for ; Wed, 26 Nov 2025 14:33:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1764196416; x=1764801216; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=nqPhQO34THrJgmVEIO8sYy5j3gs4UhBkUx4+FOQdWPo=; b=uU7QnuOWclHeQYylb8U3vs4z4iLV5/PC7FkZRgjTA5nBAlyk/1Wj7lk7qoBUO+duhU wg/Etdq5yVEsOoeyFv72N/49rj2ynCp6A2y46TlSFHInHS9O/LefhQNawD2dYHcM/DTk NYqgYKbs9A0VEWOisLGuobvE4iiz5VNCrJk9Lyt5H4yb3vdF5J2MjyO53yK5SlXHsXmU gBPOBQWIyalv++2SKp4/9hURWtA7J40+FT33w+sF5IMW3PLEPkSpfi6iN/pJiUoQrJg3 QhnrKemNHgskLG9oZz/AsM8U4APrHvpmGcex40zcL/LGjzv2sHOdRWEd8X+8iGR+VPpi JLlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764196416; x=1764801216; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nqPhQO34THrJgmVEIO8sYy5j3gs4UhBkUx4+FOQdWPo=; b=RkE7gFqEzQloV5HWqX5kJpO48FSxZFb9CJTeHaLtYyASZJaG52YK/mkySZTmuNLKke G1STeWtCtHVoj0D8QhmAWQGqzd94Kn8r8q12Fo5z/zCc+YmDEg/2SJya3OQv6ueQEvF9 sske93s+yHvPRFTKDWXIbr8OLJKR6o+ccxcj/3axm+5As3e5OZiD13CSFudsnPdXAHNm VYIUwLjYC0JTnzauAtvsvwXEmJRa61qigT18qSAPiYMHQo+bch8fW/D896lwGON7WTlh RcnBjDWgg+9i43m2mPucC/yMv8eiia4IoZPxFKh9WcCd52Cdzp2lwvLhN7s9Pl5RJe0O wjZA== X-Forwarded-Encrypted: i=1; AJvYcCWa4ZB+w+ILfT3QPaO4YYXlMXMTd4t4kSFFA5mKz0hlhtrZ4QI+noqzO9sObzCllg9HEq8tBtcGy9Y=@vger.kernel.org X-Gm-Message-State: AOJu0YwuNLZ5g13mvNiVkVulH+71QIg7fPssMYWYJH8Ddh1ugmvyO3Kd 5N+XPT0w76Cg3BVGHd+knVTmwnJzHpT7lRzbt/kIAPaACcGR50luXl3/zpR19iYRf0niUd85DgT Em4KI X-Gm-Gg: ASbGncuYzXzVq4KC5QZ8RHcQ+SOptBl5MjFFvfTljgdyJnD0eilzKb8W+f0XvVmEVnQ 9EKPtozDLahTTullZcZrRBR9mDI9WqrFlV/DkRSJruiFfdvAUc6bo12TxBU47tnD+nVEnlwznvK k4XnK/3GZPR9fKBwsFYWJusnrrTdIJ14Ek/VNzMatqo3onbhZcOOI0ZMLNUdZXoTzPIqiBm/zwT G7EsYu4hhuPrwgTsR5iE/jCxuaehV31anzrQtR2WvuaONlGmEe6XXYcDrqGB5O3YI8mnswcWC1C diu99/LQFCoHeuXP7rUEK5oKi24DsMl/XezBhET/RauWU2n5TBZeRcWaTA5/ZpRZyn91RFqKySw ODnrNcgC00e723yI8EtN9GYWcCRpta8lb9D8ANevfS3FZEfa7Di4SP3xoG0C5lhqP3EVuI94v3E 9f0h7L8W5WZTt9ryJiEJutW1lIE55Tmv/nG1eVeCBJFF7QpPdxHmguzPOu7Gxk/A== X-Google-Smtp-Source: AGHT+IG5c/4rqbd9CtKghqmwzuL49JPva9D26c+l3htbf0pHuEKNyYrgHmzQj13Qt7U1dGxlP8r0ew== X-Received: by 2002:a17:90b:4b42:b0:343:7714:4cad with SMTP id 98e67ed59e1d1-34733e4588emr19489890a91.5.1764196416071; Wed, 26 Nov 2025 14:33:36 -0800 (PST) Received: from dread.disaster.area (pa49-181-58-136.pa.nsw.optusnet.com.au. [49.181.58.136]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3476a547483sm3629551a91.4.2025.11.26.14.33.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Nov 2025 14:33:35 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98.2) (envelope-from ) id 1vOO4q-0000000GAes-3k3C; Thu, 27 Nov 2025 09:33:32 +1100 Date: Thu, 27 Nov 2025 09:33:32 +1100 From: Dave Chinner To: Karim Manaouil Cc: Carlos Maiolino , linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: Too many xfs-conv kworker threads Message-ID: References: <20251125194942.iphwjfx2a4bw6i7g@wrangler> <20251126132721.tagdhjs2mcbbkdjr@wrangler> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251126132721.tagdhjs2mcbbkdjr@wrangler> On Wed, Nov 26, 2025 at 01:27:21PM +0000, Karim Manaouil wrote: > > Hi Dave, > > Thanks for looking at this. > > On Wed, Nov 26, 2025 at 09:31:59AM +1100, Dave Chinner wrote: > > On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote: > > > Hi folks, > > > > > > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15 > > > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can > > > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores > > > AMD EPYC 9224. > > > > The mkfs.xfs (or xfs_info) output for the filesystem is on this > > device is? > > Here is xfs_info > > meta-data=/dev/md127 isize=512 agcount=48, agsize=20346496 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 rmapbt is enabled. Important. > This is the last 20/30s from iostat -dxm5 during the test. It's been the > same consistently throughput the test at ~80/89% utilization. > > Device w/s wMB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util > md127 68713.80 1051.87 0.00 0.00 1.05 15.68 72.14 89.52 > md127 66888.40 943.12 0.00 0.00 0.92 14.44 61.68 88.08 > md127 68453.80 653.24 0.00 0.00 1.23 9.77 84.37 87.12 > md127 82154.80 604.90 0.00 0.00 1.64 7.54 134.87 86.88 > md127 70320.60 295.50 0.00 0.00 1.97 4.30 138.60 87.12 > md127 19574.60 84.99 0.00 0.00 2.27 4.45 44.48 24.96 ^^^^ And the average write IO size is between 4-16kB, and it's reaching hundreds of IO in flight at the block layer at once. So, yeah, the stress test is definitely resulting in inefficient IO patterns as intended. As for the writeback IO rate, this is pretty typical for delayed allocation - writeback is single threaded and can block. Best case for delayed allocation is 100-120k allocations per second. Every IO in your workload requires allocation, and it's running at about 70-80k allocations a second. So, yeah, that seems a bit low, but not unexpectedly low. > In addition, I got the kernel profile with perf record -a -g. > > Please find at the end of this email the output of (~500 lines of) perf report. > > I have also generated the flamegraph here to make life easy. > > https://limewire.com/d/b5lJ1#ZigjlrS9mg The vast majority of IO completion work is updating the rmapbt in xfs_rmap_convert(). There looks to be ~10x the CPU overhead in updating the rmapbt (5%) vs the bmapbt (0.5%) during unwritten extent conversion. And I'd suggest that all the xfs-conv kworker threads are being created because the rmapbt updates are contending on the AGF lock to be able to perform the rmapbt update. i.e. unwritten extent conversion bmbt updates are per-inode (no global resources needed), whilst the rmapbt updates are per-AG. Every file that is in the same AG will contend for the same AGF lock to do rmap updates. It will also contend with IO submission because it is doing allocation and that requires holding the AGF locked. IOWs, the contention point here is AGF locking for the rmapbt updates during IO submission and IO completion. If you turn off rmapbt it will go somewhat faster, but it won't magically run at device speed because writeback is single threaded. I have some ideas on how to reduce contention on the AGF for allocation and rmapbt updates, but they are just ideas at this point. > > > I am not sure if this has any effect on performance, but potentially, > > > there is some scheduling overhead?! > > > > It probably does, but a trainsmash of stalled in-progress work like > > this is typically a symptom of some other misbehaviour occuring. > > > > FWIW, for a workload intended to produce "inefficient write IO", > > this is sort of behaviour is definitely indicating something > > "inefficient" is occurring during write IO. So, in the end, there is > > a definite possiblity that there may not actually be anything that > > can be "fixed" here.... > > You're right, but having 45k kworker threads still looks questionable to me > even with the inefficiency in mind. The explosion of kworker threads is a result of scheduler behaviour. It moves the writeback thread around because it is unbound and frequently blocks, whilst other kernel tasks that are bound to a specific CPU (like xfs-conv processing) takes scheduling priority. It's not ideal behaviour in this particular corner case, but for a stress test that is intended to create "inefficient IO patterns", this is exactly the sort of behaviour it should be exercising. Rmember, this is an artificial stress test.... -Dave. -- Dave Chinner david@fromorbit.com