From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 003912F0C46 for ; Wed, 26 Nov 2025 22:33:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.54 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764196418; cv=none; b=AWyRMT0hz7eoXNCpyYzPVW/3UAC3mk1KDLvREM7VAyBVgNE/fyBG2Rwyb6JErdU2E1k0QK3LhYCmAtwnd1ObOkcvhGmd+K9i2SEtPYvT9q39FN4yXGelgM680jPM5kkZqrXauNfSHM/2rPilT90j0l6xaVDmzs0vx1k0R+9ke1A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764196418; c=relaxed/simple; bh=dufnRkirIxziuqu3T2ICgzA0gTCQjGaasdp0XTVQ4ds=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=UtKF87Nu9PwPF78pASB1K7SRTZBLYkoyeMw0JP3SiwewF1Ucvt6dCVfxsufBrtZDZ3uyFmBXRV92BGOl/0oYLDhcECN53xeavVH9iO37SI88zGWZyLsg+6phpRfYOPktIyDKkZq6ld36hKl4BIuzKcY+1Mwf2ODz+RORSNJUxYY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=uU7QnuOW; arc=none smtp.client-ip=209.85.216.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="uU7QnuOW" Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-3436d6ca17bso227186a91.3 for ; Wed, 26 Nov 2025 14:33:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1764196416; x=1764801216; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=nqPhQO34THrJgmVEIO8sYy5j3gs4UhBkUx4+FOQdWPo=; b=uU7QnuOWclHeQYylb8U3vs4z4iLV5/PC7FkZRgjTA5nBAlyk/1Wj7lk7qoBUO+duhU wg/Etdq5yVEsOoeyFv72N/49rj2ynCp6A2y46TlSFHInHS9O/LefhQNawD2dYHcM/DTk NYqgYKbs9A0VEWOisLGuobvE4iiz5VNCrJk9Lyt5H4yb3vdF5J2MjyO53yK5SlXHsXmU gBPOBQWIyalv++2SKp4/9hURWtA7J40+FT33w+sF5IMW3PLEPkSpfi6iN/pJiUoQrJg3 QhnrKemNHgskLG9oZz/AsM8U4APrHvpmGcex40zcL/LGjzv2sHOdRWEd8X+8iGR+VPpi JLlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764196416; x=1764801216; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nqPhQO34THrJgmVEIO8sYy5j3gs4UhBkUx4+FOQdWPo=; b=jgVgbbR0GUfC4ZHppud7DoKvxjoijflXHjnZctNxsfMv5mMcxAMLcPxY3ZJ127E8jH Qg3zy5Pw5kutuJ6VZClCkPmho/evPZdDhvG6CVB0NJxdKpfmT9Mjr/69GnSezdvjhdyv oHTUljmgFpJ6bgkB63IusKK4bUOuM2FSHphrB2hyMlsHPM4lquUDOe/ItfJYeSwY5v8I RpUcuDtXojMvb3sQKGtAd14LX3SiFs8BXIoGrm/XdijKZZr1eXbci8jCsDpi5VePO1my QhuX/Ai933lnZKBAaaqFCWVIMjuMhN6UIoUpg4JWU/tBeqtzAgi6D9TqCiG4HWwxn6bc DcLA== X-Forwarded-Encrypted: i=1; AJvYcCWxn126VXt1xDe5h9ixqpP6D4sCRrvc6REzwvFw9xt0SBrvYCf9RyYNHlntn9ittlZGDSYZGl+UvamAp4s=@vger.kernel.org X-Gm-Message-State: AOJu0YyomJAix7Gib3UWxjuaO2HUDXDZPVf0pVTxH8ZuUdQ+4bUesRh6 +9/ZuogI7gtoZsc8LkCpODynVs3WRCYe2d6fL1h1whlrHqyK+rItOJ04/SJ3edvsAQM= X-Gm-Gg: ASbGncta7fP/xz9PNuA94CN8lj0zVXQRH6WY24FmRqBokLgKe65HdV832yw5a4bIbWE Borjn/1su4ObRhl0kFiCDZdNpCkg+y2pnQcAxVLwqDiaXNe5IvQu0g0BR9Y95JmPh9jfhvK8BEx 2tjQUQohHYena9eirUdvghPxo+OhJ9x4AB40t52UTmYtS/xOfpxg9eK1JgGs0GqWzB3iYVYoBCp afWODYWqhprJTZ1XCJSmLuCXXOsE7a+VrQboG66Bs0YH9fsw3Zg+FPSCHkDCpdkegxN06l7MKo4 /yRc+pZoce48PogfRsNarlMQRhHcIIj65lYblFktstPfg+c34yZtNdeZGb4mT95lP9dgWXCWXSY 2vpF37uB+dNPXOxQxatHMDlMO6fvbuJVCQnGzp0M04sbaPme/NHcX5iINNcuFZcjXZHFwPCOaAv 1w5jPuVf+p1MWjVYsQq61Zalb9WXPpVWzhNxwFVcPnRVRFu3kDnhGfk9GSgLgUng== X-Google-Smtp-Source: AGHT+IG5c/4rqbd9CtKghqmwzuL49JPva9D26c+l3htbf0pHuEKNyYrgHmzQj13Qt7U1dGxlP8r0ew== X-Received: by 2002:a17:90b:4b42:b0:343:7714:4cad with SMTP id 98e67ed59e1d1-34733e4588emr19489890a91.5.1764196416071; Wed, 26 Nov 2025 14:33:36 -0800 (PST) Received: from dread.disaster.area (pa49-181-58-136.pa.nsw.optusnet.com.au. [49.181.58.136]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3476a547483sm3629551a91.4.2025.11.26.14.33.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Nov 2025 14:33:35 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98.2) (envelope-from ) id 1vOO4q-0000000GAes-3k3C; Thu, 27 Nov 2025 09:33:32 +1100 Date: Thu, 27 Nov 2025 09:33:32 +1100 From: Dave Chinner To: Karim Manaouil Cc: Carlos Maiolino , linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: Too many xfs-conv kworker threads Message-ID: References: <20251125194942.iphwjfx2a4bw6i7g@wrangler> <20251126132721.tagdhjs2mcbbkdjr@wrangler> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251126132721.tagdhjs2mcbbkdjr@wrangler> On Wed, Nov 26, 2025 at 01:27:21PM +0000, Karim Manaouil wrote: > > Hi Dave, > > Thanks for looking at this. > > On Wed, Nov 26, 2025 at 09:31:59AM +1100, Dave Chinner wrote: > > On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote: > > > Hi folks, > > > > > > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15 > > > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can > > > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores > > > AMD EPYC 9224. > > > > The mkfs.xfs (or xfs_info) output for the filesystem is on this > > device is? > > Here is xfs_info > > meta-data=/dev/md127 isize=512 agcount=48, agsize=20346496 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=1 rmapbt is enabled. Important. > This is the last 20/30s from iostat -dxm5 during the test. It's been the > same consistently throughput the test at ~80/89% utilization. > > Device w/s wMB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util > md127 68713.80 1051.87 0.00 0.00 1.05 15.68 72.14 89.52 > md127 66888.40 943.12 0.00 0.00 0.92 14.44 61.68 88.08 > md127 68453.80 653.24 0.00 0.00 1.23 9.77 84.37 87.12 > md127 82154.80 604.90 0.00 0.00 1.64 7.54 134.87 86.88 > md127 70320.60 295.50 0.00 0.00 1.97 4.30 138.60 87.12 > md127 19574.60 84.99 0.00 0.00 2.27 4.45 44.48 24.96 ^^^^ And the average write IO size is between 4-16kB, and it's reaching hundreds of IO in flight at the block layer at once. So, yeah, the stress test is definitely resulting in inefficient IO patterns as intended. As for the writeback IO rate, this is pretty typical for delayed allocation - writeback is single threaded and can block. Best case for delayed allocation is 100-120k allocations per second. Every IO in your workload requires allocation, and it's running at about 70-80k allocations a second. So, yeah, that seems a bit low, but not unexpectedly low. > In addition, I got the kernel profile with perf record -a -g. > > Please find at the end of this email the output of (~500 lines of) perf report. > > I have also generated the flamegraph here to make life easy. > > https://limewire.com/d/b5lJ1#ZigjlrS9mg The vast majority of IO completion work is updating the rmapbt in xfs_rmap_convert(). There looks to be ~10x the CPU overhead in updating the rmapbt (5%) vs the bmapbt (0.5%) during unwritten extent conversion. And I'd suggest that all the xfs-conv kworker threads are being created because the rmapbt updates are contending on the AGF lock to be able to perform the rmapbt update. i.e. unwritten extent conversion bmbt updates are per-inode (no global resources needed), whilst the rmapbt updates are per-AG. Every file that is in the same AG will contend for the same AGF lock to do rmap updates. It will also contend with IO submission because it is doing allocation and that requires holding the AGF locked. IOWs, the contention point here is AGF locking for the rmapbt updates during IO submission and IO completion. If you turn off rmapbt it will go somewhat faster, but it won't magically run at device speed because writeback is single threaded. I have some ideas on how to reduce contention on the AGF for allocation and rmapbt updates, but they are just ideas at this point. > > > I am not sure if this has any effect on performance, but potentially, > > > there is some scheduling overhead?! > > > > It probably does, but a trainsmash of stalled in-progress work like > > this is typically a symptom of some other misbehaviour occuring. > > > > FWIW, for a workload intended to produce "inefficient write IO", > > this is sort of behaviour is definitely indicating something > > "inefficient" is occurring during write IO. So, in the end, there is > > a definite possiblity that there may not actually be anything that > > can be "fixed" here.... > > You're right, but having 45k kworker threads still looks questionable to me > even with the inefficiency in mind. The explosion of kworker threads is a result of scheduler behaviour. It moves the writeback thread around because it is unbound and frequently blocks, whilst other kernel tasks that are bound to a specific CPU (like xfs-conv processing) takes scheduling priority. It's not ideal behaviour in this particular corner case, but for a stress test that is intended to create "inefficient IO patterns", this is exactly the sort of behaviour it should be exercising. Rmember, this is an artificial stress test.... -Dave. -- Dave Chinner david@fromorbit.com