From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 003912F0C46
	for <linux-kernel@vger.kernel.org>; Wed, 26 Nov 2025 22:33:36 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.54
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764196418; cv=none; b=AWyRMT0hz7eoXNCpyYzPVW/3UAC3mk1KDLvREM7VAyBVgNE/fyBG2Rwyb6JErdU2E1k0QK3LhYCmAtwnd1ObOkcvhGmd+K9i2SEtPYvT9q39FN4yXGelgM680jPM5kkZqrXauNfSHM/2rPilT90j0l6xaVDmzs0vx1k0R+9ke1A=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764196418; c=relaxed/simple;
	bh=dufnRkirIxziuqu3T2ICgzA0gTCQjGaasdp0XTVQ4ds=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=UtKF87Nu9PwPF78pASB1K7SRTZBLYkoyeMw0JP3SiwewF1Ucvt6dCVfxsufBrtZDZ3uyFmBXRV92BGOl/0oYLDhcECN53xeavVH9iO37SI88zGWZyLsg+6phpRfYOPktIyDKkZq6ld36hKl4BIuzKcY+1Mwf2ODz+RORSNJUxYY=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=uU7QnuOW; arc=none smtp.client-ip=209.85.216.54
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="uU7QnuOW"
Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-3436d6ca17bso227186a91.3
        for <linux-kernel@vger.kernel.org>; Wed, 26 Nov 2025 14:33:36 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1764196416; x=1764801216; darn=vger.kernel.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=nqPhQO34THrJgmVEIO8sYy5j3gs4UhBkUx4+FOQdWPo=;
        b=uU7QnuOWclHeQYylb8U3vs4z4iLV5/PC7FkZRgjTA5nBAlyk/1Wj7lk7qoBUO+duhU
         wg/Etdq5yVEsOoeyFv72N/49rj2ynCp6A2y46TlSFHInHS9O/LefhQNawD2dYHcM/DTk
         NYqgYKbs9A0VEWOisLGuobvE4iiz5VNCrJk9Lyt5H4yb3vdF5J2MjyO53yK5SlXHsXmU
         gBPOBQWIyalv++2SKp4/9hURWtA7J40+FT33w+sF5IMW3PLEPkSpfi6iN/pJiUoQrJg3
         QhnrKemNHgskLG9oZz/AsM8U4APrHvpmGcex40zcL/LGjzv2sHOdRWEd8X+8iGR+VPpi
         JLlg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1764196416; x=1764801216;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=nqPhQO34THrJgmVEIO8sYy5j3gs4UhBkUx4+FOQdWPo=;
        b=jgVgbbR0GUfC4ZHppud7DoKvxjoijflXHjnZctNxsfMv5mMcxAMLcPxY3ZJ127E8jH
         Qg3zy5Pw5kutuJ6VZClCkPmho/evPZdDhvG6CVB0NJxdKpfmT9Mjr/69GnSezdvjhdyv
         oHTUljmgFpJ6bgkB63IusKK4bUOuM2FSHphrB2hyMlsHPM4lquUDOe/ItfJYeSwY5v8I
         RpUcuDtXojMvb3sQKGtAd14LX3SiFs8BXIoGrm/XdijKZZr1eXbci8jCsDpi5VePO1my
         QhuX/Ai933lnZKBAaaqFCWVIMjuMhN6UIoUpg4JWU/tBeqtzAgi6D9TqCiG4HWwxn6bc
         DcLA==
X-Forwarded-Encrypted: i=1; AJvYcCWxn126VXt1xDe5h9ixqpP6D4sCRrvc6REzwvFw9xt0SBrvYCf9RyYNHlntn9ittlZGDSYZGl+UvamAp4s=@vger.kernel.org
X-Gm-Message-State: AOJu0YyomJAix7Gib3UWxjuaO2HUDXDZPVf0pVTxH8ZuUdQ+4bUesRh6
	+9/ZuogI7gtoZsc8LkCpODynVs3WRCYe2d6fL1h1whlrHqyK+rItOJ04/SJ3edvsAQM=
X-Gm-Gg: ASbGncta7fP/xz9PNuA94CN8lj0zVXQRH6WY24FmRqBokLgKe65HdV832yw5a4bIbWE
	Borjn/1su4ObRhl0kFiCDZdNpCkg+y2pnQcAxVLwqDiaXNe5IvQu0g0BR9Y95JmPh9jfhvK8BEx
	2tjQUQohHYena9eirUdvghPxo+OhJ9x4AB40t52UTmYtS/xOfpxg9eK1JgGs0GqWzB3iYVYoBCp
	afWODYWqhprJTZ1XCJSmLuCXXOsE7a+VrQboG66Bs0YH9fsw3Zg+FPSCHkDCpdkegxN06l7MKo4
	/yRc+pZoce48PogfRsNarlMQRhHcIIj65lYblFktstPfg+c34yZtNdeZGb4mT95lP9dgWXCWXSY
	2vpF37uB+dNPXOxQxatHMDlMO6fvbuJVCQnGzp0M04sbaPme/NHcX5iINNcuFZcjXZHFwPCOaAv
	1w5jPuVf+p1MWjVYsQq61Zalb9WXPpVWzhNxwFVcPnRVRFu3kDnhGfk9GSgLgUng==
X-Google-Smtp-Source: AGHT+IG5c/4rqbd9CtKghqmwzuL49JPva9D26c+l3htbf0pHuEKNyYrgHmzQj13Qt7U1dGxlP8r0ew==
X-Received: by 2002:a17:90b:4b42:b0:343:7714:4cad with SMTP id 98e67ed59e1d1-34733e4588emr19489890a91.5.1764196416071;
        Wed, 26 Nov 2025 14:33:36 -0800 (PST)
Received: from dread.disaster.area (pa49-181-58-136.pa.nsw.optusnet.com.au. [49.181.58.136])
        by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3476a547483sm3629551a91.4.2025.11.26.14.33.35
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 26 Nov 2025 14:33:35 -0800 (PST)
Received: from dave by dread.disaster.area with local (Exim 4.98.2)
	(envelope-from <david@fromorbit.com>)
	id 1vOO4q-0000000GAes-3k3C;
	Thu, 27 Nov 2025 09:33:32 +1100
Date: Thu, 27 Nov 2025 09:33:32 +1100
From: Dave Chinner <david@fromorbit.com>
To: Karim Manaouil <kmanaouil.dev@gmail.com>
Cc: Carlos Maiolino <cem@kernel.org>, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: Too many xfs-conv kworker threads
Message-ID: <aSeAPOZpcGaONne9@dread.disaster.area>
References: <20251125194942.iphwjfx2a4bw6i7g@wrangler>
 <aSYuX47uH4zT-FKi@dread.disaster.area>
 <20251126132721.tagdhjs2mcbbkdjr@wrangler>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20251126132721.tagdhjs2mcbbkdjr@wrangler>

On Wed, Nov 26, 2025 at 01:27:21PM +0000, Karim Manaouil wrote:
> 
> Hi Dave,
> 
> Thanks for looking at this.
> 
> On Wed, Nov 26, 2025 at 09:31:59AM +1100, Dave Chinner wrote:
> > On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote:
> > > Hi folks,
> > > 
> > > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
> > > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
> > > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
> > > AMD EPYC 9224.
> > 
> > The mkfs.xfs (or xfs_info) output for the filesystem is on this
> > device is?
> 
> Here is xfs_info
> 
> meta-data=/dev/md127             isize=512    agcount=48, agsize=20346496 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1

rmapbt is enabled. Important.

> This is the last 20/30s from iostat -dxm5 during the test. It's been the
> same consistently throughput the test at ~80/89% utilization.
> 
> Device              w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz      aqu-sz  %util
> md127           68713.80   1051.87     0.00   0.00    1.05    15.68       72.14  89.52
> md127           66888.40    943.12     0.00   0.00    0.92    14.44       61.68  88.08
> md127           68453.80    653.24     0.00   0.00    1.23     9.77       84.37  87.12
> md127           82154.80    604.90     0.00   0.00    1.64     7.54      134.87  86.88
> md127           70320.60    295.50     0.00   0.00    1.97     4.30      138.60  87.12
> md127           19574.60     84.99     0.00   0.00    2.27     4.45       44.48  24.96
                                                                 ^^^^

And the average write IO size is between 4-16kB, and it's reaching
hundreds of IO in flight at the block layer at once. So, yeah, the
stress test is definitely resulting in inefficient IO patterns as
intended.

As for the writeback IO rate, this is pretty typical for delayed
allocation - writeback is single threaded and can block. Best case
for delayed allocation is 100-120k allocations per second.  Every IO
in your workload requires allocation, and it's running at about
70-80k allocations a second.

So, yeah, that seems a bit low, but not unexpectedly low.

> In addition, I got the kernel profile with perf record -a -g.
> 
> Please find at the end of this email the output of (~500 lines of) perf report.
> 
> I have also generated the flamegraph here to make life easy.
> 
> https://limewire.com/d/b5lJ1#ZigjlrS9mg

The vast majority of IO completion work is updating the rmapbt
in xfs_rmap_convert(). There  looks to be ~10x the CPU overhead in
updating the rmapbt (5%) vs the bmapbt (0.5%) during unwritten
extent conversion.

And I'd suggest that all the xfs-conv kworker threads are being
created because the rmapbt updates are contending on the AGF lock
to be able to perform the rmapbt update.

i.e. unwritten extent conversion bmbt updates are per-inode (no
global resources needed), whilst the rmapbt updates are per-AG.
Every file that is in the same AG will contend for the same AGF lock
to do rmap updates.

It will also contend with IO submission because it is doing
allocation and that requires holding the AGF locked.

IOWs, the contention point here is AGF locking for the rmapbt
updates during IO submission and IO completion.  If you turn off
rmapbt it will go somewhat faster, but it won't magically run at
device speed because writeback is single threaded.  I have some
ideas on how to reduce contention on the AGF for allocation and
rmapbt updates, but they are just ideas at this point.

> > > I am not sure if this has any effect on performance, but potentially,
> > > there is some scheduling overhead?!
> > 
> > It probably does, but a trainsmash of stalled in-progress work like
> > this is typically a symptom of some other misbehaviour occuring.
> > 
> > FWIW, for a workload intended to produce "inefficient write IO",
> > this is sort of behaviour is definitely indicating something
> > "inefficient" is occurring during write IO. So, in the end, there is
> > a definite possiblity that there may not actually be anything that
> > can be "fixed" here....
> 
> You're right, but having 45k kworker threads still looks questionable to me
> even with the inefficiency in mind.

The explosion of kworker threads is a result of scheduler behaviour.
It moves the writeback thread around because it is unbound and
frequently blocks, whilst other kernel tasks that are bound to a
specific CPU (like xfs-conv processing) takes scheduling priority.

It's not ideal behaviour in this particular corner case, but for a
stress test that is intended to create "inefficient IO patterns",
this is exactly the sort of behaviour it should be exercising.
Rmember, this is an artificial stress test....

-Dave.
-- 
Dave Chinner
david@fromorbit.com