From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5629F23EAAB for ; Tue, 25 Nov 2025 22:32:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764109925; cv=none; b=U+dj6TmHpIL1EXaL7vt/goX9mjxh1SUyiW4cBshkdGOeIpHWn+SRF35ZNhPuzfroZWoklo3o7CuEVLMT7IkgR/p3utdmueuqTh9nN85nP8kd4hHJiFRm7cp7TCyg0Z4iUsHBNxE8hD8dik2RonCLriGL5ZIbOSCtam5uBa7LOAY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764109925; c=relaxed/simple; bh=Rc2fSKj/afJDksfHpvKCQkO/06r9s7DxoTkLz3gz4zI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=jLNR3vLifYcKtws0cpQPWV5GqRbuvH/agQgWJn5WyKN9EnNpTCzZlHuM0F+pZd5cCsncuOUX2x/QsmDRrfJGc4BsemcTdLwc+bVYobYsLLUis3IO7m/4YHQh17CQ5DOpDYhNC6XCbtAJZ2n16NuUmJPUPv1NaWGp6kmVDDUdpDw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=LZIBJVQT; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="LZIBJVQT" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-298287a26c3so70240085ad.0 for ; Tue, 25 Nov 2025 14:32:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1764109923; x=1764714723; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=0Y9+kPYFv8ucX2yEpYHlHRyhgJvjtCKMKbzQomdQ8o4=; b=LZIBJVQT6ihfxWYhoQe+nzWZvZG8rDLZrf2bfc9/V/echK3hXi5Ivmnc1heFVQy/jc 0BFZ9lbNj6c1bQZS9aZSfx+kb6jd8Y3ufm1SYrPRjsxQ9+vL7w3ZUuwmf8YgcYminAb8 7K/mspnu75c8McX/I5tjX9OJTpyUPjETEkTOcIdh4NCdGgmEFqoZGEWAn2816/r2ZQi7 9NWcXU0HWMSVVpnjk4/EzBRpsbywjtnVHuG82Mbku6aojTdO4z+6qE18gAjoKzv1yQRF lvFUNtgUOzu9xzj/thCTnu8hoKaEKWikuMO3Gox7w8lQGiD8788iZRlxCW5/KIj5m5EW 7nkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764109923; x=1764714723; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-gg:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0Y9+kPYFv8ucX2yEpYHlHRyhgJvjtCKMKbzQomdQ8o4=; b=r7zqIkrh5urtq/lFLThQKwFpB7dRpLKewakx7wa9ZScIK3+fFuj/TW+bjuJ2VuHw2B Kh1LeGDjxBmERZTxHNmtwGIJqkvbDOb7l/mvNsKZJ1h6xt4ZLWitf9Pxd9KXuqljgj/z WWfK8pxHEeuM09As95MYOvRpXne+8ZH7KPmLlyQqdygZDStQCRlJI/qEgig4Oe4AfI1A yvO77/nN/0ry28z0hqrHiJheFMWVyq+KstcMX0YjsX7S6KCaQiPgSOZLnNUk7Q/5H7nD 8KjllywLUA6cQZUhKo83giQpwEU+k91iEDR61PLGRmzTCjNDAtES5s0p5pAXgOHPf8Qo COHg== X-Forwarded-Encrypted: i=1; AJvYcCUwXP0n8Upzev1C+PZsjItyiDiuj3wIBe1lBdpzehdoYNRtv/XXtUSltRF1xycWTF+JSUEH3hONLV4=@vger.kernel.org X-Gm-Message-State: AOJu0YwuJFIwCcp55HdisfOA06uMqGxeSt4HhxL6sPo/wbLDzvLo/FzA w7Dj2fu83dOzHu99MvXMDib3+PbFLHMaiKhmXWTbdAhf2sLuzhi8zKVy1iRn5PCJlVc= X-Gm-Gg: ASbGnctVZinx6UZYuNCLwYV13K+7/zYyGcVbLSC/jOJOksc3idcLOkwYBZqfZn44lj2 kqKWXKJxAiPppcQYFRxra3lRDDP/QDFk71dqgJ2rHQK2hENa2KqKgKnGAmnhdDSRNC4dRqCUM7p xfROFraIee9nHif2UCTG3po195lQLgHxXgRu/guk+rjXGVdMv9Ooi9OvEihQt81BK6Wbu1nDCwe j8yufdes0dwr0nXCKsSd6btJ3ZW6dRpjNDcYNYrWqXaFbQswPbVrGOkVx1nVxMZ/fG1vBnSOSa2 jeElEgkzYbSRWXLq0CGPHMvvhkPL+MB3zlD8s07378j6r+fSUIRt+BME2eaPu43mnVLYsvEhtm/ AK8GRCQnwqIBm7qtt+jzYEzETn+lBkI/0yxUaG1ZsBnjsh4I070INW6C/9CxPzdPcbqn8+I3bed hk24v2l0siQJCgS92vQskBtcSfXqg6cVJaurhM5g1u8kOc0Q8qV04vp6ZUs25row== X-Google-Smtp-Source: AGHT+IHVJBaArcUiHcfGBDNgJ5sM7FipaT41dEFAlDXLKA6CypV693FqSFJaQcUA8/7FA7L54SL/UA== X-Received: by 2002:a17:903:3845:b0:298:6a79:397b with SMTP id d9443c01a7336-29b6bf80914mr210728855ad.56.1764109922381; Tue, 25 Nov 2025 14:32:02 -0800 (PST) Received: from dread.disaster.area (pa49-181-58-136.pa.nsw.optusnet.com.au. [49.181.58.136]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-29b5b00e1fasm175594855ad.0.2025.11.25.14.32.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 25 Nov 2025 14:32:01 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98.2) (envelope-from ) id 1vO1Zn-0000000FjIW-1Ls5; Wed, 26 Nov 2025 09:31:59 +1100 Date: Wed, 26 Nov 2025 09:31:59 +1100 From: Dave Chinner To: Karim Manaouil Cc: Carlos Maiolino , linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: Too many xfs-conv kworker threads Message-ID: References: <20251125194942.iphwjfx2a4bw6i7g@wrangler> Precedence: bulk X-Mailing-List: linux-xfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20251125194942.iphwjfx2a4bw6i7g@wrangler> On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote: > Hi folks, > > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15 > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores > AMD EPYC 9224. The mkfs.xfs (or xfs_info) output for the filesystem is on this device is? > I am running thpchallenge-fio from mmtests (its purpose is described > here [1]). It's a fio job that inefficiently writes a large number of 64K > files. On a system with 128GiB of RAM, it could create up to 100K files. > A typical fio config looks like this: > > [global] > direct=0 > ioengine=sync > blocksize=4096 > invalidate=0 > fallocate=none > create_on_open=1 > > [writer] > nrfiles=785988 That's ~800k files, not 100k. > filesize=65536 > readwrite=write > numjobs=4 > filename_format=$jobnum/workfile.$filenum > > I noticed that, at some point, top reports around 42650 sleeping tasks, > example: > > Tasks: 42651 total, 1 running, 42650 sleeping, 0 stopped, 0 zombie > > This is a test machine from a fresh boot running vanilla Debian. > > After checking, it turned out, it was a massive list of xfs-conv > kworkers. Something like this (truncated): > > 58214 ? I 0:00 [kworker/47:203-xfs-conv/md127] > 58215 ? I 0:00 [kworker/47:204-xfs-conv/md127] > 58216 ? I 0:00 [kworker/47:205-xfs-conv/md127] > 58217 ? I 0:00 [kworker/47:206-xfs-conv/md127] > 58219 ? I 0:00 [kworker/12:539-xfs-conv/md127] > 58220 ? I 0:00 [kworker/12:540-xfs-conv/md127] > 58221 ? I 0:00 [kworker/12:541-xfs-conv/md127] > 58222 ? I 0:00 [kworker/12:542-xfs-conv/md127] > 58223 ? I 0:00 [kworker/12:543-xfs-conv/md127] > 58224 ? I 0:00 [kworker/12:544-xfs-conv/md127] > 58225 ? I 0:00 [kworker/12:545-xfs-conv/md127] > 58227 ? I 0:00 [kworker/38:155-xfs-conv/md127] > 58228 ? I 0:00 [kworker/38:156-xfs-conv/md127] > 58230 ? I 0:00 [kworker/38:158-xfs-conv/md127] > 58233 ? I 0:00 [kworker/38:161-xfs-conv/md127] > 58235 ? I 0:00 [kworker/8:537-xfs-conv/md127] > 58237 ? I 0:00 [kworker/8:539-xfs-conv/md127] > 58238 ? I 0:00 [kworker/8:540-xfs-conv/md127] > 58239 ? I 0:00 [kworker/8:541-xfs-conv/md127] > 58240 ? I 0:00 [kworker/8:542-xfs-conv/md127] > 58241 ? I 0:00 [kworker/8:543-xfs-conv/md127] > > It seems like the kernel is creating too many kworkers on each CPU. Or there are tens of thousands of small random write IOs in flight at any given time and this process is serialising on unwritten extent conversion during IO completion processing. i.e. so many kworker threads indicates work processing is blocking, so each new work queued gets a new kworker thread created to process it. I suspect that the filesystem is running out of journal space, and then unwritten extent conversion transactions start lock-stepping waiting for journal space. Hence the question about mkfs.xfs output. Also info about IO load (e.g. `iostat -dxm 5` output) whilst the test is running would be useful, because even fast devices can end up being really slow when something stupid is being done... Kernel profiles across all CPUs whilst the workload is running and in this state would be useful. > I am not sure if this has any effect on performance, but potentially, > there is some scheduling overhead?! It probably does, but a trainsmash of stalled in-progress work like this is typically a symptom of some other misbehaviour occuring. FWIW, for a workload intended to produce "inefficient write IO", this is sort of behaviour is definitely indicating something "inefficient" is occurring during write IO. So, in the end, there is a definite possiblity that there may not actually be anything that can be "fixed" here.... -Dave. -- Dave Chinner david@fromorbit.com