From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F01CA142E7C for ; Tue, 4 Feb 2025 02:50:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738637414; cv=none; b=hxHgpNNxCZNgUhY46coGGXAheFHY3e2I1J4bGi8K+mjdasGgbvObF6uxYp4Cnax0PHqB/k2oDMUJxeEOCTdGzIf3Q5FPFG6JTfdx8AHchx/3h9/uR2DdyognZYr9VN3ta61GsnZVo81/+LVm6PFi3QdUDHVxZFMaQu/PV/mKAEM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738637414; c=relaxed/simple; bh=a/4dPfwimuygysPiFeAAPX5FA9FA5oLRFfesDyLq75s=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rG9MfJmztEbcujfDJ1z7uffShIXJbWs3bjrNyOo3qhm/s5iyvy0BaQ7rrkkSEDz3V/i4t5TvVQooMP5Q2/LilVwYV5cpNbM+fuYmZWbTja4kVN2l6rlr0oMs7+2WZExJiy+azYc/7CuFrQM67MsJcLmRYVzMky63KBhIxZX+zqI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com; spf=pass smtp.mailfrom=fromorbit.com; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b=tx096NvQ; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=fromorbit.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=fromorbit-com.20230601.gappssmtp.com header.i=@fromorbit-com.20230601.gappssmtp.com header.b="tx096NvQ" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-2f9bd7c480eso584194a91.1 for ; Mon, 03 Feb 2025 18:50:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1738637412; x=1739242212; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=D7C2oVqc8y98TDUj5uETh96/okBC18wzfttscQcZivI=; b=tx096NvQW/hKvTMIbV79VsQQpU4beOdysEU9nEPr5SWRCWQXTfkbK7sfQDfgW+bF2z s44RfDYU1QLZ6+IcFKQjhCxgq3N52rO1W+1ttVq7TNw37hfuUdbdHDLyjTAcm9lEVqMg dW0H6aF5xGU40Gqfr1gbJ+TjetmUQRDB3y1b1g/9e1raMjGgfW9frhks393vDllIPet/ 0YVQQkHRZPdflsuo6pSw7/kkIm8vh1yneLxsOFPfxWVFfaWebhTVureMT++77YV53tYp IU/WSCkgHbz34muwEB/5Dbn6Oc5WPKzBcjbMRt74ucc+mcNUxy94ql4DieB2Gd7RgsOi Epdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738637412; x=1739242212; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=D7C2oVqc8y98TDUj5uETh96/okBC18wzfttscQcZivI=; b=dWUGKJoRbcIVK3MAdiitpkz5WGlJFFlk0MrtsJRLmXt5d9SmEXTkYyJRKNkpEwI6iq +C2s9Z3sHRVM25wps1ZByFR8H2Tex2Od3H60+H1rUukNwILYESMRpyRL9vWEujpIeIRL qQ2sPHxvkiHwOgk9e5kLMvitVKWrFztUUt3oAbWPGoJI1vYZlhEN5s//ysnVwWnXgzAR Bnh1GZlalteUGuZesMU0OkvPXbBHKCTw0T6cOQmqvmV5/CiW210X5ewqzRLa6Fe+42AM UdErl2e6/FHYxrJAb17FwkydXEvBPwcxDy05iOdCmWoaS2cejxmge+VeBRjVMkudSPYG FlnQ== X-Forwarded-Encrypted: i=1; AJvYcCVCe9VwwZOTD475JMFJqMj2PToU+XiTW4Gj5gENsMWKOuVHB9dhZuxuw5cxznj9RfUez2mbE+DbGPazxCoC@vger.kernel.org X-Gm-Message-State: AOJu0YwhcuaNwjMvST23da9tfnvqmdljTaDAyZrhucyI2lu1Fi0u1Fe6 7b+U4GyhbvPRirx2BV01X+JcDFPvGnfE0czl4YlzzaMJiSol0XoXD6K1ug9HL3E= X-Gm-Gg: ASbGncvUUBdXn7VPI4VV6Zuz2BH8EO5lQun2XhE7cfDSYy1Kg7vWms/I36A59Z344Ky lhmodkms8MTmqhlAC2dE1S1ffa0Q+8PCJz9XHpExeWgNYtkLnXoAbS4rTR/EpRi9uevDswWzZRC ArjV2X7l7hmFrToSLEbpv1D7i2gZ5UnJl3fXCYlZQQNfPWPTyP08lcQMVBTw/njguO/C1X79HCi 12eaG+C1cnkknYnyRGF6qm9M3JZRVGUIdC+Swoti06MYDc6jDvhLs1WD1u2R3YL5N6WQgtZ6Rq9 PuFDJUVY9a1aS+mg46XNgqiE9/i+6RvaTGBZF8Jk4/aPxGW5klxH1z0m X-Google-Smtp-Source: AGHT+IEbu6lKgiSMu+12rH5y7oPnX/8rXLdh6wzR4AN94QF3xHqW0iTaIhx1St4DIxryEzddVufMHw== X-Received: by 2002:a17:90b:2f50:b0:2ee:d024:e4fc with SMTP id 98e67ed59e1d1-2f83ac84521mr41724266a91.33.1738637411880; Mon, 03 Feb 2025 18:50:11 -0800 (PST) Received: from dread.disaster.area (pa49-186-89-135.pa.vic.optusnet.com.au. [49.186.89.135]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2f83bc97d25sm12249635a91.5.2025.02.03.18.50.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 03 Feb 2025 18:50:11 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.98) (envelope-from ) id 1tf90q-0000000ELN8-2QqM; Tue, 04 Feb 2025 13:50:08 +1100 Date: Tue, 4 Feb 2025 13:50:08 +1100 From: Dave Chinner To: Kundan Kumar Cc: lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, anuj20.g@samsung.com, mcgrof@kernel.org, joshi.k@samsung.com, axboe@kernel.dk, clm@meta.com, hch@lst.de, willy@infradead.org, gost.dev@samsung.com Subject: Re: [LSF/MM/BPF TOPIC] Parallelizing filesystem writeback Message-ID: References: <20250129102627.161448-1-kundan.kumar@samsung.com> <20250131093209.6luwm4ny5kj34jqc@green245> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250131093209.6luwm4ny5kj34jqc@green245> On Fri, Jan 31, 2025 at 03:02:09PM +0530, Kundan Kumar wrote: > > IOWs, having too much parallelism in writeback for the underlying > > storage and/or filesystem can be far more harmful to system > > performance under load than having too little parallelism to drive > > the filesystem/hardware to it's maximum performance. > > With increasing speed of devices we would like to improve the performance of > buffered IO as well. This will help the applications(DB, AI/ML) using buffered > I/O. If more parallelism is causing side effect, we can reduce it using some > factor like: > 1) writeback context per NUMA node. I doubt that will create enough concurrency for a typical small server or desktop machine that only has a single NUMA node but has a couple of fast nvme SSDs in it. > 2) Fixed number of writeback contexts, say min(10, numcpu). > 3) NUMCPU/N number of writeback contexts. These don't take into account the concurrency available from the underlying filesystem or storage. That's the point I was making - CPU count has -zero- relationship to the concurrency the filesystem and/or storage provide the system. It is fundamentally incorrect to base decisions about IO concurrency on the number of CPU cores in the system. Same goes for using NUMA nodes, a fixed minimum, etc. > 4) Writeback context based on FS geometry like per AG for XFS, as per your > suggestion. Leave the concurrency and queue affinity selection up to the filesystem. If they don't specify anything, then we remain with a single queue and writeback thread. > > i.e. with enough RAM, this random write workload using buffered IO > > is pretty much guaranteed to outperform direct IO regardless of the > > underlying writeback concurrency. > > We tested making sure RAM is available for both buffered and direct IO. On a > system with 32GB RAM we issued 24GB IO through 24 jobs on a PMEM device. > > fio --directory=/mnt --name=test --bs=4k --iodepth=1024 --rw=randwrite \ > --ioengine=io_uring --time_based=1 -runtime=120 --numjobs=24 --size=1G \ > --direct=1 --eta-interval=1 --eta-newline=1 --group_reporting > > We can see the results which show direct IO exceed buffered IO by big margin. > > BW (MiB/s) buffered dontcache %improvement direct %improvement > randwrite (bs=4k) 3393 5397 59.06% 9315 174.53% DIO is faster on PMEM because it only copies the data once (i.e. user to PMEM) whilst buffered IO copies the data twice (user to page cache to PMEM). The IO has exactly the same CPU overhead regardless of in-memory page cache aggregation for buffered IO because IO cpu usage is dominated by the data copy overhead. Throughput benchmarks on PMEM are -always- CPU and memory bandwidth bound doing synchronous data copying. PMEM benchamrks may respond to CPU count based optimisations, but that should not be taken as "infrastructure needs to be CPU-count optimised. This is why it is fundamentally wrong to use PMEM for IO infrastructure optimisation: the behaviour it displays is completely different to pretty much all other types of IO device we support. PMEM is a niche, and it performs very poorly compared to a handful of NVMe SSDs being driven by the same CPUs... > > IMO, we need writeback to be optimised for is asynchronous IO > > dispatch through each filesystems; our writeback IOPS problems in > > XFS largely stem from the per-IO cpu overhead of block allocation in > > the filesystems (i.e. delayed allocation). > > This is a good idea, but it means we will not be able to paralellize within > an AG. The XFS allocation concurrency model scales out across AGs, not within AGs. If we need writeback concurrency within an AG (e.g. for pure overwrite concurrency) then we want async dispatch worker threads per writeback queue, not multiple writeback queues per AG.... > I will spend some time to build a POC with per AG writeback context, and > compare it with per-cpu writeback performance and extent fragmentation. Other > filesystems using delayed allocation will also need a similar scheme. The two are not exclusive like you seem to think they are. If you want to optimise your filesystem geometry for CPU count based concurrency, then on XFS all you need to do is this: # mkfs.xfs -d concurrency=nr_cpus /dev/pmem0 mkfs will then create a geometry that has sufficient concurrency for the number of CPUs in the host system. Hence we end up with -all- in-memory and on-disk filesystem structures on that PMEM device being CPU count optimised, not just writeback. Writeback is just a small part of a much larger system, and we already optimise filesystem and IO concurrency via filesystem geometry. Writeback needs to work within those same controls we already provide users with. -Dave. -- Dave Chinner david@fromorbit.com