From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5D3439526D
	for <linux-kernel@vger.kernel.org>; Wed,  4 Feb 2026 21:38:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=18.9.28.11
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1770241105; cv=none; b=OrJRgnwCGTIwlh870rX8GC50S0FutNRqMddPtxR+T4YVFAZ59Y//2t/fpOHmyA1+E/OiWn9ECO85g244bMit7hLVIMIKI0usMDVDO1Qp7LtEzxbfBKBNHChnZv8h0feGLraBuSSXooQE0J47wk26waMdDORQYl15KVV7wHCYmzM=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1770241105; c=relaxed/simple;
	bh=O6qBV6gGN87BHthPO96VoB+LodK9kjJIc83STvOyD44=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=TCXFPyd2IsyI9wZ8GSUpwKDOiRlTD39IJoZyuXDz/1qYGJk92SETUBxx2W5Ot/fZc6R6JU0CIG+H/I6nlMmC+5YuGNJYNsSbL3Lz73i1sStXmrh5Pd3HxQG6+ZEUETI32lImNFGsw9OHwzjhNEcpFWUM+8aVN1W3xvz6nUfFKdQ=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu; spf=pass smtp.mailfrom=mit.edu; dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b=HuG+JQe9; arc=none smtp.client-ip=18.9.28.11
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mit.edu
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b="HuG+JQe9"
Received: from macsyma.thunk.org (pool-173-48-119-77.bstnma.fios.verizon.net [173.48.119.77])
	(authenticated bits=0)
        (User authenticated as tytso@ATHENA.MIT.EDU)
	by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 614LcDwi024801
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
	Wed, 4 Feb 2026 16:38:15 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing;
	t=1770241099; bh=o+Ve8qwMXjyJI60yJIDvTa/BgGKoP1paH+c8Ka1uPgY=;
	h=Date:From:Subject:Message-ID:MIME-Version:Content-Type;
	b=HuG+JQe9iSGxEXCqg8fXzsDQdz91LzEzmr6CAQMEAc3mMr3MwAWs+eABJhFpFTO9m
	 F1GfuYHPahTPNit+eDPnVeahRP/uHGprFhynHIwAFXfdMP+69yBv8VgzweCtQRGNR6
	 aqe8uQ/2s1XAAjO3iE4OBSc366GbXf2Woe5TeXIamxS3jW/7Q//QSj8TbFPtjnZ9GW
	 +mMIx/vNLL5vj88SNeYi95rPPeQHOolAxpjNSamrexjHa7/yvkZc+uc4qHgnzjiHiq
	 HbDf6BQ0lGiOcgCKeh8P5xtW1pLDUDd2nM4SqKNm0BJTwomeEtQ1fFHHOJs6ALMlyy
	 KTbYhhoRC6sSQ==
Received: by macsyma.thunk.org (Postfix, from userid 15806)
	id 4EBEE57405D6; Wed,  4 Feb 2026 16:37:13 -0500 (EST)
Date: Wed, 4 Feb 2026 16:37:13 -0500
From: "Theodore Tso" <tytso@mit.edu>
To: Mario Lohajner <mario_lohajner@rocketmail.com>
Cc: Andreas Dilger <adilger@dilger.ca>, linux-ext4@vger.kernel.org,
        linux-kernel@vger.kernel.org
Subject: Re: [PATCH] ext4: add optional rotating block allocation policy
Message-ID: <20260204213713.GD31420@macsyma.lan>
References: <20260204033112.406079-1-mario_lohajner.ref@rocketmail.com>
 <20260204033112.406079-1-mario_lohajner@rocketmail.com>
 <C3DAF83A-CE88-4348-BCE2-237960F3CD9D@dilger.ca>
 <c00064e6-a3d4-4f91-a50b-053db07c7d33@rocketmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <c00064e6-a3d4-4f91-a50b-053db07c7d33@rocketmail.com>

On Wed, Feb 04, 2026 at 12:07:57PM +0100, Mario Lohajner wrote:
> 
> Yes, the main motive for this allocator is flash wear leveling,
> but it is not strictly a wear leveling mechanism, and it is not named
> as such for a reason.

If the device needs such a flash wear leveling scheme, it's very
likely that it's not going to work very well for ext4, because there
will be *far* more writes to statially located metadata --- the
superblock, inode table, allocation bitmaps, which are scattered
across the LBA space, --- that will potentially becausing problem to
such a flash device.

In practice, even the simplest Flash Translation Layer implementations
do not require this, so I question whether devices that would need
this exist in practice.  Even the cheapest flash devices, for low-cost
mobile and digital cameras, have not needed this in the 30 plus years
that commercial flash storage have been around, and the
micro-controllers which implement the FTL have been getting more
sophisticated, not less.  Do you have a specific flash storage device
where this would be helpful?  Or this a hypothetical exercise?

> This policy helps avoid allocation hotspots at mount start by
> distributing allocations sequentially across the entire mount,
> not just a file or allocation stream.

Why are you worrying about allocation hotspots?  What's the high level
problem that you are trying to address, if it is not about wear
leveling?

> At the block/group allocation level, the file system is fairly stochastic
> and timing-sensitive. Rather than providing raw benchmark data, I prefer
> to explain the design analytically:

Whether you use raw benchmarks or try to do thought experiments you
really need to specify your assumptions about the nature of (a) the
storage device, and (b) the workload.  For example, if the flash
device has such a primitive, terible flash translation that the file
system needs to handle wear levelling, it's generally the cheapest,
most trashy storage device that can be imagined.  In those cases, the
bottleneck will likely be read/write speed.  So we probably don't need
to worry about the block allocate performance while writing to this
storage device, because the I/O throughput latency is probably
comparable to the worst possible USB thumb drive that you might find
in the checkout line of a drug store.

>From the workload perforamnce, how many files are you expecting that
system will be writing in parallel?  For example, is the user going to
be running "make -j32" while building some software project?  Probably
not, because why would connect a really powerful AMD Threadripper CPU
to the cheapest possible trash flash device?  That
would be a system that would be very out of balance.  But if this is
going to be low-demand, low-power performacne, then you might be able
to use an even simpler allocator --- say, like what FAT file system
uses.

Speaking of FAT, depending on the quality of the storage device and
benchmark, perhaps another file system would be a better choice.  In
addition to FAT, another file system to consider is f2fs, which is a
log-structured file system that avoids the static inode table which
might be a problem with with a flash device that needs file system
aware wear-leveling.

> Of course, this is not optimal for classic HDDs, but NVMe drives behave
> differently.

I'm not aware of *any* NVMe devices that that would find this to be
advantages.  This is where some real benchmarks with real hardware,
and with specific workload that is used in real world devices would be
really helpful.

Cheers,

						- Ted