From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B5D3439526D for ; Wed, 4 Feb 2026 21:38:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=18.9.28.11 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770241105; cv=none; b=OrJRgnwCGTIwlh870rX8GC50S0FutNRqMddPtxR+T4YVFAZ59Y//2t/fpOHmyA1+E/OiWn9ECO85g244bMit7hLVIMIKI0usMDVDO1Qp7LtEzxbfBKBNHChnZv8h0feGLraBuSSXooQE0J47wk26waMdDORQYl15KVV7wHCYmzM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770241105; c=relaxed/simple; bh=O6qBV6gGN87BHthPO96VoB+LodK9kjJIc83STvOyD44=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=TCXFPyd2IsyI9wZ8GSUpwKDOiRlTD39IJoZyuXDz/1qYGJk92SETUBxx2W5Ot/fZc6R6JU0CIG+H/I6nlMmC+5YuGNJYNsSbL3Lz73i1sStXmrh5Pd3HxQG6+ZEUETI32lImNFGsw9OHwzjhNEcpFWUM+8aVN1W3xvz6nUfFKdQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu; spf=pass smtp.mailfrom=mit.edu; dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b=HuG+JQe9; arc=none smtp.client-ip=18.9.28.11 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=mit.edu Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=mit.edu Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mit.edu header.i=@mit.edu header.b="HuG+JQe9" Received: from macsyma.thunk.org (pool-173-48-119-77.bstnma.fios.verizon.net [173.48.119.77]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 614LcDwi024801 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 4 Feb 2026 16:38:15 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing; t=1770241099; bh=o+Ve8qwMXjyJI60yJIDvTa/BgGKoP1paH+c8Ka1uPgY=; h=Date:From:Subject:Message-ID:MIME-Version:Content-Type; b=HuG+JQe9iSGxEXCqg8fXzsDQdz91LzEzmr6CAQMEAc3mMr3MwAWs+eABJhFpFTO9m F1GfuYHPahTPNit+eDPnVeahRP/uHGprFhynHIwAFXfdMP+69yBv8VgzweCtQRGNR6 aqe8uQ/2s1XAAjO3iE4OBSc366GbXf2Woe5TeXIamxS3jW/7Q//QSj8TbFPtjnZ9GW +mMIx/vNLL5vj88SNeYi95rPPeQHOolAxpjNSamrexjHa7/yvkZc+uc4qHgnzjiHiq HbDf6BQ0lGiOcgCKeh8P5xtW1pLDUDd2nM4SqKNm0BJTwomeEtQ1fFHHOJs6ALMlyy KTbYhhoRC6sSQ== Received: by macsyma.thunk.org (Postfix, from userid 15806) id 4EBEE57405D6; Wed, 4 Feb 2026 16:37:13 -0500 (EST) Date: Wed, 4 Feb 2026 16:37:13 -0500 From: "Theodore Tso" To: Mario Lohajner Cc: Andreas Dilger , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] ext4: add optional rotating block allocation policy Message-ID: <20260204213713.GD31420@macsyma.lan> References: <20260204033112.406079-1-mario_lohajner.ref@rocketmail.com> <20260204033112.406079-1-mario_lohajner@rocketmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Wed, Feb 04, 2026 at 12:07:57PM +0100, Mario Lohajner wrote: > > Yes, the main motive for this allocator is flash wear leveling, > but it is not strictly a wear leveling mechanism, and it is not named > as such for a reason. If the device needs such a flash wear leveling scheme, it's very likely that it's not going to work very well for ext4, because there will be *far* more writes to statially located metadata --- the superblock, inode table, allocation bitmaps, which are scattered across the LBA space, --- that will potentially becausing problem to such a flash device. In practice, even the simplest Flash Translation Layer implementations do not require this, so I question whether devices that would need this exist in practice. Even the cheapest flash devices, for low-cost mobile and digital cameras, have not needed this in the 30 plus years that commercial flash storage have been around, and the micro-controllers which implement the FTL have been getting more sophisticated, not less. Do you have a specific flash storage device where this would be helpful? Or this a hypothetical exercise? > This policy helps avoid allocation hotspots at mount start by > distributing allocations sequentially across the entire mount, > not just a file or allocation stream. Why are you worrying about allocation hotspots? What's the high level problem that you are trying to address, if it is not about wear leveling? > At the block/group allocation level, the file system is fairly stochastic > and timing-sensitive. Rather than providing raw benchmark data, I prefer > to explain the design analytically: Whether you use raw benchmarks or try to do thought experiments you really need to specify your assumptions about the nature of (a) the storage device, and (b) the workload. For example, if the flash device has such a primitive, terible flash translation that the file system needs to handle wear levelling, it's generally the cheapest, most trashy storage device that can be imagined. In those cases, the bottleneck will likely be read/write speed. So we probably don't need to worry about the block allocate performance while writing to this storage device, because the I/O throughput latency is probably comparable to the worst possible USB thumb drive that you might find in the checkout line of a drug store. >From the workload perforamnce, how many files are you expecting that system will be writing in parallel? For example, is the user going to be running "make -j32" while building some software project? Probably not, because why would connect a really powerful AMD Threadripper CPU to the cheapest possible trash flash device? That would be a system that would be very out of balance. But if this is going to be low-demand, low-power performacne, then you might be able to use an even simpler allocator --- say, like what FAT file system uses. Speaking of FAT, depending on the quality of the storage device and benchmark, perhaps another file system would be a better choice. In addition to FAT, another file system to consider is f2fs, which is a log-structured file system that avoids the static inode table which might be a problem with with a flash device that needs file system aware wear-leveling. > Of course, this is not optimal for classic HDDs, but NVMe drives behave > differently. I'm not aware of *any* NVMe devices that that would find this to be advantages. This is where some real benchmarks with real hardware, and with specific workload that is used in real world devices would be really helpful. Cheers, - Ted