* Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems):
@ 2025-01-11 9:17 Artem S. Tashkinov
2025-01-11 10:33 ` Amir Goldstein
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Artem S. Tashkinov @ 2025-01-11 9:17 UTC (permalink / raw)
To: linux-kernel, linux-fsdevel
Hello,
I had this idea on 2021-11-07, then I thought it was wrong/stupid, now
I've asked AI and it said it was actually not bad, so I'm bringing it
forward now:
Imagine the following scenarios:
* You need to delete tens of thousands of files.
* You need to change the permissions, ownership, or security context
(chmod, chown, chcon) for tens of thousands of files.
* You need to update timestamps for tens of thousands of files.
All these operations are currently relatively slow because they are
executed sequentially, generating significant I/O overhead.
What if these operations could be spooled and performed as a single
transaction? By bundling metadata updates into one atomic operation,
such tasks could become near-instant or significantly faster. This would
also reduce the number of writes, leading to less wear and tear on
storage devices.
Does this idea make sense? If it already exists, or if there’s a reason
it wouldn’t work, please let me know.
Best regards,
Artem
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-11 9:17 Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): Artem S. Tashkinov @ 2025-01-11 10:33 ` Amir Goldstein 2025-01-12 5:27 ` Theodore Ts'o 2025-01-13 23:31 ` Dave Chinner 2 siblings, 0 replies; 8+ messages in thread From: Amir Goldstein @ 2025-01-11 10:33 UTC (permalink / raw) To: Artem S. Tashkinov; +Cc: linux-kernel, linux-fsdevel, linux-xfs On Sat, Jan 11, 2025 at 10:18 AM Artem S. Tashkinov <aros@gmx.com> wrote: > > Hello, > > I had this idea on 2021-11-07, then I thought it was wrong/stupid, now > I've asked AI and it said it was actually not bad, so I'm bringing it > forward now: > > Imagine the following scenarios: > > * You need to delete tens of thousands of files. > * You need to change the permissions, ownership, or security context > (chmod, chown, chcon) for tens of thousands of files. > * You need to update timestamps for tens of thousands of files. > > All these operations are currently relatively slow because they are > executed sequentially, generating significant I/O overhead. > > What if these operations could be spooled and performed as a single > transaction? By bundling metadata updates into one atomic operation, atomicity is not implied from the use case you described. IOW, the use case should not care in how many sub-transactions the changes are executed. > such tasks could become near-instant or significantly faster. This would > also reduce the number of writes, leading to less wear and tear on > storage devices. > > Does this idea make sense? If it already exists, or if there’s a reason > it wouldn’t work, please let me know. Yes it is already how journaled filesystems work, but userspace can only request to commit the current transaction (a.k.a fsync), so transactions can be committed too frequently or at inefficient manner for the workload (e.g. rm -rf). There was a big effort IIRC around v6.1 to improve scalability of rm -rf workload in xfs which led to a long series of regressions and fixes cycles. I think that an API for rm -rf is interesting because: - It is a *very* common use case, which is often very inefficient - filesystems already have "orphan" lists to deal with deferred work on deleted inodes What could be done in principle: 1. Filesystems could opt-in to implement unlink(path, AT_REMOVE_NONEMPTY_DIR) 2. This API will fail if the directory has subdirs (i_nlink != 2) 3. If the directory has only files, it can be unlinked and its inode added to an "orphan" list as a special "batch delete" transaction 4. When executed, the "batch delete" transaction will iterate the directory entries, decrement nlink of inodes, likely adding those inodes to the "orphan" list 5. rm -rf will iterate DFS, calling unlink(path, AT_REMOVE_NONEMPTY_DIR) on leaf directories whose nlink is 2 Among other complications, this API does not take into account permissions for unlinking the child inodes, based on the child inode attributes such as immutable flag or LSM security policies. This could be an interesting as TOPIC for LSFMM. Thanks, Amir. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-11 9:17 Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): Artem S. Tashkinov 2025-01-11 10:33 ` Amir Goldstein @ 2025-01-12 5:27 ` Theodore Ts'o 2025-01-12 11:58 ` Matthew Wilcox 2025-01-13 7:41 ` Artem S. Tashkinov 2025-01-13 23:31 ` Dave Chinner 2 siblings, 2 replies; 8+ messages in thread From: Theodore Ts'o @ 2025-01-12 5:27 UTC (permalink / raw) To: Artem S. Tashkinov; +Cc: linux-kernel, linux-fsdevel On Sat, Jan 11, 2025 at 09:17:49AM +0000, Artem S. Tashkinov wrote: > Hello, > > I had this idea on 2021-11-07, then I thought it was wrong/stupid, now > I've asked AI and it said it was actually not bad, so I'm bringing it > forward now: > > Imagine the following scenarios: > > * You need to delete tens of thousands of files. > * You need to change the permissions, ownership, or security context > (chmod, chown, chcon) for tens of thousands of files. > * You need to update timestamps for tens of thousands of files. > > All these operations are currently relatively slow because they are > executed sequentially, generating significant I/O overhead. > > What if these operations could be spooled and performed as a single > transaction? By bundling metadata updates into one atomic operation, > such tasks could become near-instant or significantly faster. This would > also reduce the number of writes, leading to less wear and tear on > storage devices. As Amir has stated, pretty much all journalled file systems will combine a large number of file system operations into a single transation, unless there is an explicit request via an fsync(2) system call. For example, ext4 in general only closes a journal transaction every five seconds, or there isn't enough space in the journal (athough in practice this isn't an issue if you are using a reasonably modern mkfs.ext4, since we've increased the default size of the journal). The reason why deleting a large number of files, or changing the permissions, ownership, timestamps, etc., of a large number of files is because you need to read the directory blocks to find the inodes that you need to modify, read a large number of inodes, update a large number of inodes, and if you are deleting the inodes, also update the block allocation metadata (bitmaps, or btrees) so that those blocks are marked as no longer in use. Some of the directory entries might be cached in the dentry cache, and some of the inodes might be cached in the inode cache, but that's not always the case. If all of the metadata blocks that you need to read in order to accomplish the operation are already cached in memory, then what you propose is something that pretty much all journaled file systems will do already, today. That is, the modifications that need to be made to the metadata will be first written to the journal first, and only after the journal transaction has been committed, will the actual metadata blocks be written to the storage device, and this will be done asynchronously. In pratice, the actual delay in doing one of these large operations is the need to read the metadata blocks into memory, and this must be done synchronously, since (for example), if you are deleting 100,000 files, you first need to know which inodes for those 100,000 files by reading the directory blocks; you then need to know which blocks will be freed by deleting each of those 100,000 files, which means you will need to read 100,000 inodes and their extent tree blocks, and then you need to update the block allocation information, and that will require that you read the block allocation bitmaps so they can be updated. > Does this idea make sense? If it already exists, or if there’s a reason > it wouldn’t work, please let me know. So yes, it basically exists, although in practice, it doesn't work as well as you might think, because of the need to read potentially a large number of the metdata blocks. But for example, if you make sure that all of the inode information is already cached, e.g.: ls -lR /path/to/large/tree > /dev/null Then the operation to do a bulk update will be fast: time chown -R root:root /path/to/large/tree This demonstrates that the bottleneck tends to be *reading* the metdata blocks, not *writing* the metadata blocks. Cheers, - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-12 5:27 ` Theodore Ts'o @ 2025-01-12 11:58 ` Matthew Wilcox 2025-01-12 18:12 ` Darrick J. Wong 2025-01-13 7:41 ` Artem S. Tashkinov 1 sibling, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2025-01-12 11:58 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Artem S. Tashkinov, linux-kernel, linux-fsdevel On Sun, Jan 12, 2025 at 12:27:43AM -0500, Theodore Ts'o wrote: > So yes, it basically exists, although in practice, it doesn't work as > well as you might think, because of the need to read potentially a > large number of the metdata blocks. But for example, if you make sure > that all of the inode information is already cached, e.g.: > > ls -lR /path/to/large/tree > /dev/null > > Then the operation to do a bulk update will be fast: > > time chown -R root:root /path/to/large/tree > > This demonstrates that the bottleneck tends to be *reading* the > metdata blocks, not *writing* the metadata blocks. So if we presented more of the operations to the kernel at once, it could pipeline the reading of the metadata, providing a user-visible win. However, I don't know that we need a new user API to do it. This is something that could be done in the "rm" tool; it has the information it needs, and it's better to put heuristics like "how far to read ahead" in userspace than the kernel. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-12 11:58 ` Matthew Wilcox @ 2025-01-12 18:12 ` Darrick J. Wong 0 siblings, 0 replies; 8+ messages in thread From: Darrick J. Wong @ 2025-01-12 18:12 UTC (permalink / raw) To: Matthew Wilcox Cc: Theodore Ts'o, Artem S. Tashkinov, linux-kernel, linux-fsdevel On Sun, Jan 12, 2025 at 11:58:53AM +0000, Matthew Wilcox wrote: > On Sun, Jan 12, 2025 at 12:27:43AM -0500, Theodore Ts'o wrote: > > So yes, it basically exists, although in practice, it doesn't work as > > well as you might think, because of the need to read potentially a > > large number of the metdata blocks. But for example, if you make sure > > that all of the inode information is already cached, e.g.: > > > > ls -lR /path/to/large/tree > /dev/null > > > > Then the operation to do a bulk update will be fast: > > > > time chown -R root:root /path/to/large/tree > > > > This demonstrates that the bottleneck tends to be *reading* the > > metdata blocks, not *writing* the metadata blocks. > > So if we presented more of the operations to the kernel at once, it > could pipeline the reading of the metadata, providing a user-visible > win. > > However, I don't know that we need a new user API to do it. This is > something that could be done in the "rm" tool; it has the information > it needs, and it's better to put heuristics like "how far to read ahead" > in userspace than the kernel. nr_cpus=$(getconf _NPROCESSORS_ONLN) find $path -print0 | xargs -P $nr_cpus -0 chown root:root deltree is probably harder, because while you can easily parallelize deleting the leaves, find isn't so good at telling you what are the leaves. I suppose you could do: find $path ! -type d -print0 | xargs -P $nr_cpus -0 rm -f rm -r -f $path which would serialize on all the directories, but hopefully there aren't that many of those? FWIW as Amir said, xfs truncates and frees inodes in the background now so most of the upfront overhead of rm -r -f is reading in metadata, deleting directory entries, and putting the files on the unlinked list. --D ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-12 5:27 ` Theodore Ts'o 2025-01-12 11:58 ` Matthew Wilcox @ 2025-01-13 7:41 ` Artem S. Tashkinov 2025-01-13 14:00 ` Theodore Ts'o 1 sibling, 1 reply; 8+ messages in thread From: Artem S. Tashkinov @ 2025-01-13 7:41 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-kernel, linux-fsdevel My use case wasn't about the journal per se, I'm not using it for my ext4 partitions. Correct me if I'm wrong but I was thinking about this: == Issue number one == Let's say you chmod two files sequentially: What's happening: 1) the kernel looks up an inode 2) the kernel then updates the metadata (at the very least one logical block is being written, that's 4K bytes) ditto for the second file. Now let's say we are updating 10000 files. Does this mean that at least 40MB of data will be written, when probably less than 500KB needs to be written to disk? == Issue number two == At least when you write data to the disk, the kernel doesn't flush it immediately and your system remains responsive due to the use of dirty buffers. For operations involving metadata updates, the kernel may not have this luxury, because the system must be in a consistent state even if it's accidentally or intentionally powered off. So, metadata updates must be carried out immediately, and they can bring the system to a halt while flushing the above 40MB of data, as opposed to the 500KB that needs to be updated in terms of what is actually being updated on disk. So, the feature I'm looking for would be to say to the kernel: hey I'm about to batch 10000 operations, please be considerate, do your thing in one fell swoop while optimizing intermediate operations or writes to the disk, and there's no rush, so you may as well postpone the whole thing as much as you want. Best regards, Artem On 1/12/25 5:27 AM, Theodore Ts'o wrote: > On Sat, Jan 11, 2025 at 09:17:49AM +0000, Artem S. Tashkinov wrote: >> Hello, >> >> I had this idea on 2021-11-07, then I thought it was wrong/stupid, now >> I've asked AI and it said it was actually not bad, so I'm bringing it >> forward now: >> >> Imagine the following scenarios: >> >> * You need to delete tens of thousands of files. >> * You need to change the permissions, ownership, or security context >> (chmod, chown, chcon) for tens of thousands of files. >> * You need to update timestamps for tens of thousands of files. >> >> All these operations are currently relatively slow because they are >> executed sequentially, generating significant I/O overhead. >> >> What if these operations could be spooled and performed as a single >> transaction? By bundling metadata updates into one atomic operation, >> such tasks could become near-instant or significantly faster. This would >> also reduce the number of writes, leading to less wear and tear on >> storage devices. > > As Amir has stated, pretty much all journalled file systems will > combine a large number of file system operations into a single > transation, unless there is an explicit request via an fsync(2) system > call. For example, ext4 in general only closes a journal transaction > every five seconds, or there isn't enough space in the journal > (athough in practice this isn't an issue if you are using a reasonably > modern mkfs.ext4, since we've increased the default size of the > journal). > > The reason why deleting a large number of files, or changing the > permissions, ownership, timestamps, etc., of a large number of files > is because you need to read the directory blocks to find the inodes > that you need to modify, read a large number of inodes, update a large > number of inodes, and if you are deleting the inodes, also update the > block allocation metadata (bitmaps, or btrees) so that those blocks > are marked as no longer in use. Some of the directory entries might > be cached in the dentry cache, and some of the inodes might be cached > in the inode cache, but that's not always the case. > > If all of the metadata blocks that you need to read in order to > accomplish the operation are already cached in memory, then what you > propose is something that pretty much all journaled file systems will > do already, today. That is, the modifications that need to be made to > the metadata will be first written to the journal first, and only > after the journal transaction has been committed, will the actual > metadata blocks be written to the storage device, and this will be > done asynchronously. > > In pratice, the actual delay in doing one of these large operations is > the need to read the metadata blocks into memory, and this must be > done synchronously, since (for example), if you are deleting 100,000 > files, you first need to know which inodes for those 100,000 files by > reading the directory blocks; you then need to know which blocks will > be freed by deleting each of those 100,000 files, which means you will > need to read 100,000 inodes and their extent tree blocks, and then you > need to update the block allocation information, and that will require > that you read the block allocation bitmaps so they can be updated. > >> Does this idea make sense? If it already exists, or if there’s a reason >> it wouldn’t work, please let me know. > > So yes, it basically exists, although in practice, it doesn't work as > well as you might think, because of the need to read potentially a > large number of the metdata blocks. But for example, if you make sure > that all of the inode information is already cached, e.g.: > > ls -lR /path/to/large/tree > /dev/null > > Then the operation to do a bulk update will be fast: > > time chown -R root:root /path/to/large/tree > > This demonstrates that the bottleneck tends to be *reading* the > metdata blocks, not *writing* the metadata blocks. > > Cheers, > > - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-13 7:41 ` Artem S. Tashkinov @ 2025-01-13 14:00 ` Theodore Ts'o 0 siblings, 0 replies; 8+ messages in thread From: Theodore Ts'o @ 2025-01-13 14:00 UTC (permalink / raw) To: Artem S. Tashkinov; +Cc: linux-kernel, linux-fsdevel On Mon, Jan 13, 2025 at 07:41:53AM +0000, Artem S. Tashkinov wrote: > Let's say you chmod two files sequentially: > > What's happening: > > 1) the kernel looks up an inode > 2) the kernel then updates the metadata (at the very least one logical > block is being written, that's 4K bytes) > > ditto for the second file. > > Now let's say we are updating 10000 files. > > Does this mean that at least 40MB of data will be written, when probably > less than 500KB needs to be written to disk? No, for pretty much all file systems, we don't force the data to be written when you do a chmod. This is true for both journalled and non-journalled file systems. For a non-journalled file system, we will modify the in-memory inode structure, and we will wait until the buffer cache writeback (typically 30 seconds after the block was first dirtied) before the metadata block is written back. So if you modify the same block 32 times for 32 inodes, it won't get written until 30 seconds go by. So as long as you actually complete the chmod -R operation within 30 seconds, you'll be fine. For non-journalled file system, the disk writes won't take place until the transaction close time takes place, which is 5 seconds (by default) before the transaction closes. > == Issue number two == > > At least when you write data to the disk, the kernel doesn't flush it > immediately and your system remains responsive due to the use of dirty > buffers. > > For operations involving metadata updates, the kernel may not have this > luxury, because the system must be in a consistent state even if it's > accidentally or intentionally powered off. > > So, metadata updates must be carried out immediately, and they can bring > the system to a halt while flushing the above 40MB of data, as opposed > to the 500KB that needs to be updated in terms of what is actually being > updated on disk. Nope; POSIX does not require this. As described above, there will be a certain amount of file system updates that won't be completed if someone kicks the power plug out of the wall and the system has an unclean shutdown. > So, the feature I'm looking for would be to say to the kernel: hey I'm > about to batch 10000 operations, please be considerate, do your thing in > one fell swoop while optimizing intermediate operations or writes to the > disk, and there's no rush, so you may as well postpone the whole thing > as much as you want. This is what the kernel *always* does. It's what all system has done for decades, including in legacy Unix systems, and it's what people have always expected and programmers who want something better (e.g., databases providing ACID guarantees) will use fsync(2) and explicitly tell the kernel that reliability is more important than performance or flash durability (all of the extra writes means extra write cycles, leading to more write cycles and decreasing the lifespan of SSD's and newer HDD's with HAMR or MAMR where the laser or masers built in the HDD heads can wear out --- this is why newer HDD's have write limits in their warantees. Fortunately, this is what storage specialists at hyperscaler cloud companies worry about so you don't have to.) You can force metadata to be written out by using the sync(2), fsync(2) or syncfs(2) system calls, but we don't optimize for the uncommon case where someone might yank the power out or the kernel crashes unexpectedly. Cheers, - Ted ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): 2025-01-11 9:17 Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): Artem S. Tashkinov 2025-01-11 10:33 ` Amir Goldstein 2025-01-12 5:27 ` Theodore Ts'o @ 2025-01-13 23:31 ` Dave Chinner 2 siblings, 0 replies; 8+ messages in thread From: Dave Chinner @ 2025-01-13 23:31 UTC (permalink / raw) To: Artem S. Tashkinov; +Cc: linux-kernel, linux-fsdevel On Sat, Jan 11, 2025 at 09:17:49AM +0000, Artem S. Tashkinov wrote: > Hello, > > I had this idea on 2021-11-07, then I thought it was wrong/stupid, now > I've asked AI and it said it was actually not bad, Which shows just how poor "AI" is at analysing complex systems. > so I'm bringing it > forward now: > > Imagine the following scenarios: > > * You need to delete tens of thousands of files. > * You need to change the permissions, ownership, or security context > (chmod, chown, chcon) for tens of thousands of files. > * You need to update timestamps for tens of thousands of files. > > All these operations are currently relatively slow because they are > executed sequentially, generating significant I/O overhead. Yes, they are executed sequentially by the -application- not the filesystem. i.e. the performance limiter is application concurrency, not the filesystem. > What if these operations could be spooled and performed as a single > transaction? You get nothing extra - they'd still executed "sequentially" within the transaction. Operational concurrency is required to speed up these operations, and we have io_uring and/or threads for that... > By bundling metadata updates into one atomic operation, > such tasks could become near-instant or significantly faster. No. The filesystem still has to do exactly the same amount of work to modify thousands of files. Transactional atomicity has nothing to do with the cost of modification of an otherwise unrelated set of filesystem objects... The overhead of 'rm -rf' could be optimised, but the filesystem would still have to do the directory traversal first to find all the inodes that have to be unlinked/rmdir'd and process them before the parent directory is freed. i.e. we can make it *look fast*, but it still has largely the same cost in terms of IO, CPU and memory overhead. And, of course, operations like sync() would have to block on an offloaded 'rm -rf' operation. That is likely to cause people more unexpected issues than userspace implementing a concurrent 'rm -rf' based on sub-dir concurrency.... > This would > also reduce the number of writes, leading to less wear and tear on > storage devices. Not for a typical journalling filesystem. They aggregate and journal delta changes efficiently, then do batched writeback of the modified metadata efficiently, eliding all writes possible. > Does this idea make sense? If it already exists, or if there’s a reason > it wouldn’t work, please let me know. Filesystems can already do operations concurrently. As long as concurrency for directory traversal based operations is done on a per-directory level, they scale out to the inherent concurrency the filesytem can provide. In the case of XFS, we can scale out to around 600-700,000 file creates/s, about 1 million chmod/chown/chcon/futimes modifications per second and about 500,000 unlinks per second. With some VFS scalablility mods, we can get up around the 1 million file creates/s mark.... IOWs, if application are having problems with sequential filesystem operations being slow, the problem is generally application level algorithms and concurrency and not the filesystem implementations or syscall interfaces. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-01-13 23:31 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-11 9:17 Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): Artem S. Tashkinov 2025-01-11 10:33 ` Amir Goldstein 2025-01-12 5:27 ` Theodore Ts'o 2025-01-12 11:58 ` Matthew Wilcox 2025-01-12 18:12 ` Darrick J. Wong 2025-01-13 7:41 ` Artem S. Tashkinov 2025-01-13 14:00 ` Theodore Ts'o 2025-01-13 23:31 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox