Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems):

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Theodore Ts'o" <tytso@mit.edu>
To: "Artem S. Tashkinov" <aros@gmx.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems):
Date: Sun, 12 Jan 2025 00:27:43 -0500	[thread overview]
Message-ID: <20250112052743.GH1323402@mit.edu> (raw)
In-Reply-To: <ba4f3df5-027b-405e-8e6e-a3630f7eef93@gmx.com>

On Sat, Jan 11, 2025 at 09:17:49AM +0000, Artem S. Tashkinov wrote:
> Hello,
> 
> I had this idea on 2021-11-07, then I thought it was wrong/stupid, now
> I've asked AI and it said it was actually not bad, so I'm bringing it
> forward now:
> 
> Imagine the following scenarios:
> 
>  * You need to delete tens of thousands of files.
>  * You need to change the permissions, ownership, or security context
> (chmod, chown, chcon) for tens of thousands of files.
>  * You need to update timestamps for tens of thousands of files.
> 
> All these operations are currently relatively slow because they are
> executed sequentially, generating significant I/O overhead.
> 
> What if these operations could be spooled and performed as a single
> transaction? By bundling metadata updates into one atomic operation,
> such tasks could become near-instant or significantly faster. This would
> also reduce the number of writes, leading to less wear and tear on
> storage devices.

As Amir has stated, pretty much all journalled file systems will
combine a large number of file system operations into a single
transation, unless there is an explicit request via an fsync(2) system
call.  For example, ext4 in general only closes a journal transaction
every five seconds, or there isn't enough space in the journal
(athough in practice this isn't an issue if you are using a reasonably
modern mkfs.ext4, since we've increased the default size of the
journal).

The reason why deleting a large number of files, or changing the
permissions, ownership, timestamps, etc., of a large number of files
is because you need to read the directory blocks to find the inodes
that you need to modify, read a large number of inodes, update a large
number of inodes, and if you are deleting the inodes, also update the
block allocation metadata (bitmaps, or btrees) so that those blocks
are marked as no longer in use.  Some of the directory entries might
be cached in the dentry cache, and some of the inodes might be cached
in the inode cache, but that's not always the case.

If all of the metadata blocks that you need to read in order to
accomplish the operation are already cached in memory, then what you
propose is something that pretty much all journaled file systems will
do already, today. That is, the modifications that need to be made to
the metadata will be first written to the journal first, and only
after the journal transaction has been committed, will the actual
metadata blocks be written to the storage device, and this will be
done asynchronously.

In pratice, the actual delay in doing one of these large operations is
the need to read the metadata blocks into memory, and this must be
done synchronously, since (for example), if you are deleting 100,000
files, you first need to know which inodes for those 100,000 files by
reading the directory blocks; you then need to know which blocks will
be freed by deleting each of those 100,000 files, which means you will
need to read 100,000 inodes and their extent tree blocks, and then you
need to update the block allocation information, and that will require
that you read the block allocation bitmaps so they can be updated.

> Does this idea make sense? If it already exists, or if there’s a reason
> it wouldn’t work, please let me know.

So yes, it basically exists, although in practice, it doesn't work as
well as you might think, because of the need to read potentially a
large number of the metdata blocks.  But for example, if you make sure
that all of the inode information is already cached, e.g.:

   ls -lR /path/to/large/tree > /dev/null

Then the operation to do a bulk update will be fast:

  time chown -R root:root /path/to/large/tree

This demonstrates that the bottleneck tends to be *reading* the
metdata blocks, not *writing* the metadata blocks.

Cheers,

				- Ted

next prev parent reply	other threads:[~2025-01-12  5:27 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-11  9:17 Spooling large metadata updates / Proposal for a new API/feature in the Linux Kernel (VFS/Filesystems): Artem S. Tashkinov
2025-01-11 10:33 ` Amir Goldstein
2025-01-12  5:27 ` Theodore Ts'o [this message]
2025-01-12 11:58   ` Matthew Wilcox
2025-01-12 18:12     ` Darrick J. Wong
2025-01-13  7:41   ` Artem S. Tashkinov
2025-01-13 14:00     ` Theodore Ts'o
2025-01-13 23:31 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250112052743.GH1323402@mit.edu \
    --to=tytso@mit.edu \
    --cc=aros@gmx.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox