Deleting a bunch of files takes long.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Deleting a bunch of files takes long.
@ 2025-09-05 10:35 Rogier Wolff
  2025-09-05 13:11 ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Rogier Wolff @ 2025-09-05 10:35 UTC (permalink / raw)
  To: linux-kernel

Hi all, 

I have a little linux-server machine that has two 8T harddrives. I
noticed that they were becoming a bit "full", so I decided to remove a
bunch of files. OK. Maybe a lot of files. About 121 million of them.

How long do you estimate that this would take? Ten minutes? An hour? 

I'm using EXT4. 

Now think about how much metadata needs to be read from the disk to
know what inodes to wipe, and directories to clear... 

So a file takes a directory entry of say 30 bytes? (depends on the
filename). How about we round that to 50. Then an inode. That used to
be 128 bytes, but nowadays 256 right? Addded together we're talking
about about 300 bytes, right? So the metadata is 36Gb. 

Let's round the disk speed to 100Mb/sec. So the OS will get to "touch"
all of the necessary metadata in about 360 seconds. Add in some
seeking and we're talking about 10 minutes. to an hour. Right?

Weeeeelllll.... its been running for over a week now. 

I would think that improving this, also improves things like
find . -name zasdflasdkjf 

I'm hoping that someone with the right skills agrees with me: "this is
absurd" and has the time to do something about it. 

To reproduce: I backup my homedir of about 700k files every day using
rsync. Then I cp -lr the whole thing to a new directory with the
timestamp as the name. I have about 120-200 copies when the disk fills
up and then I batch delete a bunch.

Why do I think that it should be technically possible to achieve
"reasonable" performance? 

I think that the directory contents needs to be clustered better. And
then when the system needs to read a directory it should prefetch more
directories, say as if there is a "blocksize" of 128k or even 1M. Even
if the "hit ratio" of the "extra 990k" is only 10% you're already
gaining a lot.

To get better clustering the system may need to keep "stats" about the
filesystem in the superblock. What percentage of the disk is dedicated
to directory contents? If, say (as I suspect in my case) that's 30%,
of each block group, reserve the first 30% for directories
only. i.e. allocate directory blocks from there, allocate data-blocks
from not-there.

Or grow the two different kinds of data from different ends. 

Maybe "readahead" is better served if inodeds are allocated from the
end of the inode area. All this would not require a significant
rewrite of the filesystem, just the place in the superblock where the 
current stats are stored on each unmount. 

Another strategy might be to allocate the directory info in the current
block group top-down from the previous block-group. 

	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** https://www.BitWizard.nl/ ** +31-15-2049110 **
** Verl. Spiegelmakerstraat 37 2645 LZ  Delfgauw, The Netherlands.  
** KVK: 27239233    **
f equals m times a. When your f is steady, and your m is going down
your a** is going up.  -- Chris Hadfield about flying up the space shuttle.
**  'a' for accelleration.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Deleting a bunch of files takes long.
  2025-09-05 10:35 Deleting a bunch of files takes long Rogier Wolff
@ 2025-09-05 13:11 ` Theodore Ts'o
  2025-09-08 16:18   ` Rogier Wolff
  0 siblings, 1 reply; 4+ messages in thread
From: Theodore Ts'o @ 2025-09-05 13:11 UTC (permalink / raw)
  To: Rogier Wolff; +Cc: linux-kernel

On Fri, Sep 05, 2025 at 12:35:53PM +0200, Rogier Wolff wrote:
> 
> I think that the directory contents needs to be clustered better. And
> then when the system needs to read a directory it should prefetch more
> directories, say as if there is a "blocksize" of 128k or even 1M. Even
> if the "hit ratio" of the "extra 990k" is only 10% you're already
> gaining a lot.

I don't think that's the main issue; ext4 already clusters directory
blocks in the first block group in each flexbg (by default there are
16 block groups in a flexbg).  Instead, the issue is that when you
stat or unlink a file after readdir(2) returns each file as you are
iterating over all of files in a directory hierarchy, the system needs
to make a random seek to read the inode from the inode table.  This
results in a huge number of random seeks, which is especially painful
on hard drives.

You can see the difference by running two find commands when in a very
large directory, and see the difference in times:

# echo 3 > /proc/sys/vm/drop_caches
# time find . -print >& /dev/null
# echo 3 > /proc/sys/vm/drop_caches
# time find . -ls >& /dev/null

There is a workaround; see the attached spd_readdir.c which you can
use as a LD_PRELOAD.  It overrides readdir() by sorting the returned
directory entries by inode number.  The reason why we don't do this in
the kernel is that POSIX requires that if a directory changes by
adding or removing a file, file must be returned exactly zero or one
times, even if the user uses telldir(3) and seekdir(3) and hours,
days, weeks, months, years, can elapse between the telldir() and
seekdir(), so long as the file descriptor stays open.  Using
spd_readdir.so will result in this potentially violating this POSIX
guarantee.  This won't matter for an rm -rf run, but it might matter
for programs like, say, Samba.

Cheers,

						- Ted



/*
 * readdir accelerator
 *
 * (C) Copyright 2003, 2004 by Theodore Ts'o.
 *
 * Compile using the command:
 *
 * gcc -o spd_readdir.so -fPIC -shared spd_readdir.c -ldl
 *
 * Use it by setting the LD_PRELOAD environment variable:
 * 
 * export LD_PRELOAD=/usr/local/sbin/spd_readdir.so
 *
 * Note that this preload is not going to work for all programs.  In
 * particular, although it does supply readdir_r(), it is *not* thread
 * safe.  So I can't recommend this as something to be dropped in
 * /etc/ld.so.preload.
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 * 
 */

#define ALLOC_STEPSIZE	100
#define MAX_DIRSIZE	0

#define DEBUG

#ifdef DEBUG
#define DEBUG_DIR(x)	{if (do_debug) { x; }}
#else
#define DEBUG_DIR(x)
#endif

#define _GNU_SOURCE
#define __USE_LARGEFILE64

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>
#include <errno.h>
#include <dlfcn.h>

struct dirent_s {
	unsigned long long d_ino;
	long long d_off;
	unsigned short int d_reclen;
	unsigned char d_type;
	char *d_name;
};

struct dir_s {
	DIR	*dir;
	int	num;
	int	max;
	struct dirent_s *dp;
	int	pos;
	int	direct;
	struct dirent ret_dir;
	struct dirent64 ret_dir64;
};

static int (*real_closedir)(DIR *dir) = 0;
static DIR *(*real_opendir)(const char *name) = 0;
static DIR *(*real_fdopendir)(int fd) = 0;
static void *(*real_rewinddir)(DIR *dirp) = 0;
static struct dirent *(*real_readdir)(DIR *dir) = 0;
static int (*real_readdir_r)(DIR *dir, struct dirent *entry,
			     struct dirent **result) = 0;
static struct dirent64 *(*real_readdir64)(DIR *dir) = 0;
static off_t (*real_telldir)(DIR *dir) = 0;
static void (*real_seekdir)(DIR *dir, off_t offset) = 0;
static int (*real_dirfd)(DIR *dir) = 0;
static unsigned long max_dirsize = MAX_DIRSIZE;
static int num_open = 0;
#ifdef DEBUG
static int do_debug = 0;
#endif

static void setup_ptr()
{
	char *cp;

	real_opendir = dlsym(RTLD_NEXT, "opendir");
	real_fdopendir = dlsym(RTLD_NEXT, "fdopendir");
	real_closedir = dlsym(RTLD_NEXT, "closedir");
	real_rewinddir = dlsym(RTLD_NEXT, "rewinddir");
	real_readdir = dlsym(RTLD_NEXT, "readdir");
	real_readdir_r = dlsym(RTLD_NEXT, "readdir_r");
	real_readdir64 = dlsym(RTLD_NEXT, "readdir64");
	real_telldir = dlsym(RTLD_NEXT, "telldir");
	real_seekdir = dlsym(RTLD_NEXT, "seekdir");
	real_dirfd = dlsym(RTLD_NEXT, "dirfd");
	if ((cp = getenv("SPD_READDIR_MAX_SIZE")) != NULL) {
		max_dirsize = atol(cp);
	}
#ifdef DEBUG
	if (getenv("SPD_READDIR_DEBUG")) {
		printf("initialized!\n");
		do_debug++;
	}
#endif
}

static void free_cached_dir(struct dir_s *dirstruct)
{
	int i;

	if (!dirstruct->dp)
		return;

	for (i=0; i < dirstruct->num; i++) {
		free(dirstruct->dp[i].d_name);
	}
	free(dirstruct->dp);
	dirstruct->dp = 0;
	dirstruct->max = dirstruct->num = 0;
}	

static int ino_cmp(const void *a, const void *b)
{
	const struct dirent_s *ds_a = (const struct dirent_s *) a;
	const struct dirent_s *ds_b = (const struct dirent_s *) b;
	ino_t i_a, i_b;
	
	i_a = ds_a->d_ino;
	i_b = ds_b->d_ino;

	if (ds_a->d_name[0] == '.') {
		if (ds_a->d_name[1] == 0)
			i_a = 0;
		else if ((ds_a->d_name[1] == '.') && (ds_a->d_name[2] == 0))
			i_a = 1;
	}
	if (ds_b->d_name[0] == '.') {
		if (ds_b->d_name[1] == 0)
			i_b = 0;
		else if ((ds_b->d_name[1] == '.') && (ds_b->d_name[2] == 0))
			i_b = 1;
	}

	return (i_a - i_b);
}

struct dir_s *alloc_dirstruct(DIR *dir)
{
	struct dir_s	*dirstruct;

	dirstruct = malloc(sizeof(struct dir_s));
	if (dirstruct)
		memset(dirstruct, 0, sizeof(struct dir_s));
	dirstruct->dir = dir;
	return dirstruct;
}

void cache_dirstruct(struct dir_s *dirstruct)
{
	struct dirent_s *ds, *dnew;
	struct dirent64 *d;

	while ((d = (*real_readdir64)(dirstruct->dir)) != NULL) {
		if (dirstruct->num >= dirstruct->max) {
			dirstruct->max += ALLOC_STEPSIZE;
			DEBUG_DIR(printf("Reallocating to size %d\n", 
					 dirstruct->max));
			dnew = realloc(dirstruct->dp, 
				       dirstruct->max * sizeof(struct dir_s));
			if (!dnew)
				goto nomem;
			dirstruct->dp = dnew;
		}
		ds = &dirstruct->dp[dirstruct->num++];
		ds->d_ino = d->d_ino;
		ds->d_off = d->d_off;
		ds->d_reclen = d->d_reclen;
		ds->d_type = d->d_type;
		if ((ds->d_name = malloc(strlen(d->d_name)+1)) == NULL) {
			dirstruct->num--;
			goto nomem;
		}
		strcpy(ds->d_name, d->d_name);
		DEBUG_DIR(printf("readdir: %lu %s\n", 
				 (unsigned long) d->d_ino, d->d_name));
	}
	qsort(dirstruct->dp, dirstruct->num, sizeof(struct dirent_s), ino_cmp);
	return;
nomem:
	DEBUG_DIR(printf("No memory, backing off to direct readdir\n"));
	free_cached_dir(dirstruct);
	dirstruct->direct = 1;
}

DIR *opendir(const char *name)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct stat st;

	if (!real_opendir)
		setup_ptr();

	DEBUG_DIR(printf("Opendir(%s) (%d open)\n", name, num_open++));
	dir = (*real_opendir)(name);
	if (!dir)
		return NULL;

	dirstruct = alloc_dirstruct(dir);
	if (!dirstruct) {
		(*real_closedir)(dir);
		errno = -ENOMEM;
		return NULL;
	}

	if (max_dirsize && (stat(name, &st) == 0) && 
	    (st.st_size > max_dirsize)) {
		DEBUG_DIR(printf("Directory size %ld, using direct readdir\n",
				 st.st_size));
		dirstruct->direct = 1;
		return (DIR *) dirstruct;
	}

	cache_dirstruct(dirstruct);
	return ((DIR *) dirstruct);
}

DIR *fdopendir(int fd)
{
	DIR *dir;
	struct dir_s	*dirstruct;
	struct stat st;

	if (!real_fdopendir)
		setup_ptr();

	DEBUG_DIR(printf("fdpendir(%d) (%d open)\n", fd, num_open++));
	dir = (*real_fdopendir)(fd);
	if (!dir)
		return NULL;

	dirstruct = alloc_dirstruct(dir);
	if (!dirstruct) {
		(*real_closedir)(dir);
		errno = -ENOMEM;
		return NULL;
	}

	if (max_dirsize && (fstat(fd, &st) == 0) && 
	    (st.st_size > max_dirsize)) {
		DEBUG_DIR(printf("Directory size %ld, using direct readdir\n",
				 st.st_size));
		dirstruct->dir = dir;
		dirstruct->direct = 1;
		return (DIR *) dirstruct;
	}

	cache_dirstruct(dirstruct);
	return ((DIR *) dirstruct);
}

int closedir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	DEBUG_DIR(printf("Closedir (%d open)\n", --num_open));
	if (dirstruct->dir)
		(*real_closedir)(dirstruct->dir);

	free_cached_dir(dirstruct);
	free(dirstruct);
	return 0;
}

struct dirent *readdir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->direct)
		return (*real_readdir)(dirstruct->dir);

	if (dirstruct->pos >= dirstruct->num)
		return NULL;

	ds = &dirstruct->dp[dirstruct->pos++];
	dirstruct->ret_dir.d_ino = ds->d_ino;
	dirstruct->ret_dir.d_off = ds->d_off;
	dirstruct->ret_dir.d_reclen = ds->d_reclen;
	dirstruct->ret_dir.d_type = ds->d_type;
	strncpy(dirstruct->ret_dir.d_name, ds->d_name,
		sizeof(dirstruct->ret_dir.d_name));

	return (&dirstruct->ret_dir);
}

int readdir_r(DIR *dir, struct dirent *entry, struct dirent **result)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->direct)
		return (*real_readdir_r)(dirstruct->dir, entry, result);

	if (dirstruct->pos >= dirstruct->num) {
		*result = NULL;
		return 0;
	}

	ds = &dirstruct->dp[dirstruct->pos++];
	entry->d_ino = ds->d_ino;
	entry->d_off = ds->d_off;
	entry->d_reclen = ds->d_reclen;
	entry->d_type = ds->d_type;
	strncpy(entry->d_name, ds->d_name, sizeof(entry->d_name));
	*result = entry;

	return 0;
}

struct dirent64 *readdir64(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	struct dirent_s *ds;

	if (dirstruct->direct)
		return (*real_readdir64)(dirstruct->dir);

	if (dirstruct->pos >= dirstruct->num)
		return NULL;

	ds = &dirstruct->dp[dirstruct->pos++];
	dirstruct->ret_dir64.d_ino = ds->d_ino;
	dirstruct->ret_dir64.d_off = ds->d_off;
	dirstruct->ret_dir64.d_reclen = ds->d_reclen;
	dirstruct->ret_dir64.d_type = ds->d_type;
	strncpy(dirstruct->ret_dir64.d_name, ds->d_name,
		sizeof(dirstruct->ret_dir64.d_name));

	return (&dirstruct->ret_dir64);
}

off_t telldir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->direct)
		return (*real_telldir)(dirstruct->dir);

	return ((off_t) dirstruct->pos);
}

void seekdir(DIR *dir, off_t offset)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	if (dirstruct->direct) {
		(*real_seekdir)(dirstruct->dir, offset);
		return;
	}

	dirstruct->pos = offset;
}

void rewinddir(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;

	(*real_rewinddir)(dirstruct->dir);
	if (dirstruct->direct)
		return;
	
	dirstruct->pos = 0;
	free_cached_dir(dirstruct);
	cache_dirstruct(dirstruct);
}

int dirfd(DIR *dir)
{
	struct dir_s	*dirstruct = (struct dir_s *) dir;
	int fd = (*real_dirfd)(dirstruct->dir);

	DEBUG_DIR(printf("dirfd %d, %p\n", fd, real_dirfd));
	return fd;
}


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Deleting a bunch of files takes long.
  2025-09-05 13:11 ` Theodore Ts'o
@ 2025-09-08 16:18   ` Rogier Wolff
  2025-09-08 19:40     ` Theodore Ts'o
  0 siblings, 1 reply; 4+ messages in thread
From: Rogier Wolff @ 2025-09-08 16:18 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-kernel

On Fri, Sep 05, 2025 at 09:11:30AM -0400, Theodore Ts'o wrote:

> There is a workaround; see the attached spd_readdir.c which you can
> use as a LD_PRELOAD.  It overrides readdir() by sorting the returned

I've now had the chance to test this. With the LD_PRELOAD I'm getting
times of "about 10 minutes" per toplevel directory to delete, and min
(quite possibly due to the "data": from 4:09 lowest to 12:05 highest and
many around the 10:10 mark). 

Without the spd_readdir, I'm seeing 5:40 as the lowest and 13:03 as
the highest. So the performance increase is "about 20%".

My technical intuition says that a factor of 2-10 should be possbile.
(so 50-90% reduction in time). (While hoping for a factor of 100, 99%
reduction in runtime).

Is the "logging file system" a contributing factor? Maybe after each
rm or after each rmdir that something needs to be written to the log?

At one point in time I accidentally deleted my whole kernel tree. That
took "longer than expected" which made me think and realize what I had
done. It was about 2 or 3 seconds. The operating system was Minix and
the PC was a PC-XT at 10MHz. I doublechecked my solution for a couple
of seconds and powered it off before the cached results (everything
deleted) could be written to disk. After the expected fsck my kernel 
sources were still there. 

I have the impression that Linux has been worse than Minix in such
cases for about three decades now.

	Roger.

-- 
** R.E.Wolff@BitWizard.nl ** https://www.BitWizard.nl/ ** +31-15-2049110 **
** Verl. Spiegelmakerstraat 37 2645 LZ  Delfgauw, The Netherlands.  
** KVK: 27239233    **
f equals m times a. When your f is steady, and your m is going down
your a** is going up.  -- Chris Hadfield about flying up the space shuttle.
**  'a' for accelleration.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Deleting a bunch of files takes long.
  2025-09-08 16:18   ` Rogier Wolff
@ 2025-09-08 19:40     ` Theodore Ts'o
  0 siblings, 0 replies; 4+ messages in thread
From: Theodore Ts'o @ 2025-09-08 19:40 UTC (permalink / raw)
  To: Rogier Wolff; +Cc: linux-kernel

On Mon, Sep 08, 2025 at 06:18:51PM +0200, Rogier Wolff wrote:
> Is the "logging file system" a contributing factor? Maybe after each
> rm or after each rmdir that something needs to be written to the log?

If you want to avoid running fsck after a crash, it's not free.  So
there is always a certain amount overhead in journalling.  Comparing
Linux with Minix is comparing apples and oranges, since with Minix,
you have to run fsck after a crash or power failure.

You *can* run ext4 without the journal.  If the file system has been
cleanly unmounted, or you've run fsck, you can mount the file system
using -o noload to disable journalling.  Or you can format the file
system without the journal.  ("mkfs.ext4 -O ^has_journal")   BTW, this is
something that Google contributed some 15+ years ago, because Google
uses a cluster file system (back then, GFS) because at very large
scales, they need to make sure data isn't lost when a hard drive dies,
or when a power supply on a particular server dies, or the entry
router at the top of a rack gives up the ghost.  So if data gets lost
after a crash or power failure, the cluster file system can recover
since it has to handle much worse (e.g., when an entire rack of server
becomes inaccessible when a router die, or the power management unit
takes out multiple racks in a power failure domain), and so the ext4
journal was unnecessary overhead.

So if you don't care about reliable recovery after a power failure, by
all means, you can disable the journal with ext4.  That *will* make
certain workloads faster.  But users tend to get cranky when they lose
data after a crash, unless you have some kind of higher-level data
recovery (e.g., like a cluster-level file system which has erasure
coding or replication across different servers that are in different
failure domains).

The other thing which ext4 does is it spreads the files across the
entire file system, which reduces file fragmetnation, but it does mean
that if you create a huge number of files, and then you want to delete
a huge number of files, a larger number of block groups will need to
be updated compared to minix.  But this was a deliberate design
decision, because reducing performance degradation over the long-term
is something that we considered far more important than optimizing for
"rm -rf".

Cheers,

							- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-09-08 19:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05 10:35 Deleting a bunch of files takes long Rogier Wolff
2025-09-05 13:11 ` Theodore Ts'o
2025-09-08 16:18   ` Rogier Wolff
2025-09-08 19:40     ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox