ext3/ext4 directories don't shrink after deleting lots of files

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ext3/ext4 directories don't shrink after deleting lots of files
@ 2009-05-14 22:02 Timo Sirainen
  2009-05-15  0:32 ` Josef Bacik
  0 siblings, 1 reply; 10+ messages in thread
From: Timo Sirainen @ 2009-05-14 22:02 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 453 bytes --]

I've noticed that if you create e.g. 100k files to a directory and then
delete the files, the directory entry still seems to take a couple of
megabytes. Later whenever accessing the (almost empty) directory it can
take a few seconds to load it into cache.

Is there a way to shrink the directory somehow without having to rmdir()
it? Would be nice if kernel did it automatically, but I could live with
a manual userspace syscall/tool as well.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-14 22:02 ext3/ext4 directories don't shrink after deleting lots of files Timo Sirainen
@ 2009-05-15  0:32 ` Josef Bacik
  2009-05-15  0:45   ` Timo Sirainen
  0 siblings, 1 reply; 10+ messages in thread
From: Josef Bacik @ 2009-05-15  0:32 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: linux-kernel

On Thu, May 14, 2009 at 6:02 PM, Timo Sirainen <tss@iki.fi> wrote:
> I've noticed that if you create e.g. 100k files to a directory and then
> delete the files, the directory entry still seems to take a couple of
> megabytes. Later whenever accessing the (almost empty) directory it can
> take a few seconds to load it into cache.
>
> Is there a way to shrink the directory somehow without having to rmdir()
> it? Would be nice if kernel did it automatically, but I could live with
> a manual userspace syscall/tool as well.
>

fsck -D <device>, when its unmounted of course.

Josef

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-15  0:32 ` Josef Bacik
@ 2009-05-15  0:45   ` Timo Sirainen
  2009-05-15 10:58     ` Theodore Tso
  2009-05-17 21:33     ` Theodore Tso
  0 siblings, 2 replies; 10+ messages in thread
From: Timo Sirainen @ 2009-05-15  0:45 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-kernel

On May 14, 2009, at 8:32 PM, Josef Bacik wrote:

> On Thu, May 14, 2009 at 6:02 PM, Timo Sirainen <tss@iki.fi> wrote:
>> I've noticed that if you create e.g. 100k files to a directory and  
>> then
>> delete the files, the directory entry still seems to take a couple of
>> megabytes. Later whenever accessing the (almost empty) directory it  
>> can
>> take a few seconds to load it into cache.
>>
>> Is there a way to shrink the directory somehow without having to  
>> rmdir()
>> it? Would be nice if kernel did it automatically, but I could live  
>> with
>> a manual userspace syscall/tool as well.
>>
>
> fsck -D <device>, when its unmounted of course.

I was rather thinking something that I could run while the system was  
fully operational. Otherwise just moving the files to a temp directory  
+ rmdir() + rename() would have been fine too.

I just tested that xfs, jfs and reiserfs all shrink the directories  
immediately. Is it more difficult to implement for ext* or has no one  
else found this to be a problem?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-15  0:45   ` Timo Sirainen
@ 2009-05-15 10:58     ` Theodore Tso
  2009-05-15 17:29       ` Timo Sirainen
  2009-05-16  9:42       ` david
  2009-05-17 21:33     ` Theodore Tso
  1 sibling, 2 replies; 10+ messages in thread
From: Theodore Tso @ 2009-05-15 10:58 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: Josef Bacik, linux-kernel

On Thu, May 14, 2009 at 08:45:38PM -0400, Timo Sirainen wrote:
>
> I was rather thinking something that I could run while the system was  
> fully operational. Otherwise just moving the files to a temp directory + 
> rmdir() + rename() would have been fine too.
>
> I just tested that xfs, jfs and reiserfs all shrink the directories  
> immediately. Is it more difficult to implement for ext* or has no one  
> else found this to be a problem?

It's probably fairest to say no one has thought it worth the effort.
It would require some fancy games to swap out block locations in the
extent trees (life would be easier with non-extent-using inodes), and
in the case of htree, we would have to keep track of the index block
so we could remove it from the htree index.  So it's all doable, if a
bit tricky in terms of the technical details; it's just that the
people who could do it have been busy enough with other things.

It's hasn't been considered high priority because most of the time
directories don't go from holding thousands of files down to a small
handful.  

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-15 10:58     ` Theodore Tso
@ 2009-05-15 17:29       ` Timo Sirainen
  2009-05-15 18:25         ` Theodore Tso
  2009-05-16  9:42       ` david
  1 sibling, 1 reply; 10+ messages in thread
From: Timo Sirainen @ 2009-05-15 17:29 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Josef Bacik, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]

On Fri, 2009-05-15 at 06:58 -0400, Theodore Tso wrote:
> > I was rather thinking something that I could run while the system was  
> > fully operational. Otherwise just moving the files to a temp directory + 
> > rmdir() + rename() would have been fine too.
> >
> > I just tested that xfs, jfs and reiserfs all shrink the directories  
> > immediately. Is it more difficult to implement for ext* or has no one  
> > else found this to be a problem?
> 
> It's probably fairest to say no one has thought it worth the effort.

My problem is with mail servers and Maildir format where it's possible
that a user has tons of emails and wants to delete them. The mailbox
maybe slowly grows back to the huge size, but in the meantime it's
slower than necessary.

I can't really fix those directories while the system is running because
mail reading doesn't use any locking (and adding locking would be
unnecessary overhead). Writing does use locking though, so I could
create a new duplicate directory and switch it with the original
directory. But I suppose there's no way to atomically replace (or swap)
a non-empty directory with another?

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-15 17:29       ` Timo Sirainen
@ 2009-05-15 18:25         ` Theodore Tso
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Tso @ 2009-05-15 18:25 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: Josef Bacik, linux-kernel

On Fri, May 15, 2009 at 01:29:04PM -0400, Timo Sirainen wrote:
> On Fri, 2009-05-15 at 06:58 -0400, Theodore Tso wrote:
> > > I was rather thinking something that I could run while the system was  
> > > fully operational. Otherwise just moving the files to a temp directory + 
> > > rmdir() + rename() would have been fine too.
> > >
> > > I just tested that xfs, jfs and reiserfs all shrink the directories  
> > > immediately. Is it more difficult to implement for ext* or has no one  
> > > else found this to be a problem?
> > 
> > It's probably fairest to say no one has thought it worth the effort.
> 
> My problem is with mail servers and Maildir format where it's possible
> that a user has tons of emails and wants to delete them. The mailbox
> maybe slowly grows back to the huge size, but in the meantime it's
> slower than necessary.

The problem is that unless the user is deleting a *huge* number of
files, it's rare that the directory entry block goes completely empty.
If you shrink from 15,000 messages to 12,000 messages, say, because of
the fact that we use a hashed b-tree as our data structure, the leaf
blocks in the btree generally still contain some directory entries.
So to fix this we need to actually coalesce directory leaf blocks on
the fly, on top of everything else that I had mentioned.  It's
certianly doable, but again, someone would have to submit a patch.  We
might get around to it one of these days, but plates of those of us
who are doing ext4 are pretty full with higher priority items at
present.

There is an off-line fix that works quite well -- e2fsck -fD, but
obviously that requires scheduling downtime.

How big of a deal is this for you?  I use a local maildir myself, and
they can get quite large:

% ls /home/tytso/isync/mit
total 2132
1412 cur/   716 new/     4 tmp/

But once they are in cache, it's no longer a major problem.  I suppose
on a mail server where you have a very large number of users, caching
2 megs of directory data per user could get ugly; and it does take
time the first time you pull their directory entry into the cache.
What sort of performance degredation are you measuring, and what are
the impacts operationally at the moment for you?  Is this just a
theoretical concern, or are you measuring a significant slowdown as a result?

	    	     	    		    - Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-15 10:58     ` Theodore Tso
  2009-05-15 17:29       ` Timo Sirainen
@ 2009-05-16  9:42       ` david
  1 sibling, 0 replies; 10+ messages in thread
From: david @ 2009-05-16  9:42 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Timo Sirainen, Josef Bacik, linux-kernel

On Fri, 15 May 2009, Theodore Tso wrote:

> On Thu, May 14, 2009 at 08:45:38PM -0400, Timo Sirainen wrote:
>>
>> I was rather thinking something that I could run while the system was
>> fully operational. Otherwise just moving the files to a temp directory +
>> rmdir() + rename() would have been fine too.
>>
>> I just tested that xfs, jfs and reiserfs all shrink the directories
>> immediately. Is it more difficult to implement for ext* or has no one
>> else found this to be a problem?
>
> It's probably fairest to say no one has thought it worth the effort.
> It would require some fancy games to swap out block locations in the
> extent trees (life would be easier with non-extent-using inodes), and
> in the case of htree, we would have to keep track of the index block
> so we could remove it from the htree index.  So it's all doable, if a
> bit tricky in terms of the technical details; it's just that the
> people who could do it have been busy enough with other things.
>
> It's hasn't been considered high priority because most of the time
> directories don't go from holding thousands of files down to a small
> handful.

I see it on a fairly regular basis on mail servers. in sendng a large 
queue builds up due to a remote system being down, but then after the 
remote system recovers, all access to that directory is slow, hurting 
everything else on the system.

David Lang

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-15  0:45   ` Timo Sirainen
  2009-05-15 10:58     ` Theodore Tso
@ 2009-05-17 21:33     ` Theodore Tso
  2009-05-18  2:49       ` david
  1 sibling, 1 reply; 10+ messages in thread
From: Theodore Tso @ 2009-05-17 21:33 UTC (permalink / raw)
  To: Timo Sirainen; +Cc: Josef Bacik, linux-kernel, david, linux-ext4

On Thu, May 14, 2009 at 08:45:38PM -0400, Timo Sirainen wrote:
> I was rather thinking something that I could run while the system was  
> fully operational. Otherwise just moving the files to a temp directory + 
> rmdir() + rename() would have been fine too.
>
> I just tested that xfs, jfs and reiserfs all shrink the directories  
> immediately. Is it more difficult to implement for ext* or has no one  
> else found this to be a problem?

I've sketched out a design that shouldn't be too hard to implement
that will address the problem which you've raised.  I'm not sure when
I will have to implement it, so in case there's an ext4 developer who
has time, I thought I would throw it out there.  For folks who are
looking for something simple to get started, perhaps after submitting
a few bug fixes or cleanups, this should be a fairly straight forward
project.

The constraints that we have is that for backwards compatibility's
sake, we can't support spares directories.  So if a block in the of
the directory becomes empty, we can't just unallocate it unless the it
is at the very end of the directory.  In addition, if htree support is
enabled, we also need to make sure the hash tree index is updated
remove the reference to the block we are about to remove.  Finally, if
journalling is enabled, we need to know in advance how many blocks the
unlink() operations will need to touch.

So the basic design is as follows.  We add a new parameter to
ext4_delete_entry(), which is a pointer to a new data structure,
ext4_dir_cleanup.  This it gets filled in with information about the
directory block containing the directory entry which was removed:
directory inode, logical and physical block number, the directory
index blocks if present, etc.  Then the callers of ext4_delete_entry()
(ext4_rmdir, ext4_rename, and ext4_unlink) take that information ad
pass it another function which takes tries to shrink the directory ---
but this function gets called *after* the call to ext4_journal_stop().
That way we don't have to change any of the journal accounting credits
and the ext4_shrink_directory() function is does purely optional work.

At least initially, the ext4_shrink_directory() might only do
something useful if the last directory block in the directory is
empty, and htree is not enabled; in that case, it can just simply
truncate the last block, and return.   

The next step would be to teach ext4_shrink_directory() how to handle
removing the last directory block for htree directories; this means
that it will need to find the the entry in the htree index block, and
remove the entry in the htree index.

Next, to handle the case where the empty directory block is *not* the
last block in the directory, what ext4_shrink_directory() can do is to
take the contents of the last directory block, and copy it to the
empty directory block, and then do the truncate operation.  In the
case of htree directories, the htree index blocks would also have to
be updated (both removing the index entry pointing to the empty
directory block, as well as updating the index entry which had been
pointing to the last directory block).

Finally, ext4_shrink_directory() could be taought how to take an
*almost* empty directory block, and attempts to move the directory
entries to the previous and/or next directory block.

The basic idea is that ext4_shrink_directory() could be implemented
and tested incrementally, with at each stage it becoming more
aggressive about being able to shrink directories.

Anyway, if there's someone interested in trying to implement this,
give me a holler; I'd be happy to give more details as necessary.

     	  	      	       	  	- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-17 21:33     ` Theodore Tso
@ 2009-05-18  2:49       ` david
  2009-05-18  3:21         ` Theodore Tso
  0 siblings, 1 reply; 10+ messages in thread
From: david @ 2009-05-18  2:49 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Timo Sirainen, Josef Bacik, linux-kernel, linux-ext4

On Sun, 17 May 2009, Theodore Tso wrote:

> On Thu, May 14, 2009 at 08:45:38PM -0400, Timo Sirainen wrote:
>> I was rather thinking something that I could run while the system was
>> fully operational. Otherwise just moving the files to a temp directory +
>> rmdir() + rename() would have been fine too.
>>
>> I just tested that xfs, jfs and reiserfs all shrink the directories
>> immediately. Is it more difficult to implement for ext* or has no one
>> else found this to be a problem?
>
> I've sketched out a design that shouldn't be too hard to implement
> that will address the problem which you've raised.  I'm not sure when
> I will have to implement it, so in case there's an ext4 developer who
> has time, I thought I would throw it out there.  For folks who are
> looking for something simple to get started, perhaps after submitting
> a few bug fixes or cleanups, this should be a fairly straight forward
> project.
>
> The constraints that we have is that for backwards compatibility's
> sake, we can't support spares directories.  So if a block in the of

s/spares/sparse/ ?

> the directory becomes empty, we can't just unallocate it unless the it
> is at the very end of the directory.  In addition, if htree support is
> enabled, we also need to make sure the hash tree index is updated
> remove the reference to the block we are about to remove.  Finally, if
> journalling is enabled, we need to know in advance how many blocks the
> unlink() operations will need to touch.
>
> So the basic design is as follows.  We add a new parameter to
> ext4_delete_entry(), which is a pointer to a new data structure,
> ext4_dir_cleanup.  This it gets filled in with information about the
> directory block containing the directory entry which was removed:
> directory inode, logical and physical block number, the directory
> index blocks if present, etc.  Then the callers of ext4_delete_entry()
> (ext4_rmdir, ext4_rename, and ext4_unlink) take that information ad
> pass it another function which takes tries to shrink the directory ---
> but this function gets called *after* the call to ext4_journal_stop().
> That way we don't have to change any of the journal accounting credits
> and the ext4_shrink_directory() function is does purely optional work.
>
> At least initially, the ext4_shrink_directory() might only do
> something useful if the last directory block in the directory is
> empty, and htree is not enabled; in that case, it can just simply
> truncate the last block, and return.
>
> The next step would be to teach ext4_shrink_directory() how to handle
> removing the last directory block for htree directories; this means
> that it will need to find the the entry in the htree index block, and
> remove the entry in the htree index.
>
> Next, to handle the case where the empty directory block is *not* the
> last block in the directory, what ext4_shrink_directory() can do is to
> take the contents of the last directory block, and copy it to the
> empty directory block, and then do the truncate operation.  In the
> case of htree directories, the htree index blocks would also have to
> be updated (both removing the index entry pointing to the empty
> directory block, as well as updating the index entry which had been
> pointing to the last directory block).

I think this is more complex. I think you can't just move the last 
directory block to one earlier because that would change the order of 
things in the directory, messing up things that do a partial readdir of 
the directory and then come back to pick up where they left off. you would 
need to move all blocks after the empty up one.

Another thing, you don't nessasarily want to do this movement immediatly 
when a directory block becomes empty. It's very possible that the user is 
deleting a lot of things from the directory, and so may delete enough 
stuff so that all (or almost all) of the directory blocks could be deleted 
through the 'last block in the directory' method. it may be that the best 
thing to do at this point is to wait for instructions from the user to do 
this (more below)

> Finally, ext4_shrink_directory() could be taought how to take an
> *almost* empty directory block, and attempts to move the directory
> entries to the previous and/or next directory block.

this sounds like something that's best implemented as a nighly cron job 
(run similar to updatedb) to defrag the directory blocks. given changes 
over the years to disk technology (both how much slower seeks have become 
relative to sequential reads on rotating media, and how SSDs really have 
much larger block sizes internally than what's exposed to users), it may 
make sense to consider having a defrag tool for the data blocks as well.

David Lang

> The basic idea is that ext4_shrink_directory() could be implemented
> and tested incrementally, with at each stage it becoming more
> aggressive about being able to shrink directories.
>
> Anyway, if there's someone interested in trying to implement this,
> give me a holler; I'd be happy to give more details as necessary.
>
>     	  	      	       	  	- Ted
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: ext3/ext4 directories don't shrink after deleting lots of files
  2009-05-18  2:49       ` david
@ 2009-05-18  3:21         ` Theodore Tso
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Tso @ 2009-05-18  3:21 UTC (permalink / raw)
  To: david; +Cc: Timo Sirainen, Josef Bacik, linux-kernel, linux-ext4

On Sun, May 17, 2009 at 07:49:09PM -0700, david@lang.hm wrote:
>> The constraints that we have is that for backwards compatibility's
>> sake, we can't support spares directories.  So if a block in the of
>
> s/spares/sparse/ ?

Yes, sorry "sparse"

>> Next, to handle the case where the empty directory block is *not* the
>> last block in the directory, what ext4_shrink_directory() can do is to
>> take the contents of the last directory block, and copy it to the
>> empty directory block, and then do the truncate operation.  In the
>> case of htree directories, the htree index blocks would also have to
>> be updated (both removing the index entry pointing to the empty
>> directory block, as well as updating the index entry which had been
>> pointing to the last directory block).
>
> I think this is more complex. I think you can't just move the last  
> directory block to one earlier because that would change the order of  
> things in the directory, messing up things that do a partial readdir of  
> the directory and then come back to pick up where they left off. you 
> would need to move all blocks after the empty up one.

For htree directories we can do this, because the iterate over the
directory in hash sort order, and moving the directory blocks around
doesn't change this.  For non-htree directories, you're right;
ext4_shrink_direcotry() would have to bail and not do anything if
there was a readdir() in progress for the directory in question.

> this sounds like something that's best implemented as a nighly cron job  
> (run similar to updatedb) to defrag the directory blocks. given changes  
> over the years to disk technology (both how much slower seeks have become 
> relative to sequential reads on rotating media, and how SSDs really have  
> much larger block sizes internally than what's exposed to users), it may  
> make sense to consider having a defrag tool for the data blocks as well.

Yes, that's the other way to do this; we could have an ioctl which
defrags a directory, and which will return an error if there is
another fd open on the directory (which would imply that there was a
readdir() in progress) and then do a complete defrag operation on the
directory.  It would have to be done in kernel space so the filesystem
wouldn't have to be unmounted.  Doing this all at once is more
efficient from an I/O perspective, but it's tricker to do in the
kernel because for very large directories, the method used in
e2fsck/rehash.c assumes you can allocate enough memory for all of the
directory entries all at once, which might not be true in the kernel,
since kernel memory can't be swapped or paged out.

Doing a little bit at a time means that we're O(1) in time/space for
each unlink operation.  Doing it all at once is O(n) in space, and for
very, *very* large directories that could be problematic.  It's not
impossible, but try sketching out the algorithm first.  You may find
it's more complicated than you first thought.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-05-18  3:22 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-14 22:02 ext3/ext4 directories don't shrink after deleting lots of files Timo Sirainen
2009-05-15  0:32 ` Josef Bacik
2009-05-15  0:45   ` Timo Sirainen
2009-05-15 10:58     ` Theodore Tso
2009-05-15 17:29       ` Timo Sirainen
2009-05-15 18:25         ` Theodore Tso
2009-05-16  9:42       ` david
2009-05-17 21:33     ` Theodore Tso
2009-05-18  2:49       ` david
2009-05-18  3:21         ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox