linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Lukáš Czerner" <lczerner@redhat.com>
To: Radek Pazdera <rpazdera@redhat.com>
Cc: linux-ext4@vger.kernel.org, kasparek@fit.vutbr.cz
Subject: Re: [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index
Date: Thu, 27 Jun 2013 11:24:19 +0200 (CEST)	[thread overview]
Message-ID: <alpine.LFD.2.00.1306271123000.1781@localhost.localdomain> (raw)
In-Reply-To: <1367702922-3236-1-git-send-email-rpazdera@redhat.com>

On Sat, 4 May 2013, Radek Pazdera wrote:

> Date: Sat,  4 May 2013 23:28:33 +0200
> From: Radek Pazdera <rpazdera@redhat.com>
> To: linux-ext4@vger.kernel.org
> Cc: lczerner@redhat.com, kasparek@fit.vutbr.cz,
>     Radek Pazdera <rpazdera@redhat.com>
> Subject: [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index

Hi Radek,

patches do not apply cleanly any more on the ext4 dev branch. Can
you rebase and resend the whole patch set ?

Thanks!
-Lukas

> 
> Hello everyone,
> 
> I am an university student from Brno /CZE/. I decided to try to optimise
> the readdir/stat scenario in ext4 as the final project to school. I
> posted some test results I got few months ago [1].
> 
> I tried to implement an additional tree for ext4's directory index
> that would be sorted by inode numbers. The tree then would be used
> by ext4_readdir() which should lead to substantial increase of
> performance of operations that manipulate a whole directory at once.
> 
> The performance increase should be visible especially with large
> directories or in case of low memory or cache pressure.
> 
> This patch series is what I've got so far. I must say, I originally
> thought it would be *much* simpler :).
> 
> TLDR
> ====
> 
> The series contains the implementation of the new tree and several
> rather small changes to the original directory index. I also added
> a new implementation of ext4_readdir() that uses this new tree
> instead of the original one.
> 
> It applies on 3.9, it is basically working, however, it is highly
> experimental at the moment. It doesn't survive XFS tests yet (I
> still need to work on that).
> 
> DESIGN
> ======
> 
> The tree comes with a new read-only compatible file system feature
> called "itree". It is created in a directory at the same time as the
> original dx tree -- when the directory file exceeds a signle block of
> size. It is indicated by an inode flag.
> 
> The tree resides completely outside of the directory file. I am using
> 64bit block numbers (as suggested by Ted Ts'o), and the pointer to its
> root is stored in the end of the dx_root block.
> 
> I tried to keep the structure of the tree as close as possible to the
> original dx tree. It is a B+ tree with a 64bit long compound key. The
> key consists of a inode number and a hash.
> 
> The hash is there because of hard links. On ext4 there can be as much
> as 65 000 names for a single file which is, from the point of the inode
> tree, a collision. It is a very unlikely scenario to crate over 60 000
> names for a file from a single directory. But it would be a very serious
> problem for readdir() if someone tried to do that, so I decided to add
> the hash to the key as well. This reduces the problems of collisions to
> the same as of the hash tree.
> 
> The depth of the tree is limited by a constant in the code to 3 levels,
> but it can be increased anytime.
> 
> I decided to keep the directory entries within the leaf nodes sorted,
> which might have been a bad idea (and it brought me quite a few
> sleepless nights of pointer debugging). However, it is more convenient
> for readdir and splits. And in the case of the inode tree, there
> shouldn't be that much memmoving, because inodes are allocated linearly,
> therefore we're adding to the end most of the time anyway.
> 
> I also had to implement coalesce-on-delete. Unlike the hash values,
> the inode numbers are not random, so the tree would get fragmented
> really easily. For example when a range of inodes would be freed
> and allocated somewhere else, that part of the tree would stay empty
> forever.
> 
> In this implementation I used 8 bits for "node fullness". Neighbour
> nodes in the tree are merged as soon as their total fullness is smaller
> than maximum fullness. This way, coalescing happens too often. I plan
> to reduce it to only 2 bits (25% full, 50% full, 75% full, 100% full).
> 
> There are also some optimisations to increase the fullness of blocks
> within the tree, because if the file system adds a contiguous sequence
> of inodes to a directory and we split the nodes in half, there will be
> some tree nodes that will never get another entry and still be only 50%
> full. In the current implementation, an append is performed instead of
> a split in case the new entry should be added to the end of the full
> block.
> 
> BENCHMARKS
> ==========
> 
> I did some benchmarks and compared the performance with ext4/htree,
> XFS, and btrfs up to 5 000 000 of files in a single directory. Not
> all of them are done though (they run for days).
> 
> Probably the biggest improvement can be observed with copying files.
> I used two physical SATA disks for the test. For 5M files, the itree
> implementation is 14x faster. With metadata only, ext4 is even a tiny
> bit faster than xfs:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/cp.png
> 
> With each files 4kB big, the inode tree is 11x faster:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/cp.png
> 
> On the other hand, probably the biggest drawback of this implementation
> is that it slows down file creates, as we now have to insert the entry
> into both trees. The difference gets bigger as both trees grow (and their
> blocks get further apart). For 5M directory entries, the creation is
> roughly 40% slower. Hopefully, I'll be able to optimise it a bit in the
> future:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/create.png
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/create.png
> 
> Deleting entries should also very much benefit from this. However, the
> increase of performance in my test is far lower than expected. I think
> that is due to my implementation of coalesce-on-delete, which happens
> too often and during these massive deletes, it coallesces all the time.
> I hope that when I fix that, it will get somewhere close to XFS as well:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/delete.png
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/delete.png
> 
> All of the results have a graph and a table with precise values.
> You can always get the table by replacing the suffix of the graph
> *.png to *.results in the end of the link.
> 
> Full results are available here:
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/
> 
> 
> I also did some tests on an aged file system (I used the simple 0.8
> chance to create, 0.2 to delete a file) where the results of ext4
> with itree are much better even than xfs, which gets fragmented:
> 
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/cp.png
>     http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/readdir-stat.png
> 
> 
> There are still some things to be done, the checksums are not yet
> finished and I certainly need to do a bit of cleaning up at some
> places. But as far as features go, it all should be there already.
> 
> I'm working on this along with Lukas Czerner, who acts as my mentor
> and helps me with different things (big thanks!).
> 
> Any feedback/ideas are greatly appeciated :)!
> 
> Cheers,
> Radek
> 
> 
> Radek Pazdera (9):
>   ext4: Adding itree feature and inode flags
>   ext4: Allow sorting dx_map by inode as well
>   ext4: Adding a link to itree to the dx_root struct
>   ext4: Adding itree structures
>   ext4: Adding itree implementation I - Core
>   ext4: Adding itree implementation II - Inserting
>   ext4: Adding itree implementation III - Deleting
>   ext4: Make directory operations use itree
>   ext4: Make ext4_readdir() use itree if available
> 
>  fs/ext4/dir.c   |  177 +++-
>  fs/ext4/ext4.h  |   21 +-
>  fs/ext4/namei.c | 2442 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 2565 insertions(+), 75 deletions(-)
> 
> 

  parent reply	other threads:[~2013-06-27  9:24 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-04 21:28 [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index Radek Pazdera
2013-05-04 21:28 ` [RFC 1/9] ext4: Adding itree feature and inode flags Radek Pazdera
2013-05-04 21:28 ` [RFC 2/9] ext4: Allow sorting dx_map by inode as well Radek Pazdera
2013-05-04 21:28 ` [RFC 3/9] ext4: Adding a link to itree to the dx_root struct Radek Pazdera
2013-05-04 21:28 ` [RFC 4/9] ext4: Adding itree structures Radek Pazdera
2013-05-04 21:28 ` [RFC 5/9] ext4: Adding itree implementation I - Core Radek Pazdera
2013-05-04 21:28 ` [RFC 6/9] ext4: Adding itree implementation II - Inserting Radek Pazdera
2013-05-04 21:28 ` [RFC 7/9] ext4: Adding itree implementation III - Deleting Radek Pazdera
2013-05-04 21:28 ` [RFC 8/9] ext4: Make directory operations use itree Radek Pazdera
2013-05-04 21:28 ` [RFC 9/9] ext4: Make ext4_readdir() use itree if available Radek Pazdera
2013-05-11 13:28 ` [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index Zheng Liu
2013-05-11 21:18   ` Radek Pazdera
2013-06-16  0:55 ` Dave Chinner
2013-06-17  8:58   ` Lukáš Czerner
2013-06-19 12:10     ` Radek Pazdera
2013-06-27  9:24 ` Lukáš Czerner [this message]
2013-07-01 11:40   ` Radek Pazdera
2013-07-01 12:17     ` Lukáš Czerner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.00.1306271123000.1781@localhost.localdomain \
    --to=lczerner@redhat.com \
    --cc=kasparek@fit.vutbr.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=rpazdera@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).