public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Jörn Engel" <joern@logfs.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Woodhouse <dwmw2@infradead.org>,
	Arnd Bergmann <arnd@arndb.de>,
	linux-kernel@vger.kernel.org
Subject: Re: LogFS merge
Date: Fri, 2 May 2008 23:31:46 +0200	[thread overview]
Message-ID: <20080502213146.GD24080@logfs.org> (raw)
In-Reply-To: <alpine.LFD.1.10.0805021328200.5994@woody.linux-foundation.org>

On Fri, 2 May 2008 13:33:12 -0700, Linus Torvalds wrote:
> On Fri, 2 May 2008, Jörn Engel wrote:
> > 
> > Currently performance sucks badly on block device flashes (usb stick,
> > etc.) when creating/removing/renaming files.  The combination of logfs
> > and the built-in logic can result in 1-2MB of data written to create a
> > single empty file.  Yuck!
> 
> Can you talk about why, and describe these kinds of things? Is it just 
> because of deep directory trees and having to rebuild the tree from the 
> root up, or is it something else going on?

Logfs has the concept of "levels".  Level 0 contains file data.  Level 1
has indirect blocks, level 2 doubly indirect blocks, etc.  Inodes are
stored in an inode file, which is on level 6 for the inodes, 7 for
indirect blocks, etc.

GC requires that data for each level is kept seperate from data for
other levels.  It is the only make deadlocks impossible, any alternative
will just reduce deadlock likelyhood afaiks.  Both regular files and the
inode file can currently go up to 3x indirect, so you have up to 8
levels open for writing at any given time.

Writing data synchronously requires wandering the entire tree, i.e.
writing a block on level 0, then one on level 1, 2 and 3 if indirect
blocks are required, write the inode at level 6 and again writing blocks
on levels 7, 8 and 9 if the inode number is high.  When creating a file,
both the dentry and the created inode are written synchronously.

So on a block device level, all this translates to several writes, none
of them being adjacent.  Each write is fairly small by itself.  But the
FTL inside your favorite type of consumer flash media will turn any
small write into a write of the complete eraseblock.  So somewhere on an
internal bus, megabytes of data are happily shuffled around.


I have a solution for this, but it would require an incompatible change
to the format.  And right now I have fairly good confidence in the
format wrt. ensuring correctness.  So the plan is to merge logfs as-is
(modulo bugfixes, review fallout, etc.) and handle the changes for this
and other performance problems with compat flags.  And someday rename
the whole mess to log2fs and remove some support for old format
variants.

> > Fragmentation is neither actively avoided nor actively enforced.
> 
> I was more thinking about the fragmentation in terms of how much free 
> space you need for reasonable performance behavior - these kinds of things 
> tend to easily start behaving really badly when the disk fills up and you 
> need to GC all the time just to make room for new erase blocks for the 
> trivial inode mtime/atime updates etc.
> 
> Maybe logfs doesn't have that problem for some reason, but in many cases 
> there are rules like "we will consider the filesystem full when it goes 
> over 90% theoretical fill", and it's interesting to know.

For any flash filesystem there is what I call the "evil workload".  Fill
the filesystem 100% and randomly replace data.  In the best case (jffs2)
the filesystem has to GC one segment worth of free space to write one
block, then GC another segment for the next block, etc.  Non-compressed
log-structured filesystems can cheat their way around that by avoiding
GC and doing hole-plugging instead.  Not sure if any actually do that.

Logfs tries hard to behave the same way on any medium - if only to
increase test coverage and fix bugs for everyone when possible.  And it
has to write indirect blocks, which makes the evil workload worse yet.
However, with just 1-2MiB free, it gets back to jffs2 behaviour or at
least close, as it can GC a couple of segments and collect dirty
indirect blocks, combining updates to those.

So in effect, the question is how fill the filesystem is and how random
the write pattern.  50% full and completely random means on average
logfs has to GC one segment to write one and performance is down to 50%.
90% full and completely random means GC 9 to write 1, so we're at 10%.
90% full, 85% of which remain untouched means 5% get written and 10% are
full, so on average GC 1 to write 2.

I cannot give you a hard limit, but the math is simple enough.

> I did try looking at gitweb to see if I could find some documentation 
> file. I didn't find anything.

Noone else has asked for one before, I believe.

Jörn

-- 
Joern's library part 3:
http://inst.eecs.berkeley.edu/~cs152/fa05/handouts/clark-test.pdf

  reply	other threads:[~2008-05-02 21:32 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-05-02 13:32 LogFS merge Jörn Engel
2008-05-02 14:49 ` Pekka Enberg
2008-05-02 15:31   ` Christoph Hellwig
2008-05-02 20:33     ` Jörn Engel
2008-05-02 16:52 ` Arnd Bergmann
2008-05-02 21:47   ` Jörn Engel
2008-05-02 17:17 ` Linus Torvalds
2008-05-02 20:21   ` Jörn Engel
2008-05-02 20:33     ` Linus Torvalds
2008-05-02 21:31       ` Jörn Engel [this message]
2008-05-02 21:39         ` Linus Torvalds
2008-05-02 21:58           ` Jörn Engel
2008-05-03  7:03             ` Willy Tarreau
2008-05-03  9:11               ` Adrian Bunk
2008-05-03  9:18                 ` Willy Tarreau
2008-05-03  9:37                   ` Adrian Bunk
2008-05-03  9:44                 ` Pekka Enberg
2008-05-03 11:06                   ` Adrian Bunk
2008-05-03 17:16               ` Linus Torvalds
2008-05-03 17:59                 ` Offtopic to: " David Collier-Brown
2008-05-02 17:29 ` Andi Kleen
2008-05-02 21:34   ` Jörn Engel
2008-05-05 20:31 ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2008-05-03  9:29 matthieu castet
2008-05-03 12:00 ` Jörn Engel
2008-05-04 16:20   ` Thomas Gleixner
2008-05-04 18:04     ` Josh Boyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080502213146.GD24080@logfs.org \
    --to=joern@logfs.org \
    --cc=akpm@linux-foundation.org \
    --cc=arnd@arndb.de \
    --cc=dwmw2@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sfr@canb.auug.org.au \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox