Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
       [not found]                 ` <Pine.LNX.4.64.0612291431200.4473@woody.osdl.org>
@ 2006-12-29 23:32                   ` Theodore Tso
  2006-12-29 23:59                     ` Linus Torvalds
  2006-12-30  0:05                     ` Andrew Morton
  0 siblings, 2 replies; 4+ messages in thread
From: Theodore Tso @ 2006-12-29 23:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Segher Boessenkool, David Miller, nickpiggin,
	kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma,
	gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa,
	linux-ext4

On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> I think ext3 is terminally crap by now. It still uses buffer heads in 
> places where it really really shouldn't, and as a result, things like 
> directory accesses are simply slower than they should be. Sadly, I don't 
> think ext4 is going to fix any of this, either.

Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well.  Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime.  Unfortunately, I don't think this is going to
be trivial.

If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines.  OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes < pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds.  So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable
future.

						- Ted

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
  2006-12-29 23:32                   ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Theodore Tso
@ 2006-12-29 23:59                     ` Linus Torvalds
  2006-12-30  0:05                     ` Andrew Morton
  1 sibling, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2006-12-29 23:59 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andrew Morton, Segher Boessenkool, David Miller, nickpiggin,
	kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma,
	gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa,
	linux-ext4

On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_ 
later), but the real problem with it isn't actually the VM behaviour, but 
simply the fact that cached performance absoluely _sucks_ with the buffer 
cache.

With the physically indexed buffer cache thing, you end up always having 
to do these complicated translations into block numbers for every single 
access, and at some point when I benchmarked it, it was a huge overhead 
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost 
of translation - because you can't cheaply check whether the next block is 
there, the cost of even asking the question "should I try to read ahead?" 
is much much higher. As a result, read-ahead is seriously limited, because 
it's so expensive for the cached case (which is still hopefully the 
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having 
ext3_readdir() actually do _readahead_ though the page cache, even though 
it does everything else through the buffer cache. And that just happens to 
work because we hopefully have physically contiguous blocks, but when that 
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any 
problems for the VM, since directories cannot be mmap'ed anyway. But it's 
really pitiful, and it really doesn't work very well. Of course, other 
filesystems _also_ suck at this, and other operating systems haev even 
MORE problems, so people don't always seem to realize how horribly 
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache 
directories well. Open some horrible file manager on /usr/bin with cold 
caches, and weep. The biggest problem is the inode indirection, but at 
some point when I looked at why it sucked, it was doing basically 
synchronous single-buffer reads on the directory too, because readahead 
didn't work properly.

I was hoping that something like SpadFS would actually take off, because 
it seemed to do a lot of good design choices (having inodes in-line in the 
directory for when there are no hardlinks is probably a requirement for a 
good filesystem these days. The separate inode table had its uses, but 
indirection in a filesystem really does suck, and stat information is too 
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis 
written ;)

		Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
  2006-12-29 23:32                   ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Theodore Tso
  2006-12-29 23:59                     ` Linus Torvalds
@ 2006-12-30  0:05                     ` Andrew Morton
  2006-12-30  0:50                       ` Linus Torvalds
  1 sibling, 1 reply; 4+ messages in thread
From: Andrew Morton @ 2006-12-30  0:05 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Linus Torvalds, Segher Boessenkool, David Miller, nickpiggin,
	kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma,
	gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa,
	linux-ext4

On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso <tytso@mit.edu> wrote:

> On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> > I think ext3 is terminally crap by now. It still uses buffer heads in 
> > places where it really really shouldn't, and as a result, things like 
> > directory accesses are simply slower than they should be. Sadly, I don't 
> > think ext4 is going to fix any of this, either.
> 
> Not just ext3; ocfs2 is using the jbd layer as well.  I think we're
> going to have to put this (a rework of jbd2 to use the page cache) on
> the ext4 todo list, and work with the ocfs2 folks to try to come up
> with something that suits their needs as well.  Fortunately we have
> this filesystem/storage summit thing coming up in the next few months,
> and we can try to get some discussion going on the linux-ext4 mailing
> list in the meantime.  Unfortunately, I don't think this is going to
> be trivial.

I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache.  The whole thing is fundamentally
block-based.  But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.

We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.

Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up.  The
conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
and ugly to fully optimise away the "block" bit in the middle.

buffer_heads become more important with large PAGE_CACHE_SIZE.  I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages.  We
need that representation of the state (and location) of the block-sized
hunks which make up the page.

> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.  OTOH, reiser3
> probably has the same problems, and I suspect the changes to ext3 to
> cause it to avoid buffer heads, especially in order to support for
> filesystem blocksizes < pagesize, are going to be sufficiently risky
> in terms of introducing regressions to ext3 that they would probably
> be rejected on those grounds.  So unfortunately, we probably are going
> to have to support flushes via buffer heads for the foreseeable
> future.

We'll see.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)
  2006-12-30  0:05                     ` Andrew Morton
@ 2006-12-30  0:50                       ` Linus Torvalds
  0 siblings, 0 replies; 4+ messages in thread
From: Linus Torvalds @ 2006-12-30  0:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Segher Boessenkool, David Miller, nickpiggin,
	kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma,
	gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa,
	linux-ext4

On Fri, 29 Dec 2006, Andrew Morton wrote:
> 
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up.  The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.

Using the buffer cache as a translation layer to the physical address is 
fine. That's what _any_ block device will do.

I'm not at all sayign that "buffer heads must go away". They work fine.

What I'm saying is that

 - if you index by buffer heads, you're screwed.
 - if you do IO by starting at buffer heads, you're screwed.

Both indexing and writeback decisions should be done at the page cache 
layer. Then, when you actually need to do IO, you look at the buffers. But 
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own 
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.

So by all means keep the buffer heads as a way to keep the 
"virtual->physical" translation. It's what they were designed for. But 
they were _originally_ also designed for "lookup" and "driving the start 
of IO", and that is wrong, and has been wrong for a long time now, because

 - lookup based on physical address is fundamentally slow and inefficient. 
   You have to look up the virtual->physical translation somewhere else, 
   so it's by design an unnecessary indirection _and_ that "somewere 
   else" is also by definition filesystem-specific, so you can't do any 
   of these things at the VFS layer.

   Ergo: anything that needs to look up the physical address in order to 
   find the buffer head is BROKEN in this day and age. We look up the 
   _virtual_ page cache page, and then we can trivially find the buffer 
   heads within that page thanks to page->buffers.

   Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. 

 - starting IO based on the physical entity is insane. It's insane exactly 
   _because_ the VM doesn't actually think in physical addresses, or in 
   buffer-sized blocks. The VM only really knows about whole pages, and 
   all the VM decisions fundamentally have to be page-based. We don't ever 
   "free a buffer". We free a whole page, and as such, doing writeback 
   based on buffers is pointless, because it doesn't actually say anything 
   about the "page state" which is what the VM tracks.

But neither of these means that "buffer_head" itself has to go away. They 
both really boil down to the same thing: you should never KEY things by 
the buffer head. All actions should be based on virtual indexes as far as 
at all humanly possible.

Once you do lookup and locking and writeback _starting_ from the page, 
it's then easy to look up the actual buffer head within the page, and use 
that as a way to do the actual _IO_ on the physical address. So the buffer 
heads still exist in ext2, for example, but they don't drive the show 
quite as much.

(They still do in some areas: the allocation bitmaps, the xattr code etc. 
But as long as none of those have big VM footprints, and as long as no 
_common_ operations really care deeply, and as long as those data 
structures never need to be touched by the VM or VFS layer, nobody will 
ever really care).

The directory case comes up just because "readdir()" actually is very 
common, and sometimes very slow. And it can have a big VM working set 
footprint ("find"), so trying to be page-based actually really helps, 
because it all drives things like writeback on the _right_ issues, and we 
can do things like LRU's and writeback decisions on the level that really 
matters.

I actually suspect that the inode tables could benefit from being in the 
page cache too (although I think that the inode buffer address is actually 
"physical", so there's no indirection for inode tables, which means that 
the virtual vs physical addressing doesn't matter). For directories, there 
definitely is a big cost to continually doing the virtual->physical 
translation all the time.

		Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-12-30  0:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.0612281125100.4473@woody.osdl.org>
     [not found] ` <20061228114517.3315aee7.akpm@osdl.org>
     [not found]   ` <Pine.LNX.4.64.0612281156150.4473@woody.osdl.org>
     [not found]     ` <20061228.143815.41633302.davem@davemloft.net>
     [not found]       ` <3d6d8711f7b892a11801d43c5996ebdf@kernel.crashing.org>
     [not found]         ` <Pine.LNX.4.64.0612282155400.4473@woody.osdl.org>
     [not found]           ` <Pine.LNX.4.64.0612290017050.4473@woody.osdl.org>
     [not found]             ` <Pine.LNX.4.64.0612290202350.4473@woody.osdl.org>
     [not found]               ` <20061229141632.51c8c080.akpm@osdl.org>
     [not found]                 ` <Pine.LNX.4.64.0612291431200.4473@woody.osdl.org>
2006-12-29 23:32                   ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Theodore Tso
2006-12-29 23:59                     ` Linus Torvalds
2006-12-30  0:05                     ` Andrew Morton
2006-12-30  0:50                       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).