Re: [Tux3] Tux3 report: Tux3 Git tree available

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Tux3] Tux3 report: Tux3 Git tree available
       [not found]   ` <200903120315.07610.phillips@phunq.net>
@ 2009-03-12 11:03     ` Nick Piggin
  2009-03-12 12:24       ` Daniel Phillips
  2009-03-12 17:06       ` [Tux3] " Theodore Tso
  0 siblings, 2 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-12 11:03 UTC (permalink / raw)
  To: Daniel Phillips, linux-fsdevel; +Cc: tux3, Andrew Morton, linux-kernel

On Thursday 12 March 2009 21:15:06 Daniel Phillips wrote:

> By the way, I just spotted your fsblock effort on LWN, and I should
> mention there is a lot of commonality with a side project we have
> going in Tux3, called "block handles", which aims to get rid of buffers
> entirely, leaving a tiny structure attached to the page->private that
> just records the block states.  Currently, four bits per block.  This
> can be done entirely _within_ a filesystem.  We are already running
> some of the code that has to be in place before switching over to this
> model.
>
> Tux3 block handles (as prototyped, not in the current code base) are
> 16 bytes per page, which for 1K block size on a 32 bit arch is a factor
> of 14 savings, more on 64 bit arch.  More importantly, it saves lots of
> individual slab allocations, a feature I gather is part of fsblock as
> well.

That's interesting. Do you handle 1K block sizes with 64K page size? :)

fsblock isn't quite as small. 20 bytes per block on a 32-bit arch. Yeah,
so it will do a single 80 byte allocation for 4K page 1K block.

That's good for cache efficiency. As far as total # slab allocations
themselves go, fsblock probably tends to do more of them than buffer.c
because it frees them proactively when their refcounts reach 0 (by
default, one can switch to a lazy mode like buffer heads).

That's one of the most important things, so we don't end up with lots
of these things lying around.

eg. I could make it 16 bytes I think, but it would be a little harder
and would make support for block size > page size much harder so I
wouldn't bother. Or share the refcount field for all blocks in a page
and just duplicate state etc, but it just makes code larger and slower
and harder.

> We represent up to 8 block states in one 16 byte object.  To make this
> work, we don't try to emulate the behavior of the venerable block
> library, but first refactor the filesystem data flow so that we are
> only using the parts of buffer_head that will be emulated by the block
> handle.  For example, no caching of physical block address.  It keeps
> changing in Tux3 anyway, so this is really a useless thing to do.

fsblocks in their refcount mode don't tend to _cache_ physical block addresses
either, because they're only kept around for as long as they are required
(eg. to write out the page to avoid memory allocation deadlock problems).

But some filesystems don't do very fast block lookups and do want a cache.
I did a little extent map library on the side for that.

> Anyway, that is more than I meant to write about it.  Good luck to you,
> you will need it.  Keep in mind that some of the nastiest kernel bugs
> ever arose from interactions between page and buffer state bits.  You

Yes, I even fixed several of them too :)

fsblock simplifies a lot of those games. It protects pagecache state and
fsblock state for all assocated blocks under a lock, so no weird ordering
issues, and the two are always kept coherent (to the point that I can do
writeout by walking dirty fsblocks in block device sector-order, although
that requires bloat to fsblock struct and isn't straightforward with
delalloc).

Of course it is new code so it will have more bugs, but it is better code.

> may run into understandable reluctance to change stable filesystems
> over to the new model.  Even if it reduces the chance for confusion,
> the fact that it touches cache state at all is going to make people
> jumpy.  I would suggest that you get Ext3 working perfectly to prove
> your idea, no small task.  One advantage: Ext3 uses block handles for
> directories, as opposed to Ext2 which operates on pages.  So Ext3 will
> be easier to deal with in that one area.  But with Ext3 you have to
> deal with jbd too, which is going to keep you busy for a while.

It is basically already proven. It is faster with ext2 and it works with
XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't
really able to grok XFS enough to convert it). And works with minix
with larger block size than page size (except some places where core
pagecache code needs some hacking that I haven't got around to).

Yes an ext3 conversion would probably reveal some tweaks or fixes to
fsblock. I might try doing ext3 next. I suspect most of the problems
would be fitting ext3 to much stricter checks and consistency required
by fsblock, rather than adding ext3-required features to fsblock.

ext3 will be a tough one to convert because it is complex, very stable,
and widely used so there are lots of reasons not to make big changes to
it.

> The block handles patch is one of those fun things we have on hold for
> the time being while we get the more mundane

Good luck with it. I suspect that doing filesystem-specific layers to
duplicate basically the same functionality but slightly optimised for
the specific filesystem may not be a big win. As you say, this is where
lots of nasty problems have been, so sharing as much code as possible
is a really good idea.

I would be very interested in anything like this that could beat fsblock
in functionality or performance anywhere, even if it is taking shortcuts
by being less generic.

If there is a significant gain to be had from less generic, perhaps it
could still be made into a library usable by more than 1 fs.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 11:03     ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin
@ 2009-03-12 12:24       ` Daniel Phillips
  2009-03-12 12:32         ` Matthew Wilcox
  2009-03-12 13:04         ` Nick Piggin
  2009-03-12 17:06       ` [Tux3] " Theodore Tso
  1 sibling, 2 replies; 23+ messages in thread
From: Daniel Phillips @ 2009-03-12 12:24 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 21:15:06 Daniel Phillips wrote:
> 
> > By the way, I just spotted your fsblock effort on LWN, and I should
> > mention there is a lot of commonality with a side project we have
> > going in Tux3, called "block handles", which aims to get rid of buffers
> > entirely, leaving a tiny structure attached to the page->private that
> > just records the block states.  Currently, four bits per block.  This
> > can be done entirely _within_ a filesystem.  We are already running
> > some of the code that has to be in place before switching over to this
> > model.
> >
> > Tux3 block handles (as prototyped, not in the current code base) are
> > 16 bytes per page, which for 1K block size on a 32 bit arch is a factor
> > of 14 savings, more on 64 bit arch.  More importantly, it saves lots of
> > individual slab allocations, a feature I gather is part of fsblock as
> > well.
> 
> That's interesting. Do you handle 1K block sizes with 64K page size? :)

Not in its current incarnation.  That would require 32 bytes worth of
state while the current code just has a 4 byte map (4 bits X 8 blocks).
I suppose a reasonable way to extend it would be 4 x 8 byte maps.  Has
somebody spotted a 64K page?

> fsblock isn't quite as small. 20 bytes per block on a 32-bit arch. Yeah,
> so it will do a single 80 byte allocation for 4K page 1K block.
> 
> That's good for cache efficiency. As far as total # slab allocations
> themselves go, fsblock probably tends to do more of them than buffer.c
> because it frees them proactively when their refcounts reach 0 (by
> default, one can switch to a lazy mode like buffer heads).

I think that's a very good thing to do and intend to do the same.  If
it shows on a profiler, then the filesystem should keep its own free
list to avoid whatever slab thing creates the bottleneck.

> That's one of the most important things, so we don't end up with lots
> of these things lying around.

Amen.  Doing nothing.

> eg. I could make it 16 bytes I think, but it would be a little harder
> and would make support for block size > page size much harder so I
> wouldn't bother. Or share the refcount field for all blocks in a page
> and just duplicate state etc, but it just makes code larger and slower
> and harder.

Right, block handles share the refcount for all blocks on one page.

> > We represent up to 8 block states in one 16 byte object.  To make this
> > work, we don't try to emulate the behavior of the venerable block
> > library, but first refactor the filesystem data flow so that we are
> > only using the parts of buffer_head that will be emulated by the block
> > handle.  For example, no caching of physical block address.  It keeps
> > changing in Tux3 anyway, so this is really a useless thing to do.
> 
> fsblocks in their refcount mode don't tend to _cache_ physical block addresses
> either, because they're only kept around for as long as they are required
> (eg. to write out the page to avoid memory allocation deadlock problems).
> 
> But some filesystems don't do very fast block lookups and do want a cache.
> I did a little extent map library on the side for that.

Sure, good plan.  We are attacking the transfer path, so that all the
transfer state goes directly from the filesystem into a BIO and doesn't
need that twisty path back and forth to the block library.  The BIO
remembers the physical address across the transfer cycle.  If you must
still support those twisty paths for compatibility with the existing
buffer.c scheme, you have a much harder project.

> > Anyway, that is more than I meant to write about it.  Good luck to you,
> > you will need it.  Keep in mind that some of the nastiest kernel bugs
> > ever arose from interactions between page and buffer state bits.  You
> 
> Yes, I even fixed several of them too :)
> 
> fsblock simplifies a lot of those games. It protects pagecache state and
> fsblock state for all assocated blocks under a lock, so no weird ordering
> issues, and the two are always kept coherent (to the point that I can do
> writeout by walking dirty fsblocks in block device sector-order, although
> that requires bloat to fsblock struct and isn't straightforward with
> delalloc).
> 
> Of course it is new code so it will have more bugs, but it is better code.

I've started poking at it, starting with the manifesto.  It's a pretty
big reading project.

> > The block handles patch is one of those fun things we have on hold for
> > the time being while we get the more mundane
> 
> Good luck with it. I suspect that doing filesystem-specific layers to
> duplicate basically the same functionality but slightly optimised for
> the specific filesystem may not be a big win. As you say, this is where
> lots of nasty problems have been, so sharing as much code as possible
> is a really good idea.

The big win will come from avoiding the use of struct buffer_head as
an API element for mapping logical cache to disk, which is a narrow
constriction when the filesystem wants to do things with extents in
btrees.  It is quite painful doing a btree probe for every ->get_block
the way it is now.  We want probe... page page page page... submit bio
(or put it on a list for delayed allocation).

Once we have the desired, nice straight path above then we don't need
most of the fields in buffer_head, so tightening it up into a bitmap,
a refcount and a pointer back to the page makes a lot of sense.  This
in itself may not make a huge difference, but the reduction in cache
pressure ought to be measurable and worth the not very many lines of
code for the implementation.

> I would be very interested in anything like this that could beat fsblock
> in functionality or performance anywhere, even if it is taking shortcuts
> by being less generic.
> 
> If there is a significant gain to be had from less generic, perhaps it
> could still be made into a library usable by more than 1 fs.

I don't see any reason right off that it is not generic, except that it
does not try to fill the API role that buffer_head has, and so it isn't
a small, easy change to an existing filesystem.  It ought to be useful
for new designs though.  Mind you, the code hasn't been tried yet, it
is currently just a state-smashing API waiting for the filesystem to
evolve into the necessary shape, which is going to take another month
or two.

The Tux3 userspace buffer emulation already works much like the kernel
block handles will work, in that it doesn't cache a physical address,
and maintains cache state as a scalar value instead of a set of bits,
so we already have a fair amount of experience with the model.  When it
does get to the top of the list of things to do, it should slot in
smoothly.  At that point we could hand it to you to try your generic
API, which seems to implement similar ideas.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 12:24       ` Daniel Phillips
@ 2009-03-12 12:32         ` Matthew Wilcox
  2009-03-12 12:45           ` Nick Piggin
  2009-03-12 13:06           ` Daniel Phillips
  2009-03-12 13:04         ` Nick Piggin
  1 sibling, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2009-03-12 12:32 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Nick Piggin, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> On Thursday 12 March 2009, Nick Piggin wrote:
> > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> 
> Not in its current incarnation.  That would require 32 bytes worth of
> state while the current code just has a 4 byte map (4 bits X 8 blocks).
> I suppose a reasonable way to extend it would be 4 x 8 byte maps.  Has
> somebody spotted a 64K page?

I believe SGI ship their ia64 kernels configured this way.  Certainly
16k ia64 kernels are common, which would (if I understand your scheme
correctly) be 8 bytes worth of state in your scheme.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Tux3 report: Tux3 Git tree available
  2009-03-12 12:32         ` Matthew Wilcox
@ 2009-03-12 12:45           ` Nick Piggin
  2009-03-12 13:12             ` [Tux3] " Daniel Phillips
  2009-03-12 13:06           ` Daniel Phillips
  1 sibling, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2009-03-12 12:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel, Daniel Phillips

On Thursday 12 March 2009 23:32:30 Matthew Wilcox wrote:
> On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> > On Thursday 12 March 2009, Nick Piggin wrote:
> > > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> >
> > Not in its current incarnation.  That would require 32 bytes worth of
> > state while the current code just has a 4 byte map (4 bits X 8 blocks).
> > I suppose a reasonable way to extend it would be 4 x 8 byte maps.  Has
> > somebody spotted a 64K page?
>
> I believe SGI ship their ia64 kernels configured this way.  Certainly
> 16k ia64 kernels are common, which would (if I understand your scheme
> correctly) be 8 bytes worth of state in your scheme.

I think some distros will (or do) ship configs with 64K page size for
ia64 and powerpc too. I think I have heard of people using 64K pages
with ARM. There was some (public) talk of x86-64 getting a 16K or 64K
page size too (and even if not HW, some people want to be able to go
bigger SW pagecache size).

I wouldn't expect 64K page and 1K block to be worth optimising for
(although 64K page systems could easily use older or shared 4K block
filesystems). But just keep in mind that a good solution should not
rely on PAGE_CACHE_SIZE for correctness.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 12:45           ` Nick Piggin
@ 2009-03-12 13:12             ` Daniel Phillips
  0 siblings, 0 replies; 23+ messages in thread
From: Daniel Phillips @ 2009-03-12 13:12 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Matthew Wilcox, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 23:32:30 Matthew Wilcox wrote:
> > On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> > > On Thursday 12 March 2009, Nick Piggin wrote:
> > > > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> > >
> > > Not in its current incarnation.  That would require 32 bytes worth of
> > > state while the current code just has a 4 byte map (4 bits X 8 blocks).
> > > I suppose a reasonable way to extend it would be 4 x 8 byte maps.  Has
> > > somebody spotted a 64K page?
> >
> > I believe SGI ship their ia64 kernels configured this way.  Certainly
> > 16k ia64 kernels are common, which would (if I understand your scheme
> > correctly) be 8 bytes worth of state in your scheme.
> 
> I think some distros will (or do) ship configs with 64K page size for
> ia64 and powerpc too. I think I have heard of people using 64K pages
> with ARM. There was some (public) talk of x86-64 getting a 16K or 64K
> page size too (and even if not HW, some people want to be able to go
> bigger SW pagecache size).
> 
> I wouldn't expect 64K page and 1K block to be worth optimising for
> (although 64K page systems could easily use older or shared 4K block
> filesystems). But just keep in mind that a good solution should not
> rely on PAGE_CACHE_SIZE for correctness.

Not worth optimizing for, but it better work.  Which the current nasty
circular buffer list will, and I better keep that in mind for the next
round of effort on block handles.

On the other hand, 4K blocks on 64K pages better work really well or
those 64K systems will be turkeys.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 12:32         ` Matthew Wilcox
  2009-03-12 12:45           ` Nick Piggin
@ 2009-03-12 13:06           ` Daniel Phillips
  1 sibling, 0 replies; 23+ messages in thread
From: Daniel Phillips @ 2009-03-12 13:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Nick Piggin, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thursday 12 March 2009, Matthew Wilcox wrote:
> On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> > On Thursday 12 March 2009, Nick Piggin wrote:
> > > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> > 
> > Not in its current incarnation.  That would require 32 bytes worth of
> > state while the current code just has a 4 byte map (4 bits X 8 blocks).
> > I suppose a reasonable way to extend it would be 4 x 8 byte maps.  Has
> > somebody spotted a 64K page?
> 
> I believe SGI ship their ia64 kernels configured this way.  Certainly
> 16k ia64 kernels are common, which would (if I understand your scheme
> correctly) be 8 bytes worth of state in your scheme.

Yes, correct, and after that the state object would have to expand by a
binary factor, which probably doesn't matter because at that scale it
is really small compared to the blocks it maps.  And the mapped blocks
should just be metadata like index nodes, directory entry blocks and
bitmap blocks, which need per block data handles and locking while
regular file data can work in full pages, which is the same equation
that keeps the pain of buffer_heads down to a dull roar.

Regards,

Daniel



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Tux3 report: Tux3 Git tree available
  2009-03-12 12:24       ` Daniel Phillips
  2009-03-12 12:32         ` Matthew Wilcox
@ 2009-03-12 13:04         ` Nick Piggin
  2009-03-12 13:59           ` [Tux3] " Matthew Wilcox
  2009-03-15  2:41           ` Daniel Phillips
  1 sibling, 2 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-12 13:04 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote:
> On Thursday 12 March 2009, Nick Piggin wrote:

> > That's good for cache efficiency. As far as total # slab allocations
> > themselves go, fsblock probably tends to do more of them than buffer.c
> > because it frees them proactively when their refcounts reach 0 (by
> > default, one can switch to a lazy mode like buffer heads).
>
> I think that's a very good thing to do and intend to do the same.  If
> it shows on a profiler, then the filesystem should keep its own free
> list to avoid whatever slab thing creates the bottleneck.

slab allocation/free fastpath is on the order of 100 cycles, which is
a cache miss. I have a feeling that actually doing lots of allocs and
frees can work better because it keeps reusing the same memory for
different objects being operated on, so you get fewer cache misses.
(anyway it doesn't seem to be measurable in fsblock when switching
between cached and refcounted mode).


> > fsblocks in their refcount mode don't tend to _cache_ physical block
> > addresses either, because they're only kept around for as long as they
> > are required (eg. to write out the page to avoid memory allocation
> > deadlock problems).
> >
> > But some filesystems don't do very fast block lookups and do want a
> > cache. I did a little extent map library on the side for that.
>
> Sure, good plan.  We are attacking the transfer path, so that all the
> transfer state goes directly from the filesystem into a BIO and doesn't
> need that twisty path back and forth to the block library.  The BIO
> remembers the physical address across the transfer cycle.  If you must
> still support those twisty paths for compatibility with the existing
> buffer.c scheme, you have a much harder project.

I don't quite know what you mean. You have a set of dirty cache that
needs to be written. So you need to know the block addresses in order
to create the bio of course.

fsblock allocates the block and maps[*] the block at pagecache *dirty*
time, and holds onto it until writeout is finished. In something like
ext2, finding the offset->block map can require buffercache allocations
so it is technically deadlocky if you have to do it at writeout time.

[*] except in the case of delalloc. fsblock does its best, but for
complex filesystems like delalloc, some memory reservation would have
to be done by the fs.


> > > The block handles patch is one of those fun things we have on hold for
> > > the time being while we get the more mundane
> >
> > Good luck with it. I suspect that doing filesystem-specific layers to
> > duplicate basically the same functionality but slightly optimised for
> > the specific filesystem may not be a big win. As you say, this is where
> > lots of nasty problems have been, so sharing as much code as possible
> > is a really good idea.
>
> The big win will come from avoiding the use of struct buffer_head as
> an API element for mapping logical cache to disk, which is a narrow
> constriction when the filesystem wants to do things with extents in
> btrees.  It is quite painful doing a btree probe for every ->get_block
> the way it is now.  We want probe... page page page page... submit bio
> (or put it on a list for delayed allocation).
>
> Once we have the desired, nice straight path above then we don't need
> most of the fields in buffer_head, so tightening it up into a bitmap,
> a refcount and a pointer back to the page makes a lot of sense.  This
> in itself may not make a huge difference, but the reduction in cache
> pressure ought to be measurable and worth the not very many lines of
> code for the implementation.

I haven't done much about this in fsblock yet. I think some things need
a bit of changing in the pagecache layer (in the block library, eg.
write_begin/write_end doesn't have enough info to reserve/allocate a big
range of blocks -- we need a callback higher up to tell the filesystem
that we will be writing xxx range in the file, so get things ready for
us).

As far as the per-block pagecache state (as opposed to the per-block fs
state), I don't see any reason it is a problem for efficiency. We have to
do per-page operations anyway.


> > I would be very interested in anything like this that could beat fsblock
> > in functionality or performance anywhere, even if it is taking shortcuts
> > by being less generic.
> >
> > If there is a significant gain to be had from less generic, perhaps it
> > could still be made into a library usable by more than 1 fs.
>
> I don't see any reason right off that it is not generic, except that it
> does not try to fill the API role that buffer_head has, and so it isn't
> a small, easy change to an existing filesystem.  It ought to be useful
> for new designs though.  Mind you, the code hasn't been tried yet, it
> is currently just a state-smashing API waiting for the filesystem to
> evolve into the necessary shape, which is going to take another month
> or two.
>
> The Tux3 userspace buffer emulation already works much like the kernel
> block handles will work, in that it doesn't cache a physical address,
> and maintains cache state as a scalar value instead of a set of bits,
> so we already have a fair amount of experience with the model.  When it
> does get to the top of the list of things to do, it should slot in
> smoothly.  At that point we could hand it to you to try your generic
> API, which seems to implement similar ideas.

Cool. I will be interested to see how it works.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 13:04         ` Nick Piggin
@ 2009-03-12 13:59           ` Matthew Wilcox
  2009-03-12 14:19             ` Nick Piggin
  2009-03-15  3:24             ` Daniel Phillips
  2009-03-15  2:41           ` Daniel Phillips
  1 sibling, 2 replies; 23+ messages in thread
From: Matthew Wilcox @ 2009-03-12 13:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> As far as the per-block pagecache state (as opposed to the per-block fs
> state), I don't see any reason it is a problem for efficiency. We have to
> do per-page operations anyway.

Why?  Wouldn't it be nice if we could do arbitrary extents?  I suppose
superpages or soft page sizes get us most of the way there, but the
rounding or pieces at the end are a bit of a pain.  Sure, it'll be a
huge upheaval for the VM, but we're good at huge upheavals ;-)

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 13:59           ` [Tux3] " Matthew Wilcox
@ 2009-03-12 14:19             ` Nick Piggin
  2009-03-15  3:24             ` Daniel Phillips
  1 sibling, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-12 14:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Friday 13 March 2009 00:59:40 Matthew Wilcox wrote:
> On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> > As far as the per-block pagecache state (as opposed to the per-block fs
> > state), I don't see any reason it is a problem for efficiency. We have to
> > do per-page operations anyway.
>
> Why?  Wouldn't it be nice if we could do arbitrary extents?  I suppose
> superpages or soft page sizes get us most of the way there, but the
> rounding or pieces at the end are a bit of a pain.  Sure, it'll be a
> huge upheaval for the VM, but we're good at huge upheavals ;-)

Sounds nice in theory. I personally think it would be very very hard to
do well, and wouldn't and up being much of a win if at all anyway.

But if you have some structure representing arbitrary pagecache extents,
then you would probably be able to naturally use those pagecache extents
themselves, or parallel very similar data structures for holding block
state.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Tux3 report: Tux3 Git tree available
  2009-03-12 13:59           ` [Tux3] " Matthew Wilcox
  2009-03-12 14:19             ` Nick Piggin
@ 2009-03-15  3:24             ` Daniel Phillips
  2009-03-15  3:50               ` [Tux3] " Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2009-03-15  3:24 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, Nick Piggin, Andrew Morton, linux-kernel, tux3

On Thursday 12 March 2009, Matthew Wilcox wrote:
> On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> > As far as the per-block pagecache state (as opposed to the per-block fs
> > state), I don't see any reason it is a problem for efficiency. We have to
> > do per-page operations anyway.
> 
> Why?  Wouldn't it be nice if we could do arbitrary extents?  I suppose
> superpages or soft page sizes get us most of the way there, but the
> rounding or pieces at the end are a bit of a pain.  Sure, it'll be a
> huge upheaval for the VM, but we're good at huge upheavals ;-)

Actually, filesystem extents tend to erode the argument for superpages.
There are three reasons we have seen for wanting big pages: 1) support
larger block buffers without adding messy changes to buffer.c; 2) TLB
efficiency; 3) less per-page state in kernel memory.  TLB efficiency is
only there if the hardware supports it, which X86 arguably doesn't.
The main argument for larger block buffers is less per-block transfer
setup overhead, but the BIO model combined with filesystem extents
does that job better, or at least it will when filesystems learn to
take better advantage of this.

VM extents on the other hand could possibly do a really good job of
reducing per-page VM overhead, if anybody still cares about that now
that 64 bit machines rule the big iron world.

I expect implementing VM extents to be a brutally complex project, as
filesystem extents always turn out to be, even though one tends to
enter such projects thinking, how hard could this be?  Answer: harder
than you think.  But VM extents would be good for a modest speedup, so
somebody is sure to get brave enough to try it sometime.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15  3:24             ` Daniel Phillips
@ 2009-03-15  3:50               ` Nick Piggin
  2009-03-15  4:08                 ` Daniel Phillips
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2009-03-15  3:50 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Matthew Wilcox, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote:
> On Thursday 12 March 2009, Matthew Wilcox wrote:
> > On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> > > As far as the per-block pagecache state (as opposed to the per-block fs
> > > state), I don't see any reason it is a problem for efficiency. We have
> > > to do per-page operations anyway.
> >
> > Why?  Wouldn't it be nice if we could do arbitrary extents?  I suppose
> > superpages or soft page sizes get us most of the way there, but the
> > rounding or pieces at the end are a bit of a pain.  Sure, it'll be a
> > huge upheaval for the VM, but we're good at huge upheavals ;-)
>
> Actually, filesystem extents tend to erode the argument for superpages.
> There are three reasons we have seen for wanting big pages: 1) support
> larger block buffers without adding messy changes to buffer.c; 2) TLB
> efficiency; 3) less per-page state in kernel memory.  TLB efficiency is
> only there if the hardware supports it, which X86 arguably doesn't.
> The main argument for larger block buffers is less per-block transfer
> setup overhead, but the BIO model combined with filesystem extents
> does that job better, or at least it will when filesystems learn to
> take better advantage of this.
>
> VM extents on the other hand could possibly do a really good job of
> reducing per-page VM overhead, if anybody still cares about that now
> that 64 bit machines rule the big iron world.
>
> I expect implementing VM extents to be a brutally complex project, as
> filesystem extents always turn out to be, even though one tends to
> enter such projects thinking, how hard could this be?  Answer: harder
> than you think.  But VM extents would be good for a modest speedup, so
> somebody is sure to get brave enough to try it sometime.

I don't think there is enough evidence to be able to make such an
assertion.

When you actually implement extent splitting and merging in a deadlock
free manner and synchronize everything properly I wouldn't be surprised
if it is slower most of the time. If it was significantly faster, then
memory fragmentation means that it is going to get significantly slower
over the uptime of the kernel, so you would have to virtually map the
kernel and implement memory defragmentation, at which point you get even
slower and more complex.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Tux3 report: Tux3 Git tree available
  2009-03-15  3:50               ` [Tux3] " Nick Piggin
@ 2009-03-15  4:08                 ` Daniel Phillips
  2009-03-15  4:14                   ` [Tux3] " Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2009-03-15  4:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel, Matthew Wilcox

On Saturday 14 March 2009, Nick Piggin wrote:
> On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote:
> > I expect implementing VM extents to be a brutally complex project, as
> > filesystem extents always turn out to be, even though one tends to
> > enter such projects thinking, how hard could this be?  Answer: harder
> > than you think.  But VM extents would be good for a modest speedup, so
> > somebody is sure to get brave enough to try it sometime.
> 
> I don't think there is enough evidence to be able to make such an
> assertion.
> 
> When you actually implement extent splitting and merging in a deadlock
> free manner and synchronize everything properly I wouldn't be surprised
> if it is slower most of the time. If it was significantly faster, then
> memory fragmentation means that it is going to get significantly slower
> over the uptime of the kernel, so you would have to virtually map the
> kernel and implement memory defragmentation, at which point you get even
> slower and more complex.

You can make exactly the same argument about filesystem extents, and
we know that extents are faster there.  So what is the fundamental
difference?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15  4:08                 ` Daniel Phillips
@ 2009-03-15  4:14                   ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-15  4:14 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Matthew Wilcox, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Sunday 15 March 2009 15:08:52 Daniel Phillips wrote:
> On Saturday 14 March 2009, Nick Piggin wrote:
> > On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote:
> > > I expect implementing VM extents to be a brutally complex project, as
> > > filesystem extents always turn out to be, even though one tends to
> > > enter such projects thinking, how hard could this be?  Answer: harder
> > > than you think.  But VM extents would be good for a modest speedup, so
> > > somebody is sure to get brave enough to try it sometime.
> >
> > I don't think there is enough evidence to be able to make such an
> > assertion.
> >
> > When you actually implement extent splitting and merging in a deadlock
> > free manner and synchronize everything properly I wouldn't be surprised
> > if it is slower most of the time. If it was significantly faster, then
> > memory fragmentation means that it is going to get significantly slower
> > over the uptime of the kernel, so you would have to virtually map the
> > kernel and implement memory defragmentation, at which point you get even
> > slower and more complex.
>
> You can make exactly the same argument about filesystem extents, and
> we know that extents are faster there.  So what is the fundamental
> difference?

Uh, aside from all the obvious fundamental differences there are, you
can only make such an assertion if performance characteristics and
usage patterns are very similar, nevermind fundamentally different...



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 13:04         ` Nick Piggin
  2009-03-12 13:59           ` [Tux3] " Matthew Wilcox
@ 2009-03-15  2:41           ` Daniel Phillips
  2009-03-15  3:45             ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2009-03-15  2:41 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote:
> > > fsblocks in their refcount mode don't tend to _cache_ physical block
> > > addresses either, because they're only kept around for as long as they
> > > are required (eg. to write out the page to avoid memory allocation
> > > deadlock problems).
> > >
> > > But some filesystems don't do very fast block lookups and do want a
> > > cache. I did a little extent map library on the side for that.
> >
> > Sure, good plan.  We are attacking the transfer path, so that all the
> > transfer state goes directly from the filesystem into a BIO and doesn't
> > need that twisty path back and forth to the block library.  The BIO
> > remembers the physical address across the transfer cycle.  If you must
> > still support those twisty paths for compatibility with the existing
> > buffer.c scheme, you have a much harder project.
> 
> I don't quite know what you mean. You have a set of dirty cache that
> needs to be written. So you need to know the block addresses in order
> to create the bio of course.
> 
> fsblock allocates the block and maps[*] the block at pagecache *dirty*
> time, and holds onto it until writeout is finished.

As it happens, Tux3 also physically allocates each _physical_ metadata
block (i.e., what is currently called buffer cache) at the time it is
dirtied.  I don't know if this is the best thing to do, but it is
interesting that you do the same thing.  I also don't know if I want to
trust a library to get this right, before having completely proved out
the idea in a non-trival filesystem.  But good luck with that!  It
seems to me like a very good idea to take Ted up on his offer and try
out your library on Ext4.  This is just a gut feeling, but I think you
will need many iterations to refine the idea.  Just working, and even
showing benchmark improvement is not enough.  If it is a core API
proposal, it needs a huge body of proof.  If you end up like JBD with
just one user, because it actually only implements the semantics of
exactly one filesystem, then the extra overhead of unused generality
will just mean more lines of code to maintain and more places for bugs
to hide.

This is all general philosophy of course.  Actually reading your code
would help a lot.  By comparision, I intend the block handles library
to be a few hundred lines of code, including new incarnations of
buffer.c functionality like block_read/write_*.  If this is indeed
possible, and it does the job with 4 bytes per block on a 1K block/4K
page configuration as it does in the prototype, then I think I would
prefer a per-filesystem solution and let it evolve that way for a long
time before attempting a library.  But that is just me.

I suppose you would like to see some code?

> In something like 
> ext2, finding the offset->block map can require buffercache allocations
> so it is technically deadlocky if you have to do it at writeout time.

I am not sure what "technically" means.  Pretty much everything you do
in this area has high deadlock risk.  That is one of the things that
scares me about trying to handle every filesystem uniformly.  How would
filesystem writers even know what the deadlock avoidance rules are,
thus what they need to do in their own filesystem to avoid it?

Anyway, the Tux3 reason for doing the allocation at dirty time is, this
is the only time the filesystem knows what the parent block of a given
metadata block is.  Note that we move btree blocks around when they are
dirtied, and thus need to know the parent in order to update the parent
pointer to the child.  This is a complication you will not run into in
any of the filesystems you have poked at so far.  This subtle detail is
very much filesystem specific, or it is specific to the class of
filesystems that do remap on write.  Good luck knowing how to generalize
that before Linux has seen even one of them up and doing real production
work.

> [*] except in the case of delalloc. fsblock does its best, but for
> complex filesystems like delalloc, some memory reservation would have
> to be done by the fs.

And that is a whole, huge and critical topic.  Again, something that I
feel needs to be analyzed per filesystem, until we have considerably
more experience with the issues.

> > > > The block handles patch is one of those fun things we have on hold for
> > > > the time being while we get the more mundane
> > >
> > > Good luck with it. I suspect that doing filesystem-specific layers to
> > > duplicate basically the same functionality but slightly optimised for
> > > the specific filesystem may not be a big win. As you say, this is where
> > > lots of nasty problems have been, so sharing as much code as possible
> > > is a really good idea.
> >
> > The big win will come from avoiding the use of struct buffer_head as
> > an API element for mapping logical cache to disk, which is a narrow
> > constriction when the filesystem wants to do things with extents in
> > btrees.  It is quite painful doing a btree probe for every ->get_block
> > the way it is now.  We want probe... page page page page... submit bio
> > (or put it on a list for delayed allocation).
> >
> > Once we have the desired, nice straight path above then we don't need
> > most of the fields in buffer_head, so tightening it up into a bitmap,
> > a refcount and a pointer back to the page makes a lot of sense.  This
> > in itself may not make a huge difference, but the reduction in cache
> > pressure ought to be measurable and worth the not very many lines of
> > code for the implementation.
> 
> I haven't done much about this in fsblock yet. I think some things need
> a bit of changing in the pagecache layer (in the block library, eg.
> write_begin/write_end doesn't have enough info to reserve/allocate a big
> range of blocks -- we need a callback higher up to tell the filesystem
> that we will be writing xxx range in the file, so get things ready for
> us).

That would be write_cache_pages, it already exists and seems perfectly
serviceable.

> As far as the per-block pagecache state (as opposed to the per-block fs
> state), I don't see any reason it is a problem for efficiency. We have to
> do per-page operations anyway.

I don't see a distinction between page cache vs fs state for a block.
Tux3 has these scalar block states:

  EMPTY - not read into cache yet
  CLEAN - cache data matches disk data (which might be a hole)
  DIRTY0 .. DIRTY3 - dirty in one of up to four pipelined delta updates

Besides the per-page block reference count (hmm, do we really need it?
Why not rely on the page reference count?) there is no cache-specific
state, it is all "fs" state.

To complete the enumeration of state Tux3 represents in block handles,
there is also a per-block lock bit, used for reading blocks, the same
as buffer lock.  So far there is no writeback bit, which does not seem
to be needed, because the flow of block writeback is subtly different
from page writeback.  I am not prepared to defend that assertion just
yet!  But I think the reason for this is, there is no such thing as
redirty for metadata blocks in Tux3, there is only "dirty in a later
delta", and that implies redirecting the block to a new physical
location that has its own, separate block state.  Anyway, this is a
pretty good example of why you may find it difficult to generalize your
library to handle every filesystem.  Is there any existing filesystem
that works this way?  How would you know in advance what features to
include in your library to handle it?  Will some future filesystem
have very different requirements, not handled by your library?  If you
have finally captured every feature, will they interact?  Will all
these features be confusing to use and hard to analyze?  I am not
saying you can't solve all these problems, just that it is bound to be
hard, take a long time, and might possibly end up less elegant than a
more lightweight approach that leaves the top level logic in the hands
of the filesystem.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15  2:41           ` Daniel Phillips
@ 2009-03-15  3:45             ` Nick Piggin
  2009-03-15 21:44               ` Theodore Tso
  0 siblings, 1 reply; 23+ messages in thread
From: Nick Piggin @ 2009-03-15  3:45 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Sunday 15 March 2009 13:41:09 Daniel Phillips wrote:
> On Thursday 12 March 2009, Nick Piggin wrote:
> > On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote:
> > > > fsblocks in their refcount mode don't tend to _cache_ physical block
> > > > addresses either, because they're only kept around for as long as
> > > > they are required (eg. to write out the page to avoid memory
> > > > allocation deadlock problems).
> > > >
> > > > But some filesystems don't do very fast block lookups and do want a
> > > > cache. I did a little extent map library on the side for that.
> > >
> > > Sure, good plan.  We are attacking the transfer path, so that all the
> > > transfer state goes directly from the filesystem into a BIO and doesn't
> > > need that twisty path back and forth to the block library.  The BIO
> > > remembers the physical address across the transfer cycle.  If you must
> > > still support those twisty paths for compatibility with the existing
> > > buffer.c scheme, you have a much harder project.
> >
> > I don't quite know what you mean. You have a set of dirty cache that
> > needs to be written. So you need to know the block addresses in order
> > to create the bio of course.
> >
> > fsblock allocates the block and maps[*] the block at pagecache *dirty*
> > time, and holds onto it until writeout is finished.
>
> As it happens, Tux3 also physically allocates each _physical_ metadata
> block (i.e., what is currently called buffer cache) at the time it is
> dirtied.  I don't know if this is the best thing to do, but it is
> interesting that you do the same thing.  I also don't know if I want to
> trust a library to get this right, before having completely proved out
> the idea in a non-trival filesystem.  But good luck with that!  It

I'm not sure why it would be a big problem. fsblock isn't allocating
the block itself of course, it just asks the filesystem to. It's
trivial to do for fsblock.


> seems to me like a very good idea to take Ted up on his offer and try
> out your library on Ext4.  This is just a gut feeling, but I think you
> will need many iterations to refine the idea.  Just working, and even
> showing benchmark improvement is not enough.  If it is a core API
> proposal, it needs a huge body of proof.  If you end up like JBD with
> just one user, because it actually only implements the semantics of
> exactly one filesystem, then the extra overhead of unused generality
> will just mean more lines of code to maintain and more places for bugs
> to hide.

I don't know what you're thinking is so difficult with it. I've already
converted minix, ext2, and xfs and they seem to work fine. There is not
really fundamentally anything that buffer heads can do that fsblock can't.


> This is all general philosophy of course.  Actually reading your code
> would help a lot.  By comparision, I intend the block handles library
> to be a few hundred lines of code, including new incarnations of
> buffer.c functionality like block_read/write_*.  If this is indeed
> possible, and it does the job with 4 bytes per block on a 1K block/4K
> page configuration as it does in the prototype, then I think I would
> prefer a per-filesystem solution and let it evolve that way for a long
> time before attempting a library.  But that is just me.

If you're tracking pagecache state in these things, then I can't see how
it can get any easier just because it is smaller. In which case, your
concerns about duplicating functionality of this layer.


> I suppose you would like to see some code?
>
> > In something like
> > ext2, finding the offset->block map can require buffercache allocations
> > so it is technically deadlocky if you have to do it at writeout time.
>
> I am not sure what "technically" means.  Pretty much everything you do

Technically means that it is deadlocky. Today, practically every Linux
filesystem technically has memory deadlocks. In practice, the mm does
keep reserves around to help this and so it is very very hard to hit.


> in this area has high deadlock risk.  That is one of the things that
> scares me about trying to handle every filesystem uniformly.  How would
> filesystem writers even know what the deadlock avoidance rules are,
> thus what they need to do in their own filesystem to avoid it?

The rule is simple: if forward progress requires resource allocation,
then you would ensure resource deadlocks are avoided or can be recovered
from.

I don't think many fs developers actually care very much, but obviously
a rewrite of such core functionality must not introduce such deadlocks
by design.


> Anyway, the Tux3 reason for doing the allocation at dirty time is, this
> is the only time the filesystem knows what the parent block of a given
> metadata block is.  Note that we move btree blocks around when they are
> dirtied, and thus need to know the parent in order to update the parent
> pointer to the child.  This is a complication you will not run into in
> any of the filesystems you have poked at so far.  This subtle detail is
> very much filesystem specific, or it is specific to the class of
> filesystems that do remap on write.  Good luck knowing how to generalize
> that before Linux has seen even one of them up and doing real production
> work.

Uh, this kind of stuff is completely not what fsblock would try to do.
fsblock gives the filesystem notifications when the block gets dirtied,
when the block is prepared for writeout, etc.

It is up to the filesystem to do everything else (with the postcondition
that the block is mapped after being prepared for writeout).


> > [*] except in the case of delalloc. fsblock does its best, but for
> > complex filesystems like delalloc, some memory reservation would have
> > to be done by the fs.
>
> And that is a whole, huge and critical topic.  Again, something that I
> feel needs to be analyzed per filesystem, until we have considerably
> more experience with the issues.

Again, fsblock does as much as it can up to guaranteeing fsblock metadata
(and hence, any filesystem private data is attached to the fsblock) is
allocated as long as the block is dirty.

Of course the actual delalloc scheme is filesystem specific and can't be
handled by fsblock.


> > I haven't done much about this in fsblock yet. I think some things need
> > a bit of changing in the pagecache layer (in the block library, eg.
> > write_begin/write_end doesn't have enough info to reserve/allocate a big
> > range of blocks -- we need a callback higher up to tell the filesystem
> > that we will be writing xxx range in the file, so get things ready for
> > us).
>
> That would be write_cache_pages, it already exists and seems perfectly
> serviceable.

No it isn't. That's completely different.


> > As far as the per-block pagecache state (as opposed to the per-block fs
> > state), I don't see any reason it is a problem for efficiency. We have to
> > do per-page operations anyway.
>
> I don't see a distinction between page cache vs fs state for a block.
> Tux3 has these scalar block states:
>
>   EMPTY - not read into cache yet
>   CLEAN - cache data matches disk data (which might be a hole)
>   DIRTY0 .. DIRTY3 - dirty in one of up to four pipelined delta updates
>
> Besides the per-page block reference count (hmm, do we really need it?
> Why not rely on the page reference count?) there is no cache-specific
> state, it is all "fs" state.

dirty / uptodate is a property of the cache.


> To complete the enumeration of state Tux3 represents in block handles,
> there is also a per-block lock bit, used for reading blocks, the same
> as buffer lock.  So far there is no writeback bit, which does not seem
> to be needed, because the flow of block writeback is subtly different
> from page writeback.  I am not prepared to defend that assertion just
> yet!  But I think the reason for this is, there is no such thing as
> redirty for metadata blocks in Tux3, there is only "dirty in a later
> delta", and that implies redirecting the block to a new physical
> location that has its own, separate block state.  Anyway, this is a
> pretty good example of why you may find it difficult to generalize your
> library to handle every filesystem.  Is there any existing filesystem
> that works this way?  How would you know in advance what features to
> include in your library to handle it?  Will some future filesystem
> have very different requirements, not handled by your library?  If you
> have finally captured every feature, will they interact?  Will all
> these features be confusing to use and hard to analyze?  I am not
> saying you can't solve all these problems, just that it is bound to be
> hard, take a long time, and might possibly end up less elegant than a
> more lightweight approach that leaves the top level logic in the hands
> of the filesystem.

It's not meant to handle every possible feature of every current and
future fs! It's meant to replace buffer-head. If there is some common
filesystem feature in future that makes sense to generalise and support
in fsblock then great.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15  3:45             ` Nick Piggin
@ 2009-03-15 21:44               ` Theodore Tso
  2009-03-15 22:41                 ` Daniel Phillips
  2009-03-16  5:12                 ` Dave Chinner
  0 siblings, 2 replies; 23+ messages in thread
From: Theodore Tso @ 2009-03-15 21:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > As it happens, Tux3 also physically allocates each _physical_ metadata
> > block (i.e., what is currently called buffer cache) at the time it is
> > dirtied.  I don't know if this is the best thing to do, but it is
> > interesting that you do the same thing.  I also don't know if I want to
> > trust a library to get this right, before having completely proved out
> > the idea in a non-trival filesystem.  But good luck with that!  It
> 
> I'm not sure why it would be a big problem. fsblock isn't allocating
> the block itself of course, it just asks the filesystem to. It's
> trivial to do for fsblock.

So the really unfortunate thing about allocating the block as soon as
the page is dirty is that it spikes out delayed allocation.  By
delaying the physical allocation of the logical->physical mapping as
long as possible, the filesystem can select the best possible physical
location.  XFS, for example, keeps a btree of free regions indexed by
size so that it can select the perfect location for a newly written
file which is 24k or 56k long.  If fsblock forces the physical
allocation of blocks the moment the page is dirty, it will destroy
XFS's capability to select the perfect file.

In addition, XFS uses delayed allocation to avoid the problem of
uninitalized data becoming visible in the event of a crash.  If
fsblock immediately allocates the physical block, then either the
unitialized data might become available on a system crash (which is a
security problem), or XFS is going to have to force all newly written
data blocks to disk before a commit.  If that sounds familiar it's
what ext3's data=ordered mode does, and it's what is responsible for
the Firefox 3.0 fsync performance problem.

A similar issue exists for ext4's delayed allocation.

							- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15 21:44               ` Theodore Tso
@ 2009-03-15 22:41                 ` Daniel Phillips
  2009-03-16 10:32                   ` Nick Piggin
  2009-03-16  5:12                 ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2009-03-15 22:41 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Nick Piggin, linux-fsdevel, tux3, Andrew Morton, linux-kernel

Hi Ted,

On Sunday 15 March 2009, Theodore Tso wrote:
> On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > > As it happens, Tux3 also physically allocates each _physical_ metadata
> > > block (i.e., what is currently called buffer cache) at the time it is
> > > dirtied.  I don't know if this is the best thing to do, but it is
> > > interesting that you do the same thing.  I also don't know if I want to
> > > trust a library to get this right, before having completely proved out
> > > the idea in a non-trival filesystem.  But good luck with that!  It
> > 
> > I'm not sure why it would be a big problem. fsblock isn't allocating
> > the block itself of course, it just asks the filesystem to. It's
> > trivial to do for fsblock.
> 
> So the really unfortunate thing about allocating the block as soon as
> the page is dirty is that it spikes out delayed allocation.  By
> delaying the physical allocation of the logical->physical mapping as
> long as possible, the filesystem can select the best possible physical
> location.

Tux3 does not dirty the metadata until data cache is flushed, so the
allocation decisions for data and metadata are made at the same time.
That is the reason for the distinction between physical metadata above,
and logical metadata such as directory data and bitmaps, which are
delayed.  Though physical metadata is positioned when first dirtied,
physical metadata dirtying is delayed until delta commit.

Implementing this model (we are still working on it) requires taking
care of a lot of subtle details that are specific to the Tux3 cache
model.  I have a hard time imagining those allocation decisions driven
by callbacks from a buffer-like library.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15 22:41                 ` Daniel Phillips
@ 2009-03-16 10:32                   ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-16 10:32 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Theodore Tso, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Monday 16 March 2009 09:41:35 Daniel Phillips wrote:
> Hi Ted,

> > So the really unfortunate thing about allocating the block as soon as
> > the page is dirty is that it spikes out delayed allocation.  By
> > delaying the physical allocation of the logical->physical mapping as
> > long as possible, the filesystem can select the best possible physical
> > location.
>
> Tux3 does not dirty the metadata until data cache is flushed, so the
> allocation decisions for data and metadata are made at the same time.
> That is the reason for the distinction between physical metadata above,
> and logical metadata such as directory data and bitmaps, which are
> delayed.  Though physical metadata is positioned when first dirtied,
> physical metadata dirtying is delayed until delta commit.
>
> Implementing this model (we are still working on it) requires taking
> care of a lot of subtle details that are specific to the Tux3 cache
> model.  I have a hard time imagining those allocation decisions driven
> by callbacks from a buffer-like library.

The filesystem can get pagecache-block-dirty events in a few ways
(often a combination of):
write_begin/write_end, set_page_dirty, page_mkwrite, etc. Short of
implementing entirely your own write path (and even then you need to
hook at least page_mkwrite to catch mmapped writes, for completeness),
I don't see why a get_block(BLOCK_DIRTY) kind of callback is much
harder for you to imagine than any of the other callbacks. Actually
I imagine the block based callback should be easier for filesystems
that support any block size != page size because all the others are
page based.

I would like to hear firm details about any problems definitely,
because I would like to try to make it more generic even if your
filesystem won't use it :)

Now this is not to say the current buffer APIs are totally _optimal_.
As I said, I would like to see at least something along the lines of
"we are about to dirty range (x,y)" callback in the higher level
generic write code. But that's another story (which I am planning
to get to).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-15 21:44               ` Theodore Tso
  2009-03-15 22:41                 ` Daniel Phillips
@ 2009-03-16  5:12                 ` Dave Chinner
  2009-03-16  6:38                   ` Theodore Tso
  1 sibling, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2009-03-16  5:12 UTC (permalink / raw)
  To: Theodore Tso, Nick Piggin, Daniel Phillips, linux-fsdevel, tux3,
	Andrew Morton

On Sun, Mar 15, 2009 at 05:44:26PM -0400, Theodore Tso wrote:
> On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > > As it happens, Tux3 also physically allocates each _physical_ metadata
> > > block (i.e., what is currently called buffer cache) at the time it is
> > > dirtied.  I don't know if this is the best thing to do, but it is
> > > interesting that you do the same thing.  I also don't know if I want to
> > > trust a library to get this right, before having completely proved out
> > > the idea in a non-trival filesystem.  But good luck with that!  It
> > 
> > I'm not sure why it would be a big problem. fsblock isn't allocating
> > the block itself of course, it just asks the filesystem to. It's
> > trivial to do for fsblock.
> 
> So the really unfortunate thing about allocating the block as soon as
> the page is dirty is that it spikes out delayed allocation.  By
> delaying the physical allocation of the logical->physical mapping as
> long as possible, the filesystem can select the best possible physical
> location.

This is no different to the way delayed allocation with bufferheads
works. Both XFS and ext4 set the buffer_delay flag instead of
allocating up front so that later on in ->writepages we can do
optimal delayed allocation. AFAICT fsblock works the same way....

> XFS, for example, keeps a btree of free regions indexed by
> size so that it can select the perfect location for a newly written
> file which is 24k or 56k long.

Ah, no. It's far more complex than that. To begin with, XFS has
*two* freespace trees per allocation group - one indexed by extent size,
the other by extent starting block.

XFS looks for an exact or nearby extent start block match that is
big enough in the by-block tree. If it can't find a nearby match,
then it looks up a size match in the by-size tree. i.e. the
fundamental allocation assumption is that locality of data placement
matters far more than filling holes in the freespace trees.....

> In addition, XFS uses delayed allocation to avoid the problem of
> uninitalized data becoming visible in the event of a crash.

No it doesn't. Delayed allocation minimises the problem but doesn't
prevent it.  It has been known for years (since before I joined SGI
in 2002) that there is a theoretical timing gap in XFS where the
allocation transaction can commit and a crash occur before data hits
the disk hence exposing stale data.

The reality is that no-one has ever reported exposing stale data in
this scenario, and there has been plenty of effort expended trying
to trigger it. Hence it has remained in the realm of a theoretical
problem....

> If
> fsblock immediately allocates the physical block, then either the
> unitialized data might become available on a system crash (which
> is a security problem), or XFS is going to have to force all newly
> written data blocks to disk before a commit.  If that sounds
> familiar it's what ext3's data=ordered mode does, and it's what is
> responsible for the Firefox 3.0 fsync performance problem.

If this was to occur, the obvious solution to this problem is to
allocate unwritten extents and do conversion after data I/O
completion. That would result in correct metadata/data ordering in
all cases with only a small performance impact and without
introducing ext3-sync-the-world-like issues...

Ted, I appreciate you telling the world over and over again how bad
XFS is and what you think needs to be done to fix it. Truth is, this
would have been a much better email had you written about it from an
ext4 perspective. That way it wouldn't have been full of errors or
sound like a kid caught with his hand in the cookie jar:

"It's not my fault! I was only copying XFS! He did it first!"

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-16  5:12                 ` Dave Chinner
@ 2009-03-16  6:38                   ` Theodore Tso
  2009-03-16 10:14                     ` Nick Piggin
  0 siblings, 1 reply; 23+ messages in thread
From: Theodore Tso @ 2009-03-16  6:38 UTC (permalink / raw)
  To: Nick Piggin, Daniel Phillips, linux-fsdevel, tux3, Andrew Morton,
	linux-kernel

Dave,

It wasn't my intention to say that XFS was bad; in fact, I thought I
was actually complementing XFS by talking about some of the advanced
features that XFS had (many of which I have always said that ext3 has,
and some of which ext4 still does not have, and probably never will
have).  I stand corrected on some of the details that I got wrong.
What I was trying to say was that *if* (and perhaps I'm
misunderstanding fsblock) that fsblock is requiring that as soon as a
page is dirty, fsblock requests the filesystem to assign a block
allocation to the buffers attached to the dirty page, that this would
spike out delayed allocation, which would be unfortunate for *both*
ext4 and XFS.

But maybe I'm misunderstanding what fsblock is doing, and there isn't
a problem here.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Tux3 report: Tux3 Git tree available
  2009-03-16  6:38                   ` Theodore Tso
@ 2009-03-16 10:14                     ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-16 10:14 UTC (permalink / raw)
  To: Theodore Tso
  Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel, Daniel Phillips

On Monday 16 March 2009 17:38:30 Theodore Tso wrote:
> Dave,
>
> It wasn't my intention to say that XFS was bad; in fact, I thought I
> was actually complementing XFS by talking about some of the advanced
> features that XFS had (many of which I have always said that ext3 has,
> and some of which ext4 still does not have, and probably never will
> have).  I stand corrected on some of the details that I got wrong.
> What I was trying to say was that *if* (and perhaps I'm
> misunderstanding fsblock) that fsblock is requiring that as soon as a
> page is dirty, fsblock requests the filesystem to assign a block
> allocation to the buffers attached to the dirty page, that this would
> spike out delayed allocation, which would be unfortunate for *both*
> ext4 and XFS.
>
> But maybe I'm misunderstanding what fsblock is doing, and there isn't
> a problem here.

Yeah, Dave's understanding of fsblock is correct. I might have stated
things confusingly... fsblock allocates the in-memory fsblock metadata
structure (~= struct buffer_head) at the time of block dirtying. It
also asks the filesystem to respond to the dirtying event appropriately.
In the case of say ext2, this means allocating a block on disk. Wheras
XFS does the delalloc/reserve thing (yes, XFS appears to be working
with fsblock well enough to get this far).

fsblock really isn't too much different to buffer_head from an abstract
capability / functionality point of view except that it is often more
strict with things where I feel it makes sense.

So for this particular example; in buffer.c, buffers do tend to be
allocated when a page is dirtied, but not always, and even when they are,
they can get reclaimed while the page is still dirty. fsblock tigtens
this up.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 11:03     ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin
  2009-03-12 12:24       ` Daniel Phillips
@ 2009-03-12 17:06       ` Theodore Tso
  2009-03-13  9:32         ` Nick Piggin
  1 sibling, 1 reply; 23+ messages in thread
From: Theodore Tso @ 2009-03-12 17:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Thu, Mar 12, 2009 at 10:03:31PM +1100, Nick Piggin wrote:
> It is basically already proven. It is faster with ext2 and it works with
> XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't
> really able to grok XFS enough to convert it). And works with minix
> with larger block size than page size (except some places where core
> pagecache code needs some hacking that I haven't got around to).
> 
> Yes an ext3 conversion would probably reveal some tweaks or fixes to
> fsblock. I might try doing ext3 next. I suspect most of the problems
> would be fitting ext3 to much stricter checks and consistency required
> by fsblock, rather than adding ext3-required features to fsblock.
> 
> ext3 will be a tough one to convert because it is complex, very stable,
> and widely used so there are lots of reasons not to make big changes to
> it.

One possibility would be to do this with ext4 instead, since there are
fewer users, and it has more a "development" feel to it.  OTOH, there
are poeple (including myself) who are using ext4 in production
already, and I'd appreciate not having my source trees on my laptop
getting toasted.  :-)

Is it going to be possible to make the fsblock conversion being
something which is handled via CONFIG_EXT4_FSBLOCK #ifdefs, or are the
changes too invasive to really allow that?  (Also note BTW that ocfs2
is also using jbd2, so we need to be careful we don't break ocfs2
while we're doing the fsblock conversion.)

						- Ted

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tux3] Tux3 report: Tux3 Git tree available
  2009-03-12 17:06       ` [Tux3] " Theodore Tso
@ 2009-03-13  9:32         ` Nick Piggin
  0 siblings, 0 replies; 23+ messages in thread
From: Nick Piggin @ 2009-03-13  9:32 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel

On Friday 13 March 2009 04:06:53 Theodore Tso wrote:
> On Thu, Mar 12, 2009 at 10:03:31PM +1100, Nick Piggin wrote:
> > It is basically already proven. It is faster with ext2 and it works with
> > XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't
> > really able to grok XFS enough to convert it). And works with minix
> > with larger block size than page size (except some places where core
> > pagecache code needs some hacking that I haven't got around to).
> >
> > Yes an ext3 conversion would probably reveal some tweaks or fixes to
> > fsblock. I might try doing ext3 next. I suspect most of the problems
> > would be fitting ext3 to much stricter checks and consistency required
> > by fsblock, rather than adding ext3-required features to fsblock.
> >
> > ext3 will be a tough one to convert because it is complex, very stable,
> > and widely used so there are lots of reasons not to make big changes to
> > it.
>
> One possibility would be to do this with ext4 instead, since there are
> fewer users, and it has more a "development" feel to it.  OTOH, there

Yes I think ext4 would be the best candidate for the next conversion.


> are poeple (including myself) who are using ext4 in production
> already, and I'd appreciate not having my source trees on my laptop
> getting toasted.  :-)

Definitely ;)


> Is it going to be possible to make the fsblock conversion being
> something which is handled via CONFIG_EXT4_FSBLOCK #ifdefs, or are the
> changes too invasive to really allow that?  (Also note BTW that ocfs2
> is also using jbd2, so we need to be careful we don't break ocfs2
> while we're doing the fsblock conversion.)

Hmm, it would be difficult. I think once I get a patch working, it
wouldn't be too hard to maintain out of tree though (there tends to
be just a smallish number of patterns used many times). I'd start
by doing that, and see how it looks from there.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-03-16 10:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200903110925.37614.phillips@phunq.net>
     [not found] ` <200903122010.31282.nickpiggin@yahoo.com.au>
     [not found]   ` <200903120315.07610.phillips@phunq.net>
2009-03-12 11:03     ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin
2009-03-12 12:24       ` Daniel Phillips
2009-03-12 12:32         ` Matthew Wilcox
2009-03-12 12:45           ` Nick Piggin
2009-03-12 13:12             ` [Tux3] " Daniel Phillips
2009-03-12 13:06           ` Daniel Phillips
2009-03-12 13:04         ` Nick Piggin
2009-03-12 13:59           ` [Tux3] " Matthew Wilcox
2009-03-12 14:19             ` Nick Piggin
2009-03-15  3:24             ` Daniel Phillips
2009-03-15  3:50               ` [Tux3] " Nick Piggin
2009-03-15  4:08                 ` Daniel Phillips
2009-03-15  4:14                   ` [Tux3] " Nick Piggin
2009-03-15  2:41           ` Daniel Phillips
2009-03-15  3:45             ` Nick Piggin
2009-03-15 21:44               ` Theodore Tso
2009-03-15 22:41                 ` Daniel Phillips
2009-03-16 10:32                   ` Nick Piggin
2009-03-16  5:12                 ` Dave Chinner
2009-03-16  6:38                   ` Theodore Tso
2009-03-16 10:14                     ` Nick Piggin
2009-03-12 17:06       ` [Tux3] " Theodore Tso
2009-03-13  9:32         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).