* Re: [Tux3] Tux3 report: Tux3 Git tree available [not found] ` <200903120315.07610.phillips@phunq.net> @ 2009-03-12 11:03 ` Nick Piggin 2009-03-12 12:24 ` Daniel Phillips 2009-03-12 17:06 ` [Tux3] " Theodore Tso 0 siblings, 2 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-12 11:03 UTC (permalink / raw) To: Daniel Phillips, linux-fsdevel; +Cc: tux3, Andrew Morton, linux-kernel On Thursday 12 March 2009 21:15:06 Daniel Phillips wrote: > By the way, I just spotted your fsblock effort on LWN, and I should > mention there is a lot of commonality with a side project we have > going in Tux3, called "block handles", which aims to get rid of buffers > entirely, leaving a tiny structure attached to the page->private that > just records the block states. Currently, four bits per block. This > can be done entirely _within_ a filesystem. We are already running > some of the code that has to be in place before switching over to this > model. > > Tux3 block handles (as prototyped, not in the current code base) are > 16 bytes per page, which for 1K block size on a 32 bit arch is a factor > of 14 savings, more on 64 bit arch. More importantly, it saves lots of > individual slab allocations, a feature I gather is part of fsblock as > well. That's interesting. Do you handle 1K block sizes with 64K page size? :) fsblock isn't quite as small. 20 bytes per block on a 32-bit arch. Yeah, so it will do a single 80 byte allocation for 4K page 1K block. That's good for cache efficiency. As far as total # slab allocations themselves go, fsblock probably tends to do more of them than buffer.c because it frees them proactively when their refcounts reach 0 (by default, one can switch to a lazy mode like buffer heads). That's one of the most important things, so we don't end up with lots of these things lying around. eg. I could make it 16 bytes I think, but it would be a little harder and would make support for block size > page size much harder so I wouldn't bother. Or share the refcount field for all blocks in a page and just duplicate state etc, but it just makes code larger and slower and harder. > We represent up to 8 block states in one 16 byte object. To make this > work, we don't try to emulate the behavior of the venerable block > library, but first refactor the filesystem data flow so that we are > only using the parts of buffer_head that will be emulated by the block > handle. For example, no caching of physical block address. It keeps > changing in Tux3 anyway, so this is really a useless thing to do. fsblocks in their refcount mode don't tend to _cache_ physical block addresses either, because they're only kept around for as long as they are required (eg. to write out the page to avoid memory allocation deadlock problems). But some filesystems don't do very fast block lookups and do want a cache. I did a little extent map library on the side for that. > Anyway, that is more than I meant to write about it. Good luck to you, > you will need it. Keep in mind that some of the nastiest kernel bugs > ever arose from interactions between page and buffer state bits. You Yes, I even fixed several of them too :) fsblock simplifies a lot of those games. It protects pagecache state and fsblock state for all assocated blocks under a lock, so no weird ordering issues, and the two are always kept coherent (to the point that I can do writeout by walking dirty fsblocks in block device sector-order, although that requires bloat to fsblock struct and isn't straightforward with delalloc). Of course it is new code so it will have more bugs, but it is better code. > may run into understandable reluctance to change stable filesystems > over to the new model. Even if it reduces the chance for confusion, > the fact that it touches cache state at all is going to make people > jumpy. I would suggest that you get Ext3 working perfectly to prove > your idea, no small task. One advantage: Ext3 uses block handles for > directories, as opposed to Ext2 which operates on pages. So Ext3 will > be easier to deal with in that one area. But with Ext3 you have to > deal with jbd too, which is going to keep you busy for a while. It is basically already proven. It is faster with ext2 and it works with XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't really able to grok XFS enough to convert it). And works with minix with larger block size than page size (except some places where core pagecache code needs some hacking that I haven't got around to). Yes an ext3 conversion would probably reveal some tweaks or fixes to fsblock. I might try doing ext3 next. I suspect most of the problems would be fitting ext3 to much stricter checks and consistency required by fsblock, rather than adding ext3-required features to fsblock. ext3 will be a tough one to convert because it is complex, very stable, and widely used so there are lots of reasons not to make big changes to it. > The block handles patch is one of those fun things we have on hold for > the time being while we get the more mundane Good luck with it. I suspect that doing filesystem-specific layers to duplicate basically the same functionality but slightly optimised for the specific filesystem may not be a big win. As you say, this is where lots of nasty problems have been, so sharing as much code as possible is a really good idea. I would be very interested in anything like this that could beat fsblock in functionality or performance anywhere, even if it is taking shortcuts by being less generic. If there is a significant gain to be had from less generic, perhaps it could still be made into a library usable by more than 1 fs. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 11:03 ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin @ 2009-03-12 12:24 ` Daniel Phillips 2009-03-12 12:32 ` Matthew Wilcox 2009-03-12 13:04 ` Nick Piggin 2009-03-12 17:06 ` [Tux3] " Theodore Tso 1 sibling, 2 replies; 23+ messages in thread From: Daniel Phillips @ 2009-03-12 12:24 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thursday 12 March 2009, Nick Piggin wrote: > On Thursday 12 March 2009 21:15:06 Daniel Phillips wrote: > > > By the way, I just spotted your fsblock effort on LWN, and I should > > mention there is a lot of commonality with a side project we have > > going in Tux3, called "block handles", which aims to get rid of buffers > > entirely, leaving a tiny structure attached to the page->private that > > just records the block states. Currently, four bits per block. This > > can be done entirely _within_ a filesystem. We are already running > > some of the code that has to be in place before switching over to this > > model. > > > > Tux3 block handles (as prototyped, not in the current code base) are > > 16 bytes per page, which for 1K block size on a 32 bit arch is a factor > > of 14 savings, more on 64 bit arch. More importantly, it saves lots of > > individual slab allocations, a feature I gather is part of fsblock as > > well. > > That's interesting. Do you handle 1K block sizes with 64K page size? :) Not in its current incarnation. That would require 32 bytes worth of state while the current code just has a 4 byte map (4 bits X 8 blocks). I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has somebody spotted a 64K page? > fsblock isn't quite as small. 20 bytes per block on a 32-bit arch. Yeah, > so it will do a single 80 byte allocation for 4K page 1K block. > > That's good for cache efficiency. As far as total # slab allocations > themselves go, fsblock probably tends to do more of them than buffer.c > because it frees them proactively when their refcounts reach 0 (by > default, one can switch to a lazy mode like buffer heads). I think that's a very good thing to do and intend to do the same. If it shows on a profiler, then the filesystem should keep its own free list to avoid whatever slab thing creates the bottleneck. > That's one of the most important things, so we don't end up with lots > of these things lying around. Amen. Doing nothing. > eg. I could make it 16 bytes I think, but it would be a little harder > and would make support for block size > page size much harder so I > wouldn't bother. Or share the refcount field for all blocks in a page > and just duplicate state etc, but it just makes code larger and slower > and harder. Right, block handles share the refcount for all blocks on one page. > > We represent up to 8 block states in one 16 byte object. To make this > > work, we don't try to emulate the behavior of the venerable block > > library, but first refactor the filesystem data flow so that we are > > only using the parts of buffer_head that will be emulated by the block > > handle. For example, no caching of physical block address. It keeps > > changing in Tux3 anyway, so this is really a useless thing to do. > > fsblocks in their refcount mode don't tend to _cache_ physical block addresses > either, because they're only kept around for as long as they are required > (eg. to write out the page to avoid memory allocation deadlock problems). > > But some filesystems don't do very fast block lookups and do want a cache. > I did a little extent map library on the side for that. Sure, good plan. We are attacking the transfer path, so that all the transfer state goes directly from the filesystem into a BIO and doesn't need that twisty path back and forth to the block library. The BIO remembers the physical address across the transfer cycle. If you must still support those twisty paths for compatibility with the existing buffer.c scheme, you have a much harder project. > > Anyway, that is more than I meant to write about it. Good luck to you, > > you will need it. Keep in mind that some of the nastiest kernel bugs > > ever arose from interactions between page and buffer state bits. You > > Yes, I even fixed several of them too :) > > fsblock simplifies a lot of those games. It protects pagecache state and > fsblock state for all assocated blocks under a lock, so no weird ordering > issues, and the two are always kept coherent (to the point that I can do > writeout by walking dirty fsblocks in block device sector-order, although > that requires bloat to fsblock struct and isn't straightforward with > delalloc). > > Of course it is new code so it will have more bugs, but it is better code. I've started poking at it, starting with the manifesto. It's a pretty big reading project. > > The block handles patch is one of those fun things we have on hold for > > the time being while we get the more mundane > > Good luck with it. I suspect that doing filesystem-specific layers to > duplicate basically the same functionality but slightly optimised for > the specific filesystem may not be a big win. As you say, this is where > lots of nasty problems have been, so sharing as much code as possible > is a really good idea. The big win will come from avoiding the use of struct buffer_head as an API element for mapping logical cache to disk, which is a narrow constriction when the filesystem wants to do things with extents in btrees. It is quite painful doing a btree probe for every ->get_block the way it is now. We want probe... page page page page... submit bio (or put it on a list for delayed allocation). Once we have the desired, nice straight path above then we don't need most of the fields in buffer_head, so tightening it up into a bitmap, a refcount and a pointer back to the page makes a lot of sense. This in itself may not make a huge difference, but the reduction in cache pressure ought to be measurable and worth the not very many lines of code for the implementation. > I would be very interested in anything like this that could beat fsblock > in functionality or performance anywhere, even if it is taking shortcuts > by being less generic. > > If there is a significant gain to be had from less generic, perhaps it > could still be made into a library usable by more than 1 fs. I don't see any reason right off that it is not generic, except that it does not try to fill the API role that buffer_head has, and so it isn't a small, easy change to an existing filesystem. It ought to be useful for new designs though. Mind you, the code hasn't been tried yet, it is currently just a state-smashing API waiting for the filesystem to evolve into the necessary shape, which is going to take another month or two. The Tux3 userspace buffer emulation already works much like the kernel block handles will work, in that it doesn't cache a physical address, and maintains cache state as a scalar value instead of a set of bits, so we already have a fair amount of experience with the model. When it does get to the top of the list of things to do, it should slot in smoothly. At that point we could hand it to you to try your generic API, which seems to implement similar ideas. Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 12:24 ` Daniel Phillips @ 2009-03-12 12:32 ` Matthew Wilcox 2009-03-12 12:45 ` Nick Piggin 2009-03-12 13:06 ` Daniel Phillips 2009-03-12 13:04 ` Nick Piggin 1 sibling, 2 replies; 23+ messages in thread From: Matthew Wilcox @ 2009-03-12 12:32 UTC (permalink / raw) To: Daniel Phillips Cc: Nick Piggin, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote: > On Thursday 12 March 2009, Nick Piggin wrote: > > That's interesting. Do you handle 1K block sizes with 64K page size? :) > > Not in its current incarnation. That would require 32 bytes worth of > state while the current code just has a 4 byte map (4 bits X 8 blocks). > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has > somebody spotted a 64K page? I believe SGI ship their ia64 kernels configured this way. Certainly 16k ia64 kernels are common, which would (if I understand your scheme correctly) be 8 bytes worth of state in your scheme. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Tux3 report: Tux3 Git tree available 2009-03-12 12:32 ` Matthew Wilcox @ 2009-03-12 12:45 ` Nick Piggin 2009-03-12 13:12 ` [Tux3] " Daniel Phillips 2009-03-12 13:06 ` Daniel Phillips 1 sibling, 1 reply; 23+ messages in thread From: Nick Piggin @ 2009-03-12 12:45 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel, Daniel Phillips On Thursday 12 March 2009 23:32:30 Matthew Wilcox wrote: > On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote: > > On Thursday 12 March 2009, Nick Piggin wrote: > > > That's interesting. Do you handle 1K block sizes with 64K page size? :) > > > > Not in its current incarnation. That would require 32 bytes worth of > > state while the current code just has a 4 byte map (4 bits X 8 blocks). > > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has > > somebody spotted a 64K page? > > I believe SGI ship their ia64 kernels configured this way. Certainly > 16k ia64 kernels are common, which would (if I understand your scheme > correctly) be 8 bytes worth of state in your scheme. I think some distros will (or do) ship configs with 64K page size for ia64 and powerpc too. I think I have heard of people using 64K pages with ARM. There was some (public) talk of x86-64 getting a 16K or 64K page size too (and even if not HW, some people want to be able to go bigger SW pagecache size). I wouldn't expect 64K page and 1K block to be worth optimising for (although 64K page systems could easily use older or shared 4K block filesystems). But just keep in mind that a good solution should not rely on PAGE_CACHE_SIZE for correctness. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 12:45 ` Nick Piggin @ 2009-03-12 13:12 ` Daniel Phillips 0 siblings, 0 replies; 23+ messages in thread From: Daniel Phillips @ 2009-03-12 13:12 UTC (permalink / raw) To: Nick Piggin Cc: Matthew Wilcox, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thursday 12 March 2009, Nick Piggin wrote: > On Thursday 12 March 2009 23:32:30 Matthew Wilcox wrote: > > On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote: > > > On Thursday 12 March 2009, Nick Piggin wrote: > > > > That's interesting. Do you handle 1K block sizes with 64K page size? :) > > > > > > Not in its current incarnation. That would require 32 bytes worth of > > > state while the current code just has a 4 byte map (4 bits X 8 blocks). > > > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has > > > somebody spotted a 64K page? > > > > I believe SGI ship their ia64 kernels configured this way. Certainly > > 16k ia64 kernels are common, which would (if I understand your scheme > > correctly) be 8 bytes worth of state in your scheme. > > I think some distros will (or do) ship configs with 64K page size for > ia64 and powerpc too. I think I have heard of people using 64K pages > with ARM. There was some (public) talk of x86-64 getting a 16K or 64K > page size too (and even if not HW, some people want to be able to go > bigger SW pagecache size). > > I wouldn't expect 64K page and 1K block to be worth optimising for > (although 64K page systems could easily use older or shared 4K block > filesystems). But just keep in mind that a good solution should not > rely on PAGE_CACHE_SIZE for correctness. Not worth optimizing for, but it better work. Which the current nasty circular buffer list will, and I better keep that in mind for the next round of effort on block handles. On the other hand, 4K blocks on 64K pages better work really well or those 64K systems will be turkeys. Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 12:32 ` Matthew Wilcox 2009-03-12 12:45 ` Nick Piggin @ 2009-03-12 13:06 ` Daniel Phillips 1 sibling, 0 replies; 23+ messages in thread From: Daniel Phillips @ 2009-03-12 13:06 UTC (permalink / raw) To: Matthew Wilcox Cc: Nick Piggin, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thursday 12 March 2009, Matthew Wilcox wrote: > On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote: > > On Thursday 12 March 2009, Nick Piggin wrote: > > > That's interesting. Do you handle 1K block sizes with 64K page size? :) > > > > Not in its current incarnation. That would require 32 bytes worth of > > state while the current code just has a 4 byte map (4 bits X 8 blocks). > > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has > > somebody spotted a 64K page? > > I believe SGI ship their ia64 kernels configured this way. Certainly > 16k ia64 kernels are common, which would (if I understand your scheme > correctly) be 8 bytes worth of state in your scheme. Yes, correct, and after that the state object would have to expand by a binary factor, which probably doesn't matter because at that scale it is really small compared to the blocks it maps. And the mapped blocks should just be metadata like index nodes, directory entry blocks and bitmap blocks, which need per block data handles and locking while regular file data can work in full pages, which is the same equation that keeps the pain of buffer_heads down to a dull roar. Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Tux3 report: Tux3 Git tree available 2009-03-12 12:24 ` Daniel Phillips 2009-03-12 12:32 ` Matthew Wilcox @ 2009-03-12 13:04 ` Nick Piggin 2009-03-12 13:59 ` [Tux3] " Matthew Wilcox 2009-03-15 2:41 ` Daniel Phillips 1 sibling, 2 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-12 13:04 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote: > On Thursday 12 March 2009, Nick Piggin wrote: > > That's good for cache efficiency. As far as total # slab allocations > > themselves go, fsblock probably tends to do more of them than buffer.c > > because it frees them proactively when their refcounts reach 0 (by > > default, one can switch to a lazy mode like buffer heads). > > I think that's a very good thing to do and intend to do the same. If > it shows on a profiler, then the filesystem should keep its own free > list to avoid whatever slab thing creates the bottleneck. slab allocation/free fastpath is on the order of 100 cycles, which is a cache miss. I have a feeling that actually doing lots of allocs and frees can work better because it keeps reusing the same memory for different objects being operated on, so you get fewer cache misses. (anyway it doesn't seem to be measurable in fsblock when switching between cached and refcounted mode). > > fsblocks in their refcount mode don't tend to _cache_ physical block > > addresses either, because they're only kept around for as long as they > > are required (eg. to write out the page to avoid memory allocation > > deadlock problems). > > > > But some filesystems don't do very fast block lookups and do want a > > cache. I did a little extent map library on the side for that. > > Sure, good plan. We are attacking the transfer path, so that all the > transfer state goes directly from the filesystem into a BIO and doesn't > need that twisty path back and forth to the block library. The BIO > remembers the physical address across the transfer cycle. If you must > still support those twisty paths for compatibility with the existing > buffer.c scheme, you have a much harder project. I don't quite know what you mean. You have a set of dirty cache that needs to be written. So you need to know the block addresses in order to create the bio of course. fsblock allocates the block and maps[*] the block at pagecache *dirty* time, and holds onto it until writeout is finished. In something like ext2, finding the offset->block map can require buffercache allocations so it is technically deadlocky if you have to do it at writeout time. [*] except in the case of delalloc. fsblock does its best, but for complex filesystems like delalloc, some memory reservation would have to be done by the fs. > > > The block handles patch is one of those fun things we have on hold for > > > the time being while we get the more mundane > > > > Good luck with it. I suspect that doing filesystem-specific layers to > > duplicate basically the same functionality but slightly optimised for > > the specific filesystem may not be a big win. As you say, this is where > > lots of nasty problems have been, so sharing as much code as possible > > is a really good idea. > > The big win will come from avoiding the use of struct buffer_head as > an API element for mapping logical cache to disk, which is a narrow > constriction when the filesystem wants to do things with extents in > btrees. It is quite painful doing a btree probe for every ->get_block > the way it is now. We want probe... page page page page... submit bio > (or put it on a list for delayed allocation). > > Once we have the desired, nice straight path above then we don't need > most of the fields in buffer_head, so tightening it up into a bitmap, > a refcount and a pointer back to the page makes a lot of sense. This > in itself may not make a huge difference, but the reduction in cache > pressure ought to be measurable and worth the not very many lines of > code for the implementation. I haven't done much about this in fsblock yet. I think some things need a bit of changing in the pagecache layer (in the block library, eg. write_begin/write_end doesn't have enough info to reserve/allocate a big range of blocks -- we need a callback higher up to tell the filesystem that we will be writing xxx range in the file, so get things ready for us). As far as the per-block pagecache state (as opposed to the per-block fs state), I don't see any reason it is a problem for efficiency. We have to do per-page operations anyway. > > I would be very interested in anything like this that could beat fsblock > > in functionality or performance anywhere, even if it is taking shortcuts > > by being less generic. > > > > If there is a significant gain to be had from less generic, perhaps it > > could still be made into a library usable by more than 1 fs. > > I don't see any reason right off that it is not generic, except that it > does not try to fill the API role that buffer_head has, and so it isn't > a small, easy change to an existing filesystem. It ought to be useful > for new designs though. Mind you, the code hasn't been tried yet, it > is currently just a state-smashing API waiting for the filesystem to > evolve into the necessary shape, which is going to take another month > or two. > > The Tux3 userspace buffer emulation already works much like the kernel > block handles will work, in that it doesn't cache a physical address, > and maintains cache state as a scalar value instead of a set of bits, > so we already have a fair amount of experience with the model. When it > does get to the top of the list of things to do, it should slot in > smoothly. At that point we could hand it to you to try your generic > API, which seems to implement similar ideas. Cool. I will be interested to see how it works. Thanks, Nick ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 13:04 ` Nick Piggin @ 2009-03-12 13:59 ` Matthew Wilcox 2009-03-12 14:19 ` Nick Piggin 2009-03-15 3:24 ` Daniel Phillips 2009-03-15 2:41 ` Daniel Phillips 1 sibling, 2 replies; 23+ messages in thread From: Matthew Wilcox @ 2009-03-12 13:59 UTC (permalink / raw) To: Nick Piggin Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote: > As far as the per-block pagecache state (as opposed to the per-block fs > state), I don't see any reason it is a problem for efficiency. We have to > do per-page operations anyway. Why? Wouldn't it be nice if we could do arbitrary extents? I suppose superpages or soft page sizes get us most of the way there, but the rounding or pieces at the end are a bit of a pain. Sure, it'll be a huge upheaval for the VM, but we're good at huge upheavals ;-) -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 13:59 ` [Tux3] " Matthew Wilcox @ 2009-03-12 14:19 ` Nick Piggin 2009-03-15 3:24 ` Daniel Phillips 1 sibling, 0 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-12 14:19 UTC (permalink / raw) To: Matthew Wilcox Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Friday 13 March 2009 00:59:40 Matthew Wilcox wrote: > On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote: > > As far as the per-block pagecache state (as opposed to the per-block fs > > state), I don't see any reason it is a problem for efficiency. We have to > > do per-page operations anyway. > > Why? Wouldn't it be nice if we could do arbitrary extents? I suppose > superpages or soft page sizes get us most of the way there, but the > rounding or pieces at the end are a bit of a pain. Sure, it'll be a > huge upheaval for the VM, but we're good at huge upheavals ;-) Sounds nice in theory. I personally think it would be very very hard to do well, and wouldn't and up being much of a win if at all anyway. But if you have some structure representing arbitrary pagecache extents, then you would probably be able to naturally use those pagecache extents themselves, or parallel very similar data structures for holding block state. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Tux3 report: Tux3 Git tree available 2009-03-12 13:59 ` [Tux3] " Matthew Wilcox 2009-03-12 14:19 ` Nick Piggin @ 2009-03-15 3:24 ` Daniel Phillips 2009-03-15 3:50 ` [Tux3] " Nick Piggin 1 sibling, 1 reply; 23+ messages in thread From: Daniel Phillips @ 2009-03-15 3:24 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-fsdevel, Nick Piggin, Andrew Morton, linux-kernel, tux3 On Thursday 12 March 2009, Matthew Wilcox wrote: > On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote: > > As far as the per-block pagecache state (as opposed to the per-block fs > > state), I don't see any reason it is a problem for efficiency. We have to > > do per-page operations anyway. > > Why? Wouldn't it be nice if we could do arbitrary extents? I suppose > superpages or soft page sizes get us most of the way there, but the > rounding or pieces at the end are a bit of a pain. Sure, it'll be a > huge upheaval for the VM, but we're good at huge upheavals ;-) Actually, filesystem extents tend to erode the argument for superpages. There are three reasons we have seen for wanting big pages: 1) support larger block buffers without adding messy changes to buffer.c; 2) TLB efficiency; 3) less per-page state in kernel memory. TLB efficiency is only there if the hardware supports it, which X86 arguably doesn't. The main argument for larger block buffers is less per-block transfer setup overhead, but the BIO model combined with filesystem extents does that job better, or at least it will when filesystems learn to take better advantage of this. VM extents on the other hand could possibly do a really good job of reducing per-page VM overhead, if anybody still cares about that now that 64 bit machines rule the big iron world. I expect implementing VM extents to be a brutally complex project, as filesystem extents always turn out to be, even though one tends to enter such projects thinking, how hard could this be? Answer: harder than you think. But VM extents would be good for a modest speedup, so somebody is sure to get brave enough to try it sometime. Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 3:24 ` Daniel Phillips @ 2009-03-15 3:50 ` Nick Piggin 2009-03-15 4:08 ` Daniel Phillips 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2009-03-15 3:50 UTC (permalink / raw) To: Daniel Phillips Cc: Matthew Wilcox, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote: > On Thursday 12 March 2009, Matthew Wilcox wrote: > > On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote: > > > As far as the per-block pagecache state (as opposed to the per-block fs > > > state), I don't see any reason it is a problem for efficiency. We have > > > to do per-page operations anyway. > > > > Why? Wouldn't it be nice if we could do arbitrary extents? I suppose > > superpages or soft page sizes get us most of the way there, but the > > rounding or pieces at the end are a bit of a pain. Sure, it'll be a > > huge upheaval for the VM, but we're good at huge upheavals ;-) > > Actually, filesystem extents tend to erode the argument for superpages. > There are three reasons we have seen for wanting big pages: 1) support > larger block buffers without adding messy changes to buffer.c; 2) TLB > efficiency; 3) less per-page state in kernel memory. TLB efficiency is > only there if the hardware supports it, which X86 arguably doesn't. > The main argument for larger block buffers is less per-block transfer > setup overhead, but the BIO model combined with filesystem extents > does that job better, or at least it will when filesystems learn to > take better advantage of this. > > VM extents on the other hand could possibly do a really good job of > reducing per-page VM overhead, if anybody still cares about that now > that 64 bit machines rule the big iron world. > > I expect implementing VM extents to be a brutally complex project, as > filesystem extents always turn out to be, even though one tends to > enter such projects thinking, how hard could this be? Answer: harder > than you think. But VM extents would be good for a modest speedup, so > somebody is sure to get brave enough to try it sometime. I don't think there is enough evidence to be able to make such an assertion. When you actually implement extent splitting and merging in a deadlock free manner and synchronize everything properly I wouldn't be surprised if it is slower most of the time. If it was significantly faster, then memory fragmentation means that it is going to get significantly slower over the uptime of the kernel, so you would have to virtually map the kernel and implement memory defragmentation, at which point you get even slower and more complex. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Tux3 report: Tux3 Git tree available 2009-03-15 3:50 ` [Tux3] " Nick Piggin @ 2009-03-15 4:08 ` Daniel Phillips 2009-03-15 4:14 ` [Tux3] " Nick Piggin 0 siblings, 1 reply; 23+ messages in thread From: Daniel Phillips @ 2009-03-15 4:08 UTC (permalink / raw) To: Nick Piggin Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel, Matthew Wilcox On Saturday 14 March 2009, Nick Piggin wrote: > On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote: > > I expect implementing VM extents to be a brutally complex project, as > > filesystem extents always turn out to be, even though one tends to > > enter such projects thinking, how hard could this be? Answer: harder > > than you think. But VM extents would be good for a modest speedup, so > > somebody is sure to get brave enough to try it sometime. > > I don't think there is enough evidence to be able to make such an > assertion. > > When you actually implement extent splitting and merging in a deadlock > free manner and synchronize everything properly I wouldn't be surprised > if it is slower most of the time. If it was significantly faster, then > memory fragmentation means that it is going to get significantly slower > over the uptime of the kernel, so you would have to virtually map the > kernel and implement memory defragmentation, at which point you get even > slower and more complex. You can make exactly the same argument about filesystem extents, and we know that extents are faster there. So what is the fundamental difference? Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 4:08 ` Daniel Phillips @ 2009-03-15 4:14 ` Nick Piggin 0 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-15 4:14 UTC (permalink / raw) To: Daniel Phillips Cc: Matthew Wilcox, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Sunday 15 March 2009 15:08:52 Daniel Phillips wrote: > On Saturday 14 March 2009, Nick Piggin wrote: > > On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote: > > > I expect implementing VM extents to be a brutally complex project, as > > > filesystem extents always turn out to be, even though one tends to > > > enter such projects thinking, how hard could this be? Answer: harder > > > than you think. But VM extents would be good for a modest speedup, so > > > somebody is sure to get brave enough to try it sometime. > > > > I don't think there is enough evidence to be able to make such an > > assertion. > > > > When you actually implement extent splitting and merging in a deadlock > > free manner and synchronize everything properly I wouldn't be surprised > > if it is slower most of the time. If it was significantly faster, then > > memory fragmentation means that it is going to get significantly slower > > over the uptime of the kernel, so you would have to virtually map the > > kernel and implement memory defragmentation, at which point you get even > > slower and more complex. > > You can make exactly the same argument about filesystem extents, and > we know that extents are faster there. So what is the fundamental > difference? Uh, aside from all the obvious fundamental differences there are, you can only make such an assertion if performance characteristics and usage patterns are very similar, nevermind fundamentally different... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 13:04 ` Nick Piggin 2009-03-12 13:59 ` [Tux3] " Matthew Wilcox @ 2009-03-15 2:41 ` Daniel Phillips 2009-03-15 3:45 ` Nick Piggin 1 sibling, 1 reply; 23+ messages in thread From: Daniel Phillips @ 2009-03-15 2:41 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thursday 12 March 2009, Nick Piggin wrote: > On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote: > > > fsblocks in their refcount mode don't tend to _cache_ physical block > > > addresses either, because they're only kept around for as long as they > > > are required (eg. to write out the page to avoid memory allocation > > > deadlock problems). > > > > > > But some filesystems don't do very fast block lookups and do want a > > > cache. I did a little extent map library on the side for that. > > > > Sure, good plan. We are attacking the transfer path, so that all the > > transfer state goes directly from the filesystem into a BIO and doesn't > > need that twisty path back and forth to the block library. The BIO > > remembers the physical address across the transfer cycle. If you must > > still support those twisty paths for compatibility with the existing > > buffer.c scheme, you have a much harder project. > > I don't quite know what you mean. You have a set of dirty cache that > needs to be written. So you need to know the block addresses in order > to create the bio of course. > > fsblock allocates the block and maps[*] the block at pagecache *dirty* > time, and holds onto it until writeout is finished. As it happens, Tux3 also physically allocates each _physical_ metadata block (i.e., what is currently called buffer cache) at the time it is dirtied. I don't know if this is the best thing to do, but it is interesting that you do the same thing. I also don't know if I want to trust a library to get this right, before having completely proved out the idea in a non-trival filesystem. But good luck with that! It seems to me like a very good idea to take Ted up on his offer and try out your library on Ext4. This is just a gut feeling, but I think you will need many iterations to refine the idea. Just working, and even showing benchmark improvement is not enough. If it is a core API proposal, it needs a huge body of proof. If you end up like JBD with just one user, because it actually only implements the semantics of exactly one filesystem, then the extra overhead of unused generality will just mean more lines of code to maintain and more places for bugs to hide. This is all general philosophy of course. Actually reading your code would help a lot. By comparision, I intend the block handles library to be a few hundred lines of code, including new incarnations of buffer.c functionality like block_read/write_*. If this is indeed possible, and it does the job with 4 bytes per block on a 1K block/4K page configuration as it does in the prototype, then I think I would prefer a per-filesystem solution and let it evolve that way for a long time before attempting a library. But that is just me. I suppose you would like to see some code? > In something like > ext2, finding the offset->block map can require buffercache allocations > so it is technically deadlocky if you have to do it at writeout time. I am not sure what "technically" means. Pretty much everything you do in this area has high deadlock risk. That is one of the things that scares me about trying to handle every filesystem uniformly. How would filesystem writers even know what the deadlock avoidance rules are, thus what they need to do in their own filesystem to avoid it? Anyway, the Tux3 reason for doing the allocation at dirty time is, this is the only time the filesystem knows what the parent block of a given metadata block is. Note that we move btree blocks around when they are dirtied, and thus need to know the parent in order to update the parent pointer to the child. This is a complication you will not run into in any of the filesystems you have poked at so far. This subtle detail is very much filesystem specific, or it is specific to the class of filesystems that do remap on write. Good luck knowing how to generalize that before Linux has seen even one of them up and doing real production work. > [*] except in the case of delalloc. fsblock does its best, but for > complex filesystems like delalloc, some memory reservation would have > to be done by the fs. And that is a whole, huge and critical topic. Again, something that I feel needs to be analyzed per filesystem, until we have considerably more experience with the issues. > > > > The block handles patch is one of those fun things we have on hold for > > > > the time being while we get the more mundane > > > > > > Good luck with it. I suspect that doing filesystem-specific layers to > > > duplicate basically the same functionality but slightly optimised for > > > the specific filesystem may not be a big win. As you say, this is where > > > lots of nasty problems have been, so sharing as much code as possible > > > is a really good idea. > > > > The big win will come from avoiding the use of struct buffer_head as > > an API element for mapping logical cache to disk, which is a narrow > > constriction when the filesystem wants to do things with extents in > > btrees. It is quite painful doing a btree probe for every ->get_block > > the way it is now. We want probe... page page page page... submit bio > > (or put it on a list for delayed allocation). > > > > Once we have the desired, nice straight path above then we don't need > > most of the fields in buffer_head, so tightening it up into a bitmap, > > a refcount and a pointer back to the page makes a lot of sense. This > > in itself may not make a huge difference, but the reduction in cache > > pressure ought to be measurable and worth the not very many lines of > > code for the implementation. > > I haven't done much about this in fsblock yet. I think some things need > a bit of changing in the pagecache layer (in the block library, eg. > write_begin/write_end doesn't have enough info to reserve/allocate a big > range of blocks -- we need a callback higher up to tell the filesystem > that we will be writing xxx range in the file, so get things ready for > us). That would be write_cache_pages, it already exists and seems perfectly serviceable. > As far as the per-block pagecache state (as opposed to the per-block fs > state), I don't see any reason it is a problem for efficiency. We have to > do per-page operations anyway. I don't see a distinction between page cache vs fs state for a block. Tux3 has these scalar block states: EMPTY - not read into cache yet CLEAN - cache data matches disk data (which might be a hole) DIRTY0 .. DIRTY3 - dirty in one of up to four pipelined delta updates Besides the per-page block reference count (hmm, do we really need it? Why not rely on the page reference count?) there is no cache-specific state, it is all "fs" state. To complete the enumeration of state Tux3 represents in block handles, there is also a per-block lock bit, used for reading blocks, the same as buffer lock. So far there is no writeback bit, which does not seem to be needed, because the flow of block writeback is subtly different from page writeback. I am not prepared to defend that assertion just yet! But I think the reason for this is, there is no such thing as redirty for metadata blocks in Tux3, there is only "dirty in a later delta", and that implies redirecting the block to a new physical location that has its own, separate block state. Anyway, this is a pretty good example of why you may find it difficult to generalize your library to handle every filesystem. Is there any existing filesystem that works this way? How would you know in advance what features to include in your library to handle it? Will some future filesystem have very different requirements, not handled by your library? If you have finally captured every feature, will they interact? Will all these features be confusing to use and hard to analyze? I am not saying you can't solve all these problems, just that it is bound to be hard, take a long time, and might possibly end up less elegant than a more lightweight approach that leaves the top level logic in the hands of the filesystem. Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 2:41 ` Daniel Phillips @ 2009-03-15 3:45 ` Nick Piggin 2009-03-15 21:44 ` Theodore Tso 0 siblings, 1 reply; 23+ messages in thread From: Nick Piggin @ 2009-03-15 3:45 UTC (permalink / raw) To: Daniel Phillips; +Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel On Sunday 15 March 2009 13:41:09 Daniel Phillips wrote: > On Thursday 12 March 2009, Nick Piggin wrote: > > On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote: > > > > fsblocks in their refcount mode don't tend to _cache_ physical block > > > > addresses either, because they're only kept around for as long as > > > > they are required (eg. to write out the page to avoid memory > > > > allocation deadlock problems). > > > > > > > > But some filesystems don't do very fast block lookups and do want a > > > > cache. I did a little extent map library on the side for that. > > > > > > Sure, good plan. We are attacking the transfer path, so that all the > > > transfer state goes directly from the filesystem into a BIO and doesn't > > > need that twisty path back and forth to the block library. The BIO > > > remembers the physical address across the transfer cycle. If you must > > > still support those twisty paths for compatibility with the existing > > > buffer.c scheme, you have a much harder project. > > > > I don't quite know what you mean. You have a set of dirty cache that > > needs to be written. So you need to know the block addresses in order > > to create the bio of course. > > > > fsblock allocates the block and maps[*] the block at pagecache *dirty* > > time, and holds onto it until writeout is finished. > > As it happens, Tux3 also physically allocates each _physical_ metadata > block (i.e., what is currently called buffer cache) at the time it is > dirtied. I don't know if this is the best thing to do, but it is > interesting that you do the same thing. I also don't know if I want to > trust a library to get this right, before having completely proved out > the idea in a non-trival filesystem. But good luck with that! It I'm not sure why it would be a big problem. fsblock isn't allocating the block itself of course, it just asks the filesystem to. It's trivial to do for fsblock. > seems to me like a very good idea to take Ted up on his offer and try > out your library on Ext4. This is just a gut feeling, but I think you > will need many iterations to refine the idea. Just working, and even > showing benchmark improvement is not enough. If it is a core API > proposal, it needs a huge body of proof. If you end up like JBD with > just one user, because it actually only implements the semantics of > exactly one filesystem, then the extra overhead of unused generality > will just mean more lines of code to maintain and more places for bugs > to hide. I don't know what you're thinking is so difficult with it. I've already converted minix, ext2, and xfs and they seem to work fine. There is not really fundamentally anything that buffer heads can do that fsblock can't. > This is all general philosophy of course. Actually reading your code > would help a lot. By comparision, I intend the block handles library > to be a few hundred lines of code, including new incarnations of > buffer.c functionality like block_read/write_*. If this is indeed > possible, and it does the job with 4 bytes per block on a 1K block/4K > page configuration as it does in the prototype, then I think I would > prefer a per-filesystem solution and let it evolve that way for a long > time before attempting a library. But that is just me. If you're tracking pagecache state in these things, then I can't see how it can get any easier just because it is smaller. In which case, your concerns about duplicating functionality of this layer. > I suppose you would like to see some code? > > > In something like > > ext2, finding the offset->block map can require buffercache allocations > > so it is technically deadlocky if you have to do it at writeout time. > > I am not sure what "technically" means. Pretty much everything you do Technically means that it is deadlocky. Today, practically every Linux filesystem technically has memory deadlocks. In practice, the mm does keep reserves around to help this and so it is very very hard to hit. > in this area has high deadlock risk. That is one of the things that > scares me about trying to handle every filesystem uniformly. How would > filesystem writers even know what the deadlock avoidance rules are, > thus what they need to do in their own filesystem to avoid it? The rule is simple: if forward progress requires resource allocation, then you would ensure resource deadlocks are avoided or can be recovered from. I don't think many fs developers actually care very much, but obviously a rewrite of such core functionality must not introduce such deadlocks by design. > Anyway, the Tux3 reason for doing the allocation at dirty time is, this > is the only time the filesystem knows what the parent block of a given > metadata block is. Note that we move btree blocks around when they are > dirtied, and thus need to know the parent in order to update the parent > pointer to the child. This is a complication you will not run into in > any of the filesystems you have poked at so far. This subtle detail is > very much filesystem specific, or it is specific to the class of > filesystems that do remap on write. Good luck knowing how to generalize > that before Linux has seen even one of them up and doing real production > work. Uh, this kind of stuff is completely not what fsblock would try to do. fsblock gives the filesystem notifications when the block gets dirtied, when the block is prepared for writeout, etc. It is up to the filesystem to do everything else (with the postcondition that the block is mapped after being prepared for writeout). > > [*] except in the case of delalloc. fsblock does its best, but for > > complex filesystems like delalloc, some memory reservation would have > > to be done by the fs. > > And that is a whole, huge and critical topic. Again, something that I > feel needs to be analyzed per filesystem, until we have considerably > more experience with the issues. Again, fsblock does as much as it can up to guaranteeing fsblock metadata (and hence, any filesystem private data is attached to the fsblock) is allocated as long as the block is dirty. Of course the actual delalloc scheme is filesystem specific and can't be handled by fsblock. > > I haven't done much about this in fsblock yet. I think some things need > > a bit of changing in the pagecache layer (in the block library, eg. > > write_begin/write_end doesn't have enough info to reserve/allocate a big > > range of blocks -- we need a callback higher up to tell the filesystem > > that we will be writing xxx range in the file, so get things ready for > > us). > > That would be write_cache_pages, it already exists and seems perfectly > serviceable. No it isn't. That's completely different. > > As far as the per-block pagecache state (as opposed to the per-block fs > > state), I don't see any reason it is a problem for efficiency. We have to > > do per-page operations anyway. > > I don't see a distinction between page cache vs fs state for a block. > Tux3 has these scalar block states: > > EMPTY - not read into cache yet > CLEAN - cache data matches disk data (which might be a hole) > DIRTY0 .. DIRTY3 - dirty in one of up to four pipelined delta updates > > Besides the per-page block reference count (hmm, do we really need it? > Why not rely on the page reference count?) there is no cache-specific > state, it is all "fs" state. dirty / uptodate is a property of the cache. > To complete the enumeration of state Tux3 represents in block handles, > there is also a per-block lock bit, used for reading blocks, the same > as buffer lock. So far there is no writeback bit, which does not seem > to be needed, because the flow of block writeback is subtly different > from page writeback. I am not prepared to defend that assertion just > yet! But I think the reason for this is, there is no such thing as > redirty for metadata blocks in Tux3, there is only "dirty in a later > delta", and that implies redirecting the block to a new physical > location that has its own, separate block state. Anyway, this is a > pretty good example of why you may find it difficult to generalize your > library to handle every filesystem. Is there any existing filesystem > that works this way? How would you know in advance what features to > include in your library to handle it? Will some future filesystem > have very different requirements, not handled by your library? If you > have finally captured every feature, will they interact? Will all > these features be confusing to use and hard to analyze? I am not > saying you can't solve all these problems, just that it is bound to be > hard, take a long time, and might possibly end up less elegant than a > more lightweight approach that leaves the top level logic in the hands > of the filesystem. It's not meant to handle every possible feature of every current and future fs! It's meant to replace buffer-head. If there is some common filesystem feature in future that makes sense to generalise and support in fsblock then great. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 3:45 ` Nick Piggin @ 2009-03-15 21:44 ` Theodore Tso 2009-03-15 22:41 ` Daniel Phillips 2009-03-16 5:12 ` Dave Chinner 0 siblings, 2 replies; 23+ messages in thread From: Theodore Tso @ 2009-03-15 21:44 UTC (permalink / raw) To: Nick Piggin Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote: > > As it happens, Tux3 also physically allocates each _physical_ metadata > > block (i.e., what is currently called buffer cache) at the time it is > > dirtied. I don't know if this is the best thing to do, but it is > > interesting that you do the same thing. I also don't know if I want to > > trust a library to get this right, before having completely proved out > > the idea in a non-trival filesystem. But good luck with that! It > > I'm not sure why it would be a big problem. fsblock isn't allocating > the block itself of course, it just asks the filesystem to. It's > trivial to do for fsblock. So the really unfortunate thing about allocating the block as soon as the page is dirty is that it spikes out delayed allocation. By delaying the physical allocation of the logical->physical mapping as long as possible, the filesystem can select the best possible physical location. XFS, for example, keeps a btree of free regions indexed by size so that it can select the perfect location for a newly written file which is 24k or 56k long. If fsblock forces the physical allocation of blocks the moment the page is dirty, it will destroy XFS's capability to select the perfect file. In addition, XFS uses delayed allocation to avoid the problem of uninitalized data becoming visible in the event of a crash. If fsblock immediately allocates the physical block, then either the unitialized data might become available on a system crash (which is a security problem), or XFS is going to have to force all newly written data blocks to disk before a commit. If that sounds familiar it's what ext3's data=ordered mode does, and it's what is responsible for the Firefox 3.0 fsync performance problem. A similar issue exists for ext4's delayed allocation. - Ted ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 21:44 ` Theodore Tso @ 2009-03-15 22:41 ` Daniel Phillips 2009-03-16 10:32 ` Nick Piggin 2009-03-16 5:12 ` Dave Chinner 1 sibling, 1 reply; 23+ messages in thread From: Daniel Phillips @ 2009-03-15 22:41 UTC (permalink / raw) To: Theodore Tso Cc: Nick Piggin, linux-fsdevel, tux3, Andrew Morton, linux-kernel Hi Ted, On Sunday 15 March 2009, Theodore Tso wrote: > On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote: > > > As it happens, Tux3 also physically allocates each _physical_ metadata > > > block (i.e., what is currently called buffer cache) at the time it is > > > dirtied. I don't know if this is the best thing to do, but it is > > > interesting that you do the same thing. I also don't know if I want to > > > trust a library to get this right, before having completely proved out > > > the idea in a non-trival filesystem. But good luck with that! It > > > > I'm not sure why it would be a big problem. fsblock isn't allocating > > the block itself of course, it just asks the filesystem to. It's > > trivial to do for fsblock. > > So the really unfortunate thing about allocating the block as soon as > the page is dirty is that it spikes out delayed allocation. By > delaying the physical allocation of the logical->physical mapping as > long as possible, the filesystem can select the best possible physical > location. Tux3 does not dirty the metadata until data cache is flushed, so the allocation decisions for data and metadata are made at the same time. That is the reason for the distinction between physical metadata above, and logical metadata such as directory data and bitmaps, which are delayed. Though physical metadata is positioned when first dirtied, physical metadata dirtying is delayed until delta commit. Implementing this model (we are still working on it) requires taking care of a lot of subtle details that are specific to the Tux3 cache model. I have a hard time imagining those allocation decisions driven by callbacks from a buffer-like library. Regards, Daniel ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 22:41 ` Daniel Phillips @ 2009-03-16 10:32 ` Nick Piggin 0 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-16 10:32 UTC (permalink / raw) To: Daniel Phillips Cc: Theodore Tso, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Monday 16 March 2009 09:41:35 Daniel Phillips wrote: > Hi Ted, > > So the really unfortunate thing about allocating the block as soon as > > the page is dirty is that it spikes out delayed allocation. By > > delaying the physical allocation of the logical->physical mapping as > > long as possible, the filesystem can select the best possible physical > > location. > > Tux3 does not dirty the metadata until data cache is flushed, so the > allocation decisions for data and metadata are made at the same time. > That is the reason for the distinction between physical metadata above, > and logical metadata such as directory data and bitmaps, which are > delayed. Though physical metadata is positioned when first dirtied, > physical metadata dirtying is delayed until delta commit. > > Implementing this model (we are still working on it) requires taking > care of a lot of subtle details that are specific to the Tux3 cache > model. I have a hard time imagining those allocation decisions driven > by callbacks from a buffer-like library. The filesystem can get pagecache-block-dirty events in a few ways (often a combination of): write_begin/write_end, set_page_dirty, page_mkwrite, etc. Short of implementing entirely your own write path (and even then you need to hook at least page_mkwrite to catch mmapped writes, for completeness), I don't see why a get_block(BLOCK_DIRTY) kind of callback is much harder for you to imagine than any of the other callbacks. Actually I imagine the block based callback should be easier for filesystems that support any block size != page size because all the others are page based. I would like to hear firm details about any problems definitely, because I would like to try to make it more generic even if your filesystem won't use it :) Now this is not to say the current buffer APIs are totally _optimal_. As I said, I would like to see at least something along the lines of "we are about to dirty range (x,y)" callback in the higher level generic write code. But that's another story (which I am planning to get to). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-15 21:44 ` Theodore Tso 2009-03-15 22:41 ` Daniel Phillips @ 2009-03-16 5:12 ` Dave Chinner 2009-03-16 6:38 ` Theodore Tso 1 sibling, 1 reply; 23+ messages in thread From: Dave Chinner @ 2009-03-16 5:12 UTC (permalink / raw) To: Theodore Tso, Nick Piggin, Daniel Phillips, linux-fsdevel, tux3, Andrew Morton On Sun, Mar 15, 2009 at 05:44:26PM -0400, Theodore Tso wrote: > On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote: > > > As it happens, Tux3 also physically allocates each _physical_ metadata > > > block (i.e., what is currently called buffer cache) at the time it is > > > dirtied. I don't know if this is the best thing to do, but it is > > > interesting that you do the same thing. I also don't know if I want to > > > trust a library to get this right, before having completely proved out > > > the idea in a non-trival filesystem. But good luck with that! It > > > > I'm not sure why it would be a big problem. fsblock isn't allocating > > the block itself of course, it just asks the filesystem to. It's > > trivial to do for fsblock. > > So the really unfortunate thing about allocating the block as soon as > the page is dirty is that it spikes out delayed allocation. By > delaying the physical allocation of the logical->physical mapping as > long as possible, the filesystem can select the best possible physical > location. This is no different to the way delayed allocation with bufferheads works. Both XFS and ext4 set the buffer_delay flag instead of allocating up front so that later on in ->writepages we can do optimal delayed allocation. AFAICT fsblock works the same way.... > XFS, for example, keeps a btree of free regions indexed by > size so that it can select the perfect location for a newly written > file which is 24k or 56k long. Ah, no. It's far more complex than that. To begin with, XFS has *two* freespace trees per allocation group - one indexed by extent size, the other by extent starting block. XFS looks for an exact or nearby extent start block match that is big enough in the by-block tree. If it can't find a nearby match, then it looks up a size match in the by-size tree. i.e. the fundamental allocation assumption is that locality of data placement matters far more than filling holes in the freespace trees..... > In addition, XFS uses delayed allocation to avoid the problem of > uninitalized data becoming visible in the event of a crash. No it doesn't. Delayed allocation minimises the problem but doesn't prevent it. It has been known for years (since before I joined SGI in 2002) that there is a theoretical timing gap in XFS where the allocation transaction can commit and a crash occur before data hits the disk hence exposing stale data. The reality is that no-one has ever reported exposing stale data in this scenario, and there has been plenty of effort expended trying to trigger it. Hence it has remained in the realm of a theoretical problem.... > If > fsblock immediately allocates the physical block, then either the > unitialized data might become available on a system crash (which > is a security problem), or XFS is going to have to force all newly > written data blocks to disk before a commit. If that sounds > familiar it's what ext3's data=ordered mode does, and it's what is > responsible for the Firefox 3.0 fsync performance problem. If this was to occur, the obvious solution to this problem is to allocate unwritten extents and do conversion after data I/O completion. That would result in correct metadata/data ordering in all cases with only a small performance impact and without introducing ext3-sync-the-world-like issues... Ted, I appreciate you telling the world over and over again how bad XFS is and what you think needs to be done to fix it. Truth is, this would have been a much better email had you written about it from an ext4 perspective. That way it wouldn't have been full of errors or sound like a kid caught with his hand in the cookie jar: "It's not my fault! I was only copying XFS! He did it first!" Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-16 5:12 ` Dave Chinner @ 2009-03-16 6:38 ` Theodore Tso 2009-03-16 10:14 ` Nick Piggin 0 siblings, 1 reply; 23+ messages in thread From: Theodore Tso @ 2009-03-16 6:38 UTC (permalink / raw) To: Nick Piggin, Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel Dave, It wasn't my intention to say that XFS was bad; in fact, I thought I was actually complementing XFS by talking about some of the advanced features that XFS had (many of which I have always said that ext3 has, and some of which ext4 still does not have, and probably never will have). I stand corrected on some of the details that I got wrong. What I was trying to say was that *if* (and perhaps I'm misunderstanding fsblock) that fsblock is requiring that as soon as a page is dirty, fsblock requests the filesystem to assign a block allocation to the buffers attached to the dirty page, that this would spike out delayed allocation, which would be unfortunate for *both* ext4 and XFS. But maybe I'm misunderstanding what fsblock is doing, and there isn't a problem here. Regards, - Ted ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: Tux3 report: Tux3 Git tree available 2009-03-16 6:38 ` Theodore Tso @ 2009-03-16 10:14 ` Nick Piggin 0 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-16 10:14 UTC (permalink / raw) To: Theodore Tso Cc: linux-fsdevel, tux3, Andrew Morton, linux-kernel, Daniel Phillips On Monday 16 March 2009 17:38:30 Theodore Tso wrote: > Dave, > > It wasn't my intention to say that XFS was bad; in fact, I thought I > was actually complementing XFS by talking about some of the advanced > features that XFS had (many of which I have always said that ext3 has, > and some of which ext4 still does not have, and probably never will > have). I stand corrected on some of the details that I got wrong. > What I was trying to say was that *if* (and perhaps I'm > misunderstanding fsblock) that fsblock is requiring that as soon as a > page is dirty, fsblock requests the filesystem to assign a block > allocation to the buffers attached to the dirty page, that this would > spike out delayed allocation, which would be unfortunate for *both* > ext4 and XFS. > > But maybe I'm misunderstanding what fsblock is doing, and there isn't > a problem here. Yeah, Dave's understanding of fsblock is correct. I might have stated things confusingly... fsblock allocates the in-memory fsblock metadata structure (~= struct buffer_head) at the time of block dirtying. It also asks the filesystem to respond to the dirtying event appropriately. In the case of say ext2, this means allocating a block on disk. Wheras XFS does the delalloc/reserve thing (yes, XFS appears to be working with fsblock well enough to get this far). fsblock really isn't too much different to buffer_head from an abstract capability / functionality point of view except that it is often more strict with things where I feel it makes sense. So for this particular example; in buffer.c, buffers do tend to be allocated when a page is dirtied, but not always, and even when they are, they can get reclaimed while the page is still dirty. fsblock tigtens this up. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 11:03 ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin 2009-03-12 12:24 ` Daniel Phillips @ 2009-03-12 17:06 ` Theodore Tso 2009-03-13 9:32 ` Nick Piggin 1 sibling, 1 reply; 23+ messages in thread From: Theodore Tso @ 2009-03-12 17:06 UTC (permalink / raw) To: Nick Piggin Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Thu, Mar 12, 2009 at 10:03:31PM +1100, Nick Piggin wrote: > It is basically already proven. It is faster with ext2 and it works with > XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't > really able to grok XFS enough to convert it). And works with minix > with larger block size than page size (except some places where core > pagecache code needs some hacking that I haven't got around to). > > Yes an ext3 conversion would probably reveal some tweaks or fixes to > fsblock. I might try doing ext3 next. I suspect most of the problems > would be fitting ext3 to much stricter checks and consistency required > by fsblock, rather than adding ext3-required features to fsblock. > > ext3 will be a tough one to convert because it is complex, very stable, > and widely used so there are lots of reasons not to make big changes to > it. One possibility would be to do this with ext4 instead, since there are fewer users, and it has more a "development" feel to it. OTOH, there are poeple (including myself) who are using ext4 in production already, and I'd appreciate not having my source trees on my laptop getting toasted. :-) Is it going to be possible to make the fsblock conversion being something which is handled via CONFIG_EXT4_FSBLOCK #ifdefs, or are the changes too invasive to really allow that? (Also note BTW that ocfs2 is also using jbd2, so we need to be careful we don't break ocfs2 while we're doing the fsblock conversion.) - Ted ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Tux3] Tux3 report: Tux3 Git tree available 2009-03-12 17:06 ` [Tux3] " Theodore Tso @ 2009-03-13 9:32 ` Nick Piggin 0 siblings, 0 replies; 23+ messages in thread From: Nick Piggin @ 2009-03-13 9:32 UTC (permalink / raw) To: Theodore Tso Cc: Daniel Phillips, linux-fsdevel, tux3, Andrew Morton, linux-kernel On Friday 13 March 2009 04:06:53 Theodore Tso wrote: > On Thu, Mar 12, 2009 at 10:03:31PM +1100, Nick Piggin wrote: > > It is basically already proven. It is faster with ext2 and it works with > > XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't > > really able to grok XFS enough to convert it). And works with minix > > with larger block size than page size (except some places where core > > pagecache code needs some hacking that I haven't got around to). > > > > Yes an ext3 conversion would probably reveal some tweaks or fixes to > > fsblock. I might try doing ext3 next. I suspect most of the problems > > would be fitting ext3 to much stricter checks and consistency required > > by fsblock, rather than adding ext3-required features to fsblock. > > > > ext3 will be a tough one to convert because it is complex, very stable, > > and widely used so there are lots of reasons not to make big changes to > > it. > > One possibility would be to do this with ext4 instead, since there are > fewer users, and it has more a "development" feel to it. OTOH, there Yes I think ext4 would be the best candidate for the next conversion. > are poeple (including myself) who are using ext4 in production > already, and I'd appreciate not having my source trees on my laptop > getting toasted. :-) Definitely ;) > Is it going to be possible to make the fsblock conversion being > something which is handled via CONFIG_EXT4_FSBLOCK #ifdefs, or are the > changes too invasive to really allow that? (Also note BTW that ocfs2 > is also using jbd2, so we need to be careful we don't break ocfs2 > while we're doing the fsblock conversion.) Hmm, it would be difficult. I think once I get a patch working, it wouldn't be too hard to maintain out of tree though (there tends to be just a smallish number of patterns used many times). I'd start by doing that, and see how it looks from there. ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2009-03-16 10:32 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200903110925.37614.phillips@phunq.net>
[not found] ` <200903122010.31282.nickpiggin@yahoo.com.au>
[not found] ` <200903120315.07610.phillips@phunq.net>
2009-03-12 11:03 ` [Tux3] Tux3 report: Tux3 Git tree available Nick Piggin
2009-03-12 12:24 ` Daniel Phillips
2009-03-12 12:32 ` Matthew Wilcox
2009-03-12 12:45 ` Nick Piggin
2009-03-12 13:12 ` [Tux3] " Daniel Phillips
2009-03-12 13:06 ` Daniel Phillips
2009-03-12 13:04 ` Nick Piggin
2009-03-12 13:59 ` [Tux3] " Matthew Wilcox
2009-03-12 14:19 ` Nick Piggin
2009-03-15 3:24 ` Daniel Phillips
2009-03-15 3:50 ` [Tux3] " Nick Piggin
2009-03-15 4:08 ` Daniel Phillips
2009-03-15 4:14 ` [Tux3] " Nick Piggin
2009-03-15 2:41 ` Daniel Phillips
2009-03-15 3:45 ` Nick Piggin
2009-03-15 21:44 ` Theodore Tso
2009-03-15 22:41 ` Daniel Phillips
2009-03-16 10:32 ` Nick Piggin
2009-03-16 5:12 ` Dave Chinner
2009-03-16 6:38 ` Theodore Tso
2009-03-16 10:14 ` Nick Piggin
2009-03-12 17:06 ` [Tux3] " Theodore Tso
2009-03-13 9:32 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).