* Re: [patch] delayed disk block allocation @ 2002-03-04 14:54 rwhron 2002-03-04 15:10 ` Jeff Garzik 0 siblings, 1 reply; 22+ messages in thread From: rwhron @ 2002-03-04 14:54 UTC (permalink / raw) To: linux-kernel; +Cc: akpm It's not on a big box, but I have a side by side of 2.5.6-pre2 and 2.5.6-pre2 with patches from http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.6-pre2/ 2.5.6-pre2-akpm compiled with MPIO_DEBUG = 0 and ext2 filesystem mounted with delalloc. tiobench and dbench were on ext2. bonnie++ and most other tests were on reiserfs. 2.5.6-pre2-akpm throughput on ext2 is much improved. http://home.earthlink.net/~rwhron/kernel/akpm.html One odd number in lmbench is page fault latency. Lmbench also showed high page fault latency in 2.4.18-pre9 with make_request, read-latency2, and low-latency. 2.4.19-pre1aa1 with read_latency2 (2.4.19pre1aa1rl) did not show a bump in page fault latency. -- Randy Hron ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 14:54 [patch] delayed disk block allocation rwhron @ 2002-03-04 15:10 ` Jeff Garzik 0 siblings, 0 replies; 22+ messages in thread From: Jeff Garzik @ 2002-03-04 15:10 UTC (permalink / raw) To: rwhron; +Cc: linux-kernel, akpm rwhron@earthlink.net wrote: > 2.5.6-pre2-akpm compiled with MPIO_DEBUG = 0 and ext2 > filesystem mounted with delalloc. > > tiobench and dbench were on ext2. > bonnie++ and most other tests were on reiserfs. > > 2.5.6-pre2-akpm throughput on ext2 is much improved. > > http://home.earthlink.net/~rwhron/kernel/akpm.html This page of statistics is pretty gnifty -- I wish more people posted such :) I also like your other comparisons at the top of that link... I was thinking, what would be even more neat and more useful as well would be to post comparisons -between- the various kernels you have tested, i.e. a single page that includes benchmarks for 2.4.x-official, 2.5.x-official, -aa, -rmap, etc. Regards, Jeff -- Jeff Garzik | Building 1024 | MandrakeSoft | Choose life. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation
@ 2002-03-07 12:06 Etienne Lorrain
2002-03-07 14:47 ` Steve Lord
0 siblings, 1 reply; 22+ messages in thread
From: Etienne Lorrain @ 2002-03-07 12:06 UTC (permalink / raw)
To: linux-kernel
> With "allocate on flush", (aka delayed allocation), file data is
> assigned a disk mapping when the data is being written out, rather than
> at write(2) time. This has the following advantages:
I do agree that this is a better solution than current one,
but (even if I did not had time to test the patch), I have
a question: How about bootloaders?
IHMO all current bootloaders need to write to disk a "chain" of sector
to load for their own initialisation, i.e. loading the remainning
part of code stored on a file in one filesystem from the 512 bytes
bootcode. This "chain" of sector can only be known once the file
has been allocated to disk - and it has to be written on the same file,
at its allocated space.
So can you upgrade LILO or GRUB with your patch installed?
It is not a so big problem (the solution being to install the
bootloader on an unmounted filesystem with tools like e2fsprogs),
but it seems incompatible with the current executables.
Comment?
Etienne.
___________________________________________________________
Do You Yahoo!? -- Une adresse @yahoo.fr gratuite et en français !
Yahoo! Mail : http://fr.mail.yahoo.com
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch] delayed disk block allocation 2002-03-07 12:06 Etienne Lorrain @ 2002-03-07 14:47 ` Steve Lord 2002-03-07 17:30 ` Mike Fedyk 0 siblings, 1 reply; 22+ messages in thread From: Steve Lord @ 2002-03-07 14:47 UTC (permalink / raw) To: Etienne Lorrain; +Cc: Linux Kernel On Thu, 2002-03-07 at 06:06, Etienne Lorrain wrote: > > With "allocate on flush", (aka delayed allocation), file data is > > assigned a disk mapping when the data is being written out, rather than > > at write(2) time. This has the following advantages: > > I do agree that this is a better solution than current one, > but (even if I did not had time to test the patch), I have > a question: How about bootloaders? > > IHMO all current bootloaders need to write to disk a "chain" of sector > to load for their own initialisation, i.e. loading the remainning > part of code stored on a file in one filesystem from the 512 bytes > bootcode. This "chain" of sector can only be known once the file > has been allocated to disk - and it has to be written on the same file, > at its allocated space. > > So can you upgrade LILO or GRUB with your patch installed? > It is not a so big problem (the solution being to install the > bootloader on an unmounted filesystem with tools like e2fsprogs), > but it seems incompatible with the current executables. The interface used by lilo to read the kernel location needs to flush data out to disk before returning results. It's not too hard to do. Steve ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-07 14:47 ` Steve Lord @ 2002-03-07 17:30 ` Mike Fedyk 0 siblings, 0 replies; 22+ messages in thread From: Mike Fedyk @ 2002-03-07 17:30 UTC (permalink / raw) To: Steve Lord; +Cc: Etienne Lorrain, Linux Kernel On Thu, Mar 07, 2002 at 08:47:24AM -0600, Steve Lord wrote: > On Thu, 2002-03-07 at 06:06, Etienne Lorrain wrote: > > > With "allocate on flush", (aka delayed allocation), file data is > > > assigned a disk mapping when the data is being written out, rather than > > > at write(2) time. This has the following advantages: > > > > I do agree that this is a better solution than current one, > > but (even if I did not had time to test the patch), I have > > a question: How about bootloaders? > > > > IHMO all current bootloaders need to write to disk a "chain" of sector > > to load for their own initialisation, i.e. loading the remainning > > part of code stored on a file in one filesystem from the 512 bytes > > bootcode. This "chain" of sector can only be known once the file > > has been allocated to disk - and it has to be written on the same file, > > at its allocated space. > > > > So can you upgrade LILO or GRUB with your patch installed? > > It is not a so big problem (the solution being to install the > > bootloader on an unmounted filesystem with tools like e2fsprogs), > > but it seems incompatible with the current executables. > > The interface used by lilo to read the kernel location needs to flush > data out to disk before returning results. It's not too hard to do. > Also, GRUB shouldn't have this problem since it reads the filesystems directly on boot. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [patch] delayed disk block allocation
@ 2002-03-01 8:26 Andrew Morton
2002-03-04 2:08 ` Daniel Phillips
0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2002-03-01 8:26 UTC (permalink / raw)
To: lkml
A bunch of patches which implement allocate-on-flush for 2.5.6-pre1 are
available at
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-10-core.patch
- Core functions
and
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-20-ext2.patch
- delalloc implementation for ext2.
Also,
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-30-ratcache.patch
- Momchil Velikov/Christoph Hellwig radix-tree pagecache ported onto delalloc.
With "allocate on flush", (aka delayed allocation), file data is
assigned a disk mapping when the data is being written out, rather than
at write(2) time. This has the following advantages:
- Less disk fragmentation
- By deferring writeout, we can send larger amounts of data into
the filesytem, which gives the fs a better chance of laying the
blocks out contiguously.
This defeats the current (really bad) performance bug in ext2
and ext3 wherein the blocks for two streaming files are
intermingled in AAAAAAAABBBBBBBBAAAAAAAABBBBBBBB manner (ext2) and
ABABABABAB manner (ext3).
- Short-lived files *never* get a disk mapping, which prevents
them from fragmenting disk free space.
- Faster
- Enables complete bypass of the buffer_head layer (that's the
next patch series).
- Less work to do for short-lived files
- Unifies the handling of shared mappings and write(2) data.
The patches as they stand work fine, but they enable much more. I'd be
interested in hearing some test results of this plus the mpio patch on
big boxes.
Delayed allocate for ext2 is turned on with the `-o delalloc' mount
option. Without this, nothing much has changed (well, things will be a
bit faster, but not much).
The `MPIO_DEBUG' define in mm.h should be set to zero prior to running
any benchmarking.
There is a reservation API by which the kernel tells the filesystem to
reserve space for the delalloc page. It is completely impractical to
perform exact reservations for ext2, at least. So what the code does
is to take worst-case reservations based on the page's file offset, and
to force a writeback when ENOSPC is encountered. The writeback
collapses all existing reservations into real disk mappings. If we're
still out of space, we're really out of space. This happens within
delalloc_prepare_write() at present. I'll be moving it elsewhere soon.
There are new writeback mechanisms - an adaptively-sized pool of
threads called "pdflush" is available for writeback. These perform the
kupdate and bdflush function for dirty pages. They are designed to
avoid the situation where bdflush gets stuck on a particular device,
starving writeout for other devices. pdflush should provide increased
writeout bandwidth on many-disk machines.
There are a minimum of two instances of pdflush. Additional threads
are created and destroyed on-demand according to a
simple-but-seems-to-work algorithm, which is described in the code.
The pdflush threads are used to perform writeback within shrink_cache,
thus making kswapd almost non-blocking.
Global accounting of locked and dirty pages has been introduced. This
permits accurate writeback/throttle decisions in balance_dirty_pages().
Testing is showing considerable improvements in system tractability
under heavy load, while approximately doubling heavy dbench throughput.
Other benchmarks are pretty much unchanged, apart from those which are
affected by file fragmentation, which show improvement.
With this patch, writepage() is still using the buffer layer, so lock
contention will still be high.
A few performance glitches in the dirty-data writeout path have been
fixed.
The PG_locked and PG_dirty flags have been renamed to prevent people
from using them, which would bypass locked- and dirty-page accounting.
A number of functions in fs/inode.c have been renamed. We have a huge
and confusing array of sync_foo() functions in there. I've attempted
to differentiate between `writeback_foo()', which starts I/O, and
`sync_foo()', which starts I/O and waits on it. It's still a bit of a
mess, and needs revisiting.
The ratcache patch removes generic_buffer_fdatasync() from the kernel.
Nothing in the tree is using it.
Within the VM, the concept of ->writepage() has been replaced with the
concept of "write back a mapping". This means that rather than writing
back a single page, we write back *all* dirty pages against the mapping
to which the LRU page belongs.
Despite its crudeness, this actually works rather well. And it's
important, because disk blocks are allocated at ->writepage time, and
we *need* to write out good chunks of data, in ascending file offset
order so that the files are laid out well. Random page-at-a-time
sprinkliness won't cut it.
A simple future refinement is to change the API to be "write back N
pages around this one, including this one". At present, I'll have to
pull a number out of the air (128 pages?). Some additional
intelligence from the VM may help here.
Or not. It could be that writing out all the mapping's pages is
always the right thing to do - it's what bdflush is doing at present,
and it *has* to have the best bandwidth. But it may come unstuck
when applied to swapcache.
Things which must still be done include:
- Closely review the ratcache patch. I fixed several fairly fatal
bugs in it, and it works just fine now. But it seems I was working
from an old version. Still, it would benefit from a careful
walkthrough. This version might be flakey in the tmpfs/shmfs area.
- Remove bdflush and kupdate - use the pdflush pool to provide these
functions.
- Expose the three writeback tunables to userspace (/proc/sys/vm/pdflush?)
- Use pdflush for try_to_sync_unused_inodes(), to stop the keventd
abuse.
- Move the page_cache_size accounting into the per-cpu accumulators.
- Use delalloc for ext2 directories and symlinks.
- Throttle tasks which are dirtying pages via shared mappings.
- Make preallocation and quotas play together.
- Implement non-blocking try_to_free_buffers in the VM for buffers
against delalloc filesystems.
Overall system behaviour is noticeably improved by these patches,
but memory-requesters can still occasionally get blocked for a long
time in sync_page_buffers(). Which is fairly pointless, because the
amount of non-dirty-page memory which is backed by buffers is tiny.
Just hand this work off to pdflush if the backing filesystem is
delayed-allocate.
- Verify that the new APIs and implementation are suitable for XFS.
- Prove the API by implementing delalloc on other filesystems.
Looks to be fairly simple for reiserfs and ext3. But implementing
multi-page no-buffer_head pagecache writeout will be harder for these
filesystems.
Nice-to-do things include:
- Maybe balance_dirty_state() calculations for buffer_head based
filesystems can take into account locked and dirty page counts to
make better flush/throttling decisions.
- Turn swap space into a delayed allocate address_space. Allocation
of swapspace at ->writepage() time should provide improved swap
bandwidth.
- Unjumble the writeout order.
In the current code (as in the current 2.4 and 2.5 kernels), disk
blocks are laid out in the order in which the application wrote(2)
them. So files which are created by lseeky applications are all
jumbled up.
I can't really see a practical solution for this in the general
case, even with radix-tree pagecache. And it may be that we don't
*need* a solution, because it'll often be the case that the read
pattern for the file is also lseeky.
But when the "write some pages around this one" function is
implemented, it will perform this unjumbling. It'll be OK to
implement this by probing the pagecache, or by walking the radix
tree.
- Don't perform synchronous I/O (reads) inside lock_super() on ext2.
Massive numbers of threads get piled up on the superblock lock when
using silly workloads.
^ permalink raw reply [flat|nested] 22+ messages in thread* Re: [patch] delayed disk block allocation 2002-03-01 8:26 Andrew Morton @ 2002-03-04 2:08 ` Daniel Phillips 2002-03-04 3:10 ` Jeff Garzik ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Daniel Phillips @ 2002-03-04 2:08 UTC (permalink / raw) To: Andrew Morton, lkml; +Cc: Steve Lord On March 1, 2002 09:26 am, Andrew Morton wrote: > A bunch of patches which implement allocate-on-flush for 2.5.6-pre1 are > available at > > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-10-core.patch > - Core functions > and > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-20-ext2.patch > - delalloc implementation for ext2. > > Also, > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-30-ratcache.patch > - Momchil Velikov/Christoph Hellwig radix-tree pagecache ported onto > delalloc. Wow, this is massive. Why did you write [patch] instead of [PATCH]? ;-) I'm surprised there aren't any comments on this patch so far, that should teach you to post on a Friday afternoon. > Global accounting of locked and dirty pages has been introduced. Alan seems to be working on this as well. Besides locked and dirty we also have 'pinned', i.e., pages that somebody has taken a use count on, beyond the number of pte+pcache references. I'm just going to poke at a couple of inconsequential things for now, to show I've read the post. In general this is really important work because it starts to move away from the vfs's current dumb filesystem orientation. I doubt all the subproblems you've addressed are tackled in the simplest possible way, and because of that it's a cinch Linus isn't just going to apply this. But hopefully the benchmarking team descend upon this and find out if it does/does't suck, and hopefully you plan to maintain it through 2.5. > Testing is showing considerable improvements in system tractability > under heavy load, while approximately doubling heavy dbench throughput. > Other benchmarks are pretty much unchanged, apart from those which are > affected by file fragmentation, which show improvement. What is system tractability? > With this patch, writepage() is still using the buffer layer, so lock > contention will still be high. Right, and buffers are going away one way or another. > A number of functions in fs/inode.c have been renamed. We have a huge > and confusing array of sync_foo() functions in there. I've attempted > to differentiate between `writeback_foo()', which starts I/O, and > `sync_foo()', which starts I/O and waits on it. It's still a bit of a > mess, and needs revisiting. Yup. > Within the VM, the concept of ->writepage() has been replaced with the > concept of "write back a mapping". This means that rather than writing > back a single page, we write back *all* dirty pages against the mapping > to which the LRU page belongs. This is a good and natural step, but don't we want to go even more global than that and look at all the dirty data on a superblock, so the decision on what to write out is optimized across files for better locality. > Despite its crudeness, this actually works rather well. And it's > important, because disk blocks are allocated at ->writepage time, and > we *need* to write out good chunks of data, in ascending file offset > order so that the files are laid out well. Random page-at-a-time > sprinkliness won't cut it. Right. > A simple future refinement is to change the API to be "write back N > pages around this one, including this one". At present, I'll have to > pull a number out of the air (128 pages?). Some additional > intelligence from the VM may help here. > > Or not. It could be that writing out all the mapping's pages is > always the right thing to do - it's what bdflush is doing at present, > and it *has* to have the best bandwidth. This has the desireable result of deferring the work on how best to "write back N pages around this one, including this one" above. I really think that if we sit back and let the vfs evolve on its own a little more, that that question will get much easier to answer. > But it may come unstuck when applied to swapcache. You're not even trying to apply this to swap cache right now are you? > Things which must still be done include: > > [...] > > - Remove bdflush and kupdate - use the pdflush pool to provide these > functions. The main disconnect there is sub-page sized writes, you will bundle together young and old 1K buffers. Since it's getting harder to find a 1K blocksize filesystem, we might not care. There is also my nefarious plan to make struct pages refer to variable-binary-sized objects, including smaller than 4K PAGE_SIZE. > - Expose the three writeback tunables to userspace (/proc/sys/vm/pdflush?) /proc/sys/vm/pageflush <- we know the pages are dirty > - Use pdflush for try_to_sync_unused_inodes(), to stop the keventd > abuse. Could you explain please? > - Verify that the new APIs and implementation are suitable for XFS. Hey, I was going to mention that :-) The question is, how long will it be before vfs-level delayed allocation is in 2.5. Steve might see that as a way to get infinitely sidetracked. Well, I used up way more than the time budget I have for reading your patch and post, and I barely scratched the surface. I hope the rest of the filesystem gang gets involved at this point, or maybe not. Cooks/soup. I guess the thing to do is start thinking about parts that can be broken out because of obvious correctness. The dirty/locked accounting would be one candidate, the multiple flush threads another, and I'm sure there are more because you don't seem to have treated much as sacred ;-) Something tells me the vfs is going to look quite different by the end of 2.6. -- Daniel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 2:08 ` Daniel Phillips @ 2002-03-04 3:10 ` Jeff Garzik 2002-03-04 7:27 ` Andrew Morton 2002-03-04 5:04 ` Mike Fedyk 2002-03-04 7:20 ` Andrew Morton 2 siblings, 1 reply; 22+ messages in thread From: Jeff Garzik @ 2002-03-04 3:10 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, Steve Lord Daniel Phillips wrote: > On March 1, 2002 09:26 am, Andrew Morton wrote: > > A bunch of patches which implement allocate-on-flush for 2.5.6-pre1 are > > available at http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-10-core.patch > > - Core functions > > and > http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre1/dalloc-20-ext2.patch > > - delalloc implementation for ext2. > Wow, this is massive. Why did you write [patch] instead of [PATCH]? ;-) I'm > surprised there aren't any comments on this patch so far, that should teach > you to post on a Friday afternoon. My only comment is: how fast can we get delalloc into 2.5.x for further testing and development? IMNSHO there are few comments because I believe that few people actually realize the benefits of delalloc. My ext2 filesystem with --10-- percent fragmentation could sure use code like this, though. > > But it may come unstuck when applied to swapcache. > > You're not even trying to apply this to swap cache right now are you? This is a disagreement akpm and I have, actually :) I actually would rather that it was made a requirement that all swapfiles are "dense", so that new block allocation NEVER must be performed when swapping. > There is also my nefarious plan to make > struct pages refer to variable-binary-sized objects, including smaller than > 4K PAGE_SIZE. sigh... -- Jeff Garzik | Building 1024 | MandrakeSoft | Choose life. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 3:10 ` Jeff Garzik @ 2002-03-04 7:27 ` Andrew Morton 0 siblings, 0 replies; 22+ messages in thread From: Andrew Morton @ 2002-03-04 7:27 UTC (permalink / raw) To: Jeff Garzik; +Cc: Daniel Phillips, lkml, Steve Lord Jeff Garzik wrote: > > ... > > > > You're not even trying to apply this to swap cache right now are you? > > This is a disagreement akpm and I have, actually :) > Misunderstanding, rather. Swapfiles aren't interesting, IMO. And I agree that mkswap or swapon should just barf if the file has any holes in it. But what I refer to here is, simply, delayed allocate for swapspace. So swap_out() sticks the target page into the swapcache, marks it dirty and takes a space reservation for the page out of the swapcache's address_space. But no disk space is allocated at swap_out() time. Instead, the real disk mapping is created when the VM calls a_ops->vm_writeback() against the swapcache page's address_space. All of which rather implies a ripup-and-rewrite of half the swap code. It would certainly require a new allocator. So I just mentioned the possibility. Glad you're interested :) - ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 2:08 ` Daniel Phillips 2002-03-04 3:10 ` Jeff Garzik @ 2002-03-04 5:04 ` Mike Fedyk 2002-03-04 5:31 ` Andreas Dilger 2002-03-04 7:20 ` Andrew Morton 2 siblings, 1 reply; 22+ messages in thread From: Mike Fedyk @ 2002-03-04 5:04 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andrew Morton, lkml, Steve Lord On Mon, Mar 04, 2002 at 03:08:54AM +0100, Daniel Phillips wrote: > The main disconnect there is sub-page sized writes, you will bundle together > young and old 1K buffers. Since it's getting harder to find a 1K blocksize > filesystem, we might not care. Please don't do that. Hopefully, once this is in, 1k blocks will work much better. There are many cases where people work with lots of small files, and using 1k blocks is bad enough, 4k would be worse. Also, with dhash going into ext2/3 lots of tiny files in one dir will be feasible and comparible with reiserfs. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 5:04 ` Mike Fedyk @ 2002-03-04 5:31 ` Andreas Dilger 2002-03-04 5:40 ` Mike Fedyk ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Andreas Dilger @ 2002-03-04 5:31 UTC (permalink / raw) To: Daniel Phillips, Andrew Morton, lkml On Mar 03, 2002 21:04 -0800, Mike Fedyk wrote: > On Mon, Mar 04, 2002 at 03:08:54AM +0100, Daniel Phillips wrote: > > The main disconnect there is sub-page sized writes, you will bundle together > > young and old 1K buffers. Since it's getting harder to find a 1K blocksize > > filesystem, we might not care. > > Please don't do that. > > Hopefully, once this is in, 1k blocks will work much better. There are many > cases where people work with lots of small files, and using 1k blocks is bad > enough, 4k would be worse. > > Also, with dhash going into ext2/3 lots of tiny files in one dir will be > feasible and comparible with reiserfs. Actually, there are a whole bunch of performance issues with 1kB block ext2 filesystems. For very small files, you are probably better off to have tails in EAs stored with the inode, or with other tails/EAs in a shared block. We discussed this on ext2-devel a few months ago, and while the current ext2 EA design is totally unsuitable for that, it isn't impossible to fix. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 5:31 ` Andreas Dilger @ 2002-03-04 5:40 ` Mike Fedyk 2002-03-04 6:14 ` Andreas Dilger 2002-03-04 5:41 ` Jeff Garzik 2002-03-04 7:53 ` Andrew Morton 2 siblings, 1 reply; 22+ messages in thread From: Mike Fedyk @ 2002-03-04 5:40 UTC (permalink / raw) To: Daniel Phillips, Andrew Morton, lkml On Sun, Mar 03, 2002 at 10:31:03PM -0700, Andreas Dilger wrote: > On Mar 03, 2002 21:04 -0800, Mike Fedyk wrote: > > On Mon, Mar 04, 2002 at 03:08:54AM +0100, Daniel Phillips wrote: > > > The main disconnect there is sub-page sized writes, you will bundle together > > > young and old 1K buffers. Since it's getting harder to find a 1K blocksize > > > filesystem, we might not care. > > > > Please don't do that. > > > > Hopefully, once this is in, 1k blocks will work much better. There are many > > cases where people work with lots of small files, and using 1k blocks is bad > > enough, 4k would be worse. > > > > Also, with dhash going into ext2/3 lots of tiny files in one dir will be > > feasible and comparible with reiserfs. > > Actually, there are a whole bunch of performance issues with 1kB block > ext2 filesystems. For very small files, you are probably better off > to have tails in EAs stored with the inode, or with other tails/EAs in > a shared block. We discussed this on ext2-devel a few months ago, and > while the current ext2 EA design is totally unsuitable for that, it > isn't impossible to fix. Great, we're finally heading tward dual sized blocks (or clusters or etc). I'll be looking forward to that. :) Do you think it'll look like block tails (like ffs?) or will it be more like tail packing in reiserfs? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 5:40 ` Mike Fedyk @ 2002-03-04 6:14 ` Andreas Dilger 0 siblings, 0 replies; 22+ messages in thread From: Andreas Dilger @ 2002-03-04 6:14 UTC (permalink / raw) To: Daniel Phillips, Andrew Morton, lkml On Mar 03, 2002 21:40 -0800, Mike Fedyk wrote: > On Sun, Mar 03, 2002 at 10:31:03PM -0700, Andreas Dilger wrote: > > Actually, there are a whole bunch of performance issues with 1kB block > > ext2 filesystems. For very small files, you are probably better off > > to have tails in EAs stored with the inode, or with other tails/EAs in > > a shared block. We discussed this on ext2-devel a few months ago, and > > while the current ext2 EA design is totally unsuitable for that, it > > isn't impossible to fix. > > Do you think it'll look like block tails (like ffs?) or will it be more like > tail packing in reiserfs? It will be like tail packing in reiserfs, I believe. FFS has fixed-sized 'fragments' while reiserfs has arbitrary-sized 'tails'. The ext2 tail packing will be arbitrary-sized tails, stored as an extended attribute (along with any other EAs that this inode has. With proper EA design, you can share multiple inode's EAs in the same block. We also discussed for very small files that you could store the tail (or other EA data) within the inode itself. Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 5:31 ` Andreas Dilger 2002-03-04 5:40 ` Mike Fedyk @ 2002-03-04 5:41 ` Jeff Garzik 2002-03-04 6:18 ` Andreas Dilger 2002-03-04 7:53 ` Andrew Morton 2 siblings, 1 reply; 22+ messages in thread From: Jeff Garzik @ 2002-03-04 5:41 UTC (permalink / raw) To: Andreas Dilger; +Cc: Daniel Phillips, Andrew Morton, lkml Andreas Dilger wrote: > Actually, there are a whole bunch of performance issues with 1kB block > ext2 filesystems. For very small files, you are probably better off > to have tails in EAs stored with the inode, or with other tails/EAs in > a shared block. We discussed this on ext2-devel a few months ago, and > while the current ext2 EA design is totally unsuitable for that, it > isn't impossible to fix. IMO the ext2 filesystem design is on it's last legs ;-) I tend to think that a new filesystem efficiently handling these features is far better than dragging ext2 kicking and screaming into the 2002's :) -- Jeff Garzik | Building 1024 | MandrakeSoft | Choose life. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 5:41 ` Jeff Garzik @ 2002-03-04 6:18 ` Andreas Dilger 2002-03-04 15:02 ` Hans Reiser 0 siblings, 1 reply; 22+ messages in thread From: Andreas Dilger @ 2002-03-04 6:18 UTC (permalink / raw) To: Jeff Garzik; +Cc: Daniel Phillips, Andrew Morton, lkml On Mar 04, 2002 00:41 -0500, Jeff Garzik wrote: > Andreas Dilger wrote: > > Actually, there are a whole bunch of performance issues with 1kB block > > ext2 filesystems. For very small files, you are probably better off > > to have tails in EAs stored with the inode, or with other tails/EAs in > > a shared block. We discussed this on ext2-devel a few months ago, and > > while the current ext2 EA design is totally unsuitable for that, it > > isn't impossible to fix. > > IMO the ext2 filesystem design is on it's last legs ;-) I tend to > think that a new filesystem efficiently handling these features is far > better than dragging ext2 kicking and screaming into the 2002's :) That's why we have ext3 ;-). Given that reiserfs just barely has an fsck that finally works most of the time, and they are about to re-do the entire filesystem for reiser-v4 in 6 months, I'd rather stick with glueing features onto an ext2 core than rebuilding everything from scratch each time. Given that ext3, and htree, and all of the other ext2 'hacks' seem to do very well, I think it will continue to improve for some time to come. A wise man once said "I'm not dead yet". Cheers, Andreas -- Andreas Dilger http://sourceforge.net/projects/ext2resize/ http://www-mddsp.enel.ucalgary.ca/People/adilger/ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 6:18 ` Andreas Dilger @ 2002-03-04 15:02 ` Hans Reiser 2002-03-06 12:59 ` Pavel Machek 0 siblings, 1 reply; 22+ messages in thread From: Hans Reiser @ 2002-03-04 15:02 UTC (permalink / raw) To: Andreas Dilger; +Cc: Jeff Garzik, Daniel Phillips, Andrew Morton, lkml Andreas Dilger wrote: >On Mar 04, 2002 00:41 -0500, Jeff Garzik wrote: > >>Andreas Dilger wrote: >> >>>Actually, there are a whole bunch of performance issues with 1kB block >>>ext2 filesystems. For very small files, you are probably better off >>>to have tails in EAs stored with the inode, or with other tails/EAs in >>>a shared block. We discussed this on ext2-devel a few months ago, and >>>while the current ext2 EA design is totally unsuitable for that, it >>>isn't impossible to fix. >>> >>IMO the ext2 filesystem design is on it's last legs ;-) I tend to >>think that a new filesystem efficiently handling these features is far >>better than dragging ext2 kicking and screaming into the 2002's :) >> > >That's why we have ext3 ;-). Given that reiserfs just barely has an >fsck that finally works most of the time, and they are about to re-do >the entire filesystem for reiser-v4 in 6 months, I'd rather stick with >glueing features onto an ext2 core than rebuilding everything from >scratch each time. > Isn't this that old evolution vs. design argument? ReiserFS is design, code, tweak for a while, and then start to design again methodology. We take novel designs and algorithms, and stick with them until they are stable production code, and then we go back and do more novel algorithms. Such a methodology is well suited for doing deep rewrites at 5 year intervals. No pain, no gain, or so we think. Rewriting the core is hard work. Some people get success and then coast. Some people get success and then accelerate. We're keeping the pedal on the metal. We're throwing every penny we make into rebuilding the foundation of our filesystem. ReiserFS V3 is pretty stable right now. Fsck is usually the last thing to stabilize for a filesystem, and we were no exception to that rule. ReiserFS V3 will last for probably another 3 years (though it will remain supported for I imagine at least a decade, maybe more), with V4 gradually edging it out of the market as V4 gets more and more stable. During at least the next year we will keep some staff adding minor tweaks to V3. For instance, our layout policies will improve over the next few months, journal relocation is about to move from Linux 2.5 to 2.4, and data write ordering code is being developed by Chris. I don't know how long fsck maintenance for V3 will continue, but it will be at least as long as users find bugs in it. The biggest reason we are writing V4 from scratch is that it is a thing of beauty. If you don't understand this, words will not explain it. > > >Given that ext3, and htree, and all of the other ext2 'hacks' seem to >do very well, I think it will continue to improve for some time to come. >A wise man once said "I'm not dead yet". > >Cheers, Andreas >-- >Andreas Dilger >http://sourceforge.net/projects/ext2resize/ >http://www-mddsp.enel.ucalgary.ca/People/adilger/ > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 15:02 ` Hans Reiser @ 2002-03-06 12:59 ` Pavel Machek 0 siblings, 0 replies; 22+ messages in thread From: Pavel Machek @ 2002-03-06 12:59 UTC (permalink / raw) To: Hans Reiser Cc: Andreas Dilger, Jeff Garzik, Daniel Phillips, Andrew Morton, lkml Hi! > ReiserFS V3 is pretty stable right now. Fsck is usually the last thing > to stabilize for a filesystem, and we were no exception to that rule. > > ReiserFS V3 will last for probably another 3 years (though it will > remain supported for I imagine at least a decade, maybe more), with V4 > gradually edging it out of the market as V4 gets more and more stable. > During at least the next year we will keep some staff adding minor > tweaks to V3. For instance, our layout policies will improve over the > next few months, journal relocation is about to move from Linux 2.5 to > 2.4, and data write ordering code is being developed by Chris. I don't > know how long fsck maintenance for V3 will continue, but it will be at > least as long as users find bugs in it. When you think youer fsck works, let me know. I have some tests to find bugs in fsck.... Pavel -- Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt, details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 5:31 ` Andreas Dilger 2002-03-04 5:40 ` Mike Fedyk 2002-03-04 5:41 ` Jeff Garzik @ 2002-03-04 7:53 ` Andrew Morton 2002-03-04 8:06 ` Daniel Phillips 2 siblings, 1 reply; 22+ messages in thread From: Andrew Morton @ 2002-03-04 7:53 UTC (permalink / raw) To: Andreas Dilger; +Cc: Daniel Phillips, lkml Andreas Dilger wrote: > > Actually, there are a whole bunch of performance issues with 1kB block > ext2 filesystems. I don't see any new problems with file tails here. They're catered for OK. What I have not catered for is file holes. With the delalloc patches, whole pages are always written out (except for at eof). So if your file has lots of very small non-holes in it, these will become larger non-holes. If we're serious about 64k PAGE_CACHE_SIZE then this becomes more of a problem. In the worst case, a file which used to consist of 4k block | (1 meg - 4k) hole | 4k block | (1 meg - 4k) hole | ... will become: 64k block | (1 meg - 64k) hole | 64k block | (1 meg - 64k hole) | ... Which is a *lot* more disk space. It'll happen right now, if the file is written via mmap. But with no-buffer-head delayed allocate, it'll happen with write(2) as well. Is such space wastage on sparse files a showstopper? Maybe, but probably not if there is always at least one filesystem which handles this scenario well, because it _is_ a specialised scenario. - ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 7:53 ` Andrew Morton @ 2002-03-04 8:06 ` Daniel Phillips 2002-03-04 8:34 ` Andrew Morton 0 siblings, 1 reply; 22+ messages in thread From: Daniel Phillips @ 2002-03-04 8:06 UTC (permalink / raw) To: Andrew Morton, Andreas Dilger; +Cc: lkml On March 4, 2002 08:53 am, Andrew Morton wrote: > Andreas Dilger wrote: > > > > Actually, there are a whole bunch of performance issues with 1kB block > > ext2 filesystems. > > I don't see any new problems with file tails here. They're catered for > OK. What I have not catered for is file holes. With the delalloc > patches, whole pages are always written out (except for at eof). So > if your file has lots of very small non-holes in it, these will become > larger non-holes. > > If we're serious about 64k PAGE_CACHE_SIZE then this becomes more of > a problem. In the worst case, a file which used to consist of > > 4k block | (1 meg - 4k) hole | 4k block | (1 meg - 4k) hole | ... > > will become: > > 64k block | (1 meg - 64k) hole | 64k block | (1 meg - 64k hole) | ... > > Which is a *lot* more disk space. It'll happen right now, if the > file is written via mmap. But with no-buffer-head delayed allocate, > it'll happen with write(2) as well. > > Is such space wastage on sparse files a showstopper? Maybe, > but probably not if there is always at least one filesystem > which handles this scenario well, because it _is_ a specialised > scenario. I guess 4K PAGE_CACHE_SIZE will serve us well for another couple of years, and in that time I hope to produce a patch that generalizes the notion of page size so we can use the size that works best for each address_space, i.e., the same as the filesystem blocksize. By the way, have you ever seen a sparse 1K blocksize file? I haven't, I wouldn't get too worked up about treating the holes a little less than optimally. -- Daniel ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 8:06 ` Daniel Phillips @ 2002-03-04 8:34 ` Andrew Morton 0 siblings, 0 replies; 22+ messages in thread From: Andrew Morton @ 2002-03-04 8:34 UTC (permalink / raw) To: Daniel Phillips; +Cc: Andreas Dilger, lkml Daniel Phillips wrote: > > .. > I guess 4K PAGE_CACHE_SIZE will serve us well for another couple of years, Having reviewed the archives, it seems that the multipage PAGE_CACHE_SIZE patches which Hugh and Ben were working on were mainly designed to increase I/O efficiency. If that's the only reason for large pages then yeah, I think we can stick with 4k PAGE_CACHE_SIZE :). There really are tremendous efficiencies available in the current code. Another (and very significant) reason for large pages is to decrease TLB misses. Said to be very important for large-working-set scientific apps and such. But that doesn't seem to have a lot to do with PAGE_CACHE_SIZE? > ... > By the way, have you ever seen a sparse 1K blocksize file? > ... Sure I have. I just created one. (I'm writing test cases for my emails now. Sheesh). - ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 2:08 ` Daniel Phillips 2002-03-04 3:10 ` Jeff Garzik 2002-03-04 5:04 ` Mike Fedyk @ 2002-03-04 7:20 ` Andrew Morton 2002-03-04 9:23 ` Daniel Phillips 2 siblings, 1 reply; 22+ messages in thread From: Andrew Morton @ 2002-03-04 7:20 UTC (permalink / raw) To: Daniel Phillips; +Cc: lkml, Steve Lord Daniel Phillips wrote: > > ... > Why did you write [patch] instead of [PATCH]? ;-) It's a start ;) >... > > Global accounting of locked and dirty pages has been introduced. > > Alan seems to be working on this as well. Besides locked and dirty we also > have 'pinned', i.e., pages that somebody has taken a use count on, beyond the > number of pte+pcache references. Well one little problem with tree owners writing code is that it's rather hard for mortals to see what they've done. Because the diffs come with so much other stuff. Which is my lame excuse for not having reviewed Alan's work. But if he has global (as opposed to per-mm, per-vma, etc) data then yes, it can go into the page_states[] array. > I'm just going to poke at a couple of inconsequential things for now, to show > I've read the post. In general this is really important work because it > starts to move away from the vfs's current dumb filesystem orientation. > > I doubt all the subproblems you've addressed are tackled in the simplest > possible way, and because of that it's a cinch Linus isn't just going to > apply this. But hopefully the benchmarking team descend upon this and find > out if it does/does't suck, and hopefully you plan to maintain it through 2.5. The problem with little, incremental patches is that they require a high degree of planning, trust and design. A belief that the end outcome will be right. That's hard, and it generates a lot of talk, and the end outcome may *not* be right. So in the best of worlds, we have the end outcome in-hand, and testable. If it works, then we go back to the incremental patches. > > Testing is showing considerable improvements in system tractability > > under heavy load, while approximately doubling heavy dbench throughput. > > Other benchmarks are pretty much unchanged, apart from those which are > > affected by file fragmentation, which show improvement. > > What is system tractability? Sorry. General usability when the system is under load. With these patches it's better, but still bad. Look. A process calls the page allocator to, duh, allocate some pages. Processes do *not* call the page allocator because they suddenly feel like spending fifteen seconds asleep on the damned request queue. We need to throttle the writers and only the writers. We need other tasks to be able to obtain full benefit of the rate at which the disks can clean memory. You know where this is headed, don't you: - writeout is performed by the writers, and by the gang-of-flush-threads. - kswapd is 100% non-blocking. It never does I/O. - kswapd is the only process which runs page_launder/shrink_caches. - Memory requesters do not perform I/O. They sleep until memory is available. kswapd gives them pages as they become available, and wakes them up. So that's the grand plan. It may be fatally flawed - I remember Linus had a serious-sounding objection to it some time back, but I forget what that was. We come badly unstuck if it's a disk-writer who goes to sleep on the i-want-some-memory queue, but I don't think it was that. Still, this is just a VM rant. It's not the objective of this work. > > With this patch, writepage() is still using the buffer layer, so lock > > contention will still be high. > > Right, and buffers are going away one way or another. This is a problem. I'm adding new stuff which does old things in a new way, with no believable plan in place for getting rid of the old stuff. I don't think it's humanly possible to do away with struct buffer_head. It is *the* way of representing a disk block. And unless we plan to live with 4k pages and 4k blocks for ever, the problem is about to get worse. Think 64k pages with 4k blocks. Possibly we could handle sub-page segments of memory via a per-page up-to-date bitmask. And then a `dirty' bitmask. And then a `locked' bitmask, etc. I suspect eventually we'll end up with, say, a vector of structures attached to each page which represents the state of each of the page's sub-segments. whoops. So as a tool for representing disk blocks - for subdividing individual pages of the block device's pagecache entries, buffer_heads make sense, and I doubt if they're going away. But as a tool for getting bulk file data on and off disk, buffer_heads really must die. Note how submit_bh() now adds an extra memory allocation into each buffer as it goes past. Look at some 2.5 kernel profiles.... > ... > > Within the VM, the concept of ->writepage() has been replaced with the > > concept of "write back a mapping". This means that rather than writing > > back a single page, we write back *all* dirty pages against the mapping > > to which the LRU page belongs. > > This is a good and natural step, but don't we want to go even more global > than that and look at all the dirty data on a superblock, so the decision on > what to write out is optimized across files for better locality. Possibly, yes. The way I'm performing writeback now is quite different from the 2.4 way. Instead of: for (buffer = oldest; buffer != newest; buffer++) write(buffer); it's for (superblock = first; superblock != last; superblock++) for (dirty_inode = first; dirty_inode != last; dirty_inode++) filemap_fdatasync(inode); Again, by luck and by design, it turns out that this almost always works. Care is taken to ensure that the ordering of the various lists is preserved, and that we end up writing data in program-creation order. Which works OK, because filesystems allocate inodes and blocks in the way we expect (and desire). What you're proposing is that, within the VM, we opportunistically flush out more inodes - those which neighbour the one which owns the page which the VM wants to free. That would work. It doesn't particularly help us in the case where the VM is trying to get free pages against a specific zone, but it would perhaps provide some overall bandwidth benefits. However, I'm kind of already doing this. Note how the VM's wakeup_bdflush() call also wakes pdflush. pdflush will wake up, walk through all the superblocks, find one which doesn't currently have a pdflush instance working it and will start writing back that superblock's dirty pages. (And the next wakeup_bdflush call will wake another pdflush thread, which will go off and find a different superblock to sync, which is in theory tons better than using a single bdflush thread for all dirty data in the machine. But I haven't demonstrated practical benefit from this yet). > ... > > > But it may come unstuck when applied to swapcache. > > You're not even trying to apply this to swap cache right now are you? No. > > Things which must still be done include: > > > > [...] > > > > - Remove bdflush and kupdate - use the pdflush pool to provide these > > functions. > > The main disconnect there is sub-page sized writes, you will bundle together > young and old 1K buffers. Since it's getting harder to find a 1K blocksize > filesystem, we might not care. There is also my nefarious plan to make > struct pages refer to variable-binary-sized objects, including smaller than > 4K PAGE_SIZE. I was merely suggesting a tidy-up here. pdflush provides a dynamically-sized pool of threads for writing data back to disk. So we can remove the dedicated kupdate and bdflush kernel threads and replace them with: wakeup_bdflush() { pdflush_operation(sync_old_buffers, NULL); } Additionally, we do need to provide ways of turning the kupdate, bdflush and pdflush functions off and on. For laptops, swsusp, etc. But these are really strange interfaces which have sort of crept up on us over time. In this case we need to go back, work out what we're really trying to do here and provide a proper set of interfaces. Rather than `kill -STOP $(pidof kupdate)' or whatever the heck people are using. > ... > > - Use pdflush for try_to_sync_unused_inodes(), to stop the keventd > > abuse. > > Could you explain please? keventd is a "process context bottom half handler". It should provide the caller with reasonably-good response times. Recently, schedule_task() has been used for writing ginormous gobs of discontiguous data out to disk because the VM happened to get itself into a sticky corner. So it's another little tidy-up. Use the pdflush pool for this operation, and restore keventd's righteousness. > ... > I guess the thing to do is start thinking about parts that can be broken out > because of obvious correctness. The dirty/locked accounting would be one > candidate, the multiple flush threads another, and I'm sure there are more > because you don't seem to have treated much as sacred ;-) Yes, that's a reasonable ordering. pdflush is simple and powerful enough to be useful even if the rest founders - rationalise kupdate, bdflush, keventd non-abuse, etc. ratcache is ready, IMO. The global page-accounting is not provably needed yet. Here's another patch for you: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.6-pre2/dallocbase-10-readahead.patch It's against 2.5.6-pre2 base. It's a partial redesign and a big tidy-up of the readahead code. It's largely described in the comments (naturally). - Unifies the current three readahead functions (mmap reads, read(2) and sys_readhead) into a single implementation. - More aggressive in building up the readahead windows. - More conservative in tearing them down. - Special start-of-file heuristics. - Preallocates the readahead pages, to avoid the (never demonstrated, but potentially catastrophic) scenario where allocation of readahead pages causes the allocator to perform VM writeout. - {hidden agenda): Gets all the readahead pages gathered together in one spot, so they can be marshalled into big BIOs. - reinstates the readahead tunables which Mr Dalecki cheerfully chainsawed. So hdparm(8) and blockdev(8) are working again. The readahead settings are now per-request-queue, and the drivers never have to know about it. - Identifies readahead thrashing. Note "identifies". This is 100% reliable - it detects readahead thrashing beautifully. It just doesn't do anything useful about it :( Currently, I just perform a massive shrink on the readahead window when thrashing occurs. This greatly reduces the amount of pointless I/O which we perform, and will reduce the CPU load. But big deal. It doesn't do anything to reduce the seek load, and it's the seek load which is the killer here. I have a little test case which goes from 40 seconds with 40 files to eight minutes with 50 files, because the 50 file case invokes thrashing. Still thinking about this one. - Provides almost unmeasurable throughput speedups! - ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch] delayed disk block allocation 2002-03-04 7:20 ` Andrew Morton @ 2002-03-04 9:23 ` Daniel Phillips 0 siblings, 0 replies; 22+ messages in thread From: Daniel Phillips @ 2002-03-04 9:23 UTC (permalink / raw) To: Andrew Morton; +Cc: lkml, Steve Lord On March 4, 2002 08:20 am, Andrew Morton wrote: > You know where this is headed, don't you: Yes I do, because it's more or less a carbon copy of what I had in mind. > - writeout is performed by the writers, and by the gang-of-flush-threads. > - kswapd is 100% non-blocking. It never does I/O. > - kswapd is the only process which runs page_launder/shrink_caches. > - Memory requesters do not perform I/O. They sleep until memory > is available. kswapd gives them pages as they become available, and > wakes them up. > > So that's the grand plan. It may be fatally flawed - I remember Linus > had a serious-sounding objection to it some time back, but I forget > what that was. I remember, since he gently roasted me last autum for thinking such thoughts. The idea is that by making threads do their own vm scanning they throttle themselves. I don't think the resulting chaotic scanning behavior is worth it. > > > With this patch, writepage() is still using the buffer layer, so lock > > > contention will still be high. > > > > Right, and buffers are going away one way or another. > > This is a problem. I'm adding new stuff which does old things in > a new way, with no believable plan in place for getting rid of the > old stuff. > > I don't think it's humanly possible to do away with struct buffer_head. > It is *the* way of representing a disk block. And unless we plan > to live with 4k pages and 4k blocks for ever, the problem is about > to get worse. Think 64k pages with 4k blocks. If struct page refers to an object the same size as a disk block then struct page can take the place of a buffer in the getblk interface. For IO we have other ways of doing things, kvecs, bio's and so on. We don't need buffers for that, the only thing we need them for is handles for disk blocks, and locking thereof. If you actually had this now your patch would be quite a lot simpler. It's going to take a while though, because first we have to do active physical defragmentation, and that requires rmap. So a prototype is a few months away at least, but six months ago I would have said two years. > Possibly we could handle sub-page segments of memory via a per-page up-to-date > bitmask. And then a `dirty' bitmask. And then a `locked' bitmask, etc. I > suspect eventually we'll end up with, say, a vector of structures attached to > each page which represents the state of each of the page's sub-segments. whoops. You could, but that would be a lot messier than what I have in mind. Your work fits really nicely with that since it gets rid of the IO function of buffers. -- Daniel ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2002-03-07 20:52 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2002-03-04 14:54 [patch] delayed disk block allocation rwhron 2002-03-04 15:10 ` Jeff Garzik -- strict thread matches above, loose matches on Subject: below -- 2002-03-07 12:06 Etienne Lorrain 2002-03-07 14:47 ` Steve Lord 2002-03-07 17:30 ` Mike Fedyk 2002-03-01 8:26 Andrew Morton 2002-03-04 2:08 ` Daniel Phillips 2002-03-04 3:10 ` Jeff Garzik 2002-03-04 7:27 ` Andrew Morton 2002-03-04 5:04 ` Mike Fedyk 2002-03-04 5:31 ` Andreas Dilger 2002-03-04 5:40 ` Mike Fedyk 2002-03-04 6:14 ` Andreas Dilger 2002-03-04 5:41 ` Jeff Garzik 2002-03-04 6:18 ` Andreas Dilger 2002-03-04 15:02 ` Hans Reiser 2002-03-06 12:59 ` Pavel Machek 2002-03-04 7:53 ` Andrew Morton 2002-03-04 8:06 ` Daniel Phillips 2002-03-04 8:34 ` Andrew Morton 2002-03-04 7:20 ` Andrew Morton 2002-03-04 9:23 ` Daniel Phillips
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox