* [RFC] kernel facilities for cache prefetching [not found] <20060502075049.GA5000@mail.ustc.edu.cn> @ 2006-05-02 7:50 ` Wu Fengguang 2006-05-02 12:46 ` Diego Calleja ` (3 more replies) 2006-05-02 7:58 ` Arjan van de Ven ` (2 subsequent siblings) 3 siblings, 4 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-02 7:50 UTC (permalink / raw) To: linux-kernel Cc: Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty Pre-caching reloaded ;) I have been exploring cache prefetching on desktop systems for quite some time, and recently found ways that have _proven_ effects. The basic ideas are described in the following google soc proposal, with the proof-of-concept patch to support I/O priority attached. I'd like to know whether the two proposed kernel components would be acceptable for the mainline kernel, and any recommends to improve them. Thanks, Wu ------------------------------------------------------------------------ SOC PROPOSAL Rapid linux desktop startup through pre-caching MOTIVATION KDE, Gnome, OpenOffice, and Firefox all take too long to start up. Boot time pre-caching seems to be the single most straightforward and effective way to improve it and make linux desktop experience more comfortable. It is a great pleasure for me to take up the work. QUALIFICATION I have been working on linux kernel readahead for over a year, and developed an adaptive readahead patch(http://lwn.net/Articles/155510/) to bring a bunch of new capabilities to linux. During the time I gained knowledge about the VM/IO subsystems of linux kernel, and get acquainted with the slow startup problem, the existing solutions, why they do no work as one would expect and how to get the job better done. PREVIOUS WORKS There has been some similar efforts, i.e. - Linux: Boot Time Speedups Through Precaching http://kerneltrap.org/node/2157 - Andrew Morton's kernel module solution http://www.zip.com.au/~akpm/linux/fboot.tar.gz - preload - adaptive readahead daemon http://sourceforge.net/projects/preload The formers are mainly kernel staffs, while the latter is a pure user-land solution. But none seems to do the trick for system startup time. Andrew saw ~10% speedup, while preload actually saw slow down. IMHO, Andrew's kernel module to dump the contents of pagecache and to restore it at boot time is one step towards the final solution. However, it takes one more step to actually win the time: to optimize the I/O on restoring time. I/O ANALYSE How come the prefetcher gains little or even negative time? After some huntings in the source tree and some experiments, I came to the conclusions that 1. the prefetcher has to readahead one file _after_ another, thus loses the opportunity to reorder the blocks and reduce seeking time. == reasoning == It tends to be blocked on calling open(), where the kernel has to do some dir/inode/bmap lookups and do tiny I/Os _synchronously_ and with _locks_ held. 2. the readahead I/O stands in the way of normal I/O, thus the prefetcher blocks normal apps and loses the opportunity to overlap CPU/IO time. == reasoning == Support of I/O priority is still incomplete for linux. Of the three I/O elevators, anticipatory & deadline simply overlooks priority; CFQ is built for fair I/O priorities, though is still not enough for the case of readahead. Imagine that the prefetcher first issues an I/O request for page A with low priority, then comes the real app that needs page A, and simply waits on the page to be available, which will take rather long time because the elevator still thinks the page as a low priority one. So we get the amazing fact that prefetching actually slows things down! SCHEME/GOAL 1) kernel facility to provide necessary I/O priority support - add basic I/O priority support to AS/deadline elevators: never have readahead I/O stand in the way of normal I/O - enable AS/CFQ/deadline to react on I/O priority changes: reschedule a previous readahead I/O that is now actually demanded by a legacy program 2) kernel module to query the file cache - on loading: create /proc/filecache - setting up: echo "param value" > /proc/filecache - info query: cat /proc/filecache - sample sessions: # modprobe filecache $ cat /proc/filecache # file ALL # mask 0 # # ino size cached status refcnt name 0 2929846 3181 D 1 /dev/hda1 81455 9 9 _ 1 /sbin/init ...... $ echo "file /dev/hda1" > /proc/filecache $ cat /proc/filecache # file /dev/hda1 # mask 0 # # idx len 0 24 48 2 53 5 ...... 3) user-land tools to dump the current content of file cache, and to restore it on boot time - store as plain text files to be script/user friendly - be able to run on the very beginning of boot process - be able to trim down the cache records (for small systems) - optional poor man's defrag ;) - record and replay I/O for any task by taking two cache snapshots and do a set-substract A proof of concept implementation has been developed and ran on my desktop. According to the experimental results, the expected effect of the final work would be: - the power-on to login time remains roughly the same - most gui files are ready on login time - login to usable desktop time comes close to a warm startup BIO I am currently pursuing PhD. degree in University of Science and Technology of China. I enjoy computing and GNU/Linux. - programming since 1993 - using linux since 1998 - playing PXE since 2000 - developing OSS since 2004 - developing adaptive readahead for linux since 2005 ------------------------------------------------------------------------ block/deadline-iosched.c | 35 +++++++++++++++++++++++++++++++++++ block/ll_rw_blk.c | 8 ++++++++ fs/buffer.c | 5 +++-- include/linux/elevator.h | 2 ++ 4 files changed, 48 insertions(+), 2 deletions(-) --- linux.orig/block/deadline-iosched.c +++ linux/block/deadline-iosched.c @@ -310,6 +310,40 @@ deadline_add_request(struct request_queu } /* + * kick a page for io + */ +static int +deadline_kick_page(struct request_queue *q, struct page *page) +{ + struct deadline_data *dd = q->elevator->elevator_data; + struct deadline_rq *drq; + struct request *rq; + struct list_head *pos; + struct bio_vec *bvec; + struct bio *bio; + int i; + + list_for_each(pos, &dd->fifo_list[READ]) { + drq = list_entry_fifo(pos); + rq = drq->request; + rq_for_each_bio(bio, rq) { + bio_for_each_segment(bvec, bio, i) { + if (page == bvec->bv_page) + goto found; + } + } + } + + return -1; + +found: + rq->ioprio = 1; + list_del(&drq->fifo); + deadline_add_drq_fifo(dd, rq); + return 0; +} + +/* * remove rq from rbtree, fifo, and hash */ static void deadline_remove_request(request_queue_t *q, struct request *rq) @@ -784,6 +818,7 @@ static struct elevator_type iosched_dead .elevator_merge_req_fn = deadline_merged_requests, .elevator_dispatch_fn = deadline_dispatch_requests, .elevator_add_req_fn = deadline_add_request, + .elevator_kick_page_fn = deadline_kick_page, .elevator_queue_empty_fn = deadline_queue_empty, .elevator_former_req_fn = deadline_former_request, .elevator_latter_req_fn = deadline_latter_request, --- linux.orig/block/ll_rw_blk.c +++ linux/block/ll_rw_blk.c @@ -1620,6 +1620,14 @@ static void blk_backing_dev_unplug(struc { request_queue_t *q = bdi->unplug_io_data; + if (page && + IOPRIO_PRIO_CLASS(current->ioprio) != IOPRIO_CLASS_IDLE && + q->elevator && q->elevator->ops->elevator_kick_page_fn) { + spin_lock_irq(q->queue_lock); + q->elevator->ops->elevator_kick_page_fn(q, page); + spin_unlock_irq(q->queue_lock); + } + /* * devices don't necessarily have an ->unplug_fn defined */ --- linux.orig/fs/buffer.c +++ linux/fs/buffer.c @@ -63,8 +63,9 @@ static int sync_buffer(void *word) smp_mb(); bd = bh->b_bdev; - if (bd) - blk_run_address_space(bd->bd_inode->i_mapping); + if (bd && bd->bd_inode && bd->bd_inode->i_mapping) + blk_run_backing_dev(bd->bd_inode->i_mapping->backing_dev_info, + bh->b_page); io_schedule(); return 0; } --- linux.orig/include/linux/elevator.h +++ linux/include/linux/elevator.h @@ -20,6 +20,7 @@ typedef int (elevator_set_req_fn) (reque typedef void (elevator_put_req_fn) (request_queue_t *, struct request *); typedef void (elevator_activate_req_fn) (request_queue_t *, struct request *); typedef void (elevator_deactivate_req_fn) (request_queue_t *, struct request *); +typedef int (elevator_kick_page_fn) (request_queue_t *, struct page *); typedef int (elevator_init_fn) (request_queue_t *, elevator_t *); typedef void (elevator_exit_fn) (elevator_t *); @@ -34,6 +35,7 @@ struct elevator_ops elevator_add_req_fn *elevator_add_req_fn; elevator_activate_req_fn *elevator_activate_req_fn; elevator_deactivate_req_fn *elevator_deactivate_req_fn; + elevator_kick_page_fn *elevator_kick_page_fn; elevator_queue_empty_fn *elevator_queue_empty_fn; elevator_completed_req_fn *elevator_completed_req_fn; ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang @ 2006-05-02 12:46 ` Diego Calleja [not found] ` <20060502144203.GA10594@mail.ustc.edu.cn> 2006-05-02 15:55 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 1 reply; 48+ messages in thread From: Diego Calleja @ 2006-05-02 12:46 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan El Tue, 2 May 2006 15:50:49 +0800, Wu Fengguang <wfg@mail.ustc.edu.cn> escribió: > 2) kernel module to query the file cache Can't mincore() + /proc/$PID/* stuff be used to replace that ? Improving boot time is nice and querying the file cache would work for that, but improving the boot time of some programs once the system is running (ie: running openoffice 6 hours after booting) is something that other preloaders do in other OSes aswell, querying the full file cache wouldn't be that useful for such cases. The main reason why I believe that the pure userspace (preload.sf.net) solution slows down in some cases is becauses it uses bayesian heuristics (!) as a magic ball to guess the future, which is a flawed idea IMHO. I started (but didn't finish) a preloader which uses the process event connector to get notifications of what processes are being launched, then it profiles it (using mincore(), /proc/$PID/* stuff, etc) and preloads things optimally the next time it gets a notification of the same app. Mac OS X has a program that implements your idea, available (the sources) at http://darwinsource.opendarwin.org/projects/apsl/BootCache-25/ Also, mac os x uses launchd, as init/init.d replacement (http://darwinsource.opendarwin.org/projects/apsl/launchd-106/) If (as Arjan noticed) the bootup is not really IO-bound, launchd could help to reduce the amount of CPU time wasted if the CPU time being wasted is in shell interpreters (a more unixy port has been done by the freebsd people - http://wikitest.freebsd.org/moin.cgi/launchd) ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060502144203.GA10594@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502144203.GA10594@mail.ustc.edu.cn> @ 2006-05-02 14:42 ` Wu Fengguang 2006-05-02 16:07 ` Diego Calleja 0 siblings, 1 reply; 48+ messages in thread From: Wu Fengguang @ 2006-05-02 14:42 UTC (permalink / raw) To: Diego Calleja Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan Diego, On Tue, May 02, 2006 at 02:46:41PM +0200, Diego Calleja wrote: > El Tue, 2 May 2006 15:50:49 +0800, > Wu Fengguang <wfg@mail.ustc.edu.cn> escribió: > > > 2) kernel module to query the file cache > > Can't mincore() + /proc/$PID/* stuff be used to replace that ? Nope. mincore() only provides info about files that are currently opened, by the process itself. The majority in the file cache are closed files. > Improving boot time is nice and querying the file cache would work > for that, but improving the boot time of some programs once the system > is running (ie: running openoffice 6 hours after booting) is something > that other preloaders do in other OSes aswell, querying the full file > cache wouldn't be that useful for such cases. Yes, it can still be useful after booting :) One can get the cache footprint of any task started at any time by taking snapshots of the cache before and after the task, and do a set-subtract on them. > The main reason why I believe that the pure userspace (preload.sf.net) > solution slows down in some cases is becauses it uses bayesian heuristics > (!) as a magic ball to guess the future, which is a flawed idea IMHO. > I started (but didn't finish) a preloader which uses the process event > connector to get notifications of what processes are being launched, > then it profiles it (using mincore(), /proc/$PID/* stuff, etc) and > preloads things optimally the next time it gets a notification of the > same app. My thought is that any prefetcher that ignores the thousands of small data files is incomplete. Only kernel can provide that info. > Mac OS X has a program that implements your idea, available (the sources) > at http://darwinsource.opendarwin.org/projects/apsl/BootCache-25/ Ah, thanks for mentioning it. It seems to do the job mostly in kernel, comprising of 2400+ LOC. Whereas my plan is to write a module of ~300 LOC to _only_ provide the necessary info, and leave other jobs to smart user-land tools. Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 14:42 ` Wu Fengguang @ 2006-05-02 16:07 ` Diego Calleja [not found] ` <20060503064503.GA4781@mail.ustc.edu.cn> 0 siblings, 1 reply; 48+ messages in thread From: Diego Calleja @ 2006-05-02 16:07 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan El Tue, 2 May 2006 22:42:03 +0800, Wu Fengguang <wfg@mail.ustc.edu.cn> escribió: > Nope. mincore() only provides info about files that are currently > opened, by the process itself. The majority in the file cache are > closed files. Yes; what I meant was: - start listening the process event connector as firsk task on startup - get a list of what files are being used (every second or something) by looking at /proc/$PID/* stuff - mincore() all those files at the end of the bootup > Yes, it can still be useful after booting :) One can get the cache > footprint of any task started at any time by taking snapshots of the > cache before and after the task, and do a set-subtract on them. Although this certainly looks simpler for userspace (and complete, if you want to get absolutely all the info about files that get opened and closed faster than the profile interval of a prefetcher) (another useful tool would be a dtrace-like thing) ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060503064503.GA4781@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503064503.GA4781@mail.ustc.edu.cn> @ 2006-05-03 6:45 ` Wu Fengguang 2006-05-03 18:14 ` Diego Calleja 0 siblings, 1 reply; 48+ messages in thread From: Wu Fengguang @ 2006-05-03 6:45 UTC (permalink / raw) To: Diego Calleja Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan On Tue, May 02, 2006 at 06:07:53PM +0200, Diego Calleja wrote: > > Nope. mincore() only provides info about files that are currently > > opened, by the process itself. The majority in the file cache are > > closed files. > > Yes; what I meant was: > - start listening the process event connector as firsk task on startup > - get a list of what files are being used (every second or something) > by looking at /proc/$PID/* stuff > - mincore() all those files at the end of the bootup > > > Yes, it can still be useful after booting :) One can get the cache > > footprint of any task started at any time by taking snapshots of the > > cache before and after the task, and do a set-subtract on them. > > Although this certainly looks simpler for userspace (and complete, if > you want to get absolutely all the info about files that get opened and > closed faster than the profile interval of a prefetcher) Thanks, so it's a question of simplicity/completeness :) > (another useful tool would be a dtrace-like thing) Lubos Lunak also reminds me of SUSE's preload (http://en.opensuse.org/index.php?title=SUPER_preloading_internals) which is a user-land solution using strace to collect the info. And there's Andrea Arcangeli's "bootcache userspace logging" kernel patch(http://lkml.org/lkml/2004/8/6/216). Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-03 6:45 ` Wu Fengguang @ 2006-05-03 18:14 ` Diego Calleja 2006-05-03 23:39 ` Zan Lynx 0 siblings, 1 reply; 48+ messages in thread From: Diego Calleja @ 2006-05-03 18:14 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan El Wed, 3 May 2006 14:45:03 +0800, Wu Fengguang <wfg@mail.ustc.edu.cn> escribió: > Lubos Lunak also reminds me of SUSE's preload > (http://en.opensuse.org/index.php?title=SUPER_preloading_internals) > which is a user-land solution using strace to collect the info. > > And there's Andrea Arcangeli's "bootcache userspace logging" kernel > patch(http://lkml.org/lkml/2004/8/6/216). Just for completeness, windows vista will include a enhanced prefetcher called (sic) SuperFetch. The idea behind it seems to be to analyze I/O patterns and then "mirror" the most frequently used disk blocks into the USB flash drive; so if when the usb flash drive is plugged in the system will read those blocks from it as it was the hard drive the next time you run the app (http://www.windowsitpro.com/Windows/Article/ArticleID/48085/48085.html) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-03 18:14 ` Diego Calleja @ 2006-05-03 23:39 ` Zan Lynx 2006-05-04 1:37 ` Diego Calleja 0 siblings, 1 reply; 48+ messages in thread From: Zan Lynx @ 2006-05-03 23:39 UTC (permalink / raw) To: Diego Calleja Cc: Wu Fengguang, linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan [-- Attachment #1: Type: text/plain, Size: 1142 bytes --] On Wed, 2006-05-03 at 20:14 +0200, Diego Calleja wrote: > Just for completeness, windows vista will include a enhanced prefetcher > called (sic) SuperFetch. The idea behind it seems to be to analyze I/O > patterns and then "mirror" the most frequently used disk blocks into > the USB flash drive; so if when the usb flash drive is plugged in > the system will read those blocks from it as it was the hard drive > the next time you run the app > (http://www.windowsitpro.com/Windows/Article/ArticleID/48085/48085.html) Linux should be able to do something like this using unionfs. It could be worthwhile to try it with one of the very fastest flash cards or USB drives. With slower cards and USB keys its more of a loss unless the faster seek speed can make up for it, because sequential hard drive access is faster. For comparison, a OCZ USB 2.0 Flash Drive with dual channel works at about 30 MB/s. One of my 7,200 RPM SATA desktop drives does 45 MB/s. A 15k SCSI drive can do over 60 MB/s. It'd be great for laptops though. My slug of a laptop drive does 20 MB/s on a good day. -- Zan Lynx <zlynx@acm.org> [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-03 23:39 ` Zan Lynx @ 2006-05-04 1:37 ` Diego Calleja 0 siblings, 0 replies; 48+ messages in thread From: Diego Calleja @ 2006-05-04 1:37 UTC (permalink / raw) To: Zan Lynx Cc: wfg, linux-kernel, torvalds, akpm, axboe, nickpiggin, pbadari, arjan El Wed, 03 May 2006 17:39:51 -0600, Zan Lynx <zlynx@acm.org> escribió: > Linux should be able to do something like this using unionfs. It could > be worthwhile to try it with one of the very fastest flash cards or USB > drives. BTW; I forgot that the next intel laptops (and apparently, mactel laptops) will have a small flash memory built in for this very purpose. According to an article at extremetech.com, the flash memory requires an "I/O controller"; IOW, it is not transparent and seems to need support from the OS (although I guess that vista's "superprefetch" was designed precisely for this hardware). Apparently, the main purpose here is to improve the battery life (disks seeking for too many seconds can eat lot of power, I guess). "Prefetchers" are becoming trendy, it seems (page 14) http://www.intel.com/pressroom/kits/events/idfspr_2006/20060307_MaloneyTranscript.pdf > With slower cards and USB keys its more of a loss unless the faster seek > speed can make up for it, because sequential hard drive access is > faster. That's where the gain is; if the hard drive access was sequential people wouldn't be talking about prefetchers. My SATA desktop drive also does ~45 MB/s, but it doesn't goes beyond 2 when seeking ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang 2006-05-02 12:46 ` Diego Calleja @ 2006-05-02 15:55 ` Linus Torvalds 2006-05-02 16:35 ` Andi Kleen ` (5 more replies) 2006-05-03 21:45 ` Linda Walsh 2006-05-04 9:02 ` Helge Hafting 3 siblings, 6 replies; 48+ messages in thread From: Linus Torvalds @ 2006-05-02 15:55 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, 2 May 2006, Wu Fengguang wrote: > > block/deadline-iosched.c | 35 +++++++++++++++++++++++++++++++++++ > block/ll_rw_blk.c | 8 ++++++++ > fs/buffer.c | 5 +++-- > include/linux/elevator.h | 2 ++ > 4 files changed, 48 insertions(+), 2 deletions(-) Now, regardless of the other issues people have brought up, I'd like to say that I think this is broken. Doing prefetching on a physical block basis is simply not a valid approach, for several reasons: - it misses several important cases (you can surely prefetch over NFS too) - it gives you only a very limited view into what is actually going on. - it doesn't much allow you to _fix_ any problems, it just allows you to try to paper them over. - it's useless anyway, since pretty all kernel caching is based on virtual caches, so if you "pre-read" the physical buffers, it won't help: you'll just waste time reading the data into a buffer that will never be used, and when the real request comes in, the read will be done _again_. So I would _seriously_ claim that the place to do all the statistics allocation is in anything that ends up having to call "->readpage()", and do it all on a virtual mapping level. Yes, it isn't perfect either (I'll mention some problems), but it's a _lot_ better. It means that when you gather the statistics, you can see the actual _files_ and offsets being touched. You can even get the filenames by following the address space -> inode -> i_dentry list. This is important for several reasons: (a) it makes it a hell of a lot more readable, and the user gets a lot more information that may make him see the higher-level issues involved. (b) it's in the form that we cache things, so if you read-ahead in that form, you'll actually get real information. (c) it's in a form where you can actually _do_ something about things like fragmentation etc ("Oh, I could move these files all to a separate area") Now, admittedly it has a few downsides: - right now "readpage()" is called in several places, and you'd have to create some kind of nice wrapper for the most common "mapping->a_ops->readpage()" thing and hook into there to avoid duplicating the effort. Alternatively, you could decide that you only want to do this at the filesystem level, which actually simplifies some things. If you instrument "mpage_readpage[2]()", you'll already get several of the ones you care about, and you could do the others individually. [ As a third alternative, you might decide that the only thing you actually care about is when you have to wait on a locked page, and instrument the page wait-queues instead. ] - it will miss any situation where a filesystem does a read some other way. Notably, in many loads, the _directory_ accesses are the important ones, and if you want statistics for those you'd often have to do that separately (not always - some of the filesystems just use the same page reading stuff). The downsides basically boil down to the fact that it's not as clearly just one single point. You can't just look at the request queue and see what physical requests go out. NOTE! You can obviously do both, and try to correlate one against the other, and you'd get the best possible information ("is this seek because we started reading another file, or is it because the file itself is fragmented" kind of stuff). So physical statistics aren't meaningless. They're just _less_ important than the virtual ones, and you should do them only if you already do virtual stats. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 15:55 ` Linus Torvalds @ 2006-05-02 16:35 ` Andi Kleen [not found] ` <20060503041106.GC5915@mail.ustc.edu.cn> ` (4 subsequent siblings) 5 siblings, 0 replies; 48+ messages in thread From: Andi Kleen @ 2006-05-02 16:35 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, wfg Linus Torvalds <torvalds@osdl.org> writes: > > The downsides basically boil down to the fact that it's not as clearly > just one single point. You can't just look at the request queue and see > what physical requests go out. The other problem is that readpages has no idea about the layout of the disk so just doing it dumbly might end up with a lot of seeking. I suppose you would at least need some tweaks to the scheduler to make sure it doesn't give these reads any priority and keeps them in the queue long enough to get real good sorting and large requests. Problem is that this would likely need to be asynchronous to be any useful unless you were willing to block a thread for a potentially long time for each file to be prefetched. -Andi ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060503041106.GC5915@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503041106.GC5915@mail.ustc.edu.cn> @ 2006-05-03 4:11 ` Wu Fengguang 2006-05-03 17:28 ` Badari Pulavarty 1 sibling, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-03 4:11 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote: > Doing prefetching on a physical block basis is simply not a valid > approach, for several reasons: Sorry! I made a misleading introduction. I'll try to explain it in more detail. DATA ACQUISITION /proc/filecache provides an interface to query the cached pages of any file. This information is expressed in tuples of <idx, len>, which more specifically means <mapping-offset, pages>. Normally one should use 'echo' to setup two parameters before doing 'cat': @file the filename; use 'ALL' to get a list all files cached @mask only show the pages with non-zero (page-flags & @mask); for simplicity, use '0' to show all present pages(take 0 as ~0) Normally, one should first get the file list using param 'file ALL', and then iterate through all the files and pages of interested with params 'file filename' and 'mask pagemask'. The param 'mask' acts as a filter for different users: it allows sysadms to know where his memory goes, and the prefetcher to ignore pages from false readahead. One can use 'mask hex(PG_active|PG_referenced|PG_mapped)' in its hex form to show only accessed pages(here PG_mapped is a faked flag), and use 'mask hex(PG_dirty)' to show only dirtied pages. One can use $ echo "file /sbin/init" > /proc/filecache $ echo "mask 0" > /proc/filecache $ cat /proc/filecache to get an idea which pages of /sbin/init are currently cached. In the proposal, I used the following example, which is proved to be rather misleading: $ echo "file /dev/hda1" > /proc/filecache $ cat /proc/filecache The intention of that example was to show that filesystem dir/inode buffer status -- which is the key data for user-land pre-caching -- can also be retrieved through this interface. So the proposed solution is to - prefetch normal files on the virtual mapping level - prefetch fs dir/inode buffers on a physical block basis I/O SUBMISSION How can we avoid unnecessary seeks when prefetching on virtual mapping level? The answer is to leave this job to i/o elevators. What we should do is to present elevators with most readahead requests before too many requests being submitted to disk drivers. The proposed scheme is to: 1) (first things first) issue all readahead requests for filesystem buffers 2) (in background, often blocked) issue all readahead requests for normal files -) make sure the above requests are of really _low_ priority 3) regular system boot continues 4) promote the priority of any request that is now demanded by legacy programs In the scheme, most work is done by user-land tools. The required kernel support is minimal and general-purpose: - an /proc/filecache interface - the ability to promote I/O priority on demanded pages By this approach, we avoided the complicated OSX bootcache solution, which is a physical-blocks-based, special-handlings-in-kernel solution that is exactly what Linus is against. Thanks, Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503041106.GC5915@mail.ustc.edu.cn> 2006-05-03 4:11 ` Wu Fengguang @ 2006-05-03 17:28 ` Badari Pulavarty [not found] ` <346733486.30800@ustc.edu.cn> 1 sibling, 1 reply; 48+ messages in thread From: Badari Pulavarty @ 2006-05-03 17:28 UTC (permalink / raw) To: Wu Fengguang; +Cc: Linus Torvalds, lkml, Andrew Morton, Jens Axboe, Nick Piggin On Wed, 2006-05-03 at 12:11 +0800, Wu Fengguang wrote: > On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote: > > Doing prefetching on a physical block basis is simply not a valid > > approach, for several reasons: > > Sorry! > I made a misleading introduction. I'll try to explain it in more detail. > > DATA ACQUISITION > > /proc/filecache provides an interface to query the cached pages of any > file. This information is expressed in tuples of <idx, len>, which > more specifically means <mapping-offset, pages>. > > Normally one should use 'echo' to setup two parameters before doing > 'cat': > @file > the filename; > use 'ALL' to get a list all files cached > @mask > only show the pages with non-zero (page-flags & @mask); > for simplicity, use '0' to show all present pages(take 0 as ~0) > > Normally, one should first get the file list using param 'file ALL', > and then iterate through all the files and pages of interested with > params 'file filename' and 'mask pagemask'. > > The param 'mask' acts as a filter for different users: it allows > sysadms to know where his memory goes, and the prefetcher to ignore > pages from false readahead. > > One can use 'mask hex(PG_active|PG_referenced|PG_mapped)' in its hex form > to show only accessed pages(here PG_mapped is a faked flag), and use > 'mask hex(PG_dirty)' to show only dirtied pages. > > One can use > $ echo "file /sbin/init" > /proc/filecache > $ echo "mask 0" > /proc/filecache > $ cat /proc/filecache > to get an idea which pages of /sbin/init are currently cached. > > In the proposal, I used the following example, which is proved to be > rather misleading: > $ echo "file /dev/hda1" > /proc/filecache > $ cat /proc/filecache > The intention of that example was to show that filesystem dir/inode > buffer status -- which is the key data for user-land pre-caching -- > can also be retrieved through this interface. > > So the proposed solution is to > - prefetch normal files on the virtual mapping level > - prefetch fs dir/inode buffers on a physical block basis > > I/O SUBMISSION > How can we avoid unnecessary seeks when prefetching on virtual mapping > level? The answer is to leave this job to i/o elevators. What we > should do is to present elevators with most readahead requests before > too many requests being submitted to disk drivers. > The proposed scheme is to: > 1) (first things first) > issue all readahead requests for filesystem buffers > 2) (in background, often blocked) > issue all readahead requests for normal files > -) make sure the above requests are of really _low_ priority > 3) regular system boot continues > 4) promote the priority of any request that is now demanded by > legacy programs > > In the scheme, most work is done by user-land tools. The required > kernel support is minimal and general-purpose: > - an /proc/filecache interface > - the ability to promote I/O priority on demanded pages > > By this approach, we avoided the complicated OSX bootcache solution, > which is a physical-blocks-based, special-handlings-in-kernel solution > that is exactly what Linus is against. Wu, While ago, I hacked up similar /proc interface echo "<filesystem-name>" > /proc/pagecache-usage Which showed pagecache usage of every file in that filesystem (filename, #num pages). My main objective was to shoot down pagecache for all the files in a given filesystem. I ended up using it to do posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried to do this without this, by doing fadvise() on all files in the filesystem - but ended up bloating up inode and dcache). Yeah. having this kind of information would be useful. But I am not sure how much of this can benefit regular workloads - unless one is willing to tweak things heavily. Bottom line is, need to have a better strategy on how you would use information .. Thanks, Badari ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <346733486.30800@ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <346733486.30800@ustc.edu.cn> @ 2006-05-04 15:03 ` Linus Torvalds 2006-05-04 16:57 ` Badari Pulavarty [not found] ` <20060505144451.GA6134@mail.ustc.edu.cn> 0 siblings, 2 replies; 48+ messages in thread From: Linus Torvalds @ 2006-05-04 15:03 UTC (permalink / raw) To: Wu Fengguang Cc: Badari Pulavarty, lkml, Andrew Morton, Jens Axboe, Nick Piggin On Thu, 4 May 2006, Wu Fengguang wrote: > > On Wed, May 03, 2006 at 10:28:00AM -0700, Badari Pulavarty wrote: > > While ago, I hacked up similar /proc interface > > echo "<filesystem-name>" > /proc/pagecache-usage > > > > Which showed pagecache usage of every file in that filesystem > > (filename, #num pages). My main objective was to shoot down pagecache > > for all the files in a given filesystem. I ended up using it to do > > posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried > > to do this without this, by doing fadvise() on all files in the > > filesystem - but ended up bloating up inode and dcache). > > Ah, I have not thought of the possibility of querying the page cache > just to drop some caches -- a subset of sysctl_drop_caches functions. Actually, I did something even simpler for a totally one-off thing: you don't actually even need to drop caches or track pretty much anything, it's sufficient for most analysis to just have a timestamp on each page cache, and then have some way to read out the current cached contents. You can then use the timestamps to get a pretty good idea of what order things happened in. You'll lose the temporary file information (files that are created and deleted), because deleting a file will also flush the page cache for it, but temporary files tend to not be hugely interesting. The really nice thing was that you don't even have to set the timestamp in any complex place: you do it at page _allocation_ time. That automatically gets the right answer for any page cache page, and you can do it in a single place. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-04 15:03 ` Linus Torvalds @ 2006-05-04 16:57 ` Badari Pulavarty [not found] ` <20060505144451.GA6134@mail.ustc.edu.cn> 1 sibling, 0 replies; 48+ messages in thread From: Badari Pulavarty @ 2006-05-04 16:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Wu Fengguang, lkml, Andrew Morton, Jens Axboe, Nick Piggin On Thu, 2006-05-04 at 08:03 -0700, Linus Torvalds wrote: > > On Thu, 4 May 2006, Wu Fengguang wrote: > > > > On Wed, May 03, 2006 at 10:28:00AM -0700, Badari Pulavarty wrote: > > > While ago, I hacked up similar /proc interface > > > echo "<filesystem-name>" > /proc/pagecache-usage > > > > > > Which showed pagecache usage of every file in that filesystem > > > (filename, #num pages). My main objective was to shoot down pagecache > > > for all the files in a given filesystem. I ended up using it to do > > > posix_fadivse(POSIX_FADV_DONTNEED) on those files. (Initially, tried > > > to do this without this, by doing fadvise() on all files in the > > > filesystem - but ended up bloating up inode and dcache). > > > > Ah, I have not thought of the possibility of querying the page cache > > just to drop some caches -- a subset of sysctl_drop_caches functions. > > Actually, I did something even simpler for a totally one-off thing: you > don't actually even need to drop caches or track pretty much anything, > it's sufficient for most analysis to just have a timestamp on each page > cache, and then have some way to read out the current cached contents. > > You can then use the timestamps to get a pretty good idea of what order > things happened in. > > You'll lose the temporary file information (files that are created and > deleted), because deleting a file will also flush the page cache for it, > but temporary files tend to not be hugely interesting. > > The really nice thing was that you don't even have to set the timestamp in > any complex place: you do it at page _allocation_ time. That automatically > gets the right answer for any page cache page, and you can do it in a > single place. Sounds like an interesting idea. The problem, I was trying to address was that - one of our large database customer was complaining about the variation (degradation) in database performance when other applications like backup, ftp, scp, tar happening on other filesystems on the system. VM end up swapping out some of database process data to make room for these read/writes. They wanted a way to completely flush out pagecache data for a given filesystem to see if it improves their situation. (although "swapiness" should in *theory* do the same thing). In fact, they want a way to control the amount of pagecache filesystems populate through /proc :( My patch was an initial attempt to unwind mysteries of VM behaviour :) (VM may doing the right thing from kernel perspective - but customer was complaining that its thrashing their workload). Thanks, Badari ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060505144451.GA6134@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060505144451.GA6134@mail.ustc.edu.cn> @ 2006-05-05 14:44 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-05 14:44 UTC (permalink / raw) To: Linus Torvalds Cc: Badari Pulavarty, lkml, Andrew Morton, Jens Axboe, Nick Piggin On Thu, May 04, 2006 at 08:03:50AM -0700, Linus Torvalds wrote: > Actually, I did something even simpler for a totally one-off thing: you > don't actually even need to drop caches or track pretty much anything, > it's sufficient for most analysis to just have a timestamp on each page > cache, and then have some way to read out the current cached contents. > > You can then use the timestamps to get a pretty good idea of what order > things happened in. > The really nice thing was that you don't even have to set the timestamp in > any complex place: you do it at page _allocation_ time. That automatically > gets the right answer for any page cache page, and you can do it in a > single place. Looks nice. A detailed scheme might be: 1) ctime/atime for each radixtree node(or, cluser of pages) It seems to be a good accuracy/overhead compromise. And is perfect for the most common case of sequential accesses. struct radix_tree_node { + unsigned long ctime, atime; } radix_tree_node_alloc() { + node->ctime = some_virtual_time(); } radix_tree_lookup_slot() { + node->atime = some_virtual_time(); } 2) eviction-time for each recently evicted page Store eviction-time _in place_ in the slot that used to store the page's address, with minimal space/time impact. +#define SLOT_IS_EVICTION_TIME 1 radix_tree_delete() { - pathp->node->slots[pathp->offset] = NULL; + pathp->node->slots[pathp->offset] = + some_virtual_time() | SLOT_IS_EVICTION_TIME; } Regards, Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060503071325.GC4781@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503071325.GC4781@mail.ustc.edu.cn> @ 2006-05-03 7:13 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-03 7:13 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 08:55:06AM -0700, Linus Torvalds wrote: > So I would _seriously_ claim that the place to do all the statistics > allocation is in anything that ends up having to call "->readpage()", and > do it all on a virtual mapping level. > > Yes, it isn't perfect either (I'll mention some problems), but it's a > _lot_ better. It means that when you gather the statistics, you can see > the actual _files_ and offsets being touched. You can even get the > filenames by following the address space -> inode -> i_dentry list. > > This is important for several reasons: > (a) it makes it a hell of a lot more readable, and the user gets a > lot more information that may make him see the higher-level issues > involved. > (b) it's in the form that we cache things, so if you read-ahead in > that form, you'll actually get real information. > (c) it's in a form where you can actually _do_ something about things > like fragmentation etc ("Oh, I could move these files all to a > separate area") There have been two alternatives for me: 1) static/passive interface i.e. the /proc/filecache querier - user-land tools request to dump the cache contents on demand 2) dynamic/active interface i.e. the readpage() logger - user-land daemon accepts live page access/io activities > Now, admittedly it has a few downsides: > > - right now "readpage()" is called in several places, and you'd have to > create some kind of nice wrapper for the most common > "mapping->a_ops->readpage()" thing and hook into there to avoid > duplicating the effort. > > Alternatively, you could decide that you only want to do this at the > filesystem level, which actually simplifies some things. If you > instrument "mpage_readpage[2]()", you'll already get several of the > ones you care about, and you could do the others individually. > > [ As a third alternative, you might decide that the only thing you > actually care about is when you have to wait on a locked page, and > instrument the page wait-queues instead. ] > > - it will miss any situation where a filesystem does a read some other > way. Notably, in many loads, the _directory_ accesses are the important > ones, and if you want statistics for those you'd often have to do that > separately (not always - some of the filesystems just use the same > page reading stuff). > > The downsides basically boil down to the fact that it's not as clearly > just one single point. You can't just look at the request queue and see > what physical requests go out. Good insights. The readpage() activities logging idea has been appealing for me. We might even go further to log mark_page_accessed() calls for more information. This approach is more precise, and provides process/page correlations and time info that the /proc/filecache interface cannot provide. Though it involves more complexity and overhead(for me they mean the possibility of being rejected:). Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 15:55 ` Linus Torvalds ` (2 preceding siblings ...) [not found] ` <20060503071325.GC4781@mail.ustc.edu.cn> @ 2006-05-03 12:59 ` Nikita Danilov 2006-05-03 22:20 ` Rik van Riel 2006-05-04 0:28 ` Linda Walsh 5 siblings, 0 replies; 48+ messages in thread From: Nikita Danilov @ 2006-05-03 12:59 UTC (permalink / raw) To: Linus Torvalds Cc: linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty Linus Torvalds writes: > [...] > > Now, admittedly it has a few downsides: > > - right now "readpage()" is called in several places, and you'd have to > create some kind of nice wrapper for the most common > "mapping->a_ops->readpage()" thing and hook into there to avoid > duplicating the effort. > > Alternatively, you could decide that you only want to do this at the > filesystem level, which actually simplifies some things. If you > instrument "mpage_readpage[2]()", you'll already get several of the > ones you care about, and you could do the others individually. > > [ As a third alternative, you might decide that the only thing you > actually care about is when you have to wait on a locked page, and > instrument the page wait-queues instead. ] > > - it will miss any situation where a filesystem does a read some other > way. Notably, in many loads, the _directory_ accesses are the important > ones, and if you want statistics for those you'd often have to do that > separately (not always - some of the filesystems just use the same > page reading stuff). Another disadvantage is that ->readpage() can only do read-ahead within single file, which is not helpful for the case of reading a lot of small files (and this is what happens during startup). And to implement reasonable multi-file read-ahead at the file system layer one needs asynchronous inode loading interface implemented for every file system. Nikita. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 15:55 ` Linus Torvalds ` (3 preceding siblings ...) 2006-05-03 12:59 ` Nikita Danilov @ 2006-05-03 22:20 ` Rik van Riel [not found] ` <20060506011125.GA9099@mail.ustc.edu.cn> 2006-05-04 0:28 ` Linda Walsh 5 siblings, 1 reply; 48+ messages in thread From: Rik van Riel @ 2006-05-03 22:20 UTC (permalink / raw) To: Linus Torvalds Cc: Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, 2 May 2006, Linus Torvalds wrote: > On Tue, 2 May 2006, Wu Fengguang wrote: > > > > 4 files changed, 48 insertions(+), 2 deletions(-) > > So I would _seriously_ claim that the place to do all the statistics > allocation is in anything that ends up having to call "->readpage()", and > do it all on a virtual mapping level. Why not simply read everything in a whole file at a time at boot time, while we still have enough free memory ? We can have a small modification to the readahead code to read in the whole file on the first read or fault, or maybe even on open. Once the system is done booting, it can switch this bootup readahead mode off through a tunable in /proc. If the system is booting on a system with not enough memory to load everything file-at-a-time, the bootup scripts can switch this off earlier (or not switch it on). The kernel modifications needed to make this work are minimal. It might need some tweaking so we don't try to read in truly enormous files, but that is easy. Does this sound reasonable ? -- All Rights Reversed ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060506011125.GA9099@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060506011125.GA9099@mail.ustc.edu.cn> @ 2006-05-06 1:11 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-06 1:11 UTC (permalink / raw) To: Rik van Riel Cc: Linus Torvalds, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Wed, May 03, 2006 at 06:20:29PM -0400, Rik van Riel wrote: > Why not simply read everything in a whole file at a time > at boot time, while we still have enough free memory ? Sure it helps, though may be unnecessary once we can rely on userspace pre-caching tools to do it in a more accurate and efficient way. The main concern would be inter-file/inter-buffer readahead. I have some data to support it: % wc -l hda* startup.files 1743 hda2_buffers <== my / 335 hda3_buffers <== my /home 1130 startup.files 3208 total The above cache sizes sum up to ~110MB, so it's about 36KB per seek. Thanks, Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 15:55 ` Linus Torvalds ` (4 preceding siblings ...) 2006-05-03 22:20 ` Rik van Riel @ 2006-05-04 0:28 ` Linda Walsh 2006-05-04 1:31 ` Linus Torvalds 5 siblings, 1 reply; 48+ messages in thread From: Linda Walsh @ 2006-05-04 0:28 UTC (permalink / raw) To: Linus Torvalds Cc: Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty Linus Torvalds wrote: > Now, regardless of the other issues people have brought up, I'd like to > say that I think this is broken. > > Doing prefetching on a physical block basis is simply not a valid > approach, for several reasons: > > - it misses several important cases (you can surely prefetch over NFS > too) --- I don't think fetching over a network is a case that needs to be "optimized". I don't think any there would be sufficient gain in "pre-fetching" over a network (even if it was one long network read) to outweigh network latencies. I would strongly suggest this be ignored. > - it gives you only a very limited view into what is actually going on. --- ??? In what way? I don't think we need a *complex* view of what is going on. If anything, complexity can be a performance drag (though in some cases, an excellent algorithm can outweigh the extra computational complexitiy. > - it doesn't much allow you to _fix_ any problems, it just allows you to > try to paper them over. ---- It isn't designed to be the one "true" fix for every problem known. It's meant to provide speedups for a common subset of io-requests. If one comes up with the one-uber algorithm in three years, then replace this one, meanwhile... > - it's useless anyway, since pretty all kernel caching is based on > virtual caches, so if you "pre-read" the physical buffers, it won't > help: you'll just waste time reading the data into a buffer that will > never be used, and when the real request comes in, the read will > be done _again_. ---- This is the big drawback/problem. But this leads me to an obvious question. If I read a file into memory, it is cached, _seemingly_, relative to the filename. If I read the the same physical block into memory using _block_-i/o, relative to the device, the claim is that two separate reads are done into the block cache? Can't this result in inconsistent results between the two copies? If I change the file relative block, how is this coordinated with the device-relative block? Surely there is some block coelescing done at some point to ensure the two different views of the block are consistent? If not, I'd tend to view this as a "bug". But presuming this is "dealt with" at some layer, can't the same information used to merge the two blocks be used to identify that the same block is already in memory? If file-blockno, addresses can't be converted to device-blockno's, how can the blocks be written to disk? If the file-blockno's are convertable, can't this be used to compare which physical blocks are already in the cache? Certainly this would be faster than the milliseconds it could take to do another physial read, no? > (a) it makes it a hell of a lot more readable, and the user gets a > lot more information that may make him see the higher-level issues > involved. --- I'm not sure this is a benefit that outweight the elimination of having to go through the file system (following directory chains, indirect blocks, extents, etc). > (b) it's in the form that we cache things, so if you read-ahead in > that form, you'll actually get real information. ---- This may not be a great reason. Using the fact that "this is the way things are already done", may not allow optimal implementation of faster algorithms. > (c) it's in a form where you can actually _do_ something about things > like fragmentation etc ("Oh, I could move these files all to a > separate area") ---- This could theoretically be done solely on a block basis. While file pointers would have to be updated to point to "out-of-sequence" blocks, it might be more optimal to store _some_ files discontiguously in order to pack the desired blocks into a (relatively) "small" area that can be read into memory in some minimal number of physical i/o requests. > - it will miss any situation where a filesystem does a read some other > way. Notably, in many loads, the _directory_ accesses are the important > ones, and if you want statistics for those you'd often have to do that > separately (not always - some of the filesystems just use the same > page reading stuff). --- Recording physical block reads could automatically include this situation. > > The downsides basically boil down to the fact that it's not as clearly > just one single point. You can't just look at the request queue and see > what physical requests go out. --- You also have the downside of having to go through the file-system and following each path and file-chain in order to read a potentially discontiguous section of a file. It also, _seemingly_, has the downside of reading in "entire files" when only subsections of files may be needed on startup. > > NOTE! You can obviously do both, and try to correlate one against the > other, and you'd get the best possible information ("is this seek because > we started reading another file, or is it because the file itself is > fragmented" kind of stuff). So physical statistics aren't meaningless. > They're just _less_ important than the virtual ones, and you should do > them only if you already do virtual stats. --- Sounds like you might need to have 2 sets of "addresses" / block: file relative block addresses, and device relative block addresses. -l ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-04 0:28 ` Linda Walsh @ 2006-05-04 1:31 ` Linus Torvalds 2006-05-04 7:08 ` Ph. Marek 0 siblings, 1 reply; 48+ messages in thread From: Linus Torvalds @ 2006-05-04 1:31 UTC (permalink / raw) To: Linda Walsh Cc: Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Wed, 3 May 2006, Linda Walsh wrote: > > > - it gives you only a very limited view into what is actually going on. > > ??? In what way? I don't think we need a *complex* view of what > is going on. The block-based view is a very simple one, but it's _too_ simple. It doesn't actually tell the user what is happening. It doesn't tell you why the request happened in the first place, so it leaves you no way to really do anything sane about it. You can't prefetch (because the indexing is wrong), and you can't actually even analyze it (because you don't know any background to what you're seeing). In short, you can't _do_ anything with it. Linus ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-04 1:31 ` Linus Torvalds @ 2006-05-04 7:08 ` Ph. Marek 2006-05-04 7:33 ` Arjan van de Ven 0 siblings, 1 reply; 48+ messages in thread From: Ph. Marek @ 2006-05-04 7:08 UTC (permalink / raw) To: Linus Torvalds Cc: Linda Walsh, Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Thursday 04 May 2006 03:31, Linus Torvalds wrote: > On Wed, 3 May 2006, Linda Walsh wrote: > > > - it gives you only a very limited view into what is actually going > > > on. > > > > ??? In what way? I don't think we need a *complex* view of what > > is going on. > > The block-based view is a very simple one, but it's _too_ simple. It > doesn't actually tell the user what is happening. It doesn't tell you why > the request happened in the first place, so it leaves you no way to really > do anything sane about it. > > You can't prefetch (because the indexing is wrong), and you can't actually > even analyze it (because you don't know any background to what you're > seeing). In short, you can't _do_ anything with it. Please forgive me if I'm dumb, but what about the following reasoning: The user as such (the person just *using* the computer) won't do anything with "program xyz uses block a, b, and c.". But it wouldn't help much if he was told "program xyz uses files a, b, and d." That's a task for package maintainers and distributions to optimize the *number* of accesses. So the simple information which block numbers we need would not benefit the users (or the distribution developers), but it's enough for caching things: Ascending block numbers on disk can be read very fast, as the disk needs no or less seeking. That's even true for stripes and mirrors. (I grant you that there are complicated setups out there, but these could be handled similar.) If we know which blocks are read (directories, inodes, data), an early userspace application (possibly even from initrd) could - read these blocks into a ramdisk device (linearly), - define a device mapper as a mirror of - the original root partition, - and a sparse device composed from parts of the ramdisk, interleaved by non-existing (error-producing) ranges, and - setup the device mapper as a root device. - In a specified point at startup the mapping device gets set to just a linear interpretation of the root partition, and the ramdisk gets discarded. Example: -------- N specifies a needed block, the numbers other blocks. Harddisk: 1 2 3 N N N 7 8 N 10 N 12 13 14 N 16 N ... RAM-Disk contains 4 5 6 9 11 15 17 The mirror set would look like this, with E representing an error-returning block: Harddisk: 1 2 3 N N N 7 8 N 10 N 12 13 14 N 16 N ... mapped from RAM-Disk: E E E 4 5 6 E E 9 E 11 E E E 15 E 17 ... Advantages: - As long as there's a good chance that the needed blocks are read from the ramdisk (which was filled linearly, with 20-60MB/sec as reported by others), *all* disk-access times (inodes, directories, data) should be nearly zero. Of course, there will be blocks which have now to be read which weren't needed the last boot, which will have to be taken from the harddisk, but the majority should be taken from the ramdisk. Problems: - We'd need to define a priority in the mirror set, so that *everything* possible is taken from the ramdisk - There has to be a way to *not* disable the ram-disk when an error-block is read. That's completely doable in userspace (except for knowing which blocks have to be read), and doesn't sound like much overhead to me. Suggestions, ideas? Does that work? Does it not? Regards, Phil ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-04 7:08 ` Ph. Marek @ 2006-05-04 7:33 ` Arjan van de Ven [not found] ` <20060504121454.GB6008@mail.ustc.edu.cn> 0 siblings, 1 reply; 48+ messages in thread From: Arjan van de Ven @ 2006-05-04 7:33 UTC (permalink / raw) To: Ph. Marek Cc: Linus Torvalds, Linda Walsh, Wu Fengguang, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty > Ascending block numbers on disk can be read very fast, as the disk needs no or > less seeking. That's even true for stripes and mirrors. (I grant you that > there are complicated setups out there, but these could be handled similar.) > btw this all really spells out that you may want to do this as a device mapper thing; eg have a device mapper module that can do "lookaside" to a different order/mirror block whatever. The filesystem just doesn't want to know; do it at the DM level ;) That also solves the entire caching layering problem etc ;) ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060504121454.GB6008@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060504121454.GB6008@mail.ustc.edu.cn> @ 2006-05-04 12:14 ` Wu Fengguang 2006-05-04 12:34 ` Arjan van de Ven 1 sibling, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-04 12:14 UTC (permalink / raw) To: Arjan van de Ven Cc: Ph. Marek, Linus Torvalds, Linda Walsh, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Thu, May 04, 2006 at 09:33:24AM +0200, Arjan van de Ven wrote: > > > > Ascending block numbers on disk can be read very fast, as the disk needs no or > > less seeking. That's even true for stripes and mirrors. (I grant you that > > there are complicated setups out there, but these could be handled similar.) > > > > > btw this all really spells out that you may want to do this as a device > mapper thing; eg have a device mapper module that can do "lookaside" to > a different order/mirror block whatever. The filesystem just doesn't > want to know; do it at the DM level ;) That also solves the entire > caching layering problem etc ;) I guess some big corps might want to install such a layer into their storage products ;) ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060504121454.GB6008@mail.ustc.edu.cn> 2006-05-04 12:14 ` Wu Fengguang @ 2006-05-04 12:34 ` Arjan van de Ven 1 sibling, 0 replies; 48+ messages in thread From: Arjan van de Ven @ 2006-05-04 12:34 UTC (permalink / raw) To: Wu Fengguang Cc: Ph. Marek, Linus Torvalds, Linda Walsh, linux-kernel, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Thu, 2006-05-04 at 20:14 +0800, Wu Fengguang wrote: > On Thu, May 04, 2006 at 09:33:24AM +0200, Arjan van de Ven wrote: > > > > > > > Ascending block numbers on disk can be read very fast, as the disk needs no or > > > less seeking. That's even true for stripes and mirrors. (I grant you that > > > there are complicated setups out there, but these could be handled similar.) > > > > > > > > > btw this all really spells out that you may want to do this as a device > > mapper thing; eg have a device mapper module that can do "lookaside" to > > a different order/mirror block whatever. The filesystem just doesn't > > want to know; do it at the DM level ;) That also solves the entire > > caching layering problem etc ;) > > I guess some big corps might want to install such a layer into their > storage products ;) maybe, could be. Doing it at this level also has the advantage that fs metadata becomes seek free as well; since that is mostly fixed-location inside the fs, reordering that on an fs-level is really hard.. on a dm level.. easy. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang 2006-05-02 12:46 ` Diego Calleja 2006-05-02 15:55 ` Linus Torvalds @ 2006-05-03 21:45 ` Linda Walsh [not found] ` <20060504121212.GA6008@mail.ustc.edu.cn> 2006-05-04 9:02 ` Helge Hafting 3 siblings, 1 reply; 48+ messages in thread From: Linda Walsh @ 2006-05-03 21:45 UTC (permalink / raw) To: Wu Fengguang, linux-kernel Wu Fengguang wrote: > Pre-caching reloaded ;) > I/O ANALYSE... > SCHEME/GOAL(s)... > Some good analysis and ideas. I don't know if it is wanted, but I'd like to add a 'few cents' referring to the pre-fetch mechanism in XP, which addresses both boot and application prefetch and has the benefit of showing measurable performance improvements (compare the boot time of an NT4 system to XP; maybe a 5-8x performance boost?). 1. As you mention; reading files "sequentially" through the file system is "bad" for several reasons. Areas of interest: a) don't go through the file system. Don't waste time doing directory lookups and following file-allocation maps; Instead, use raw-disk i/o and read sectors in using device & block number. b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks starting ASAP after system boot and continuing for some "configurable" number of seconds past reaching the desired "run-level" (coinciding with initial disk quiescence). Save as "configurable" (~6-8?) number of traces to allow finding the common initial subset of blocks needed. c) Allow specification of max# of blocks and max number of "sections" (discontiguous areas on disk); d) "Ideally", would have a way to "defrag" the common set of blocks. I.e. -- moving the needed blocks from potentially disparate areas of files into 1 or 2 contiguous areas, hopefully near the beginning of the disk (or partition(s)). That's the area of "boot" pre-caching. Next is doing something similar for "application" starts. Start tracing when an application is loaded & observe what blocks are requested for that app for the first 20 ("configurable") seconds of execution. Store traces on a per-application basis. Again, it would be ideal if the different files (blocks, really), needed by an application could be grouped so that sequentially needed disk-blocks are stored sequentially on disk (this _could_ imply the containing files are not contiguous). Essentially, one wants to do for applications, the same thing one does for booting. On small applications, the benefit would likely be negligible, but on loading a large app like a windowing system, IDE, or database app, multiple configuration files could be read into the cache in one large read. That's "application" pre-caching. A third area -- that can't be easily done in the kernel, but would require a higher skill level on the part of application and library developers, is to move towards using "delay-loaded" libraries. In Windows, it seems common among system libraries to use this feature. An obvious benefit -- if certain features of a program are not used, the associated libraries are never loaded. Not loading unneeded parts of a program should speed up initial application load time, significantly. I don't know where the "cross-over" point is, but moving to demand loaded "so's" can cause extreme benefits for interactive usage. In addition to load-time benefits, additional benefits are gained by not wasting memory on unused libraries and program features. In looking at the distro I use, many unused libraries are linked in with commonly used programs. For a small office or home setup, I rarely have a need for LDAP, internationalized libraries, & application support for virtually any hardware & software configuration. This isn't a kernel problem -- the kernel has dynamically loadable modules to allow loading only of hardware & software drivers that are needed in a particular kernel. User applications that I have been exposed to in Linux usually don't have this adaptability -- virtually everything is loaded at execution time -- not as needed during program execution. I've seen this feature used on Unix systems to dynamically present feature interfaces depending on the user's software configuration. On linux, I more often see a "everything, including the kitchen sink" approach, where every possible software configuration is supported via "static", shared libraries that must be present and loaded into memory before the program begins execution. This has the potential to have a greater benefit as the application environment becomes more complex if you think about how the number of statically loaded, sharable libraries have increased (have seen addition of ldap, pam, and most recently, selinux libraries that are required for loading before application execution). Good luck in speeding these things up. It might require some level of cooperation in different areas (kernel, fs utils, distro-configuration, application design and build...etc). -linda ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060504121212.GA6008@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060504121212.GA6008@mail.ustc.edu.cn> @ 2006-05-04 12:12 ` Wu Fengguang 2006-05-04 18:57 ` Linda Walsh 0 siblings, 1 reply; 48+ messages in thread From: Wu Fengguang @ 2006-05-04 12:12 UTC (permalink / raw) To: Linda Walsh; +Cc: linux-kernel On Wed, May 03, 2006 at 02:45:53PM -0700, Linda Walsh wrote: > 1. As you mention; reading files "sequentially" through the file > system is "bad" for several reasons. Areas of interest: > a) don't go through the file system. Don't waste time doing > directory lookups and following file-allocation maps; Instead, > use raw-disk i/o and read sectors in using device & block number. Sorry, it does not fit in the linux's cache model. > b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks > starting ASAP after system boot and continuing for some "configurable" > number of seconds past reaching the desired "run-level" (coinciding with > initial disk quiescence). Save as "configurable" (~6-8?) number of > traces to allow finding the common initial subset of blocks needed. It is a alternative way of doing the same job: more precise, with more complexity and more overhead. However the 'blockno' way is not so tasteful. > c) Allow specification of max# of blocks and max number of "sections" > (discontiguous areas on disk); Good point, will do it in my work. > d) "Ideally", would have a way to "defrag" the common set of blocks. > I.e. -- moving the needed blocks from potentially disparate areas of > files into 1 or 2 contiguous areas, hopefully near the beginning of > the disk (or partition(s)). > That's the area of "boot" pre-caching. I guess poor man's defrag would be good enough for the seeking storm. > Next is doing something similar for "application" starts. Start tracing > when an application is loaded & observe what blocks are requested for > that app for the first 20 ("configurable") seconds of execution. Store > traces on a per-application basis. Again, it would be ideal if the > different files (blocks, really), needed by an application could be > grouped so that sequentially needed disk-blocks are stored sequentially > on disk (this _could_ imply the containing files are not contiguous). > > Essentially, one wants to do for applications, the same thing one does > for booting. On small applications, the benefit would likely be negligible, > but on loading a large app like a windowing system, IDE, or database app, > multiple configuration files could be read into the cache in one large > read. > > That's "application" pre-caching. Yes, it is a planned feature, will do it. > A third area -- that can't be easily done in the kernel, but would > require a higher skill level on the part of application and library > developers, is to move towards using "delay-loaded" libraries. In > Windows, it seems common among system libraries to use this feature. > An obvious benefit -- if certain features of a program are not used, > the associated libraries are never loaded. Not loading unneeded parts > of a program should speed up initial application load time, significantly. Linux already does lazy loading for linked libs. The only one pitfall is that /lib/ld-linux.so.2 seems to touch the first 512B data of every libs before doing mmap(): % strace date execve("/bin/date", ["date"], [/* 41 vars */]) = 0 uname({sys="Linux", node="lark", ...}) = 0 brk(0) = 0x8053000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0e000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f0d000 open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=77643, ...}) = 0 mmap2(NULL, 77643, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7efa000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/tls/librt.so.1", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0`\35\0\000"..., 512) = 512 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fstat64(3, {st_mode=S_IFREG|0644, st_size=30612, ...}) = 0 mmap2(NULL, 29264, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7ef2000 mmap2(0xb7ef8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6) = 0xb7ef8000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/tls/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\260O\1"..., 512) = 512 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fstat64(3, {st_mode=S_IFREG|0755, st_size=1270928, ...}) = 0 mmap2(NULL, 1276892, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7dba000 mmap2(0xb7ee8000, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x12e) = 0xb7ee8000 mmap2(0xb7ef0000, 7132, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7ef0000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/tls/libpthread.so.0", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\340G\0"..., 512) = 512 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~HERE~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fstat64(3, {st_mode=S_IFREG|0755, st_size=85770, ...}) = 0 mmap2(NULL, 70104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7da8000 mmap2(0xb7db6000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xd) = 0xb7db6000 mmap2(0xb7db8000, 4568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7db8000 close(3) = 0 They can lead to more seeks, and also disturb the readahead logic much. Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-04 12:12 ` Wu Fengguang @ 2006-05-04 18:57 ` Linda Walsh [not found] ` <20060505152007.GB6134@mail.ustc.edu.cn> 0 siblings, 1 reply; 48+ messages in thread From: Linda Walsh @ 2006-05-04 18:57 UTC (permalink / raw) To: Wu Fengguang, linux-kernel Wu Fengguang wrote: > On Wed, May 03, 2006 at 02:45:53PM -0700, Linda Walsh wrote: > >> 1. As you mention; reading files "sequentially" through the file >> system is "bad" for several reasons. Areas of interest: >> a) don't go through the file system. Don't waste time doing >> directory lookups and following file-allocation maps; Instead, >> use raw-disk i/o and read sectors in using device & block number. >> > > Sorry, it does not fit in the linux's cache model. > --- Maybe linux's cache model needs to be _improved_ to better allow for hardware acceleration?^** It is the job of the OS to provide sufficiently low level facilities to allow optimal use of the system hardware, while at the same time providing enough high level facilities to support applications that don't require such tuning. >> b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks >> starting ASAP after system boot and continuing for some "configurable" >> number of seconds past reaching the desired "run-level" (coinciding with >> initial disk quiescence). Save as "configurable" (~6-8?) number of >> traces to allow finding the common initial subset of blocks needed. >> > > It is a alternative way of doing the same job: more precise, with more > complexity and more overhead. However the 'blockno' way is not so > tasteful. > ---- Maybe not so tasteful to you, but it is an alternate path that circumvents unnecessary i/o redirections. An additional optimization is to have a "cache" of frequently used "system applications" that have their pathnames registered, so no run-time seaching of the user PATH is necessary. >> c) Allow specification of max# of blocks and max number of "sections" >> (discontiguous areas on disk); >> > > Good point, will do it in my work. > > >> d) "Ideally", would have a way to "defrag" the common set of blocks. >> I.e. -- moving the needed blocks from potentially disparate areas of >> files into 1 or 2 contiguous areas, hopefully near the beginning of >> the disk (or partition(s)). >> That's the area of "boot" pre-caching. >> > > > I guess poor man's defrag would be good enough for the seeking storm. > --- I disagree. The "poor man's defrag, as you call it, puts entire files into contiguous sections -- each of which will have to be referenced by following a PATH and directory chain. The optimization I'm talking about would only store the referenced data-blocks actually used in the files. This would allow a directy read into memory of a *bunch* of needed sectors while not including sectors from the same files that are not actually read from during the boot *or* app. initialization process. >> That's "application" pre-caching. >> > Yes, it is a planned feature, will do it. > Trés cool. >> A third area -- that can't be easily done in the kernel, but would >> require a higher skill level on the part of application and library >> developers, is to move towards using "delay-loaded" libraries. In >> Windows, it seems common among system libraries to use this feature. >> An obvious benefit -- if certain features of a program are not used, >> the associated libraries are never loaded. Not loading unneeded parts >> of a program should speed up initial application load time, significantly. >> > Linux already does lazy loading for linked libs. The only one pitfall > is that /lib/ld-linux.so.2 seems to touch the first 512B data of every > libs before doing mmap(): > ---- Yes -- and this is the "killer". If my app may "potentially" use 50 run-time libs, but in any single invocation, only uses 20 of those libs, the page tables and fixups for the unused 30 libraries don't need to be done. In fact, in some configurations, those 30 libs may not even need to be present on disk! Typical example - "Active Directory"; I don't use it. I don't need the libraries on my system or need them "resolved" at runtime. It would be far preferable if programs would only load those libraries actually used at run-time -- and load them *dynamically*, as needed (for libraries that may not actually be called or used). This way, the initial time to start the program is significantly reduced to the "core" set of libraries needed to run the program. Optional features are loaded as those features are called for off disk. Delays of loading optional libraries one-two at a time, interactively, are not likely to be noticed, but if you load all of those "optional" libraries prior to execution, the sum-total will be noticed in an interactive environment. ^** -- "somewhere", it _seems_, the physical, device relative sector must be resolved. If it is not, how is i/o-block buffer consistency maintained when the user references "device "hda", sector "1000", then the same sector as "hda1", sector 500, and also as file "/etc/passwd", sector 0? _If_ cache consistency is maintained (and I _assume_^**2 it is), they all need to be mapped to a physical sector at some point. ^**2 - Use of assumption noted; feel free to correct me and tell me this isn't the case if linux doesn't maintain disk-block cache consistency. -linda ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060505152007.GB6134@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060505152007.GB6134@mail.ustc.edu.cn> @ 2006-05-05 15:20 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-05 15:20 UTC (permalink / raw) To: Linda Walsh; +Cc: linux-kernel On Thu, May 04, 2006 at 11:57:21AM -0700, Linda Walsh wrote: > Wu Fengguang wrote: > >> b) Be "dynamic"; "Trace" (record (dev&blockno/range) blocks > >>starting ASAP after system boot and continuing for some "configurable" > >>number of seconds past reaching the desired "run-level" (coinciding with > >>initial disk quiescence). Save as "configurable" (~6-8?) number of > >>traces to allow finding the common initial subset of blocks needed. > >> > > > >It is a alternative way of doing the same job: more precise, with more > >complexity and more overhead. However the 'blockno' way is not so > >tasteful. > > > ---- > Maybe not so tasteful to you, but it is an alternate path that > circumvents unnecessary i/o redirections. An additional optimization Yes, it's about the choice of the overhead/generalization that indirections bring about. > is to have a "cache" of frequently used "system applications" that have > their pathnames registered, so no run-time seaching of the user PATH > is necessary. The facility is already there: it is called dcache. > >I guess poor man's defrag would be good enough for the seeking storm. > > > --- > I disagree. The "poor man's defrag, as you call it, puts entire files > into contiguous sections -- each of which will have to be referenced by > following > a PATH and directory chain. The optimization I'm talking about would > only store > the referenced data-blocks actually used in the files. This would allow a > directy read into memory of a *bunch* of needed sectors while not including > sectors from the same files that are not actually read from during the > boot *or* > app. initialization process. The seek storm is mostly caused by small files(that are read in wholes), and the corresponding scattered dir/inode buffers. By moving these files to one single directory, both the file data/inode buffers are kept continuous enough. > ^** -- "somewhere", it _seems_, the physical, device relative sector > must be resolved. If it is not, how is i/o-block buffer consistency > maintained when the user references "device "hda", sector "1000", then > the same sector as "hda1", sector 500, and also as file "/etc/passwd", > sector 0? _If_ cache consistency is maintained (and I _assume_^**2 > it is), they all need to be mapped to a physical sector at some point. > > ^**2 - Use of assumption noted; feel free to correct me and tell me > this isn't the case if linux doesn't maintain disk-block cache > consistency. The linux cache model is something like: physical drive file /dev/hda1 | <-----------------> | file /bin/ls | | |<---------------------> | | |<---------------------> | | | <-----------------> | | | | | file /boot/vmlinuz | | |<---------------------> | | | | | | | | |<---------------------> | | | <-----------------> | | | | | | | As opposed to this one: file /dev/hda1 physical drive | <-----------------> | file /bin/ls | | |<---------------------> | | |<---------------------> | | | <-----------------> | | | | | file /boot/vmlinuz | | |<---------------------> | | | | | | | | |<---------------------> | | | <-----------------> | | | | | | | Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang ` (2 preceding siblings ...) 2006-05-03 21:45 ` Linda Walsh @ 2006-05-04 9:02 ` Helge Hafting 3 siblings, 0 replies; 48+ messages in thread From: Helge Hafting @ 2006-05-04 9:02 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty Wu Fengguang wrote: > Rapid linux desktop startup through pre-caching > > >MOTIVATION > > KDE, Gnome, OpenOffice, and Firefox all take too long to start up. > Boot time pre-caching seems to be the single most straightforward and > effective way to improve it and make linux desktop experience more > comfortable. It is a great pleasure for me to take up the work. > > Actually, the best way is to not run so much software. An yes, that is an option. I won't say no to an improved kernel too though. :-) The apps mentioned are popular, but few needs *all* of them. One can do without KDE and gnome, run a nice lightweight window manager instead. Take the kde/gnome performance hit only when you actually need some kde/gnome app. Not every day. A nice windowmanager like icewm of fluxbox brings the login delay down to 3s or so for me. Openoffice has lightweight alternatives for every task. (abiword,lyx,gnumeric, . . . ) Strange that this bloated sw is as popular as it is, given the many alternatives. Not something I use every month, and I use linux exclusively for my office tasks. Another alternative is to profile the slow apps and improve them. Fix algorithms, optimize stuff. The slow boot is fixable by: 1) run boot scripts in parallell instead of sequentially - somewhat experimental but helps. Especially if you can bring up X before slowest stuff completes. 2) Don't run what you don't use/need! Don't install everything and the kitchen sink just because it is free software. I am guilty of installing too much myself, so I suffer 40-second bootup time. But I don't reboot my office pc every week, normally I only have that 3s login delay. Helge Hafting ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] <20060502075049.GA5000@mail.ustc.edu.cn> 2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang @ 2006-05-02 7:58 ` Arjan van de Ven [not found] ` <20060502080619.GA5406@mail.ustc.edu.cn> 2006-05-02 8:09 ` Jens Axboe 2006-05-02 19:10 ` Pavel Machek 3 siblings, 1 reply; 48+ messages in thread From: Arjan van de Ven @ 2006-05-02 7:58 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty > > PREVIOUS WORKS > > There has been some similar efforts, i.e. > - Linux: Boot Time Speedups Through Precaching > http://kerneltrap.org/node/2157 > - Andrew Morton's kernel module solution > http://www.zip.com.au/~akpm/linux/fboot.tar.gz > - preload - adaptive readahead daemon > http://sourceforge.net/projects/preload you missed the solution Fedora deploys since over a year using readahead ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060502080619.GA5406@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502080619.GA5406@mail.ustc.edu.cn> @ 2006-05-02 8:06 ` Wu Fengguang 2006-05-02 8:30 ` Arjan van de Ven 1 sibling, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-02 8:06 UTC (permalink / raw) To: Arjan van de Ven Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 09:58:44AM +0200, Arjan van de Ven wrote: > > > > > PREVIOUS WORKS > > > > There has been some similar efforts, i.e. > > - Linux: Boot Time Speedups Through Precaching > > http://kerneltrap.org/node/2157 > > - Andrew Morton's kernel module solution > > http://www.zip.com.au/~akpm/linux/fboot.tar.gz > > - preload - adaptive readahead daemon > > http://sourceforge.net/projects/preload > > you missed the solution Fedora deploys since over a year using readahead Thanks, and sorry for more previous works that I failed to mention here :) Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502080619.GA5406@mail.ustc.edu.cn> 2006-05-02 8:06 ` Wu Fengguang @ 2006-05-02 8:30 ` Arjan van de Ven [not found] ` <20060502085325.GA9190@mail.ustc.edu.cn> 2006-05-02 22:03 ` Dave Jones 1 sibling, 2 replies; 48+ messages in thread From: Arjan van de Ven @ 2006-05-02 8:30 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, 2006-05-02 at 16:06 +0800, Wu Fengguang wrote: > On Tue, May 02, 2006 at 09:58:44AM +0200, Arjan van de Ven wrote: > > > > > > > > PREVIOUS WORKS > > > > > > There has been some similar efforts, i.e. > > > - Linux: Boot Time Speedups Through Precaching > > > http://kerneltrap.org/node/2157 > > > - Andrew Morton's kernel module solution > > > http://www.zip.com.au/~akpm/linux/fboot.tar.gz > > > - preload - adaptive readahead daemon > > > http://sourceforge.net/projects/preload > > > > you missed the solution Fedora deploys since over a year using readahead > > Thanks, and sorry for more previous works that I failed to mention here :) one interesting thing that came out of the fedora readahead work is that most of the bootup isn't actually IO bound. My test machine for example can load all the data into ram in about 10 seconds, so that the rest of the boot is basically IO-less. But that still takes 2 minutes.... So I'm not entirely sure how much you can win by just attacking this. Another interesting approach would be to actually put all the data you want to use in a non-fragmented, sequential area on disk somehow (there is an OLS paper submitted about that by Ben) so that at least the disk side is seekless... ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060502085325.GA9190@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502085325.GA9190@mail.ustc.edu.cn> @ 2006-05-02 8:53 ` Wu Fengguang 2006-05-06 6:49 ` Denis Vlasenko 2006-05-02 8:55 ` Arjan van de Ven 1 sibling, 1 reply; 48+ messages in thread From: Wu Fengguang @ 2006-05-02 8:53 UTC (permalink / raw) To: Arjan van de Ven Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote: > one interesting thing that came out of the fedora readahead work is that > most of the bootup isn't actually IO bound. My test machine for example > can load all the data into ram in about 10 seconds, so that the rest of > the boot is basically IO-less. But that still takes 2 minutes.... > So I'm not entirely sure how much you can win by just attacking this. Yes, I find it hard to improve the boot time of the init.d stage. However, it is perfectly ok to preload all GUI staffs during that timespan, by overlapping CPU/IO activities. > Another interesting approach would be to actually put all the data you > want to use in a non-fragmented, sequential area on disk somehow (there > is an OLS paper submitted about that by Ben) so that at least the disk > side is seekless... You are right, reducing seeking distances helps not much. My fluxbox desktop requires near 3k seeks, which can be loaded in the 20s init.d booting time. But for KDE/GNOME desktops, some defragging would be necessary to fit them into the 20s time span. I found ext3 to be rather good filesystem to support poor man's defrag described by Chris Mason: http://www.gelato.unsw.edu.au/archives/git/0504/1690.html This certainly can help. Based on some ideas from andrea I made a poor man's defrag script last year that was similar. It worked by copying files into a flat dir in the order you expected to read them in, deleting the original, then hard linking them into their original name. Make the 'flat dir' as an top level dir would do the trick for ext3. We have good chance to merge 10k seeks into 3k seeks by this trick :) Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 8:53 ` Wu Fengguang @ 2006-05-06 6:49 ` Denis Vlasenko 0 siblings, 0 replies; 48+ messages in thread From: Denis Vlasenko @ 2006-05-06 6:49 UTC (permalink / raw) To: Wu Fengguang Cc: Arjan van de Ven, linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tuesday 02 May 2006 11:53, Wu Fengguang wrote: > On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote: > > one interesting thing that came out of the fedora readahead work is that > > most of the bootup isn't actually IO bound. My test machine for example > > can load all the data into ram in about 10 seconds, so that the rest of > > the boot is basically IO-less. But that still takes 2 minutes.... > > So I'm not entirely sure how much you can win by just attacking this. > > Yes, I find it hard to improve the boot time of the init.d stage. Then _get rid of your init.d_. SysV init sequence is insane. I use daemontools-based setup and it starts up to login prompt in 5-10 secs (counting from the root fs mount). In my setup, after a few simple startup scripts a svscan process is called, which starts all services in parallel. Gettys are services, and start right away. -- vda ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502085325.GA9190@mail.ustc.edu.cn> 2006-05-02 8:53 ` Wu Fengguang @ 2006-05-02 8:55 ` Arjan van de Ven 2006-05-02 11:39 ` Jan Engelhardt [not found] ` <20060502114853.GA9983@mail.ustc.edu.cn> 1 sibling, 2 replies; 48+ messages in thread From: Arjan van de Ven @ 2006-05-02 8:55 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, 2006-05-02 at 16:53 +0800, Wu Fengguang wrote: > On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote: > > one interesting thing that came out of the fedora readahead work is that > > most of the bootup isn't actually IO bound. My test machine for example > > can load all the data into ram in about 10 seconds, so that the rest of > > the boot is basically IO-less. But that still takes 2 minutes.... > > So I'm not entirely sure how much you can win by just attacking this. > > Yes, I find it hard to improve the boot time of the init.d stage. > However, it is perfectly ok to preload all GUI staffs during that > timespan, by overlapping CPU/IO activities. fwiw fedora even loads a bunch of GUI apps into memory already > > > Another interesting approach would be to actually put all the data you > > want to use in a non-fragmented, sequential area on disk somehow (there > > is an OLS paper submitted about that by Ben) so that at least the disk > > side is seekless... > > You are right, reducing seeking distances helps not much. My fluxbox > desktop requires near 3k seeks, which can be loaded in the 20s init.d > booting time. But for KDE/GNOME desktops, some defragging would be > necessary to fit them into the 20s time span. or just move the blocks (or copy them) to a reserved area... ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 8:55 ` Arjan van de Ven @ 2006-05-02 11:39 ` Jan Engelhardt [not found] ` <20060502114853.GA9983@mail.ustc.edu.cn> 1 sibling, 0 replies; 48+ messages in thread From: Jan Engelhardt @ 2006-05-02 11:39 UTC (permalink / raw) To: Arjan van de Ven Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty >> > Another interesting approach would be to actually put all the data you >> > want to use in a non-fragmented, sequential area on disk somehow (there >> > is an OLS paper submitted about that by Ben) so that at least the disk >> > side is seekless... >> >> You are right, reducing seeking distances helps not much. My fluxbox >> desktop requires near 3k seeks, which can be loaded in the 20s init.d >> booting time. But for KDE/GNOME desktops, some defragging would be >> necessary to fit them into the 20s time span. > >or just move the blocks (or copy them) to a reserved area... Or even put it into a ramdisk, and then fix userspace. When Gnome loads more than 10000 files [http://kerneltrap.org/node/2157], it sounds like a design problem. Jan Engelhardt -- ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060502114853.GA9983@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502114853.GA9983@mail.ustc.edu.cn> @ 2006-05-02 11:48 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-02 11:48 UTC (permalink / raw) To: Arjan van de Ven Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 10:55:29AM +0200, Arjan van de Ven wrote: > > Yes, I find it hard to improve the boot time of the init.d stage. > > However, it is perfectly ok to preload all GUI staffs during that > > timespan, by overlapping CPU/IO activities. > > fwiw fedora even loads a bunch of GUI apps into memory already Hopefully the warm cache time has been improving. I have very good performance with fluxbox. Lubos Lunak said (http://wiki.kde.org/tiki-index.php?page=KDE%20Google%20SoC%202006%20ideas#id106304) that "KDE startup with cold caches is several times slower than with warm caches." The latest GNOME release also saw good speedup. > > > Another interesting approach would be to actually put all the data you > > > want to use in a non-fragmented, sequential area on disk somehow (there > > > is an OLS paper submitted about that by Ben) so that at least the disk > > > side is seekless... > > > > You are right, reducing seeking distances helps not much. My fluxbox > > desktop requires near 3k seeks, which can be loaded in the 20s init.d > > booting time. But for KDE/GNOME desktops, some defragging would be > > necessary to fit them into the 20s time span. > > or just move the blocks (or copy them) to a reserved area... Ah, we can save discontinuous pages into the journal file on umounting and restore them on mounting. But let's do the simple way first :) Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 8:30 ` Arjan van de Ven [not found] ` <20060502085325.GA9190@mail.ustc.edu.cn> @ 2006-05-02 22:03 ` Dave Jones 1 sibling, 0 replies; 48+ messages in thread From: Dave Jones @ 2006-05-02 22:03 UTC (permalink / raw) To: Arjan van de Ven Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 10:30:17AM +0200, Arjan van de Ven wrote: > one interesting thing that came out of the fedora readahead work is that > most of the bootup isn't actually IO bound. Here's another interesting datapoint. A huge proportion of I/O done during bootup is _completely_ unnecessary. Profiling just a few of the places that seemed to stall yielded some lovely things, like cupsd reading in and parsing descriptions for every printer ever known to man, even if there's no printer connected. Or my favorite.. hald reloading and reparsing the same XML hardware description files.. ~50 times, _each_. Utterly insane. (And these aren't Fedora specific problems btw, every distro shipping those bits will have the same dumb behaviour [modulo versions]) Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] <20060502075049.GA5000@mail.ustc.edu.cn> 2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang 2006-05-02 7:58 ` Arjan van de Ven @ 2006-05-02 8:09 ` Jens Axboe [not found] ` <20060502082009.GA9038@mail.ustc.edu.cn> 2006-05-03 22:05 ` Benjamin LaHaise 2006-05-02 19:10 ` Pavel Machek 3 siblings, 2 replies; 48+ messages in thread From: Jens Axboe @ 2006-05-02 8:09 UTC (permalink / raw) To: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Badari Pulavarty On Tue, May 02 2006, Wu Fengguang wrote: > I'd like to know whether the two proposed kernel components would be > acceptable for the mainline kernel, and any recommends to improve them. I tried something very similar to this years ago, except I made it explicit instead of hiding it in the blk_run_backing_dev() which we didn't have at that time. My initial results showed that you would get a load of requests for different pages so would end up doing io randomly instead again. Your patch isn't exactly pretty, but that is fixable. I'm more interested in how you'd control that sanely. -- Jens Axboe ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060502082009.GA9038@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060502082009.GA9038@mail.ustc.edu.cn> @ 2006-05-02 8:20 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-02 8:20 UTC (permalink / raw) To: Jens Axboe Cc: linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 10:09:36AM +0200, Jens Axboe wrote: > I tried something very similar to this years ago, except I made it > explicit instead of hiding it in the blk_run_backing_dev() which we > didn't have at that time. My initial results showed that you would get a > load of requests for different pages so would end up doing io randomly > instead again. Yes, the hard one would be _not_ to impact normal I/Os in any way. A simple solution for the case of deadline scheduler would be: - static const int read_expire = HZ / 2; /* max time before a read is submitted. */ + static const int read_expire = HZ / 2; /* max time before a impending read is submitted. */ + static const int reada_expire = HZ * 30; /* max time before a read-ahead is submitted. */ Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 8:09 ` Jens Axboe [not found] ` <20060502082009.GA9038@mail.ustc.edu.cn> @ 2006-05-03 22:05 ` Benjamin LaHaise 1 sibling, 0 replies; 48+ messages in thread From: Benjamin LaHaise @ 2006-05-03 22:05 UTC (permalink / raw) To: Jens Axboe Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton, Nick Piggin, Badari Pulavarty On Tue, May 02, 2006 at 10:09:36AM +0200, Jens Axboe wrote: > Your patch isn't exactly pretty, but that is fixable. I'm more > interested in how you'd control that sanely. I have a quick and dirty hack called BootCache (only afterwards did I notice that Apple used the same name) that snoops all the read requests during startup and uses that to populate a file which gets read in to populate the cache on subsequent boots. The results are impressive: eliminating the seeks (since the file is contiguous) results in 15s (out of the 45s from init starting to the login prompt) reduction in boot time on ext3. Prefetch without reordering isn't particularly effective, as the cost of the seeks to read in the inodes and file data tend to block critical path io in the startup scripts. I hope to release the code sometime soon and rework it to be a better fit for merging with the kernel -- it should be possible to use the block io tracing facility to log the blocks required and then feed that back into the filesystem to reallocate blocks in the filesystem. That would eliminate the messy part of bootcache that has to snoop block writes to maintain the cache. This is all part of what I'm aiming to present at OLS. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching [not found] <20060502075049.GA5000@mail.ustc.edu.cn> ` (2 preceding siblings ...) 2006-05-02 8:09 ` Jens Axboe @ 2006-05-02 19:10 ` Pavel Machek 2006-05-02 23:36 ` Nigel Cunningham ` (3 more replies) 3 siblings, 4 replies; 48+ messages in thread From: Pavel Machek @ 2006-05-02 19:10 UTC (permalink / raw) To: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, Nigel Cunningham, Rafael J. Wysocki Hi! > SOC PROPOSAL > > Rapid linux desktop startup through pre-caching Looks nice me... > SCHEME/GOAL > 2) kernel module to query the file cache > - on loading: create /proc/filecache > - setting up: echo "param value" > /proc/filecache > - info query: cat /proc/filecache > - sample sessions: > > # modprobe filecache > $ cat /proc/filecache > # file ALL > # mask 0 > # > # ino size cached status refcnt name > 0 2929846 3181 D 1 /dev/hda1 > 81455 9 9 _ 1 /sbin/init > ...... > > $ echo "file /dev/hda1" > /proc/filecache > $ cat /proc/filecache > # file /dev/hda1 > # mask 0 > # > # idx len > 0 24 > 48 2 > 53 5 > ...... Could we use this instead of blockdev freezing/big suspend image support? It should permit us to resume quickly (with small image), and then do readahead. ... that will give us usable machine quickly, still very responsive desktop after resume? Pavel -- Thanks for all the (sleeping) penguins. ^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [RFC] kernel facilities for cache prefetching 2006-05-02 19:10 ` Pavel Machek @ 2006-05-02 23:36 ` Nigel Cunningham [not found] ` <20060503023505.GB5915@mail.ustc.edu.cn> [not found] ` <20060503023223.GA5915@mail.ustc.edu.cn> ` (2 subsequent siblings) 3 siblings, 1 reply; 48+ messages in thread From: Nigel Cunningham @ 2006-05-02 23:36 UTC (permalink / raw) To: Pavel Machek Cc: Wu Fengguang, linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, Rafael J. Wysocki [-- Attachment #1: Type: text/plain, Size: 601 bytes --] Hi. On Wednesday 03 May 2006 05:10, Pavel Machek wrote: > Could we use this instead of blockdev freezing/big suspend image > support? It should permit us to resume quickly (with small image), and > then do readahead. ... that will give us usable machine quickly, still > very responsive desktop after resume? Unrelated to bdev freezing, and will involve far more seeking than reading a contiguous image (as they normally are). Regards, Nigel -- See our web page for Howtos, FAQs, the Wiki and mailing list info. http://www.suspend2.net IRC: #suspend2 on Freenode [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060503023505.GB5915@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503023505.GB5915@mail.ustc.edu.cn> @ 2006-05-03 2:35 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-03 2:35 UTC (permalink / raw) To: Nigel Cunningham Cc: Pavel Machek, linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, Rafael J. Wysocki On Wed, May 03, 2006 at 09:36:12AM +1000, Nigel Cunningham wrote: > > Could we use this instead of blockdev freezing/big suspend image > > support? It should permit us to resume quickly (with small image), and > > then do readahead. ... that will give us usable machine quickly, still > > very responsive desktop after resume? > > Unrelated to bdev freezing, and will involve far more seeking than reading a > contiguous image (as they normally are). Yes, needs more pondering on this issue... ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060503023223.GA5915@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503023223.GA5915@mail.ustc.edu.cn> @ 2006-05-03 2:32 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-03 2:32 UTC (permalink / raw) To: Pavel Machek Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, Nigel Cunningham, Rafael J. Wysocki Hi Pavel, On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote: > > SOC PROPOSAL > > > > Rapid linux desktop startup through pre-caching > > Looks nice me... Thanks. > > SCHEME/GOAL > > 2) kernel module to query the file cache > > - on loading: create /proc/filecache > > - setting up: echo "param value" > /proc/filecache > > - info query: cat /proc/filecache > > - sample sessions: > > > > # modprobe filecache > > $ cat /proc/filecache > > # file ALL > > # mask 0 > > # > > # ino size cached status refcnt name > > 0 2929846 3181 D 1 /dev/hda1 > > 81455 9 9 _ 1 /sbin/init > > ...... > > > > $ echo "file /dev/hda1" > /proc/filecache > > $ cat /proc/filecache > > # file /dev/hda1 > > # mask 0 > > # > > # idx len > > 0 24 > > 48 2 > > 53 5 > > ...... > > Could we use this instead of blockdev freezing/big suspend image > support? It should permit us to resume quickly (with small image), and > then do readahead. ... that will give us usable machine quickly, still > very responsive desktop after resume? Sure. Indeed I have considered about the possibility to integrate my work with the foresighted userland-swsusp idea. The added value of this approach is that user-land tools can do all kinds of smart processing of the contents, and present users with different policies. Another imagined user of /proc/filecache is system adms :) Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060503071948.GD4781@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060503071948.GD4781@mail.ustc.edu.cn> @ 2006-05-03 7:19 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-03 7:19 UTC (permalink / raw) To: Pavel Machek Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, Nigel Cunningham, Rafael J. Wysocki On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote: > Could we use this instead of blockdev freezing/big suspend image > support? It should permit us to resume quickly (with small image), and > then do readahead. ... that will give us usable machine quickly, still > very responsive desktop after resume? Seems that it can help in reducing the image size: write only small ranges of file pages to the suspend image(maybe 80MB = 10k ranges * 8k avgsize), and let the prefetcher restore other large chunks of code/data, depending on user specified policies. Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
[parent not found: <20060504122830.GA6205@mail.ustc.edu.cn>]
* Re: [RFC] kernel facilities for cache prefetching [not found] ` <20060504122830.GA6205@mail.ustc.edu.cn> @ 2006-05-04 12:28 ` Wu Fengguang 0 siblings, 0 replies; 48+ messages in thread From: Wu Fengguang @ 2006-05-04 12:28 UTC (permalink / raw) To: Pavel Machek Cc: linux-kernel, Linus Torvalds, Andrew Morton, Jens Axboe, Nick Piggin, Badari Pulavarty, Nigel Cunningham, Rafael J. Wysocki On Tue, May 02, 2006 at 09:10:01PM +0200, Pavel Machek wrote: > > $ echo "file /dev/hda1" > /proc/filecache > > $ cat /proc/filecache > > # file /dev/hda1 > > # mask 0 > > # > > # idx len > > 0 24 > > 48 2 > > 53 5 > > ...... > > Could we use this instead of blockdev freezing/big suspend image > support? It should permit us to resume quickly (with small image), and > then do readahead. ... that will give us usable machine quickly, still > very responsive desktop after resume? Badari's usage case inspired me that on suspension we can - first invoke a user-land tool to do all the cache status-logging/analyzing/selective-dropping jobs - then let the kernel write all the remaining caches(made up of many small chunks) to the suspend image And do the reverse things on restoring. With that we moved all the strategies to userspace. Wu ^ permalink raw reply [flat|nested] 48+ messages in thread
end of thread, other threads:[~2006-05-06 6:50 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20060502075049.GA5000@mail.ustc.edu.cn>
2006-05-02 7:50 ` [RFC] kernel facilities for cache prefetching Wu Fengguang
2006-05-02 12:46 ` Diego Calleja
[not found] ` <20060502144203.GA10594@mail.ustc.edu.cn>
2006-05-02 14:42 ` Wu Fengguang
2006-05-02 16:07 ` Diego Calleja
[not found] ` <20060503064503.GA4781@mail.ustc.edu.cn>
2006-05-03 6:45 ` Wu Fengguang
2006-05-03 18:14 ` Diego Calleja
2006-05-03 23:39 ` Zan Lynx
2006-05-04 1:37 ` Diego Calleja
2006-05-02 15:55 ` Linus Torvalds
2006-05-02 16:35 ` Andi Kleen
[not found] ` <20060503041106.GC5915@mail.ustc.edu.cn>
2006-05-03 4:11 ` Wu Fengguang
2006-05-03 17:28 ` Badari Pulavarty
[not found] ` <346733486.30800@ustc.edu.cn>
2006-05-04 15:03 ` Linus Torvalds
2006-05-04 16:57 ` Badari Pulavarty
[not found] ` <20060505144451.GA6134@mail.ustc.edu.cn>
2006-05-05 14:44 ` Wu Fengguang
[not found] ` <20060503071325.GC4781@mail.ustc.edu.cn>
2006-05-03 7:13 ` Wu Fengguang
2006-05-03 12:59 ` Nikita Danilov
2006-05-03 22:20 ` Rik van Riel
[not found] ` <20060506011125.GA9099@mail.ustc.edu.cn>
2006-05-06 1:11 ` Wu Fengguang
2006-05-04 0:28 ` Linda Walsh
2006-05-04 1:31 ` Linus Torvalds
2006-05-04 7:08 ` Ph. Marek
2006-05-04 7:33 ` Arjan van de Ven
[not found] ` <20060504121454.GB6008@mail.ustc.edu.cn>
2006-05-04 12:14 ` Wu Fengguang
2006-05-04 12:34 ` Arjan van de Ven
2006-05-03 21:45 ` Linda Walsh
[not found] ` <20060504121212.GA6008@mail.ustc.edu.cn>
2006-05-04 12:12 ` Wu Fengguang
2006-05-04 18:57 ` Linda Walsh
[not found] ` <20060505152007.GB6134@mail.ustc.edu.cn>
2006-05-05 15:20 ` Wu Fengguang
2006-05-04 9:02 ` Helge Hafting
2006-05-02 7:58 ` Arjan van de Ven
[not found] ` <20060502080619.GA5406@mail.ustc.edu.cn>
2006-05-02 8:06 ` Wu Fengguang
2006-05-02 8:30 ` Arjan van de Ven
[not found] ` <20060502085325.GA9190@mail.ustc.edu.cn>
2006-05-02 8:53 ` Wu Fengguang
2006-05-06 6:49 ` Denis Vlasenko
2006-05-02 8:55 ` Arjan van de Ven
2006-05-02 11:39 ` Jan Engelhardt
[not found] ` <20060502114853.GA9983@mail.ustc.edu.cn>
2006-05-02 11:48 ` Wu Fengguang
2006-05-02 22:03 ` Dave Jones
2006-05-02 8:09 ` Jens Axboe
[not found] ` <20060502082009.GA9038@mail.ustc.edu.cn>
2006-05-02 8:20 ` Wu Fengguang
2006-05-03 22:05 ` Benjamin LaHaise
2006-05-02 19:10 ` Pavel Machek
2006-05-02 23:36 ` Nigel Cunningham
[not found] ` <20060503023505.GB5915@mail.ustc.edu.cn>
2006-05-03 2:35 ` Wu Fengguang
[not found] ` <20060503023223.GA5915@mail.ustc.edu.cn>
2006-05-03 2:32 ` Wu Fengguang
[not found] ` <20060503071948.GD4781@mail.ustc.edu.cn>
2006-05-03 7:19 ` Wu Fengguang
[not found] ` <20060504122830.GA6205@mail.ustc.edu.cn>
2006-05-04 12:28 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox