* [PATCH 0/3] readahead drop behind and size adjustment
@ 2007-07-21 21:00 Peter Zijlstra
2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
` (3 more replies)
0 siblings, 4 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw)
To: linux-kernel
Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper,
Chris Snook, Peter Zijlstra
The various readahead bits I have lying about.
Wu, would you agree with asking Andrew to stick these behind your latest
readahead series?
^ permalink raw reply [flat|nested] 46+ messages in thread* [PATCH 1/3] readahead: drop behind 2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra @ 2007-07-21 21:00 ` Peter Zijlstra 2007-07-21 20:29 ` Eric St-Laurent 2007-07-25 3:55 ` Eric St-Laurent 2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 2 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw) To: linux-kernel Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook, Peter Zijlstra [-- Attachment #1: readahead-useonce.patch --] [-- Type: text/plain, Size: 5776 bytes --] Use the read-ahead code to provide hints to page reclaim. This patch has the potential to solve the streaming-IO trashes my desktop problem. It tries to aggressively reclaim pages that were loaded in a strong sequential pattern and have been consumed. Thereby limiting the damage to the current resident set. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/swap.h | 1 + mm/readahead.c | 39 ++++++++++++++++++++++++++++++++++++++- mm/swap.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 90 insertions(+), 1 deletion(-) Index: linux-2.6/mm/swap.c =================================================================== --- linux-2.6.orig/mm/swap.c +++ linux-2.6/mm/swap.c @@ -30,6 +30,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/init.h> +#include <linux/rmap.h> /* How many pages do we try to swap or page in/out together? */ int page_cluster; @@ -176,6 +177,7 @@ EXPORT_SYMBOL(mark_page_accessed); */ static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, }; static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, }; +static DEFINE_PER_CPU(struct pagevec, lru_demote_pvecs) = { 0, }; void fastcall lru_cache_add(struct page *page) { @@ -197,6 +199,37 @@ void fastcall lru_cache_add_active(struc put_cpu_var(lru_add_active_pvecs); } +static void __pagevec_lru_demote(struct pagevec *pvec) +{ + int i; + struct zone *zone = NULL; + + for (i = 0; i < pagevec_count(pvec); i++) { + struct page *page = pvec->pages[i]; + struct zone *pagezone = page_zone(page); + + if (pagezone != zone) { + if (zone) + spin_unlock_irq(&zone->lru_lock); + zone = pagezone; + spin_lock_irq(&zone->lru_lock); + } + if (PageLRU(page)) { + page_referenced(page, 0); + if (PageActive(page)) { + ClearPageActive(page); + __dec_zone_state(zone, NR_ACTIVE); + __inc_zone_state(zone, NR_INACTIVE); + } + list_move_tail(&page->lru, &zone->inactive_list); + } + } + if (zone) + spin_unlock_irq(&zone->lru_lock); + release_pages(pvec->pages, pvec->nr, pvec->cold); + pagevec_reinit(pvec); +} + static void __lru_add_drain(int cpu) { struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu); @@ -207,6 +240,9 @@ static void __lru_add_drain(int cpu) pvec = &per_cpu(lru_add_active_pvecs, cpu); if (pagevec_count(pvec)) __pagevec_lru_add_active(pvec); + pvec = &per_cpu(lru_demote_pvecs, cpu); + if (pagevec_count(pvec)) + __pagevec_lru_demote(pvec); } void lru_add_drain(void) @@ -403,6 +439,21 @@ void __pagevec_lru_add_active(struct pag } /* + * Function used to forcefully demote a page to the tail of the inactive + * list. + */ +void fastcall lru_demote(struct page *page) +{ + if (likely(get_page_unless_zero(page))) { + struct pagevec *pvec = &get_cpu_var(lru_demote_pvecs); + + if (!pagevec_add(pvec, page)) + __pagevec_lru_demote(pvec); + put_cpu_var(lru_demote_pvecs); + } +} + +/* * Try to drop buffers from the pages in a pagevec */ void pagevec_strip(struct pagevec *pvec) Index: linux-2.6/include/linux/swap.h =================================================================== --- linux-2.6.orig/include/linux/swap.h +++ linux-2.6/include/linux/swap.h @@ -180,6 +180,7 @@ extern unsigned int nr_free_pagecache_pa /* linux/mm/swap.c */ extern void FASTCALL(lru_cache_add(struct page *)); extern void FASTCALL(lru_cache_add_active(struct page *)); +extern void FASTCALL(lru_demote(struct page *)); extern void FASTCALL(activate_page(struct page *)); extern void FASTCALL(mark_page_accessed(struct page *)); extern void lru_add_drain(void); Index: linux-2.6/mm/readahead.c =================================================================== --- linux-2.6.orig/mm/readahead.c +++ linux-2.6/mm/readahead.c @@ -15,6 +15,7 @@ #include <linux/backing-dev.h> #include <linux/task_io_accounting_ops.h> #include <linux/pagevec.h> +#include <linux/swap.h> void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page) { @@ -448,13 +449,19 @@ EXPORT_SYMBOL_GPL(page_cache_sync_readah * page_cache_async_ondemand() should be called when a page is used which * has the PG_readahead flag: this is a marker to suggest that the application * has used up enough of the readahead window that we should start pulling in - * more pages. */ + * more pages. + */ void page_cache_async_readahead(struct address_space *mapping, struct file_ra_state *ra, struct file *filp, struct page *page, pgoff_t offset, unsigned long req_size) { + pgoff_t demote_idx = offset - min_t(pgoff_t, offset, ra->size); + struct page *pages[16]; + unsigned nr_pages; + unsigned i; + /* no read-ahead */ if (!ra->ra_pages) return; @@ -473,6 +480,36 @@ page_cache_async_readahead(struct addres if (bdi_read_congested(mapping->backing_dev_info)) return; + /* + * Read-ahead use once: when the ra window is maximal this is a good + * hint that there is sequential IO, which implies that the pages that + * have been used thus far can be reclaimed + */ + if (ra->size == ra->ra_pages) do { + nr_pages = find_get_pages(mapping, + demote_idx, ARRAY_SIZE(pages), pages); + + for (i = 0; i < nr_pages; i++) { + page = pages[i]; + demote_idx = page_index(page); + + /* + * The page is active. This means there are other + * users. We should not take away somebody else's + * pages, so do not drop behind beyond this point. + */ + if (demote_idx < offset && !PageActive(page)) { + lru_demote(page); + } else { + demote_idx = offset; + break; + } + } + demote_idx++; + + release_pages(pages, nr_pages, 0); + } while (demote_idx < offset); + /* do read-ahead */ ondemand_readahead(mapping, ra, filp, true, offset, req_size); } -- ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] readahead: drop behind 2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra @ 2007-07-21 20:29 ` Eric St-Laurent 2007-07-21 20:37 ` Peter Zijlstra 2007-07-25 3:55 ` Eric St-Laurent 1 sibling, 1 reply; 46+ messages in thread From: Eric St-Laurent @ 2007-07-21 20:29 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote: > plain text document attachment (readahead-useonce.patch) > Use the read-ahead code to provide hints to page reclaim. > > This patch has the potential to solve the streaming-IO trashes my > desktop problem. > > It tries to aggressively reclaim pages that were loaded in a strong > sequential pattern and have been consumed. Thereby limiting the damage > to the current resident set. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> With the fadvise change, it looks like the right solution to me. The patches are for which kernel? They doesn't apply cleanly to 2.6.22.1. It would be useful to have a temporary /proc tunable to enable/disable the heuristic to help test the effects. - Eric ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] readahead: drop behind 2007-07-21 20:29 ` Eric St-Laurent @ 2007-07-21 20:37 ` Peter Zijlstra 2007-07-21 20:59 ` Eric St-Laurent 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-07-21 20:37 UTC (permalink / raw) To: Eric St-Laurent Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sat, 2007-07-21 at 16:29 -0400, Eric St-Laurent wrote: > On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote: > > plain text document attachment (readahead-useonce.patch) > > Use the read-ahead code to provide hints to page reclaim. > > > > This patch has the potential to solve the streaming-IO trashes my > > desktop problem. > > > > It tries to aggressively reclaim pages that were loaded in a strong > > sequential pattern and have been consumed. Thereby limiting the damage > > to the current resident set. > > > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > With the fadvise change, it looks like the right solution to me. > > The patches are for which kernel? They doesn't apply cleanly to > 2.6.22.1. They are against git of a few hours ago and the latest readahead patches from Wu (which don't apply cleanly either, but the rejects are trivial). > It would be useful to have a temporary /proc tunable to enable/disable > the heuristic to help test the effects. Right, I had such a patch somewhere,.. won't apply cleanly but should be obvious.. --- kernel/sysctl.c | 9 +++++++++ mm/readahead.c | 6 ++++-- 2 files changed, 13 insertions(+), 2 deletions(-) Index: linux-2.6/kernel/sysctl.c =================================================================== --- linux-2.6.orig/kernel/sysctl.c +++ linux-2.6/kernel/sysctl.c @@ -168,6 +168,7 @@ extern int dirty_ratio_handler(ctl_table extern int prove_locking; extern int lock_stat; +extern int sysctl_dropbehind; /* The default sysctl tables: */ @@ -1075,6 +1076,14 @@ static ctl_table vm_table[] = { .extra1 = &zero, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "drop_behind", + .data = &sysctl_dropbehind, + .maxlen = sizeof(sysctl_dropbehind), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt Index: linux-2.6/mm/readahead.c =================================================================== --- linux-2.6.orig/mm/readahead.c +++ linux-2.6/mm/readahead.c @@ -429,6 +429,8 @@ void page_cache_sync_readahead(struct ad } EXPORT_SYMBOL_GPL(page_cache_sync_readahead); +int sysctl_dropbehind = 2; + /** * page_cache_async_readahead - file readahead for marked pages * @mapping: address_space which holds the pagecache and I/O vectors @@ -473,7 +475,7 @@ page_cache_async_readahead(struct addres * hint that there is sequential IO, which implies that the pages that * have been used thus far can be reclaimed */ - if (ra->size == ra->ra_pages) { + if (sysctl_dropbehind && ra->size == ra->ra_pages) { unsigned long demote_idx = offset - min(offset, ra->size); read_lock_irq(&mapping->tree_lock); @@ -489,7 +491,7 @@ page_cache_async_readahead(struct addres * users. We should not take away somebody else's * pages, so do not drop behind beyond this point. */ - if (PageActive(page)) + if (sysctl_dropbehind > 1 && PageActive(page)) break; lru_demote(page); ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] readahead: drop behind 2007-07-21 20:37 ` Peter Zijlstra @ 2007-07-21 20:59 ` Eric St-Laurent 2007-07-21 21:06 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Eric St-Laurent @ 2007-07-21 20:59 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook > They are against git of a few hours ago and the latest readahead patches > from Wu (which don't apply cleanly either, but the rejects are trivial). > > > It would be useful to have a temporary /proc tunable to enable/disable > > the heuristic to help test the effects. > > Right, I had such a patch somewhere,.. won't apply cleanly but should be > obvious.. Thanks, I will merge theses and report back with some results. After copying large files, I find my system sluggish. I hope your changes will help. - Eric ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] readahead: drop behind 2007-07-21 20:59 ` Eric St-Laurent @ 2007-07-21 21:06 ` Peter Zijlstra 0 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-07-21 21:06 UTC (permalink / raw) To: Eric St-Laurent Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sat, 2007-07-21 at 16:59 -0400, Eric St-Laurent wrote: > > They are against git of a few hours ago and the latest readahead patches > > from Wu (which don't apply cleanly either, but the rejects are trivial). > > > > > It would be useful to have a temporary /proc tunable to enable/disable > > > the heuristic to help test the effects. > > > > Right, I had such a patch somewhere,.. won't apply cleanly but should be > > obvious.. > > Thanks, I will merge theses and report back with some results. > > After copying large files, I find my system sluggish. I hope your > changes will help. Yeah, copying will still hurt. This will only help reading large files, not writing :-/ For that I still need an inspiring idea. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 1/3] readahead: drop behind 2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra 2007-07-21 20:29 ` Eric St-Laurent @ 2007-07-25 3:55 ` Eric St-Laurent 1 sibling, 0 replies; 46+ messages in thread From: Eric St-Laurent @ 2007-07-25 3:55 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook, Nick Piggin [-- Attachment #1: Type: text/plain, Size: 3307 bytes --] On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote: > Use the read-ahead code to provide hints to page reclaim. > > This patch has the potential to solve the streaming-IO trashes my > desktop problem. > > It tries to aggressively reclaim pages that were loaded in a strong > sequential pattern and have been consumed. Thereby limiting the damage > to the current resident set. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> (sorry for the delay) Ok, I've done some tests with your patches, I came up with a test program that should approximate my use case. It simply mmap() and scan (read) a 375M file which represent the usual used memory on my desktop system. This data is frequently used, and should stay cached as much as possible in preference over the "used once" data read in the page cache when copying large files. I don't claim that the test program is perfect or even correct, I'm open for suggestions. Test system: - Linux x86_64 2.6.23-rc1 - 1G of RAM - I use the basic drop behind and sysctl patches. The readahead size patch is _not_ included. Setting up: dd if=/dev/zero of=/tmp/375M_file bs=1M count=375 dd if=/dev/zero of=/tmp/5G_file bs=1M count=5120 Tests with stock kernel (drop behind disabled): echo 0 >/proc/sys/vm/drop_behind Base test: sync; echo 1 >/proc/sys/vm/drop_caches time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file 1st execution: 0m7.146s 2nd execution: 0m1.119s 3rd execution: 0m1.109s 4th execution: 0m1.105s Reading a large file test: sync; echo 1 >/proc/sys/vm/drop_caches time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file cp /tmp/5G_file /dev/null time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file 1st execution: 0m7.224s 2nd execution: 0m1.114s 3rd execution: 0m7.178s <<< Much slower 4th execution: 0m1.115s Copying (read+write) a large file test: sync; echo 1 >/proc/sys/vm/drop_caches time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file cp /tmp/5G_file /tmp/copy_of_5G_file time ./large_app_load_simul /tmp/375M_file time ./large_app_load_simul /tmp/375M_file rm /tmp/copy_of_5G_file 1st execution: 0m7.203s 2nd execution: 0m1.147s 3rd execution: 0m7.238s <<< Much slower 4th execution: 0m1.129s Tests with drop behind enabled: echo 1 >/proc/sys/vm/drop_behind Base test: [same tests as above] 1st execution: 0m7.206s 2nd execution: 0m1.110s 3rd execution: 0m1.102s 4th execution: 0m1.106s Reading a large file test: [same tests as above] 1st execution: 0m7.197s 2nd execution: 0m1.116s 3rd execution: 0m1.114s <<< Great!!! 4th execution: 0m1.111s Copying (read+write) a large file test: [same tests as above] 1st execution: 0m7.186s 2nd execution: 0m1.111s 3rd execution: 0m7.339s <<< Not fixed 4th execution: 0m1.121s Conclusion: - The drop-behind patch works and really prevents the page cache content from being fulled with useless read-once data. - It doesn't help the copy (read+write) case. This should also be fixed, as it's a common workload. Tested-By: Eric St-Laurent (ericstl34@xxxxxxxxx.xx) Best regards, - Eric (*) Test program and batch file are attached. [-- Attachment #2: drop-behind+sysctl.patch --] [-- Type: text/x-patch, Size: 5642 bytes --] diff -urN linux-2.6/include/linux/swap.h linux-2.6-drop-behind/include/linux/swap.h --- linux-2.6/include/linux/swap.h 2007-07-21 18:26:00.000000000 -0400 +++ linux-2.6-drop-behind/include/linux/swap.h 2007-07-22 16:22:48.000000000 -0400 @@ -180,6 +180,7 @@ /* linux/mm/swap.c */ extern void FASTCALL(lru_cache_add(struct page *)); extern void FASTCALL(lru_cache_add_active(struct page *)); +extern void FASTCALL(lru_demote(struct page *)); extern void FASTCALL(activate_page(struct page *)); extern void FASTCALL(mark_page_accessed(struct page *)); extern void lru_add_drain(void); diff -urN linux-2.6/kernel/sysctl.c linux-2.6-drop-behind/kernel/sysctl.c --- linux-2.6/kernel/sysctl.c 2007-07-21 18:26:01.000000000 -0400 +++ linux-2.6-drop-behind/kernel/sysctl.c 2007-07-22 16:20:27.000000000 -0400 @@ -163,6 +163,7 @@ extern int prove_locking; extern int lock_stat; +extern int sysctl_dropbehind; /* The default sysctl tables: */ @@ -1048,6 +1049,14 @@ .extra1 = &zero, }, #endif + { + .ctl_name = CTL_UNNUMBERED, + .procname = "drop_behind", + .data = &sysctl_dropbehind, + .maxlen = sizeof(sysctl_dropbehind), + .mode = 0644, + .proc_handler = &proc_dointvec, + }, /* * NOTE: do not add new entries to this table unless you have read * Documentation/sysctl/ctl_unnumbered.txt diff -urN linux-2.6/mm/readahead.c linux-2.6-drop-behind/mm/readahead.c --- linux-2.6/mm/readahead.c 2007-07-21 18:26:01.000000000 -0400 +++ linux-2.6-drop-behind/mm/readahead.c 2007-07-22 16:41:47.000000000 -0400 @@ -15,6 +15,7 @@ #include <linux/backing-dev.h> #include <linux/task_io_accounting_ops.h> #include <linux/pagevec.h> +#include <linux/swap.h> void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page) { @@ -429,6 +430,8 @@ } EXPORT_SYMBOL_GPL(page_cache_sync_readahead); +int sysctl_dropbehind = 1; + /** * page_cache_async_readahead - file readahead for marked pages * @mapping: address_space which holds the pagecache and I/O vectors @@ -449,6 +452,11 @@ struct page *page, pgoff_t offset, unsigned long req_size) { + pgoff_t demote_idx = offset - min_t(pgoff_t, offset, ra->size); + struct page *pages[16]; + unsigned nr_pages; + unsigned i; + /* no read-ahead */ if (!ra->ra_pages) return; @@ -467,6 +475,36 @@ if (bdi_read_congested(mapping->backing_dev_info)) return; + /* + * Read-ahead use once: when the ra window is maximal this is a good + * hint that there is sequential IO, which implies that the pages that + * have been used thus far can be reclaimed + */ + if (sysctl_dropbehind && ra->size == ra->ra_pages) { + nr_pages = find_get_pages(mapping, + demote_idx, ARRAY_SIZE(pages), pages); + + for (i = 0; i < nr_pages; i++) { + page = pages[i]; + demote_idx = page_index(page); + + /* + * The page is active. This means there are other + * users. We should not take away somebody else's + * pages, so do not drop behind beyond this point. + */ + if (demote_idx < offset && !PageActive(page)) { + lru_demote(page); + } else { + demote_idx = offset; + break; + } + } + demote_idx++; + + release_pages(pages, nr_pages, 0); + } while (demote_idx < offset); + /* do read-ahead */ ondemand_readahead(mapping, ra, filp, true, offset, req_size); } diff -urN linux-2.6/mm/swap.c linux-2.6-drop-behind/mm/swap.c --- linux-2.6/mm/swap.c 2007-07-21 18:26:01.000000000 -0400 +++ linux-2.6-drop-behind/mm/swap.c 2007-07-22 16:26:35.000000000 -0400 @@ -30,6 +30,7 @@ #include <linux/cpu.h> #include <linux/notifier.h> #include <linux/init.h> +#include <linux/rmap.h> /* How many pages do we try to swap or page in/out together? */ int page_cluster; @@ -176,6 +177,7 @@ */ static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, }; static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, }; +static DEFINE_PER_CPU(struct pagevec, lru_demote_pvecs) = { 0, }; void fastcall lru_cache_add(struct page *page) { @@ -197,6 +199,37 @@ put_cpu_var(lru_add_active_pvecs); } +static void __pagevec_lru_demote(struct pagevec *pvec) +{ + int i; + struct zone *zone = NULL; + + for (i = 0; i < pagevec_count(pvec); i++) { + struct page *page = pvec->pages[i]; + struct zone *pagezone = page_zone(page); + + if (pagezone != zone) { + if (zone) + spin_unlock_irq(&zone->lru_lock); + zone = pagezone; + spin_lock_irq(&zone->lru_lock); + } + if (PageLRU(page)) { + page_referenced(page, 0); + if (PageActive(page)) { + ClearPageActive(page); + __dec_zone_state(zone, NR_ACTIVE); + __inc_zone_state(zone, NR_INACTIVE); + } + list_move_tail(&page->lru, &zone->inactive_list); + } + } + if (zone) + spin_unlock_irq(&zone->lru_lock); + release_pages(pvec->pages, pvec->nr, pvec->cold); + pagevec_reinit(pvec); +} + static void __lru_add_drain(int cpu) { struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu); @@ -207,6 +240,9 @@ pvec = &per_cpu(lru_add_active_pvecs, cpu); if (pagevec_count(pvec)) __pagevec_lru_add_active(pvec); + pvec = &per_cpu(lru_demote_pvecs, cpu); + if (pagevec_count(pvec)) + __pagevec_lru_demote(pvec); } void lru_add_drain(void) @@ -403,6 +439,21 @@ } /* + * Function used to forcefully demote a page to the tail of the inactive + * list. + */ +void fastcall lru_demote(struct page *page) +{ + if (likely(get_page_unless_zero(page))) { + struct pagevec *pvec = &get_cpu_var(lru_demote_pvecs); + + if (!pagevec_add(pvec, page)) + __pagevec_lru_demote(pvec); + put_cpu_var(lru_demote_pvecs); + } +} + +/* * Try to drop buffers from the pages in a pagevec */ void pagevec_strip(struct pagevec *pvec) [-- Attachment #3: large_app_load_simul.cpp --] [-- Type: text/x-c++src, Size: 459 bytes --] #include <stdio.h> #include <stdlib.h> #include <assert.h> #include <sys/mman.h> int main(int argc, char *argv[]) { if (argc != 2) return EXIT_FAILURE; FILE *file = fopen(argv[1], "rb"); fseek(file, 0, SEEK_END); long size = ftell(file); char *buf = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, fileno(file), 0); for (long i = 0; i < size; i++) char temp = buf[i]; int result = munmap(buf, size); fclose(file); return EXIT_SUCCESS; } [-- Attachment #4: Makefile --] [-- Type: text/x-makefile, Size: 69 bytes --] all: gcc large_app_load_simul.cpp -o large_app_load_simul -lstdc++ [-- Attachment #5: useonce-test.sh --] [-- Type: application/x-shellscript, Size: 1737 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 2/3] readahead: fadvise drop behind controls 2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra 2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra @ 2007-07-21 21:00 ` Peter Zijlstra 2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn> 3 siblings, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw) To: linux-kernel Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook, Peter Zijlstra [-- Attachment #1: readahead-fadvise.patch --] [-- Type: text/plain, Size: 2490 bytes --] currently drop behind is controlled by a heuristic: ra->size == ra->ra_pages provide some fadvise() control: POSIX_FADV_NORMAL - heuristic POSIX_FADV_RANDOM - always disable drop behind POSIX_FADV_SEQUENTIAL - always enable drop behind Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/fs.h | 7 +++++++ mm/fadvise.c | 5 +++++ mm/readahead.c | 4 +++- 3 files changed, 15 insertions(+), 1 deletion(-) Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -705,8 +705,15 @@ struct file_ra_state { unsigned int ra_pages; /* Maximum readahead window */ int mmap_miss; /* Cache miss stat for mmap accesses */ loff_t prev_pos; /* Cache last read() position */ + unsigned int flags; /* readahead flags */ }; +#define F_RA_DROP_BEHIND_NEVER 1 +#define F_RA_DROP_BEHIND_ALWAYS 2 + +#define F_RA_DROP_BEHIND_MASK \ + (F_RA_DROP_BEHIND_NEVER|F_RA_DROP_BEHIND_ALWAYS) + /* * Check if @index falls in the readahead windows. */ Index: linux-2.6/mm/fadvise.c =================================================================== --- linux-2.6.orig/mm/fadvise.c +++ linux-2.6/mm/fadvise.c @@ -65,12 +65,17 @@ asmlinkage long sys_fadvise64_64(int fd, switch (advice) { case POSIX_FADV_NORMAL: file->f_ra.ra_pages = bdi->ra_pages; + file->f_ra.flags &= ~F_RA_DROP_BEHIND_MASK; break; case POSIX_FADV_RANDOM: file->f_ra.ra_pages = 0; + file->f_ra.flags &= ~F_RA_DROP_BEHIND_MASK; + file->f_ra.flags |= F_RA_DROP_BEHIND_NEVER; break; case POSIX_FADV_SEQUENTIAL: file->f_ra.ra_pages = bdi->ra_pages * 2; + file->f_ra.flags &= ~F_RA_DROP_BEHIND_MASK; + file->f_ra.flags |= F_RA_DROP_BEHIND_ALWAYS; break; case POSIX_FADV_WILLNEED: if (!mapping->a_ops->readpage) { Index: linux-2.6/mm/readahead.c =================================================================== --- linux-2.6.orig/mm/readahead.c +++ linux-2.6/mm/readahead.c @@ -485,7 +485,9 @@ page_cache_async_readahead(struct addres * hint that there is sequential IO, which implies that the pages that * have been used thus far can be reclaimed */ - if (ra->size == ra->ra_pages) do { + if (!(ra->flags & F_RA_DROP_BEHIND_NEVER) && + ((ra->flags & F_RA_DROP_BEHIND_ALWAYS) || + ra->size == ra->ra_pages)) do { nr_pages = find_get_pages(mapping, demote_idx, ARRAY_SIZE(pages), pages); -- ^ permalink raw reply [flat|nested] 46+ messages in thread
* [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra 2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra 2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra @ 2007-07-21 21:00 ` Peter Zijlstra 2007-07-22 8:24 ` Jens Axboe [not found] ` <20070722084526.GB6317@mail.ustc.edu.cn> [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn> 3 siblings, 2 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw) To: linux-kernel Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook, Peter Zijlstra [-- Attachment #1: ra_pages.patch --] [-- Type: text/plain, Size: 2488 bytes --] Scale the default max readahead size with the system memory size. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- block/ll_rw_blk.c | 2 +- include/linux/fs.h | 1 + mm/readahead.c | 32 ++++++++++++++++++++++++++++++++ 3 files changed, 34 insertions(+), 1 deletion(-) Index: linux-2.6/block/ll_rw_blk.c =================================================================== --- linux-2.6.orig/block/ll_rw_blk.c +++ linux-2.6/block/ll_rw_blk.c @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); q->make_request_fn = mfn; - q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; + bdi_ra_init(&q->backing_dev_info); q->backing_dev_info.state = 0; q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY; blk_queue_max_sectors(q, SAFE_MAX_SECTORS); Index: linux-2.6/include/linux/fs.h =================================================================== --- linux-2.6.orig/include/linux/fs.h +++ linux-2.6/include/linux/fs.h @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file extern void file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); +extern void bdi_ra_init(struct backing_dev_info *bdi); extern loff_t no_llseek(struct file *file, loff_t offset, int origin); extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); Index: linux-2.6/mm/readahead.c =================================================================== --- linux-2.6.orig/mm/readahead.c +++ linux-2.6/mm/readahead.c @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state } EXPORT_SYMBOL_GPL(file_ra_state_init); +static unsigned long ra_pages; + +static __init int readahead_init(void) +{ + /* + * Scale the max readahead window with system memory + * + * 64M: 128K + * 128M: 180K + * 256M: 256K + * 512M: 360K + * 1G: 512K + * 2G: 724K + * 4G: 1024K + * 8G: 1448K + * 16G: 2048K + */ + ra_pages = int_sqrt(totalram_pages/16); + if (ra_pages > (2 << (20 - PAGE_SHIFT))) + ra_pages = 2 << (20 - PAGE_SHIFT); + + return 0; +} + +subsys_initcall(readahead_init); + +void bdi_ra_init(struct backing_dev_info *bdi) +{ + bdi->ra_pages = ra_pages; +} +EXPORT_SYMBOL(bdi_ra_init); + #define list_to_page(head) (list_entry((head)->prev, struct page, lru)) /** -- ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra @ 2007-07-22 8:24 ` Jens Axboe 2007-07-22 8:36 ` Peter Zijlstra [not found] ` <20070722084526.GB6317@mail.ustc.edu.cn> 1 sibling, 1 reply; 46+ messages in thread From: Jens Axboe @ 2007-07-22 8:24 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sat, Jul 21 2007, Peter Zijlstra wrote: > Scale the default max readahead size with the system memory size. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > block/ll_rw_blk.c | 2 +- > include/linux/fs.h | 1 + > mm/readahead.c | 32 ++++++++++++++++++++++++++++++++ > 3 files changed, 34 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/ll_rw_blk.c > =================================================================== > --- linux-2.6.orig/block/ll_rw_blk.c > +++ linux-2.6/block/ll_rw_blk.c > @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu > blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); > blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); > q->make_request_fn = mfn; > - q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; > + bdi_ra_init(&q->backing_dev_info); > q->backing_dev_info.state = 0; > q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY; > blk_queue_max_sectors(q, SAFE_MAX_SECTORS); > Index: linux-2.6/include/linux/fs.h > =================================================================== > --- linux-2.6.orig/include/linux/fs.h > +++ linux-2.6/include/linux/fs.h > @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file > > extern void > file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); > +extern void bdi_ra_init(struct backing_dev_info *bdi); > extern loff_t no_llseek(struct file *file, loff_t offset, int origin); > extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); > extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); > Index: linux-2.6/mm/readahead.c > =================================================================== > --- linux-2.6.orig/mm/readahead.c > +++ linux-2.6/mm/readahead.c > @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state > } > EXPORT_SYMBOL_GPL(file_ra_state_init); > > +static unsigned long ra_pages; > + > +static __init int readahead_init(void) > +{ > + /* > + * Scale the max readahead window with system memory > + * > + * 64M: 128K > + * 128M: 180K > + * 256M: 256K > + * 512M: 360K > + * 1G: 512K > + * 2G: 724K > + * 4G: 1024K > + * 8G: 1448K > + * 16G: 2048K > + */ > + ra_pages = int_sqrt(totalram_pages/16); > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > + ra_pages = 2 << (20 - PAGE_SHIFT); > + > + return 0; > +} How did you come up with these numbers? -- Jens Axboe ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 8:24 ` Jens Axboe @ 2007-07-22 8:36 ` Peter Zijlstra 2007-07-22 8:50 ` Jens Axboe 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-07-22 8:36 UTC (permalink / raw) To: Jens Axboe Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote: > On Sat, Jul 21 2007, Peter Zijlstra wrote: > > +static __init int readahead_init(void) > > +{ > > + /* > > + * Scale the max readahead window with system memory > > + * > > + * 64M: 128K > > + * 128M: 180K > > + * 256M: 256K > > + * 512M: 360K > > + * 1G: 512K > > + * 2G: 724K > > + * 4G: 1024K > > + * 8G: 1448K > > + * 16G: 2048K > > + */ > > + ra_pages = int_sqrt(totalram_pages/16); > > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > > + ra_pages = 2 << (20 - PAGE_SHIFT); > > + > > + return 0; > > +} > > How did you come up with these numbers? Well, most other places in the kernel where we scale by memory size we use the a sqrt curve, and the specific scale was the result of some fiddling, these numbers looked sane to me, nothing special. Would you suggest a different set, and if so, do you have any rationale for them? ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 8:36 ` Peter Zijlstra @ 2007-07-22 8:50 ` Jens Axboe 2007-07-22 9:17 ` Peter Zijlstra 2007-07-22 23:52 ` Rik van Riel 0 siblings, 2 replies; 46+ messages in thread From: Jens Axboe @ 2007-07-22 8:50 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, Jul 22 2007, Peter Zijlstra wrote: > On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote: > > On Sat, Jul 21 2007, Peter Zijlstra wrote: > > > > +static __init int readahead_init(void) > > > +{ > > > + /* > > > + * Scale the max readahead window with system memory > > > + * > > > + * 64M: 128K > > > + * 128M: 180K > > > + * 256M: 256K > > > + * 512M: 360K > > > + * 1G: 512K > > > + * 2G: 724K > > > + * 4G: 1024K > > > + * 8G: 1448K > > > + * 16G: 2048K > > > + */ > > > + ra_pages = int_sqrt(totalram_pages/16); > > > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > > > + ra_pages = 2 << (20 - PAGE_SHIFT); > > > + > > > + return 0; > > > +} > > > > How did you come up with these numbers? > > Well, most other places in the kernel where we scale by memory size we > use the a sqrt curve, and the specific scale was the result of some > fiddling, these numbers looked sane to me, nothing special. > > Would you suggest a different set, and if so, do you have any rationale > for them? I just wish you had a rationale behind them, I don't think it's that great of a series. I agree with the low point of 128k. Then it'd be sane to try and determine what the upper limit of ra window size goodness is, which is probably impossible since it depends on the hardware a lot. But lets just say the upper value is 2mb, then I think it's pretty silly _not_ to use 2mb on a 1g machine for instance. So more aggressive scaling. Then there's the relationship between nr of requests and ra size. When you leave everything up to a simple sqrt of total_ram type thing, then you are sure to hit stupid values that cause a queue size of a number of full requests, plus a small one at the end. Clearly not optimal! -- Jens Axboe ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 8:50 ` Jens Axboe @ 2007-07-22 9:17 ` Peter Zijlstra 2007-07-22 16:44 ` Jens Axboe 2007-07-22 23:52 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-07-22 9:17 UTC (permalink / raw) To: Jens Axboe Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, 2007-07-22 at 10:50 +0200, Jens Axboe wrote: > On Sun, Jul 22 2007, Peter Zijlstra wrote: > > On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote: > > > On Sat, Jul 21 2007, Peter Zijlstra wrote: > > > > > > +static __init int readahead_init(void) > > > > +{ > > > > + /* > > > > + * Scale the max readahead window with system memory > > > > + * > > > > + * 64M: 128K > > > > + * 128M: 180K > > > > + * 256M: 256K > > > > + * 512M: 360K > > > > + * 1G: 512K > > > > + * 2G: 724K > > > > + * 4G: 1024K > > > > + * 8G: 1448K > > > > + * 16G: 2048K > > > > + */ > > > > + ra_pages = int_sqrt(totalram_pages/16); > > > > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > > > > + ra_pages = 2 << (20 - PAGE_SHIFT); > > > > + > > > > + return 0; > > > > +} > > > > > > How did you come up with these numbers? > > > > Well, most other places in the kernel where we scale by memory size we > > use the a sqrt curve, and the specific scale was the result of some > > fiddling, these numbers looked sane to me, nothing special. > > > > Would you suggest a different set, and if so, do you have any rationale > > for them? > > I just wish you had a rationale behind them, I don't think it's that > great of a series. Well, I was quite ignorant of the issues you just pointed out. Thanks those do indeed provide basis for a more solid set. > I agree with the low point of 128k. Perhaps that should be enforced then, because currently a system with <64M will get less. > Then it'd be sane > to try and determine what the upper limit of ra window size goodness is, > which is probably impossible since it depends on the hardware a lot. But > lets just say the upper value is 2mb, then I think it's pretty silly > _not_ to use 2mb on a 1g machine for instance. So more aggressive > scaling. Right, I was being a little conservative here. > Then there's the relationship between nr of requests and ra size. When > you leave everything up to a simple sqrt of total_ram type thing, then > you are sure to hit stupid values that cause a queue size of a number of > full requests, plus a small one at the end. Clearly not optimal! And this is where Wu's point of power of two series comes into play, right? So something like: roundup_pow_of_two(int_sqrt((totalram_pages << (PAGE_SHIFT-10)))) memory in MB RA window in KB 64 128 128 256 256 256 512 512 1024 512 2048 1024 4096 1024 8192 2048 16384 2048 32768 4096 65536 4096 ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 9:17 ` Peter Zijlstra @ 2007-07-22 16:44 ` Jens Axboe 2007-07-23 10:04 ` Jörn Engel 0 siblings, 1 reply; 46+ messages in thread From: Jens Axboe @ 2007-07-22 16:44 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, Jul 22 2007, Peter Zijlstra wrote: > On Sun, 2007-07-22 at 10:50 +0200, Jens Axboe wrote: > > On Sun, Jul 22 2007, Peter Zijlstra wrote: > > > On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote: > > > > On Sat, Jul 21 2007, Peter Zijlstra wrote: > > > > > > > > +static __init int readahead_init(void) > > > > > +{ > > > > > + /* > > > > > + * Scale the max readahead window with system memory > > > > > + * > > > > > + * 64M: 128K > > > > > + * 128M: 180K > > > > > + * 256M: 256K > > > > > + * 512M: 360K > > > > > + * 1G: 512K > > > > > + * 2G: 724K > > > > > + * 4G: 1024K > > > > > + * 8G: 1448K > > > > > + * 16G: 2048K > > > > > + */ > > > > > + ra_pages = int_sqrt(totalram_pages/16); > > > > > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > > > > > + ra_pages = 2 << (20 - PAGE_SHIFT); > > > > > + > > > > > + return 0; > > > > > +} > > > > > > > > How did you come up with these numbers? > > > > > > Well, most other places in the kernel where we scale by memory size we > > > use the a sqrt curve, and the specific scale was the result of some > > > fiddling, these numbers looked sane to me, nothing special. > > > > > > Would you suggest a different set, and if so, do you have any rationale > > > for them? > > > > I just wish you had a rationale behind them, I don't think it's that > > great of a series. > > Well, I was quite ignorant of the issues you just pointed out. Thanks > those do indeed provide basis for a more solid set. > > > I agree with the low point of 128k. > > Perhaps that should be enforced then, because currently a system with > <64M will get less. I think it should remain the low point. > > Then it'd be sane > > to try and determine what the upper limit of ra window size goodness is, > > which is probably impossible since it depends on the hardware a lot. But > > lets just say the upper value is 2mb, then I think it's pretty silly > > _not_ to use 2mb on a 1g machine for instance. So more aggressive > > scaling. > > Right, I was being a little conservative here. > > > Then there's the relationship between nr of requests and ra size. When > > you leave everything up to a simple sqrt of total_ram type thing, then > > you are sure to hit stupid values that cause a queue size of a number of > > full requests, plus a small one at the end. Clearly not optimal! > > And this is where Wu's point of power of two series comes into play, > right? > > So something like: > > roundup_pow_of_two(int_sqrt((totalram_pages << (PAGE_SHIFT-10)))) > > > memory in MB RA window in KB > 64 128 > 128 256 > 256 256 > 512 512 > 1024 512 > 2048 1024 > 4096 1024 > 8192 2048 > 16384 2048 > 32768 4096 > 65536 4096 Only if you assume that max request size is always a power of 2. That's usually true, but there are cases where it's 124kb for instance. And there's still an issue when max_sectors isn't the deciding factor, if we end up having to stop merging on a request because we hit other limitations. So there's definitely room for improvement! Even today, btw, it's not all because of these changes. -- Jens Axboe ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 16:44 ` Jens Axboe @ 2007-07-23 10:04 ` Jörn Engel 2007-07-23 10:11 ` Jens Axboe 2007-07-23 22:44 ` Rusty Russell 0 siblings, 2 replies; 46+ messages in thread From: Jörn Engel @ 2007-07-23 10:04 UTC (permalink / raw) To: Jens Axboe Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, 22 July 2007 18:44:03 +0200, Jens Axboe wrote: > > > > > I agree with the low point of 128k. > > > > Perhaps that should be enforced then, because currently a system with > > <64M will get less. > > I think it should remain the low point. I believe this whole thing is fundamentally flawed. The perfect readahead size depends on the device in question. If we set a single system-wide value depending on memory size, it may easily be too small and too large at the same time. Think hard disk and SSD. It may make sense to have a _maximum_ readahead size which depends on memory size. But the preferred size (if there is enough RAM) should depend solely on the device, possibly as a function of the bandwidth * read latency product. Jörn -- Joern's library part 12: http://physics.nist.gov/cuu/Units/binary.html ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-23 10:04 ` Jörn Engel @ 2007-07-23 10:11 ` Jens Axboe 2007-07-23 22:44 ` Rusty Russell 1 sibling, 0 replies; 46+ messages in thread From: Jens Axboe @ 2007-07-23 10:11 UTC (permalink / raw) To: Jörn Engel Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Mon, Jul 23 2007, Jörn Engel wrote: > On Sun, 22 July 2007 18:44:03 +0200, Jens Axboe wrote: > > > > > > > I agree with the low point of 128k. > > > > > > Perhaps that should be enforced then, because currently a system with > > > <64M will get less. > > > > I think it should remain the low point. > > I believe this whole thing is fundamentally flawed. The perfect > readahead size depends on the device in question. If we set a single > system-wide value depending on memory size, it may easily be too small > and too large at the same time. Think hard disk and SSD. > > It may make sense to have a _maximum_ readahead size which depends on > memory size. But the preferred size (if there is enough RAM) should > depend solely on the device, possibly as a function of the > bandwidth * read latency product. The value IS the maximum read-ahead value, not an enforced minimum. However, there's definitely room for improvement in the queue feedback. Even for seekless devices like SSD, read-ahead may be beneficial due to zero request -> request latency. -- Jens Axboe ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-23 10:04 ` Jörn Engel 2007-07-23 10:11 ` Jens Axboe @ 2007-07-23 22:44 ` Rusty Russell 1 sibling, 0 replies; 46+ messages in thread From: Rusty Russell @ 2007-07-23 22:44 UTC (permalink / raw) To: Jörn Engel Cc: Jens Axboe, Peter Zijlstra, linux-kernel, Fengguang Wu, riel, Andrew Morton, Tim Pepper, Chris Snook On Mon, 2007-07-23 at 12:04 +0200, Jörn Engel wrote: > I believe this whole thing is fundamentally flawed. The perfect > readahead size depends on the device in question. If we set a single > system-wide value depending on memory size, it may easily be too small > and too large at the same time. Think hard disk and SSD. Well, I think the filesystem should get the first shot at altering the readahead heuristic (we should add a hook so it can then ask the underlying device). Something like: === Hook for filesystems to tweak readahead logic It was suggested at Fengguang's OLS talk that filesystems can make a useful contribution to the readahead logic: avoiding seeks and lock boundaries (on cluster filesystems) and other such filesystem-specific things. The alter_readahead hook should return the number of pages to read. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- Documentation/filesystems/vfs.txt | 10 ++++++---- include/linux/fs.h | 1 + mm/readahead.c | 12 ++++++++++++ 3 files changed, 19 insertions(+), 4 deletions(-) --- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h +++ linux-2.6.22-rc6-mm1/include/linux/fs.h @@ -1204,7 +1204,8 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*revoke)(struct file *, struct address_space *); + unsigned long (*alter_readahead)(struct file *, pgoff_t, unsigned long, pgoff_t, unsigned long); }; struct inode_operations { --- linux-2.6.22-rc6-mm1.orig/mm/readahead.c +++ linux-2.6.22-rc6-mm1/mm/readahead.c @@ -394,7 +394,19 @@ ondemand_readahead(struct address_space } readit: + /* + * Give filesystem a chance to tweak readahead size (eg. to + * contiguous block boundary). It should probably not change + * by more than 50% otherwise it's really ignoring our + * readahead advice. + */ + if (filp->f_op->alter_readahead) { + ra->size = filp->f_op->alter_readahead(filp, offset, req_size, + ra->start, ra->size); + if (ra->async_size > ra->size) + ra->async_size = ra->size; + } return ra_submit(ra, mapping, filp); } --- linux-2.6.22-rc6-mm1.orig/Documentation/filesystems/vfs.txt +++ linux-2.6.22-rc6-mm1/Documentation/filesystems/vfs.txt @@ -777,11 +777,10 @@ struct file_operations { int (*check_flags)(int); int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); - ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned -int); - ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned -int); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); + ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); int (*revoke)(struct file *); + unsigned long (*alter_readahead)(struct file *, pgoff_t, unsigned long, pgoff_t, unsigned long); }; Again, all methods are called without any locks being held, unless @@ -859,6 +858,9 @@ otherwise noted. to an open file. This method must ensure that all currently blocked writes are flushed and reads will fail. + alter_readahead: called when the readahead logic is about to submit a + readahead request for I/O + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 8:50 ` Jens Axboe 2007-07-22 9:17 ` Peter Zijlstra @ 2007-07-22 23:52 ` Rik van Riel 2007-07-23 5:22 ` Jens Axboe 1 sibling, 1 reply; 46+ messages in thread From: Rik van Riel @ 2007-07-22 23:52 UTC (permalink / raw) To: Jens Axboe Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook Jens Axboe wrote: > I just wish you had a rationale behind them, I don't think it's that > great of a series. I agree with the low point of 128k. Then it'd be sane > to try and determine what the upper limit of ra window size goodness is, > which is probably impossible since it depends on the hardware a lot. But > lets just say the upper value is 2mb, then I think it's pretty silly > _not_ to use 2mb on a 1g machine for instance. So more aggressive > scaling. 1 or 2 MB is a nice number. Seek time (plus rotational latency) on disks still takes on the order of 10 ms, while disks commonly transfer data on the order of 50MB/second. That means one disk seek (10ms) takes as long as it takes to read around 512kB of data. The current 128kB means that if you have lots of streaming IO going on, you spend only 20% of the time transferring data and get roughly 10MB/s. Seek 10ms, read 2.5ms worth of data. OTOH, if you do 2MB per request for the same heavy streaming workload (say, an ftp or nfs server doing media files), you can get 80% of the disk throughput, or 40MB/s. This is because you spend 40ms transferring data for every 10ms seek time. Yes, filesystem metadata will reduce this "occasionally", but the general idea holds. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 23:52 ` Rik van Riel @ 2007-07-23 5:22 ` Jens Axboe 0 siblings, 0 replies; 46+ messages in thread From: Jens Axboe @ 2007-07-23 5:22 UTC (permalink / raw) To: Rik van Riel Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, Jul 22 2007, Rik van Riel wrote: > Jens Axboe wrote: > >> I just wish you had a rationale behind them, I don't think it's that >> great of a series. I agree with the low point of 128k. Then it'd be sane >> to try and determine what the upper limit of ra window size goodness is, >> which is probably impossible since it depends on the hardware a lot. But >> lets just say the upper value is 2mb, then I think it's pretty silly >> _not_ to use 2mb on a 1g machine for instance. So more aggressive >> scaling. > > 1 or 2 MB is a nice number. > > Seek time (plus rotational latency) on disks still takes > on the order of 10 ms, while disks commonly transfer data > on the order of 50MB/second. > > That means one disk seek (10ms) takes as long as it takes > to read around 512kB of data. > > The current 128kB means that if you have lots of streaming > IO going on, you spend only 20% of the time transferring > data and get roughly 10MB/s. Seek 10ms, read 2.5ms worth > of data. > > OTOH, if you do 2MB per request for the same heavy streaming > workload (say, an ftp or nfs server doing media files), you > can get 80% of the disk throughput, or 40MB/s. This is because > you spend 40ms transferring data for every 10ms seek time. > > Yes, filesystem metadata will reduce this "occasionally", > but the general idea holds. I meant real numbers, the above is just rudimentary math. AS and CFQ makes sure that you get a window of more than one request in, so huge read-ahead sizes aren't as interesting. So - more testing, less hand waving :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070722084526.GB6317@mail.ustc.edu.cn>]
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size [not found] ` <20070722084526.GB6317@mail.ustc.edu.cn> @ 2007-07-22 8:45 ` Fengguang Wu 2007-07-22 8:59 ` Peter Zijlstra 0 siblings, 1 reply; 46+ messages in thread From: Fengguang Wu @ 2007-07-22 8:45 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sat, Jul 21, 2007 at 11:00:08PM +0200, Peter Zijlstra wrote: > Scale the default max readahead size with the system memory size. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > block/ll_rw_blk.c | 2 +- > include/linux/fs.h | 1 + > mm/readahead.c | 32 ++++++++++++++++++++++++++++++++ > 3 files changed, 34 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/ll_rw_blk.c > =================================================================== > --- linux-2.6.orig/block/ll_rw_blk.c > +++ linux-2.6/block/ll_rw_blk.c > @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu > blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); > blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); > q->make_request_fn = mfn; > - q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; > + bdi_ra_init(&q->backing_dev_info); fs/fuse/inode.c has another line to be converted. > q->backing_dev_info.state = 0; > q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY; > blk_queue_max_sectors(q, SAFE_MAX_SECTORS); > Index: linux-2.6/include/linux/fs.h > =================================================================== > --- linux-2.6.orig/include/linux/fs.h > +++ linux-2.6/include/linux/fs.h > @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file > > extern void > file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); > +extern void bdi_ra_init(struct backing_dev_info *bdi); > extern loff_t no_llseek(struct file *file, loff_t offset, int origin); > extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); > extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); > Index: linux-2.6/mm/readahead.c > =================================================================== > --- linux-2.6.orig/mm/readahead.c > +++ linux-2.6/mm/readahead.c > @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state > } > EXPORT_SYMBOL_GPL(file_ra_state_init); > > +static unsigned long ra_pages; > + > +static __init int readahead_init(void) > +{ > + /* > + * Scale the max readahead window with system memory > + * > + * 64M: 128K > + * 128M: 180K > + * 256M: 256K > + * 512M: 360K > + * 1G: 512K > + * 2G: 724K > + * 4G: 1024K > + * 8G: 1448K > + * 16G: 2048K > + */ > + ra_pages = int_sqrt(totalram_pages/16); > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > + ra_pages = 2 << (20 - PAGE_SHIFT); We can elaborate on the numbers ;) How about the following rules? - limit it under 1MB: we have to consider latencies - make them alignment-friendly, i.e. 128K, 256K, 512K, 1M. My original plan is to simply do the following: - #define VM_MAX_READAHEAD 128 /* kbytes */ + #define VM_MAX_READAHEAD 512 /* kbytes */ I'd like to post some numbers to back-up the discussion: readahead readahead size miss 128K 38% 512K 45% 1024K 49% The numbers are measured on a fresh booted KDE desktop. The majority misses come from the larger mmap read-arounds. Sequential readahead hits are pretty high and not quite affected by the readahead size, thanks to its size ramp-up process. > + > + return 0; > +} > + > +subsys_initcall(readahead_init); Remove the global ra_pages and fold readahead_init() into bdi_ra_init()? bdi_ra_init will only be called several times I guess. > + > +void bdi_ra_init(struct backing_dev_info *bdi) > +{ > + bdi->ra_pages = ra_pages; > +} > +EXPORT_SYMBOL(bdi_ra_init); ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size 2007-07-22 8:45 ` Fengguang Wu @ 2007-07-22 8:59 ` Peter Zijlstra [not found] ` <20070722095313.GA8136@mail.ustc.edu.cn> 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-07-22 8:59 UTC (permalink / raw) To: Fengguang Wu Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, 2007-07-22 at 16:45 +0800, Fengguang Wu wrote: > On Sat, Jul 21, 2007 at 11:00:08PM +0200, Peter Zijlstra wrote: > > Scale the default max readahead size with the system memory size. > > > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > --- > > block/ll_rw_blk.c | 2 +- > > include/linux/fs.h | 1 + > > mm/readahead.c | 32 ++++++++++++++++++++++++++++++++ > > 3 files changed, 34 insertions(+), 1 deletion(-) > > > > Index: linux-2.6/block/ll_rw_blk.c > > =================================================================== > > --- linux-2.6.orig/block/ll_rw_blk.c > > +++ linux-2.6/block/ll_rw_blk.c > > @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu > > blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); > > blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); > > q->make_request_fn = mfn; > > - q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; > > + bdi_ra_init(&q->backing_dev_info); > > fs/fuse/inode.c has another line to be converted. Drad, right you are. Will grep a bit. > > q->backing_dev_info.state = 0; > > q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY; > > blk_queue_max_sectors(q, SAFE_MAX_SECTORS); > > Index: linux-2.6/include/linux/fs.h > > =================================================================== > > --- linux-2.6.orig/include/linux/fs.h > > +++ linux-2.6/include/linux/fs.h > > @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file > > > > extern void > > file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping); > > +extern void bdi_ra_init(struct backing_dev_info *bdi); > > extern loff_t no_llseek(struct file *file, loff_t offset, int origin); > > extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin); > > extern loff_t remote_llseek(struct file *file, loff_t offset, int origin); > > Index: linux-2.6/mm/readahead.c > > =================================================================== > > --- linux-2.6.orig/mm/readahead.c > > +++ linux-2.6/mm/readahead.c > > @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state > > } > > EXPORT_SYMBOL_GPL(file_ra_state_init); > > > > +static unsigned long ra_pages; > > + > > +static __init int readahead_init(void) > > +{ > > + /* > > + * Scale the max readahead window with system memory > > + * > > + * 64M: 128K > > + * 128M: 180K > > + * 256M: 256K > > + * 512M: 360K > > + * 1G: 512K > > + * 2G: 724K > > + * 4G: 1024K > > + * 8G: 1448K > > + * 16G: 2048K > > + */ > > + ra_pages = int_sqrt(totalram_pages/16); > > + if (ra_pages > (2 << (20 - PAGE_SHIFT))) > > + ra_pages = 2 << (20 - PAGE_SHIFT); > > We can elaborate on the numbers ;) > > How about the following rules? > - limit it under 1MB: we have to consider latencies readahead is done async and we have these cond_resched() things sprinkled all over, no? > - make them alignment-friendly, i.e. 128K, 256K, 512K, 1M. Would that actually matter? but yeah, that seems like a sane suggestion. roundup_pow_of_two() comes to mind. > My original plan is to simply do the following: > > - #define VM_MAX_READAHEAD 128 /* kbytes */ > + #define VM_MAX_READAHEAD 512 /* kbytes */ Yeah, the trouble I have with that is that it might adversely affect tiny systems (although the trash detection might mitigate that impact) > I'd like to post some numbers to back-up the discussion: > > readahead readahead > size miss > 128K 38% > 512K 45% > 1024K 49% > > The numbers are measured on a fresh booted KDE desktop. > > The majority misses come from the larger mmap read-arounds. the mmap code never gets into readahead unless madvise(MADV_SEQUENTIAL) is used afaik. > Sequential readahead hits are pretty high and not quite affected by > the readahead size, thanks to its size ramp-up process. > > > + > > + return 0; > > +} > > + > > +subsys_initcall(readahead_init); > > Remove the global ra_pages and fold readahead_init() into bdi_ra_init()? > bdi_ra_init will only be called several times I guess. I guess we could, this just seemed like a proper setup where more things could grow into. ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070722095313.GA8136@mail.ustc.edu.cn>]
* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size [not found] ` <20070722095313.GA8136@mail.ustc.edu.cn> @ 2007-07-22 9:53 ` Fengguang Wu 0 siblings, 0 replies; 46+ messages in thread From: Fengguang Wu @ 2007-07-22 9:53 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, Jul 22, 2007 at 10:59:11AM +0200, Peter Zijlstra wrote: > On Sun, 2007-07-22 at 16:45 +0800, Fengguang Wu wrote: > > How about the following rules? > > - limit it under 1MB: we have to consider latencies > > readahead is done async and we have these cond_resched() things > sprinkled all over, no? Yeah, it should not be a big problem. > > - make them alignment-friendly, i.e. 128K, 256K, 512K, 1M. > > Would that actually matter? but yeah, that seems like a sane suggestion. > roundup_pow_of_two() comes to mind. E.g. RAID stride size, and the max_sectors_kb. Typically they are power-of-two. > > My original plan is to simply do the following: > > > > - #define VM_MAX_READAHEAD 128 /* kbytes */ > > + #define VM_MAX_READAHEAD 512 /* kbytes */ > > Yeah, the trouble I have with that is that it might adversely affect > tiny systems (although the trash detection might mitigate that impact) I'm also OK with the scaling up scheme. It's reasonable. > > I'd like to post some numbers to back-up the discussion: > > > > readahead readahead > > size miss > > 128K 38% > > 512K 45% > > 1024K 49% > > > > The numbers are measured on a fresh booted KDE desktop. > > > > The majority misses come from the larger mmap read-arounds. > > the mmap code never gets into readahead unless madvise(MADV_SEQUENTIAL) > is used afaik. Sadly mmap read-around reuses the same readahead size. - for read-around, VM_MAX_READAHEAD is the _real_ readahead size - for readahead, VM_MAX_READAHEAD is the _max_ readahead size If we simply increasing VM_MAX_READAHEAD, tiny systems can be immediately hurt by large read-arounds. That's the problem. ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070722023923.GA6438@mail.ustc.edu.cn>]
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn> @ 2007-07-22 2:39 ` Fengguang Wu 2007-07-22 2:44 ` Dave Jones 1 sibling, 0 replies; 46+ messages in thread From: Fengguang Wu @ 2007-07-22 2:39 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook, Dave Jones On Sat, Jul 21, 2007 at 11:00:05PM +0200, Peter Zijlstra wrote: > The various readahead bits I have lying about. > > Wu, would you agree with asking Andrew to stick these behind your latest > readahead series? They are generally good feature to have. Thank you! - default readahead size It makes sense to raise it beyond 128K. 1M default readahead absolutely makes sense for sequential workloads. For the desktop, this increases boot speed and readahead misses, both due to more aggressive mmap read-around. Most users will be glad to feel the speedup, and happily ignore the readahead misses, which may be "invisible" in case of large memory. In theory, the distributions can do the same tuning. So we have an interesting question for Dave: Does fedora desktop raise the default readahead size? Why or why not? It goes so far to do userland readahead ;) - drop behind Sorry, I still doubt it will benefit all/most workloads. Leave it off by default, and leave the enabling decision to Dave? I do hope that it help general desktops. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn> 2007-07-22 2:39 ` [PATCH 0/3] readahead drop behind and size adjustment Fengguang Wu @ 2007-07-22 2:44 ` Dave Jones [not found] ` <20070722081010.GA6317@mail.ustc.edu.cn> 1 sibling, 1 reply; 46+ messages in thread From: Dave Jones @ 2007-07-22 2:44 UTC (permalink / raw) To: Peter Zijlstra, linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, Jul 22, 2007 at 10:39:23AM +0800, Fengguang Wu wrote: > It makes sense to raise it beyond 128K. 1M default readahead > absolutely makes sense for sequential workloads. For the desktop, > this increases boot speed and readahead misses, both due to more > aggressive mmap read-around. Most users will be glad to feel the > speedup, and happily ignore the readahead misses, which may be > "invisible" in case of large memory. > > In theory, the distributions can do the same tuning. So we have an > interesting question for Dave: > Does fedora desktop raise the default readahead size? Why or > why not? It goes so far to do userland readahead ;) Fedora takes whatever defaults for readahead the kernel.org kernel has. The only reasoning being if anyone reported VM bugs, we'd be able to say to interested upstream developers "we're running the stock VM". without having to get the user to try and reproduce on unpatched kernels. > - drop behind > > Sorry, I still doubt it will benefit all/most workloads. Leave it off > by default, and leave the enabling decision to Dave? I do hope that > it help general desktops. It's not a subject that I'm intimatly familiar with, and when it comes to decisions like this, I tend to just take whatever the upstream defaults are. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070722081010.GA6317@mail.ustc.edu.cn>]
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070722081010.GA6317@mail.ustc.edu.cn> @ 2007-07-22 8:10 ` Fengguang Wu 2007-07-22 8:24 ` Peter Zijlstra 2007-07-22 8:33 ` Rusty Russell 1 sibling, 1 reply; 46+ messages in thread From: Fengguang Wu @ 2007-07-22 8:10 UTC (permalink / raw) To: Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sat, Jul 21, 2007 at 10:44:28PM -0400, Dave Jones wrote: > On Sun, Jul 22, 2007 at 10:39:23AM +0800, Fengguang Wu wrote: > > > It makes sense to raise it beyond 128K. 1M default readahead > > absolutely makes sense for sequential workloads. For the desktop, > > this increases boot speed and readahead misses, both due to more > > aggressive mmap read-around. Most users will be glad to feel the > > speedup, and happily ignore the readahead misses, which may be > > "invisible" in case of large memory. > > > > In theory, the distributions can do the same tuning. So we have an > > interesting question for Dave: > > Does fedora desktop raise the default readahead size? Why or > > why not? It goes so far to do userland readahead ;) > > Fedora takes whatever defaults for readahead the kernel.org kernel has. > The only reasoning being if anyone reported VM bugs, we'd be able > to say to interested upstream developers "we're running the stock VM". > without having to get the user to try and reproduce on unpatched > kernels. Thank you. Now I'm more confident that the kernel should have a reasonable default readahead size. The current one is 5+ years old and should be updated now. > > - drop behind > > > > Sorry, I still doubt it will benefit all/most workloads. Leave it off > > by default, and leave the enabling decision to Dave? I do hope that > > it help general desktops. > > It's not a subject that I'm intimatly familiar with, and when it > comes to decisions like this, I tend to just take whatever the > upstream defaults are. - It will avoid large-file-reads-thrashing-my-desktop problem, so most desktop users should like it. But sure there will be counter cases when a user want to keep the data cached. - File servers may hurt from it. Imagine a mp3/png file server. The files are large enough to trigger drop-behind, but small (and hot) enough to be cached. Also when a new fedora DVD iso is released, it may be cached for some days. These are only the obvious cases. So I opt for it being made tunable, safe, and turned off by default. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-22 8:10 ` Fengguang Wu @ 2007-07-22 8:24 ` Peter Zijlstra [not found] ` <20070722082923.GA7790@mail.ustc.edu.cn> 0 siblings, 1 reply; 46+ messages in thread From: Peter Zijlstra @ 2007-07-22 8:24 UTC (permalink / raw) To: Fengguang Wu Cc: Dave Jones, linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > - It will avoid large-file-reads-thrashing-my-desktop problem, > so most desktop users should like it. But sure there will be counter > cases when a user want to keep the data cached. > - File servers may hurt from it. Imagine a mp3/png file server. The > files are large enough to trigger drop-behind, but small (and hot) > enough to be cached. Also when a new fedora DVD iso is released, it > may be cached for some days. These are only the obvious cases. > > So I opt for it being made tunable, safe, and turned off by default. I'm still not convinced (Rik wasn't either last time around). When these files really are hot, they will be kept in memory due to them becoming Active. Also, by scaling up the max readahead size it takes a larger file before it starts dropping. If say this server has 4G of memory (not much at all for a server) resulting in a 1M readahead window, the file needs to be > ~2M before it starts drop behind. But I guess it takes someone to try this IRL before we can settle this debate :-/ ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070722082923.GA7790@mail.ustc.edu.cn>]
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070722082923.GA7790@mail.ustc.edu.cn> @ 2007-07-22 8:29 ` Fengguang Wu 0 siblings, 0 replies; 46+ messages in thread From: Fengguang Wu @ 2007-07-22 8:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Jones, linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper, Chris Snook On Sun, Jul 22, 2007 at 10:24:37AM +0200, Peter Zijlstra wrote: > On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > > > - It will avoid large-file-reads-thrashing-my-desktop problem, > > so most desktop users should like it. But sure there will be counter > > cases when a user want to keep the data cached. > > - File servers may hurt from it. Imagine a mp3/png file server. The > > files are large enough to trigger drop-behind, but small (and hot) > > enough to be cached. Also when a new fedora DVD iso is released, it > > may be cached for some days. These are only the obvious cases. > > > > So I opt for it being made tunable, safe, and turned off by default. > > I'm still not convinced (Rik wasn't either last time around). When these > files really are hot, they will be kept in memory due to them becoming > Active. > > Also, by scaling up the max readahead size it takes a larger file before > it starts dropping. If say this server has 4G of memory (not much at all > for a server) resulting in a 1M readahead window, the file needs to be > > ~2M before it starts drop behind. [snip] > But I guess it takes someone to try this IRL before we can settle this > debate :-/ Yeah, some real workload numbers would help. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070722081010.GA6317@mail.ustc.edu.cn> 2007-07-22 8:10 ` Fengguang Wu @ 2007-07-22 8:33 ` Rusty Russell 2007-07-22 8:45 ` Peter Zijlstra 2007-07-23 9:00 ` Nick Piggin 1 sibling, 2 replies; 46+ messages in thread From: Rusty Russell @ 2007-07-22 8:33 UTC (permalink / raw) To: Fengguang Wu Cc: Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > - It will avoid large-file-reads-thrashing-my-desktop problem, > so most desktop users should like it. But sure there will be counter > cases when a user want to keep the data cached. > - File servers may hurt from it. Imagine a mp3/png file server. The > files are large enough to trigger drop-behind, but small (and hot) > enough to be cached. Also when a new fedora DVD iso is released, it > may be cached for some days. These are only the obvious cases. I'm still not convinced of the latter case. If a second user accesses the file before it pages are thrown out, from my understanding there won't be readahead but the pages will be put back on the active list (ie. the dropbehind is undone). AFAICT this patch truly biasses towards releasing pages from sequentially accessed files. So file servers where memory pressure comes from lots of such files should be fine (bias will work correctly against least-served files). The main problem I can see is a low-memory server where we are currently swapping out pages from non-sequential files (eg. gigantic running java apps), and now performance might decline because we'll discard file pages first. > So I opt for it being made tunable, safe, and turned off by default. I'd like to see it turned on by default in -mm, and try to come up with some server-like workload to measure the effect. Should be easy to simulate something (eg. apache server, where clients grab some files in preference, and apache server where clients grab different files). Peter? Thanks, Rusty. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-22 8:33 ` Rusty Russell @ 2007-07-22 8:45 ` Peter Zijlstra 2007-07-23 9:00 ` Nick Piggin 1 sibling, 0 replies; 46+ messages in thread From: Peter Zijlstra @ 2007-07-22 8:45 UTC (permalink / raw) To: Rusty Russell Cc: Fengguang Wu, Dave Jones, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook On Sun, 2007-07-22 at 18:33 +1000, Rusty Russell wrote: > On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > > - It will avoid large-file-reads-thrashing-my-desktop problem, > > so most desktop users should like it. But sure there will be counter > > cases when a user want to keep the data cached. > > - File servers may hurt from it. Imagine a mp3/png file server. The > > files are large enough to trigger drop-behind, but small (and hot) > > enough to be cached. Also when a new fedora DVD iso is released, it > > may be cached for some days. These are only the obvious cases. > > I'm still not convinced of the latter case. If a second user accesses > the file before it pages are thrown out, from my understanding there > won't be readahead but the pages will be put back on the active list > (ie. the dropbehind is undone). > > AFAICT this patch truly biasses towards releasing pages from > sequentially accessed files. So file servers where memory pressure > comes from lots of such files should be fine (bias will work correctly > against least-served files). > > The main problem I can see is a low-memory server where we are currently > swapping out pages from non-sequential files (eg. gigantic running java > apps), and now performance might decline because we'll discard file > pages first. > > > So I opt for it being made tunable, safe, and turned off by default. > > I'd like to see it turned on by default in -mm, and try to come up with > some server-like workload to measure the effect. Should be easy to > simulate something (eg. apache server, where clients grab some files in > preference, and apache server where clients grab different files). > > Peter? I guess we could do that. Just populate the server with a lot of randomly sized files, and make a client do a random pull or one with a poisson distribution or something like that. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-22 8:33 ` Rusty Russell 2007-07-22 8:45 ` Peter Zijlstra @ 2007-07-23 9:00 ` Nick Piggin [not found] ` <20070723142457.GA10130@mail.ustc.edu.cn> 2007-07-25 4:35 ` Eric St-Laurent 1 sibling, 2 replies; 46+ messages in thread From: Nick Piggin @ 2007-07-23 9:00 UTC (permalink / raw) To: Rusty Russell Cc: Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook Rusty Russell wrote: > On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: >>So I opt for it being made tunable, safe, and turned off by default. I hate tunables :) Unless we have workload A that gets a reasonable benefit from something and workload B that gets a significant regression, and no clear way to reconcile them... > I'd like to see it turned on by default in -mm, and try to come up with > some server-like workload to measure the effect. Should be easy to > simulate something (eg. apache server, where clients grab some files in > preference, and apache server where clients grab different files). I don't like this kind of conditional information going from something like readahead into page reclaim. Unless it is for readahead _specific_ data such as "I got these all wrong, so you can reclaim them" (which this isn't). Possibly it makes sense to realise that the given pages are cheaper to read back in as they are apparently being read-ahead very nicely. But I don't like it as a use-once thing. The VM should be able to get that right. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070723142457.GA10130@mail.ustc.edu.cn>]
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070723142457.GA10130@mail.ustc.edu.cn> @ 2007-07-23 14:24 ` Fengguang Wu 2007-07-23 19:40 ` Andrew Morton 0 siblings, 1 reply; 46+ messages in thread From: Fengguang Wu @ 2007-07-23 14:24 UTC (permalink / raw) To: Nick Piggin Cc: Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook, Jens Axboe On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote: > Rusty Russell wrote: > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > > >>So I opt for it being made tunable, safe, and turned off by default. > > I hate tunables :) Unless we have workload A that gets a reasonable > benefit from something and workload B that gets a significant regression, > and no clear way to reconcile them... Me too ;) But sometimes we really want to avoid flushing the cache. Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here. > >I'd like to see it turned on by default in -mm, and try to come up with > >some server-like workload to measure the effect. Should be easy to > >simulate something (eg. apache server, where clients grab some files in > >preference, and apache server where clients grab different files). > > I don't like this kind of conditional information going from something > like readahead into page reclaim. Unless it is for readahead _specific_ > data such as "I got these all wrong, so you can reclaim them" (which > this isn't). > > Possibly it makes sense to realise that the given pages are cheaper > to read back in as they are apparently being read-ahead very nicely. In fact I have talked to Jens about it in last year's kernel summit. The patch below explains itself. --- Subject: cost based page reclaim Cost based page reclaim - a minimalist implementation. Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a large file. Should we first drop the 32-pages which are read in one I/O, or drop the 32 distinct pages, each costs one I/O? (Given that the files are of equal hotness.) Page replacement algorithms should be designed to minimize the number of I/Os, instead of the number of page faults. Dividing the cost of I/O by the number of pages it bring in, we get the cost of the page. The bigger page cost, the more 'lives/bloods' the page should have. This patch adds life to costly pages by pretending that they are referenced more times. Possible downsides: - burdens the pressure of vmscan - active pages are no longer that 'active' Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- include/linux/backing-dev.h | 1 + mm/readahead.c | 27 +++++++++++++++++++++++++++ mm/swap.c | 5 ++++- 3 files changed, 32 insertions(+), 1 deletion(-) --- linux-2.6.22.orig/mm/readahead.c +++ linux-2.6.22/mm/readahead.c @@ -125,6 +125,7 @@ static void ra_account(struct file_ra_st struct backing_dev_info default_backing_dev_info = { .ra_pages = MAX_RA_PAGES, + .avg_ra_size = MAX_RA_PAGES * PAGE_CACHE_SIZE, .ra_pages0 = INITIAL_RA_PAGES, .ra_thrash_bytes = MAX_RA_PAGES * PAGE_CACHE_SIZE, .state = 0, @@ -133,6 +134,26 @@ struct backing_dev_info default_backing_ }; EXPORT_SYMBOL_GPL(default_backing_dev_info); +#define log2(n) fls(n) +static int readahead_cost_per_page(struct address_space *mapping, + unsigned long pages) +{ + unsigned long avg; + int cost; + + avg = mapping->backing_dev_info->avg_ra_size; + avg = (pages * PAGE_CACHE_SIZE + avg * 1023) / 1024; + mapping->backing_dev_info->avg_ra_size = avg; + + avg = avg / PAGE_CACHE_SIZE; + if (!avg || !pages) + cost = 0; + else + cost = (log2(avg) - log2(pages)) * 5 / log2(avg); + + return cost; +} + /* * Initialise a struct file's readahead state. Assumes that the caller has * memset *ra to zero. @@ -418,6 +439,7 @@ __do_page_cache_readahead(struct address struct page *page; unsigned long end_index; /* The last page we want to read */ LIST_HEAD(page_pool); + int cost; int page_idx; int ret = 0; loff_t isize = i_size_read(inode); @@ -426,6 +448,7 @@ __do_page_cache_readahead(struct address goto out; end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); + cost = readahead_cost_per_page(mapping, nr_to_read); /* * Preallocate as many pages as we will need. @@ -448,6 +471,10 @@ __do_page_cache_readahead(struct address if (!page) break; page->index = page_offset; + if (cost >= 3) + SetPageActive(page); + else if (cost == 2) + SetPageReferenced(page); list_add(&page->lru, &page_pool); if (page_idx == nr_to_read - lookahead_size) SetPageReadahead(page); --- linux-2.6.22.orig/mm/swap.c +++ linux-2.6.22/mm/swap.c @@ -365,7 +365,10 @@ void __pagevec_lru_add(struct pagevec *p } VM_BUG_ON(PageLRU(page)); SetPageLRU(page); - add_page_to_inactive_list(zone, page); + if (!PageActive(page)) + add_page_to_inactive_list(zone, page); + else + add_page_to_active_list(zone, page); } if (zone) spin_unlock_irq(&zone->lru_lock); --- linux-2.6.22.orig/include/linux/backing-dev.h +++ linux-2.6.22/include/linux/backing-dev.h @@ -26,6 +26,7 @@ typedef int (congested_fn)(void *, int); struct backing_dev_info { unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ + unsigned long avg_ra_size; unsigned long ra_pages0; /* min readahead on start of file */ unsigned long ra_thrash_bytes; /* estimated thrashing threshold */ unsigned long state; /* Always use atomic bitops on this */ ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-23 14:24 ` Fengguang Wu @ 2007-07-23 19:40 ` Andrew Morton [not found] ` <20070724004728.GA8026@mail.ustc.edu.cn> 0 siblings, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-07-23 19:40 UTC (permalink / raw) To: Fengguang Wu Cc: Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe On Mon, 23 Jul 2007 22:24:57 +0800 Fengguang Wu <fengguang.wu@gmail.com> wrote: > On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote: > > Rusty Russell wrote: > > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > > > > >>So I opt for it being made tunable, safe, and turned off by default. > > > > I hate tunables :) Unless we have workload A that gets a reasonable > > benefit from something and workload B that gets a significant regression, > > and no clear way to reconcile them... > > Me too ;) > > But sometimes we really want to avoid flushing the cache. > Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here. It's the only way to go in some situations. Sometimes the kernel just cannot predict the future sufficiently well, and the costs of making a mistake are terribly high. We need human help. And it should be administration-time help, not programming-time help. > > >I'd like to see it turned on by default in -mm, and try to come up with > > >some server-like workload to measure the effect. Should be easy to > > >simulate something (eg. apache server, where clients grab some files in > > >preference, and apache server where clients grab different files). > > > > I don't like this kind of conditional information going from something > > like readahead into page reclaim. Unless it is for readahead _specific_ > > data such as "I got these all wrong, so you can reclaim them" (which > > this isn't). > > > > Possibly it makes sense to realise that the given pages are cheaper > > to read back in as they are apparently being read-ahead very nicely. > > In fact I have talked to Jens about it in last year's kernel summit. > The patch below explains itself. > --- > Subject: cost based page reclaim > > Cost based page reclaim - a minimalist implementation. > > Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a > large file. Should we first drop the 32-pages which are read in one I/O, or > drop the 32 distinct pages, each costs one I/O? (Given that the files are of > equal hotness.) > > Page replacement algorithms should be designed to minimize the number of I/Os, > instead of the number of page faults. Dividing the cost of I/O by the number of > pages it bring in, we get the cost of the page. The bigger page cost, the more > 'lives/bloods' the page should have. > > This patch adds life to costly pages by pretending that they are > referenced more times. Possible downsides: > - burdens the pressure of vmscan > - active pages are no longer that 'active' > This is all fun stuff, but how do we find out that changes like this are good ones, apart from shipping it and seeing who gets hurt 12 months later? > +#define log2(n) fls(n) <look at include/linux/log2.h> ^ permalink raw reply [flat|nested] 46+ messages in thread
[parent not found: <20070724004728.GA8026@mail.ustc.edu.cn>]
* Re: [PATCH 0/3] readahead drop behind and size adjustment [not found] ` <20070724004728.GA8026@mail.ustc.edu.cn> @ 2007-07-24 0:47 ` Fengguang Wu 2007-07-24 1:17 ` Andrew Morton 2007-07-24 4:30 ` Nick Piggin 0 siblings, 2 replies; 46+ messages in thread From: Fengguang Wu @ 2007-07-24 0:47 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe, linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna, David Mosberger-Tang, Kyle McMartin On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote: > On Mon, 23 Jul 2007 22:24:57 +0800 > Fengguang Wu <fengguang.wu@gmail.com> wrote: > > > On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote: > > > Rusty Russell wrote: > > > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote: > > > > > > >>So I opt for it being made tunable, safe, and turned off by default. > > > > > > I hate tunables :) Unless we have workload A that gets a reasonable > > > benefit from something and workload B that gets a significant regression, > > > and no clear way to reconcile them... > > > > Me too ;) > > > > But sometimes we really want to avoid flushing the cache. > > Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here. > > It's the only way to go in some situations. Sometimes the kernel just > cannot predict the future sufficiently well, and the costs of making a > mistake are terribly high. We need human help. And it should be > administration-time help, not programming-time help. Agreed. I feel that drop behind is not a universal applicable. Cost based reclaim sounds better, but who knows before field use ;) > > > >I'd like to see it turned on by default in -mm, and try to come up with > > > >some server-like workload to measure the effect. Should be easy to > > > >simulate something (eg. apache server, where clients grab some files in > > > >preference, and apache server where clients grab different files). > > > > > > I don't like this kind of conditional information going from something > > > like readahead into page reclaim. Unless it is for readahead _specific_ > > > data such as "I got these all wrong, so you can reclaim them" (which > > > this isn't). > > > > > > Possibly it makes sense to realise that the given pages are cheaper > > > to read back in as they are apparently being read-ahead very nicely. > > > > In fact I have talked to Jens about it in last year's kernel summit. > > The patch below explains itself. > > --- > > Subject: cost based page reclaim > > > > Cost based page reclaim - a minimalist implementation. > > > > Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a > > large file. Should we first drop the 32-pages which are read in one I/O, or > > drop the 32 distinct pages, each costs one I/O? (Given that the files are of > > equal hotness.) > > > > Page replacement algorithms should be designed to minimize the number of I/Os, > > instead of the number of page faults. Dividing the cost of I/O by the number of > > pages it bring in, we get the cost of the page. The bigger page cost, the more > > 'lives/bloods' the page should have. > > > > This patch adds life to costly pages by pretending that they are > > referenced more times. Possible downsides: > > - burdens the pressure of vmscan > > - active pages are no longer that 'active' > > > > This is all fun stuff, but how do we find out that changes like this are > good ones, apart from shipping it and seeing who gets hurt 12 months later? One thing I can imagine now is that the first pages may get more life because of the conservative initial readahead size. Generally file servers use sendfile(wholefile), so not a problem. > > +#define log2(n) fls(n) > > <look at include/linux/log2.h> Thank you, this comment lead to another patch :) --- Subject: convert ill defined log2() to ilog2() It's *wrong* to have #define log2(n) ffz(~(n)) It should be *reversed*: #define log2(n) flz(~(n)) or #define log2(n) fls(n) or just use ilog2(n) defined in linux/log2.h. This patch follows the last solution, recommended by Andrew Morton. //Or are they simply the wrong naming, and is in fact correct in behavior? Cc: linux-ext4@vger.kernel.org Cc: Mingming Cao <cmm@us.ibm.com> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Chris Ahna <christopher.j.ahna@intel.com> Cc: David Mosberger-Tang <davidm@hpl.hp.com> Cc: Kyle McMartin <kyle@parisc-linux.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> --- drivers/char/agp/hp-agp.c | 9 +++------ drivers/char/agp/i460-agp.c | 5 ++--- drivers/char/agp/parisc-agp.c | 7 ++----- fs/ext4/super.c | 6 ++---- 4 files changed, 9 insertions(+), 18 deletions(-) --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c @@ -14,15 +14,12 @@ #include <linux/pci.h> #include <linux/init.h> #include <linux/agp_backend.h> +#include <linux/log2.h> #include <asm/acpi-ext.h> #include "agp.h" -#ifndef log2 -#define log2(x) ffz(~(x)) -#endif - #define HP_ZX1_IOC_OFFSET 0x1000 /* ACPI reports SBA, we want IOC */ /* HP ZX1 IOC registers */ @@ -256,7 +253,7 @@ hp_zx1_configure (void) readl(hp->ioc_regs+HP_ZX1_IMASK); writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE); readl(hp->ioc_regs+HP_ZX1_IBASE); - writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM); + writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM); readl(hp->ioc_regs+HP_ZX1_PCOM); } @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem) { struct _hp_private *hp = &hp_private; - writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM); + writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM); readq(hp->ioc_regs+HP_ZX1_PCOM); } --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c @@ -13,6 +13,7 @@ #include <linux/string.h> #include <linux/slab.h> #include <linux/agp_backend.h> +#include <linux/log2.h> #include "agp.h" @@ -59,8 +60,6 @@ */ #define WR_FLUSH_GATT(index) RD_GATT(index) -#define log2(x) ffz(~(x)) - static struct { void *gatt; /* ioremap'd GATT area */ @@ -148,7 +147,7 @@ static int i460_fetch_size (void) * values[i].size. */ values[i].num_entries = (values[i].size << 8) >> (I460_IO_PAGE_SHIFT - 12); - values[i].page_order = log2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT); + values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT); } for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) { --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c @@ -18,6 +18,7 @@ #include <linux/init.h> #include <linux/klist.h> #include <linux/agp_backend.h> +#include <linux/log2.h> #include <asm-parisc/parisc-device.h> #include <asm-parisc/ropes.h> @@ -27,10 +28,6 @@ #define DRVNAME "quicksilver" #define DRVPFX DRVNAME ": " -#ifndef log2 -#define log2(x) ffz(~(x)) -#endif - #define AGP8X_MODE_BIT 3 #define AGP8X_MODE (1 << AGP8X_MODE_BIT) @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m { struct _parisc_agp_info *info = &parisc_agp_info; - writeq(info->gart_base | log2(info->gart_size), info->ioc_regs+IOC_PCOM); + writeq(info->gart_base | ilog2(info->gart_size), info->ioc_regs+IOC_PCOM); readq(info->ioc_regs+IOC_PCOM); /* flush */ } --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct sb->s_flags = s_flags; /* Restore MS_RDONLY status */ } -#define log2(n) ffz(~(n)) - /* * Maximal file size. There is a direct, and {,double-,triple-}indirect * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks. @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb); sbi->s_sbh = bh; sbi->s_mount_state = le16_to_cpu(es->s_state); - sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb)); - sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb)); + sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb)); + sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb)); for (i=0; i < 4; i++) sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]); sbi->s_def_hash_version = es->s_def_hash_version; ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-24 0:47 ` Fengguang Wu @ 2007-07-24 1:17 ` Andrew Morton 2007-07-24 8:50 ` Andreas Dilger 2007-07-24 4:30 ` Nick Piggin 1 sibling, 1 reply; 46+ messages in thread From: Andrew Morton @ 2007-07-24 1:17 UTC (permalink / raw) To: Fengguang Wu Cc: Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe, linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna, David Mosberger-Tang, Kyle McMartin, Dave Jones, Dave Airlie On Tue, 24 Jul 2007 08:47:28 +0800 Fengguang Wu <fengguang.wu@gmail.com> wrote: > Subject: convert ill defined log2() to ilog2() > > It's *wrong* to have > #define log2(n) ffz(~(n)) > It should be *reversed*: > #define log2(n) flz(~(n)) > or > #define log2(n) fls(n) > or just use > ilog2(n) defined in linux/log2.h. > > This patch follows the last solution, recommended by Andrew Morton. > > //Or are they simply the wrong naming, and is in fact correct in behavior? > > Cc: linux-ext4@vger.kernel.org > Cc: Mingming Cao <cmm@us.ibm.com> > Cc: Bjorn Helgaas <bjorn.helgaas@hp.com> > Cc: Chris Ahna <christopher.j.ahna@intel.com> > Cc: David Mosberger-Tang <davidm@hpl.hp.com> > Cc: Kyle McMartin <kyle@parisc-linux.org> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> > --- > drivers/char/agp/hp-agp.c | 9 +++------ > drivers/char/agp/i460-agp.c | 5 ++--- > drivers/char/agp/parisc-agp.c | 7 ++----- > fs/ext4/super.c | 6 ++---- > 4 files changed, 9 insertions(+), 18 deletions(-) hm, yes, there is a risk that the code was accidentally correct. Or the code has only ever dealt with power-of-2 inputs, in which case it happens to work either way. David(s) and ext4-people: could we please have a close review of these changes? Thanks. > --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c > +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c > @@ -14,15 +14,12 @@ > #include <linux/pci.h> > #include <linux/init.h> > #include <linux/agp_backend.h> > +#include <linux/log2.h> > > #include <asm/acpi-ext.h> > > #include "agp.h" > > -#ifndef log2 > -#define log2(x) ffz(~(x)) > -#endif > - > #define HP_ZX1_IOC_OFFSET 0x1000 /* ACPI reports SBA, we want IOC */ > > /* HP ZX1 IOC registers */ > @@ -256,7 +253,7 @@ hp_zx1_configure (void) > readl(hp->ioc_regs+HP_ZX1_IMASK); > writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE); > readl(hp->ioc_regs+HP_ZX1_IBASE); > - writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM); > + writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM); > readl(hp->ioc_regs+HP_ZX1_PCOM); > } > > @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem) > { > struct _hp_private *hp = &hp_private; > > - writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM); > + writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM); > readq(hp->ioc_regs+HP_ZX1_PCOM); > } > > --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c > +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c > @@ -13,6 +13,7 @@ > #include <linux/string.h> > #include <linux/slab.h> > #include <linux/agp_backend.h> > +#include <linux/log2.h> > > #include "agp.h" > > @@ -59,8 +60,6 @@ > */ > #define WR_FLUSH_GATT(index) RD_GATT(index) > > -#define log2(x) ffz(~(x)) > - > static struct { > void *gatt; /* ioremap'd GATT area */ > > @@ -148,7 +147,7 @@ static int i460_fetch_size (void) > * values[i].size. > */ > values[i].num_entries = (values[i].size << 8) >> (I460_IO_PAGE_SHIFT - 12); > - values[i].page_order = log2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT); > + values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT); > } > > for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) { > --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c > +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c > @@ -18,6 +18,7 @@ > #include <linux/init.h> > #include <linux/klist.h> > #include <linux/agp_backend.h> > +#include <linux/log2.h> > > #include <asm-parisc/parisc-device.h> > #include <asm-parisc/ropes.h> > @@ -27,10 +28,6 @@ > #define DRVNAME "quicksilver" > #define DRVPFX DRVNAME ": " > > -#ifndef log2 > -#define log2(x) ffz(~(x)) > -#endif > - > #define AGP8X_MODE_BIT 3 > #define AGP8X_MODE (1 << AGP8X_MODE_BIT) > > @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m > { > struct _parisc_agp_info *info = &parisc_agp_info; > > - writeq(info->gart_base | log2(info->gart_size), info->ioc_regs+IOC_PCOM); > + writeq(info->gart_base | ilog2(info->gart_size), info->ioc_regs+IOC_PCOM); > readq(info->ioc_regs+IOC_PCOM); /* flush */ > } > > --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c > +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c > @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct > sb->s_flags = s_flags; /* Restore MS_RDONLY status */ > } > > -#define log2(n) ffz(~(n)) > - > /* > * Maximal file size. There is a direct, and {,double-,triple-}indirect > * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks. > @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super > sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb); > sbi->s_sbh = bh; > sbi->s_mount_state = le16_to_cpu(es->s_state); > - sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb)); > - sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb)); > + sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb)); > + sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb)); > for (i=0; i < 4; i++) > sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]); > sbi->s_def_hash_version = es->s_def_hash_version; ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-24 1:17 ` Andrew Morton @ 2007-07-24 8:50 ` Andreas Dilger 0 siblings, 0 replies; 46+ messages in thread From: Andreas Dilger @ 2007-07-24 8:50 UTC (permalink / raw) To: Andrew Morton Cc: Fengguang Wu, Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe, linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna, David Mosberger-Tang, Kyle McMartin, Dave Jones, Dave Airlie On Jul 23, 2007 18:17 -0700, Andrew Morton wrote: > hm, yes, there is a risk that the code was accidentally correct. Or the > code has only ever dealt with power-of-2 inputs, in which case it happens > to work either way. > > David(s) and ext4-people: could we please have a close review of these > changes? > > @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super > > - sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb)); > > - sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb)); > > + sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb)); > > + sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb)); For the ext[234] code there has only ever been power-of-two values for ADDR_PER_BLOCK() and DESC_PER_BLOCK(). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-24 0:47 ` Fengguang Wu 2007-07-24 1:17 ` Andrew Morton @ 2007-07-24 4:30 ` Nick Piggin 1 sibling, 0 replies; 46+ messages in thread From: Nick Piggin @ 2007-07-24 4:30 UTC (permalink / raw) To: Fengguang Wu Cc: Andrew Morton, Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe, linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna, David Mosberger-Tang, Kyle McMartin Fengguang Wu wrote: > On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote: >>This is all fun stuff, but how do we find out that changes like this are >>good ones, apart from shipping it and seeing who gets hurt 12 months later? > > > One thing I can imagine now is that the first pages may get more life > because of the conservative initial readahead size. Yeah I don't think it is really worth the complexity and corner cases it would introduce at this stage. People are still complaining about their nightly cron job *swapping* stuff out of their 1-2GB desktop systems. There must still be some low hanging fruit or obvious problems with page reclaim so I'd like to try working out exactly what is going wrong and fix it in page reclaim rather than adding more complexity in the hope that it might help. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-23 9:00 ` Nick Piggin [not found] ` <20070723142457.GA10130@mail.ustc.edu.cn> @ 2007-07-25 4:35 ` Eric St-Laurent 2007-07-25 5:19 ` Nick Piggin 1 sibling, 1 reply; 46+ messages in thread From: Eric St-Laurent @ 2007-07-25 4:35 UTC (permalink / raw) To: Nick Piggin Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook On Mon, 2007-23-07 at 19:00 +1000, Nick Piggin wrote: > I don't like this kind of conditional information going from something > like readahead into page reclaim. Unless it is for readahead _specific_ > data such as "I got these all wrong, so you can reclaim them" (which > this isn't). > > But I don't like it as a use-once thing. The VM should be able to get > that right. > Question: How work the use-once code in the current kernel? Is there any? I doesn't quite work for me... See my previous email today, I've done a small test case to demonstrate the problem and the effectiveness of Peter's patch. The only piece missing is the copy case (read once + write once). Regardless of how it's implemented, I think a similar mechanism must be added. This is a long standing issue. In the end, I think it's a pagecache resources allocation problem. the VM lacks fair-share limits between processes. The kernel doesn't have enough information to make the right decisions. You can refine or use more advanced page reclaim, but some fair-share splitting (like the CPU scheduler) between the processes must be present. Of course some process should have large or unlimited VM limits, like databases. Maybe the "containers" patchset and memory controller can help. With some specific configuration and/or a userspace daemon to adjust the limits on the fly. Independently, the basic large file streaming read (or copy) once cases should not trash the pagecache. Can we agree on that? I say, let's add some code to fix the problem. If we hear about any regression in some workloads, we can add a tunable to limit or disable its effects, _if_ a better compromised solution cannot be found. Surely it's possible to have a acceptable solution. Best regards, - Eric ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 4:35 ` Eric St-Laurent @ 2007-07-25 5:19 ` Nick Piggin 2007-07-25 6:18 ` Eric St-Laurent 2007-07-25 15:28 ` Rik van Riel 0 siblings, 2 replies; 46+ messages in thread From: Nick Piggin @ 2007-07-25 5:19 UTC (permalink / raw) To: Eric St-Laurent Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook Eric St-Laurent wrote: > On Mon, 2007-23-07 at 19:00 +1000, Nick Piggin wrote: > > >>I don't like this kind of conditional information going from something >>like readahead into page reclaim. Unless it is for readahead _specific_ >>data such as "I got these all wrong, so you can reclaim them" (which >>this isn't). >> >>But I don't like it as a use-once thing. The VM should be able to get >>that right. >> > > > > Question: How work the use-once code in the current kernel? Is there > any? I doesn't quite work for me... What *I* think is supposed to happen is that newly read in pages get put on the inactive list, and unless they get accessed againbefore being reclaimed, they are allowed to fall off the end of the list without disturbing active data too much. I think there is a missing piece here, that we used to ease the reclaim pressure off the active list when the inactive list grows relatively much larger than it (which could indicate a lot of use-once pages in the system). Andrew got rid of that logic for some reason which I don't know, but I can't see that use-once would be terribly effective today (so your results don't surprise me too much). I think I've been banned from touching vmscan.c, but if you're keen to try a patch, I might be convinced to come out of retirement :) > See my previous email today, I've done a small test case to demonstrate > the problem and the effectiveness of Peter's patch. The only piece > missing is the copy case (read once + write once). > > Regardless of how it's implemented, I think a similar mechanism must be > added. This is a long standing issue. > > In the end, I think it's a pagecache resources allocation problem. the > VM lacks fair-share limits between processes. The kernel doesn't have > enough information to make the right decisions. > > You can refine or use more advanced page reclaim, but some fair-share > splitting (like the CPU scheduler) between the processes must be > present. Of course some process should have large or unlimited VM > limits, like databases. > > Maybe the "containers" patchset and memory controller can help. With > some specific configuration and/or a userspace daemon to adjust the > limits on the fly. > > Independently, the basic large file streaming read (or copy) once cases > should not trash the pagecache. Can we agree on that? One man's trash is another's treasure: some people will want the files to remain in cache because they'll use them again (copy it somewhere else, or start editing it after being copied or whatever). But yeah, we can probably do better at the sequential read/write case. > I say, let's add some code to fix the problem. If we hear about any > regression in some workloads, we can add a tunable to limit or disable > its effects, _if_ a better compromised solution cannot be found. Sure, but let's figure out the workloads and look at all the alternatives first. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 5:19 ` Nick Piggin @ 2007-07-25 6:18 ` Eric St-Laurent 2007-07-25 7:09 ` Nick Piggin 2007-07-25 15:28 ` Rik van Riel 1 sibling, 1 reply; 46+ messages in thread From: Eric St-Laurent @ 2007-07-25 6:18 UTC (permalink / raw) To: Nick Piggin Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote: > What *I* think is supposed to happen is that newly read in pages get > put on the inactive list, and unless they get accessed againbefore > being reclaimed, they are allowed to fall off the end of the list > without disturbing active data too much. > > I think there is a missing piece here, that we used to ease the reclaim > pressure off the active list when the inactive list grows relatively > much larger than it (which could indicate a lot of use-once pages in > the system). Maybe a new list should be added to put newly read pages in it. If they are not used or used once after a certain period, they can be moved to the inactive list (or whatever). Newly read pages... - ... not used after this period are excessive readahead, we discard immediately. - ... used only once after this period, we discard soon. - ... used many/frequently are moved to active list. Surely the scan rate (do I make sense?) should be different for this newly-read list and the inactive list. I also remember your split mapped/unmapped active list patches from a while ago. Can someone point me to a up-to-date documentation about the Linux VM? The books and documents I've seen are outdated. > I think I've been banned from touching vmscan.c, but if you're keen to > try a patch, I might be convinced to come out of retirement :) I'm more than willing! Now that CFS is merged, redirect your energies from nicksched to nick-vm ;) Patches against any tree (stable, linus, mm, rt) are good. But I prefer the last stable release because it narrows down the possible problems that a moving target like the development tree may have. I test this on my main system, so patches with basic testing and reasonable stability are preferred. I just want to avoid data corruption bugs. FYI, I used to run the -rt tree most of the time. > One man's trash is another's treasure: some people will want the > files to remain in cache because they'll use them again (copy it > somewhere else, or start editing it after being copied or whatever). > > But yeah, we can probably do better at the sequential read/write > case. Sure, but there are many hints to detect this: *large* (> most of the RAM), *streaming*, *used once* But if a program mmap() a 3/4 of the RAM area and "play" in it, it's a good sign that the streaming code shouldn't be active. - Eric ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 6:18 ` Eric St-Laurent @ 2007-07-25 7:09 ` Nick Piggin 2007-07-25 7:48 ` Eric St-Laurent ` (2 more replies) 0 siblings, 3 replies; 46+ messages in thread From: Nick Piggin @ 2007-07-25 7:09 UTC (permalink / raw) To: Eric St-Laurent Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook [-- Attachment #1: Type: text/plain, Size: 2821 bytes --] Eric St-Laurent wrote: > On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote: > > >>What *I* think is supposed to happen is that newly read in pages get >>put on the inactive list, and unless they get accessed againbefore >>being reclaimed, they are allowed to fall off the end of the list >>without disturbing active data too much. >> >>I think there is a missing piece here, that we used to ease the reclaim >>pressure off the active list when the inactive list grows relatively >>much larger than it (which could indicate a lot of use-once pages in >>the system). > > > Maybe a new list should be added to put newly read pages in it. If they > are not used or used once after a certain period, they can be moved to > the inactive list (or whatever). > > Newly read pages... > > - ... not used after this period are excessive readahead, we discard > immediately. > - ... used only once after this period, we discard soon. > - ... used many/frequently are moved to active list. > > Surely the scan rate (do I make sense?) should be different for this > newly-read list and the inactive list. A new list could be a possibility. One problem with adding lists is just trying to work out how to balance scanning rates between them, another problem is CPU overhead of moving pages from one to another... but don't let me stop you if you want to jump in and try something :) > I also remember your split mapped/unmapped active list patches from a > while ago. > > Can someone point me to a up-to-date documentation about the Linux VM? > The books and documents I've seen are outdated. If you just want to play with page reclaim algorithms, try reading over mm/vmscan.c. If you don't know much about the Linux VM internals before, don't worry too much about the fine details and start by getting an idea of how pages move between the active and inactive lists. I have Mel Gorman's, but I don't recall whether it covers the fine details of page reclaim. But anyway it is still a good book. >>I think I've been banned from touching vmscan.c, but if you're keen to >>try a patch, I might be convinced to come out of retirement :) > > > I'm more than willing! Now that CFS is merged, redirect your energies > from nicksched to nick-vm ;) > > Patches against any tree (stable, linus, mm, rt) are good. But I prefer > the last stable release because it narrows down the possible problems > that a moving target like the development tree may have. > > I test this on my main system, so patches with basic testing and > reasonable stability are preferred. I just want to avoid data corruption > bugs. FYI, I used to run the -rt tree most of the time. OK here is one which just changes the rate that the active and inactive lists get scanned. Data corruption bugs should be minimal ;) -- SUSE Labs, Novell Inc. [-- Attachment #2: inactive-useonce.patch --] [-- Type: text/plain, Size: 2375 bytes --] Index: linux-2.6/mm/vmscan.c =================================================================== --- linux-2.6.orig/mm/vmscan.c +++ linux-2.6/mm/vmscan.c @@ -1011,6 +1011,8 @@ static unsigned long shrink_zone(int pri { unsigned long nr_active; unsigned long nr_inactive; + unsigned long scan_active; + unsigned long scan_inactive; unsigned long nr_to_scan; unsigned long nr_reclaimed = 0; @@ -1020,34 +1022,47 @@ static unsigned long shrink_zone(int pri * Add one to `nr_to_scan' just to make sure that the kernel will * slowly sift through the active list. */ - zone->nr_scan_active += - (zone_page_state(zone, NR_ACTIVE) >> priority) + 1; - nr_active = zone->nr_scan_active; - if (nr_active >= sc->swap_cluster_max) + nr_active = zone_page_state(zone, NR_INACTIVE); + nr_inactive = zone_page_state(zone, NR_ACTIVE); + + scan_inactive = (nr_inactive >> priority) + 1; + + if (nr_active >= 4*(nr_inactive*2 + 1)) + scan_active = 4*scan_inactive; + else { + unsigned long long tmp; + + tmp = (unsigned long long)scan_inactive * nr_active; + do_div(tmp, nr_inactive*2 + 1); + scan_active = (unsigned long)tmp + 1; + } + + zone->nr_scan_active += scan_active; + scan_active = zone->nr_scan_active; + if (scan_active >= sc->swap_cluster_max) zone->nr_scan_active = 0; else - nr_active = 0; + scan_active = 0; - zone->nr_scan_inactive += - (zone_page_state(zone, NR_INACTIVE) >> priority) + 1; - nr_inactive = zone->nr_scan_inactive; - if (nr_inactive >= sc->swap_cluster_max) + zone->nr_scan_inactive += scan_inactive; + scan_inactive = zone->nr_scan_inactive; + if (scan_inactive >= sc->swap_cluster_max) zone->nr_scan_inactive = 0; else - nr_inactive = 0; + scan_inactive = 0; - while (nr_active || nr_inactive) { - if (nr_active) { - nr_to_scan = min(nr_active, + while (scan_active || scan_inactive) { + if (scan_active) { + nr_to_scan = min(scan_active, (unsigned long)sc->swap_cluster_max); - nr_active -= nr_to_scan; + scan_active -= nr_to_scan; shrink_active_list(nr_to_scan, zone, sc, priority); } - if (nr_inactive) { - nr_to_scan = min(nr_inactive, + if (scan_inactive) { + nr_to_scan = min(scan_inactive, (unsigned long)sc->swap_cluster_max); - nr_inactive -= nr_to_scan; + scan_inactive -= nr_to_scan; nr_reclaimed += shrink_inactive_list(nr_to_scan, zone, sc); } ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 7:09 ` Nick Piggin @ 2007-07-25 7:48 ` Eric St-Laurent 2007-07-25 15:36 ` Rik van Riel 2007-07-25 15:33 ` Rik van Riel 2007-07-29 7:44 ` Eric St-Laurent 2 siblings, 1 reply; 46+ messages in thread From: Eric St-Laurent @ 2007-07-25 7:48 UTC (permalink / raw) To: Nick Piggin Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote: > > A new list could be a possibility. One problem with adding lists is just > trying to work out how to balance scanning rates between them, another > problem is CPU overhead of moving pages from one to another... Disk sizes seem to increase more rapidly that the ability to read them quickly. Fortunately the processing power increase greatly too. It may be a good idea to spend more CPU cycles to better decide how the VM should juggle with this data. We've got to keep those multi-cores cpu busy. > but don't > let me stop you if you want to jump in and try something :) > Well I might try a few things along the way. But I prefer the thorough approach versus tinkering... - Read all research, check competition - Build test virtual machines, with benchmarks and typical workloads - Add (or use) some instrumentation to the pagecache - Code a simulator - Try all algorithms, tune them This is way overkill for a part-time hobby. If we don't see much work on this area it's surely because it's really not a problem anymore for most workloads. Database have their own cache management and disk scheduling, file servers just add more ram or processors, etc. > OK here is one which just changes the rate that the active and inactive > lists get scanned. Data corruption bugs should be minimal ;) > Will test. - Eric ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 7:48 ` Eric St-Laurent @ 2007-07-25 15:36 ` Rik van Riel 0 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2007-07-25 15:36 UTC (permalink / raw) To: Eric St-Laurent Cc: Nick Piggin, Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, Andrew Morton, Tim Pepper, Chris Snook Eric St-Laurent wrote: > On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote: > >> A new list could be a possibility. One problem with adding lists is just >> trying to work out how to balance scanning rates between them, another >> problem is CPU overhead of moving pages from one to another... > > Disk sizes seem to increase more rapidly that the ability to read them > quickly. Fortunately the processing power increase greatly too. > > It may be a good idea to spend more CPU cycles to better decide how the > VM should juggle with this data. We've got to keep those multi-cores cpu > busy. Agreed, up to a level. You cannot let hundreds of gigabytes worth of memory sit around with (stale) accessed bits set, and then have to scan millions of pages all at once when free memory finally runs low. The amount of CPU time spent in the pageout code at any point in time needs to be limited. > If we don't see much work on this area it's surely because it's really > not a problem anymore for most workloads. Database have their own cache > management and disk scheduling, file servers just add more ram or > processors, etc. It is a really big problem for some (fairly common) workloads, though. One of the main reasons there has been little activity in this area upstream is that nobody (at least, nobody I talked to) had real fixes in mind for the problem. All we had were workarounds for RHEL, not the kind of real fixes that are interesting for upstream. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 7:09 ` Nick Piggin 2007-07-25 7:48 ` Eric St-Laurent @ 2007-07-25 15:33 ` Rik van Riel 2007-07-29 7:44 ` Eric St-Laurent 2 siblings, 0 replies; 46+ messages in thread From: Rik van Riel @ 2007-07-25 15:33 UTC (permalink / raw) To: Nick Piggin Cc: Eric St-Laurent, Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, Andrew Morton, Tim Pepper, Chris Snook Nick Piggin wrote: > Eric St-Laurent wrote: >> On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote: >> >> >>> What *I* think is supposed to happen is that newly read in pages get >>> put on the inactive list, and unless they get accessed againbefore >>> being reclaimed, they are allowed to fall off the end of the list >>> without disturbing active data too much. >>> >>> I think there is a missing piece here, that we used to ease the reclaim >>> pressure off the active list when the inactive list grows relatively >>> much larger than it (which could indicate a lot of use-once pages in >>> the system). >> >> >> Maybe a new list should be added to put newly read pages in it. If they >> are not used or used once after a certain period, they can be moved to >> the inactive list (or whatever). >> >> Newly read pages... >> >> - ... not used after this period are excessive readahead, we discard >> immediately. >> - ... used only once after this period, we discard soon. >> - ... used many/frequently are moved to active list. >> >> Surely the scan rate (do I make sense?) should be different for this >> newly-read list and the inactive list. > > A new list could be a possibility. One problem with adding lists is just > trying to work out how to balance scanning rates between them, another > problem is CPU overhead of moving pages from one to another... but don't > let me stop you if you want to jump in and try something :) > > >> I also remember your split mapped/unmapped active list patches from a >> while ago. >> >> Can someone point me to a up-to-date documentation about the Linux VM? >> The books and documents I've seen are outdated. > > If you just want to play with page reclaim algorithms, try reading over > mm/vmscan.c. If you don't know much about the Linux VM internals before, > don't worry too much about the fine details and start by getting an idea > of how pages move between the active and inactive lists. I've got a nice list of problems with the VM at http://linux-mm.org/PageoutFailureModes :) I should post my latest version of the split LRU lists code. I'm still working on SEQ replacement for the anonymous pages... > OK here is one which just changes the rate that the active and inactive > lists get scanned. Data corruption bugs should be minimal ;) I had something like that in mind for the file pages. I am a bit nervous about introducing such a change while file and anonymous pages share the same LRU, though... -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 7:09 ` Nick Piggin 2007-07-25 7:48 ` Eric St-Laurent 2007-07-25 15:33 ` Rik van Riel @ 2007-07-29 7:44 ` Eric St-Laurent 2 siblings, 0 replies; 46+ messages in thread From: Eric St-Laurent @ 2007-07-29 7:44 UTC (permalink / raw) To: Nick Piggin Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote: > Eric St-Laurent wrote: > > I test this on my main system, so patches with basic testing and > > reasonable stability are preferred. I just want to avoid data corruption > > bugs. FYI, I used to run the -rt tree most of the time. > > OK here is one which just changes the rate that the active and inactive > lists get scanned. Data corruption bugs should be minimal ;) > Nick, I have tried your patch with my test case, unfortunately it doesn't help. Numbers did vary a little bit more, and it seemed drop_caches was not working as well as usual (used between the runs). Also, overall the runs took about .1s more to complete. Linux 2.6.23-rc1-nick PREEMPT x86_64 Base test: 1st run: 0m9.123s 2nd run: 0m3.565s 3rd run: 0m3.553s 4th run: 0m3.565s Reading a large file test: 1st run: 0m9.146s 2nd run: 0m3.560s `/tmp/large_file' -> `/dev/null' 3rd run: 0m19.759s 4th run: 0m3.515s Copying (using cp) a large file test: 1st run: 0m9.085s 2nd run: 0m3.522s `/tmp/large_file' -> `/tmp/large_file.copy' 3rd run: 0m9.977s 4th run: 0m3.518s Anyway, what is the theory behind the patch? - Eric ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment 2007-07-25 5:19 ` Nick Piggin 2007-07-25 6:18 ` Eric St-Laurent @ 2007-07-25 15:28 ` Rik van Riel 1 sibling, 0 replies; 46+ messages in thread From: Rik van Riel @ 2007-07-25 15:28 UTC (permalink / raw) To: Nick Piggin Cc: Eric St-Laurent, Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, Andrew Morton, Tim Pepper, Chris Snook Nick Piggin wrote: > What *I* think is supposed to happen is that newly read in pages get > put on the inactive list, and unless they get accessed againbefore > being reclaimed, they are allowed to fall off the end of the list > without disturbing active data too much. > > I think there is a missing piece here, that we used to ease the reclaim > pressure off the active list when the inactive list grows relatively > much larger than it (which could indicate a lot of use-once pages in > the system). > > Andrew got rid of that logic for some reason which I don't know, but I > can't see that use-once would be terribly effective today (so your > results don't surprise me too much). I suspect that reason is that use-once works great for page cache pages, but anonymous pages share the same LRUs and need a more balanced aging approach. The result: an LRU list that works ok for both types of pages most of the time, but falls over in some workloads. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [PATCH 0/3] readahead drop behind and size adjustment
@ 2007-07-22 11:11 Al Boldi
0 siblings, 0 replies; 46+ messages in thread
From: Al Boldi @ 2007-07-22 11:11 UTC (permalink / raw)
To: linux-kernel
Fengguang Wu wrote:
> On Sun, Jul 22, 2007 at 10:24:37AM +0200, Peter Zijlstra wrote:
> > On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > > - It will avoid large-file-reads-thrashing-my-desktop problem,
> > > so most desktop users should like it. But sure there will be counter
> > > cases when a user want to keep the data cached.
> > > - File servers may hurt from it. Imagine a mp3/png file server. The
> > > files are large enough to trigger drop-behind, but small (and hot)
> > > enough to be cached. Also when a new fedora DVD iso is released, it
> > > may be cached for some days. These are only the obvious cases.
> > >
> > > So I opt for it being made tunable, safe, and turned off by default.
> >
> > I'm still not convinced (Rik wasn't either last time around). When these
> > files really are hot, they will be kept in memory due to them becoming
> > Active.
> >
> > Also, by scaling up the max readahead size it takes a larger file before
> > it starts dropping. If say this server has 4G of memory (not much at all
> > for a server) resulting in a 1M readahead window, the file needs to be >
> > ~2M before it starts drop behind.
>
> [snip]
>
> > But I guess it takes someone to try this IRL before we can settle this
> > debate :-/
>
> Yeah, some real workload numbers would help.
A patch against 2.6.22 may help too.
Thanks!
--
Al
^ permalink raw reply [flat|nested] 46+ messages in threadend of thread, other threads:[~2007-07-29 7:44 UTC | newest]
Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra
2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
2007-07-21 20:29 ` Eric St-Laurent
2007-07-21 20:37 ` Peter Zijlstra
2007-07-21 20:59 ` Eric St-Laurent
2007-07-21 21:06 ` Peter Zijlstra
2007-07-25 3:55 ` Eric St-Laurent
2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra
2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra
2007-07-22 8:24 ` Jens Axboe
2007-07-22 8:36 ` Peter Zijlstra
2007-07-22 8:50 ` Jens Axboe
2007-07-22 9:17 ` Peter Zijlstra
2007-07-22 16:44 ` Jens Axboe
2007-07-23 10:04 ` Jörn Engel
2007-07-23 10:11 ` Jens Axboe
2007-07-23 22:44 ` Rusty Russell
2007-07-22 23:52 ` Rik van Riel
2007-07-23 5:22 ` Jens Axboe
[not found] ` <20070722084526.GB6317@mail.ustc.edu.cn>
2007-07-22 8:45 ` Fengguang Wu
2007-07-22 8:59 ` Peter Zijlstra
[not found] ` <20070722095313.GA8136@mail.ustc.edu.cn>
2007-07-22 9:53 ` Fengguang Wu
[not found] ` <20070722023923.GA6438@mail.ustc.edu.cn>
2007-07-22 2:39 ` [PATCH 0/3] readahead drop behind and size adjustment Fengguang Wu
2007-07-22 2:44 ` Dave Jones
[not found] ` <20070722081010.GA6317@mail.ustc.edu.cn>
2007-07-22 8:10 ` Fengguang Wu
2007-07-22 8:24 ` Peter Zijlstra
[not found] ` <20070722082923.GA7790@mail.ustc.edu.cn>
2007-07-22 8:29 ` Fengguang Wu
2007-07-22 8:33 ` Rusty Russell
2007-07-22 8:45 ` Peter Zijlstra
2007-07-23 9:00 ` Nick Piggin
[not found] ` <20070723142457.GA10130@mail.ustc.edu.cn>
2007-07-23 14:24 ` Fengguang Wu
2007-07-23 19:40 ` Andrew Morton
[not found] ` <20070724004728.GA8026@mail.ustc.edu.cn>
2007-07-24 0:47 ` Fengguang Wu
2007-07-24 1:17 ` Andrew Morton
2007-07-24 8:50 ` Andreas Dilger
2007-07-24 4:30 ` Nick Piggin
2007-07-25 4:35 ` Eric St-Laurent
2007-07-25 5:19 ` Nick Piggin
2007-07-25 6:18 ` Eric St-Laurent
2007-07-25 7:09 ` Nick Piggin
2007-07-25 7:48 ` Eric St-Laurent
2007-07-25 15:36 ` Rik van Riel
2007-07-25 15:33 ` Rik van Riel
2007-07-29 7:44 ` Eric St-Laurent
2007-07-25 15:28 ` Rik van Riel
-- strict thread matches above, loose matches on Subject: below --
2007-07-22 11:11 Al Boldi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox