* [PATCH 00/19] Adaptive read-ahead V8
@ 2005-11-25 15:12 Wu Fengguang
2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
` (19 more replies)
0 siblings, 20 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton
Changelog
=========
V8 2005-11-25
- balance zone aging only in page relaim paths and do it right
- do the aging of slabs in the same way as zones
- add debug code to dump the detailed page reclaim steps
- undo exposing of struct radix_tree_node and uninline related functions
- work better with nfsd
- generalize accelerated context based read-ahead
- account smooth read-ahead aging based on page referenced/activate bits
- avoid divide error in compute_thrashing_threshold()
- more low lantency efforts
- update some comments
- rebase debug actions on debugfs entries instead of magic readahead_ratio values
V7 2005-11-09
- new tunable parameters: readahead_hit_rate/readahead_live_chunk
- support sparse sequential accesses
- delay look-ahead if drive is spinned down in laptop mode
- disable look-ahead for loopback file
- make mandatory thrashing protection more simple and robust
- attempt to improve responsiveness on large read-ahead size
V6 2005-11-01
- cancel look-ahead in laptop mode
- increase read-ahead limit to 0xFFFF pages
V5 2005-10-28
- rewrite context based method to make it clean and robust
- improved accuracy of stateful thrashing threshold estimation
- make page aging equal to the number of code pages scanned
- sort out the thrashing protection logic
- enhanced debug/accounting facilities
V4 2005-10-15
- detect and save live chunks on page reclaim
- support database workload
- support reading backward
- radix tree lookup look-aside cache
V3 2005-10-06
- major code reorganization and documention
- stateful estimation of thrashing-threshold
- context method with accelerated grow up phase
- adaptive look-ahead
- early detection and rescue of pages in danger
- statitics data collection
- synchronized page aging between zones
V2 2005-09-15
- delayed page activation
- look-ahead: towards pipelined read-ahead
V1 2005-09-13
Initial release which features:
o stateless (for now)
o adapts to available memory / read speed
o free of thrashing (in theory)
And handles:
o large number of slow streams (FTP server)
o open/read/close access patterns (NFS server)
o multiple interleaved, sequential streams in one file
(multithread / multimedia / database)
Overview
========
The current read-ahead logic uses an inflexible algorithm with 128KB
VM_MAX_READAHEAD. Less memory leads to thrashing, more memory helps no
throughput. The new logic is simply safer and faster. It makes sure
every single read-ahead request is safe for the current load. Memory
tight systems are expected to benefit a lot: no thrashing any more.
It can also help boost I/O throughput for large memory systems, for
VM_MAX_READAHEAD now defaults to 1MB. The value is no longer tightly
coupled with the thrashing problem, and therefore constrainted by it.
Thanks,
Wu Fengguang
--
Dept. Automation University of Science and Technology of China
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 01/19] mm: delayed page activation
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 02/19] vm: kswapd incmin Wu Fengguang
` (18 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: mm-delayed-activation.patch --]
[-- Type: text/plain, Size: 4764 bytes --]
When a page is referenced the second time in inactive_list, mark it with
PG_activate instead of moving it into active_list immediately. The actual
moving work is delayed to vmscan time.
This implies two essential changes:
- keeps the adjecency of pages in lru;
- lifts the page reference counter max from 1 to 3.
And leads to the following improvements:
- read-ahead for a leading reader will not be disturbed by a following reader;
- enables the thrashing protection logic to save pages for following readers;
- keeping relavant pages together helps improve I/O efficiency;
- and also helps decrease vm fragmantation;
- increased refcnt space might help page replacement algorithms.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/page-flags.h | 31 +++++++++++++++++++++++++++++++
mm/page_alloc.c | 1 +
mm/swap.c | 9 ++++-----
mm/vmscan.c | 6 ++++++
4 files changed, 42 insertions(+), 5 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.15-rc2-mm1/include/linux/page-flags.h
@@ -76,6 +76,7 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
+#define PG_activate 20 /* delayed activate */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -304,6 +305,12 @@ extern void __mod_page_state(unsigned lo
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageActivate(page) test_bit(PG_activate, &(page)->flags)
+#define SetPageActivate(page) set_bit(PG_activate, &(page)->flags)
+#define ClearPageActivate(page) clear_bit(PG_activate, &(page)->flags)
+#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
+#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
@@ -329,4 +336,28 @@ static inline void set_page_writeback(st
#define ClearPageFsMisc(page) clear_bit(PG_fs_misc, &(page)->flags)
#define TestClearPageFsMisc(page) test_and_clear_bit(PG_fs_misc, &(page)->flags)
+#if PG_activate < PG_referenced
+#error unexpected page flags order
+#endif
+
+#define PAGE_REFCNT_0 0
+#define PAGE_REFCNT_1 (1 << PG_referenced)
+#define PAGE_REFCNT_2 (1 << PG_activate)
+#define PAGE_REFCNT_3 ((1 << PG_activate) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK PAGE_REFCNT_3
+
+/*
+ * STATUS REFERENCE COUNT
+ * __ 0
+ * _R PAGE_REFCNT_1
+ * A_ PAGE_REFCNT_2
+ * AR PAGE_REFCNT_3
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+ return page->flags & PAGE_REFCNT_MASK;
+}
+
#endif /* PAGE_FLAGS_H */
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -508,6 +508,7 @@ static int prep_new_page(struct page *pa
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
+ 1 << PG_activate |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refs(page, order);
--- linux-2.6.15-rc2-mm1.orig/mm/swap.c
+++ linux-2.6.15-rc2-mm1/mm/swap.c
@@ -29,7 +29,6 @@
#include <linux/percpu.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
-#include <linux/init.h>
/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -115,13 +114,13 @@ void fastcall activate_page(struct page
* Mark a page as having seen activity.
*
* inactive,unreferenced -> inactive,referenced
- * inactive,referenced -> active,unreferenced
- * active,unreferenced -> active,referenced
+ * inactive,referenced -> activate,unreferenced
+ * activate,unreferenced -> activate,referenced
*/
void fastcall mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
- activate_page(page);
+ if (!PageActivate(page) && PageReferenced(page) && PageLRU(page)) {
+ SetPageActivate(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -454,6 +454,12 @@ static int shrink_list(struct list_head
if (PageWriteback(page))
goto keep_locked;
+ if (PageActivate(page)) {
+ ClearPageActivate(page);
+ ClearPageReferenced(page);
+ goto activate_locked;
+ }
+
referenced = page_referenced(page, 1, sc->priority <= 0);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 02/19] vm: kswapd incmin
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 03/19] mm: balance page aging between zones and slabs Wu Fengguang
` (17 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Nick Piggin, Wu Fengguang
[-- Attachment #1: vm-kswapd-incmin.patch --]
[-- Type: text/plain, Size: 4789 bytes --]
Explicitly teach kswapd about the incremental min logic instead of just scanning
all zones under the first low zone. This should keep more even pressure applied
on the zones.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
This patch is taken unchanged from Nick Piggin's work.
mm/vmscan.c | 105 ++++++++++++++++++++----------------------------------------
1 files changed, 35 insertions(+), 70 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -1314,97 +1314,63 @@ loop_again:
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
+ int first_low_zone = 0;
all_zones_ok = 1;
+ sc.nr_scanned = 0;
+ sc.nr_reclaimed = 0;
+ sc.priority = priority;
+ sc.swap_cluster_max = nr_pages ? nr_pages : SWAP_CLUSTER_MAX;
- if (nr_pages == 0) {
- /*
- * Scan in the highmem->dma direction for the highest
- * zone which needs scanning
- */
- for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ /* Scan in the highmem->dma direction */
+ for (i = pgdat->nr_zones - 1; i >= 0; i--) {
+ struct zone *zone = pgdat->node_zones + i;
- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;
- if (zone->all_unreclaimable &&
- priority != DEF_PRIORITY)
+ if (nr_pages == 0) { /* Not software suspend */
+ if (zone_watermark_ok(zone, order,
+ zone->pages_high, first_low_zone, 0))
continue;
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, 0, 0)) {
- end_zone = i;
- goto scan;
- }
+ all_zones_ok = 0;
+ if (first_low_zone < i)
+ first_low_zone = i;
}
- goto out;
- } else {
- end_zone = pgdat->nr_zones - 1;
- }
-scan:
- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- lru_pages += zone->nr_active + zone->nr_inactive;
- }
-
- /*
- * Now scan the zone in the dma->highmem direction, stopping
- * at the last zone which needs scanning.
- *
- * We do this because the page allocator works in the opposite
- * direction. This prevents the page allocator from allocating
- * pages behind kswapd's direction of progress, which would
- * cause too much scanning of the lower zones.
- */
- for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
- int nr_slab;
-
- if (!populated_zone(zone))
- continue;
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;
- if (nr_pages == 0) { /* Not software suspend */
- if (!zone_watermark_ok(zone, order,
- zone->pages_high, end_zone, 0))
- all_zones_ok = 0;
- }
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
- sc.nr_scanned = 0;
- sc.nr_reclaimed = 0;
- sc.priority = priority;
- sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
+ lru_pages += zone->nr_active + zone->nr_inactive;
+
atomic_inc(&zone->reclaim_in_progress);
shrink_zone(zone, &sc);
atomic_dec(&zone->reclaim_in_progress);
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
- lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
- total_reclaimed += sc.nr_reclaimed;
- total_scanned += sc.nr_scanned;
- if (zone->all_unreclaimable)
- continue;
- if (nr_slab == 0 && zone->pages_scanned >=
+
+ if (zone->pages_scanned >=
(zone->nr_active + zone->nr_inactive) * 4)
zone->all_unreclaimable = 1;
- /*
- * If we've done a decent amount of scanning and
- * the reclaim ratio is low, start doing writepage
- * even in laptop mode
- */
- if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
- total_scanned > total_reclaimed+total_reclaimed/2)
- sc.may_writepage = 1;
}
+ reclaim_state->reclaimed_slab = 0;
+ shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+ sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+ total_reclaimed += sc.nr_reclaimed;
+ total_scanned += sc.nr_scanned;
+
+ /*
+ * If we've done a decent amount of scanning and
+ * the reclaim ratio is low, start doing writepage
+ * even in laptop mode
+ */
+ if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+ total_scanned > total_reclaimed+total_reclaimed/2)
+ sc.may_writepage = 1;
+
if (nr_pages && to_free > total_reclaimed)
continue; /* swsusp: need to do more work */
if (all_zones_ok)
@@ -1425,7 +1391,6 @@ scan:
if ((total_reclaimed >= SWAP_CLUSTER_MAX) && (!nr_pages))
break;
}
-out:
for (i = 0; i < pgdat->nr_zones; i++) {
struct zone *zone = pgdat->node_zones + i;
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 03/19] mm: balance page aging between zones and slabs
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
2005-11-25 15:12 ` [PATCH 02/19] vm: kswapd incmin Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 04/19] mm: debug page reclaim Wu Fengguang
` (16 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel
Cc: Andrew Morton, Marcelo Tosatti, Magnus Damm, Nick Piggin,
Andrea Arcangeli, Wu Fengguang
[-- Attachment #1: mm-balanced-aging.patch --]
[-- Type: text/plain, Size: 16543 bytes --]
The zone aging rates are currently imbalanced, the gap can be as large as 3
times, which can severely damage read-ahead requests and shorten their
effective life time.
The current slab shrinking code is way too fragile. It should manage aging
pace by itself, and provide a simple and robust interface.
THE METHOD
==========
This patch adds three variables in struct zone and shrinker
- aging_total
- aging_milestone
- page_age
to keep track of page aging rate, and keep it in sync on page reclaim time.
The aging_total is just a per-zone counter-part to the per-cpu
pgscan_{kswapd,direct}_{zone name}. But it is not direct comparable between
zones, so the aging_milestone/page_age are maintained based on aging_total.
The main design aspects of page balancing:
- In the kswapd reclaim path,
- reclaim also the least aged zone to help it catch up with the most
aged zone;
- the reclaim for watermark is avoided whenever possible by calling
watermark_ok() with classzone_idx = 0;
- in this way, the reclaims for aging are garuanteed to be more than
reclaims for watermark if there's a big imbalance, so that there
will not be any chance for a growing gap.
- In the direct reclaim path,
- reclaim only the least aged local/headless zone in the first two
rounds;
- two rounds are enough in normal to get enough free pages;
- and prevents unnecessarily disturbing remote nodes.
- In shrink_cache()/shrink_zone()/shrink_list(),
- zone->nr_scan_inactive is purged, the requested zone will be
scanned at least SWAP_CLUSTER_MAX pages;
- sc->nr_scanned reflects exactly the number of cold pages scanned.
- In shrink_slab(),
- keep the age of slabs in line with that of the largest zone, using
the same syncing logic as that of the zones;
- a minimal number of unused slabs are reserved in normal;
- the slabs are shrinked more if vm pressure is high;
- `mmap pages found' - `shrink more slabs' - `avoid swapping',
I don't think they are strongly related, so the code is removed.
THE RESULT
==========
- It does not touch the time critical page alloc path, so will not hurt
performance at all.
- The shrink of slabs become simple, consistent and robust. The callers no
longer have to worry about the tricky aging thing.
- Experiments with 64M/512M/2G memory show that it keeps the zone ages in
sync perfectly. One can check the effects by running:
tar c / | cat > /dev/null &
watch -n1 'grep "age " /proc/zoneinfo'
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 4
include/linux/mmzone.h | 14 +++
mm/page_alloc.c | 16 ++-
mm/vmscan.c | 222 +++++++++++++++++++++++++++++++------------------
4 files changed, 173 insertions(+), 83 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/mmzone.h
+++ linux-2.6.15-rc2-mm1/include/linux/mmzone.h
@@ -149,6 +149,20 @@ struct zone {
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
+ /* Fields for balanced page aging:
+ * aging_total - The accumulated number of activities that may
+ * cause page aging, that is, make some pages closer
+ * to the tail of inactive_list.
+ * aging_milestone - A snapshot of total_scan every time a full
+ * inactive_list of pages become aged.
+ * page_age - A normalized value showing the percent of pages
+ * have been aged. It is compared between zones to
+ * balance the rate of page aging.
+ */
+ unsigned long aging_total;
+ unsigned long aging_milestone;
+ unsigned long page_age;
+
/*
* Does the allocator try to reclaim pages from the zone as soon
* as it fails a watermark_ok() in __alloc_pages?
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -775,7 +775,9 @@ struct shrinker {
shrinker_t shrinker;
struct list_head list;
int seeks; /* seeks to recreate an obj */
- long nr; /* objs pending delete */
+ unsigned long aging_total;
+ unsigned long aging_milestone;
+ unsigned long page_age;
struct shrinker_stats *s_stats;
};
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -1468,6 +1468,8 @@ void show_free_areas(void)
" active:%lukB"
" inactive:%lukB"
" present:%lukB"
+ " aging:%lukB"
+ " age:%lu"
" pages_scanned:%lu"
" all_unreclaimable? %s"
"\n",
@@ -1479,6 +1481,8 @@ void show_free_areas(void)
K(zone->nr_active),
K(zone->nr_inactive),
K(zone->present_pages),
+ K(zone->aging_total),
+ zone->page_age,
zone->pages_scanned,
(zone->all_unreclaimable ? "yes" : "no")
);
@@ -2089,9 +2093,11 @@ static void __init free_area_init_core(s
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_list);
zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
zone->nr_active = 0;
zone->nr_inactive = 0;
+ zone->aging_total = 0;
+ zone->aging_milestone = 0;
+ zone->page_age = 0;
atomic_set(&zone->reclaim_in_progress, 0);
if (!size)
continue;
@@ -2240,7 +2246,9 @@ static int zoneinfo_show(struct seq_file
"\n high %lu"
"\n active %lu"
"\n inactive %lu"
- "\n scanned %lu (a: %lu i: %lu)"
+ "\n aging %lu"
+ "\n age %lu"
+ "\n scanned %lu (a: %lu)"
"\n spanned %lu"
"\n present %lu",
zone->free_pages,
@@ -2249,8 +2257,10 @@ static int zoneinfo_show(struct seq_file
zone->pages_high,
zone->nr_active,
zone->nr_inactive,
+ zone->aging_total,
+ zone->page_age,
zone->pages_scanned,
- zone->nr_scan_active, zone->nr_scan_inactive,
+ zone->nr_scan_active,
zone->spanned_pages,
zone->present_pages);
seq_printf(m,
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -123,6 +123,56 @@ static long total_memory;
static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);
+#ifdef CONFIG_HIGHMEM64G
+#define PAGE_AGE_SHIFT 8
+#elif BITS_PER_LONG == 32
+#define PAGE_AGE_SHIFT 12
+#elif BITS_PER_LONG == 64
+#define PAGE_AGE_SHIFT 20
+#else
+#error unknown BITS_PER_LONG
+#endif
+#define PAGE_AGE_MASK ((1 << PAGE_AGE_SHIFT) - 1)
+
+/*
+ * The simplified code is: (a->page_age > b->page_age)
+ * The complexity deals with the wrap-around problem.
+ * Two page ages not close enough should also be ignored:
+ * they are out of sync and the comparison may be nonsense.
+ */
+#define pages_more_aged(a, b) \
+ ((b->page_age - a->page_age) & PAGE_AGE_MASK) > \
+ PAGE_AGE_MASK - (1 << (PAGE_AGE_SHIFT - 3)) \
+
+/*
+ * Keep track of the percent of cold pages that have been scanned / aged.
+ * It's not really ##%, but a high resolution normalized value.
+ */
+static inline void update_zone_age(struct zone *z, int nr_scan)
+{
+ unsigned long len = z->nr_inactive + z->free_pages;
+
+ z->aging_total += nr_scan;
+
+ if (z->aging_total - z->aging_milestone > len)
+ z->aging_milestone += len;
+
+ z->page_age = ((z->aging_total - z->aging_milestone)
+ << PAGE_AGE_SHIFT) / len;
+}
+
+static inline void update_slab_age(struct shrinker *s,
+ unsigned long len, int nr_scan)
+{
+ s->aging_total += nr_scan;
+
+ if (s->aging_total - s->aging_milestone > len)
+ s->aging_milestone += len;
+
+ s->page_age = ((s->aging_total - s->aging_milestone)
+ << PAGE_AGE_SHIFT) / len;
+}
+
/*
* Add a shrinker callback to be called from the vm
*/
@@ -134,7 +184,9 @@ struct shrinker *set_shrinker(int seeks,
if (shrinker) {
shrinker->shrinker = theshrinker;
shrinker->seeks = seeks;
- shrinker->nr = 0;
+ shrinker->aging_total = 0;
+ shrinker->aging_milestone = 0;
+ shrinker->page_age = 0;
shrinker->s_stats = alloc_percpu(struct shrinker_stats);
if (!shrinker->s_stats) {
kfree(shrinker);
@@ -170,80 +222,61 @@ EXPORT_SYMBOL(remove_shrinker);
* percentages of the lru and ageable caches. This should balance the seeks
* generated by these structures.
*
- * If the vm encounted mapped pages on the LRU it increase the pressure on
- * slab to avoid swapping.
- *
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
- *
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt. It is used for balancing
- * slab reclaim versus page reclaim.
+ * If the vm pressure is high, shrink the slabs more.
*
* Returns the number of slab objects which we shrunk.
*/
-static int shrink_slab(unsigned long scanned, gfp_t gfp_mask,
- unsigned long lru_pages)
+static int shrink_slab(gfp_t gfp_mask)
{
struct shrinker *shrinker;
- int ret = 0;
-
- if (scanned == 0)
- scanned = SWAP_CLUSTER_MAX;
+ struct pglist_data *pgdat;
+ struct zone *zone;
+ int n;
if (!down_read_trylock(&shrinker_rwsem))
return 1; /* Assume we'll be able to shrink next time */
- list_for_each_entry(shrinker, &shrinker_list, list) {
- unsigned long long delta;
- unsigned long total_scan;
- unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);
-
- delta = (4 * scanned) / shrinker->seeks;
- delta *= max_pass;
- do_div(delta, lru_pages + 1);
- shrinker->nr += delta;
- if (shrinker->nr < 0) {
- printk(KERN_ERR "%s: nr=%ld\n",
- __FUNCTION__, shrinker->nr);
- shrinker->nr = max_pass;
- }
+ /* find the major zone for the slabs to catch up age with */
+ pgdat = NODE_DATA(numa_node_id());
+ zone = pgdat->node_zones;
+ for (n = 1; n < pgdat->nr_zones; n++) {
+ struct zone *z = pgdat->node_zones + n;
- /*
- * Avoid risking looping forever due to too large nr value:
- * never try to free more than twice the estimate number of
- * freeable entries.
- */
- if (shrinker->nr > max_pass * 2)
- shrinker->nr = max_pass * 2;
-
- total_scan = shrinker->nr;
- shrinker->nr = 0;
+ if (zone->present_pages < z->present_pages)
+ zone = z;
+ }
- while (total_scan >= SHRINK_BATCH) {
- long this_scan = SHRINK_BATCH;
- int shrink_ret;
+ n = 0;
+ list_for_each_entry(shrinker, &shrinker_list, list) {
+ while (pages_more_aged(zone, shrinker)) {
int nr_before;
+ int nr_after;
nr_before = (*shrinker->shrinker)(0, gfp_mask);
- shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
- if (shrink_ret == -1)
+ if (nr_before <= SHRINK_BATCH * zone->prev_priority)
+ break;
+
+ nr_after = (*shrinker->shrinker)(SHRINK_BATCH, gfp_mask);
+ if (nr_after == -1)
break;
- if (shrink_ret < nr_before) {
- ret += nr_before - shrink_ret;
- shrinker_stat_add(shrinker, nr_freed,
- (nr_before - shrink_ret));
+
+ if (nr_after < nr_before) {
+ int nr_freed = nr_before - nr_after;
+
+ n += nr_freed;
+ shrinker_stat_add(shrinker, nr_freed, nr_freed);
}
- shrinker_stat_add(shrinker, nr_req, this_scan);
- mod_page_state(slabs_scanned, this_scan);
- total_scan -= this_scan;
+ shrinker_stat_add(shrinker, nr_req, SHRINK_BATCH);
+ mod_page_state(slabs_scanned, SHRINK_BATCH);
+ update_slab_age(shrinker, nr_before * DEF_PRIORITY,
+ SHRINK_BATCH * shrinker->seeks *
+ zone->prev_priority);
cond_resched();
}
-
- shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
- return ret;
+ return n;
}
/* Called without lock on whether page is mapped, so answer is unstable */
@@ -446,11 +479,6 @@ static int shrink_list(struct list_head
BUG_ON(PageActive(page));
- sc->nr_scanned++;
- /* Double the slab pressure for mapped and swapcache pages */
- if (page_mapped(page) || PageSwapCache(page))
- sc->nr_scanned++;
-
if (PageWriteback(page))
goto keep_locked;
@@ -894,11 +922,13 @@ static void shrink_cache(struct zone *zo
&page_list, &nr_scan);
zone->nr_inactive -= nr_taken;
zone->pages_scanned += nr_scan;
+ update_zone_age(zone, nr_scan);
spin_unlock_irq(&zone->lru_lock);
if (nr_taken == 0)
goto done;
+ sc->nr_scanned += nr_scan;
max_scan -= nr_scan;
if (current_is_kswapd())
mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
@@ -1101,12 +1131,7 @@ shrink_zone(struct zone *zone, struct sc
else
nr_active = 0;
- zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
- nr_inactive = zone->nr_scan_inactive;
- if (nr_inactive >= sc->swap_cluster_max)
- zone->nr_scan_inactive = 0;
- else
- nr_inactive = 0;
+ nr_inactive = (zone->nr_inactive >> sc->priority) + 1;
sc->nr_to_reclaim = sc->swap_cluster_max;
@@ -1153,6 +1178,7 @@ static void
shrink_caches(struct zone **zones, struct scan_control *sc)
{
int i;
+ struct zone *z = NULL;
for (i = 0; zones[i] != NULL; i++) {
struct zone *zone = zones[i];
@@ -1170,8 +1196,31 @@ shrink_caches(struct zone **zones, struc
if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
+ /*
+ * Balance page aging in local zones and following headless
+ * zones in the first two rounds of direct reclaim.
+ */
+ if (sc->priority > DEF_PRIORITY - 2) {
+ if (zone->zone_pgdat != zones[0]->zone_pgdat) {
+ cpumask_t cpu = node_to_cpumask(
+ zone->zone_pgdat->node_id);
+ if (!cpus_empty(cpu))
+ break;
+ }
+
+ if (!z)
+ z = zone;
+ else if (pages_more_aged(z, zone))
+ z = zone;
+
+ continue;
+ }
+
shrink_zone(zone, sc);
}
+
+ if (z)
+ shrink_zone(z, sc);
}
/*
@@ -1194,7 +1243,6 @@ int try_to_free_pages(struct zone **zone
int total_scanned = 0, total_reclaimed = 0;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;
- unsigned long lru_pages = 0;
int i;
delay_prefetch();
@@ -1212,7 +1260,6 @@ int try_to_free_pages(struct zone **zone
continue;
zone->temp_priority = DEF_PRIORITY;
- lru_pages += zone->nr_active + zone->nr_inactive;
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -1222,7 +1269,7 @@ int try_to_free_pages(struct zone **zone
sc.priority = priority;
sc.swap_cluster_max = SWAP_CLUSTER_MAX;
shrink_caches(zones, &sc);
- shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+ shrink_slab(gfp_mask);
if (reclaim_state) {
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -1296,6 +1343,23 @@ static int balance_pgdat(pg_data_t *pgda
int total_scanned, total_reclaimed;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;
+ struct zone *youngest_zone;
+ struct zone *oldest_zone;
+
+ youngest_zone = oldest_zone = NULL;
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ if (zone->present_pages == 0)
+ continue;
+
+ if (!oldest_zone)
+ youngest_zone = oldest_zone = zone;
+ else if (pages_more_aged(zone, oldest_zone))
+ oldest_zone = zone;
+ else if (pages_more_aged(youngest_zone, zone))
+ youngest_zone = zone;
+ }
loop_again:
total_scanned = 0;
@@ -1314,9 +1378,6 @@ loop_again:
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- unsigned long lru_pages = 0;
- int first_low_zone = 0;
-
all_zones_ok = 1;
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
@@ -1331,13 +1392,17 @@ loop_again:
continue;
if (nr_pages == 0) { /* Not software suspend */
- if (zone_watermark_ok(zone, order,
- zone->pages_high, first_low_zone, 0))
+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high,
+ 0, 0)) {
+ all_zones_ok = 0;
+ } else if (zone == youngest_zone &&
+ pages_more_aged(oldest_zone,
+ youngest_zone)) {
+ /* if (priority > DEF_PRIORITY - 2) */
+ /* all_zones_ok = 0; */
+ } else
continue;
-
- all_zones_ok = 0;
- if (first_low_zone < i)
- first_low_zone = i;
}
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
@@ -1346,7 +1411,6 @@ loop_again:
zone->temp_priority = priority;
if (zone->prev_priority > priority)
zone->prev_priority = priority;
- lru_pages += zone->nr_active + zone->nr_inactive;
atomic_inc(&zone->reclaim_in_progress);
shrink_zone(zone, &sc);
@@ -1357,7 +1421,7 @@ loop_again:
zone->all_unreclaimable = 1;
}
reclaim_state->reclaimed_slab = 0;
- shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages);
+ shrink_slab(GFP_KERNEL);
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
total_reclaimed += sc.nr_reclaimed;
total_scanned += sc.nr_scanned;
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 04/19] mm: debug page reclaim
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (2 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 03/19] mm: balance page aging between zones and slabs Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 05/19] radixtree: sync with mainline Wu Fengguang
` (15 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: mm-debug-page-reclaim.patch --]
[-- Type: text/plain, Size: 8784 bytes --]
Show the detailed steps of early/direct/kswapd page reclaim.
To enable the printk traces:
# echo y > /debug/debug_page_reclaim
Sample lines:
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-31, age 1248, hot+cold+free pages 92563+25703+860
reclaim zone2 from kswapd for watermark, prio 11, scan-reclaimed 33-32, age 1253, hot+cold+free pages 92547+25688+892
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1258, hot+cold+free pages 92547+25656+924
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-29, age 1262, hot+cold+free pages 92532+25639+924
reclaim zone2 from kswapd for watermark, prio 11, scan-reclaimed 33-28, age 1269, hot+cold+free pages 92531+25611+956
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-31, age 1274, hot+cold+free pages 92535+25579+988
reclaim zone2 from kswapd for watermark, prio 11, scan-reclaimed 33-29, age 1278, hot+cold+free pages 92518+25565+1020
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-29, age 1282, hot+cold+free pages 92507+25547+1052
reclaim zone2 from kswapd for aging, prio 11, scan-reclaimed 33-27, age 1287, hot+cold+free pages 92506+25519+1084
reclaim zone2 from direct for aging, prio 12, scan-reclaimed 33-29, age 1292, hot+cold+free pages 92505+25494+1116
reclaim zone2 from direct for aging, prio 11, scan-reclaimed 33-32, age 1297, hot+cold+free pages 92506+25464+1148
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1302, hot+cold+free pages 92506+25746+860
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1307, hot+cold+free pages 92488+25732+892
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1312, hot+cold+free pages 92488+25700+924
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1316, hot+cold+free pages 92481+25675+956
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1322, hot+cold+free pages 92481+25643+988
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1326, hot+cold+free pages 92474+25618+1020
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1331, hot+cold+free pages 92474+25586+1052
reclaim zone2 from kswapd for aging, prio 12, scan-reclaimed 33-32, age 1336, hot+cold+free pages 92474+25554+1084
reclaim zone2 from direct for aging, prio 12, scan-reclaimed 33-32, age 1341, hot+cold+free pages 92474+25522+1116
reclaim zone2 from direct for aging, prio 12, scan-reclaimed 33-32, age 1347, hot+cold+free pages 92470+25744+892
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-32, age 1352, hot+cold+free pages 92463+25712+924
reclaim zone2 from kswapd for watermark, prio 12, scan-reclaimed 33-25, age 1357, hot+cold+free pages 92462+25681+956
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/vmscan.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 75 insertions(+), 15 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -38,6 +38,7 @@
#include <asm/div64.h>
#include <linux/swapops.h>
+#include <linux/debugfs.h>
/* possible outcome of pageout() */
typedef enum {
@@ -72,10 +73,7 @@ struct scan_control {
/* This context's GFP mask */
gfp_t gfp_mask;
- int may_writepage;
-
- /* Can pages be swapped as part of reclaim? */
- int may_swap;
+ unsigned long flags;
/* This context's SWAP_CLUSTER_MAX. If freeing memory for
* suspend, we effectively ignore SWAP_CLUSTER_MAX.
@@ -84,6 +82,59 @@ struct scan_control {
int swap_cluster_max;
};
+#define SC_RECLAIM_FROM_KSWAPD 0x01
+#define SC_RECLAIM_FROM_DIRECT 0x02
+#define SC_RECLAIM_FROM_EARLY 0x04
+#define SC_RECLAIM_FOR_WATERMARK 0x10
+#define SC_RECLAIM_FOR_AGING 0x20
+#define SC_RECLAIM_MASK 0xFF
+
+#define SC_MAY_WRITEPAGE 0x100
+#define SC_MAY_SWAP 0x200 /* Can pages be swapped as part of reclaim? */
+
+#ifdef CONFIG_DEBUG_FS
+static u32 debug_page_reclaim;
+
+static inline void debug_reclaim(struct scan_control *sc, unsigned long flags)
+{
+ sc->flags = (sc->flags & ~SC_RECLAIM_MASK) | flags;
+}
+
+static inline void debug_reclaim_report(struct scan_control *sc, struct zone *z)
+{
+ if (!debug_page_reclaim)
+ return;
+
+ printk(KERN_DEBUG "reclaim zone%d from %s for %s, "
+ "prio %d, scan-reclaimed %lu-%lu, age %lu, "
+ "hot+cold+free pages %lu+%lu+%lu\n",
+ zone_idx(z),
+ (sc->flags & SC_RECLAIM_FROM_KSWAPD) ? "kswapd" :
+ ((sc->flags & SC_RECLAIM_FROM_DIRECT) ? "direct" :
+ "early"),
+ (sc->flags & SC_RECLAIM_FOR_AGING) ?
+ "aging" : "watermark",
+ sc->priority, sc->nr_scanned, sc->nr_reclaimed,
+ z->page_age, z->nr_active, z->nr_inactive, z->free_pages);
+}
+
+static inline void debug_reclaim_init(void)
+{
+ debugfs_create_bool("debug_page_reclaim", 0644, NULL,
+ &debug_page_reclaim);
+}
+#else
+static inline void debug_reclaim(struct scan_control *sc, int flags)
+{
+}
+static inline void debug_reclaim_report(struct scan_control *sc, struct zone *z)
+{
+}
+static inline void debug_reclaim_init(void)
+{
+}
+#endif
+
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
#ifdef ARCH_HAS_PREFETCH
@@ -499,7 +550,7 @@ static int shrink_list(struct list_head
* Try to allocate it some swap space here.
*/
if (PageAnon(page) && !PageSwapCache(page)) {
- if (!sc->may_swap)
+ if (!(sc->flags & SC_MAY_SWAP))
goto keep_locked;
if (!add_to_swap(page, GFP_ATOMIC))
goto activate_locked;
@@ -530,7 +581,7 @@ static int shrink_list(struct list_head
goto keep_locked;
if (!may_enter_fs)
goto keep_locked;
- if (laptop_mode && !sc->may_writepage)
+ if (laptop_mode && !(sc->flags & SC_MAY_WRITEPAGE))
goto keep_locked;
/* Page is dirty, try to write it out here */
@@ -939,6 +990,7 @@ static void shrink_cache(struct zone *zo
mod_page_state(kswapd_steal, nr_freed);
mod_page_state_zone(zone, pgsteal, nr_freed);
sc->nr_to_reclaim -= nr_freed;
+ debug_reclaim_report(sc, zone);
spin_lock_irq(&zone->lru_lock);
/*
@@ -1216,11 +1268,14 @@ shrink_caches(struct zone **zones, struc
continue;
}
+ debug_reclaim(sc, SC_RECLAIM_FROM_DIRECT);
shrink_zone(zone, sc);
}
- if (z)
+ if (z) {
+ debug_reclaim(sc, SC_RECLAIM_FROM_DIRECT|SC_RECLAIM_FOR_AGING);
shrink_zone(z, sc);
+ }
}
/*
@@ -1248,8 +1303,7 @@ int try_to_free_pages(struct zone **zone
delay_prefetch();
sc.gfp_mask = gfp_mask;
- sc.may_writepage = 0;
- sc.may_swap = 1;
+ sc.flags = SC_MAY_SWAP;
inc_page_state(allocstall);
@@ -1290,7 +1344,7 @@ int try_to_free_pages(struct zone **zone
*/
if (total_scanned > sc.swap_cluster_max + sc.swap_cluster_max/2) {
wakeup_pdflush(laptop_mode ? 0 : total_scanned);
- sc.may_writepage = 1;
+ sc.flags |= SC_MAY_WRITEPAGE;
}
/* Take a nap, wait for some writeback to complete */
@@ -1365,8 +1419,7 @@ loop_again:
total_scanned = 0;
total_reclaimed = 0;
sc.gfp_mask = GFP_KERNEL;
- sc.may_writepage = 0;
- sc.may_swap = 1;
+ sc.flags = SC_MAY_SWAP;
sc.nr_mapped = read_page_state(nr_mapped);
inc_page_state(pageoutrun);
@@ -1395,10 +1448,16 @@ loop_again:
if (!zone_watermark_ok(zone, order,
zone->pages_high,
0, 0)) {
+ debug_reclaim(&sc,
+ SC_RECLAIM_FROM_KSWAPD|
+ SC_RECLAIM_FOR_WATERMARK);
all_zones_ok = 0;
} else if (zone == youngest_zone &&
pages_more_aged(oldest_zone,
youngest_zone)) {
+ debug_reclaim(&sc,
+ SC_RECLAIM_FROM_KSWAPD|
+ SC_RECLAIM_FOR_AGING);
/* if (priority > DEF_PRIORITY - 2) */
/* all_zones_ok = 0; */
} else
@@ -1433,7 +1492,7 @@ loop_again:
*/
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > total_reclaimed+total_reclaimed/2)
- sc.may_writepage = 1;
+ sc.flags |= SC_MAY_WRITEPAGE;
if (nr_pages && to_free > total_reclaimed)
continue; /* swsusp: need to do more work */
@@ -1623,6 +1682,7 @@ static int __init kswapd_init(void)
= find_task_by_pid(kernel_thread(kswapd, pgdat, CLONE_KERNEL));
total_memory = nr_free_pagecache_pages();
hotcpu_notifier(cpu_callback, 0);
+ debug_reclaim_init();
return 0;
}
@@ -1645,8 +1705,7 @@ int zone_reclaim(struct zone *zone, gfp_
return 0;
sc.gfp_mask = gfp_mask;
- sc.may_writepage = 0;
- sc.may_swap = 0;
+ sc.flags = 0;
sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
@@ -1662,6 +1721,7 @@ int zone_reclaim(struct zone *zone, gfp_
if (atomic_read(&zone->reclaim_in_progress) > 0)
goto out;
+ debug_reclaim(&sc, SC_RECLAIM_FROM_EARLY);
shrink_zone(zone, &sc);
total_reclaimed = sc.nr_reclaimed;
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 05/19] radixtree: sync with mainline
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (3 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 04/19] mm: debug page reclaim Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 06/19] radixtree: look-aside cache Wu Fengguang
` (14 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Christoph Lameter, Wu Fengguang
[-- Attachment #1: radixtree-sync-with-mainline.patch --]
[-- Type: text/plain, Size: 1182 bytes --]
The patch from Christoph Lameter:
[PATCH] radix-tree: Remove unnecessary indirections and clean up code
is only partially merged into -mm tree. This patch completes it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
lib/radix-tree.c | 12 +++++-------
1 files changed, 5 insertions(+), 7 deletions(-)
--- linux-2.6.14-mm1.orig/lib/radix-tree.c
+++ linux-2.6.14-mm1/lib/radix-tree.c
@@ -291,27 +291,25 @@ static inline void **__lookup_slot(struc
unsigned long index)
{
unsigned int height, shift;
- struct radix_tree_node **slot;
+ struct radix_tree_node *slot;
height = root->height;
if (index > radix_tree_maxindex(height))
return NULL;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;
+ slot = root->rnode;
while (height > 0) {
- if (*slot == NULL)
+ if (slot == NULL)
return NULL;
- slot = (struct radix_tree_node **)
- ((*slot)->slots +
- ((index >> shift) & RADIX_TREE_MAP_MASK));
+ slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
}
- return (void **)slot;
+ return slot;
}
/**
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 06/19] radixtree: look-aside cache
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (4 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 05/19] radixtree: sync with mainline Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 07/19] readahead: some preparation Wu Fengguang
` (13 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Nick Piggin, Wu Fengguang
[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 11999 bytes --]
This introduces a set of lookup functions to radix tree for the read-ahead
logic. Other access patterns with high locality may also benefit from them.
- radix_tree_lookup_node(root, index, level)
Perform partial lookup, return the @level'th parent of the slot at
@index.
- radix_tree_cache_xxx()
Init/Query the cache.
- radix_tree_cache_lookup(root, cache, index)
Perform lookup with the aid of a look-aside cache.
For sequential scans, it has a time complexity of 64*O(1) + 1*O(logn).
Typical usage:
void func() {
+ struct radix_tree_cache cache;
+
+ radix_tree_cache_init(&cache);
read_lock_irq(&mapping->tree_lock);
for(;;) {
- page = radix_tree_lookup(&mapping->page_tree, index);
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache, index);
}
read_unlock_irq(&mapping->tree_lock);
}
- radix_tree_lookup_head(root, index, max_scan)
- radix_tree_lookup_tail(root, index, max_scan)
Assume [head, tail) to be a segment with continuous pages. The two
functions search for the head and tail index of the segment at @index.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/radix-tree.h | 80 +++++++++++++++++-
lib/radix-tree.c | 196 ++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 254 insertions(+), 22 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/radix-tree.h
+++ linux-2.6.15-rc2-mm1/include/linux/radix-tree.h
@@ -22,12 +22,24 @@
#include <linux/preempt.h>
#include <linux/types.h>
+#define RADIX_TREE_MAP_SHIFT 6
+#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
+
struct radix_tree_root {
unsigned int height;
gfp_t gfp_mask;
struct radix_tree_node *rnode;
};
+/*
+ * Support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+ unsigned long first_index;
+ struct radix_tree_node *tree_node;
+};
+
#define RADIX_TREE_INIT(mask) { \
.height = 0, \
.gfp_mask = (mask), \
@@ -45,9 +57,18 @@ do { \
} while (0)
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_node(struct radix_tree_root *, unsigned long,
+ unsigned int);
+void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+int radix_tree_cache_count(struct radix_tree_cache *cache);
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level);
+unsigned long radix_tree_lookup_head(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan);
+unsigned long radix_tree_lookup_tail(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
@@ -69,4 +90,59 @@ static inline void radix_tree_preload_en
preempt_enable();
}
+/**
+ * radix_tree_lookup - perform lookup operation on a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+ unsigned long index)
+{
+ return radix_tree_lookup_node(root, index, 0);
+}
+
+/**
+ * radix_tree_cache_init - init the cache
+ * @cache: look-aside cache
+ *
+ * Init the @cache.
+ */
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+ cache->first_index = RADIX_TREE_MAP_MASK;
+ cache->tree_node = NULL;
+}
+
+/**
+ * radix_tree_cache_lookup - cached lookup page
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index)
+{
+ return radix_tree_cache_lookup_node(root, cache, index, 0);
+}
+
+static inline int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+ return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+ return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline int radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+ return cache->first_index;
+}
+
#endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.15-rc2-mm1.orig/lib/radix-tree.c
+++ linux-2.6.15-rc2-mm1/lib/radix-tree.c
@@ -32,16 +32,7 @@
#include <linux/bitops.h>
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT 6
-#else
-#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
-#endif
#define RADIX_TREE_TAGS 2
-
-#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
-
#define RADIX_TREE_TAG_LONGS \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
@@ -287,8 +278,21 @@ int radix_tree_insert(struct radix_tree_
}
EXPORT_SYMBOL(radix_tree_insert);
-static inline void **__lookup_slot(struct radix_tree_root *root,
- unsigned long index)
+/**
+ * radix_tree_lookup_node - low level lookup routine
+ * @root: radix tree root
+ * @index: index key
+ * @level: stop at that many levels from bottom
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ * The return value is:
+ * @level == 0: page at @index;
+ * @level == 1: the corresponding bottom level tree node;
+ * @level < height: (height - @level)th level tree node;
+ * @level >= height: root node.
+ */
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+ unsigned long index, unsigned int level)
{
unsigned int height, shift;
struct radix_tree_node *slot;
@@ -300,7 +304,7 @@ static inline void **__lookup_slot(struc
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
slot = root->rnode;
- while (height > 0) {
+ while (height > level) {
if (slot == NULL)
return NULL;
@@ -311,6 +315,49 @@ static inline void **__lookup_slot(struc
return slot;
}
+EXPORT_SYMBOL(radix_tree_lookup_node);
+
+/**
+ * radix_tree_cache_lookup_node - cached lookup node
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root,
+ * and return the node @level levels from the bottom in the search path.
+ * @cache stores the last accessed upper level tree node by this
+ * function, and is always checked first before searching in the tree.
+ * It can improve speed for access patterns with strong locality.
+ * NOTE:
+ * - The cache becomes invalid on leaving the lock;
+ * - Do not intermix calls with different @level.
+ */
+void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level)
+{
+ struct radix_tree_node *node;
+ unsigned long i;
+ unsigned long mask;
+
+ if (level >= root->height)
+ return root->rnode;
+
+ i = ((index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK);
+ mask = ~((RADIX_TREE_MAP_SIZE << (level * RADIX_TREE_MAP_SHIFT)) - 1);
+
+ if ((index & mask) == cache->first_index)
+ return cache->tree_node->slots[i];
+
+ node = radix_tree_lookup_node(root, index, level + 1);
+ if (!node)
+ return 0;
+
+ cache->tree_node = node;
+ cache->first_index = (index & mask);
+ return node->slots[i];
+}
+EXPORT_SYMBOL(radix_tree_cache_lookup_node);
/**
* radix_tree_lookup_slot - lookup a slot in a radix tree
@@ -322,25 +369,134 @@ static inline void **__lookup_slot(struc
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
- return __lookup_slot(root, index);
+ struct radix_tree_node *node;
+
+ node = radix_tree_lookup_node(root, index, 1);
+ return node->slots + (index & RADIX_TREE_MAP_MASK);
}
EXPORT_SYMBOL(radix_tree_lookup_slot);
/**
- * radix_tree_lookup - perform lookup operation on a radix tree
+ * radix_tree_cache_count - items in the cached node
+ * @cache: radix tree look-aside cache
+ *
+ * Query the number of items contained in the cached node.
+ */
+int radix_tree_cache_count(struct radix_tree_cache *cache)
+{
+ if (!(cache->first_index & RADIX_TREE_MAP_MASK))
+ return cache->tree_node->count;
+ else
+ return 0;
+}
+EXPORT_SYMBOL(radix_tree_cache_count);
+
+/**
+ * radix_tree_lookup_head - lookup the head index
* @root: radix tree root
* @index: index key
+ * @max_scan: max items to scan
*
- * Lookup the item at the position @index in the radix tree @root.
+ * Lookup head index of the segment which contains @index. A segment is
+ * a set of continuous pages in a file.
+ * CASE RETURN VALUE
+ * no page at @index (not head) = @index + 1
+ * found in the range @index - @max_scan < (head index) <= @index
+ * not found in range (unfinished head) <= @index - @max_scan
*/
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned long radix_tree_lookup_head(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan)
{
- void **slot;
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ int i;
+ unsigned long origin;
- slot = __lookup_slot(root, index);
- return slot != NULL ? *slot : NULL;
+ origin = index;
+ if (unlikely(max_scan > index))
+ max_scan = index;
+ radix_tree_cache_init(&cache);
+
+next_node:
+ if (origin - index > max_scan)
+ goto out;
+
+ node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+ if (!node)
+ goto out;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ if (index < RADIX_TREE_MAP_SIZE) {
+ index = -1;
+ goto out;
+ }
+ index = (index - RADIX_TREE_MAP_SIZE) | RADIX_TREE_MAP_MASK;
+ goto next_node;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+ goto next_node;
+
+out:
+ return index + 1;
+}
+EXPORT_SYMBOL(radix_tree_lookup_head);
+
+/**
+ * radix_tree_lookup_tail - lookup the tail index
+ * @root: radix tree root
+ * @index: index key
+ * @max_scan: max items to scan
+ *
+ * Lookup tail(pass the end) index of the segment which contains @index.
+ * A segment is a set of continuous pages in a file.
+ * CASE RETURN VALUE
+ * found in the range @index <= (tail index) < @index + @max_scan
+ * not found in range @index + @max_scan <= (non tail)
+ */
+unsigned long radix_tree_lookup_tail(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan)
+{
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ int i;
+ unsigned long origin;
+
+ origin = index;
+ if (unlikely(index + max_scan < index))
+ max_scan = LONG_MAX - index;
+ radix_tree_cache_init(&cache);
+
+next_node:
+ if (index - origin >= max_scan)
+ goto out;
+
+ node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+ if (!node)
+ goto out;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index | RADIX_TREE_MAP_MASK) + 1;
+ if (unlikely(!index))
+ goto out;
+ goto next_node;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++, index++) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+ goto next_node;
+
+out:
+ return index;
}
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_lookup_tail);
/**
* radix_tree_tag_set - set a tag on a radix tree node
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 07/19] readahead: some preparation
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (5 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 06/19] radixtree: look-aside cache Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 08/19] readahead: call scheme Wu Fengguang
` (12 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-prepare.patch --]
[-- Type: text/plain, Size: 8900 bytes --]
Some random changes that do not fit in elsewhere.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 7 +
mm/readahead.c | 193 ++++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 191 insertions(+), 9 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -986,6 +986,13 @@ void handle_ra_miss(struct address_space
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+#ifdef CONFIG_DEBUG_FS
+extern u32 readahead_debug_level;
+#define READAHEAD_DEBUG_LEVEL(n) (readahead_debug_level >= n)
+#else
+#define READAHEAD_DEBUG_LEVEL(n) (0)
+#endif
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -15,13 +15,58 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+/* The default max/min read-ahead pages. */
+#define KB(size) (((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
+#define MAX_RA_PAGES KB(VM_MAX_READAHEAD)
+#define MIN_RA_PAGES KB(VM_MIN_READAHEAD)
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
+
+#define dprintk(args...) \
+ do { if (READAHEAD_DEBUG_LEVEL(1)) printk(KERN_DEBUG args); } while(0)
+#define ddprintk(args...) \
+ do { if (READAHEAD_DEBUG_LEVEL(2)) printk(KERN_DEBUG args); } while(0)
+
+/*
+ * Debug facilities.
+ */
+#ifdef CONFIG_DEBUG_FS
+#define DEBUG_READAHEAD
+#endif
+
+#ifdef DEBUG_READAHEAD
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/init.h>
+
+u32 readahead_debug_level = 0;
+
+static int __init readahead_init(void)
+{
+ struct dentry *root;
+
+ root = debugfs_create_dir("readahead", NULL);
+
+ debugfs_create_u32("debug_level", 0644, root, &readahead_debug_level);
+
+ return 0;
+}
+
+module_init(readahead_init)
+#else /* !DEBUG_READAHEAD */
+
+#endif /* DEBUG_READAHEAD */
+
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
- .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+ .ra_pages = MAX_RA_PAGES,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
@@ -50,7 +95,7 @@ static inline unsigned long get_max_read
static inline unsigned long get_min_readahead(struct file_ra_state *ra)
{
- return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ return MIN_RA_PAGES;
}
static inline void ra_off(struct file_ra_state *ra)
@@ -258,10 +303,11 @@ out:
*/
static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
- pgoff_t offset, unsigned long nr_to_read)
+ pgoff_t offset, unsigned long nr_to_read,
+ unsigned long lookahead_size)
{
struct inode *inode = mapping->host;
- struct page *page;
+ struct page *page = NULL;
unsigned long end_index; /* The last page we want to read */
LIST_HEAD(page_pool);
int page_idx;
@@ -271,7 +317,7 @@ __do_page_cache_readahead(struct address
if (isize == 0)
goto out;
- end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
/*
* Preallocate as many pages as we will need.
@@ -284,8 +330,14 @@ __do_page_cache_readahead(struct address
break;
page = radix_tree_lookup(&mapping->page_tree, page_offset);
- if (page)
+ if (page) {
+#ifdef READAHEAD_STREAMING
+ if (prefer_adaptive_readahead() &&
+ page_idx == nr_to_read - lookahead_size)
+ SetPageReadahead(page);
+#endif
continue;
+ }
read_unlock_irq(&mapping->tree_lock);
page = page_cache_alloc_cold(mapping);
@@ -294,6 +346,9 @@ __do_page_cache_readahead(struct address
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
+ if (prefer_adaptive_readahead() &&
+ page_idx == nr_to_read - lookahead_size)
+ SetPageReadahead(page);
ret++;
}
read_unlock_irq(&mapping->tree_lock);
@@ -330,7 +385,7 @@ int force_page_cache_readahead(struct ad
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
err = __do_page_cache_readahead(mapping, filp,
- offset, this_chunk);
+ offset, this_chunk, 0);
if (err < 0) {
ret = err;
break;
@@ -377,7 +432,7 @@ int do_page_cache_readahead(struct addre
if (bdi_read_congested(mapping->backing_dev_info))
return -1;
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
}
/*
@@ -397,7 +452,10 @@ blockable_page_cache_readahead(struct ad
if (!block && bdi_read_congested(mapping->backing_dev_info))
return 0;
- actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+ dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+ mapping->host->i_ino, offset, nr_to_read, actual);
return check_ra_success(ra, nr_to_read, actual);
}
@@ -575,3 +633,120 @@ unsigned long max_sane_readahead(unsigne
__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
+
+/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressures.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ * 1. state based - the default one
+ * 2. context based - the fail safe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ * a read-ahead chunk
+ * +-----------------------------------------+
+ * | # PG_readahead |
+ * +-----------------------------------------+
+ * ^ When this page is read, notify me for the next read-ahead.
+ *
+ *
+ * Here are some variable names used frequently:
+ *
+ * |<------- la_size ------>|
+ * +-----------------------------------------+
+ * | # |
+ * +-----------------------------------------+
+ * ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+/*
+ * The nature of read-ahead allows most tests to fail or even be wrong.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ struct page *page;
+
+ read_lock_irq(&mapping->tree_lock);
+ page = __find_page(mapping, offset);
+ read_unlock_irq(&mapping->tree_lock);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (page)
+ BUG_ON(page->index != offset);
+#endif
+ return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static int rescue_pages(struct page *page, pgoff_t nr_pages)
+{
+ unsigned long pgrescue;
+ pgoff_t index;
+ struct address_space *mapping;
+ struct zone *zone;
+
+ BUG_ON(!nr_pages || !page);
+ pgrescue = 0;
+ index = page_index(page);
+ mapping = page_mapping(page);
+
+ dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+ mapping->host->i_ino, index, nr_pages);
+
+ for(;;) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (!PageLRU(page))
+ goto out_unlock;
+
+ while (page_mapping(page) == mapping &&
+ page_index(page) == index) {
+ struct page *the_page = page;
+ page = next_page(page);
+ if (!PageActive(the_page) &&
+ !PageActivate(the_page) &&
+ !PageLocked(the_page) &&
+ page_count(the_page) == 1) {
+ list_move(&the_page->lru, &zone->inactive_list);
+ pgrescue++;
+ }
+ index++;
+ if (!--nr_pages)
+ goto out_unlock;
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+
+ page = find_page(mapping, index);
+ if (!page)
+ goto out;
+ }
+out_unlock:
+ spin_unlock_irq(&zone->lru_lock);
+out:
+ ra_account(0, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+
+ return nr_pages ? index : 0;
+}
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 08/19] readahead: call scheme
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (6 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 07/19] readahead: some preparation Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 09/19] readahead: parameters Wu Fengguang
` (11 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 13823 bytes --]
An new page flag PG_readahead is introduced as a look-ahead mark.
The look-ahead mark corresponds to `ahead_start' of the current logic.
The read-ahead logic is called when
- read reaches a look-ahead mark;
- read on a non-present page.
And ra_access() is called on every page reference to maintain the cache_hit
counter.
This scheme has the following benefits:
- makes all stateful/stateless methods happy;
- eliminates the cache hit problem naturally;
- lives in harmony with application managed read-aheads via
fadvise/madvise.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 13 +++
include/linux/page-flags.h | 5 +
mm/filemap.c | 74 ++++++++++++++++---
mm/page_alloc.c | 2
mm/readahead.c | 170 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 252 insertions(+), 12 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.15-rc2-mm1/include/linux/page-flags.h
@@ -77,6 +77,7 @@
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
#define PG_activate 20 /* delayed activate */
+#define PG_readahead 21 /* check readahead when reading this page */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -311,6 +312,10 @@ extern void __mod_page_state(unsigned lo
#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+#define PageReadahead(page) test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page) set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -985,6 +985,13 @@ unsigned long page_cache_readahead(struc
void handle_ra_miss(struct address_space *mapping,
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ pgoff_t first_index,
+ pgoff_t index, pgoff_t last_index);
+void fastcall ra_access(struct file_ra_state *ra, struct page *page);
#ifdef CONFIG_DEBUG_FS
extern u32 readahead_debug_level;
@@ -993,6 +1000,12 @@ extern u32 readahead_debug_level;
#define READAHEAD_DEBUG_LEVEL(n) (0)
#endif
+extern int readahead_ratio;
+static inline int prefer_adaptive_readahead(void)
+{
+ return readahead_ratio > 9;
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -508,7 +508,7 @@ static int prep_new_page(struct page *pa
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_activate |
+ 1 << PG_activate | 1 << PG_readahead |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refs(page, order);
--- linux-2.6.15-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.15-rc2-mm1/mm/filemap.c
@@ -765,10 +765,12 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
loff_t isize;
struct page *cached_page;
+ struct page *prev_page;
int error;
struct file_ra_state ra = *_ra;
cached_page = NULL;
+ prev_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -779,6 +781,10 @@ void do_generic_mapping_read(struct addr
if (!isize)
goto out;
+ if (READAHEAD_DEBUG_LEVEL(5))
+ printk(KERN_DEBUG "read-file(ino=%lu, req=%lu+%lu)\n",
+ inode->i_ino, index, last_index - index);
+
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
for (;;) {
struct page *page;
@@ -797,16 +803,36 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- if (index == next_index)
+
+ if (!prefer_adaptive_readahead() && index == next_index)
next_index = page_cache_readahead(mapping, &ra, filp,
index, last_index - index);
find_page:
page = find_get_page(mapping, index);
+ if (prefer_adaptive_readahead()) {
+ if (unlikely(page == NULL)) {
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, NULL,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ page = find_get_page(mapping, index);
+ } else if (PageReadahead(page)) {
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, page,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ }
+ }
if (unlikely(page == NULL)) {
- handle_ra_miss(mapping, &ra, index);
+ if (!prefer_adaptive_readahead())
+ handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
+ ra_access(&ra, page);
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -841,7 +868,6 @@ page_ok:
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
- page_cache_release(page);
if (ret == nr && desc->count)
continue;
goto out;
@@ -853,7 +879,6 @@ page_not_up_to_date:
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
unlock_page(page);
- page_cache_release(page);
continue;
}
@@ -883,7 +908,6 @@ readpage:
* invalidate_inode_pages got it
*/
unlock_page(page);
- page_cache_release(page);
goto find_page;
}
unlock_page(page);
@@ -904,7 +928,6 @@ readpage:
isize = i_size_read(inode);
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
if (unlikely(!isize || index > end_index)) {
- page_cache_release(page);
goto out;
}
@@ -913,7 +936,6 @@ readpage:
if (index == end_index) {
nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
if (nr <= offset) {
- page_cache_release(page);
goto out;
}
}
@@ -923,7 +945,6 @@ readpage:
readpage_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
- page_cache_release(page);
goto out;
no_cached_page:
@@ -948,15 +969,22 @@ no_cached_page:
}
page = cached_page;
cached_page = NULL;
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
goto readpage;
}
out:
*_ra = ra;
+ if (prefer_adaptive_readahead())
+ _ra->prev_page = prev_index;
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (cached_page)
page_cache_release(cached_page);
+ if (prev_page)
+ page_cache_release(prev_page);
if (filp)
file_accessed(filp);
}
@@ -1237,6 +1265,12 @@ struct page *filemap_nopage(struct vm_ar
pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff;
+ if (READAHEAD_DEBUG_LEVEL(5))
+ printk(KERN_DEBUG "read-mmap(ino=%lu, idx=%lu, hint=%s)\n",
+ inode->i_ino, pgoff,
+ VM_RandomReadHint(area) ? "random" :
+ (VM_SequentialReadHint(area) ? "sequential" : "none"));
+
retry_all:
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (pgoff >= size)
@@ -1252,19 +1286,33 @@ retry_all:
*
* For sequential accesses, we use the generic readahead logic.
*/
- if (VM_SequentialReadHint(area))
+ if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area))
page_cache_readahead(mapping, ra, file, pgoff, 1);
+
/*
* Do we have something in the page cache already?
*/
retry_find:
page = find_get_page(mapping, pgoff);
+ if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) {
+ if (!page) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, NULL,
+ pgoff, pgoff, pgoff + 1);
+ page = find_get_page(mapping, pgoff);
+ } else if (PageReadahead(page)) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, page,
+ pgoff, pgoff, pgoff + 1);
+ }
+ }
if (!page) {
unsigned long ra_pages;
if (VM_SequentialReadHint(area)) {
- handle_ra_miss(mapping, ra, pgoff);
+ if (!prefer_adaptive_readahead())
+ handle_ra_miss(mapping, ra, pgoff);
goto no_cached_page;
}
ra->mmap_miss++;
@@ -1301,6 +1349,8 @@ retry_find:
if (!did_readaround)
ra->mmap_hit++;
+ ra_access(ra, page);
+
/*
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
@@ -1315,6 +1365,8 @@ success:
mark_page_accessed(page);
if (type)
*type = majmin;
+ if (prefer_adaptive_readahead())
+ ra->prev_page = page->index;
return page;
outside_data_content:
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -20,6 +20,43 @@
#define MAX_RA_PAGES KB(VM_MAX_READAHEAD)
#define MIN_RA_PAGES KB(VM_MIN_READAHEAD)
+/* Detailed classification of read-ahead behaviors. */
+#define RA_CLASS_SHIFT 3
+#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+ RA_CLASS_ALL,
+ RA_CLASS_NEWFILE,
+ RA_CLASS_STATE,
+ RA_CLASS_CONTEXT,
+ RA_CLASS_CONTEXT_ACCELERATED,
+ RA_CLASS_BACKWARD,
+ RA_CLASS_RANDOM_THRASHING,
+ RA_CLASS_RANDOM_SEEK,
+ RA_CLASS_END,
+};
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+ RA_EVENT_CACHE_MISS, /* read cache misses */
+ RA_EVENT_READRANDOM, /* random reads */
+ RA_EVENT_IO_CONGESTION, /* io congestion */
+ RA_EVENT_IO_CACHE_HIT, /* canceled io due to cache hit */
+ RA_EVENT_IO_BLOCK, /* read on locked page */
+
+ RA_EVENT_READAHEAD, /* read-ahead issued */
+ RA_EVENT_READAHEAD_HIT, /* read-ahead page hit */
+ RA_EVENT_LOOKAHEAD, /* look-ahead issued */
+ RA_EVENT_LOOKAHEAD_HIT, /* look-ahead mark hit */
+ RA_EVENT_LOOKAHEAD_NOACTION, /* look-ahead mark ignored */
+ RA_EVENT_READAHEAD_EOF, /* read-ahead reaches EOF */
+ RA_EVENT_READAHEAD_SHRINK, /* ra_size decreased, reflects var. */
+ RA_EVENT_READAHEAD_THRASHING, /* read-ahead thrashing happened */
+ RA_EVENT_READAHEAD_MUTILATE, /* read-ahead request mutilated */
+ RA_EVENT_READAHEAD_RESCUE, /* read-ahead rescued */
+
+ RA_EVENT_END
+};
+
#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
@@ -750,3 +787,136 @@ out:
return nr_pages ? index : 0;
}
+
+/*
+ * This is the entry point of the adaptive read-ahead logic.
+ *
+ * It is only called on two conditions:
+ * 1. page == NULL
+ * A cache miss happened, it can be either a random read or a sequential one.
+ * 2. page != NULL
+ * There is a look-ahead mark(PG_readahead) from a previous sequential read.
+ * It's time to do some checking and submit the next read-ahead IO.
+ *
+ * That has the merits of:
+ * - makes all stateful/stateless methods happy;
+ * - eliminates the cache hit problem naturally;
+ * - lives in harmony with application managed read-aheads via fadvise/madvise.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ pgoff_t begin_index,
+ pgoff_t index, pgoff_t end_index)
+{
+ unsigned long size;
+ unsigned long ra_min;
+ unsigned long ra_max;
+ int ret;
+
+ if (page) {
+ if(!TestClearPageReadahead(page))
+ return 0;
+ if (bdi_read_congested(mapping->backing_dev_info))
+ return 0;
+ }
+
+ if (page)
+ ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+ ra->readahead_index - ra->lookahead_index);
+ else if (index)
+ ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+ size = end_index - index;
+ get_readahead_bounds(ra, &ra_min, &ra_max);
+
+ /* readahead disabled? */
+ if (unlikely(!ra_min || !readahead_ratio)) {
+ size = max_sane_readahead(size);
+ goto readit;
+ }
+
+ /*
+ * Start of file.
+ */
+ if (index == 0)
+ return newfile_readahead(mapping, filp, ra, end_index, ra_min);
+
+ /*
+ * State based sequential read-ahead.
+ */
+ if (!disable_stateful_method() &&
+ index == ra->lookahead_index &&
+ (page || index == ra->readahead_index) &&
+ (ra_cache_hit_ok(ra) ||
+ end_index - begin_index >= ra_max))
+ return state_based_readahead(mapping, filp, ra, page, ra_max);
+
+ /*
+ * Backward read-ahead.
+ */
+ if (try_read_backward(ra, begin_index, end_index, size, ra_min, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Context based sequential read-ahead.
+ */
+ ret = try_context_based_readahead(mapping, ra, prev_page, page,
+ index, size, ra_min, ra_max);
+ if (ret > 0)
+ return ra_dispatch(ra, mapping, filp);
+ if (ret < 0)
+ return 0;
+
+ /* No action on look ahead time? */
+ if (page) {
+ ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION,
+ ra->readahead_index - index);
+ return 0;
+ }
+
+ /*
+ * Random read that follows a sequential one.
+ */
+ if (try_random_readahead(ra, index, size, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Random read.
+ */
+ if (size > ra_max)
+ size = ra_max;
+
+readit:
+ size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+ ra_account(ra, RA_EVENT_READRANDOM, size);
+ dprintk("readrandom(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+ mapping->host->i_ino, mapping->nrpages,
+ begin_index, index, end_index, size);
+
+ return size;
+}
+
+/*
+ * Call me!
+ */
+void fastcall ra_access(struct file_ra_state *ra, struct page *page)
+{
+ if (page->flags & ((1 << PG_active) |
+ (1 << PG_activate) |
+ (1 << PG_referenced)))
+ return;
+
+ if (!ra_has_index(ra, page->index))
+ return;
+
+ ra->cache_hit++;
+
+ if (page->index >= ra->ra_index)
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+ else
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 09/19] readahead: parameters
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (7 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 08/19] readahead: call scheme Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
` (10 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-parameters.patch --]
[-- Type: text/plain, Size: 8583 bytes --]
- new entry in /proc/sys/vm/readahead_ratio with default value of 50;
- new entry in /proc/sys/vm/readahead_hit_rate with default value of 2;
- new entry in /proc/sys/vm/readahead_live_chunk with default value of 2M;
- limit mmap read-around size to 256kb;
- dynamic minimal/initial read-ahead size.
readahead_ratio is divided into several ranges:
condition action
===================================================================
readahead_ratio == 0 disable read-ahead
readahead_ratio < 9 select old read-ahead logic
readahead_ratio >= 9 select new read-ahead logic
readahead_ratio >= 80 enable live pages protection
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
Documentation/sysctl/vm.txt | 51 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 5 +++-
include/linux/sysctl.h | 3 ++
kernel/sysctl.c | 34 +++++++++++++++++++++++++++++
mm/filemap.c | 7 ++++++
mm/readahead.c | 39 +++++++++++++++++++++++++++++++++
6 files changed, 138 insertions(+), 1 deletion(-)
--- linux-2.6.15-rc2-mm1.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.15-rc2-mm1/Documentation/sysctl/vm.txt
@@ -27,6 +27,9 @@ Currently, these files are in /proc/sys/
- laptop_mode
- block_dump
- swap_prefetch
+- readahead_ratio
+- readahead_hit_rate
+- readahead_live_chunk
==============================================================
@@ -114,3 +117,51 @@ except when laptop_mode is enabled and t
Setting it to 0 disables prefetching entirely.
The default value is dependant on ramsize.
+
+==============================================================
+
+readahead_ratio
+
+This limits read-ahead size to percent of the thrashing-threshold.
+The thrashing-threshold is dynamicly estimated according to the
+_history_ read speed and system load, and used to limit the
+_future_ read-ahead request size.
+
+Set it to a low value if you have not enough memory to counteract
+the I/O load fluctuations. But if there's plenty of memory, set it
+to a larger value might help increase read speed. Also note that a
+value >= 80 activates mandatory thrashing protection(see
+readahead_live_chunk).
+
+The default value is 50.
+
+==============================================================
+
+readahead_hit_rate
+
+This is the max allowed value of (read-ahead-pages : accessed-pages).
+If the previous read-ahead request has bad hit rate, kernel will be
+very conservative to issue the next read-ahead.
+
+A large value helps speedup some sparse access patterns, at the cost
+of more memory consumption. It is recommended to keep the value below
+(max-readahead-pages / 8).
+
+The default value is 2.
+
+==============================================================
+
+readahead_live_chunk
+
+In a file server, there are typically one or more sequential
+readers working on a file. The kernel can detect most live
+chunks(a sequence of pages to be accessed by an active reader),
+and save them for their imminent readers. This is called
+mandatory thrashing protection, and is only in effect when
+(readahead_ratio >= 80).
+
+This parameter controls the max allowed chunk size, i.e. the max
+number of pages pinned for an active reader.
+
+The default value is 2MB size of pages. That is 512 on most archs.
+Increase it if you have enough memory.
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -968,11 +968,14 @@ extern int filemap_populate(struct vm_ar
int write_one_page(struct page *page, int wait);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
+#define VM_MAX_READAHEAD 1024 /* kbytes */
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
#define VM_MAX_CACHE_HIT 256 /* max pages in a row in cache before
* turning readahead off */
+/* turn on read-ahead thrashing protection if (readahead_ratio >= ##) */
+#define VM_READAHEAD_PROTECT_RATIO 80
+
int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
--- linux-2.6.15-rc2-mm1.orig/include/linux/sysctl.h
+++ linux-2.6.15-rc2-mm1/include/linux/sysctl.h
@@ -182,6 +182,9 @@ enum
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
VM_SWAP_PREFETCH=29, /* int: amount to swap prefetch */
+ VM_READAHEAD_RATIO=30, /* percent of read-ahead size to thrashing-threshold */
+ VM_READAHEAD_HIT_RATE=31, /* one accessed page legitimizes so many read-ahead pages */
+ VM_READAHEAD_LIVE_CHUNK=32, /* pin no more than that many pages for a live reader */
};
--- linux-2.6.15-rc2-mm1.orig/kernel/sysctl.c
+++ linux-2.6.15-rc2-mm1/kernel/sysctl.c
@@ -68,6 +68,9 @@ extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
extern int pid_max_min, pid_max_max;
+extern int readahead_ratio;
+extern int readahead_hit_rate;
+extern int readahead_live_chunk;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -668,6 +671,7 @@ static ctl_table kern_table[] = {
/* Constants for minimum and maximum testing in vm_table.
We use these as one-element integer vectors. */
static int zero;
+static int one = 1;
static int one_hundred = 100;
@@ -867,6 +871,36 @@ static ctl_table vm_table[] = {
},
#endif
#endif
+ {
+ .ctl_name = VM_READAHEAD_RATIO,
+ .procname = "readahead_ratio",
+ .data = &readahead_ratio,
+ .maxlen = sizeof(readahead_ratio),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
+ {
+ .ctl_name = VM_READAHEAD_HIT_RATE,
+ .procname = "readahead_hit_rate",
+ .data = &readahead_hit_rate,
+ .maxlen = sizeof(readahead_hit_rate),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &one,
+ },
+ {
+ .ctl_name = VM_READAHEAD_LIVE_CHUNK,
+ .procname = "readahead_live_chunk",
+ .data = &readahead_live_chunk,
+ .maxlen = sizeof(readahead_live_chunk),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
{ .ctl_name = 0 }
};
--- linux-2.6.15-rc2-mm1.orig/mm/filemap.c
+++ linux-2.6.15-rc2-mm1/mm/filemap.c
@@ -1336,6 +1336,13 @@ retry_find:
if (ra_pages) {
pgoff_t start = 0;
+ /*
+ * Max read-around should be much smaller than
+ * max read-ahead.
+ * How about adding a tunable parameter for this?
+ */
+ if (ra_pages > 64)
+ ra_pages = 64;
if (pgoff > ra_pages / 2)
start = pgoff - ra_pages / 2;
do_page_cache_readahead(mapping, file, start, ra_pages);
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -20,6 +20,24 @@
#define MAX_RA_PAGES KB(VM_MAX_READAHEAD)
#define MIN_RA_PAGES KB(VM_MIN_READAHEAD)
+/* In laptop mode, poll delayed look-ahead on every ## pages read. */
+#define LAPTOP_POLL_INTERVAL 16
+
+/* Set look-ahead size to 1/# of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 50;
+EXPORT_SYMBOL(readahead_ratio);
+
+/* Readahead as long as cache hit ratio keeps above 1/##. */
+int readahead_hit_rate = 2;
+EXPORT_SYMBOL(readahead_hit_rate);
+
+/* Scan backward ## pages to find a live reader. */
+int readahead_live_chunk = 2 * MAX_RA_PAGES;
+EXPORT_SYMBOL(readahead_live_chunk);
+
/* Detailed classification of read-ahead behaviors. */
#define RA_CLASS_SHIFT 3
#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
@@ -789,6 +807,27 @@ out:
}
/*
+ * ra_size is mainly determined by:
+ * 1. sequential-start: min(MIN_RA_PAGES + (pages>>14), KB(128))
+ * 2. sequential-max: min(ra->ra_pages, 0xFFFF)
+ * 3. sequential: (thrashing-threshold) * readahead_ratio / 100
+ *
+ * Table of concrete numbers for 4KB page size:
+ * (inactive + free) (in MB): 4 8 16 32 64 128 256 512 1024
+ * initial ra_size (in KB): 16 16 16 16 20 24 32 48 64
+ */
+static inline void get_readahead_bounds(struct file_ra_state *ra,
+ unsigned long *ra_min,
+ unsigned long *ra_max)
+{
+ unsigned long pages;
+
+ pages = node_free_and_cold_pages();
+ *ra_max = min(min(pages/2, 0xFFFFUL), ra->ra_pages);
+ *ra_min = min(min(MIN_RA_PAGES + (pages>>14), KB(128)), *ra_max/2);
+}
+
+/*
* This is the entry point of the adaptive read-ahead logic.
*
* It is only called on two conditions:
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 10/19] readahead: state based method
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (8 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 09/19] readahead: parameters Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:21 ` Eric Dumazet
2005-11-25 15:12 ` [PATCH 11/19] readahead: context " Wu Fengguang
` (9 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-method-stateful.patch --]
[-- Type: text/plain, Size: 14166 bytes --]
This is the fast code path.
Major steps:
- estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file_ra_state;
- submit it.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/fs.h | 8 +
include/linux/mm.h | 6
mm/memory.c | 1
mm/readahead.c | 355 +++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap.c | 2
mm/vmscan.c | 3
6 files changed, 374 insertions(+), 1 deletion(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
@@ -604,13 +604,19 @@ struct file_ra_state {
unsigned long start; /* Current window */
unsigned long size;
unsigned long flags; /* ra flags RA_FLAG_xxx*/
- unsigned long cache_hit; /* cache hit count*/
+ uint64_t cache_hit; /* cache hit count*/
unsigned long prev_page; /* Cache last read() position */
unsigned long ahead_start; /* Ahead window */
unsigned long ahead_size;
unsigned long ra_pages; /* Maximum readahead window */
unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
+
+ unsigned long age;
+ pgoff_t la_index;
+ pgoff_t ra_index;
+ pgoff_t lookahead_index;
+ pgoff_t readahead_index;
};
#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -1009,6 +1009,12 @@ static inline int prefer_adaptive_readah
return readahead_ratio > 9;
}
+DECLARE_PER_CPU(unsigned long, readahead_aging);
+static inline void inc_readahead_aging(void)
+{
+ __get_cpu_var(readahead_aging)++;
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.15-rc2-mm1.orig/mm/memory.c
+++ linux-2.6.15-rc2-mm1/mm/memory.c
@@ -1894,6 +1894,7 @@ static int do_anonymous_page(struct mm_s
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (!pte_none(*page_table))
goto release;
+ inc_readahead_aging();
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(page);
SetPageReferenced(page);
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -539,6 +539,9 @@ static int shrink_list(struct list_head
goto activate_locked;
}
+ if (!PageReferenced(page))
+ inc_readahead_aging();
+
referenced = page_referenced(page, 1, sc->priority <= 0);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
--- linux-2.6.15-rc2-mm1.orig/mm/swap.c
+++ linux-2.6.15-rc2-mm1/mm/swap.c
@@ -124,6 +124,8 @@ void fastcall mark_page_accessed(struct
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
+ if (PageLRU(page) && !PageActivate(page))
+ inc_readahead_aging();
}
}
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -38,6 +38,12 @@ EXPORT_SYMBOL(readahead_hit_rate);
int readahead_live_chunk = 2 * MAX_RA_PAGES;
EXPORT_SYMBOL(readahead_live_chunk);
+/* Analog to zone->aging_total.
+ * But mainly increased on fresh page references, so is much more smoother.
+ */
+DEFINE_PER_CPU(unsigned long, readahead_aging);
+EXPORT_PER_CPU_SYMBOL(readahead_aging);
+
/* Detailed classification of read-ahead behaviors. */
#define RA_CLASS_SHIFT 3
#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
@@ -97,6 +103,7 @@ enum ra_event {
#include <linux/init.h>
u32 readahead_debug_level = 0;
+u32 debug_disable_stateful_method = 0;
static int __init readahead_init(void)
{
@@ -105,13 +112,26 @@ static int __init readahead_init(void)
root = debugfs_create_dir("readahead", NULL);
debugfs_create_u32("debug_level", 0644, root, &readahead_debug_level);
+ debugfs_create_bool("disable_stateful_method", 0644, root,
+ &debug_disable_stateful_method);
return 0;
}
module_init(readahead_init)
+
+static inline int disable_stateful_method(void)
+{
+ return debug_disable_stateful_method;
+}
+
#else /* !DEBUG_READAHEAD */
+static inline int disable_stateful_method(void)
+{
+ return 0;
+}
+
#endif /* DEBUG_READAHEAD */
@@ -807,6 +827,341 @@ out:
}
/*
+ * State based calculation of read-ahead request.
+ *
+ * This figure shows the meaning of file_ra_state members:
+ *
+ * chunk A chunk B
+ * +---------------------------+-------------------------------------------+
+ * | # | # |
+ * +---------------------------+-------------------------------------------+
+ * ^ ^ ^ ^
+ * la_index ra_index lookahead_index readahead_index
+ */
+
+/*
+ * The global effective length of the inactive_list(s).
+ */
+static unsigned long node_free_and_cold_pages(void)
+{
+ unsigned int i;
+ unsigned long sum = 0;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].nr_inactive +
+ zones[i].free_pages - zones[i].pages_low;
+
+ return sum;
+}
+
+/*
+ * A more smooth and timely analog to zone->aging_total.
+ */
+static unsigned long node_readahead_aging(void)
+{
+ unsigned long cpu;
+ unsigned long sum = 0;
+ cpumask_t mask = node_to_cpumask(numa_node_id());
+
+ for_each_cpu_mask(cpu, mask)
+ sum += per_cpu(readahead_aging, cpu);
+
+ return sum;
+}
+
+/*
+ * Set class of read-ahead
+ */
+static inline void set_ra_class(struct file_ra_state *ra,
+ enum ra_class ra_class)
+{
+ unsigned long FLAGS_MASK;
+ unsigned long flags;
+ unsigned long old_ra_class;
+
+ FLAGS_MASK = ~(RA_CLASS_MASK | (RA_CLASS_MASK << RA_CLASS_SHIFT));
+ flags = ra->flags & FLAGS_MASK;
+
+ old_ra_class = (ra->flags & RA_CLASS_MASK) << RA_CLASS_SHIFT;
+
+ ra->flags = flags | old_ra_class | ra_class;
+}
+
+/*
+ * The 64bit cache_hit stores three accumulated value and one counter value.
+ * MSB LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static inline int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+ return (ra->cache_hit >> (nr * 16)) & 0xFFFF;
+}
+
+/*
+ * Conceptual code:
+ * ra_cache_hit(ra, 1) += ra_cache_hit(ra, 0);
+ * ra_cache_hit(ra, 0) = 0;
+ */
+static inline void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+ int n;
+
+ n = ra_cache_hit(ra, 0);
+ ra->cache_hit -= n;
+ n <<= 16;
+ ra->cache_hit += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate >= 1/readahead_hit_rate.
+ */
+static inline int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+ return ra_cache_hit(ra, 0) * readahead_hit_rate >=
+ (ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the @ra request.
+ */
+static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
+{
+ if (index < ra->la_index || index >= ra->readahead_index)
+ return 0;
+
+ if (index >= ra->ra_index)
+ return 1;
+ else
+ return -1;
+}
+
+/*
+ * Prepare file_ra_state for a new read-ahead sequence.
+ */
+static inline void ra_state_init(struct file_ra_state *ra,
+ pgoff_t la_index, pgoff_t ra_index)
+{
+ ra_addup_cache_hit(ra);
+ ra->cache_hit <<= 16;
+ ra->lookahead_index = la_index;
+ ra->readahead_index = ra_index;
+}
+
+/*
+ * Take down a new read-ahead request in file_ra_state.
+ */
+static inline void ra_state_update(struct file_ra_state *ra,
+ unsigned long ra_size, unsigned long la_size)
+{
+#ifdef DEBUG_READAHEAD
+ unsigned long old_ra = ra->readahead_index - ra->ra_index;
+ if (ra_size < old_ra && ra_cache_hit(ra, 0))
+ ra_account(ra, RA_EVENT_READAHEAD_SHRINK, old_ra - ra_size);
+#endif
+ ra_addup_cache_hit(ra);
+ ra->ra_index = ra->readahead_index;
+ ra->la_index = ra->lookahead_index;
+ ra->readahead_index += ra_size;
+ ra->lookahead_index = ra->readahead_index - la_size;
+ ra->age = node_readahead_aging();
+}
+
+/*
+ * Adjust the read-ahead request in file_ra_state.
+ */
+static inline void ra_state_adjust(struct file_ra_state *ra,
+ unsigned long ra_size, unsigned long la_size)
+{
+ ra->readahead_index = ra->ra_index + ra_size;
+ ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+ struct address_space *mapping, struct file *filp)
+{
+ pgoff_t eof_index;
+ unsigned long ra_size;
+ unsigned long la_size;
+ int actual;
+ enum ra_class ra_class;
+
+ ra_class = (ra->flags & RA_CLASS_MASK);
+ BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
+
+ eof_index = ((i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT) + 1;
+ ra_size = ra->readahead_index - ra->ra_index;
+ la_size = ra->readahead_index - ra->lookahead_index;
+
+ /* Snap to EOF. */
+ if (unlikely(ra->ra_index >= eof_index))
+ return 0;
+ if (ra->readahead_index + ra_size / 2 > eof_index) {
+ if (ra_class == RA_CLASS_CONTEXT_ACCELERATED &&
+ eof_index > ra->lookahead_index + 1)
+ la_size = eof_index - ra->lookahead_index;
+ else
+ la_size = 0;
+ ra_size = eof_index - ra->ra_index;
+ ra_state_adjust(ra, ra_size, la_size);
+ }
+
+ actual = __do_page_cache_readahead(mapping, filp,
+ ra->ra_index, ra_size, la_size);
+
+#ifdef READAHEAD_STREAMING
+ if (actual < ra_size) {
+ struct page *page = find_page(mapping, ra->ra_index + actual);
+ if (page)
+ rescue_pages(page, ra_size);
+ }
+#endif
+
+ if (ra->readahead_index == eof_index)
+ ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+ if (la_size)
+ ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+ ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+ dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+ ra_class_name[ra_class],
+ mapping->host->i_ino, ra->la_index,
+ ra->ra_index, ra_size, la_size, actual);
+
+ return actual;
+}
+
+/*
+ * Determine the request parameters from primitive values.
+ *
+ * It applies the following rules:
+ * - Substract ra_size by the old look-ahead to get real safe read-ahead;
+ * - Set new la_size according to the (still large) ra_size;
+ * - Apply upper limits;
+ * - Make sure stream_shift is not too small.
+ * (So that the next global_shift will not be too small.)
+ *
+ * Input:
+ * ra_size stores the estimated thrashing-threshold.
+ * la_size stores the look-ahead size of previous request.
+ */
+static inline int adjust_rala(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ unsigned long stream_shift = *la_size;
+
+ if (*ra_size > *la_size)
+ *ra_size -= *la_size;
+ else
+ return 0;
+
+ *la_size = *ra_size / LOOKAHEAD_RATIO;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ stream_shift += (*ra_size - *la_size);
+ if (stream_shift < *ra_size / 4)
+ *la_size -= (*ra_size / 4 - stream_shift);
+
+ return 1;
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ * It is returned to make the next read-ahead request.
+ * 2. the remained space for the current chunk
+ * It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will vibrate
+ * more with light load(small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * This figure illustrates the formula used in the function:
+ * While the stream reads stream_shift pages inside the chunks,
+ * the chunks are shifted global_shift pages inside inactive_list.
+ *
+ * chunk A chunk B
+ * |<=============== global_shift ================|
+ * +-------------+ +-------------------+ |
+ * | # | | # | inactive_list |
+ * +-------------+ +-------------------+ head |
+ * |---->| |---------->|
+ * | |
+ * +-- stream_shift --+
+ */
+static inline unsigned long compute_thrashing_threshold(
+ struct file_ra_state *ra,
+ unsigned long *remain)
+{
+ unsigned long global_size;
+ unsigned long global_shift;
+ unsigned long stream_shift;
+ unsigned long ra_size;
+
+ global_size = node_free_and_cold_pages();
+ global_shift = node_readahead_aging() - ra->age + 1;
+ stream_shift = ra_cache_hit(ra, 0);
+
+ ra_size = stream_shift *
+ global_size * readahead_ratio / (100 * global_shift);
+
+ if (global_size > global_shift)
+ *remain = stream_shift *
+ (global_size - global_shift) / global_shift;
+ else
+ *remain = 0;
+
+ ddprintk("compute_thrashing_threshold: "
+ "ra=%lu=%lu*%lu/%lu, remain %lu for %lu\n",
+ ra_size, stream_shift, global_size, global_shift,
+ *remain, ra->readahead_index - ra->lookahead_index);
+
+ return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static inline unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+ struct file_ra_state *ra, struct page *page,
+ unsigned long ra_max)
+{
+ unsigned long ra_old;
+ unsigned long ra_size;
+ unsigned long la_size;
+ unsigned long remain_space;
+ unsigned long growth_limit;
+
+ la_size = ra->readahead_index - ra->lookahead_index;
+ ra_old = ra->readahead_index - ra->ra_index;
+ ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+ if (readahead_ratio < VM_READAHEAD_PROTECT_RATIO &&
+ remain_space <= la_size && la_size > 1) {
+ rescue_pages(page, la_size);
+ return 0;
+ }
+
+ growth_limit = 2 * ra_old + (ra_max - ra_old) / 16;
+ if (!adjust_rala(min(ra_max, growth_limit), &ra_size, &la_size))
+ return 0;
+
+ set_ra_class(ra, RA_CLASS_STATE);
+ ra_state_update(ra, ra_size, la_size);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+
+/*
* ra_size is mainly determined by:
* 1. sequential-start: min(MIN_RA_PAGES + (pages>>14), KB(128))
* 2. sequential-max: min(ra->ra_pages, 0xFFFF)
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 11/19] readahead: context based method
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (9 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 12/19] readahead: other methods Wu Fengguang
` (8 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 10430 bytes --]
This is the slow code path.
No valid state info is available, so the page cache is queried to abtain the
required position/timing infomation.
Major steps:
- look back/forward to find the ra_index;
- look back to estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file_ra_state;
- submit it.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 345 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 345 insertions(+)
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -1160,6 +1160,351 @@ state_based_readahead(struct address_spa
return ra_dispatch(ra, mapping, filp);
}
+/*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks around to find the start point of next read-ahead,
+ * and then, if necessary, looks backward in the inactive_list to get an
+ * estimation of the thrashing-threshold.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ * chunk A chunk B chunk C head
+ *
+ * l01 l11 l12 l21 l22
+ *| |-->|-->| |------>|-->| |------>|
+ *| +-------+ +-----------+ +-------------+ |
+ *| | # | | # | | # | |
+ *| +-------+ +-----------+ +-------------+ |
+ *| |<==============|<===========================|<============================|
+ * L0 L1 L2
+ *
+ * Let f(l) = L be a map from
+ * l: the number of pages read by the stream
+ * to
+ * L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * f(l01) <= L0
+ * f(l11 + l12) = L1
+ * f(l21 + l22) = L2
+ * ...
+ * f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ * <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+/*
+ * STATUS REFERENCE COUNT TYPE
+ * A__ 0 not in inactive list
+ * ___ 0 fresh
+ * __R PAGE_REFCNT_1 stale
+ * _a_ PAGE_REFCNT_2 disturbed once
+ * _aR PAGE_REFCNT_3 disturbed twice
+ *
+ * A/a/R: Active / aCTIVATE / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+ if (!page || PageActive(page))
+ return 0;
+
+ return page_refcnt(page);
+}
+
+static inline char page_refcnt_symbol(struct page *page)
+{
+ if (!page)
+ return 'X';
+ if (PageActive(page))
+ return 'A';
+ switch (page_refcnt(page)) {
+ case 0:
+ return '_';
+ case PAGE_REFCNT_1:
+ return '-';
+ case PAGE_REFCNT_2:
+ return '=';
+ case PAGE_REFCNT_3:
+ return '#';
+ }
+ return '?';
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+ pgoff_t first_index, pgoff_t last_index)
+{
+ struct page *page;
+ int size = last_index - first_index + 1;
+ int count = 0;
+ int i;
+
+ read_lock_irq(&mapping->tree_lock);
+
+ /*
+ * The first page may well is chunk head and has been accessed,
+ * so it is index 0 that makes the estimation optimistic. This
+ * behavior guarantees a readahead when (size < ra_max) and
+ * (readahead_hit_rate >= 16).
+ */
+ for (i = 0; i < 16;) {
+ page = __find_page(mapping, first_index +
+ size * ((i++ * 29) & 15) / 16);
+ if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+ break;
+ }
+
+ read_unlock_irq(&mapping->tree_lock);
+
+ return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static int query_page_cache(struct address_space *mapping,
+ struct file_ra_state *ra,
+ unsigned long *remain, pgoff_t offset,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ int count;
+ pgoff_t index;
+ unsigned long nr_lookback;
+ struct radix_tree_cache cache;
+
+ /*
+ * Scan backward and check the near @ra_max pages.
+ * The count here determines ra_size.
+ */
+ read_lock_irq(&mapping->tree_lock);
+ index = radix_tree_lookup_head(&mapping->page_tree, offset, ra_max);
+ read_unlock_irq(&mapping->tree_lock);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (index <= offset) {
+ WARN_ON(!find_page(mapping, index));
+ if (index + ra_max > offset)
+ WARN_ON(find_page(mapping, index - 1));
+ } else {
+ BUG_ON(index > offset + 1);
+ WARN_ON(find_page(mapping, offset));
+ }
+#endif
+
+ *remain = offset - index + 1;
+
+ if (unlikely(*remain <= ra_min))
+ return ra_min;
+
+ if (!index)
+ return *remain;
+
+ if (offset + 1 == ra->readahead_index && ra_cache_hit_ok(ra))
+ count = *remain;
+ else if (count_cache_hit(mapping, index, offset) *
+ readahead_hit_rate >= *remain)
+ count = *remain;
+ else
+ return ra_min;
+
+ if (count < ra_max)
+ goto out;
+
+ /*
+ * Check the far pages coarsely.
+ * The big count here helps increase la_size.
+ */
+ nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+ 100 / (readahead_ratio + 1);
+ if (nr_lookback > offset)
+ nr_lookback = offset;
+
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (count += ra_max; count < nr_lookback; count += ra_max) {
+ struct radix_tree_node *node;
+ node = radix_tree_cache_lookup_node(&mapping->page_tree,
+ &cache, offset - count, 1);
+ if (!node)
+ break;
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (node != radix_tree_lookup_node(&mapping->page_tree,
+ offset - count, 1)) {
+ read_unlock_irq(&mapping->tree_lock);
+ printk(KERN_ERR "check radix_tree_cache_lookup_node!\n");
+ return 1;
+ }
+#endif
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ /*
+ * For sequential read that extends from index 0, the counted value
+ * may well be far under the true threshold, so return it unmodified
+ * for further process in adjust_rala_accelerated().
+ */
+ if (count >= offset)
+ return offset + 1;
+
+out:
+ count = count * readahead_ratio / 100;
+ return count;
+}
+
+/*
+ * Scan backward in the file for the first non-present page.
+ */
+static inline pgoff_t first_absent_page_bw(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct page *page;
+ pgoff_t origin;
+
+ origin = index;
+ if (max_scan > index)
+ max_scan = index;
+
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (; origin - index <= max_scan;) {
+ page = radix_tree_cache_lookup(&mapping->page_tree,
+ &cache, --index);
+ if (page) {
+ index++;
+ break;
+ }
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ return index;
+}
+
+/*
+ * Scan forward in the file for the first non-present page.
+ */
+static inline pgoff_t first_absent_page(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ pgoff_t ra_index;
+
+ read_lock_irq(&mapping->tree_lock);
+ ra_index = radix_tree_lookup_tail(&mapping->page_tree,
+ index + 1, max_scan);
+ read_unlock_irq(&mapping->tree_lock);
+
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ BUG_ON(ra_index <= index);
+ if (index + max_scan > index) {
+ if (ra_index <= index + max_scan)
+ WARN_ON(find_page(mapping, ra_index));
+ WARN_ON(!find_page(mapping, ra_index - 1));
+ }
+#endif
+
+ if (ra_index <= index + max_scan)
+ return ra_index;
+ else
+ return 0;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not choosed to make the whole next chunk safe(as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static inline int adjust_rala_accelerated(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ pgoff_t index = *ra_size;
+
+ *ra_size -= min(*ra_size, *la_size);
+ *ra_size = *ra_size * readahead_ratio / 100;
+ *la_size = index * readahead_ratio / 100;
+ *ra_size += *la_size;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ */
+static inline int
+try_context_based_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct page *prev_page, struct page *page,
+ pgoff_t index, unsigned long ra_size,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ pgoff_t ra_index;
+ unsigned long la_size;
+ unsigned long remain_pages;
+
+ /* Where to start read-ahead?
+ * NFSv3 daemons may process adjecent requests in parallel,
+ * leading to many locally disordered, globally sequential reads.
+ * So do not require nearby history pages to be present or accessed.
+ */
+ if (page) {
+ ra_index = first_absent_page(mapping, index, ra_max * 5 / 4);
+ if (unlikely(!ra_index))
+ return -1;
+ } else if (!prev_page) {
+ ra_index = first_absent_page_bw(mapping, index,
+ readahead_hit_rate + ra_min);
+ if (index - ra_index > readahead_hit_rate + ra_min)
+ return 0;
+ ra_min += 2 * (index - ra_index);
+ index = ra_index;
+ } else {
+ ra_index = index;
+ if (ra_has_index(ra, index))
+ ra_account(ra, RA_EVENT_READAHEAD_MUTILATE,
+ ra->readahead_index - index);
+ }
+
+ ra_size = query_page_cache(mapping, ra, &remain_pages,
+ index - 1, ra_min, ra_max);
+
+ la_size = ra_index - index;
+ if (readahead_ratio < VM_READAHEAD_PROTECT_RATIO &&
+ remain_pages <= la_size && la_size > 1) {
+ rescue_pages(page, la_size);
+ return -1;
+ }
+
+ if (ra_size == index) {
+ if (!adjust_rala_accelerated(ra_max, &ra_size, &la_size))
+ return -1;
+ set_ra_class(ra, RA_CLASS_CONTEXT_ACCELERATED);
+ } else {
+ if (!adjust_rala(ra_max, &ra_size, &la_size))
+ return -1;
+ set_ra_class(ra, RA_CLASS_CONTEXT);
+ }
+
+ ra_state_init(ra, index, ra_index);
+ ra_state_update(ra, ra_size, la_size);
+
+ return 1;
+}
+
/*
* ra_size is mainly determined by:
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 12/19] readahead: other methods
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (10 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 11/19] readahead: context " Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 13/19] readahead: detect and rescue live pages Wu Fengguang
` (7 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-method-others.patch --]
[-- Type: text/plain, Size: 3633 bytes --]
Various read-ahead strategies for:
- fresh read from start of file
- backward prefetching
- seek and read one record pattern(db workload)
- quick recover from thrashing
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 111 insertions(+)
--- linux-2.6.15-rc1-mm2.orig/mm/readahead.c
+++ linux-2.6.15-rc1-mm2/mm/readahead.c
@@ -1505,6 +1505,117 @@ try_context_based_readahead(struct addre
return 1;
}
+/*
+ * Read-ahead on start of file.
+ *
+ * It is most important for small files.
+ * 1. Set a moderate large read-ahead size;
+ * 2. Issue the next read-ahead request as soon as possible.
+ *
+ * But be careful, there are some applications that dip into only the very head
+ * of a file. The most important thing is to prevent them from triggering the
+ * next (much larger) read-ahead request, which leads to lots of cache misses.
+ * Two pages should be enough for them, correct me if I'm wrong.
+ */
+static inline unsigned long
+newfile_readahead(struct address_space *mapping,
+ struct file *filp, struct file_ra_state *ra,
+ unsigned long req_size, unsigned long ra_min)
+{
+ unsigned long ra_size;
+ unsigned long la_size;
+
+ if (req_size > ra_min)
+ req_size = ra_min;
+
+ ra_size = 4 * req_size;
+ la_size = 2 * req_size;
+
+ set_ra_class(ra, RA_CLASS_NEWFILE);
+ ra_state_init(ra, 0, 0);
+ ra_state_update(ra, ra_size, la_size);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * Backward prefetching.
+ * No look ahead and thrashing threshold estimation for stepping backward
+ * pattern: should be unnecessary.
+ */
+static inline int
+try_read_backward(struct file_ra_state *ra,
+ pgoff_t begin_index, pgoff_t end_index,
+ unsigned long ra_size,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ if (ra_size > ra_max || end_index > ra->prev_page)
+ return 0;
+
+ if (ra_has_index(ra, ra->prev_page)) {
+ if (end_index > ra->la_index)
+ return 0;
+ ra_size += 2 * ra_cache_hit(ra, 0);
+ end_index = ra->la_index;
+ } else {
+ ra_size += readahead_hit_rate + ra_min;
+ end_index = ra->prev_page;
+ }
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ if (end_index > begin_index + ra_size)
+ return 0;
+
+ begin_index = end_index - ra_size;
+
+ set_ra_class(ra, RA_CLASS_BACKWARD);
+ ra_state_init(ra, begin_index, begin_index);
+ ra_state_update(ra, ra_size, 0);
+
+ return 1;
+}
+
+/*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ * Databases are known to have this seek-and-read-one-record pattern.
+ */
+static inline int
+try_random_readahead(struct file_ra_state *ra, pgoff_t index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ unsigned long hit0 = ra_cache_hit(ra, 0);
+ unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+ unsigned long hit2 = ra_cache_hit(ra, 2);
+ unsigned long hit3 = ra_cache_hit(ra, 3);
+
+ if (!ra_has_index(ra, ra->prev_page))
+ return 0;
+
+ if (index == ra->prev_page + 1) { /* read after thrashing */
+ ra_size = hit0;
+ set_ra_class(ra, RA_CLASS_RANDOM_THRASHING);
+ ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+ ra->readahead_index - index);
+ } else if (ra_size < hit1 && /* read after seeking */
+ hit1 > hit2 / 2 &&
+ hit2 > hit3 / 2 &&
+ hit3 > hit1 / 2) {
+ ra_size = max(hit1, hit2);
+ set_ra_class(ra, RA_CLASS_RANDOM_SEEK);
+ } else
+ return 0;
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ ra_state_init(ra, index, index);
+ ra_state_update(ra, ra_size, 0);
+
+ return 1;
+}
/*
* ra_size is mainly determined by:
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 13/19] readahead: detect and rescue live pages
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (11 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 12/19] readahead: other methods Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 14/19] readahead: events accounting Wu Fengguang
` (6 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-protect-live-pages.patch --]
[-- Type: text/plain, Size: 13422 bytes --]
It tries to identify and protect live pages (sequential pages that are going
to be accessed in the near future) on vmscan time. Almost all live pages can
be sorted out and saved.
The cost: dead pages that won't be read will be kept in inactive_list for
another round. Hopefully there won't be much, for file servers normally have
read-ahead hit rate as high as 99%.
This feature is greatly demanded by file servers, though should be useless
for desktop. It saves the case when one fast reader flushes the cache and
strips the read-ahead size of slow readers, or the case where caching is just
a wasting of memory. Then you can raise readahead_ratio to 200 or more to
enforce a larger read-ahead size and strip the useless cached data out of lru.
Its use is currently limited, for the context based method is not ready for
the case. This problem will be solved when the timing info of evicted pages
are available. It can also be extended to further benefit large memory
systems. I'll leave them as future work.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 2
include/linux/page-flags.h | 2
mm/page_alloc.c | 2
mm/readahead.c | 318 ++++++++++++++++++++++++++++++++++++++++++++-
mm/vmscan.c | 10 +
5 files changed, 333 insertions(+), 1 deletion(-)
--- linux-2.6.15-rc2-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.15-rc2-mm1/include/linux/page-flags.h
@@ -107,6 +107,8 @@ struct page_state {
unsigned long pgfree; /* page freeings */
unsigned long pgactivate; /* pages moved inactive->active */
unsigned long pgdeactivate; /* pages moved active->inactive */
+ unsigned long pgkeephot; /* pages sent back to active */
+ unsigned long pgkeepcold; /* pages sent back to inactive */
unsigned long pgfault; /* faults (major+minor) */
unsigned long pgmajfault; /* faults (major only) */
--- linux-2.6.15-rc2-mm1.orig/include/linux/mm.h
+++ linux-2.6.15-rc2-mm1/include/linux/mm.h
@@ -995,6 +995,8 @@ page_cache_readahead_adaptive(struct add
pgoff_t first_index,
pgoff_t index, pgoff_t last_index);
void fastcall ra_access(struct file_ra_state *ra, struct page *page);
+int rescue_ra_pages(struct list_head *page_list, struct list_head *save_list);
+
#ifdef CONFIG_DEBUG_FS
extern u32 readahead_debug_level;
--- linux-2.6.15-rc2-mm1.orig/mm/page_alloc.c
+++ linux-2.6.15-rc2-mm1/mm/page_alloc.c
@@ -2351,6 +2351,8 @@ static char *vmstat_text[] = {
"pgfree",
"pgactivate",
"pgdeactivate",
+ "pgkeephot",
+ "pgkeepcold",
"pgfault",
"pgmajfault",
--- linux-2.6.15-rc2-mm1.orig/mm/vmscan.c
+++ linux-2.6.15-rc2-mm1/mm/vmscan.c
@@ -501,6 +501,8 @@ cannot_free:
return 0;
}
+extern int readahead_live_chunk;
+
/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
@@ -509,10 +511,15 @@ static int shrink_list(struct list_head
LIST_HEAD(ret_pages);
struct pagevec freed_pvec;
int pgactivate = 0;
+ int pgkeep = 0;
int reclaimed = 0;
cond_resched();
+ if (readahead_ratio >= VM_READAHEAD_PROTECT_RATIO &&
+ readahead_live_chunk)
+ pgkeep += rescue_ra_pages(page_list, &ret_pages);
+
pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
struct address_space *mapping;
@@ -661,11 +668,13 @@ keep_locked:
keep:
list_add(&page->lru, &ret_pages);
BUG_ON(PageLRU(page));
+ pgkeep++;
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
__pagevec_release_nonlru(&freed_pvec);
mod_page_state(pgactivate, pgactivate);
+ mod_page_state(pgkeepcold, pgkeep - pgactivate);
sc->nr_reclaimed += reclaimed;
return reclaimed;
}
@@ -1162,6 +1171,7 @@ refill_inactive_zone(struct zone *zone,
mod_page_state_zone(zone, pgrefill, pgscanned);
mod_page_state(pgdeactivate, pgdeactivate);
+ mod_page_state(pgkeephot, pgmoved);
}
/*
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -400,7 +400,7 @@ __do_page_cache_readahead(struct address
read_lock_irq(&mapping->tree_lock);
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
pgoff_t page_offset = offset + page_idx;
-
+
if (page_offset > end_index)
break;
@@ -1770,3 +1770,319 @@ void fastcall ra_access(struct file_ra_s
ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
}
+/*
+ * Detect and protect live read-ahead pages.
+ *
+ * This function provides safty guarantee for file servers with big
+ * readahead_ratio(>=VM_READAHEAD_PROTECT_RATIO) set. The goal is to save all
+ * and only the sequential pages that are to be accessed in the near future.
+ *
+ * This function is called when pages in @page_list are to be freed,
+ * it protects live read-ahead pages by moving them into @save_list.
+ *
+ * The general idea is to classify pages of a file into groups of sequential
+ * accessed pages. Dead sequential pages are left over, live sequential pages
+ * are saved.
+ *
+ * Live read-ahead pages are defined as sequential pages that have reading in
+ * progress. They are detected by reference count pattern of:
+ *
+ * live head live pages
+ * ra pages group --> ------------___________________
+ * [ pages to save ] (*)
+ *
+ * (*) for now, an extra page from the live head may also be saved.
+ *
+ * In pratical, the group of pages are fragmented into chunks. To tell whether
+ * pages inside a chunk are alive, we must check:
+ * 1) Are there any live heads inside the chunk?
+ * 2) Are there any live heads in the group before the chunk?
+ * 3) Sepcial case: live head just sits on the boundary of current chunk?
+ *
+ * The detailed rules employed must ensure:
+ * - no page is pinned in inactive_list.
+ * - no excessive pages are saved.
+ *
+ * A picture of common cases:
+ * back search chunk case
+ * -----___________|[____________________] Normal
+ * ----------------|----[________________] Normal
+ * |----[________________] Normal
+ * ----------------|---------------------- Normal
+ * |---------------------- Normal
+ * ________________|______________________ ra miss
+ * |______________________ ra miss
+ * ________________|_______--------[_____] two readers
+ * ----____________|[______--------______] two readers
+ * |_______--------[_____] two readers
+ * |----[____------______] two readers
+ * ----------------|----[____------______] two readers
+ * _______---------|---------------[_____] two readers
+ * ----___---------|[--------------______] two readers
+ * ________________|---------------[_____] two readers
+ * ----____________|[--------------______] two readers
+ * ====------------|[---_________________] two readers
+ * |====[----------______] two readers
+ * |###======[-----------] three readers
+ */
+static int save_chunk(struct page *head, struct page *live_head,
+ struct page *tail, struct list_head *save_list)
+{
+ struct page *page;
+ struct address_space *mapping;
+ struct radix_tree_cache cache;
+ int i;
+ pgoff_t index;
+ pgoff_t head_index;
+ unsigned long refcnt;
+
+#ifdef DEBUG_READAHEAD
+ static char static_buf[PAGE_SIZE];
+ static char *zone_names[] = {"DMA", "DMA32", "Normal", "HighMem"};
+ char *pat = static_buf;
+ int pidx = 0;
+#define log_symbol(symbol) \
+ do { \
+ if (READAHEAD_DEBUG_LEVEL(3) && pidx < PAGE_SIZE - 1) \
+ pat[pidx++] = symbol; \
+ } while (0)
+
+ if (READAHEAD_DEBUG_LEVEL(3)) {
+ pat = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!pat)
+ pat = static_buf;
+ }
+#else
+#define log_symbol(symbol) do {} while (0)
+#endif
+#define log_page(page) log_symbol(page_refcnt_symbol(page))
+
+ head_index = head->index;
+ mapping = head->mapping;
+ radix_tree_cache_init(&cache);
+
+ BUG_ON(!mapping); /* QUESTION: in what case mapping will be NULL ? */
+ read_lock_irq(&mapping->tree_lock);
+
+ /*
+ * Common case test.
+ * Does the far end indicates a leading live head?
+ */
+ index = radix_tree_lookup_head(&mapping->page_tree,
+ head_index, readahead_live_chunk);
+ if (index >= head_index)
+ goto skip_scan_locked;
+
+ page = __find_page(mapping, index);
+ BUG_ON(!page);
+ log_symbol('|');
+ log_page(page);
+ refcnt = cold_page_refcnt(page);
+ if (head_index - index < readahead_live_chunk &&
+ refcnt > page_refcnt(head)) {
+ live_head = head;
+ goto skip_scan_locked;
+ }
+
+ /*
+ * The slow path.
+ * Scan page by page to see if the whole chunk should be saved.
+ */
+ if (next_page(head) != tail)
+ head_index = next_page(head)->index;
+ else
+ head_index++;
+ for (i = 0, index++; index <= head_index; index++) {
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache,
+ index);
+ if (index == head->index)
+ log_symbol('|');
+ log_page(page);
+
+ if (!page) {
+ WARN_ON(index < head->index);
+ break;
+ }
+
+ if (refcnt == page_refcnt(page))
+ i++;
+ else if (refcnt < page_refcnt(page))
+ i = 0;
+ else if (i < 1)
+ i = INT_MIN;
+ else {
+ live_head = head;
+ break;
+ }
+
+ refcnt = page_refcnt(page);
+ }
+
+skip_scan_locked:
+#ifdef DEBUG_READAHEAD
+ if (index < head->index)
+ log_symbol('*');
+ index = prev_page(tail)->index;
+
+ log_symbol('|');
+ for (page = head; page != tail; page = next_page(page)) {
+ BUG_ON(PageAnon(page));
+ BUG_ON(PageSwapCache(page));
+ /* BUG_ON(page_mapped(page)); */
+
+ if (page == live_head)
+ log_symbol('[');
+ log_page(page);
+ }
+ if (live_head)
+ log_symbol(']');
+#endif
+
+ /*
+ * Special case work around.
+ *
+ * Save one extra page if it is a live head of the following chunk.
+ * Just to be safe. It protects the rare situation when the reader
+ * is just crossing the chunk boundary, and the following chunk is not
+ * far away from tail of inactive_list.
+ *
+ * The special case is awkwardly delt with for now. They will be all set
+ * when the timing information of recently evicted pages are available.
+ * Dead pages can also be purged earlier with the timing info.
+ */
+ if (live_head != head) {
+ struct page *last_page = prev_page(tail);
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache,
+ last_page->index + 1);
+ log_symbol('|');
+ log_page(page);
+ if (page && !live_head) {
+ refcnt = page_refcnt(last_page);
+ if (page_refcnt(page) >= refcnt)
+ page = radix_tree_cache_lookup(
+ &mapping->page_tree, &cache,
+ last_page->index + 2);
+ log_page(page);
+ if (page && page_refcnt(page) < refcnt) {
+ live_head = last_page;
+ log_symbol('I');
+ }
+ } else if (!page && live_head) {
+ live_head = next_page(live_head);
+ log_symbol('D');
+ }
+ }
+ log_symbol('\0');
+
+ read_unlock_irq(&mapping->tree_lock);
+
+ /*
+ * Now save the alive pages.
+ */
+ i = 0;
+ if (live_head) {
+ for (; live_head != tail;) { /* never dereference tail! */
+ page = next_page(live_head);
+ if (!PageActivate(live_head)) {
+ list_move(&live_head->lru, save_list);
+ i++;
+ if (!page_refcnt(live_head))
+ inc_readahead_aging();
+ }
+ live_head = page;
+ }
+
+ if (i)
+ ra_account(0, RA_EVENT_READAHEAD_RESCUE, i);
+ }
+
+#ifdef DEBUG_READAHEAD
+ if (READAHEAD_DEBUG_LEVEL(3)) {
+ ddprintk("save_chunk(ino=%lu, idx=%lu-%lu, %s@%s:%s)"
+ " = %d\n",
+ mapping->host->i_ino,
+ head->index, index,
+ mapping_mapped(mapping) ? "mmap" : "file",
+ zone_names[page_zonenum(head)], pat, i);
+ if (pat != static_buf)
+ free_page((unsigned long)pat);
+ }
+#endif
+
+ return i;
+}
+
+int rescue_ra_pages(struct list_head *page_list, struct list_head *save_list)
+{
+ struct address_space *mapping;
+ struct page *chunk_head;
+ struct page *live_head;
+ struct page *page;
+ unsigned long refcnt;
+ unsigned long min_refcnt;
+ int n;
+ int ret = 0;
+
+ page = list_to_page(page_list);
+
+next_chunk:
+ chunk_head = page;
+ live_head = NULL;
+ mapping = page->mapping;
+ n = 0;
+ min_refcnt = LONG_MAX;
+
+next_head_page:
+ refcnt = page_refcnt(page);
+ if (min_refcnt > refcnt)
+ min_refcnt = refcnt;
+ page = next_page(page);
+
+ if (mapping != page->mapping || &page->lru == page_list)
+ goto save_chunk;
+
+ /* At least 2 pages followed by a fall in refcnt makes a live head:
+ * --_
+ * ^ live_head
+ */
+ if (refcnt == page_refcnt(page))
+ n++;
+ else if (refcnt < page_refcnt(page))
+ n = 0;
+ else if (n < 1)
+ n = INT_MIN;
+ else
+ goto got_live_head;
+
+ goto next_head_page;
+
+got_live_head:
+ n = 0;
+ live_head = prev_page(page);
+
+next_page:
+ if (refcnt < page_refcnt(page)) /* limit the number of rises */
+ n++;
+ refcnt = page_refcnt(page);
+ if (min_refcnt > refcnt)
+ min_refcnt = refcnt;
+ page = next_page(page);
+
+ if (mapping != page->mapping || &page->lru == page_list)
+ goto save_chunk;
+
+ goto next_page;
+
+save_chunk:
+ if (mapping && !PageAnon(chunk_head) &&
+ !PageSwapCache(chunk_head) &&
+ /* !page_mapped(chunk_head) && */
+ min_refcnt < PAGE_REFCNT_2 &&
+ n <= 3)
+ ret += save_chunk(chunk_head, live_head, page, save_list);
+
+ if (&page->lru != page_list)
+ goto next_chunk;
+
+ return ret;
+}
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 14/19] readahead: events accounting
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (12 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 13/19] readahead: detect and rescue live pages Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 15/19] readahead: page aging accounting Wu Fengguang
` (5 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, J?rn Engel, Ingo Oeser
[-- Attachment #1: readahead-account-events.patch --]
[-- Type: text/plain, Size: 10352 bytes --]
A debugfs file named `readahead/events' is created according to advices from
J?rn Engel, Andrew Morton and Ingo Oeser. It yields to much better
readability than the preious /proc/vmstat interface :)
It reveals various read-ahead activities/events, and is vital to the testing.
-------------------------------------------------------------------------
If you are experiencing performance problems, or want to help improve the
read-ahead logic, please send me the debug data. Thanks.
- Preparations
## Compile with CONFIG_DEBUG_FS('Kernel hacking ---> Debug Filesystem')
mkdir /debug
mount -t debug none /debug
- For each session with distinct access pattern
echo > /debug/readahead/events # reset the counters
# echo > /var/log/kern.log # you may want to backup it first
# echo 5 > /debug/readahead/debug_level # show verbose printk traces
## do one benchmark/task
# echo 0 > /debug/readahead/debug_level # revert to normal value
cp /debug/readahead/events readahead-events-`date +'%F_%R'`
# bzip2 -c /var/log/kern.log > kern.log-`date +'%F_%R'`.bz2
The commented out commands can uncover the very detailed file accesses,
which are useful sometimes. But the log file will be rather huge.
-------------------------------------------------------------------------
This is a trimmed down output on my PC:
# cat /debug/readahead/events
[table requests] total newfile state context none
cache_miss 403 56 12 69 263
read_random 260 37 5 17 201
io_congestion 0 0 0 0 0
io_cache_hit 85 0 24 46 0
io_block 9796 5613 822 143 3203
readahead 5956 5418 383 139 0
lookahead 961 650 212 98 0
lookahead_hit 449 181 164 58 41
lookahead_ignore 0 0 0 0 0
readahead_eof 4981 4768 171 28 0
readahead_shrink 0 0 0 0 0
readahead_thrash 0 0 0 0 0
readahead_mutilt 0 0 0 0 0
readahead_rescue 45 0 0 0 45
[table pages] total newfile state context none
cache_miss 5590 72 2506 181 2826
read_random 265 37 5 17 206
io_congestion 0 0 0 0 0
io_cache_hit 2440 0 1054 1366 0
io_block 165848 11117 147794 3668 3203
readahead 43080 11360 28949 2621 0
readahead_hit 38251 10716 25669 1818 9
lookahead 24013 1718 21641 647 0
lookahead_hit 20161 770 18679 712 0
lookahead_ignore 0 0 0 0 0
readahead_eof 15961 7924 7440 461 0
readahead_shrink 0 0 0 0 0
readahead_thrash 0 0 0 0 0
readahead_mutilt 0 0 0 0 0
readahead_rescue 240 0 0 0 240
[table summary] total newfile state context none
random_rate 4% 0% 1% 10% 99%
ra_hit_rate 88% 94% 88% 69% 900%
la_hit_rate 46% 27% 76% 58% 4100%
avg_ra_size 7 2 75 19 0
avg_la_size 25 3 102 7 0
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 192 insertions(+), 2 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -105,6 +105,9 @@ enum ra_event {
u32 readahead_debug_level = 0;
u32 debug_disable_stateful_method = 0;
+static struct dentry *readahead_events_dentry;
+extern struct file_operations ra_debug_fops;
+
static int __init readahead_init(void)
{
struct dentry *root;
@@ -115,6 +118,9 @@ static int __init readahead_init(void)
debugfs_create_bool("disable_stateful_method", 0644, root,
&debug_disable_stateful_method);
+ readahead_events_dentry = debugfs_create_file("events",
+ 0644, root, NULL, &ra_debug_fops);
+
return 0;
}
@@ -132,6 +138,178 @@ static inline int disable_stateful_metho
return 0;
}
+#endif
+
+/*
+ * Read-ahead events accounting.
+ */
+#ifdef DEBUG_READAHEAD
+
+static char *ra_class_name[] = {
+ "total",
+ "newfile",
+ "state",
+ "context",
+ "contexta",
+ "backward",
+ "onthrash",
+ "onraseek",
+ "none",
+};
+
+static char *ra_event_name[] = {
+ "cache_miss",
+ "read_random",
+ "io_congestion",
+ "io_cache_hit",
+ "io_block",
+ "readahead",
+ "readahead_hit",
+ "lookahead",
+ "lookahead_hit",
+ "lookahead_ignore",
+ "readahead_eof",
+ "readahead_shrink",
+ "readahead_thrash",
+ "readahead_mutilt",
+ "readahead_rescue",
+};
+
+static unsigned long ra_event_count[RA_CLASS_END+1][RA_EVENT_END][2];
+
+static inline void ra_account(struct file_ra_state *ra,
+ enum ra_event e, int pages)
+{
+ enum ra_class c;
+
+ c = (ra ? ra->flags & RA_CLASS_MASK : RA_CLASS_END);
+ if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+ c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+ pages = -pages;
+ }
+ if (!c)
+ c = RA_CLASS_END;
+ BUG_ON(c > RA_CLASS_END);
+
+ ra_event_count[c][e][0] += 1;
+ ra_event_count[c][e][1] += pages;
+}
+
+static int ra_account_show(struct seq_file *s, void *_)
+{
+ int i;
+ int c;
+ int e;
+ static char event_fmt[] = "%-16s";
+ static char class_fmt[] = "%11s";
+ static char item_fmt[] = "%11lu";
+ static char percent_format[] = "%10lu%%";
+ static char *table_name[] = {
+ "[table requests]",
+ "[table pages]",
+ "[table summary]"};
+
+ for (i = 0; i <= 1; i++) {
+ for (e = 0; e < RA_EVENT_END; e++) {
+ ra_event_count[0][e][i] = 0;
+ for (c = 1; c <= RA_CLASS_END; c++)
+ ra_event_count[0][e][i] +=
+ ra_event_count[c][e][i];
+ }
+
+ seq_printf(s, event_fmt, table_name[i]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ for (e = 0; e < RA_EVENT_END; e++) {
+ if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+ continue;
+
+ seq_printf(s, event_fmt, ra_event_name[e]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ ra_event_count[c][e][i]);
+ seq_puts(s, "\n");
+ }
+ seq_puts(s, "\n");
+ }
+
+ seq_printf(s, event_fmt, table_name[2]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "random_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_event_count[c][RA_EVENT_READRANDOM][0] * 100) /
+ (ra_event_count[c][RA_EVENT_READRANDOM][0] +
+ ra_event_count[c][RA_EVENT_READAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "ra_hit_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_event_count[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+ (ra_event_count[c][RA_EVENT_READAHEAD][1] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "la_hit_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_ra_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_event_count[c][RA_EVENT_READAHEAD][1] +
+ ra_event_count[c][RA_EVENT_READAHEAD][0] / 2) /
+ (ra_event_count[c][RA_EVENT_READAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_la_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD][1] +
+ ra_event_count[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ return 0;
+}
+
+static int ra_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, ra_account_show, NULL);
+}
+
+static ssize_t ra_debug_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ if (file->f_dentry == readahead_events_dentry)
+ memset(ra_event_count, 0, sizeof(ra_event_count));
+ return 1;
+}
+
+static struct file_operations ra_debug_fops = {
+ .owner = THIS_MODULE,
+ .open = ra_debug_open,
+ .write = ra_debug_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+#else /* !DEBUG_READAHEAD */
+
+static inline void ra_account(struct file_ra_state *ra,
+ enum ra_event e, int pages)
+{
+}
+
#endif /* DEBUG_READAHEAD */
@@ -1024,6 +1202,8 @@ static int ra_dispatch(struct file_ra_st
ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
if (la_size)
ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+ if (ra_size > actual)
+ ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
ra_account(ra, RA_EVENT_READAHEAD, actual);
dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
@@ -1668,8 +1848,11 @@ page_cache_readahead_adaptive(struct add
if (page) {
if(!TestClearPageReadahead(page))
return 0;
- if (bdi_read_congested(mapping->backing_dev_info))
+ if (bdi_read_congested(mapping->backing_dev_info)) {
+ ra_account(ra, RA_EVENT_IO_CONGESTION,
+ end_index - index);
return 0;
+ }
}
if (page)
@@ -1759,8 +1942,15 @@ void fastcall ra_access(struct file_ra_s
(1 << PG_referenced)))
return;
- if (!ra_has_index(ra, page->index))
+ if (ra_has_index(ra, page->index)) {
+ if (!PageUptodate(page))
+ ra_account(ra, RA_EVENT_IO_BLOCK,
+ ra->readahead_index - page->index);
+ } else {
+ if (!PageUptodate(page))
+ ra_account(0, RA_EVENT_IO_BLOCK, 1);
return;
+ }
ra->cache_hit++;
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 15/19] readahead: page aging accounting
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (13 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 14/19] readahead: events accounting Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
` (4 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-account-aging.patch --]
[-- Type: text/plain, Size: 7977 bytes --]
The accuracy of stateful thrashing-threshold estimation depends largely on the
measurement of cold page aging speed.
A file named `page_aging' is created in debugfs to monitor the trace of two
measurement variables: per-zone `aging_total' and per-cpu `readahead_aging'.
Their values and the jiffies are recorded each time one of them has a delta
of 1, 1/2, 1/4, 1/16, 1/256, 1/4096 (nr_inactive + nr_free).
Sample series of collected data shows that smooth_aging is more stable in
small sampling granularity:
time dt page_aging8 smooth_aging8
872765 26 520056 33 653782 163
872791 12 520089 132 653945 51
872803 4 520221 132 653996 66
872807 17 520353 165 654062 107
872824 22 520518 99 654169 74
872846 372 520617 99 654243 78
873218 294 520716 99 654321 73
873512 196 520815 99 654394 130
873708 15 520914 231 654524 28
873723 15 521145 198 654552 9
873738 881 521343 99 654561 182
874619 700 521442 0 654743 198
875319 384 521442 66 654941 110
875703 2119 521508 99 655051 1632
877822 3960 521607 0 656683 980
881782 904 521607 0 657663 216
time dt page_aging1 smooth_aging1
-90822 12418 5775 12999 33302 10767
-78404 17510 18774 10303 44069 10345
-60894 24757 29077 9871 54414 14615
-36137 19194 38948 10404 69029 13726
-16943 19636 49352 10440 82755 12865
2693 16299 59792 12453 95620 10734
18992 19851 72245 10073 106354 15960
38843 16099 82318 10767 122314 14059
54942 16094 93085 10041 136373 12117
71036 19888 103126 12595 148490 16155
90924 18452 115721 9782 164645 11705
109376 22395 125503 10214 176350 13679
131771 19310 135717 10759 190029 11843
151081 20793 146476 10699 201872 12595
171874 22308 157175 10321 214467 13157
194182 17954 167496 10773 227624 14803
212136 19946 178269 10554 242427 13391
232082 21051 188823 11179 255818 11783
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 155 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 154 insertions(+), 1 deletion(-)
--- linux-2.6.15-rc1-mm2.orig/mm/readahead.c
+++ linux-2.6.15-rc1-mm2/mm/readahead.c
@@ -108,6 +108,9 @@ u32 debug_disable_stateful_method;
static struct dentry *readahead_events_dentry;
extern struct file_operations ra_debug_fops;
+static struct dentry *page_aging_dentry;
+extern struct file_operations aginginfo_fops;
+
static int __init readahead_init(void)
{
struct dentry *root;
@@ -120,6 +123,8 @@ static int __init readahead_init(void)
readahead_events_dentry = debugfs_create_file("events",
0644, root, NULL, &ra_debug_fops);
+ page_aging_dentry = debugfs_create_file("page_aging",
+ 0644, root, NULL, &aginginfo_fops);
return 0;
}
@@ -281,9 +286,14 @@ static int ra_account_show(struct seq_fi
return 0;
}
+extern struct seq_operations aginginfo_ops;
+
static int ra_debug_open(struct inode *inode, struct file *file)
{
- return single_open(file, ra_account_show, NULL);
+ if (file->f_dentry == readahead_events_dentry)
+ return single_open(file, ra_account_show, NULL);
+ else
+ return seq_open(file, &aginginfo_ops);
}
static ssize_t ra_debug_write(struct file *file, const char __user *buf,
@@ -312,6 +322,146 @@ static inline void ra_account(struct fil
#endif /* DEBUG_READAHEAD */
+/*
+ * Measure the aging progress of cold pages over time.
+ */
+#ifdef DEBUG_READAHEAD
+
+#define AGING_INFO_SIZE (1 << 8)
+#define AGING_INFO_MASK (AGING_INFO_SIZE - 1)
+static int aging_info_shift[] = {0, 1, 2, 4, 8, 12};
+#define AGING_INFO_SHIFTS (sizeof(aging_info_shift)/\
+ sizeof(aging_info_shift[0]))
+static int aging_info_index[AGING_INFO_SHIFTS];
+static unsigned long aging_info[AGING_INFO_SIZE][AGING_INFO_SHIFTS*3];
+static spinlock_t aging_info_lock = SPIN_LOCK_UNLOCKED;
+
+static unsigned long node_free_and_cold_pages(void);
+static unsigned long node_readahead_aging(void);
+
+/*
+ * The accumulated count of pages pushed into inactive_list(s).
+ */
+static unsigned long aging_total(void)
+{
+ unsigned int i;
+ unsigned long sum = 0;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].aging_total;
+
+ return sum;
+}
+
+static void collect_aging_info(void)
+{
+ int i;
+ unsigned long mem;
+ unsigned long page_aging;
+ unsigned long smooth_aging;
+
+ mem = node_free_and_cold_pages();
+ page_aging = aging_total();
+ smooth_aging = node_readahead_aging();
+
+ spin_lock_irq(&aging_info_lock);
+
+ for (i = AGING_INFO_SHIFTS - 1; i >= 0; i--) {
+ if (smooth_aging - aging_info[aging_info_index[i]][i*3+2] +
+ page_aging - aging_info[aging_info_index[i]][i*3+1] >
+ 2 * (mem >> aging_info_shift[i])) {
+ aging_info_index[i]++;
+ aging_info_index[i] &= AGING_INFO_MASK;
+ aging_info[aging_info_index[i]][i*3] = jiffies;
+ aging_info[aging_info_index[i]][i*3+1] = page_aging;
+ aging_info[aging_info_index[i]][i*3+2] = smooth_aging;
+ } else
+ break;
+ }
+
+ spin_unlock_irq(&aging_info_lock);
+}
+
+static void *aginginfo_start(struct seq_file *s, loff_t *pos)
+{
+ int n = *pos;
+ int i;
+
+ spin_lock_irq(&aging_info_lock);
+
+ if (!n) {
+ for (i = 0; i < AGING_INFO_SHIFTS; i++) {
+ seq_printf(s, "%12s %10s %18s%d %18s%d\t", "time","dt",
+ "page_aging", aging_info_shift[i],
+ "smooth_aging", aging_info_shift[i]);
+ }
+ seq_puts(s, "\n");
+ }
+
+ if (++n < AGING_INFO_SIZE)
+ return (void *)n;
+ else
+ return NULL;
+}
+
+static void *aginginfo_next(struct seq_file *s, void *p, loff_t *pos)
+{
+ int n = (int)p;
+
+ ++*pos;
+ return (void *)(++n < AGING_INFO_SIZE ? n : 0);
+}
+
+static void aginginfo_stop(struct seq_file *s, void *p)
+{
+ spin_unlock_irq(&aging_info_lock);
+}
+
+static int aginginfo_show(struct seq_file *s, void *p)
+{
+ int n = (int)p;
+ int i;
+ int index0;
+ int index1;
+ long time;
+ unsigned long nr1;
+ unsigned long nr2;
+
+ for (i = 0; i < AGING_INFO_SHIFTS; i++) {
+ index0 = aging_info_index[i] + n;
+ index1 = aging_info_index[i] + n + 1;
+ index0 &= AGING_INFO_MASK;
+ index1 &= AGING_INFO_MASK;
+ time = aging_info[index0][i*3];
+ nr1 = aging_info[index1][i*3+1] - aging_info[index0][i*3+1];
+ nr2 = aging_info[index1][i*3+2] - aging_info[index0][i*3+2];
+ seq_printf(s, "%12ld %10lu %10lu %8lu %10lu %8lu\t",
+ time, aging_info[index1][i*3] - time,
+ aging_info[index0][i*3+1], nr1,
+ aging_info[index0][i*3+2], nr2);
+ }
+ seq_puts(s, "\n");
+
+ return 0;
+}
+
+
+static struct seq_operations aginginfo_ops = {
+ .start = aginginfo_start,
+ .next = aginginfo_next,
+ .stop = aginginfo_stop,
+ .show = aginginfo_show,
+};
+
+static struct file_operations aginginfo_fops = {
+ .open = ra_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+#endif /* DEBUG_READAHEAD */
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -1298,6 +1448,9 @@ static inline unsigned long compute_thra
else
*remain = 0;
+#ifdef DEBUG_READAHEAD
+ collect_aging_info();
+#endif
ddprintk("compute_thrashing_threshold: "
"ra=%lu=%lu*%lu/%lu, remain %lu for %lu\n",
ra_size, stream_shift, global_size, global_shift,
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 16/19] readahead: laptop mode support
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (14 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 15/19] readahead: page aging accounting Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 16:06 ` Bart Samwel
2005-11-25 15:12 ` [PATCH 17/19] readahead: disable look-ahead for loopback file Wu Fengguang
` (3 subsequent siblings)
19 siblings, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Bart Samwel
[-- Attachment #1: readahead-laptop-mode.patch --]
[-- Type: text/plain, Size: 3313 bytes --]
When the laptop drive is spinned down, defer look-ahead to spin up time.
The implementation employs a poll based method, for performance is not a
concern in this code path. The poll interval is 64KB, which should be small
enough for movies/musics. The user space application is responsible for
proper caching to hide the spin-up-and-read delay.
------------------------------------------------------------------------
For crazy laptop users who prefer aggressive read-ahead, here is the way:
# echo 1000 > /proc/sys/vm/readahead_ratio
# blockdev --setra 524280 /dev/hda # this is the max possible value
Notes:
- It is still an untested feature.
- It is safer to use blockdev+fadvise to increase ra-max for a single file,
which needs patching your movie player.
- Be sure to restore them to sane values in normal operations!
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/writeback.h | 6 ++++++
mm/page-writeback.c | 2 +-
mm/readahead.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 37 insertions(+), 1 deletion(-)
--- linux-2.6.15-rc1-mm2.orig/include/linux/writeback.h
+++ linux-2.6.15-rc1-mm2/include/linux/writeback.h
@@ -85,6 +85,12 @@ void laptop_io_completion(void);
void laptop_sync_completion(void);
void throttle_vm_writeout(void);
+extern struct timer_list laptop_mode_wb_timer;
+static inline int laptop_spinned_down(void)
+{
+ return !timer_pending(&laptop_mode_wb_timer);
+}
+
/* These are exported to sysctl. */
extern int dirty_background_ratio;
extern int vm_dirty_ratio;
--- linux-2.6.15-rc1-mm2.orig/mm/page-writeback.c
+++ linux-2.6.15-rc1-mm2/mm/page-writeback.c
@@ -369,7 +369,7 @@ static void wb_timer_fn(unsigned long un
static void laptop_timer_fn(unsigned long unused);
static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
-static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
+DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
/*
* Periodic writeback of "old" data.
--- linux-2.6.15-rc1-mm2.orig/mm/readahead.c
+++ linux-2.6.15-rc1-mm2/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/writeback.h>
/* The default max/min read-ahead pages. */
#define KB(size) (((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
@@ -1155,6 +1156,30 @@ out:
}
/*
+ * Set a new look-ahead mark at @new_index.
+ * Return 0 if the new mark is successfully set.
+ */
+static inline int renew_lookahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ pgoff_t index, pgoff_t new_index)
+{
+ struct page *page;
+
+ if (index == ra->lookahead_index &&
+ new_index >= ra->readahead_index)
+ return 1;
+
+ page = find_page(mapping, new_index);
+ if (!page)
+ return 1;
+
+ SetPageReadahead(page);
+ if (ra->lookahead_index == index)
+ ra->lookahead_index = new_index;
+ return 0;
+}
+
+/*
* State based calculation of read-ahead request.
*
* This figure shows the meaning of file_ra_state members:
@@ -2006,6 +2031,11 @@ page_cache_readahead_adaptive(struct add
end_index - index);
return 0;
}
+ if (laptop_mode && laptop_spinned_down()) {
+ if (!renew_lookahead(mapping, ra, index,
+ index + LAPTOP_POLL_INTERVAL))
+ return 0;
+ }
}
if (page)
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 17/19] readahead: disable look-ahead for loopback file
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (15 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 18/19] readahead: nfsd support Wu Fengguang
` (2 subsequent siblings)
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-disable-lookahead-for-loopback.patch --]
[-- Type: text/plain, Size: 2027 bytes --]
Loopback files normally contain filesystem, in which case there are already
proper look-aheads in the upper layer, more look-aheads on the loopback file
only ruins the read-ahead hit rate.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
I'd like to thank Tero Grundstr?m for uncovering the loopback problem.
drivers/block/loop.c | 6 ++++++
include/linux/fs.h | 1 +
mm/readahead.c | 8 ++++++++
3 files changed, 15 insertions(+)
--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
@@ -620,6 +620,7 @@ struct file_ra_state {
};
#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
+#define RA_FLAG_NO_LOOKAHEAD (1UL<<31) /* disable look-ahead */
struct file {
/*
--- linux-2.6.15-rc2-mm1.orig/drivers/block/loop.c
+++ linux-2.6.15-rc2-mm1/drivers/block/loop.c
@@ -782,6 +782,12 @@ static int loop_set_fd(struct loop_devic
mapping = file->f_mapping;
inode = mapping->host;
+ /*
+ * The upper layer should already do proper look-ahead,
+ * one more look-ahead here only ruins the cache hit rate.
+ */
+ file->f_ra.flags |= RA_FLAG_NO_LOOKAHEAD;
+
if (!(file->f_mode & FMODE_WRITE))
lo_flags |= LO_FLAGS_READ_ONLY;
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -1312,6 +1312,11 @@ static inline void ra_state_update(struc
if (ra_size < old_ra && ra_cache_hit(ra, 0))
ra_account(ra, RA_EVENT_READAHEAD_SHRINK, old_ra - ra_size);
#endif
+
+ /* Disable look-ahead for loopback file. */
+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+ la_size = 0;
+
ra_addup_cache_hit(ra);
ra->ra_index = ra->readahead_index;
ra->la_index = ra->lookahead_index;
@@ -1672,6 +1677,9 @@ static int query_page_cache(struct addre
if (count < ra_max)
goto out;
+ if (unlikely(ra->flags & RA_FLAG_NO_LOOKAHEAD))
+ goto out;
+
/*
* Check the far pages coarsely.
* The big count here helps increase la_size.
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 18/19] readahead: nfsd support
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (16 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 17/19] readahead: disable look-ahead for loopback file Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 19/19] io: avoid too much latency from read-ahead Wu Fengguang
2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Neil Brown
[-- Attachment #1: readahead-nfsd-support.patch --]
[-- Type: text/plain, Size: 4985 bytes --]
- disable nfsd raparms: the new logic do not rely on it
- disable look-ahead on start of file: leave it to the client
For the case of NFS service, the new read-ahead logic
+ can handle disordered nfsd requests
+ can handle concurrent sequential requests on large files
with the help of look-ahead
- will have much ado about the concurrent ones on small files
------------------------------------------------------------------------
Notes about the concurrent nfsd requests issue:
nfsd read requests can be out of order, concurrent and with no ra-state info.
They are handled by the context based read-ahead method, which does the job
in the following steps:
1. scan in page cache
2. make read-ahead decisions
3. alloc new pages
4. insert new pages to page cache
A single read-ahead chunk in the client side will be dissembled and serviced
by many concurrent nfsd in the server side. It is highly possible for two or
more of these parallel nfsd instances to be in step 1/2/3 at the same time.
Without knowing others working on the same file region, they will issue
overlaped read-ahead requests, which lead to many conflicts at step 4.
There's no much luck to eliminate the concurrent problem in general and
efficient ways. But for small to medium NFS servers where the bottleneck
lies in storage devices, here is a performance tip:
# for pid in `pidof nfsd`; do taskset -p 1 $pid; done
This command effectively serializes all nfsd requests. It would be nice if
someone can code this serialization on a per-file basis.
------------------------------------------------------------------------
Here is some test output(8 nfsd; local mount with tcp,rsize=8192):
SERIALIZED, SMALL FILES
=======================
readahead_ratio = 0, ra_max = 128kb (old logic, the ra_max is really not relavant)
96.51s real 11.32s system 3.27s user 160334+2829 cs diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb (new read-ahead logic)
94.88s real 11.53s system 3.20s user 152415+3777 cs diff -r $NFSDIR $NFSDIR2
PARALLEL, SMALL FILES
=====================
readahead_ratio = 0, ra_max = 128kb
99.87s real 11.41s system 3.15s user 173945+9163 cs diff -r $NFSDIR $NFSDIR2
readahead_ratio = 70, ra_max = 1024kb
100.14s real 12.06s system 3.16s user 170865+13406 cs diff -r $NFSDIR $NFSDIR2
SERIALIZED, BIG FILES
=====================
readahead_ratio = 0, ra_max = 128kb
56.52s real 3.38s system 1.23s user 47930+5256 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
32.54s real 5.71s system 1.38s user 23851+17007 cs diff $NFSFILE $NFSFILE2
PARALLEL, BIG FILES
===================
readahead_ratio = 0, ra_max = 128kb
63.35s real 5.68s system 1.57s user 82594+48747 cs diff $NFSFILE $NFSFILE2
readahead_ratio = 70, ra_max = 1024kb
33.87s real 10.17s system 1.55s user 72291+100079 cs diff $NFSFILE $NFSFILE2
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
fs/nfsd/vfs.c | 6 +++++-
include/linux/fs.h | 1 +
mm/readahead.c | 11 +++++++++--
3 files changed, 15 insertions(+), 3 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/fs/nfsd/vfs.c
+++ linux-2.6.15-rc2-mm1/fs/nfsd/vfs.c
@@ -826,10 +826,14 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st
#endif
/* Get readahead parameters */
- ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
+ if (prefer_adaptive_readahead())
+ ra = NULL;
+ else
+ ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
if (ra && ra->p_set)
file->f_ra = ra->p_ra;
+ file->f_ra.flags |= RA_FLAG_NFSD;
if (file->f_op->sendfile) {
svc_pushback_unused_pages(rqstp);
--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
@@ -621,6 +621,7 @@ struct file_ra_state {
#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
#define RA_FLAG_NO_LOOKAHEAD (1UL<<31) /* disable look-ahead */
+#define RA_FLAG_NFSD (1UL<<30) /* request from nfsd */
struct file {
/*
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -15,11 +15,13 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
#include <linux/writeback.h>
+#include <linux/nfsd/const.h>
/* The default max/min read-ahead pages. */
#define KB(size) (((size)*1024 + PAGE_CACHE_SIZE-1) / PAGE_CACHE_SIZE)
#define MAX_RA_PAGES KB(VM_MAX_READAHEAD)
#define MIN_RA_PAGES KB(VM_MIN_READAHEAD)
+#define MIN_NFSD_PAGES KB(NFSSVC_MAXBLKSIZE/1024)
/* In laptop mode, poll delayed look-ahead on every ## pages read. */
#define LAPTOP_POLL_INTERVAL 16
@@ -1894,8 +1896,13 @@ newfile_readahead(struct address_space *
if (req_size > ra_min)
req_size = ra_min;
- ra_size = 4 * req_size;
- la_size = 2 * req_size;
+ if (unlikely(ra->flags & RA_FLAG_NFSD)) {
+ ra_size = MIN_NFSD_PAGES;
+ la_size = 0;
+ } else {
+ ra_size = 4 * req_size;
+ la_size = 2 * req_size;
+ }
set_ra_class(ra, RA_CLASS_NEWFILE);
ra_state_init(ra, 0, 0);
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 19/19] io: avoid too much latency from read-ahead
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (17 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 18/19] readahead: nfsd support Wu Fengguang
@ 2005-11-25 15:12 ` Wu Fengguang
2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
19 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-25 15:12 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Con Kolivas
[-- Attachment #1: io-low-latency.patch --]
[-- Type: text/plain, Size: 4666 bytes --]
Since the VM_MAX_READAHEAD is greatly enlarged and the algorithm become more
complex, it may be necessary to insert some cond_resched() calls in the
read-ahead path.
----------------------------------------------------------------------------
If you feel more latency with the new read-ahead code, the cause can
either be the complex read-ahead code, or some low level routines not ready
to support the larger default 1M read-ahead size. It can be cleared out with
the following commands:
dd if=/dev/zero of=sparse bs=1M seek=10000 count=1
cp sparse /dev/null
If the above copy does not lead to audio jitters, then it's an low level
issue. It's hard to find them out for now, though there's a workaround:
blockdev --setra 256 /dev/hda # or whatever device you are using
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
This is recommended by Con Kolivas to improve respond time for desktop.
fs/mpage.c | 4 +++-
mm/readahead.c | 16 +++++++++++++++-
2 files changed, 18 insertions(+), 2 deletions(-)
--- linux-2.6.15-rc2-mm1.orig/mm/readahead.c
+++ linux-2.6.15-rc2-mm1/mm/readahead.c
@@ -368,6 +368,7 @@ static void collect_aging_info(void)
page_aging = aging_total();
smooth_aging = node_readahead_aging();
+ cond_resched();
spin_lock_irq(&aging_info_lock);
for (i = AGING_INFO_SHIFTS - 1; i >= 0; i--) {
@@ -625,8 +626,10 @@ static int read_pages(struct address_spa
page->index, GFP_KERNEL)) {
ret = mapping->a_ops->readpage(filp, page);
if (ret != AOP_TRUNCATED_PAGE) {
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
__pagevec_lru_add(&lru_pvec);
+ cond_resched();
+ }
continue;
} /* else fall through to release */
}
@@ -746,6 +749,7 @@ __do_page_cache_readahead(struct address
}
read_unlock_irq(&mapping->tree_lock);
+ cond_resched();
page = page_cache_alloc_cold(mapping);
read_lock_irq(&mapping->tree_lock);
if (!page)
@@ -1145,6 +1149,7 @@ static int rescue_pages(struct page *pag
spin_unlock_irq(&zone->lru_lock);
+ cond_resched();
page = find_page(mapping, index);
if (!page)
goto out;
@@ -1219,6 +1224,7 @@ static unsigned long node_readahead_agin
unsigned long sum = 0;
cpumask_t mask = node_to_cpumask(numa_node_id());
+ cond_resched();
for_each_cpu_mask(cpu, mask)
sum += per_cpu(readahead_aging, cpu);
@@ -1349,6 +1355,7 @@ static int ra_dispatch(struct file_ra_st
int actual;
enum ra_class ra_class;
+ cond_resched();
ra_class = (ra->flags & RA_CLASS_MASK);
BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
@@ -1371,6 +1378,7 @@ static int ra_dispatch(struct file_ra_st
actual = __do_page_cache_readahead(mapping, filp,
ra->ra_index, ra_size, la_size);
+ cond_resched();
#ifdef READAHEAD_STREAMING
if (actual < ra_size) {
@@ -1609,6 +1617,7 @@ static int count_cache_hit(struct addres
int count = 0;
int i;
+ cond_resched();
read_lock_irq(&mapping->tree_lock);
/*
@@ -1646,6 +1655,7 @@ static int query_page_cache(struct addre
* Scan backward and check the near @ra_max pages.
* The count here determines ra_size.
*/
+ cond_resched();
read_lock_irq(&mapping->tree_lock);
index = radix_tree_lookup_head(&mapping->page_tree, offset, ra_max);
read_unlock_irq(&mapping->tree_lock);
@@ -1691,6 +1701,7 @@ static int query_page_cache(struct addre
if (nr_lookback > offset)
nr_lookback = offset;
+ cond_resched();
radix_tree_cache_init(&cache);
read_lock_irq(&mapping->tree_lock);
for (count += ra_max; count < nr_lookback; count += ra_max) {
@@ -1737,6 +1748,7 @@ static inline pgoff_t first_absent_page_
if (max_scan > index)
max_scan = index;
+ cond_resched();
radix_tree_cache_init(&cache);
read_lock_irq(&mapping->tree_lock);
for (; origin - index <= max_scan;) {
@@ -1760,6 +1772,7 @@ static inline pgoff_t first_absent_page(
{
pgoff_t ra_index;
+ cond_resched();
read_lock_irq(&mapping->tree_lock);
ra_index = radix_tree_lookup_tail(&mapping->page_tree,
index + 1, max_scan);
@@ -2469,6 +2482,7 @@ save_chunk:
n <= 3)
ret += save_chunk(chunk_head, live_head, page, save_list);
+ cond_resched();
if (&page->lru != page_list)
goto next_chunk;
--- linux-2.6.15-rc2-mm1.orig/fs/mpage.c
+++ linux-2.6.15-rc2-mm1/fs/mpage.c
@@ -343,8 +343,10 @@ mpage_readpages(struct address_space *ma
bio = do_mpage_readpage(bio, page,
nr_pages - page_idx,
&last_block_in_bio, get_block);
- if (!pagevec_add(&lru_pvec, page))
+ if (!pagevec_add(&lru_pvec, page)) {
__pagevec_lru_add(&lru_pvec);
+ cond_resched();
+ }
} else {
page_cache_release(page);
}
--
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 10/19] readahead: state based method
2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
@ 2005-11-25 15:21 ` Eric Dumazet
2005-11-25 15:31 ` Eric Dumazet
2005-11-26 3:09 ` Wu Fengguang
0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2005-11-25 15:21 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, Andrew Morton
Wu Fengguang a écrit :
> include/linux/fs.h | 8 +
>
> --- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
> +++ linux-2.6.15-rc2-mm1/include/linux/fs.h
> @@ -604,13 +604,19 @@ struct file_ra_state {
> unsigned long start; /* Current window */
> unsigned long size;
> unsigned long flags; /* ra flags RA_FLAG_xxx*/
> - unsigned long cache_hit; /* cache hit count*/
> + uint64_t cache_hit; /* cache hit count*/
> unsigned long prev_page; /* Cache last read() position */
> unsigned long ahead_start; /* Ahead window */
> unsigned long ahead_size;
> unsigned long ra_pages; /* Maximum readahead window */
> unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
> unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
> +
> + unsigned long age;
> + pgoff_t la_index;
> + pgoff_t ra_index;
> + pgoff_t lookahead_index;
> + pgoff_t readahead_index;
> };
Hum... This sizeof(struct file) increase seems quite large...
Have you ever considered to change struct file so that file_ra_state is not
embedded, but dynamically allocated (or other strategy) for regular files ?
I mean, sockets, pipes cannot readahead... And some machines use far more
sockets than regular files.
I wrote such a patch in the past I could resend...
Eric
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 10/19] readahead: state based method
2005-11-25 15:21 ` Eric Dumazet
@ 2005-11-25 15:31 ` Eric Dumazet
2005-11-26 3:09 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2005-11-25 15:31 UTC (permalink / raw)
To: Eric Dumazet; +Cc: Wu Fengguang, linux-kernel, Andrew Morton
Eric Dumazet a écrit :
> Hum... This sizeof(struct file) increase seems quite large...
>
> Have you ever considered to change struct file so that file_ra_state is
> not embedded, but dynamically allocated (or other strategy) for regular
> files ?
>
> I mean, sockets, pipes cannot readahead... And some machines use far
> more sockets than regular files.
>
> I wrote such a patch in the past I could resend...
http://marc.theaimsgroup.com/?l=linux-kernel&m=112435605407020&w=2
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
` (18 preceding siblings ...)
2005-11-25 15:12 ` [PATCH 19/19] io: avoid too much latency from read-ahead Wu Fengguang
@ 2005-11-25 15:43 ` Diego Calleja
2005-11-25 19:31 ` Lee Revell
2005-11-26 3:17 ` Wu Fengguang
19 siblings, 2 replies; 32+ messages in thread
From: Diego Calleja @ 2005-11-25 15:43 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, akpm
El Fri, 25 Nov 2005 23:12:10 +0800,
Wu Fengguang <wfg@mail.ustc.edu.cn> escribió:
> Overview
> ========
>
> The current read-ahead logic uses an inflexible algorithm with 128KB
> VM_MAX_READAHEAD. Less memory leads to thrashing, more memory helps no
> throughput. The new logic is simply safer and faster. It makes sure
> every single read-ahead request is safe for the current load. Memory
> tight systems are expected to benefit a lot: no thrashing any more.
> It can also help boost I/O throughput for large memory systems, for
> VM_MAX_READAHEAD now defaults to 1MB. The value is no longer tightly
> coupled with the thrashing problem, and therefore constrainted by it.
Recently, a openoffice hacker wrote in his blog that the kernel was
culprit of applications not starting as fast as in other systems.
Most of the reasons he gave were wrong, but there was a interesting
one: When you start your system, you've lots of free memory. Since
you have lots of memory, he said it was reasonable to expect that
kernel would readahead *heavily* everything it can to fill that
memory as soon as possible (hoping that what you readahead'ed was
part of the kde/gnome/openoffice libraries etc) and go back to the
normal behaviour when your free memory is used by caches etc.
"Teorically" it looks like a nice heuristic for desktops. Does
adaptative readahead does something like this?
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 16/19] readahead: laptop mode support
2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
@ 2005-11-25 16:06 ` Bart Samwel
2005-11-26 3:33 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Bart Samwel @ 2005-11-25 16:06 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, Andrew Morton
Wu Fengguang wrote:
> When the laptop drive is spinned down, defer look-ahead to spin up time.
Just a n00b question: isn't readahead something that happens at read
time at the block device level? And doesn't the fact that you're reading
something imply that the drive is spun up? Or can readahead be triggered
by reading from cache?
> For crazy laptop users who prefer aggressive read-ahead, here is the way:
>
> # echo 1000 > /proc/sys/vm/readahead_ratio
> # blockdev --setra 524280 /dev/hda # this is the max possible value
These amounts of readahead are absolutely useless though. I've done
measurements about a year ago, that show that at a spindown time of two
minutes you've basically achieved all the power savings you can get.
More than 10 minutes of spindown is absolutely useless unless you have a
desktop drive, because those don't normally support more than 50,000
spinup cycles. The only apps I can think of that work on this amount of
data in such a short period of time are all apps where you shouldn't be
concerned about power drawn by the hard drive. :)
--Bart
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
@ 2005-11-25 19:31 ` Lee Revell
2005-11-26 23:03 ` Mark van der Made
2005-11-26 3:17 ` Wu Fengguang
1 sibling, 1 reply; 32+ messages in thread
From: Lee Revell @ 2005-11-25 19:31 UTC (permalink / raw)
To: Diego Calleja; +Cc: Wu Fengguang, linux-kernel, akpm
On Fri, 2005-11-25 at 16:43 +0100, Diego Calleja wrote:
> Recently, a openoffice hacker wrote in his blog that the kernel was
> culprit of applications not starting as fast as in other systems.
Useless without a link ;-)
Lee
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 10/19] readahead: state based method
2005-11-25 15:21 ` Eric Dumazet
2005-11-25 15:31 ` Eric Dumazet
@ 2005-11-26 3:09 ` Wu Fengguang
1 sibling, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26 3:09 UTC (permalink / raw)
To: Eric Dumazet; +Cc: linux-kernel, Andrew Morton
On Fri, Nov 25, 2005 at 04:21:22PM +0100, Eric Dumazet wrote:
> Wu Fengguang a écrit :
>
> > include/linux/fs.h | 8 +
> >
> >--- linux-2.6.15-rc2-mm1.orig/include/linux/fs.h
> >+++ linux-2.6.15-rc2-mm1/include/linux/fs.h
> >@@ -604,13 +604,19 @@ struct file_ra_state {
> > unsigned long start; /* Current window */
> > unsigned long size;
> > unsigned long flags; /* ra flags RA_FLAG_xxx*/
> >- unsigned long cache_hit; /* cache hit count*/
> >+ uint64_t cache_hit; /* cache hit count*/
> > unsigned long prev_page; /* Cache last read() position */
> > unsigned long ahead_start; /* Ahead window */
> > unsigned long ahead_size;
> > unsigned long ra_pages; /* Maximum readahead window */
> > unsigned long mmap_hit; /* Cache hit stat for mmap accesses
> > */
> > unsigned long mmap_miss; /* Cache miss stat for mmap accesses
> > */
> >+
> >+ unsigned long age;
> >+ pgoff_t la_index;
> >+ pgoff_t ra_index;
> >+ pgoff_t lookahead_index;
> >+ pgoff_t readahead_index;
> > };
>
> Hum... This sizeof(struct file) increase seems quite large...
Thanks.
I'm not sure if the two read-ahead logics should coexist in long term.
If so, the file_ra_state can be changed as follows to save memory:
struct file_ra_state {
union {
struct {
unsigned long start; /* Current window */
unsigned long size;
unsigned long ahead_start; /* Ahead window */
unsigned long ahead_size;
};
struct {
unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
};
struct {
unsigned long age;
pgoff_t la_index;
pgoff_t ra_index;
pgoff_t lookahead_index;
pgoff_t readahead_index;
};
};
uint64_t cache_hit; /* cache hit count*/
unsigned long flags; /* ra flags RA_FLAG_xxx*/
unsigned long prev_page; /* Cache last read() position */
unsigned long ra_pages; /* Maximum readahead window */
};
The mmap_hit/mmap_miss should be only used in mmap read-around logic.
> Have you ever considered to change struct file so that file_ra_state is not
> embedded, but dynamically allocated (or other strategy) for regular files ?
>
> I mean, sockets, pipes cannot readahead... And some machines use far more
> sockets than regular files.
>
> I wrote such a patch in the past I could resend...
Yes, I noticed it, and think it generally a good idea. The only problem is that
I'm afraid the patch might make the file_ra_state tightly coupled with file. See
this comment:
* Note that @filp is purely used for passing on to the ->readpage[s]()
* handler: it may refer to a different file from @mapping (so we may not use
* @filp->f_mapping or @filp->f_dentry->d_inode here).
* Also, @ra may not be equal to &@filp->f_ra.
And this patch:
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.15-rc2/2.6.15-rc2-mm1/broken-out/ext3_readdir-use-generic-readahead.patch
Linus points out that ext3_readdir's readahead only cuts in when
ext3_readdir() is operating at the very start of the directory. So for large
directories we end up performing no readahead at all and we suck.
So take it all out and use the core VM's page_cache_readahead(). This means
that ext3 directory reads will use all of readahead's dynamic sizing goop.
Note that we're using the diretory's filp->f_ra to hold the readahead state,
but readahead is actually being performed against the underlying blockdev's
address_space. Fortunately the readahead code is all set up to handle this.
Regards,
Wu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
2005-11-25 19:31 ` Lee Revell
@ 2005-11-26 3:17 ` Wu Fengguang
2005-11-26 13:25 ` Tomasz Torcz
1 sibling, 1 reply; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26 3:17 UTC (permalink / raw)
To: Diego Calleja; +Cc: linux-kernel, akpm
On Fri, Nov 25, 2005 at 04:43:17PM +0100, Diego Calleja wrote:
> Recently, a openoffice hacker wrote in his blog that the kernel was
> culprit of applications not starting as fast as in other systems.
> Most of the reasons he gave were wrong, but there was a interesting
> one: When you start your system, you've lots of free memory. Since
> you have lots of memory, he said it was reasonable to expect that
> kernel would readahead *heavily* everything it can to fill that
> memory as soon as possible (hoping that what you readahead'ed was
> part of the kde/gnome/openoffice libraries etc) and go back to the
> normal behaviour when your free memory is used by caches etc.
> "Teorically" it looks like a nice heuristic for desktops. Does
> adaptative readahead does something like this?
It's interesting ;)
In fact some distributions do have a read-ahead script to preload files on
startup. The readahead system call should be enough for this purpose:
NAME
readahead - perform file readahead into page cache
SYNOPSIS
#include <fcntl.h>
ssize_t readahead(int fd, off64_t *offset, size_t count);
Thanks,
Wu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 16/19] readahead: laptop mode support
2005-11-25 16:06 ` Bart Samwel
@ 2005-11-26 3:33 ` Wu Fengguang
0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26 3:33 UTC (permalink / raw)
To: Bart Samwel; +Cc: linux-kernel, Andrew Morton
On Fri, Nov 25, 2005 at 05:06:25PM +0100, Bart Samwel wrote:
> Wu Fengguang wrote:
> >When the laptop drive is spinned down, defer look-ahead to spin up time.
>
> Just a n00b question: isn't readahead something that happens at read
> time at the block device level? And doesn't the fact that you're reading
> something imply that the drive is spun up? Or can readahead be triggered
> by reading from cache?
Yes, both the old and new read-ahead logic issue read-ahead requests before
the pages are immediately needed. It is called look-ahead in the new logic,
which achieves I/O pipelining, and helps hide the I/O latency.
> >For crazy laptop users who prefer aggressive read-ahead, here is the way:
> >
> ># echo 1000 > /proc/sys/vm/readahead_ratio
> ># blockdev --setra 524280 /dev/hda # this is the max possible value
>
> These amounts of readahead are absolutely useless though. I've done
> measurements about a year ago, that show that at a spindown time of two
> minutes you've basically achieved all the power savings you can get.
> More than 10 minutes of spindown is absolutely useless unless you have a
> desktop drive, because those don't normally support more than 50,000
> spinup cycles. The only apps I can think of that work on this amount of
> data in such a short period of time are all apps where you shouldn't be
> concerned about power drawn by the hard drive. :)
Thanks, I have read about the paper, quite informative :)
It's just that some one suggested about the feature, and it's just a matter of
lifting some limit values in the code - so I did it :)
Regards,
Wu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-26 3:17 ` Wu Fengguang
@ 2005-11-26 13:25 ` Tomasz Torcz
2005-11-26 14:11 ` Wu Fengguang
0 siblings, 1 reply; 32+ messages in thread
From: Tomasz Torcz @ 2005-11-26 13:25 UTC (permalink / raw)
To: Wu Fengguang, Diego Calleja, linux-kernel, akpm
[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]
On Sat, Nov 26, 2005 at 11:17:55AM +0800, Wu Fengguang wrote:
> On Fri, Nov 25, 2005 at 04:43:17PM +0100, Diego Calleja wrote:
> > Recently, a openoffice hacker wrote in his blog that the kernel was
> > culprit of applications not starting as fast as in other systems.
> > Most of the reasons he gave were wrong, but there was a interesting
> > one: When you start your system, you've lots of free memory. Since
> > you have lots of memory, he said it was reasonable to expect that
> > kernel would readahead *heavily* everything it can to fill that
> > memory as soon as possible (hoping that what you readahead'ed was
> > part of the kde/gnome/openoffice libraries etc) and go back to the
> > normal behaviour when your free memory is used by caches etc.
> > "Teorically" it looks like a nice heuristic for desktops. Does
> > adaptative readahead does something like this?
>
> It's interesting ;)
> In fact some distributions do have a read-ahead script to preload files on
> startup. The readahead system call should be enough for this purpose:
>
> NAME
> readahead - perform file readahead into page cache
posix_fadvise() with POSIX_FADV_WILLNEED hint?
The specified data will be accessed in the near future.
--
Tomasz Torcz Morality must always be based on practicality.
zdzichu@irc.-nie.spam-.pl -- Baron Vladimir Harkonnen
[-- Attachment #2: Type: application/pgp-signature, Size: 229 bytes --]
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-26 13:25 ` Tomasz Torcz
@ 2005-11-26 14:11 ` Wu Fengguang
0 siblings, 0 replies; 32+ messages in thread
From: Wu Fengguang @ 2005-11-26 14:11 UTC (permalink / raw)
To: Diego Calleja, linux-kernel, akpm
On Sat, Nov 26, 2005 at 02:25:24PM +0100, Tomasz Torcz wrote:
> > It's interesting ;)
> > In fact some distributions do have a read-ahead script to preload files on
> > startup. The readahead system call should be enough for this purpose:
> >
> > NAME
> > readahead - perform file readahead into page cache
>
> posix_fadvise() with POSIX_FADV_WILLNEED hint?
> The specified data will be accessed in the near future.
Nod, this one is better.
They do roughly the same thing, while the latter interface is more portable.
Another note: if you do not know precisely which files to readahead, turn on
aggressive readahead feature temporary might help. There are mainly three
parameters that controls the aggressiveness, the following example values should
be large enough:
echo 16 > /proc/sys/vm/readahead_hit_rate # overlook hit rate?
echo 1000 > /proc/sys/vm/readahead_ratio # grow up ra-size quickly?
blockdev --setra 8192 /dev/had # large ra-size limit?
Thanks,
Wu
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-25 19:31 ` Lee Revell
@ 2005-11-26 23:03 ` Mark van der Made
2005-11-27 19:54 ` Lee Revell
0 siblings, 1 reply; 32+ messages in thread
From: Mark van der Made @ 2005-11-26 23:03 UTC (permalink / raw)
To: Lee Revell; +Cc: Diego Calleja, Wu Fengguang, linux-kernel, akpm
On 11/25/05, Lee Revell <rlrevell@joe-job.com> wrote:
> On Fri, 2005-11-25 at 16:43 +0100, Diego Calleja wrote:
> > Recently, a openoffice hacker wrote in his blog that the kernel was
> > culprit of applications not starting as fast as in other systems.
>
> Useless without a link ;-)
>
I think Diego refers to Michael Meeks blog:
http://www.gnome.org/~michael/activity.html#2005-11-04
Michael Meek also speaks about kernel issues in an interview in Linux
Format 72 (nov 2005).
Mark
^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 00/19] Adaptive read-ahead V8
2005-11-26 23:03 ` Mark van der Made
@ 2005-11-27 19:54 ` Lee Revell
0 siblings, 0 replies; 32+ messages in thread
From: Lee Revell @ 2005-11-27 19:54 UTC (permalink / raw)
To: Mark van der Made; +Cc: Diego Calleja, Wu Fengguang, linux-kernel, akpm
On Sun, 2005-11-27 at 00:03 +0100, Mark van der Made wrote:
> On 11/25/05, Lee Revell <rlrevell@joe-job.com> wrote:
> > On Fri, 2005-11-25 at 16:43 +0100, Diego Calleja wrote:
> > > Recently, a openoffice hacker wrote in his blog that the kernel was
> > > culprit of applications not starting as fast as in other systems.
> >
> > Useless without a link ;-)
> >
> I think Diego refers to Michael Meeks blog:
> http://www.gnome.org/~michael/activity.html#2005-11-04
> Michael Meek also speaks about kernel issues in an interview in Linux
> Format 72 (nov 2005).
>
This link tells a much more interesting story:
http://www.gnome.org/~lcolitti/gnome-startup/analysis/
Seems like most of the problems are in userspace at the application
level.
Maybe the excessive seeking when loading libraries could be solved by
physically rearranging the libraries on disk so that the "hot paths"
don't induce as many seeks.
Lee
^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2005-11-27 19:54 UTC | newest]
Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-11-25 15:12 [PATCH 00/19] Adaptive read-ahead V8 Wu Fengguang
2005-11-25 15:12 ` [PATCH 01/19] mm: delayed page activation Wu Fengguang
2005-11-25 15:12 ` [PATCH 02/19] vm: kswapd incmin Wu Fengguang
2005-11-25 15:12 ` [PATCH 03/19] mm: balance page aging between zones and slabs Wu Fengguang
2005-11-25 15:12 ` [PATCH 04/19] mm: debug page reclaim Wu Fengguang
2005-11-25 15:12 ` [PATCH 05/19] radixtree: sync with mainline Wu Fengguang
2005-11-25 15:12 ` [PATCH 06/19] radixtree: look-aside cache Wu Fengguang
2005-11-25 15:12 ` [PATCH 07/19] readahead: some preparation Wu Fengguang
2005-11-25 15:12 ` [PATCH 08/19] readahead: call scheme Wu Fengguang
2005-11-25 15:12 ` [PATCH 09/19] readahead: parameters Wu Fengguang
2005-11-25 15:12 ` [PATCH 10/19] readahead: state based method Wu Fengguang
2005-11-25 15:21 ` Eric Dumazet
2005-11-25 15:31 ` Eric Dumazet
2005-11-26 3:09 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 11/19] readahead: context " Wu Fengguang
2005-11-25 15:12 ` [PATCH 12/19] readahead: other methods Wu Fengguang
2005-11-25 15:12 ` [PATCH 13/19] readahead: detect and rescue live pages Wu Fengguang
2005-11-25 15:12 ` [PATCH 14/19] readahead: events accounting Wu Fengguang
2005-11-25 15:12 ` [PATCH 15/19] readahead: page aging accounting Wu Fengguang
2005-11-25 15:12 ` [PATCH 16/19] readahead: laptop mode support Wu Fengguang
2005-11-25 16:06 ` Bart Samwel
2005-11-26 3:33 ` Wu Fengguang
2005-11-25 15:12 ` [PATCH 17/19] readahead: disable look-ahead for loopback file Wu Fengguang
2005-11-25 15:12 ` [PATCH 18/19] readahead: nfsd support Wu Fengguang
2005-11-25 15:12 ` [PATCH 19/19] io: avoid too much latency from read-ahead Wu Fengguang
2005-11-25 15:43 ` [PATCH 00/19] Adaptive read-ahead V8 Diego Calleja
2005-11-25 19:31 ` Lee Revell
2005-11-26 23:03 ` Mark van der Made
2005-11-27 19:54 ` Lee Revell
2005-11-26 3:17 ` Wu Fengguang
2005-11-26 13:25 ` Tomasz Torcz
2005-11-26 14:11 ` Wu Fengguang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox