* [PATCH 01/13] mm: delayed page activation
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 02/13] mm: balance page aging between zones Wu Fengguang
` (12 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: mm-delayed-activation.patch --]
[-- Type: text/plain, Size: 4762 bytes --]
When a page is referenced the second time in inactive_list, mark it with
PG_activate instead of moving it into active_list immediately. The actual
moving work is delayed to vmscan time.
This means two essential changes:
- keeps the adjecency of pages in lru;
- lifts the page reference counter max from 1 to 3.
And leads to the following improvements:
- read-ahead for a leading reader will not be disturbed by a following reader;
- enables the thrashing protection logic to save pages for following readers;
- keeping relavant pages together helps improve I/O efficiency;
- and also helps decrease vm fragmantation;
- increased refcnt space might help page replacement algorithms.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/page-flags.h | 31 +++++++++++++++++++++++++++++++
mm/page_alloc.c | 1 +
mm/swap.c | 9 ++++-----
mm/vmscan.c | 6 ++++++
4 files changed, 42 insertions(+), 5 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.14-rc5-mm1/include/linux/page-flags.h
@@ -76,6 +76,7 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
+#define PG_activate 20 /* delayed activate */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -308,6 +309,12 @@ extern void __mod_page_state(unsigned lo
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageActivate(page) test_bit(PG_activate, &(page)->flags)
+#define SetPageActivate(page) set_bit(PG_activate, &(page)->flags)
+#define ClearPageActivate(page) clear_bit(PG_activate, &(page)->flags)
+#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
+#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
@@ -333,4 +340,28 @@ static inline void set_page_writeback(st
#define ClearPageFsMisc(page) clear_bit(PG_fs_misc, &(page)->flags)
#define TestClearPageFsMisc(page) test_and_clear_bit(PG_fs_misc, &(page)->flags)
+#if PG_activate < PG_referenced
+#error unexpected page flags order
+#endif
+
+#define PAGE_REFCNT_0 0
+#define PAGE_REFCNT_1 (1 << PG_referenced)
+#define PAGE_REFCNT_2 (1 << PG_activate)
+#define PAGE_REFCNT_3 ((1 << PG_activate) | (1 << PG_referenced))
+#define PAGE_REFCNT_MASK PAGE_REFCNT_3
+
+/*
+ * STATUS REFERENCE COUNT
+ * __ 0
+ * _R PAGE_REFCNT_1
+ * A_ PAGE_REFCNT_2
+ * AR PAGE_REFCNT_3
+ *
+ * A/R: Active / Referenced
+ */
+static inline unsigned long page_refcnt(struct page *page)
+{
+ return page->flags & PAGE_REFCNT_MASK;
+}
+
#endif /* PAGE_FLAGS_H */
--- linux-2.6.14-rc5-mm1.orig/mm/page_alloc.c
+++ linux-2.6.14-rc5-mm1/mm/page_alloc.c
@@ -488,6 +488,7 @@ static void prep_new_page(struct page *p
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
+ 1 << PG_activate |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refs(page, order);
--- linux-2.6.14-rc5-mm1.orig/mm/swap.c
+++ linux-2.6.14-rc5-mm1/mm/swap.c
@@ -29,7 +29,6 @@
#include <linux/percpu.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
-#include <linux/init.h>
/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -117,13 +116,13 @@ void fastcall activate_page(struct page
* Mark a page as having seen activity.
*
* inactive,unreferenced -> inactive,referenced
- * inactive,referenced -> active,unreferenced
- * active,unreferenced -> active,referenced
+ * inactive,referenced -> activate,unreferenced
+ * activate,unreferenced -> activate,referenced
*/
void fastcall mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
- activate_page(page);
+ if (!PageActivate(page) && PageReferenced(page) && PageLRU(page)) {
+ SetPageActivate(page);
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c
@@ -407,6 +407,12 @@ static int shrink_list(struct list_head
if (PageWriteback(page))
goto keep_locked;
+ if (PageActivate(page)) {
+ ClearPageActivate(page);
+ ClearPageReferenced(page);
+ goto activate_locked;
+ }
+
referenced = page_referenced(page, 1, sc->priority <= 0);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 02/13] mm: balance page aging between zones
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
2005-10-29 6:02 ` [PATCH 01/13] mm: delayed page activation Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 03/13] radixtree: sync with mainline Wu Fengguang
` (11 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Marcelo Tosatti, Magnus Damm, Wu Fengguang
[-- Attachment #1: mm-balanced-aging.patch --]
[-- Type: text/plain, Size: 10974 bytes --]
The page aging rates are currently imbalanced, the gap can be as large as 3
times, which can severely damage read-ahead requests and shorten their
effective life time.
This patch adds three variables in struct zone to keep track of page aging
rate, and keeps them in sync on vmscan/pagealloc time. The nr_page_aging is
a per-zone counter-part to the per-cpu pgscan_{kswapd,direct}_{zone name}.
The direct page reclaim path is not touched, for it needs non-trival changes.
It is ok as long as (pgscan_kswapd_* > pgscan_direct_*).
__alloc_pages() is changed a lot, which needs comfirmation from NUMA gurus.
The basic idea is to do reclaim in batches, and keep the zones inside one
batch balanced. There can be different policies in the partitioning. One
obvious choice for the first batch would be the zones on current node and
nearby memory nodes without cpu.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mmzone.h | 51 +++++++++++++++++++++++++
mm/page_alloc.c | 97 ++++++++++++++++++++++++++++++++++++++++---------
mm/vmscan.c | 32 +++++++++++++---
3 files changed, 158 insertions(+), 22 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/include/linux/mmzone.h
+++ linux-2.6.14-rc5-mm1/include/linux/mmzone.h
@@ -160,6 +160,20 @@ struct zone {
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
+ /* Fields for balanced page aging:
+ * nr_page_aging - The accumulated number of activities that may
+ * cause page aging, that is, make some pages closer
+ * to the tail of inactive_list.
+ * aging_milestone - A snapshot of nr_page_aging every time a full
+ * inactive_list of pages become aged.
+ * page_age - A normalized value showing the percent of pages
+ * have been aged. It is compared between zones to
+ * balance the rate of page aging.
+ */
+ unsigned long nr_page_aging;
+ unsigned long aging_milestone;
+ unsigned long page_age;
+
/*
* Does the allocator try to reclaim pages from the zone as soon
* as it fails a watermark_ok() in __alloc_pages?
@@ -343,6 +357,43 @@ static inline void memory_present(int ni
unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
#endif
+#ifdef CONFIG_HIGHMEM64G
+#define PAGE_AGE_SHIFT 8
+#elif BITS_PER_LONG == 32
+#define PAGE_AGE_SHIFT 12
+#elif BITS_PER_LONG == 64
+#define PAGE_AGE_SHIFT 20
+#else
+#error unknown BITS_PER_LONG
+#endif
+#define PAGE_AGE_MASK ((1 << PAGE_AGE_SHIFT) - 1)
+
+/*
+ * The percent of pages in inactive_list that have been scanned / aged.
+ * It's not really ##%, but a high resolution normalized value.
+ */
+static inline void update_page_age(struct zone *z)
+{
+ if (z->nr_page_aging - z->aging_milestone > z->nr_inactive)
+ z->aging_milestone += z->nr_inactive;
+
+ z->page_age = ((z->nr_page_aging - z->aging_milestone)
+ << PAGE_AGE_SHIFT) / (1 + z->nr_inactive);
+}
+
+/*
+ * The simplified code is:
+ * return (a->page_age > b->page_age);
+ * The complexity deals with the wrap-around problem.
+ * Two page ages not close enough should also be ignored:
+ * they are out of sync and the comparison may be nonsense.
+ */
+static inline int pages_more_aged(struct zone *a, struct zone *b)
+{
+ return ((b->page_age - a->page_age) & PAGE_AGE_MASK) >
+ PAGE_AGE_MASK - (1 << (PAGE_AGE_SHIFT - 2));
+}
+
/*
* zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
*/
--- linux-2.6.14-rc5-mm1.orig/mm/page_alloc.c
+++ linux-2.6.14-rc5-mm1/mm/page_alloc.c
@@ -488,7 +488,7 @@ static void prep_new_page(struct page *p
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_activate |
+ 1 << PG_activate | 1 << PG_readahead |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refs(page, order);
@@ -871,9 +871,15 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
struct task_struct *p = current;
int i;
int classzone_idx;
+ int do_reclaim;
int do_retry;
int can_try_harder;
int did_some_progress;
+ unsigned long zones_mask;
+ int left_count;
+ int batch_size;
+ int batch_base;
+ int batch_idx;
might_sleep_if(wait);
@@ -893,13 +899,62 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
classzone_idx = zone_idx(zones[0]);
-restart:
/*
* Go through the zonelist once, looking for a zone with enough free.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- for (i = 0; (z = zones[i]) != NULL; i++) {
- int do_reclaim = should_reclaim_zone(z, gfp_mask);
+restart:
+ /*
+ * To fulfill three goals:
+ * - balanced page aging
+ * - locality
+ * - predefined zonelist priority
+ *
+ * The logic employs the following rules:
+ * 1. Zones are checked in predefined order in general.
+ * 2. Skip to the next zone if it has lower page_age.
+ * 3. Checkings are carried out in batch, all zones in a batch must be
+ * checked before entering the next batch.
+ * 4. All local zones in the zonelist forms the first batch.
+ */
+
+ /* TODO: Avoid this loop by putting the values into struct zonelist.
+ * The (more general) desired batch counts can also go there.
+ */
+ for (batch_size = 0, i = 0; (z = zones[i]) != NULL; i++) {
+ if (z->zone_pgdat == zones[0]->zone_pgdat)
+ batch_size++;
+ }
+ BUG_ON(!batch_size);
+
+ left_count = i - batch_size;
+ batch_base = 0;
+ batch_idx = 0;
+ zones_mask = 0;
+
+ for (;;) {
+ if (zones_mask == (1 << batch_size) - 1) {
+ if (left_count <= 0) {
+ break;
+ }
+ batch_base += batch_size;
+ batch_size = min(left_count, (int)sizeof(zones_mask) * 8);
+ left_count -= batch_size;
+ batch_idx = 0;
+ zones_mask = 0;
+ }
+
+ do {
+ i = batch_idx;
+ do {
+ if (++batch_idx >= batch_size)
+ batch_idx = 0;
+ } while (zones_mask & (1 << batch_idx));
+ } while (pages_more_aged(zones[batch_base + i],
+ zones[batch_base + batch_idx]));
+
+ zones_mask |= (1 << i);
+ z = zones[batch_base + i];
if (!cpuset_zone_allowed(z, __GFP_HARDWALL))
continue;
@@ -909,11 +964,12 @@ restart:
* will try to reclaim pages and check the watermark a second
* time before giving up and falling back to the next zone.
*/
+ do_reclaim = should_reclaim_zone(z, gfp_mask);
zone_reclaim_retry:
if (!zone_watermark_ok(z, order, z->pages_low,
classzone_idx, 0, 0)) {
if (!do_reclaim)
- continue;
+ goto try_harder;
else {
zone_reclaim(z, gfp_mask, order);
/* Only try reclaim once */
@@ -925,20 +981,18 @@ zone_reclaim_retry:
page = buffered_rmqueue(z, order, gfp_mask);
if (page)
goto got_pg;
- }
- for (i = 0; (z = zones[i]) != NULL; i++)
+try_harder:
wakeup_kswapd(z, order);
- /*
- * Go through the zonelist again. Let __GFP_HIGH and allocations
- * coming from realtime tasks to go deeper into reserves
- *
- * This is the last chance, in general, before the goto nopage.
- * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
- * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
- */
- for (i = 0; (z = zones[i]) != NULL; i++) {
+ /*
+ * Put stress on the zone. Let __GFP_HIGH and allocations
+ * coming from realtime tasks to go deeper into reserves.
+ *
+ * This is the last chance, in general, before the goto nopage.
+ * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
+ * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
+ */
if (!zone_watermark_ok(z, order, z->pages_min,
classzone_idx, can_try_harder,
gfp_mask & __GFP_HIGH))
@@ -1446,6 +1500,8 @@ void show_free_areas(void)
" active:%lukB"
" inactive:%lukB"
" present:%lukB"
+ " aging:%lukB"
+ " age:%lu"
" pages_scanned:%lu"
" all_unreclaimable? %s"
"\n",
@@ -1457,6 +1513,8 @@ void show_free_areas(void)
K(zone->nr_active),
K(zone->nr_inactive),
K(zone->present_pages),
+ K(zone->nr_page_aging),
+ zone->page_age,
zone->pages_scanned,
(zone->all_unreclaimable ? "yes" : "no")
);
@@ -2073,6 +2131,9 @@ static void __init free_area_init_core(s
zone->nr_scan_inactive = 0;
zone->nr_active = 0;
zone->nr_inactive = 0;
+ zone->nr_page_aging = 0;
+ zone->aging_milestone = 0;
+ zone->page_age = 0;
atomic_set(&zone->reclaim_in_progress, 0);
if (!size)
continue;
@@ -2221,6 +2282,8 @@ static int zoneinfo_show(struct seq_file
"\n high %lu"
"\n active %lu"
"\n inactive %lu"
+ "\n aging %lu"
+ "\n age %lu"
"\n scanned %lu (a: %lu i: %lu)"
"\n spanned %lu"
"\n present %lu",
@@ -2230,6 +2293,8 @@ static int zoneinfo_show(struct seq_file
zone->pages_high,
zone->nr_active,
zone->nr_inactive,
+ zone->nr_page_aging,
+ zone->page_age,
zone->pages_scanned,
zone->nr_scan_active, zone->nr_scan_inactive,
zone->spanned_pages,
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c
@@ -653,6 +653,8 @@ static void shrink_cache(struct zone *zo
goto done;
max_scan -= nr_scan;
+ zone->nr_page_aging += nr_scan;
+ update_page_age(zone);
if (current_is_kswapd())
mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
else
@@ -1068,6 +1070,7 @@ loop_again:
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
+ int begin_zone = -1;
unsigned long lru_pages = 0;
all_zones_ok = 1;
@@ -1089,16 +1092,33 @@ loop_again:
if (!zone_watermark_ok(zone, order,
zone->pages_high, 0, 0, 0)) {
- end_zone = i;
- goto scan;
+ if (!end_zone)
+ begin_zone = end_zone = i;
+ else /* if (begin_zone == i + 1) */
+ begin_zone = i;
}
}
- goto out;
+ if (begin_zone < 0)
+ goto out;
} else {
+ begin_zone = 0;
end_zone = pgdat->nr_zones - 1;
}
-scan:
- for (i = 0; i <= end_zone; i++) {
+
+ /*
+ * Prepare enough free pages for zones with small page_age,
+ * they are going to be reclaimed in the page allocation.
+ */
+ while (end_zone < pgdat->nr_zones - 1 &&
+ pages_more_aged(pgdat->node_zones + end_zone,
+ pgdat->node_zones + end_zone + 1))
+ end_zone++;
+ while (begin_zone &&
+ pages_more_aged(pgdat->node_zones + begin_zone,
+ pgdat->node_zones + begin_zone - 1))
+ begin_zone--;
+
+ for (i = begin_zone; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
lru_pages += zone->nr_active + zone->nr_inactive;
@@ -1113,7 +1133,7 @@ scan:
* pages behind kswapd's direction of progress, which would
* cause too much scanning of the lower zones.
*/
- for (i = 0; i <= end_zone; i++) {
+ for (i = begin_zone; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
int nr_slab;
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 03/13] radixtree: sync with mainline
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
2005-10-29 6:02 ` [PATCH 01/13] mm: delayed page activation Wu Fengguang
2005-10-29 6:02 ` [PATCH 02/13] mm: balance page aging between zones Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 04/13] radixtree: look-aside cache Wu Fengguang
` (10 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Christoph Lameter, Wu Fengguang
[-- Attachment #1: radixtree-sync-with-mainline.patch --]
[-- Type: text/plain, Size: 1190 bytes --]
The patch from Christoph Lameter:
[PATCH] radix-tree: Remove unnecessary indirections and clean up code
is only partially merged into -mm tree. This patch completes it.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
lib/radix-tree.c | 12 +++++-------
1 files changed, 5 insertions(+), 7 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/lib/radix-tree.c
+++ linux-2.6.14-rc5-mm1/lib/radix-tree.c
@@ -286,27 +286,25 @@ static inline void **__lookup_slot(struc
unsigned long index)
{
unsigned int height, shift;
- struct radix_tree_node **slot;
+ struct radix_tree_node *slot;
height = root->height;
if (index > radix_tree_maxindex(height))
return NULL;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
- slot = &root->rnode;
+ slot = root->rnode;
while (height > 0) {
- if (*slot == NULL)
+ if (slot == NULL)
return NULL;
- slot = (struct radix_tree_node **)
- ((*slot)->slots +
- ((index >> shift) & RADIX_TREE_MAP_MASK));
+ slot = slot->slots[(index >> shift) & RADIX_TREE_MAP_MASK];
shift -= RADIX_TREE_MAP_SHIFT;
height--;
}
- return (void **)slot;
+ return slot;
}
/**
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 04/13] radixtree: look-aside cache
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (2 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 03/13] radixtree: sync with mainline Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 05/13] readahead: some preparation Wu Fengguang
` (9 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Nick Piggin, Wu Fengguang
[-- Attachment #1: radixtree-lookaside-cache.patch --]
[-- Type: text/plain, Size: 11853 bytes --]
This introduces a set of lookup functions to radix tree for the read-ahead
logic. Other access patterns with high locality may also benefit from them.
Most of them are best inlined, so some macros/structs in .c are moved into .h.
- radix_tree_lookup_node(root, index, level)
Perform partial lookup, return the @level'th parent of the slot at
@index.
- radix_tree_cache_lookup(root, cache, index)
Perform lookup with the aid of a look-aside cache.
- radix_tree_cache_xxx()
Query various information about the cache.
- radix_tree_lookup_head(root, index, max_scan)
- radix_tree_lookup_tail(root, index, max_scan)
[head, tail) is a segment with continuous pages. The functions
search for the head and tail index of the segment at @index.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/radix-tree.h | 149 ++++++++++++++++++++++++++++++++++++++++++++-
lib/radix-tree.c | 142 ++++++++++++++++++++++++++++++++----------
2 files changed, 254 insertions(+), 37 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/include/linux/radix-tree.h
+++ linux-2.6.14-rc5-mm1/include/linux/radix-tree.h
@@ -22,12 +22,39 @@
#include <linux/preempt.h>
#include <linux/types.h>
+#ifdef __KERNEL__
+#define RADIX_TREE_MAP_SHIFT 6
+#else
+#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
+#endif
+#define RADIX_TREE_TAGS 2
+
+#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
+
+#define RADIX_TREE_TAG_LONGS \
+ ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
+
+struct radix_tree_node {
+ unsigned int count;
+ void *slots[RADIX_TREE_MAP_SIZE];
+ unsigned long tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
+};
+
struct radix_tree_root {
unsigned int height;
unsigned int gfp_mask;
struct radix_tree_node *rnode;
};
+/*
+ * Support access patterns with strong locality.
+ */
+struct radix_tree_cache {
+ unsigned long first_index;
+ struct radix_tree_node *tree_node;
+};
+
#define RADIX_TREE_INIT(mask) { \
.height = 0, \
.gfp_mask = (mask), \
@@ -45,9 +72,13 @@ do { \
} while (0)
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
-void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
-void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_lookup_node(struct radix_tree_root *, unsigned long,
+ unsigned int);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
+unsigned long radix_tree_lookup_head(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan);
+unsigned long radix_tree_lookup_tail(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned long first_index, unsigned int max_items);
@@ -69,4 +100,118 @@ static inline void radix_tree_preload_en
preempt_enable();
}
+/**
+ * radix_tree_lookup - perform lookup operation on a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_lookup(struct radix_tree_root *root,
+ unsigned long index)
+{
+ return radix_tree_lookup_node(root, index, 0);
+}
+
+/**
+ * radix_tree_lookup_slot - lookup a slot in a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Lookup the slot corresponding to the position @index in the radix tree
+ * @root. This is useful for update-if-exists operations.
+ */
+static inline void **radix_tree_lookup_slot(struct radix_tree_root *root,
+ unsigned long index)
+{
+ struct radix_tree_node *node;
+
+ node = radix_tree_lookup_node(root, index, 1);
+ return node->slots + (index & RADIX_TREE_MAP_MASK);
+}
+
+/**
+ * radix_tree_cache_lookup_node - cached lookup node
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root,
+ * and return the node @level levels from the bottom in the search path.
+ * @cache stores the last accessed upper level tree node by this
+ * function, and is always checked first before searching in the tree.
+ * It can improve speed for access patterns with strong locality.
+ * NOTE:
+ * - The cache becomes invalid on leaving the lock;
+ * - Do not intermix calls with different @level.
+ */
+static inline void *radix_tree_cache_lookup_node(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index, unsigned int level)
+{
+ struct radix_tree_node *node;
+ unsigned long i;
+ unsigned long mask;
+
+ if (level && level >= root->height)
+ return root->rnode;
+
+ i = ((index >> (level * RADIX_TREE_MAP_SHIFT)) & RADIX_TREE_MAP_MASK);
+ mask = ~((RADIX_TREE_MAP_SIZE << (level * RADIX_TREE_MAP_SHIFT)) - 1);
+
+ if ((index & mask) == cache->first_index)
+ return cache->tree_node->slots[i];
+
+ node = radix_tree_lookup_node(root, index, level + 1);
+ if (!node)
+ return 0;
+
+ cache->tree_node = node;
+ cache->first_index = (index & mask);
+ return node->slots[i];
+}
+
+/**
+ * radix_tree_cache_lookup - cached lookup page
+ * @root: radix tree root
+ * @cache: look-aside cache
+ * @index: index key
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ */
+static inline void *radix_tree_cache_lookup(struct radix_tree_root *root,
+ struct radix_tree_cache *cache,
+ unsigned long index)
+{
+ return radix_tree_cache_lookup_node(root, cache, index, 0);
+}
+
+static inline void radix_tree_cache_init(struct radix_tree_cache *cache)
+{
+ cache->first_index = 0x77;
+}
+
+static inline int radix_tree_cache_size(struct radix_tree_cache *cache)
+{
+ return RADIX_TREE_MAP_SIZE;
+}
+
+static inline int radix_tree_cache_count(struct radix_tree_cache *cache)
+{
+ if (cache->first_index != 0x77)
+ return cache->tree_node->count;
+ else
+ return 0;
+}
+
+static inline int radix_tree_cache_full(struct radix_tree_cache *cache)
+{
+ return radix_tree_cache_count(cache) == radix_tree_cache_size(cache);
+}
+
+static inline int radix_tree_cache_first_index(struct radix_tree_cache *cache)
+{
+ return cache->first_index;
+}
+
#endif /* _LINUX_RADIX_TREE_H */
--- linux-2.6.14-rc5-mm1.orig/lib/radix-tree.c
+++ linux-2.6.14-rc5-mm1/lib/radix-tree.c
@@ -32,25 +32,6 @@
#include <linux/bitops.h>
-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT 6
-#else
-#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
-#endif
-#define RADIX_TREE_TAGS 2
-
-#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
-
-#define RADIX_TREE_TAG_LONGS \
- ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
-
-struct radix_tree_node {
- unsigned int count;
- void *slots[RADIX_TREE_MAP_SIZE];
- unsigned long tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS];
-};
-
struct radix_tree_path {
struct radix_tree_node *node;
int offset;
@@ -282,8 +263,21 @@ int radix_tree_insert(struct radix_tree_
}
EXPORT_SYMBOL(radix_tree_insert);
-static inline void **__lookup_slot(struct radix_tree_root *root,
- unsigned long index)
+/**
+ * radix_tree_lookup_node - low level lookup routine
+ * @root: radix tree root
+ * @index: index key
+ * @level: stop at that many levels from bottom
+ *
+ * Lookup the item at the position @index in the radix tree @root.
+ * The return value is:
+ * @level == 0: page at @index;
+ * @level == 1: the corresponding bottom level tree node;
+ * @level < height: (height - @level)th level tree node;
+ * @level >= height: root node.
+ */
+void *radix_tree_lookup_node(struct radix_tree_root *root,
+ unsigned long index, unsigned int level)
{
unsigned int height, shift;
struct radix_tree_node *slot;
@@ -295,7 +289,7 @@ static inline void **__lookup_slot(struc
shift = (height-1) * RADIX_TREE_MAP_SHIFT;
slot = root->rnode;
- while (height > 0) {
+ while (height > level) {
if (slot == NULL)
return NULL;
@@ -306,36 +300,114 @@ static inline void **__lookup_slot(struc
return slot;
}
+EXPORT_SYMBOL(radix_tree_lookup_node);
/**
- * radix_tree_lookup_slot - lookup a slot in a radix tree
+ * radix_tree_lookup_head - lookup the head index
* @root: radix tree root
* @index: index key
+ * @max_scan: max items to scan
*
- * Lookup the slot corresponding to the position @index in the radix tree
- * @root. This is useful for update-if-exists operations.
+ * Lookup head index of the segment which contains @index. A segment is
+ * a set of continuous pages in a file.
+ * CASE RETURN VALUE
+ * no page at @index (not head) = @index + 1
+ * found in the range @index - @max_scan < (head index) <= @index
+ * not found in range (unfinished head) <= @index - @max_scan
*/
-void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
+unsigned long radix_tree_lookup_head(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan)
{
- return __lookup_slot(root, index);
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ int i;
+ unsigned long origin;
+
+ origin = index;
+ if (unlikely(max_scan > index))
+ max_scan = index;
+ radix_tree_cache_init(&cache);
+
+next_node:
+ if (origin - index > max_scan)
+ goto out;
+
+ node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+ if (!node)
+ goto out;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ if (index < RADIX_TREE_MAP_SIZE) {
+ index = -1;
+ goto out;
+ }
+ index = (index - RADIX_TREE_MAP_SIZE) | RADIX_TREE_MAP_MASK;
+ goto next_node;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i >= 0; i--, index--) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+ goto next_node;
+
+out:
+ return index + 1;
}
-EXPORT_SYMBOL(radix_tree_lookup_slot);
+EXPORT_SYMBOL(radix_tree_lookup_head);
/**
- * radix_tree_lookup - perform lookup operation on a radix tree
+ * radix_tree_lookup_tail - lookup the tail index
* @root: radix tree root
* @index: index key
+ * @max_scan: max items to scan
*
- * Lookup the item at the position @index in the radix tree @root.
+ * Lookup tail(pass the end) index of the segment which contains @index.
+ * A segment is a set of continuous pages in a file.
+ * CASE RETURN VALUE
+ * found in the range @index <= (tail index) < @index + @max_scan
+ * not found in range @index + @max_scan <= (non tail)
*/
-void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
+unsigned long radix_tree_lookup_tail(struct radix_tree_root *root,
+ unsigned long index, unsigned int max_scan)
{
- void **slot;
+ struct radix_tree_cache cache;
+ struct radix_tree_node *node;
+ int i;
+ unsigned long origin;
+
+ origin = index;
+ if (unlikely(index + max_scan < index))
+ max_scan = LONG_MAX - index;
+ radix_tree_cache_init(&cache);
+
+next_node:
+ if (index - origin >= max_scan)
+ goto out;
- slot = __lookup_slot(root, index);
- return slot != NULL ? *slot : NULL;
+ node = radix_tree_cache_lookup_node(root, &cache, index, 1);
+ if (!node)
+ goto out;
+
+ if (node->count == RADIX_TREE_MAP_SIZE) {
+ index = (index | RADIX_TREE_MAP_MASK) + 1;
+ if (unlikely(!index))
+ goto out;
+ goto next_node;
+ }
+
+ for (i = index & RADIX_TREE_MAP_MASK; i < RADIX_TREE_MAP_SIZE; i++, index++) {
+ if (!node->slots[i])
+ goto out;
+ }
+
+ goto next_node;
+
+out:
+ return index;
}
-EXPORT_SYMBOL(radix_tree_lookup);
+EXPORT_SYMBOL(radix_tree_lookup_tail);
/**
* radix_tree_tag_set - set a tag on a radix tree node
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 05/13] readahead: some preparation
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (3 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 04/13] radixtree: look-aside cache Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 06/13] readahead: call scheme Wu Fengguang
` (8 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-prepare.patch --]
[-- Type: text/plain, Size: 7522 bytes --]
Some random changes that do not fit in elsewhere.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 157 insertions(+), 8 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -15,13 +15,36 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+/*
+ * Debug facilities.
+ */
+#define DEBUG_READAHEAD
+#ifdef DEBUG_READAHEAD
+
+#define dprintk(args...) \
+ if (readahead_ratio & 1) printk(KERN_DEBUG args)
+#define ddprintk(args...) \
+ if ((readahead_ratio & 3) == 3) printk(KERN_DEBUG args)
+
+#else /* DEBUG_READAHEAD */
+
+#define dprintk(args...) do {} while(0)
+#define ddprintk(args...) do {} while(0)
+
+#endif /* DEBUG_READAHEAD */
+
+
+/* The default max/min read-ahead pages. */
+#define MAX_RA_PAGES (VM_MAX_READAHEAD >> (PAGE_CACHE_SHIFT - 10))
+#define MIN_RA_PAGES (VM_MIN_READAHEAD >> (PAGE_CACHE_SHIFT - 10))
+
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
}
EXPORT_SYMBOL(default_unplug_io_fn);
struct backing_dev_info default_backing_dev_info = {
- .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE,
+ .ra_pages = MAX_RA_PAGES,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
@@ -50,7 +73,7 @@ static inline unsigned long get_max_read
static inline unsigned long get_min_readahead(struct file_ra_state *ra)
{
- return (VM_MIN_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+ return MIN_RA_PAGES;
}
static inline void ra_off(struct file_ra_state *ra)
@@ -255,10 +278,11 @@ out:
*/
static int
__do_page_cache_readahead(struct address_space *mapping, struct file *filp,
- unsigned long offset, unsigned long nr_to_read)
+ unsigned long offset, unsigned long nr_to_read,
+ unsigned long lookahead_size)
{
struct inode *inode = mapping->host;
- struct page *page;
+ struct page *page = NULL;
unsigned long end_index; /* The last page we want to read */
LIST_HEAD(page_pool);
int page_idx;
@@ -268,7 +292,7 @@ __do_page_cache_readahead(struct address
if (isize == 0)
goto out;
- end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
/*
* Preallocate as many pages as we will need.
@@ -291,6 +315,9 @@ __do_page_cache_readahead(struct address
break;
page->index = page_offset;
list_add(&page->lru, &page_pool);
+ if (readahead_ratio > 9 &&
+ page_idx == nr_to_read - lookahead_size)
+ SetPageReadahead(page);
ret++;
}
read_unlock_irq(&mapping->tree_lock);
@@ -327,7 +354,7 @@ int force_page_cache_readahead(struct ad
if (this_chunk > nr_to_read)
this_chunk = nr_to_read;
err = __do_page_cache_readahead(mapping, filp,
- offset, this_chunk);
+ offset, this_chunk, 0);
if (err < 0) {
ret = err;
break;
@@ -374,7 +401,7 @@ int do_page_cache_readahead(struct addre
if (bdi_read_congested(mapping->backing_dev_info))
return -1;
- return __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
}
/*
@@ -394,7 +421,10 @@ blockable_page_cache_readahead(struct ad
if (!block && bdi_read_congested(mapping->backing_dev_info))
return 0;
- actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read);
+ actual = __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
+
+ dprintk("blockable-readahead(ino=%lu, ra=%lu+%lu) = %d\n",
+ mapping->host->i_ino, offset, nr_to_read, actual);
return check_ra_success(ra, nr_to_read, actual);
}
@@ -559,3 +589,122 @@ unsigned long max_sane_readahead(unsigne
__get_zone_counts(&active, &inactive, &free, NODE_DATA(numa_node_id()));
return min(nr, (inactive + free) / 2);
}
+
+/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressures.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ * 1. state based - the default one
+ * 2. context based - the fail safe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In every read-ahead chunk, it selects one page and tag it with PG_readahead.
+ * Later when the page with PG_readahead is to be read, the logic knows that
+ * it's time to carry out the next read-ahead chunk in advance.
+ *
+ * a read-ahead chunk
+ * +-----------------------------------------+
+ * | # PG_readahead |
+ * +-----------------------------------------+
+ * ^ When this page is read, we submit I/O for the next read-ahead.
+ *
+ *
+ * Here are some variable names used frequently:
+ *
+ * |<------- la_size ------>|
+ * +-----------------------------------------+
+ * | # |
+ * +-----------------------------------------+
+ * ra_index -->|<---------------- ra_size -------------->|
+ *
+ */
+
+#define next_page(pg) (list_entry((pg)->lru.prev, struct page, lru))
+#define prev_page(pg) (list_entry((pg)->lru.next, struct page, lru))
+
+/*
+ * The nature of read-ahead allows most tests to fail or even be wrong.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+ unsigned long offset)
+{
+ return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+struct page *find_page(struct address_space *mapping, unsigned long offset)
+{
+ struct page *page;
+
+ read_lock_irq(&mapping->tree_lock);
+ page = __find_page(mapping, offset);
+ read_unlock_irq(&mapping->tree_lock);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (page)
+ BUG_ON(page->index != offset);
+#endif
+ return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static int rescue_pages(struct page *page, unsigned long nr_pages)
+{
+ unsigned long pgrescue;
+ unsigned long index;
+ struct address_space *mapping;
+ struct zone *zone;
+
+ BUG_ON(!nr_pages || !page);
+ pgrescue = 0;
+ index = page_index(page);
+ mapping = page_mapping(page);
+
+ dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+ mapping->host->i_ino, index, nr_pages);
+
+ for(;;) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (!PageLRU(page))
+ goto out_unlock;
+
+ while (page_mapping(page) == mapping &&
+ page_index(page) == index) {
+ struct page *the_page = page;
+ page = next_page(page);
+ if (!PageActive(the_page) &&
+ !PageActivate(the_page) &&
+ !PageLocked(the_page) &&
+ page_count(the_page) == 1) {
+ list_move(&the_page->lru, &zone->inactive_list);
+ pgrescue++;
+ }
+ index++;
+ if (!--nr_pages)
+ goto out_unlock;
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+
+ page = find_page(mapping, index);
+ if (!page)
+ goto out;
+ }
+out_unlock:
+ spin_unlock_irq(&zone->lru_lock);
+out:
+ ra_account(0, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+
+ return nr_pages ? index : 0;
+}
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 06/13] readahead: call scheme
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (4 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 05/13] readahead: some preparation Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 07/13] readahead: tunable parameters Wu Fengguang
` (7 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-call-scheme.patch --]
[-- Type: text/plain, Size: 12286 bytes --]
An new page flag PG_readahead is introduced as a look-ahead mark.
The look-ahead mark corresponds to `ahead_start' of the current logic.
The read-ahead logic is called when
- read reaches a look-ahead mark;
- read on a non-present page.
And ra_access() is called on every page reference to maintain the cache_hit
counter.
This scheme has the following benefits:
- makes all stateful/stateless methods happy;
- eliminates the cache hit problem naturally;
- lives in harmony with application managed read-aheads via
fadvise/madvise.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 7 +
include/linux/page-flags.h | 5 +
mm/filemap.c | 66 +++++++++++++++---
mm/readahead.c | 162 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 229 insertions(+), 11 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.14-rc5-mm1/include/linux/page-flags.h
@@ -77,6 +77,7 @@
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
#define PG_activate 20 /* delayed activate */
+#define PG_readahead 21 /* check readahead when reading this page */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -315,6 +316,10 @@ extern void __mod_page_state(unsigned lo
#define TestClearPageActivate(page) test_and_clear_bit(PG_activate, &(page)->flags)
#define TestSetPageActivate(page) test_and_set_bit(PG_activate, &(page)->flags)
+#define PageReadahead(page) test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page) set_bit(PG_readahead, &(page)->flags)
+#define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
--- linux-2.6.14-rc5-mm1.orig/include/linux/mm.h
+++ linux-2.6.14-rc5-mm1/include/linux/mm.h
@@ -944,6 +944,13 @@ unsigned long page_cache_readahead(stru
void handle_ra_miss(struct address_space *mapping,
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ unsigned long first_index,
+ unsigned long index, unsigned long last_index);
+void fastcall ra_access(struct file_ra_state *ra, struct page *page);
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-2.6.14-rc5-mm1.orig/mm/filemap.c
+++ linux-2.6.14-rc5-mm1/mm/filemap.c
@@ -724,6 +724,8 @@ grab_cache_page_nowait(struct address_sp
EXPORT_SYMBOL(grab_cache_page_nowait);
+extern int readahead_ratio;
+
/*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level
@@ -751,10 +753,12 @@ void do_generic_mapping_read(struct addr
unsigned long prev_index;
loff_t isize;
struct page *cached_page;
+ struct page *prev_page;
int error;
struct file_ra_state ra = *_ra;
cached_page = NULL;
+ prev_page = NULL;
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
@@ -783,16 +787,36 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- if (index == next_index)
+
+ if (readahead_ratio <= 9 && index == next_index)
next_index = page_cache_readahead(mapping, &ra, filp,
index, last_index - index);
find_page:
page = find_get_page(mapping, index);
+ if (readahead_ratio > 9) {
+ if (unlikely(page == NULL)) {
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, NULL,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ page = find_get_page(mapping, index);
+ } else if (PageReadahead(page)) {
+ page_cache_readahead_adaptive(mapping, &ra,
+ filp, prev_page, page,
+ *ppos >> PAGE_CACHE_SHIFT,
+ index, last_index);
+ }
+ }
if (unlikely(page == NULL)) {
- handle_ra_miss(mapping, &ra, index);
+ if (readahead_ratio <= 9)
+ handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
+ ra_access(&ra, page);
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -808,8 +832,9 @@ page_ok:
* When (part of) the same page is read multiple times
* in succession, only mark it as accessed the first time.
*/
- if (prev_index != index)
+ if (prev_index != index) {
mark_page_accessed(page);
+ }
prev_index = index;
/*
@@ -827,7 +852,6 @@ page_ok:
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
- page_cache_release(page);
if (ret == nr && desc->count)
continue;
goto out;
@@ -839,7 +863,6 @@ page_not_up_to_date:
/* Did it get unhashed before we got the lock? */
if (!page->mapping) {
unlock_page(page);
- page_cache_release(page);
continue;
}
@@ -864,7 +887,6 @@ readpage:
* invalidate_inode_pages got it
*/
unlock_page(page);
- page_cache_release(page);
goto find_page;
}
unlock_page(page);
@@ -885,7 +907,6 @@ readpage:
isize = i_size_read(inode);
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
if (unlikely(!isize || index > end_index)) {
- page_cache_release(page);
goto out;
}
@@ -894,7 +915,6 @@ readpage:
if (index == end_index) {
nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
if (nr <= offset) {
- page_cache_release(page);
goto out;
}
}
@@ -904,7 +924,6 @@ readpage:
readpage_error:
/* UHHUH! A synchronous read error occurred. Report it */
desc->error = error;
- page_cache_release(page);
goto out;
no_cached_page:
@@ -929,15 +948,22 @@ no_cached_page:
}
page = cached_page;
cached_page = NULL;
+ if (prev_page)
+ page_cache_release(prev_page);
+ prev_page = page;
goto readpage;
}
out:
*_ra = ra;
+ if (readahead_ratio > 9)
+ _ra->prev_page = prev_index;
*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
if (cached_page)
page_cache_release(cached_page);
+ if (prev_page)
+ page_cache_release(prev_page);
if (filp)
file_accessed(filp);
}
@@ -1235,19 +1261,33 @@ retry_all:
*
* For sequential accesses, we use the generic readahead logic.
*/
- if (VM_SequentialReadHint(area))
+ if (readahead_ratio <= 9 && VM_SequentialReadHint(area))
page_cache_readahead(mapping, ra, file, pgoff, 1);
+
/*
* Do we have something in the page cache already?
*/
retry_find:
page = find_get_page(mapping, pgoff);
+ if (VM_SequentialReadHint(area) && readahead_ratio > 9) {
+ if (!page) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, NULL,
+ pgoff, pgoff, pgoff + 1);
+ page = find_get_page(mapping, pgoff);
+ } else if (PageReadahead(page)) {
+ page_cache_readahead_adaptive(mapping, ra,
+ file, NULL, page,
+ pgoff, pgoff, pgoff + 1);
+ }
+ }
if (!page) {
unsigned long ra_pages;
if (VM_SequentialReadHint(area)) {
- handle_ra_miss(mapping, ra, pgoff);
+ if (readahead_ratio <= 9)
+ handle_ra_miss(mapping, ra, pgoff);
goto no_cached_page;
}
ra->mmap_miss++;
@@ -1284,6 +1324,8 @@ retry_find:
if (!did_readaround)
ra->mmap_hit++;
+ ra_access(ra, page);
+
/*
* Ok, found a page in the page cache, now we need to check
* that it's up-to-date.
@@ -1298,6 +1340,8 @@ success:
mark_page_accessed(page);
if (type)
*type = majmin;
+ if (readahead_ratio > 9)
+ ra->prev_page = page->index;
return page;
outside_data_content:
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -15,6 +15,38 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+/* Detailed classification of read-ahead behaviors. */
+#define RA_CLASS_SHIFT 3
+#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
+enum ra_class {
+ RA_CLASS_ALL,
+ RA_CLASS_NEWFILE,
+ RA_CLASS_STATE,
+ RA_CLASS_CONTEXT,
+ RA_CLASS_CONTEXT_ACCELERATED,
+ RA_CLASS_BACKWARD,
+ RA_CLASS_RANDOM_THRASHING,
+ RA_CLASS_RANDOM_SEEK,
+ RA_CLASS_END,
+};
+
+/* Read-ahead events to be accounted. */
+enum ra_event {
+ RA_EVENT_CACHE_MISS, /* read cache misses */
+ RA_EVENT_READRANDOM, /* random reads */
+
+ RA_EVENT_READAHEAD, /* read-ahead issued */
+ RA_EVENT_READAHEAD_HIT, /* read-ahead page hit */
+ RA_EVENT_LOOKAHEAD, /* look-ahead issued */
+ RA_EVENT_LOOKAHEAD_HIT, /* look-ahead mark hit */
+ RA_EVENT_READAHEAD_EOF, /* read-ahead reaches EOF */
+ RA_EVENT_READAHEAD_SHRINK, /* ra_size decreased, reflects var. */
+ RA_EVENT_READAHEAD_THRASHING, /* read-ahead thrashing happened */
+ RA_EVENT_READAHEAD_RESCUE, /* read-ahead rescued */
+
+ RA_EVENT_END
+};
+
/*
* Debug facilities.
*/
@@ -708,3 +740,133 @@ out:
return nr_pages ? index : 0;
}
+
+/*
+ * This is the entry point of the adaptive read-ahead logic.
+ *
+ * It is only called on two conditions:
+ * 1. page == NULL
+ * A cache miss happened, it can be either a random read or a sequential one.
+ * 2. page != NULL
+ * There is a look-ahead mark(PG_readahead) from a previous sequential read.
+ * It's time to do some checking and submit the next read-ahead IO.
+ *
+ * That has the merits of:
+ * - makes all stateful/stateless methods happy;
+ * - eliminates the cache hit problem naturally;
+ * - lives in harmony with application managed read-aheads via fadvise/madvise.
+ */
+unsigned long
+page_cache_readahead_adaptive(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *prev_page, struct page *page,
+ unsigned long begin_index,
+ unsigned long index, unsigned long end_index)
+{
+ unsigned long size;
+ unsigned long ra_min;
+ unsigned long ra_max;
+ int ret;
+
+ if (page) {
+ if(!TestClearPageReadahead(page))
+ return 0;
+ if (bdi_read_congested(mapping->backing_dev_info))
+ return 0;
+ }
+
+ if (page)
+ ra_account(ra, RA_EVENT_LOOKAHEAD_HIT,
+ ra->readahead_index - ra->lookahead_index);
+ else if (index)
+ ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index);
+
+ size = end_index - index;
+ get_readahead_bounds(ra, &ra_min, &ra_max);
+
+ /* readahead disabled? */
+ if (unlikely(!ra_min || !readahead_ratio)) {
+ size = max_sane_readahead(size);
+ goto readit;
+ }
+
+ /*
+ * Start of file.
+ */
+ if (index == 0)
+ return newfile_readahead(mapping, filp, ra, end_index, ra_min);
+
+ /*
+ * State based sequential read-ahead.
+ */
+ if ((readahead_ratio % 5) == 0 &&
+ index == ra->lookahead_index &&
+ (page || index == ra->readahead_index) &&
+ (ra_cache_hit_ok(ra) ||
+ end_index - begin_index >= ra_max))
+ return state_based_readahead(mapping, filp, ra, page, ra_max);
+
+ /*
+ * Backward read-ahead.
+ */
+ if (try_read_backward(ra, begin_index, end_index, size, ra_min, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Context based sequential read-ahead.
+ */
+ ret = try_context_based_readahead(mapping, ra, prev_page, page,
+ index, size, ra_min, ra_max);
+ if (ret > 0)
+ return ra_dispatch(ra, mapping, filp);
+ if (ret < 0)
+ return 0;
+
+ /* No action on look ahead time? */
+ if (page)
+ return 0;
+
+ /*
+ * Random read that follows a sequential one.
+ */
+ if (try_random_readahead(ra, index, size, ra_max))
+ return ra_dispatch(ra, mapping, filp);
+
+ /*
+ * Random read.
+ */
+ if (size > ra_max)
+ size = ra_max;
+
+readit:
+ size = __do_page_cache_readahead(mapping, filp, index, size, 0);
+
+ ra_account(ra, RA_EVENT_READRANDOM, size);
+ dprintk("readrandom(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n",
+ mapping->host->i_ino, mapping->nrpages,
+ begin_index, index, end_index, size);
+
+ return size;
+}
+
+/*
+ * Call me!
+ */
+void fastcall ra_access(struct file_ra_state *ra, struct page *page)
+{
+ if (page->flags & ((1 << PG_active) |
+ (1 << PG_activate) |
+ (1 << PG_referenced)))
+ return;
+
+ if (!ra_has_index(ra, page->index))
+ return;
+
+ ra->cache_hit++;
+
+ if (page->index >= ra->ra_index)
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, 1);
+ else
+ ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
+}
+
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 07/13] readahead: tunable parameters
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (5 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 06/13] readahead: call scheme Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 08/13] readahead: state based method Wu Fengguang
` (6 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-parameter.patch --]
[-- Type: text/plain, Size: 6005 bytes --]
- new entry in /proc/sys/vm/readahead_ratio;
- limit loopdev file read-ahead size to 32kb;
- limit mmap read-around size to 256kb;
- dynamic read-ahead min/max bound.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
Documentation/sysctl/vm.txt | 14 ++++++++++++++
drivers/block/loop.c | 5 +++++
include/linux/mm.h | 5 ++++-
include/linux/sysctl.h | 1 +
kernel/sysctl.c | 11 +++++++++++
mm/filemap.c | 7 +++++++
mm/readahead.c | 31 +++++++++++++++++++++++++++++++
7 files changed, 73 insertions(+), 1 deletion(-)
--- linux-2.6.14-rc5-mm1.orig/Documentation/sysctl/vm.txt
+++ linux-2.6.14-rc5-mm1/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/
- laptop_mode
- block_dump
- swap_prefetch
+- readahead_ratio
==============================================================
@@ -114,3 +115,16 @@ except when laptop_mode is enabled and t
Setting it to 0 disables prefetching entirely.
The default value is dependant on ramsize.
+
+==============================================================
+
+readahead_ratio
+
+This limits read-ahead size to percent of the thrashing-threshold.
+The thrashing-threshold is dynamicly estimated according to the
+_history_ read speed and system load, and used to limit the
+_future_ read-ahead request size. So you should set it to a low
+value if you have not enough memory to counteract the I/O load
+fluctuation.
+
+The default value is 50.
--- linux-2.6.14-rc5-mm1.orig/include/linux/mm.h
+++ linux-2.6.14-rc5-mm1/include/linux/mm.h
@@ -927,11 +927,14 @@ extern int filemap_populate(struct vm_ar
int write_one_page(struct page *page, int wait);
/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
+#define VM_MAX_READAHEAD 1024 /* kbytes */
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */
#define VM_MAX_CACHE_HIT 256 /* max pages in a row in cache before
* turning readahead off */
+/* turn on read-ahead thrashing protection if (readahead_ratio >= ##) */
+#define VM_READAHEAD_PROTECT_RATIO 80
+
int do_page_cache_readahead(struct address_space *mapping, struct file *filp,
unsigned long offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
--- linux-2.6.14-rc5-mm1.orig/include/linux/sysctl.h
+++ linux-2.6.14-rc5-mm1/include/linux/sysctl.h
@@ -182,6 +182,7 @@ enum
VM_LEGACY_VA_LAYOUT=27, /* legacy/compatibility virtual address space layout */
VM_SWAP_TOKEN_TIMEOUT=28, /* default time for token time out */
VM_SWAP_PREFETCH=29, /* int: amount to swap prefetch */
+ VM_READAHEAD_RATIO=30, /* percent of read-ahead size to thrashing-threshold */
};
--- linux-2.6.14-rc5-mm1.orig/kernel/sysctl.c
+++ linux-2.6.14-rc5-mm1/kernel/sysctl.c
@@ -67,6 +67,7 @@ extern int min_free_kbytes;
extern int printk_ratelimit_jiffies;
extern int printk_ratelimit_burst;
extern int pid_max_min, pid_max_max;
+extern int readahead_ratio;
#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86)
int unknown_nmi_panic;
@@ -869,6 +870,16 @@ static ctl_table vm_table[] = {
},
#endif
#endif
+ {
+ .ctl_name = VM_READAHEAD_RATIO,
+ .procname = "readahead_ratio",
+ .data = &readahead_ratio,
+ .maxlen = sizeof(readahead_ratio),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ .strategy = &sysctl_intvec,
+ .extra1 = &zero,
+ },
{ .ctl_name = 0 }
};
--- linux-2.6.14-rc5-mm1.orig/drivers/block/loop.c
+++ linux-2.6.14-rc5-mm1/drivers/block/loop.c
@@ -768,6 +768,11 @@ static int loop_set_fd(struct loop_devic
mapping = file->f_mapping;
inode = mapping->host;
+ /*
+ * The upper layer should already do proper read-ahead,
+ * unlimited read-ahead here only ruins the cache hit rate.
+ */
+ file->f_ra.ra_pages = 32 >> (PAGE_CACHE_SHIFT - 10);
if (!(file->f_mode & FMODE_WRITE))
lo_flags |= LO_FLAGS_READ_ONLY;
--- linux-2.6.14-rc5-mm1.orig/mm/filemap.c
+++ linux-2.6.14-rc5-mm1/mm/filemap.c
@@ -1312,6 +1312,13 @@ retry_find:
if (ra_pages) {
pgoff_t start = 0;
+ /*
+ * Max read-around should be much smaller than
+ * max read-ahead.
+ * How about adding a tunable parameter for this?
+ */
+ if (ra_pages > 64)
+ ra_pages = 64;
if (pgoff > ra_pages / 2)
start = pgoff - ra_pages / 2;
do_page_cache_readahead(mapping, file, start, ra_pages);
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -15,6 +15,13 @@
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+/* Set look-ahead size to 1/8 of the thrashing-threshold. */
+#define LOOKAHEAD_RATIO 8
+
+/* Set read-ahead size to ##% of the thrashing-threshold. */
+int readahead_ratio = 0;
+EXPORT_SYMBOL(readahead_ratio);
+
/* Detailed classification of read-ahead behaviors. */
#define RA_CLASS_SHIFT 3
#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
@@ -742,6 +749,30 @@ out:
}
/*
+ * ra_size is mainly determined by:
+ * 1. sequential-start: min(KB(16 + mem_mb/16), KB(64))
+ * 2. sequential-max: min(KB(64 + mem_mb*64), KB(2048))
+ * 3. sequential: (thrashing-threshold) * readahead_ratio / 100
+ *
+ * Table of concrete numbers for 4KB page size:
+ * (inactive + free) (in MB): 4 8 16 32 64 128 256 512 1024
+ * initial ra_size (in KB): 16 16 16 16 20 24 32 48 64
+ * max ra_size (in KB): 320 576 1088 2048 2048 2048 2048 2048 2048
+ */
+static inline void get_readahead_bounds(struct file_ra_state *ra,
+ unsigned long *ra_min,
+ unsigned long *ra_max)
+{
+ unsigned long mem_mb;
+
+#define KB(size) (((size) * 1024) / PAGE_CACHE_SIZE)
+ mem_mb = nr_free_inactive() * PAGE_CACHE_SIZE / 1024 / 1024;
+ *ra_max = min(min(KB(64 + mem_mb*64), KB(2048)), ra->ra_pages);
+ *ra_min = min(min(KB(VM_MIN_READAHEAD + mem_mb/16), KB(128)), *ra_max/2);
+#undef KB
+}
+
+/*
* This is the entry point of the adaptive read-ahead logic.
*
* It is only called on two conditions:
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 08/13] readahead: state based method
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (6 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 07/13] readahead: tunable parameters Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 09/13] readahead: context " Wu Fengguang
` (5 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-method-stateful.patch --]
[-- Type: text/plain, Size: 12656 bytes --]
This is the fast code path.
Major steps:
- estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file->f_ra;
- submit it.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/fs.h | 8 +
mm/readahead.c | 333 +++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/swap.c | 3
mm/vmscan.c | 5
4 files changed, 348 insertions(+), 1 deletion(-)
--- linux-2.6.14-rc5-mm1.orig/include/linux/fs.h
+++ linux-2.6.14-rc5-mm1/include/linux/fs.h
@@ -564,13 +564,19 @@ struct file_ra_state {
unsigned long start; /* Current window */
unsigned long size;
unsigned long flags; /* ra flags RA_FLAG_xxx*/
- unsigned long cache_hit; /* cache hit count*/
+ uint64_t cache_hit; /* cache hit count*/
unsigned long prev_page; /* Cache last read() position */
unsigned long ahead_start; /* Ahead window */
unsigned long ahead_size;
unsigned long ra_pages; /* Maximum readahead window */
unsigned long mmap_hit; /* Cache hit stat for mmap accesses */
unsigned long mmap_miss; /* Cache miss stat for mmap accesses */
+
+ unsigned long age;
+ unsigned long la_index;
+ unsigned long ra_index;
+ unsigned long lookahead_index;
+ unsigned long readahead_index;
};
#define RA_FLAG_MISS 0x01 /* a cache miss occured against this file */
#define RA_FLAG_INCACHE 0x02 /* file is already in cache */
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c
@@ -370,6 +370,8 @@ static pageout_t pageout(struct page *pa
return PAGE_CLEAN;
}
+DECLARE_PER_CPU(unsigned long, smooth_aging);
+
/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
@@ -417,6 +419,8 @@ static int shrink_list(struct list_head
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
goto activate_locked;
+ if (!referenced)
+ __get_cpu_var(smooth_aging)++;
#ifdef CONFIG_SWAP
/*
@@ -774,6 +778,7 @@ refill_inactive_zone(struct zone *zone,
list_add(&page->lru, &l_active);
continue;
}
+ __get_cpu_var(smooth_aging)++;
}
list_add(&page->lru, &l_inactive);
}
--- linux-2.6.14-rc5-mm1.orig/mm/swap.c
+++ linux-2.6.14-rc5-mm1/mm/swap.c
@@ -112,6 +112,8 @@ void fastcall activate_page(struct page
spin_unlock_irq(&zone->lru_lock);
}
+DECLARE_PER_CPU(unsigned long, smooth_aging);
+
/*
* Mark a page as having seen activity.
*
@@ -126,6 +128,7 @@ void fastcall mark_page_accessed(struct
ClearPageReferenced(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
+ __get_cpu_var(smooth_aging)++;
}
}
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -22,6 +22,12 @@
int readahead_ratio = 0;
EXPORT_SYMBOL(readahead_ratio);
+/* Analog to nr_page_aging.
+ * But mainly increased on fresh page references, so is much more smoother.
+ */
+DEFINE_PER_CPU(unsigned long, smooth_aging);
+EXPORT_PER_CPU_SYMBOL(smooth_aging);
+
/* Detailed classification of read-ahead behaviors. */
#define RA_CLASS_SHIFT 3
#define RA_CLASS_MASK ((1 << RA_CLASS_SHIFT) - 1)
@@ -749,6 +755,333 @@ out:
}
/*
+ * State based calculation of read-ahead request.
+ *
+ * This figure shows the meaning of file_ra_state members:
+ *
+ * chunk A chunk B
+ * +---------------------------+-------------------------------------------+
+ * | # | # |
+ * +---------------------------+-------------------------------------------+
+ * ^ ^ ^ ^
+ * la_index ra_index lookahead_index readahead_index
+ */
+
+/*
+ * The global effective length of the inactive_list(s).
+ */
+static unsigned long nr_free_inactive(void)
+{
+ unsigned int i;
+ unsigned long sum = 0;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].nr_inactive +
+ zones[i].free_pages - zones[i].pages_low;
+
+ return sum;
+}
+
+/*
+ * The accumulated count of pages pushed into inactive_list(s).
+ */
+static unsigned long nr_page_aging(void)
+{
+ unsigned int i;
+ unsigned long sum = 0;
+ struct zone *zones = NODE_DATA(numa_node_id())->node_zones;
+
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ sum += zones[i].nr_page_aging;
+
+ return sum;
+}
+
+/*
+ * A much smoother analog to nr_page_aging.
+ */
+static unsigned long nr_smooth_aging(void)
+{
+ unsigned long cpu;
+ unsigned long sum = 0;
+ cpumask_t mask = node_to_cpumask(node);
+
+ for_each_cpu_mask(cpu, mask)
+ sum += per_cpu(smooth_aging, cpu);
+
+ return sum;
+}
+
+/*
+ * Set class of read-ahead
+ */
+static inline void set_ra_class(struct file_ra_state *ra,
+ enum ra_class ra_class)
+{
+ ra->flags <<= RA_CLASS_SHIFT;
+ ra->flags += ra_class;
+}
+
+/*
+ * The 64bit cache_hit stores three accumulated value and one counter value.
+ * MSB LSB
+ * 3333333333333333 : 2222222222222222 : 1111111111111111 : 0000000000000000
+ */
+static inline int ra_cache_hit(struct file_ra_state *ra, int nr)
+{
+ return (ra->cache_hit >> (nr * 16)) & 0xFFFF;
+}
+
+static inline void ra_addup_cache_hit(struct file_ra_state *ra)
+{
+ int n;
+
+ n = ra_cache_hit(ra, 0);
+ ra->cache_hit -= n;
+ n <<= 16;
+ ra->cache_hit += n;
+}
+
+/*
+ * The read-ahead is deemed success if cache-hit-rate > 50%.
+ */
+static inline int ra_cache_hit_ok(struct file_ra_state *ra)
+{
+ return ra_cache_hit(ra, 0) * 2 > (ra->lookahead_index - ra->la_index);
+}
+
+/*
+ * Check if @index falls in the ra request.
+ */
+static inline int ra_has_index(struct file_ra_state *ra, unsigned long index)
+{
+ if (index < ra->la_index || index >= ra->readahead_index)
+ return 0;
+
+ if (index >= ra->ra_index)
+ return 1;
+ else
+ return -1;
+}
+
+/*
+ * Prepare file_ra_state for a new read-ahead sequence.
+ */
+static inline void ra_state_init(struct file_ra_state *ra,
+ unsigned long la_index, unsigned long ra_index)
+{
+ ra_addup_cache_hit(ra);
+ ra->cache_hit <<= 16;
+ ra->lookahead_index = la_index;
+ ra->readahead_index = ra_index;
+}
+
+/*
+ * Take down a new read-ahead request in file_ra_state.
+ */
+static inline void ra_state_update(struct file_ra_state *ra,
+ unsigned long ra_size, unsigned long la_size)
+{
+#ifdef DEBUG_READAHEAD
+ unsigned long old_ra = ra->readahead_index - ra->ra_index;
+ if (ra_size < old_ra && ra_cache_hit(ra, 0))
+ ra_account(ra, RA_EVENT_READAHEAD_SHRINK, old_ra - ra_size);
+#endif
+ ra_addup_cache_hit(ra);
+ ra->ra_index = ra->readahead_index;
+ ra->la_index = ra->lookahead_index;
+ ra->readahead_index += ra_size;
+ ra->lookahead_index = ra->readahead_index - la_size;
+ ra->age = nr_smooth_aging();
+}
+
+/*
+ * Adjust the read-ahead request in file_ra_state.
+ */
+static inline void ra_state_adjust(struct file_ra_state *ra,
+ unsigned long ra_size, unsigned long la_size)
+{
+ ra->readahead_index = ra->ra_index + ra_size;
+ ra->lookahead_index = ra->readahead_index - la_size;
+}
+
+/*
+ * Submit IO for the read-ahead request in file_ra_state.
+ */
+static int ra_dispatch(struct file_ra_state *ra,
+ struct address_space *mapping, struct file *filp)
+{
+ unsigned long eof_index;
+ unsigned long ra_size;
+ unsigned long la_size;
+ int actual;
+ enum ra_class ra_class;
+
+ ra_class = (ra->flags & RA_CLASS_MASK);
+ BUG_ON(ra_class == 0 || ra_class > RA_CLASS_END);
+
+ eof_index = ((i_size_read(mapping->host) - 1) >> PAGE_CACHE_SHIFT) + 1;
+ ra_size = ra->readahead_index - ra->ra_index;
+ la_size = ra->readahead_index - ra->lookahead_index;
+
+ /* Snap to EOF. */
+ if (unlikely(ra->ra_index >= eof_index))
+ return 0;
+ if (ra->readahead_index + ra_size / 2 > eof_index) {
+ if (ra_class == RA_CLASS_CONTEXT_ACCELERATED &&
+ eof_index > ra->lookahead_index + 1)
+ la_size = eof_index - ra->lookahead_index;
+ else
+ la_size = 0;
+ ra_size = eof_index - ra->ra_index;
+ ra_state_adjust(ra, ra_size, la_size);
+ }
+
+ actual = __do_page_cache_readahead(mapping, filp,
+ ra->ra_index, ra_size, la_size);
+
+ if (ra->readahead_index == eof_index)
+ ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
+ if (la_size)
+ ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+ ra_account(ra, RA_EVENT_READAHEAD, actual);
+
+ dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
+ ra_class_name[ra_class],
+ mapping->host->i_ino, ra->la_index,
+ ra->ra_index, ra_size, la_size, actual);
+
+ return actual;
+}
+
+/*
+ * Determine the request parameters from primitive values.
+ *
+ * It applies the following rules:
+ * - Substract ra_size by the old look-ahead to get real safe read-ahead;
+ * - Set new la_size according to the (still large) ra_size;
+ * - Apply upper limits;
+ * - Make sure stream_shift is not too small.
+ * (So that the next global_shift will not be too small.)
+ *
+ * Input:
+ * ra_size stores the estimated thrashing-threshold.
+ * la_size stores the look-ahead size of previous request.
+ */
+static inline int adjust_rala(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ unsigned long stream_shift = *la_size;
+
+ if (*ra_size > *la_size)
+ *ra_size -= *la_size;
+ else
+ return 0;
+
+ *la_size = *ra_size / LOOKAHEAD_RATIO;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ stream_shift += (*ra_size - *la_size);
+ if (stream_shift < *ra_size / 4)
+ *la_size -= (*ra_size / 4 - stream_shift);
+
+ return 1;
+}
+
+/*
+ * The function estimates two values:
+ * 1. thrashing-threshold for the current stream
+ * It is returned to make the next read-ahead request.
+ * 2. the remained space for the current chunk
+ * It will be checked to ensure that the current chunk is safe.
+ *
+ * The computation will be pretty accurate under heavy load, and will change
+ * vastly with light load(small global_shift), so the grow speed of ra_size
+ * must be limited, and a moderate large stream_shift must be insured.
+ *
+ * This figure illustrates the formula:
+ * While the stream reads stream_shift pages inside the chunks,
+ * the chunks are shifted global_shift pages inside inactive_list.
+ *
+ * chunk A chunk B
+ * |<=============== global_shift ================|
+ * +-------------+ +-------------------+ |
+ * | # | | # | inactive_list |
+ * +-------------+ +-------------------+ head |
+ * |---->| |---------->|
+ * | |
+ * +-- stream_shift --+
+ */
+static inline unsigned long compute_thrashing_threshold(
+ struct file_ra_state *ra,
+ unsigned long *remain)
+{
+ unsigned long global_size;
+ unsigned long global_shift;
+ unsigned long stream_shift;
+ unsigned long ra_size;
+
+ global_size = nr_free_inactive();
+ global_shift = nr_smooth_aging() - ra->age;
+ stream_shift = ra_cache_hit(ra, 0);
+
+ ra_size = stream_shift *
+ global_size * readahead_ratio / (100 * global_shift);
+
+ if (global_size > global_shift)
+ *remain = stream_shift *
+ (global_size - global_shift) / global_shift;
+ else
+ *remain = 0;
+
+ ddprintk("compute_thrashing_threshold: "
+ "ra=%lu=%lu*%lu/%lu, remain %lu for %lu\n",
+ ra_size, stream_shift, global_size, global_shift,
+ *remain, ra->readahead_index - ra->lookahead_index);
+
+ return ra_size;
+}
+
+/*
+ * Main function for file_ra_state based read-ahead.
+ */
+static inline unsigned long
+state_based_readahead(struct address_space *mapping, struct file *filp,
+ struct file_ra_state *ra, struct page *page,
+ unsigned long ra_max)
+{
+ unsigned long ra_old;
+ unsigned long ra_size;
+ unsigned long la_size;
+ unsigned long remain_space;
+
+ la_size = ra->readahead_index - ra->lookahead_index;
+ ra_old = ra->readahead_index - ra->ra_index;
+ ra_size = compute_thrashing_threshold(ra, &remain_space);
+
+ if (readahead_ratio < VM_READAHEAD_PROTECT_RATIO &&
+ remain_space <= la_size && la_size > 1) {
+ rescue_pages(page, la_size);
+ return 0;
+ }
+
+ if (!adjust_rala(min(ra_max, 2 * ra_old + (ra_max - ra_old) / 16),
+ &ra_size, &la_size))
+ return 0;
+
+ set_ra_class(ra, RA_CLASS_STATE);
+ ra_state_update(ra, ra_size, la_size);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+
+/*
* ra_size is mainly determined by:
* 1. sequential-start: min(KB(16 + mem_mb/16), KB(64))
* 2. sequential-max: min(KB(64 + mem_mb*64), KB(2048))
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 09/13] readahead: context based method
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (7 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 08/13] readahead: state based method Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 10/13] readahead: other methods Wu Fengguang
` (4 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-method-context.patch --]
[-- Type: text/plain, Size: 10398 bytes --]
This is the slow code path.
No valid state info is available, so the page cache is queried to abtain the
required position/timing infomation.
Major steps:
- look back/forward to find the ra_index;
- look back to estimate a thrashing safe ra_size;
- assemble the next read-ahead request in file->f_ra;
- submit it.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 336 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 335 insertions(+), 1 deletion(-)
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -1080,6 +1080,340 @@ state_based_readahead(struct address_spa
return ra_dispatch(ra, mapping, filp);
}
+/*
+ * Page cache context based estimation of read-ahead/look-ahead size/index.
+ *
+ * The logic first looks backward in the inactive_list to get an estimation of
+ * the thrashing-threshold, and then, if necessary, looks forward to determine
+ * the start point of next read-ahead.
+ *
+ * The estimation theory can be illustrated with figure:
+ *
+ * chunk A chunk B chunk C head
+ *
+ * l01 l11 l12 l21 l22
+ *| |-->|-->| |------>|-->| |------>|
+ *| +-------+ +-----------+ +-------------+ |
+ *| | # | | # | | # | |
+ *| +-------+ +-----------+ +-------------+ |
+ *| |<==============|<===========================|<============================|
+ * L0 L1 L2
+ *
+ * Let f(l) = L be a map from
+ * l: the number of pages read by the stream
+ * to
+ * L: the number of pages pushed into inactive_list in the mean time
+ * then
+ * f(l01) <= L0
+ * f(l11 + l12) = L1
+ * f(l21 + l22) = L2
+ * ...
+ * f(l01 + l11 + ...) <= Sum(L0 + L1 + ...)
+ * <= Length(inactive_list) = f(thrashing-threshold)
+ *
+ * So the count of countinuous history pages left in the inactive_list is always
+ * a lower estimation of the true thrashing-threshold.
+ */
+
+/*
+ * STATUS REFERENCE COUNT TYPE
+ * A__ 0 not in inactive list
+ * ___ 0 fresh
+ * __R PAGE_REFCNT_1 stale
+ * _a_ PAGE_REFCNT_2 disturbed once
+ * _aR PAGE_REFCNT_3 disturbed twice
+ *
+ * A/a/R: Active / aCTIVATE / Referenced
+ */
+static inline unsigned long cold_page_refcnt(struct page *page)
+{
+ if (!page || PageActive(page))
+ return 0;
+
+ return page_refcnt(page);
+}
+
+static inline char page_refcnt_symbol(struct page *page)
+{
+ if (!page)
+ return 'X';
+ if (PageActive(page))
+ return 'A';
+ switch (page_refcnt(page)) {
+ case 0:
+ return '_';
+ case PAGE_REFCNT_1:
+ return '-';
+ case PAGE_REFCNT_2:
+ return '=';
+ case PAGE_REFCNT_3:
+ return '#';
+ }
+ return '?';
+}
+
+/*
+ * Count/estimate cache hits in range [first_index, last_index].
+ * The estimation is simple and a bit optimistic.
+ */
+static int count_cache_hit(struct address_space *mapping,
+ unsigned long first_index, unsigned long last_index)
+{
+ static int steps[8] = {0, 4, 2, 6, 1, 3, 5, 7};
+ struct page *page;
+ int size = last_index - first_index + 1;
+ int count = 0;
+ int i;
+
+ read_lock_irq(&mapping->tree_lock);
+
+ for (i = 0; i < 8;) {
+ page = __find_page(mapping,
+ first_index + size * steps[i++] / 8);
+ if (cold_page_refcnt(page) >= PAGE_REFCNT_1 && ++count >= 2)
+ break;
+ }
+
+ read_unlock_irq(&mapping->tree_lock);
+
+ return size * count / i;
+}
+
+/*
+ * Look back and check history pages to estimate thrashing-threshold.
+ */
+static int query_page_cache(struct address_space *mapping,
+ unsigned long *remain, unsigned long offset,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ int step;
+ int count;
+ unsigned long index;
+ unsigned long nr_lookback;
+ struct radix_tree_cache cache;
+
+ /*
+ * Scan backward and check the near @ra_max pages.
+ * The count here determines ra_size.
+ */
+ read_lock_irq(&mapping->tree_lock);
+ index = radix_tree_lookup_head(&mapping->page_tree, offset, ra_max);
+ read_unlock_irq(&mapping->tree_lock);
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (index <= offset) {
+ WARN_ON(!find_page(mapping, index));
+ if (index + ra_max > offset)
+ WARN_ON(find_page(mapping, index - 1));
+ } else {
+ BUG_ON(index > offset + 1);
+ WARN_ON(find_page(mapping, offset));
+ }
+#endif
+
+ *remain = offset - index + 1;
+
+ if (unlikely(*remain <= ra_min)) {
+ count = ra_min;
+ goto out;
+ }
+
+ count = count_cache_hit(mapping, index, offset);
+ if (count < ra_min)
+ count = ra_min;
+ if (unlikely(count * 2 < offset - index))
+ goto out;
+
+ if (*remain < ra_max)
+ goto out;
+
+ /*
+ * Check the far pages coarsely.
+ * The big count here helps increase la_size.
+ */
+ nr_lookback = ra_max * (LOOKAHEAD_RATIO + 1) *
+ 100 / (readahead_ratio + 1);
+ if (nr_lookback > offset)
+ nr_lookback = offset;
+
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (step = 2 * ra_max; step < nr_lookback; step += ra_max) {
+ struct radix_tree_node *node;
+ node = radix_tree_cache_lookup_node(&mapping->page_tree,
+ &cache, offset - step, 1);
+ if (!node)
+ break;
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ if (node != radix_tree_lookup_node(&mapping->page_tree,
+ offset - step, 1)) {
+ read_unlock_irq(&mapping->tree_lock);
+ printk(KERN_ERR "check radix_tree_cache_lookup_node!\n");
+ return 1;
+ }
+#endif
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ /*
+ * For sequential read that extends from index 0, the counted value
+ * may well be far under the true threshold, so return it unmodified
+ * for further process in adjust_rala_accelerated().
+ */
+ if (step < offset)
+ count = step * readahead_ratio / 100;
+ else
+ count = offset;
+
+out:
+ return count;
+}
+
+/*
+ * Scan backward in the file for the first non-present page.
+ */
+static inline unsigned long first_absent_page_bw(struct address_space *mapping,
+ unsigned long index, unsigned long max_scan)
+{
+ struct radix_tree_cache cache;
+ struct page *page;
+ unsigned long origin;
+
+ origin = index;
+ if (max_scan > index)
+ max_scan = index;
+ radix_tree_cache_init(&cache);
+ read_lock_irq(&mapping->tree_lock);
+ for (;;) {
+ page = radix_tree_cache_lookup(&mapping->page_tree,
+ &cache, --index);
+ if (page) {
+ index++;
+ break;
+ }
+ if (origin - index > max_scan)
+ break;
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ return index;
+}
+
+/*
+ * Scan forward in the file for the first non-present page.
+ */
+static inline unsigned long first_absent_page(struct address_space *mapping,
+ unsigned long index, unsigned long max_scan)
+{
+ unsigned long ra_index;
+
+ read_lock_irq(&mapping->tree_lock);
+ ra_index = radix_tree_lookup_tail(&mapping->page_tree,
+ index + 1, max_scan);
+ read_unlock_irq(&mapping->tree_lock);
+
+#ifdef DEBUG_READAHEAD_RADIXTREE
+ BUG_ON(ra_index <= index);
+ if (index + max_scan > index) {
+ if (ra_index <= index + max_scan)
+ WARN_ON(find_page(mapping, ra_index));
+ WARN_ON(!find_page(mapping, ra_index - 1));
+ }
+#endif
+
+ if (ra_index <= index + max_scan)
+ return ra_index;
+ else
+ return 0;
+}
+
+/*
+ * Determine the request parameters for context based read-ahead that extends
+ * from start of file.
+ *
+ * The major weakness of stateless method is perhaps the slow grow up speed of
+ * ra_size. The logic tries to make up for this in the important case of
+ * sequential reads that extend from start of file. In this case, the ra_size
+ * is not choosed to make the whole next chunk safe(as in normal ones). Only
+ * half of which is safe. The added 'unsafe' half is the look-ahead part. It
+ * is expected to be safeguarded by rescue_pages() when the previous chunks are
+ * lost.
+ */
+static inline int adjust_rala_accelerated(unsigned long ra_max,
+ unsigned long *ra_size, unsigned long *la_size)
+{
+ if (*ra_size <= *la_size)
+ return 0;
+
+ *la_size = (*ra_size - *la_size) * readahead_ratio / 100;
+ *ra_size = *la_size * 2;
+
+ if (*ra_size > ra_max)
+ *ra_size = ra_max;
+ if (*la_size > *ra_size)
+ *la_size = *ra_size;
+
+ return 1;
+}
+
+/*
+ * Main function for page context based read-ahead.
+ */
+static inline int
+try_context_based_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct page *prev_page, struct page *page,
+ unsigned long index,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ unsigned long ra_index;
+ unsigned long la_size;
+ unsigned long ra_size;
+ unsigned long remain_pages;
+
+ /* Where to start read-ahead?
+ * NFSv3 daemons may process adjecent requests in parallel,
+ * leading to many locally disordered, globally sequential reads.
+ * So do not require nearby history pages to be present or accessed.
+ */
+ if (page) {
+ ra_index = first_absent_page(mapping, index, ra_max * 5 / 4);
+ if (unlikely(!ra_index))
+ return -1;
+ } else if (!prev_page) {
+ ra_index = first_absent_page_bw(mapping, index, ra_min);
+ if (index - ra_index > ra_min)
+ return 0;
+ ra_min += index - ra_index;
+ index = ra_index;
+ } else
+ ra_index = index;
+
+ ra_size = query_page_cache(mapping, &remain_pages,
+ index - 1, ra_min, ra_max);
+
+ la_size = ra_index - index;
+ if (readahead_ratio < VM_READAHEAD_PROTECT_RATIO &&
+ remain_pages <= la_size && la_size > 1) {
+ rescue_pages(page, la_size);
+ return -1;
+ }
+
+ if (ra_size == index) {
+ if (!adjust_rala_accelerated(ra_max, &ra_size, &la_size))
+ return -1;
+ set_ra_class(ra, RA_CLASS_CONTEXT_ACCELERATED);
+ } else {
+ if (!adjust_rala(ra_max, &ra_size, &la_size))
+ return -1;
+ set_ra_class(ra, RA_CLASS_CONTEXT);
+ }
+
+ ra_state_init(ra, index, ra_index);
+ ra_state_update(ra, ra_size, la_size);
+
+ return 1;
+}
+
/*
* ra_size is mainly determined by:
@@ -1180,7 +1514,7 @@ page_cache_readahead_adaptive(struct add
* Context based sequential read-ahead.
*/
ret = try_context_based_readahead(mapping, ra, prev_page, page,
- index, size, ra_min, ra_max);
+ index, ra_min, ra_max);
if (ret > 0)
return ra_dispatch(ra, mapping, filp);
if (ret < 0)
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 10/13] readahead: other methods
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (8 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 09/13] readahead: context " Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 11/13] readahead: thrashing protection Wu Fengguang
` (3 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-method-others.patch --]
[-- Type: text/plain, Size: 3585 bytes --]
Various read-ahead strategies for:
- fresh read from start of file
- backward prefetching
- seek and read one record pattern(db workload)
- quick recover from thrashing
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 108 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 108 insertions(+)
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -1414,6 +1414,114 @@ try_context_based_readahead(struct addre
return 1;
}
+/*
+ * Read-ahead on start of file.
+ *
+ * It is most important for small files.
+ * 1. Set a moderate large read-ahead size;
+ * 2. Issue the next read-ahead request as soon as possible.
+ *
+ * But be careful, there are some applications that dip into only the very head
+ * of a file. The most important thing is to prevent them from triggering the
+ * next (much larger) read-ahead request, which leads to lots of cache misses.
+ * Two pages should be enough for them, correct me if I'm wrong.
+ */
+static inline unsigned long
+newfile_readahead(struct address_space *mapping,
+ struct file *filp, struct file_ra_state *ra,
+ unsigned long req_size, unsigned long ra_min)
+{
+ unsigned long ra_size;
+ unsigned long la_size;
+
+ if (req_size > ra_min)
+ req_size = ra_min;
+
+ ra_size = 4 * req_size;
+ la_size = 2 * req_size;
+
+ set_ra_class(ra, RA_CLASS_NEWFILE);
+ ra_state_init(ra, 0, 0);
+ ra_state_update(ra, ra_size, la_size);
+
+ return ra_dispatch(ra, mapping, filp);
+}
+
+/*
+ * Backward prefetching.
+ * No look ahead and thrashing threshold estimation for stepping backward
+ * pattern: should be unnecessary.
+ */
+static inline int
+try_read_backward(struct file_ra_state *ra,
+ unsigned long begin_index, unsigned long end_index,
+ unsigned long ra_size,
+ unsigned long ra_min, unsigned long ra_max)
+{
+ if (ra_size > ra_max || end_index > ra->prev_page)
+ return 0;
+
+ if (ra_has_index(ra, ra->prev_page)) {
+ if (end_index > ra->la_index)
+ return 0;
+ ra_size += 2 * ra_cache_hit(ra, 0);
+ end_index = ra->la_index;
+ } else {
+ ra_size += ra_min;
+ end_index = ra->prev_page;
+ }
+
+ if (ra_size > ra_max)
+ ra_size = ra_max;
+
+ if (end_index > begin_index + ra_size)
+ return 0;
+
+ begin_index = end_index - ra_size;
+
+ set_ra_class(ra, RA_CLASS_BACKWARD);
+ ra_state_init(ra, begin_index, begin_index);
+ ra_state_update(ra, ra_size, 0);
+
+ return 1;
+}
+
+/*
+ * If there is a previous sequential read, it is likely to be another
+ * sequential read at the new position.
+ * Databases are known to have this seek-and-read-one-record pattern.
+ */
+static inline int
+try_random_readahead(struct file_ra_state *ra, unsigned long index,
+ unsigned long ra_size, unsigned long ra_max)
+{
+ unsigned long hit0 = ra_cache_hit(ra, 0);
+ unsigned long hit1 = ra_cache_hit(ra, 1) + hit0;
+ unsigned long hit2 = ra_cache_hit(ra, 2);
+ unsigned long hit3 = ra_cache_hit(ra, 3);
+
+ if (!ra_has_index(ra, ra->prev_page))
+ return 0;
+
+ if (index == ra->prev_page + 1) { /* read after thrashing */
+ ra_size = hit0;
+ set_ra_class(ra, RA_CLASS_RANDOM_THRASHING);
+ ra_account(ra, RA_EVENT_READAHEAD_THRASHING,
+ ra->readahead_index - index);
+ } else if (ra_size < hit1 && /* read after seeking */
+ hit1 > hit2 / 2 &&
+ hit2 > hit3 / 2 &&
+ hit3 > hit1 / 2) {
+ ra_size = min(ra_max, hit1);
+ set_ra_class(ra, RA_CLASS_RANDOM_SEEK);
+ } else
+ return 0;
+
+ ra_state_init(ra, index, index);
+ ra_state_update(ra, ra_size, 0);
+
+ return 1;
+}
/*
* ra_size is mainly determined by:
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 11/13] readahead: thrashing protection
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (9 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 10/13] readahead: other methods Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 12/13] readahead: events accounting Wu Fengguang
` (2 subsequent siblings)
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-thrashing-protection.patch --]
[-- Type: text/plain, Size: 13856 bytes --]
It tries to identify and protect live pages (sequential pages that are going
to be accessed in the near future) on vmscan time.
Almost all live pages can be sorted out and saved. The only problem is that
relavant pages can be mutilated into tiny chunks and ignored as random pages.
The cost: dead pages that won't be read will be kept in inactive_list for
another round. Hopefully there won't be much.
This feature is greatly demanded by file servers, though should be useless
for desktop. It saves the case when one fast reader flushes the cache and
strips the read-ahead size of slow readers, or the case where caching is just
a wasting of memory. Then you can raise readahead_ratio to 200 or more to
enforce a larger read-ahead size and strip the useless cached data out of lru.
Its use is currently limited, for the context based method is not ready for
the case. This problem will be solved when the timing info of evicted pages
are available. It can also be extended to further benefit large memory
systems. I'll leave them as future work.
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
include/linux/mm.h | 2
include/linux/page-flags.h | 2
mm/page_alloc.c | 2
mm/readahead.c | 319 ++++++++++++++++++++++++++++++++++++++++++++-
mm/vmscan.c | 8 +
5 files changed, 332 insertions(+), 1 deletion(-)
--- linux-2.6.14-rc5-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.14-rc5-mm1/include/linux/page-flags.h
@@ -107,6 +107,8 @@ struct page_state {
unsigned long pgfree; /* page freeings */
unsigned long pgactivate; /* pages moved inactive->active */
unsigned long pgdeactivate; /* pages moved active->inactive */
+ unsigned long pgkeephot; /* pages sent back to active */
+ unsigned long pgkeepcold; /* pages sent back to inactive */
unsigned long pgfault; /* faults (major+minor) */
unsigned long pgmajfault; /* faults (major only) */
--- linux-2.6.14-rc5-mm1.orig/include/linux/mm.h
+++ linux-2.6.14-rc5-mm1/include/linux/mm.h
@@ -954,6 +954,8 @@ page_cache_readahead_adaptive(struct add
unsigned long first_index,
unsigned long index, unsigned long last_index);
void fastcall ra_access(struct file_ra_state *ra, struct page *page);
+int rescue_ra_pages(struct list_head *page_list, struct list_head *save_list);
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
--- linux-2.6.14-rc5-mm1.orig/mm/page_alloc.c
+++ linux-2.6.14-rc5-mm1/mm/page_alloc.c
@@ -2389,6 +2389,8 @@ static char *vmstat_text[] = {
"pgfree",
"pgactivate",
"pgdeactivate",
+ "pgkeephot",
+ "pgkeepcold",
"pgfault",
"pgmajfault",
--- linux-2.6.14-rc5-mm1.orig/mm/vmscan.c
+++ linux-2.6.14-rc5-mm1/mm/vmscan.c
@@ -370,6 +370,7 @@ static pageout_t pageout(struct page *pa
return PAGE_CLEAN;
}
+extern int readahead_ratio;
DECLARE_PER_CPU(unsigned long, smooth_aging);
/*
@@ -380,10 +381,14 @@ static int shrink_list(struct list_head
LIST_HEAD(ret_pages);
struct pagevec freed_pvec;
int pgactivate = 0;
+ int pgkeep = 0;
int reclaimed = 0;
cond_resched();
+ if (readahead_ratio >= VM_READAHEAD_PROTECT_RATIO)
+ rescue_ra_pages(page_list, &ret_pages);
+
pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
struct address_space *mapping;
@@ -569,11 +574,13 @@ keep_locked:
keep:
list_add(&page->lru, &ret_pages);
BUG_ON(PageLRU(page));
+ pgkeep++;
}
list_splice(&ret_pages, page_list);
if (pagevec_count(&freed_pvec))
__pagevec_release_nonlru(&freed_pvec);
mod_page_state(pgactivate, pgactivate);
+ mod_page_state(pgkeepcold, pgkeep - pgactivate);
sc->nr_reclaimed += reclaimed;
return reclaimed;
}
@@ -837,6 +844,7 @@ refill_inactive_zone(struct zone *zone,
mod_page_state_zone(zone, pgrefill, pgscanned);
mod_page_state(pgdeactivate, pgdeactivate);
+ mod_page_state(pgkeephot, pgmoved);
}
/*
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -345,7 +345,7 @@ __do_page_cache_readahead(struct address
read_lock_irq(&mapping->tree_lock);
for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
unsigned long page_offset = offset + page_idx;
-
+
if (page_offset > end_index)
break;
@@ -1676,3 +1676,320 @@ void fastcall ra_access(struct file_ra_s
ra_account(ra, RA_EVENT_READAHEAD_HIT, -1);
}
+/*
+ * Detect and protect live read-ahead pages.
+ *
+ * This function provides safty guarantee for file servers with big
+ * readahead_ratio(>=VM_READAHEAD_PROTECT_RATIO) set. The goal is to save all
+ * and only the sequential pages that are to be accessed in the near future.
+ *
+ * This function is called when pages in @page_list are to be freed,
+ * it protects live read-ahead pages by moving them into @save_list.
+ *
+ * The general idea is to classify pages of a file into random pages and groups
+ * of sequential accessed pages. Random pages and dead sequential pages are
+ * left over, live sequential pages are saved.
+ *
+ * Live read-ahead pages are defined as sequential pages that have reading in
+ * progress. They are detected by reference count pattern of:
+ *
+ * live head live pages
+ * ra pages group --> ------------___________________
+ * [ pages to save ] (*)
+ *
+ * (*) for now, an extra page from the live head may also be saved.
+ *
+ * In pratical, the group of pages are fragmented into chunks. To tell whether
+ * pages inside a chunk are alive, we must check:
+ * 1) Are there any live heads inside the chunk?
+ * 2) Are there any live heads in the group before the chunk?
+ * 3) Sepcial case: live head just sits on the boundary of current chunk?
+ *
+ * The detailed rules employed must ensure:
+ * - no page is pinned in inactive_list.
+ * - no excessive pages are saved.
+ *
+ * A picture of common cases:
+ * back search chunk case
+ * -----___________|[____________________] Normal
+ * ----------------|----[________________] Normal
+ * |----[________________] Normal
+ * ----------------|---------------------- Normal
+ * |---------------------- Normal
+ * ________________|______________________ ra miss
+ * |______________________ ra miss
+ * ________________|_______--------[_____] two readers
+ * ----____________|[______--------______] two readers
+ * |_______--------[_____] two readers
+ * |----[____------______] two readers
+ * ----------------|----[____------______] two readers
+ * _______---------|---------------[_____] two readers
+ * ----___---------|[--------------______] two readers
+ * ________________|---------------[_____] two readers
+ * ----____________|[--------------______] two readers
+ * ====------------|[---_________________] two readers
+ * |====[----------______] two readers
+ * |###======[-----------] three readers
+ *
+ * Read backward pattern support is possible, in which case the pages should be
+ * pushed into inactive_list in reverse order.
+ *
+ * The two special cases are awkwardly delt with for now. They will be all set
+ * when the timing information of recently evicted pages are available.
+ * Dead pages can also be purged earlier with the timing info.
+ */
+static int save_chunk(struct page *head, struct page *live_head,
+ struct page *tail, struct list_head *save_list)
+{
+ struct page *page;
+ struct address_space *mapping;
+ struct radix_tree_cache cache;
+ int i;
+ unsigned long index;
+ unsigned long refcnt;
+
+#ifdef DEBUG_READAHEAD
+ static char static_buf[PAGE_SIZE];
+ static char *zone_names[1 << ZONES_SHIFT] = {
+ "DMA", "DMA32", "Normal", "HighMem", "Z5", "Z6", "Z7" };
+ char *pat = static_buf;
+ unsigned long pidx = PAGE_SIZE / 2;
+
+ if ((readahead_ratio & 3) == 3) {
+ pat = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!pat)
+ pat = static_buf;
+ }
+#endif
+
+#define LIVE_PAGE_SCAN (4 * MAX_RA_PAGES)
+ index = head->index;
+ refcnt = page_refcnt(head);
+ mapping = head->mapping;
+ radix_tree_cache_init(&cache);
+
+ BUG_ON(!mapping); /* QUESTION: in what case mapping will be NULL ? */
+ read_lock_irq(&mapping->tree_lock);
+
+ /*
+ * Common case test:
+ * Does the far end indicates a leading live head?
+ */
+ index = radix_tree_lookup_head(&mapping->page_tree,
+ index, LIVE_PAGE_SCAN);
+ page = __find_page(mapping, index);
+ if (cold_page_refcnt(page) > refcnt) {
+#ifdef DEBUG_READAHEAD
+ if ((readahead_ratio & 3) == 3) {
+ pat[--pidx] = '.';
+ pat[--pidx] = '.';
+ pat[--pidx] = '.';
+ pat[--pidx] = page_refcnt_symbol(page);
+ pat[--pidx] = '|';
+ }
+#endif
+ live_head = head;
+ goto skip_scan_locked;
+ }
+
+ /*
+ * Special case 1:
+ * If @head is a live head, rescue_ra_pages() will not detect it.
+ * Check it here.
+ */
+ index = head->index;
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache, --index);
+ if (!page || PageActive(page)) {
+#ifdef DEBUG_READAHEAD
+ if ((readahead_ratio & 3) == 3)
+ pat[--pidx] = page_refcnt_symbol(page);
+#endif
+ goto skip_scan_locked;
+ }
+ if (refcnt > page_refcnt(next_page(head)) &&
+ page_refcnt(page) > page_refcnt(next_page(head))) {
+#ifdef DEBUG_READAHEAD
+ if ((readahead_ratio & 3) == 3)
+ pat[--pidx] = page_refcnt_symbol(page);
+#endif
+ live_head = head;
+ goto skip_scan_locked;
+ }
+
+ /*
+ * Scan backward to see if the whole chunk should be saved.
+ * It can be costly. But can be made rare in future.
+ */
+ for (i = LIVE_PAGE_SCAN; i >= 0; i--) {
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache,
+ --index);
+#ifdef DEBUG_READAHEAD
+ if ((readahead_ratio & 3) == 3 && pidx)
+ pat[--pidx] = page_refcnt_symbol(page);
+#endif
+
+ if (!page)
+ break;
+
+ /* Avoid being pinned by active page. */
+ if (unlikely(PageActive(page)))
+ break;
+
+ if (page_refcnt(page) > refcnt) { /* So we are alive! */
+ live_head = head;
+ break;
+ }
+
+ refcnt = page_refcnt(page);
+ }
+
+skip_scan_locked:
+ /*
+ * Special case 2:
+ * Save one extra page if it is a live head of the following chunk.
+ * Just to be safe. It protects the rare situation when the reader
+ * is just crossing the chunk boundary, and the following chunk is not
+ * far away from tail of inactive_list.
+ */
+ if (live_head != head) {
+ struct page *last_page = prev_page(tail);
+ page = radix_tree_cache_lookup(&mapping->page_tree, &cache,
+ last_page->index + 1);
+ if (page && !live_head) {
+ refcnt = page_refcnt(last_page);
+ if (page_refcnt(page) >= refcnt)
+ page = radix_tree_cache_lookup(
+ &mapping->page_tree, &cache,
+ last_page->index + 2);
+ if (page && page_refcnt(page) < refcnt)
+ live_head = last_page;
+ } else if (!page && live_head)
+ live_head = next_page(live_head);
+ }
+
+ read_unlock_irq(&mapping->tree_lock);
+
+#ifdef DEBUG_READAHEAD
+ if ((readahead_ratio & 3) == 3) {
+ for (i = 0; pidx < PAGE_SIZE / 2;)
+ pat[i++] = pat[pidx++];
+ pat[i++] = '|';
+ for (page = head; page != tail; page = next_page(page)) {
+ pidx = page->index;
+ if (page == live_head)
+ pat[i++] = '[';
+ pat[i++] = page_refcnt_symbol(page);
+ BUG_ON(PageAnon(page));
+ BUG_ON(PageSwapCache(page));
+ /* BUG_ON(page_mapped(page)); */
+ if (i >= PAGE_SIZE - 2)
+ break;
+ }
+ if (live_head)
+ pat[i++] = ']';
+ pat[i] = 0;
+ pat[PAGE_SIZE - 1] = 0;
+ }
+#endif
+
+ /*
+ * Now save the alive pages.
+ */
+ i = 0;
+ if (live_head) {
+ for (; live_head != tail;) { /* never dereference tail! */
+ page = next_page(live_head);
+ if (!PageActivate(live_head)) {
+ if (!page_refcnt(live_head))
+ __get_cpu_var(smooth_aging)++;
+ i++;
+ list_move(&live_head->lru, save_list);
+ }
+ live_head = page;
+ }
+
+ if (i)
+ ra_account(0, RA_EVENT_READAHEAD_RESCUE, i);
+ }
+
+#ifdef DEBUG_READAHEAD
+ if ((readahead_ratio & 3) == 3) {
+ ddprintk("save_chunk(ino=%lu, idx=%lu-%lu-%lu, %s@%s:%s)"
+ " = %d\n",
+ mapping->host->i_ino,
+ index, head->index, pidx,
+ mapping_mapped(mapping) ? "mmap" : "file",
+ zone_names[page_zonenum(head)], pat, i);
+ if (pat != static_buf)
+ free_page((unsigned long)pat);
+ }
+#endif
+
+ return i;
+}
+
+int rescue_ra_pages(struct list_head *page_list, struct list_head *save_list)
+{
+ struct address_space *mapping;
+ struct page *chunk_head;
+ struct page *live_head;
+ struct page *page;
+ unsigned long refcnt;
+ int n;
+ int ret = 0;
+
+ page = list_to_page(page_list);
+
+next_chunk:
+ chunk_head = page;
+ live_head = NULL;
+ mapping = page->mapping;
+ n = 0;
+
+next_rs_page:
+ refcnt = page_refcnt(page);
+ page = next_page(page);
+
+ if (mapping != page->mapping || &page->lru == page_list)
+ goto save_chunk;
+
+ if (refcnt == page_refcnt(page))
+ n++;
+ else if (refcnt < page_refcnt(page))
+ n = 0;
+ else if (n < 1)
+ n = INT_MIN;
+ else
+ goto got_live_head;
+
+ goto next_rs_page;
+
+got_live_head:
+ n = 0;
+ live_head = prev_page(page);
+
+next_page:
+ if (refcnt < page_refcnt(page))
+ n++;
+ refcnt = page_refcnt(page);
+ page = next_page(page);
+
+ if (mapping != page->mapping || &page->lru == page_list)
+ goto save_chunk;
+
+ goto next_page;
+
+save_chunk:
+ if (mapping && !PageAnon(chunk_head) &&
+ !PageSwapCache(chunk_head) &&
+ /* !page_mapped(chunk_head) && */
+ n <= 3 &&
+ (!refcnt ||
+ prev_page(page)->index >= chunk_head->index + 5))
+ ret += save_chunk(chunk_head, live_head, page, save_list);
+
+ if (&page->lru != page_list)
+ goto next_chunk;
+
+ return ret;
+}
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 12/13] readahead: events accounting
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (10 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 11/13] readahead: thrashing protection Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-29 6:02 ` [PATCH 13/13] readahead: page aging accounting Wu Fengguang
2005-10-31 21:45 ` [PATCH 00/13] Adaptive read-ahead V5 Ram Pai
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, J?rn Engel, Ingo Oeser, Wu Fengguang
[-- Attachment #1: readahead-account-events.patch --]
[-- Type: text/plain, Size: 9331 bytes --]
A debugfs file named `readahead' is created according to advices from
J?rn Engel, Andrew Morton and Ingo Oeser. It yields to much better
readability than the preious /proc/vmstat interface :)
It reveals various read-ahead activities/events, and is vital to the testing.
This is a trimmed down output on my PC:
# cat /debugfs/readahead
[table requests] total newfile state context contexta
cache_miss 234 50 10 54 0
read_random 116 37 4 21 0
io_congestion 0 0 0 0 0
io_cache_hit 61 0 14 37 0
io_block 4994 1576 232 120 2
readahead 1812 1522 171 107 2
lookahead 596 442 76 78 0
lookahead_hit 214 89 70 32 1
readahead_eof 1208 1080 95 21 2
readahead_shrink 0 0 0 0 0
readahead_thrash 0 0 0 0 0
readahead_rescue 1275 0 0 0 0
[table pages] total newfile state context contexta
cache_miss 401 98 50 87 0
read_random 120 37 4 21 0
io_congestion 0 0 0 0 0
io_cache_hit 872 0 412 448 0
io_block 22833 3857 12177 3680 21
readahead 15081 3827 9683 1484 21
readahead_hit 12684 3441 8164 1001 22
lookahead 8191 936 6926 329 0
lookahead_hit 8140 230 6368 1542 0
readahead_eof 5192 1955 2777 373 21
readahead_shrink 0 0 0 0 0
readahead_thrash 0 0 0 0 0
readahead_rescue 13656 0 0 0 0
[table summary] total newfile state context contexta
random_rate 6% 2% 2% 16% 0%
ra_hit_rate 84% 89% 84% 67% 100%
la_hit_rate 35% 20% 90% 40% 100%
avg_ra_size 8 3 56 14 7
avg_la_size 14 2 90 4 0
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 192 insertions(+), 2 deletions(-)
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -47,6 +47,9 @@ enum ra_class {
enum ra_event {
RA_EVENT_CACHE_MISS, /* read cache misses */
RA_EVENT_READRANDOM, /* random reads */
+ RA_EVENT_IO_CONGESTION, /* io congestion */
+ RA_EVENT_IO_CACHE_HIT, /* canceled io due to cache hit */
+ RA_EVENT_IO_BLOCK, /* read on locked page */
RA_EVENT_READAHEAD, /* read-ahead issued */
RA_EVENT_READAHEAD_HIT, /* read-ahead page hit */
@@ -65,6 +68,177 @@ enum ra_event {
*/
#define DEBUG_READAHEAD
#ifdef DEBUG_READAHEAD
+#include <linux/jiffies.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+#include <linux/init.h>
+
+static char *ra_class_name[] = {
+ "total",
+ "newfile",
+ "state",
+ "context",
+ "contexta",
+ "backward",
+ "onthrash",
+ "onraseek",
+ "none",
+};
+
+static char *ra_event_name[] = {
+ "cache_miss",
+ "read_random",
+ "io_congestion",
+ "io_cache_hit",
+ "io_block",
+ "readahead",
+ "readahead_hit",
+ "lookahead",
+ "lookahead_hit",
+ "readahead_eof",
+ "readahead_shrink",
+ "readahead_thrash",
+ "readahead_rescue",
+};
+
+static unsigned long ra_event_count[RA_CLASS_END+1][RA_EVENT_END][2];
+
+static inline void ra_account(struct file_ra_state *ra,
+ enum ra_event e, int pages)
+{
+ enum ra_class c;
+
+ c = (ra ? ra->flags & RA_CLASS_MASK : RA_CLASS_END);
+ if (e == RA_EVENT_READAHEAD_HIT && pages < 0) {
+ c = (ra->flags >> RA_CLASS_SHIFT) & RA_CLASS_MASK;
+ pages = -pages;
+ }
+ if (!c)
+ c = RA_CLASS_END;
+ BUG_ON(c > RA_CLASS_END);
+
+ ra_event_count[c][e][0] += 1;
+ ra_event_count[c][e][1] += pages;
+}
+
+static int ra_account_show(struct seq_file *s, void *_)
+{
+ int i;
+ int c;
+ int e;
+ static char event_fmt[] = "%-16s";
+ static char class_fmt[] = "%11s";
+ static char item_fmt[] = "%11lu";
+ static char percent_format[] = "%10lu%%";
+ static char *table_name[] = {
+ "[table requests]",
+ "[table pages]",
+ "[table summary]"};
+
+ for (i = 0; i <= 1; i++) {
+ for (e = 0; e < RA_EVENT_END; e++) {
+ ra_event_count[0][e][i] = 0;
+ for (c = 1; c <= RA_CLASS_END; c++)
+ ra_event_count[0][e][i] +=
+ ra_event_count[c][e][i];
+ }
+
+ seq_printf(s, event_fmt, table_name[i]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ for (e = 0; e < RA_EVENT_END; e++) {
+ if (e == RA_EVENT_READAHEAD_HIT && i == 0)
+ continue;
+
+ seq_printf(s, event_fmt, ra_event_name[e]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ ra_event_count[c][e][i]);
+ seq_puts(s, "\n");
+ }
+ seq_puts(s, "\n");
+ }
+
+ seq_printf(s, event_fmt, table_name[2]);
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, class_fmt, ra_class_name[c]);
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "random_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_event_count[c][RA_EVENT_READRANDOM][0] * 100) /
+ (ra_event_count[c][RA_EVENT_READRANDOM][0] +
+ ra_event_count[c][RA_EVENT_READAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "ra_hit_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_event_count[c][RA_EVENT_READAHEAD_HIT][1] * 100) /
+ (ra_event_count[c][RA_EVENT_READAHEAD][1] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "la_hit_rate");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, percent_format,
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD_HIT][0] * 100) /
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_ra_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_event_count[c][RA_EVENT_READAHEAD][1] +
+ ra_event_count[c][RA_EVENT_READAHEAD][0] / 2) /
+ (ra_event_count[c][RA_EVENT_READAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ seq_printf(s, event_fmt, "avg_la_size");
+ for (c = 0; c <= RA_CLASS_END; c++)
+ seq_printf(s, item_fmt,
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD][1] +
+ ra_event_count[c][RA_EVENT_LOOKAHEAD][0] / 2) /
+ (ra_event_count[c][RA_EVENT_LOOKAHEAD][0] + 1));
+ seq_puts(s, "\n");
+
+ return 0;
+}
+
+static struct dentry *readahead_dentry;
+
+static int ra_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, ra_account_show, NULL);
+}
+
+static ssize_t ra_debug_write(struct file *file, const char __user *buf,
+ size_t size, loff_t *offset)
+{
+ if (file->f_dentry == readahead_dentry)
+ memset(ra_event_count, 0, sizeof(ra_event_count));
+ return 1;
+}
+
+static struct file_operations ra_debug_fops = {
+ .owner = THIS_MODULE,
+ .open = ra_debug_open,
+ .write = ra_debug_write,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init readahead_init(void)
+{
+ readahead_dentry = debugfs_create_file("readahead",
+ 0644, NULL, NULL, &ra_debug_fops);
+ return 0;
+}
+
+module_init(readahead_init)
#define dprintk(args...) \
if (readahead_ratio & 1) printk(KERN_DEBUG args)
@@ -73,6 +247,10 @@ enum ra_event {
#else /* DEBUG_READAHEAD */
+static inline void ra_account(struct file_ra_state *ra,
+ enum ra_event e, int pages)
+{
+}
#define dprintk(args...) do {} while(0)
#define ddprintk(args...) do {} while(0)
@@ -945,6 +1123,8 @@ static int ra_dispatch(struct file_ra_st
ra_account(ra, RA_EVENT_READAHEAD_EOF, actual);
if (la_size)
ra_account(ra, RA_EVENT_LOOKAHEAD, la_size);
+ if (ra_size > actual)
+ ra_account(ra, RA_EVENT_IO_CACHE_HIT, ra_size - actual);
ra_account(ra, RA_EVENT_READAHEAD, actual);
dprintk("readahead-%s(ino=%lu, index=%lu, ra=%lu+%lu-%lu) = %d\n",
@@ -1577,8 +1757,11 @@ page_cache_readahead_adaptive(struct add
if (page) {
if(!TestClearPageReadahead(page))
return 0;
- if (bdi_read_congested(mapping->backing_dev_info))
+ if (bdi_read_congested(mapping->backing_dev_info)) {
+ ra_account(ra, RA_EVENT_IO_CONGESTION,
+ end_index - index);
return 0;
+ }
}
if (page)
@@ -1665,8 +1848,15 @@ void fastcall ra_access(struct file_ra_s
(1 << PG_referenced)))
return;
- if (!ra_has_index(ra, page->index))
+ if (ra_has_index(ra, page->index)) {
+ if (PageLocked(page))
+ ra_account(ra, RA_EVENT_IO_BLOCK,
+ ra->readahead_index - page->index);
+ } else {
+ if (PageLocked(page))
+ ra_account(0, RA_EVENT_IO_BLOCK, 1);
return;
+ }
ra->cache_hit++;
--
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH 13/13] readahead: page aging accounting
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (11 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 12/13] readahead: events accounting Wu Fengguang
@ 2005-10-29 6:02 ` Wu Fengguang
2005-10-31 21:45 ` [PATCH 00/13] Adaptive read-ahead V5 Ram Pai
13 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-10-29 6:02 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Wu Fengguang
[-- Attachment #1: readahead-account-aging.patch --]
[-- Type: text/plain, Size: 7237 bytes --]
The accuracy of stateful thrashing-threshold estimation depends largely on the
measurement of cold page aging speed.
A file named `pageaging' is created in debugfs to monitor the trace of two
measurement variables: per-zone `nr_page_aging' and per-cpu `smooth_aging'.
Their values and the jiffies are recorded each time one of them has a delta
of 1, 1/2, 1/4, 1/16, 1/256, 1/4096 (nr_inactive + nr_free).
Sample series of collected data shows that smooth_aging is more stable in
small sampling granularity:
time dt page_aging8 smooth_aging8
872765 26 520056 33 653782 163
872791 12 520089 132 653945 51
872803 4 520221 132 653996 66
872807 17 520353 165 654062 107
872824 22 520518 99 654169 74
872846 372 520617 99 654243 78
873218 294 520716 99 654321 73
873512 196 520815 99 654394 130
873708 15 520914 231 654524 28
873723 15 521145 198 654552 9
873738 881 521343 99 654561 182
874619 700 521442 0 654743 198
875319 384 521442 66 654941 110
875703 2119 521508 99 655051 1632
877822 3960 521607 0 656683 980
881782 904 521607 0 657663 216
time dt page_aging1 smooth_aging1
-90822 12418 5775 12999 33302 10767
-78404 17510 18774 10303 44069 10345
-60894 24757 29077 9871 54414 14615
-36137 19194 38948 10404 69029 13726
-16943 19636 49352 10440 82755 12865
2693 16299 59792 12453 95620 10734
18992 19851 72245 10073 106354 15960
38843 16099 82318 10767 122314 14059
54942 16094 93085 10041 136373 12117
71036 19888 103126 12595 148490 16155
90924 18452 115721 9782 164645 11705
109376 22395 125503 10214 176350 13679
131771 19310 135717 10759 190029 11843
151081 20793 146476 10699 201872 12595
171874 22308 157175 10321 214467 13157
194182 17954 167496 10773 227624 14803
212136 19946 178269 10554 242427 13391
232082 21051 188823 11179 255818 11783
Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn>
---
mm/readahead.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 133 insertions(+), 1 deletion(-)
--- linux-2.6.14-rc5-mm1.orig/mm/readahead.c
+++ linux-2.6.14-rc5-mm1/mm/readahead.c
@@ -207,11 +207,130 @@ static int ra_account_show(struct seq_fi
return 0;
}
+/*
+ * Measure the aging progress of cold pages over time.
+ */
+#define AGING_INFO_SIZE (1 << 8)
+#define AGING_INFO_MASK (AGING_INFO_SIZE - 1)
+static int aging_info_shift[] = {0, 1, 2, 4, 8, 12};
+#define AGING_INFO_SHIFTS (sizeof(aging_info_shift)/\
+ sizeof(aging_info_shift[0]))
+static int aging_info_index[AGING_INFO_SHIFTS];
+static unsigned long aging_info[AGING_INFO_SIZE][AGING_INFO_SHIFTS*3];
+static spinlock_t aging_info_lock = SPIN_LOCK_UNLOCKED;
+
+static unsigned long nr_free_inactive(void);
+static unsigned long nr_page_aging(void);
+static unsigned long nr_smooth_aging(void);
+
+static void collect_aging_info(void)
+{
+ int i;
+ unsigned long mem;
+ unsigned long page_aging;
+ unsigned long smooth_aging;
+
+ mem = nr_free_inactive();
+ page_aging = nr_page_aging();
+ smooth_aging = nr_smooth_aging();
+
+ spin_lock_irq(&aging_info_lock);
+
+ for (i = AGING_INFO_SHIFTS - 1; i >= 0; i--) {
+ if (smooth_aging - aging_info[aging_info_index[i]][i*3+2] +
+ page_aging - aging_info[aging_info_index[i]][i*3+1] >
+ 2 * (mem >> aging_info_shift[i])) {
+ aging_info_index[i]++;
+ aging_info_index[i] &= AGING_INFO_MASK;
+ aging_info[aging_info_index[i]][i*3] = jiffies;
+ aging_info[aging_info_index[i]][i*3+1] = page_aging;
+ aging_info[aging_info_index[i]][i*3+2] = smooth_aging;
+ } else
+ break;
+ }
+
+ spin_unlock_irq(&aging_info_lock);
+}
+
+static void *aginginfo_start(struct seq_file *s, loff_t *pos)
+{
+ int n = *pos;
+ int i;
+
+ spin_lock_irq(&aging_info_lock);
+
+ if (!n) {
+ for (i = 0; i < AGING_INFO_SHIFTS; i++) {
+ seq_printf(s, "%12s %10s %18s%d %18s%d\t", "time","dt",
+ "page_aging", aging_info_shift[i],
+ "smooth_aging", aging_info_shift[i]);
+ }
+ seq_puts(s, "\n");
+ }
+
+ if (++n < AGING_INFO_SIZE)
+ return (void *)n;
+ else
+ return NULL;
+}
+
+static void *aginginfo_next(struct seq_file *s, void *p, loff_t *pos)
+{
+ int n = (int)p;
+
+ ++*pos;
+ return (void *)(++n < AGING_INFO_SIZE ? n : 0);
+}
+
+static void aginginfo_stop(struct seq_file *s, void *p)
+{
+ spin_unlock_irq(&aging_info_lock);
+}
+
+static int aginginfo_show(struct seq_file *s, void *p)
+{
+ int n = (int)p;
+ int i;
+ int index0;
+ int index1;
+ long time;
+ unsigned long nr1;
+ unsigned long nr2;
+
+ for (i = 0; i < AGING_INFO_SHIFTS; i++) {
+ index0 = aging_info_index[i] + n;
+ index1 = aging_info_index[i] + n + 1;
+ index0 &= AGING_INFO_MASK;
+ index1 &= AGING_INFO_MASK;
+ time = aging_info[index0][i*3];
+ nr1 = aging_info[index1][i*3+1] - aging_info[index0][i*3+1];
+ nr2 = aging_info[index1][i*3+2] - aging_info[index0][i*3+2];
+ seq_printf(s, "%12ld %10lu %10lu %8lu %10lu %8lu\t",
+ time, aging_info[index1][i*3] - time,
+ aging_info[index0][i*3+1], nr1,
+ aging_info[index0][i*3+2], nr2);
+ }
+ seq_puts(s, "\n");
+
+ return 0;
+}
+
static struct dentry *readahead_dentry;
+static struct dentry *pageaging_dentry;
+
+struct seq_operations aginginfo_ops = {
+ .start = aginginfo_start,
+ .next = aginginfo_next,
+ .stop = aginginfo_stop,
+ .show = aginginfo_show,
+};
static int ra_debug_open(struct inode *inode, struct file *file)
{
- return single_open(file, ra_account_show, NULL);
+ if (file->f_dentry == readahead_dentry)
+ return single_open(file, ra_account_show, NULL);
+ else
+ return seq_open(file, &aginginfo_ops);
}
static ssize_t ra_debug_write(struct file *file, const char __user *buf,
@@ -231,10 +350,20 @@ static struct file_operations ra_debug_f
.release = single_release,
};
+static struct file_operations aginginfo_fops = {
+ .open = ra_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+
static int __init readahead_init(void)
{
readahead_dentry = debugfs_create_file("readahead",
0644, NULL, NULL, &ra_debug_fops);
+ pageaging_dentry = debugfs_create_file("pageaging",
+ 0644, NULL, NULL, &aginginfo_fops);
return 0;
}
@@ -1219,6 +1348,9 @@ static inline unsigned long compute_thra
else
*remain = 0;
+#ifdef DEBUG_READAHEAD
+ collect_aging_info();
+#endif
ddprintk("compute_thrashing_threshold: "
"ra=%lu=%lu*%lu/%lu, remain %lu for %lu\n",
ra_size, stream_shift, global_size, global_shift,
--
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 00/13] Adaptive read-ahead V5
2005-10-29 6:02 [PATCH 00/13] Adaptive read-ahead V5 Wu Fengguang
` (12 preceding siblings ...)
2005-10-29 6:02 ` [PATCH 13/13] readahead: page aging accounting Wu Fengguang
@ 2005-10-31 21:45 ` Ram Pai
2005-11-01 3:04 ` Wu Fengguang
13 siblings, 1 reply; 16+ messages in thread
From: Ram Pai @ 2005-10-31 21:45 UTC (permalink / raw)
To: Wu Fengguang; +Cc: linux-kernel, Andrew Morton, slpratt
Wu,
I will run some standard benchmarks and post the results soon.
The benchmarks I have are (1) DSS (TPCH) (2) iozone (3) sysbench (4)
tiobench
It will take atleast a week, since some disks on my machine have to be
replaced and the setup has to be remade.
RP
On Sat, Oct 29, 2005 at 02:02:16PM +0800, Wu Fengguang wrote:
> This version of adaptive read-ahead patch features various clean ups.
> Changes include:
> - rewrote context based method to make it clean and robust
> - improved accuracy of stateful thrashing threshold estimation
> - nr_page_aging made right
> - sorted out the thrashing protection logic
> - enhanced debug/accounting facilities
>
> Regards,
> Wu Fengguang
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
^ permalink raw reply [flat|nested] 16+ messages in thread* Re: [PATCH 00/13] Adaptive read-ahead V5
2005-10-31 21:45 ` [PATCH 00/13] Adaptive read-ahead V5 Ram Pai
@ 2005-11-01 3:04 ` Wu Fengguang
0 siblings, 0 replies; 16+ messages in thread
From: Wu Fengguang @ 2005-11-01 3:04 UTC (permalink / raw)
To: Ram Pai; +Cc: linux-kernel, Andrew Morton, slpratt
On Mon, Oct 31, 2005 at 01:45:14PM -0800, Ram Pai wrote:
> Wu,
>
> I will run some standard benchmarks and post the results soon.
> The benchmarks I have are (1) DSS (TPCH) (2) iozone (3) sysbench (4)
> tiobench
>
> It will take atleast a week, since some disks on my machine have to be
> replaced and the setup has to be remade.
Thank you for taking the trouble. In the meantime I will prepare a patch
against the stable 2.6.14, so that you can comfortably run vanilla 2.6.14
as the reference point.
Regards,
Wu Fengguang
^ permalink raw reply [flat|nested] 16+ messages in thread