Re: [PATCH 1/3] readahead: drop behind

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/3] readahead: drop behind
  2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
@ 2007-07-21 20:29   ` Eric St-Laurent
  2007-07-21 20:37     ` Peter Zijlstra
  2007-07-25  3:55   ` Eric St-Laurent
  1 sibling, 1 reply; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-21 20:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:
> plain text document attachment (readahead-useonce.patch)
> Use the read-ahead code to provide hints to page reclaim.
> 
> This patch has the potential to solve the streaming-IO trashes my
> desktop problem.
> 
> It tries to aggressively reclaim pages that were loaded in a strong
> sequential pattern and have been consumed. Thereby limiting the damage
> to the current resident set.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

With the fadvise change, it looks like the right solution to me.

The patches are for which kernel? They doesn't apply cleanly to
2.6.22.1.

It would be useful to have a temporary /proc tunable to enable/disable
the heuristic to help test the effects.


- Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] readahead: drop behind
  2007-07-21 20:29   ` Eric St-Laurent
@ 2007-07-21 20:37     ` Peter Zijlstra
  2007-07-21 20:59       ` Eric St-Laurent
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 20:37 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sat, 2007-07-21 at 16:29 -0400, Eric St-Laurent wrote:
> On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:
> > plain text document attachment (readahead-useonce.patch)
> > Use the read-ahead code to provide hints to page reclaim.
> > 
> > This patch has the potential to solve the streaming-IO trashes my
> > desktop problem.
> > 
> > It tries to aggressively reclaim pages that were loaded in a strong
> > sequential pattern and have been consumed. Thereby limiting the damage
> > to the current resident set.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> 
> With the fadvise change, it looks like the right solution to me.
> 
> The patches are for which kernel? They doesn't apply cleanly to
> 2.6.22.1.

They are against git of a few hours ago and the latest readahead patches
from Wu (which don't apply cleanly either, but the rejects are trivial).

> It would be useful to have a temporary /proc tunable to enable/disable
> the heuristic to help test the effects.

Right, I had such a patch somewhere,.. won't apply cleanly but should be
obvious..

---
 kernel/sysctl.c |    9 +++++++++
 mm/readahead.c  |    6 ++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c
+++ linux-2.6/kernel/sysctl.c
@@ -168,6 +168,7 @@ extern int dirty_ratio_handler(ctl_table
 
 extern int prove_locking;
 extern int lock_stat;
+extern int sysctl_dropbehind;
 
 /* The default sysctl tables: */
 
@@ -1075,6 +1076,14 @@ static ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "drop_behind",
+		.data		= &sysctl_dropbehind,
+		.maxlen		= sizeof(sysctl_dropbehind),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -429,6 +429,8 @@ void page_cache_sync_readahead(struct ad
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
 
+int sysctl_dropbehind = 2;
+
 /**
  * page_cache_async_readahead - file readahead for marked pages
  * @mapping: address_space which holds the pagecache and I/O vectors
@@ -473,7 +475,7 @@ page_cache_async_readahead(struct addres
 	 * hint that there is sequential IO, which implies that the pages that
 	 * have been used thus far can be reclaimed
 	 */
-	if (ra->size == ra->ra_pages) {
+	if (sysctl_dropbehind && ra->size == ra->ra_pages) {
 		unsigned long demote_idx = offset - min(offset, ra->size);
 
 		read_lock_irq(&mapping->tree_lock);
@@ -489,7 +491,7 @@ page_cache_async_readahead(struct addres
 			 * users. We should not take away somebody else's
 			 * pages, so do not drop behind beyond this point.
 			 */
-			if (PageActive(page))
+			if (sysctl_dropbehind > 1 && PageActive(page))
 				break;
 
 			lru_demote(page);



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] readahead: drop behind
  2007-07-21 20:37     ` Peter Zijlstra
@ 2007-07-21 20:59       ` Eric St-Laurent
  2007-07-21 21:06         ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-21 20:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook


> They are against git of a few hours ago and the latest readahead patches
> from Wu (which don't apply cleanly either, but the rejects are trivial).
> 
> > It would be useful to have a temporary /proc tunable to enable/disable
> > the heuristic to help test the effects.
> 
> Right, I had such a patch somewhere,.. won't apply cleanly but should be
> obvious..

Thanks, I will merge theses and report back with some results.

After copying large files, I find my system sluggish. I hope your
changes will help.

- Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 0/3] readahead drop behind and size adjustment
@ 2007-07-21 21:00 Peter Zijlstra
  2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook, Peter Zijlstra

The various readahead bits I have lying about.

Wu, would you agree with asking Andrew to stick these behind your latest
readahead series?



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/3] readahead: drop behind
  2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra
@ 2007-07-21 21:00 ` Peter Zijlstra
  2007-07-21 20:29   ` Eric St-Laurent
  2007-07-25  3:55   ` Eric St-Laurent
  2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook, Peter Zijlstra

[-- Attachment #1: readahead-useonce.patch --]
[-- Type: text/plain, Size: 5776 bytes --]

Use the read-ahead code to provide hints to page reclaim.

This patch has the potential to solve the streaming-IO trashes my
desktop problem.

It tries to aggressively reclaim pages that were loaded in a strong
sequential pattern and have been consumed. Thereby limiting the damage
to the current resident set.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/swap.h |    1 +
 mm/readahead.c       |   39 ++++++++++++++++++++++++++++++++++++++-
 mm/swap.c            |   51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 90 insertions(+), 1 deletion(-)

Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c
+++ linux-2.6/mm/swap.c
@@ -30,6 +30,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/init.h>
+#include <linux/rmap.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -176,6 +177,7 @@ EXPORT_SYMBOL(mark_page_accessed);
  */
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_demote_pvecs) = { 0, };
 
 void fastcall lru_cache_add(struct page *page)
 {
@@ -197,6 +199,37 @@ void fastcall lru_cache_add_active(struc
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+static void __pagevec_lru_demote(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		if (PageLRU(page)) {
+			page_referenced(page, 0);
+			if (PageActive(page)) {
+				ClearPageActive(page);
+				__dec_zone_state(zone, NR_ACTIVE);
+				__inc_zone_state(zone, NR_INACTIVE);
+			}
+			list_move_tail(&page->lru, &zone->inactive_list);
+		}
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
 static void __lru_add_drain(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
@@ -207,6 +240,9 @@ static void __lru_add_drain(int cpu)
 	pvec = &per_cpu(lru_add_active_pvecs, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add_active(pvec);
+	pvec = &per_cpu(lru_demote_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_demote(pvec);
 }
 
 void lru_add_drain(void)
@@ -403,6 +439,21 @@ void __pagevec_lru_add_active(struct pag
 }
 
 /*
+ * Function used to forcefully demote a page to the tail of the inactive
+ * list.
+ */
+void fastcall lru_demote(struct page *page)
+{
+	if (likely(get_page_unless_zero(page))) {
+		struct pagevec *pvec = &get_cpu_var(lru_demote_pvecs);
+
+		if (!pagevec_add(pvec, page))
+			__pagevec_lru_demote(pvec);
+		put_cpu_var(lru_demote_pvecs);
+	}
+}
+
+/*
  * Try to drop buffers from the pages in a pagevec
  */
 void pagevec_strip(struct pagevec *pvec)
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h
+++ linux-2.6/include/linux/swap.h
@@ -180,6 +180,7 @@ extern unsigned int nr_free_pagecache_pa
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_demote(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -15,6 +15,7 @@
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -448,13 +449,19 @@ EXPORT_SYMBOL_GPL(page_cache_sync_readah
  * page_cache_async_ondemand() should be called when a page is used which
  * has the PG_readahead flag: this is a marker to suggest that the application
  * has used up enough of the readahead window that we should start pulling in
- * more pages. */
+ * more pages.
+ */
 void
 page_cache_async_readahead(struct address_space *mapping,
 			   struct file_ra_state *ra, struct file *filp,
 			   struct page *page, pgoff_t offset,
 			   unsigned long req_size)
 {
+	pgoff_t demote_idx = offset - min_t(pgoff_t, offset, ra->size);
+	struct page *pages[16];
+	unsigned nr_pages;
+	unsigned i;
+
 	/* no read-ahead */
 	if (!ra->ra_pages)
 		return;
@@ -473,6 +480,36 @@ page_cache_async_readahead(struct addres
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return;
 
+	/*
+	 * Read-ahead use once: when the ra window is maximal this is a good
+	 * hint that there is sequential IO, which implies that the pages that
+	 * have been used thus far can be reclaimed
+	 */
+	if (ra->size == ra->ra_pages) do {
+		nr_pages = find_get_pages(mapping,
+				demote_idx, ARRAY_SIZE(pages), pages);
+
+		for (i = 0; i < nr_pages; i++) {
+			page = pages[i];
+			demote_idx = page_index(page);
+
+			/*
+			 * The page is active. This means there are other
+			 * users. We should not take away somebody else's
+			 * pages, so do not drop behind beyond this point.
+			 */
+			if (demote_idx < offset && !PageActive(page)) {
+				lru_demote(page);
+			} else {
+				demote_idx = offset;
+				break;
+			}
+		}
+		demote_idx++;
+
+		release_pages(pages, nr_pages, 0);
+	} while (demote_idx < offset);
+
 	/* do read-ahead */
 	ondemand_readahead(mapping, ra, filp, true, offset, req_size);
 }

--


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2/3] readahead: fadvise drop behind controls
  2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra
  2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
@ 2007-07-21 21:00 ` Peter Zijlstra
  2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra
       [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn>
  3 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook, Peter Zijlstra

[-- Attachment #1: readahead-fadvise.patch --]
[-- Type: text/plain, Size: 2490 bytes --]

currently drop behind is controlled by a heuristic:

  ra->size == ra->ra_pages

provide some fadvise() control:

POSIX_FADV_NORMAL     - heuristic
POSIX_FADV_RANDOM     - always disable drop behind
POSIX_FADV_SEQUENTIAL - always enable drop behind

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/fs.h |    7 +++++++
 mm/fadvise.c       |    5 +++++
 mm/readahead.c     |    4 +++-
 3 files changed, 15 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -705,8 +705,15 @@ struct file_ra_state {
 	unsigned int ra_pages;		/* Maximum readahead window */
 	int mmap_miss;			/* Cache miss stat for mmap accesses */
 	loff_t prev_pos;		/* Cache last read() position */
+	unsigned int flags;		/* readahead flags */
 };
 
+#define F_RA_DROP_BEHIND_NEVER		1
+#define F_RA_DROP_BEHIND_ALWAYS		2
+
+#define F_RA_DROP_BEHIND_MASK \
+	(F_RA_DROP_BEHIND_NEVER|F_RA_DROP_BEHIND_ALWAYS)
+
 /*
  * Check if @index falls in the readahead windows.
  */
Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c
+++ linux-2.6/mm/fadvise.c
@@ -65,12 +65,17 @@ asmlinkage long sys_fadvise64_64(int fd,
 	switch (advice) {
 	case POSIX_FADV_NORMAL:
 		file->f_ra.ra_pages = bdi->ra_pages;
+		file->f_ra.flags &= ~F_RA_DROP_BEHIND_MASK;
 		break;
 	case POSIX_FADV_RANDOM:
 		file->f_ra.ra_pages = 0;
+		file->f_ra.flags &= ~F_RA_DROP_BEHIND_MASK;
+		file->f_ra.flags |= F_RA_DROP_BEHIND_NEVER;
 		break;
 	case POSIX_FADV_SEQUENTIAL:
 		file->f_ra.ra_pages = bdi->ra_pages * 2;
+		file->f_ra.flags &= ~F_RA_DROP_BEHIND_MASK;
+		file->f_ra.flags |= F_RA_DROP_BEHIND_ALWAYS;
 		break;
 	case POSIX_FADV_WILLNEED:
 		if (!mapping->a_ops->readpage) {
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -485,7 +485,9 @@ page_cache_async_readahead(struct addres
 	 * hint that there is sequential IO, which implies that the pages that
 	 * have been used thus far can be reclaimed
 	 */
-	if (ra->size == ra->ra_pages) do {
+	if (!(ra->flags & F_RA_DROP_BEHIND_NEVER) &&
+	    ((ra->flags & F_RA_DROP_BEHIND_ALWAYS) ||
+	     ra->size == ra->ra_pages)) do {
 		nr_pages = find_get_pages(mapping,
 				demote_idx, ARRAY_SIZE(pages), pages);
 

--


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra
  2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
  2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra
@ 2007-07-21 21:00 ` Peter Zijlstra
  2007-07-22  8:24   ` Jens Axboe
       [not found]   ` <20070722084526.GB6317@mail.ustc.edu.cn>
       [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn>
  3 siblings, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 21:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Fengguang Wu, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook, Peter Zijlstra

[-- Attachment #1: ra_pages.patch --]
[-- Type: text/plain, Size: 2488 bytes --]

Scale the default max readahead size with the system memory size.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 block/ll_rw_blk.c  |    2 +-
 include/linux/fs.h |    1 +
 mm/readahead.c     |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 1 deletion(-)

Index: linux-2.6/block/ll_rw_blk.c
===================================================================
--- linux-2.6.orig/block/ll_rw_blk.c
+++ linux-2.6/block/ll_rw_blk.c
@@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu
 	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
 	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
 	q->make_request_fn = mfn;
-	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
+	bdi_ra_init(&q->backing_dev_info);
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
 	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file
 
 extern void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
+extern void bdi_ra_init(struct backing_dev_info *bdi);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c
+++ linux-2.6/mm/readahead.c
@@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state 
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
 
+static unsigned long ra_pages;
+
+static __init int readahead_init(void)
+{
+	/*
+	 * Scale the max readahead window with system memory
+	 *
+	 *   64M:   128K
+	 *  128M:   180K
+	 *  256M:   256K
+	 *  512M:   360K
+	 *    1G:   512K
+	 *    2G:   724K
+	 *    4G:  1024K
+	 *    8G:  1448K
+	 *   16G:  2048K
+	 */
+	ra_pages = int_sqrt(totalram_pages/16);
+	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
+		ra_pages = 2 << (20 - PAGE_SHIFT);
+
+	return 0;
+}
+
+subsys_initcall(readahead_init);
+
+void bdi_ra_init(struct backing_dev_info *bdi)
+{
+	bdi->ra_pages = ra_pages;
+}
+EXPORT_SYMBOL(bdi_ra_init);
+
 #define list_to_page(head) (list_entry((head)->prev, struct page, lru))
 
 /**

--


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] readahead: drop behind
  2007-07-21 20:59       ` Eric St-Laurent
@ 2007-07-21 21:06         ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-21 21:06 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sat, 2007-07-21 at 16:59 -0400, Eric St-Laurent wrote:
> > They are against git of a few hours ago and the latest readahead patches
> > from Wu (which don't apply cleanly either, but the rejects are trivial).
> > 
> > > It would be useful to have a temporary /proc tunable to enable/disable
> > > the heuristic to help test the effects.
> > 
> > Right, I had such a patch somewhere,.. won't apply cleanly but should be
> > obvious..
> 
> Thanks, I will merge theses and report back with some results.
> 
> After copying large files, I find my system sluggish. I hope your
> changes will help.

Yeah, copying will still hurt. This will only help reading large files,
not writing :-/

For that I still need an inspiring idea.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn>
@ 2007-07-22  2:39   ` Fengguang Wu
  2007-07-22  2:44   ` Dave Jones
  1 sibling, 0 replies; 46+ messages in thread
From: Fengguang Wu @ 2007-07-22  2:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook, Dave Jones

On Sat, Jul 21, 2007 at 11:00:05PM +0200, Peter Zijlstra wrote:
> The various readahead bits I have lying about.
> 
> Wu, would you agree with asking Andrew to stick these behind your latest
> readahead series?

They are generally good feature to have. Thank you!

- default readahead size

It makes sense to raise it beyond 128K.  1M default readahead
absolutely makes sense for sequential workloads.  For the desktop,
this increases boot speed and readahead misses, both due to more
aggressive mmap read-around. Most users will be glad to feel the
speedup, and happily ignore the readahead misses, which may be
"invisible" in case of large memory.

In theory, the distributions can do the same tuning.  So we have an
interesting question for Dave:
        Does fedora desktop raise the default readahead size? Why or
        why not?  It goes so far to do userland readahead ;)

- drop behind

Sorry, I still doubt it will benefit all/most workloads. Leave it off
by default, and leave the enabling decision to Dave? I do hope that
it help general desktops.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn>
  2007-07-22  2:39   ` [PATCH 0/3] readahead drop behind and size adjustment Fengguang Wu
@ 2007-07-22  2:44   ` Dave Jones
       [not found]     ` <20070722081010.GA6317@mail.ustc.edu.cn>
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Jones @ 2007-07-22  2:44 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, Jul 22, 2007 at 10:39:23AM +0800, Fengguang Wu wrote:
 
 > It makes sense to raise it beyond 128K.  1M default readahead
 > absolutely makes sense for sequential workloads.  For the desktop,
 > this increases boot speed and readahead misses, both due to more
 > aggressive mmap read-around. Most users will be glad to feel the
 > speedup, and happily ignore the readahead misses, which may be
 > "invisible" in case of large memory.
 > 
 > In theory, the distributions can do the same tuning.  So we have an
 > interesting question for Dave:
 >         Does fedora desktop raise the default readahead size? Why or
 >         why not?  It goes so far to do userland readahead ;)
 
Fedora takes whatever defaults for readahead the kernel.org kernel has.
The only reasoning being if anyone reported VM bugs, we'd be able
to say to interested upstream developers "we're running the stock VM".
without having to get the user to try and reproduce on unpatched
kernels.
 
 > - drop behind
 > 
 > Sorry, I still doubt it will benefit all/most workloads. Leave it off
 > by default, and leave the enabling decision to Dave? I do hope that
 > it help general desktops.

It's not a subject that I'm intimatly familiar with, and when it
comes to decisions like this, I tend to just take whatever the
upstream defaults are.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found]     ` <20070722081010.GA6317@mail.ustc.edu.cn>
@ 2007-07-22  8:10       ` Fengguang Wu
  2007-07-22  8:24         ` Peter Zijlstra
  2007-07-22  8:33       ` Rusty Russell
  1 sibling, 1 reply; 46+ messages in thread
From: Fengguang Wu @ 2007-07-22  8:10 UTC (permalink / raw)
  To: Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton,
	Rusty Russell, Tim Pepper, Chris Snook

On Sat, Jul 21, 2007 at 10:44:28PM -0400, Dave Jones wrote:
> On Sun, Jul 22, 2007 at 10:39:23AM +0800, Fengguang Wu wrote:
>  
>  > It makes sense to raise it beyond 128K.  1M default readahead
>  > absolutely makes sense for sequential workloads.  For the desktop,
>  > this increases boot speed and readahead misses, both due to more
>  > aggressive mmap read-around. Most users will be glad to feel the
>  > speedup, and happily ignore the readahead misses, which may be
>  > "invisible" in case of large memory.
>  > 
>  > In theory, the distributions can do the same tuning.  So we have an
>  > interesting question for Dave:
>  >         Does fedora desktop raise the default readahead size? Why or
>  >         why not?  It goes so far to do userland readahead ;)
>  
> Fedora takes whatever defaults for readahead the kernel.org kernel has.
> The only reasoning being if anyone reported VM bugs, we'd be able
> to say to interested upstream developers "we're running the stock VM".
> without having to get the user to try and reproduce on unpatched
> kernels.

Thank you.  Now I'm more confident that the kernel should have a
reasonable default readahead size. The current one is 5+ years old and
should be updated now.

>  > - drop behind
>  > 
>  > Sorry, I still doubt it will benefit all/most workloads. Leave it off
>  > by default, and leave the enabling decision to Dave? I do hope that
>  > it help general desktops.
> 
> It's not a subject that I'm intimatly familiar with, and when it
> comes to decisions like this, I tend to just take whatever the
> upstream defaults are.

- It will avoid large-file-reads-thrashing-my-desktop problem,
  so most desktop users should like it. But sure there will be counter
  cases when a user want to keep the data cached.
- File servers may hurt from it. Imagine a mp3/png file server. The
  files are large enough to trigger drop-behind, but small (and hot)
  enough to be cached. Also when a new fedora DVD iso is released, it
  may be cached for some days. These are only the obvious cases.

So I opt for it being made tunable, safe, and turned off by default.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra
@ 2007-07-22  8:24   ` Jens Axboe
  2007-07-22  8:36     ` Peter Zijlstra
       [not found]   ` <20070722084526.GB6317@mail.ustc.edu.cn>
  1 sibling, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2007-07-22  8:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sat, Jul 21 2007, Peter Zijlstra wrote:
> Scale the default max readahead size with the system memory size.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  block/ll_rw_blk.c  |    2 +-
>  include/linux/fs.h |    1 +
>  mm/readahead.c     |   32 ++++++++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/ll_rw_blk.c
> ===================================================================
> --- linux-2.6.orig/block/ll_rw_blk.c
> +++ linux-2.6/block/ll_rw_blk.c
> @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu
>  	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
>  	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
>  	q->make_request_fn = mfn;
> -	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> +	bdi_ra_init(&q->backing_dev_info);
>  	q->backing_dev_info.state = 0;
>  	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
>  	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file
>  
>  extern void
>  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
> +extern void bdi_ra_init(struct backing_dev_info *bdi);
>  extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
>  extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
>  extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
> Index: linux-2.6/mm/readahead.c
> ===================================================================
> --- linux-2.6.orig/mm/readahead.c
> +++ linux-2.6/mm/readahead.c
> @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state 
>  }
>  EXPORT_SYMBOL_GPL(file_ra_state_init);
>  
> +static unsigned long ra_pages;
> +
> +static __init int readahead_init(void)
> +{
> +	/*
> +	 * Scale the max readahead window with system memory
> +	 *
> +	 *   64M:   128K
> +	 *  128M:   180K
> +	 *  256M:   256K
> +	 *  512M:   360K
> +	 *    1G:   512K
> +	 *    2G:   724K
> +	 *    4G:  1024K
> +	 *    8G:  1448K
> +	 *   16G:  2048K
> +	 */
> +	ra_pages = int_sqrt(totalram_pages/16);
> +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> +		ra_pages = 2 << (20 - PAGE_SHIFT);
> +
> +	return 0;
> +}

How did you come up with these numbers?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-22  8:10       ` Fengguang Wu
@ 2007-07-22  8:24         ` Peter Zijlstra
       [not found]           ` <20070722082923.GA7790@mail.ustc.edu.cn>
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-22  8:24 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Dave Jones, linux-kernel, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:

> - It will avoid large-file-reads-thrashing-my-desktop problem,
>   so most desktop users should like it. But sure there will be counter
>   cases when a user want to keep the data cached.
> - File servers may hurt from it. Imagine a mp3/png file server. The
>   files are large enough to trigger drop-behind, but small (and hot)
>   enough to be cached. Also when a new fedora DVD iso is released, it
>   may be cached for some days. These are only the obvious cases.
> 
> So I opt for it being made tunable, safe, and turned off by default.

I'm still not convinced (Rik wasn't either last time around). When these
files really are hot, they will be kept in memory due to them becoming
Active.

Also, by scaling up the max readahead size it takes a larger file before
it starts dropping. If say this server has 4G of memory (not much at all
for a server) resulting in a 1M readahead window, the file needs to be >
~2M before it starts drop behind.

But I guess it takes someone to try this IRL before we can settle this
debate :-/


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found]           ` <20070722082923.GA7790@mail.ustc.edu.cn>
@ 2007-07-22  8:29             ` Fengguang Wu
  0 siblings, 0 replies; 46+ messages in thread
From: Fengguang Wu @ 2007-07-22  8:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Jones, linux-kernel, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, Jul 22, 2007 at 10:24:37AM +0200, Peter Zijlstra wrote:
> On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> 
> > - It will avoid large-file-reads-thrashing-my-desktop problem,
> >   so most desktop users should like it. But sure there will be counter
> >   cases when a user want to keep the data cached.
> > - File servers may hurt from it. Imagine a mp3/png file server. The
> >   files are large enough to trigger drop-behind, but small (and hot)
> >   enough to be cached. Also when a new fedora DVD iso is released, it
> >   may be cached for some days. These are only the obvious cases.
> > 
> > So I opt for it being made tunable, safe, and turned off by default.
> 
> I'm still not convinced (Rik wasn't either last time around). When these
> files really are hot, they will be kept in memory due to them becoming
> Active.
> 
> Also, by scaling up the max readahead size it takes a larger file before
> it starts dropping. If say this server has 4G of memory (not much at all
> for a server) resulting in a 1M readahead window, the file needs to be >
> ~2M before it starts drop behind.

[snip]

> But I guess it takes someone to try this IRL before we can settle this
> debate :-/

Yeah, some real workload numbers would help.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found]     ` <20070722081010.GA6317@mail.ustc.edu.cn>
  2007-07-22  8:10       ` Fengguang Wu
@ 2007-07-22  8:33       ` Rusty Russell
  2007-07-22  8:45         ` Peter Zijlstra
  2007-07-23  9:00         ` Nick Piggin
  1 sibling, 2 replies; 46+ messages in thread
From: Rusty Russell @ 2007-07-22  8:33 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Dave Jones, Peter Zijlstra, linux-kernel, riel, Andrew Morton,
	Tim Pepper, Chris Snook

On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> - It will avoid large-file-reads-thrashing-my-desktop problem,
>   so most desktop users should like it. But sure there will be counter
>   cases when a user want to keep the data cached.
> - File servers may hurt from it. Imagine a mp3/png file server. The
>   files are large enough to trigger drop-behind, but small (and hot)
>   enough to be cached. Also when a new fedora DVD iso is released, it
>   may be cached for some days. These are only the obvious cases.

I'm still not convinced of the latter case.  If a second user accesses
the file before it pages are thrown out, from my understanding there
won't be readahead but the pages will be put back on the active list
(ie. the dropbehind is undone).

AFAICT this patch truly biasses towards releasing pages from
sequentially accessed files.  So file servers where memory pressure
comes from lots of such files should be fine (bias will work correctly
against least-served files).

The main problem I can see is a low-memory server where we are currently
swapping out pages from non-sequential files (eg. gigantic running java
apps), and now performance might decline because we'll discard file
pages first.

> So I opt for it being made tunable, safe, and turned off by default.

I'd like to see it turned on by default in -mm, and try to come up with
some server-like workload to measure the effect.  Should be easy to
simulate something (eg. apache server, where clients grab some files in
preference, and apache server where clients grab different files).

Peter?

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22  8:24   ` Jens Axboe
@ 2007-07-22  8:36     ` Peter Zijlstra
  2007-07-22  8:50       ` Jens Axboe
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-22  8:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote:
> On Sat, Jul 21 2007, Peter Zijlstra wrote:

> > +static __init int readahead_init(void)
> > +{
> > +	/*
> > +	 * Scale the max readahead window with system memory
> > +	 *
> > +	 *   64M:   128K
> > +	 *  128M:   180K
> > +	 *  256M:   256K
> > +	 *  512M:   360K
> > +	 *    1G:   512K
> > +	 *    2G:   724K
> > +	 *    4G:  1024K
> > +	 *    8G:  1448K
> > +	 *   16G:  2048K
> > +	 */
> > +	ra_pages = int_sqrt(totalram_pages/16);
> > +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> > +		ra_pages = 2 << (20 - PAGE_SHIFT);
> > +
> > +	return 0;
> > +}
> 
> How did you come up with these numbers?

Well, most other places in the kernel where we scale by memory size we
use the a sqrt curve, and the specific scale was the result of some
fiddling, these numbers looked sane to me, nothing special.

Would you suggest a different set, and if so, do you have any rationale
for them?


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
       [not found]   ` <20070722084526.GB6317@mail.ustc.edu.cn>
@ 2007-07-22  8:45     ` Fengguang Wu
  2007-07-22  8:59       ` Peter Zijlstra
  0 siblings, 1 reply; 46+ messages in thread
From: Fengguang Wu @ 2007-07-22  8:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook

On Sat, Jul 21, 2007 at 11:00:08PM +0200, Peter Zijlstra wrote:
> Scale the default max readahead size with the system memory size.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  block/ll_rw_blk.c  |    2 +-
>  include/linux/fs.h |    1 +
>  mm/readahead.c     |   32 ++++++++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6/block/ll_rw_blk.c
> ===================================================================
> --- linux-2.6.orig/block/ll_rw_blk.c
> +++ linux-2.6/block/ll_rw_blk.c
> @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu
>  	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
>  	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
>  	q->make_request_fn = mfn;
> -	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> +	bdi_ra_init(&q->backing_dev_info);

fs/fuse/inode.c has another line to be converted.

>  	q->backing_dev_info.state = 0;
>  	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
>  	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file
>  
>  extern void
>  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
> +extern void bdi_ra_init(struct backing_dev_info *bdi);
>  extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
>  extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
>  extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
> Index: linux-2.6/mm/readahead.c
> ===================================================================
> --- linux-2.6.orig/mm/readahead.c
> +++ linux-2.6/mm/readahead.c
> @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state 
>  }
>  EXPORT_SYMBOL_GPL(file_ra_state_init);
>  
> +static unsigned long ra_pages;
> +
> +static __init int readahead_init(void)
> +{
> +	/*
> +	 * Scale the max readahead window with system memory
> +	 *
> +	 *   64M:   128K
> +	 *  128M:   180K
> +	 *  256M:   256K
> +	 *  512M:   360K
> +	 *    1G:   512K
> +	 *    2G:   724K
> +	 *    4G:  1024K
> +	 *    8G:  1448K
> +	 *   16G:  2048K
> +	 */
> +	ra_pages = int_sqrt(totalram_pages/16);
> +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> +		ra_pages = 2 << (20 - PAGE_SHIFT);

We can elaborate on the numbers ;)

How about the following rules?
- limit it under 1MB: we have to consider latencies
- make them alignment-friendly, i.e. 128K, 256K, 512K, 1M.

My original plan is to simply do the following:

- #define VM_MAX_READAHEAD        128     /* kbytes */
+ #define VM_MAX_READAHEAD        512     /* kbytes */

I'd like to post some numbers to back-up the discussion:

  readahead   readahead
       size        miss
       128K         38%
       512K         45%
      1024K         49%

The numbers are measured on a fresh booted KDE desktop.

The majority misses come from the larger mmap read-arounds.
Sequential readahead hits are pretty high and not quite affected by
the readahead size, thanks to its size ramp-up process.

> +
> +	return 0;
> +}
> +
> +subsys_initcall(readahead_init);

Remove the global ra_pages and fold readahead_init() into bdi_ra_init()?
bdi_ra_init will only be called several times I guess.

> +
> +void bdi_ra_init(struct backing_dev_info *bdi)
> +{
> +	bdi->ra_pages = ra_pages;
> +}
> +EXPORT_SYMBOL(bdi_ra_init);


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-22  8:33       ` Rusty Russell
@ 2007-07-22  8:45         ` Peter Zijlstra
  2007-07-23  9:00         ` Nick Piggin
  1 sibling, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-22  8:45 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Fengguang Wu, Dave Jones, linux-kernel, riel, Andrew Morton,
	Tim Pepper, Chris Snook

On Sun, 2007-07-22 at 18:33 +1000, Rusty Russell wrote:
> On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > - It will avoid large-file-reads-thrashing-my-desktop problem,
> >   so most desktop users should like it. But sure there will be counter
> >   cases when a user want to keep the data cached.
> > - File servers may hurt from it. Imagine a mp3/png file server. The
> >   files are large enough to trigger drop-behind, but small (and hot)
> >   enough to be cached. Also when a new fedora DVD iso is released, it
> >   may be cached for some days. These are only the obvious cases.
> 
> I'm still not convinced of the latter case.  If a second user accesses
> the file before it pages are thrown out, from my understanding there
> won't be readahead but the pages will be put back on the active list
> (ie. the dropbehind is undone).
> 
> AFAICT this patch truly biasses towards releasing pages from
> sequentially accessed files.  So file servers where memory pressure
> comes from lots of such files should be fine (bias will work correctly
> against least-served files).
> 
> The main problem I can see is a low-memory server where we are currently
> swapping out pages from non-sequential files (eg. gigantic running java
> apps), and now performance might decline because we'll discard file
> pages first.
> 
> > So I opt for it being made tunable, safe, and turned off by default.
> 
> I'd like to see it turned on by default in -mm, and try to come up with
> some server-like workload to measure the effect.  Should be easy to
> simulate something (eg. apache server, where clients grab some files in
> preference, and apache server where clients grab different files).
> 
> Peter?

I guess we could do that. Just populate the server with a lot of
randomly sized files, and make a client do a random pull or one with a
poisson distribution or something like that.




^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22  8:36     ` Peter Zijlstra
@ 2007-07-22  8:50       ` Jens Axboe
  2007-07-22  9:17         ` Peter Zijlstra
  2007-07-22 23:52         ` Rik van Riel
  0 siblings, 2 replies; 46+ messages in thread
From: Jens Axboe @ 2007-07-22  8:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, Jul 22 2007, Peter Zijlstra wrote:
> On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote:
> > On Sat, Jul 21 2007, Peter Zijlstra wrote:
> 
> > > +static __init int readahead_init(void)
> > > +{
> > > +	/*
> > > +	 * Scale the max readahead window with system memory
> > > +	 *
> > > +	 *   64M:   128K
> > > +	 *  128M:   180K
> > > +	 *  256M:   256K
> > > +	 *  512M:   360K
> > > +	 *    1G:   512K
> > > +	 *    2G:   724K
> > > +	 *    4G:  1024K
> > > +	 *    8G:  1448K
> > > +	 *   16G:  2048K
> > > +	 */
> > > +	ra_pages = int_sqrt(totalram_pages/16);
> > > +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> > > +		ra_pages = 2 << (20 - PAGE_SHIFT);
> > > +
> > > +	return 0;
> > > +}
> > 
> > How did you come up with these numbers?
> 
> Well, most other places in the kernel where we scale by memory size we
> use the a sqrt curve, and the specific scale was the result of some
> fiddling, these numbers looked sane to me, nothing special.
> 
> Would you suggest a different set, and if so, do you have any rationale
> for them?

I just wish you had a rationale behind them, I don't think it's that
great of a series. I agree with the low point of 128k. Then it'd be sane
to try and determine what the upper limit of ra window size goodness is,
which is probably impossible since it depends on the hardware a lot. But
lets just say the upper value is 2mb, then I think it's pretty silly
_not_ to use 2mb on a 1g machine for instance. So more aggressive
scaling.

Then there's the relationship between nr of requests and ra size. When
you leave everything up to a simple sqrt of total_ram type thing, then
you are sure to hit stupid values that cause a queue size of a number of
full requests, plus a small one at the end. Clearly not optimal!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22  8:45     ` Fengguang Wu
@ 2007-07-22  8:59       ` Peter Zijlstra
       [not found]         ` <20070722095313.GA8136@mail.ustc.edu.cn>
  0 siblings, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-22  8:59 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook

On Sun, 2007-07-22 at 16:45 +0800, Fengguang Wu wrote:
> On Sat, Jul 21, 2007 at 11:00:08PM +0200, Peter Zijlstra wrote:
> > Scale the default max readahead size with the system memory size.
> > 
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > ---
> >  block/ll_rw_blk.c  |    2 +-
> >  include/linux/fs.h |    1 +
> >  mm/readahead.c     |   32 ++++++++++++++++++++++++++++++++
> >  3 files changed, 34 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/block/ll_rw_blk.c
> > ===================================================================
> > --- linux-2.6.orig/block/ll_rw_blk.c
> > +++ linux-2.6/block/ll_rw_blk.c
> > @@ -208,7 +208,7 @@ void blk_queue_make_request(request_queu
> >  	blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
> >  	blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS);
> >  	q->make_request_fn = mfn;
> > -	q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> > +	bdi_ra_init(&q->backing_dev_info);
> 
> fs/fuse/inode.c has another line to be converted.

Drad, right you are. Will grep a bit.

> >  	q->backing_dev_info.state = 0;
> >  	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
> >  	blk_queue_max_sectors(q, SAFE_MAX_SECTORS);
> > Index: linux-2.6/include/linux/fs.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/fs.h
> > +++ linux-2.6/include/linux/fs.h
> > @@ -1696,6 +1696,7 @@ extern long do_splice_direct(struct file
> >  
> >  extern void
> >  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping);
> > +extern void bdi_ra_init(struct backing_dev_info *bdi);
> >  extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
> >  extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
> >  extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
> > Index: linux-2.6/mm/readahead.c
> > ===================================================================
> > --- linux-2.6.orig/mm/readahead.c
> > +++ linux-2.6/mm/readahead.c
> > @@ -42,6 +42,38 @@ file_ra_state_init(struct file_ra_state 
> >  }
> >  EXPORT_SYMBOL_GPL(file_ra_state_init);
> >  
> > +static unsigned long ra_pages;
> > +
> > +static __init int readahead_init(void)
> > +{
> > +	/*
> > +	 * Scale the max readahead window with system memory
> > +	 *
> > +	 *   64M:   128K
> > +	 *  128M:   180K
> > +	 *  256M:   256K
> > +	 *  512M:   360K
> > +	 *    1G:   512K
> > +	 *    2G:   724K
> > +	 *    4G:  1024K
> > +	 *    8G:  1448K
> > +	 *   16G:  2048K
> > +	 */
> > +	ra_pages = int_sqrt(totalram_pages/16);
> > +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> > +		ra_pages = 2 << (20 - PAGE_SHIFT);
> 
> We can elaborate on the numbers ;)
> 
> How about the following rules?
> - limit it under 1MB: we have to consider latencies

readahead is done async and we have these cond_resched() things
sprinkled all over, no?

> - make them alignment-friendly, i.e. 128K, 256K, 512K, 1M.

Would that actually matter? but yeah, that seems like a sane suggestion.
roundup_pow_of_two() comes to mind.

> My original plan is to simply do the following:
> 
> - #define VM_MAX_READAHEAD        128     /* kbytes */
> + #define VM_MAX_READAHEAD        512     /* kbytes */

Yeah, the trouble I have with that is that it might adversely affect
tiny systems (although the trash detection might mitigate that impact)

> I'd like to post some numbers to back-up the discussion:
> 
>   readahead   readahead
>        size        miss
>        128K         38%
>        512K         45%
>       1024K         49%
> 
> The numbers are measured on a fresh booted KDE desktop.
> 
> The majority misses come from the larger mmap read-arounds.

the mmap code never gets into readahead unless madvise(MADV_SEQUENTIAL)
is used afaik.

> Sequential readahead hits are pretty high and not quite affected by
> the readahead size, thanks to its size ramp-up process.
> 
> > +
> > +	return 0;
> > +}
> > +
> > +subsys_initcall(readahead_init);
> 
> Remove the global ra_pages and fold readahead_init() into bdi_ra_init()?
> bdi_ra_init will only be called several times I guess.

I guess we could, this just seemed like a proper setup where more things
could grow into.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22  8:50       ` Jens Axboe
@ 2007-07-22  9:17         ` Peter Zijlstra
  2007-07-22 16:44           ` Jens Axboe
  2007-07-22 23:52         ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Peter Zijlstra @ 2007-07-22  9:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, 2007-07-22 at 10:50 +0200, Jens Axboe wrote:
> On Sun, Jul 22 2007, Peter Zijlstra wrote:
> > On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote:
> > > On Sat, Jul 21 2007, Peter Zijlstra wrote:
> > 
> > > > +static __init int readahead_init(void)
> > > > +{
> > > > +	/*
> > > > +	 * Scale the max readahead window with system memory
> > > > +	 *
> > > > +	 *   64M:   128K
> > > > +	 *  128M:   180K
> > > > +	 *  256M:   256K
> > > > +	 *  512M:   360K
> > > > +	 *    1G:   512K
> > > > +	 *    2G:   724K
> > > > +	 *    4G:  1024K
> > > > +	 *    8G:  1448K
> > > > +	 *   16G:  2048K
> > > > +	 */
> > > > +	ra_pages = int_sqrt(totalram_pages/16);
> > > > +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> > > > +		ra_pages = 2 << (20 - PAGE_SHIFT);
> > > > +
> > > > +	return 0;
> > > > +}
> > > 
> > > How did you come up with these numbers?
> > 
> > Well, most other places in the kernel where we scale by memory size we
> > use the a sqrt curve, and the specific scale was the result of some
> > fiddling, these numbers looked sane to me, nothing special.
> > 
> > Would you suggest a different set, and if so, do you have any rationale
> > for them?
> 
> I just wish you had a rationale behind them, I don't think it's that
> great of a series.

Well, I was quite ignorant of the issues you just pointed out. Thanks
those do indeed provide basis for a more solid set.

>  I agree with the low point of 128k.

Perhaps that should be enforced then, because currently a system with
<64M will get less.

>  Then it'd be sane
> to try and determine what the upper limit of ra window size goodness is,
> which is probably impossible since it depends on the hardware a lot. But
> lets just say the upper value is 2mb, then I think it's pretty silly
> _not_ to use 2mb on a 1g machine for instance. So more aggressive
> scaling.

Right, I was being a little conservative here.

> Then there's the relationship between nr of requests and ra size. When
> you leave everything up to a simple sqrt of total_ram type thing, then
> you are sure to hit stupid values that cause a queue size of a number of
> full requests, plus a small one at the end. Clearly not optimal!

And this is where Wu's point of power of two series comes into play,
right?

So something like:

  roundup_pow_of_two(int_sqrt((totalram_pages << (PAGE_SHIFT-10))))


		memory in MB		RA window in KB
                            64                            128
                           128                            256
                           256                            256
                           512                            512
                          1024                            512
                          2048                           1024
                          4096                           1024
                          8192                           2048
                         16384                           2048
                         32768                           4096
                         65536                           4096





^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
       [not found]         ` <20070722095313.GA8136@mail.ustc.edu.cn>
@ 2007-07-22  9:53           ` Fengguang Wu
  0 siblings, 0 replies; 46+ messages in thread
From: Fengguang Wu @ 2007-07-22  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, riel, Andrew Morton, Rusty Russell, Tim Pepper,
	Chris Snook

On Sun, Jul 22, 2007 at 10:59:11AM +0200, Peter Zijlstra wrote:
> On Sun, 2007-07-22 at 16:45 +0800, Fengguang Wu wrote:
> > How about the following rules?
> > - limit it under 1MB: we have to consider latencies
> 
> readahead is done async and we have these cond_resched() things
> sprinkled all over, no?

Yeah, it should not be a big problem.

> > - make them alignment-friendly, i.e. 128K, 256K, 512K, 1M.
> 
> Would that actually matter? but yeah, that seems like a sane suggestion.
> roundup_pow_of_two() comes to mind.

E.g. RAID stride size, and the max_sectors_kb.
Typically they are power-of-two.

> > My original plan is to simply do the following:
> > 
> > - #define VM_MAX_READAHEAD        128     /* kbytes */
> > + #define VM_MAX_READAHEAD        512     /* kbytes */
> 
> Yeah, the trouble I have with that is that it might adversely affect
> tiny systems (although the trash detection might mitigate that impact)

I'm also OK with the scaling up scheme. It's reasonable.

> > I'd like to post some numbers to back-up the discussion:
> > 
> >   readahead   readahead
> >        size        miss
> >        128K         38%
> >        512K         45%
> >       1024K         49%
> > 
> > The numbers are measured on a fresh booted KDE desktop.
> > 
> > The majority misses come from the larger mmap read-arounds.
> 
> the mmap code never gets into readahead unless madvise(MADV_SEQUENTIAL)
> is used afaik.

Sadly mmap read-around reuses the same readahead size.
- for read-around, VM_MAX_READAHEAD is the _real_ readahead size
- for readahead, VM_MAX_READAHEAD is the _max_ readahead size
If we simply increasing VM_MAX_READAHEAD, tiny systems can be
immediately hurt by large read-arounds. That's the problem.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
@ 2007-07-22 11:11 Al Boldi
  0 siblings, 0 replies; 46+ messages in thread
From: Al Boldi @ 2007-07-22 11:11 UTC (permalink / raw)
  To: linux-kernel

Fengguang Wu wrote:
> On Sun, Jul 22, 2007 at 10:24:37AM +0200, Peter Zijlstra wrote:
> > On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > > - It will avoid large-file-reads-thrashing-my-desktop problem,
> > >   so most desktop users should like it. But sure there will be counter
> > >   cases when a user want to keep the data cached.
> > > - File servers may hurt from it. Imagine a mp3/png file server. The
> > >   files are large enough to trigger drop-behind, but small (and hot)
> > >   enough to be cached. Also when a new fedora DVD iso is released, it
> > >   may be cached for some days. These are only the obvious cases.
> > >
> > > So I opt for it being made tunable, safe, and turned off by default.
> >
> > I'm still not convinced (Rik wasn't either last time around). When these
> > files really are hot, they will be kept in memory due to them becoming
> > Active.
> >
> > Also, by scaling up the max readahead size it takes a larger file before
> > it starts dropping. If say this server has 4G of memory (not much at all
> > for a server) resulting in a 1M readahead window, the file needs to be >
> > ~2M before it starts drop behind.
>
> [snip]
>
> > But I guess it takes someone to try this IRL before we can settle this
> > debate :-/
>
> Yeah, some real workload numbers would help.

A patch against 2.6.22 may help too.


Thanks!

--
Al


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22  9:17         ` Peter Zijlstra
@ 2007-07-22 16:44           ` Jens Axboe
  2007-07-23 10:04             ` Jörn Engel
  0 siblings, 1 reply; 46+ messages in thread
From: Jens Axboe @ 2007-07-22 16:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook

On Sun, Jul 22 2007, Peter Zijlstra wrote:
> On Sun, 2007-07-22 at 10:50 +0200, Jens Axboe wrote:
> > On Sun, Jul 22 2007, Peter Zijlstra wrote:
> > > On Sun, 2007-07-22 at 10:24 +0200, Jens Axboe wrote:
> > > > On Sat, Jul 21 2007, Peter Zijlstra wrote:
> > > 
> > > > > +static __init int readahead_init(void)
> > > > > +{
> > > > > +	/*
> > > > > +	 * Scale the max readahead window with system memory
> > > > > +	 *
> > > > > +	 *   64M:   128K
> > > > > +	 *  128M:   180K
> > > > > +	 *  256M:   256K
> > > > > +	 *  512M:   360K
> > > > > +	 *    1G:   512K
> > > > > +	 *    2G:   724K
> > > > > +	 *    4G:  1024K
> > > > > +	 *    8G:  1448K
> > > > > +	 *   16G:  2048K
> > > > > +	 */
> > > > > +	ra_pages = int_sqrt(totalram_pages/16);
> > > > > +	if (ra_pages > (2 << (20 - PAGE_SHIFT)))
> > > > > +		ra_pages = 2 << (20 - PAGE_SHIFT);
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > 
> > > > How did you come up with these numbers?
> > > 
> > > Well, most other places in the kernel where we scale by memory size we
> > > use the a sqrt curve, and the specific scale was the result of some
> > > fiddling, these numbers looked sane to me, nothing special.
> > > 
> > > Would you suggest a different set, and if so, do you have any rationale
> > > for them?
> > 
> > I just wish you had a rationale behind them, I don't think it's that
> > great of a series.
> 
> Well, I was quite ignorant of the issues you just pointed out. Thanks
> those do indeed provide basis for a more solid set.
> 
> >  I agree with the low point of 128k.
> 
> Perhaps that should be enforced then, because currently a system with
> <64M will get less.

I think it should remain the low point.

> >  Then it'd be sane
> > to try and determine what the upper limit of ra window size goodness is,
> > which is probably impossible since it depends on the hardware a lot. But
> > lets just say the upper value is 2mb, then I think it's pretty silly
> > _not_ to use 2mb on a 1g machine for instance. So more aggressive
> > scaling.
> 
> Right, I was being a little conservative here.
> 
> > Then there's the relationship between nr of requests and ra size. When
> > you leave everything up to a simple sqrt of total_ram type thing, then
> > you are sure to hit stupid values that cause a queue size of a number of
> > full requests, plus a small one at the end. Clearly not optimal!
> 
> And this is where Wu's point of power of two series comes into play,
> right?
> 
> So something like:
> 
>   roundup_pow_of_two(int_sqrt((totalram_pages << (PAGE_SHIFT-10))))
> 
> 
> 		memory in MB		RA window in KB
>                             64                            128
>                            128                            256
>                            256                            256
>                            512                            512
>                           1024                            512
>                           2048                           1024
>                           4096                           1024
>                           8192                           2048
>                          16384                           2048
>                          32768                           4096
>                          65536                           4096

Only if you assume that max request size is always a power of 2. That's
usually true, but there are cases where it's 124kb for instance.

And there's still an issue when max_sectors isn't the deciding factor,
if we end up having to stop merging on a request because we hit other
limitations.

So there's definitely room for improvement! Even today, btw, it's not
all because of these changes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22  8:50       ` Jens Axboe
  2007-07-22  9:17         ` Peter Zijlstra
@ 2007-07-22 23:52         ` Rik van Riel
  2007-07-23  5:22           ` Jens Axboe
  1 sibling, 1 reply; 46+ messages in thread
From: Rik van Riel @ 2007-07-22 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, Andrew Morton,
	Rusty Russell, Tim Pepper, Chris Snook

Jens Axboe wrote:

> I just wish you had a rationale behind them, I don't think it's that
> great of a series. I agree with the low point of 128k. Then it'd be sane
> to try and determine what the upper limit of ra window size goodness is,
> which is probably impossible since it depends on the hardware a lot. But
> lets just say the upper value is 2mb, then I think it's pretty silly
> _not_ to use 2mb on a 1g machine for instance. So more aggressive
> scaling.

1 or 2 MB is a nice number.

Seek time (plus rotational latency) on disks still takes
on the order of 10 ms, while disks commonly transfer data
on the order of 50MB/second.

That means one disk seek (10ms) takes as long as it takes
to read around 512kB of data.

The current 128kB means that if you have lots of streaming
IO going on, you spend only 20% of the time transferring
data and get roughly 10MB/s.  Seek 10ms, read 2.5ms worth
of data.

OTOH, if you do 2MB per request for the same heavy streaming
workload (say, an ftp or nfs server doing media files), you
can get 80% of the disk throughput, or 40MB/s.  This is because
you spend 40ms transferring data for every 10ms seek time.

Yes, filesystem metadata will reduce this "occasionally",
but the general idea holds.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22 23:52         ` Rik van Riel
@ 2007-07-23  5:22           ` Jens Axboe
  0 siblings, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2007-07-23  5:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, Andrew Morton,
	Rusty Russell, Tim Pepper, Chris Snook

On Sun, Jul 22 2007, Rik van Riel wrote:
> Jens Axboe wrote:
>
>> I just wish you had a rationale behind them, I don't think it's that
>> great of a series. I agree with the low point of 128k. Then it'd be sane
>> to try and determine what the upper limit of ra window size goodness is,
>> which is probably impossible since it depends on the hardware a lot. But
>> lets just say the upper value is 2mb, then I think it's pretty silly
>> _not_ to use 2mb on a 1g machine for instance. So more aggressive
>> scaling.
>
> 1 or 2 MB is a nice number.
>
> Seek time (plus rotational latency) on disks still takes
> on the order of 10 ms, while disks commonly transfer data
> on the order of 50MB/second.
>
> That means one disk seek (10ms) takes as long as it takes
> to read around 512kB of data.
>
> The current 128kB means that if you have lots of streaming
> IO going on, you spend only 20% of the time transferring
> data and get roughly 10MB/s.  Seek 10ms, read 2.5ms worth
> of data.
>
> OTOH, if you do 2MB per request for the same heavy streaming
> workload (say, an ftp or nfs server doing media files), you
> can get 80% of the disk throughput, or 40MB/s.  This is because
> you spend 40ms transferring data for every 10ms seek time.
>
> Yes, filesystem metadata will reduce this "occasionally",
> but the general idea holds.

I meant real numbers, the above is just rudimentary math. AS and CFQ
makes sure that you get a window of more than one request in, so huge
read-ahead sizes aren't as interesting.

So - more testing, less hand waving :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-22  8:33       ` Rusty Russell
  2007-07-22  8:45         ` Peter Zijlstra
@ 2007-07-23  9:00         ` Nick Piggin
       [not found]           ` <20070723142457.GA10130@mail.ustc.edu.cn>
  2007-07-25  4:35           ` Eric St-Laurent
  1 sibling, 2 replies; 46+ messages in thread
From: Nick Piggin @ 2007-07-23  9:00 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Fengguang Wu, Dave Jones, Peter Zijlstra, linux-kernel, riel,
	Andrew Morton, Tim Pepper, Chris Snook

Rusty Russell wrote:
> On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:

>>So I opt for it being made tunable, safe, and turned off by default.

I hate tunables :) Unless we have workload A that gets a reasonable
benefit from something and workload B that gets a significant regression,
and no clear way to reconcile them...

> I'd like to see it turned on by default in -mm, and try to come up with
> some server-like workload to measure the effect.  Should be easy to
> simulate something (eg. apache server, where clients grab some files in
> preference, and apache server where clients grab different files).

I don't like this kind of conditional information going from something
like readahead into page reclaim. Unless it is for readahead _specific_
data such as "I got these all wrong, so you can reclaim them" (which
this isn't).

Possibly it makes sense to realise that the given pages are cheaper
to read back in as they are apparently being read-ahead very nicely.

But I don't like it as a use-once thing. The VM should be able to get
that right.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-22 16:44           ` Jens Axboe
@ 2007-07-23 10:04             ` Jörn Engel
  2007-07-23 10:11               ` Jens Axboe
  2007-07-23 22:44               ` Rusty Russell
  0 siblings, 2 replies; 46+ messages in thread
From: Jörn Engel @ 2007-07-23 10:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, riel, Andrew Morton,
	Rusty Russell, Tim Pepper, Chris Snook

On Sun, 22 July 2007 18:44:03 +0200, Jens Axboe wrote:
> > 
> > >  I agree with the low point of 128k.
> > 
> > Perhaps that should be enforced then, because currently a system with
> > <64M will get less.
> 
> I think it should remain the low point.

I believe this whole thing is fundamentally flawed.  The perfect
readahead size depends on the device in question.  If we set a single
system-wide value depending on memory size, it may easily be too small
and too large at the same time.  Think hard disk and SSD.

It may make sense to have a _maximum_ readahead size which depends on
memory size.  But the preferred size (if there is enough RAM) should
depend solely on the device, possibly as a function of the
bandwidth * read latency product.

Jörn

-- 
Joern's library part 12:
http://physics.nist.gov/cuu/Units/binary.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-23 10:04             ` Jörn Engel
@ 2007-07-23 10:11               ` Jens Axboe
  2007-07-23 22:44               ` Rusty Russell
  1 sibling, 0 replies; 46+ messages in thread
From: Jens Axboe @ 2007-07-23 10:11 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Peter Zijlstra, linux-kernel, Fengguang Wu, riel, Andrew Morton,
	Rusty Russell, Tim Pepper, Chris Snook

On Mon, Jul 23 2007, Jörn Engel wrote:
> On Sun, 22 July 2007 18:44:03 +0200, Jens Axboe wrote:
> > > 
> > > >  I agree with the low point of 128k.
> > > 
> > > Perhaps that should be enforced then, because currently a system with
> > > <64M will get less.
> > 
> > I think it should remain the low point.
> 
> I believe this whole thing is fundamentally flawed.  The perfect
> readahead size depends on the device in question.  If we set a single
> system-wide value depending on memory size, it may easily be too small
> and too large at the same time.  Think hard disk and SSD.
> 
> It may make sense to have a _maximum_ readahead size which depends on
> memory size.  But the preferred size (if there is enough RAM) should
> depend solely on the device, possibly as a function of the
> bandwidth * read latency product.

The value IS the maximum read-ahead value, not an enforced minimum.

However, there's definitely room for improvement in the queue feedback.
Even for seekless devices like SSD, read-ahead may be beneficial due to
zero request -> request latency.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found]           ` <20070723142457.GA10130@mail.ustc.edu.cn>
@ 2007-07-23 14:24             ` Fengguang Wu
  2007-07-23 19:40               ` Andrew Morton
  0 siblings, 1 reply; 46+ messages in thread
From: Fengguang Wu @ 2007-07-23 14:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, Dave Jones, Peter Zijlstra, linux-kernel, riel,
	Andrew Morton, Tim Pepper, Chris Snook, Jens Axboe

On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote:
> Rusty Russell wrote:
> >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> 
> >>So I opt for it being made tunable, safe, and turned off by default.
> 
> I hate tunables :) Unless we have workload A that gets a reasonable
> benefit from something and workload B that gets a significant regression,
> and no clear way to reconcile them...

Me too ;)

But sometimes we really want to avoid flushing the cache.
Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here.

> >I'd like to see it turned on by default in -mm, and try to come up with
> >some server-like workload to measure the effect.  Should be easy to
> >simulate something (eg. apache server, where clients grab some files in
> >preference, and apache server where clients grab different files).
> 
> I don't like this kind of conditional information going from something
> like readahead into page reclaim. Unless it is for readahead _specific_
> data such as "I got these all wrong, so you can reclaim them" (which
> this isn't).
> 
> Possibly it makes sense to realise that the given pages are cheaper
> to read back in as they are apparently being read-ahead very nicely.

In fact I have talked to Jens about it in last year's kernel summit.
The patch below explains itself.
---
Subject: cost based page reclaim

Cost based page reclaim - a minimalist implementation.

Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a
large file.  Should we first drop the 32-pages which are read in one I/O, or
drop the 32 distinct pages, each costs one I/O? (Given that the files are of
equal hotness.)

Page replacement algorithms should be designed to minimize the number of I/Os,
instead of the number of page faults. Dividing the cost of I/O by the number of
pages it bring in, we get the cost of the page. The bigger page cost, the more
'lives/bloods' the page should have.

This patch adds life to costly pages by pretending that they are
referenced more times. Possible downsides:
- burdens the pressure of vmscan
- active pages are no longer that 'active'

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 include/linux/backing-dev.h |    1 +
 mm/readahead.c              |   27 +++++++++++++++++++++++++++
 mm/swap.c                   |    5 ++++-
 3 files changed, 32 insertions(+), 1 deletion(-)

--- linux-2.6.22.orig/mm/readahead.c
+++ linux-2.6.22/mm/readahead.c
@@ -125,6 +125,7 @@ static void ra_account(struct file_ra_st
 
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= MAX_RA_PAGES,
+	.avg_ra_size	= MAX_RA_PAGES * PAGE_CACHE_SIZE,
 	.ra_pages0	= INITIAL_RA_PAGES,
 	.ra_thrash_bytes = MAX_RA_PAGES * PAGE_CACHE_SIZE,
 	.state		= 0,
@@ -133,6 +134,26 @@ struct backing_dev_info default_backing_
 };
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
+#define log2(n) fls(n)
+static int readahead_cost_per_page(struct address_space *mapping,
+						unsigned long pages)
+{
+	unsigned long avg;
+	int cost;
+
+	avg = mapping->backing_dev_info->avg_ra_size;
+	avg = (pages * PAGE_CACHE_SIZE + avg * 1023) / 1024;
+	mapping->backing_dev_info->avg_ra_size = avg;
+
+	avg = avg / PAGE_CACHE_SIZE;
+	if (!avg || !pages)
+		cost = 0;
+	else
+		cost = (log2(avg) - log2(pages)) * 5 / log2(avg);
+
+	return cost;
+}
+
 /*
  * Initialise a struct file's readahead state.  Assumes that the caller has
  * memset *ra to zero.
@@ -418,6 +439,7 @@ __do_page_cache_readahead(struct address
 	struct page *page;
 	unsigned long end_index;	/* The last page we want to read */
 	LIST_HEAD(page_pool);
+	int cost;
 	int page_idx;
 	int ret = 0;
 	loff_t isize = i_size_read(inode);
@@ -426,6 +448,7 @@ __do_page_cache_readahead(struct address
 		goto out;
 
 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	cost = readahead_cost_per_page(mapping, nr_to_read);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -448,6 +471,10 @@ __do_page_cache_readahead(struct address
 		if (!page)
 			break;
 		page->index = page_offset;
+		if (cost >= 3)
+			SetPageActive(page);
+		else if (cost == 2)
+			SetPageReferenced(page);
 		list_add(&page->lru, &page_pool);
 		if (page_idx == nr_to_read - lookahead_size)
 			SetPageReadahead(page);
--- linux-2.6.22.orig/mm/swap.c
+++ linux-2.6.22/mm/swap.c
@@ -365,7 +365,10 @@ void __pagevec_lru_add(struct pagevec *p
 		}
 		VM_BUG_ON(PageLRU(page));
 		SetPageLRU(page);
-		add_page_to_inactive_list(zone, page);
+		if (!PageActive(page))
+			add_page_to_inactive_list(zone, page);
+		else
+			add_page_to_active_list(zone, page);
 	}
 	if (zone)
 		spin_unlock_irq(&zone->lru_lock);
--- linux-2.6.22.orig/include/linux/backing-dev.h
+++ linux-2.6.22/include/linux/backing-dev.h
@@ -26,6 +26,7 @@ typedef int (congested_fn)(void *, int);
 
 struct backing_dev_info {
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
+	unsigned long avg_ra_size;
 	unsigned long ra_pages0; /* min readahead on start of file */
 	unsigned long ra_thrash_bytes;	/* estimated thrashing threshold */
 	unsigned long state;	/* Always use atomic bitops on this */


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-23 14:24             ` Fengguang Wu
@ 2007-07-23 19:40               ` Andrew Morton
       [not found]                 ` <20070724004728.GA8026@mail.ustc.edu.cn>
  0 siblings, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2007-07-23 19:40 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe

On Mon, 23 Jul 2007 22:24:57 +0800
Fengguang Wu <fengguang.wu@gmail.com> wrote:

> On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote:
> > Rusty Russell wrote:
> > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > 
> > >>So I opt for it being made tunable, safe, and turned off by default.
> > 
> > I hate tunables :) Unless we have workload A that gets a reasonable
> > benefit from something and workload B that gets a significant regression,
> > and no clear way to reconcile them...
> 
> Me too ;)
> 
> But sometimes we really want to avoid flushing the cache.
> Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here.

It's the only way to go in some situations.  Sometimes the kernel just
cannot predict the future sufficiently well, and the costs of making a
mistake are terribly high.  We need human help.  And it should be
administration-time help, not programming-time help.

> > >I'd like to see it turned on by default in -mm, and try to come up with
> > >some server-like workload to measure the effect.  Should be easy to
> > >simulate something (eg. apache server, where clients grab some files in
> > >preference, and apache server where clients grab different files).
> > 
> > I don't like this kind of conditional information going from something
> > like readahead into page reclaim. Unless it is for readahead _specific_
> > data such as "I got these all wrong, so you can reclaim them" (which
> > this isn't).
> > 
> > Possibly it makes sense to realise that the given pages are cheaper
> > to read back in as they are apparently being read-ahead very nicely.
> 
> In fact I have talked to Jens about it in last year's kernel summit.
> The patch below explains itself.
> ---
> Subject: cost based page reclaim
> 
> Cost based page reclaim - a minimalist implementation.
> 
> Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a
> large file.  Should we first drop the 32-pages which are read in one I/O, or
> drop the 32 distinct pages, each costs one I/O? (Given that the files are of
> equal hotness.)
> 
> Page replacement algorithms should be designed to minimize the number of I/Os,
> instead of the number of page faults. Dividing the cost of I/O by the number of
> pages it bring in, we get the cost of the page. The bigger page cost, the more
> 'lives/bloods' the page should have.
> 
> This patch adds life to costly pages by pretending that they are
> referenced more times. Possible downsides:
> - burdens the pressure of vmscan
> - active pages are no longer that 'active'
> 

This is all fun stuff, but how do we find out that changes like this are
good ones, apart from shipping it and seeing who gets hurt 12 months later?

> +#define log2(n) fls(n)

<look at include/linux/log2.h>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/3] readahead: scale max readahead size depending on memory size
  2007-07-23 10:04             ` Jörn Engel
  2007-07-23 10:11               ` Jens Axboe
@ 2007-07-23 22:44               ` Rusty Russell
  1 sibling, 0 replies; 46+ messages in thread
From: Rusty Russell @ 2007-07-23 22:44 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Jens Axboe, Peter Zijlstra, linux-kernel, Fengguang Wu, riel,
	Andrew Morton, Tim Pepper, Chris Snook

On Mon, 2007-07-23 at 12:04 +0200, Jörn Engel wrote:
> I believe this whole thing is fundamentally flawed.  The perfect
> readahead size depends on the device in question.  If we set a single
> system-wide value depending on memory size, it may easily be too small
> and too large at the same time.  Think hard disk and SSD.

Well, I think the filesystem should get the first shot at altering the
readahead heuristic (we should add a hook so it can then ask the
underlying device).

Something like:
===
Hook for filesystems to tweak readahead logic

It was suggested at Fengguang's OLS talk that filesystems can make a
useful contribution to the readahead logic: avoiding seeks and lock
boundaries (on cluster filesystems) and other such filesystem-specific
things.

The alter_readahead hook should return the number of pages to read.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>

---
 Documentation/filesystems/vfs.txt |   10 ++++++----
 include/linux/fs.h                |    1 +
 mm/readahead.c                    |   12 ++++++++++++
 3 files changed, 19 insertions(+), 4 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/include/linux/fs.h
+++ linux-2.6.22-rc6-mm1/include/linux/fs.h
@@ -1204,7 +1204,8 @@ struct file_operations {
 	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
 	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*revoke)(struct file *, struct address_space *);
+	unsigned long (*alter_readahead)(struct file *, pgoff_t, unsigned long, pgoff_t, unsigned long);
 };
 
 struct inode_operations {
--- linux-2.6.22-rc6-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc6-mm1/mm/readahead.c
@@ -394,7 +394,19 @@ ondemand_readahead(struct address_space 
 	}
 
 readit:
+	/*
+	 * Give filesystem a chance to tweak readahead size (eg. to
+	 * contiguous block boundary).  It should probably not change
+	 * by more than 50% otherwise it's really ignoring our
+	 * readahead advice.
+	 */
+	if (filp->f_op->alter_readahead) {
+		ra->size = filp->f_op->alter_readahead(filp, offset, req_size,
+						       ra->start, ra->size);
+		if (ra->async_size > ra->size)
+			ra->async_size = ra->size;
+	}
 	return ra_submit(ra, mapping, filp);
 }
 
--- linux-2.6.22-rc6-mm1.orig/Documentation/filesystems/vfs.txt
+++ linux-2.6.22-rc6-mm1/Documentation/filesystems/vfs.txt
@@ -777,11 +777,10 @@ struct file_operations {
 	int (*check_flags)(int);
 	int (*dir_notify)(struct file *filp, unsigned long arg);
 	int (*flock) (struct file *, int, struct file_lock *);
-	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned 
-int);
-	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned  
-int);
+	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int);
+	ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);
 	int (*revoke)(struct file *);
+	unsigned long (*alter_readahead)(struct file *, pgoff_t, unsigned long, pgoff_t, unsigned long);
 };
 
 Again, all methods are called without any locks being held, unless
@@ -859,6 +858,9 @@ otherwise noted.
 	  to an open file. This method must ensure that all currently blocked
 	  writes are flushed and reads will fail.
 
+  alter_readahead: called when the readahead logic is about to submit a
+                   readahead request for I/O
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides. When opening a device node
 (character or block special) most filesystems will call special



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
       [not found]                 ` <20070724004728.GA8026@mail.ustc.edu.cn>
@ 2007-07-24  0:47                   ` Fengguang Wu
  2007-07-24  1:17                     ` Andrew Morton
  2007-07-24  4:30                     ` Nick Piggin
  0 siblings, 2 replies; 46+ messages in thread
From: Fengguang Wu @ 2007-07-24  0:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe,
	linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna,
	David Mosberger-Tang, Kyle McMartin

On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote:
> On Mon, 23 Jul 2007 22:24:57 +0800
> Fengguang Wu <fengguang.wu@gmail.com> wrote:
> 
> > On Mon, Jul 23, 2007 at 07:00:59PM +1000, Nick Piggin wrote:
> > > Rusty Russell wrote:
> > > >On Sun, 2007-07-22 at 16:10 +0800, Fengguang Wu wrote:
> > > 
> > > >>So I opt for it being made tunable, safe, and turned off by default.
> > > 
> > > I hate tunables :) Unless we have workload A that gets a reasonable
> > > benefit from something and workload B that gets a significant regression,
> > > and no clear way to reconcile them...
> > 
> > Me too ;)
> > 
> > But sometimes we really want to avoid flushing the cache.
> > Andrew's user space LD_PRELOAD+fadvise based tool fit nicely here.
> 
> It's the only way to go in some situations.  Sometimes the kernel just
> cannot predict the future sufficiently well, and the costs of making a
> mistake are terribly high.  We need human help.  And it should be
> administration-time help, not programming-time help.

Agreed. I feel that drop behind is not a universal applicable.
Cost based reclaim sounds better, but who knows before field use ;)

> > > >I'd like to see it turned on by default in -mm, and try to come up with
> > > >some server-like workload to measure the effect.  Should be easy to
> > > >simulate something (eg. apache server, where clients grab some files in
> > > >preference, and apache server where clients grab different files).
> > > 
> > > I don't like this kind of conditional information going from something
> > > like readahead into page reclaim. Unless it is for readahead _specific_
> > > data such as "I got these all wrong, so you can reclaim them" (which
> > > this isn't).
> > > 
> > > Possibly it makes sense to realise that the given pages are cheaper
> > > to read back in as they are apparently being read-ahead very nicely.
> > 
> > In fact I have talked to Jens about it in last year's kernel summit.
> > The patch below explains itself.
> > ---
> > Subject: cost based page reclaim
> > 
> > Cost based page reclaim - a minimalist implementation.
> > 
> > Suppose we cached 32 small files each with 1 page, and one 32-page chunk from a
> > large file.  Should we first drop the 32-pages which are read in one I/O, or
> > drop the 32 distinct pages, each costs one I/O? (Given that the files are of
> > equal hotness.)
> > 
> > Page replacement algorithms should be designed to minimize the number of I/Os,
> > instead of the number of page faults. Dividing the cost of I/O by the number of
> > pages it bring in, we get the cost of the page. The bigger page cost, the more
> > 'lives/bloods' the page should have.
> > 
> > This patch adds life to costly pages by pretending that they are
> > referenced more times. Possible downsides:
> > - burdens the pressure of vmscan
> > - active pages are no longer that 'active'
> > 
> 
> This is all fun stuff, but how do we find out that changes like this are
> good ones, apart from shipping it and seeing who gets hurt 12 months later?

One thing I can imagine now is that the first pages may get more life
because of the conservative initial readahead size.

Generally file servers use sendfile(wholefile), so not a problem.

> > +#define log2(n) fls(n)
> 
> <look at include/linux/log2.h>

Thank you, this comment lead to another patch :)
---
Subject: convert ill defined log2() to ilog2()

It's *wrong* to have
			#define log2(n) ffz(~(n))
It should be *reversed*:
			#define log2(n) flz(~(n))
or
			#define log2(n) fls(n)
or just use
			ilog2(n) defined in linux/log2.h.

This patch follows the last solution, recommended by Andrew Morton.

//Or are they simply the wrong naming, and is in fact correct in behavior?

Cc: linux-ext4@vger.kernel.org
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: Chris Ahna <christopher.j.ahna@intel.com>
Cc: David Mosberger-Tang <davidm@hpl.hp.com>
Cc: Kyle McMartin <kyle@parisc-linux.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 drivers/char/agp/hp-agp.c     |    9 +++------
 drivers/char/agp/i460-agp.c   |    5 ++---
 drivers/char/agp/parisc-agp.c |    7 ++-----
 fs/ext4/super.c               |    6 ++----
 4 files changed, 9 insertions(+), 18 deletions(-)

--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
@@ -14,15 +14,12 @@
 #include <linux/pci.h>
 #include <linux/init.h>
 #include <linux/agp_backend.h>
+#include <linux/log2.h>
 
 #include <asm/acpi-ext.h>
 
 #include "agp.h"
 
-#ifndef log2
-#define log2(x)		ffz(~(x))
-#endif
-
 #define HP_ZX1_IOC_OFFSET	0x1000  /* ACPI reports SBA, we want IOC */
 
 /* HP ZX1 IOC registers */
@@ -256,7 +253,7 @@ hp_zx1_configure (void)
 		readl(hp->ioc_regs+HP_ZX1_IMASK);
 		writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE);
 		readl(hp->ioc_regs+HP_ZX1_IBASE);
-		writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
+		writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
 		readl(hp->ioc_regs+HP_ZX1_PCOM);
 	}
 
@@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem)
 {
 	struct _hp_private *hp = &hp_private;
 
-	writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
+	writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
 	readq(hp->ioc_regs+HP_ZX1_PCOM);
 }
 
--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c
@@ -13,6 +13,7 @@
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/agp_backend.h>
+#include <linux/log2.h>
 
 #include "agp.h"
 
@@ -59,8 +60,6 @@
  */
 #define WR_FLUSH_GATT(index)	RD_GATT(index)
 
-#define log2(x)			ffz(~(x))
-
 static struct {
 	void *gatt;				/* ioremap'd GATT area */
 
@@ -148,7 +147,7 @@ static int i460_fetch_size (void)
 		 * values[i].size.
 		 */
 		values[i].num_entries = (values[i].size << 8) >> (I460_IO_PAGE_SHIFT - 12);
-		values[i].page_order = log2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
+		values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
 	}
 
 	for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) {
--- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c
+++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c
@@ -18,6 +18,7 @@
 #include <linux/init.h>
 #include <linux/klist.h>
 #include <linux/agp_backend.h>
+#include <linux/log2.h>
 
 #include <asm-parisc/parisc-device.h>
 #include <asm-parisc/ropes.h>
@@ -27,10 +28,6 @@
 #define DRVNAME	"quicksilver"
 #define DRVPFX	DRVNAME ": "
 
-#ifndef log2
-#define log2(x)		ffz(~(x))
-#endif
-
 #define AGP8X_MODE_BIT		3
 #define AGP8X_MODE		(1 << AGP8X_MODE_BIT)
 
@@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m
 {
 	struct _parisc_agp_info *info = &parisc_agp_info;
 
-	writeq(info->gart_base | log2(info->gart_size), info->ioc_regs+IOC_PCOM);
+	writeq(info->gart_base | ilog2(info->gart_size), info->ioc_regs+IOC_PCOM);
 	readq(info->ioc_regs+IOC_PCOM);	/* flush */
 }
 
--- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c
+++ linux-2.6.22-rc6-mm1/fs/ext4/super.c
@@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct 
 	sb->s_flags = s_flags; /* Restore MS_RDONLY status */
 }
 
-#define log2(n) ffz(~(n))
-
 /*
  * Maximal file size.  There is a direct, and {,double-,triple-}indirect
  * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
@@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
 	sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
 	sbi->s_sbh = bh;
 	sbi->s_mount_state = le16_to_cpu(es->s_state);
-	sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb));
-	sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb));
+	sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
+	sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));
 	for (i=0; i < 4; i++)
 		sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
 	sbi->s_def_hash_version = es->s_def_hash_version;


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-24  0:47                   ` Fengguang Wu
@ 2007-07-24  1:17                     ` Andrew Morton
  2007-07-24  8:50                       ` Andreas Dilger
  2007-07-24  4:30                     ` Nick Piggin
  1 sibling, 1 reply; 46+ messages in thread
From: Andrew Morton @ 2007-07-24  1:17 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Nick Piggin, Rusty Russell, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe,
	linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna,
	David Mosberger-Tang, Kyle McMartin, Dave Jones, Dave Airlie

On Tue, 24 Jul 2007 08:47:28 +0800
Fengguang Wu <fengguang.wu@gmail.com> wrote:

> Subject: convert ill defined log2() to ilog2()
> 
> It's *wrong* to have
> 			#define log2(n) ffz(~(n))
> It should be *reversed*:
> 			#define log2(n) flz(~(n))
> or
> 			#define log2(n) fls(n)
> or just use
> 			ilog2(n) defined in linux/log2.h.
> 
> This patch follows the last solution, recommended by Andrew Morton.
> 
> //Or are they simply the wrong naming, and is in fact correct in behavior?
> 
> Cc: linux-ext4@vger.kernel.org
> Cc: Mingming Cao <cmm@us.ibm.com>
> Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
> Cc: Chris Ahna <christopher.j.ahna@intel.com>
> Cc: David Mosberger-Tang <davidm@hpl.hp.com>
> Cc: Kyle McMartin <kyle@parisc-linux.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> ---
>  drivers/char/agp/hp-agp.c     |    9 +++------
>  drivers/char/agp/i460-agp.c   |    5 ++---
>  drivers/char/agp/parisc-agp.c |    7 ++-----
>  fs/ext4/super.c               |    6 ++----
>  4 files changed, 9 insertions(+), 18 deletions(-)

hm, yes, there is a risk that the code was accidentally correct.  Or the
code has only ever dealt with power-of-2 inputs, in which case it happens
to work either way.

David(s) and ext4-people: could we please have a close review of these
changes?

Thanks.

> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/hp-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/hp-agp.c
> @@ -14,15 +14,12 @@
>  #include <linux/pci.h>
>  #include <linux/init.h>
>  #include <linux/agp_backend.h>
> +#include <linux/log2.h>
>  
>  #include <asm/acpi-ext.h>
>  
>  #include "agp.h"
>  
> -#ifndef log2
> -#define log2(x)		ffz(~(x))
> -#endif
> -
>  #define HP_ZX1_IOC_OFFSET	0x1000  /* ACPI reports SBA, we want IOC */
>  
>  /* HP ZX1 IOC registers */
> @@ -256,7 +253,7 @@ hp_zx1_configure (void)
>  		readl(hp->ioc_regs+HP_ZX1_IMASK);
>  		writel(hp->iova_base|1, hp->ioc_regs+HP_ZX1_IBASE);
>  		readl(hp->ioc_regs+HP_ZX1_IBASE);
> -		writel(hp->iova_base|log2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
> +		writel(hp->iova_base|ilog2(HP_ZX1_IOVA_SIZE), hp->ioc_regs+HP_ZX1_PCOM);
>  		readl(hp->ioc_regs+HP_ZX1_PCOM);
>  	}
>  
> @@ -284,7 +281,7 @@ hp_zx1_tlbflush (struct agp_memory *mem)
>  {
>  	struct _hp_private *hp = &hp_private;
>  
> -	writeq(hp->gart_base | log2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
> +	writeq(hp->gart_base | ilog2(hp->gart_size), hp->ioc_regs+HP_ZX1_PCOM);
>  	readq(hp->ioc_regs+HP_ZX1_PCOM);
>  }
>  
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/i460-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/i460-agp.c
> @@ -13,6 +13,7 @@
>  #include <linux/string.h>
>  #include <linux/slab.h>
>  #include <linux/agp_backend.h>
> +#include <linux/log2.h>
>  
>  #include "agp.h"
>  
> @@ -59,8 +60,6 @@
>   */
>  #define WR_FLUSH_GATT(index)	RD_GATT(index)
>  
> -#define log2(x)			ffz(~(x))
> -
>  static struct {
>  	void *gatt;				/* ioremap'd GATT area */
>  
> @@ -148,7 +147,7 @@ static int i460_fetch_size (void)
>  		 * values[i].size.
>  		 */
>  		values[i].num_entries = (values[i].size << 8) >> (I460_IO_PAGE_SHIFT - 12);
> -		values[i].page_order = log2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
> +		values[i].page_order = ilog2((sizeof(u32)*values[i].num_entries) >> PAGE_SHIFT);
>  	}
>  
>  	for (i = 0; i < agp_bridge->driver->num_aperture_sizes; i++) {
> --- linux-2.6.22-rc6-mm1.orig/drivers/char/agp/parisc-agp.c
> +++ linux-2.6.22-rc6-mm1/drivers/char/agp/parisc-agp.c
> @@ -18,6 +18,7 @@
>  #include <linux/init.h>
>  #include <linux/klist.h>
>  #include <linux/agp_backend.h>
> +#include <linux/log2.h>
>  
>  #include <asm-parisc/parisc-device.h>
>  #include <asm-parisc/ropes.h>
> @@ -27,10 +28,6 @@
>  #define DRVNAME	"quicksilver"
>  #define DRVPFX	DRVNAME ": "
>  
> -#ifndef log2
> -#define log2(x)		ffz(~(x))
> -#endif
> -
>  #define AGP8X_MODE_BIT		3
>  #define AGP8X_MODE		(1 << AGP8X_MODE_BIT)
>  
> @@ -92,7 +89,7 @@ parisc_agp_tlbflush(struct agp_memory *m
>  {
>  	struct _parisc_agp_info *info = &parisc_agp_info;
>  
> -	writeq(info->gart_base | log2(info->gart_size), info->ioc_regs+IOC_PCOM);
> +	writeq(info->gart_base | ilog2(info->gart_size), info->ioc_regs+IOC_PCOM);
>  	readq(info->ioc_regs+IOC_PCOM);	/* flush */
>  }
>  
> --- linux-2.6.22-rc6-mm1.orig/fs/ext4/super.c
> +++ linux-2.6.22-rc6-mm1/fs/ext4/super.c
> @@ -1433,8 +1433,6 @@ static void ext4_orphan_cleanup (struct 
>  	sb->s_flags = s_flags; /* Restore MS_RDONLY status */
>  }
>  
> -#define log2(n) ffz(~(n))
> -
>  /*
>   * Maximal file size.  There is a direct, and {,double-,triple-}indirect
>   * block limit, and also a limit of (2^32 - 1) 512-byte sectors in i_blocks.
> @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
>  	sbi->s_desc_per_block = blocksize / EXT4_DESC_SIZE(sb);
>  	sbi->s_sbh = bh;
>  	sbi->s_mount_state = le16_to_cpu(es->s_state);
> -	sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb));
> -	sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb));
> +	sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
> +	sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));
>  	for (i=0; i < 4; i++)
>  		sbi->s_hash_seed[i] = le32_to_cpu(es->s_hash_seed[i]);
>  	sbi->s_def_hash_version = es->s_def_hash_version;

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-24  0:47                   ` Fengguang Wu
  2007-07-24  1:17                     ` Andrew Morton
@ 2007-07-24  4:30                     ` Nick Piggin
  1 sibling, 0 replies; 46+ messages in thread
From: Nick Piggin @ 2007-07-24  4:30 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Rusty Russell, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Tim Pepper, Chris Snook, Jens Axboe,
	linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna,
	David Mosberger-Tang, Kyle McMartin

Fengguang Wu wrote:
> On Mon, Jul 23, 2007 at 12:40:09PM -0700, Andrew Morton wrote:

>>This is all fun stuff, but how do we find out that changes like this are
>>good ones, apart from shipping it and seeing who gets hurt 12 months later?
> 
> 
> One thing I can imagine now is that the first pages may get more life
> because of the conservative initial readahead size.

Yeah I don't think it is really worth the complexity and corner cases
it would introduce at this stage.

People are still complaining about their nightly cron job *swapping*
stuff out of their 1-2GB desktop systems. There must still be some low
hanging fruit or obvious problems with page reclaim so I'd like to try
working out exactly what is going wrong and fix it in page reclaim
rather than adding more complexity in the hope that it might help.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-24  1:17                     ` Andrew Morton
@ 2007-07-24  8:50                       ` Andreas Dilger
  0 siblings, 0 replies; 46+ messages in thread
From: Andreas Dilger @ 2007-07-24  8:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, Nick Piggin, Rusty Russell, Dave Jones,
	Peter Zijlstra, linux-kernel, riel, Tim Pepper, Chris Snook,
	Jens Axboe, linux-ext4, Mingming Cao, Bjorn Helgaas, Chris Ahna,
	David Mosberger-Tang, Kyle McMartin, Dave Jones, Dave Airlie

On Jul 23, 2007  18:17 -0700, Andrew Morton wrote:
> hm, yes, there is a risk that the code was accidentally correct.  Or the
> code has only ever dealt with power-of-2 inputs, in which case it happens
> to work either way.
> 
> David(s) and ext4-people: could we please have a close review of these
> changes?
> > @@ -1706,8 +1704,8 @@ static int ext4_fill_super (struct super
> > -	sbi->s_addr_per_block_bits = log2(EXT4_ADDR_PER_BLOCK(sb));
> > -	sbi->s_desc_per_block_bits = log2(EXT4_DESC_PER_BLOCK(sb));
> > +	sbi->s_addr_per_block_bits = ilog2(EXT4_ADDR_PER_BLOCK(sb));
> > +	sbi->s_desc_per_block_bits = ilog2(EXT4_DESC_PER_BLOCK(sb));

For the ext[234] code there has only ever been power-of-two values for
ADDR_PER_BLOCK() and DESC_PER_BLOCK().

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/3] readahead: drop behind
  2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
  2007-07-21 20:29   ` Eric St-Laurent
@ 2007-07-25  3:55   ` Eric St-Laurent
  1 sibling, 0 replies; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-25  3:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Fengguang Wu, riel, Andrew Morton, Rusty Russell,
	Tim Pepper, Chris Snook, Nick Piggin

[-- Attachment #1: Type: text/plain, Size: 3307 bytes --]

On Sat, 2007-21-07 at 23:00 +0200, Peter Zijlstra wrote:

> Use the read-ahead code to provide hints to page reclaim.
> 
> This patch has the potential to solve the streaming-IO trashes my
> desktop problem.
> 
> It tries to aggressively reclaim pages that were loaded in a strong
> sequential pattern and have been consumed. Thereby limiting the damage
> to the current resident set.
> 
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

(sorry for the delay)

Ok, I've done some tests with your patches,

I came up with a test program that should approximate my use case. It
simply mmap() and scan (read) a 375M file which represent the usual used
memory on my desktop system.  This data is frequently used, and should
stay cached as much as possible in preference over the "used once" data
read in the page cache when copying large files. I don't claim that the
test program is perfect or even correct, I'm open for suggestions.

Test system:

- Linux x86_64 2.6.23-rc1
- 1G of RAM
- I use the basic drop behind and sysctl patches. The readahead size
patch is _not_ included.


Setting up:

dd if=/dev/zero of=/tmp/375M_file bs=1M count=375
dd if=/dev/zero of=/tmp/5G_file bs=1M count=5120

Tests with stock kernel (drop behind disabled):

echo 0 >/proc/sys/vm/drop_behind

Base test:

sync; echo 1 >/proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file

1st execution: 0m7.146s
2nd execution: 0m1.119s
3rd execution: 0m1.109s
4th execution: 0m1.105s

Reading a large file test:

sync; echo 1 >/proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
cp /tmp/5G_file /dev/null
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file

1st execution: 0m7.224s
2nd execution: 0m1.114s
3rd execution: 0m7.178s <<< Much slower
4th execution: 0m1.115s

Copying (read+write) a large file test:

sync; echo 1 >/proc/sys/vm/drop_caches
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
cp /tmp/5G_file /tmp/copy_of_5G_file
time ./large_app_load_simul /tmp/375M_file
time ./large_app_load_simul /tmp/375M_file
rm /tmp/copy_of_5G_file

1st execution: 0m7.203s
2nd execution: 0m1.147s
3rd execution: 0m7.238s <<< Much slower
4th execution: 0m1.129s

Tests with drop behind enabled:

echo 1 >/proc/sys/vm/drop_behind

Base test:

[same tests as above]

1st execution: 0m7.206s
2nd execution: 0m1.110s
3rd execution: 0m1.102s
4th execution: 0m1.106s

Reading a large file test:

[same tests as above]

1st execution: 0m7.197s
2nd execution: 0m1.116s
3rd execution: 0m1.114s <<< Great!!!
4th execution: 0m1.111s

Copying (read+write) a large file test:

[same tests as above]

1st execution: 0m7.186s
2nd execution: 0m1.111s
3rd execution: 0m7.339s <<< Not fixed
4th execution: 0m1.121s


Conclusion:

- The drop-behind patch works and really prevents the page cache content
from being fulled with useless read-once data.

- It doesn't help the copy (read+write) case. This should also be fixed,
as it's a common workload.

Tested-By: Eric St-Laurent (ericstl34@xxxxxxxxx.xx)



Best regards,

- Eric

(*) Test program and batch file are attached.


[-- Attachment #2: drop-behind+sysctl.patch --]
[-- Type: text/x-patch, Size: 5642 bytes --]

diff -urN linux-2.6/include/linux/swap.h linux-2.6-drop-behind/include/linux/swap.h
--- linux-2.6/include/linux/swap.h	2007-07-21 18:26:00.000000000 -0400
+++ linux-2.6-drop-behind/include/linux/swap.h	2007-07-22 16:22:48.000000000 -0400
@@ -180,6 +180,7 @@
 /* linux/mm/swap.c */
 extern void FASTCALL(lru_cache_add(struct page *));
 extern void FASTCALL(lru_cache_add_active(struct page *));
+extern void FASTCALL(lru_demote(struct page *));
 extern void FASTCALL(activate_page(struct page *));
 extern void FASTCALL(mark_page_accessed(struct page *));
 extern void lru_add_drain(void);
diff -urN linux-2.6/kernel/sysctl.c linux-2.6-drop-behind/kernel/sysctl.c
--- linux-2.6/kernel/sysctl.c	2007-07-21 18:26:01.000000000 -0400
+++ linux-2.6-drop-behind/kernel/sysctl.c	2007-07-22 16:20:27.000000000 -0400
@@ -163,6 +163,7 @@
 
 extern int prove_locking;
 extern int lock_stat;
+extern int sysctl_dropbehind;
 
 /* The default sysctl tables: */
 
@@ -1048,6 +1049,14 @@
 		.extra1		= &zero,
 	},
 #endif
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "drop_behind",
+		.data		= &sysctl_dropbehind,
+		.maxlen		= sizeof(sysctl_dropbehind),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
 /*
  * NOTE: do not add new entries to this table unless you have read
  * Documentation/sysctl/ctl_unnumbered.txt
diff -urN linux-2.6/mm/readahead.c linux-2.6-drop-behind/mm/readahead.c
--- linux-2.6/mm/readahead.c	2007-07-21 18:26:01.000000000 -0400
+++ linux-2.6-drop-behind/mm/readahead.c	2007-07-22 16:41:47.000000000 -0400
@@ -15,6 +15,7 @@
 #include <linux/backing-dev.h>
 #include <linux/task_io_accounting_ops.h>
 #include <linux/pagevec.h>
+#include <linux/swap.h>
 
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 {
@@ -429,6 +430,8 @@
 }
 EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
 
+int sysctl_dropbehind = 1;
+
 /**
  * page_cache_async_readahead - file readahead for marked pages
  * @mapping: address_space which holds the pagecache and I/O vectors
@@ -449,6 +452,11 @@
 			   struct page *page, pgoff_t offset,
 			   unsigned long req_size)
 {
+	pgoff_t demote_idx = offset - min_t(pgoff_t, offset, ra->size);
+	struct page *pages[16];
+	unsigned nr_pages;
+	unsigned i;
+
 	/* no read-ahead */
 	if (!ra->ra_pages)
 		return;
@@ -467,6 +475,36 @@
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return;
 
+	/*
+	 * Read-ahead use once: when the ra window is maximal this is a good
+	 * hint that there is sequential IO, which implies that the pages that
+	 * have been used thus far can be reclaimed
+	 */
+	if (sysctl_dropbehind && ra->size == ra->ra_pages) {
+		nr_pages = find_get_pages(mapping,
+				demote_idx, ARRAY_SIZE(pages), pages);
+
+		for (i = 0; i < nr_pages; i++) {
+			page = pages[i];
+			demote_idx = page_index(page);
+
+			/*
+			 * The page is active. This means there are other
+			 * users. We should not take away somebody else's
+			 * pages, so do not drop behind beyond this point.
+			 */
+			if (demote_idx < offset && !PageActive(page)) {
+				lru_demote(page);
+			} else {
+				demote_idx = offset;
+				break;
+			}
+		}
+		demote_idx++;
+
+		release_pages(pages, nr_pages, 0);
+	} while (demote_idx < offset);
+
 	/* do read-ahead */
 	ondemand_readahead(mapping, ra, filp, true, offset, req_size);
 }
diff -urN linux-2.6/mm/swap.c linux-2.6-drop-behind/mm/swap.c
--- linux-2.6/mm/swap.c	2007-07-21 18:26:01.000000000 -0400
+++ linux-2.6-drop-behind/mm/swap.c	2007-07-22 16:26:35.000000000 -0400
@@ -30,6 +30,7 @@
 #include <linux/cpu.h>
 #include <linux/notifier.h>
 #include <linux/init.h>
+#include <linux/rmap.h>
 
 /* How many pages do we try to swap or page in/out together? */
 int page_cluster;
@@ -176,6 +177,7 @@
  */
 static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
 static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_demote_pvecs) = { 0, };
 
 void fastcall lru_cache_add(struct page *page)
 {
@@ -197,6 +199,37 @@
 	put_cpu_var(lru_add_active_pvecs);
 }
 
+static void __pagevec_lru_demote(struct pagevec *pvec)
+{
+	int i;
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+		if (PageLRU(page)) {
+			page_referenced(page, 0);
+			if (PageActive(page)) {
+				ClearPageActive(page);
+				__dec_zone_state(zone, NR_ACTIVE);
+				__inc_zone_state(zone, NR_INACTIVE);
+			}
+			list_move_tail(&page->lru, &zone->inactive_list);
+		}
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
 static void __lru_add_drain(int cpu)
 {
 	struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
@@ -207,6 +240,9 @@
 	pvec = &per_cpu(lru_add_active_pvecs, cpu);
 	if (pagevec_count(pvec))
 		__pagevec_lru_add_active(pvec);
+	pvec = &per_cpu(lru_demote_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_demote(pvec);
 }
 
 void lru_add_drain(void)
@@ -403,6 +439,21 @@
 }
 
 /*
+ * Function used to forcefully demote a page to the tail of the inactive
+ * list.
+ */
+void fastcall lru_demote(struct page *page)
+{
+	if (likely(get_page_unless_zero(page))) {
+		struct pagevec *pvec = &get_cpu_var(lru_demote_pvecs);
+
+		if (!pagevec_add(pvec, page))
+			__pagevec_lru_demote(pvec);
+		put_cpu_var(lru_demote_pvecs);
+	}
+}
+
+/*
  * Try to drop buffers from the pages in a pagevec
  */
 void pagevec_strip(struct pagevec *pvec)


[-- Attachment #3: large_app_load_simul.cpp --]
[-- Type: text/x-c++src, Size: 459 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
	if (argc != 2)
		return EXIT_FAILURE;

	FILE *file = fopen(argv[1], "rb");

	fseek(file, 0, SEEK_END);
	long size = ftell(file);

	char *buf = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, fileno(file), 0);

	for (long i = 0; i < size; i++)
		char temp = buf[i];

	int result = munmap(buf, size);

	fclose(file);

	return EXIT_SUCCESS;
}

[-- Attachment #4: Makefile --]
[-- Type: text/x-makefile, Size: 69 bytes --]

all:
	gcc large_app_load_simul.cpp -o large_app_load_simul -lstdc++


[-- Attachment #5: useonce-test.sh --]
[-- Type: application/x-shellscript, Size: 1737 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-23  9:00         ` Nick Piggin
       [not found]           ` <20070723142457.GA10130@mail.ustc.edu.cn>
@ 2007-07-25  4:35           ` Eric St-Laurent
  2007-07-25  5:19             ` Nick Piggin
  1 sibling, 1 reply; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-25  4:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook

On Mon, 2007-23-07 at 19:00 +1000, Nick Piggin wrote:

> I don't like this kind of conditional information going from something
> like readahead into page reclaim. Unless it is for readahead _specific_
> data such as "I got these all wrong, so you can reclaim them" (which
> this isn't).
> 
> But I don't like it as a use-once thing. The VM should be able to get
> that right.
> 

Question: How work the use-once code in the current kernel? Is there
any? I doesn't quite work for me...

See my previous email today, I've done a small test case to demonstrate 
the problem and the effectiveness of Peter's patch.  The only piece
missing is the copy case (read once + write once).

Regardless of how it's implemented, I think a similar mechanism must be
added. This is a long standing issue.

In the end, I think it's a pagecache resources allocation problem. the
VM lacks fair-share limits between processes. The kernel doesn't have
enough information to make the right decisions.

You can refine or use more advanced page reclaim, but some fair-share
splitting (like the CPU scheduler) between the processes must be
present.  Of course some process should have large or unlimited VM
limits, like databases.

Maybe the "containers" patchset and memory controller can help.  With
some specific configuration and/or a userspace daemon to adjust the
limits on the fly.

Independently, the basic large file streaming read (or copy) once cases
should not trash the pagecache. Can we agree on that?

I say, let's add some code to fix the problem.  If we hear about any
regression in some workloads, we can add a tunable to limit or disable
its effects, _if_ a better compromised solution cannot be found.

Surely it's possible to have a acceptable solution.

Best regards,

- Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  4:35           ` Eric St-Laurent
@ 2007-07-25  5:19             ` Nick Piggin
  2007-07-25  6:18               ` Eric St-Laurent
  2007-07-25 15:28               ` Rik van Riel
  0 siblings, 2 replies; 46+ messages in thread
From: Nick Piggin @ 2007-07-25  5:19 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook

Eric St-Laurent wrote:
> On Mon, 2007-23-07 at 19:00 +1000, Nick Piggin wrote:
> 
> 
>>I don't like this kind of conditional information going from something
>>like readahead into page reclaim. Unless it is for readahead _specific_
>>data such as "I got these all wrong, so you can reclaim them" (which
>>this isn't).
>>
>>But I don't like it as a use-once thing. The VM should be able to get
>>that right.
>>
> 
> 
> 
> Question: How work the use-once code in the current kernel? Is there
> any? I doesn't quite work for me...

What *I* think is supposed to happen is that newly read in pages get
put on the inactive list, and unless they get accessed againbefore
being reclaimed, they are allowed to fall off the end of the list
without disturbing active data too much.

I think there is a missing piece here, that we used to ease the reclaim
pressure off the active list when the inactive list grows relatively
much larger than it (which could indicate a lot of use-once pages in
the system).

Andrew got rid of that logic for some reason which I don't know, but I
can't see that use-once would be terribly effective today (so your
results don't surprise me too much).

I think I've been banned from touching vmscan.c, but if you're keen to
try a patch, I might be convinced to come out of retirement :)


> See my previous email today, I've done a small test case to demonstrate 
> the problem and the effectiveness of Peter's patch.  The only piece
> missing is the copy case (read once + write once).
> 
> Regardless of how it's implemented, I think a similar mechanism must be
> added. This is a long standing issue.
> 
> In the end, I think it's a pagecache resources allocation problem. the
> VM lacks fair-share limits between processes. The kernel doesn't have
> enough information to make the right decisions.
> 
> You can refine or use more advanced page reclaim, but some fair-share
> splitting (like the CPU scheduler) between the processes must be
> present.  Of course some process should have large or unlimited VM
> limits, like databases.
> 
> Maybe the "containers" patchset and memory controller can help.  With
> some specific configuration and/or a userspace daemon to adjust the
> limits on the fly.
> 
> Independently, the basic large file streaming read (or copy) once cases
> should not trash the pagecache. Can we agree on that?

One man's trash is another's treasure: some people will want the
files to remain in cache because they'll use them again (copy it
somewhere else, or start editing it after being copied or whatever).

But yeah, we can probably do better at the sequential read/write
case.


> I say, let's add some code to fix the problem.  If we hear about any
> regression in some workloads, we can add a tunable to limit or disable
> its effects, _if_ a better compromised solution cannot be found.

Sure, but let's figure out the workloads and look at all the
alternatives first.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  5:19             ` Nick Piggin
@ 2007-07-25  6:18               ` Eric St-Laurent
  2007-07-25  7:09                 ` Nick Piggin
  2007-07-25 15:28               ` Rik van Riel
  1 sibling, 1 reply; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-25  6:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook

On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote:

> What *I* think is supposed to happen is that newly read in pages get
> put on the inactive list, and unless they get accessed againbefore
> being reclaimed, they are allowed to fall off the end of the list
> without disturbing active data too much.
> 
> I think there is a missing piece here, that we used to ease the reclaim
> pressure off the active list when the inactive list grows relatively
> much larger than it (which could indicate a lot of use-once pages in
> the system).

Maybe a new list should be added to put newly read pages in it. If they
are not used or used once after a certain period, they can be moved to
the inactive list (or whatever).

Newly read pages...

- ... not used after this period are excessive readahead, we discard
immediately.
- ... used only once after this period, we discard soon.
- ... used many/frequently are moved to active list.

Surely the scan rate (do I make sense?) should be different for this
newly-read list and the inactive list. 

I also remember your split mapped/unmapped active list patches from a
while ago.

Can someone point me to a up-to-date documentation about the Linux VM?
The books and documents I've seen are outdated.

> I think I've been banned from touching vmscan.c, but if you're keen to
> try a patch, I might be convinced to come out of retirement :)

I'm more than willing!  Now that CFS is merged, redirect your energies
from nicksched to nick-vm ;)

Patches against any tree (stable, linus, mm, rt) are good. But I prefer
the last stable release because it narrows down the possible problems
that a moving target like the development tree may have.

I test this on my main system, so patches with basic testing and
reasonable stability are preferred. I just want to avoid data corruption
bugs. FYI, I used to run the -rt tree most of the time.

> One man's trash is another's treasure: some people will want the
> files to remain in cache because they'll use them again (copy it
> somewhere else, or start editing it after being copied or whatever).
> 
> But yeah, we can probably do better at the sequential read/write
> case.

Sure, but there are many hints to detect this: *large* (> most of the
RAM), *streaming*, *used once*

But if a program mmap() a 3/4 of the RAM area and "play" in it, it's a
good sign that the streaming code shouldn't be active.

- Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  6:18               ` Eric St-Laurent
@ 2007-07-25  7:09                 ` Nick Piggin
  2007-07-25  7:48                   ` Eric St-Laurent
                                     ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Nick Piggin @ 2007-07-25  7:09 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook

[-- Attachment #1: Type: text/plain, Size: 2821 bytes --]

Eric St-Laurent wrote:
> On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote:
> 
> 
>>What *I* think is supposed to happen is that newly read in pages get
>>put on the inactive list, and unless they get accessed againbefore
>>being reclaimed, they are allowed to fall off the end of the list
>>without disturbing active data too much.
>>
>>I think there is a missing piece here, that we used to ease the reclaim
>>pressure off the active list when the inactive list grows relatively
>>much larger than it (which could indicate a lot of use-once pages in
>>the system).
> 
> 
> Maybe a new list should be added to put newly read pages in it. If they
> are not used or used once after a certain period, they can be moved to
> the inactive list (or whatever).
> 
> Newly read pages...
> 
> - ... not used after this period are excessive readahead, we discard
> immediately.
> - ... used only once after this period, we discard soon.
> - ... used many/frequently are moved to active list.
> 
> Surely the scan rate (do I make sense?) should be different for this
> newly-read list and the inactive list. 

A new list could be a possibility. One problem with adding lists is just
trying to work out how to balance scanning rates between them, another
problem is CPU overhead of moving pages from one to another... but don't
let me stop you if you want to jump in and try something :)


> I also remember your split mapped/unmapped active list patches from a
> while ago.
> 
> Can someone point me to a up-to-date documentation about the Linux VM?
> The books and documents I've seen are outdated.

If you just want to play with page reclaim algorithms, try reading over
mm/vmscan.c. If you don't know much about the Linux VM internals before,
don't worry too much about the fine details and start by getting an idea
of how pages move between the active and inactive lists.

I have Mel Gorman's, but I don't recall whether it covers the fine details
of page reclaim. But anyway it is still a good book.


>>I think I've been banned from touching vmscan.c, but if you're keen to
>>try a patch, I might be convinced to come out of retirement :)
> 
> 
> I'm more than willing!  Now that CFS is merged, redirect your energies
> from nicksched to nick-vm ;)
> 
> Patches against any tree (stable, linus, mm, rt) are good. But I prefer
> the last stable release because it narrows down the possible problems
> that a moving target like the development tree may have.
> 
> I test this on my main system, so patches with basic testing and
> reasonable stability are preferred. I just want to avoid data corruption
> bugs. FYI, I used to run the -rt tree most of the time.

OK here is one which just changes the rate that the active and inactive
lists get scanned. Data corruption bugs should be minimal ;)

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: inactive-useonce.patch --]
[-- Type: text/plain, Size: 2375 bytes --]

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -1011,6 +1011,8 @@ static unsigned long shrink_zone(int pri
 {
 	unsigned long nr_active;
 	unsigned long nr_inactive;
+	unsigned long scan_active;
+	unsigned long scan_inactive;
 	unsigned long nr_to_scan;
 	unsigned long nr_reclaimed = 0;
 
@@ -1020,34 +1022,47 @@ static unsigned long shrink_zone(int pri
 	 * Add one to `nr_to_scan' just to make sure that the kernel will
 	 * slowly sift through the active list.
 	 */
-	zone->nr_scan_active +=
-		(zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
-	nr_active = zone->nr_scan_active;
-	if (nr_active >= sc->swap_cluster_max)
+	nr_active = zone_page_state(zone, NR_INACTIVE);
+	nr_inactive = zone_page_state(zone, NR_ACTIVE);
+
+	scan_inactive = (nr_inactive >> priority) + 1;
+
+	if (nr_active >= 4*(nr_inactive*2 + 1))
+		scan_active = 4*scan_inactive;
+	else {
+		unsigned long long tmp;
+
+		tmp = (unsigned long long)scan_inactive * nr_active;
+		do_div(tmp, nr_inactive*2 + 1);
+		scan_active = (unsigned long)tmp + 1;
+	}
+
+	zone->nr_scan_active += scan_active;
+	scan_active = zone->nr_scan_active;
+	if (scan_active >= sc->swap_cluster_max)
 		zone->nr_scan_active = 0;
 	else
-		nr_active = 0;
+		scan_active = 0;
 
-	zone->nr_scan_inactive +=
-		(zone_page_state(zone, NR_INACTIVE) >> priority) + 1;
-	nr_inactive = zone->nr_scan_inactive;
-	if (nr_inactive >= sc->swap_cluster_max)
+	zone->nr_scan_inactive += scan_inactive;
+	scan_inactive = zone->nr_scan_inactive;
+	if (scan_inactive >= sc->swap_cluster_max)
 		zone->nr_scan_inactive = 0;
 	else
-		nr_inactive = 0;
+		scan_inactive = 0;
 
-	while (nr_active || nr_inactive) {
-		if (nr_active) {
-			nr_to_scan = min(nr_active,
+	while (scan_active || scan_inactive) {
+		if (scan_active) {
+			nr_to_scan = min(scan_active,
 					(unsigned long)sc->swap_cluster_max);
-			nr_active -= nr_to_scan;
+			scan_active -= nr_to_scan;
 			shrink_active_list(nr_to_scan, zone, sc, priority);
 		}
 
-		if (nr_inactive) {
-			nr_to_scan = min(nr_inactive,
+		if (scan_inactive) {
+			nr_to_scan = min(scan_inactive,
 					(unsigned long)sc->swap_cluster_max);
-			nr_inactive -= nr_to_scan;
+			scan_inactive -= nr_to_scan;
 			nr_reclaimed += shrink_inactive_list(nr_to_scan, zone,
 								sc);
 		}

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  7:09                 ` Nick Piggin
@ 2007-07-25  7:48                   ` Eric St-Laurent
  2007-07-25 15:36                     ` Rik van Riel
  2007-07-25 15:33                   ` Rik van Riel
  2007-07-29  7:44                   ` Eric St-Laurent
  2 siblings, 1 reply; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-25  7:48 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook

On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:

> 
> A new list could be a possibility. One problem with adding lists is just
> trying to work out how to balance scanning rates between them, another
> problem is CPU overhead of moving pages from one to another... 

Disk sizes seem to increase more rapidly that the ability to read them
quickly.  Fortunately the processing power increase greatly too.

It may be a good idea to spend more CPU cycles to better decide how the
VM should juggle with this data. We've got to keep those multi-cores cpu
busy.

> but don't
> let me stop you if you want to jump in and try something :)
> 

Well I might try a few things along the way.

But I prefer the thorough approach versus tinkering... 

- Read all research, check competition
- Build test virtual machines, with benchmarks and typical workloads
- Add (or use) some instrumentation to the pagecache
- Code a simulator
- Try all algorithms, tune them

This is way overkill for a part-time hobby.

If we don't see much work on this area it's surely because it's really
not a problem anymore for most workloads. Database have their own cache
management and disk scheduling, file servers just add more ram or
processors, etc.

> OK here is one which just changes the rate that the active and inactive
> lists get scanned. Data corruption bugs should be minimal ;)
> 

Will test.

- Eric

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  5:19             ` Nick Piggin
  2007-07-25  6:18               ` Eric St-Laurent
@ 2007-07-25 15:28               ` Rik van Riel
  1 sibling, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2007-07-25 15:28 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric St-Laurent, Rusty Russell, Fengguang Wu, Dave Jones,
	Peter Zijlstra, linux-kernel, Andrew Morton, Tim Pepper,
	Chris Snook

Nick Piggin wrote:

> What *I* think is supposed to happen is that newly read in pages get
> put on the inactive list, and unless they get accessed againbefore
> being reclaimed, they are allowed to fall off the end of the list
> without disturbing active data too much.
> 
> I think there is a missing piece here, that we used to ease the reclaim
> pressure off the active list when the inactive list grows relatively
> much larger than it (which could indicate a lot of use-once pages in
> the system).
> 
> Andrew got rid of that logic for some reason which I don't know, but I
> can't see that use-once would be terribly effective today (so your
> results don't surprise me too much).

I suspect that reason is that use-once works great for
page cache pages, but anonymous pages share the same
LRUs and need a more balanced aging approach.

The result: an LRU list that works ok for both types
of pages most of the time, but falls over in some
workloads.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  7:09                 ` Nick Piggin
  2007-07-25  7:48                   ` Eric St-Laurent
@ 2007-07-25 15:33                   ` Rik van Riel
  2007-07-29  7:44                   ` Eric St-Laurent
  2 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2007-07-25 15:33 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric St-Laurent, Rusty Russell, Fengguang Wu, Dave Jones,
	Peter Zijlstra, linux-kernel, Andrew Morton, Tim Pepper,
	Chris Snook

Nick Piggin wrote:
> Eric St-Laurent wrote:
>> On Wed, 2007-25-07 at 15:19 +1000, Nick Piggin wrote:
>>
>>
>>> What *I* think is supposed to happen is that newly read in pages get
>>> put on the inactive list, and unless they get accessed againbefore
>>> being reclaimed, they are allowed to fall off the end of the list
>>> without disturbing active data too much.
>>>
>>> I think there is a missing piece here, that we used to ease the reclaim
>>> pressure off the active list when the inactive list grows relatively
>>> much larger than it (which could indicate a lot of use-once pages in
>>> the system).
>>
>>
>> Maybe a new list should be added to put newly read pages in it. If they
>> are not used or used once after a certain period, they can be moved to
>> the inactive list (or whatever).
>>
>> Newly read pages...
>>
>> - ... not used after this period are excessive readahead, we discard
>> immediately.
>> - ... used only once after this period, we discard soon.
>> - ... used many/frequently are moved to active list.
>>
>> Surely the scan rate (do I make sense?) should be different for this
>> newly-read list and the inactive list. 
> 
> A new list could be a possibility. One problem with adding lists is just
> trying to work out how to balance scanning rates between them, another
> problem is CPU overhead of moving pages from one to another... but don't
> let me stop you if you want to jump in and try something :)
> 
> 
>> I also remember your split mapped/unmapped active list patches from a
>> while ago.
>>
>> Can someone point me to a up-to-date documentation about the Linux VM?
>> The books and documents I've seen are outdated.
> 
> If you just want to play with page reclaim algorithms, try reading over
> mm/vmscan.c. If you don't know much about the Linux VM internals before,
> don't worry too much about the fine details and start by getting an idea
> of how pages move between the active and inactive lists.

I've got a nice list of problems with the VM at
http://linux-mm.org/PageoutFailureModes :)

I should post my latest version of the split LRU
lists code.  I'm still working on SEQ replacement
for the anonymous pages...

> OK here is one which just changes the rate that the active and inactive
> lists get scanned. Data corruption bugs should be minimal ;)

I had something like that in mind for the file
pages.  I am a bit nervous about introducing
such a change while file and anonymous pages
share the same LRU, though...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  7:48                   ` Eric St-Laurent
@ 2007-07-25 15:36                     ` Rik van Riel
  0 siblings, 0 replies; 46+ messages in thread
From: Rik van Riel @ 2007-07-25 15:36 UTC (permalink / raw)
  To: Eric St-Laurent
  Cc: Nick Piggin, Rusty Russell, Fengguang Wu, Dave Jones,
	Peter Zijlstra, linux-kernel, Andrew Morton, Tim Pepper,
	Chris Snook

Eric St-Laurent wrote:
> On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:
> 
>> A new list could be a possibility. One problem with adding lists is just
>> trying to work out how to balance scanning rates between them, another
>> problem is CPU overhead of moving pages from one to another... 
> 
> Disk sizes seem to increase more rapidly that the ability to read them
> quickly.  Fortunately the processing power increase greatly too.
> 
> It may be a good idea to spend more CPU cycles to better decide how the
> VM should juggle with this data. We've got to keep those multi-cores cpu
> busy.

Agreed, up to a level.  You cannot let hundreds of gigabytes
worth of memory sit around with (stale) accessed bits set,
and then have to scan millions of pages all at once when free
memory finally runs low.

The amount of CPU time spent in the pageout code at any point
in time needs to be limited.

> If we don't see much work on this area it's surely because it's really
> not a problem anymore for most workloads. Database have their own cache
> management and disk scheduling, file servers just add more ram or
> processors, etc.

It is a really big problem for some (fairly common) workloads,
though.

One of the main reasons there has been little activity in this
area upstream is that nobody (at least, nobody I talked to)
had real fixes in mind for the problem.

All we had were workarounds for RHEL, not the kind of real
fixes that are interesting for upstream.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 0/3] readahead drop behind and size adjustment
  2007-07-25  7:09                 ` Nick Piggin
  2007-07-25  7:48                   ` Eric St-Laurent
  2007-07-25 15:33                   ` Rik van Riel
@ 2007-07-29  7:44                   ` Eric St-Laurent
  2 siblings, 0 replies; 46+ messages in thread
From: Eric St-Laurent @ 2007-07-29  7:44 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Rusty Russell, Fengguang Wu, Dave Jones, Peter Zijlstra,
	linux-kernel, riel, Andrew Morton, Tim Pepper, Chris Snook

On Wed, 2007-25-07 at 17:09 +1000, Nick Piggin wrote:
> Eric St-Laurent wrote:
> > I test this on my main system, so patches with basic testing and
> > reasonable stability are preferred. I just want to avoid data corruption
> > bugs. FYI, I used to run the -rt tree most of the time.
> 
> OK here is one which just changes the rate that the active and inactive
> lists get scanned. Data corruption bugs should be minimal ;)
> 

Nick,

I have tried your patch with my test case, unfortunately it doesn't
help.

Numbers did vary a little bit more, and it seemed drop_caches was not
working as well as usual (used between the runs).

Also, overall the runs took about .1s more to complete.


Linux 2.6.23-rc1-nick PREEMPT x86_64

Base test:

1st run: 0m9.123s
2nd run: 0m3.565s
3rd run: 0m3.553s
4th run: 0m3.565s

Reading a large file test:

1st run: 0m9.146s
2nd run: 0m3.560s
`/tmp/large_file' -> `/dev/null'
3rd run: 0m19.759s
4th run: 0m3.515s

Copying (using cp) a large file test:

1st run: 0m9.085s
2nd run: 0m3.522s
`/tmp/large_file' -> `/tmp/large_file.copy'
3rd run: 0m9.977s
4th run: 0m3.518s


Anyway, what is the theory behind the patch?


- Eric



^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2007-07-29  7:44 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-21 21:00 [PATCH 0/3] readahead drop behind and size adjustment Peter Zijlstra
2007-07-21 21:00 ` [PATCH 1/3] readahead: drop behind Peter Zijlstra
2007-07-21 20:29   ` Eric St-Laurent
2007-07-21 20:37     ` Peter Zijlstra
2007-07-21 20:59       ` Eric St-Laurent
2007-07-21 21:06         ` Peter Zijlstra
2007-07-25  3:55   ` Eric St-Laurent
2007-07-21 21:00 ` [PATCH 2/3] readahead: fadvise drop behind controls Peter Zijlstra
2007-07-21 21:00 ` [PATCH 3/3] readahead: scale max readahead size depending on memory size Peter Zijlstra
2007-07-22  8:24   ` Jens Axboe
2007-07-22  8:36     ` Peter Zijlstra
2007-07-22  8:50       ` Jens Axboe
2007-07-22  9:17         ` Peter Zijlstra
2007-07-22 16:44           ` Jens Axboe
2007-07-23 10:04             ` Jörn Engel
2007-07-23 10:11               ` Jens Axboe
2007-07-23 22:44               ` Rusty Russell
2007-07-22 23:52         ` Rik van Riel
2007-07-23  5:22           ` Jens Axboe
     [not found]   ` <20070722084526.GB6317@mail.ustc.edu.cn>
2007-07-22  8:45     ` Fengguang Wu
2007-07-22  8:59       ` Peter Zijlstra
     [not found]         ` <20070722095313.GA8136@mail.ustc.edu.cn>
2007-07-22  9:53           ` Fengguang Wu
     [not found] ` <20070722023923.GA6438@mail.ustc.edu.cn>
2007-07-22  2:39   ` [PATCH 0/3] readahead drop behind and size adjustment Fengguang Wu
2007-07-22  2:44   ` Dave Jones
     [not found]     ` <20070722081010.GA6317@mail.ustc.edu.cn>
2007-07-22  8:10       ` Fengguang Wu
2007-07-22  8:24         ` Peter Zijlstra
     [not found]           ` <20070722082923.GA7790@mail.ustc.edu.cn>
2007-07-22  8:29             ` Fengguang Wu
2007-07-22  8:33       ` Rusty Russell
2007-07-22  8:45         ` Peter Zijlstra
2007-07-23  9:00         ` Nick Piggin
     [not found]           ` <20070723142457.GA10130@mail.ustc.edu.cn>
2007-07-23 14:24             ` Fengguang Wu
2007-07-23 19:40               ` Andrew Morton
     [not found]                 ` <20070724004728.GA8026@mail.ustc.edu.cn>
2007-07-24  0:47                   ` Fengguang Wu
2007-07-24  1:17                     ` Andrew Morton
2007-07-24  8:50                       ` Andreas Dilger
2007-07-24  4:30                     ` Nick Piggin
2007-07-25  4:35           ` Eric St-Laurent
2007-07-25  5:19             ` Nick Piggin
2007-07-25  6:18               ` Eric St-Laurent
2007-07-25  7:09                 ` Nick Piggin
2007-07-25  7:48                   ` Eric St-Laurent
2007-07-25 15:36                     ` Rik van Riel
2007-07-25 15:33                   ` Rik van Riel
2007-07-29  7:44                   ` Eric St-Laurent
2007-07-25 15:28               ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2007-07-22 11:11 Al Boldi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox