[PATCH] 2.4.20-rmap15a

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] 2.4.20-rmap15a
@ 2002-12-01 20:35 Rik van Riel
  2002-12-03 13:55 ` Miquel van Smoorenburg
  0 siblings, 1 reply; 15+ messages in thread
From: Rik van Riel @ 2002-12-01 20:35 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm

This is a merge of rmap15a with marcelo's 2.4 bitkeeper tree,
which is identical to 2.4.20-rc4 (he didn't push the makefile
update).  The only thing left out of the merge for now is
Andrew Morton's read_latency patch, both because I'm not sure
how needed it is with the elevator updates and because this
part of the merge was too tricky to do at merge time; I'll port
over Andrew Morton's read_latency patch later...


The first maintenance release of the 15th version of the reverse
mapping based VM is now available.
This is an attempt at making a more robust and flexible VM
subsystem, while cleaning up a lot of code at the same time.
The patch is available from:

           http://surriel.com/patches/2.4/2.4.20-rmap15a
and        http://linuxvm.bkbits.net/


My big TODO items for a next release are:
  - backport speedups from 2.5
  - pte-highmem

rmap 15a:
  - more agressive freeing for higher order allocations   (me)
  - export __find_pagecache_page, find_get_page define    (me, Cristoph, Arjan)
  - make memory statistics SMP safe again                 (me)
  - make page aging slow down again when needed           (Andrew Morton)
  - first stab at fine-tuning arjan's O(1) VM             (me)
  - split active list in cache / working set              (me)
  - fix SMP locking in arjan's O(1) VM                    (me)
rmap 15:
  - small code cleanups and spelling fixes for O(1) VM    (me)
  - O(1) page launder, O(1) page aging                    (Arjan van de Ven)
  - resync code with -ac (12 small patches)               (me)
rmap 14c:
  - fold page_over_rsslimit() into page_referenced()      (me)
  - 2.5 backport: get pte_chains from the slab cache      (William Lee Irwin)
  - remove dead code from page_launder_zone()             (me)
  - make OOM detection a bit more agressive               (me)
rmap 14b:
  - don't unmap pages not in pagecache (ext3 & reiser)    (Andrew Morton, me)
  - clean up mark_page_accessed a bit                     (me)
  - Alpha NUMA fix for Ingo's per-cpu pages               (Flávio Leitner, me)
  - remove explicit low latency schedule zap_page_range   (Robert Love)
  - fix OOM stuff for good, hopefully                     (me)
rmap 14a:
  - Ingo Molnar's per-cpu pages (SMP speedup)             (Christoph Hellwig)
  - fix SMP bug in page_launder_zone (rmap14 only)        (Arjan van de Ven)
  - semicolon day, fix typo in rmap.c w/ DEBUG_RMAP       (Craig Kulesa)
  - remove unneeded pte_chain_unlock/lock pair vmscan.c   (Craig Kulesa)
  - low latency zap_page_range also without preempt       (Arjan van de Ven)
  - do some throughput tuning for kswapd/page_launder     (me)
  - don't allocate swap space for pages we're not writing (me)
rmap 14:
  - get rid of stalls during swapping, hopefully          (me)
  - low latency zap_page_range                            (Robert Love)
rmap 13c:
  - add wmb() to wakeup_memwaiters                        (Arjan van de Ven)
  - remap_pmd_range now calls pte_alloc with full address (Paul Mackerras)
  - #ifdef out pte_chain_lock/unlock on UP machines       (Andrew Morton)
  - un-BUG() truncate_complete_page, the race is expected (Andrew Morton, me)
  - remove NUMA changes from rmap13a                      (Christoph Hellwig)
rmap 13b:
  - prevent PF_MEMALLOC recursion for higher order allocs (Arjan van de Ven, me)
  - fix small SMP race, PG_lru                            (Hugh Dickins)
rmap 13a:
  - NUMA changes for page_address                         (Samuel Ortiz)
  - replace vm.freepages with simpler kswapd_minfree      (Christoph Hellwig)
rmap 13:
  - rename touch_page to mark_page_accessed and uninline  (Christoph Hellwig)
  - NUMA bugfix for __alloc_pages                         (William Irwin)
  - kill __find_page                                      (Christoph Hellwig)
  - make pte_chain_freelist per zone                      (William Irwin)
  - protect pte_chains by per-page lock bit               (William Irwin)
  - minor code cleanups                                   (me)
rmap 12i:
  - slab cleanup                                          (Christoph Hellwig)
  - remove references to compiler.h from mm/*             (me)
  - move rmap to marcelo's bk tree                        (me)
  - minor cleanups                                        (me)
rmap 12h:
  - hopefully fix OOM detection algorithm                 (me)
  - drop pte quicklist in anticipation of pte-highmem     (me)
  - replace andrea's highmem emulation by ingo's one      (me)
  - improve rss limit checking                            (Nick Piggin)
rmap 12g:
  - port to armv architecture                             (David Woodhouse)
  - NUMA fix to zone_table initialisation                 (Samuel Ortiz)
  - remove init_page_count                                (David Miller)
rmap 12f:
  - for_each_pgdat macro                                  (William Lee Irwin)
  - put back EXPORT(__find_get_page) for modular rd       (me)
  - make bdflush and kswapd actually start queued disk IO (me)
rmap 12e
  - RSS limit fix, the limit can be 0 for some reason     (me)
  - clean up for_each_zone define to not need pgdata_t    (William Lee Irwin)
  - fix i810_dma bug introduced with page->wait removal   (William Lee Irwin)
rmap 12d:
  - fix compiler warning in rmap.c                        (Roger Larsson)
  - read latency improvement   (read-latency2)            (Andrew Morton)
rmap 12c:
  - fix small balancing bug in page_launder_zone          (Nick Piggin)
  - wakeup_kswapd / wakeup_memwaiters code fix            (Arjan van de Ven)
  - improve RSS limit enforcement                         (me)
rmap 12b:
  - highmem emulation (for debugging purposes)            (Andrea Arcangeli)
  - ulimit RSS enforcement when memory gets tight         (me)
  - sparc64 page->virtual quickfix                        (Greg Procunier)
rmap 12a:
  - fix the compile warning in buffer.c                   (me)
  - fix divide-by-zero on highmem initialisation  DOH!    (me)
  - remove the pgd quicklist (suspicious ...)             (DaveM, me)
rmap 12:
  - keep some extra free memory on large machines         (Arjan van de Ven, me)
  - higher-order allocation bugfix                        (Adrian Drzewiecki)
  - nr_free_buffer_pages() returns inactive + free mem    (me)
  - pages from unused objects directly to inactive_clean  (me)
  - use fast pte quicklists on non-pae machines           (Andrea Arcangeli)
  - remove sleep_on from wakeup_kswapd                    (Arjan van de Ven)
  - page waitqueue cleanup                                (Christoph Hellwig)
rmap 11c:
  - oom_kill race locking fix                             (Andres Salomon)
  - elevator improvement                                  (Andrew Morton)
  - dirty buffer writeout speedup (hopefully ;))          (me)
  - small documentation updates                           (me)
  - page_launder() never does synchronous IO, kswapd
    and the processes calling it sleep on higher level    (me)
  - deadlock fix in touch_page()                          (me)
rmap 11b:
  - added low latency reschedule points in vmscan.c       (me)
  - make i810_dma.c include mm_inline.h too               (William Lee Irwin)
  - wake up kswapd sleeper tasks on OOM kill so the
    killed task can continue on its way out               (me)
  - tune page allocation sleep point a little             (me)
rmap 11a:
  - don't let refill_inactive() progress count for OOM    (me)
  - after an OOM kill, wait 5 seconds for the next kill   (me)
  - agpgart_be fix for hashed waitqueues                  (William Lee Irwin)
rmap 11:
  - fix stupid logic inversion bug in wakeup_kswapd()     (Andrew Morton)
  - fix it again in the morning                           (me)
  - add #ifdef BROKEN_PPC_PTE_ALLOC_ONE to rmap.h, it
    seems PPC calls pte_alloc() before mem_map[] init     (me)
  - disable the debugging code in rmap.c ... the code
    is working and people are running benchmarks          (me)
  - let the slab cache shrink functions return a value
    to help prevent early OOM killing                     (Ed Tomlinson)
  - also, don't call the OOM code if we have enough
    free pages                                            (me)
  - move the call to lru_cache_del into __free_pages_ok   (Ben LaHaise)
  - replace the per-page waitqueue with a hashed
    waitqueue, reduces size of struct page from 64
    bytes to 52 bytes (48 bytes on non-highmem machines)  (William Lee Irwin)
rmap 10:
  - fix the livelock for real (yeah right), turned out
    to be a stupid bug in page_launder_zone()             (me)
  - to make sure the VM subsystem doesn't monopolise
    the CPU, let kswapd and some apps sleep a bit under
    heavy stress situations                               (me)
  - let __GFP_HIGH allocations dig a little bit deeper
    into the free page pool, the SCSI layer seems fragile (me)
rmap 9:
  - improve comments all over the place                   (Michael Cohen)
  - don't panic if page_remove_rmap() cannot find the
    rmap in question, it's possible that the memory was
    PG_reserved and belonging to a driver, but the driver
    exited and cleared the PG_reserved bit                (me)
  - fix the VM livelock by replacing > by >= in a few
    critical places in the pageout code                   (me)
  - treat the reclaiming of an inactive_clean page like
    allocating a new page, calling try_to_free_pages()
    and/or fixup_freespace() if required                  (me)
  - when low on memory, don't make things worse by
    doing swapin_readahead                                (me)
rmap 8:
  - add ANY_ZONE to the balancing functions to improve
    kswapd's balancing a bit                              (me)
  - regularize some of the maximum loop bounds in
    vmscan.c for cosmetic purposes                        (William Lee Irwin)
  - move page_address() to architecture-independent
    code, now the removal of page->virtual is portable    (William Lee Irwin)
  - speed up free_area_init_core() by doing a single
    pass over the pages and not using atomic ops          (William Lee Irwin)
  - documented the buddy allocator in page_alloc.c        (William Lee Irwin)
rmap 7:
  - clean up and document vmscan.c                        (me)
  - reduce size of page struct, part one                  (William Lee Irwin)
  - add rmap.h for other archs (untested, not for ARM)    (me)
rmap 6:
  - make the active and inactive_dirty list per zone,
    this is finally possible because we can free pages
    based on their physical address                       (William Lee Irwin)
  - cleaned up William's code a bit                       (me)
  - turn some defines into inlines and move those to
    mm_inline.h (the includes are a mess ...)             (me)
  - improve the VM balancing a bit                        (me)
  - add back inactive_target to /proc/meminfo             (me)
rmap 5:
  - fixed recursive buglet, introduced by directly
    editing the patch for making rmap 4 ;)))              (me)
rmap 4:
  - look at the referenced bits in page tables            (me)
rmap 3:
  - forgot one FASTCALL definition                        (me)
rmap 2:
  - teach try_to_unmap_one() about mremap()               (me)
  - don't assign swap space to pages with buffers         (me)
  - make the rmap.c functions FASTCALL / inline           (me)
rmap 1:
  - fix the swap leak in rmap 0                           (Dave McCracken)
rmap 0:
  - port of reverse mapping VM to 2.4.16                  (me)

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://guru.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-01 20:35 [PATCH] 2.4.20-rmap15a Rik van Riel
@ 2002-12-03 13:55 ` Miquel van Smoorenburg
  0 siblings, 0 replies; 15+ messages in thread
From: Miquel van Smoorenburg @ 2002-12-03 13:55 UTC (permalink / raw)
  To: linux-kernel

In article <Pine.LNX.4.44L.0212011833310.15981-100000@imladris.surriel.com>,
Rik van Riel  <riel@conectiva.com.br> wrote:
>This is a merge of rmap15a with marcelo's 2.4 bitkeeper tree,
>which is identical to 2.4.20-rc4 (he didn't push the makefile
>update).  The only thing left out of the merge for now is
>Andrew Morton's read_latency patch, both because I'm not sure
>how needed it is with the elevator updates and because this
>part of the merge was too tricky to do at merge time; I'll port
>over Andrew Morton's read_latency patch later...

Just FYI:

I've tried this on our peering newsserver, you know, the same one
I tried a couple of 2.5.4X kernels on (Andrew knows ...)

Basically, performance sucks with rmap15a, with or without readlatency2.
See http://stats.cistron.nl/mrtg/html/wormhole.html

Monday  2-dec 16:00 -- Tuesday 3-dec 12:00	rmap15a + rl2
Tuesday 3-dec 12:00 -- Tuesday 3-dec 14:00	rmap15a
Tuesday 3-dec 14:20 -- now			2.4.20 vanilla

As you can see, the server accepted a lot less articles when
running rmap, as a result outgoing bandwidth dropped too.

I have no idea how to find out what the exact cause of this is,
so I didn't (it's still a production server and I can't let
it limp along for a too long period of time) that's why I
started with 'just FYI'. BTW, the server is not into swap - it
just looks like the working set of innd (the main article
accepting process) which includes mmap()s of the history
database is getting pushed out of memory by streaming IO,
which is really bad for performance. 2.4.20 does better on
this, and 2.5 a bit more even (but isn't stable)

Mike.
-- 
They all laughed when I said I wanted to build a joke-telling machine.
Well, I showed them! Nobody's laughing *now*! -- acesteves@clix.pt

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
@ 2002-12-01 20:56 Marc-Christian Petersen
  2002-12-01 21:25 ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Marc-Christian Petersen @ 2002-12-01 20:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rik van Riel

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

Hi Rik, Hi all,

> This is a merge of rmap15a with marcelo's 2.4 bitkeeper tree,
> which is identical to 2.4.20-rc4 (he didn't push the makefile
> update).  The only thing left out of the merge for now is
> Andrew Morton's read_latency patch, both because I'm not sure
> how needed it is with the elevator updates and because this
> part of the merge was too tricky to do at merge time; I'll port
> over Andrew Morton's read_latency patch later...
Well, it is needed. This makes a difference for the I/O pauses noticed in 
2.4.19 and 2.4.20. Anyway, readlatency-2 won't make them go away, but those 
stops/pauses are a bit less than before.

So, here my patch proposal. Ontop of 2.4.20-rmap15a.

ciao, Marc

[-- Attachment #2: read-latency2-2.4.20-rmap15a.patch --]
[-- Type: text/x-diff, Size: 4406 bytes --]

--- linux-akpm/drivers/block/elevator.c~read-latency2	Sun Nov 10 19:53:53 2002
+++ linux-akpm-akpm/drivers/block/elevator.c	Sun Nov 10 19:59:21 2002
@@ -80,25 +80,38 @@ int elevator_linus_merge(request_queue_t
 			 struct buffer_head *bh, int rw,
 			 int max_sectors)
 {
-	struct list_head *entry = &q->queue_head;
-	unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE;
+	struct list_head *entry;
+	unsigned int count = bh->b_size >> 9;
+	unsigned int ret = ELEVATOR_NO_MERGE;
+	int merge_only = 0;
+	const int max_bomb_segments = q->elevator.max_bomb_segments;
 	struct request *__rq;
+	int passed_a_read = 0;
+
+	entry = &q->queue_head;
 
 	while ((entry = entry->prev) != head) {
 		__rq = blkdev_entry_to_request(entry);
 
-		/*
-		 * we can't insert beyond a zero sequence point
-		 */
-		if (__rq->elevator_sequence <= 0)
-			break;
+		if (__rq->elevator_sequence-- <= 0) {
+			/*
+			 * OK, we've exceeded someone's latency limit.
+			 * But we still continue to look for merges,
+			 * because they're so much better than seeks.
+			 */
+			merge_only = 1;
+		}
 
 		if (__rq->waiting)
 			continue;
 		if (__rq->rq_dev != bh->b_rdev)
 			continue;
-		if (!*req && bh_rq_in_between(bh, __rq, &q->queue_head))
+		if (!*req && !merge_only &&
+				bh_rq_in_between(bh, __rq, &q->queue_head)) {
 			*req = __rq;
+		}
+		if (__rq->cmd != WRITE)
+			passed_a_read = 1;
 		if (__rq->cmd != rw)
 			continue;
 		if (__rq->nr_sectors + count > max_sectors)
@@ -129,6 +142,57 @@ int elevator_linus_merge(request_queue_t
 		}
 	}
 
+	/*
+	 * If we failed to merge a read anywhere in the request
+	 * queue, we really don't want to place it at the end
+	 * of the list, behind lots of writes.  So place it near
+	 * the front.
+	 *
+	 * We don't want to place it in front of _all_ writes: that
+	 * would create lots of seeking, and isn't tunable.
+	 * We try to avoid promoting this read in front of existing
+	 * reads.
+	 *
+	 * max_bomb_segments becomes the maximum number of write
+	 * requests which we allow to remain in place in front of
+	 * a newly introduced read.  We weight things a little bit,
+	 * so large writes are more expensive than small ones, but it's
+	 * requests which count, not sectors.
+	 */
+	if (max_bomb_segments && rw == READ && !passed_a_read &&
+				ret == ELEVATOR_NO_MERGE) {
+		int cur_latency = 0;
+		struct request * const cur_request = *req;
+
+		entry = head->next;
+		while (entry != &q->queue_head) {
+			struct request *__rq;
+
+			if (entry == &q->queue_head)
+				BUG();
+			if (entry == q->queue_head.next &&
+					q->head_active && !q->plugged)
+				BUG();
+			__rq = blkdev_entry_to_request(entry);
+
+			if (__rq == cur_request) {
+				/*
+				 * This is where the old algorithm placed it.
+				 * There's no point pushing it further back,
+				 * so leave it here, in sorted order.
+				 */
+				break;
+			}
+			if (__rq->cmd == WRITE) {
+				cur_latency += 1 + __rq->nr_sectors / 64;
+				if (cur_latency >= max_bomb_segments) {
+					*req = __rq;
+					break;
+				}
+			}
+			entry = entry->next;
+		}
+	}
 	return ret;
 }
 
@@ -186,7 +250,7 @@ int blkelvget_ioctl(elevator_t * elevato
 	output.queue_ID			= elevator->queue_ID;
 	output.read_latency		= elevator->read_latency;
 	output.write_latency		= elevator->write_latency;
-	output.max_bomb_segments	= 0;
+	output.max_bomb_segments	= elevator->max_bomb_segments;
 
 	if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
 		return -EFAULT;
@@ -205,9 +269,12 @@ int blkelvset_ioctl(elevator_t * elevato
 		return -EINVAL;
 	if (input.write_latency < 0)
 		return -EINVAL;
+	if (input.max_bomb_segments < 0)
+		return -EINVAL;
 
 	elevator->read_latency		= input.read_latency;
 	elevator->write_latency		= input.write_latency;
+	elevator->max_bomb_segments	= input.max_bomb_segments;
 	return 0;
 }
 
--- linux-akpm/drivers/block/ll_rw_blk.c~read-latency2	Sun Nov 10 19:53:53 2002
+++ linux-akpm-akpm/drivers/block/ll_rw_blk.c	Sun Nov 10 19:53:53 2002
@@ -432,9 +432,11 @@ static void blk_init_free_list(request_q
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
-		nr_requests /= 2;
+	nr_requests = (megs * 2) & ~15;	/* One per half-megabyte */
+	if (nr_requests < 32)
+		nr_requests = 32;
+	if (nr_requests > 1024)
+		nr_requests = 1024;
 	blk_grow_request_list(q, nr_requests);
 
 	init_waitqueue_head(&q->wait_for_requests[0]);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-01 20:56 Marc-Christian Petersen
@ 2002-12-01 21:25 ` Rik van Riel
  2002-12-01 21:41   ` Marc-Christian Petersen
  2002-12-02  8:15   ` Jens Axboe
  0 siblings, 2 replies; 15+ messages in thread
From: Rik van Riel @ 2002-12-01 21:25 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Con Kolivas

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1009 bytes --]

On Sun, 1 Dec 2002, Marc-Christian Petersen wrote:

> > update).  The only thing left out of the merge for now is
> > Andrew Morton's read_latency patch, both because I'm not sure
> > how needed it is with the elevator updates

> Well, it is needed. This makes a difference for the I/O pauses noticed
> in 2.4.19 and 2.4.20. Anyway, readlatency-2 won't make them go away, but
> those stops/pauses are a bit less than before.

That was my gut feeling as well, but I guess it's a good thing
to quantify how much of a difference it makes.  I wonder if we
could convince Con to test a kernel both with and without this
patch and look at the difference.

> So, here my patch proposal. Ontop of 2.4.20-rmap15a.

Looks good, now lets test it.  If the patch is as needed as you
say we should push it to marcelo ;)

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://guru.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>

[-- Attachment #2: Type: TEXT/X-DIFF, Size: 4406 bytes --]

--- linux-akpm/drivers/block/elevator.c~read-latency2	Sun Nov 10 19:53:53 2002
+++ linux-akpm-akpm/drivers/block/elevator.c	Sun Nov 10 19:59:21 2002
@@ -80,25 +80,38 @@ int elevator_linus_merge(request_queue_t
 			 struct buffer_head *bh, int rw,
 			 int max_sectors)
 {
-	struct list_head *entry = &q->queue_head;
-	unsigned int count = bh->b_size >> 9, ret = ELEVATOR_NO_MERGE;
+	struct list_head *entry;
+	unsigned int count = bh->b_size >> 9;
+	unsigned int ret = ELEVATOR_NO_MERGE;
+	int merge_only = 0;
+	const int max_bomb_segments = q->elevator.max_bomb_segments;
 	struct request *__rq;
+	int passed_a_read = 0;
+
+	entry = &q->queue_head;
 
 	while ((entry = entry->prev) != head) {
 		__rq = blkdev_entry_to_request(entry);
 
-		/*
-		 * we can't insert beyond a zero sequence point
-		 */
-		if (__rq->elevator_sequence <= 0)
-			break;
+		if (__rq->elevator_sequence-- <= 0) {
+			/*
+			 * OK, we've exceeded someone's latency limit.
+			 * But we still continue to look for merges,
+			 * because they're so much better than seeks.
+			 */
+			merge_only = 1;
+		}
 
 		if (__rq->waiting)
 			continue;
 		if (__rq->rq_dev != bh->b_rdev)
 			continue;
-		if (!*req && bh_rq_in_between(bh, __rq, &q->queue_head))
+		if (!*req && !merge_only &&
+				bh_rq_in_between(bh, __rq, &q->queue_head)) {
 			*req = __rq;
+		}
+		if (__rq->cmd != WRITE)
+			passed_a_read = 1;
 		if (__rq->cmd != rw)
 			continue;
 		if (__rq->nr_sectors + count > max_sectors)
@@ -129,6 +142,57 @@ int elevator_linus_merge(request_queue_t
 		}
 	}
 
+	/*
+	 * If we failed to merge a read anywhere in the request
+	 * queue, we really don't want to place it at the end
+	 * of the list, behind lots of writes.  So place it near
+	 * the front.
+	 *
+	 * We don't want to place it in front of _all_ writes: that
+	 * would create lots of seeking, and isn't tunable.
+	 * We try to avoid promoting this read in front of existing
+	 * reads.
+	 *
+	 * max_bomb_segments becomes the maximum number of write
+	 * requests which we allow to remain in place in front of
+	 * a newly introduced read.  We weight things a little bit,
+	 * so large writes are more expensive than small ones, but it's
+	 * requests which count, not sectors.
+	 */
+	if (max_bomb_segments && rw == READ && !passed_a_read &&
+				ret == ELEVATOR_NO_MERGE) {
+		int cur_latency = 0;
+		struct request * const cur_request = *req;
+
+		entry = head->next;
+		while (entry != &q->queue_head) {
+			struct request *__rq;
+
+			if (entry == &q->queue_head)
+				BUG();
+			if (entry == q->queue_head.next &&
+					q->head_active && !q->plugged)
+				BUG();
+			__rq = blkdev_entry_to_request(entry);
+
+			if (__rq == cur_request) {
+				/*
+				 * This is where the old algorithm placed it.
+				 * There's no point pushing it further back,
+				 * so leave it here, in sorted order.
+				 */
+				break;
+			}
+			if (__rq->cmd == WRITE) {
+				cur_latency += 1 + __rq->nr_sectors / 64;
+				if (cur_latency >= max_bomb_segments) {
+					*req = __rq;
+					break;
+				}
+			}
+			entry = entry->next;
+		}
+	}
 	return ret;
 }
 
@@ -186,7 +250,7 @@ int blkelvget_ioctl(elevator_t * elevato
 	output.queue_ID			= elevator->queue_ID;
 	output.read_latency		= elevator->read_latency;
 	output.write_latency		= elevator->write_latency;
-	output.max_bomb_segments	= 0;
+	output.max_bomb_segments	= elevator->max_bomb_segments;
 
 	if (copy_to_user(arg, &output, sizeof(blkelv_ioctl_arg_t)))
 		return -EFAULT;
@@ -205,9 +269,12 @@ int blkelvset_ioctl(elevator_t * elevato
 		return -EINVAL;
 	if (input.write_latency < 0)
 		return -EINVAL;
+	if (input.max_bomb_segments < 0)
+		return -EINVAL;
 
 	elevator->read_latency		= input.read_latency;
 	elevator->write_latency		= input.write_latency;
+	elevator->max_bomb_segments	= input.max_bomb_segments;
 	return 0;
 }
 
--- linux-akpm/drivers/block/ll_rw_blk.c~read-latency2	Sun Nov 10 19:53:53 2002
+++ linux-akpm-akpm/drivers/block/ll_rw_blk.c	Sun Nov 10 19:53:53 2002
@@ -432,9 +432,11 @@ static void blk_init_free_list(request_q
 
 	si_meminfo(&si);
 	megs = si.totalram >> (20 - PAGE_SHIFT);
-	nr_requests = 128;
-	if (megs < 32)
-		nr_requests /= 2;
+	nr_requests = (megs * 2) & ~15;	/* One per half-megabyte */
+	if (nr_requests < 32)
+		nr_requests = 32;
+	if (nr_requests > 1024)
+		nr_requests = 1024;
 	blk_grow_request_list(q, nr_requests);
 
 	init_waitqueue_head(&q->wait_for_requests[0]);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-01 21:25 ` Rik van Riel
@ 2002-12-01 21:41   ` Marc-Christian Petersen
  2002-12-01 21:56     ` Con Kolivas
  2002-12-02  0:18     ` Con Kolivas
  2002-12-02  8:15   ` Jens Axboe
  1 sibling, 2 replies; 15+ messages in thread
From: Marc-Christian Petersen @ 2002-12-01 21:41 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rik van Riel, Con Kolivas

On Sunday 01 December 2002 22:25, you wrote:

Hi Rik,

> That was my gut feeling as well, but I guess it's a good thing
> to quantify how much of a difference it makes.  I wonder if we
> could convince Con to test a kernel both with and without this
> patch and look at the difference.
yep, would be a good idea. Con: *wake up ;)*

> > So, here my patch proposal. Ontop of 2.4.20-rmap15a.
> Looks good, now lets test it.  If the patch is as needed as you
> say we should push it to marcelo ;)
yep, Andrew should do it. Anyway, all those patches do _not_ get rid of those 
I/O pauses/stops since 2.4.19-pre6. Andrea did a good approach with his 
lowlatency elevator, even if it drops throughput (needs more testing to 
become equivalent to throughput w/o it) and also Con and me did a Mini 
Lowlatency Elevator + Config option, so you can decide weather you are 
building for serverusage where interactive "desktop performance" is not 
needed ;) or not.

I wish I'll have the time to eleminate the broken code which went into 2.4.19 
that causes those I/O stops.

*Repetition: those stopps do not occur with 2.4.18* ;)

ciao, Marc



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-01 21:41   ` Marc-Christian Petersen
@ 2002-12-01 21:56     ` Con Kolivas
  2002-12-02  0:18     ` Con Kolivas
  1 sibling, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2002-12-01 21:56 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Rik van Riel

Quoting Marc-Christian Petersen <m.c.p@wolk-project.de>:

> On Sunday 01 December 2002 22:25, you wrote:
> 
> Hi Rik,
> 
> > That was my gut feeling as well, but I guess it's a good thing
> > to quantify how much of a difference it makes.  I wonder if we
> > could convince Con to test a kernel both with and without this
> > patch and look at the difference.
> yep, would be a good idea. Con: *wake up ;)*

Sorry sleep and work intervene ;)

I have already tested it in -ck and was planning to put it into 2.4.20-ck1 when
I finished it. I'll test it in vanilla to show how it works. It made a
thunderous difference to io load.

Con

> 
> > > So, here my patch proposal. Ontop of 2.4.20-rmap15a.
> > Looks good, now lets test it.  If the patch is as needed as you
> > say we should push it to marcelo ;)
> yep, Andrew should do it. Anyway, all those patches do _not_ get rid of those
> 
> I/O pauses/stops since 2.4.19-pre6. Andrea did a good approach with his 
> lowlatency elevator, even if it drops throughput (needs more testing to 
> become equivalent to throughput w/o it) and also Con and me did a Mini 
> Lowlatency Elevator + Config option, so you can decide weather you are 
> building for serverusage where interactive "desktop performance" is not 
> needed ;) or not.
> 
> I wish I'll have the time to eleminate the broken code which went into 2.4.19
> 
> that causes those I/O stops.
> 
> *Repetition: those stopps do not occur with 2.4.18* ;)
> 
> ciao, Marc
> 
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-01 21:41   ` Marc-Christian Petersen
  2002-12-01 21:56     ` Con Kolivas
@ 2002-12-02  0:18     ` Con Kolivas
  1 sibling, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2002-12-02  0:18 UTC (permalink / raw)
  To: Marc-Christian Petersen; +Cc: linux-kernel, Rik van Riel

Quoting Marc-Christian Petersen <m.c.p@wolk-project.de>:

> On Sunday 01 December 2002 22:25, you wrote:
> 
> Hi Rik,
> 
> > That was my gut feeling as well, but I guess it's a good thing
> > to quantify how much of a difference it makes.  I wonder if we
> > could convince Con to test a kernel both with and without this
> > patch and look at the difference.
> yep, would be a good idea. Con: *wake up ;)*
> 
> > > So, here my patch proposal. Ontop of 2.4.20-rmap15a.
> > Looks good, now lets test it.  If the patch is as needed as you
> > say we should push it to marcelo ;)
> yep, Andrew should do it. Anyway, all those patches do _not_ get rid of those
> 
> I/O pauses/stops since 2.4.19-pre6. Andrea did a good approach with his 
> lowlatency elevator, even if it drops throughput (needs more testing to 
> become equivalent to throughput w/o it) and also Con and me did a Mini 
> Lowlatency Elevator + Config option, so you can decide weather you are 
> building for serverusage where interactive "desktop performance" is not 
> needed ;) or not.

Ok here are the contest results on the SMP machine with the read latency 2 patch
(rl2) and my 3 line disk latency hack (dlh) which almost disables the elevator
on vanilla 2.4.20:

io_load:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.19 [5]              162.3   45      28      19      4.48
2.4.20 [5]              164.9   45      31      21      4.55
2.4.20-dlh [3]          47.3    157     0       1       1.31
2.4.20-rl2 [3]          101.8   76      19      22      2.81

io_other:
Kernel [runs]           Time    CPU%    Loads   LCPU%   Ratio
2.4.19 [5]              62.3    117     11      20      1.72
2.4.20 [5]              89.6    86      17      21      2.47
2.4.20-dlh [3]          45.0    159     0       1       1.24
2.4.20-rl2 [3]          51.8    142     10      21      1.43

Note no mention of throughput here - but implied drop off with dlh! My guess is
the drop in throughput with dlh is much worse during kernel compilation than
when the machine is idle. Nonetheless the dlh hack makes a significant
improvement in responsivenss under IO load on a desktop.

Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-01 21:25 ` Rik van Riel
  2002-12-01 21:41   ` Marc-Christian Petersen
@ 2002-12-02  8:15   ` Jens Axboe
  2002-12-02  8:51     ` Andrew Morton
  1 sibling, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2002-12-02  8:15 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Marc-Christian Petersen, linux-kernel, Con Kolivas

On Sun, Dec 01 2002, Rik van Riel wrote:
> > So, here my patch proposal. Ontop of 2.4.20-rmap15a.
> 
> Looks good, now lets test it.  If the patch is as needed as you
> say we should push it to marcelo ;)

Yes lets for heavens sake not fix the problem, merge the hack.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02  8:15   ` Jens Axboe
@ 2002-12-02  8:51     ` Andrew Morton
  2002-12-02  8:56       ` Jens Axboe
  0 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2002-12-02  8:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Rik van Riel, Marc-Christian Petersen, linux-kernel, Con Kolivas

Jens Axboe wrote:
> 
> On Sun, Dec 01 2002, Rik van Riel wrote:
> > > So, here my patch proposal. Ontop of 2.4.20-rmap15a.
> >
> > Looks good, now lets test it.  If the patch is as needed as you
> > say we should push it to marcelo ;)
> 
> Yes lets for heavens sake not fix the problem, merge the hack.

If it fails to find a merge or insert the current 2.4 elevator
will stick a read at the far end of the request queue.  That's
quite arbitrary, and is the worst possible thing to do with it.

read-latency2 will put the read a tunable distance from the head.
Add a few embellishments to avoid permanent writer starvation,
and that's basically all it does.

So rather than just keeping on calling it a "hack" could you please
describe what is actually wrong with the idea?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02  8:51     ` Andrew Morton
@ 2002-12-02  8:56       ` Jens Axboe
  2002-12-02 12:38         ` Rik van Riel
  0 siblings, 1 reply; 15+ messages in thread
From: Jens Axboe @ 2002-12-02  8:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Marc-Christian Petersen, linux-kernel, Con Kolivas

On Mon, Dec 02 2002, Andrew Morton wrote:
> Jens Axboe wrote:
> > 
> > On Sun, Dec 01 2002, Rik van Riel wrote:
> > > > So, here my patch proposal. Ontop of 2.4.20-rmap15a.
> > >
> > > Looks good, now lets test it.  If the patch is as needed as you
> > > say we should push it to marcelo ;)
> > 
> > Yes lets for heavens sake not fix the problem, merge the hack.
> 
> If it fails to find a merge or insert the current 2.4 elevator
> will stick a read at the far end of the request queue.  That's
> quite arbitrary, and is the worst possible thing to do with it.
> 
> read-latency2 will put the read a tunable distance from the head.
> Add a few embellishments to avoid permanent writer starvation,
> and that's basically all it does.

I just think that the design of the thing is ugly. It's clamped on to
the current elevator instead of redoing the core based on the principles
of read-starvation that it introduces (this is the only good thing that
has come out of the patch).

> So rather than just keeping on calling it a "hack" could you please
> describe what is actually wrong with the idea?

I've never said that the idea is wrong, it's the solution that is an
ugly hack.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02  8:56       ` Jens Axboe
@ 2002-12-02 12:38         ` Rik van Riel
  2002-12-02 20:45           ` Willy Tarreau
  2002-12-02 21:46           ` Bill Davidsen
  0 siblings, 2 replies; 15+ messages in thread
From: Rik van Riel @ 2002-12-02 12:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Marc-Christian Petersen, linux-kernel, Con Kolivas

On Mon, 2 Dec 2002, Jens Axboe wrote:
> On Mon, Dec 02 2002, Andrew Morton wrote:

> > So rather than just keeping on calling it a "hack" could you please
> > describe what is actually wrong with the idea?
>
> I've never said that the idea is wrong, it's the solution that is an
> ugly hack.

OK, do you have a better idea on how to implement this thing ?

I could be your code monkey if you don't have the time to
implement something yourself.

regards,

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://guru.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02 12:38         ` Rik van Riel
@ 2002-12-02 20:45           ` Willy Tarreau
  2002-12-02 23:10             ` Rik van Riel
  2002-12-02 21:46           ` Bill Davidsen
  1 sibling, 1 reply; 15+ messages in thread
From: Willy Tarreau @ 2002-12-02 20:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jens Axboe, Andrew Morton, Marc-Christian Petersen, linux-kernel,
	Con Kolivas

On Mon, Dec 02, 2002 at 10:38:40AM -0200, Rik van Riel wrote:
> OK, do you have a better idea on how to implement this thing ?

Hello !

Please excuse my ignorance of the internals, but from a neutral view, I think
that an efficient design could be like this :
  - not one, but two elevators, one for read requests, one for write requests.
  - may be one couple of these elevators for each physical device to ease
    parallelism, but I'm not sure.
  - we would process one of the request queues (either reads or writes), and
    after a user-settable amount of requests processed, we would switch to the
    other one if it contains pending requests. For each request processed, we
    would take a look at the other queue, to see if a request for a very close
    location exists, in which case we would also switch.

This would bring the advantage of the latency/throughput balance being
completely user-settable.

Please excuse me if it's impossible in the current design or if it's already
done this way and fails. I just wanted to add my 2 euro-cents here.

Comments ?

Cheers,
Willy


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02 20:45           ` Willy Tarreau
@ 2002-12-02 23:10             ` Rik van Riel
  2002-12-03  6:21               ` Willy Tarreau
  0 siblings, 1 reply; 15+ messages in thread
From: Rik van Riel @ 2002-12-02 23:10 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Jens Axboe, Andrew Morton, Marc-Christian Petersen, linux-kernel,
	Con Kolivas

On Mon, 2 Dec 2002, Willy Tarreau wrote:

>   - not one, but two elevators, one for read requests, one for write requests.
>   - we would process one of the request queues (either reads or writes), and
>     after a user-settable amount of requests processed, we would switch to the

OK, lets for the sake of the argument imagine such an
elevator, with Read and Write queues for the following
block numbers:

R:  1 3 4 5 6 20 21 22 100 110 111

W:  2 15 16 17 18 50 52 53

Now imagine what switching randomly between these queues
would do for disk seeks. Especially considering that some
of the writes can be "sandwiched" in-between the reads...

Rik
-- 
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/		http://guru.conectiva.com/
Current spamtrap:  <a href=mailto:"october@surriel.com">october@surriel.com</a>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02 23:10             ` Rik van Riel
@ 2002-12-03  6:21               ` Willy Tarreau
  0 siblings, 0 replies; 15+ messages in thread
From: Willy Tarreau @ 2002-12-03  6:21 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Willy Tarreau, Jens Axboe, Andrew Morton, Marc-Christian Petersen,
	linux-kernel, Con Kolivas

On Mon, Dec 02, 2002 at 09:10:03PM -0200, Rik van Riel wrote:
> On Mon, 2 Dec 2002, Willy Tarreau wrote:
> 
> >   - not one, but two elevators, one for read requests, one for write requests.
> >   - we would process one of the request queues (either reads or writes), and
> >     after a user-settable amount of requests processed, we would switch to the
> 
> OK, lets for the sake of the argument imagine such an
> elevator, with Read and Write queues for the following
> block numbers:
> 
> R:  1 3 4 5 6 20 21 22 100 110 111
> 
> W:  2 15 16 17 18 50 52 53
> 
> Now imagine what switching randomly between these queues
> would do for disk seeks. Especially considering that some
> of the writes can be "sandwiched" in-between the reads...

Well, I don't speak about "random switching". My goal is exactly to reduce seek
time *and* to bound latency. Look at your example, considering that we put no
limit on the number of consecutive requests, just processing them in order would
give :

R(1), W(2), R(3,4,5,6), W(15,16,17,18), R(20,21,22), W(50,52,53), R(100,110,111)

This is about what is currently done with a single elevator. Now, if we try to
detect long runs of consecutive accesses based on seek length, we could
optimize it this way :

W(2), R(1-22), W(15-53), R(100-111)   => we only do one backwards seek

And now, if you want to lower latency for a particular usage, with a 3:1
read/write ratio, this would give :

R(1,3,4), W(2), R(5,6,20), W(15), R(21,22,100), W(16), R(110,111), W(17-53)

Of course, this won't be globally optimal, but could perhaps help *some*
processes to wait less time for their data, which is the goal of inserting read
requests near the head of the queue, isn't it ?

BTW, just for my understanding, what would your example look like with the
current elevator (choose the ordering you like) ?

Cheers,
Willy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] 2.4.20-rmap15a
  2002-12-02 12:38         ` Rik van Riel
  2002-12-02 20:45           ` Willy Tarreau
@ 2002-12-02 21:46           ` Bill Davidsen
  1 sibling, 0 replies; 15+ messages in thread
From: Bill Davidsen @ 2002-12-02 21:46 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Jens Axboe, Andrew Morton, Marc-Christian Petersen, linux-kernel,
	Con Kolivas

On Mon, 2 Dec 2002, Rik van Riel wrote:

> On Mon, 2 Dec 2002, Jens Axboe wrote:
> > On Mon, Dec 02 2002, Andrew Morton wrote:
> 
> > > So rather than just keeping on calling it a "hack" could you please
> > > describe what is actually wrong with the idea?
> >
> > I've never said that the idea is wrong, it's the solution that is an
> > ugly hack.
> 
> OK, do you have a better idea on how to implement this thing ?
> 
> I could be your code monkey if you don't have the time to
> implement something yourself.

Clearly the patch addresses the problem. However, I have some doubt that
it can be optimal however tuned. The more we put blocking io ahead of
non-blocking io, the greater the chance that the system will get so low on
memory that some of the non-blocking write may act like blocking write, or
the system may smoothly become dead slow.

Ignoring all the reasons why a major change shouldn't be put into a frozen
development, if there were a per-device and per-pid queue, then each time
the elevator were loaded some minimum and maximum number of requests per
pid could be queued. This would repeat until all requests were processed
or some max number of requests was handled. 

I think the effect of this would be that under light load many requests
from a single process would be taken, getting the benefit of sequential io
and good throughput. If the performance gain justified the overhead the
per-pid queue could be sorted to keep contiguous requests grouped. If
there were a lot of processes contending for the device, it would assure
some degree of fair scheduling without having to scan down or insert in
one large queue.

Just a thought, clearly this is too large a change to put in at the
moment, and may have some drawback I missed, or less benefit. It has the
advantage of not having to scan down all waiting requests for a device,
which would clearly be too much overhead. The actual decisions aren't all
that complex at any given point, hopefully CPU usage would reflext this.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-12-03 13:47 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-01 20:35 [PATCH] 2.4.20-rmap15a Rik van Riel
2002-12-03 13:55 ` Miquel van Smoorenburg
  -- strict thread matches above, loose matches on Subject: below --
2002-12-01 20:56 Marc-Christian Petersen
2002-12-01 21:25 ` Rik van Riel
2002-12-01 21:41   ` Marc-Christian Petersen
2002-12-01 21:56     ` Con Kolivas
2002-12-02  0:18     ` Con Kolivas
2002-12-02  8:15   ` Jens Axboe
2002-12-02  8:51     ` Andrew Morton
2002-12-02  8:56       ` Jens Axboe
2002-12-02 12:38         ` Rik van Riel
2002-12-02 20:45           ` Willy Tarreau
2002-12-02 23:10             ` Rik van Riel
2002-12-03  6:21               ` Willy Tarreau
2002-12-02 21:46           ` Bill Davidsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox