ll_rw_block/submit_bh and request limits

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ll_rw_block/submit_bh and request limits
@ 2001-02-22  9:41 Marcelo Tosatti
  2001-02-22 13:56 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Marcelo Tosatti @ 2001-02-22  9:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: lkml

Hi, 

The following piece of code in ll_rw_block() aims to limit the number of
locked buffers by making processes throttle on IO if the number of on
flight requests is bigger than a high watermaker. IO will only start
again if we're under a low watermark.

                if (atomic_read(&queued_sectors) >= high_queued_sectors) {
                        run_task_queue(&tq_disk);
                        wait_event(blk_buffers_wait,
                        	atomic_read(&queued_sectors) < low_queued_sectors);
                }

However, if submit_bh() is used to queue IO (which is used by ->readpage()
for ext2, for example), no throttling happens.

It looks like ll_rw_block() users (writes, metadata reads) can be starved
by submit_bh() (data reads). 

If I'm not missing something, the watermark check should be moved to
submit_bh(). 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22  9:41 ll_rw_block/submit_bh and request limits Marcelo Tosatti
@ 2001-02-22 13:56 ` Jens Axboe
  2001-02-22 18:59   ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2001-02-22 13:56 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: lkml

On Thu, Feb 22 2001, Marcelo Tosatti wrote:
> The following piece of code in ll_rw_block() aims to limit the number of
> locked buffers by making processes throttle on IO if the number of on
> flight requests is bigger than a high watermaker. IO will only start
> again if we're under a low watermark.
> 
>                 if (atomic_read(&queued_sectors) >= high_queued_sectors) {
>                         run_task_queue(&tq_disk);
>                         wait_event(blk_buffers_wait,
>                         	atomic_read(&queued_sectors) < low_queued_sectors);
>                 }
> 
> 
> However, if submit_bh() is used to queue IO (which is used by ->readpage()
> for ext2, for example), no throttling happens.
> 
> It looks like ll_rw_block() users (writes, metadata reads) can be starved
> by submit_bh() (data reads). 
> 
> If I'm not missing something, the watermark check should be moved to
> submit_bh(). 

We might as well put it there, the idea was to not lock this one
buffer either but I doubt this would make any different in reality :-)

Linus, could you apply?

--- /opt/kernel/linux-2.4.2/drivers/block/ll_rw_blk.c	Thu Feb 22 14:55:22 2001
+++ drivers/block/ll_rw_blk.c	Thu Feb 22 14:53:07 2001
@@ -957,6 +959,20 @@
 	if (!test_bit(BH_Lock, &bh->b_state))
 		BUG();
 
+	/*
+	 * don't lock any more buffers if we are above the high
+	 * water mark. instead start I/O on the queued stuff.
+	 */
+	if (atomic_read(&queued_sectors) >= high_queued_sectors) {
+		run_task_queue(&tq_disk);
+		if (rw == READA) {
+			bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
+			return;
+		}
+		wait_event(blk_buffers_wait,
+			atomic_read(&queued_sectors) < low_queued_sectors);
+	}
+
 	set_bit(BH_Req, &bh->b_state);
 
 	/*
@@ -1057,16 +1073,6 @@
 
 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
-
-		/*
-		 * don't lock any more buffers if we are above the high
-		 * water mark. instead start I/O on the queued stuff.
-		 */
-		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
-			run_task_queue(&tq_disk);
-			wait_event(blk_buffers_wait,
-			 atomic_read(&queued_sectors) < low_queued_sectors);
-		}
 
 		/* Only one thread can actually submit the I/O. */
 		if (test_and_set_bit(BH_Lock, &bh->b_state))

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 13:56 ` Jens Axboe
@ 2001-02-22 18:59   ` Linus Torvalds
  2001-02-22 20:32     ` Marcelo Tosatti
  2001-02-22 21:38     ` Andrea Arcangeli
  0 siblings, 2 replies; 12+ messages in thread
From: Linus Torvalds @ 2001-02-22 18:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Marcelo Tosatti, lkml

On Thu, 22 Feb 2001, Jens Axboe wrote:

> On Thu, Feb 22 2001, Marcelo Tosatti wrote:
> > The following piece of code in ll_rw_block() aims to limit the number of
> > locked buffers by making processes throttle on IO if the number of on
> > flight requests is bigger than a high watermaker. IO will only start
> > again if we're under a low watermark.
> > 
> >                 if (atomic_read(&queued_sectors) >= high_queued_sectors) {
> >                         run_task_queue(&tq_disk);
> >                         wait_event(blk_buffers_wait,
> >                         	atomic_read(&queued_sectors) < low_queued_sectors);
> >                 }
> > 
> > 
> > However, if submit_bh() is used to queue IO (which is used by ->readpage()
> > for ext2, for example), no throttling happens.
> > 
> > It looks like ll_rw_block() users (writes, metadata reads) can be starved
> > by submit_bh() (data reads). 
> > 
> > If I'm not missing something, the watermark check should be moved to
> > submit_bh(). 
> 
> We might as well put it there, the idea was to not lock this one
> buffer either but I doubt this would make any different in reality :-)

I'd prefer for this check to be a per-queue one.

Right now a slow device (like a floppy) would artifically throttle a fast
one, if I read the above right. So instead of moving it down the
call-chain, I'd rather remove the check completely as it looks wrong to
me.

Now, if people want throttling, I'd much rather see that done per-queue.

(There's another level of throttling that migth make sense: right now the
swap-out code has this "nr_async_pages" throttling which is very different
from the queue throttling. It might make sense to move that _VM_-level
throttling to writepage too - so that syncing of dirty mmap's will not
cause an overload of pages in flight. This was one of the reasons I
changed the semantics of write-page - so that shared mappings could do
that kind of smoothing too).

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 18:59   ` Linus Torvalds
@ 2001-02-22 20:32     ` Marcelo Tosatti
  2001-02-22 21:38     ` Andrea Arcangeli
  1 sibling, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2001-02-22 20:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, lkml


On Thu, 22 Feb 2001, Linus Torvalds wrote:

> 
> 
> On Thu, 22 Feb 2001, Jens Axboe wrote:
> 
> > On Thu, Feb 22 2001, Marcelo Tosatti wrote:
> > > The following piece of code in ll_rw_block() aims to limit the number of
> > > locked buffers by making processes throttle on IO if the number of on
> > > flight requests is bigger than a high watermaker. IO will only start
> > > again if we're under a low watermark.
> > > 
> > >                 if (atomic_read(&queued_sectors) >= high_queued_sectors) {
> > >                         run_task_queue(&tq_disk);
> > >                         wait_event(blk_buffers_wait,
> > >                         	atomic_read(&queued_sectors) < low_queued_sectors);
> > >                 }
> > > 
> > > 
> > > However, if submit_bh() is used to queue IO (which is used by ->readpage()
> > > for ext2, for example), no throttling happens.
> > > 
> > > It looks like ll_rw_block() users (writes, metadata reads) can be starved
> > > by submit_bh() (data reads). 
> > > 
> > > If I'm not missing something, the watermark check should be moved to
> > > submit_bh(). 
> > 
> > We might as well put it there, the idea was to not lock this one
> > buffer either but I doubt this would make any different in reality :-)
> 
> I'd prefer for this check to be a per-queue one.
> 
> Right now a slow device (like a floppy) would artifically throttle a fast
> one, if I read the above right. So instead of moving it down the
> call-chain, I'd rather remove the check completely as it looks wrong to
> me.
> 
> Now, if people want throttling, I'd much rather see that done per-queue.
> 
> (There's another level of throttling that migth make sense: right now the
> swap-out code has this "nr_async_pages" throttling which is very different
> from the queue throttling. It might make sense to move that _VM_-level
> throttling to writepage too - so that syncing of dirty mmap's will not
> cause an overload of pages in flight. This was one of the reasons I
> changed the semantics of write-page - so that shared mappings could do
> that kind of smoothing too).

And what about write() and read() if you do throttling with nr_async_pages ?

The current scheme inside the block-layer throttles write()'s based on the
number of locked buffers in _main memory_. 







^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 21:38     ` Andrea Arcangeli
@ 2001-02-22 20:40       ` Marcelo Tosatti
  2001-02-22 22:57         ` Andrea Arcangeli
  2001-02-25 17:34       ` Jens Axboe
  1 sibling, 1 reply; 12+ messages in thread
From: Marcelo Tosatti @ 2001-02-22 20:40 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Jens Axboe, lkml

On Thu, 22 Feb 2001, Andrea Arcangeli wrote:

> On Thu, Feb 22, 2001 at 10:59:20AM -0800, Linus Torvalds wrote:
> > I'd prefer for this check to be a per-queue one.
> 
> I'm running this in my tree since a few weeks, however I never had the courage
> to post it publically because I didn't benchmarked it carefully yet and I
> prefer to finish another thing first. 

You want to throttle IO if the amount of on flight data is higher than
a given percentage of _main memory_. 

As far as I can see, your patch avoids each individual queue from being
bigger than the high watermark (which is a percentage of main
memory). However, you do not avoid multiple queues together from being
bigger than the high watermark.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 18:59   ` Linus Torvalds
  2001-02-22 20:32     ` Marcelo Tosatti
@ 2001-02-22 21:38     ` Andrea Arcangeli
  2001-02-22 20:40       ` Marcelo Tosatti
  2001-02-25 17:34       ` Jens Axboe
  1 sibling, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2001-02-22 21:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Jens Axboe, Marcelo Tosatti, lkml

On Thu, Feb 22, 2001 at 10:59:20AM -0800, Linus Torvalds wrote:
> I'd prefer for this check to be a per-queue one.

I'm running this in my tree since a few weeks, however I never had the courage
to post it publically because I didn't benchmarked it carefully yet and I
prefer to finish another thing first. This is actually based on the code I had
in my blkdev tree after I merged last time with Jens the 512K I/O requests and
elevator fixes. I think it won't generate bad numbers and it was running fine
on a 32way SMP (though I didn't stressed the I/O subsystem much there) but
please don't include until somebody benchmarks it carefully with dbench and
tiotest.  (it still applys cleanly against 2.4.2)

diff -urN 2.4.2pre3/drivers/block/DAC960.c x/drivers/block/DAC960.c
--- 2.4.2pre3/drivers/block/DAC960.c	Fri Feb  9 18:35:16 2001
+++ x/drivers/block/DAC960.c	Fri Feb  9 19:00:58 2001
@@ -2827,7 +2827,6 @@
 static inline void DAC960_ProcessCompletedBuffer(BufferHeader_T *BufferHeader,
 						 boolean SuccessfulIO)
 {
-  blk_finished_io(BufferHeader->b_size >> 9);
   BufferHeader->b_end_io(BufferHeader, SuccessfulIO);
 }
 
diff -urN 2.4.2pre3/drivers/block/cciss.c x/drivers/block/cciss.c
--- 2.4.2pre3/drivers/block/cciss.c	Fri Feb  9 18:35:17 2001
+++ x/drivers/block/cciss.c	Fri Feb  9 19:00:59 2001
@@ -1072,7 +1072,6 @@
 	{
 		xbh = bh->b_reqnext; 
 		bh->b_reqnext = NULL; 
-		blk_finished_io(bh->b_size >> 9);
 		bh->b_end_io(bh, status);
 		bh = xbh;
 	}
diff -urN 2.4.2pre3/drivers/block/cpqarray.c x/drivers/block/cpqarray.c
--- 2.4.2pre3/drivers/block/cpqarray.c	Fri Feb  9 18:35:17 2001
+++ x/drivers/block/cpqarray.c	Fri Feb  9 19:00:59 2001
@@ -1031,7 +1031,6 @@
 		xbh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		
-		blk_finished_io(bh->b_size >> 9);
 		bh->b_end_io(bh, ok);
 
 		bh = xbh;
diff -urN 2.4.2pre3/drivers/block/ll_rw_blk.c x/drivers/block/ll_rw_blk.c
--- 2.4.2pre3/drivers/block/ll_rw_blk.c	Fri Feb  9 18:35:17 2001
+++ x/drivers/block/ll_rw_blk.c	Fri Feb  9 19:06:11 2001
@@ -118,17 +118,10 @@
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * queued sectors for all devices, used to make sure we don't fill all
- * of memory with locked buffers
- */
-atomic_t queued_sectors;
-
-/*
  * high and low watermark for above
  */
-static int high_queued_sectors, low_queued_sectors;
+int max_in_flight_sectors;
 static int batch_requests, queue_nr_requests;
-static DECLARE_WAIT_QUEUE_HEAD(blk_buffers_wait);
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -397,11 +390,17 @@
 		rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL);
 		memset(rq, 0, sizeof(struct request));
 		rq->rq_status = RQ_INACTIVE;
-		list_add(&rq->table, &q->request_freelist[i & 1]);
+		list_add(&rq->table, &q->request_freelist[i < (queue_nr_requests >> 1) ? READ : WRITE]);
 	}
 
-	init_waitqueue_head(&q->wait_for_request);
+	init_waitqueue_head(&q->wait_for_request[READ]);
+	init_waitqueue_head(&q->wait_for_request[WRITE]);
+
+	init_waitqueue_head(&q->in_flight_wait);
+	q->in_flight_sectors = 0;
+#if 0
 	spin_lock_init(&q->queue_lock);
+#endif
 }
 
 static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh);
@@ -491,9 +490,9 @@
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue_exclusive(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_request[rw], &wait);
 	for (;;) {
-		__set_current_state(TASK_UNINTERRUPTIBLE);
+		set_current_state(TASK_UNINTERRUPTIBLE);
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q, rw);
 		spin_unlock_irq(&io_request_lock);
@@ -502,8 +501,8 @@
 		generic_unplug_device(q);
 		schedule();
 	}
-	remove_wait_queue(&q->wait_for_request, &wait);
-	current->state = TASK_RUNNING;
+	remove_wait_queue(&q->wait_for_request[rw], &wait);
+	__set_current_state(TASK_RUNNING);
 	return rq;
 }
 
@@ -590,12 +589,26 @@
 	list_add(&req->queue, insert_here);
 }
 
-void inline blk_refill_freelist(request_queue_t *q, int rw)
+static inline void blk_refill_freelist(request_queue_t *q, int rw)
 {
-	if (q->pending_free[rw]) {
-		list_splice(&q->pending_freelist[rw], &q->request_freelist[rw]);
-		INIT_LIST_HEAD(&q->pending_freelist[rw]);
-		q->pending_free[rw] = 0;
+	list_splice(&q->pending_freelist[rw], &q->request_freelist[rw]);
+	INIT_LIST_HEAD(&q->pending_freelist[rw]);
+	q->pending_free[rw] = 0;
+}
+
+static inline void blk_finished_io(request_queue_t * q, struct request * req)
+{
+	int nsects  = req->in_flight_sectors;
+
+	req->in_flight_sectors = 0;
+	q->in_flight_sectors -= nsects;
+	if (q->in_flight_sectors < 0) {
+		printk("blkdev: in_flight_sectors < 0\n");
+		q->in_flight_sectors = 0;
+	}
+	if (waitqueue_active(&q->in_flight_wait) &&
+	    q->in_flight_sectors < max_in_flight_sectors) {
+		wake_up(&(q)->in_flight_wait);
 	}
 }
 
@@ -612,26 +625,27 @@
 
 	/*
 	 * Request may not have originated from ll_rw_blk. if not,
-	 * asumme it has free buffers and check waiters
+	 * assumme it has free buffers and check waiters
 	 */
 	if (q) {
-		/*
-		 * we've released enough buffers to start I/O again
-		 */
-		if (waitqueue_active(&blk_buffers_wait)
-		    && atomic_read(&queued_sectors) < low_queued_sectors)
-			wake_up(&blk_buffers_wait);
-
-		/*
-		 * Add to pending free list and batch wakeups
-		 */
-		list_add(&req->table, &q->pending_freelist[rw]);
-
-		if (++q->pending_free[rw] >= batch_requests) {
-			int wake_up = q->pending_free[rw];
-			blk_refill_freelist(q, rw);
-			wake_up_nr(&q->wait_for_request, wake_up);
+		if (!list_empty(&q->request_freelist[rw])) {
+			if (q->pending_free[rw])
+				BUG();
+			list_add(&req->table, &q->request_freelist[rw]);
+			wake_up(&q->wait_for_request[rw]);
+		} else 	{
+			/*
+			 * Add to pending free list and batch wakeups
+			 */
+			list_add(&req->table, &q->pending_freelist[rw]);
+			if (++q->pending_free[rw] >= batch_requests) {
+				int wake_up = q->pending_free[rw];
+				blk_refill_freelist(q, rw);
+				wake_up_nr(&q->wait_for_request[rw], wake_up);
+			}
 		}
+
+		blk_finished_io(q, req);
 	}
 }
 
@@ -694,6 +708,12 @@
 	attempt_merge(q, blkdev_entry_to_request(prev), max_sectors, max_segments);
 }
 
+#define blk_started_io(q, req, nsects)		\
+do {						\
+	(q)->in_flight_sectors += (nsects);	\
+	(req)->in_flight_sectors += (nsects);	\
+} while(0)
+
 static int __make_request(request_queue_t * q, int rw,
 				  struct buffer_head * bh)
 {
@@ -753,6 +773,27 @@
 	 */
 	spin_lock_irq(&io_request_lock);
 
+	/*
+	 * don't lock any more buffers if we are above the max
+	 * water mark. instead start I/O on the queued stuff.
+	 */
+	if (q->in_flight_sectors >= max_in_flight_sectors) {
+		DECLARE_WAITQUEUE(wait, current);
+
+		add_wait_queue(&q->in_flight_wait, &wait);
+		for (;;) {
+			__set_current_state(TASK_UNINTERRUPTIBLE);
+			__generic_unplug_device(q);
+			spin_unlock_irq(&io_request_lock);
+			schedule();
+			spin_lock_irq(&io_request_lock);
+			if (q->in_flight_sectors < max_in_flight_sectors)
+				break;
+		}
+		remove_wait_queue(&q->in_flight_wait, &wait);
+		__set_current_state(TASK_RUNNING);
+	}
+
 	insert_here = head->prev;
 	if (list_empty(head)) {
 		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
@@ -771,7 +812,7 @@
 			req->bhtail->b_reqnext = bh;
 			req->bhtail = bh;
 			req->nr_sectors = req->hard_nr_sectors += count;
-			blk_started_io(count);
+			blk_started_io(q, req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			attempt_back_merge(q, req, max_sectors, max_segments);
 			goto out;
@@ -786,7 +827,7 @@
 			req->current_nr_sectors = count;
 			req->sector = req->hard_sector = sector;
 			req->nr_sectors = req->hard_nr_sectors += count;
-			blk_started_io(count);
+			blk_started_io(q, req, count);
 			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
 			attempt_front_merge(q, head, req, max_sectors, max_segments);
 			goto out;
@@ -841,7 +882,7 @@
 	req->bh = bh;
 	req->bhtail = bh;
 	req->rq_dev = bh->b_rdev;
-	blk_started_io(count);
+	blk_started_io(q, req, count);
 	add_request(q, req, insert_here);
 out:
 	if (freereq)
@@ -1059,16 +1100,6 @@
 	for (i = 0; i < nr; i++) {
 		struct buffer_head *bh = bhs[i];
 
-		/*
-		 * don't lock any more buffers if we are above the high
-		 * water mark. instead start I/O on the queued stuff.
-		 */
-		if (atomic_read(&queued_sectors) >= high_queued_sectors) {
-			run_task_queue(&tq_disk);
-			wait_event(blk_buffers_wait,
-			 atomic_read(&queued_sectors) < low_queued_sectors);
-		}
-
 		/* Only one thread can actually submit the I/O. */
 		if (test_and_set_bit(BH_Lock, &bh->b_state))
 			continue;
@@ -1143,7 +1174,6 @@
 
 	if ((bh = req->bh) != NULL) {
 		nsect = bh->b_size >> 9;
-		blk_finished_io(nsect);
 		req->bh = bh->b_reqnext;
 		bh->b_reqnext = NULL;
 		bh->b_end_io(bh, uptodate);
@@ -1173,8 +1203,6 @@
 	blkdev_release_request(req);
 }
 
-#define MB(kb)	((kb) << 10)
-
 int __init blk_dev_init(void)
 {
 	struct blk_dev_struct *dev;
@@ -1194,30 +1222,19 @@
 	memset(max_readahead, 0, sizeof(max_readahead));
 	memset(max_sectors, 0, sizeof(max_sectors));
 
-	atomic_set(&queued_sectors, 0);
 	total_ram = nr_free_pages() << (PAGE_SHIFT - 10);
 
-	/*
-	 * Try to keep 128MB max hysteris. If not possible,
-	 * use half of RAM
-	 */
-	high_queued_sectors = (total_ram * 2) / 3;
-	low_queued_sectors = high_queued_sectors / 3;
-	if (high_queued_sectors - low_queued_sectors > MB(128))
-		low_queued_sectors = high_queued_sectors - MB(128);
-
+	max_in_flight_sectors = total_ram / 3;
 
 	/*
 	 * make it sectors (512b)
 	 */
-	high_queued_sectors <<= 1;
-	low_queued_sectors <<= 1;
+	max_in_flight_sectors <<= 1;
 
 	/*
 	 * Scale free request slots per queue too
 	 */
-	total_ram = (total_ram + MB(32) - 1) & ~(MB(32) - 1);
-	if ((queue_nr_requests = total_ram >> 9) > QUEUE_NR_REQUESTS)
+	if ((queue_nr_requests = total_ram >> 5) > QUEUE_NR_REQUESTS)
 		queue_nr_requests = QUEUE_NR_REQUESTS;
 
 	/*
@@ -1225,11 +1242,12 @@
 	 */
 	if ((batch_requests = queue_nr_requests >> 3) > 32)
 		batch_requests = 32;
+	if (!batch_requests)
+		batch_requests = 1;
 
-	printk("block: queued sectors max/low %dkB/%dkB, %d slots per queue\n",
-						high_queued_sectors / 2,
-						low_queued_sectors / 2,
-						queue_nr_requests);
+	printk("block: queued sectors max %dkB, %d slots per queue, %d batch\n",
+						max_in_flight_sectors / 2,
+						queue_nr_requests, batch_requests);
 
 #ifdef CONFIG_AMIGA_Z2RAM
 	z2_init();
@@ -1350,4 +1368,3 @@
 EXPORT_SYMBOL(generic_make_request);
 EXPORT_SYMBOL(blkdev_release_request);
 EXPORT_SYMBOL(generic_unplug_device);
-EXPORT_SYMBOL(queued_sectors);
diff -urN 2.4.2pre3/drivers/scsi/scsi_lib.c x/drivers/scsi/scsi_lib.c
--- 2.4.2pre3/drivers/scsi/scsi_lib.c	Fri Feb  9 18:35:34 2001
+++ x/drivers/scsi/scsi_lib.c	Fri Feb  9 19:00:59 2001
@@ -375,7 +375,6 @@
 	do {
 		if ((bh = req->bh) != NULL) {
 			nsect = bh->b_size >> 9;
-			blk_finished_io(nsect);
 			req->bh = bh->b_reqnext;
 			req->nr_sectors -= nsect;
 			req->sector += nsect;
diff -urN 2.4.2pre3/fs/buffer.c x/fs/buffer.c
--- 2.4.2pre3/fs/buffer.c	Fri Feb  9 18:35:44 2001
+++ x/fs/buffer.c	Fri Feb  9 19:00:59 2001
@@ -628,7 +628,7 @@
    to do in order to release the ramdisk memory is to destroy dirty buffers.
 
    These are two special cases. Normal usage imply the device driver
-   to issue a sync on the device (without waiting I/O completation) and
+   to issue a sync on the device (without waiting I/O completion) and
    then an invalidate_buffers call that doesn't trash dirty buffers. */
 void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers)
 {
@@ -1027,12 +1027,13 @@
 	write_unlock(&hash_table_lock);
 	spin_unlock(&lru_list_lock);
 	refill_freelist(size);
+	/* FIXME: getblk should fail if there's no enough memory */
 	goto repeat;
 }
 
 /* -1 -> no need to flush
     0 -> async flush
-    1 -> sync flush (wait for I/O completation) */
+    1 -> sync flush (wait for I/O completion) */
 int balance_dirty_state(kdev_t dev)
 {
 	unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;
@@ -1431,6 +1432,7 @@
 {
 	struct buffer_head *bh, *head, *tail;
 
+	/* FIXME: create_buffers should fail if there's no enough memory */
 	head = create_buffers(page, blocksize, 1);
 	if (page->buffers)
 		BUG();
@@ -2367,11 +2369,9 @@
 	spin_lock(&free_list[index].lock);
 	tmp = bh;
 	do {
-		struct buffer_head *p = tmp;
-
-		tmp = tmp->b_this_page;
-		if (buffer_busy(p))
+		if (buffer_busy(tmp))
 			goto busy_buffer_page;
+		tmp = tmp->b_this_page;
 	} while (tmp != bh);
 
 	spin_lock(&unused_list_lock);
diff -urN 2.4.2pre3/include/linux/blkdev.h x/include/linux/blkdev.h
--- 2.4.2pre3/include/linux/blkdev.h	Fri Feb  9 18:34:13 2001
+++ x/include/linux/blkdev.h	Fri Feb  9 21:28:15 2001
@@ -35,7 +35,7 @@
 	int errors;
 	unsigned long sector;
 	unsigned long nr_sectors;
-	unsigned long hard_sector, hard_nr_sectors;
+	unsigned long hard_sector, hard_nr_sectors, in_flight_sectors;
 	unsigned int nr_segments;
 	unsigned int nr_hw_segments;
 	unsigned long current_nr_sectors;
@@ -77,6 +77,7 @@
 	struct list_head	request_freelist[2];
 	struct list_head	pending_freelist[2];
 	int			pending_free[2];
+	int			in_flight_sectors;
 
 	/*
 	 * Together with queue_head for cacheline sharing
@@ -112,16 +113,20 @@
 	 */
 	char			head_active;
 
+	wait_queue_head_t	in_flight_wait;
+
 	/*
-	 * Is meant to protect the queue in the future instead of
-	 * io_request_lock
+	 * Tasks wait here for free request
 	 */
-	spinlock_t		queue_lock;
+	wait_queue_head_t	wait_for_request[2];
 
+#if 0
 	/*
-	 * Tasks wait here for free request
+	 * Is meant to protect the queue in the future instead of
+	 * io_request_lock
 	 */
-	wait_queue_head_t	wait_for_request;
+	spinlock_t		queue_lock;
+#endif
 };
 
 struct blk_dev_struct {
@@ -177,7 +182,7 @@
 
 extern int * max_segments[MAX_BLKDEV];
 
-extern atomic_t queued_sectors;
+extern int max_in_flight_sectors;
 
 #define MAX_SEGMENTS 128
 #define MAX_SECTORS (MAX_SEGMENTS*8)
@@ -205,15 +210,5 @@
 	else
 		return 512;
 }
-
-#define blk_finished_io(nsects)				\
-	atomic_sub(nsects, &queued_sectors);		\
-	if (atomic_read(&queued_sectors) < 0) {		\
-		printk("block: queued_sectors < 0\n");	\
-		atomic_set(&queued_sectors, 0);		\
-	}
-
-#define blk_started_io(nsects)				\
-	atomic_add(nsects, &queued_sectors);
 
 #endif

Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 22:57         ` Andrea Arcangeli
@ 2001-02-22 21:44           ` Marcelo Tosatti
  2001-02-22 23:54             ` Andrea Arcangeli
  2001-02-22 23:12           ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Marcelo Tosatti @ 2001-02-22 21:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Jens Axboe, lkml


On Thu, 22 Feb 2001, Andrea Arcangeli wrote:

<snip>

> However if you have houndred of different queues doing I/O at the same
> time it may make a difference, but probably with tons of harddisks
> you'll also have tons of ram... In theory we could put a global limit
> on top of the the per-queue one. Or we could at least upper bound the
> total_ram / 3.

The global limit on top of the per-queue limit sounds good. 

Since you're talking about the "total_ram / 3" hardcoded value... it
should be /proc tunable IMO. (Andi Kleen already suggested this)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 20:40       ` Marcelo Tosatti
@ 2001-02-22 22:57         ` Andrea Arcangeli
  2001-02-22 21:44           ` Marcelo Tosatti
  2001-02-22 23:12           ` Andrea Arcangeli
  0 siblings, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2001-02-22 22:57 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Jens Axboe, lkml

On Thu, Feb 22, 2001 at 06:40:48PM -0200, Marcelo Tosatti wrote:
> You want to throttle IO if the amount of on flight data is higher than
> a given percentage of _main memory_. 
> 
> As far as I can see, your patch avoids each individual queue from being
> bigger than the high watermark (which is a percentage of main
> memory). However, you do not avoid multiple queues together from being
> bigger than the high watermark.

I of course see what you mean and I considered but I tend to believe that's a
minor issue and that most machines will be happier without the global unplug
even if without the global limit.

The only reason we added the limit of I/O in flight is to be allowed to have an
huge number of requests so we can do very large reordering and merges in the
elevator with seeking I/O (4k large IO request) _but_ still we don't have to
wait to lock in ram giga of pages before starting the I/O if the I/O was
contigous. We absolutely need such a sanity limit, it would be absolutely
unsane to wait kupdate to submit 10G of ram to a single harddisk before
unplugging on a 30G machine.

It doesn't need to be exactly "if we unplug not exactly after 1/3 of the global
ram is locked then performance sucks or the machine crashes or task gets
killed".  As Jens noticed sync_page_buffers will unplug the queue at some point
if we're low on ram anyways.

The limit just says "unplug after a rasonable limit, after it doesn't matter
anymore to try to delay requests for this harddisk, not matter if there are
still I/O requests available".

However if you have houndred of different queues doing I/O at the same time it
may make a difference, but probably with tons of harddisks you'll also have
tons of ram... In theory we could put a global limit on top of the the
per-queue one. Or we could at least upper bound the total_ram / 3.

Note that 2.4.0 as well doesn't enforce a global limit of packets in flight.
(while in 2.2 the limit is global as it has a shared pool of I/O requests).

Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 22:57         ` Andrea Arcangeli
  2001-02-22 21:44           ` Marcelo Tosatti
@ 2001-02-22 23:12           ` Andrea Arcangeli
  1 sibling, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2001-02-22 23:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Jens Axboe, lkml

On Thu, Feb 22, 2001 at 11:57:00PM +0100, Andrea Arcangeli wrote:
> unsane to wait kupdate to submit 10G of ram to a single harddisk before
> unplugging on a 30G machine.

actually kupdate will unplug itself the queue but in theory it can grow the
queue still up to such level after the I/O started. I think we'd better
add an high limit on the in flight I/O watermark.

Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 21:44           ` Marcelo Tosatti
@ 2001-02-22 23:54             ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2001-02-22 23:54 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Linus Torvalds, Jens Axboe, lkml

On Thu, Feb 22, 2001 at 07:44:11PM -0200, Marcelo Tosatti wrote:
> The global limit on top of the per-queue limit sounds good. 

Probably.

> Since you're talking about the "total_ram / 3" hardcoded value... it
> should be /proc tunable IMO. (Andi Kleen already suggested this)

Yes, IIRC Andi also proposed that a few weeks ago.

Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-22 21:38     ` Andrea Arcangeli
  2001-02-22 20:40       ` Marcelo Tosatti
@ 2001-02-25 17:34       ` Jens Axboe
  2001-02-25 19:26         ` Andrea Arcangeli
  1 sibling, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2001-02-25 17:34 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, Marcelo Tosatti, lkml

On Thu, Feb 22 2001, Andrea Arcangeli wrote:
> On Thu, Feb 22, 2001 at 10:59:20AM -0800, Linus Torvalds wrote:
> > I'd prefer for this check to be a per-queue one.
> 
> I'm running this in my tree since a few weeks, however I never had the courage
> to post it publically because I didn't benchmarked it carefully yet and I
> prefer to finish another thing first. This is actually based on the code I had
> in my blkdev tree after I merged last time with Jens the 512K I/O requests and
> elevator fixes. I think it won't generate bad numbers and it was running fine
> on a 32way SMP (though I didn't stressed the I/O subsystem much there) but
> please don't include until somebody benchmarks it carefully with dbench and
> tiotest.  (it still applys cleanly against 2.4.2)

Thinking about this a bit, I have to agree with you and Linus. It
is possible to find pathetic cases where the per-queue limit suffers
compared to the global one, but in reality I don't think it's worth
it. And the per-queue limits saves us the atomic updates since it's
done under the io_request_lock (or queue later, still fine) so that's
a win too.

I have had rw wait queues before, was removed when I did the request
stealing which is now gone again. I'm not even sure it's worth it
now, Marcelo and I discussed it last week and I did some tests that
showed nothing remarkable. But it's mainly for free, so we might
as well do it.

Any reason why you don't have a lower wake-up limit for the queue?
Do you mind if I do some testing with this patch and fold it in,
possibly?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: ll_rw_block/submit_bh and request limits
  2001-02-25 17:34       ` Jens Axboe
@ 2001-02-25 19:26         ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2001-02-25 19:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linus Torvalds, Marcelo Tosatti, lkml

On Sun, Feb 25, 2001 at 06:34:01PM +0100, Jens Axboe wrote:
> Any reason why you don't have a lower wake-up limit for the queue?

The watermark diff looked too high (it's 128M in current Linus's tree), but it's
probably a good idea to resurrect it with a max difference of a few full sized
requests (1/2mbytes).

> Do you mind if I do some testing with this patch and fold it in,
> possibly?

Go ahead, thanks,
Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2001-02-25 19:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-02-22  9:41 ll_rw_block/submit_bh and request limits Marcelo Tosatti
2001-02-22 13:56 ` Jens Axboe
2001-02-22 18:59   ` Linus Torvalds
2001-02-22 20:32     ` Marcelo Tosatti
2001-02-22 21:38     ` Andrea Arcangeli
2001-02-22 20:40       ` Marcelo Tosatti
2001-02-22 22:57         ` Andrea Arcangeli
2001-02-22 21:44           ` Marcelo Tosatti
2001-02-22 23:54             ` Andrea Arcangeli
2001-02-22 23:12           ` Andrea Arcangeli
2001-02-25 17:34       ` Jens Axboe
2001-02-25 19:26         ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox