* ll_rw_block/submit_bh and request limits
@ 2001-02-22 9:41 Marcelo Tosatti
2001-02-22 13:56 ` Jens Axboe
0 siblings, 1 reply; 12+ messages in thread
From: Marcelo Tosatti @ 2001-02-22 9:41 UTC (permalink / raw)
To: Jens Axboe; +Cc: lkml
Hi,
The following piece of code in ll_rw_block() aims to limit the number of
locked buffers by making processes throttle on IO if the number of on
flight requests is bigger than a high watermaker. IO will only start
again if we're under a low watermark.
if (atomic_read(&queued_sectors) >= high_queued_sectors) {
run_task_queue(&tq_disk);
wait_event(blk_buffers_wait,
atomic_read(&queued_sectors) < low_queued_sectors);
}
However, if submit_bh() is used to queue IO (which is used by ->readpage()
for ext2, for example), no throttling happens.
It looks like ll_rw_block() users (writes, metadata reads) can be starved
by submit_bh() (data reads).
If I'm not missing something, the watermark check should be moved to
submit_bh().
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: ll_rw_block/submit_bh and request limits 2001-02-22 9:41 ll_rw_block/submit_bh and request limits Marcelo Tosatti @ 2001-02-22 13:56 ` Jens Axboe 2001-02-22 18:59 ` Linus Torvalds 0 siblings, 1 reply; 12+ messages in thread From: Jens Axboe @ 2001-02-22 13:56 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: lkml On Thu, Feb 22 2001, Marcelo Tosatti wrote: > The following piece of code in ll_rw_block() aims to limit the number of > locked buffers by making processes throttle on IO if the number of on > flight requests is bigger than a high watermaker. IO will only start > again if we're under a low watermark. > > if (atomic_read(&queued_sectors) >= high_queued_sectors) { > run_task_queue(&tq_disk); > wait_event(blk_buffers_wait, > atomic_read(&queued_sectors) < low_queued_sectors); > } > > > However, if submit_bh() is used to queue IO (which is used by ->readpage() > for ext2, for example), no throttling happens. > > It looks like ll_rw_block() users (writes, metadata reads) can be starved > by submit_bh() (data reads). > > If I'm not missing something, the watermark check should be moved to > submit_bh(). We might as well put it there, the idea was to not lock this one buffer either but I doubt this would make any different in reality :-) Linus, could you apply? --- /opt/kernel/linux-2.4.2/drivers/block/ll_rw_blk.c Thu Feb 22 14:55:22 2001 +++ drivers/block/ll_rw_blk.c Thu Feb 22 14:53:07 2001 @@ -957,6 +959,20 @@ if (!test_bit(BH_Lock, &bh->b_state)) BUG(); + /* + * don't lock any more buffers if we are above the high + * water mark. instead start I/O on the queued stuff. + */ + if (atomic_read(&queued_sectors) >= high_queued_sectors) { + run_task_queue(&tq_disk); + if (rw == READA) { + bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state)); + return; + } + wait_event(blk_buffers_wait, + atomic_read(&queued_sectors) < low_queued_sectors); + } + set_bit(BH_Req, &bh->b_state); /* @@ -1057,16 +1073,6 @@ for (i = 0; i < nr; i++) { struct buffer_head *bh = bhs[i]; - - /* - * don't lock any more buffers if we are above the high - * water mark. instead start I/O on the queued stuff. - */ - if (atomic_read(&queued_sectors) >= high_queued_sectors) { - run_task_queue(&tq_disk); - wait_event(blk_buffers_wait, - atomic_read(&queued_sectors) < low_queued_sectors); - } /* Only one thread can actually submit the I/O. */ if (test_and_set_bit(BH_Lock, &bh->b_state)) -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 13:56 ` Jens Axboe @ 2001-02-22 18:59 ` Linus Torvalds 2001-02-22 20:32 ` Marcelo Tosatti 2001-02-22 21:38 ` Andrea Arcangeli 0 siblings, 2 replies; 12+ messages in thread From: Linus Torvalds @ 2001-02-22 18:59 UTC (permalink / raw) To: Jens Axboe; +Cc: Marcelo Tosatti, lkml On Thu, 22 Feb 2001, Jens Axboe wrote: > On Thu, Feb 22 2001, Marcelo Tosatti wrote: > > The following piece of code in ll_rw_block() aims to limit the number of > > locked buffers by making processes throttle on IO if the number of on > > flight requests is bigger than a high watermaker. IO will only start > > again if we're under a low watermark. > > > > if (atomic_read(&queued_sectors) >= high_queued_sectors) { > > run_task_queue(&tq_disk); > > wait_event(blk_buffers_wait, > > atomic_read(&queued_sectors) < low_queued_sectors); > > } > > > > > > However, if submit_bh() is used to queue IO (which is used by ->readpage() > > for ext2, for example), no throttling happens. > > > > It looks like ll_rw_block() users (writes, metadata reads) can be starved > > by submit_bh() (data reads). > > > > If I'm not missing something, the watermark check should be moved to > > submit_bh(). > > We might as well put it there, the idea was to not lock this one > buffer either but I doubt this would make any different in reality :-) I'd prefer for this check to be a per-queue one. Right now a slow device (like a floppy) would artifically throttle a fast one, if I read the above right. So instead of moving it down the call-chain, I'd rather remove the check completely as it looks wrong to me. Now, if people want throttling, I'd much rather see that done per-queue. (There's another level of throttling that migth make sense: right now the swap-out code has this "nr_async_pages" throttling which is very different from the queue throttling. It might make sense to move that _VM_-level throttling to writepage too - so that syncing of dirty mmap's will not cause an overload of pages in flight. This was one of the reasons I changed the semantics of write-page - so that shared mappings could do that kind of smoothing too). Linus ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 18:59 ` Linus Torvalds @ 2001-02-22 20:32 ` Marcelo Tosatti 2001-02-22 21:38 ` Andrea Arcangeli 1 sibling, 0 replies; 12+ messages in thread From: Marcelo Tosatti @ 2001-02-22 20:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, lkml On Thu, 22 Feb 2001, Linus Torvalds wrote: > > > On Thu, 22 Feb 2001, Jens Axboe wrote: > > > On Thu, Feb 22 2001, Marcelo Tosatti wrote: > > > The following piece of code in ll_rw_block() aims to limit the number of > > > locked buffers by making processes throttle on IO if the number of on > > > flight requests is bigger than a high watermaker. IO will only start > > > again if we're under a low watermark. > > > > > > if (atomic_read(&queued_sectors) >= high_queued_sectors) { > > > run_task_queue(&tq_disk); > > > wait_event(blk_buffers_wait, > > > atomic_read(&queued_sectors) < low_queued_sectors); > > > } > > > > > > > > > However, if submit_bh() is used to queue IO (which is used by ->readpage() > > > for ext2, for example), no throttling happens. > > > > > > It looks like ll_rw_block() users (writes, metadata reads) can be starved > > > by submit_bh() (data reads). > > > > > > If I'm not missing something, the watermark check should be moved to > > > submit_bh(). > > > > We might as well put it there, the idea was to not lock this one > > buffer either but I doubt this would make any different in reality :-) > > I'd prefer for this check to be a per-queue one. > > Right now a slow device (like a floppy) would artifically throttle a fast > one, if I read the above right. So instead of moving it down the > call-chain, I'd rather remove the check completely as it looks wrong to > me. > > Now, if people want throttling, I'd much rather see that done per-queue. > > (There's another level of throttling that migth make sense: right now the > swap-out code has this "nr_async_pages" throttling which is very different > from the queue throttling. It might make sense to move that _VM_-level > throttling to writepage too - so that syncing of dirty mmap's will not > cause an overload of pages in flight. This was one of the reasons I > changed the semantics of write-page - so that shared mappings could do > that kind of smoothing too). And what about write() and read() if you do throttling with nr_async_pages ? The current scheme inside the block-layer throttles write()'s based on the number of locked buffers in _main memory_. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 18:59 ` Linus Torvalds 2001-02-22 20:32 ` Marcelo Tosatti @ 2001-02-22 21:38 ` Andrea Arcangeli 2001-02-22 20:40 ` Marcelo Tosatti 2001-02-25 17:34 ` Jens Axboe 1 sibling, 2 replies; 12+ messages in thread From: Andrea Arcangeli @ 2001-02-22 21:38 UTC (permalink / raw) To: Linus Torvalds; +Cc: Jens Axboe, Marcelo Tosatti, lkml On Thu, Feb 22, 2001 at 10:59:20AM -0800, Linus Torvalds wrote: > I'd prefer for this check to be a per-queue one. I'm running this in my tree since a few weeks, however I never had the courage to post it publically because I didn't benchmarked it carefully yet and I prefer to finish another thing first. This is actually based on the code I had in my blkdev tree after I merged last time with Jens the 512K I/O requests and elevator fixes. I think it won't generate bad numbers and it was running fine on a 32way SMP (though I didn't stressed the I/O subsystem much there) but please don't include until somebody benchmarks it carefully with dbench and tiotest. (it still applys cleanly against 2.4.2) diff -urN 2.4.2pre3/drivers/block/DAC960.c x/drivers/block/DAC960.c --- 2.4.2pre3/drivers/block/DAC960.c Fri Feb 9 18:35:16 2001 +++ x/drivers/block/DAC960.c Fri Feb 9 19:00:58 2001 @@ -2827,7 +2827,6 @@ static inline void DAC960_ProcessCompletedBuffer(BufferHeader_T *BufferHeader, boolean SuccessfulIO) { - blk_finished_io(BufferHeader->b_size >> 9); BufferHeader->b_end_io(BufferHeader, SuccessfulIO); } diff -urN 2.4.2pre3/drivers/block/cciss.c x/drivers/block/cciss.c --- 2.4.2pre3/drivers/block/cciss.c Fri Feb 9 18:35:17 2001 +++ x/drivers/block/cciss.c Fri Feb 9 19:00:59 2001 @@ -1072,7 +1072,6 @@ { xbh = bh->b_reqnext; bh->b_reqnext = NULL; - blk_finished_io(bh->b_size >> 9); bh->b_end_io(bh, status); bh = xbh; } diff -urN 2.4.2pre3/drivers/block/cpqarray.c x/drivers/block/cpqarray.c --- 2.4.2pre3/drivers/block/cpqarray.c Fri Feb 9 18:35:17 2001 +++ x/drivers/block/cpqarray.c Fri Feb 9 19:00:59 2001 @@ -1031,7 +1031,6 @@ xbh = bh->b_reqnext; bh->b_reqnext = NULL; - blk_finished_io(bh->b_size >> 9); bh->b_end_io(bh, ok); bh = xbh; diff -urN 2.4.2pre3/drivers/block/ll_rw_blk.c x/drivers/block/ll_rw_blk.c --- 2.4.2pre3/drivers/block/ll_rw_blk.c Fri Feb 9 18:35:17 2001 +++ x/drivers/block/ll_rw_blk.c Fri Feb 9 19:06:11 2001 @@ -118,17 +118,10 @@ int * max_sectors[MAX_BLKDEV]; /* - * queued sectors for all devices, used to make sure we don't fill all - * of memory with locked buffers - */ -atomic_t queued_sectors; - -/* * high and low watermark for above */ -static int high_queued_sectors, low_queued_sectors; +int max_in_flight_sectors; static int batch_requests, queue_nr_requests; -static DECLARE_WAIT_QUEUE_HEAD(blk_buffers_wait); static inline int get_max_sectors(kdev_t dev) { @@ -397,11 +390,17 @@ rq = kmem_cache_alloc(request_cachep, SLAB_KERNEL); memset(rq, 0, sizeof(struct request)); rq->rq_status = RQ_INACTIVE; - list_add(&rq->table, &q->request_freelist[i & 1]); + list_add(&rq->table, &q->request_freelist[i < (queue_nr_requests >> 1) ? READ : WRITE]); } - init_waitqueue_head(&q->wait_for_request); + init_waitqueue_head(&q->wait_for_request[READ]); + init_waitqueue_head(&q->wait_for_request[WRITE]); + + init_waitqueue_head(&q->in_flight_wait); + q->in_flight_sectors = 0; +#if 0 spin_lock_init(&q->queue_lock); +#endif } static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh); @@ -491,9 +490,9 @@ register struct request *rq; DECLARE_WAITQUEUE(wait, current); - add_wait_queue_exclusive(&q->wait_for_request, &wait); + add_wait_queue_exclusive(&q->wait_for_request[rw], &wait); for (;;) { - __set_current_state(TASK_UNINTERRUPTIBLE); + set_current_state(TASK_UNINTERRUPTIBLE); spin_lock_irq(&io_request_lock); rq = get_request(q, rw); spin_unlock_irq(&io_request_lock); @@ -502,8 +501,8 @@ generic_unplug_device(q); schedule(); } - remove_wait_queue(&q->wait_for_request, &wait); - current->state = TASK_RUNNING; + remove_wait_queue(&q->wait_for_request[rw], &wait); + __set_current_state(TASK_RUNNING); return rq; } @@ -590,12 +589,26 @@ list_add(&req->queue, insert_here); } -void inline blk_refill_freelist(request_queue_t *q, int rw) +static inline void blk_refill_freelist(request_queue_t *q, int rw) { - if (q->pending_free[rw]) { - list_splice(&q->pending_freelist[rw], &q->request_freelist[rw]); - INIT_LIST_HEAD(&q->pending_freelist[rw]); - q->pending_free[rw] = 0; + list_splice(&q->pending_freelist[rw], &q->request_freelist[rw]); + INIT_LIST_HEAD(&q->pending_freelist[rw]); + q->pending_free[rw] = 0; +} + +static inline void blk_finished_io(request_queue_t * q, struct request * req) +{ + int nsects = req->in_flight_sectors; + + req->in_flight_sectors = 0; + q->in_flight_sectors -= nsects; + if (q->in_flight_sectors < 0) { + printk("blkdev: in_flight_sectors < 0\n"); + q->in_flight_sectors = 0; + } + if (waitqueue_active(&q->in_flight_wait) && + q->in_flight_sectors < max_in_flight_sectors) { + wake_up(&(q)->in_flight_wait); } } @@ -612,26 +625,27 @@ /* * Request may not have originated from ll_rw_blk. if not, - * asumme it has free buffers and check waiters + * assumme it has free buffers and check waiters */ if (q) { - /* - * we've released enough buffers to start I/O again - */ - if (waitqueue_active(&blk_buffers_wait) - && atomic_read(&queued_sectors) < low_queued_sectors) - wake_up(&blk_buffers_wait); - - /* - * Add to pending free list and batch wakeups - */ - list_add(&req->table, &q->pending_freelist[rw]); - - if (++q->pending_free[rw] >= batch_requests) { - int wake_up = q->pending_free[rw]; - blk_refill_freelist(q, rw); - wake_up_nr(&q->wait_for_request, wake_up); + if (!list_empty(&q->request_freelist[rw])) { + if (q->pending_free[rw]) + BUG(); + list_add(&req->table, &q->request_freelist[rw]); + wake_up(&q->wait_for_request[rw]); + } else { + /* + * Add to pending free list and batch wakeups + */ + list_add(&req->table, &q->pending_freelist[rw]); + if (++q->pending_free[rw] >= batch_requests) { + int wake_up = q->pending_free[rw]; + blk_refill_freelist(q, rw); + wake_up_nr(&q->wait_for_request[rw], wake_up); + } } + + blk_finished_io(q, req); } } @@ -694,6 +708,12 @@ attempt_merge(q, blkdev_entry_to_request(prev), max_sectors, max_segments); } +#define blk_started_io(q, req, nsects) \ +do { \ + (q)->in_flight_sectors += (nsects); \ + (req)->in_flight_sectors += (nsects); \ +} while(0) + static int __make_request(request_queue_t * q, int rw, struct buffer_head * bh) { @@ -753,6 +773,27 @@ */ spin_lock_irq(&io_request_lock); + /* + * don't lock any more buffers if we are above the max + * water mark. instead start I/O on the queued stuff. + */ + if (q->in_flight_sectors >= max_in_flight_sectors) { + DECLARE_WAITQUEUE(wait, current); + + add_wait_queue(&q->in_flight_wait, &wait); + for (;;) { + __set_current_state(TASK_UNINTERRUPTIBLE); + __generic_unplug_device(q); + spin_unlock_irq(&io_request_lock); + schedule(); + spin_lock_irq(&io_request_lock); + if (q->in_flight_sectors < max_in_flight_sectors) + break; + } + remove_wait_queue(&q->in_flight_wait, &wait); + __set_current_state(TASK_RUNNING); + } + insert_here = head->prev; if (list_empty(head)) { q->plug_device_fn(q, bh->b_rdev); /* is atomic */ @@ -771,7 +812,7 @@ req->bhtail->b_reqnext = bh; req->bhtail = bh; req->nr_sectors = req->hard_nr_sectors += count; - blk_started_io(count); + blk_started_io(q, req, count); drive_stat_acct(req->rq_dev, req->cmd, count, 0); attempt_back_merge(q, req, max_sectors, max_segments); goto out; @@ -786,7 +827,7 @@ req->current_nr_sectors = count; req->sector = req->hard_sector = sector; req->nr_sectors = req->hard_nr_sectors += count; - blk_started_io(count); + blk_started_io(q, req, count); drive_stat_acct(req->rq_dev, req->cmd, count, 0); attempt_front_merge(q, head, req, max_sectors, max_segments); goto out; @@ -841,7 +882,7 @@ req->bh = bh; req->bhtail = bh; req->rq_dev = bh->b_rdev; - blk_started_io(count); + blk_started_io(q, req, count); add_request(q, req, insert_here); out: if (freereq) @@ -1059,16 +1100,6 @@ for (i = 0; i < nr; i++) { struct buffer_head *bh = bhs[i]; - /* - * don't lock any more buffers if we are above the high - * water mark. instead start I/O on the queued stuff. - */ - if (atomic_read(&queued_sectors) >= high_queued_sectors) { - run_task_queue(&tq_disk); - wait_event(blk_buffers_wait, - atomic_read(&queued_sectors) < low_queued_sectors); - } - /* Only one thread can actually submit the I/O. */ if (test_and_set_bit(BH_Lock, &bh->b_state)) continue; @@ -1143,7 +1174,6 @@ if ((bh = req->bh) != NULL) { nsect = bh->b_size >> 9; - blk_finished_io(nsect); req->bh = bh->b_reqnext; bh->b_reqnext = NULL; bh->b_end_io(bh, uptodate); @@ -1173,8 +1203,6 @@ blkdev_release_request(req); } -#define MB(kb) ((kb) << 10) - int __init blk_dev_init(void) { struct blk_dev_struct *dev; @@ -1194,30 +1222,19 @@ memset(max_readahead, 0, sizeof(max_readahead)); memset(max_sectors, 0, sizeof(max_sectors)); - atomic_set(&queued_sectors, 0); total_ram = nr_free_pages() << (PAGE_SHIFT - 10); - /* - * Try to keep 128MB max hysteris. If not possible, - * use half of RAM - */ - high_queued_sectors = (total_ram * 2) / 3; - low_queued_sectors = high_queued_sectors / 3; - if (high_queued_sectors - low_queued_sectors > MB(128)) - low_queued_sectors = high_queued_sectors - MB(128); - + max_in_flight_sectors = total_ram / 3; /* * make it sectors (512b) */ - high_queued_sectors <<= 1; - low_queued_sectors <<= 1; + max_in_flight_sectors <<= 1; /* * Scale free request slots per queue too */ - total_ram = (total_ram + MB(32) - 1) & ~(MB(32) - 1); - if ((queue_nr_requests = total_ram >> 9) > QUEUE_NR_REQUESTS) + if ((queue_nr_requests = total_ram >> 5) > QUEUE_NR_REQUESTS) queue_nr_requests = QUEUE_NR_REQUESTS; /* @@ -1225,11 +1242,12 @@ */ if ((batch_requests = queue_nr_requests >> 3) > 32) batch_requests = 32; + if (!batch_requests) + batch_requests = 1; - printk("block: queued sectors max/low %dkB/%dkB, %d slots per queue\n", - high_queued_sectors / 2, - low_queued_sectors / 2, - queue_nr_requests); + printk("block: queued sectors max %dkB, %d slots per queue, %d batch\n", + max_in_flight_sectors / 2, + queue_nr_requests, batch_requests); #ifdef CONFIG_AMIGA_Z2RAM z2_init(); @@ -1350,4 +1368,3 @@ EXPORT_SYMBOL(generic_make_request); EXPORT_SYMBOL(blkdev_release_request); EXPORT_SYMBOL(generic_unplug_device); -EXPORT_SYMBOL(queued_sectors); diff -urN 2.4.2pre3/drivers/scsi/scsi_lib.c x/drivers/scsi/scsi_lib.c --- 2.4.2pre3/drivers/scsi/scsi_lib.c Fri Feb 9 18:35:34 2001 +++ x/drivers/scsi/scsi_lib.c Fri Feb 9 19:00:59 2001 @@ -375,7 +375,6 @@ do { if ((bh = req->bh) != NULL) { nsect = bh->b_size >> 9; - blk_finished_io(nsect); req->bh = bh->b_reqnext; req->nr_sectors -= nsect; req->sector += nsect; diff -urN 2.4.2pre3/fs/buffer.c x/fs/buffer.c --- 2.4.2pre3/fs/buffer.c Fri Feb 9 18:35:44 2001 +++ x/fs/buffer.c Fri Feb 9 19:00:59 2001 @@ -628,7 +628,7 @@ to do in order to release the ramdisk memory is to destroy dirty buffers. These are two special cases. Normal usage imply the device driver - to issue a sync on the device (without waiting I/O completation) and + to issue a sync on the device (without waiting I/O completion) and then an invalidate_buffers call that doesn't trash dirty buffers. */ void __invalidate_buffers(kdev_t dev, int destroy_dirty_buffers) { @@ -1027,12 +1027,13 @@ write_unlock(&hash_table_lock); spin_unlock(&lru_list_lock); refill_freelist(size); + /* FIXME: getblk should fail if there's no enough memory */ goto repeat; } /* -1 -> no need to flush 0 -> async flush - 1 -> sync flush (wait for I/O completation) */ + 1 -> sync flush (wait for I/O completion) */ int balance_dirty_state(kdev_t dev) { unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit; @@ -1431,6 +1432,7 @@ { struct buffer_head *bh, *head, *tail; + /* FIXME: create_buffers should fail if there's no enough memory */ head = create_buffers(page, blocksize, 1); if (page->buffers) BUG(); @@ -2367,11 +2369,9 @@ spin_lock(&free_list[index].lock); tmp = bh; do { - struct buffer_head *p = tmp; - - tmp = tmp->b_this_page; - if (buffer_busy(p)) + if (buffer_busy(tmp)) goto busy_buffer_page; + tmp = tmp->b_this_page; } while (tmp != bh); spin_lock(&unused_list_lock); diff -urN 2.4.2pre3/include/linux/blkdev.h x/include/linux/blkdev.h --- 2.4.2pre3/include/linux/blkdev.h Fri Feb 9 18:34:13 2001 +++ x/include/linux/blkdev.h Fri Feb 9 21:28:15 2001 @@ -35,7 +35,7 @@ int errors; unsigned long sector; unsigned long nr_sectors; - unsigned long hard_sector, hard_nr_sectors; + unsigned long hard_sector, hard_nr_sectors, in_flight_sectors; unsigned int nr_segments; unsigned int nr_hw_segments; unsigned long current_nr_sectors; @@ -77,6 +77,7 @@ struct list_head request_freelist[2]; struct list_head pending_freelist[2]; int pending_free[2]; + int in_flight_sectors; /* * Together with queue_head for cacheline sharing @@ -112,16 +113,20 @@ */ char head_active; + wait_queue_head_t in_flight_wait; + /* - * Is meant to protect the queue in the future instead of - * io_request_lock + * Tasks wait here for free request */ - spinlock_t queue_lock; + wait_queue_head_t wait_for_request[2]; +#if 0 /* - * Tasks wait here for free request + * Is meant to protect the queue in the future instead of + * io_request_lock */ - wait_queue_head_t wait_for_request; + spinlock_t queue_lock; +#endif }; struct blk_dev_struct { @@ -177,7 +182,7 @@ extern int * max_segments[MAX_BLKDEV]; -extern atomic_t queued_sectors; +extern int max_in_flight_sectors; #define MAX_SEGMENTS 128 #define MAX_SECTORS (MAX_SEGMENTS*8) @@ -205,15 +210,5 @@ else return 512; } - -#define blk_finished_io(nsects) \ - atomic_sub(nsects, &queued_sectors); \ - if (atomic_read(&queued_sectors) < 0) { \ - printk("block: queued_sectors < 0\n"); \ - atomic_set(&queued_sectors, 0); \ - } - -#define blk_started_io(nsects) \ - atomic_add(nsects, &queued_sectors); #endif Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 21:38 ` Andrea Arcangeli @ 2001-02-22 20:40 ` Marcelo Tosatti 2001-02-22 22:57 ` Andrea Arcangeli 2001-02-25 17:34 ` Jens Axboe 1 sibling, 1 reply; 12+ messages in thread From: Marcelo Tosatti @ 2001-02-22 20:40 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Jens Axboe, lkml On Thu, 22 Feb 2001, Andrea Arcangeli wrote: > On Thu, Feb 22, 2001 at 10:59:20AM -0800, Linus Torvalds wrote: > > I'd prefer for this check to be a per-queue one. > > I'm running this in my tree since a few weeks, however I never had the courage > to post it publically because I didn't benchmarked it carefully yet and I > prefer to finish another thing first. You want to throttle IO if the amount of on flight data is higher than a given percentage of _main memory_. As far as I can see, your patch avoids each individual queue from being bigger than the high watermark (which is a percentage of main memory). However, you do not avoid multiple queues together from being bigger than the high watermark. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 20:40 ` Marcelo Tosatti @ 2001-02-22 22:57 ` Andrea Arcangeli 2001-02-22 21:44 ` Marcelo Tosatti 2001-02-22 23:12 ` Andrea Arcangeli 0 siblings, 2 replies; 12+ messages in thread From: Andrea Arcangeli @ 2001-02-22 22:57 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Jens Axboe, lkml On Thu, Feb 22, 2001 at 06:40:48PM -0200, Marcelo Tosatti wrote: > You want to throttle IO if the amount of on flight data is higher than > a given percentage of _main memory_. > > As far as I can see, your patch avoids each individual queue from being > bigger than the high watermark (which is a percentage of main > memory). However, you do not avoid multiple queues together from being > bigger than the high watermark. I of course see what you mean and I considered but I tend to believe that's a minor issue and that most machines will be happier without the global unplug even if without the global limit. The only reason we added the limit of I/O in flight is to be allowed to have an huge number of requests so we can do very large reordering and merges in the elevator with seeking I/O (4k large IO request) _but_ still we don't have to wait to lock in ram giga of pages before starting the I/O if the I/O was contigous. We absolutely need such a sanity limit, it would be absolutely unsane to wait kupdate to submit 10G of ram to a single harddisk before unplugging on a 30G machine. It doesn't need to be exactly "if we unplug not exactly after 1/3 of the global ram is locked then performance sucks or the machine crashes or task gets killed". As Jens noticed sync_page_buffers will unplug the queue at some point if we're low on ram anyways. The limit just says "unplug after a rasonable limit, after it doesn't matter anymore to try to delay requests for this harddisk, not matter if there are still I/O requests available". However if you have houndred of different queues doing I/O at the same time it may make a difference, but probably with tons of harddisks you'll also have tons of ram... In theory we could put a global limit on top of the the per-queue one. Or we could at least upper bound the total_ram / 3. Note that 2.4.0 as well doesn't enforce a global limit of packets in flight. (while in 2.2 the limit is global as it has a shared pool of I/O requests). Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 22:57 ` Andrea Arcangeli @ 2001-02-22 21:44 ` Marcelo Tosatti 2001-02-22 23:54 ` Andrea Arcangeli 2001-02-22 23:12 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Marcelo Tosatti @ 2001-02-22 21:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Jens Axboe, lkml On Thu, 22 Feb 2001, Andrea Arcangeli wrote: <snip> > However if you have houndred of different queues doing I/O at the same > time it may make a difference, but probably with tons of harddisks > you'll also have tons of ram... In theory we could put a global limit > on top of the the per-queue one. Or we could at least upper bound the > total_ram / 3. The global limit on top of the per-queue limit sounds good. Since you're talking about the "total_ram / 3" hardcoded value... it should be /proc tunable IMO. (Andi Kleen already suggested this) ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 21:44 ` Marcelo Tosatti @ 2001-02-22 23:54 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2001-02-22 23:54 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Jens Axboe, lkml On Thu, Feb 22, 2001 at 07:44:11PM -0200, Marcelo Tosatti wrote: > The global limit on top of the per-queue limit sounds good. Probably. > Since you're talking about the "total_ram / 3" hardcoded value... it > should be /proc tunable IMO. (Andi Kleen already suggested this) Yes, IIRC Andi also proposed that a few weeks ago. Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 22:57 ` Andrea Arcangeli 2001-02-22 21:44 ` Marcelo Tosatti @ 2001-02-22 23:12 ` Andrea Arcangeli 1 sibling, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2001-02-22 23:12 UTC (permalink / raw) To: Marcelo Tosatti; +Cc: Linus Torvalds, Jens Axboe, lkml On Thu, Feb 22, 2001 at 11:57:00PM +0100, Andrea Arcangeli wrote: > unsane to wait kupdate to submit 10G of ram to a single harddisk before > unplugging on a 30G machine. actually kupdate will unplug itself the queue but in theory it can grow the queue still up to such level after the I/O started. I think we'd better add an high limit on the in flight I/O watermark. Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-22 21:38 ` Andrea Arcangeli 2001-02-22 20:40 ` Marcelo Tosatti @ 2001-02-25 17:34 ` Jens Axboe 2001-02-25 19:26 ` Andrea Arcangeli 1 sibling, 1 reply; 12+ messages in thread From: Jens Axboe @ 2001-02-25 17:34 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: Linus Torvalds, Marcelo Tosatti, lkml On Thu, Feb 22 2001, Andrea Arcangeli wrote: > On Thu, Feb 22, 2001 at 10:59:20AM -0800, Linus Torvalds wrote: > > I'd prefer for this check to be a per-queue one. > > I'm running this in my tree since a few weeks, however I never had the courage > to post it publically because I didn't benchmarked it carefully yet and I > prefer to finish another thing first. This is actually based on the code I had > in my blkdev tree after I merged last time with Jens the 512K I/O requests and > elevator fixes. I think it won't generate bad numbers and it was running fine > on a 32way SMP (though I didn't stressed the I/O subsystem much there) but > please don't include until somebody benchmarks it carefully with dbench and > tiotest. (it still applys cleanly against 2.4.2) Thinking about this a bit, I have to agree with you and Linus. It is possible to find pathetic cases where the per-queue limit suffers compared to the global one, but in reality I don't think it's worth it. And the per-queue limits saves us the atomic updates since it's done under the io_request_lock (or queue later, still fine) so that's a win too. I have had rw wait queues before, was removed when I did the request stealing which is now gone again. I'm not even sure it's worth it now, Marcelo and I discussed it last week and I did some tests that showed nothing remarkable. But it's mainly for free, so we might as well do it. Any reason why you don't have a lower wake-up limit for the queue? Do you mind if I do some testing with this patch and fold it in, possibly? -- Jens Axboe ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: ll_rw_block/submit_bh and request limits 2001-02-25 17:34 ` Jens Axboe @ 2001-02-25 19:26 ` Andrea Arcangeli 0 siblings, 0 replies; 12+ messages in thread From: Andrea Arcangeli @ 2001-02-25 19:26 UTC (permalink / raw) To: Jens Axboe; +Cc: Linus Torvalds, Marcelo Tosatti, lkml On Sun, Feb 25, 2001 at 06:34:01PM +0100, Jens Axboe wrote: > Any reason why you don't have a lower wake-up limit for the queue? The watermark diff looked too high (it's 128M in current Linus's tree), but it's probably a good idea to resurrect it with a max difference of a few full sized requests (1/2mbytes). > Do you mind if I do some testing with this patch and fold it in, > possibly? Go ahead, thanks, Andrea ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2001-02-25 19:25 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2001-02-22 9:41 ll_rw_block/submit_bh and request limits Marcelo Tosatti 2001-02-22 13:56 ` Jens Axboe 2001-02-22 18:59 ` Linus Torvalds 2001-02-22 20:32 ` Marcelo Tosatti 2001-02-22 21:38 ` Andrea Arcangeli 2001-02-22 20:40 ` Marcelo Tosatti 2001-02-22 22:57 ` Andrea Arcangeli 2001-02-22 21:44 ` Marcelo Tosatti 2001-02-22 23:54 ` Andrea Arcangeli 2001-02-22 23:12 ` Andrea Arcangeli 2001-02-25 17:34 ` Jens Axboe 2001-02-25 19:26 ` Andrea Arcangeli
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox