[patch, v2] aio: allocate kiocbs in batches

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch, v2] aio: allocate kiocbs in batches
@ 2011-09-21 17:16 Jeff Moyer
  2011-09-21 21:39 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff Moyer @ 2011-09-21 17:16 UTC (permalink / raw)
  To: linux-kernel, linux-aio, Andrew Morton

Hi,

In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path.  The
reason is that we take it for every I/O submitted (see __aio_get_req).
Since we know how many I/Os are passed to io_submit, we can preallocate
the kiocbs in batches, reducing the number of times we take and release
the lock.  In my testing, I was able to reduce the amount of time spent
in _raw_spin_lock_irq by .56% (average of 3 runs).  The command I used
to test this was:
   aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>

I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>

---
Changes from rfc -> v2:
- folded in akpm's incremental patch which fixes coding style and
  variable names
- tried to clarify a comment about a starvation case
- fixed up my breaking of the handling of that starvation case
- moved from an on-stack array to a list at the suggestion of akpm


diff --git a/fs/aio.c b/fs/aio.c
index e29ec48..8229329 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -440,8 +440,6 @@ void exit_aio(struct mm_struct *mm)
 static struct kiocb *__aio_get_req(struct kioctx *ctx)
 {
 	struct kiocb *req = NULL;
-	struct aio_ring *ring;
-	int okay = 0;
 
 	req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
 	if (unlikely(!req))
@@ -459,39 +457,116 @@ static struct kiocb *__aio_get_req(struct kioctx *ctx)
 	INIT_LIST_HEAD(&req->ki_run_list);
 	req->ki_eventfd = NULL;
 
-	/* Check if the completion queue has enough free space to
-	 * accept an event from this io.
-	 */
+	return req;
+}
+
+/*
+ * struct kiocb's are allocated in batches to reduce the number of
+ * times the ctx lock is acquired and released.
+ */
+#define KIOCB_BATCH_SIZE	32
+struct kiocb_batch {
+	struct list_head head;
+	long total;	/* number of requests passed to sys_io_submit */
+	long allocated;	/* number of requests allocated so far */
+};
+
+static void kiocb_batch_init(struct kiocb_batch *batch, long total)
+{
+	INIT_LIST_HEAD(&batch->head);
+	batch->total = total;
+	batch->allocated = 0;
+}
+
+static void kiocb_batch_free(struct kiocb_batch *batch)
+{
+	struct kiocb *req, *n;
+
+	list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
+		list_del(&req->ki_batch);
+		kmem_cache_free(kiocb_cachep, req);
+	}
+}
+
+/*
+ * Allocate a batch of kiocbs.  This avoids taking and dropping the
+ * context lock a lot during setup.
+ */
+static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
+{
+	int i;
+	int to_alloc, avail;
+	bool called_fput = false;
+	struct kiocb *req, *n;
+	struct aio_ring *ring;
+
+	to_alloc = min(batch->total - batch->allocated, KIOCB_BATCH_SIZE);
+	for (i = 0; i < to_alloc; i++) {
+		req = __aio_get_req(ctx);
+		if (!req)
+			/* allocation failed, go with what we've got */
+			break;
+		list_add(&req->ki_batch, &batch->head);
+	}
+
+	if (i == 0)
+		goto out;
+
+retry:
 	spin_lock_irq(&ctx->ctx_lock);
-	ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
-	if (ctx->reqs_active < aio_ring_avail(&ctx->ring_info, ring)) {
+	ring = kmap_atomic(ctx->ring_info.ring_pages[0]);
+
+	avail = aio_ring_avail(&ctx->ring_info, ring) - ctx->reqs_active;
+	BUG_ON(avail < 0);
+	if (avail == 0 && !called_fput) {
+		/*
+		 * Handle a potential starvation case.  It is possible that
+		 * we hold the last reference on a struct file, causing us
+		 * to delay the final fput to non-irq context.  In this case,
+		 * ctx->reqs_active is artificially high.  Calling the fput
+		 * routine here may free up a slot in the event completion
+		 * ring, allowing this allocation to succeed.
+		 */
+		spin_unlock_irq(&ctx->ctx_lock);
+		kunmap_atomic(ring);
+		aio_fput_routine(NULL);
+		called_fput = true;
+		goto retry;
+	}
+
+	if (avail < i) {
+		/* Trim back the number of requests. */
+		list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
+			list_del(&req->ki_batch);
+			kmem_cache_free(kiocb_cachep, req);
+			if (--i <= avail)
+				break;
+		}
+	}
+
+	batch->allocated += i;
+	list_for_each_entry(req, &batch->head, ki_batch) {
 		list_add(&req->ki_list, &ctx->active_reqs);
 		ctx->reqs_active++;
-		okay = 1;
 	}
-	kunmap_atomic(ring, KM_USER0);
-	spin_unlock_irq(&ctx->ctx_lock);
 
-	if (!okay) {
-		kmem_cache_free(kiocb_cachep, req);
-		req = NULL;
-	}
+	kunmap_atomic(ring);
+	spin_unlock_irq(&ctx->ctx_lock);
 
-	return req;
+out:
+	return i;
 }
 
-static inline struct kiocb *aio_get_req(struct kioctx *ctx)
+static inline struct kiocb *aio_get_req(struct kioctx *ctx,
+					struct kiocb_batch *batch)
 {
 	struct kiocb *req;
-	/* Handle a potential starvation case -- should be exceedingly rare as 
-	 * requests will be stuck on fput_head only if the aio_fput_routine is 
-	 * delayed and the requests were the last user of the struct file.
-	 */
-	req = __aio_get_req(ctx);
-	if (unlikely(NULL == req)) {
-		aio_fput_routine(NULL);
-		req = __aio_get_req(ctx);
-	}
+
+	if (list_empty(&batch->head))
+		if (kiocb_batch_refill(ctx, batch) == 0)
+			return NULL;
+	req = list_first_entry(&batch->head, struct kiocb, ki_batch);
+	list_del(&req->ki_batch);
 	return req;
 }
 
@@ -1515,7 +1590,8 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb, bool compat)
 }
 
 static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
-			 struct iocb *iocb, bool compat)
+			 struct iocb *iocb, struct kiocb_batch *batch,
+			 bool compat)
 {
 	struct kiocb *req;
 	struct file *file;
@@ -1541,7 +1617,7 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
 	if (unlikely(!file))
 		return -EBADF;
 
-	req = aio_get_req(ctx);		/* returns with 2 references to req */
+	req = aio_get_req(ctx, batch);  /* returns with 2 references to req */
 	if (unlikely(!req)) {
 		fput(file);
 		return -EAGAIN;
@@ -1621,8 +1697,9 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 {
 	struct kioctx *ctx;
 	long ret = 0;
-	int i;
+	int i = 0;
 	struct blk_plug plug;
+	struct kiocb_batch batch;
 
 	if (unlikely(nr < 0))
 		return -EINVAL;
@@ -1639,6 +1716,8 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 		return -EINVAL;
 	}
 
+	kiocb_batch_init(&batch, nr);
+
 	blk_start_plug(&plug);
 
 	/*
@@ -1659,12 +1738,13 @@ long do_io_submit(aio_context_t ctx_id, long nr,
 			break;
 		}
 
-		ret = io_submit_one(ctx, user_iocb, &tmp, compat);
+		ret = io_submit_one(ctx, user_iocb, &tmp, &batch, compat);
 		if (ret)
 			break;
 	}
 	blk_finish_plug(&plug);
 
+	kiocb_batch_free(&batch);
 	put_ioctx(ctx);
 	return i ? i : ret;
 }
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 2dcb72b..2314ad8 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -117,6 +117,7 @@ struct kiocb {
 
 	struct list_head	ki_list;	/* the aio core uses this
 						 * for cancellation */
+	struct list_head	ki_batch;	/* batch allocation */
 
 	/*
 	 * If the aio_resfd field of the userspace iocb is not zero,

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [patch, v2] aio: allocate kiocbs in batches
  2011-09-21 17:16 [patch, v2] aio: allocate kiocbs in batches Jeff Moyer
@ 2011-09-21 21:39 ` Andrew Morton
  2011-09-22 13:24   ` Jeff Moyer
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2011-09-21 21:39 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, linux-aio

On Wed, 21 Sep 2011 13:16:00 -0400
Jeff Moyer <jmoyer@redhat.com> wrote:

> Hi,
> 
> In testing aio on a fast storage device, I found that the context lock
> takes up a fair amount of cpu time in the I/O submission path.  The
> reason is that we take it for every I/O submitted (see __aio_get_req).
> Since we know how many I/Os are passed to io_submit, we can preallocate
> the kiocbs in batches, reducing the number of times we take and release
> the lock.  In my testing, I was able to reduce the amount of time spent
> in _raw_spin_lock_irq by .56% (average of 3 runs).  The command I used
> to test this was:
>    aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 <dev>
> 
> I also tested the patch with various numbers of events passed to
> io_submit, and I ran the xfstests aio group of tests to ensure I didn't
> break anything.
>
> ...
>
> +/*
> + * struct kiocb's are allocated in batches to reduce the number of
> + * times the ctx lock is acquired and released.
> + */
> +#define KIOCB_BATCH_SIZE	32
> +struct kiocb_batch {
> +	struct list_head head;
> +	long total;	/* number of requests passed to sys_io_submit */
> +	long allocated;	/* number of requests allocated so far */
> +};

I don't see a reason why `total' and `allocated' need to be 64-bit. 
Making them 32-bit results in smaller code, smaller storage, smaller
d-cache footprint, etc.

Also, they should logically be unsigned types.

>
> ...
>
> +static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
> +{
> +	int i;
> +	int to_alloc, avail;
> +	bool called_fput = false;
> +	struct kiocb *req, *n;
> +	struct aio_ring *ring;
> +
> +	to_alloc = min(batch->total - batch->allocated, KIOCB_BATCH_SIZE);

And this generates a compile-time warning due to the long/int mismatch.
Did your compiler not warn here?  (And why did `to_alloc' and `i' get
to be `int'?  The type choices are chaotic in there!)

I'd suggest going with "unsigned" for `total' and `allocated', and make
KIOCB_BATCH_SIZE 32U.  Then have a think about the appropriate types
for the derived locals such as `i', `to_alloc' and `avail'.

> +	for (i = 0; i < to_alloc; i++) {
> +		req = __aio_get_req(ctx);
> +		if (!req)
> +			/* allocation failed, go with what we've got */
> +			break;
> +		list_add(&req->ki_batch, &batch->head);
> +	}
> +
> +	if (i == 0)
> +		goto out;
> +
> +retry:
>  	spin_lock_irq(&ctx->ctx_lock);
> -	ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0);
> -	if (ctx->reqs_active < aio_ring_avail(&ctx->ring_info, ring)) {
> +	ring = kmap_atomic(ctx->ring_info.ring_pages[0]);
> +
> +	avail = aio_ring_avail(&ctx->ring_info, ring) - ctx->reqs_active;
> +	BUG_ON(avail < 0);
> +	if (avail == 0 && !called_fput) {
> +		/*
> +		 * Handle a potential starvation case.  It is possible that
> +		 * we hold the last reference on a struct file, causing us
> +		 * to delay the final fput to non-irq context.  In this case,
> +		 * ctx->reqs_active is artificially high.  Calling the fput
> +		 * routine here may free up a slot in the event completion
> +		 * ring, allowing this allocation to succeed.
> +		 */
> +		spin_unlock_irq(&ctx->ctx_lock);
> +		kunmap_atomic(ring);

And there's a bug.  We need to maintain the thread's atomic state
across the kunmap_atomic().  This should have caused a might_sleep()
runtime warning from kunmap_atomic()'s smp_processor_id() (at least). 
That's assuming you tested on a 32-bit highmem box and were able to
exercise this codepath, neither of which seems likely ;)

> +		aio_fput_routine(NULL);
> +		called_fput = true;
> +		goto retry;
> +	}
> +
> +	if (avail < i) {
> +		/* Trim back the number of requests. */
> +		list_for_each_entry_safe(req, n, &batch->head, ki_batch) {
> +			list_del(&req->ki_batch);
> +			kmem_cache_free(kiocb_cachep, req);
> +			if (--i <= avail)
> +				break;
> +		}
> +	}
> +
> +	batch->allocated += i;
> +	list_for_each_entry(req, &batch->head, ki_batch) {
>  		list_add(&req->ki_list, &ctx->active_reqs);
>  		ctx->reqs_active++;
> -		okay = 1;
>  	}
> -	kunmap_atomic(ring, KM_USER0);
> -	spin_unlock_irq(&ctx->ctx_lock);
>  
> -	if (!okay) {
> -		kmem_cache_free(kiocb_cachep, req);
> -		req = NULL;
> -	}
> +	kunmap_atomic(ring);
> +	spin_unlock_irq(&ctx->ctx_lock);

Like that.

> -	return req;
> +out:
> +	return i;
>  }
>
> ...
>

I wouldn't want to do the long->unsigned conversion without runtime
testing it so can you please do a v3?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [patch, v2] aio: allocate kiocbs in batches
  2011-09-21 21:39 ` Andrew Morton
@ 2011-09-22 13:24   ` Jeff Moyer
  0 siblings, 0 replies; 3+ messages in thread
From: Jeff Moyer @ 2011-09-22 13:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-aio

Andrew Morton <akpm@google.com> writes:

> On Wed, 21 Sep 2011 13:16:00 -0400
> Jeff Moyer <jmoyer@redhat.com> wrote:
>
>>
>> +/*
>> + * struct kiocb's are allocated in batches to reduce the number of
>> + * times the ctx lock is acquired and released.
>> + */
>> +#define KIOCB_BATCH_SIZE	32
>> +struct kiocb_batch {
>> +	struct list_head head;
>> +	long total;	/* number of requests passed to sys_io_submit */
>> +	long allocated;	/* number of requests allocated so far */
>> +};
>
> I don't see a reason why `total' and `allocated' need to be 64-bit. 
> Making them 32-bit results in smaller code, smaller storage, smaller
> d-cache footprint, etc.
>
> Also, they should logically be unsigned types.

The number of iocbs passed into sys_io_submit is of type long, and so
the total and the number allocated need to be of the same size.  I
considered unsigned, but seeing as the value would be capped at long, I
didn't see a real compelling reason to switch to unsigned.

Now, I suppose I could do with a single variable there, and just
decrement it as kiocbs are allocated.

>>
>> +static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch)
>> +{
>> +	int i;
>> +	int to_alloc, avail;
>> +	bool called_fput = false;
>> +	struct kiocb *req, *n;
>> +	struct aio_ring *ring;
>> +
>> +	to_alloc = min(batch->total - batch->allocated, KIOCB_BATCH_SIZE);
>
> And this generates a compile-time warning due to the long/int mismatch.
> Did your compiler not warn here?  (And why did `to_alloc' and `i' get
> to be `int'?  The type choices are chaotic in there!)

Oops, I missed the warning.  to_alloc and i won't be very big, since
they are capped at KIOCB_BATCH_SIZE.  I could use an unsigned short.

> I'd suggest going with "unsigned" for `total' and `allocated', and make
> KIOCB_BATCH_SIZE 32U.  Then have a think about the appropriate types
> for the derived locals such as `i', `to_alloc' and `avail'.

As mentioned above, I'd like to stick with signed types, and I'll use
just a single long for the number of kiocbs left to allocate.

>> +		spin_unlock_irq(&ctx->ctx_lock);
>> +		kunmap_atomic(ring);
>
> And there's a bug.  We need to maintain the thread's atomic state
> across the kunmap_atomic().  This should have caused a might_sleep()
> runtime warning from kunmap_atomic()'s smp_processor_id() (at least). 
> That's assuming you tested on a 32-bit highmem box and were able to
> exercise this codepath, neither of which seems likely ;)

I'll fix it.  You are right, I didn't test 32-bit highmem....

>> ...
>>
>
> I wouldn't want to do the long->unsigned conversion without runtime
> testing it so can you please do a v3?

No problem.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-09-22 13:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-21 17:16 [patch, v2] aio: allocate kiocbs in batches Jeff Moyer
2011-09-21 21:39 ` Andrew Morton
2011-09-22 13:24   ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).