From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751355Ab1IUVkN (ORCPT ); Wed, 21 Sep 2011 17:40:13 -0400 Received: from smtp-out.google.com ([74.125.121.67]:9593 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750777Ab1IUVkK (ORCPT ); Wed, 21 Sep 2011 17:40:10 -0400 DomainKey-Signature: a=rsa-sha1; s=beta; d=google.com; c=nofws; q=dns; h=date:from:to:cc:subject:message-id:in-reply-to:references: x-mailer:mime-version:content-type: content-transfer-encoding:x-system-of-record; b=wOMilQwmL/tIc4Dr4Jna1UMqbvMm3Wp14fgC6zwOeM3BcRYQp4EwCMgunRXZH4WBG CWysj+geOQwhqdVKnDQew== Date: Wed, 21 Sep 2011 14:39:58 -0700 From: Andrew Morton To: Jeff Moyer Cc: linux-kernel@vger.kernel.org, linux-aio@kvack.org Subject: Re: [patch, v2] aio: allocate kiocbs in batches Message-Id: <20110921143958.df7105db.akpm@google.com> In-Reply-To: References: X-Mailer: Sylpheed 3.0.2 (GTK+ 2.20.1; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 21 Sep 2011 13:16:00 -0400 Jeff Moyer wrote: > Hi, > > In testing aio on a fast storage device, I found that the context lock > takes up a fair amount of cpu time in the I/O submission path. The > reason is that we take it for every I/O submitted (see __aio_get_req). > Since we know how many I/Os are passed to io_submit, we can preallocate > the kiocbs in batches, reducing the number of times we take and release > the lock. In my testing, I was able to reduce the amount of time spent > in _raw_spin_lock_irq by .56% (average of 3 runs). The command I used > to test this was: > aio-stress -O -o 2 -o 3 -r 8 -d 128 -b 32 -i 32 -s 16384 > > I also tested the patch with various numbers of events passed to > io_submit, and I ran the xfstests aio group of tests to ensure I didn't > break anything. > > ... > > +/* > + * struct kiocb's are allocated in batches to reduce the number of > + * times the ctx lock is acquired and released. > + */ > +#define KIOCB_BATCH_SIZE 32 > +struct kiocb_batch { > + struct list_head head; > + long total; /* number of requests passed to sys_io_submit */ > + long allocated; /* number of requests allocated so far */ > +}; I don't see a reason why `total' and `allocated' need to be 64-bit. Making them 32-bit results in smaller code, smaller storage, smaller d-cache footprint, etc. Also, they should logically be unsigned types. > > ... > > +static int kiocb_batch_refill(struct kioctx *ctx, struct kiocb_batch *batch) > +{ > + int i; > + int to_alloc, avail; > + bool called_fput = false; > + struct kiocb *req, *n; > + struct aio_ring *ring; > + > + to_alloc = min(batch->total - batch->allocated, KIOCB_BATCH_SIZE); And this generates a compile-time warning due to the long/int mismatch. Did your compiler not warn here? (And why did `to_alloc' and `i' get to be `int'? The type choices are chaotic in there!) I'd suggest going with "unsigned" for `total' and `allocated', and make KIOCB_BATCH_SIZE 32U. Then have a think about the appropriate types for the derived locals such as `i', `to_alloc' and `avail'. > + for (i = 0; i < to_alloc; i++) { > + req = __aio_get_req(ctx); > + if (!req) > + /* allocation failed, go with what we've got */ > + break; > + list_add(&req->ki_batch, &batch->head); > + } > + > + if (i == 0) > + goto out; > + > +retry: > spin_lock_irq(&ctx->ctx_lock); > - ring = kmap_atomic(ctx->ring_info.ring_pages[0], KM_USER0); > - if (ctx->reqs_active < aio_ring_avail(&ctx->ring_info, ring)) { > + ring = kmap_atomic(ctx->ring_info.ring_pages[0]); > + > + avail = aio_ring_avail(&ctx->ring_info, ring) - ctx->reqs_active; > + BUG_ON(avail < 0); > + if (avail == 0 && !called_fput) { > + /* > + * Handle a potential starvation case. It is possible that > + * we hold the last reference on a struct file, causing us > + * to delay the final fput to non-irq context. In this case, > + * ctx->reqs_active is artificially high. Calling the fput > + * routine here may free up a slot in the event completion > + * ring, allowing this allocation to succeed. > + */ > + spin_unlock_irq(&ctx->ctx_lock); > + kunmap_atomic(ring); And there's a bug. We need to maintain the thread's atomic state across the kunmap_atomic(). This should have caused a might_sleep() runtime warning from kunmap_atomic()'s smp_processor_id() (at least). That's assuming you tested on a 32-bit highmem box and were able to exercise this codepath, neither of which seems likely ;) > + aio_fput_routine(NULL); > + called_fput = true; > + goto retry; > + } > + > + if (avail < i) { > + /* Trim back the number of requests. */ > + list_for_each_entry_safe(req, n, &batch->head, ki_batch) { > + list_del(&req->ki_batch); > + kmem_cache_free(kiocb_cachep, req); > + if (--i <= avail) > + break; > + } > + } > + > + batch->allocated += i; > + list_for_each_entry(req, &batch->head, ki_batch) { > list_add(&req->ki_list, &ctx->active_reqs); > ctx->reqs_active++; > - okay = 1; > } > - kunmap_atomic(ring, KM_USER0); > - spin_unlock_irq(&ctx->ctx_lock); > > - if (!okay) { > - kmem_cache_free(kiocb_cachep, req); > - req = NULL; > - } > + kunmap_atomic(ring); > + spin_unlock_irq(&ctx->ctx_lock); Like that. > - return req; > +out: > + return i; > } > > ... > I wouldn't want to do the long->unsigned conversion without runtime testing it so can you please do a v3?