From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753642Ab0ICXg0 (ORCPT ); Fri, 3 Sep 2010 19:36:26 -0400 Received: from e9.ny.us.ibm.com ([32.97.182.139]:59392 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753078Ab0ICXgY (ORCPT ); Fri, 3 Sep 2010 19:36:24 -0400 Date: Fri, 3 Sep 2010 16:36:21 -0700 From: "Paul E. McKenney" To: Vivek Goyal Cc: linux kernel mailing list , Jens Axboe , Nauman Rafique , Gui Jianfeng , Divyesh Shah , Heinz Mauelshagen , arighi@develer.com Subject: Re: [RFC PATCH] Bio Throttling support for block IO controller Message-ID: <20100903233621.GE2413@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20100901175830.GC22149@redhat.com> <20100902183932.GF2349@linux.vnet.ibm.com> <20100903015739.GB9450@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20100903015739.GB9450@redhat.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 02, 2010 at 09:57:39PM -0400, Vivek Goyal wrote: > On Thu, Sep 02, 2010 at 11:39:32AM -0700, Paul E. McKenney wrote: > > On Wed, Sep 01, 2010 at 01:58:30PM -0400, Vivek Goyal wrote: > > > Hi, > > > > > > Currently CFQ provides the weight based proportional division of bandwidth. > > > People also have been looking at extending block IO controller to provide > > > throttling/max bandwidth control. > > > > > > I have started to write the support for throttling in block layer on > > > request queue so that it can be used both for higher level logical > > > devices as well as leaf nodes. This patch is still work in progress but > > > I wanted to post it for early feedback. > > > > > > Basically currently I have hooked into __make_request() function to > > > check which cgroup bio belongs to and if it is exceeding the specified > > > BW rate. If no, thread can continue to dispatch bio as it is otherwise > > > bio is queued internally and dispatched later with the help of a worker > > > thread. > > > > > > HOWTO > > > ===== > > > - Mount blkio controller > > > mount -t cgroup -o blkio none /cgroup/blkio > > > > > > - Specify a bandwidth rate on particular device for root group. The format > > > for policy is ": ". > > > > > > echo "8:16 1048576" > /cgroup/blkio/blkio.read_bps_device > > > > > > Above will put a limit of 1MB/second on reads happening for root group > > > on device having major/minor number 8:16. > > > > > > - Run dd to read a file and see if rate is throttled to 1MB/s or not. > > > > > > # dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 iflag=direct > > > 1024+0 records in > > > 1024+0 records out > > > 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s > > > > > > Limits for writes can be put using blkio.write_bps_device file. > > > > > > Open Issues > > > =========== > > > - Do we need to provide additional queue congestion semantics as we are > > > throttling and queuing bios at request queue and probably we don't want > > > a user space application to consume all the memory allocating bios > > > and bombarding request queue with those bios. > > > > > > - How to handle the current blkio cgroup stats file and two policies > > > in the background. If for some reason both throttling and proportional > > > BW policies are operating on request queue, then stats will be very > > > confusing. > > > > > > May be we can allow activating either throttling or proportional BW > > > policy per request queue and we can create a /sys tunable to list and > > > chose between policies (something like choosing IO scheduler). The > > > only downside of this apporach is that user also need to be aware of > > > the storage hierachy and activate right policy at each node/request > > > queue. > > > > > > TODO > > > ==== > > > - Lots of testing, bug fixes. > > > - Provide support for enforcing limits in IOPS. > > > - Extend the throttling support for dm devices also. > > > > > > Any feedback is welcome. > > > > > > Thanks > > > Vivek > > > > > > o IO throttling support in block layer. > > > > > > Signed-off-by: Vivek Goyal > > > --- > > > block/Makefile | 2 > > > block/blk-cgroup.c | 282 +++++++++++-- > > > block/blk-cgroup.h | 44 ++ > > > block/blk-core.c | 28 + > > > block/blk-throttle.c | 928 ++++++++++++++++++++++++++++++++++++++++++++++ > > > block/blk.h | 4 > > > block/cfq-iosched.c | 4 > > > include/linux/blk_types.h | 3 > > > include/linux/blkdev.h | 22 + > > > 9 files changed, 1261 insertions(+), 56 deletions(-) > > > > > > > [ . . . ] > > > > > +void blk_throtl_exit(struct request_queue *q) > > > +{ > > > + struct throtl_data *td = q->td; > > > + bool wait = false; > > > + > > > + BUG_ON(!td); > > > + > > > + throtl_shutdown_timer_wq(q); > > > + > > > + spin_lock_irq(q->queue_lock); > > > + throtl_release_tgs(td); > > > + blkiocg_del_blkio_group(&td->root_tg.blkg); > > > + > > > + /* If there are other groups */ > > > + if (td->nr_undestroyed_grps >= 1) > > > + wait = true; > > > + > > > + spin_unlock_irq(q->queue_lock); > > > + > > > + /* > > > + * Wait for tg->blkg->key accessors to exit their grace periods. > > > + * Do this wait only if there are other undestroyed groups out > > > + * there (other than root group). This can happen if cgroup deletion > > > + * path claimed the responsibility of cleaning up a group before > > > + * queue cleanup code get to the group. > > > + * > > > + * Do not call synchronize_rcu() unconditionally as there are drivers > > > + * which create/delete request queue hundreds of times during scan/boot > > > + * and synchronize_rcu() can take significant time and slow down boot. > > > + */ > > > + if (wait) > > > + synchronize_rcu(); > > > > The RCU readers are presumably not accessing the structure referenced > > by td? If they can access it, then they will be accessing freed memory > > after the following function call!!! > > Hi Paul, > > Thanks for the review. > > As per my understanding if wait = false, then there should not be any > RCU readers of tg->blkg->key (key is basically struct throtl_data *td) out > there hence it should be safe to to free "td" without calling > synchronize_rcu() or call_rcu(). > > Following are some details. > > - We instanciate some throtl_grp structures as IO happens in a cgropu and > these objects are put in a hash list (td->tg_list). These objects are > put into another cgroup list (blkcg->blkg_list, blk-cgroup.c). > > Root group is only exception which is not allocated dynamically instead it > is statically allocated as part of throtl_data structure. > (struct throtl_grp root_tg); > > - There are two group deletion paths. One is if cgroup is being deleted > then we need to cleanup associated group and other is if device is > going away then we need to cleanup all groups and td and request queue > etc. > > - The only user of RCU protected tg->blkg->key is cgroup deletion path > and that path will be accessing this key only if it got the ownership > of a group it wants to delete. Basically group deletion path can race > between cgroup deletion event and device going away at the same time. > > In this case, both path will want to clean up a group and some kind of > arbitration is needed. The path which is first able to take blkcg->lock > and is able to delete group from blkcg->blkg_list, takes the > responsibility of cleaning up the group. > > Now if there are no undestroyed groups (except root group which cgroup > path will never try to destroy as root cgroup is not deleted), that > means cgroup path will not try to free up any groups, that also means > that there will be no other RCU readers of tg->blkg->key and hence > it should be safe to free up "td" without synchronize_rcu() > or call_rcu(). Am I missing something? If I understand you correctly, RCU is used only for part of the data structure, and if you are not freeing up an RCU-traversed portion of the data structure, then there is no need to wait for a grace period. Thanx, Paul > > If they can access it, I suggest using call_rcu() instead of > > synchronize_rcu(). One way of doing this would be: > > > > if (!wait) { > > call_rcu(&td->rcu, throtl_td_deferred_free); > > if !wait, then as per my current understanding there are no RCU readers > out there and above step should not be required. The reason I don't want > to use call_rcu() is that though it will keep "td" around but request > queue will be gone (td->queue) and RCU reader path take request queue > spin lock and they will be trying to take lock which has been freed. > > throtl_unlink_blkio_group() { > spin_lock_irqsave(td->queue->queue_lock, flags); > } > > > > } else { > > synchronize_rcu(); > > throtl_td_free(td); > > } > > This is the step my code is already doing. If wait=true, then there are > RCU readers out there and wait for them to finish before freeing up > td. > > Thanks > Vivek