public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux kernel mailing list <linux-kernel@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>, Nauman Rafique <nauman@google.com>,
	Gui Jianfeng <guijianfeng@cn.fujitsu.com>,
	Divyesh Shah <dpshah@google.com>,
	Heinz Mauelshagen <heinzm@redhat.com>,
	arighi@develer.com
Subject: Re: [RFC PATCH] Bio Throttling support for block IO controller
Date: Fri, 3 Sep 2010 16:36:21 -0700	[thread overview]
Message-ID: <20100903233621.GE2413@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100903015739.GB9450@redhat.com>

On Thu, Sep 02, 2010 at 09:57:39PM -0400, Vivek Goyal wrote:
> On Thu, Sep 02, 2010 at 11:39:32AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 01, 2010 at 01:58:30PM -0400, Vivek Goyal wrote:
> > > Hi,
> > > 
> > > Currently CFQ provides the weight based proportional division of bandwidth.
> > > People also have been looking at extending block IO controller to provide
> > > throttling/max bandwidth control.
> > > 
> > > I have started to write the support for throttling in block layer on 
> > > request queue so that it can be used both for higher level logical
> > > devices as well as leaf nodes. This patch is still work in progress but
> > > I wanted to post it for early feedback.
> > > 
> > > Basically currently I have hooked into __make_request() function to 
> > > check which cgroup bio belongs to and if it is exceeding the specified
> > > BW rate. If no, thread can continue to dispatch bio as it is otherwise
> > > bio is queued internally and dispatched later with the help of a worker
> > > thread.
> > > 
> > > HOWTO
> > > =====
> > > - Mount blkio controller
> > > 	mount -t cgroup -o blkio none /cgroup/blkio
> > > 
> > > - Specify a bandwidth rate on particular device for root group. The format
> > >   for policy is "<major>:<minor>  <byes_per_second>".
> > > 
> > > 	echo "8:16  1048576" > /cgroup/blkio/blkio.read_bps_device
> > > 
> > >   Above will put a limit of 1MB/second on reads happening for root group
> > >   on device having major/minor number 8:16.
> > > 
> > > - Run dd to read a file and see if rate is throttled to 1MB/s or not.
> > > 
> > > 	# dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 iflag=direct
> > > 	1024+0 records in
> > > 	1024+0 records out
> > > 	4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
> > > 
> > >  Limits for writes can be put using blkio.write_bps_device file.
> > > 
> > > Open Issues
> > > ===========
> > > - Do we need to provide additional queue congestion semantics as we are
> > >   throttling and queuing bios at request queue and probably we don't want
> > >   a user space application to consume all the memory allocating bios
> > >   and bombarding request queue with those bios.
> > > 
> > > - How to handle the current blkio cgroup stats file and two policies
> > >   in the background. If for some reason both throttling and proportional
> > >   BW policies are operating on request queue, then stats will be very
> > >   confusing.
> > > 
> > >   May be we can allow activating either throttling or proportional BW
> > >   policy per request queue and we can create a /sys tunable to list and
> > >   chose between policies (something like choosing IO scheduler). The
> > >   only downside of this apporach is that user also need to be aware of
> > >   the storage hierachy and activate right policy at each node/request
> > >   queue.
> > > 
> > > TODO
> > > ====
> > > - Lots of testing, bug fixes.
> > > - Provide support for enforcing limits in IOPS.
> > > - Extend the throttling support for dm devices also.
> > > 
> > > Any feedback is welcome.
> > > 
> > > Thanks
> > > Vivek
> > > 
> > > o IO throttling support in block layer.
> > > 
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  block/Makefile            |    2 
> > >  block/blk-cgroup.c        |  282 +++++++++++--
> > >  block/blk-cgroup.h        |   44 ++
> > >  block/blk-core.c          |   28 +
> > >  block/blk-throttle.c      |  928 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  block/blk.h               |    4 
> > >  block/cfq-iosched.c       |    4 
> > >  include/linux/blk_types.h |    3 
> > >  include/linux/blkdev.h    |   22 +
> > >  9 files changed, 1261 insertions(+), 56 deletions(-)
> > > 
> > 
> > [ . . . ]
> > 
> > > +void blk_throtl_exit(struct request_queue *q)
> > > +{
> > > +	struct throtl_data *td = q->td;
> > > +	bool wait = false;
> > > +
> > > +	BUG_ON(!td);
> > > +
> > > +	throtl_shutdown_timer_wq(q);
> > > +
> > > +	spin_lock_irq(q->queue_lock);
> > > +	throtl_release_tgs(td);
> > > +	blkiocg_del_blkio_group(&td->root_tg.blkg);
> > > +
> > > +	/* If there are other groups */
> > > +	if (td->nr_undestroyed_grps >= 1)
> > > +		wait = true;
> > > +
> > > +	spin_unlock_irq(q->queue_lock);
> > > +
> > > +	/*
> > > +	 * Wait for tg->blkg->key accessors to exit their grace periods.
> > > +	 * Do this wait only if there are other undestroyed groups out
> > > +	 * there (other than root group). This can happen if cgroup deletion
> > > +	 * path claimed the responsibility of cleaning up a group before
> > > +	 * queue cleanup code get to the group.
> > > +	 *
> > > +	 * Do not call synchronize_rcu() unconditionally as there are drivers
> > > +	 * which create/delete request queue hundreds of times during scan/boot
> > > +	 * and synchronize_rcu() can take significant time and slow down boot.
> > > +	 */
> > > +	if (wait)
> > > +		synchronize_rcu();
> > 
> > The RCU readers are presumably not accessing the structure referenced
> > by td?  If they can access it, then they will be accessing freed memory
> > after the following function call!!!
> 
> Hi Paul,
> 
> Thanks for the review.
> 
> As per my understanding if wait = false, then there should not be any
> RCU readers of tg->blkg->key (key is basically struct throtl_data *td) out
> there hence it should be safe to to free "td" without calling
> synchronize_rcu() or call_rcu().
> 
> Following are some details.
> 
> - We instanciate some throtl_grp structures as IO happens in a cgropu and
>   these objects are put in a hash list (td->tg_list). These objects are
>   put into another cgroup list (blkcg->blkg_list, blk-cgroup.c).
> 
>   Root group is only exception which is not allocated dynamically instead it
>   is statically allocated as part of throtl_data structure.
>   (struct throtl_grp root_tg);
> 
> - There are two group deletion paths. One is if cgroup is being deleted
>   then we need to cleanup associated group and other is if device is
>   going away then we need to cleanup all groups and td and request queue
>   etc.
> 
> - The only user of RCU protected tg->blkg->key is cgroup deletion path
>   and that path will be accessing this key only if it got the ownership
>   of a group it wants to delete. Basically group deletion path can race
>   between cgroup deletion event and device going away at the same time.
> 
>   In this case, both path will want to clean up a group and some kind of
>   arbitration is needed. The path which is first able to take blkcg->lock
>   and is able to delete group from blkcg->blkg_list, takes the
>   responsibility of cleaning up the group.
> 
>   Now if there are no undestroyed groups (except root group which cgroup
>   path will never try to destroy as root cgroup is not deleted), that
>   means cgroup path will not try to free up any groups, that also means
>   that there will be no other RCU readers of tg->blkg->key and hence
>   it should be safe to free up "td" without synchronize_rcu()
>   or call_rcu(). Am I missing something?

If I understand you correctly, RCU is used only for part of the data
structure, and if you are not freeing up an RCU-traversed portion of
the data structure, then there is no need to wait for a grace period.

							Thanx, Paul

> > If they can access it, I suggest using call_rcu() instead of
> > synchronize_rcu().  One way of doing this would be:
> > 
> > 	if (!wait) {
> > 		call_rcu(&td->rcu, throtl_td_deferred_free);
> 
> if !wait, then as per my current understanding there are no RCU readers
> out there and above step should not be required. The reason I don't want
> to use call_rcu() is that though it will keep "td" around but request
> queue will be gone (td->queue) and RCU reader path take request queue
> spin lock and they will be trying to take lock which has been freed.
> 
> throtl_unlink_blkio_group() {
> 	spin_lock_irqsave(td->queue->queue_lock, flags);
> }
> 
> 
> > 	} else {
> > 		synchronize_rcu();
> > 		throtl_td_free(td);
> > 	}
> 
> This is the step my code is already doing. If wait=true, then there are
> RCU readers out there and wait for them to finish before freeing up
> td.
> 
> Thanks
> Vivek

  reply	other threads:[~2010-09-03 23:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-01 17:58 [RFC PATCH] Bio Throttling support for block IO controller Vivek Goyal
2010-09-01 20:07 ` Vivek Goyal
2010-09-02 15:18   ` Vivek Goyal
2010-09-02 16:22     ` Nauman Rafique
2010-09-02 17:22       ` Vivek Goyal
2010-09-02 17:32     ` Balbir Singh
2010-09-02 18:39 ` Paul E. McKenney
2010-09-03  1:57   ` Vivek Goyal
2010-09-03 23:36     ` Paul E. McKenney [this message]
2010-09-03  9:50 ` Gui Jianfeng
2010-09-03 12:48   ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100903233621.GE2413@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=arighi@develer.com \
    --cc=axboe@kernel.dk \
    --cc=dpshah@google.com \
    --cc=guijianfeng@cn.fujitsu.com \
    --cc=heinzm@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nauman@google.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox