All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Vivek Goyal <vgoyal@redhat.com>
Cc: linux kernel mailing list <linux-kernel@vger.kernel.org>,
	Jens Axboe <axboe@kernel.dk>, Nauman Rafique <nauman@google.com>,
	Gui Jianfeng <guijianfeng@cn.fujitsu.com>,
	Divyesh Shah <dpshah@google.com>,
	Heinz Mauelshagen <heinzm@redhat.com>,
	arighi@develer.com
Subject: Re: [RFC PATCH] Bio Throttling support for block IO controller
Date: Fri, 3 Sep 2010 16:36:21 -0700	[thread overview]
Message-ID: <20100903233621.GE2413@linux.vnet.ibm.com> (raw)
In-Reply-To: <20100903015739.GB9450@redhat.com>

On Thu, Sep 02, 2010 at 09:57:39PM -0400, Vivek Goyal wrote:
> On Thu, Sep 02, 2010 at 11:39:32AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 01, 2010 at 01:58:30PM -0400, Vivek Goyal wrote:
> > > Hi,
> > > 
> > > Currently CFQ provides the weight based proportional division of bandwidth.
> > > People also have been looking at extending block IO controller to provide
> > > throttling/max bandwidth control.
> > > 
> > > I have started to write the support for throttling in block layer on 
> > > request queue so that it can be used both for higher level logical
> > > devices as well as leaf nodes. This patch is still work in progress but
> > > I wanted to post it for early feedback.
> > > 
> > > Basically currently I have hooked into __make_request() function to 
> > > check which cgroup bio belongs to and if it is exceeding the specified
> > > BW rate. If no, thread can continue to dispatch bio as it is otherwise
> > > bio is queued internally and dispatched later with the help of a worker
> > > thread.
> > > 
> > > HOWTO
> > > =====
> > > - Mount blkio controller
> > > 	mount -t cgroup -o blkio none /cgroup/blkio
> > > 
> > > - Specify a bandwidth rate on particular device for root group. The format
> > >   for policy is "<major>:<minor>  <byes_per_second>".
> > > 
> > > 	echo "8:16  1048576" > /cgroup/blkio/blkio.read_bps_device
> > > 
> > >   Above will put a limit of 1MB/second on reads happening for root group
> > >   on device having major/minor number 8:16.
> > > 
> > > - Run dd to read a file and see if rate is throttled to 1MB/s or not.
> > > 
> > > 	# dd if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 iflag=direct
> > > 	1024+0 records in
> > > 	1024+0 records out
> > > 	4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s
> > > 
> > >  Limits for writes can be put using blkio.write_bps_device file.
> > > 
> > > Open Issues
> > > ===========
> > > - Do we need to provide additional queue congestion semantics as we are
> > >   throttling and queuing bios at request queue and probably we don't want
> > >   a user space application to consume all the memory allocating bios
> > >   and bombarding request queue with those bios.
> > > 
> > > - How to handle the current blkio cgroup stats file and two policies
> > >   in the background. If for some reason both throttling and proportional
> > >   BW policies are operating on request queue, then stats will be very
> > >   confusing.
> > > 
> > >   May be we can allow activating either throttling or proportional BW
> > >   policy per request queue and we can create a /sys tunable to list and
> > >   chose between policies (something like choosing IO scheduler). The
> > >   only downside of this apporach is that user also need to be aware of
> > >   the storage hierachy and activate right policy at each node/request
> > >   queue.
> > > 
> > > TODO
> > > ====
> > > - Lots of testing, bug fixes.
> > > - Provide support for enforcing limits in IOPS.
> > > - Extend the throttling support for dm devices also.
> > > 
> > > Any feedback is welcome.
> > > 
> > > Thanks
> > > Vivek
> > > 
> > > o IO throttling support in block layer.
> > > 
> > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > > ---
> > >  block/Makefile            |    2 
> > >  block/blk-cgroup.c        |  282 +++++++++++--
> > >  block/blk-cgroup.h        |   44 ++
> > >  block/blk-core.c          |   28 +
> > >  block/blk-throttle.c      |  928 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  block/blk.h               |    4 
> > >  block/cfq-iosched.c       |    4 
> > >  include/linux/blk_types.h |    3 
> > >  include/linux/blkdev.h    |   22 +
> > >  9 files changed, 1261 insertions(+), 56 deletions(-)
> > > 
> > 
> > [ . . . ]
> > 
> > > +void blk_throtl_exit(struct request_queue *q)
> > > +{
> > > +	struct throtl_data *td = q->td;
> > > +	bool wait = false;
> > > +
> > > +	BUG_ON(!td);
> > > +
> > > +	throtl_shutdown_timer_wq(q);
> > > +
> > > +	spin_lock_irq(q->queue_lock);
> > > +	throtl_release_tgs(td);
> > > +	blkiocg_del_blkio_group(&td->root_tg.blkg);
> > > +
> > > +	/* If there are other groups */
> > > +	if (td->nr_undestroyed_grps >= 1)
> > > +		wait = true;
> > > +
> > > +	spin_unlock_irq(q->queue_lock);
> > > +
> > > +	/*
> > > +	 * Wait for tg->blkg->key accessors to exit their grace periods.
> > > +	 * Do this wait only if there are other undestroyed groups out
> > > +	 * there (other than root group). This can happen if cgroup deletion
> > > +	 * path claimed the responsibility of cleaning up a group before
> > > +	 * queue cleanup code get to the group.
> > > +	 *
> > > +	 * Do not call synchronize_rcu() unconditionally as there are drivers
> > > +	 * which create/delete request queue hundreds of times during scan/boot
> > > +	 * and synchronize_rcu() can take significant time and slow down boot.
> > > +	 */
> > > +	if (wait)
> > > +		synchronize_rcu();
> > 
> > The RCU readers are presumably not accessing the structure referenced
> > by td?  If they can access it, then they will be accessing freed memory
> > after the following function call!!!
> 
> Hi Paul,
> 
> Thanks for the review.
> 
> As per my understanding if wait = false, then there should not be any
> RCU readers of tg->blkg->key (key is basically struct throtl_data *td) out
> there hence it should be safe to to free "td" without calling
> synchronize_rcu() or call_rcu().
> 
> Following are some details.
> 
> - We instanciate some throtl_grp structures as IO happens in a cgropu and
>   these objects are put in a hash list (td->tg_list). These objects are
>   put into another cgroup list (blkcg->blkg_list, blk-cgroup.c).
> 
>   Root group is only exception which is not allocated dynamically instead it
>   is statically allocated as part of throtl_data structure.
>   (struct throtl_grp root_tg);
> 
> - There are two group deletion paths. One is if cgroup is being deleted
>   then we need to cleanup associated group and other is if device is
>   going away then we need to cleanup all groups and td and request queue
>   etc.
> 
> - The only user of RCU protected tg->blkg->key is cgroup deletion path
>   and that path will be accessing this key only if it got the ownership
>   of a group it wants to delete. Basically group deletion path can race
>   between cgroup deletion event and device going away at the same time.
> 
>   In this case, both path will want to clean up a group and some kind of
>   arbitration is needed. The path which is first able to take blkcg->lock
>   and is able to delete group from blkcg->blkg_list, takes the
>   responsibility of cleaning up the group.
> 
>   Now if there are no undestroyed groups (except root group which cgroup
>   path will never try to destroy as root cgroup is not deleted), that
>   means cgroup path will not try to free up any groups, that also means
>   that there will be no other RCU readers of tg->blkg->key and hence
>   it should be safe to free up "td" without synchronize_rcu()
>   or call_rcu(). Am I missing something?

If I understand you correctly, RCU is used only for part of the data
structure, and if you are not freeing up an RCU-traversed portion of
the data structure, then there is no need to wait for a grace period.

							Thanx, Paul

> > If they can access it, I suggest using call_rcu() instead of
> > synchronize_rcu().  One way of doing this would be:
> > 
> > 	if (!wait) {
> > 		call_rcu(&td->rcu, throtl_td_deferred_free);
> 
> if !wait, then as per my current understanding there are no RCU readers
> out there and above step should not be required. The reason I don't want
> to use call_rcu() is that though it will keep "td" around but request
> queue will be gone (td->queue) and RCU reader path take request queue
> spin lock and they will be trying to take lock which has been freed.
> 
> throtl_unlink_blkio_group() {
> 	spin_lock_irqsave(td->queue->queue_lock, flags);
> }
> 
> 
> > 	} else {
> > 		synchronize_rcu();
> > 		throtl_td_free(td);
> > 	}
> 
> This is the step my code is already doing. If wait=true, then there are
> RCU readers out there and wait for them to finish before freeing up
> td.
> 
> Thanks
> Vivek

  reply	other threads:[~2010-09-03 23:36 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-01 17:58 [RFC PATCH] Bio Throttling support for block IO controller Vivek Goyal
2010-09-01 20:07 ` Vivek Goyal
2010-09-02 15:18   ` Vivek Goyal
2010-09-02 16:22     ` Nauman Rafique
2010-09-02 17:22       ` Vivek Goyal
2010-09-02 17:32     ` Balbir Singh
2010-09-02 18:39 ` Paul E. McKenney
2010-09-03  1:57   ` Vivek Goyal
2010-09-03 23:36     ` Paul E. McKenney [this message]
2010-09-03  9:50 ` Gui Jianfeng
2010-09-03 12:48   ` Vivek Goyal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100903233621.GE2413@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=arighi@develer.com \
    --cc=axboe@kernel.dk \
    --cc=dpshah@google.com \
    --cc=guijianfeng@cn.fujitsu.com \
    --cc=heinzm@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nauman@google.com \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.