qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@gmail.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: kwolf@redhat.com, wuzhy@linux.vnet.ibm.com,
	"Benoît Canet" <benoit@irqsave.net>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
Date: Wed, 20 Mar 2013 15:28:42 +0100	[thread overview]
Message-ID: <20130320142842.GA2389@stefanha-thinkpad.muc.redhat.com> (raw)
In-Reply-To: <20130320132924.GB14441@stefanha-thinkpad.muc.redhat.com>

On Wed, Mar 20, 2013 at 02:29:24PM +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 20, 2013 at 10:12:14AM +0100, Benoît Canet wrote:
> > This patch fix an I/O throttling behavior triggered by limiting at 150 iops
> > and running a load of 50 threads doing random pwrites on a raw virtio device.
> > 
> > After a few moments the iops count start to oscillate steadily between 0 and a
> > value upper than the limit.
> > 
> > As the load keep running the period and the amplitude of the oscillation
> > increase until io bursts reaching the physical storage max iops count are
> > done.
> > 
> > These bursts are followed by guest io starvation.
> > 
> > As the period of this oscillation cycle is increasing the cause must be a
> > computation error leading to increase slowly the wait time.
> > 
> > This patch make the wait time a bit smaller and tests confirm that it solves
> > the oscillating behavior.
> > 
> > Signed-off-by: Benoit Canet <benoit@irqsave.net>
> > ---
> >  block.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block.c b/block.c
> > index 0a062c9..455d8b0 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -3739,7 +3739,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> >      }
> >  
> >      /* Calc approx time to dispatch */
> > -    wait_time = (ios_base + 1) / iops_limit;
> > +    wait_time = ios_base / iops_limit;
> >      if (wait_time > elapsed_time) {
> >          wait_time = wait_time - elapsed_time;
> >      } else {
> 
> I tried reproducing without your test case:
> 
> dd if=/dev/vdb of=/dev/null bs=4096 iflag=direct
> 
> I've pasted printfs below which reveals that wait time increases monotonically!
> In other words, dd is slowed down more and more as it runs:
> 
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1426 ms
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1431 ms
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1437 ms
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
> ...
> 
> Killing dd and starting it again resets the accumulated delay (probably because
> we end the slice and state is cleared).
> 
> This suggests workloads that are constantly at the I/O limit will experience
> creeping delay or the oscillations you found.
> 
> After applying your patch I observed the opposite behavior: wait time decreases
> until it resets itself.  Perhaps we're waiting less and less until we just
> finish the slice and all values reset:
> 
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 496 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 489 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 484 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 480 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 474 ms
> ...
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 300 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 299 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 298 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 494 ms
> 
> I'm not confident that I understand the effects of your patch.  Do you have an
> explanation for these results?
> 
> More digging will probably be necessary to solve the underlying problem here.
> 
> diff --git a/block.c b/block.c
> index 0a062c9..7a8c9e6 100644
> --- a/block.c
> +++ b/block.c
> @@ -175,7 +175,9 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
>      int64_t wait_time = -1;
>  
>      if (!qemu_co_queue_empty(&bs->throttled_reqs)) {
> +        fprintf(stderr, "bs %p co %p waiting for throttled_reqs\n", bs, qemu_coroutine_self());
>          qemu_co_queue_wait(&bs->throttled_reqs);
> +        fprintf(stderr, "bs %p co %p woke up from throttled_reqs\n", bs, qemu_coroutine_self());
>      }
>  
>      /* In fact, we hope to keep each request's timing, in FIFO mode. The next
> @@ -188,7 +190,9 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
>      while (bdrv_exceed_io_limits(bs, nb_sectors, is_write, &wait_time)) {
>          qemu_mod_timer(bs->block_timer,
>                         wait_time + qemu_get_clock_ns(vm_clock));
> +        fprintf(stderr, "bs %p co %p throttled for %"PRId64" ms\n", bs, qemu_coroutine_self(), wait_time
>          qemu_co_queue_wait_insert_head(&bs->throttled_reqs);
> +        fprintf(stderr, "bs %p co %p woke up from throttled_reqs after sleeping\n", bs, qemu_coroutine_s
>      }
>  
>      qemu_co_queue_next(&bs->throttled_reqs);
> 

There's something that bothers me:

static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
                             double elapsed_time, uint64_t *wait)
{
...

We extend the slice by incrementing bs->slice_end.  This is done to
account for requests that span slice boundaries.  By extending we keep
the io_base[] statistic so that guests cannot cheat by issuing their
requests at the end of the slice.

But I don't understand why bs->slice_time is modified instead of keeping
it constant at 100 ms:

    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
    if (wait) {
        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
    }

    return true;
}

We'll use bs->slice_time again when a request falls within the current
slice:

static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
                           bool is_write, int64_t *wait)
{
    int64_t  now, max_wait;
    uint64_t bps_wait = 0, iops_wait = 0;
    double   elapsed_time;
    int      bps_ret, iops_ret;

    now = qemu_get_clock_ns(vm_clock);
    if ((bs->slice_start < now)
        && (bs->slice_end > now)) {
        bs->slice_end = now + bs->slice_time;
    } else {

I decided to try the following without your patch:

diff --git a/block.c b/block.c
index 0a062c9..2af2da2 100644
--- a/block.c
+++ b/block.c
@@ -3746,8 +3750,8 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
         wait_time = 0;
     }
 
-    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
-    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
+/*    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; */
+    bs->slice_end += bs->slice_time; /* - 3 * BLOCK_IO_SLICE_TIME; */
     if (wait) {
         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
     }

Now there is no oscillation and the wait_times do not grow or shrink
under constant load from dd(1).

Can you try this patch by itself to see if it fixes the oscillation?

If yes, we should audit the code a bit more to figure out the best
solution for extending slice times.

Stefan

  reply	other threads:[~2013-03-20 14:28 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-20  9:12 [Qemu-devel] [PATCH] Fix I/O throttling pathologic oscillating behavior Benoît Canet
2013-03-20  9:12 ` [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation Benoît Canet
2013-03-20 10:55   ` Zhi Yong Wu
2013-03-20 13:29   ` Stefan Hajnoczi
2013-03-20 14:28     ` Stefan Hajnoczi [this message]
2013-03-20 14:56       ` Benoît Canet
2013-03-20 15:12         ` Stefan Hajnoczi
2013-03-21  1:18           ` Zhi Yong Wu
2013-03-21  9:17             ` Stefan Hajnoczi
2013-03-21 13:04               ` Zhi Yong Wu
2013-03-21 15:14                 ` Stefan Hajnoczi
2013-03-20 15:27       ` Benoît Canet
2013-03-21 10:34         ` Stefan Hajnoczi
2013-03-21 14:28           ` Benoît Canet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130320142842.GA2389@stefanha-thinkpad.muc.redhat.com \
    --to=stefanha@gmail.com \
    --cc=benoit@irqsave.net \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=wuzhy@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).