[Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
@ 2011-09-08  9:13 Takuya Yoshikawa
  2011-09-08 13:49 ` Vivek Goyal
  0 siblings, 1 reply; 5+ messages in thread
From: Takuya Yoshikawa @ 2011-09-08  9:13 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, takuya.yoshikawa, qemu-devel, kvm, vgoyal

This is a report of strange cfq behaviour which seems to be triggered by
QEMU posix aio threads.

Host environment:
  OS: RHEL6.0 KVM/qemu-kvm (with no patch applied)
  IO scheduler: cfq (with the default parameters)

On the host, we were running 3 linux guests to see if I/O from these guests
would be handled fairly by host; each guest did dd write with oflag=direct.

Guest virtual disk:
  We used a host local disk which had 3 partitions, and each guest was
  allocated one of these as dd write target.

So our test was for checking if cfq could keep fairness for the 3 guests
who shared the same disk.

The result (strage starvation):
  Sometimes, one guest dominated cfq for more than 10sec and requests from
  other guests were not handled at all during that time.

Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1)
is not handled at all during cfq2095S and cfq2067S which hold requests to
(8,26) are being handled alternately.

*1) WS 104920578 + 64

Question:
  I guess that cfq_close_cooperator() was being called in an unusual manner.
  If so, do you think that cfq is responsible for keeping fairness for this
  kind of unusual write requests?

Note:
  With RHEL6.1, this problem could not triggered. But I guess that was due to
  QEMU's block layer updates.

Thanks,
	Takuya

--- blktrace log ---
  8,16   0     2010     0.275081840  2068  A  WS 104920578 + 64 <- (8,27) 0
  8,16   0     2011     0.275082180  2068  Q  WS 104920578 + 64 [qemu-kvm]
  8,16   0        0     0.275091369     0  m   N cfq2068S / alloced
  8,16   0     2012     0.275091909  2068  G  WS 104920578 + 64 [qemu-kvm]
  8,16   0     2013     0.275093352  2068  P   N [qemu-kvm]
  8,16   0     2014     0.275094059  2068  I   W 104920578 + 64 [qemu-kvm]
  8,16   0        0     0.275094887     0  m   N cfq2068S / insert_request
  8,16   0        0     0.275095742     0  m   N cfq2068S / add_to_rr
  8,16   0     2015     0.275097194  2068  U   N [qemu-kvm] 1
  8,16   2     2073     0.275189462  2095  A  WS 83979688 + 64 <- (8,26) 40000
  8,16   2     2074     0.275189989  2095  Q  WS 83979688 + 64 [qemu-kvm]
  8,16   2     2075     0.275192534  2095  G  WS 83979688 + 64 [qemu-kvm]
  8,16   2     2076     0.275193909  2095  I   W 83979688 + 64 [qemu-kvm]
  8,16   2        0     0.275195609     0  m   N cfq2095S / insert_request
  8,16   2        0     0.275196404     0  m   N cfq2095S / add_to_rr
  8,16   2        0     0.275198004     0  m   N cfq2095S / preempt
  8,16   2        0     0.275198688     0  m   N cfq2067S / slice expired t=1
  8,16   2        0     0.275199631     0  m   N cfq2067S / resid=100
  8,16   2        0     0.275200413     0  m   N cfq2067S / sl_used=1
  8,16   2        0     0.275201521     0  m   N / served: vt=1671968768 min_vt=1671966720
  8,16   2        0     0.275202323     0  m   N cfq2067S / del_from_rr
  8,16   2        0     0.275204263     0  m   N cfq2095S / set_active wl_prio:0 wl_type:2
  8,16   2        0     0.275205131     0  m   N cfq2095S / fifo=(null)
  8,16   2        0     0.275205851     0  m   N cfq2095S / dispatch_insert
  8,16   2        0     0.275207121     0  m   N cfq2095S / dispatched a request
  8,16   2        0     0.275207873     0  m   N cfq2095S / activate rq, drv=1
  8,16   2     2077     0.275208198  2095  D   W 83979688 + 64 [qemu-kvm]
  8,16   2     2078     0.275269567  2095  U   N [qemu-kvm] 2
  8,16   4      836     0.275483550     0  C   W 83979688 + 64 [0]
  8,16   4        0     0.275496745     0  m   N cfq2095S / complete rqnoidle 0
  8,16   4        0     0.275497825     0  m   N cfq2095S / set_slice=100
  8,16   4        0     0.275499512     0  m   N cfq2095S / arm_idle: 8
  8,16   4        0     0.275499862     0  m   N cfq schedule dispatch
  8,16   6       85     0.275626195  2067  A  WS 83979752 + 64 <- (8,26) 40064
  8,16   6       86     0.275626598  2067  Q  WS 83979752 + 64 [qemu-kvm]
  8,16   6       87     0.275628580  2067  G  WS 83979752 + 64 [qemu-kvm]
  8,16   6       88     0.275629630  2067  I   W 83979752 + 64 [qemu-kvm]
  8,16   6        0     0.275631047     0  m   N cfq2067S / insert_request
  8,16   6        0     0.275631730     0  m   N cfq2067S / add_to_rr
  8,16   6        0     0.275633567     0  m   N cfq2067S / preempt
  8,16   6        0     0.275634275     0  m   N cfq2095S / slice expired t=1
  8,16   6        0     0.275635285     0  m   N cfq2095S / resid=100
  8,16   6        0     0.275635985     0  m   N cfq2095S / sl_used=1
  8,16   6        0     0.275636882     0  m   N / served: vt=1671970816 min_vt=1671968768
  8,16   6        0     0.275637585     0  m   N cfq2095S / del_from_rr
  8,16   6        0     0.275639382     0  m   N cfq2067S / set_active wl_prio:0 wl_type:2
  8,16   6        0     0.275640222     0  m   N cfq2067S / fifo=(null)
  8,16   6        0     0.275640809     0  m   N cfq2067S / dispatch_insert
  8,16   6        0     0.275641929     0  m   N cfq2067S / dispatched a request
  8,16   6        0     0.275642699     0  m   N cfq2067S / activate rq, drv=1
  8,16   6       89     0.275643047  2067  D   W 83979752 + 64 [qemu-kvm]
  8,16   6       90     0.275702446  2067  U   N [qemu-kvm] 2
  8,16   4      837     0.275864044     0  C   W 83979752 + 64 [0]
  8,16   4        0     0.275869194     0  m   N cfq2067S / complete rqnoidle 0
  8,16   4        0     0.275870399     0  m   N cfq2067S / set_slice=100
  8,16   4        0     0.275872046     0  m   N cfq2067S / arm_idle: 8
  8,16   4        0     0.275872442     0  m   N cfq schedule dispatch
....
... more than 10sec ...
....
  8,16   4        0    13.854114096     0  m   N cfq schedule dispatch
  8,16   4        0    13.854123729     0  m   N cfq2068S / set_active wl_prio:0 wl_type:2
  8,16   4        0    13.854125678     0  m   N cfq2068S / fifo=ffff880bddcec780
  8,16   4        0    13.854126416     0  m   N cfq2068S / dispatch_insert
  8,16   4        0    13.854128441     0  m   N cfq2068S / dispatched a request
  8,16   4        0    13.854129303     0  m   N cfq2068S / activate rq, drv=1
  8,16   4    23836    13.854130246    54  D   W 104920578 + 64 [kblockd/4]
  8,16   4    23837    13.855439985     0  C   W 104920578 + 64 [0]
  8,16   4        0    13.855450434     0  m   N cfq2068S / complete rqnoidle 0
  8,16   4        0    13.855451909     0  m   N cfq2068S / set_slice=100
  8,16   4        0    13.855453604     0  m   N cfq2068S / arm_idle: 8
  8,16   4        0    13.855454099     0  m   N cfq schedule dispatch
  8,16   0    48186    13.855686027  2102  A  WS 104920642 + 64 <- (8,27) 64
  8,16   0    48187    13.855686537  2102  Q  WS 104920642 + 64 [qemu-kvm]
  8,16   0        0    13.855698094     0  m   N cfq2102S / alloced
  8,16   0    48188    13.855698528  2102  G  WS 104920642 + 64 [qemu-kvm]
  8,16   0    48189    13.855700281  2102  I   W 104920642 + 64 [qemu-kvm]
  8,16   0        0    13.855701243     0  m   N cfq2102S / insert_request
  8,16   0        0    13.855701974     0  m   N cfq2102S / add_to_rr
  8,16   0        0    13.855704313     0  m   N cfq2102S / preempt
  8,16   0        0    13.855705068     0  m   N cfq2068S / slice expired t=1
  8,16   0        0    13.855706191     0  m   N cfq2068S / resid=100
  8,16   0        0    13.855706993     0  m   N cfq2068S / sl_used=1
  8,16   0        0    13.855708228     0  m   N / served: vt=1736314880 min_vt=1736312832
  8,16   0        0    13.855709046     0  m   N cfq2068S / del_from_rr


-- 
Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
  2011-09-08  9:13 [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests Takuya Yoshikawa
@ 2011-09-08 13:49 ` Vivek Goyal
  2011-09-09  9:00   ` Takuya Yoshikawa
  0 siblings, 1 reply; 5+ messages in thread
From: Vivek Goyal @ 2011-09-08 13:49 UTC (permalink / raw)
  To: Takuya Yoshikawa; +Cc: axboe, takuya.yoshikawa, linux-kernel, kvm, qemu-devel

On Thu, Sep 08, 2011 at 06:13:53PM +0900, Takuya Yoshikawa wrote:
> This is a report of strange cfq behaviour which seems to be triggered by
> QEMU posix aio threads.
> 
> Host environment:
>   OS: RHEL6.0 KVM/qemu-kvm (with no patch applied)
>   IO scheduler: cfq (with the default parameters)

So you are using both RHEL 6.0 in both host and guest kernel? Can you
reproduce the same issue with upstream kernels? How easily/frequently
you can reproduce this with RHEL6.0 host.

> 
> On the host, we were running 3 linux guests to see if I/O from these guests
> would be handled fairly by host; each guest did dd write with oflag=direct.
> 
> Guest virtual disk:
>   We used a host local disk which had 3 partitions, and each guest was
>   allocated one of these as dd write target.
> 
> So our test was for checking if cfq could keep fairness for the 3 guests
> who shared the same disk.
> 
> The result (strage starvation):
>   Sometimes, one guest dominated cfq for more than 10sec and requests from
>   other guests were not handled at all during that time.
> 
> Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1)
> is not handled at all during cfq2095S and cfq2067S which hold requests to
> (8,26) are being handled alternately.
> 
> *1) WS 104920578 + 64
> 
> Question:
>   I guess that cfq_close_cooperator() was being called in an unusual manner.
>   If so, do you think that cfq is responsible for keeping fairness for this
>   kind of unusual write requests?

- If two guests are doing IO to separate partitions, they should really
  not be very close (until and unless partitions are really small).

- Even if there are close cooperators, these queues are merged and they
  are treated as single queue from slice point of view. So cooperating
  queues should be merged and get a single slice instead of starving
  other queues in the system.

Can you upload the blktrace logs somewhere which shows what happened 
during that 10 seconds.

> 
> Note:
>   With RHEL6.1, this problem could not triggered. But I guess that was due to
>   QEMU's block layer updates.

You can try reproducing this with fio.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
  2011-09-08 13:49 ` Vivek Goyal
@ 2011-09-09  9:00   ` Takuya Yoshikawa
  2011-09-09 13:48     ` Stefan Hajnoczi
  2011-09-09 14:38     ` Vivek Goyal
  0 siblings, 2 replies; 5+ messages in thread
From: Takuya Yoshikawa @ 2011-09-09  9:00 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: axboe, takuya.yoshikawa, linux-kernel, kvm, qemu-devel

Vivek Goyal <vgoyal@redhat.com> wrote:

> So you are using both RHEL 6.0 in both host and guest kernel? Can you
> reproduce the same issue with upstream kernels? How easily/frequently
> you can reproduce this with RHEL6.0 host.

Guests were CentOS6.0.

I have only RHEL6.0 and RHEL6.1 test results now.
I want to try similar tests with upstream kernels if I can get some time.

With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%.

> > On the host, we were running 3 linux guests to see if I/O from these guests
> > would be handled fairly by host; each guest did dd write with oflag=direct.
> > 
> > Guest virtual disk:
> >   We used a host local disk which had 3 partitions, and each guest was
> >   allocated one of these as dd write target.
> > 
> > So our test was for checking if cfq could keep fairness for the 3 guests
> > who shared the same disk.
> > 
> > The result (strage starvation):
> >   Sometimes, one guest dominated cfq for more than 10sec and requests from
> >   other guests were not handled at all during that time.
> > 
> > Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1)
> > is not handled at all during cfq2095S and cfq2067S which hold requests to
> > (8,26) are being handled alternately.
> > 
> > *1) WS 104920578 + 64
> > 
> > Question:
> >   I guess that cfq_close_cooperator() was being called in an unusual manner.
> >   If so, do you think that cfq is responsible for keeping fairness for this
> >   kind of unusual write requests?
> 
> - If two guests are doing IO to separate partitions, they should really
>   not be very close (until and unless partitions are really small).

Sorry for my lack of explanation.

The IO was issued from QEMU and the cooperative threads were both for the same
guest. In other words, QEMU was using two threads for one IO stream from the guest.

As my blktrace log snippet showed, cfq2095S and cfq2067S treated one sequential
IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on.

  These should be from the same guest because the target partition was same,
  which was allocated to that guest.

During the 10sec, this repetition continued without allowing others to interrupt.

I know it is unnatural but sometimes QEMU uses two aio threads for issuing one
IO stream.

> 
> - Even if there are close cooperators, these queues are merged and they
>   are treated as single queue from slice point of view. So cooperating
>   queues should be merged and get a single slice instead of starving
>   other queues in the system.

I understand that close cooperators' queues should be merged, but in our test
case, when the 64KB request was issued from one aio thread, the other thread's
queue was empty; because these queues are for the same stream, next request
could not come until current request got finished.

  But this is complicated because it depends on the qemu block layer aio.

I am not sure if cfq would try to merge the queues in such cases.

> Can you upload the blktrace logs somewhere which shows what happened 
> during that 10 seconds.

I have some restrictions here, so maybe, but I need to check later.

> > Note:
> >   With RHEL6.1, this problem could not triggered. But I guess that was due to
> >   QEMU's block layer updates.
> 
> You can try reproducing this with fio.

Thank you, I want to do some tests by myself; the original report was not from
my team.

My feeling is that, it may be possible to dominate IO if we create two threads
and issue cooperative IO as our QEMU did; QEMU is just a process from the host
view, and one QEMU process dominated IO preventing other QEMU's IO.

Thanks,
	Takuya

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
  2011-09-09  9:00   ` Takuya Yoshikawa
@ 2011-09-09 13:48     ` Stefan Hajnoczi
  2011-09-09 14:38     ` Vivek Goyal
  1 sibling, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2011-09-09 13:48 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: axboe, kvm, qemu-devel, linux-kernel, Vivek Goyal,
	takuya.yoshikawa

On Fri, Sep 9, 2011 at 10:00 AM, Takuya Yoshikawa
<yoshikawa.takuya@oss.ntt.co.jp> wrote:
> Vivek Goyal <vgoyal@redhat.com> wrote:
>
>> So you are using both RHEL 6.0 in both host and guest kernel? Can you
>> reproduce the same issue with upstream kernels? How easily/frequently
>> you can reproduce this with RHEL6.0 host.
>
> Guests were CentOS6.0.
>
> I have only RHEL6.0 and RHEL6.1 test results now.
> I want to try similar tests with upstream kernels if I can get some time.
>
> With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%.
>
>> > On the host, we were running 3 linux guests to see if I/O from these guests
>> > would be handled fairly by host; each guest did dd write with oflag=direct.
>> >
>> > Guest virtual disk:
>> >   We used a host local disk which had 3 partitions, and each guest was
>> >   allocated one of these as dd write target.
>> >
>> > So our test was for checking if cfq could keep fairness for the 3 guests
>> > who shared the same disk.
>> >
>> > The result (strage starvation):
>> >   Sometimes, one guest dominated cfq for more than 10sec and requests from
>> >   other guests were not handled at all during that time.
>> >
>> > Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1)
>> > is not handled at all during cfq2095S and cfq2067S which hold requests to
>> > (8,26) are being handled alternately.
>> >
>> > *1) WS 104920578 + 64
>> >
>> > Question:
>> >   I guess that cfq_close_cooperator() was being called in an unusual manner.
>> >   If so, do you think that cfq is responsible for keeping fairness for this
>> >   kind of unusual write requests?
>>
>> - If two guests are doing IO to separate partitions, they should really
>>   not be very close (until and unless partitions are really small).
>
> Sorry for my lack of explanation.
>
> The IO was issued from QEMU and the cooperative threads were both for the same
> guest. In other words, QEMU was using two threads for one IO stream from the guest.
>
> As my blktrace log snippet showed, cfq2095S and cfq2067S treated one sequential
> IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on.
>
>  These should be from the same guest because the target partition was same,
>  which was allocated to that guest.
>
> During the 10sec, this repetition continued without allowing others to interrupt.
>
> I know it is unnatural but sometimes QEMU uses two aio threads for issuing one
> IO stream.
>
>>
>> - Even if there are close cooperators, these queues are merged and they
>>   are treated as single queue from slice point of view. So cooperating
>>   queues should be merged and get a single slice instead of starving
>>   other queues in the system.
>
> I understand that close cooperators' queues should be merged, but in our test
> case, when the 64KB request was issued from one aio thread, the other thread's
> queue was empty; because these queues are for the same stream, next request
> could not come until current request got finished.
>
>  But this is complicated because it depends on the qemu block layer aio.
>
> I am not sure if cfq would try to merge the queues in such cases.

Looking at posix-aio-compat.c, QEMU's threadpool for asynchronous I/O,
this seems like a fairly generic issue.  Other applications may suffer
from this same I/O scheduler behavior.  It would be nice to create a
test case program which doesn't use QEMU at all.

QEMU has a queue of requests that need to be processed.  There is a
pool of threads that sleep until requests become available with
pthread_cond_timedwait(3).  When a request is added to the queue,
pthread_cond_signal(3) is called in order to wake one sleeping thread.

This bouncing pattern between two threads that you describe is
probably a result of pthread_cond_timedwait(3) waking up each thread
in alternating fashion.  So we get this pattern:

A  B  <-- threads
1     <-- I/O requests
   2
3
   4
5
   6
...

Stefan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests
  2011-09-09  9:00   ` Takuya Yoshikawa
  2011-09-09 13:48     ` Stefan Hajnoczi
@ 2011-09-09 14:38     ` Vivek Goyal
  1 sibling, 0 replies; 5+ messages in thread
From: Vivek Goyal @ 2011-09-09 14:38 UTC (permalink / raw)
  To: Takuya Yoshikawa
  Cc: axboe, kvm, linux-kernel, qemu-devel, Moyer Jeff Moyer,
	takuya.yoshikawa

On Fri, Sep 09, 2011 at 06:00:28PM +0900, Takuya Yoshikawa wrote:

[..]
> 
> > 
> > - Even if there are close cooperators, these queues are merged and they
> >   are treated as single queue from slice point of view. So cooperating
> >   queues should be merged and get a single slice instead of starving
> >   other queues in the system.
> 
> I understand that close cooperators' queues should be merged, but in our test
> case, when the 64KB request was issued from one aio thread, the other thread's
> queue was empty; because these queues are for the same stream, next request
> could not come until current request got finished.
> 
>   But this is complicated because it depends on the qemu block layer aio.
> 
> I am not sure if cfq would try to merge the queues in such cases.

[CCing Jeff Moyer ]

I think even if these queues are alternating, it should have been merged
(If we considered them close cooperator).

So in select queue we have.

        new_cfqq = cfq_close_cooperator(cfqd, cfqq);
        if (new_cfqq) {
                if (!cfqq->new_cfqq)
                        cfq_setup_merge(cfqq, new_cfqq);
                goto expire;
        }

So if we selected a new queue because it is a close cooperator, we should
have called setup_merge() and next time when the IO happens, one of the
queue should merge into another queue.

cfq_set_request() {
                if (cfqq->new_cfqq)
                        cfqq = cfq_merge_cfqqs(cfqd, cic, cfqq);
}

If merging is not happening and still we somehow continue to pick 
close_cooperator() as the new queue and starve other queues in the system,
then there is a bug.

I think try to reproduce this with fio with upstream kenrels and put
some more tracepoints and see what's happening.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-09-09 14:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-08  9:13 [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests Takuya Yoshikawa
2011-09-08 13:49 ` Vivek Goyal
2011-09-09  9:00   ` Takuya Yoshikawa
2011-09-09 13:48     ` Stefan Hajnoczi
2011-09-09 14:38     ` Vivek Goyal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).