questions about the number of pending requests that the host system can detect

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* questions about the number of pending requests that the host system can detect
@ 2010-08-12  3:42 Yuehai Xu
  2010-08-12 18:04 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 8+ messages in thread
From: Yuehai Xu @ 2010-08-12  3:42 UTC (permalink / raw)
  To: xen-devel; +Cc: yuehai.xu, yhxu

Hi all,

I know the default I/O scheduler for DomU is noop. Without considering
XEN, suppose the I/O scheduler for hard disk is noop, 10 processes run
currently and each process does stride read, in this case, the number
of pending requests which need to be dispatched to the hard disk
should be at around 8~9, here I suppose the hard disk can handle at
most 2 requests concurrently. this makes sense.

Now, suppose there is only one VM, and the ten processes are now
running in the guest system. The disk mode is tap2:aio, which means a
process called tapdisk2 is running in the host system and it handles
all the requests from the domU, dispatches them to the real hard disk.
In such case, from the view of the host, the number of pending
requests should always be at around 8~9 because tapdisk2 is using
asynchronous way to handle requests.

However, the result turns out that my assumption is wrong. The number
of pending requests, according to the trace of blktrace, is changing
like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
8 8..., just like a curve.

I am puzzled about this weird result. Can anybody explain what has
happened between domU and dom0 for this result? Does this result make
sense? or I did something wrong to get this result.

I am using Xen-4.0.0-rc5, kernel version in the host is 2.6.31.13, the
kernel in the guest system is 2.6.18

Thanks,
Yuehai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-12  3:42 questions about the number of pending requests that the host system can detect Yuehai Xu
@ 2010-08-12 18:04 ` Jeremy Fitzhardinge
  2010-08-12 18:16   ` Yuehai Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-12 18:04 UTC (permalink / raw)
  To: Yuehai Xu; +Cc: yuehai.xu, xen-devel, yhxu

  On 08/11/2010 08:42 PM, Yuehai Xu wrote:
> However, the result turns out that my assumption is wrong. The number
> of pending requests, according to the trace of blktrace, is changing
> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
> 8 8..., just like a curve.
>
> I am puzzled about this weird result. Can anybody explain what has
> happened between domU and dom0 for this result? Does this result make
> sense? or I did something wrong to get this result.

If you're using a journalled filesystem in the guest, it will be need to 
drain the IO queue periodically to control the write ordering.  You 
should also observe barrier writes in the blkfront stream.

     J

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-12 18:04 ` Jeremy Fitzhardinge
@ 2010-08-12 18:16   ` Yuehai Xu
  2010-08-12 18:18     ` Yuehai Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Yuehai Xu @ 2010-08-12 18:16 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: yuehai.xu, xen-devel, yhxu

On Thu, Aug 12, 2010 at 2:04 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>  On 08/11/2010 08:42 PM, Yuehai Xu wrote:
>>
>> However, the result turns out that my assumption is wrong. The number
>> of pending requests, according to the trace of blktrace, is changing
>> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
>> 8 8..., just like a curve.
>>
>> I am puzzled about this weird result. Can anybody explain what has
>> happened between domU and dom0 for this result? Does this result make
>> sense? or I did something wrong to get this result.
>
> If you're using a journalled filesystem in the guest, it will be need to
> drain the IO queue periodically to control the write ordering.  You should
> also observe barrier writes in the blkfront stream.
>
>    J
>
The file system I use in the guest system is ext3, which is a
journaled file system. However, I don't quite understand what you said
".. control the write ordering" because the 10 processes running in
the guest system all just send requests, there is no write request.
What do you mean of "barrier writes" here?

Thanks,
Yuehai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-12 18:16   ` Yuehai Xu
@ 2010-08-12 18:18     ` Yuehai Xu
  2010-08-12 18:21       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 8+ messages in thread
From: Yuehai Xu @ 2010-08-12 18:18 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: yuehai.xu, xen-devel, yhxu

On Thu, Aug 12, 2010 at 2:16 PM, Yuehai Xu <yuehaixu@gmail.com> wrote:
> On Thu, Aug 12, 2010 at 2:04 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>>  On 08/11/2010 08:42 PM, Yuehai Xu wrote:
>>>
>>> However, the result turns out that my assumption is wrong. The number
>>> of pending requests, according to the trace of blktrace, is changing
>>> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
>>> 8 8..., just like a curve.
>>>
>>> I am puzzled about this weird result. Can anybody explain what has
>>> happened between domU and dom0 for this result? Does this result make
>>> sense? or I did something wrong to get this result.
>>
>> If you're using a journalled filesystem in the guest, it will be need to
>> drain the IO queue periodically to control the write ordering.  You should
>> also observe barrier writes in the blkfront stream.
>>
>>    J
>>
> The file system I use in the guest system is ext3, which is a
> journaled file system. However, I don't quite understand what you said
> ".. control the write ordering" because the 10 processes running in
> the guest system all just send requests, there is no write request.
> What do you mean of "barrier writes" here?
>
> Thanks,
> Yuehai
>
I am sorry for the missing word, the requests sent by the 10 processes
in the guest system are all read requests.

Thanks,
Yuehai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-12 18:18     ` Yuehai Xu
@ 2010-08-12 18:21       ` Jeremy Fitzhardinge
  2010-08-12 18:36         ` Yuehai Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-12 18:21 UTC (permalink / raw)
  To: Yuehai Xu; +Cc: yuehai.xu, xen-devel, yhxu

  On 08/12/2010 11:18 AM, Yuehai Xu wrote:
> On Thu, Aug 12, 2010 at 2:16 PM, Yuehai Xu<yuehaixu@gmail.com>  wrote:
>> On Thu, Aug 12, 2010 at 2:04 PM, Jeremy Fitzhardinge<jeremy@goop.org>  wrote:
>>>   On 08/11/2010 08:42 PM, Yuehai Xu wrote:
>>>> However, the result turns out that my assumption is wrong. The number
>>>> of pending requests, according to the trace of blktrace, is changing
>>>> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
>>>> 8 8..., just like a curve.
>>>>
>>>> I am puzzled about this weird result. Can anybody explain what has
>>>> happened between domU and dom0 for this result? Does this result make
>>>> sense? or I did something wrong to get this result.
>>> If you're using a journalled filesystem in the guest, it will be need to
>>> drain the IO queue periodically to control the write ordering.  You should
>>> also observe barrier writes in the blkfront stream.
>>>
>>>     J
>>>
>> The file system I use in the guest system is ext3, which is a
>> journaled file system. However, I don't quite understand what you said
>> ".. control the write ordering" because the 10 processes running in
>> the guest system all just send requests, there is no write request.
>> What do you mean of "barrier writes" here?
>>
>> Thanks,
>> Yuehai
>>
> I am sorry for the missing word, the requests sent by the 10 processes
> in the guest system are all read requests.

Even a pure read-only workload may generate writes for metadata unless 
you've turned it off.  Is it a read-only mount?  Do you have the noatime 
mount option?  Is the device itself read-only?

Still, it seems odd that it won't/can't keep the queue full of read 
requests.  Unless its getting local cache hits?

     J

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-12 18:21       ` Jeremy Fitzhardinge
@ 2010-08-12 18:36         ` Yuehai Xu
  2010-08-15 20:12           ` Daniel Stodden
  0 siblings, 1 reply; 8+ messages in thread
From: Yuehai Xu @ 2010-08-12 18:36 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: yuehai.xu, xen-devel, yhxu

On Thu, Aug 12, 2010 at 2:21 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>  On 08/12/2010 11:18 AM, Yuehai Xu wrote:
>>
>> On Thu, Aug 12, 2010 at 2:16 PM, Yuehai Xu<yuehaixu@gmail.com>  wrote:
>>>
>>> On Thu, Aug 12, 2010 at 2:04 PM, Jeremy Fitzhardinge<jeremy@goop.org>
>>>  wrote:
>>>>
>>>>  On 08/11/2010 08:42 PM, Yuehai Xu wrote:
>>>>>
>>>>> However, the result turns out that my assumption is wrong. The number
>>>>> of pending requests, according to the trace of blktrace, is changing
>>>>> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
>>>>> 8 8..., just like a curve.
>>>>>
>>>>> I am puzzled about this weird result. Can anybody explain what has
>>>>> happened between domU and dom0 for this result? Does this result make
>>>>> sense? or I did something wrong to get this result.
>>>>
>>>> If you're using a journalled filesystem in the guest, it will be need to
>>>> drain the IO queue periodically to control the write ordering.  You
>>>> should
>>>> also observe barrier writes in the blkfront stream.
>>>>
>>>>    J
>>>>
>>> The file system I use in the guest system is ext3, which is a
>>> journaled file system. However, I don't quite understand what you said
>>> ".. control the write ordering" because the 10 processes running in
>>> the guest system all just send requests, there is no write request.
>>> What do you mean of "barrier writes" here?
>>>
>>> Thanks,
>>> Yuehai
>>>
>> I am sorry for the missing word, the requests sent by the 10 processes
>> in the guest system are all read requests.
>
> Even a pure read-only workload may generate writes for metadata unless
> you've turned it off.  Is it a read-only mount?  Do you have the noatime
> mount option?  Is the device itself read-only?
>

The definition of my disk is: ['tap2:aio:/PATH/dom.img, hda1, w'], so,
I think it should not be read-only mount, and I don't set any specific
option for mount. The device itself should be read-write.


> Still, it seems odd that it won't/can't keep the queue full of read
> requests.  Unless its getting local cache hits?
>
>    J
>

I don't think the local cache would be hit because every time I did
the test, I drop the cache both in the guest and host OS. And, the
access pattern is stride read, it is impossible to hit the cache.

I am not sure whether there are write requests, even there are, I
think the number of write requests should be very small, will it
affect the I/O queue of guest or host? I don't think so. The common
sense should be that the I/O queue in the host system should be almost
full because tapdisk2 is async.

Thanks,
Yuehai

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-12 18:36         ` Yuehai Xu
@ 2010-08-15 20:12           ` Daniel Stodden
  2010-08-16  2:41             ` Yuehai Xu
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Stodden @ 2010-08-15 20:12 UTC (permalink / raw)
  To: Yuehai Xu
  Cc: Jeremy Fitzhardinge, xen-devel@lists.xensource.com,
	yhxu@wayne.edu, yuehai.xu@gmail.com

On Thu, 2010-08-12 at 14:36 -0400, Yuehai Xu wrote:
> On Thu, Aug 12, 2010 at 2:21 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> >  On 08/12/2010 11:18 AM, Yuehai Xu wrote:
> >>
> >> On Thu, Aug 12, 2010 at 2:16 PM, Yuehai Xu<yuehaixu@gmail.com>  wrote:
> >>>
> >>> On Thu, Aug 12, 2010 at 2:04 PM, Jeremy Fitzhardinge<jeremy@goop.org>
> >>>  wrote:
> >>>>
> >>>>  On 08/11/2010 08:42 PM, Yuehai Xu wrote:
> >>>>>
> >>>>> However, the result turns out that my assumption is wrong. The number
> >>>>> of pending requests, according to the trace of blktrace, is changing
> >>>>> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
> >>>>> 8 8..., just like a curve.
> >>>>>
> >>>>> I am puzzled about this weird result. Can anybody explain what has
> >>>>> happened between domU and dom0 for this result? Does this result make
> >>>>> sense? or I did something wrong to get this result.
> >>>>
> >>>> If you're using a journalled filesystem in the guest, it will be need to
> >>>> drain the IO queue periodically to control the write ordering.  You
> >>>> should
> >>>> also observe barrier writes in the blkfront stream.
> >>>>
> >>>>    J
> >>>>
> >>> The file system I use in the guest system is ext3, which is a
> >>> journaled file system. However, I don't quite understand what you said
> >>> ".. control the write ordering" because the 10 processes running in
> >>> the guest system all just send requests, there is no write request.
> >>> What do you mean of "barrier writes" here?
> >>>
> >>> Thanks,
> >>> Yuehai
> >>>
> >> I am sorry for the missing word, the requests sent by the 10 processes
> >> in the guest system are all read requests.
> >
> > Even a pure read-only workload may generate writes for metadata unless
> > you've turned it off.  Is it a read-only mount?  Do you have the noatime
> > mount option?  Is the device itself read-only?
> >
> 
> The definition of my disk is: ['tap2:aio:/PATH/dom.img, hda1, w'], so,
> I think it should not be read-only mount, and I don't set any specific
> option for mount. The device itself should be read-write.
> 
> 
> > Still, it seems odd that it won't/can't keep the queue full of read
> > requests.  Unless its getting local cache hits?
> >
> >    J
> >
> 
> I don't think the local cache would be hit because every time I did
> the test, I drop the cache both in the guest and host OS. And, the
> access pattern is stride read, it is impossible to hit the cache.
> 
> I am not sure whether there are write requests, even there are, I
> think the number of write requests should be very small, will it
> affect the I/O queue of guest or host? I don't think so. The common
> sense should be that the I/O queue in the host system should be almost
> full because tapdisk2 is async.

Most of what is coming to my mind has already been mentioned above.
Maybe try a read-only mount to avoid metadata updates.

What do you mean by stride read? Just reads with some fixed stride? What
stride size? Did you make sure to turned off OS readahead (iirc 128k)?
What's the underlying storage type? If it's a file, was the data fully
preallocated?

If the request offsets qualify for a merge, then blktap will do so quite
aggressively, so you will see a lot of the I/O complete discretely not
incrementally request-by-request. 

How did you sample the pending number of requests?

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: questions about the number of pending requests that the host system can detect
  2010-08-15 20:12           ` Daniel Stodden
@ 2010-08-16  2:41             ` Yuehai Xu
  0 siblings, 0 replies; 8+ messages in thread
From: Yuehai Xu @ 2010-08-16  2:41 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Jeremy Fitzhardinge, xen-devel, yhxu, yuehai.xu

On Sun, Aug 15, 2010 at 4:12 PM, Daniel Stodden
<daniel.stodden@citrix.com> wrote:
> On Thu, 2010-08-12 at 14:36 -0400, Yuehai Xu wrote:
>> On Thu, Aug 12, 2010 at 2:21 PM, Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>> >  On 08/12/2010 11:18 AM, Yuehai Xu wrote:
>> >>
>> >> On Thu, Aug 12, 2010 at 2:16 PM, Yuehai Xu<yuehaixu@gmail.com>  wrote:
>> >>>
>> >>> On Thu, Aug 12, 2010 at 2:04 PM, Jeremy Fitzhardinge<jeremy@goop.org>
>> >>>  wrote:
>> >>>>
>> >>>>  On 08/11/2010 08:42 PM, Yuehai Xu wrote:
>> >>>>>
>> >>>>> However, the result turns out that my assumption is wrong. The number
>> >>>>> of pending requests, according to the trace of blktrace, is changing
>> >>>>> like this way: 9 8 7 6 5 4 3 2 1 1 1 2 3 4 5 4 3 2 1 1 1 2 3 4 5 6 7 8
>> >>>>> 8 8..., just like a curve.
>> >>>>>
>> >>>>> I am puzzled about this weird result. Can anybody explain what has
>> >>>>> happened between domU and dom0 for this result? Does this result make
>> >>>>> sense? or I did something wrong to get this result.
>> >>>>
>> >>>> If you're using a journalled filesystem in the guest, it will be need to
>> >>>> drain the IO queue periodically to control the write ordering.  You
>> >>>> should
>> >>>> also observe barrier writes in the blkfront stream.
>> >>>>
>> >>>>    J
>> >>>>
>> >>> The file system I use in the guest system is ext3, which is a
>> >>> journaled file system. However, I don't quite understand what you said
>> >>> ".. control the write ordering" because the 10 processes running in
>> >>> the guest system all just send requests, there is no write request.
>> >>> What do you mean of "barrier writes" here?
>> >>>
>> >>> Thanks,
>> >>> Yuehai
>> >>>
>> >> I am sorry for the missing word, the requests sent by the 10 processes
>> >> in the guest system are all read requests.
>> >
>> > Even a pure read-only workload may generate writes for metadata unless
>> > you've turned it off.  Is it a read-only mount?  Do you have the noatime
>> > mount option?  Is the device itself read-only?
>> >
>>
>> The definition of my disk is: ['tap2:aio:/PATH/dom.img, hda1, w'], so,
>> I think it should not be read-only mount, and I don't set any specific
>> option for mount. The device itself should be read-write.
>>
>>
>> > Still, it seems odd that it won't/can't keep the queue full of read
>> > requests.  Unless its getting local cache hits?
>> >
>> >    J
>> >
>>
>> I don't think the local cache would be hit because every time I did
>> the test, I drop the cache both in the guest and host OS. And, the
>> access pattern is stride read, it is impossible to hit the cache.
>>
>> I am not sure whether there are write requests, even there are, I
>> think the number of write requests should be very small, will it
>> affect the I/O queue of guest or host? I don't think so. The common
>> sense should be that the I/O queue in the host system should be almost
>> full because tapdisk2 is async.
>
> Most of what is coming to my mind has already been mentioned above.
> Maybe try a read-only mount to avoid metadata updates.

I compile linux 2.6.31.13 as the guest kernel instead the original
2.6.18, the problem disappear, that even I run 10 processes stride
reading data in the guest system, from the host level, the number of
pending requests keeps at around 8~9. This makes sense

>
> What do you mean by stride read? Just reads with some fixed stride? What
> stride size? Did you make sure to turned off OS readahead (iirc 128k)?
> What's the underlying storage type? If it's a file, was the data fully
> preallocated?

Stride read here just as what you have understood, I am sorry not to
interpret clearly, the stride size is 8K, in this way, readahead
should not be trigged, the trace from blktrace can also confirm this.

>
> If the request offsets qualify for a merge, then blktap will do so quite
> aggressively, so you will see a lot of the I/O complete discretely not
> incrementally request-by-request.
>
> How did you sample the pending number of requests?

The sample of the pending number can be done in such way, to run
blktrace in the host OS, then, the 6th column indicates the status of
the requests, among these status, we can know when a request is
inserted to the block device level and when a request is dispatched to
the hard disk. So, the number of pending requests can be gotten.

As I know, all the non work conserving I/O schedulers base on the info
of process, such as CFQ and AS, however, as there is only one
process(tapdisk) in the host system to handle all the requests from a
guest system. It is impossible for the I/O scheduler in host OS to
recognize a certain process in guest OS, so, the mechanism for
anticipation of CFQ(AS is deleted in the latest kernel branch since
CFQ can also do anticipation) will be tuned off. In that way, for some
workloads, especially when several processes run concurrently in a
guest OS, the throughput might be lowed down, because the CFQ in host
system will never do anticipation.

Meanwhile, in the guest system, since the I/O scheduler should never
know the position of the real disk head because several guest systems
share a single disk head, what's the most suitable I/O scheduler for
guest?

What do you think of the problem that the I/O scheduler in both the
guest and the host system faces?

Thanks,
Yuehai

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-08-16  2:41 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-12  3:42 questions about the number of pending requests that the host system can detect Yuehai Xu
2010-08-12 18:04 ` Jeremy Fitzhardinge
2010-08-12 18:16   ` Yuehai Xu
2010-08-12 18:18     ` Yuehai Xu
2010-08-12 18:21       ` Jeremy Fitzhardinge
2010-08-12 18:36         ` Yuehai Xu
2010-08-15 20:12           ` Daniel Stodden
2010-08-16  2:41             ` Yuehai Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).