* Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
@ 2006-02-02 21:34 Steve Dobbelstein
2006-02-02 21:46 ` Anthony Liguori
0 siblings, 1 reply; 10+ messages in thread
From: Steve Dobbelstein @ 2006-02-02 21:34 UTC (permalink / raw)
To: xen-devel
While running some disk performance tests for VMX domains we noticed that
writes to the backend device for a VMX domain's disk go through the buffer
cache, that is, they are not written immediately to disk. Shouldn't the
I/Os go straight to the backend device, i.e., the device should be opened
with O_DIRECT or some such? From the domain's perspective it expects the
data to be physically on the device, but in reality it is not. There are
things, such a writes to a file system journal, that the OS in the domain
will expect to be on disk. If the whole system crashes before the buffer
cache in dom0 is written to disk, those writes may not be on the disk.
When the domain is started again it may find the file system in an
inconsistent state, due to writes to the journal that didn't make it to
disk, and may not be able to recover.
It seems to me that if a domain expects things to be physically on its
frontend device that they should be physically on the backend device as
well. Or am I missing something from the bigger picture?
Steve D.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-02 21:34 Shouldn't backend devices for VMX domain disks be opened with O_DIRECT? Steve Dobbelstein
@ 2006-02-02 21:46 ` Anthony Liguori
2006-02-02 22:28 ` Steve Dobbelstein
0 siblings, 1 reply; 10+ messages in thread
From: Anthony Liguori @ 2006-02-02 21:46 UTC (permalink / raw)
To: Steve Dobbelstein; +Cc: xen-devel
Steve Dobbelstein wrote:
>While running some disk performance tests for VMX domains we noticed that
>writes to the backend device for a VMX domain's disk go through the buffer
>cache, that is, they are not written immediately to disk. Shouldn't the
>I/Os go straight to the backend device, i.e., the device should be opened
>with O_DIRECT or some such? From the domain's perspective it expects the
>data to be physically on the device, but in reality it is not. There are
>things, such a writes to a file system journal, that the OS in the domain
>will expect to be on disk. If the whole system crashes before the buffer
>cache in dom0 is written to disk, those writes may not be on the disk.
>When the domain is started again it may find the file system in an
>inconsistent state, due to writes to the journal that didn't make it to
>disk, and may not be able to recover.
>
>It seems to me that if a domain expects things to be physically on its
>frontend device that they should be physically on the backend device as
>well. Or am I missing something from the bigger picture?
>
>
I would doubt it. Since it's usually opening a file, and qemu-dm is
emulating a contigous disk, you probably want the buffer cache to
reorder events.
Are you seeing a performance improvement? Should be easy to check.
Regards,
Anthony Liguori
>Steve D.
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-02 21:46 ` Anthony Liguori
@ 2006-02-02 22:28 ` Steve Dobbelstein
2006-02-02 22:41 ` Philip R. Auld
0 siblings, 1 reply; 10+ messages in thread
From: Steve Dobbelstein @ 2006-02-02 22:28 UTC (permalink / raw)
To: xen-devel
aliguori@us.ltcfwd.linux.ibm.com wrote on 02/02/2006 03:46:11 PM:
> I would doubt it. Since it's usually opening a file, and qemu-dm is
> emulating a contigous disk, you probably want the buffer cache to
> reorder events.
I guess we're not usual since our backend is an LVM volume. :)
I can appreciate how writing to the buffer cache can speed up the response
to the I/O and make it more efficient in its writing to the backend device
by reordering events. However, I'm still wondering if we have a data
corruption issue should dom0 crash before it writes the data in the buffer
cache to disk, data that the domain expects to be on the disk but won't be
there when the domain is restarted.
> Are you seeing a performance improvement? Should be easy to check.
We just started doing the first runs of disk performance tests when we
noticed this behavior and thought we should bring it up on the list. We
don't have enough data points to compare yet. We'll post problems/issues
if/when we find them.
Steve D.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-02 22:28 ` Steve Dobbelstein
@ 2006-02-02 22:41 ` Philip R. Auld
2006-02-03 0:09 ` Anthony Liguori
0 siblings, 1 reply; 10+ messages in thread
From: Philip R. Auld @ 2006-02-02 22:41 UTC (permalink / raw)
To: Steve Dobbelstein; +Cc: xen-devel
Hi,
Rumor has it that on Thu, Feb 02, 2006 at 04:28:37PM -0600 Steve Dobbelstein said:
> aliguori@us.ltcfwd.linux.ibm.com wrote on 02/02/2006 03:46:11 PM:
>
> > I would doubt it. Since it's usually opening a file, and qemu-dm is
> > emulating a contigous disk, you probably want the buffer cache to
> > reorder events.
>
> I guess we're not usual since our backend is an LVM volume. :)
>
> I can appreciate how writing to the buffer cache can speed up the response
> to the I/O and make it more efficient in its writing to the backend device
> by reordering events. However, I'm still wondering if we have a data
> corruption issue should dom0 crash before it writes the data in the buffer
> cache to disk, data that the domain expects to be on the disk but won't be
> there when the domain is restarted.
I agree. It sounds like a correctness problem. It's just like disks
with write caching enabled.
>
> > Are you seeing a performance improvement? Should be easy to check.
>
It's more about correctness and data integrity than performance.
Cheers,
Phil
> We just started doing the first runs of disk performance tests when we
> noticed this behavior and thought we should bring it up on the list. We
> don't have enough data points to compare yet. We'll post problems/issues
> if/when we find them.
>
> Steve D.
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
--
Philip R. Auld, Ph.D. Egenera, Inc.
Software Architect 165 Forest St.
(508) 858-2628 Marlboro, MA 01752
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-02 22:41 ` Philip R. Auld
@ 2006-02-03 0:09 ` Anthony Liguori
2006-02-03 0:31 ` Luciano Miguel Ferreira Rocha
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Anthony Liguori @ 2006-02-03 0:09 UTC (permalink / raw)
To: Philip R. Auld; +Cc: Steve Dobbelstein, xen-devel
Philip R. Auld wrote:
>Hi,
>
>Rumor has it that on Thu, Feb 02, 2006 at 04:28:37PM -0600 Steve Dobbelstein said:
>
>
>>aliguori@us.ltcfwd.linux.ibm.com wrote on 02/02/2006 03:46:11 PM:
>>
>>
>>
>>>I would doubt it. Since it's usually opening a file, and qemu-dm is
>>>emulating a contigous disk, you probably want the buffer cache to
>>>reorder events.
>>>
>>>
>>I guess we're not usual since our backend is an LVM volume. :)
>>
>>I can appreciate how writing to the buffer cache can speed up the response
>>to the I/O and make it more efficient in its writing to the backend device
>>by reordering events. However, I'm still wondering if we have a data
>>corruption issue should dom0 crash before it writes the data in the buffer
>>cache to disk, data that the domain expects to be on the disk but won't be
>>there when the domain is restarted.
>>
>>
>
>I agree. It sounds like a correctness problem. It's just like disks
>with write caching enabled.
>
>
Referring to the original question, which has been quoted away,
journaling doesn't require that data be written to disk per-say but that
writes occur in a particular order. A journal is always recoverable
given that writes occur in the expected order. A buffer cache will have
no effect on that order so you're no more likely to have corruption than
if you disabled the buffer cache.
You especially want the buffer cache if you have LVM partitions.
Sectors on an LVM disk are not necessarily contiguous and can even span
multiple disks. You definitely want the IO scheduler involved there.
If anything, what you really want (from a performance perspective) is to
disable the buffer cache in the domU and leave it enabled in the dom0
(this is what the paravirtual drivers should be doing IIRC).
Does this address your corruption concerns?
Regards,
Anthony Liguori
>>>Are you seeing a performance improvement? Should be easy to check.
>>>
>>>
>
>It's more about correctness and data integrity than performance.
>
>
>Cheers,
>
>Phil
>
>
>
>
>>We just started doing the first runs of disk performance tests when we
>>noticed this behavior and thought we should bring it up on the list. We
>>don't have enough data points to compare yet. We'll post problems/issues
>>if/when we find them.
>>
>>Steve D.
>>
>>
>>_______________________________________________
>>Xen-devel mailing list
>>Xen-devel@lists.xensource.com
>>http://lists.xensource.com/xen-devel
>>
>>
>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-03 0:09 ` Anthony Liguori
@ 2006-02-03 0:31 ` Luciano Miguel Ferreira Rocha
2006-02-03 2:40 ` Rik van Riel
2006-02-03 2:42 ` Stephen Tweedie
2 siblings, 0 replies; 10+ messages in thread
From: Luciano Miguel Ferreira Rocha @ 2006-02-03 0:31 UTC (permalink / raw)
To: xen-devel
On Thu, Feb 02, 2006 at 06:09:11PM -0600, Anthony Liguori wrote:
> >I agree. It sounds like a correctness problem. It's just like disks
> >with write caching enabled.
> >
> >
> Referring to the original question, which has been quoted away,
> journaling doesn't require that data be written to disk per-say but that
> writes occur in a particular order. A journal is always recoverable
> given that writes occur in the expected order. A buffer cache will have
> no effect on that order so you're no more likely to have corruption than
> if you disabled the buffer cache.
Corruption meaning that the domU thinks data has been committed to disk
but never has (dom0 crashed before the cache could be flushed).
The correctness of some protocols or procedures depend on being able to
forcefully commit changes to disk (databases, for example).
> If anything, what you really want (from a performance perspective) is to
> disable the buffer cache in the domU and leave it enabled in the dom0
> (this is what the paravirtual drivers should be doing IIRC).
I disagree. domU must be able to sync(). And if domUs are already
caching data, why let them pollute dom0's cache?
--
lfr
0/0
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-03 0:09 ` Anthony Liguori
2006-02-03 0:31 ` Luciano Miguel Ferreira Rocha
@ 2006-02-03 2:40 ` Rik van Riel
2006-02-03 2:42 ` Stephen Tweedie
2 siblings, 0 replies; 10+ messages in thread
From: Rik van Riel @ 2006-02-03 2:40 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Steve Dobbelstein, Philip R. Auld, xen-devel
On Thu, 2 Feb 2006, Anthony Liguori wrote:
> Referring to the original question, which has been quoted away,
> journaling doesn't require that data be written to disk per-say but that
> writes occur in a particular order. A journal is always recoverable
> given that writes occur in the expected order.
If I do a database transaction or accept an email (SMTP transaction),
I need to ensure that the data really did make it to disk.
There is a reason that the email RFCs explicitly state that the email
has to be committed to stable storage before returning the "250 Ok"!
--
All Rights Reversed
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-03 0:09 ` Anthony Liguori
2006-02-03 0:31 ` Luciano Miguel Ferreira Rocha
2006-02-03 2:40 ` Rik van Riel
@ 2006-02-03 2:42 ` Stephen Tweedie
2006-02-03 2:50 ` Anthony Liguori
2 siblings, 1 reply; 10+ messages in thread
From: Stephen Tweedie @ 2006-02-03 2:42 UTC (permalink / raw)
To: Anthony Liguori
Cc: Steve Dobbelstein, Philip R. Auld, xen-devel@lists.xensource.com
Hi,
On Thu, 2006-02-02 at 18:09 -0600, Anthony Liguori wrote:
> Referring to the original question, which has been quoted away,
> journaling doesn't require that data be written to disk per-say but that
> writes occur in a particular order. A journal is always recoverable
> given that writes occur in the expected order.
Sure... it's *internally* consistent, maybe. But you need more than
that. You need guarantees that things are on disk, else external
consistency guarantees will be broken.
Consider things like sendmail fsync()ing a spool file before telling the
sender that the email has been accepted. After that acknowledgement,
the sender can delete the mail from its queues knowing that the
recipient MTA definitely has the data, and even if it crashes, the mail
won't be lost. Databases frequently have similar consistency
requirements. If a power failure loses writes that you have told the
domU have completed --- even if you maintain write ordering --- then you
*are* putting application correctness at risk, there's no doubt about
it.
> A buffer cache will have
> no effect on that order so you're no more likely to have corruption than
> if you disabled the buffer cache.
Not if it's being used as a write-through cache. If it's write-back, it
will have a major impact on ordering.
> You especially want the buffer cache if you have LVM partitions.
> Sectors on an LVM disk are not necessarily contiguous and can even span
> multiple disks. You definitely want the IO scheduler involved there.
That does not at all imply the use of the buffer cache. All that you
need to satisfy this is AIO (asynchronous *submission* of the IO)
combined with O_DIRECT IO (synchronous *completion*) --- ie. you can
submit multiple IOs concurrently, but you know for sure when each one
completes. That still lets the elevator get strongly involved in the
scheduling and reordering of the IOs, but lets you know reliably when
things hit disk.
Fortunately, that's just what blkback is doing --- it's using submit_bio
to submit the write IOs without waiting for completion, and is using the
bio's bi_end_io callback to process the IO completion once it is hard on
disk.
--Stephen
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-03 2:42 ` Stephen Tweedie
@ 2006-02-03 2:50 ` Anthony Liguori
2006-02-03 15:42 ` Stephen C. Tweedie
0 siblings, 1 reply; 10+ messages in thread
From: Anthony Liguori @ 2006-02-03 2:50 UTC (permalink / raw)
To: Stephen Tweedie
Cc: Steve Dobbelstein, Philip R. Auld, xen-devel@lists.xensource.com
Stephen Tweedie wrote:
>Hi,
>
>On Thu, 2006-02-02 at 18:09 -0600, Anthony Liguori wrote:
>
>
>
>>Referring to the original question, which has been quoted away,
>>journaling doesn't require that data be written to disk per-say but that
>>writes occur in a particular order. A journal is always recoverable
>>given that writes occur in the expected order.
>>
>>
>
>Sure... it's *internally* consistent, maybe. But you need more than
>that. You need guarantees that things are on disk, else external
>consistency guarantees will be broken.
>
>
Ok, this is certainly correct (but not the original point).
>Consider things like sendmail fsync()ing a spool file before telling the
>sender that the email has been accepted. After that acknowledgement,
>the sender can delete the mail from its queues knowing that the
>recipient MTA definitely has the data, and even if it crashes, the mail
>won't be lost. Databases frequently have similar consistency
>requirements. If a power failure loses writes that you have told the
>domU have completed --- even if you maintain write ordering --- then you
>*are* putting application correctness at risk, there's no doubt about
>it.
>
>
Ok, this is a good argument for using O_SYNC.
>Fortunately, that's just what blkback is doing --- it's using submit_bio
>to submit the write IOs without waiting for completion, and is using the
>bio's bi_end_io callback to process the IO completion once it is hard on
>disk.
>
>
Yup, the question here is with the device model which doesn't use the
block frontend/backend. Would O_DIRECT be helpful over O_SYNC?
Regards,
Anthony Liguori
>--Stephen
>
>
>
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Shouldn't backend devices for VMX domain disks be opened with O_DIRECT?
2006-02-03 2:50 ` Anthony Liguori
@ 2006-02-03 15:42 ` Stephen C. Tweedie
0 siblings, 0 replies; 10+ messages in thread
From: Stephen C. Tweedie @ 2006-02-03 15:42 UTC (permalink / raw)
To: Anthony Liguori
Cc: Philip R. Auld, xen-devel@lists.xensource.com, Steve Dobbelstein
Hi,
On Thu, 2006-02-02 at 20:50 -0600, Anthony Liguori wrote:
> Yup, the question here is with the device model which doesn't use the
> block frontend/backend. Would O_DIRECT be helpful over O_SYNC?
There are really two separate parts to that.
First is whether write-through caching is helpful. My own gut reaction
is not --- it implies an extra copy in the dom0, which is both a CPU
overhead to make the copy and a memory overhead to preserve it. It does
mean that subsequent reads will be faster, but it should be up to the
domU to decide whether that is useful or not, not the dom0 --- a domU
running a database using O_DIRECT **really** doesn't want dom0 doings
any extra caching on its writes.
Second is whether O_DIRECT fits the IO model. O_DIRECT has a lot of
extra constraints on the sort of IO you can do --- it has to be
sector-aligned in memory, in size and on disk, for example. That will
probably fit neatly into the environment that a block device backend is
running in, but something doing filesystem-level IO forwarding won't
always have the same alignment guarantees. (The page cache gives us the
right alignment in a lot of cases, though.)
--Stephen
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-02-03 15:42 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-02 21:34 Shouldn't backend devices for VMX domain disks be opened with O_DIRECT? Steve Dobbelstein
2006-02-02 21:46 ` Anthony Liguori
2006-02-02 22:28 ` Steve Dobbelstein
2006-02-02 22:41 ` Philip R. Auld
2006-02-03 0:09 ` Anthony Liguori
2006-02-03 0:31 ` Luciano Miguel Ferreira Rocha
2006-02-03 2:40 ` Rik van Riel
2006-02-03 2:42 ` Stephen Tweedie
2006-02-03 2:50 ` Anthony Liguori
2006-02-03 15:42 ` Stephen C. Tweedie
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.