qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Status of DAX for virtio-fs/virtiofsd?
@ 2023-05-17 15:23 Alex Bennée
  2023-05-17 16:26 ` Stefan Hajnoczi
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2023-05-17 15:23 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Stefan Hajnoczi, virtio-fs, Erik Schilling

Hi,

I remember from the dim and distant past (probably a KVM Forum) that one
of the touted features of virtio-fs was the ability to get memory
efficiency savings by having a direct memory mapping the host pages into
the guest address space.

AFAICT the kernel side was merged a while ago, see 22f3787 (virtiofs:
set up virtio_fs dax_device) and related. However when investigating
what would be needed to support this for Xen guests using virtio-fs we
were confused as to what else was needed.

There were some patches for the old C daemon:

  Subject: [PATCH v3 00/26] virtiofs dax patches
  Date: Wed, 28 Apr 2021 12:00:34 +0100	[thread overview]
  Message-ID: <20210428110100.27757-1-dgilbert@redhat.com>

although they look like they were never merged and the C version of
virtiofsd has since been dropped from tools.

Looking at the supporting rust code (vhost_user/message.rs) there are a
number of additional messages:

    /// Virtio-fs draft: map file content into the window.
    FS_MAP = 6,
    /// Virtio-fs draft: unmap file content from the window.
    FS_UNMAP = 7,
    /// Virtio-fs draft: sync file content.
    FS_SYNC = 8,
    /// Virtio-fs draft: perform a read/write from an fd directly to GPA.
    FS_IO = 9,
    /// Upper bound of valid commands.
    MAX_CMD = 10,

that don't appear in the current canonical vhost-user reference in the
QEMU repository and the QEMU code certainly doesn't have implementations
for any of them. So I have some questions:

 * What VMM/daemon combinations has DAX been tested on?
 * Isn't it time the vhost-user spec is updated?
 * Is anyone picking up Dave's patches for the QEMU side of support?

Thanks,


-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Status of DAX for virtio-fs/virtiofsd?
  2023-05-17 15:23 Status of DAX for virtio-fs/virtiofsd? Alex Bennée
@ 2023-05-17 16:26 ` Stefan Hajnoczi
  2023-05-18 19:45   ` Vivek Goyal
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Stefan Hajnoczi @ 2023-05-17 16:26 UTC (permalink / raw)
  To: Alex Bennée
  Cc: QEMU Developers, Stefan Hajnoczi, virtio-fs, Erik Schilling,
	Vivek Goyal

On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee@linaro.org> wrote:
Hi Alex,
There were two unresolved issues:

1. How to inject SIGBUS when the guest accesses a page that's beyond
the end-of-file.
2. Implementing the vhost-user messages for mapping ranges of files to
the vhost-user frontend.

The harder problem is SIGBUS. An mmap area may be larger than the
length of the file. Or another process could truncate the file while
it's mmapped, causing a previously correctly sized mmap to become
longer than the actual file. When a page beyond the end of file is
accessed, the kernel raises SIGBUS.

When this scenario occurs in the DAX Window, kvm.ko gets some type of
vmexit (fault) and the code currently enters an infinite loop because
it expects KVM memory regions to resolve faults. Since there is no
page backing that part of the vma, the fault handling fails and the
code loops trying to do this forever.

There needs to be a way to inject this fault back into the guest.
However, we did not found a way to do that. We considered Machine
Check Exceptions (MCEs), x86 interrupts, and paravirtualized
approaches. None of them looked like a clean and sane way to do this.
The Linux maintainers for MCEs and kvm.ko were not excited about
supporting this.

So in the end, SIGBUS was never solved. It leads to a DoS because the
host kernel will enter an infinite loop. We decided that until there
is progress on SIGBUS, we can't go ahead with DAX Windows in
production.

The easier problem is adding new vhost-user messages. It does lead to
a fundamental change in the vhost-user protocol: the presence of the
DAX Window means there are memory ranges that cannot be accessed via
shared memory. Imagine Device A has a DAX Window and Device B needs to
DMA to/from it. That doesn't work because the mmaps happen inside the
frontend (QEMU), so Device B doesn't have access to the current
mappings. The fundamental change to vhost-user is that virtqueue
descriptor mapping code must now deal with the situation where guest
addresses are absent from the shared memory regions and instead send
vhost-user protocol messages to read/write to/from bounce buffers
instead. The rest of the device backend does not require modification.
This is a slow path, but at least it works whereas currently the I/O
would fail because the memory is absent. Other solutions to the
vhost-user DMA problem exist, but this is the one that Dave and I last
discussed.

In the end, there is still work to do to make the DAX Window
supportable. There is experimental code out there that kind of works,
but we felt it was incomplete.

To your specific questions:

>  * What VMM/daemon combinations has DAX been tested on?

Only the experimental virtio-fs Kata Containers kernels and QEMU
builds that were available a few years ago. I don't think the code has
been rebased.

>  * Isn't it time the vhost-user spec is updated?

I don't know if Dave ever wrote the spec for or implemented the final
version of the vhost-user protocol messages we discussed.

>  * Is anyone picking up Dave's patches for the QEMU side of support?

Not at the moment. It would be nice to support, but someone needs the
energy/time/focus to deal with the outstanding issues I mentioned.

If you want to work on it, feel free to include me. I can help dig up
old discussions and give input.

Stefan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Status of DAX for virtio-fs/virtiofsd?
  2023-05-17 16:26 ` Stefan Hajnoczi
@ 2023-05-18 19:45   ` Vivek Goyal
  2023-05-22 12:54   ` Alex Bennée
  2023-09-06 13:07   ` [Virtio-fs] " Hao Xu
  2 siblings, 0 replies; 7+ messages in thread
From: Vivek Goyal @ 2023-05-18 19:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Bennée, QEMU Developers, Stefan Hajnoczi, virtio-fs,
	Erik Schilling

On Wed, May 17, 2023 at 12:26:18PM -0400, Stefan Hajnoczi wrote:
> On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee@linaro.org> wrote:
> Hi Alex,
> There were two unresolved issues:
> 
> 1. How to inject SIGBUS when the guest accesses a page that's beyond
> the end-of-file.
> 2. Implementing the vhost-user messages for mapping ranges of files to
> the vhost-user frontend.
> 
> The harder problem is SIGBUS. An mmap area may be larger than the
> length of the file. Or another process could truncate the file while
> it's mmapped, causing a previously correctly sized mmap to become
> longer than the actual file. When a page beyond the end of file is
> accessed, the kernel raises SIGBUS.
> 
> When this scenario occurs in the DAX Window, kvm.ko gets some type of
> vmexit (fault) and the code currently enters an infinite loop because
> it expects KVM memory regions to resolve faults. Since there is no
> page backing that part of the vma, the fault handling fails and the
> code loops trying to do this forever.
> 
> There needs to be a way to inject this fault back into the guest.
> However, we did not found a way to do that. We considered Machine
> Check Exceptions (MCEs), x86 interrupts, and paravirtualized
> approaches. None of them looked like a clean and sane way to do this.
> The Linux maintainers for MCEs and kvm.ko were not excited about
> supporting this.
> 
> So in the end, SIGBUS was never solved. It leads to a DoS because the
> host kernel will enter an infinite loop. We decided that until there
> is progress on SIGBUS, we can't go ahead with DAX Windows in
> production.
> 
> The easier problem is adding new vhost-user messages. It does lead to
> a fundamental change in the vhost-user protocol: the presence of the
> DAX Window means there are memory ranges that cannot be accessed via
> shared memory. Imagine Device A has a DAX Window and Device B needs to
> DMA to/from it. That doesn't work because the mmaps happen inside the
> frontend (QEMU), so Device B doesn't have access to the current
> mappings. The fundamental change to vhost-user is that virtqueue
> descriptor mapping code must now deal with the situation where guest
> addresses are absent from the shared memory regions and instead send
> vhost-user protocol messages to read/write to/from bounce buffers
> instead. The rest of the device backend does not require modification.
> This is a slow path, but at least it works whereas currently the I/O
> would fail because the memory is absent. Other solutions to the
> vhost-user DMA problem exist, but this is the one that Dave and I last
> discussed.
> 
> In the end, there is still work to do to make the DAX Window
> supportable. There is experimental code out there that kind of works,
> but we felt it was incomplete.

I feel that it will be good if someone can solve the vhost-user problem
first and get patches upstream. Now virtiofsd support from qemu has
been removed, so someone will have to add DAX support to rust virtiofsd.
(And make correspoding vhost-user changes in qemu).

Once that is done, someone can look into MCE issue.

With vhost-user problem solved, DAX will be usable in non-shared mode.
That is just pass through host filesystem into the guest and even host
can't make modifications. And that should steer clear us of the truncation
issue.

virtiofs DAX is a good piece of technology and provides speed up in many
cases. Will be sad to see the patches lost.

Now people are posting fixes to kernel side of DAX and there is no good
way to test these. I will try to make it work with old DAX branch david
had to test kernel changes but I am sure at some point of time it will
stop working and I don't want virtiofs kernel DAX code to become unstable.

Will be good if somebody takes up this project and makes it happen.

Thanks
Vivek

> 
> To your specific questions:
> 
> >  * What VMM/daemon combinations has DAX been tested on?
> 
> Only the experimental virtio-fs Kata Containers kernels and QEMU
> builds that were available a few years ago. I don't think the code has
> been rebased.
> 
> >  * Isn't it time the vhost-user spec is updated?
> 
> I don't know if Dave ever wrote the spec for or implemented the final
> version of the vhost-user protocol messages we discussed.
> 
> >  * Is anyone picking up Dave's patches for the QEMU side of support?
> 
> Not at the moment. It would be nice to support, but someone needs the
> energy/time/focus to deal with the outstanding issues I mentioned.
> 
> If you want to work on it, feel free to include me. I can help dig up
> old discussions and give input.
> 
> Stefan
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Status of DAX for virtio-fs/virtiofsd?
  2023-05-17 16:26 ` Stefan Hajnoczi
  2023-05-18 19:45   ` Vivek Goyal
@ 2023-05-22 12:54   ` Alex Bennée
  2023-09-06 13:07   ` [Virtio-fs] " Hao Xu
  2 siblings, 0 replies; 7+ messages in thread
From: Alex Bennée @ 2023-05-22 12:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: QEMU Developers, Stefan Hajnoczi, virtio-fs, Erik Schilling,
	Vivek Goyal


Stefan Hajnoczi <stefanha@gmail.com> writes:

> On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee@linaro.org> wrote:
> Hi Alex,
> There were two unresolved issues:
>
> 1. How to inject SIGBUS when the guest accesses a page that's beyond
> the end-of-file.
> 2. Implementing the vhost-user messages for mapping ranges of files to
> the vhost-user frontend.
>
> The harder problem is SIGBUS. An mmap area may be larger than the
> length of the file. Or another process could truncate the file while
> it's mmapped, causing a previously correctly sized mmap to become
> longer than the actual file. When a page beyond the end of file is
> accessed, the kernel raises SIGBUS.
>
> When this scenario occurs in the DAX Window, kvm.ko gets some type of
> vmexit (fault) and the code currently enters an infinite loop because
> it expects KVM memory regions to resolve faults. Since there is no
> page backing that part of the vma, the fault handling fails and the
> code loops trying to do this forever.
>
> There needs to be a way to inject this fault back into the guest.
> However, we did not found a way to do that. We considered Machine
> Check Exceptions (MCEs), x86 interrupts, and paravirtualized
> approaches. None of them looked like a clean and sane way to do this.
> The Linux maintainers for MCEs and kvm.ko were not excited about
> supporting this.
>
> So in the end, SIGBUS was never solved. It leads to a DoS because the
> host kernel will enter an infinite loop. We decided that until there
> is progress on SIGBUS, we can't go ahead with DAX Windows in
> production.

This certainly seems like something we'd need hypervisor specific
support for as well. In the Xen case pages aren't "owned" by the dom0
kernel (although it does track some of them) so the hypervisor would
need report the problem via some mechanism.

> The easier problem is adding new vhost-user messages. It does lead to
> a fundamental change in the vhost-user protocol: the presence of the
> DAX Window means there are memory ranges that cannot be accessed via
> shared memory. Imagine Device A has a DAX Window and Device B needs to
> DMA to/from it. That doesn't work because the mmaps happen inside the
> frontend (QEMU), so Device B doesn't have access to the current
> mappings. The fundamental change to vhost-user is that virtqueue
> descriptor mapping code must now deal with the situation where guest
> addresses are absent from the shared memory regions and instead send
> vhost-user protocol messages to read/write to/from bounce buffers
> instead. The rest of the device backend does not require modification.
> This is a slow path, but at least it works whereas currently the I/O
> would fail because the memory is absent. Other solutions to the
> vhost-user DMA problem exist, but this is the one that Dave and I last
> discussed.

This doesn't sound too dissimilar to cases we need to handle now in Xen
where access to memory is transitory and controlled by the hypervisor.

>
> In the end, there is still work to do to make the DAX Window
> supportable. There is experimental code out there that kind of works,
> but we felt it was incomplete.
>
> To your specific questions:
>
>>  * What VMM/daemon combinations has DAX been tested on?
>
> Only the experimental virtio-fs Kata Containers kernels and QEMU
> builds that were available a few years ago. I don't think the code has
> been rebased.
>
>>  * Isn't it time the vhost-user spec is updated?
>
> I don't know if Dave ever wrote the spec for or implemented the final
> version of the vhost-user protocol messages we discussed.
>
>>  * Is anyone picking up Dave's patches for the QEMU side of support?
>
> Not at the moment. It would be nice to support, but someone needs the
> energy/time/focus to deal with the outstanding issues I mentioned.
>
> If you want to work on it, feel free to include me. I can help dig up
> old discussions and give input.

I think in the short term we shall just concentrate on getting virtiofsd
working well in our Xen setup. We can certainly consider looking at DAX
again in our optimisation phase. We know it will help in performance so
its just down to the implementation details ;-)

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Virtio-fs] Status of DAX for virtio-fs/virtiofsd?
  2023-05-17 16:26 ` Stefan Hajnoczi
  2023-05-18 19:45   ` Vivek Goyal
  2023-05-22 12:54   ` Alex Bennée
@ 2023-09-06 13:07   ` Hao Xu
  2023-09-06 13:57     ` Stefan Hajnoczi
  2 siblings, 1 reply; 7+ messages in thread
From: Hao Xu @ 2023-09-06 13:07 UTC (permalink / raw)
  To: Stefan Hajnoczi, Alex Bennée
  Cc: virtio-fs, Erik Schilling, QEMU Developers, Stefan Hajnoczi,
	Vivek Goyal


On 5/18/23 00:26, Stefan Hajnoczi wrote:
> On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee@linaro.org> wrote:
> Hi Alex,
> There were two unresolved issues:
>
> 1. How to inject SIGBUS when the guest accesses a page that's beyond
> the end-of-file.

Hi Stefan,
Does this SIGBUS issue exist if the guest kernel can be trusted? Since in

that case, we can check the offset value in guest kernel.


Thanks,

Hao



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Virtio-fs] Status of DAX for virtio-fs/virtiofsd?
  2023-09-06 13:07   ` [Virtio-fs] " Hao Xu
@ 2023-09-06 13:57     ` Stefan Hajnoczi
  2023-09-07  4:19       ` Hao Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2023-09-06 13:57 UTC (permalink / raw)
  To: Hao Xu
  Cc: Alex Bennée, virtio-fs, Erik Schilling, QEMU Developers,
	Stefan Hajnoczi, Vivek Goyal

On Wed, 6 Sept 2023 at 09:07, Hao Xu <hao.xu@linux.dev> wrote:
> On 5/18/23 00:26, Stefan Hajnoczi wrote:
> > On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee@linaro.org> wrote:
> > Hi Alex,
> > There were two unresolved issues:
> >
> > 1. How to inject SIGBUS when the guest accesses a page that's beyond
> > the end-of-file.
>
> Hi Stefan,
> Does this SIGBUS issue exist if the guest kernel can be trusted? Since in
>
> that case, we can check the offset value in guest kernel.

The scenario is:
1. A guest userspace process has a DAX file mmapped.
2. The host or another guest that is also sharing the directory
truncates the file. The pages mmapped by our guest are no longer
valid.
3. The guest loads from an mmapped page and a vmexit occurs.
4. Now the host must inject a SIGBUS into the guest. There is
currently no way to do this.

I believe this scenario doesn't happen within a single guest, because
the guest kernel will raise SIGBUS itself without a vmexit if another
process inside that same guest truncates the file.

Another scenario is when the guest kernel access the DAX pages. A
vmexit can occur here too.

If you trust the host and all guests sharing the directory not to
truncate files that are mmapped, then this issue will not occur.

Stefan


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Virtio-fs] Status of DAX for virtio-fs/virtiofsd?
  2023-09-06 13:57     ` Stefan Hajnoczi
@ 2023-09-07  4:19       ` Hao Xu
  0 siblings, 0 replies; 7+ messages in thread
From: Hao Xu @ 2023-09-07  4:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Alex Bennée, virtio-fs, Erik Schilling, QEMU Developers,
	Stefan Hajnoczi, Vivek Goyal


On 9/6/23 21:57, Stefan Hajnoczi wrote:
> On Wed, 6 Sept 2023 at 09:07, Hao Xu <hao.xu@linux.dev> wrote:
>> On 5/18/23 00:26, Stefan Hajnoczi wrote:
>>> On Wed, 17 May 2023 at 11:54, Alex Bennée <alex.bennee@linaro.org> wrote:
>>> Hi Alex,
>>> There were two unresolved issues:
>>>
>>> 1. How to inject SIGBUS when the guest accesses a page that's beyond
>>> the end-of-file.
>> Hi Stefan,
>> Does this SIGBUS issue exist if the guest kernel can be trusted? Since in
>>
>> that case, we can check the offset value in guest kernel.
> The scenario is:
> 1. A guest userspace process has a DAX file mmapped.
> 2. The host or another guest that is also sharing the directory
> truncates the file. The pages mmapped by our guest are no longer
> valid.
> 3. The guest loads from an mmapped page and a vmexit occurs.
> 4. Now the host must inject a SIGBUS into the guest. There is
> currently no way to do this.
>
> I believe this scenario doesn't happen within a single guest, because
> the guest kernel will raise SIGBUS itself without a vmexit if another
> process inside that same guest truncates the file.
>
> Another scenario is when the guest kernel access the DAX pages. A
> vmexit can occur here too.
>
> If you trust the host and all guests sharing the directory not to
> truncate files that are mmapped, then this issue will not occur.
>
> Stefan


I see, my use case should be fine since the directory is not shared and 
fs is read-only.

Thanks for detail explanation.


Regards,

Hao



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-09-07  4:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-17 15:23 Status of DAX for virtio-fs/virtiofsd? Alex Bennée
2023-05-17 16:26 ` Stefan Hajnoczi
2023-05-18 19:45   ` Vivek Goyal
2023-05-22 12:54   ` Alex Bennée
2023-09-06 13:07   ` [Virtio-fs] " Hao Xu
2023-09-06 13:57     ` Stefan Hajnoczi
2023-09-07  4:19       ` Hao Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).