blktap: Sync with XCP, dropping zero-copy.

All of lore.kernel.org
 help / color / mirror / Atom feed

* blktap: Sync with XCP, dropping zero-copy.
@ 2010-11-12 23:31 Daniel Stodden
  2010-11-13  0:50 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Stodden @ 2010-11-12 23:31 UTC (permalink / raw)
  To: Xen; +Cc: Jeremy Fitzhardinge

Hi all.

This is the better half of what XCP developments and testing brought
for blktap.

It's fairly a big change in how I/O buffers are managed. Prior to this
series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
always had to jump through a couple of extra loops to do so. Present
state of that is that we dropped that, so all tapdev I/O is bounced
to/from a bunch of normal pages. Essentially replacing the old VMA
management with a couple insert/zap VM calls.

One issue was that the kernel can't cope with recursive
I/O. Submitting an iovec on a tapdev, passing it to userspace and then
reissuing the same vector via AIO apparently doesn't fit well with the
lock protocol applied to those pages. This is the main reason why
blktap had to deal a lot with grant refs. About as much as blkback
already does before passing requests on. What happens there is that
it's aliasing those granted pages under a different PFN, thereby in a
separate page struct. Not pretty, but it worked, so it's not the
reason why we chose to drop that at some point.

The more prevalent problem was network storage, especially anything
involving TCP. That includes VHD on both NFS and iSCSI. The problem
with those is that retransmits (by the transport) and I/O op
completion (on the application layer) are never synchronized.  With
sufficiently bad timing and bit of jitter on the network, it's
perfectly common for the kernel to complete an AIO request with a late
ack on the input queue just when retransmission timer is about to fire
underneath. The completion will unmap the granted frame, crashing any
uncanceled retransmission on an empty page frame. There are different
ways to deal with that. Page destructors might be one way, but as far
as I heard they are not particularly popular upstream. Issuing the
block I/O on dom0 memory is straightforward and avoids the hassle. One
could go argue that retransmits after DIO completion are still a
potential privacy problem (I did), but it's not Xen's problem after
all.

If zero-copy becomes more attractive again, the plan would be to
rather use grantdev in userspace, such as a filter driver for tapdisk
instead. Until then, there's presumably a notable difference in L2
cache footprint. Then again, there's also a whole number of cycles not
spent in redundant hypercalls now, to deal with the pseudophysical
map.

There are also benefits or non-issues.

 - This blktap is rather xen-independent. Certainly depends on the
   common ring macros, but lacking grant stuff it compiles on bare
   metal Linux with no CONFIG_XEN. Not consummated here, because
   that's going to move the source tree out of drivers/xen. But I'd
   like to post a new branch proposing to do so.

 - Blktaps size in dom0 didn't really change. Frames (now pages) were
   always pooled. We used to balloon memory to claim space for
   redundant grant mappings. Now we reserve, by default, the same
   volume in normal memory.

 - The previous code would runs all I/O on a single pool. Typically
   two rings worth of requests. Sufficient for a whole lot of systems,
   especially with single storage backends, but not so nice when I/O
   on a number of otherwise independent filers or volumes collides.

   Pools are refcounted kobjects in sysfs. Toolstacks using the new
   code can thereby choose to elimitate bottlenecks by grouping taps
   on different buffer pools. Pools can also be resized, to accomodate
   greater queue depths. [Note that blkback still has the same issue,
   so guests won't take advantage of that before that's resolved as
   well.]

 - XCP started to make some use of stacking tapdevs. Think pointing
   the image chain of a bunch of "leaf" taps to a shared parent
   node. That works fairly well, but definitely takes independent
   resource pools to avoid deadlock by parent starvation then.

Please pull upstream/xen/dom0/backend/blktap2 from
git://xenbits.xensource.com/people/dstodden/linux.git

.. and/or upstream/xen/next for a merge.

I also pulled in the pending warning fix from Teck Choon Giam.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-12 23:31 Daniel Stodden
@ 2010-11-13  0:50 ` Jeremy Fitzhardinge
  2010-11-13  3:56   ` Daniel Stodden
       [not found]   ` <1289620544.11102.373.camel@agari.van.xensource.com>
  0 siblings, 2 replies; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-13  0:50 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen

On 11/12/2010 03:31 PM, Daniel Stodden wrote:
> It's fairly a big change in how I/O buffers are managed. Prior to this
> series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
> always had to jump through a couple of extra loops to do so. Present
> state of that is that we dropped that, so all tapdev I/O is bounced
> to/from a bunch of normal pages. Essentially replacing the old VMA
> management with a couple insert/zap VM calls.

Do you have any performance results comparing the two approaches?

> One issue was that the kernel can't cope with recursive
> I/O. Submitting an iovec on a tapdev, passing it to userspace and then
> reissuing the same vector via AIO apparently doesn't fit well with the
> lock protocol applied to those pages. This is the main reason why
> blktap had to deal a lot with grant refs. About as much as blkback
> already does before passing requests on. What happens there is that
> it's aliasing those granted pages under a different PFN, thereby in a
> separate page struct. Not pretty, but it worked, so it's not the
> reason why we chose to drop that at some point.
>
> The more prevalent problem was network storage, especially anything
> involving TCP. That includes VHD on both NFS and iSCSI. The problem
> with those is that retransmits (by the transport) and I/O op
> completion (on the application layer) are never synchronized.  With
> sufficiently bad timing and bit of jitter on the network, it's
> perfectly common for the kernel to complete an AIO request with a late
> ack on the input queue just when retransmission timer is about to fire
> underneath. The completion will unmap the granted frame, crashing any
> uncanceled retransmission on an empty page frame. There are different
> ways to deal with that. Page destructors might be one way, but as far
> as I heard they are not particularly popular upstream. Issuing the
> block I/O on dom0 memory is straightforward and avoids the hassle. One
> could go argue that retransmits after DIO completion are still a
> potential privacy problem (I did), but it's not Xen's problem after
> all.

Surely this can be dealt with by replacing the mapped granted page with
a local copy if the refcount is elevated?  Then that can catch any stray
residual references while we can still return the granted page to its
owner.  And obviously, not reuse that pfn for grants until the refcount
is zero...

> If zero-copy becomes more attractive again, the plan would be to
> rather use grantdev in userspace, such as a filter driver for tapdisk
> instead. Until then, there's presumably a notable difference in L2
> cache footprint. Then again, there's also a whole number of cycles not
> spent in redundant hypercalls now, to deal with the pseudophysical
> map.

Frankly, I think the idea of putting blkback+tapdisk entirely in
usermode is all upside with no (obvious) downsides.  It:

   1. avoids having to upstream anything
   2. avoids having to upstream anything
   3. avoids having to upstream anything

   4. gets us back zero-copy (if that's important)
   5. makes the IO path nice and straightforward
   6. seems to address all the other problems you mentioned

The only caveat is the stray unmapping problem, but I think gntdev can
be modified to deal with that pretty easily.

qemu has usermode blkback support already, and an actively improving
block-IO infrastructure, so one approach might be to consider putting
(parts of) tapdisk into qemu - and makes it pretty natural to reuse it
with non-Xen guests via virtio-block, emulated devices, etc.  But I'm
not sold on that; having a standalone tapdisk w/ blkback makes sense to
me as well.

On the other hand, I don't think we're going to be able to get away with
putting netback in usermode, so we still need to deal with that - but I
think an all-copying version will be fine to get started with at least.


> There are also benefits or non-issues.
>
>  - This blktap is rather xen-independent. Certainly depends on the
>    common ring macros, but lacking grant stuff it compiles on bare
>    metal Linux with no CONFIG_XEN. Not consummated here, because
>    that's going to move the source tree out of drivers/xen. But I'd
>    like to post a new branch proposing to do so.
>
>  - Blktaps size in dom0 didn't really change. Frames (now pages) were
>    always pooled. We used to balloon memory to claim space for
>    redundant grant mappings. Now we reserve, by default, the same
>    volume in normal memory.
>
>  - The previous code would runs all I/O on a single pool. Typically
>    two rings worth of requests. Sufficient for a whole lot of systems,
>    especially with single storage backends, but not so nice when I/O
>    on a number of otherwise independent filers or volumes collides.
>
>    Pools are refcounted kobjects in sysfs. Toolstacks using the new
>    code can thereby choose to elimitate bottlenecks by grouping taps
>    on different buffer pools. Pools can also be resized, to accomodate
>    greater queue depths. [Note that blkback still has the same issue,
>    so guests won't take advantage of that before that's resolved as
>    well.]
>
>  - XCP started to make some use of stacking tapdevs. Think pointing
>    the image chain of a bunch of "leaf" taps to a shared parent
>    node. That works fairly well, but definitely takes independent
>    resource pools to avoid deadlock by parent starvation then.
>
> Please pull upstream/xen/dom0/backend/blktap2 from
> git://xenbits.xensource.com/people/dstodden/linux.git

OK, I've pulled it, but I haven't had a chance to test it yet.

Thanks,
    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-13  0:50 ` Jeremy Fitzhardinge
@ 2010-11-13  3:56   ` Daniel Stodden
       [not found]   ` <1289620544.11102.373.camel@agari.van.xensource.com>
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Stodden @ 2010-11-13  3:56 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com

Hey.

On Fri, 2010-11-12 at 19:50 -0500, Jeremy Fitzhardinge wrote:
> On 11/12/2010 03:31 PM, Daniel Stodden wrote:
> > It's fairly a big change in how I/O buffers are managed. Prior to this
> > series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
> > always had to jump through a couple of extra loops to do so. Present
> > state of that is that we dropped that, so all tapdev I/O is bounced
> > to/from a bunch of normal pages. Essentially replacing the old VMA
> > management with a couple insert/zap VM calls.
> 
> Do you have any performance results comparing the two approaches?

No. One could probably go try large ramdisks or an AIO backend in tmpfs.
All the storage I'm concerned with here terminates either on the NIC or
a local spindle. That's hard to thwart in cache bandwidth.

> > One issue was that the kernel can't cope with recursive
> > I/O. Submitting an iovec on a tapdev, passing it to userspace and then
> > reissuing the same vector via AIO apparently doesn't fit well with the
> > lock protocol applied to those pages. This is the main reason why
> > blktap had to deal a lot with grant refs. About as much as blkback
> > already does before passing requests on. What happens there is that
> > it's aliasing those granted pages under a different PFN, thereby in a
> > separate page struct. Not pretty, but it worked, so it's not the
> > reason why we chose to drop that at some point.
> >
> > The more prevalent problem was network storage, especially anything
> > involving TCP. That includes VHD on both NFS and iSCSI. The problem
> > with those is that retransmits (by the transport) and I/O op
> > completion (on the application layer) are never synchronized.  With
> > sufficiently bad timing and bit of jitter on the network, it's
> > perfectly common for the kernel to complete an AIO request with a late
> > ack on the input queue just when retransmission timer is about to fire
> > underneath. The completion will unmap the granted frame, crashing any
> > uncanceled retransmission on an empty page frame. There are different
> > ways to deal with that. Page destructors might be one way, but as far
> > as I heard they are not particularly popular upstream. Issuing the
> > block I/O on dom0 memory is straightforward and avoids the hassle. One
> > could go argue that retransmits after DIO completion are still a
> > potential privacy problem (I did), but it's not Xen's problem after
> > all.
> 
> Surely this can be dealt with by replacing the mapped granted page with
> a local copy if the refcount is elevated?

Yeah. We briefly discussed this when the problem started to pop up
(again).

I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
does the job. On UP that'd be just a matter of disabling interrupts for
a while.

I dropped it after it became clear that XS was moving to SMP, where one
would end up with a full barrier to orchestrate the TLB flushes
everywhere. Now, the skb runs prone to crash all run in softirq context,
I wouldn't exactly expect a huge performance win from syncing on that
kind of thing across all nodes, compared to local memcpy. Nor would I
want storage stuff to touch locks shared with TCP, that's just not our
business. Correct me if I'm mistaken.

I'd like to see maybe stuff like node affinity on NUMA getting a bit
more work. I think the patch presently just fills the queue node in, but
that didn't see much testing, and one would have to correlate that.

>   Then that can catch any stray
> residual references while we can still return the granted page to its
> owner.  And obviously, not reuse that pfn for grants until the refcount
> is zero...

> > If zero-copy becomes more attractive again, the plan would be to
> > rather use grantdev in userspace, such as a filter driver for tapdisk
> > instead. Until then, there's presumably a notable difference in L2
> > cache footprint. Then again, there's also a whole number of cycles not
> > spent in redundant hypercalls now, to deal with the pseudophysical
> > map.
> 
> Frankly, I think the idea of putting blkback+tapdisk entirely in
> usermode is all upside with no (obvious) downsides.  It:
> 
>    1. avoids having to upstream anything
>    2. avoids having to upstream anything
>    3. avoids having to upstream anything
> 
>    4. gets us back zero-copy (if that's important)

(No, unfortunately. DIO on a granted frame under blktap would be as
vulnerable as DIO on a granted frame under a userland blkback, userland
won't buy us anthing as far as the zcopy side of things is concerned).

>    5. makes the IO path nice and straightforward
>    6. seems to address all the other problems you mentioned

I'm not at all against a userland blkback. Just wouldn't go as far as
considering this a silver bullet.

The main thing I'm scared of is ending up hacking cheesy stuff into the
user ABI to take advantage of things immediately available to FSes on
the bio layer, but harder (or at least slower) to get made available to
userland.

DISCARD support is one currently hot example, do you see that in AIO
somewhere? Ok, probably a good thing for everybody anyway, so maybe
patching that is useful work. But it's extra work right now and probably
no more fun to maintain than blktap is.

The second issue I see is the XCP side of things. XenServer got a lot of
benefit out of blktap2, and particularly because of the tapdevs. It
promotes a fairly rigorous split between a blkback VBD, controlled by
the agent, and tapdevs, controlled by XS's storage manager.

That doesn't prevent blkback to go into userspace, but it better won't
share a process with some libblktap, which in turn would better not be
controlled under the same xenstore path.

So for XCP it'd be AIO on tapdevs for the time being, and with that
whatever the syscall interface lets you do.

> The only caveat is the stray unmapping problem, but I think gntdev can
> be modified to deal with that pretty easily.

Not easier than anything else in kernel space, but when dealing only
with the refcounts, that's as as good a place as anwhere else, yes.

> qemu has usermode blkback support already, and an actively improving
> block-IO infrastructure, so one approach might be to consider putting
> (parts of) tapdisk into qemu - and makes it pretty natural to reuse it
> with non-Xen guests via virtio-block, emulated devices, etc.  But I'm
> not sold on that; having a standalone tapdisk w/ blkback makes sense to
> me as well.
> 
> On the other hand, I don't think we're going to be able to get away with
> putting netback in usermode, so we still need to deal with that - but I
> think an all-copying version will be fine to get started with at least.

> > Please pull upstream/xen/dom0/backend/blktap2 from
> > git://xenbits.xensource.com/people/dstodden/linux.git
> 
> OK, I've pulled it, but I haven't had a chance to test it yet.

Thanks.

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
       [not found]   ` <1289620544.11102.373.camel@agari.van.xensource.com>
@ 2010-11-15 18:27     ` Jeremy Fitzhardinge
  2010-11-16  9:13       ` Daniel Stodden
  0 siblings, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-15 18:27 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com

On 11/12/2010 07:55 PM, Daniel Stodden wrote:
>> Surely this can be dealt with by replacing the mapped granted page with
>> a local copy if the refcount is elevated?
> Yeah. We briefly discussed this when the problem started to pop up
> (again).
>
> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> does the job.

Hm, I'd be a bit concerned that that might cause problems if used
generically.  If the page is being used RO, then replacing with a copy
shouldn't be a problem, but getting a consistent snapshot of a mutable
page is obviously going to be a problem.

>  On UP that'd be just a matter of disabling interrupts for
> a while.

Disable for what purpose?  You mean to do an exchange of the mapping, or
something else?

> I dropped it after it became clear that XS was moving to SMP, where one
> would end up with a full barrier to orchestrate the TLB flushes
> everywhere. Now, the skb runs prone to crash all run in softirq context,
> I wouldn't exactly expect a huge performance win from syncing on that
> kind of thing across all nodes, compared to local memcpy. Nor would I
> want storage stuff to touch locks shared with TCP, that's just not our
> business. Correct me if I'm mistaken.

I don't follow what your high-level concern is here.   If we update the
pte to unmap the granted page, then its up to Xen to arrange for any TLB
flushes to make sure the page is no longer accessible to the domain.  We
don't need to do anything explicit, and its independent of whether we're
simply unmapping the page or replacing the mapping with another one (ie,
any TLB flushing necessary is already going on).

I was also under the impression that this is a relatively rare event
rather than something that's going to be necessary for every page, so
the overhead should be minimal.

>>> If zero-copy becomes more attractive again, the plan would be to
>>> rather use grantdev in userspace, such as a filter driver for tapdisk
>>> instead. Until then, there's presumably a notable difference in L2
>>> cache footprint. Then again, there's also a whole number of cycles not
>>> spent in redundant hypercalls now, to deal with the pseudophysical
>>> map.
>> Frankly, I think the idea of putting blkback+tapdisk entirely in
>> usermode is all upside with no (obvious) downsides.  It:
>>
>>    1. avoids having to upstream anything
>>    2. avoids having to upstream anything
>>    3. avoids having to upstream anything
>>
>>    4. gets us back zero-copy (if that's important)
> (No, unfortunately. DIO on a granted frame under blktap would be as
> vulnerable as DIO on a granted frame under a userland blkback, userland
> won't buy us anthing as far as the zcopy side of things is concerned).

Why's that?  Are you talking about the stray page reference
vulnerability, or something else?  You're right that it doesn't really
help with stray references because its still kernel code which ends up
doing the dereference rather than usermode.

>>    5. makes the IO path nice and straightforward
>>    6. seems to address all the other problems you mentioned
> I'm not at all against a userland blkback. Just wouldn't go as far as
> considering this a silver bullet.
>
> The main thing I'm scared of is ending up hacking cheesy stuff into the
> user ABI to take advantage of things immediately available to FSes on
> the bio layer, but harder (or at least slower) to get made available to
> userland.

But that hasn't been a problem for tapdisk so far.  Given that it is
using DIO then any completed write has at least been submitted to a
storage device, so then its just a question of how to make sure that any
buffers are fully flushed, which would be fdatasync() I guess.

Besides, our requirements are hardly unique; if there's some clear need
for a new API, then the course of action is to work with broader Linux
community to work out what it should be and how it should be
implemented.  There's no need for "hacking cheesy stuff" - there's been
enough of that already.

> DISCARD support is one currently hot example, do you see that in AIO
> somewhere? Ok, probably a good thing for everybody anyway, so maybe
> patching that is useful work. But it's extra work right now and probably
> no more fun to maintain than blktap is.

fallocate() is being extended to allow hole-punching in files.  I don't
know what work is being done to do discard on random block devices, but
that's clearly a generically useful thing to have.

> The second issue I see is the XCP side of things. XenServer got a lot of
> benefit out of blktap2, and particularly because of the tapdevs. It
> promotes a fairly rigorous split between a blkback VBD, controlled by
> the agent, and tapdevs, controlled by XS's storage manager.
>
> That doesn't prevent blkback to go into userspace, but it better won't
> share a process with some libblktap, which in turn would better not be
> controlled under the same xenstore path.

Could you elaborate on this?  What was the benefit?

>> The only caveat is the stray unmapping problem, but I think gntdev can
>> be modified to deal with that pretty easily.
> Not easier than anything else in kernel space, but when dealing only
> with the refcounts, that's as as good a place as anwhere else, yes.

I think the refcount test is pretty straightforward - if the refcount is
1, then we're the sole owner of the page and we don't need to worry
about any other users.  If its > 1, then somebody else has it, and we
need to make sure it no longer refers to a granted page (which is just a
matter of doing a set_pte_atomic() to remap from present to present).

Then we'd have a set of frames whose lifetimes are being determined by
some other subsystem.  We can either maintain a list of them and poll
waiting for them to become free, or just release them and let them be
managed by the normal kernel lifetime rules (which requires that the
memory attached to them be completely normal, of course).

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-15 18:27     ` Jeremy Fitzhardinge
@ 2010-11-16  9:13       ` Daniel Stodden
  2010-11-16 17:56         ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Stodden @ 2010-11-16  9:13 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com

On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
> >> Surely this can be dealt with by replacing the mapped granted page with
> >> a local copy if the refcount is elevated?
> > Yeah. We briefly discussed this when the problem started to pop up
> > (again).
> >
> > I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> > a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> > does the job.
> 
> Hm, I'd be a bit concerned that that might cause problems if used
> generically. 

Yeah. It wasn't a problem because all the network backends are on TCP,
where one can be rather sure that the dups are going to be properly
dropped.

Does this hold everywhere ..? -- As mentioned below, the problem is
rather in AIO/DIO than being Xen-specific, so you can see the same
behavior on bare metal kernels too. A userspace app seeing an AIO
complete and then reusing that buffer elsewhere will occassionally
resend garbage over the network.

How often that would happen in practice probably depends on the
popularity of DIO on NFS for normal apps. Not too many, so yeah, very
generically one would maybe want to memcpy().

>  If the page is being used RO, then replacing with a copy
> shouldn't be a problem, but getting a consistent snapshot of a mutable
> page is obviously going to be a problem.

I think it's safe to assume the problem is limited to writes, so the
buffers aren't really mutable.

> >  On UP that'd be just a matter of disabling interrupts for
> > a while.
> 
> Disable for what purpose?  You mean to do an exchange of the mapping, or
> something else?

[I was referring to the prospect of doing a gref unmap/map sequence in
dom0. Since we do have a swap operation in Xen, just asking flipping the
PTE entry and flushing across all vcpus sounds sufficient to me,
indeed.]

> > I dropped it after it became clear that XS was moving to SMP, where one
> > would end up with a full barrier to orchestrate the TLB flushes
> > everywhere. Now, the skb runs prone to crash all run in softirq context,
> > I wouldn't exactly expect a huge performance win from syncing on that
> > kind of thing across all nodes, compared to local memcpy. Nor would I
> > want storage stuff to touch locks shared with TCP, that's just not our
> > business. Correct me if I'm mistaken.
> 
> I don't follow what your high-level concern is here.   If we update the
> pte to unmap the granted page, then its up to Xen to arrange for any TLB
> flushes to make sure the page is no longer accessible to the domain.  We
> don't need to do anything explicit, and its independent of whether we're
> simply unmapping the page or replacing the mapping with another one (ie,
> any TLB flushing necessary is already going on).
> 
> I was also under the impression that this is a relatively rare event
> rather than something that's going to be necessary for every page, so
> the overhead should be minimal.
> 
> >>> If zero-copy becomes more attractive again, the plan would be to
> >>> rather use grantdev in userspace, such as a filter driver for tapdisk
> >>> instead. Until then, there's presumably a notable difference in L2
> >>> cache footprint. Then again, there's also a whole number of cycles not
> >>> spent in redundant hypercalls now, to deal with the pseudophysical
> >>> map.
> >> Frankly, I think the idea of putting blkback+tapdisk entirely in
> >> usermode is all upside with no (obvious) downsides.  It:
> >>
> >>    1. avoids having to upstream anything
> >>    2. avoids having to upstream anything
> >>    3. avoids having to upstream anything
> >>
> >>    4. gets us back zero-copy (if that's important)
> > (No, unfortunately. DIO on a granted frame under blktap would be as
> > vulnerable as DIO on a granted frame under a userland blkback, userland
> > won't buy us anthing as far as the zcopy side of things is concerned).
> 
> Why's that?  Are you talking about the stray page reference
> vulnerability, or something else?  You're right that it doesn't really
> help with stray references because its still kernel code which ends up
> doing the dereference rather than usermode.

[Still was talking about the stray reference issues].

> >>    5. makes the IO path nice and straightforward
> >>    6. seems to address all the other problems you mentioned
> > I'm not at all against a userland blkback. Just wouldn't go as far as
> > considering this a silver bullet.
> >
> > The main thing I'm scared of is ending up hacking cheesy stuff into the
> > user ABI to take advantage of things immediately available to FSes on
> > the bio layer, but harder (or at least slower) to get made available to
> > userland.
> 
> But that hasn't been a problem for tapdisk so far.  Given that it is
> using DIO then any completed write has at least been submitted to a
> storage device, so then its just a question of how to make sure that any
> buffers are fully flushed, which would be fdatasync() I guess.

> Besides, our requirements are hardly unique; if there's some clear need
> for a new API, then the course of action is to work with broader Linux
> community to work out what it should be and how it should be
> implemented.  There's no need for "hacking cheesy stuff" - there's been
> enough of that already.

> > DISCARD support is one currently hot example, do you see that in AIO
> > somewhere? Ok, probably a good thing for everybody anyway, so maybe
> > patching that is useful work. But it's extra work right now and probably
> > no more fun to maintain than blktap is.
> 
> fallocate() is being extended to allow hole-punching in files.  I don't
> know what work is being done to do discard on random block devices, but
> that's clearly a generically useful thing to have.

Hole punching for userland being in the works is actually great news.
Just saw the patchset on ext4-devel flying by, great.

> > The second issue I see is the XCP side of things. XenServer got a lot of
> > benefit out of blktap2, and particularly because of the tapdevs. It
> > promotes a fairly rigorous split between a blkback VBD, controlled by
> > the agent, and tapdevs, controlled by XS's storage manager.
> >
> > That doesn't prevent blkback to go into userspace, but it better won't
> > share a process with some libblktap, which in turn would better not be
> > controlled under the same xenstore path.
> 
> 
> Could you elaborate on this?  What was the benefit?

It's been mainly a matter of who controls what. Blktap1 was basically a
VBD, controlled by the agent. Blktap2 is a VDI represented as a block
device. Leaving management of that to XCP's storage manager, which just
hands that device node over to Xapi simplified many things. Before, the
agent had to understand a lot about the type of storage, then talk to
the right backend accordingly. Worse, in order to have storage
management control a couple datapath features, you'd basically have to
talk to Xapi, which would talk though xenstore to blktap, which was a
bit tedious. :)

Merging the VDI side of things with VBDs again just didn't sound so
attrative at first sight.

But I've thinking a little longer about the issues in the meantime. I
think it's actually not so hard to do that without collateral damage.

Let's say we create an extension to tapdisk which speaks blkback's
datapath in userland. We'd basically put one of those tapdisks on every
storage node, independent of the image type, such as a bare LUN or a
VHD. We add a couple additional IPC calls to make it directly
connect/disconnect to/from (ring-ref,event-channel) pairs. 

Means it doesn't even need to talk xenstore, the control plane could all
be left to some single daemon, which knows how to instruct the right
tapdev (via libblktapctl) by looking at the physical-device node. I
guess getting the control stuff out of the kernel is always a good idea.

There are some important parts which would go missing. Such as
ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
gntmap 352 pages simultaneously isn't so good, so there still needs to
be some bridge arbitrating them. I'd rather keep that in kernel space,
okay to cram stuff like that into gntdev? It'd be much more
straightforward than IPC.

Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
Can't find it now, what happened? Without, there's presently still no
zero-copy.

Once the issues were solved, it'd be kinda nice. Simplifies stuff like
memshr for blktap, which depends on getting hold of original grefs.

We'd presumably still need the tapdev nodes, for qemu, etc. But those
can stay non-xen aware then.

> >> The only caveat is the stray unmapping problem, but I think gntdev can
> >> be modified to deal with that pretty easily.
> > Not easier than anything else in kernel space, but when dealing only
> > with the refcounts, that's as as good a place as anwhere else, yes.
> 
> I think the refcount test is pretty straightforward - if the refcount is
> 1, then we're the sole owner of the page and we don't need to worry
> about any other users.  If its > 1, then somebody else has it, and we
> need to make sure it no longer refers to a granted page (which is just a
> matter of doing a set_pte_atomic() to remap from present to present).

[set_pte_atomic over grant ptes doesn't work, or does it?]

> Then we'd have a set of frames whose lifetimes are being determined by
> some other subsystem.  We can either maintain a list of them and poll
> waiting for them to become free, or just release them and let them be
> managed by the normal kernel lifetime rules (which requires that the
> memory attached to them be completely normal, of course).

The latter sounds like a good alternative to polling. So an
unmap_and_replace, and giving up ownership thereafter. Next run of the
dispatcher thread can can just refill the foreign pfn range via
alloc_empty_pages(), to rebalance.

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-16  9:13       ` Daniel Stodden
@ 2010-11-16 17:56         ` Jeremy Fitzhardinge
  2010-11-16 21:28           ` Daniel Stodden
  0 siblings, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-16 17:56 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com

On 11/16/2010 01:13 AM, Daniel Stodden wrote:
> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
>> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
>>>> Surely this can be dealt with by replacing the mapped granted page with
>>>> a local copy if the refcount is elevated?
>>> Yeah. We briefly discussed this when the problem started to pop up
>>> (again).
>>>
>>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
>>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
>>> does the job.
>> Hm, I'd be a bit concerned that that might cause problems if used
>> generically. 
> Yeah. It wasn't a problem because all the network backends are on TCP,
> where one can be rather sure that the dups are going to be properly
> dropped.
>
> Does this hold everywhere ..? -- As mentioned below, the problem is
> rather in AIO/DIO than being Xen-specific, so you can see the same
> behavior on bare metal kernels too. A userspace app seeing an AIO
> complete and then reusing that buffer elsewhere will occassionally
> resend garbage over the network.

Yeah, that sounds like a generic security problem.  I presume the
protocol will just discard the excess retransmit data, but it might mean
a usermode program ends up transmitting secrets it never intended to...

> There are some important parts which would go missing. Such as
> ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
> gntmap 352 pages simultaneously isn't so good, so there still needs to
> be some bridge arbitrating them. I'd rather keep that in kernel space,
> okay to cram stuff like that into gntdev? It'd be much more
> straightforward than IPC.

What's the problem?  If you do nothing then it will appear to the kernel
as a bunch of processes doing memory allocations, and they'll get
blocked/rate-limited accordingly if memory is getting short.  There's
plenty of existing mechanisms to control that sort of thing (cgroups,
etc) without adding anything new to the kernel.  Or are you talking
about something other than simple memory pressure?

And there's plenty of existing IPC mechanisms if you want them to
explicitly coordinate with each other, but I'd tend to thing that's
premature unless you have something specific in mind.

> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
> Can't find it now, what happened? Without, there's presently still no
> zero-copy.

gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
new-ish) mmu notifier infrastructure which is intended to allow a device
to sync an external MMU with usermode mappings.  We're not using it in
precisely that way, but it allows us to wrangle grant mappings before
the generic code tries to do normal pte ops on them.

> Once the issues were solved, it'd be kinda nice. Simplifies stuff like
> memshr for blktap, which depends on getting hold of original grefs.
>
> We'd presumably still need the tapdev nodes, for qemu, etc. But those
> can stay non-xen aware then.
>
>>>> The only caveat is the stray unmapping problem, but I think gntdev can
>>>> be modified to deal with that pretty easily.
>>> Not easier than anything else in kernel space, but when dealing only
>>> with the refcounts, that's as as good a place as anwhere else, yes.
>> I think the refcount test is pretty straightforward - if the refcount is
>> 1, then we're the sole owner of the page and we don't need to worry
>> about any other users.  If its > 1, then somebody else has it, and we
>> need to make sure it no longer refers to a granted page (which is just a
>> matter of doing a set_pte_atomic() to remap from present to present).
> [set_pte_atomic over grant ptes doesn't work, or does it?]

No, I forgot about grant ptes magic properties.  But there is the hypercall.

>> Then we'd have a set of frames whose lifetimes are being determined by
>> some other subsystem.  We can either maintain a list of them and poll
>> waiting for them to become free, or just release them and let them be
>> managed by the normal kernel lifetime rules (which requires that the
>> memory attached to them be completely normal, of course).
> The latter sounds like a good alternative to polling. So an
> unmap_and_replace, and giving up ownership thereafter. Next run of the
> dispatcher thread can can just refill the foreign pfn range via
> alloc_empty_pages(), to rebalance.

Do we actually need a "foreign page range"?  Won't any pfn do?  If we
start with a specific range of foreign pfns and then start freeing those
pfns back to the kernel, we won't have one for long...

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-16 17:56         ` Jeremy Fitzhardinge
@ 2010-11-16 21:28           ` Daniel Stodden
  2010-11-17 18:00             ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Stodden @ 2010-11-16 21:28 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com

On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote:
> On 11/16/2010 01:13 AM, Daniel Stodden wrote:
> > On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
> >> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
> >>>> Surely this can be dealt with by replacing the mapped granted page with
> >>>> a local copy if the refcount is elevated?
> >>> Yeah. We briefly discussed this when the problem started to pop up
> >>> (again).
> >>>
> >>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
> >>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
> >>> does the job.
> >> Hm, I'd be a bit concerned that that might cause problems if used
> >> generically. 
> > Yeah. It wasn't a problem because all the network backends are on TCP,
> > where one can be rather sure that the dups are going to be properly
> > dropped.
> >
> > Does this hold everywhere ..? -- As mentioned below, the problem is
> > rather in AIO/DIO than being Xen-specific, so you can see the same
> > behavior on bare metal kernels too. A userspace app seeing an AIO
> > complete and then reusing that buffer elsewhere will occassionally
> > resend garbage over the network.
> 
> Yeah, that sounds like a generic security problem.  I presume the
> protocol will just discard the excess retransmit data, but it might mean
> a usermode program ends up transmitting secrets it never intended to...
> 
> > There are some important parts which would go missing. Such as
> > ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
> > gntmap 352 pages simultaneously isn't so good, so there still needs to
> > be some bridge arbitrating them. I'd rather keep that in kernel space,
> > okay to cram stuff like that into gntdev? It'd be much more
> > straightforward than IPC.
> 
> What's the problem?  If you do nothing then it will appear to the kernel
> as a bunch of processes doing memory allocations, and they'll get
> blocked/rate-limited accordingly if memory is getting short.  

The problem is that just letting the page allocator work through
allocations isn't going to scale anywhere.

The worst case memory requested under load is <number-of-disks> * (32 *
11 pages). As a (conservative) rule of thumb, N will be 200 or rather
better.

The number of I/O actually in-flight at any point, in contrast, is
derived from the queue/sg sizes of the physical device. For a simple
disk, that's about a ring or two.

> There's
> plenty of existing mechanisms to control that sort of thing (cgroups,
> etc) without adding anything new to the kernel.  Or are you talking
> about something other than simple memory pressure?
> 
> And there's plenty of existing IPC mechanisms if you want them to
> explicitly coordinate with each other, but I'd tend to thing that's
> premature unless you have something specific in mind.
> 
> > Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
> > Can't find it now, what happened? Without, there's presently still no
> > zero-copy.
> 
> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
> new-ish) mmu notifier infrastructure which is intended to allow a device
> to sync an external MMU with usermode mappings.  We're not using it in
> precisely that way, but it allows us to wrangle grant mappings before
> the generic code tries to do normal pte ops on them.

The mmu notifiers were for safe teardown only. They are not sufficient
for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
need to back those VMAs with page structs.  Or bounce again (gulp, just
mentioning it). As with the blktap2 patches, note there is no difference
in the dom0 memory bill, it takes page frames.

This is pretty much exactly the pooling stuff in current drivers/blktap.
The interface could look as follows ([] designates users). 

 * [toolstack]
   Calling some ctls to create/destroy ctls  pools of frames. 
   (Blktap currently does this in sysfs.)

 * [toolstack]
   Optionally resize them, according to the physical queue 
   depth [estimate] of the underlying storage.

 * [tapdisk]
   A backend instance, when starting up, opens a gntdev, then
   uses a ctl to bind its gntdev handle to a frame pool.

 * [tapdisk]
   The .mmap call now will allocate frames to back the VMA.
   This operation can fail/block under congestion. Neither
   is desirable, so we need a .poll.

 * [tapdisk]
   To integrate grant mappings with a single-threaded event loop,
   use .poll. The handle fires as soon as a request can be mapped.

   Under congestion, the .poll code will queue waiting disks and wake
   them round-robin, once VMAs are released.

(A [tapdisk] doesn't mean to dismiss a potential [qemu].)

Still confident we want that? (Seriously asking). A lot of the code to
do so has been written for blktap, it wouldn't be hard to bend into a
gntdev extension.

> > Once the issues were solved, it'd be kinda nice. Simplifies stuff like
> > memshr for blktap, which depends on getting hold of original grefs.
> >
> > We'd presumably still need the tapdev nodes, for qemu, etc. But those
> > can stay non-xen aware then.
> >
> >>>> The only caveat is the stray unmapping problem, but I think gntdev can
> >>>> be modified to deal with that pretty easily.
> >>> Not easier than anything else in kernel space, but when dealing only
> >>> with the refcounts, that's as as good a place as anwhere else, yes.
> >> I think the refcount test is pretty straightforward - if the refcount is
> >> 1, then we're the sole owner of the page and we don't need to worry
> >> about any other users.  If its > 1, then somebody else has it, and we
> >> need to make sure it no longer refers to a granted page (which is just a
> >> matter of doing a set_pte_atomic() to remap from present to present).
> > [set_pte_atomic over grant ptes doesn't work, or does it?]
> 
> No, I forgot about grant ptes magic properties.  But there is the hypercall.

Yup.

> >> Then we'd have a set of frames whose lifetimes are being determined by
> >> some other subsystem.  We can either maintain a list of them and poll
> >> waiting for them to become free, or just release them and let them be
> >> managed by the normal kernel lifetime rules (which requires that the
> >> memory attached to them be completely normal, of course).
> > The latter sounds like a good alternative to polling. So an
> > unmap_and_replace, and giving up ownership thereafter. Next run of the
> > dispatcher thread can can just refill the foreign pfn range via
> > alloc_empty_pages(), to rebalance.
> 
> Do we actually need a "foreign page range"?  Won't any pfn do?  If we
> start with a specific range of foreign pfns and then start freeing those
> pfns back to the kernel, we won't have one for long...

I guess we've been meaning the same thing here, unless I'm
misunderstanding you. Any pfn does, and the balloon pagevec allocations
default to order 0 entries indeed. Sorry, you're right, that's not a
'range'. With a pending re-xmit, the backend can find a couple (or all)
of the request frames have count>1. It can flip and abandon those as
normal memory. But it will need those lost memory slots back, straight
away or next time it's running out of frames. As order-0 allocations.

Foreign memory is deliberately short. Blkback still defaults to 2 rings
worth of address space, iirc, globally. That's what that mempool sysfs
stuff in the later blktap2 patches aimed at -- making the size
configurable where queue length matters, and isolate throughput between
physical backends, where the toolstack wants to care.

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
       [not found] <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com>
@ 2010-11-17 16:36 ` Andres Lagar-Cavilla
  2010-11-17 17:52   ` Jeremy Fitzhardinge
  2010-11-17 23:42   ` Daniel Stodden
  0 siblings, 2 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2010-11-17 16:36 UTC (permalink / raw)
  To: xen-devel; +Cc: Jeremy Fitzhardinge, Daniel Stodden

I'll throw an idea there and you educate me why it's lame.

Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc.

Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out.

The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out:

1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them.

2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched.

3. For blkfront reads, call swap when stuffing the response back into the ring

4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0

5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle)

6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer

7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though.

Problems at first glance:
1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback.
2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback?
3. Managing the pool of backend reserved pages may be a problem?

So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach
Andres 

> Message: 3
> Date: Tue, 16 Nov 2010 13:28:51 -0800
> From: Daniel Stodden <daniel.stodden@citrix.com>
> Subject: [Xen-devel] Re: blktap: Sync with XCP, dropping zero-copy.
> To: Jeremy Fitzhardinge <jeremy@goop.org>
> Cc: "Xen-devel@lists.xensource.com" <Xen-devel@lists.xensource.com>
> Message-ID: <1289942932.11102.802.camel@agari.van.xensource.com>
> Content-Type: text/plain; charset="UTF-8"
> 
> On Tue, 2010-11-16 at 12:56 -0500, Jeremy Fitzhardinge wrote:
>> On 11/16/2010 01:13 AM, Daniel Stodden wrote:
>>> On Mon, 2010-11-15 at 13:27 -0500, Jeremy Fitzhardinge wrote:
>>>> On 11/12/2010 07:55 PM, Daniel Stodden wrote:
>>>>>> Surely this can be dealt with by replacing the mapped granted page with
>>>>>> a local copy if the refcount is elevated?
>>>>> Yeah. We briefly discussed this when the problem started to pop up
>>>>> (again).
>>>>> 
>>>>> I had a patch, for blktap1 in XS 5.5 iirc, which would fill mapping with
>>>>> a dummy page mapped in. You wouldn't need a copy, a R/O zero map easily
>>>>> does the job.
>>>> Hm, I'd be a bit concerned that that might cause problems if used
>>>> generically. 
>>> Yeah. It wasn't a problem because all the network backends are on TCP,
>>> where one can be rather sure that the dups are going to be properly
>>> dropped.
>>> 
>>> Does this hold everywhere ..? -- As mentioned below, the problem is
>>> rather in AIO/DIO than being Xen-specific, so you can see the same
>>> behavior on bare metal kernels too. A userspace app seeing an AIO
>>> complete and then reusing that buffer elsewhere will occassionally
>>> resend garbage over the network.
>> 
>> Yeah, that sounds like a generic security problem.  I presume the
>> protocol will just discard the excess retransmit data, but it might mean
>> a usermode program ends up transmitting secrets it never intended to...
>> 
>>> There are some important parts which would go missing. Such as
>>> ratelimiting gntdev accesses -- 200 thundering tapdisks each trying to
>>> gntmap 352 pages simultaneously isn't so good, so there still needs to
>>> be some bridge arbitrating them. I'd rather keep that in kernel space,
>>> okay to cram stuff like that into gntdev? It'd be much more
>>> straightforward than IPC.
>> 
>> What's the problem?  If you do nothing then it will appear to the kernel
>> as a bunch of processes doing memory allocations, and they'll get
>> blocked/rate-limited accordingly if memory is getting short.  
> 
> The problem is that just letting the page allocator work through
> allocations isn't going to scale anywhere.
> 
> The worst case memory requested under load is <number-of-disks> * (32 *
> 11 pages). As a (conservative) rule of thumb, N will be 200 or rather
> better.
> 
> The number of I/O actually in-flight at any point, in contrast, is
> derived from the queue/sg sizes of the physical device. For a simple
> disk, that's about a ring or two.
> 
>> There's
>> plenty of existing mechanisms to control that sort of thing (cgroups,
>> etc) without adding anything new to the kernel.  Or are you talking
>> about something other than simple memory pressure?
>> 
>> And there's plenty of existing IPC mechanisms if you want them to
>> explicitly coordinate with each other, but I'd tend to thing that's
>> premature unless you have something specific in mind.
>> 
>>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
>>> Can't find it now, what happened? Without, there's presently still no
>>> zero-copy.
>> 
>> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
>> new-ish) mmu notifier infrastructure which is intended to allow a device
>> to sync an external MMU with usermode mappings.  We're not using it in
>> precisely that way, but it allows us to wrangle grant mappings before
>> the generic code tries to do normal pte ops on them.
> 
> The mmu notifiers were for safe teardown only. They are not sufficient
> for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
> need to back those VMAs with page structs.  Or bounce again (gulp, just
> mentioning it). As with the blktap2 patches, note there is no difference
> in the dom0 memory bill, it takes page frames.
> 
> This is pretty much exactly the pooling stuff in current drivers/blktap.
> The interface could look as follows ([] designates users). 
> 
> * [toolstack]
>   Calling some ctls to create/destroy ctls  pools of frames. 
>   (Blktap currently does this in sysfs.)
> 
> * [toolstack]
>   Optionally resize them, according to the physical queue 
>   depth [estimate] of the underlying storage.
> 
> * [tapdisk]
>   A backend instance, when starting up, opens a gntdev, then
>   uses a ctl to bind its gntdev handle to a frame pool.
> 
> * [tapdisk]
>   The .mmap call now will allocate frames to back the VMA.
>   This operation can fail/block under congestion. Neither
>   is desirable, so we need a .poll.
> 
> * [tapdisk]
>   To integrate grant mappings with a single-threaded event loop,
>   use .poll. The handle fires as soon as a request can be mapped.
> 
>   Under congestion, the .poll code will queue waiting disks and wake
>   them round-robin, once VMAs are released.
> 
> (A [tapdisk] doesn't mean to dismiss a potential [qemu].)
> 
> Still confident we want that? (Seriously asking). A lot of the code to
> do so has been written for blktap, it wouldn't be hard to bend into a
> gntdev extension.
> 
>>> Once the issues were solved, it'd be kinda nice. Simplifies stuff like
>>> memshr for blktap, which depends on getting hold of original grefs.
>>> 
>>> We'd presumably still need the tapdev nodes, for qemu, etc. But those
>>> can stay non-xen aware then.
>>> 
>>>>>> The only caveat is the stray unmapping problem, but I think gntdev can
>>>>>> be modified to deal with that pretty easily.
>>>>> Not easier than anything else in kernel space, but when dealing only
>>>>> with the refcounts, that's as as good a place as anwhere else, yes.
>>>> I think the refcount test is pretty straightforward - if the refcount is
>>>> 1, then we're the sole owner of the page and we don't need to worry
>>>> about any other users.  If its > 1, then somebody else has it, and we
>>>> need to make sure it no longer refers to a granted page (which is just a
>>>> matter of doing a set_pte_atomic() to remap from present to present).
>>> [set_pte_atomic over grant ptes doesn't work, or does it?]
>> 
>> No, I forgot about grant ptes magic properties.  But there is the hypercall.
> 
> Yup.
> 
>>>> Then we'd have a set of frames whose lifetimes are being determined by
>>>> some other subsystem.  We can either maintain a list of them and poll
>>>> waiting for them to become free, or just release them and let them be
>>>> managed by the normal kernel lifetime rules (which requires that the
>>>> memory attached to them be completely normal, of course).
>>> The latter sounds like a good alternative to polling. So an
>>> unmap_and_replace, and giving up ownership thereafter. Next run of the
>>> dispatcher thread can can just refill the foreign pfn range via
>>> alloc_empty_pages(), to rebalance.
>> 
>> Do we actually need a "foreign page range"?  Won't any pfn do?  If we
>> start with a specific range of foreign pfns and then start freeing those
>> pfns back to the kernel, we won't have one for long...
> 
> I guess we've been meaning the same thing here, unless I'm
> misunderstanding you. Any pfn does, and the balloon pagevec allocations
> default to order 0 entries indeed. Sorry, you're right, that's not a
> 'range'. With a pending re-xmit, the backend can find a couple (or all)
> of the request frames have count>1. It can flip and abandon those as
> normal memory. But it will need those lost memory slots back, straight
> away or next time it's running out of frames. As order-0 allocations.
> 
> Foreign memory is deliberately short. Blkback still defaults to 2 rings
> worth of address space, iirc, globally. That's what that mempool sysfs
> stuff in the later blktap2 patches aimed at -- making the size
> configurable where queue length matters, and isolate throughput between
> physical backends, where the toolstack wants to care.
> 
> Daniel
> 
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 16 Nov 2010 13:42:54 -0800 (PST)
> From: Boris Derzhavets <bderzhavets@yahoo.com>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to
> 	handle	kernel paging request
> To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
> 	xen-devel@lists.xensource.com,	Bruce Edge <bruce.edge@gmail.com>
> Message-ID: <923132.8834.qm@web56101.mail.re3.yahoo.com>
> Content-Type: text/plain; charset="us-ascii"
> 
>> So what I think you are saying is that you keep on getting the bug in DomU?
>> Is the stack-trace the same as in rc1?
> 
> Yes.
> When i want to get 1-2 hr of stable work :-
> 
> # service network restart
> # service nfs restart
> 
> at Dom0.
> 
> I also believe that presence of xen-pcifront.fix.patch is making things much more stable
> on F14.
> 
> Boris.
> 
> --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> 
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle kernel paging request
> To: "Boris Derzhavets" <bderzhavets@yahoo.com>
> Cc: "Jeremy Fitzhardinge" <jeremy@goop.org>, xen-devel@lists.xensource.com, "Bruce Edge" <bruce.edge@gmail.com>
> Date: Tuesday, November 16, 2010, 4:15 PM
> 
> On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote:
>>> Huh. I .. what? I am confused. I thought we established that the issue
>>> was not related to Xen PCI front? You also seem to uncomment the
>>> upstream.core.patches and the xen.pvhvm.patch - why?
>> 
>> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch
>> it gives failed HUNKs
> 
> Uhh.. I am even more confused.
>> 
>>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes
>> 
>> I am pretty sure rc2 doesn't contain everything from xen.next-2.6.37.patch,
>> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded 
>> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1. 
>> Device /dev/xen/gntdev has not been created. I understand that it's
>> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i cannot
>> get 3.2 GB copied over to DomU from NFS share at Dom0.
> 
> So what I think you are saying is that you keep on getting the bug in DomU?
> Is the stack-trace the same as in rc1?
> 
> 
> 
> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/015048ae/attachment.html
> 
> ------------------------------
> 
> Message: 5
> Date: Tue, 16 Nov 2010 13:49:14 -0800 (PST)
> From: Boris Derzhavets <bderzhavets@yahoo.com>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to
> 	handle	kernel paging request
> To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>,
> 	xen-devel@lists.xensource.com,	Bruce Edge <bruce.edge@gmail.com>
> Message-ID: <228566.47308.qm@web56106.mail.re3.yahoo.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Yes, here we are
> 
> [  186.975228] ------------[ cut here ]------------
> [  186.975245] kernel BUG at mm/mmap.c:2399!
> [  186.975254] invalid opcode: 0000 [#1] SMP 
> [  186.975269] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
> [  186.975284] CPU 0 
> [  186.975290] Modules linked in: nfs fscache deflate zlib_deflate ctr camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_generic twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipv6 uinput xen_netfront microcode xen_blkfront [last unloaded: scsi_wait_scan]
> [  186.975507] 
> [  186.975515] Pid: 1562, comm: ls Not tainted 2.6.37-0.1.rc1.git8.xendom0.fc14.x86_64 #1 /
> [  186.975529] RIP: e030:[<ffffffff8110ada1>]  [<ffffffff8110ada1>] exit_mmap+0x10c/0x119
> [  186.975550] RSP: e02b:ffff8800781bde18  EFLAGS: 00010202
> [  186.975560] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [  186.975573] RDX: 00000000914a9149 RSI: 0000000000000001 RDI: ffffea00000c0280
> [  186.975585] RBP: ffff8800781bde48 R08: ffffea00000c0280 R09: 0000000000000001
> [  186.975598] R10: ffffffff8100750f R11: ffffea0000967778 R12: ffff880076c68b00
> [  186.975610] R13: ffff88007f83f1e0 R14: ffff880076c68b68 R15: 0000000000000001
> [  186.975625] FS:  00007f8e471d97c0(0000) GS:ffff88007f831000(0000) knlGS:0000000000000000
> [  186.975639] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  186.975650] CR2: 00007f8e464a9940 CR3: 0000000001a03000 CR4: 0000000000002660
> [  186.975663] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  186.976012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  186.976012] Process ls (pid: 1562, threadinfo ffff8800781bc000, task ffff8800788223e0)
> [  186.976012] Stack:
> [  186.976012]  000000000000006b ffff88007f83f1e0 ffff8800781bde38 ffff880076c68b00
> [  186.976012]  ffff880076c68c40 ffff8800788229d0 ffff8800781bde68 ffffffff810505fc
> [  186.976012]  ffff8800788223e0 ffff880076c68b00 ffff8800781bdeb8 ffffffff81056747
> [  186.976012] Call Trace:
> [  186.976012]  [<ffffffff810505fc>] mmput+0x65/0xd8
> [  186.976012]  [<ffffffff81056747>] exit_mm+0x13e/0x14b
> [  186.976012]  [<ffffffff81056976>] do_exit+0x222/0x7c6
> [  186.976012]  [<ffffffff8100750f>] ? xen_restore_fl_direct_end+0x0/0x1
> [  186.976012]  [<ffffffff8107ea7c>] ? arch_local_irq_restore+0xb/0xd
> [  186.976012]  [<ffffffff814b3949>] ? lockdep_sys_exit_thunk+0x35/0x67
> [  186.976012]  [<ffffffff810571b0>] do_group_exit+0x88/0xb6
> [  186.976012]  [<ffffffff810571f5>] sys_exit_group+0x17/0x1b
> [  186.976012]  [<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b
> [  186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48 
> [  186.976012] RIP  [<ffffffff8110ada1>] exit_mmap+0x10c/0x119
> [  186.976012]  RSP <ffff8800781bde18>
> [  186.976012] ---[ end trace c0f4eff4054a67e4 ]---
> [  186.976012] Fixing recursive fault but reboot is needed!
> 
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
>  kernel:[  186.975228] ------------[ cut here ]------------
> 
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
>  kernel:[  186.975254] invalid opcode: 0000 [#1] SMP 
> 
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
>  kernel:[  186.975269] last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
> 
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
>  kernel:[  186.976012] Stack:
> 
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
>  kernel:[  186.976012] Call Trace:
> 
> Message from syslogd@fedora14 at Nov 17 00:47:40 ...
>  kernel:[  186.976012] Code: 8d 7d 18 e8 c3 8a 00 00 41 c7 45 08 00 00 00 00 48 89 df e8 0d e9 ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 98 01 00 00 00 74 02 <0f> 0b 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 54 53 48 
> 
> --- On Tue, 11/16/10, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> 
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Subject: Re: [Xen-devel] Re: 2.6.37-rc1 mainline domU - BUG: unable to handle kernel paging request
> To: "Boris Derzhavets" <bderzhavets@yahoo.com>
> Cc: "Jeremy Fitzhardinge" <jeremy@goop.org>, xen-devel@lists.xensource.com, "Bruce Edge" <bruce.edge@gmail.com>
> Date: Tuesday, November 16, 2010, 4:15 PM
> 
> On Tue, Nov 16, 2010 at 12:43:28PM -0800, Boris Derzhavets wrote:
>>> Huh. I .. what? I am confused. I thought we established that the issue
>>> was not related to Xen PCI front? You also seem to uncomment the
>>> upstream.core.patches and the xen.pvhvm.patch - why?
>> 
>> I cannot uncomment upstream.core.patches and the xen.pvhvm.patch
>> it gives failed HUNKs
> 
> Uhh.. I am even more confused.
>> 
>>> Ok, they are.. v2.6.37-rc2 which came out today has the fixes
>> 
>> I am pretty sure rc2 doesn't contain everything from xen.next-2.6.37.patch,
>> gntdev's stuff for sure. I've built 2.6.37-rc2 kernel rpms and loaded 
>> kernel-2.6.27-rc2.git0.xendom0.x86_64 under Xen 4.0.1. 
>> Device /dev/xen/gntdev has not been created. I understand that it's
>> unrelated to DomU ( related to Dom0) , but once again with rc2 in DomU i cannot
>> get 3.2 GB copied over to DomU from NFS share at Dom0.
> 
> So what I think you are saying is that you keep on getting the bug in DomU?
> Is the stack-trace the same as in rc1?
> 
> 
> 
> 
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://lists.xensource.com/archives/html/xen-devel/attachments/20101116/84bccfd3/attachment.html
> 
> ------------------------------
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
> 
> 
> End of Xen-devel Digest, Vol 69, Issue 218
> ******************************************

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 16:36 ` blktap: Sync with XCP, dropping zero-copy Andres Lagar-Cavilla
@ 2010-11-17 17:52   ` Jeremy Fitzhardinge
  2010-11-17 19:47     ` Andres Lagar-Cavilla
  2010-11-17 23:42   ` Daniel Stodden
  1 sibling, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-17 17:52 UTC (permalink / raw)
  To: Andres Lagar-Cavilla; +Cc: xen-devel, Daniel Stodden

On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote:
> I'll throw an idea there and you educate me why it's lame.
>
> Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc.
>
> Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out.
>
> The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out:
>
> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them.
>
> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched.

Would GNTTABOP_swap also require the domU to have already unmapped the
page from its own pagetables?  Presumably it would fail if it didn't,
otherwise you'd end up with a domU mapping the same mfn as a
dom0-private page.

> 3. For blkfront reads, call swap when stuffing the response back into the ring
>
> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0
>
> 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle)
>
> 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer
>
> 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though.

I think that will be the common case - the kernel will always attempt to
write dirty pagecache pages to make clean ones, and it will still want
them around to access.  So it can't really give up the page altogether;
if it hands it over to dom0, it needs to make a local copy first.

> Problems at first glance:
> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback.
> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback?
> 3. Managing the pool of backend reserved pages may be a problem?
>
> So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach

It's not clear to me that its any improvement over just directly copying
the data up front.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-16 21:28           ` Daniel Stodden
@ 2010-11-17 18:00             ` Jeremy Fitzhardinge
  2010-11-17 20:21               ` Daniel Stodden
  0 siblings, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-17 18:00 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com

On 11/16/2010 01:28 PM, Daniel Stodden wrote:
>> What's the problem?  If you do nothing then it will appear to the kernel
>> as a bunch of processes doing memory allocations, and they'll get
>> blocked/rate-limited accordingly if memory is getting short.  
> The problem is that just letting the page allocator work through
> allocations isn't going to scale anywhere.
>
> The worst case memory requested under load is <number-of-disks> * (32 *
> 11 pages). As a (conservative) rule of thumb, N will be 200 or rather
> better.

Under what circumstances would you end up needing to allocate that many
pages?

> The number of I/O actually in-flight at any point, in contrast, is
> derived from the queue/sg sizes of the physical device. For a simple
> disk, that's about a ring or two.

Wouldn't that be the worst case?

>> There's
>> plenty of existing mechanisms to control that sort of thing (cgroups,
>> etc) without adding anything new to the kernel.  Or are you talking
>> about something other than simple memory pressure?
>>
>> And there's plenty of existing IPC mechanisms if you want them to
>> explicitly coordinate with each other, but I'd tend to thing that's
>> premature unless you have something specific in mind.
>>
>>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
>>> Can't find it now, what happened? Without, there's presently still no
>>> zero-copy.
>> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
>> new-ish) mmu notifier infrastructure which is intended to allow a device
>> to sync an external MMU with usermode mappings.  We're not using it in
>> precisely that way, but it allows us to wrangle grant mappings before
>> the generic code tries to do normal pte ops on them.
> The mmu notifiers were for safe teardown only. They are not sufficient
> for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
> need to back those VMAs with page structs.

The pages will have struct page, because they're normal kernel pages
which happen to be backed by mapped granted pages.  Are you talking
about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? 
Isn't that just there to cope with the nested-IO-on-the-same-page
problem that the current blktap architecture provokes?  If there's only
a single IO on each page - the one initiated by usermode - then it
shouldn't be necessary, right?

>   Or bounce again (gulp, just
> mentioning it). As with the blktap2 patches, note there is no difference
> in the dom0 memory bill, it takes page frames.

(And perhaps actual pages to substitute for the granted pages.)

> I guess we've been meaning the same thing here, unless I'm
> misunderstanding you. Any pfn does, and the balloon pagevec allocations
> default to order 0 entries indeed. Sorry, you're right, that's not a
> 'range'. With a pending re-xmit, the backend can find a couple (or all)
> of the request frames have count>1. It can flip and abandon those as
> normal memory. But it will need those lost memory slots back, straight
> away or next time it's running out of frames. As order-0 allocations.

Right.  GFP_KERNEL order 0 allocations are pretty reliable; they only
fail if the system is under extreme memory pressure.  And it has the
nice property that if those allocations block or fail it rate limits IO
ingress from domains rather than being crushed by memory pressure at the
backend (ie, the problem with trying to allocate memory in the writeout
path).

Also the cgroup mechanism looks like an extremely powerful way to
control the allocations for a process or group of processes to stop them
from dominating the whole machine.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 17:52   ` Jeremy Fitzhardinge
@ 2010-11-17 19:47     ` Andres Lagar-Cavilla
  0 siblings, 0 replies; 18+ messages in thread
From: Andres Lagar-Cavilla @ 2010-11-17 19:47 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel, Daniel Stodden

So, swapping mfns for write requests is a definite no-no. One could still live with copying write buffers and swapping read buffers by the end of the request. That still yields some benefit. 

As for kernel mappings, I though a solution would be to provide the hypervisor with both pte pointers. After all pte pointers are already provided for mapping grants in user-space. But that's a little too much to handle for the current interface.

Thanks for the feedback
Andres
On Nov 17, 2010, at 12:52 PM, Jeremy Fitzhardinge wrote:

> On 11/17/2010 08:36 AM, Andres Lagar-Cavilla wrote:
>> I'll throw an idea there and you educate me why it's lame.
>> 
>> Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc.
>> 
>> Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out.
>> 
>> The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out:
>> 
>> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them.
>> 
>> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched.
> 
> Would GNTTABOP_swap also require the domU to have already unmapped the
> page from its own pagetables?  Presumably it would fail if it didn't,
> otherwise you'd end up with a domU mapping the same mfn as a
> dom0-private page.
> 
>> 3. For blkfront reads, call swap when stuffing the response back into the ring
>> 
>> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0
>> 
>> 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle)
>> 
>> 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer
>> 
>> 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though.
> 
> I think that will be the common case - the kernel will always attempt to
> write dirty pagecache pages to make clean ones, and it will still want
> them around to access.  So it can't really give up the page altogether;
> if it hands it over to dom0, it needs to make a local copy first.
> 
>> Problems at first glance:
>> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback.
>> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback?
>> 3. Managing the pool of backend reserved pages may be a problem?
>> 
>> So in the end, perhaps more of an academic exercise than a palatable answer, but nonetheless I'd like to hear other problems people may find with this approach
> 
> It's not clear to me that its any improvement over just directly copying
> the data up front.
> 
>    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 18:00             ` Jeremy Fitzhardinge
@ 2010-11-17 20:21               ` Daniel Stodden
  2010-11-17 21:02                 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 18+ messages in thread
From: Daniel Stodden @ 2010-11-17 20:21 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com

On Wed, 2010-11-17 at 13:00 -0500, Jeremy Fitzhardinge wrote:
> On 11/16/2010 01:28 PM, Daniel Stodden wrote:
> >> What's the problem?  If you do nothing then it will appear to the kernel
> >> as a bunch of processes doing memory allocations, and they'll get
> >> blocked/rate-limited accordingly if memory is getting short.  
> > The problem is that just letting the page allocator work through
> > allocations isn't going to scale anywhere.
> >
> > The worst case memory requested under load is <number-of-disks> * (32 *
> > 11 pages). As a (conservative) rule of thumb, N will be 200 or rather
> > better.
> 
> Under what circumstances would you end up needing to allocate that many
> pages?

I don't. Independently running tapdisks would do, on behalf of guests
queuing I/O.

That's why one wouldn't just let them run and allocate their own memory.
The memory space set aside for I/O should be a shared resource.

> > The number of I/O actually in-flight at any point, in contrast, is
> > derived from the queue/sg sizes of the physical device. For a simple
> > disk, that's about a ring or two.
> 
> Wouldn't that be the worst case?

Yes. It's quite small. A 2 or 3 megs per physical backend are usually
sufficient.

> >> There's
> >> plenty of existing mechanisms to control that sort of thing (cgroups,
> >> etc) without adding anything new to the kernel.  Or are you talking
> >> about something other than simple memory pressure?
> >>
> >> And there's plenty of existing IPC mechanisms if you want them to
> >> explicitly coordinate with each other, but I'd tend to thing that's
> >> premature unless you have something specific in mind.
> >>
> >>> Also, I was absolutely certain I once saw VM_FOREIGN support in gntdev..
> >>> Can't find it now, what happened? Without, there's presently still no
> >>> zero-copy.
> >> gntdev doesn't need VM_FOREIGN any more - it uses the (relatively
> >> new-ish) mmu notifier infrastructure which is intended to allow a device
> >> to sync an external MMU with usermode mappings.  We're not using it in
> >> precisely that way, but it allows us to wrangle grant mappings before
> >> the generic code tries to do normal pte ops on them.
> > The mmu notifiers were for safe teardown only. They are not sufficient
> > for DIO, which wants gup() to work. If you want zcopy on gntdev, we'll
> > need to back those VMAs with page structs.
> 
> The pages will have struct page, because they're normal kernel pages
> which happen to be backed by mapped granted pages.

And, like all granted frames, not owning them implies they are not
resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
without the VM_FOREIGN hack.

Correct me if I'm mistaken. I used to be quicker looking up stuff on
arch-xen kernels, but I think fundamental constants of the Xen universe
didn't change since last time.

>   Are you talking
> about the #ifdef CONFIG_XEN code in the middle of __get_user_pages()? 
> Isn't that just there to cope with the nested-IO-on-the-same-page
> problem that the current blktap architecture provokes?  If there's only
> a single IO on each page - the one initiated by usermode - then it
> shouldn't be necessary, right?

No. Jake brought the aliasing in specifically to get blktap2 working
with zero-copy.

VM_FOREIGN is much older. Only blktap2 does the recursive thing, because
it's a blkdev above some physical dev. Blktap1 went from the guest ring
straight down to userland. As would be the case with a gntdev-based
blkback.

> >   Or bounce again (gulp, just
> > mentioning it). As with the blktap2 patches, note there is no difference
> > in the dom0 memory bill, it takes page frames.
> 
> (And perhaps actual pages to substitute for the granted pages.)

Well yes, that's right. Still fine as long there's some relatively small
constant boundary on the worst case. O(n) for large systems would go in
the hundreds of megs. Given that the *reasonable* amount of memory used
simultaneously is pretty small in any case, even going through the
memory allocator can be skipped.

[
Part of the reason why blktap *never* frees those pages, apart from
being slightly greedy, are deadlock hazards when writing those nodes in
dom0 through the pagecache, as dom0 might. You need memory pools on the
datapath to guarantee progress under pressure. That got pretty ugly
after 2.6.27, btw.
]

In any case, let's skip trying what happens if a thundering herd of
several hundred userspace disks tries gfp()ing their grant slots out of
dom0 without without arbitration.

> > I guess we've been meaning the same thing here, unless I'm
> > misunderstanding you. Any pfn does, and the balloon pagevec allocations
> > default to order 0 entries indeed. Sorry, you're right, that's not a
> > 'range'. With a pending re-xmit, the backend can find a couple (or all)
> > of the request frames have count>1. It can flip and abandon those as
> > normal memory. But it will need those lost memory slots back, straight
> > away or next time it's running out of frames. As order-0 allocations.
> 
> Right.  GFP_KERNEL order 0 allocations are pretty reliable; they only
> fail if the system is under extreme memory pressure.  And it has the
> nice property that if those allocations block or fail it rate limits IO
> ingress from domains rather than being crushed by memory pressure at the
> backend (ie, the problem with trying to allocate memory in the writeout
> path).
> 
> Also the cgroup mechanism looks like an extremely powerful way to
> control the allocations for a process or group of processes to stop them
> from dominating the whole machine.

Ah. In case it can be put to work to bind processes allocating pagecache
entries for dirtying to some boundary, I'd be really interested. I think
I came across it once but didn't take the time to read the docs
thoroughly. Can it?

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 20:21               ` Daniel Stodden
@ 2010-11-17 21:02                 ` Jeremy Fitzhardinge
  2010-11-17 21:57                   ` Daniel Stodden
  0 siblings, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-17 21:02 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com

On 11/17/2010 12:21 PM, Daniel Stodden wrote:
> And, like all granted frames, not owning them implies they are not
> resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
> without the VM_FOREIGN hack.

Hm, I see.  Well, I wonder if using _PAGE_SPECIAL would help (it is put
on usermode ptes which don't have a backing struct page).  After all,
there's no fundamental reason why it would need a pfn; the mfn in the
pte is what's actually needed to ultimately generate a DMA descriptor.

> Correct me if I'm mistaken. I used to be quicker looking up stuff on
> arch-xen kernels, but I think fundamental constants of the Xen universe
> didn't change since last time.

No, but Linux has.

> [
> Part of the reason why blktap *never* frees those pages, apart from
> being slightly greedy, are deadlock hazards when writing those nodes in
> dom0 through the pagecache, as dom0 might. You need memory pools on the
> datapath to guarantee progress under pressure. That got pretty ugly
> after 2.6.27, btw.
> ]

That's what mempools are intended to solve.

> In any case, let's skip trying what happens if a thundering herd of
> several hundred userspace disks tries gfp()ing their grant slots out of
> dom0 without without arbitration.

I'm not against arbitration, but I don't think that's something that
should be implemented as part of a Xen driver.

>>> I guess we've been meaning the same thing here, unless I'm
>>> misunderstanding you. Any pfn does, and the balloon pagevec allocations
>>> default to order 0 entries indeed. Sorry, you're right, that's not a
>>> 'range'. With a pending re-xmit, the backend can find a couple (or all)
>>> of the request frames have count>1. It can flip and abandon those as
>>> normal memory. But it will need those lost memory slots back, straight
>>> away or next time it's running out of frames. As order-0 allocations.
>> Right.  GFP_KERNEL order 0 allocations are pretty reliable; they only
>> fail if the system is under extreme memory pressure.  And it has the
>> nice property that if those allocations block or fail it rate limits IO
>> ingress from domains rather than being crushed by memory pressure at the
>> backend (ie, the problem with trying to allocate memory in the writeout
>> path).
>>
>> Also the cgroup mechanism looks like an extremely powerful way to
>> control the allocations for a process or group of processes to stop them
>> from dominating the whole machine.
> Ah. In case it can be put to work to bind processes allocating pagecache
> entries for dirtying to some boundary, I'd be really interested. I think
> I came across it once but didn't take the time to read the docs
> thoroughly. Can it?

I'm not sure about dirtyness - it seems like something that should be
within its remit, even if it doesn't currently have it.

The cgroup mechanism is extremely powerful, now that I look at it.  You
can do everything from setting block IO priorities and QoS parameters to
CPU limits.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 21:02                 ` Jeremy Fitzhardinge
@ 2010-11-17 21:57                   ` Daniel Stodden
  2010-11-17 22:14                     ` Jeremy Fitzhardinge
  2010-11-17 23:32                     ` Daniel Stodden
  0 siblings, 2 replies; 18+ messages in thread
From: Daniel Stodden @ 2010-11-17 21:57 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com

On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote:
> On 11/17/2010 12:21 PM, Daniel Stodden wrote:
> > And, like all granted frames, not owning them implies they are not
> > resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
> > without the VM_FOREIGN hack.
> 
> Hm, I see.  Well, I wonder if using _PAGE_SPECIAL would help (it is put
> on usermode ptes which don't have a backing struct page).  After all,
> there's no fundamental reason why it would need a pfn; the mfn in the
> pte is what's actually needed to ultimately generate a DMA descriptor.

The kernel needs the page structs at least for locking and refcounting.

There's also a some trickier stuff in there. Like redirtying disk-backed
user memory after read completion, in case it's been laundered. (So that
an AIO on unpinned user memory doesn't subsequently get flashed back
when cycling through swap, if I understood that thing correctly.)

Doesn't apply for blktap (it's all reserved pages). All I mean is: I
wouldn't exactly see some innocent little dio hack or so shape up in
there.

Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I'd
agree. E.g. there's a concept of devices DMA-ing into arbitrary I/O
memory space, not host memory, on some bus architectures. PCI would come
to my mind (the old shared medium stuff, unsure about those newfangled
P-t-P topologies). But not in Linux, so I presently don't see anybody
upstream bothering to make block-I/O request addressing more forgiving
than it is.

PAGE_SPECIAL -- to the kernel, that means the opposite: page structs
which aren't backed by 'real' memory, so gup(), for example, is told to
fail (how nasty). In contrast, VM_FOREIGN is non-memory backed by page
structs.

> > Correct me if I'm mistaken. I used to be quicker looking up stuff on
> > arch-xen kernels, but I think fundamental constants of the Xen universe
> > didn't change since last time.
> 
> No, but Linux has.

Not in that respect.

There's certainly a way to get VM_FOREIGN out of the mainline code. It
would involve an unlikely() branch in .pte_val(=xen_pte_val) to fall
back into a private local m2p hash lookup. Assuming that kind of thing
gets nowhere inlined. Not nice, but still more upstreamable than
VM_FOREIGN.

> > [
> > Part of the reason why blktap *never* frees those pages, apart from
> > being slightly greedy, are deadlock hazards when writing those nodes in
> > dom0 through the pagecache, as dom0 might. You need memory pools on the
> > datapath to guarantee progress under pressure. That got pretty ugly
> > after 2.6.27, btw.
> > ]
> 
> That's what mempools are intended to solve.

That's why the blktap frame pool is now a mempool, indeed.

> > In any case, let's skip trying what happens if a thundering herd of
> > several hundred userspace disks tries gfp()ing their grant slots out of
> > dom0 without without arbitration.
> 
> I'm not against arbitration, but I don't think that's something that
> should be implemented as part of a Xen driver.

Uhm, maybe I'm misunderstanding you, isn't the whole thing a Xen driver?
What do you suggest?

> >>> I guess we've been meaning the same thing here, unless I'm
> >>> misunderstanding you. Any pfn does, and the balloon pagevec allocations
> >>> default to order 0 entries indeed. Sorry, you're right, that's not a
> >>> 'range'. With a pending re-xmit, the backend can find a couple (or all)
> >>> of the request frames have count>1. It can flip and abandon those as
> >>> normal memory. But it will need those lost memory slots back, straight
> >>> away or next time it's running out of frames. As order-0 allocations.
> >> Right.  GFP_KERNEL order 0 allocations are pretty reliable; they only
> >> fail if the system is under extreme memory pressure.  And it has the
> >> nice property that if those allocations block or fail it rate limits IO
> >> ingress from domains rather than being crushed by memory pressure at the
> >> backend (ie, the problem with trying to allocate memory in the writeout
> >> path).
> >>
> >> Also the cgroup mechanism looks like an extremely powerful way to
> >> control the allocations for a process or group of processes to stop them
> >> from dominating the whole machine.
> > Ah. In case it can be put to work to bind processes allocating pagecache
> > entries for dirtying to some boundary, I'd be really interested. I think
> > I came across it once but didn't take the time to read the docs
> > thoroughly. Can it?
> 
> I'm not sure about dirtyness - it seems like something that should be
> within its remit, even if it doesn't currently have it.
> 
> The cgroup mechanism is extremely powerful, now that I look at it.  You
> can do everything from setting block IO priorities and QoS parameters to
> CPU limits.

Thanks. I'll keep it under my pillow then.

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 21:57                   ` Daniel Stodden
@ 2010-11-17 22:14                     ` Jeremy Fitzhardinge
       [not found]                       ` <1290035201.11102.1577.camel@agari.van.xensource.com>
  2010-11-17 23:32                     ` Daniel Stodden
  1 sibling, 1 reply; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-17 22:14 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com

On 11/17/2010 01:57 PM, Daniel Stodden wrote:
> On Wed, 2010-11-17 at 16:02 -0500, Jeremy Fitzhardinge wrote:
>> On 11/17/2010 12:21 PM, Daniel Stodden wrote:
>>> And, like all granted frames, not owning them implies they are not
>>> resolvable via mfn_to_pfn, thereby failing in follow_page, thereby gup()
>>> without the VM_FOREIGN hack.
>> Hm, I see.  Well, I wonder if using _PAGE_SPECIAL would help (it is put
>> on usermode ptes which don't have a backing struct page).  After all,
>> there's no fundamental reason why it would need a pfn; the mfn in the
>> pte is what's actually needed to ultimately generate a DMA descriptor.
> The kernel needs the page structs at least for locking and refcounting.

Yeah.

> There's also a some trickier stuff in there. Like redirtying disk-backed
> user memory after read completion, in case it's been laundered. (So that
> an AIO on unpinned user memory doesn't subsequently get flashed back
> when cycling through swap, if I understood that thing correctly.)
>
> Doesn't apply for blktap (it's all reserved pages). All I mean is: I
> wouldn't exactly see some innocent little dio hack or so shape up in
> there.
>
> Kernel allowing to DMA into a bare pfnmap -- From the platform POV, I'd
> agree. E.g. there's a concept of devices DMA-ing into arbitrary I/O
> memory space, not host memory, on some bus architectures. PCI would come
> to my mind (the old shared medium stuff, unsure about those newfangled
> P-t-P topologies). But not in Linux, so I presently don't see anybody
> upstream bothering to make block-I/O request addressing more forgiving
> than it is.
>
> PAGE_SPECIAL -- to the kernel, that means the opposite: page structs
> which aren't backed by 'real' memory, so gup(), for example, is told to
> fail (how nasty).

It's pfns with no corresponding struct page - it's the pte level
equivalent of VM_PFNMAP at the VMA level.  But you're right that we
can't do without struct pages.

So we're back to needing a way of mapping from a random mfn to a pfn so
we can find the corresponding struct page.  I would be tempted to put a
layer over m2p to allow local m2p mappings to override the global m2p table.

>  In contrast, VM_FOREIGN is non-memory backed by page
> structs.
Yep.

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 21:57                   ` Daniel Stodden
  2010-11-17 22:14                     ` Jeremy Fitzhardinge
@ 2010-11-17 23:32                     ` Daniel Stodden
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Stodden @ 2010-11-17 23:32 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xen-devel@lists.xensource.com

On Wed, 2010-11-17 at 16:57 -0500, Daniel Stodden wrote:

> > > In any case, let's skip trying what happens if a thundering herd of
> > > several hundred userspace disks tries gfp()ing their grant slots out of
> > > dom0 without without arbitration.
> > 
> > I'm not against arbitration, but I don't think that's something that
> > should be implemented as part of a Xen driver.
> 
> Uhm, maybe I'm misunderstanding you, isn't the whole thing a Xen driver?
> What do you suggest?

Just misread you, sorry. You mean arbitration via IPC.

Somewhat, but ugly. Just counting pages with cmpxchg in some shm segment
isnt't a big deal, agreed. But userspace fixing the counter after
detecting some process crash would already start to complicate that.
Next, someone has to do the ballooning, and you need gntmap to
understand the VMA type anyway. From there on, going the rest of the way
and let the kernel driver round-robin the frame pool altogether is much
smaller and cleaner.

Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
  2010-11-17 16:36 ` blktap: Sync with XCP, dropping zero-copy Andres Lagar-Cavilla
  2010-11-17 17:52   ` Jeremy Fitzhardinge
@ 2010-11-17 23:42   ` Daniel Stodden
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Stodden @ 2010-11-17 23:42 UTC (permalink / raw)
  To: Andres Lagar-Cavilla; +Cc: Fitzhardinge, xen-devel@lists.xensource.com, Jeremy

On Wed, 2010-11-17 at 11:36 -0500, Andres Lagar-Cavilla wrote:
> I'll throw an idea there and you educate me why it's lame.
> 
> Going back to the primary issue of dropping zero-copy, you want the block backend (tapdev w/AIO or otherwise) to operate on regular dom0 pages, because you run into all sorts of quirkiness otherwise: magical VM_FOREIGN incantations to back granted mfn's with fake page structs that make get_user_pages happy, quirky grant PTEs, etc.
> 
> Ok, so how about something along the lines of GNTTABOP_swap? Eerily reminiscent of (maligned?) GNTTABOP_transfer, but hear me out.
> 
> The observation is that for a blkfront read, you could do the read all along on a regular dom0 frame, and when stuffing the response into the ring, swap the dom0 frame (mfn) you used with the domU frame provided as a buffer. Then the algorithm folds out:
> 
> 1. Block backend, instead of get_empty_pages_and_pagevec at init time, creates a pool of reserved regular pages via get_free_page(s). These pages have their refcount pumped, no one in dom0 will ever touch them.
> 
> 2. When extracting a blkfront write from the ring, call GNTTABOP_swap immediately. One of the backend-reserved mfn's is swapped with the domU mfn. Pfn's and page struct's on both ends remain untouched.
> 
> 3. For blkfront reads, call swap when stuffing the response back into the ring
> 
> 4. Because of 1, dom0 can a) calmly fix its p2m (and kvaddr) after swap, much like balloon and others do, without fear of races. More importantly, b) you don't have a weirdo granted PTE, or work with a frame from other domain. It's your page all along, dom0
> 
> 5. One assumption for domU is that pages allocated as blkfront buffers won't be touched by anybody, so a) it's safe for them to swap async with another frame with undef contents and b) domU can fix its p2m (and kvaddr) when pulling responses from the ring (the new mfn should be put on the response by dom0 directly or through an opaque handle)
> 
> 6. Scatter-gather vectors in ring requests give you a natural multicall batching for these GNTTABOP_swap's. I.e. all these hypercalls won't happen as often and at the granularity as skbuff's demanded for GNTTABOP_transfer
> 
> 7. Potentially domU may want to use the contents in a blkfront write buffer later for something else. So it's not really zero-copy. But the approach opens a window to async memcpy . From the point of swap when pulling the req to the point of pushing the response, you can do memcpy at any time. Don't know about how practical that is though.
> 
> Problems at first glance:
> 1. To support GNTTABOP_swap you need to add more if(version) to blkfront and blkback.
> 2. The kernel vaddr will need to be managed as well by dom0/U. Much like balloon or others: hypercall, fix p2m, and fix kvaddr all need to be taken care of. domU will probably need to neuter its kvaddr before granting, and then re-establish it when the response arrives. Weren't all these hypercalls ultimately more expensive than memcpy for GNTABOP_transfer for netback?
> 3. Managing the pool of backend reserved pages may be a problem?

I guess GNT_transfer for network I/O died because of the double-ended
TLB fallout?

Still liked the general direction, nice shot.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blktap: Sync with XCP, dropping zero-copy.
       [not found]                           ` <1290040898.11102.1709.camel@agari.van.xensource.com>
@ 2010-11-18  2:29                             ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 18+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-18  2:29 UTC (permalink / raw)
  To: Daniel Stodden; +Cc: Xen-devel@lists.xensource.com

On 11/17/2010 04:41 PM, Daniel Stodden wrote:
> On Wed, 2010-11-17 at 18:49 -0500, Jeremy Fitzhardinge wrote:
>> On 11/17/2010 03:06 PM, Daniel Stodden wrote:
>>>> So we're back to needing a way of mapping from a random mfn to a pfn so
>>>> we can find the corresponding struct page.  I would be tempted to put a
>>>> layer over m2p to allow local m2p mappings to override the global m2p table.
>>> I think a local m2p lookup on a slow path is a superior option, iff you
>>> do think it's doable. Without e.g. risking to bloat some inline stuff, I
>>> mean.
>>>
>>> Where do you think in the call stack down into pvops code would the
>>> lookup go? Asking because I'd expect the kernel to potentially learn
>>> more tricks with that.  
>> I don't think m2p is all that performance critical.  p2m is used way more.
>>
>> p2m is already a radix tree; 
> Yes, but pfns are dense plug holes, aren't they?

Yes.

>> I think m2p could be done somewhat
>> similarly, where undefined entries fall through to the global m2p.  The
>> main problem is probably making sure the memory allocation for new m2p
>> entries can be done in a blocking context, so we don't have to rely on
>> GFP_ATOMIC.
> Whatever the index is going to be, all the backends I'm aware of run
> their rings on a kthread. Sounds to me like GFP_WAIT followed by an
> rwlock is perfectly sufficient. Only the reversal commonly ends up in
> interrupt context.
>
>> That particular m2p lookup would be in xen_pte_val(), but I think that's
>> the callsite for pretty much all m2p lookups.
>>
>>> A mfn->gref mapping would obsolete blkback-pagemap. Well, iff the
>>> kernel-blktap zerocopy stuff wants to be resurrected.
>>>
>>> It would also be a cheap way to implement current blktap to do
>>> virt->gref lookups for tapdisks. Some tapdisk filters want this, present
>>> major example is the memshr patch, and it's sort of nicer than a ring
>>> message hack.
>>>
>>> Wouldn't storing the handle allow unmapping grant ptes on the normal
>>> user PT teardown path? I think we always was this .zap_pte vma-operation
>>> in blktap, iirc. MMU notifier replacement? Maybe not a good one.
>> I think mmu notifiers are fine; this is exactly what they're for after
>> all.  Very few address spaces have granted pages mapped into them, so
>> keeping the normal pte operations as fast as possible and using more
>> expensive notifiers for the afflicted mms seems like the right tradeoff.
>>
>> Hm.  Before Gerd highlighted mmu notifiers as the right way to deal with
>> granted pages in gntdev, I had the idea of allocating a shadow page for
>> each pte page and storing grant refs there, where the shadow is hung of
>> the pte page's struct page, where set_pte could use it to do the special
>> thing if needed.
>>
>> I wonder if we could do something similar here, where we store the pfn
>> for a given pte in the shadow?  But how would it get used?  There's no
>> "read pte" pvop, and pte_val gets the pte value, not its address, so
>> there'd be no way to find the shadow.  So I guess that doesn't work.
> I'm not sure about the radix variant. All the backends do order-0
> allocations, as discussed above. Depending on the driver pair behavior,
> the mfn ranges can get arbitrarily sparse. The real M2P makes completely
> different assumptions in density and size, or not? 

Well, there's two use-cases for the local m2p idea.  One is for granted
pages, which are going to be all over the place, but the other is for
hardware mfns, which are more likely to be densely clustered.

A radix tree for grant mfns is likely to be pretty inefficient - but the
worst case is one radix page per mfn, which isn't too bad, since we're
not expecting that many granted mfns to be around.  But perhaps a hash
or rbtree would be a better fit.  Or we could insist on making the mfns
contiguous.

> Well, maybe one could shadow (cow, actually) just that? Saves the index.
> Likely a dumb idea. :)

I guess Xen won't let us map over the m2p, but maybe we could alias it. 
But that's going to waste lots of address space in a 32b dom0.

> What numbers of grant refs do we run? I remember times when the tables
> were bumped up massively, for stuff like pvfb, a long time ago. I guess
> rest remained rather conservative.

We only really need to worry about mfns which are actually going to be
mapped into userspace and guped.  We could even propose a (new?) mmu
notifier to prepare for gup so that it can be deferred as late as possible.

> The shadow idea sounds pretty cool, mainly because the vaddr spaces are
> often contiguous. At least for userland pages.
>
> Thing which would bother me is that settling on a single shadow page
> already limits map entries to sizeof(pte), so ideas like additionally
> mapping to grefs/handles/flag already go out of the window. Or grow
> bigger leaf, but on average that's also more space overhead.

At least we can always fit a pointer into sizeof(pte_t), so there's some
scope for having more storage if needed.  But I don't see how it can
help for gup...

    J

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2010-11-18  2:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com>
2010-11-17 16:36 ` blktap: Sync with XCP, dropping zero-copy Andres Lagar-Cavilla
2010-11-17 17:52   ` Jeremy Fitzhardinge
2010-11-17 19:47     ` Andres Lagar-Cavilla
2010-11-17 23:42   ` Daniel Stodden
2010-11-12 23:31 Daniel Stodden
2010-11-13  0:50 ` Jeremy Fitzhardinge
2010-11-13  3:56   ` Daniel Stodden
     [not found]   ` <1289620544.11102.373.camel@agari.van.xensource.com>
2010-11-15 18:27     ` Jeremy Fitzhardinge
2010-11-16  9:13       ` Daniel Stodden
2010-11-16 17:56         ` Jeremy Fitzhardinge
2010-11-16 21:28           ` Daniel Stodden
2010-11-17 18:00             ` Jeremy Fitzhardinge
2010-11-17 20:21               ` Daniel Stodden
2010-11-17 21:02                 ` Jeremy Fitzhardinge
2010-11-17 21:57                   ` Daniel Stodden
2010-11-17 22:14                     ` Jeremy Fitzhardinge
     [not found]                       ` <1290035201.11102.1577.camel@agari.van.xensource.com>
     [not found]                         ` <4CE46A03.3010104@goop.org>
     [not found]                           ` <1290040898.11102.1709.camel@agari.van.xensource.com>
2010-11-18  2:29                             ` Jeremy Fitzhardinge
2010-11-17 23:32                     ` Daniel Stodden

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.