All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeremy Fitzhardinge <jeremy@goop.org>
To: Daniel Stodden <daniel.stodden@citrix.com>
Cc: Xen <xen-devel@lists.xensource.com>
Subject: Re: blktap: Sync with XCP, dropping zero-copy.
Date: Fri, 12 Nov 2010 16:50:34 -0800	[thread overview]
Message-ID: <4CDDE0DA.2070303@goop.org> (raw)
In-Reply-To: <1289604707-13378-1-git-send-email-daniel.stodden@citrix.com>

On 11/12/2010 03:31 PM, Daniel Stodden wrote:
> It's fairly a big change in how I/O buffers are managed. Prior to this
> series, we had zero-copy I/O down to userspace. Unfortunately, blktap2
> always had to jump through a couple of extra loops to do so. Present
> state of that is that we dropped that, so all tapdev I/O is bounced
> to/from a bunch of normal pages. Essentially replacing the old VMA
> management with a couple insert/zap VM calls.

Do you have any performance results comparing the two approaches?

> One issue was that the kernel can't cope with recursive
> I/O. Submitting an iovec on a tapdev, passing it to userspace and then
> reissuing the same vector via AIO apparently doesn't fit well with the
> lock protocol applied to those pages. This is the main reason why
> blktap had to deal a lot with grant refs. About as much as blkback
> already does before passing requests on. What happens there is that
> it's aliasing those granted pages under a different PFN, thereby in a
> separate page struct. Not pretty, but it worked, so it's not the
> reason why we chose to drop that at some point.
>
> The more prevalent problem was network storage, especially anything
> involving TCP. That includes VHD on both NFS and iSCSI. The problem
> with those is that retransmits (by the transport) and I/O op
> completion (on the application layer) are never synchronized.  With
> sufficiently bad timing and bit of jitter on the network, it's
> perfectly common for the kernel to complete an AIO request with a late
> ack on the input queue just when retransmission timer is about to fire
> underneath. The completion will unmap the granted frame, crashing any
> uncanceled retransmission on an empty page frame. There are different
> ways to deal with that. Page destructors might be one way, but as far
> as I heard they are not particularly popular upstream. Issuing the
> block I/O on dom0 memory is straightforward and avoids the hassle. One
> could go argue that retransmits after DIO completion are still a
> potential privacy problem (I did), but it's not Xen's problem after
> all.

Surely this can be dealt with by replacing the mapped granted page with
a local copy if the refcount is elevated?  Then that can catch any stray
residual references while we can still return the granted page to its
owner.  And obviously, not reuse that pfn for grants until the refcount
is zero...

> If zero-copy becomes more attractive again, the plan would be to
> rather use grantdev in userspace, such as a filter driver for tapdisk
> instead. Until then, there's presumably a notable difference in L2
> cache footprint. Then again, there's also a whole number of cycles not
> spent in redundant hypercalls now, to deal with the pseudophysical
> map.

Frankly, I think the idea of putting blkback+tapdisk entirely in
usermode is all upside with no (obvious) downsides.  It:

   1. avoids having to upstream anything
   2. avoids having to upstream anything
   3. avoids having to upstream anything

   4. gets us back zero-copy (if that's important)
   5. makes the IO path nice and straightforward
   6. seems to address all the other problems you mentioned

The only caveat is the stray unmapping problem, but I think gntdev can
be modified to deal with that pretty easily.

qemu has usermode blkback support already, and an actively improving
block-IO infrastructure, so one approach might be to consider putting
(parts of) tapdisk into qemu - and makes it pretty natural to reuse it
with non-Xen guests via virtio-block, emulated devices, etc.  But I'm
not sold on that; having a standalone tapdisk w/ blkback makes sense to
me as well.

On the other hand, I don't think we're going to be able to get away with
putting netback in usermode, so we still need to deal with that - but I
think an all-copying version will be fine to get started with at least.


> There are also benefits or non-issues.
>
>  - This blktap is rather xen-independent. Certainly depends on the
>    common ring macros, but lacking grant stuff it compiles on bare
>    metal Linux with no CONFIG_XEN. Not consummated here, because
>    that's going to move the source tree out of drivers/xen. But I'd
>    like to post a new branch proposing to do so.
>
>  - Blktaps size in dom0 didn't really change. Frames (now pages) were
>    always pooled. We used to balloon memory to claim space for
>    redundant grant mappings. Now we reserve, by default, the same
>    volume in normal memory.
>
>  - The previous code would runs all I/O on a single pool. Typically
>    two rings worth of requests. Sufficient for a whole lot of systems,
>    especially with single storage backends, but not so nice when I/O
>    on a number of otherwise independent filers or volumes collides.
>
>    Pools are refcounted kobjects in sysfs. Toolstacks using the new
>    code can thereby choose to elimitate bottlenecks by grouping taps
>    on different buffer pools. Pools can also be resized, to accomodate
>    greater queue depths. [Note that blkback still has the same issue,
>    so guests won't take advantage of that before that's resolved as
>    well.]
>
>  - XCP started to make some use of stacking tapdevs. Think pointing
>    the image chain of a bunch of "leaf" taps to a shared parent
>    node. That works fairly well, but definitely takes independent
>    resource pools to avoid deadlock by parent starvation then.
>
> Please pull upstream/xen/dom0/backend/blktap2 from
> git://xenbits.xensource.com/people/dstodden/linux.git

OK, I've pulled it, but I haven't had a chance to test it yet.

Thanks,
    J

  parent reply	other threads:[~2010-11-13  0:50 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-11-12 23:31 blktap: Sync with XCP, dropping zero-copy Daniel Stodden
2010-11-12 23:31 ` [PATCH 1/5] blktap: Manage segment buffers in mempools Daniel Stodden
2010-11-12 23:31 ` [PATCH 2/5] blktap: Make VMAs non-foreign and bounce buffered Daniel Stodden
2010-11-12 23:31 ` [PATCH 3/5] blktap: Add queue access macros Daniel Stodden
2010-11-12 23:31 ` [PATCH 4/5] blktap: Forward port to 2.6.32 Daniel Stodden
2010-11-12 23:31 ` [PATCH 5/5] Fix compilation format warning in drivers/xen/blktap/device.c Daniel Stodden
2010-11-13  0:50 ` Jeremy Fitzhardinge [this message]
2010-11-13  3:56   ` blktap: Sync with XCP, dropping zero-copy Daniel Stodden
     [not found]   ` <1289620544.11102.373.camel@agari.van.xensource.com>
2010-11-15 18:27     ` Jeremy Fitzhardinge
2010-11-15 19:19       ` Ian Campbell
2010-11-15 19:34         ` Jeremy Fitzhardinge
2010-11-15 20:07           ` Ian Campbell
2010-11-16  0:43             ` Daniel Stodden
2010-11-16  9:13       ` Daniel Stodden
2010-11-16 12:17         ` Stefano Stabellini
2010-11-16 16:11           ` Konrad Rzeszutek Wilk
2010-11-16 16:16             ` Stefano Stabellini
2010-11-17  2:40           ` Daniel Stodden
2010-11-17 12:35             ` Stefano Stabellini
2010-11-17 15:34               ` Jonathan Ludlam
2010-11-16 13:00         ` Dave Scott
2010-11-16 14:48           ` Stefano Stabellini
2010-11-16 17:56         ` Jeremy Fitzhardinge
2010-11-16 21:28           ` Daniel Stodden
2010-11-17 17:04             ` Ian Campbell
2010-11-17 19:27               ` Daniel Stodden
2010-11-18 13:56                 ` Ian Campbell
2010-11-18 19:37                   ` Daniel Stodden
2010-11-19 10:57                     ` Ian Campbell
2010-11-17 18:00             ` Jeremy Fitzhardinge
2010-11-17 20:21               ` Daniel Stodden
2010-11-17 21:02                 ` Jeremy Fitzhardinge
2010-11-17 21:57                   ` Daniel Stodden
2010-11-17 22:14                     ` Jeremy Fitzhardinge
     [not found]                       ` <1290035201.11102.1577.camel@agari.van.xensource.com>
     [not found]                         ` <4CE46A03.3010104@goop.org>
     [not found]                           ` <1290040898.11102.1709.camel@agari.van.xensource.com>
2010-11-18  2:29                             ` Jeremy Fitzhardinge
2010-11-17 23:32                     ` Daniel Stodden
     [not found] <20101116215621.59FC2CF782@homiemail-mx7.g.dreamhost.com>
2010-11-17 16:36 ` Andres Lagar-Cavilla
2010-11-17 17:52   ` Jeremy Fitzhardinge
2010-11-17 19:47     ` Andres Lagar-Cavilla
2010-11-17 23:42   ` Daniel Stodden

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4CDDE0DA.2070303@goop.org \
    --to=jeremy@goop.org \
    --cc=daniel.stodden@citrix.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.