V4V

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* V4V
@ 2012-05-24 17:23 Jean Guyader
  2012-05-25  9:48 ` V4V Stefano Stabellini
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jean Guyader @ 2012-05-24 17:23 UTC (permalink / raw)
  To: xen-devel; +Cc: Ross Philipson, Jean Guyader, James McKenzie

[-- Attachment #1.1: Type: text/plain, Size: 5772 bytes --]

As I'm going through the code to clean-up XenClient's inter VM
communication
(V4V), I thought it would be a good idea to start a thread to talk about
the
fundamental differences between V4V and libvchan. I believe the two system
are
not clones of eachother and they serve different
purposes.

Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
about libvchan it coming from my reading of the code. If some of the facts
are wrong it's only due to my ignorance about the subject.

1. Why V4V?

About the time when we started XenClient (3 year ago) we were looking for a
lightweight inter VM communication scheme. We started working on a system
based on netchannel2 at the time called V2V (VM to VM). The system
was very similar to what libvchan is today, and we started to hit some
roadblocks:

    - The setup relied on a broker in dom0 to prepare the xenstore node
      permissions when a guest wanted to create a new connection. The code
      to do this setup was a single point of failure. If the
      broker was down you could create any more connections.
    - Symmetric communications were a nightmare. Take the case where A is a
      backend for B and B is a backend for A. If one of the domain crash the
      other one couldn't be destroyed because it has some paged mapped from
      the dead domain. This specific issue is probably fixed today.

Some of the downsides to using the shared memory grant method:
    - This method imposes an implicit ordering on domain destruction.
      When this ordering is not honored the grantor domain cannot shutdown
      while the grantee still holds references. In the extreme case where
      the grantee domain hangs or crashes without releasing it granted
      pages, both domains can end up hung and unstoppable (the DEADBEEF
      issue).
    - You can't trust any ring structures because the entire set of pages
      that are granted are available to be written by the either guest.
    - The PV connect/disconnect state-machine is poorly implemented.
      There's no trivial mechanism to synchronize disconnecting/reconnecting
      and dom0 must also allow the two domains to see parts of xenstore
      belonging to the other domain in the process.
    - Using the grant-ref model and having to map grant pages on each
      transfer cause updates to V->P memory mappings and thus leads to
      TLB misses and flushes (TLB flushes being expensive operations).

After a lot time spent trying to make the V2V solution work the way we
wanted,  we decided that we should look at a new design that wouldn't have
the issues mentioned above. At this point we started to work on V4V (V2V
version 2).

2. What is V4V?

One of the fundamental problem about V2V was that it didn't implement a
connection mechanism. If one end of the ring disappeared you had to hope
that you would received the xenstore watch that will sort everything out.

V4V is a inter-domain communication that supports 1 to many connections.
All the communications from a domain (even dom0) to another domain goes
through Xen and Xen forward the packet with a memory copies.

Here are some of the reasons why we think v4v is a good solution for
inter-domain communication.

Reasons why the V4V method is quite good even though it does memory copies:
    - Memory transfer speeds through the FSB in modern chipsets is quite
      fast. Speeds on the order of 10-12 Gb/s (over say 2 DRAM channels)
      can be realized.
    - Transfers on a single clock cycle using SSE(2)(3) instructions allow
      moving up to 128 bits at a time.
    - Locality of reference arguments with respect to processor caches
      imply even more speed-up due to likely cache hits (this may in fact
      make the most difference in mem copy speed).
    - V4V provides much better domain isolation since one domain's memory
      is never seen by another and the hypervisor (a trusted component)
      brokers all interactions. This also implies that the structure of
      the ring can be trusted.
    - Use of V4V obviates the event channel depletion issue since
      it doesn't consume individual channel bits when using VIRQs.
    - The projected overhead of VMEXITs (that was originally cited as a
      majorly limiting factor) did not manifest itself as an issue. In
      fact, it can be seen that in the worst case V4V is not causing
      many more VMEXITs than the shared memory grant method and in
      general is at parity with the existing method.
    - The implementation specifics of V4V make its use in both a Windows
      and a Unix/Linux type OS's very simple and natural (ReadFile/WriteFile
      and sockets respectively). In addition, V4V uses TCP/IP protocol
      semantics which are widely understood and it does not introduce an
      entire new protocol set that must be learned.
    - V4V comes with a userspace library that can be use to interpose
      the standard userspace socket layer. That mean that *any* network
      program can be "V4Ved" *without* behing recompiled.
      In fact we tried it on many program suchs as ssh, midori,
      dbus (TCP-IP), X11.
      This is possible because the underlying V4V protocol implement
      a V4V sementic and supports connection. Suchs feature will be
      really really hard to implement over the top of the current
      libvchan implementation.

3. V4V compared to libvchan

I've done some benchmarks on V4V and libchan and the results were
pretty close between the the two if you use the same buffer size in both
cases.

In conclusion, this is not an attempt to demonstrate that V4V is superior to
libvchan. Rather it is an attempt to illustrate that they can coexist in the
Xen ecosystem, helping to solve different sets of problems.

Thanks,
Jean

[-- Attachment #1.2: Type: text/html, Size: 7077 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-24 17:23 V4V Jean Guyader
@ 2012-05-25  9:48 ` Stefano Stabellini
  2012-05-25 10:11   ` V4V Jean Guyader
  2012-05-25 10:19 ` V4V Pasi Kärkkäinen
  2012-05-29 22:22 ` V4V Daniel De Graaf
  2 siblings, 1 reply; 10+ messages in thread
From: Stefano Stabellini @ 2012-05-25  9:48 UTC (permalink / raw)
  To: Jean Guyader; +Cc: James McKenzie, Ross Philipson, xen-devel@lists.xen.org

[-- Attachment #1: Type: text/plain, Size: 1876 bytes --]

On Thu, 24 May 2012, Jean Guyader wrote:
> Some of the downsides to using the shared memory grant method:
>     - This method imposes an implicit ordering on domain destruction.
>       When this ordering is not honored the grantor domain cannot shutdown
>       while the grantee still holds references. In the extreme case where
>       the grantee domain hangs or crashes without releasing it granted   
>       pages, both domains can end up hung and unstoppable (the DEADBEEF  
>       issue).                                                            

Is it still true? This looks like a serious issue.


>     - You can't trust any ring structures because the entire set of pages
>       that are granted are available to be written by the either guest.  
>     - The PV connect/disconnect state-machine is poorly implemented.     
>       There's no trivial mechanism to synchronize disconnecting/reconnecting
>       and dom0 must also allow the two domains to see parts of xenstore    
>       belonging to the other domain in the process.                        

We are starting to see this problem, trying to setup driver domains with
libxl.


>     - Using the grant-ref model and having to map grant pages on each      
>       transfer cause updates to V->P memory mappings and thus leads to     
>       TLB misses and flushes (TLB flushes being expensive operations).     

[snip]

> I've done some benchmarks on V4V and libchan and the results were
> pretty close between the the two if you use the same buffer size in both cases.

It is strange that you cannot see any performance advantages using v4v. I
was expecting quite a difference, especially on new numa machines.

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-25  9:48 ` V4V Stefano Stabellini
@ 2012-05-25 10:11   ` Jean Guyader
  2012-05-25 10:16     ` V4V Stefano Stabellini
  0 siblings, 1 reply; 10+ messages in thread
From: Jean Guyader @ 2012-05-25 10:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: James McKenzie, Ross Philipson, xen-devel@lists.xen.org

On 25 May 2012 10:48, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
>
> On Thu, 24 May 2012, Jean Guyader wrote:
> > Some of the downsides to using the shared memory grant method:
> >     - This method imposes an implicit ordering on domain destruction.
> >       When this ordering is not honored the grantor domain cannot shutdown
> >       while the grantee still holds references. In the extreme case where
> >       the grantee domain hangs or crashes without releasing it granted
> >       pages, both domains can end up hung and unstoppable (the DEADBEEF
> >       issue).
>
> Is it still true? This looks like a serious issue.
>

I have tried to repro this issue the order day with libvchan but I
couldn't so it's probably fixed.
I suspect it has something to do with grant-ref and I don't think
libvchan uses grant refs.

>
> >     - You can't trust any ring structures because the entire set of pages
> >       that are granted are available to be written by the either guest.
> >     - The PV connect/disconnect state-machine is poorly implemented.
> >       There's no trivial mechanism to synchronize disconnecting/reconnecting
> >       and dom0 must also allow the two domains to see parts of xenstore
> >       belonging to the other domain in the process.
>
> We are starting to see this problem, trying to setup driver domains with
> libxl.
>
>
> >     - Using the grant-ref model and having to map grant pages on each
> >       transfer cause updates to V->P memory mappings and thus leads to
> >       TLB misses and flushes (TLB flushes being expensive operations).
>
> [snip]
>
> > I've done some benchmarks on V4V and libchan and the results were
> > pretty close between the the two if you use the same buffer size in both cases.
>
> It is strange that you cannot see any performance advantages using v4v. I
> was expecting quite a difference, especially on new numa machines.

The numbers for both system were in the 50MB/s 60MB/s ranges from domU to domU.
That was on a out of the box testing on a desktop type machine I
didn't look at making
any kind of tricks or adjustment to make it go faster.

Jean

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-25 10:11   ` V4V Jean Guyader
@ 2012-05-25 10:16     ` Stefano Stabellini
  0 siblings, 0 replies; 10+ messages in thread
From: Stefano Stabellini @ 2012-05-25 10:16 UTC (permalink / raw)
  To: Jean Guyader
  Cc: xen-devel@lists.xen.org, James McKenzie, Ross Philipson,
	Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

On Fri, 25 May 2012, Jean Guyader wrote:
> On 25 May 2012 10:48, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> >
> > On Thu, 24 May 2012, Jean Guyader wrote:
> > > Some of the downsides to using the shared memory grant method:
> > >     - This method imposes an implicit ordering on domain destruction.
> > >       When this ordering is not honored the grantor domain cannot shutdown
> > >       while the grantee still holds references. In the extreme case where
> > >       the grantee domain hangs or crashes without releasing it granted
> > >       pages, both domains can end up hung and unstoppable (the DEADBEEF
> > >       issue).
> >
> > Is it still true? This looks like a serious issue.
> >
> 
> I have tried to repro this issue the order day with libvchan but I
> couldn't so it's probably fixed.
> I suspect it has something to do with grant-ref and I don't think
> libvchan uses grant refs.
> 
> >
> > >     - You can't trust any ring structures because the entire set of pages
> > >       that are granted are available to be written by the either guest.
> > >     - The PV connect/disconnect state-machine is poorly implemented.
> > >       There's no trivial mechanism to synchronize disconnecting/reconnecting
> > >       and dom0 must also allow the two domains to see parts of xenstore
> > >       belonging to the other domain in the process.
> >
> > We are starting to see this problem, trying to setup driver domains with
> > libxl.
> >
> >
> > >     - Using the grant-ref model and having to map grant pages on each
> > >       transfer cause updates to V->P memory mappings and thus leads to
> > >       TLB misses and flushes (TLB flushes being expensive operations).
> >
> > [snip]
> >
> > > I've done some benchmarks on V4V and libchan and the results were
> > > pretty close between the the two if you use the same buffer size in both cases.
> >
> > It is strange that you cannot see any performance advantages using v4v. I
> > was expecting quite a difference, especially on new numa machines.
> 
> The numbers for both system were in the 50MB/s 60MB/s ranges from domU to domU.
> That was on a out of the box testing on a desktop type machine I
> didn't look at making
> any kind of tricks or adjustment to make it go faster.

I suspect that the same test on a machine with 32 cpus and 4-8 numa
nodes would have given different results.

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-24 17:23 V4V Jean Guyader
  2012-05-25  9:48 ` V4V Stefano Stabellini
@ 2012-05-25 10:19 ` Pasi Kärkkäinen
  2012-05-29 22:22 ` V4V Daniel De Graaf
  2 siblings, 0 replies; 10+ messages in thread
From: Pasi Kärkkäinen @ 2012-05-25 10:19 UTC (permalink / raw)
  To: Jean Guyader; +Cc: James McKenzie, Ross Philipson, xen-devel

On Thu, May 24, 2012 at 06:23:19PM +0100, Jean Guyader wrote:
> 
>    Reasons why the V4V method is quite good even though it does memory
>    copies:
>        - Memory transfer speeds through the FSB in modern chipsets is quite
>          fast. Speeds on the order of 10-12 Gb/s (over say 2 DRAM channels)
>          can be realized.
>

Hello,

I just wanted to ask for clarification..
Did you mean 10-12 Gbit/sec or 10-12 GB (GigaBytes) per second? 

Thanks,

-- Pasi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-24 17:23 V4V Jean Guyader
  2012-05-25  9:48 ` V4V Stefano Stabellini
  2012-05-25 10:19 ` V4V Pasi Kärkkäinen
@ 2012-05-29 22:22 ` Daniel De Graaf
  2012-05-30 11:41   ` V4V Stefano Stabellini
  2 siblings, 1 reply; 10+ messages in thread
From: Daniel De Graaf @ 2012-05-29 22:22 UTC (permalink / raw)
  To: Jean Guyader
  Cc: Stefano Stabellini, James McKenzie, Ross Philipson, xen-devel

On 05/24/2012 01:23 PM, Jean Guyader wrote:
> As I'm going through the code to clean-up XenClient's inter VM
> communication
> (V4V), I thought it would be a good idea to start a thread to talk about
> the
> fundamental differences between V4V and libvchan. I believe the two system
> are
> not clones of eachother and they serve different
> purposes.
> 
> 
> Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
> about libvchan it coming from my reading of the code. If some of the facts
> are wrong it's only due to my ignorance about the subject.
> 

I'll try to fill in some of these points with my understanding of libvchan;
I have correspondingly less knowledge of V4V, so I may be wrong in assumptions
there.

> 1. Why V4V?
> 
> About the time when we started XenClient (3 year ago) we were looking for a
> lightweight inter VM communication scheme. We started working on a system
> based on netchannel2 at the time called V2V (VM to VM). The system
> was very similar to what libvchan is today, and we started to hit some
> roadblocks:
> 
>     - The setup relied on a broker in dom0 to prepare the xenstore node
>       permissions when a guest wanted to create a new connection. The code
>       to do this setup was a single point of failure. If the
>       broker was down you could create any more connections.

libvchan avoids this by allowing the application to determine the xenstore
path and adjusts permissions itself; the path /local/domain/N/data is
suitable for a libvchan server in domain N to create the nodes in question.

>     - Symmetric communications were a nightmare. Take the case where A is a
>       backend for B and B is a backend for A. If one of the domain crash the
>       other one couldn't be destroyed because it has some paged mapped from
>       the dead domain. This specific issue is probably fixed today.

This is mostly taken care of by improvements in the hypervisor's handling of
grant mappings. If one domain holds grant mappings open, the domain whose
grants are held can't be fully destroyed, but if both domains are being
destroyed then cycles of grant mappings won't stop them from going away.
 
> Some of the downsides to using the shared memory grant method:
>     - This method imposes an implicit ordering on domain destruction.
>       When this ordering is not honored the grantor domain cannot shutdown
>       while the grantee still holds references. In the extreme case where
>       the grantee domain hangs or crashes without releasing it granted
>       pages, both domains can end up hung and unstoppable (the DEADBEEF
>       issue).

This is fixed on current hypervisors.

>     - You can't trust any ring structures because the entire set of pages
>       that are granted are available to be written by the either guest.

This is not a problem: the rings are only used to communicate between the
guests, so the worst that a guest can do is corrupt the data that it sends or
cause spurious events. Note that libvchan does copy some important state out
of the shared page (ring sizes) once at startup because unexpected changes to
these values could cause problems.

>     - The PV connect/disconnect state-machine is poorly implemented.
>       There's no trivial mechanism to synchronize disconnecting/reconnecting
>       and dom0 must also allow the two domains to see parts of xenstore
>       belonging to the other domain in the process.

No interaction from dom0 is required to allow two domUs to communicate using
xenstore (assuming the standard xenstored; more restrictive xenstored
daemons may add such restrictions, intended to be used in conjunction with XSM
policies preventing direct communication via event channels/grants). The
connection state machine is weak; an external mechanism (perhaps the standard
xenbus "state" entry) could be used to coordinate this better in the user of
libvchan.

>     - Using the grant-ref model and having to map grant pages on each
>       transfer cause updates to V->P memory mappings and thus leads to
>       TLB misses and flushes (TLB flushes being expensive operations).

This mapping only happens once at the open of the channel, so this cost becomes
unimportant for a long-running channel. The cost is far higher for HVM domains
that use PCI devices since the grant mapping causes an IOMMU flush.

> 
> After a lot time spent trying to make the V2V solution work the way we
> wanted,  we decided that we should look at a new design that wouldn't have
> the issues mentioned above. At this point we started to work on V4V (V2V
> version 2).
> 
> 2. What is V4V?
> 
> One of the fundamental problem about V2V was that it didn't implement a
> connection mechanism. If one end of the ring disappeared you had to hope
> that you would received the xenstore watch that will sort everything out.
> 
> V4V is a inter-domain communication that supports 1 to many connections.
> All the communications from a domain (even dom0) to another domain goes
> through Xen and Xen forward the packet with a memory copies.
> 
> Here are some of the reasons why we think v4v is a good solution for
> inter-domain communication.
> 
> Reasons why the V4V method is quite good even though it does memory copies:
>     - Memory transfer speeds through the FSB in modern chipsets is quite
>       fast. Speeds on the order of 10-12 Gb/s (over say 2 DRAM channels)
>       can be realized.
>     - Transfers on a single clock cycle using SSE(2)(3) instructions allow
>       moving up to 128 bits at a time.
>     - Locality of reference arguments with respect to processor caches
>       imply even more speed-up due to likely cache hits (this may in fact
>       make the most difference in mem copy speed).

As I understand it, the same copies happen in libvchan: one copy from the
source to a buffer (hypervisor buffer in V4V, shared pages in libvchan) and
one copy out of the buffer to the destination. Both of these copies should
be able to take advantage of processor caches assuming reasonable locality
of data use and send/recv calls.

>     - V4V provides much better domain isolation since one domain's memory
>       is never seen by another and the hypervisor (a trusted component)
>       brokers all interactions. This also implies that the structure of
>       the ring can be trusted.

This is important for multicast, which libvchan does not support; it is not
as important for unicast.

>     - Use of V4V obviates the event channel depletion issue since
>       it doesn't consume individual channel bits when using VIRQs.

This is a significant advantage of V4V if channels are used in place of
local network communications as opposed to device drivers or other
OS-level components.

>     - The projected overhead of VMEXITs (that was originally cited as a
>       majorly limiting factor) did not manifest itself as an issue. In
>       fact, it can be seen that in the worst case V4V is not causing
>       many more VMEXITs than the shared memory grant method and in
>       general is at parity with the existing method.
>     - The implementation specifics of V4V make its use in both a Windows
>       and a Unix/Linux type OS's very simple and natural (ReadFile/WriteFile
>       and sockets respectively). In addition, V4V uses TCP/IP protocol
>       semantics which are widely understood and it does not introduce an
>       entire new protocol set that must be learned.
>     - V4V comes with a userspace library that can be use to interpose
>       the standard userspace socket layer. That mean that *any* network
>       program can be "V4Ved" *without* behing recompiled.
>       In fact we tried it on many program suchs as ssh, midori,
>       dbus (TCP-IP), X11.
>       This is possible because the underlying V4V protocol implement
>       a V4V sementic and supports connection. Suchs feature will be
>       really really hard to implement over the top of the current
>       libvchan implementation.
> 
> 3. V4V compared to libvchan
> 
> I've done some benchmarks on V4V and libchan and the results were
> pretty close between the the two if you use the same buffer size in both
> cases.
> 

[followup from Stefano's replies]
I would not expect much difference even on a NUMA system, assuming each domU
is fully contained within a NUMA node: one of the two copies made by libvchan
will be confined to a single node, while the other copy will be cross-node.
With domUs not properly confined to nodes, the hypervisor might be able to do
better in cases where libvchan would have required two inter-node copies.

> 
> In conclusion, this is not an attempt to demonstrate that V4V is superior to
> libvchan. Rather it is an attempt to illustrate that they can coexist in the
> Xen ecosystem, helping to solve different sets of problems.
> 
> Thanks,
> Jean
> 

-- 
Daniel De Graaf
National Security Agency

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-29 22:22 ` V4V Daniel De Graaf
@ 2012-05-30 11:41   ` Stefano Stabellini
  2012-05-30 14:19     ` V4V Daniel De Graaf
  0 siblings, 1 reply; 10+ messages in thread
From: Stefano Stabellini @ 2012-05-30 11:41 UTC (permalink / raw)
  To: Daniel De Graaf
  Cc: James McKenzie, Stefano Stabellini, Ross Philipson, Jean Guyader,
	xen-devel@lists.xen.org

On Tue, 29 May 2012, Daniel De Graaf wrote:
> On 05/24/2012 01:23 PM, Jean Guyader wrote:
> > As I'm going through the code to clean-up XenClient's inter VM
> > communication
> > (V4V), I thought it would be a good idea to start a thread to talk about
> > the
> > fundamental differences between V4V and libvchan. I believe the two system
> > are
> > not clones of eachother and they serve different
> > purposes.
> > 
> > 
> > Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
> > about libvchan it coming from my reading of the code. If some of the facts
> > are wrong it's only due to my ignorance about the subject.
> > 
> 
> I'll try to fill in some of these points with my understanding of libvchan;
> I have correspondingly less knowledge of V4V, so I may be wrong in assumptions
> there.
> 
> > 1. Why V4V?
> > 
> > About the time when we started XenClient (3 year ago) we were looking for a
> > lightweight inter VM communication scheme. We started working on a system
> > based on netchannel2 at the time called V2V (VM to VM). The system
> > was very similar to what libvchan is today, and we started to hit some
> > roadblocks:
> > 
> >     - The setup relied on a broker in dom0 to prepare the xenstore node
> >       permissions when a guest wanted to create a new connection. The code
> >       to do this setup was a single point of failure. If the
> >       broker was down you could create any more connections.
> 
> libvchan avoids this by allowing the application to determine the xenstore
> path and adjusts permissions itself; the path /local/domain/N/data is
> suitable for a libvchan server in domain N to create the nodes in question.

Let say that the frontend lives in domain A and that the backend lives
in domain N.
Usually the frontend has a node:

/local/domain/A/device/<devicename>/<number>/backend

that points to the backend, in this case:

/local/domain/N/backend/<devicename>/A/<number>

The backend is not allowed to write to the frontend path, so it cannot write
its own path in the backend node. Clearly the frontend doesn't know that
information so it cannot fill it up. So the toolstack (typically in
dom0) helps with the initial setup writing down under the frontend path
where is the backend.
How does libvchan solve this issue?


> >     - Symmetric communications were a nightmare. Take the case where A is a
> >       backend for B and B is a backend for A. If one of the domain crash the
> >       other one couldn't be destroyed because it has some paged mapped from
> >       the dead domain. This specific issue is probably fixed today.
> 
> This is mostly taken care of by improvements in the hypervisor's handling of
> grant mappings. If one domain holds grant mappings open, the domain whose
> grants are held can't be fully destroyed, but if both domains are being
> destroyed then cycles of grant mappings won't stop them from going away.

However under normal circumstances the domain holding the mappings (that
I guess it would be the domain running the backend, correct?) would
recognize that the other domain is gone and therefore unmap the grants
and close the connection, right?
I hope that if the frontend crashes and dies, it doesn't necessarily
become a zombie because the backend holds some mappings.


> >     - The PV connect/disconnect state-machine is poorly implemented.
> >       There's no trivial mechanism to synchronize disconnecting/reconnecting
> >       and dom0 must also allow the two domains to see parts of xenstore
> >       belonging to the other domain in the process.
> 
> No interaction from dom0 is required to allow two domUs to communicate using
> xenstore (assuming the standard xenstored; more restrictive xenstored
> daemons may add such restrictions, intended to be used in conjunction with XSM
> policies preventing direct communication via event channels/grants). The
> connection state machine is weak; an external mechanism (perhaps the standard
> xenbus "state" entry) could be used to coordinate this better in the user of
> libvchan.

I am curious to know what the "connection state machine" is in libvchan.


> >     - Using the grant-ref model and having to map grant pages on each
> >       transfer cause updates to V->P memory mappings and thus leads to
> >       TLB misses and flushes (TLB flushes being expensive operations).
> 
> This mapping only happens once at the open of the channel, so this cost becomes
> unimportant for a long-running channel. The cost is far higher for HVM domains
> that use PCI devices since the grant mapping causes an IOMMU flush.

So I take that you are not passing grant refs through the connection,
unlike blkfront and blkback.



> [followup from Stefano's replies]
> I would not expect much difference even on a NUMA system, assuming each domU
> is fully contained within a NUMA node: one of the two copies made by libvchan
> will be confined to a single node, while the other copy will be cross-node.
> With domUs not properly confined to nodes, the hypervisor might be able to do
> better in cases where libvchan would have required two inter-node copies.

Right, I didn't realize that libvchan uses copies rather than grant refs
to transfer the actual data.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-30 11:41   ` V4V Stefano Stabellini
@ 2012-05-30 14:19     ` Daniel De Graaf
  2012-05-31 17:20       ` V4V Stefano Stabellini
  0 siblings, 1 reply; 10+ messages in thread
From: Daniel De Graaf @ 2012-05-30 14:19 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: James McKenzie, Ross Philipson, Jean Guyader,
	xen-devel@lists.xen.org

On 05/30/2012 07:41 AM, Stefano Stabellini wrote:
> On Tue, 29 May 2012, Daniel De Graaf wrote:
>> On 05/24/2012 01:23 PM, Jean Guyader wrote:
>>> As I'm going through the code to clean-up XenClient's inter VM
>>> communication
>>> (V4V), I thought it would be a good idea to start a thread to talk about
>>> the
>>> fundamental differences between V4V and libvchan. I believe the two system
>>> are
>>> not clones of eachother and they serve different
>>> purposes.
>>>
>>>
>>> Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
>>> about libvchan it coming from my reading of the code. If some of the facts
>>> are wrong it's only due to my ignorance about the subject.
>>>
>>
>> I'll try to fill in some of these points with my understanding of libvchan;
>> I have correspondingly less knowledge of V4V, so I may be wrong in assumptions
>> there.
>>
>>> 1. Why V4V?
>>>
>>> About the time when we started XenClient (3 year ago) we were looking for a
>>> lightweight inter VM communication scheme. We started working on a system
>>> based on netchannel2 at the time called V2V (VM to VM). The system
>>> was very similar to what libvchan is today, and we started to hit some
>>> roadblocks:
>>>
>>>     - The setup relied on a broker in dom0 to prepare the xenstore node
>>>       permissions when a guest wanted to create a new connection. The code
>>>       to do this setup was a single point of failure. If the
>>>       broker was down you could create any more connections.
>>
>> libvchan avoids this by allowing the application to determine the xenstore
>> path and adjusts permissions itself; the path /local/domain/N/data is
>> suitable for a libvchan server in domain N to create the nodes in question.
> 
> Let say that the frontend lives in domain A and that the backend lives
> in domain N.
> Usually the frontend has a node:
> 
> /local/domain/A/device/<devicename>/<number>/backend
> 
> that points to the backend, in this case:
> 
> /local/domain/N/backend/<devicename>/A/<number>
> 
> The backend is not allowed to write to the frontend path, so it cannot write
> its own path in the backend node. Clearly the frontend doesn't know that
> information so it cannot fill it up. So the toolstack (typically in
> dom0) helps with the initial setup writing down under the frontend path
> where is the backend.
> How does libvchan solve this issue?

Libvchan requires both endpoints to know the domain ID of the peer they are
communicating with - this could be communicated during domain build or through
a name service. The application then defines a path such as
"/local/domain/$server_domid/data/example-app/$client_domid" which is writable
by the server; the server creates nodes here that are readable by the client.

>>>     - Symmetric communications were a nightmare. Take the case where A is a
>>>       backend for B and B is a backend for A. If one of the domain crash the
>>>       other one couldn't be destroyed because it has some paged mapped from
>>>       the dead domain. This specific issue is probably fixed today.
>>
>> This is mostly taken care of by improvements in the hypervisor's handling of
>> grant mappings. If one domain holds grant mappings open, the domain whose
>> grants are held can't be fully destroyed, but if both domains are being
>> destroyed then cycles of grant mappings won't stop them from going away.
> 
> However under normal circumstances the domain holding the mappings (that
> I guess it would be the domain running the backend, correct?) would
> recognize that the other domain is gone and therefore unmap the grants
> and close the connection, right?
> I hope that if the frontend crashes and dies, it doesn't necessarily
> become a zombie because the backend holds some mappings.

The mapping between frontend/backend and vchan client/server may be backwards:
the server must be initialized first and provides the pages for the client to
map. It looks like you are considering the frontend to be the server.

The vchan client domain maps grants provided by the server. If the server's
domain crashes, it may become a zombie until the client application notices the
crash. This will happen if the client uses the vchan and gets an error when
sending an event notification (in this case, a well-behaved client will close the
vchan). If the client does not often send data on the vchan, it can use a watch on
the server's xenstore node and close the vchan when the node is deleted.

A client that does not notice the server's destruction will leave a zombie domain.
A system administrator can resolve this by killing the client process.

> 
>>>     - The PV connect/disconnect state-machine is poorly implemented.
>>>       There's no trivial mechanism to synchronize disconnecting/reconnecting
>>>       and dom0 must also allow the two domains to see parts of xenstore
>>>       belonging to the other domain in the process.
>>
>> No interaction from dom0 is required to allow two domUs to communicate using
>> xenstore (assuming the standard xenstored; more restrictive xenstored
>> daemons may add such restrictions, intended to be used in conjunction with XSM
>> policies preventing direct communication via event channels/grants). The
>> connection state machine is weak; an external mechanism (perhaps the standard
>> xenbus "state" entry) could be used to coordinate this better in the user of
>> libvchan.
> 
> I am curious to know what the "connection state machine" is in libvchan.

There are two bytes in the shared page which are set to \1 when the vchan is
connected and changed to \0 when one side is closed (either by libvchan_close or
by Linux if the process exits or crashes). The server has the option of ignoring
the close and allowing the client to reconnect, which is useful if the client
application is to be restarted. Since the rings remain intact, no data is lost
across a restart (although a crashing client may lose data it has already pulled
off the ring).

>>>     - Using the grant-ref model and having to map grant pages on each
>>>       transfer cause updates to V->P memory mappings and thus leads to
>>>       TLB misses and flushes (TLB flushes being expensive operations).
>>
>> This mapping only happens once at the open of the channel, so this cost becomes
>> unimportant for a long-running channel. The cost is far higher for HVM domains
>> that use PCI devices since the grant mapping causes an IOMMU flush.
> 
> So I take that you are not passing grant refs through the connection,
> unlike blkfront and blkback.
 
Not directly. All data is passed through the rings, which by default are sized at
1024 and 2048 bytes. Larger multi-page rings are supported (in powers of two), in
which case the initial shared page has a static list of grants provided by the
server which are all mapped by the client. Data transfer speeds are significantly
improved with larger rings, although this levels off when both ends are able to
avoid excessive context switches waiting for a ring to be filled or emptied.

> 
>> [followup from Stefano's replies]
>> I would not expect much difference even on a NUMA system, assuming each domU
>> is fully contained within a NUMA node: one of the two copies made by libvchan
>> will be confined to a single node, while the other copy will be cross-node.
>> With domUs not properly confined to nodes, the hypervisor might be able to do
>> better in cases where libvchan would have required two inter-node copies.
> 
> Right, I didn't realize that libvchan uses copies rather than grant refs
> to transfer the actual data.
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-30 14:19     ` V4V Daniel De Graaf
@ 2012-05-31 17:20       ` Stefano Stabellini
  2012-05-31 18:18         ` V4V Daniel De Graaf
  0 siblings, 1 reply; 10+ messages in thread
From: Stefano Stabellini @ 2012-05-31 17:20 UTC (permalink / raw)
  To: Daniel De Graaf
  Cc: James McKenzie, xen-devel@lists.xen.org, Ross Philipson,
	Jean Guyader, Stefano Stabellini

On Wed, 30 May 2012, Daniel De Graaf wrote:
> On 05/30/2012 07:41 AM, Stefano Stabellini wrote:
> > On Tue, 29 May 2012, Daniel De Graaf wrote:
> >> On 05/24/2012 01:23 PM, Jean Guyader wrote:
> >>> As I'm going through the code to clean-up XenClient's inter VM
> >>> communication
> >>> (V4V), I thought it would be a good idea to start a thread to talk about
> >>> the
> >>> fundamental differences between V4V and libvchan. I believe the two system
> >>> are
> >>> not clones of eachother and they serve different
> >>> purposes.
> >>>
> >>>
> >>> Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
> >>> about libvchan it coming from my reading of the code. If some of the facts
> >>> are wrong it's only due to my ignorance about the subject.
> >>>
> >>
> >> I'll try to fill in some of these points with my understanding of libvchan;
> >> I have correspondingly less knowledge of V4V, so I may be wrong in assumptions
> >> there.
> >>
> >>> 1. Why V4V?
> >>>
> >>> About the time when we started XenClient (3 year ago) we were looking for a
> >>> lightweight inter VM communication scheme. We started working on a system
> >>> based on netchannel2 at the time called V2V (VM to VM). The system
> >>> was very similar to what libvchan is today, and we started to hit some
> >>> roadblocks:
> >>>
> >>>     - The setup relied on a broker in dom0 to prepare the xenstore node
> >>>       permissions when a guest wanted to create a new connection. The code
> >>>       to do this setup was a single point of failure. If the
> >>>       broker was down you could create any more connections.
> >>
> >> libvchan avoids this by allowing the application to determine the xenstore
> >> path and adjusts permissions itself; the path /local/domain/N/data is
> >> suitable for a libvchan server in domain N to create the nodes in question.
> > 
> > Let say that the frontend lives in domain A and that the backend lives
> > in domain N.
> > Usually the frontend has a node:
> > 
> > /local/domain/A/device/<devicename>/<number>/backend
> > 
> > that points to the backend, in this case:
> > 
> > /local/domain/N/backend/<devicename>/A/<number>
> > 
> > The backend is not allowed to write to the frontend path, so it cannot write
> > its own path in the backend node. Clearly the frontend doesn't know that
> > information so it cannot fill it up. So the toolstack (typically in
> > dom0) helps with the initial setup writing down under the frontend path
> > where is the backend.
> > How does libvchan solve this issue?
> 
> Libvchan requires both endpoints to know the domain ID of the peer they are
> communicating with - this could be communicated during domain build or through
> a name service. The application then defines a path such as
> "/local/domain/$server_domid/data/example-app/$client_domid" which is writable
> by the server; the server creates nodes here that are readable by the client.

Is it completely up to the application to choose a xenstore path and
give write permissions to the other end?
It looks like something that could be generalized and moved to a library.

How do you currently tell to the server the domid of the client?


> >>>     - Symmetric communications were a nightmare. Take the case where A is a
> >>>       backend for B and B is a backend for A. If one of the domain crash the
> >>>       other one couldn't be destroyed because it has some paged mapped from
> >>>       the dead domain. This specific issue is probably fixed today.
> >>
> >> This is mostly taken care of by improvements in the hypervisor's handling of
> >> grant mappings. If one domain holds grant mappings open, the domain whose
> >> grants are held can't be fully destroyed, but if both domains are being
> >> destroyed then cycles of grant mappings won't stop them from going away.
> > 
> > However under normal circumstances the domain holding the mappings (that
> > I guess it would be the domain running the backend, correct?) would
> > recognize that the other domain is gone and therefore unmap the grants
> > and close the connection, right?
> > I hope that if the frontend crashes and dies, it doesn't necessarily
> > become a zombie because the backend holds some mappings.
> 
> The mapping between frontend/backend and vchan client/server may be backwards:
> the server must be initialized first and provides the pages for the client to
> map. It looks like you are considering the frontend to be the server.
> 
> The vchan client domain maps grants provided by the server. If the server's
> domain crashes, it may become a zombie until the client application notices the
> crash. This will happen if the client uses the vchan and gets an error when
> sending an event notification (in this case, a well-behaved client will close the
> vchan). If the client does not often send data on the vchan, it can use a watch on
> the server's xenstore node and close the vchan when the node is deleted.
> 
> A client that does not notice the server's destruction will leave a zombie domain.
> A system administrator can resolve this by killing the client process.

This looks like a serious issue. Considering that libvchan already does
copies to transfer the data, couldn't you switch to grant table copy
operations? That would remove the zombie domain problem I think.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: V4V
  2012-05-31 17:20       ` V4V Stefano Stabellini
@ 2012-05-31 18:18         ` Daniel De Graaf
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel De Graaf @ 2012-05-31 18:18 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: James McKenzie, Ross Philipson, Jean Guyader,
	xen-devel@lists.xen.org

On 05/31/2012 01:20 PM, Stefano Stabellini wrote:
> On Wed, 30 May 2012, Daniel De Graaf wrote:
>> On 05/30/2012 07:41 AM, Stefano Stabellini wrote:
>>> On Tue, 29 May 2012, Daniel De Graaf wrote:
>>>> On 05/24/2012 01:23 PM, Jean Guyader wrote:
>>>>> As I'm going through the code to clean-up XenClient's inter VM
>>>>> communication
>>>>> (V4V), I thought it would be a good idea to start a thread to talk about
>>>>> the
>>>>> fundamental differences between V4V and libvchan. I believe the two system
>>>>> are
>>>>> not clones of eachother and they serve different
>>>>> purposes.
>>>>>
>>>>>
>>>>> Disclaimer: I'm not an expert in libvchan; most of the assertion I'm doing
>>>>> about libvchan it coming from my reading of the code. If some of the facts
>>>>> are wrong it's only due to my ignorance about the subject.
>>>>>
>>>>
>>>> I'll try to fill in some of these points with my understanding of libvchan;
>>>> I have correspondingly less knowledge of V4V, so I may be wrong in assumptions
>>>> there.
>>>>
>>>>> 1. Why V4V?
>>>>>
>>>>> About the time when we started XenClient (3 year ago) we were looking for a
>>>>> lightweight inter VM communication scheme. We started working on a system
>>>>> based on netchannel2 at the time called V2V (VM to VM). The system
>>>>> was very similar to what libvchan is today, and we started to hit some
>>>>> roadblocks:
>>>>>
>>>>>     - The setup relied on a broker in dom0 to prepare the xenstore node
>>>>>       permissions when a guest wanted to create a new connection. The code
>>>>>       to do this setup was a single point of failure. If the
>>>>>       broker was down you could create any more connections.
>>>>
>>>> libvchan avoids this by allowing the application to determine the xenstore
>>>> path and adjusts permissions itself; the path /local/domain/N/data is
>>>> suitable for a libvchan server in domain N to create the nodes in question.
>>>
>>> Let say that the frontend lives in domain A and that the backend lives
>>> in domain N.
>>> Usually the frontend has a node:
>>>
>>> /local/domain/A/device/<devicename>/<number>/backend
>>>
>>> that points to the backend, in this case:
>>>
>>> /local/domain/N/backend/<devicename>/A/<number>
>>>
>>> The backend is not allowed to write to the frontend path, so it cannot write
>>> its own path in the backend node. Clearly the frontend doesn't know that
>>> information so it cannot fill it up. So the toolstack (typically in
>>> dom0) helps with the initial setup writing down under the frontend path
>>> where is the backend.
>>> How does libvchan solve this issue?
>>
>> Libvchan requires both endpoints to know the domain ID of the peer they are
>> communicating with - this could be communicated during domain build or through
>> a name service. The application then defines a path such as
>> "/local/domain/$server_domid/data/example-app/$client_domid" which is writable
>> by the server; the server creates nodes here that are readable by the client.
> 
> Is it completely up to the application to choose a xenstore path and
> give write permissions to the other end?
> It looks like something that could be generalized and moved to a library.
> 
> How do you currently tell to the server the domid of the client?

This depends on the client. One method would be to watch @introduceDomain in
Xenstore and set up a vchan for each new domain (this assumes that your server
wants to talk to every new domain). You could also use existing communications
channels (network or vchan from dom0) to inform a server of clients, and also to
inform the client of the server's domid.

The nodes used by libvchan could be placed under normal frontend/backend device
paths, but the current xenstore permissions require that this be done by dom0.
In this case, the usual xenbus conventions can be used; the management of this
state could be useful for a library.

Xenstore permissions are handled in libvchan; all it needs is a writable path to
create nodes. The original libvchan was using a hard-coded path similar to my
example, but it was decided that allowing the application to define the path would
be more flexible.
 
>>>>>     - Symmetric communications were a nightmare. Take the case where A is a
>>>>>       backend for B and B is a backend for A. If one of the domain crash the
>>>>>       other one couldn't be destroyed because it has some paged mapped from
>>>>>       the dead domain. This specific issue is probably fixed today.
>>>>
>>>> This is mostly taken care of by improvements in the hypervisor's handling of
>>>> grant mappings. If one domain holds grant mappings open, the domain whose
>>>> grants are held can't be fully destroyed, but if both domains are being
>>>> destroyed then cycles of grant mappings won't stop them from going away.
>>>
>>> However under normal circumstances the domain holding the mappings (that
>>> I guess it would be the domain running the backend, correct?) would
>>> recognize that the other domain is gone and therefore unmap the grants
>>> and close the connection, right?
>>> I hope that if the frontend crashes and dies, it doesn't necessarily
>>> become a zombie because the backend holds some mappings.
>>
>> The mapping between frontend/backend and vchan client/server may be backwards:
>> the server must be initialized first and provides the pages for the client to
>> map. It looks like you are considering the frontend to be the server.
>>
>> The vchan client domain maps grants provided by the server. If the server's
>> domain crashes, it may become a zombie until the client application notices the
>> crash. This will happen if the client uses the vchan and gets an error when
>> sending an event notification (in this case, a well-behaved client will close the
>> vchan). If the client does not often send data on the vchan, it can use a watch on
>> the server's xenstore node and close the vchan when the node is deleted.
>>
>> A client that does not notice the server's destruction will leave a zombie domain.
>> A system administrator can resolve this by killing the client process.
> 
> This looks like a serious issue. Considering that libvchan already does
> copies to transfer the data, couldn't you switch to grant table copy
> operations? That would remove the zombie domain problem I think.
> 

The grant table copy operations would work for the actual data, but would be rather
inefficient for updating the shared page (ring indexes and notification bits)
which need to be checked and updated before and after each copy, requiring three
or four copy operations per library call. The layout of the shared page would also
need to be rearranged to make all the fields updated by one domain adjacent and
replace notification bits with a different mechanism.

The Linux gntdev driver does not currently support copy operations; this would
need to be added. You would also lose the ability for the server to detect when the
client application exits using the unmap notify byte (as opposed to the entire
client domain crashing) - however, this functionality may not be very important.

An alternate solution (certainly not for 4.2) would be to fix the zombie domain
problem altogether, since it is not limited to vchan - any frontend/backend driver
that does not respond to domain destruction events can cause zombie domains. The
mapped pages could be reassigned to the domain mapping them until all the mappings
are removed and the pages released back to Xen's heap.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-05-31 18:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-24 17:23 V4V Jean Guyader
2012-05-25  9:48 ` V4V Stefano Stabellini
2012-05-25 10:11   ` V4V Jean Guyader
2012-05-25 10:16     ` V4V Stefano Stabellini
2012-05-25 10:19 ` V4V Pasi Kärkkäinen
2012-05-29 22:22 ` V4V Daniel De Graaf
2012-05-30 11:41   ` V4V Stefano Stabellini
2012-05-30 14:19     ` V4V Daniel De Graaf
2012-05-31 17:20       ` V4V Stefano Stabellini
2012-05-31 18:18         ` V4V Daniel De Graaf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).