xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>
Cc: Vincent Hanquez <Vincent.Hanquez@eu.citrix.com>,
	Ross Philipson <Ross.Philipson@citrix.com>,
	"Xen-devel@lists.xen.org" <Xen-devel@lists.xen.org>
Subject: Re: Inter-domain Communication using Virtual Sockets (high-level design)
Date: Tue, 11 Jun 2013 19:54:19 +0100	[thread overview]
Message-ID: <51B7725B.6020108@citrix.com> (raw)
In-Reply-To: <51B76754.7040800@citrix.com>

On 11/06/13 19:07, David Vrabel wrote:
> All,
>
> This is a high-level design document for an inter-domain communication
> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
>
> Two low-level transports are discussed: a shared ring based one
> requiring no additional hypervisor support and v4v.
>
> The PDF (including the diagrams) is available here:
>
> http://xenbits.xen.org/people/dvrabel/inter-domain-comms-C.pdf
>
> % Inter-domain Communication using Virtual Sockets
> % David Vrabel <<david.vrabel@citrix.com>

Mismatched angles.

> % Draft C
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft C  11 Jun 2013  Minor clarifications.
>
> Draft B  10 Jun 2013  Added a section on the low-level shared ring
> transport.
>
>                       Added a section on using v4v as the low-level
> transport.
>
> Draft A  28 May 2013  Initial draft.
> --------------------------------------------------------------------
>
> Purpose
> -------
>
> In the Windsor architecture for XenServer, dom0 is disaggregated into
> several _service domains_.  Examples of service domains include
> network and storage driver domains, and qemu (stub) domains.
>
> To allow the toolstack to manage service domains there needs to be a
> communication mechanism between the toolstack running in one domain and
> all the service domains.
>
> The principle focus of this new transport is control-plane traffic
> (low latency and low data rates) but consideration is given to future
> uses requiring higher data rates.
>
> Linux 3.9 support virtual sockets which is a new type of socket (the
> new AF_VSOCK address family) for inter-domain communication.  This was
> originally implemented for VMWare's VMCI transport but has hooks for
> other transports.  This will be used to provide the interface to
> applications.
>
>
> System Overview
> ---------------
>
> ![\label{fig_overview}System Overview](overview.pdf)
>
>
> Design Map
> ----------
>
> The linux kernel requires a Xen-specific virtual socket transport and
> front and back drivers.
>
> The connection manager is a new user space daemon running in the
> backend domain.
>
> Toolstacks will require changes to allow them to set the policy used
> by the connection manager.  The design of these changes is out of
> scope of this document.
>
> Definitions and Acronyms
> ------------------------
>
> _AF\_VSOCK_
>   ~ The address family for virtual sockets.
>
> _CID (Context ID)_
>
>   ~ The domain ID portion of the AF_VSOCK address format.
>
> _Port_
>
>   ~ The part of the AF_VSOCK address format identifying a specific
>     service. Similar to the port number used in TCP connection.
>
> _Virtual Socket_
>
>   ~ A socket using the AF_VSOCK protocol.
>
> References
> ----------
>
> [Windsor Architecture slides from XenSummit
> 2012](http://www.slideshare.net/xen_com_mgr/windsor-domain-0-disaggregation-for-xenserver-and-xcp)
>
>
> Design Considerations
> =====================
>
> Assumptions
> -----------
>
> * There exists a low-level peer-to-peer, datagram based transport
>   mechanism using shared rings (as in libvchan).
>
> Constraints
> -----------
>
> * The AF_VSOCK address format is limited to a 32-bit CID and a 32-bit
>   port number.  This is sufficient as Xen only has 16-bit domain IDs.
>
> Risks and Volatile Areas
> ------------------------
>
> * The transport may be used between untrusted peers.  A domain may be
>   subject to malicious activity or denial of service attacks.
>
> Architecture
> ============
>
> Overview
> --------
>
> ![\label{fig_architecture}Architecture Overview](architecture.pdf)
>
> Linux's virtual sockets are used as the interface to applications.
> Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
> independent[^1] interface to user space applications for inter-domain
> communication.
>
> [^1]: The API and address format is hypervisor independent but the
> address values are not.
>
> An internal API is provided to implement a low-level virtual socket
> transport.  This will be implemented within a pair of front and back
> drivers.  The use of the standard front/back driver method allows the
> toolstack to handle the suspend, resume and migration in a similar way
> to the existing drivers.
>
> The front/back pair provides a point-to-point link between the two
> domains.  This is used to communicate between applications on those
> hosts and between the frontend domain and the _connection manager_
> running on the backend.
>
> The connection manager allows domUs to request direct connections to
> peer domains.  Without the connection manager, peers have no mechanism
> to exchange the information ncessary for setting up the direct
> connections. The toolstack sets the policy in the connection manager
> to allow connection requests.  The default policy is to deny
> connection requests.
>
>
> High Level Design
> =================
>
> Virtual Sockets
> ---------------
>
> The AF_VSOCK socket address family in the Linux kernel has a two part
> address format: a uint32_t _context ID_ (_CID_) identifying the domain
> and a uint32_t port for the specific service in that domain.
>
> The CID shall be the domain ID and some CIDs have a specific meaning.
>
> CID                     Purpose
> -------------------     -------
> 0x7FF0 (DOMID_SELF)     The local domain.
> 0x7FF1                  The backend domain (where the connection manager
> is).

0x7FF1 is DOMID_IO which has a separate definition as far as Xen is
concerned.

Is it not possible for this information to be in xenstore?

>
> Some port numbers are reserved.
>
> Port    Purpose
> ----    -------
> 0       Reserved
> 1       Connection Manager
> 2-1023  Reserved for well-known services (such as a service discovery
> service).

If you are making use of DOMID_SELF, probably also make use of
DOMID_FIRST_RESERVED, which has the same numeric value.

>
> Front / Back Drivers
> --------------------
>
> Using a front or back driver to provide the virtual socket transport
> allows the toolstack to only make the inter-domain communication
> facility available to selected domains.
>
> The "standard" xenbus connection state machine shall be used. See
> figures \ref{fig_front-sm} and \ref{fig_back-sm} on pages
> \pageref{fig_front-sm} and \pageref{fig_back-sm}.
>
> ![\label{fig_front-sm}Frontend Connection State Machine](front-sm.pdf)
>
> ![\label{fig_back-sm}Backend Connection State Machine](back-sm.pdf)
>
>
> Connection Manager
> ------------------
>
> The connection manager has two main purposes.
>
> 1. Checking that two domains are permitted to connect.
>
> 2. Providing a mechanism for two domains to exchange the grant
>    references and event channels needed for them to setup a shared
>    ring transport.
>
> Domains commnicate with the connection manager over the front-back
> transport link.  The connection manager must be in the same domain as
> the virtual socket backend driver.
>
> The connection manager opens a virtual socket and listens on a well
> defined port (port 1).
>
> The following messages are defined.
>
> Message          Purpose
> -------          -------
> CONNECT_req      Request connection to another peer.
> CONNECT_rsp      Response to a connection request.
> CONNECT_ind      Indicate that a peer is trying to connect.
> CONNECT_ack      Acknowledge a connection request.
>
> ![\label{fig_conn-msc}Connect Message Sequence Chart](conn.pdf)
>
> Before forwarding a connection request to a peer, the connection
> manager checks that the connection is permitted.  The toolstack sets
> these permissions.
>
> Disconnecting transport links to an uncooperative (or dead) domain is
> required.  Therefore there are no messages for disconnecting transport
> links (as these may be ignore or delayed). Instead a transport link is
> disconnected by tearing down the local end. The peer will notice the
> remote end going away and then teardown its end.
>
> Low-level transport
> ===================
>
> [ This exact details are yet to be determined but this section should
>   provide a reasonably summary of the mechanisms used. ]
>
> Frontend and backend domains
> ----------------------------
>
> As is typical for frontend and backend drivers, the frontend will
> grant copy-only access to two rings -- one for from-front messages and
> one for to-front messages.  Each ring shall have an event channel for
> notifying when requests and responses are placed on the ring.

The term "grant copy-only" is very confusing to read in context. 
However I cant offhand think of a better way of describing it.

~Andrew

>
> Peer domains
> ------------
>
> The initiator grants copy-only access to a from-initiator (transmit)
> ring and provides an event channel for notifications for this ring.
> This information is included in the CONNECT_req and CONNECT_ind
> messages.
>
> The responder grants copy-only access to a from-responder (transmit)
> ring and provides an event channel for notifications for this ring.
> The information is included in the CONNECT_ack and CONNECT_rsp
> messages.
>
> After the initial connection, the two domains operate as identical
> peers.  Disconnection is signalled by a domain ungranting its transmit
> ring, notifying the peer via the associated event channel.  The event
> channel is then unbound.
>
> Appendix
> ========
>
> V4V
> ---
>
> An alternative low-level transport (V4V) has been proposed.  The
> hypervisor copies messages from the source domain into a destination
> ring provided by the destination domain.
>
> Because peers are untrusted, in order to prevent them from being able
> to denial-of-service the processing of messages from other peers, each
> receiver must have a per-peer receive ring.  A listening service does
> not know in advance which peers may connect so it cannot create these
> rings in advance.
>
> The connection manager service running in a trusted domain (as in the
> shared ring transport described above) may be used.  The CONNECT_ind
> message is used to trigger the creation of receive ring for that
> specific sender.
>
> A peer must be able to find the connection manager service both at
> start of day and if the connection manager service is restarted in a
> new domain.  This can be done in two possible ways:
>
> 1. Watch a Xenstore key which contains the connection manager service
>    domain ID.
>
> 2. Use a frontend/backend driver pair.
>
> ### Advantages
>
> * Does not use grant table resource.  If shared rings are used then a
>   busy guest with hundreds of peers will require more grant table
>   entries than the current default.
>
> ### Disadvantages
>
> * Any changes or extentions to the protocol or ring format would
>   require a hypervisor change.  This is more difficult than making
>   changes to guests.
>
> * The connection-less, "shared-bus" model of v4v is unsuitable for
>   untrusted peers.  This requires layering a connection model on top
>   and much of the simplicity of the v4v ABI is lost.
>
> * The mechanism for handling full destination rings will not scale up
>   on busy domains.  The event channel only indicates that some ring
>   may have space -- it does not identify which ring has space.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

  reply	other threads:[~2013-06-11 18:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-11 18:07 Inter-domain Communication using Virtual Sockets (high-level design) David Vrabel
2013-06-11 18:54 ` Andrew Cooper [this message]
2013-06-13 16:27 ` Tim Deegan
2013-06-17 16:19   ` David Vrabel
2013-06-20 11:15     ` Tim Deegan
2013-06-17 18:28   ` Ross Philipson
2013-06-20 11:05     ` David Vrabel
2013-06-20 11:30     ` Tim Deegan
2013-06-20 14:11       ` Ross Philipson
2013-10-30 14:51 ` David Vrabel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51B7725B.6020108@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=Ross.Philipson@citrix.com \
    --cc=Vincent.Hanquez@eu.citrix.com \
    --cc=Xen-devel@lists.xen.org \
    --cc=david.vrabel@citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).