All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Cooper <andrew.cooper3@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>
Cc: Vincent Hanquez <Vincent.Hanquez@eu.citrix.com>,
	Ross Philipson <Ross.Philipson@citrix.com>,
	"Xen-devel@lists.xen.org" <Xen-devel@lists.xen.org>
Subject: Re: Inter-domain Communication using Virtual Sockets (high-level design)
Date: Tue, 11 Jun 2013 19:54:19 +0100	[thread overview]
Message-ID: <51B7725B.6020108@citrix.com> (raw)
In-Reply-To: <51B76754.7040800@citrix.com>

On 11/06/13 19:07, David Vrabel wrote:
> All,
>
> This is a high-level design document for an inter-domain communication
> system under the virtual sockets API (AF_VSOCK) recently added to Linux.
>
> Two low-level transports are discussed: a shared ring based one
> requiring no additional hypervisor support and v4v.
>
> The PDF (including the diagrams) is available here:
>
> http://xenbits.xen.org/people/dvrabel/inter-domain-comms-C.pdf
>
> % Inter-domain Communication using Virtual Sockets
> % David Vrabel <<david.vrabel@citrix.com>

Mismatched angles.

> % Draft C
>
> Introduction
> ============
>
> Revision History
> ----------------
>
> --------------------------------------------------------------------
> Version  Date         Changes
> -------  -----------  ----------------------------------------------
> Draft C  11 Jun 2013  Minor clarifications.
>
> Draft B  10 Jun 2013  Added a section on the low-level shared ring
> transport.
>
>                       Added a section on using v4v as the low-level
> transport.
>
> Draft A  28 May 2013  Initial draft.
> --------------------------------------------------------------------
>
> Purpose
> -------
>
> In the Windsor architecture for XenServer, dom0 is disaggregated into
> several _service domains_.  Examples of service domains include
> network and storage driver domains, and qemu (stub) domains.
>
> To allow the toolstack to manage service domains there needs to be a
> communication mechanism between the toolstack running in one domain and
> all the service domains.
>
> The principle focus of this new transport is control-plane traffic
> (low latency and low data rates) but consideration is given to future
> uses requiring higher data rates.
>
> Linux 3.9 support virtual sockets which is a new type of socket (the
> new AF_VSOCK address family) for inter-domain communication.  This was
> originally implemented for VMWare's VMCI transport but has hooks for
> other transports.  This will be used to provide the interface to
> applications.
>
>
> System Overview
> ---------------
>
> ![\label{fig_overview}System Overview](overview.pdf)
>
>
> Design Map
> ----------
>
> The linux kernel requires a Xen-specific virtual socket transport and
> front and back drivers.
>
> The connection manager is a new user space daemon running in the
> backend domain.
>
> Toolstacks will require changes to allow them to set the policy used
> by the connection manager.  The design of these changes is out of
> scope of this document.
>
> Definitions and Acronyms
> ------------------------
>
> _AF\_VSOCK_
>   ~ The address family for virtual sockets.
>
> _CID (Context ID)_
>
>   ~ The domain ID portion of the AF_VSOCK address format.
>
> _Port_
>
>   ~ The part of the AF_VSOCK address format identifying a specific
>     service. Similar to the port number used in TCP connection.
>
> _Virtual Socket_
>
>   ~ A socket using the AF_VSOCK protocol.
>
> References
> ----------
>
> [Windsor Architecture slides from XenSummit
> 2012](http://www.slideshare.net/xen_com_mgr/windsor-domain-0-disaggregation-for-xenserver-and-xcp)
>
>
> Design Considerations
> =====================
>
> Assumptions
> -----------
>
> * There exists a low-level peer-to-peer, datagram based transport
>   mechanism using shared rings (as in libvchan).
>
> Constraints
> -----------
>
> * The AF_VSOCK address format is limited to a 32-bit CID and a 32-bit
>   port number.  This is sufficient as Xen only has 16-bit domain IDs.
>
> Risks and Volatile Areas
> ------------------------
>
> * The transport may be used between untrusted peers.  A domain may be
>   subject to malicious activity or denial of service attacks.
>
> Architecture
> ============
>
> Overview
> --------
>
> ![\label{fig_architecture}Architecture Overview](architecture.pdf)
>
> Linux's virtual sockets are used as the interface to applications.
> Virtual sockets were introduced in Linux 3.9 and provides a hypervisor
> independent[^1] interface to user space applications for inter-domain
> communication.
>
> [^1]: The API and address format is hypervisor independent but the
> address values are not.
>
> An internal API is provided to implement a low-level virtual socket
> transport.  This will be implemented within a pair of front and back
> drivers.  The use of the standard front/back driver method allows the
> toolstack to handle the suspend, resume and migration in a similar way
> to the existing drivers.
>
> The front/back pair provides a point-to-point link between the two
> domains.  This is used to communicate between applications on those
> hosts and between the frontend domain and the _connection manager_
> running on the backend.
>
> The connection manager allows domUs to request direct connections to
> peer domains.  Without the connection manager, peers have no mechanism
> to exchange the information ncessary for setting up the direct
> connections. The toolstack sets the policy in the connection manager
> to allow connection requests.  The default policy is to deny
> connection requests.
>
>
> High Level Design
> =================
>
> Virtual Sockets
> ---------------
>
> The AF_VSOCK socket address family in the Linux kernel has a two part
> address format: a uint32_t _context ID_ (_CID_) identifying the domain
> and a uint32_t port for the specific service in that domain.
>
> The CID shall be the domain ID and some CIDs have a specific meaning.
>
> CID                     Purpose
> -------------------     -------
> 0x7FF0 (DOMID_SELF)     The local domain.
> 0x7FF1                  The backend domain (where the connection manager
> is).

0x7FF1 is DOMID_IO which has a separate definition as far as Xen is
concerned.

Is it not possible for this information to be in xenstore?

>
> Some port numbers are reserved.
>
> Port    Purpose
> ----    -------
> 0       Reserved
> 1       Connection Manager
> 2-1023  Reserved for well-known services (such as a service discovery
> service).

If you are making use of DOMID_SELF, probably also make use of
DOMID_FIRST_RESERVED, which has the same numeric value.

>
> Front / Back Drivers
> --------------------
>
> Using a front or back driver to provide the virtual socket transport
> allows the toolstack to only make the inter-domain communication
> facility available to selected domains.
>
> The "standard" xenbus connection state machine shall be used. See
> figures \ref{fig_front-sm} and \ref{fig_back-sm} on pages
> \pageref{fig_front-sm} and \pageref{fig_back-sm}.
>
> ![\label{fig_front-sm}Frontend Connection State Machine](front-sm.pdf)
>
> ![\label{fig_back-sm}Backend Connection State Machine](back-sm.pdf)
>
>
> Connection Manager
> ------------------
>
> The connection manager has two main purposes.
>
> 1. Checking that two domains are permitted to connect.
>
> 2. Providing a mechanism for two domains to exchange the grant
>    references and event channels needed for them to setup a shared
>    ring transport.
>
> Domains commnicate with the connection manager over the front-back
> transport link.  The connection manager must be in the same domain as
> the virtual socket backend driver.
>
> The connection manager opens a virtual socket and listens on a well
> defined port (port 1).
>
> The following messages are defined.
>
> Message          Purpose
> -------          -------
> CONNECT_req      Request connection to another peer.
> CONNECT_rsp      Response to a connection request.
> CONNECT_ind      Indicate that a peer is trying to connect.
> CONNECT_ack      Acknowledge a connection request.
>
> ![\label{fig_conn-msc}Connect Message Sequence Chart](conn.pdf)
>
> Before forwarding a connection request to a peer, the connection
> manager checks that the connection is permitted.  The toolstack sets
> these permissions.
>
> Disconnecting transport links to an uncooperative (or dead) domain is
> required.  Therefore there are no messages for disconnecting transport
> links (as these may be ignore or delayed). Instead a transport link is
> disconnected by tearing down the local end. The peer will notice the
> remote end going away and then teardown its end.
>
> Low-level transport
> ===================
>
> [ This exact details are yet to be determined but this section should
>   provide a reasonably summary of the mechanisms used. ]
>
> Frontend and backend domains
> ----------------------------
>
> As is typical for frontend and backend drivers, the frontend will
> grant copy-only access to two rings -- one for from-front messages and
> one for to-front messages.  Each ring shall have an event channel for
> notifying when requests and responses are placed on the ring.

The term "grant copy-only" is very confusing to read in context. 
However I cant offhand think of a better way of describing it.

~Andrew

>
> Peer domains
> ------------
>
> The initiator grants copy-only access to a from-initiator (transmit)
> ring and provides an event channel for notifications for this ring.
> This information is included in the CONNECT_req and CONNECT_ind
> messages.
>
> The responder grants copy-only access to a from-responder (transmit)
> ring and provides an event channel for notifications for this ring.
> The information is included in the CONNECT_ack and CONNECT_rsp
> messages.
>
> After the initial connection, the two domains operate as identical
> peers.  Disconnection is signalled by a domain ungranting its transmit
> ring, notifying the peer via the associated event channel.  The event
> channel is then unbound.
>
> Appendix
> ========
>
> V4V
> ---
>
> An alternative low-level transport (V4V) has been proposed.  The
> hypervisor copies messages from the source domain into a destination
> ring provided by the destination domain.
>
> Because peers are untrusted, in order to prevent them from being able
> to denial-of-service the processing of messages from other peers, each
> receiver must have a per-peer receive ring.  A listening service does
> not know in advance which peers may connect so it cannot create these
> rings in advance.
>
> The connection manager service running in a trusted domain (as in the
> shared ring transport described above) may be used.  The CONNECT_ind
> message is used to trigger the creation of receive ring for that
> specific sender.
>
> A peer must be able to find the connection manager service both at
> start of day and if the connection manager service is restarted in a
> new domain.  This can be done in two possible ways:
>
> 1. Watch a Xenstore key which contains the connection manager service
>    domain ID.
>
> 2. Use a frontend/backend driver pair.
>
> ### Advantages
>
> * Does not use grant table resource.  If shared rings are used then a
>   busy guest with hundreds of peers will require more grant table
>   entries than the current default.
>
> ### Disadvantages
>
> * Any changes or extentions to the protocol or ring format would
>   require a hypervisor change.  This is more difficult than making
>   changes to guests.
>
> * The connection-less, "shared-bus" model of v4v is unsuitable for
>   untrusted peers.  This requires layering a connection model on top
>   and much of the simplicity of the v4v ABI is lost.
>
> * The mechanism for handling full destination rings will not scale up
>   on busy domains.  The event channel only indicates that some ring
>   may have space -- it does not identify which ring has space.
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

  reply	other threads:[~2013-06-11 18:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-11 18:07 Inter-domain Communication using Virtual Sockets (high-level design) David Vrabel
2013-06-11 18:54 ` Andrew Cooper [this message]
2013-06-13 16:27 ` Tim Deegan
2013-06-17 16:19   ` David Vrabel
2013-06-20 11:15     ` Tim Deegan
2013-06-17 18:28   ` Ross Philipson
2013-06-20 11:05     ` David Vrabel
2013-06-20 11:30     ` Tim Deegan
2013-06-20 14:11       ` Ross Philipson
2013-10-30 14:51 ` David Vrabel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51B7725B.6020108@citrix.com \
    --to=andrew.cooper3@citrix.com \
    --cc=Ross.Philipson@citrix.com \
    --cc=Vincent.Hanquez@eu.citrix.com \
    --cc=Xen-devel@lists.xen.org \
    --cc=david.vrabel@citrix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.