Re: [DRAFT 1] XenSock protocol design document

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Joao Martins <joao.m.martins@oracle.com>
To: Stefano Stabellini <stefano@aporeto.com>
Cc: jgross@suse.com, lars.kurth@citrix.com, wei.liu2@citrix.com,
	david.vrabel@citrix.com, xen-devel@lists.xenproject.org,
	boris.ostrovsky@oracle.com, roger.pau@citrix.com
Subject: Re: [DRAFT 1] XenSock protocol design document
Date: Mon, 11 Jul 2016 15:51:37 +0100	[thread overview]
Message-ID: <5783B279.3050801@oracle.com> (raw)
In-Reply-To: <alpine.DEB.2.10.1607071740120.26575@sstabellini-ThinkPad-X260>

On 07/08/2016 12:23 PM, Stefano Stabellini wrote:
> Hi all,
> 
Hey!

[...]

> 
> ## Design
> 
> ### Xenstore
> 
> The frontend and the backend connect to each other exchanging information via
> xenstore. The toolstack creates front and back nodes with state
> XenbusStateInitialising. There can only be one XenSock frontend per domain.
> 
> #### Frontend XenBus Nodes
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the ring buffer.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized ring buffer.

Would it make sense to export minimum, default and maximum size of the socket over
xenstore entries? It normally follows a convention depending on the type of socket
(and OS) you have, or then through settables on socket options.


> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     
>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
>     
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };
> 
> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 
> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7
> - **sockid** is generated by the frontend and identifies the socket to connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
>   socket.
>   
Interesting - Have you consider setsockopt and getsockopt to be part of this? There
are some common options (as in POSIX defined) and then some more exotic flavors Linux
or FreeBSD specific. Say SO_REUSEPORT used on nginx that is good for load balancing
across a set of workers or Linux SO_BUSY_POLL for low latency sockets. Though not
sure how sensible it is to start exposing all of these socket options but to limit to
a specific subset? Or maybe doesn't make sense for your case - see further suggestion
regarding data ring part.

> All three fields are echoed back by the backend.
> 
> As for the other Xen ring based protocols, after writing a request to the ring,
> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro.
> The format is the following:
> 
>     struct xen_xensock_response {
>         uint32_t id;
>         uint32_t cmd;
>         uint64_t sockid;
>         int32_t ret;
>     };
>    
>     0       4       8       12      16      20
>     +-------+-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |  ret  |
>     +-------+-------+-------+-------+-------+
> 
> - **id**: echoed back from request
> - **cmd**: echoed back from request
> - **sockid**: echoed back from request
> - **ret**: return value, identifies success or failure
> 
Are these fields taken from a specific OS (I assumed Linux)? Probably ids, cmd and
ret size could be less big overall or may be not - in which case could be useful
specifying in the spec if it's following a specific OS.

[...]

> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they provide
> excellent performance.
> 
> - **in** is an array of 65536 bytes, used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array of 131072 bytes, used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
Could this size be a tunable intercepting RCVBUF and SNDBUF sockopt adjustments
(these two are POSIX defined) ofc under the assumption that in this proposal you want
to replicate local and remote socket? IOW to dynamically allocate how much the socket
will use for sending/receiving which would turn into the amount of grants in use?
Even doing with xenstore entries in the backend is better - even though user may want
to adjust send/receive buffer for whatever aplication needs. Ideally this would be
dynamic per socket, instead of compile-time defined - and would allow more sockets on
the same VM without overshooting the grant table limits.

Joao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2016-07-11 14:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-08 11:23 [DRAFT 1] XenSock protocol design document Stefano Stabellini
2016-07-08 12:14 ` Juergen Gross
2016-07-08 14:16   ` Stefano Stabellini
2016-07-08 14:27     ` Juergen Gross
2016-07-08 15:57     ` David Vrabel
2016-07-08 16:52       ` Stefano Stabellini
2016-07-08 17:10         ` David Vrabel
2016-07-08 17:36           ` Stefano Stabellini
2016-07-08 17:11 ` David Vrabel
2016-07-11 10:59   ` Stefano Stabellini
2016-07-11 12:47 ` Paul Durrant
2016-07-12 17:39   ` Stefano Stabellini
2016-07-11 14:51 ` Joao Martins [this message]
2016-07-13 11:06   ` Stefano Stabellini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5783B279.3050801@oracle.com \
    --to=joao.m.martins@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=david.vrabel@citrix.com \
    --cc=jgross@suse.com \
    --cc=lars.kurth@citrix.com \
    --cc=roger.pau@citrix.com \
    --cc=stefano@aporeto.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xenproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).