Re: Scalable Event Channel ABI design (draft A)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Keir Fraser <keir.xen@gmail.com>
To: David Vrabel <david.vrabel@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Subject: Re: Scalable Event Channel ABI design (draft A)
Date: Mon, 04 Feb 2013 19:59:16 +0000	[thread overview]
Message-ID: <CD35C394.4B677%keir.xen@gmail.com> (raw)
In-Reply-To: <510FF54E.1070706@citrix.com>

On 04/02/2013 17:52, "David Vrabel" <david.vrabel@citrix.com> wrote:

> Hi,
> 
> Here is a design for a scalable event channel ABI for Xen.  It has a
> number of benefits over the design currently being proposed by Wei Lui.
> 
> * More event channels (>100,000).
> * 16 event priorities.
> * Reduced memory requirements (only 1 additional page per domU).
> * The use of FIFOs for events ensures fairness, whereas it is difficult
> to reason about the fairness with the current bitmap system.

I have some sympathy for this design. It's primary downside compared with
the 3-level alternative is its greater space cost (32*#vcpus). However, as
you say the fairness and prioritisation features are rather nice. Also
having the data structures be per vcpu may well help avoid cacheline
contention on busy multi-vcpu guests.

Interested in what others think, but I may actually prefer a ground-up
redesign like this.

 -- Keir

> The PDF version is easier to read and has diagrams and readable maths
> but the original markdown format document is included below (for ease of
> commenting).
> 
> http://xenbits.xen.org/people/dvrabel/event-channels-A.pdf
> 
> Introduction
> ============
> 
> Purpose
> -------
> 
> Xen uses event channels to signal events (interrupts) to (fully or
> partially) paravirtualized guests.  The current event channel ABI
> provided by Xen only supports up-to 1024 (for 32-bit guests) or 4096
> (for 64-bit guests) event channels.  This is limiting scalability as
> support for more VMs, VCPUs and devices is required.
> 
> The existing ABI does not easily allow events to have different
> priorities.  Current Linux kernels prioritize the timer event by
> special casing this but this is not generalizable to more events.
> Event priorities may be useful for prioritizing MMIO emulation
> requests over bulk data traffic (such as network or disk).
> 
> This design replaces the existing event channel ABI with one that:
> 
> - is scalable to more than 100,000 event channels, with scope for
> increasing this further with minimal ABI changes.
> 
> - allows guests to use up-to 16 different event priorities.
> 
> System Overview
> ---------------
> 
> [FIXME: diagram showing Xen and guest and shared memory block for events?]
> 
> Design Map
> ----------
> 
> A new event channel ABI requires changes to Xen and the guest kernels.
> 
> References
> ----------
> 
> [FIXME: link to alternate proposal?]
> 
> Design Considerations
> =====================
> 
> Assumptions
> -----------
> 
> * Atomic read-modify-write of 32-bit words is possible on all
>   supported platforms.  This can be with a linked-load /
>   store-conditional (e.g., ARMv8's ldrx/strx) or a compare-and-swap
>   (e.g., x86's cmpxchg).
> 
> Constraints
> -----------
> 
> * The existing ABI must continue to be useable.  Compatibilty with
>   existing guests is mandatory.
> 
> Risks and Volatile Areas
> ------------------------
> 
> * Should the 3-level proposal be merged into Xen then this design does
>   not offer enough improvements to warrant the cost of maintaining
>   three different event channel ABIs in Xen and guest kernels.
> 
> * The performance of some operations may be decreased.  Specifically,
>   retriggering an event now always requires a hypercall.
> 
> Architecture
> ============
> 
> Overview
> --------
> 
> The event channel ABI uses a data structure that is shared between Xen
> and the guest.  Access to the structure is done with lock-less
> operations (except for some less common operations where the guest
> must use a hypercall).  The guest is responsible for allocating this
> structure and registering it with Xen during VCPU bring-up.
> 
> Events are reported to a guest's VCPU using a FIFO queue.  There is a
> queue for each priority level and each VCPU.
> 
> Each event has a _pending_ and a _masked_ bit.  The pending bit
> indicates the event has been raised.  The masked bit is used by the
> guest to prevent delivery of that specific event.
> 
> 
> High Level Design
> =================
> 
> Shared Event Data Structure
> ---------------------------
> 
> The shared event data structure has a per-domain _event array_, and a
> per-VCPU _control block_.
> 
> - _event array_: A logical array of event words (one per event
>   channel) which contains the pending and mask bits and the link index
>   for next event in the queue.
> 
> ![\label{fig_event-word}Event Array Word](event-word.pdf)
> 
> - _control block_: This contains the head and tail indexes for each
>   priority level.  Each VCPU has its own control block and this is
>   contained in the existing `struct vcpu_info`.
> 
> The pages within the event array need not be physically nor virtually
> contiguous, but the guest or Xen may make the virtually contiguous for
> ease of implementation.  e.g., by using vmap() in Xen or vmalloc() in
> Linux.  Pages are added by the guest as required by the number of
> bound event channels.
> 
> Only 17 bits are currently defined for the LINK field, allowing 2^17^
> (131,072) events.  This limit can be trivially increased without any
> other changes to the ABI.  Bits [28:17] are reserved for future
> expansion or for other uses.
> 
> Event State Machine
> -------------------
> 
> Event channels are bound to a domain using the existing ABI.
> 
> A bound event may be in one of three main states.
> 
> State    Abbrev.  Meaning
> -----    -------  -------
> BOUND    B        The event is bound but not pending.
> PENDING  P        The event has been raised and not yet acknowledged.
> LINKED   L        The event is on an event queue.
> 
> Additionally, events may be UNMASKED or MASKED (M).
> 
> ![\label{events_sm}Event State Machine](event-sm.pdf)
> 
> The state of an event is tracked using 3 bits within the event word.
> the M (masked), P (pending) and L (linked) bits.  Only state
> transitions that change a single bit are valid.
> 
> Event Queues
> ------------
> 
> The event queues use a singly-linked list of event array words (see
> figure \ref{fig_event-word} and \ref{fig_event-queue}).
> 
> ![\label{fig_event-queue}Empty and Non-empty Event Queues](event-queue.pdf)
> 
> Each event queue has a head and tail index stored in the control
> block.  The head index is the index of the first element in the queue.
> The tail index is the last element in the queue.  Every element within
> the queue has the L bit set.
> 
> The LINK field in the event word indexes the next event in the queue.
> LINK is zero for the last word in the queue.
> 
> The queue is empty when the head index is zero (zero is not a valid
> event channel).
> 
> Hypercalls
> ----------
> 
> Four new EVTCHNOP hypercall sub-operations are added:
> 
> * `EVTCHNOP_init`
> 
> * `EVTCHNOP_expand`
> 
> * `EVTCHNOP_set_priority`
> 
> * `EVTCHNOP_set_limit`
> 
> ### `EVTCHNOP_init`
> 
> This call initializes a single VCPU's event channel data structures,
> adding one page for the event array.
> 
> A guest should call this during initial VCPU bring up (and on resume?).
> 
>     struct evtchnop_init {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the first page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frame for the domain.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_expand`
> 
> This call expands the event array for a VCPU by appended an additional
> page.
> 
> A guest should call this for all VCPUs when a new event channel is
> required and there is insufficient space in the current event array.
> 
> It is not possible to shrink the event array once it has been
> expanded.
> 
>     struct evtchnop_expand {
>         uint32_t vcpu;
>         uint64_t array_pfn;
>     };
> 
> Field          Purpose
> -----          -------
> `vcpu`         [in] The VCPU number.
> `array_pfn`    [in] The PFN or GMFN of a page to be used for the next page
>                of the event array.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `vcpu` is invalid or already initialized.
> EINVAL      `array_pfn` is not a valid frames for the domain.
> EINVAL      The event array already has the maximum number of pages.
> ENOMEM      Insufficient memory to allocate internal structures.
> 
> ### `EVTCHNOP_set_priority`
> 
> This call sets the priority for an event channel.  The event must be
> unbound.
> 
> A guest may call this prior to binding an event channel. The meaning
> and the use of the priority are up to the guest.  Valid priorities are
> 0 - 15 and the default is 7.
> 
>     struct evtchnop_set_priority {
>         uint32_t port;
>         uint32_t priority;
>     };
> 
> Field       Purpose
> -----       -------
> `port`      [in] The event channel.
> `priority`  [in] The priority for the event channel.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `port` is invalid.
> EINVAL      `port` is currently bound.
> EINVAL      `priority` is outside the range 0 - 15.
> 
> ### `EVTCHNOP_set_limit`
> 
> This privileged call sets the maximum number of event channels a
> domain can bind.  The default for dom0 is unlimited.  Other domains
> default to 1024 events (requiring only a single page for their event
> array).
> 
>     struct evtchnop_set_limit {
>         uint32_t domid;
>         uint32_t max_events;
>     };
> 
> Field         Purpose
> -----         -------
> `domid`       [in] The domain ID.
> `max_events`  [in] The maximum number of event channels that may be bound
>               to the domain.
> 
> Error code  Reason
> ----------  ------
> EINVAL      `domid` is invalid.
> EPERM       The calling domain has insufficient privileges.
> 
> Memory Usage
> ------------
> 
> ### Event Arrays
> 
> Xen needs to map every domains' event array into its address space.
> The space reserved for these global mappings is limited to 1 GiB on
> x86-64 (262144 pages) and is shared with other users.
> 
> It is non-trivial to calculate the maximum number of VMs that can be
> supported as this depends on the system configuration (how many driver
> domains etc.) and VM configuration.  We can make some assuptions and
> derive an approximate limit.
> 
> Each page of the event array has space for 1024 events ($E_P$) so a
> regular domU will only require a single page.  Since event channels
> typically come in pairs, the upper bound on the total number of pages
> is $2 \times\text{number of VMs}$.
> 
> If the guests are further restricted in the number of event channels
> ($E_V$) then this upper bound can be reduced further.
> 
> The number of VMs ($V$) with a limit of $P$ total event array pages is:
> \begin{equation*}
> V = P \div \left(1 + \frac{E_V}{E_P}\right)
> \end{equation*}
> 
> Using only half the available pages and limiting guests to only 64
> events gives:
> \begin{eqnarray*}
> V & = & (262144 / 2) \div (1 + 64 / 1024) \\
>   & = & 123 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> Alternatively, we can consider a system with $D$ driver domains, each
> of which requires $E_D$ events, and a dom0 using the maximum number of
> pages (128).
> 
> \begin{eqnarray*}
> V & = & P - \left(128 + D \times \frac{E_D}{E_P}\right)
> \end{eqnarray*}
> 
> With, for example, 16 driver domains each using the maximum number of pages:
> \begin{eqnarray*}
> V  & = & (262144/2) - (128 + 16 \times \frac{2^{17}}{1024}) \\
>    & = & 129 \times 10^3\text{ VMs}
> \end{eqnarray*}
> 
> In summary, there is space to map the event arrays for over 100,000
> VMs.  This is more than the limit imposed by the 16 bit domain ID
> (~32,000 VMs).
> 
> ### Control Block
> 
> With $L$ priority levels and two 32-bit words for the head and tail
> indexes, the amount of space ($S$) required in the `struct vcpu_info`
> for the control block is:
> \begin{eqnarray*}
> S & = & L \times 2 \times 4 \\
>   & = & 16 \times 2 \times 4 \\
>   & = & 128\text{ bytes}
> \end{eqnarray*}
> 
> There is more than enough space within `struct vcpu_info` for the
> control block while still keeping plenty of space for future use.
> 
> Low Level Design
> ================
> 
> In the pseudo code in this section, all memory accesses are atomic,
> including those to bit-fields within the event word.
> 
> These variables have a standard meaning:
> 
> Variable  Purpose
> --------  -------
> E         Event array.
> p         A specific event.
> H         The head index for a specific priority.
> T         The tail index for a specific priority.
> 
> These memory barriers are required:
> 
> Function  Purpose
> --------  -------
> mb()      Full (read/write) memory barrier
> 
> Raising an Event
> ----------------
> 
> When Xen raises an event it marks it pending and (if it is not masked)
> adds it tail of event queue.
> 
>     E[p].pending = 1
>     if not E[p].linked and not E[n].masked
>         E[p].linked = 1
>         E[p].link = 0
>         mb()
>         if H == 0
>             H = p
>         else
>             E[T].link = p
>         T = p
> 
> Concurrent access by Xen to the event queue must be protected by a
> per-event queue spin lock.
> 
> Consuming Events
> ----------------
> 
> The guests consumes events starting at the head until it reaches the
> tail.  Events in the queue that are not pending or are masked are
> consumed but not handled.
> 
>     while H != 0
>         p = H
>         H = E[p].link
>         if H == 0
>             mb()
>             H = E[p].link
>         E[H].linked = 0
>         if not E[p].masked
>             handle(p)
> 
> handle() clears `E[p].pending` and EOIs level-triggered PIRQs.
> 
>> Note: When the event queue contains a single event, removing the
>> event may race with Xen appending another event because the load of
>> `E[p].link` and the store of `H` is not atomic.  To avoid this race,
>> the guest must recheck `E[p].link` if the list appeared empty.
> 
> Masking Events
> --------------
> 
> Events are masked by setting the masked bit.  If the event is pending
> and linked it does not need to be unlinked.
> 
>     E[p].masked = 1
> 
> Unmasking Events
> ----------------
> 
> Events are unmasked by the guest by clearing the masked bit.  If the
> event is pending the guest must call the event channel unmask
> hypercall so Xen can link the event into the correct event queue.
> 
>     E[p].masked = 0
>     if E[p].pending
>         hypercall(EVTCHN_unmask)
> 
> The expectation here is that unmasking a pending event will be rare,
> so the performance hit of the hypercall is minimal.
> 
>> Note: that after clearing the mask bit, the event may be raised and
>> thus it may already be linked by the time the hypercall is done.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2013-02-04 19:59 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-04 17:52 Scalable Event Channel ABI design (draft A) David Vrabel
2013-02-04 19:59 ` Keir Fraser [this message]
2013-02-05 14:48   ` David Vrabel
2013-02-05 15:16     ` Wei Liu
2013-02-05 18:05       ` George Dunlap
2013-02-05 18:57         ` David Vrabel
2013-02-05 19:03           ` Wei Liu
2013-02-06 11:32           ` George Dunlap
2013-02-06 13:53             ` Keir Fraser
2013-03-14 19:20               ` David Vrabel
2013-02-05 15:49     ` Keir Fraser
2013-02-05 15:54       ` David Vrabel
2013-02-05 16:11         ` Ian Campbell
2013-02-05 18:02           ` Keir Fraser
2013-02-06  9:38             ` Ian Campbell
2013-02-06 10:41               ` Keir Fraser
2013-02-06 10:42               ` Wei Liu
2013-02-06 10:52                 ` Ian Campbell
2013-02-06 11:09                   ` Wei Liu
2013-02-05 16:11         ` Keir Fraser
2013-02-06 11:46   ` Jan Beulich
2013-02-04 21:07 ` Wei Liu
2013-02-04 22:16   ` Keir Fraser
2013-02-05 18:36   ` David Vrabel
2013-02-05 16:10 ` Ian Campbell
2013-02-05 18:18   ` David Vrabel
2013-02-06  9:35     ` Ian Campbell
2013-02-06  9:13 ` Ian Campbell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CD35C394.4B677%keir.xen@gmail.com \
    --to=keir.xen@gmail.com \
    --cc=david.vrabel@citrix.com \
    --cc=wei.liu2@citrix.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).