xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 for-xen-4.5] Fix interrupt latency of HVM PCIpassthrough devices.
@ 2014-11-03 19:14 Konrad Rzeszutek Wilk
  2014-11-03 19:14 ` [for-xen-4.5 v9 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
  2014-11-03 19:14 ` [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12) Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-03 19:14 UTC (permalink / raw)
  To: andrew.cooper3, JBeulich, xen-devel, keir, ian.jackson,
	ian.campbell, tim


Changelog:
since v8 (http://lists.xen.org/archives/html/xen-devel/2014-10/msg02564.html)
 - Went back-n-forth on remote softirq, fixed styleguide violations.
 - Moved call to pt_pirq_softirq_reset in pt_irq_create_bind
 - Removed 'case default'
 - Streamlined 'pci_clean_dpci_irqs' return usage.

since v7 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg04385.html)
 - Put ref-counting back, added two bits to allow ref-counting from other places.
since v6 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg03208.html)
 - Squashed #1 + #2.
 - Added more comments, redid it based on Jan's feedback.
since v5 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg02868.html)
 - Redid the series based on Jan's feedback
since v4
(http://lists.xen.org/archives/html/xen-devel/2014-09/msg01676.html):
 - Ditch the domain centric mechansim.
 - Fix issues raised by Jan.


The patches are an performance bug-fixes for PCI passthrough for machines
with many sockets. On those machines we have observed awful latency issues
with interrupts and with high steal time on idle guests. The root cause
was that the tasklet lock which was shared across all sockets. Each interrupt
that was to be delivered to a guest took the tasket lock - and with many
guests and many PCI passthrough devices - the performance and latency were
atrocious. These two patches fix the outstanding issues.

 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 282 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  29 +++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 283 insertions(+), 42 deletions(-)

Konrad Rzeszutek Wilk (2):
      dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
      dpci: Replace tasklet with an softirq (v12)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [for-xen-4.5 v9 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model.
  2014-11-03 19:14 [PATCH v9 for-xen-4.5] Fix interrupt latency of HVM PCIpassthrough devices Konrad Rzeszutek Wilk
@ 2014-11-03 19:14 ` Konrad Rzeszutek Wilk
  2014-11-03 19:14 ` [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12) Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-03 19:14 UTC (permalink / raw)
  To: andrew.cooper3, JBeulich, xen-devel, keir, ian.jackson,
	ian.campbell, tim
  Cc: Konrad Rzeszutek Wilk

When an interrupt for an PCI (or PCIe) passthrough device
is to be sent to a guest, we find the appropiate
'hvm_dirq_dpci' structure for the interrupt (PIRQ), set
a bit (masked), and schedule an tasklet.

Then the 'hvm_dirq_assist' tasklet gets called with the 'struct
domain' from where it iterates over the the radix-tree of
'hvm_dirq_dpci' (from zero to the number of PIRQs allocated)
which are masked to the guest and calls each 'hvm_pirq_assist'.
If the PIRQ has a bit set (masked) it figures out how to
inject the PIRQ to the guest.

This is inefficient and not fair as:
 - We iterate starting at PIRQ 0 and up every time. That means
   the PCIe devices that have lower PIRQs get to be called
   first.
 - If we have many PCIe devices passed in with many PIRQs and
   if most of the time only the highest numbered PIRQ get an
   interrupt (as the initial ones are for control) we end up
   iterating over many PIRQs.

But we could do beter - the 'hvm_dirq_dpci' has the field for
'struct domain', which we can use instead of having to pass in the
'struct domain'.

As such this patch moves the tasklet to the 'struct hvm_dirq_dpci'
and sets the 'dom' field to the domain. We also double-check
that the '->dom' is not reset before using it.

We have to be careful with this as that means we MUST have
'dom' set before pirq_guest_bind() is called. As such
we add the 'pirq_dpci->dom = d;' to cover for such
cases.

The mechanism to tear it down is more complex as there
are two ways it can be executed:

 a) pci_clean_dpci_irq. This gets called when the guest is
    being destroyed. We end up calling 'tasklet_kill'.

    The scenarios in which the 'struct pirq' (and subsequently
    the 'hvm_pirq_dpci') gets destroyed is when:

    - guest did not use the pirq at all after setup.
    - guest did use pirq, but decided to mask and left it in that
      state.
    - guest did use pirq, but crashed.

    In all of those scenarios we end up calling 'tasklet_kill'
    which will spin on the tasklet if it is running.

 b) pt_irq_destroy_bind (guest disables the MSI). We double-check
    that the softirq has run by piggy-backing on the existing
    'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
    We add the extra call to 'pt_pirq_softirq_active' in
    'pt_pirq_cleanup_check'.

    NOTE: Guests that use event channels unbind first the
    event channel from PIRQs, so the 'pt_pirq_cleanup_check'
    won't be called as event is set to zero. In that case
    we either clean it up via the a) mechanism. It is OK
    to re-use the tasklet when 'pt_irq_create_bind' is called
    afterwards.

    There is an extra scenario regardless of event being
    set or not: the guest did 'pt_irq_destroy_bind' while an
    interrupt was triggered and tasklet was scheduled (but had not
    been run). It is OK to still run the tasklet as
    hvm_dirq_assist won't do anything (as the flags are
    set to zero). As such we can exit out of hvm_dirq_assist
    without doing anything.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/io.c  | 75 ++++++++++++++++++++++++++++++-------------
 xen/drivers/passthrough/pci.c |  4 +--
 xen/include/xen/hvm/irq.h     |  2 +-
 3 files changed, 55 insertions(+), 26 deletions(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4cd32b5..dceb17e 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -27,7 +27,7 @@
 #include <xen/hvm/irq.h>
 #include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long _d);
+static void hvm_dirq_assist(unsigned long arg);
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -114,9 +114,6 @@ int pt_irq_create_bind(
             spin_unlock(&d->event_lock);
             return -ENOMEM;
         }
-        softirq_tasklet_init(
-            &hvm_irq_dpci->dirq_tasklet,
-            hvm_dirq_assist, (unsigned long)d);
         for ( i = 0; i < NR_HVM_IRQS; i++ )
             INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
 
@@ -144,6 +141,18 @@ int pt_irq_create_bind(
                                HVM_IRQ_DPCI_GUEST_MSI;
             pirq_dpci->gmsi.gvec = pt_irq_bind->u.msi.gvec;
             pirq_dpci->gmsi.gflags = pt_irq_bind->u.msi.gflags;
+            /*
+             * 'pt_irq_create_bind' can be called after 'pt_irq_destroy_bind'.
+             * The 'pirq_cleanup_check' which would free the structure is only
+             * called if the event channel for the PIRQ is active. However
+             * OS-es that use event channels usually bind PIRQs to eventds
+             * and unbind them before calling 'pt_irq_destroy_bind' - with the
+             * result that we re-use the 'dpci' structure. This can be
+             * reproduced with unloading and loading the driver for a device.
+             *
+             * As such on every 'pt_irq_create_bind' call we MUST set it.
+             */
+            pirq_dpci->dom = d;
             /* bind after hvm_irq_dpci is setup to avoid race with irq handler*/
             rc = pirq_guest_bind(d->vcpu[0], info, 0);
             if ( rc == 0 && pt_irq_bind->u.msi.gtable )
@@ -156,6 +165,7 @@ int pt_irq_create_bind(
             {
                 pirq_dpci->gmsi.gflags = 0;
                 pirq_dpci->gmsi.gvec = 0;
+                pirq_dpci->dom = NULL;
                 pirq_dpci->flags = 0;
                 pirq_cleanup_check(info, d);
                 spin_unlock(&d->event_lock);
@@ -232,6 +242,7 @@ int pt_irq_create_bind(
         {
             unsigned int share;
 
+            /* MUST be set, as the pirq_dpci can be re-used. */
             pirq_dpci->dom = d;
             if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
             {
@@ -415,11 +426,18 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
+    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    return !dpci->flags;
+    if ( !dpci->flags )
+    {
+        tasklet_kill(&dpci->tasklet);
+        dpci->dom = NULL;
+        return 1;
+    }
+    return 0;
 }
 
 int pt_pirq_iterate(struct domain *d,
@@ -459,7 +477,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
         return 0;
 
     pirq_dpci->masked = 1;
-    tasklet_schedule(&dpci->dirq_tasklet);
+    tasklet_schedule(&pirq_dpci->tasklet);
     return 1;
 }
 
@@ -513,9 +531,27 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
-                            void *arg)
+static void hvm_dirq_assist(unsigned long arg)
 {
+    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
+    struct domain *d = pirq_dpci->dom;
+
+    /*
+     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
+     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
+     *
+     * And '->dom' gets cleared later in the destroy path. We exit and clear
+     * 'masked' - which is OK as later in this code we would
+     * do nothing except clear the ->masked field anyhow.
+     */
+    if ( !d )
+    {
+        pirq_dpci->masked = 0;
+        return;
+    }
+    ASSERT(d->arch.hvm_domain.irq.dpci);
+
+    spin_lock(&d->event_lock);
     if ( test_and_clear_bool(pirq_dpci->masked) )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -526,13 +562,17 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
             send_guest_pirq(d, pirq);
 
             if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
-                return 0;
+            {
+                spin_unlock(&d->event_lock);
+                return;
+            }
         }
 
         if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
         {
             vmsi_deliver_pirq(d, pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -545,7 +585,8 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         {
             /* for translated MSI to INTx interrupt, eoi as early as possible */
             __msi_pirq_eoi(pirq_dpci);
-            return 0;
+            spin_unlock(&d->event_lock);
+            return;
         }
 
         /*
@@ -558,18 +599,6 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
         ASSERT(pt_irq_need_timer(pirq_dpci->flags));
         set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
     }
-
-    return 0;
-}
-
-static void hvm_dirq_assist(unsigned long _d)
-{
-    struct domain *d = (struct domain *)_d;
-
-    ASSERT(d->arch.hvm_domain.irq.dpci);
-
-    spin_lock(&d->event_lock);
-    pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
     spin_unlock(&d->event_lock);
 }
 
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..81e8a3a 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,6 +767,8 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
+    tasklet_kill(&pirq_dpci->tasklet);
+
     return 0;
 }
 
@@ -784,8 +786,6 @@ static void pci_clean_dpci_irqs(struct domain *d)
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
-
         pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
 
         d->arch.hvm_domain.irq.dpci = NULL;
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..94a550a 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
     DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
     /* Record of mapped Links */
     uint8_t link_cnt[NR_LINK];
-    struct tasklet dirq_tasklet;
 };
 
 /* Machine IRQ to guest device/intx mapping. */
@@ -100,6 +99,7 @@ struct hvm_pirq_dpci {
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
+    struct tasklet tasklet;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12)
  2014-11-03 19:14 [PATCH v9 for-xen-4.5] Fix interrupt latency of HVM PCIpassthrough devices Konrad Rzeszutek Wilk
  2014-11-03 19:14 ` [for-xen-4.5 v9 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
@ 2014-11-03 19:14 ` Konrad Rzeszutek Wilk
  2014-11-04 10:37   ` Jan Beulich
  1 sibling, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-03 19:14 UTC (permalink / raw)
  To: andrew.cooper3, JBeulich, xen-devel, keir, ian.jackson,
	ian.campbell, tim
  Cc: Konrad Rzeszutek Wilk

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
waiting (cpu_relax). We cannot call 'process_pending_softirqs' as it
might result in a dead-lock. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure. In case this softirq
is scheduled on a remote CPU the softirq will run on it as the
semantics behind an softirq is that it will execute within the
guest interruption.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan.
v11: IPI the remote CPU (once) if it it has the softirq scheduled.
v12: Remove the IPI for the remote CPU as the sofitrq mechanism does that.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 251 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 ++++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 254 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..8dcb254 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,120 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * Should only be called from hvm_do_IRQ_dpci. We use the
+ * 'state' as a gate to thwart multiple interrupts being scheduled.
+ * The 'state' is cleared by 'softirq_dpci' when it has
+ * completed executing 'hvm_dirq_assist' or by 'pt_pirq_softirq_reset'
+ * if we want to try to unschedule the softirq before it runs.
+ *
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (state is still set) we return
+ * true. Otherwise we return false.
+ *
+ *  If it is false, it is the callers responsibility to make sure
+ *  that the softirq (with the event_lock dropped) has ran. We need
+ *  to flush out the outstanding 'dpci_softirq' (no more of them
+ *  will be added for this pirq as the IRQ action handler has been
+ *  reset in pt_irq_destroy_bind).
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do
+ * at the right time and if unschedules the softirq before it has
+ * run it also refcounts (which is what the softirq would have done).
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+    case (1 << STATE_SCHED):
+        /*
+         * We are going to try to de-schedule the softirq before it goes in
+         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+         */
+        put_domain(d);
+        /* fallthrough. */
+    case (1 << STATE_RUN):
+    case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        /*
+         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom'
+         * in local variable before it sets STATE_RUN - and therefore will not
+         * dereference '->dom' which would crash.
+         */
+        pirq_dpci->dom = NULL;
+        break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +146,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +207,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +234,20 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -159,7 +279,16 @@ int pt_irq_create_bind(
             {
                 rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
                 if ( unlikely(rc) )
+                {
                     pirq_guest_unbind(d, info);
+                    /*
+                     * Between 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                     * an interrupt can be scheduled. No more of them are going
+                     * to be scheduled but we must deal with the one that is in
+                     * the queue.
+                     */
+                    pt_pirq_softirq_reset(pirq_dpci);
+                }
             }
             if ( unlikely(rc) )
             {
@@ -269,6 +398,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +535,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +564,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +612,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +666,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +773,79 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..78c6977 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return 0;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return 0;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..9709397 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12)
  2014-11-03 19:14 ` [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12) Konrad Rzeszutek Wilk
@ 2014-11-04 10:37   ` Jan Beulich
  2014-11-10 21:02     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2014-11-04 10:37 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 03.11.14 at 20:14, <konrad.wilk@oracle.com> wrote:
> +/*
> + * Should only be called from hvm_do_IRQ_dpci. We use the

This statement together with the comment in pt_pirq_softirq_active()
is at least confusing: If the function is to be called in only one place,
there shouldn't be a second place where its use is being suggested.
Plus, a function with such required limited use would very likely better
not be a separate function at all.

> @@ -159,7 +279,16 @@ int pt_irq_create_bind(
>              {
>                  rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
>                  if ( unlikely(rc) )
> +                {
>                      pirq_guest_unbind(d, info);
> +                    /*
> +                     * Between 'pirq_guest_bind' and before 'pirq_guest_unbind'
> +                     * an interrupt can be scheduled. No more of them are going
> +                     * to be scheduled but we must deal with the one that is in

s/ is / may be /?

> @@ -269,6 +398,10 @@ int pt_irq_create_bind(
>              {
>                  if ( pt_irq_need_timer(pirq_dpci->flags) )
>                      kill_timer(&pirq_dpci->timer);
> +                /*
> +                 * There is no path for __do_IRQ to schedule softirq as
> +                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.

"right away" suggests the alternative handling defers it in any way.
Maybe better "directly"?

Jan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12)
  2014-11-04 10:37   ` Jan Beulich
@ 2014-11-10 21:02     ` Konrad Rzeszutek Wilk
  2014-11-11  8:46       ` Jan Beulich
  0 siblings, 1 reply; 6+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-11-10 21:02 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

[-- Attachment #1: Type: text/plain, Size: 26171 bytes --]

On Tue, Nov 04, 2014 at 10:37:29AM +0000, Jan Beulich wrote:
> >>> On 03.11.14 at 20:14, <konrad.wilk@oracle.com> wrote:
> > +/*
> > + * Should only be called from hvm_do_IRQ_dpci. We use the
> 
> This statement together with the comment in pt_pirq_softirq_active()
> is at least confusing: If the function is to be called in only one place,
> there shouldn't be a second place where its use is being suggested.
> Plus, a function with such required limited use would very likely better
> not be a separate function at all.

Could you explain your rationale for it please?

I am of course OK changing it, but when reading this patch and having most of
the functions that deal with the new code in one place - makes it easier
to read.
> 
> > @@ -159,7 +279,16 @@ int pt_irq_create_bind(
> >              {
> >                  rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
> >                  if ( unlikely(rc) )
> > +                {
> >                      pirq_guest_unbind(d, info);
> > +                    /*
> > +                     * Between 'pirq_guest_bind' and before 'pirq_guest_unbind'
> > +                     * an interrupt can be scheduled. No more of them are going
> > +                     * to be scheduled but we must deal with the one that is in
> 
> s/ is / may be /?
> 
> > @@ -269,6 +398,10 @@ int pt_irq_create_bind(
> >              {
> >                  if ( pt_irq_need_timer(pirq_dpci->flags) )
> >                      kill_timer(&pirq_dpci->timer);
> > +                /*
> > +                 * There is no path for __do_IRQ to schedule softirq as
> > +                 * IRQ_GUEST is not set. As such we can reset 'dom' right away.
> 
> "right away" suggests the alternative handling defers it in any way.
> Maybe better "directly"?

Done. I've updated the patch and also removed/added some comments.See attached
and inline:

>From b4cdb5365d9d254c87b82bc3df794c935c682541 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v13)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
waiting (cpu_relax). We cannot call 'process_pending_softirqs' as it
might result in a dead-lock. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure. In case this softirq
is scheduled on a remote CPU the softirq will run on it as the
semantics behind an softirq is that it will execute within the
guest interruption.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan.
v11: IPI the remote CPU (once) if it it has the softirq scheduled.
v12: Remove the IPI for the remote CPU as the sofitrq mechanism does that.
v13: Clarify comments.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 251 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 ++++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 254 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..b61dbd6 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,116 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * This can be called multiple times, but the softirq is only raised once.
+ * That is until the STATE_SCHED state has been cleared. The state can be
+ * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
+ * or by 'pt_pirq_softirq_reset' (which will try to clear the state before
+ * the softirq had a chance to run).
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (STATE_SCHED) we return
+ * true. Otherwise we return false.
+ *
+ * If it is false, it is the callers responsibility to make sure
+ * that the softirq (with the event_lock dropped) has ran.
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do it
+ * at the right time. If it unschedules the 'hvm_dirq_assist' from running
+ * it also refcounts (which is what the softirq would have done) properly.
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+    case (1 << STATE_SCHED):
+        /*
+         * We are going to try to de-schedule the softirq before it goes in
+         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+         */
+        put_domain(d);
+        /* fallthrough. */
+    case (1 << STATE_RUN):
+    case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        /*
+         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom'
+         * in local variable before it sets STATE_RUN - and therefore will not
+         * dereference '->dom' which would crash.
+         */
+        pirq_dpci->dom = NULL;
+        break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +142,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +203,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +230,20 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -159,7 +275,16 @@ int pt_irq_create_bind(
             {
                 rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
                 if ( unlikely(rc) )
+                {
                     pirq_guest_unbind(d, info);
+                    /*
+                     * Between 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                     * an interrupt can be scheduled. No more of them are going
+                     * to be scheduled but we must deal with the one that may be
+                     * in the queue.
+                     */
+                    pt_pirq_softirq_reset(pirq_dpci);
+                }
             }
             if ( unlikely(rc) )
             {
@@ -269,6 +394,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' directly.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +531,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +560,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +608,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +662,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +769,83 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+/*
+ * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to
+ * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting.
+ */
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..78c6977 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return 0;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return 0;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..9709397 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.9.3

> 
> Jan
> 

[-- Attachment #2: 0001-dpci-Replace-tasklet-with-an-softirq-v13.patch --]
[-- Type: text/plain, Size: 24277 bytes --]

>From b4cdb5365d9d254c87b82bc3df794c935c682541 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 23 Oct 2014 20:41:24 -0400
Subject: [PATCH] dpci: Replace tasklet with an softirq (v13)

The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.

The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).

The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).

The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:

_asm_vmexit_handler
 -> vmx_vmexit_handler
    ->vmx_do_extint
        -> do_IRQ
            -> __do_IRQ_guest
                -> hvm_do_IRQ_dpci
                   tasklet_schedule(&dpci->dirq_tasklet);
                   [takes lock to put the tasklet on]

[later on the schedule_tail is invoked which is 'vmx_do_resume']

vmx_do_resume
 -> vmx_asm_do_vmentry
        -> call vmx_intr_assist
          -> vmx_process_softirqs
            -> do_softirq
              [executes the tasklet function, takes the
               lock again]

While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.

The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.

With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.

The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the
'hvm_dirq_assist' needs to run on. That is simple solved by
threading the 'struct hvm_pirq_dpci' through a linked list.
The rule of only running one 'hvm_dirq_assist' for only
one 'hvm_pirq_dpci' is also preserved by having
'schedule_dpci_for' ignore any subsequent calls for an domain
which has already been scheduled.

== Code details ==

Most of the code complexity comes from the '->dom' field
in the 'hvm_pirq_dpci' structure. We use it for ref-counting
and as such it MUST be valid as long as STATE_SCHED bit is
set. Whoever clears the STATE_SCHED bit does the ref-counting
and can also reset the '->dom' field.

To compound the complexity, there are multiple points where the
'hvm_pirq_dpci' structure is reset or re-used. Initially
(first time the domain uses the pirq), the 'hvm_pirq_dpci->dom'
field is set to NULL as it is allocated. On subsequent calls
in to 'pt_irq_create_bind' the ->dom is whatever it had last time.

As this is the initial call (which QEMU ends up calling when the
guest writes an vector value in the MSI field) we MUST set the
'->dom' to a the proper structure (otherwise we cannot do
proper ref-counting).

The mechanism to tear it down is more complex as there
are three ways it can be executed. To make it simpler
everything revolves around 'pt_pirq_softirq_active'. If it
returns -EAGAIN that means there is an outstanding softirq
that needs to finish running before we can continue tearing
down. With that in mind:

a) pci_clean_dpci_irq. This gets called when the guest is
   being destroyed. We end up calling 'pt_pirq_softirq_active'
   to see if it is OK to continue the destruction.

   The scenarios in which the 'struct pirq' (and subsequently
   the 'hvm_pirq_dpci') gets destroyed is when:

   - guest did not use the pirq at all after setup.
   - guest did use pirq, but decided to mask and left it in that
     state.
   - guest did use pirq, but crashed.

   In all of those scenarios we end up calling
   'pt_pirq_softirq_active' to check if the softirq is still
   active. Read below on the 'pt_pirq_softirq_active' loop.

b) pt_irq_destroy_bind (guest disables the MSI). We double-check
   that the softirq has run by piggy-backing on the existing
   'pirq_cleanup_check' mechanism which calls 'pt_pirq_cleanup_check'.
   We add the extra call to 'pt_pirq_softirq_active' in
   'pt_pirq_cleanup_check'.

   NOTE: Guests that use event channels unbind first the
   event channel from PIRQs, so the 'pt_pirq_cleanup_check'
   won't be called as 'event' is set to zero. In that case
   we either clean it up via the a) or c) mechanism.

   There is an extra scenario regardless of 'event' being
   set or not: the guest did 'pt_irq_destroy_bind' while an
   interrupt was triggered and softirq was scheduled (but had not
   been run). It is OK to still run the softirq as
   hvm_dirq_assist won't do anything (as the flags are
   set to zero). However we will try to deschedule the
   softirq if we can (by clearing the STATE_SCHED bit and us
   doing the ref-counting).

c) pt_irq_create_bind (not a typo). The scenarios are:

     - guest disables the MSI and then enables it
       (rmmod and modprobe in a loop). We call 'pt_pirq_reset'
       which checks to see if the softirq has been scheduled.
       Imagine the 'b)' with interrupts in flight and c) getting
       called in a loop.

We will spin up on 'pt_pirq_is_active' (at the start of the
'pt_irq_create_bind') with the event_lock spinlock dropped and
waiting (cpu_relax). We cannot call 'process_pending_softirqs' as it
might result in a dead-lock. hvm_dirq_assist will be executed
and then the softirq will clear 'state' which signals that that we
can re-use the 'hvm_pirq_dpci' structure. In case this softirq
is scheduled on a remote CPU the softirq will run on it as the
semantics behind an softirq is that it will execute within the
guest interruption.

     - we hit once the error paths in 'pt_irq_create_bind' while
       an interrupt was triggered and softirq was scheduled.

If the softirq is in STATE_RUN that means it is executing and we should
let it continue on. We can clear the '->dom' field as the softirq
has stashed it beforehand. If the softirq is STATE_SCHED and
we are successful in clearing it, we do the ref-counting and
clear the '->dom' field. Otherwise we let the softirq continue
on and the '->dom' field is left intact. The clearing of
the '->dom' is left to a), b) or again c) case.

Note that in both cases the 'flags' variable is cleared so
hvm_dirq_assist won't actually do anything.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>

---
v2: On top of ref-cnts also have wait loop for the outstanding
    'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
v6: Ditch the 'state' bit, expand description, simplify
    softirq and teardown sequence.
v7: Flesh out the comments. Drop the use of domain refcounts
v8: Add two bits (STATE_[SCHED|RUN]) to allow refcounts.
v9: Use cmpxchg, ASSSERT, fix up comments per Jan.
v10: Fix up issues spotted by Jan.
v11: IPI the remote CPU (once) if it it has the softirq scheduled.
v12: Remove the IPI for the remote CPU as the sofitrq mechanism does that.
v13: Clarify comments.
---
 xen/arch/x86/domain.c         |   4 +-
 xen/drivers/passthrough/io.c  | 251 +++++++++++++++++++++++++++++++++++++-----
 xen/drivers/passthrough/pci.c |  31 ++++--
 xen/include/asm-x86/softirq.h |   3 +-
 xen/include/xen/hvm/irq.h     |   5 +-
 xen/include/xen/pci.h         |   2 +-
 6 files changed, 254 insertions(+), 42 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index ae0a344..73d01bb 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1965,7 +1965,9 @@ int domain_relinquish_resources(struct domain *d)
     switch ( d->arch.relmem )
     {
     case RELMEM_not_started:
-        pci_release_devices(d);
+        ret = pci_release_devices(d);
+        if ( ret )
+            return ret;
 
         /* Tear down paging-assistance stuff. */
         ret = paging_teardown(d);
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index dceb17e..b61dbd6 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,116 @@
 
 #include <xen/event.h>
 #include <xen/iommu.h>
+#include <xen/cpu.h>
 #include <xen/irq.h>
 #include <asm/hvm/irq.h>
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
 
-static void hvm_dirq_assist(unsigned long arg);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These two bit states help to safely schedule, deschedule, and wait until
+ * the softirq has finished.
+ *
+ * The semantics behind these two bits is as follow:
+ *  - STATE_SCHED - whoever modifies it has to ref-count the domain (->dom).
+ *  - STATE_RUN - only softirq is allowed to set and clear it. If it has
+ *      been set hvm_dirq_assist will RUN with a saved value of the
+ *      'struct domain' copied from 'pirq_dpci->dom' before STATE_RUN was set.
+ *
+ * The usual states are: STATE_SCHED(set) -> STATE_RUN(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(unset).
+ *
+ * However the states can also diverge such as: STATE_SCHED(set) ->
+ * STATE_SCHED(unset) -> STATE_RUN(set) -> STATE_RUN(unset). That means
+ * the 'hvm_dirq_assist' never run and that the softirq did not do any
+ * ref-counting.
+ */
+
+enum {
+    STATE_SCHED,
+    STATE_RUN
+};
+
+/*
+ * This can be called multiple times, but the softirq is only raised once.
+ * That is until the STATE_SCHED state has been cleared. The state can be
+ * cleared by: the 'dpci_softirq' (when it has executed 'hvm_dirq_assist'),
+ * or by 'pt_pirq_softirq_reset' (which will try to clear the state before
+ * the softirq had a chance to run).
+ */
+static void raise_softirq_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+    unsigned long flags;
+
+    if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+        return;
+
+    get_knownalive_domain(pirq_dpci->dom);
+
+    local_irq_save(flags);
+    list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+    local_irq_restore(flags);
+
+    raise_softirq(HVM_DPCI_SOFTIRQ);
+}
+
+/*
+ * If we are racing with softirq_dpci (STATE_SCHED) we return
+ * true. Otherwise we return false.
+ *
+ * If it is false, it is the callers responsibility to make sure
+ * that the softirq (with the event_lock dropped) has ran.
+ */
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *pirq_dpci)
+{
+    if ( pirq_dpci->state & ((1 << STATE_RUN) | (1 << STATE_SCHED)) )
+        return 1;
+
+    /*
+     * If in the future we would call 'raise_softirq_for' right away
+     * after 'pt_pirq_softirq_active' we MUST reset the list (otherwise it
+     * might have stale data).
+     */
+    return 0;
+}
+
+/*
+ * Reset the pirq_dpci->dom parameter to NULL.
+ *
+ * This function checks the different states to make sure it can do it
+ * at the right time. If it unschedules the 'hvm_dirq_assist' from running
+ * it also refcounts (which is what the softirq would have done) properly.
+ */
+static void pt_pirq_softirq_reset(struct hvm_pirq_dpci *pirq_dpci)
+{
+    struct domain *d = pirq_dpci->dom;
+
+    ASSERT(spin_is_locked(&d->event_lock));
+
+    switch ( cmpxchg(&pirq_dpci->state, 1 << STATE_SCHED, 0) )
+    {
+    case (1 << STATE_SCHED):
+        /*
+         * We are going to try to de-schedule the softirq before it goes in
+         * STATE_RUN. Whoever clears STATE_SCHED MUST refcount the 'dom'.
+         */
+        put_domain(d);
+        /* fallthrough. */
+    case (1 << STATE_RUN):
+    case (1 << STATE_RUN) | (1 << STATE_SCHED):
+        /*
+         * The reason it is OK to reset 'dom' when STATE_RUN bit is set is due
+         * to a shortcut the 'dpci_softirq' implements. It stashes the 'dom'
+         * in local variable before it sets STATE_RUN - and therefore will not
+         * dereference '->dom' which would crash.
+         */
+        pirq_dpci->dom = NULL;
+        break;
+    }
+}
 
 bool_t pt_irq_need_timer(uint32_t flags)
 {
@@ -40,7 +142,7 @@ static int pt_irq_guest_eoi(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
     if ( __test_and_clear_bit(_HVM_IRQ_DPCI_EOI_LATCH_SHIFT,
                               &pirq_dpci->flags) )
     {
-        pirq_dpci->masked = 0;
+        pirq_dpci->state = 0;
         pirq_dpci->pending = 0;
         pirq_guest_eoi(dpci_pirq(pirq_dpci));
     }
@@ -101,6 +203,7 @@ int pt_irq_create_bind(
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
 
+ restart:
     spin_lock(&d->event_lock);
 
     hvm_irq_dpci = domain_get_irq_dpci(d);
@@ -127,7 +230,20 @@ int pt_irq_create_bind(
         return -ENOMEM;
     }
     pirq_dpci = pirq_dpci(info);
-
+    /*
+     * A crude 'while' loop with us dropping the spinlock and giving
+     * the softirq_dpci a chance to run.
+     * We MUST check for this condition as the softirq could be scheduled
+     * and hasn't run yet. Note that this code replaced tasklet_kill which
+     * would have spun forever and would do the same thing (wait to flush out
+     * outstanding hvm_dirq_assist calls.
+     */
+    if ( pt_pirq_softirq_active(pirq_dpci) )
+    {
+        spin_unlock(&d->event_lock);
+        cpu_relax();
+        goto restart;
+    }
     switch ( pt_irq_bind->irq_type )
     {
     case PT_IRQ_TYPE_MSI:
@@ -159,7 +275,16 @@ int pt_irq_create_bind(
             {
                 rc = msixtbl_pt_register(d, info, pt_irq_bind->u.msi.gtable);
                 if ( unlikely(rc) )
+                {
                     pirq_guest_unbind(d, info);
+                    /*
+                     * Between 'pirq_guest_bind' and before 'pirq_guest_unbind'
+                     * an interrupt can be scheduled. No more of them are going
+                     * to be scheduled but we must deal with the one that may be
+                     * in the queue.
+                     */
+                    pt_pirq_softirq_reset(pirq_dpci);
+                }
             }
             if ( unlikely(rc) )
             {
@@ -269,6 +394,10 @@ int pt_irq_create_bind(
             {
                 if ( pt_irq_need_timer(pirq_dpci->flags) )
                     kill_timer(&pirq_dpci->timer);
+                /*
+                 * There is no path for __do_IRQ to schedule softirq as
+                 * IRQ_GUEST is not set. As such we can reset 'dom' directly.
+                 */
                 pirq_dpci->dom = NULL;
                 list_del(&girq->list);
                 list_del(&digl->list);
@@ -402,8 +531,13 @@ int pt_irq_destroy_bind(
         msixtbl_pt_unregister(d, pirq);
         if ( pt_irq_need_timer(pirq_dpci->flags) )
             kill_timer(&pirq_dpci->timer);
-        pirq_dpci->dom   = NULL;
         pirq_dpci->flags = 0;
+        /*
+         * See comment in pt_irq_create_bind's PT_IRQ_TYPE_MSI before the
+         * call to pt_pirq_softirq_reset.
+         */
+        pt_pirq_softirq_reset(pirq_dpci);
+
         pirq_cleanup_check(pirq, d);
     }
 
@@ -426,14 +560,12 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
 {
     INIT_LIST_HEAD(&dpci->digl_list);
     dpci->gmsi.dest_vcpu_id = -1;
-    softirq_tasklet_init(&dpci->tasklet, hvm_dirq_assist, (unsigned long)dpci);
 }
 
 bool_t pt_pirq_cleanup_check(struct hvm_pirq_dpci *dpci)
 {
-    if ( !dpci->flags )
+    if ( !dpci->flags && !pt_pirq_softirq_active(dpci) )
     {
-        tasklet_kill(&dpci->tasklet);
         dpci->dom = NULL;
         return 1;
     }
@@ -476,8 +608,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
          !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
         return 0;
 
-    pirq_dpci->masked = 1;
-    tasklet_schedule(&pirq_dpci->tasklet);
+    raise_softirq_for(pirq_dpci);
     return 1;
 }
 
@@ -531,28 +662,12 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
     spin_unlock(&d->event_lock);
 }
 
-static void hvm_dirq_assist(unsigned long arg)
+static void hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
 {
-    struct hvm_pirq_dpci *pirq_dpci = (struct hvm_pirq_dpci *)arg;
-    struct domain *d = pirq_dpci->dom;
-
-    /*
-     * We can be racing with 'pt_irq_destroy_bind' - with us being scheduled
-     * right before 'pirq_guest_unbind' gets called - but us not yet executed.
-     *
-     * And '->dom' gets cleared later in the destroy path. We exit and clear
-     * 'masked' - which is OK as later in this code we would
-     * do nothing except clear the ->masked field anyhow.
-     */
-    if ( !d )
-    {
-        pirq_dpci->masked = 0;
-        return;
-    }
     ASSERT(d->arch.hvm_domain.irq.dpci);
 
     spin_lock(&d->event_lock);
-    if ( test_and_clear_bool(pirq_dpci->masked) )
+    if ( pirq_dpci->state )
     {
         struct pirq *pirq = dpci_pirq(pirq_dpci);
         const struct dev_intx_gsi_link *digl;
@@ -654,3 +769,83 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
 unlock:
     spin_unlock(&d->event_lock);
 }
+
+/*
+ * Note: 'pt_pirq_softirq_reset' can clear the STATE_SCHED before we get to
+ * doing it. If that is the case we let 'pt_pirq_softirq_reset' do ref-counting.
+ */
+static void dpci_softirq(void)
+{
+    unsigned int cpu = smp_processor_id();
+    LIST_HEAD(our_list);
+
+    local_irq_disable();
+    list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+    local_irq_enable();
+
+    while ( !list_empty(&our_list) )
+    {
+        struct hvm_pirq_dpci *pirq_dpci;
+        struct domain *d;
+
+        pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+        list_del(&pirq_dpci->softirq_list);
+
+        d = pirq_dpci->dom;
+        smp_mb(); /* 'd' MUST be saved before we set/clear the bits. */
+        if ( test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+            BUG();
+        /*
+         * The one who clears STATE_SCHED MUST refcount the domain.
+         */
+        if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+        {
+            hvm_dirq_assist(d, pirq_dpci);
+            put_domain(d);
+        }
+        clear_bit(STATE_RUN, &pirq_dpci->state);
+    }
+}
+
+static int cpu_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    unsigned int cpu = (unsigned long)hcpu;
+
+    switch ( action )
+    {
+    case CPU_UP_PREPARE:
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+        break;
+    case CPU_UP_CANCELED:
+    case CPU_DEAD:
+        /*
+         * On CPU_DYING this callback is called (on the CPU that is dying)
+         * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+         * clear out any outstanding domains (by the virtue of the idle loop
+         * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+         * there are no pending softirqs for us to handle so we can chill.
+         */
+        ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+    .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu)
+        INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+    open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+    register_cpu_notifier(&cpu_nfb);
+    return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 81e8a3a..78c6977 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,51 @@ static int pci_clean_dpci_irq(struct domain *d,
         xfree(digl);
     }
 
-    tasklet_kill(&pirq_dpci->tasklet);
-
-    return 0;
+    return pt_pirq_softirq_active(pirq_dpci) ? -ERESTART : 0;
 }
 
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
 {
     struct hvm_irq_dpci *hvm_irq_dpci = NULL;
 
     if ( !iommu_enabled )
-        return;
+        return 0;
 
     if ( !is_hvm_domain(d) )
-        return;
+        return 0;
 
     spin_lock(&d->event_lock);
     hvm_irq_dpci = domain_get_irq_dpci(d);
     if ( hvm_irq_dpci != NULL )
     {
-        pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+        int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+
+        if ( ret )
+        {
+            spin_unlock(&d->event_lock);
+            return ret;
+        }
 
         d->arch.hvm_domain.irq.dpci = NULL;
         free_hvm_irq_dpci(hvm_irq_dpci);
     }
     spin_unlock(&d->event_lock);
+    return 0;
 }
 
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
 {
     struct pci_dev *pdev;
     u8 bus, devfn;
+    int ret;
 
     spin_lock(&pcidevs_lock);
-    pci_clean_dpci_irqs(d);
+    ret = pci_clean_dpci_irqs(d);
+    if ( ret )
+    {
+        spin_unlock(&pcidevs_lock);
+        return ret;
+    }
     while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
     {
         bus = pdev->bus;
@@ -811,6 +822,8 @@ void pci_release_devices(struct domain *d)
                    PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     spin_unlock(&pcidevs_lock);
+
+    return 0;
 }
 
 #define PCI_CLASS_BRIDGE_HOST    0x0600
diff --git a/xen/include/asm-x86/softirq.h b/xen/include/asm-x86/softirq.h
index 7225dea..ec787d6 100644
--- a/xen/include/asm-x86/softirq.h
+++ b/xen/include/asm-x86/softirq.h
@@ -7,7 +7,8 @@
 
 #define MACHINE_CHECK_SOFTIRQ  (NR_COMMON_SOFTIRQS + 3)
 #define PCI_SERR_SOFTIRQ       (NR_COMMON_SOFTIRQS + 4)
-#define NR_ARCH_SOFTIRQS       5
+#define HVM_DPCI_SOFTIRQ       (NR_COMMON_SOFTIRQS + 5)
+#define NR_ARCH_SOFTIRQS       6
 
 bool_t arch_skip_send_event_check(unsigned int cpu);
 
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index 94a550a..9709397 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -93,13 +93,13 @@ struct hvm_irq_dpci {
 /* Machine IRQ to guest device/intx mapping. */
 struct hvm_pirq_dpci {
     uint32_t flags;
-    bool_t masked;
+    unsigned int state;
     uint16_t pending;
     struct list_head digl_list;
     struct domain *dom;
     struct hvm_gmsi_info gmsi;
     struct timer timer;
-    struct tasklet tasklet;
+    struct list_head softirq_list;
 };
 
 void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -109,6 +109,7 @@ int pt_pirq_iterate(struct domain *d,
                               struct hvm_pirq_dpci *, void *arg),
                     void *arg);
 
+bool_t pt_pirq_softirq_active(struct hvm_pirq_dpci *);
 /* Modify state of a PCI INTx wire. */
 void hvm_pci_intx_assert(
     struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
 
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
-- 
1.9.3


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12)
  2014-11-10 21:02     ` Konrad Rzeszutek Wilk
@ 2014-11-11  8:46       ` Jan Beulich
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Beulich @ 2014-11-11  8:46 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: keir, ian.campbell, andrew.cooper3, tim, xen-devel, ian.jackson

>>> On 10.11.14 at 22:02, <konrad.wilk@oracle.com> wrote:
> On Tue, Nov 04, 2014 at 10:37:29AM +0000, Jan Beulich wrote:
>> >>> On 03.11.14 at 20:14, <konrad.wilk@oracle.com> wrote:
>> > +/*
>> > + * Should only be called from hvm_do_IRQ_dpci. We use the
>> 
>> This statement together with the comment in pt_pirq_softirq_active()
>> is at least confusing: If the function is to be called in only one place,
>> there shouldn't be a second place where its use is being suggested.
>> Plus, a function with such required limited use would very likely better
>> not be a separate function at all.
> 
> Could you explain your rationale for it please?

I explained the why in what is still being quoted above.

> I am of course OK changing it, but when reading this patch and having most 
> of the functions that deal with the new code in one place - makes it easier
> to read.

Some pieces of code are necessarily scattered around, so one more
wouldn't be a big deal. But I see you re-worded that comment now,
and with that I don't mind the function remaining where it is.

Hence
Reviewed-by: Jan Beulich <jbeulich@suse.com>

But please re-send properly paired with patch 1/2, to avoid mixing
up versions when committing (it's already confusing that this one
says v9 and v12 in the subject, while the patch history is at v13).

Jan

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-11-11  8:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-03 19:14 [PATCH v9 for-xen-4.5] Fix interrupt latency of HVM PCIpassthrough devices Konrad Rzeszutek Wilk
2014-11-03 19:14 ` [for-xen-4.5 v9 1/2] dpci: Move from an hvm_irq_dpci (and struct domain) to an hvm_dirq_dpci model Konrad Rzeszutek Wilk
2014-11-03 19:14 ` [for-xen-4.5 v9 2/2] dpci: Replace tasklet with an softirq (v12) Konrad Rzeszutek Wilk
2014-11-04 10:37   ` Jan Beulich
2014-11-10 21:02     ` Konrad Rzeszutek Wilk
2014-11-11  8:46       ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).