* [PATCH v5] Fix interrupt latency of HVM PCI passthroughdevices.
@ 2014-09-19 18:51 Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5) Konrad Rzeszutek Wilk
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-19 18:51 UTC (permalink / raw)
To: xen-devel, ian.campbell, ian.jackson, jbeulich, keir, tim
>From Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> # This line is ignored.
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: [PATCH v5] Fix interrupt latency of HVM PCI passthroughdevices.
In-Reply-To:
Hey Jan, and other maintainers which scripts/get_maintainer.pl spit out,
Changelog since v4 (http://lists.xen.org/archives/html/xen-devel/2014-09/msg01676.html):
- Ditch the domain centric mechansim.
- Fix issues raised by Jan.
The big change is on ditching of having a domain centric view and
focus on having it all revolve around 'struct hvm_pirq_dpci'
structure. That all works - except that we must be very careful
with the 'dom' field. I haven't been able to trigger an assert but
there is one edge case (see the second patch) which required some
an bounded loop in case of a race. I would welcome feedback of
what the prefer way is of resolving it.
If this mechanism seems a bit cumbersome I can switch back to
the v4 design - which was based on the struct domain having the
state and list.
Please review!
xen/arch/x86/domain.c | 4 +-
xen/arch/x86/irq.c | 6 ++
xen/drivers/passthrough/io.c | 230 +++++++++++++++++++++++++++++++++++++-----
xen/drivers/passthrough/pci.c | 27 +++--
xen/include/xen/hvm/irq.h | 5 +-
xen/include/xen/pci.h | 2 +-
xen/include/xen/softirq.h | 3 +
7 files changed, 240 insertions(+), 37 deletions(-)
Konrad Rzeszutek Wilk (3):
dpci: Replace tasklet with an softirq (v5)
dpci: Safeguard against race with hvm_dirq_assist crashing and pt_irq_[create|destroy]_bind
dpci: In hvm_dirq_assist stop using pt_pirq_iterate
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5)
2014-09-19 18:51 [PATCH v5] Fix interrupt latency of HVM PCI passthroughdevices Konrad Rzeszutek Wilk
@ 2014-09-19 18:51 ` Konrad Rzeszutek Wilk
2014-09-22 11:58 ` Jan Beulich
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 2/3] dpci: Safeguard against race with hvm_dirq_assist crashing and pt_irq_[create|destroy]_bind Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 3/3] dpci: In hvm_dirq_assist stop using pt_pirq_iterate Konrad Rzeszutek Wilk
2 siblings, 1 reply; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-19 18:51 UTC (permalink / raw)
To: xen-devel, ian.campbell, ian.jackson, jbeulich, keir, tim
Cc: Konrad Rzeszutek Wilk
The existing tasklet mechanism has a single global
spinlock that is taken every-time the global list
is touched. And we use this lock quite a lot - when
we call do_tasklet_work which is called via an softirq
and from the idle loop. We take the lock on any
operation on the tasklet_list.
The problem we are facing is that there are quite a lot of
tasklets scheduled. The most common one that is invoked is
the one injecting the VIRQ_TIMER in the guest. Guests
are not insane and don't set the one-shot or periodic
clocks to be in sub 1ms intervals (causing said tasklet
to be scheduled for such small intervalls).
The problem appears when PCI passthrough devices are used
over many sockets and we have an mix of heavy-interrupt
guests and idle guests. The idle guests end up seeing
1/10 of its RUNNING timeslice eaten by the hypervisor
(and 40% steal time).
The mechanism by which we inject PCI interrupts is by
hvm_do_IRQ_dpci which schedules the hvm_dirq_assist
tasklet every time an interrupt is received.
The callchain is:
_asm_vmexit_handler
-> vmx_vmexit_handler
->vmx_do_extint
-> do_IRQ
-> __do_IRQ_guest
-> hvm_do_IRQ_dpci
tasklet_schedule(&dpci->dirq_tasklet);
[takes lock to put the tasklet on]
[later on the schedule_tail is invoked which is 'vmx_do_resume']
vmx_do_resume
-> vmx_asm_do_vmentry
-> call vmx_intr_assist
-> vmx_process_softirqs
-> do_softirq
[executes the tasklet function, takes the
lock again]
While on other CPUs they might be sitting in a idle loop
and invoked to deliver an VIRQ_TIMER, which also ends
up taking the lock twice: first to schedule the
v->arch.hvm_vcpu.assert_evtchn_irq_tasklet (accounted to
the guests' BLOCKED_state); then to execute it - which is
accounted for in the guest's RUNTIME_state.
The end result is that on a 8 socket machine with
PCI passthrough, where four sockets are busy with interrupts,
and the other sockets have idle guests - we end up with
the idle guests having around 40% steal time and 1/10
of its timeslice (3ms out of 30 ms) being tied up
taking the lock. The latency of the PCI interrupts delieved
to guest is also hindered.
With this patch the problem disappears completly.
That is removing the lock for the PCI passthrough use-case
(the 'hvm_dirq_assist' case) by not using tasklets at all.
The patch is simple - instead of scheduling an tasklet
we schedule our own softirq - HVM_DPCI_SOFTIRQ, which will
take care of running 'hvm_dirq_assist'. The information we need
on each CPU is which 'struct hvm_pirq_dpci' structure the 'hvm_dirq_assist'
needs to run on. That is simple solved by threading the
'struct hvm_pirq_dpci' through a linked list. The rule of only running
one 'hvm_dirq_assist' for only one 'hvm_pirq_dpci' is also preserved
by having 'schedule_dpci_for' ignore any subsequent calls for
an domain which has already been scheduled.
Since we are using 'hvm_pirq_dpci' it is setup in 'pt_irq_create_bind'
function tore down in 'pt_irq_destroy_bind'. If the OS forgot (or QEMU)
we still clean it up in pci_clean_dpci_irq. We can ignore the
return value of 'dpci_kill' in 'pt_irq_create_bind' for the error
routines as 'flags' is reset - which means that 'pt_pirq_iterate'
will ignore the structure and never call 'hvm_dirq_assist' on
this particular 'hvm_pirq_dpci' structure.
We also add 'dpci_kill' in the cleanup of 'pirq_cleanup_check'.
If the 'dpci_kill' fails we need to inhibit the deletion
of the structure from the 'pirq_tree' - otherwise the
catch-it-all 'pci_clean_dpci_irq' won't be able to properly
cleanup up using the 'dpci_kill'.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Suggested-by: Jan Beulich <JBeulich@suse.com>
---
v2: On top of ref-cnts also have wait loop for the outstanding
'struct domain' that need to be processed.
v3: Add -ERETRY, fix up StyleGuide issues
v4: Clean it up mode, redo per_cpu, this_cpu logic
v5: Instead of threading struct domain, use hvm_pirq_dpci.
---
xen/arch/x86/domain.c | 4 +-
xen/arch/x86/irq.c | 6 ++
xen/drivers/passthrough/io.c | 175 +++++++++++++++++++++++++++++++++++++++---
xen/drivers/passthrough/pci.c | 27 +++++--
xen/include/xen/hvm/irq.h | 5 +-
xen/include/xen/pci.h | 2 +-
xen/include/xen/softirq.h | 3 +
7 files changed, 200 insertions(+), 22 deletions(-)
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 7b1dfe6..12bf241 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1924,7 +1924,9 @@ int domain_relinquish_resources(struct domain *d)
switch ( d->arch.relmem )
{
case RELMEM_not_started:
- pci_release_devices(d);
+ ret = pci_release_devices(d);
+ if ( ret )
+ return ret;
/* Tear down paging-assistance stuff. */
paging_teardown(d);
diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c
index f214072..a4c7053 100644
--- a/xen/arch/x86/irq.c
+++ b/xen/arch/x86/irq.c
@@ -1337,6 +1337,12 @@ void (pirq_cleanup_check)(struct pirq *pirq, struct domain *d)
return;
if ( !pt_pirq_cleanup_check(&pirq->arch.hvm.dpci) )
return;
+ /*
+ * Inhibit deletion from 'pirq_tree' so that 'pci_clean_dpci_irq'
+ * can still properly call 'dpci_kill'.
+ */
+ if ( dpci_kill(&pirq->arch.hvm.dpci) )
+ return;
}
if ( radix_tree_delete(&d->pirq_tree, pirq->pirq) != pirq )
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 4cd32b5..5f82aba 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -20,14 +20,79 @@
#include <xen/event.h>
#include <xen/iommu.h>
+#include <xen/cpu.h>
#include <xen/irq.h>
#include <asm/hvm/irq.h>
#include <asm/hvm/iommu.h>
#include <asm/hvm/support.h>
#include <xen/hvm/irq.h>
-#include <xen/tasklet.h>
-static void hvm_dirq_assist(unsigned long _d);
+static DEFINE_PER_CPU(struct list_head, dpci_list);
+
+/*
+ * These fancy bit manipulations (bit 0 and bit 1) along with using a lock
+ * operation allow us to have four stages in dpci life-time.
+ * a) 0x0: Completely empty (not scheduled nor running).
+ * b) 0x1: Scheduled but not running. Used to guard in 'schedule_dpci_for'
+ * such that we will only schedule one. If it is scheduled and had never
+ * run (hence never clearing STATE_SCHED bit), dpci_kill will spin
+ * forever on said hvm_pirq_dpci. However 'schedule_dpci_for' raises the
+ * softirq so it will run, albeit there might be a race (dpci_kill
+ * spinning until the softirq handler runs).
+ * c) 0x2: it is running (only on one CPU) and can be scheduled on any
+ * CPU. The bit 0 - scheduled is cleared at this stage allowing
+ * 'schedule_dpci_for' to succesfully schedule.
+ * d) 0x3: scheduled and running - only possible if the running hvm_dirq_assist
+ * calls 'schedule_dpci_for'. It does not do that, but it could in the
+ * future.
+ *
+ * The two bits play a vital role in assuring that the 'hvm_dirq_assist' is
+ * scheduled once and runs only once for a hvm_pirq_dpci steps are:
+ *
+ * 1) schedule_dpci_for: STATE_SCHED bit set (0x1), added on the per cpu list.
+ * 2) dpci_softirq picks one hvm_pirq_dpci from the list. Schedules
+ * itself later if there are more hvm_pirq_dpci's on it. Tries to set STATE_RUN bit
+ * (0x3) - if it fails adds hvm_pirq_dpci back to the per-cpu list. If it succeeds
+ * clears the STATE_SCHED bit (0x2). Once hvm_dirq_assist for an hvm_pirq_dpci
+ * complets, it unsets STATE_RUN (0x0 or 0x1 if 'schedule_dpci_for' is called
+ * from 'hvm_dirq_assist').
+ */
+enum {
+ STATE_SCHED, /* Bit 0 */
+ STATE_RUN
+};
+
+static void schedule_dpci_for(struct hvm_pirq_dpci *pirq_dpci)
+{
+ if ( !test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+ {
+ unsigned long flags;
+
+ get_knownalive_domain(pirq_dpci->dom); /* To be put by 'dpci_softirq' */
+
+ local_irq_save(flags);
+ list_add_tail(&pirq_dpci->softirq_list, &this_cpu(dpci_list));
+ local_irq_restore(flags);
+
+ raise_softirq(HVM_DPCI_SOFTIRQ);
+ }
+}
+
+int dpci_kill(struct hvm_pirq_dpci *pirq_dpci)
+{
+ if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
+ return -EAGAIN;
+
+ if ( test_bit(STATE_RUN, &pirq_dpci->state) )
+ return -EAGAIN;
+
+ clear_bit(STATE_SCHED, &pirq_dpci->state);
+
+ /* In case we turn around and call 'schedule_dpci_for' right away. */
+ INIT_LIST_HEAD(&pirq_dpci->softirq_list);
+ return 0;
+}
+
bool_t pt_irq_need_timer(uint32_t flags)
{
@@ -114,9 +179,7 @@ int pt_irq_create_bind(
spin_unlock(&d->event_lock);
return -ENOMEM;
}
- softirq_tasklet_init(
- &hvm_irq_dpci->dirq_tasklet,
- hvm_dirq_assist, (unsigned long)d);
+
for ( i = 0; i < NR_HVM_IRQS; i++ )
INIT_LIST_HEAD(&hvm_irq_dpci->girq[i]);
@@ -131,6 +194,16 @@ int pt_irq_create_bind(
}
pirq_dpci = pirq_dpci(info);
+ /* Something is horrible wrong if it has been scheduled or is running. */
+ ASSERT(pirq_dpci->state == 0);
+
+ /* Guest unbinds (pt_irq_destroy_bind) and rebinds it back. */
+ if ( !pirq_dpci->dom )
+ pirq_dpci->dom = d;
+
+ /* Otherwise 'schedule_dpci_for' will crash. */
+ ASSERT(pirq_dpci->dom == d);
+
switch ( pt_irq_bind->irq_type )
{
case PT_IRQ_TYPE_MSI:
@@ -156,7 +229,9 @@ int pt_irq_create_bind(
{
pirq_dpci->gmsi.gflags = 0;
pirq_dpci->gmsi.gvec = 0;
+ pirq_dpci->dom = NULL;
pirq_dpci->flags = 0;
+ dpci_kill(pirq_dpci);
pirq_cleanup_check(info, d);
spin_unlock(&d->event_lock);
return rc;
@@ -232,7 +307,6 @@ int pt_irq_create_bind(
{
unsigned int share;
- pirq_dpci->dom = d;
if ( pt_irq_bind->irq_type == PT_IRQ_TYPE_MSI_TRANSLATE )
{
pirq_dpci->flags = HVM_IRQ_DPCI_MAPPED |
@@ -263,6 +337,7 @@ int pt_irq_create_bind(
list_del(&digl->list);
hvm_irq_dpci->link_cnt[link]--;
pirq_dpci->flags = 0;
+ dpci_kill(pirq_dpci);
pirq_cleanup_check(info, d);
spin_unlock(&d->event_lock);
xfree(girq);
@@ -393,6 +468,7 @@ int pt_irq_destroy_bind(
kill_timer(&pirq_dpci->timer);
pirq_dpci->dom = NULL;
pirq_dpci->flags = 0;
+ dpci_kill(pirq_dpci);
pirq_cleanup_check(pirq, d);
}
@@ -414,6 +490,8 @@ int pt_irq_destroy_bind(
void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
{
INIT_LIST_HEAD(&dpci->digl_list);
+ INIT_LIST_HEAD(&dpci->softirq_list);
+ dpci->dom = d;
dpci->gmsi.dest_vcpu_id = -1;
}
@@ -459,7 +537,7 @@ int hvm_do_IRQ_dpci(struct domain *d, struct pirq *pirq)
return 0;
pirq_dpci->masked = 1;
- tasklet_schedule(&dpci->dirq_tasklet);
+ schedule_dpci_for(pirq_dpci);
return 1;
}
@@ -562,11 +640,11 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
return 0;
}
-static void hvm_dirq_assist(unsigned long _d)
+static void hvm_dirq_assist(struct hvm_pirq_dpci *pirq_dpci)
{
- struct domain *d = (struct domain *)_d;
+ struct domain *d = pirq_dpci->dom;
- ASSERT(d->arch.hvm_domain.irq.dpci);
+ ASSERT(d && d->arch.hvm_domain.irq.dpci);
spin_lock(&d->event_lock);
pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
@@ -625,3 +703,80 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
unlock:
spin_unlock(&d->event_lock);
}
+
+static void dpci_softirq(void)
+{
+ struct hvm_pirq_dpci *pirq_dpci;
+ unsigned int cpu = smp_processor_id();
+ LIST_HEAD(our_list);
+
+ local_irq_disable();
+ list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
+ local_irq_enable();
+
+ while ( !list_empty(&our_list) )
+ {
+ pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
+ list_del(&pirq_dpci->softirq_list);
+
+ if ( !test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
+ {
+ if ( !test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
+ BUG();
+ hvm_dirq_assist(pirq_dpci);
+ put_domain(pirq_dpci->dom);
+ clear_bit(STATE_RUN, &pirq_dpci->state);
+ continue;
+ }
+
+ local_irq_disable();
+ list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
+ local_irq_enable();
+
+ raise_softirq(HVM_DPCI_SOFTIRQ);
+ }
+}
+static int cpu_callback(
+ struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+
+ switch ( action )
+ {
+ case CPU_UP_PREPARE:
+ INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+ break;
+ case CPU_UP_CANCELED:
+ case CPU_DEAD:
+ /*
+ * On CPU_DYING this callback is called (on the CPU that is dying)
+ * with an possible HVM_DPIC_SOFTIRQ pending - at which point we can
+ * clear out any outstanding domains (by the virtue of the idle loop
+ * calling the softirq later). In CPU_DEAD case the CPU is deaf and
+ * there are no pending softirqs for us to handle so we can chill.
+ */
+ ASSERT(list_empty(&per_cpu(dpci_list, cpu)));
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_nfb = {
+ .notifier_call = cpu_callback,
+};
+
+static int __init setup_dpci_softirq(void)
+{
+ int cpu;
+
+ for_each_online_cpu(cpu)
+ INIT_LIST_HEAD(&per_cpu(dpci_list, cpu));
+
+ open_softirq(HVM_DPCI_SOFTIRQ, dpci_softirq);
+ register_cpu_notifier(&cpu_nfb);
+ return 0;
+}
+__initcall(setup_dpci_softirq);
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 1eba833..9be3405 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -767,40 +767,49 @@ static int pci_clean_dpci_irq(struct domain *d,
xfree(digl);
}
- return 0;
+ return dpci_kill(pirq_dpci);
}
-static void pci_clean_dpci_irqs(struct domain *d)
+static int pci_clean_dpci_irqs(struct domain *d)
{
struct hvm_irq_dpci *hvm_irq_dpci = NULL;
if ( !iommu_enabled )
- return;
+ return -ESRCH;
if ( !is_hvm_domain(d) )
- return;
+ return -EINVAL;
spin_lock(&d->event_lock);
hvm_irq_dpci = domain_get_irq_dpci(d);
if ( hvm_irq_dpci != NULL )
{
- tasklet_kill(&hvm_irq_dpci->dirq_tasklet);
+ int ret = pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
- pt_pirq_iterate(d, pci_clean_dpci_irq, NULL);
+ if ( ret )
+ {
+ spin_unlock(&d->event_lock);
+ return ret;
+ }
d->arch.hvm_domain.irq.dpci = NULL;
free_hvm_irq_dpci(hvm_irq_dpci);
}
spin_unlock(&d->event_lock);
+ return 0;
}
-void pci_release_devices(struct domain *d)
+int pci_release_devices(struct domain *d)
{
struct pci_dev *pdev;
u8 bus, devfn;
+ int ret;
spin_lock(&pcidevs_lock);
- pci_clean_dpci_irqs(d);
+ ret = pci_clean_dpci_irqs(d);
+ if ( ret == -EAGAIN )
+ return ret;
+
while ( (pdev = pci_get_pdev_by_domain(d, -1, -1, -1)) )
{
bus = pdev->bus;
@@ -811,6 +820,8 @@ void pci_release_devices(struct domain *d)
PCI_SLOT(devfn), PCI_FUNC(devfn));
}
spin_unlock(&pcidevs_lock);
+
+ return 0;
}
#define PCI_CLASS_BRIDGE_HOST 0x0600
diff --git a/xen/include/xen/hvm/irq.h b/xen/include/xen/hvm/irq.h
index c89f4b1..de8ea57 100644
--- a/xen/include/xen/hvm/irq.h
+++ b/xen/include/xen/hvm/irq.h
@@ -88,7 +88,6 @@ struct hvm_irq_dpci {
DECLARE_BITMAP(isairq_map, NR_ISAIRQS);
/* Record of mapped Links */
uint8_t link_cnt[NR_LINK];
- struct tasklet dirq_tasklet;
};
/* Machine IRQ to guest device/intx mapping. */
@@ -100,6 +99,8 @@ struct hvm_pirq_dpci {
struct domain *dom;
struct hvm_gmsi_info gmsi;
struct timer timer;
+ struct list_head softirq_list;
+ unsigned long state;
};
void pt_pirq_init(struct domain *, struct hvm_pirq_dpci *);
@@ -108,7 +109,7 @@ int pt_pirq_iterate(struct domain *d,
int (*cb)(struct domain *,
struct hvm_pirq_dpci *, void *arg),
void *arg);
-
+int dpci_kill(struct hvm_pirq_dpci *);
/* Modify state of a PCI INTx wire. */
void hvm_pci_intx_assert(
struct domain *d, unsigned int device, unsigned int intx);
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 91520bc..5f295f3 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -99,7 +99,7 @@ struct pci_dev *pci_lock_domain_pdev(
void setup_hwdom_pci_devices(struct domain *,
int (*)(u8 devfn, struct pci_dev *));
-void pci_release_devices(struct domain *d);
+int pci_release_devices(struct domain *d);
int pci_add_segment(u16 seg);
const unsigned long *pci_get_ro_map(u16 seg);
int pci_add_device(u16 seg, u8 bus, u8 devfn, const struct pci_dev_info *);
diff --git a/xen/include/xen/softirq.h b/xen/include/xen/softirq.h
index 0895a16..f3da5df 100644
--- a/xen/include/xen/softirq.h
+++ b/xen/include/xen/softirq.h
@@ -8,6 +8,9 @@ enum {
NEW_TLBFLUSH_CLOCK_PERIOD_SOFTIRQ,
RCU_SOFTIRQ,
TASKLET_SOFTIRQ,
+#ifdef HAS_PASSTHROUGH
+ HVM_DPCI_SOFTIRQ,
+#endif
NR_COMMON_SOFTIRQS
};
--
1.9.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v5 for-xen-4.5 2/3] dpci: Safeguard against race with hvm_dirq_assist crashing and pt_irq_[create|destroy]_bind
2014-09-19 18:51 [PATCH v5] Fix interrupt latency of HVM PCI passthroughdevices Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5) Konrad Rzeszutek Wilk
@ 2014-09-19 18:51 ` Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 3/3] dpci: In hvm_dirq_assist stop using pt_pirq_iterate Konrad Rzeszutek Wilk
2 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-19 18:51 UTC (permalink / raw)
To: xen-devel, ian.campbell, ian.jackson, jbeulich, keir, tim
Cc: Konrad Rzeszutek Wilk
There is a potential for race where the pt_irq_destroy_bind is called while
the interrupt for the PIRQ has been taken - and hvm_dirq_assist is scheduled
for execution.
The timing window is short enough - and the issue we can encounter is that
the hvm_dirq_assist could get a page fault while trying to access the
'struct domain' which was extracted from 'struct hvm_pirq_dpci->dom' field
- while the pt_irq_destroy_bind resets 'struct hvm_pirq_dpci->dom' to
NULL. Imagine this example:
CPU0 CPU1
do_IRQ pt_irq_destroy_bind
-> __do_IRQ_guest
-> schedule_dpci_for
... pirq_guest_unbind
[action unbound]
pirq_dpci->dom = NULL;
...
dpci_softirq
hvm_dirq_assist
pirq_dpci->dom dereference
[BOOM!]
The way to fix this is by:
a) Do not reset ->dom field to NULL at all. Only allow 'dpci_kill'
once it has established that the hvm_dirq_assist is done - to
reset the field.
b) At the start of 'pt_irq_create_bind' check if the softirq
is still running (by using dpci_kill) - and if it is spin
- but only for less than a second. If the time is greater
than a second return -EAGAIN and let QEMU handle it
(which is that it returns to the OS that the MSI was not set).
If it was less than a second restart the creation/re-use process.
The b) could be done in 'pt_irq_destroy_bind' as well if that
would be better. Either option works.
The return of -EAGAIN could be removed as QEMU is not that smart
to handle the error.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
xen/drivers/passthrough/io.c | 54 ++++++++++++++++++++++++++++++++++----------
1 file changed, 42 insertions(+), 12 deletions(-)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 5f82aba..13433a5 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -163,6 +163,7 @@ int pt_irq_create_bind(
struct pirq *info;
int rc, pirq = pt_irq_bind->machine_irq;
+ restart:
if ( pirq < 0 || pirq >= d->nr_pirqs )
return -EINVAL;
@@ -194,15 +195,34 @@ int pt_irq_create_bind(
}
pirq_dpci = pirq_dpci(info);
- /* Something is horrible wrong if it has been scheduled or is running. */
- ASSERT(pirq_dpci->state == 0);
+ /*
+ * We can hit this scenario if the dpci_softirq is running while we are
+ * doing pt_irq_destroy_bind followed by this call on another CPU.
+ * We have to wait until the dpci_softirq is done.
+ */
+ if ( pirq_dpci->state )
+ {
+ s_time_t start;
- /* Guest unbinds (pt_irq_destroy_bind) and rebinds it back. */
- if ( !pirq_dpci->dom )
- pirq_dpci->dom = d;
+ /* Need to drop the lock so hvm_dirq_assist can progress. */
+ spin_unlock(&d->event_lock);
+ start = NOW();
+ while ( dpci_kill(pirq_dpci) )
+ {
+ process_pending_softirqs();
+ if ( (NOW() - start) >> 30 )
+ {
+ return -EAGAIN;
+ }
+ }
+ /* And since we dropped the lock, another pt_irq_[destroy|create]_bind
+ * could be called. Lets start from scratch.
+ */
+ goto restart;
+ }
/* Otherwise 'schedule_dpci_for' will crash. */
- ASSERT(pirq_dpci->dom == d);
+ pirq_dpci->dom = d;
switch ( pt_irq_bind->irq_type )
{
@@ -229,9 +249,15 @@ int pt_irq_create_bind(
{
pirq_dpci->gmsi.gflags = 0;
pirq_dpci->gmsi.gvec = 0;
- pirq_dpci->dom = NULL;
pirq_dpci->flags = 0;
- dpci_kill(pirq_dpci);
+ /*
+ * If we fail, we have flags reset and hvm_dirq_assist will
+ * just exit. We cannot touch pirq_dpci->dom unless
+ * the softirq is truly dead - and we reap it at the start or
+ * in pci_clean_dpci_irq.
+ */
+ if ( !dpci_kill(pirq_dpci) )
+ pirq_dpci->dom = NULL;
pirq_cleanup_check(info, d);
spin_unlock(&d->event_lock);
return rc;
@@ -332,12 +358,18 @@ int pt_irq_create_bind(
{
if ( pt_irq_need_timer(pirq_dpci->flags) )
kill_timer(&pirq_dpci->timer);
- pirq_dpci->dom = NULL;
list_del(&girq->list);
list_del(&digl->list);
hvm_irq_dpci->link_cnt[link]--;
pirq_dpci->flags = 0;
- dpci_kill(pirq_dpci);
+ /*
+ * If we fail, we have flags reset and hvm_dirq_assist will
+ * just exit. We cannot touch pirq_dpci->dom unless
+ * the softirq is truly dead - and we reap it at the start or
+ * in pci_clean_dpci_irq.
+ */
+ if ( !dpci_kill(pirq_dpci) )
+ pirq_dpci->dom = NULL;
pirq_cleanup_check(info, d);
spin_unlock(&d->event_lock);
xfree(girq);
@@ -466,7 +498,6 @@ int pt_irq_destroy_bind(
msixtbl_pt_unregister(d, pirq);
if ( pt_irq_need_timer(pirq_dpci->flags) )
kill_timer(&pirq_dpci->timer);
- pirq_dpci->dom = NULL;
pirq_dpci->flags = 0;
dpci_kill(pirq_dpci);
pirq_cleanup_check(pirq, d);
@@ -491,7 +522,6 @@ void pt_pirq_init(struct domain *d, struct hvm_pirq_dpci *dpci)
{
INIT_LIST_HEAD(&dpci->digl_list);
INIT_LIST_HEAD(&dpci->softirq_list);
- dpci->dom = d;
dpci->gmsi.dest_vcpu_id = -1;
}
--
1.9.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH v5 for-xen-4.5 3/3] dpci: In hvm_dirq_assist stop using pt_pirq_iterate
2014-09-19 18:51 [PATCH v5] Fix interrupt latency of HVM PCI passthroughdevices Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5) Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 2/3] dpci: Safeguard against race with hvm_dirq_assist crashing and pt_irq_[create|destroy]_bind Konrad Rzeszutek Wilk
@ 2014-09-19 18:51 ` Konrad Rzeszutek Wilk
2 siblings, 0 replies; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-19 18:51 UTC (permalink / raw)
To: xen-devel, ian.campbell, ian.jackson, jbeulich, keir, tim
Cc: Konrad Rzeszutek Wilk
As we are now called from dpci_softirq with the outstanding
'struct hvm_pirq_dpci' there is no need to call pt_pirq_iterate
which will iterate over all of the PIRQs and call us with every
one that is mapped. And then _hvm_dirq_assist figuring out
which one to execute.
We know which one to execute as 'schedule_dpci_for' is called
for the specific 'struct hvm_pirq_dpci' that has an outstanding
interrupt.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
xen/drivers/passthrough/io.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 13433a5..62fdca9 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -621,9 +621,13 @@ void hvm_dpci_msi_eoi(struct domain *d, int vector)
spin_unlock(&d->event_lock);
}
-static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
- void *arg)
+static int hvm_dirq_assist(struct hvm_pirq_dpci *pirq_dpci)
{
+ struct domain *d = pirq_dpci->dom;
+
+ ASSERT(d && d->arch.hvm_domain.irq.dpci);
+
+ spin_lock(&d->event_lock);
if ( test_and_clear_bool(pirq_dpci->masked) )
{
struct pirq *pirq = dpci_pirq(pirq_dpci);
@@ -634,13 +638,13 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
send_guest_pirq(d, pirq);
if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
- return 0;
+ goto out;
}
if ( pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI )
{
vmsi_deliver_pirq(d, pirq_dpci);
- return 0;
+ goto out;
}
list_for_each_entry ( digl, &pirq_dpci->digl_list, list )
@@ -653,7 +657,7 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
{
/* for translated MSI to INTx interrupt, eoi as early as possible */
__msi_pirq_eoi(pirq_dpci);
- return 0;
+ goto out;
}
/*
@@ -666,21 +670,12 @@ static int _hvm_dirq_assist(struct domain *d, struct hvm_pirq_dpci *pirq_dpci,
ASSERT(pt_irq_need_timer(pirq_dpci->flags));
set_timer(&pirq_dpci->timer, NOW() + PT_IRQ_TIME_OUT);
}
+ out:
+ spin_unlock(&d->event_lock);
return 0;
}
-static void hvm_dirq_assist(struct hvm_pirq_dpci *pirq_dpci)
-{
- struct domain *d = pirq_dpci->dom;
-
- ASSERT(d && d->arch.hvm_domain.irq.dpci);
-
- spin_lock(&d->event_lock);
- pt_pirq_iterate(d, _hvm_dirq_assist, NULL);
- spin_unlock(&d->event_lock);
-}
-
static void __hvm_dpci_eoi(struct domain *d,
const struct hvm_girq_dpci_mapping *girq,
const union vioapic_redir_entry *ent)
--
1.9.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5)
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5) Konrad Rzeszutek Wilk
@ 2014-09-22 11:58 ` Jan Beulich
2014-09-22 14:31 ` Konrad Rzeszutek Wilk
0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2014-09-22 11:58 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir, ian.jackson, ian.campbell, tim
>>> On 19.09.14 at 20:51, <konrad.wilk@oracle.com> wrote:
> --- a/xen/arch/x86/irq.c
> +++ b/xen/arch/x86/irq.c
> @@ -1337,6 +1337,12 @@ void (pirq_cleanup_check)(struct pirq *pirq, struct domain *d)
> return;
> if ( !pt_pirq_cleanup_check(&pirq->arch.hvm.dpci) )
> return;
> + /*
> + * Inhibit deletion from 'pirq_tree' so that 'pci_clean_dpci_irq'
> + * can still properly call 'dpci_kill'.
> + */
> + if ( dpci_kill(&pirq->arch.hvm.dpci) )
> + return;
If this is really necessary / The Right Thing To Do, it would clearly
belong into pt_pirq_cleanup_check() rather than here, since it's the
purpose of that function to deal with all pass-through related
aspects of pIRQ cleanup.
> +int dpci_kill(struct hvm_pirq_dpci *pirq_dpci)
> +{
> + if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> + return -EAGAIN;
> +
> + if ( test_bit(STATE_RUN, &pirq_dpci->state) )
> + return -EAGAIN;
> +
> + clear_bit(STATE_SCHED, &pirq_dpci->state);
So why do you first set and the immediately clear STATE_SCHED?
> +
> + /* In case we turn around and call 'schedule_dpci_for' right away. */
> + INIT_LIST_HEAD(&pirq_dpci->softirq_list);
You needing this seems to hint at a problem elsewhere.
> @@ -131,6 +194,16 @@ int pt_irq_create_bind(
> }
> pirq_dpci = pirq_dpci(info);
>
> + /* Something is horrible wrong if it has been scheduled or is running. */
> + ASSERT(pirq_dpci->state == 0);
> +
> + /* Guest unbinds (pt_irq_destroy_bind) and rebinds it back. */
> + if ( !pirq_dpci->dom )
> + pirq_dpci->dom = d;
And this looks more like a hack too. I think there should be exactly
one place setting this field, and one or two (the second possibly in
some error path following the setting of it) clearing it.
> @@ -156,7 +229,9 @@ int pt_irq_create_bind(
> {
> pirq_dpci->gmsi.gflags = 0;
> pirq_dpci->gmsi.gvec = 0;
> + pirq_dpci->dom = NULL;
> pirq_dpci->flags = 0;
> + dpci_kill(pirq_dpci);
> pirq_cleanup_check(info, d);
Now these dpci_kill() calls you insert not just here right before the
pirq_cleanup_check() ones put an even bigger question mark on
your change to pirq_cleanup_check().
> @@ -625,3 +703,80 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
> unlock:
> spin_unlock(&d->event_lock);
> }
> +
> +static void dpci_softirq(void)
> +{
> + struct hvm_pirq_dpci *pirq_dpci;
> + unsigned int cpu = smp_processor_id();
> + LIST_HEAD(our_list);
> +
> + local_irq_disable();
> + list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
> + local_irq_enable();
> +
> + while ( !list_empty(&our_list) )
> + {
> + pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
> + list_del(&pirq_dpci->softirq_list);
> +
> + if ( !test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> + {
> + if ( !test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> + BUG();
> + hvm_dirq_assist(pirq_dpci);
> + put_domain(pirq_dpci->dom);
> + clear_bit(STATE_RUN, &pirq_dpci->state);
> + continue;
> + }
> +
> + local_irq_disable();
> + list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
> + local_irq_enable();
Can't all this be simplified quite a bit now that the code is no longer
domain-centric? I can't really see who the above setting of
STATE_RUN can race with for example. And should that no longer
be needed it would seem that re-inserting the list entry wouldn't
then be either.
> +
> + raise_softirq(HVM_DPCI_SOFTIRQ);
> + }
> +}
> +static int cpu_callback(
Blank line missing.
> +static int __init setup_dpci_softirq(void)
> +{
> + int cpu;
unsigned int
> @@ -108,7 +109,7 @@ int pt_pirq_iterate(struct domain *d,
> int (*cb)(struct domain *,
> struct hvm_pirq_dpci *, void *arg),
> void *arg);
> -
> +int dpci_kill(struct hvm_pirq_dpci *);
> /* Modify state of a PCI INTx wire. */
> void hvm_pci_intx_assert(
> struct domain *d, unsigned int device, unsigned int intx);
Please don't drop useful blank lines like this.
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5)
2014-09-22 11:58 ` Jan Beulich
@ 2014-09-22 14:31 ` Konrad Rzeszutek Wilk
2014-09-22 15:31 ` Jan Beulich
0 siblings, 1 reply; 7+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-09-22 14:31 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel, keir, ian.jackson, ian.campbell, tim
On Mon, Sep 22, 2014 at 12:58:57PM +0100, Jan Beulich wrote:
> >>> On 19.09.14 at 20:51, <konrad.wilk@oracle.com> wrote:
> > --- a/xen/arch/x86/irq.c
> > +++ b/xen/arch/x86/irq.c
> > @@ -1337,6 +1337,12 @@ void (pirq_cleanup_check)(struct pirq *pirq, struct domain *d)
> > return;
> > if ( !pt_pirq_cleanup_check(&pirq->arch.hvm.dpci) )
> > return;
> > + /*
> > + * Inhibit deletion from 'pirq_tree' so that 'pci_clean_dpci_irq'
> > + * can still properly call 'dpci_kill'.
> > + */
> > + if ( dpci_kill(&pirq->arch.hvm.dpci) )
> > + return;
>
> If this is really necessary / The Right Thing To Do, it would clearly
> belong into pt_pirq_cleanup_check() rather than here, since it's the
> purpose of that function to deal with all pass-through related
> aspects of pIRQ cleanup.
OK. Will do.
>
> > +int dpci_kill(struct hvm_pirq_dpci *pirq_dpci)
> > +{
> > + if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
> > + return -EAGAIN;
> > +
> > + if ( test_bit(STATE_RUN, &pirq_dpci->state) )
> > + return -EAGAIN;
> > +
> > + clear_bit(STATE_SCHED, &pirq_dpci->state);
>
> So why do you first set and the immediately clear STATE_SCHED?
To make sure that we do not race with another
'schedule_for_dpci'/'dpci_softirq'.
Granted this is all 'if somebody did extend this in the future'
type code. That is not happening right now with all of this
and we could as well just depend on 'dpic->masked' being set.
>
> > +
> > + /* In case we turn around and call 'schedule_dpci_for' right away. */
> > + INIT_LIST_HEAD(&pirq_dpci->softirq_list);
>
> You needing this seems to hint at a problem elsewhere.
That actually is not happening at all. I added that code in case
somebody _did_ do this in the future. Perhaps a better approach
would be to mention this in the function header?
>
> > @@ -131,6 +194,16 @@ int pt_irq_create_bind(
> > }
> > pirq_dpci = pirq_dpci(info);
> >
> > + /* Something is horrible wrong if it has been scheduled or is running. */
> > + ASSERT(pirq_dpci->state == 0);
> > +
> > + /* Guest unbinds (pt_irq_destroy_bind) and rebinds it back. */
> > + if ( !pirq_dpci->dom )
> > + pirq_dpci->dom = d;
>
> And this looks more like a hack too. I think there should be exactly
> one place setting this field, and one or two (the second possibly in
> some error path following the setting of it) clearing it.
Right. The second patch (which I was thinking to squash in
but figured as a seperate patch might make it easier to understand)
does away with this and just does:
pirq_dpci->dom = d;
here.
>
> > @@ -156,7 +229,9 @@ int pt_irq_create_bind(
> > {
> > pirq_dpci->gmsi.gflags = 0;
> > pirq_dpci->gmsi.gvec = 0;
> > + pirq_dpci->dom = NULL;
> > pirq_dpci->flags = 0;
> > + dpci_kill(pirq_dpci);
> > pirq_cleanup_check(info, d);
>
> Now these dpci_kill() calls you insert not just here right before the
> pirq_cleanup_check() ones put an even bigger question mark on
> your change to pirq_cleanup_check().
The 'pirq_cleanup_check' macro calls the 'pirq_cleanup_check' only if
the event value is set. At this stage the event is not set (it is
called later by the OS to bind the pirq to the event channel -
map_domain_pirq) so it would never call 'dpci_kill' at this juncture.
Unless somebody setup an event channel first and then called this?
Inserting 'dpci_kill' was the best way to make sure we would still call it
and set the fields in the right state. However as patch #2 demonstrates
there is a potential race.
Perhaps the best way to do this is to put 'dpci_kill' in:
1) At the start of this function (as patch #2 does), or
'pt_irq_destroy_bind'. Your call in which function it should go.
2) When the domain is being destroyed (in case we setup an MSI but
never use them). Already implemented in this patch.
3). In 'pt_pirq_cleanup_check' if the flags field is set to zero.
(And we need to do that because if we remove the 'struct pirq'
from the 'pirq_tree' radix tree, we will never be able to
call 'dpci_kill' on it from 'pci_clean_dpci_irqs' as the
field is gone).
And remove the various 'dpci_kill's in the error paths.
>
> > @@ -625,3 +703,80 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
> > unlock:
> > spin_unlock(&d->event_lock);
> > }
> > +
> > +static void dpci_softirq(void)
> > +{
> > + struct hvm_pirq_dpci *pirq_dpci;
> > + unsigned int cpu = smp_processor_id();
> > + LIST_HEAD(our_list);
> > +
> > + local_irq_disable();
> > + list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
> > + local_irq_enable();
> > +
> > + while ( !list_empty(&our_list) )
> > + {
> > + pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
> > + list_del(&pirq_dpci->softirq_list);
> > +
> > + if ( !test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
> > + {
> > + if ( !test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
> > + BUG();
> > + hvm_dirq_assist(pirq_dpci);
> > + put_domain(pirq_dpci->dom);
> > + clear_bit(STATE_RUN, &pirq_dpci->state);
> > + continue;
> > + }
> > +
> > + local_irq_disable();
> > + list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
> > + local_irq_enable();
>
> Can't all this be simplified quite a bit now that the code is no longer
> domain-centric? I can't really see who the above setting of
> STATE_RUN can race with for example. And should that no longer
> be needed it would seem that re-inserting the list entry wouldn't
> then be either.
It can race with 'dpci_kill.'
But I can simplify this by just using 'mapping' and having just one
state - set or unset. Thought that will require fiddling with
'hvm_dirq_assist'. I think to preserve git bisection no-break rules
that needs to be done as a seperate patch (after the #3 patch which
squashes __hvm_dirq_assist in hvm_dirq_assist).
>
> > +
> > + raise_softirq(HVM_DPCI_SOFTIRQ);
> > + }
> > +}
> > +static int cpu_callback(
>
> Blank line missing.
Aye!
>
> > +static int __init setup_dpci_softirq(void)
> > +{
> > + int cpu;
>
> unsigned int
Shame on me. You mentioned that in the previous review and I lost that
change!
>
>
> > @@ -108,7 +109,7 @@ int pt_pirq_iterate(struct domain *d,
> > int (*cb)(struct domain *,
> > struct hvm_pirq_dpci *, void *arg),
> > void *arg);
> > -
> > +int dpci_kill(struct hvm_pirq_dpci *);
> > /* Modify state of a PCI INTx wire. */
> > void hvm_pci_intx_assert(
> > struct domain *d, unsigned int device, unsigned int intx);
>
> Please don't drop useful blank lines like this.
Yes. That was an oversight.
Will of course fix it.
>
> Jan
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5)
2014-09-22 14:31 ` Konrad Rzeszutek Wilk
@ 2014-09-22 15:31 ` Jan Beulich
0 siblings, 0 replies; 7+ messages in thread
From: Jan Beulich @ 2014-09-22 15:31 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: xen-devel, keir, ian.jackson, ian.campbell, tim
>>> On 22.09.14 at 16:31, <konrad.wilk@oracle.com> wrote:
> On Mon, Sep 22, 2014 at 12:58:57PM +0100, Jan Beulich wrote:
>> >>> On 19.09.14 at 20:51, <konrad.wilk@oracle.com> wrote:
>> > +int dpci_kill(struct hvm_pirq_dpci *pirq_dpci)
>> > +{
>> > + if ( test_and_set_bit(STATE_SCHED, &pirq_dpci->state) )
>> > + return -EAGAIN;
>> > +
>> > + if ( test_bit(STATE_RUN, &pirq_dpci->state) )
>> > + return -EAGAIN;
>> > +
>> > + clear_bit(STATE_SCHED, &pirq_dpci->state);
>>
>> So why do you first set and the immediately clear STATE_SCHED?
>
> To make sure that we do not race with another
> 'schedule_for_dpci'/'dpci_softirq'.
>
> Granted this is all 'if somebody did extend this in the future'
> type code. That is not happening right now with all of this
> and we could as well just depend on 'dpic->masked' being set.
>
>>
>> > +
>> > + /* In case we turn around and call 'schedule_dpci_for' right away. */
>> > + INIT_LIST_HEAD(&pirq_dpci->softirq_list);
>>
>> You needing this seems to hint at a problem elsewhere.
>
> That actually is not happening at all. I added that code in case
> somebody _did_ do this in the future. Perhaps a better approach
> would be to mention this in the function header?
Yeah, if you could make the patch not more difficult to understand
than it needs to be by leaving out pieces helpful only to an abstract
future usage model (hinting at eventual needs in comments is
certainly fine).
>> > @@ -156,7 +229,9 @@ int pt_irq_create_bind(
>> > {
>> > pirq_dpci->gmsi.gflags = 0;
>> > pirq_dpci->gmsi.gvec = 0;
>> > + pirq_dpci->dom = NULL;
>> > pirq_dpci->flags = 0;
>> > + dpci_kill(pirq_dpci);
>> > pirq_cleanup_check(info, d);
>>
>> Now these dpci_kill() calls you insert not just here right before the
>> pirq_cleanup_check() ones put an even bigger question mark on
>> your change to pirq_cleanup_check().
>
> The 'pirq_cleanup_check' macro calls the 'pirq_cleanup_check' only if
> the event value is set. At this stage the event is not set (it is
> called later by the OS to bind the pirq to the event channel -
> map_domain_pirq) so it would never call 'dpci_kill' at this juncture.
> Unless somebody setup an event channel first and then called this?
>
> Inserting 'dpci_kill' was the best way to make sure we would still call it
> and set the fields in the right state. However as patch #2 demonstrates
> there is a potential race.
>
> Perhaps the best way to do this is to put 'dpci_kill' in:
>
> 1) At the start of this function (as patch #2 does), or
> 'pt_irq_destroy_bind'. Your call in which function it should go.
I don't really mind either way - all that confuses me is that the way
it's done now makes it at least look like the function can get called
twice in a row, which surely isn't what we want/need.
> 2) When the domain is being destroyed (in case we setup an MSI but
> never use them). Already implemented in this patch.
>
> 3). In 'pt_pirq_cleanup_check' if the flags field is set to zero.
>
> (And we need to do that because if we remove the 'struct pirq'
> from the 'pirq_tree' radix tree, we will never be able to
> call 'dpci_kill' on it from 'pci_clean_dpci_irqs' as the
> field is gone).
>
> And remove the various 'dpci_kill's in the error paths.
Sounds like a reasonable alternative.
>> > @@ -625,3 +703,80 @@ void hvm_dpci_eoi(struct domain *d, unsigned int guest_gsi,
>> > unlock:
>> > spin_unlock(&d->event_lock);
>> > }
>> > +
>> > +static void dpci_softirq(void)
>> > +{
>> > + struct hvm_pirq_dpci *pirq_dpci;
>> > + unsigned int cpu = smp_processor_id();
>> > + LIST_HEAD(our_list);
>> > +
>> > + local_irq_disable();
>> > + list_splice_init(&per_cpu(dpci_list, cpu), &our_list);
>> > + local_irq_enable();
>> > +
>> > + while ( !list_empty(&our_list) )
>> > + {
>> > + pirq_dpci = list_entry(our_list.next, struct hvm_pirq_dpci, softirq_list);
>> > + list_del(&pirq_dpci->softirq_list);
>> > +
>> > + if ( !test_and_set_bit(STATE_RUN, &pirq_dpci->state) )
>> > + {
>> > + if ( !test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) )
>> > + BUG();
>> > + hvm_dirq_assist(pirq_dpci);
>> > + put_domain(pirq_dpci->dom);
>> > + clear_bit(STATE_RUN, &pirq_dpci->state);
>> > + continue;
>> > + }
>> > +
>> > + local_irq_disable();
>> > + list_add_tail(&pirq_dpci->softirq_list, &per_cpu(dpci_list, cpu));
>> > + local_irq_enable();
>>
>> Can't all this be simplified quite a bit now that the code is no longer
>> domain-centric? I can't really see who the above setting of
>> STATE_RUN can race with for example. And should that no longer
>> be needed it would seem that re-inserting the list entry wouldn't
>> then be either.
>
> It can race with 'dpci_kill.'
>
> But I can simplify this by just using 'mapping' and having just one
> state - set or unset. Thought that will require fiddling with
> 'hvm_dirq_assist'. I think to preserve git bisection no-break rules
> that needs to be done as a seperate patch (after the #3 patch which
> squashes __hvm_dirq_assist in hvm_dirq_assist).
I was actually more thinking that perhaps patch 3 could be moved
first (i.e. together with the elimination of the domain centric
implementation, but without the tasklet->softirq conversion).
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-09-22 15:31 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-19 18:51 [PATCH v5] Fix interrupt latency of HVM PCI passthroughdevices Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 1/3] dpci: Replace tasklet with an softirq (v5) Konrad Rzeszutek Wilk
2014-09-22 11:58 ` Jan Beulich
2014-09-22 14:31 ` Konrad Rzeszutek Wilk
2014-09-22 15:31 ` Jan Beulich
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 2/3] dpci: Safeguard against race with hvm_dirq_assist crashing and pt_irq_[create|destroy]_bind Konrad Rzeszutek Wilk
2014-09-19 18:51 ` [PATCH v5 for-xen-4.5 3/3] dpci: In hvm_dirq_assist stop using pt_pirq_iterate Konrad Rzeszutek Wilk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).