From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <JBeulich@suse.com>,
xen-devel <xen-devel@lists.xenproject.org>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>,
Tim Deegan <tim@xen.org>, Keir Fraser <keir@xen.org>,
xiantao.zhang@intel.com
Subject: Re: [PATCH v2 1/5] IOMMU: make page table population preemptible
Date: Fri, 13 Dec 2013 15:09:02 +0000 [thread overview]
Message-ID: <52AB230E.90607@citrix.com> (raw)
In-Reply-To: <52AB20DB020000780010D060@nat28.tlf.novell.com>
[-- Attachment #1.1: Type: text/plain, Size: 7207 bytes --]
On 13/12/2013 13:59, Jan Beulich wrote:
> Since this can take an arbitrary amount of time, the rooting domctl as
> well as all involved code must become aware of this requiring a
> continuation.
>
> The subject domain's rel_mem_list is being (ab)used for this, in a way
> similar to and compatible with broken page offlining.
>
> Further, operations get slightly re-ordered in assign_device(): IOMMU
> page tables now get set up _before_ the first device gets assigned, at
> once closing a small timing window in which the guest may already see
> the device but wouldn't be able to access it.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> v2: Extend comment on struct domain's need_iommu field.
>
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1924,6 +1924,12 @@ int domain_relinquish_resources(struct d
> }
>
> d->arch.relmem = RELMEM_xen;
> +
> + spin_lock(&d->page_alloc_lock);
> + page_list_splice(&d->arch.relmem_list, &d->page_list);
> + INIT_PAGE_LIST_HEAD(&d->arch.relmem_list);
> + spin_unlock(&d->page_alloc_lock);
> +
> /* Fallthrough. Relinquish every page of memory. */
> case RELMEM_xen:
> ret = relinquish_memory(d, &d->xenpage_list, ~0UL);
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -459,7 +459,8 @@ p2m_pod_offline_or_broken_hit(struct pag
>
> pod_hit:
> lock_page_alloc(p2m);
> - page_list_add_tail(p, &d->arch.relmem_list);
> + /* Insertion must be at list head (see iommu_populate_page_table()). */
> + page_list_add(p, &d->arch.relmem_list);
> unlock_page_alloc(p2m);
> pod_unlock(p2m);
> return 1;
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -18,6 +18,7 @@
> #include <asm/hvm/iommu.h>
> #include <xen/paging.h>
> #include <xen/guest_access.h>
> +#include <xen/event.h>
> #include <xen/softirq.h>
> #include <xen/keyhandler.h>
> #include <xsm/xsm.h>
> @@ -265,7 +266,23 @@ static int assign_device(struct domain *
> d->mem_event->paging.ring_page)) )
> return -EXDEV;
>
> - spin_lock(&pcidevs_lock);
> + if ( !spin_trylock(&pcidevs_lock) )
> + return -ERESTART;
> +
> + if ( need_iommu(d) <= 0 )
> + {
> + if ( !iommu_use_hap_pt(d) )
> + {
> + rc = iommu_populate_page_table(d);
> + if ( rc )
> + {
> + spin_unlock(&pcidevs_lock);
> + return rc;
> + }
> + }
> + d->need_iommu = 1;
> + }
> +
> pdev = pci_get_pdev_by_domain(dom0, seg, bus, devfn);
> if ( !pdev )
> {
> @@ -290,15 +307,14 @@ static int assign_device(struct domain *
> rc);
> }
>
> - if ( has_arch_pdevs(d) && !need_iommu(d) )
> + done:
> + if ( !has_arch_pdevs(d) && need_iommu(d) )
We now have a case where, for the first device, we could set up
pagetables for a large domain, get an error with assignment, then tear
them all back down. (-EBUSY from pci_get_pdev() looks like a good
non-fatal candidate for causing this behaviour)
I am wondering whether this is better for worse than the race condition
where a guest couldn't use the device. A guest could not reasonably
expect to use a device before the toolstack is done setting it up. A
buggy toolstack could quite easily tie up a lot of Xen time creating and
destroying complete iommu pagetable sets.
~Andrew
> {
> - d->need_iommu = 1;
> - if ( !iommu_use_hap_pt(d) )
> - rc = iommu_populate_page_table(d);
> - goto done;
> + d->need_iommu = 0;
> + hd->platform_ops->teardown(d);
> }
> -done:
> spin_unlock(&pcidevs_lock);
> +
> return rc;
> }
>
> @@ -306,12 +322,17 @@ static int iommu_populate_page_table(str
> {
> struct hvm_iommu *hd = domain_hvm_iommu(d);
> struct page_info *page;
> - int rc = 0;
> + int rc = 0, n = 0;
> +
> + d->need_iommu = -1;
>
> this_cpu(iommu_dont_flush_iotlb) = 1;
> spin_lock(&d->page_alloc_lock);
>
> - page_list_for_each ( page, &d->page_list )
> + if ( unlikely(d->is_dying) )
> + rc = -ESRCH;
> +
> + while ( !rc && (page = page_list_remove_head(&d->page_list)) )
> {
> if ( is_hvm_domain(d) ||
> (page->u.inuse.type_info & PGT_type_mask) == PGT_writable_page )
> @@ -321,7 +342,32 @@ static int iommu_populate_page_table(str
> d, mfn_to_gmfn(d, page_to_mfn(page)), page_to_mfn(page),
> IOMMUF_readable|IOMMUF_writable);
> if ( rc )
> + {
> + page_list_add(page, &d->page_list);
> break;
> + }
> + }
> + page_list_add_tail(page, &d->arch.relmem_list);
> + if ( !(++n & 0xff) && !page_list_empty(&d->page_list) &&
> + hypercall_preempt_check() )
> + rc = -ERESTART;
> + }
> +
> + if ( !rc )
> + {
> + /*
> + * The expectation here is that generally there are many normal pages
> + * on relmem_list (the ones we put there) and only few being in an
> + * offline/broken state. The latter ones are always at the head of the
> + * list. Hence we first move the whole list, and then move back the
> + * first few entries.
> + */
> + page_list_move(&d->page_list, &d->arch.relmem_list);
> + while ( (page = page_list_first(&d->page_list)) != NULL &&
> + (page->count_info & (PGC_state|PGC_broken)) )
> + {
> + page_list_del(page, &d->page_list);
> + page_list_add_tail(page, &d->arch.relmem_list);
> }
> }
>
> @@ -330,8 +376,11 @@ static int iommu_populate_page_table(str
>
> if ( !rc )
> iommu_iotlb_flush_all(d);
> - else
> + else if ( rc != -ERESTART )
> + {
> + d->need_iommu = 0;
> hd->platform_ops->teardown(d);
> + }
>
> return rc;
> }
> @@ -688,7 +737,10 @@ int iommu_do_domctl(
>
> ret = device_assigned(seg, bus, devfn) ?:
> assign_device(d, seg, bus, devfn);
> - if ( ret )
> + if ( ret == -ERESTART )
> + ret = hypercall_create_continuation(__HYPERVISOR_domctl,
> + "h", u_domctl);
> + else if ( ret )
> printk(XENLOG_G_ERR "XEN_DOMCTL_assign_device: "
> "assign %04x:%02x:%02x.%u to dom%d failed (%d)\n",
> seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn),
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -322,8 +322,8 @@ struct domain
> enum guest_type guest_type;
>
> #ifdef HAS_PASSTHROUGH
> - /* Does this guest need iommu mappings? */
> - bool_t need_iommu;
> + /* Does this guest need iommu mappings (-1 meaning "being set up")? */
> + s8 need_iommu;
> #endif
> /* is node-affinity automatically computed? */
> bool_t auto_node_affinity;
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
[-- Attachment #1.2: Type: text/html, Size: 8050 bytes --]
[-- Attachment #2: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2013-12-13 15:09 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-12-10 15:43 [PATCH 0/5] XSA-77 follow-ups Jan Beulich
2013-12-10 15:45 ` [PATCH 1/5] IOMMU: make page table population preemptible Jan Beulich
2013-12-11 18:40 ` Andrew Cooper
2013-12-13 9:51 ` Jan Beulich
2013-12-13 12:18 ` Andrew Cooper
2013-12-13 12:34 ` Jan Beulich
2013-12-13 13:57 ` Andrew Cooper
2013-12-13 13:59 ` [PATCH v2 " Jan Beulich
2013-12-13 14:16 ` Tim Deegan
2013-12-13 15:09 ` Andrew Cooper [this message]
2013-12-13 15:41 ` Jan Beulich
2013-12-13 15:44 ` Andrew Cooper
2013-12-30 13:43 ` Zhang, Xiantao
2014-01-07 13:23 ` Jan Beulich
2013-12-20 13:06 ` [PATCH " Jan Beulich
2013-12-10 15:46 ` [PATCH 2/5] IOMMU: make page table deallocation preemptible Jan Beulich
2013-12-11 19:01 ` Andrew Cooper
2013-12-13 9:55 ` Jan Beulich
2013-12-13 14:00 ` [PATCH v2 " Jan Beulich
2013-12-13 15:26 ` Andrew Cooper
2014-01-07 14:51 ` Zhang, Xiantao
2013-12-10 15:47 ` [PATCH 3/5] x86/p2m: restrict auditing to debug builds Jan Beulich
2013-12-10 15:58 ` Andrew Cooper
2013-12-10 17:31 ` George Dunlap
2013-12-13 13:46 ` Tim Deegan
2013-12-10 15:48 ` [PATCH 4/5] HVM: prevent leaking heap data from hvm_save_one() Jan Beulich
2013-12-10 16:01 ` Andrew Cooper
2013-12-10 17:32 ` George Dunlap
2013-12-17 9:16 ` Ping: " Jan Beulich
2013-12-17 10:45 ` Andrew Cooper
2013-12-17 11:02 ` Jan Beulich
2013-12-10 15:48 ` [PATCH 5/5] x86/PV: don't commit debug register values early in arch_set_info_guest() Jan Beulich
2013-12-10 17:23 ` Andrew Cooper
2013-12-10 17:33 ` George Dunlap
2013-12-10 18:17 ` Keir Fraser
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52AB230E.90607@citrix.com \
--to=andrew.cooper3@citrix.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=keir@xen.org \
--cc=tim@xen.org \
--cc=xen-devel@lists.xenproject.org \
--cc=xiantao.zhang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).