[PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM_unpin
@ 2013-09-09 16:04 Stefano Stabellini
  2013-09-09 16:06 ` [PATCH v5 1/4] xen: move steal_page and donate_page to common code Stefano Stabellini
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-09 16:04 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan, Ian Campbell, Stefano Stabellini

Hi all,
this patch series introduces two new hypercalls to allow autotranslate
guests to allocate a contiguous buffer in machine addresses.
The XENMEM_exchange_and_pin returns the mfns and makes sure to pin the
pages so that the hypervisor won't change their p2m mappings while in
use.
XENMEM_unpin simply unpins the pages.

Cheers,

Stefano


Changes in v5:
- move donate_page to common code;
- update commit message;
- align tests;
- comment p2m_walker;
- fix return codes in p2m_walker;
- handle superpages in p2m_walker;
- rename _p2m_lookup to p2m_lookup_f;
- return -EBUSY when the P2M_DMA_PIN check fails;
- rename _guest_physmap_pin_range to pin_one_pte;
- rename _guest_physmap_unpin_range to unpin_one_pte;
- memory_exchange: handle guest_physmap_pin_range failures;
- make i an unsigned long in unpinned;
- add nr_unpinned to xen_unpin to report partial success.

Changes in v4:
- move steal_page to common code;
- introduce a generic p2m walker and use it in p2m_lookup,
  guest_physmap_pin_range and guest_physmap_unpin_range;
- return -EINVAL when the P2M_DMA_PIN check fails;
- change the printk into a gdprintk;
- add a comment on what type of page can be pinned;
- rename XENMEM_get_dma_buf to XENMEM_exchange_and_pin;
- rename XENMEM_get_dma_buf to XENMEM_unpin;
- move the pinning before we copy back the mfn to the guest;
- propagate errors returned by guest_physmap_pin_range;
- use xen_memory_exchange_t as parameter for XENMEM_exchange_and_pin;
- use an unsigned iterator in unpin;
- improve the documentation of the new hypercalls;
- add a note about out.address_bits for XENMEM_exchange.

Changes in v3:
- implement guest_physmap_pin_range and guest_physmap_unpin_range on
  ARM;
- implement XENMEM_put_dma_buf.

Changes in v2:
- actually don't print the warning more than once.



Stefano Stabellini (4):
      xen: move steal_page and donate_page to common code
      xen/arm: introduce a generic p2m walker and use it in p2m_lookup
      xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
      xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin

 xen/arch/arm/mm.c           |   12 ---
 xen/arch/arm/p2m.c          |  177 +++++++++++++++++++++++++++++-----
 xen/arch/x86/mm.c           |   85 ----------------
 xen/common/memory.c         |  226 +++++++++++++++++++++++++++++++++++++------
 xen/include/asm-arm/mm.h    |    9 +-
 xen/include/asm-arm/page.h  |    7 +-
 xen/include/asm-x86/mm.h    |    5 -
 xen/include/asm-x86/p2m.h   |   12 +++
 xen/include/public/memory.h |   47 +++++++++
 xen/include/xen/mm.h        |    3 +
 10 files changed, 418 insertions(+), 165 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-09 16:04 [PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
@ 2013-09-09 16:06 ` Stefano Stabellini
  2013-09-10  9:08   ` Ian Campbell
  2013-09-10 10:39   ` Jan Beulich
  2013-09-09 16:06 ` [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup Stefano Stabellini
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-09 16:06 UTC (permalink / raw)
  To: xen-devel; +Cc: keir, tim, Ian.Campbell, JBeulich, Stefano Stabellini

Only code movement, except for a small change to the warning printed in
case of an error in donate_page.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
CC: keir@xen.org
CC: JBeulich@suse.com

Changes in v5:
- move donate_page to common code;
- update commit message.

Changes in v4:
- move steal_page to common code.
---
 xen/arch/arm/mm.c        |   12 ------
 xen/arch/x86/mm.c        |   85 ----------------------------------------------
 xen/common/memory.c      |   85 ++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-arm/mm.h |    5 ---
 xen/include/asm-x86/mm.h |    5 ---
 xen/include/xen/mm.h     |    3 ++
 6 files changed, 88 insertions(+), 107 deletions(-)

diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index f301e65..89b9d27 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -742,18 +742,6 @@ void arch_dump_shared_mem_info(void)
 {
 }
 
-int donate_page(struct domain *d, struct page_info *page, unsigned int memflags)
-{
-    ASSERT(0);
-    return -ENOSYS;
-}
-
-int steal_page(
-    struct domain *d, struct page_info *page, unsigned int memflags)
-{
-    return -1;
-}
-
 int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
 {
     ASSERT(0);
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7f0e13..a512046 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4124,91 +4124,6 @@ int replace_grant_host_mapping(
     return rc;
 }
 
-int donate_page(
-    struct domain *d, struct page_info *page, unsigned int memflags)
-{
-    spin_lock(&d->page_alloc_lock);
-
-    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
-        goto fail;
-
-    if ( d->is_dying )
-        goto fail;
-
-    if ( page->count_info & ~(PGC_allocated | 1) )
-        goto fail;
-
-    if ( !(memflags & MEMF_no_refcount) )
-    {
-        if ( d->tot_pages >= d->max_pages )
-            goto fail;
-        domain_adjust_tot_pages(d, 1);
-    }
-
-    page->count_info = PGC_allocated | 1;
-    page_set_owner(page, d);
-    page_list_add_tail(page,&d->page_list);
-
-    spin_unlock(&d->page_alloc_lock);
-    return 0;
-
- fail:
-    spin_unlock(&d->page_alloc_lock);
-    MEM_LOG("Bad donate %p: ed=%p(%u), sd=%p, caf=%08lx, taf=%" PRtype_info,
-            (void *)page_to_mfn(page), d, d->domain_id,
-            page_get_owner(page), page->count_info, page->u.inuse.type_info);
-    return -1;
-}
-
-int steal_page(
-    struct domain *d, struct page_info *page, unsigned int memflags)
-{
-    unsigned long x, y;
-    bool_t drop_dom_ref = 0;
-
-    spin_lock(&d->page_alloc_lock);
-
-    if ( is_xen_heap_page(page) || (page_get_owner(page) != d) )
-        goto fail;
-
-    /*
-     * We require there is just one reference (PGC_allocated). We temporarily
-     * drop this reference now so that we can safely swizzle the owner.
-     */
-    y = page->count_info;
-    do {
-        x = y;
-        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
-            goto fail;
-        y = cmpxchg(&page->count_info, x, x & ~PGC_count_mask);
-    } while ( y != x );
-
-    /* Swizzle the owner then reinstate the PGC_allocated reference. */
-    page_set_owner(page, NULL);
-    y = page->count_info;
-    do {
-        x = y;
-        BUG_ON((x & (PGC_count_mask|PGC_allocated)) != PGC_allocated);
-    } while ( (y = cmpxchg(&page->count_info, x, x | 1)) != x );
-
-    /* Unlink from original owner. */
-    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
-        drop_dom_ref = 1;
-    page_list_del(page, &d->page_list);
-
-    spin_unlock(&d->page_alloc_lock);
-    if ( unlikely(drop_dom_ref) )
-        put_domain(d);
-    return 0;
-
- fail:
-    spin_unlock(&d->page_alloc_lock);
-    MEM_LOG("Bad page %p: ed=%p(%u), sd=%p, caf=%08lx, taf=%" PRtype_info,
-            (void *)page_to_mfn(page), d, d->domain_id,
-            page_get_owner(page), page->count_info, page->u.inuse.type_info);
-    return -1;
-}
-
 static int __do_update_va_mapping(
     unsigned long va, u64 val64, unsigned long flags, struct domain *pg_owner)
 {
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 50b740f..4b2f311 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -230,6 +230,91 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
     return 1;
 }
 
+int donate_page(
+    struct domain *d, struct page_info *page, unsigned int memflags)
+{
+    spin_lock(&d->page_alloc_lock);
+
+    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
+        goto fail;
+
+    if ( d->is_dying )
+        goto fail;
+
+    if ( page->count_info & ~(PGC_allocated | 1) )
+        goto fail;
+
+    if ( !(memflags & MEMF_no_refcount) )
+    {
+        if ( d->tot_pages >= d->max_pages )
+            goto fail;
+        domain_adjust_tot_pages(d, 1);
+    }
+
+    page->count_info = PGC_allocated | 1;
+    page_set_owner(page, d);
+    page_list_add_tail(page,&d->page_list);
+
+    spin_unlock(&d->page_alloc_lock);
+    return 0;
+
+ fail:
+    spin_unlock(&d->page_alloc_lock);
+    gdprintk(XENLOG_WARNING, "Bad donate %p: ed=%p(%u), sd=%p, caf=%08lx, taf=%016lx",
+            (void *)page_to_mfn(page), d, d->domain_id,
+            page_get_owner(page), page->count_info, page->u.inuse.type_info);
+    return -1;
+}
+
+int steal_page(
+    struct domain *d, struct page_info *page, unsigned int memflags)
+{
+    unsigned long x, y;
+    bool_t drop_dom_ref = 0;
+
+    spin_lock(&d->page_alloc_lock);
+
+    if ( is_xen_heap_page(page) || (page_get_owner(page) != d) )
+        goto fail;
+
+    /*
+     * We require there is just one reference (PGC_allocated). We temporarily
+     * drop this reference now so that we can safely swizzle the owner.
+     */
+    y = page->count_info;
+    do {
+        x = y;
+        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
+            goto fail;
+        y = cmpxchg(&page->count_info, x, x & ~PGC_count_mask);
+    } while ( y != x );
+
+    /* Swizzle the owner then reinstate the PGC_allocated reference. */
+    page_set_owner(page, NULL);
+    y = page->count_info;
+    do {
+        x = y;
+        BUG_ON((x & (PGC_count_mask|PGC_allocated)) != PGC_allocated);
+    } while ( (y = cmpxchg(&page->count_info, x, x | 1)) != x );
+
+    /* Unlink from original owner. */
+    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
+        drop_dom_ref = 1;
+    page_list_del(page, &d->page_list);
+
+    spin_unlock(&d->page_alloc_lock);
+    if ( unlikely(drop_dom_ref) )
+        put_domain(d);
+    return 0;
+
+ fail:
+    spin_unlock(&d->page_alloc_lock);
+    printk("Bad page %p: ed=%p(%u), sd=%p, caf=%08lx, taf=%lx\n",
+            (void *)page_to_mfn(page), d, d->domain_id,
+            page_get_owner(page), page->count_info, page->u.inuse.type_info);
+    return -1;
+}
+
 static void decrease_reservation(struct memop_args *a)
 {
     unsigned long i, j;
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index 5e7c5a3..0b0a457 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -297,11 +297,6 @@ static inline int relinquish_shared_pages(struct domain *d)
 /* Arch-specific portion of memory_op hypercall. */
 long arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg);
 
-int steal_page(
-    struct domain *d, struct page_info *page, unsigned int memflags);
-int donate_page(
-    struct domain *d, struct page_info *page, unsigned int memflags);
-
 #define domain_set_alloc_bitsize(d) ((void)0)
 #define domain_clamp_alloc_bitsize(d, b) (b)
 
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 213fc9c..e5254ee 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -566,11 +566,6 @@ long subarch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void) arg);
 int compat_arch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void));
 int compat_subarch_memory_op(int op, XEN_GUEST_HANDLE_PARAM(void));
 
-int steal_page(
-    struct domain *d, struct page_info *page, unsigned int memflags);
-int donate_page(
-    struct domain *d, struct page_info *page, unsigned int memflags);
-
 int map_ldt_shadow_page(unsigned int);
 
 #define NIL(type) ((type *)-sizeof(type))
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 4f5795c..1dc859b 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -60,6 +60,9 @@ void destroy_xen_mappings(unsigned long v, unsigned long e);
 unsigned long domain_adjust_tot_pages(struct domain *d, long pages);
 int domain_set_outstanding_pages(struct domain *d, unsigned long pages);
 void get_outstanding_claims(uint64_t *free_pages, uint64_t *outstanding_pages);
+int steal_page(struct domain *d, struct page_info *page, unsigned int memflags);
+int donate_page(struct domain *d, struct page_info *page, unsigned int memflags);
+
 
 /* Domain suballocator. These functions are *not* interrupt-safe.*/
 void init_domheap_pages(paddr_t ps, paddr_t pe);
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup
  2013-09-09 16:04 [PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
  2013-09-09 16:06 ` [PATCH v5 1/4] xen: move steal_page and donate_page to common code Stefano Stabellini
@ 2013-09-09 16:06 ` Stefano Stabellini
  2013-09-10  9:26   ` Ian Campbell
  2013-09-09 16:06 ` [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range Stefano Stabellini
  2013-09-09 16:06 ` [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
  3 siblings, 1 reply; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-09 16:06 UTC (permalink / raw)
  To: xen-devel; +Cc: tim, Ian.Campbell, Stefano Stabellini

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>


Changes in v5:
- align tests;
- comment p2m_walker;
- fix return codes in p2m_walker;
- handle superpages in p2m_walker;
- rename _p2m_lookup to p2m_lookup_f.
---
 xen/arch/arm/p2m.c |  131 +++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 105 insertions(+), 26 deletions(-)

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index 307c6d4..a9ceacf 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -31,48 +31,127 @@ void p2m_load_VTTBR(struct domain *d)
 }
 
 /*
- * Lookup the MFN corresponding to a domain's PFN.
+ * d: domain p2m to walk
+ * paddr: the guest start physical address
+ * order: page order
+ * func: function to call for each 2-stage lpae_t entry found
+ * arg: opaque pointer to pass to func
  *
- * There are no processor functions to do a stage 2 only lookup therefore we
- * do a a software walk.
  */
-paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
+static int p2m_walker(struct domain *d, paddr_t paddr, unsigned int order,
+                      int (*func)(lpae_t *pte, void *arg, int level), void* arg)
 {
+    lpae_t *first = NULL, *second = NULL, *third = NULL;
     struct p2m_domain *p2m = &d->arch.p2m;
-    lpae_t pte, *first = NULL, *second = NULL, *third = NULL;
-    paddr_t maddr = INVALID_PADDR;
+    int rc = -EFAULT, level = 1;
+    unsigned long cur_first_offset = ~0, cur_second_offset = ~0;
+    paddr_t pend = paddr + (1UL << order);
 
     spin_lock(&p2m->lock);
 
     first = __map_domain_page(p2m->first_level);
 
-    pte = first[first_table_offset(paddr)];
-    if ( !pte.p2m.valid || !pte.p2m.table )
-        goto done;
+    if ( !first ||
+         !first[first_table_offset(paddr)].p2m.valid )
+        goto err;
 
-    second = map_domain_page(pte.p2m.base);
-    pte = second[second_table_offset(paddr)];
-    if ( !pte.p2m.valid || !pte.p2m.table )
-        goto done;
+    if ( !first[first_table_offset(paddr)].p2m.table )
+    {
+        rc = func(&first[first_table_offset(paddr)], arg, level);
+        if ( rc != 0 )
+            goto err;
+        paddr += FIRST_SIZE;
+    }
 
-    third = map_domain_page(pte.p2m.base);
-    pte = third[third_table_offset(paddr)];
+    while ( paddr < pend )
+    {
+        rc = -EFAULT;
+        level = 1;
 
-    /* This bit must be one in the level 3 entry */
-    if ( !pte.p2m.table )
-        pte.bits = 0;
+        if ( cur_first_offset != first_table_offset(paddr) )
+        {
+            if (second) unmap_domain_page(second);
+            second = map_domain_page(first[first_table_offset(paddr)].p2m.base);
+            cur_first_offset = first_table_offset(paddr);
+        }
+        level++;
+        if ( !second ||
+             !second[second_table_offset(paddr)].p2m.valid )
+            goto err;
+        if ( !second[second_table_offset(paddr)].p2m.table )
+        {
+            rc = func(&first[first_table_offset(paddr)], arg, level);
+            if ( rc != 0 )
+                goto err;
+            paddr += SECOND_SIZE;
+        }
 
-done:
-    if ( pte.p2m.valid )
-        maddr = (pte.bits & PADDR_MASK & PAGE_MASK) | (paddr & ~PAGE_MASK);
+        if ( cur_second_offset != second_table_offset(paddr) )
+        {
+            if (third) unmap_domain_page(third);
+            third = map_domain_page(second[second_table_offset(paddr)].p2m.base);
+            cur_second_offset = second_table_offset(paddr);
+        }
+        level++;
+        if ( !third ||
+             !third[third_table_offset(paddr)].p2m.valid )
+            goto err;
 
-    if (third) unmap_domain_page(third);
-    if (second) unmap_domain_page(second);
-    if (first) unmap_domain_page(first);
+        rc = func(&third[third_table_offset(paddr)], arg, level);
+        if ( rc != 0 )
+            goto err;
+
+        paddr += PAGE_SIZE;
+    }
+
+    rc = 0;
+
+err:
+    if ( third ) unmap_domain_page(third);
+    if ( second ) unmap_domain_page(second);
+    if ( first ) unmap_domain_page(first);
 
     spin_unlock(&p2m->lock);
 
-    return maddr;
+    return rc;
+}
+
+struct p2m_lookup_t {
+    paddr_t paddr;
+    paddr_t maddr;
+};
+
+static int p2m_lookup_f(lpae_t *ptep, void *arg, int level)
+{
+    lpae_t pte;
+    struct p2m_lookup_t *p2m = (struct p2m_lookup_t *)arg;
+    ASSERT(level == 3);
+
+    pte = *ptep;
+
+    /* This bit must be one in the level 3 entry */
+    if ( !pte.p2m.table || !pte.p2m.valid )
+        return -EFAULT;
+
+    p2m->maddr = (pte.bits & PADDR_MASK & PAGE_MASK) |
+        (p2m->paddr & ~PAGE_MASK);
+    return 0;
+}
+/*
+ * Lookup the MFN corresponding to a domain's PFN.
+ *
+ * There are no processor functions to do a stage 2 only lookup therefore we
+ * do a a software walk.
+ */
+paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
+{
+    struct p2m_lookup_t p2m;
+    p2m.paddr = paddr;
+    p2m.maddr = INVALID_PADDR;
+
+    p2m_walker(d, paddr, 0, p2m_lookup_f, &p2m);
+
+    return p2m.maddr;
 }
 
 int guest_physmap_mark_populate_on_demand(struct domain *d,
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-09 16:04 [PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
  2013-09-09 16:06 ` [PATCH v5 1/4] xen: move steal_page and donate_page to common code Stefano Stabellini
  2013-09-09 16:06 ` [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup Stefano Stabellini
@ 2013-09-09 16:06 ` Stefano Stabellini
  2013-09-10  9:30   ` Ian Campbell
  2013-09-10  9:33   ` Ian Campbell
  2013-09-09 16:06 ` [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
  3 siblings, 2 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-09 16:06 UTC (permalink / raw)
  To: xen-devel; +Cc: tim, Ian.Campbell, Stefano Stabellini

guest_physmap_pin_range pins a range of guest pages so that their p2m
mappings won't be changed.
guest_physmap_unpin_range unpins the previously pinned pages.
The pinning is done using one of the spare bits in the p2m ptes.

Use the newly introduce p2m_walker to implement the two functions on
ARM.
Provide empty stubs for x86.


Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>


Changes in v5:
- return -EBUSY when the P2M_DMA_PIN check fails;
- rename _guest_physmap_pin_range to pin_one_pte;
- rename _guest_physmap_unpin_range to unpin_one_pte.

Changes in v4:
- use p2m_walker to implement guest_physmap_pin_range and
guest_physmap_unpin_range;
- return -EINVAL when the P2M_DMA_PIN check fails;
- change the printk into a gdprintk;
- add a comment on what type of page can be pinned.
---
 xen/arch/arm/p2m.c         |   48 ++++++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-arm/mm.h   |    4 +++
 xen/include/asm-arm/page.h |    7 +++++-
 xen/include/asm-x86/p2m.h  |   12 +++++++++++
 4 files changed, 70 insertions(+), 1 deletions(-)

diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
index a9ceacf..bac6c7e 100644
--- a/xen/arch/arm/p2m.c
+++ b/xen/arch/arm/p2m.c
@@ -154,6 +154,46 @@ paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
     return p2m.maddr;
 }
 
+static int pin_one_pte(lpae_t *ptep, void *arg, int level)
+{
+    lpae_t pte = *ptep;
+    ASSERT(level == 3);
+
+    if ( pte.p2m.avail & P2M_DMA_PIN )
+        return -EBUSY;
+    pte.p2m.avail |= P2M_DMA_PIN;
+    write_pte(ptep, pte);
+    return 0;
+}
+
+int guest_physmap_pin_range(struct domain *d,
+                            xen_pfn_t gpfn,
+                            unsigned int order)
+{
+    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
+                      pin_one_pte, NULL);
+}
+
+static int unpin_one_pte(lpae_t *ptep, void *arg, int level)
+{
+    lpae_t pte = *ptep;
+    ASSERT(level == 3);
+
+    if ( !pte.p2m.avail & P2M_DMA_PIN )
+        return -EBUSY;
+    pte.p2m.avail &= ~P2M_DMA_PIN;
+    write_pte(ptep, pte);
+    return 0;
+}
+
+int guest_physmap_unpin_range(struct domain *d,
+                              xen_pfn_t gpfn,
+                              unsigned int order)
+{
+    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
+                      unpin_one_pte, NULL);
+}
+
 int guest_physmap_mark_populate_on_demand(struct domain *d,
                                           unsigned long gfn,
                                           unsigned int order)
@@ -263,6 +303,14 @@ static int create_p2m_entries(struct domain *d,
             cur_second_offset = second_table_offset(addr);
         }
 
+        if ( third[third_table_offset(addr)].p2m.avail & P2M_DMA_PIN )
+        {
+            rc = -EINVAL;
+            gdprintk(XENLOG_WARNING, "cannot change p2m mapping for paddr=%"PRIpaddr
+                    " domid=%d, the page is pinned\n", addr, d->domain_id);
+            goto out;
+        }
+
         flush = third[third_table_offset(addr)].p2m.valid;
 
         /* Allocate a new RAM page and attach */
diff --git a/xen/include/asm-arm/mm.h b/xen/include/asm-arm/mm.h
index 0b0a457..d6ae3ad 100644
--- a/xen/include/asm-arm/mm.h
+++ b/xen/include/asm-arm/mm.h
@@ -314,6 +314,10 @@ void free_init_memory(void);
 
 int guest_physmap_mark_populate_on_demand(struct domain *d, unsigned long gfn,
                                           unsigned int order);
+int guest_physmap_pin_range(struct domain *d, xen_pfn_t gpfn,
+                            unsigned int order);
+int guest_physmap_unpin_range(struct domain *d, xen_pfn_t gpfn,
+                              unsigned int order);
 
 extern void put_page_type(struct page_info *page);
 static inline void put_page_and_type(struct page_info *page)
diff --git a/xen/include/asm-arm/page.h b/xen/include/asm-arm/page.h
index 41e9eff..c08cfca 100644
--- a/xen/include/asm-arm/page.h
+++ b/xen/include/asm-arm/page.h
@@ -153,11 +153,16 @@ typedef struct {
     unsigned long hint:1;       /* In a block of 16 contiguous entries */
     unsigned long sbz2:1;
     unsigned long xn:1;         /* eXecute-Never */
-    unsigned long avail:4;      /* Ignored by hardware */
+    unsigned long avail:4;      /* Ignored by hardware, see below */
 
     unsigned long sbz1:5;
 } __attribute__((__packed__)) lpae_p2m_t;
 
+/* Xen "avail" bits allocation in third level entries */
+#define P2M_DMA_PIN     (1<<0)  /* The page has been "pinned": the hypervisor
+                                   promises not to change the p2m mapping.
+                                   Only normal r/w guest RAM can be pinned. */
+
 /*
  * Walk is the common bits of p2m and pt entries which are needed to
  * simply walk the table (e.g. for debug).
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index 43583b2..b08a722 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -492,6 +492,18 @@ void guest_physmap_remove_page(struct domain *d,
 /* Set a p2m range as populate-on-demand */
 int guest_physmap_mark_populate_on_demand(struct domain *d, unsigned long gfn,
                                           unsigned int order);
+static inline int guest_physmap_pin_range(struct domain *d,
+                                          xen_pfn_t gpfn,
+                                          unsigned int order)
+{
+    return -ENOSYS;
+}
+static inline int guest_physmap_unpin_range(struct domain *d,
+                              xen_pfn_t gpfn,
+                              unsigned int order)
+{
+    return -ENOSYS;
+}
 
 /* Change types across all p2m entries in a domain */
 void p2m_change_entry_type_global(struct domain *d, 
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-09 16:04 [PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
                   ` (2 preceding siblings ...)
  2013-09-09 16:06 ` [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range Stefano Stabellini
@ 2013-09-09 16:06 ` Stefano Stabellini
  2013-09-10  9:47   ` Ian Campbell
  2013-09-10 12:15   ` Jan Beulich
  3 siblings, 2 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-09 16:06 UTC (permalink / raw)
  To: xen-devel; +Cc: tim, Ian.Campbell, Stefano Stabellini

Introduce two new hypercalls XENMEM_exchange_and_pin and XENMEM_unpin.

XENMEM_exchange_and_pin, it's like XENMEM_exchange but it also pins the
new pages: their p2m mapping are guaranteed not to be changed, until
XENMEM_unpin is called.  XENMEM_exchange_and_pin returns the DMA frame
numbers of the new pages to the caller, even if it's an autotranslate
guest.

The only effect of XENMEM_unpin is to "unpin" the previously
pinned pages. Afterwards the p2m mappings can be transparently changed by
the hypervisor as normal. The memory remains accessible from the guest.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>


Changes in v5:
- memory_exchange: handle guest_physmap_pin_range failures;
- make i an unsigned long in unpinned;
- add nr_unpinned to xen_unpin to report partial success.

Changes in v4:
- rename XENMEM_get_dma_buf to XENMEM_exchange_and_pin;
- rename XENMEM_get_dma_buf to XENMEM_unpin;
- move the pinning before we copy back the mfn to the guest;
- propagate errors returned by guest_physmap_pin_range;
- use xen_memory_exchange_t as parameter for XENMEM_exchange_and_pin;
- use an unsigned iterator in unpin;
- improve the documentation of the new hypercalls;
- add a note about out.address_bits for XENMEM_exchange.
---
 xen/common/memory.c         |  141 +++++++++++++++++++++++++++++++++----------
 xen/include/public/memory.h |   47 ++++++++++++++
 2 files changed, 156 insertions(+), 32 deletions(-)

diff --git a/xen/common/memory.c b/xen/common/memory.c
index 4b2f311..34169c1 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -364,14 +364,14 @@ static void decrease_reservation(struct memop_args *a)
     a->nr_done = i;
 }
 
-static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
+static long memory_exchange(int op, XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
 {
     struct xen_memory_exchange exch;
     PAGE_LIST_HEAD(in_chunk_list);
     PAGE_LIST_HEAD(out_chunk_list);
     unsigned long in_chunk_order, out_chunk_order;
     xen_pfn_t     gpfn, gmfn, mfn;
-    unsigned long i, j, k = 0; /* gcc ... */
+    unsigned long i = 0, j = 0, k = 0; /* gcc ... */
     unsigned int  memflags = 0;
     long          rc = 0;
     struct domain *d;
@@ -542,54 +542,72 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
         /* Assign each output page to the domain. */
         for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )
         {
-            if ( assign_pages(d, page, exch.out.extent_order,
-                              MEMF_no_refcount) )
-            {
-                unsigned long dec_count;
-                bool_t drop_dom_ref;
-
-                /*
-                 * Pages in in_chunk_list is stolen without
-                 * decreasing the tot_pages. If the domain is dying when
-                 * assign pages, we need decrease the count. For those pages
-                 * that has been assigned, it should be covered by
-                 * domain_relinquish_resources().
-                 */
-                dec_count = (((1UL << exch.in.extent_order) *
-                              (1UL << in_chunk_order)) -
-                             (j * (1UL << exch.out.extent_order)));
-
-                spin_lock(&d->page_alloc_lock);
-                domain_adjust_tot_pages(d, -dec_count);
-                drop_dom_ref = (dec_count && !d->tot_pages);
-                spin_unlock(&d->page_alloc_lock);
-
-                if ( drop_dom_ref )
-                    put_domain(d);
-
-                free_domheap_pages(page, exch.out.extent_order);
-                goto dying;
-            }
+            unsigned long dec_count;
+            bool_t drop_dom_ref;
 
             if ( __copy_from_guest_offset(&gpfn, exch.out.extent_start,
                                           (i << out_chunk_order) + j, 1) )
             {
                 rc = -EFAULT;
-                continue;
+                goto extent_error;
             }
 
             mfn = page_to_mfn(page);
             guest_physmap_add_page(d, gpfn, mfn, exch.out.extent_order);
 
+            if ( op == XENMEM_exchange_and_pin )
+            {
+                if ( guest_physmap_pin_range(d, gpfn, exch.out.extent_order) )
+                {
+                    rc = -EFAULT;
+                    goto extent_error_physmap;
+                }
+            }
+
             if ( !paging_mode_translate(d) )
             {
                 for ( k = 0; k < (1UL << exch.out.extent_order); k++ )
                     set_gpfn_from_mfn(mfn + k, gpfn + k);
+            }
+
+            rc = assign_pages(d, page, exch.out.extent_order, MEMF_no_refcount);
+            if ( rc )
+                goto extent_error_physmap;
+
+            if ( op == XENMEM_exchange_and_pin || !paging_mode_translate(d) )
+            {
                 if ( __copy_to_guest_offset(exch.out.extent_start,
                                             (i << out_chunk_order) + j,
                                             &mfn, 1) )
                     rc = -EFAULT;
             }
+
+            continue;
+
+extent_error_physmap:
+            guest_physmap_remove_page(d, gpfn, mfn, exch.out.extent_order);
+extent_error:
+            /*
+             * Pages in in_chunk_list is stolen without
+             * decreasing the tot_pages. If the domain is dying when
+             * assign pages, we need decrease the count. For those pages
+             * that has been assigned, it should be covered by
+             * domain_relinquish_resources().
+             */
+            dec_count = (((1UL << exch.in.extent_order) *
+                        (1UL << in_chunk_order)) -
+                    (j * (1UL << exch.out.extent_order)));
+
+            spin_lock(&d->page_alloc_lock);
+            domain_adjust_tot_pages(d, -dec_count);
+            drop_dom_ref = (dec_count && !d->tot_pages);
+            spin_unlock(&d->page_alloc_lock);
+
+            if ( drop_dom_ref )
+                put_domain(d);
+
+            free_domheap_pages(page, exch.out.extent_order);
+            goto dying;
         }
         BUG_ON( !(d->is_dying) && (j != (1UL << out_chunk_order)) );
     }
@@ -627,6 +645,60 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
     return rc;
 }
 
+static long unpin(XEN_GUEST_HANDLE_PARAM(xen_unpin_t) arg)
+{
+    int rc;
+    unsigned long i;
+    struct xen_unpin unpin;
+    xen_pfn_t gpfn;
+    struct domain *d;
+
+    if ( copy_from_guest(&unpin, arg, 1) )
+        return -EFAULT;
+
+    /* Various sanity checks. */
+    if ( /* Extent orders are sensible? */
+         (unpin.in.extent_order > MAX_ORDER) ||
+         /* Sizes of input list do not overflow a long? */
+         ((~0UL >> unpin.in.extent_order) < unpin.in.nr_extents) )
+        return -EFAULT;
+
+    if ( !guest_handle_okay(unpin.in.extent_start, unpin.in.nr_extents) )
+        return -EFAULT;
+
+    d = rcu_lock_domain_by_any_id(unpin.in.domid);
+    if ( d == NULL )
+    {
+        rc = -ESRCH;
+        goto fail;
+    }
+
+    for ( i = 0; i < unpin.in.nr_extents; i++ )
+    {
+        if ( unlikely(__copy_from_guest_offset(
+                      &gpfn, unpin.in.extent_start, i, 1)) )
+        {
+            rc = -EFAULT;
+            goto partial;
+        }
+
+        rc = guest_physmap_unpin_range(d, gpfn, unpin.in.extent_order);
+        if ( rc )
+            goto partial;
+    }
+
+    rc = 0;
+
+partial:
+    unpin.nr_unpinned = i;
+    if ( __copy_field_to_guest(arg, &unpin, nr_unpinned) )
+        rc = -EFAULT;
+
+ fail:
+    rcu_unlock_domain(d);
+    return rc;
+}
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -715,8 +787,13 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         break;
 
+    case XENMEM_exchange_and_pin:
     case XENMEM_exchange:
-        rc = memory_exchange(guest_handle_cast(arg, xen_memory_exchange_t));
+        rc = memory_exchange(op, guest_handle_cast(arg, xen_memory_exchange_t));
+        break;
+
+    case XENMEM_unpin:
+        rc = unpin(guest_handle_cast(arg, xen_unpin_t));
         break;
 
     case XENMEM_maximum_ram_page:
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 7a26dee..73fdc31 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -105,6 +105,7 @@ struct xen_memory_exchange {
     /*
      * [IN] Details of memory extents to be exchanged (GMFN bases).
      * Note that @in.address_bits is ignored and unused.
+     * @out.address_bits should contain the address mask for the new pages.
      */
     struct xen_memory_reservation in;
 
@@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
  * The zero value is appropiate.
  */
 
+#define XENMEM_exchange_and_pin             26
+/*
+ * This hypercall is similar to XENMEM_exchange: it takes the same
+ * struct as an argument and it exchanges the pages passed in with a new
+ * set of pages. The new pages are going to be "pinned": it's guaranteed
+ * that their p2m mapping won't be changed until explicitly "unpinned".
+ * The content of the exchanged pages is lost.
+ * Only normal guest r/w memory can be pinned: no granted pages or
+ * ballooned pages.
+ * If return code is zero then @out.extent_list provides the DMA frame
+ * numbers of the newly-allocated memory.
+ * Returns zero on complete success, otherwise a negative error code:
+ *   -ENOSYS if not implemented
+ *   -EINVAL if the page is already pinned
+ *   -EFAULT if an internal error occurs
+ * On complete success then always @nr_exchanged == @in.nr_extents.  On
+ * partial success @nr_exchanged indicates how much work was done and a
+ * negative error code is returned.
+ */
+
+#define XENMEM_unpin             27
+/*
+ * XENMEM_unpin unpins a set of pages, previously pinned by
+ * XENMEM_exchange_and_pin. After this call the p2m mapping of the pages can
+ * be transparently changed by the hypervisor, as usual. The pages are
+ * still accessible from the guest.
+ */
+struct xen_unpin {
+    /*
+     * [IN] Details of memory extents to be unpinned (GMFN bases).
+     * Note that @in.address_bits is ignored and unused.
+     */
+    struct xen_memory_reservation in;
+    /*
+     * [OUT] Number of input extents that were successfully unpinned.
+     *  1. The first @nr_unpinned input extents were successfully
+     *     unpinned.
+     *  2. All other input extents are untouched.
+     *  3. If not all input exents are unpinned then the return code of this
+     *     command will be non-zero.
+     */
+    xen_ulong_t nr_unpinned;
+};
+typedef struct xen_unpin xen_unpin_t;
+DEFINE_XEN_GUEST_HANDLE(xen_unpin_t);
+
 #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
 
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
-- 
1.7.2.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-09 16:06 ` [PATCH v5 1/4] xen: move steal_page and donate_page to common code Stefano Stabellini
@ 2013-09-10  9:08   ` Ian Campbell
  2013-09-10 10:39   ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Ian Campbell @ 2013-09-10  9:08 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: keir, xen-devel, tim, JBeulich

On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> Only code movement, except for a small change to the warning printed in
> case of an error in donate_page.
> 
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> CC: keir@xen.org
> CC: JBeulich@suse.com

Not much to it on the ARM side but:
Acked-by: Ian Campbell <ian.campbell@citrix.com>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup
  2013-09-09 16:06 ` [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup Stefano Stabellini
@ 2013-09-10  9:26   ` Ian Campbell
  2013-09-10 18:00     ` Stefano Stabellini
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10  9:26 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim

On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> 
> 
> Changes in v5:
> - align tests;
> - comment p2m_walker;
> - fix return codes in p2m_walker;
> - handle superpages in p2m_walker;
> - rename _p2m_lookup to p2m_lookup_f.
> ---
>  xen/arch/arm/p2m.c |  131 +++++++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 105 insertions(+), 26 deletions(-)
> 
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index 307c6d4..a9ceacf 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -31,48 +31,127 @@ void p2m_load_VTTBR(struct domain *d)
>  }
>  
>  /*
> - * Lookup the MFN corresponding to a domain's PFN.
> + * d: domain p2m to walk
> + * paddr: the guest start physical address
> + * order: page order
> + * func: function to call for each 2-stage lpae_t entry found

s/2-stage/stage-2/

Also it is called for each *leaf* entry, not for all of them. It could
be argued that either was a useful interface in general.

This function is not actually all the p2m specific in the end, by using
lpae_t.walk instead of lpae_t.pt (see dump_pt_walk) you could make it
useful for e.g. hyp pagetable walks too. (e.g. dump_hyp_walk). In theory
it could also be used for LPAE guest walks too, but we'd need a separate
walker for non-LPAE guests.

If we wanted to replace dump_hyp_walk with this then calling the
callback for each level would be required.

> + * arg: opaque pointer to pass to func
>   *
> - * There are no processor functions to do a stage 2 only lookup therefore we
> - * do a a software walk.
>   */
> -paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
> +static int p2m_walker(struct domain *d, paddr_t paddr, unsigned int order,
> +                      int (*func)(lpae_t *pte, void *arg, int level), void* arg)
>  {
> +    lpae_t *first = NULL, *second = NULL, *third = NULL;
>      struct p2m_domain *p2m = &d->arch.p2m;
> -    lpae_t pte, *first = NULL, *second = NULL, *third = NULL;
> -    paddr_t maddr = INVALID_PADDR;
> +    int rc = -EFAULT, level = 1;
> +    unsigned long cur_first_offset = ~0, cur_second_offset = ~0;
> +    paddr_t pend = paddr + (1UL << order);

1UL is 32 bits on arm32, but paddr_t is 64 bits, so this might fail for
large enough order?

I think you need something like: (((paddr_t)1UL)<<order). Or maybe a
helpful macro.

>  
>      spin_lock(&p2m->lock);
>  
>      first = __map_domain_page(p2m->first_level);
>  
> -    pte = first[first_table_offset(paddr)];
> -    if ( !pte.p2m.valid || !pte.p2m.table )
> -        goto done;
> +    if ( !first ||
> +         !first[first_table_offset(paddr)].p2m.valid )
> +        goto err;

Why is the first level handled specially outside the loop? What happens
if order is such that we span multiple first level entries?

> -    second = map_domain_page(pte.p2m.base);
> -    pte = second[second_table_offset(paddr)];
> -    if ( !pte.p2m.valid || !pte.p2m.table )
> -        goto done;
> +    if ( !first[first_table_offset(paddr)].p2m.table )
> +    {
> +        rc = func(&first[first_table_offset(paddr)], arg, level);
> +        if ( rc != 0 )
> +            goto err;
> +        paddr += FIRST_SIZE;
> +    }
>  
> -    third = map_domain_page(pte.p2m.base);
> -    pte = third[third_table_offset(paddr)];
> +    while ( paddr < pend )
> +    {
> +        rc = -EFAULT;
> +        level = 1;

I think this and all the level++ stuff could be replaced by literal 1,
2, and 3 at the appropriate callback sites. The level is statically
implied by the code.

>  
> -    /* This bit must be one in the level 3 entry */
> -    if ( !pte.p2m.table )
> -        pte.bits = 0;
> +        if ( cur_first_offset != first_table_offset(paddr) )
> +        {
> +            if (second) unmap_domain_page(second);
> +            second = map_domain_page(first[first_table_offset(paddr)].p2m.base);
> +            cur_first_offset = first_table_offset(paddr);
> +        }
> +        level++;
> +        if ( !second ||

ASSERT(second) seems reasonable here, I think, since it would indicate
we had screwed up the p2m pretty badly. That would...

> +             !second[second_table_offset(paddr)].p2m.valid )
> +            goto err;

... leave the -EFAULT resulting from this for a failure of the actual
walk itself.

> +        if ( !second[second_table_offset(paddr)].p2m.table )
> +        {
> +            rc = func(&first[first_table_offset(paddr)], arg, level);
> +            if ( rc != 0 )
> +                goto err;
> +            paddr += SECOND_SIZE;

You need to continue here, don't you? Otherwise you will look into the
3rd level for the paddr +SECOND_SIZE address before you should have
done.

Likewise for the 1st level stuff once it is properly within the loop (I
wonder if this is what caused you to pull that out?)

> +        }
>  
> -done:
> -    if ( pte.p2m.valid )
> -        maddr = (pte.bits & PADDR_MASK & PAGE_MASK) | (paddr & ~PAGE_MASK);
> +        if ( cur_second_offset != second_table_offset(paddr) )
> +        {
> +            if (third) unmap_domain_page(third);
> +            third = map_domain_page(second[second_table_offset(paddr)].p2m.base);
> +            cur_second_offset = second_table_offset(paddr);
> +        }
> +        level++;
> +        if ( !third ||
> +             !third[third_table_offset(paddr)].p2m.valid )
> +            goto err;

If third...table is 0 then that is an error too, iff valid == 1.

>  
> -    if (third) unmap_domain_page(third);
> -    if (second) unmap_domain_page(second);
> -    if (first) unmap_domain_page(first);
> +        rc = func(&third[third_table_offset(paddr)], arg, level);
> +        if ( rc != 0 )
> +            goto err;
> +
> +        paddr += PAGE_SIZE;
> +    }
> +
> +    rc = 0;
> +
> +err:
> +    if ( third ) unmap_domain_page(third);
> +    if ( second ) unmap_domain_page(second);
> +    if ( first ) unmap_domain_page(first);
>  
>      spin_unlock(&p2m->lock);
>  
> -    return maddr;
> +    return rc;
> +}
> +
> +struct p2m_lookup_t {
> +    paddr_t paddr;
> +    paddr_t maddr;
> +};
> +
> +static int p2m_lookup_f(lpae_t *ptep, void *arg, int level)
> +{
> +    lpae_t pte;
> +    struct p2m_lookup_t *p2m = (struct p2m_lookup_t *)arg;
> +    ASSERT(level == 3);
> +
> +    pte = *ptep;
> +
> +    /* This bit must be one in the level 3 entry */
> +    if ( !pte.p2m.table || !pte.p2m.valid )
> +        return -EFAULT;

Ah, here's that check I was talking about early -- you could make this
generic I think.

> +
> +    p2m->maddr = (pte.bits & PADDR_MASK & PAGE_MASK) |
> +        (p2m->paddr & ~PAGE_MASK);
> +    return 0;
> +}
> +/*
> + * Lookup the MFN corresponding to a domain's PFN.
> + *
> + * There are no processor functions to do a stage 2 only lookup therefore we
> + * do a a software walk.
> + */
> +paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
> +{
> +    struct p2m_lookup_t p2m;
> +    p2m.paddr = paddr;
> +    p2m.maddr = INVALID_PADDR;
> +
> +    p2m_walker(d, paddr, 0, p2m_lookup_f, &p2m);
> +
> +    return p2m.maddr;
>  }
>  
>  int guest_physmap_mark_populate_on_demand(struct domain *d,

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-09 16:06 ` [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range Stefano Stabellini
@ 2013-09-10  9:30   ` Ian Campbell
  2013-09-10 19:11     ` Stefano Stabellini
  2013-09-10  9:33   ` Ian Campbell
  1 sibling, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10  9:30 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim

On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> guest_physmap_pin_range pins a range of guest pages so that their p2m
> mappings won't be changed.
> guest_physmap_unpin_range unpins the previously pinned pages.
> The pinning is done using one of the spare bits in the p2m ptes.
> 
> Use the newly introduce p2m_walker to implement the two functions on
> ARM.
> Provide empty stubs for x86.
> 
> 
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> 
> 
> Changes in v5:
> - return -EBUSY when the P2M_DMA_PIN check fails;
> - rename _guest_physmap_pin_range to pin_one_pte;
> - rename _guest_physmap_unpin_range to unpin_one_pte.
> 
> Changes in v4:
> - use p2m_walker to implement guest_physmap_pin_range and
> guest_physmap_unpin_range;
> - return -EINVAL when the P2M_DMA_PIN check fails;
> - change the printk into a gdprintk;
> - add a comment on what type of page can be pinned.
> ---
>  xen/arch/arm/p2m.c         |   48 ++++++++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-arm/mm.h   |    4 +++
>  xen/include/asm-arm/page.h |    7 +++++-
>  xen/include/asm-x86/p2m.h  |   12 +++++++++++
>  4 files changed, 70 insertions(+), 1 deletions(-)
> 
> diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> index a9ceacf..bac6c7e 100644
> --- a/xen/arch/arm/p2m.c
> +++ b/xen/arch/arm/p2m.c
> @@ -154,6 +154,46 @@ paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
>      return p2m.maddr;
>  }
>  
> +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> +{
> +    lpae_t pte = *ptep;
> +    ASSERT(level == 3);
> +
> +    if ( pte.p2m.avail & P2M_DMA_PIN )
> +        return -EBUSY;
> +    pte.p2m.avail |= P2M_DMA_PIN;
> +    write_pte(ptep, pte);
> +    return 0;
> +}
> +
> +int guest_physmap_pin_range(struct domain *d,
> +                            xen_pfn_t gpfn,
> +                            unsigned int order)
> +{
> +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> +                      pin_one_pte, NULL);

If this fails then you will have left some subset of the pages
successfully pinned. You need to clean it up I think, or be fatal to the
guest or something.

> +}
> +
> +static int unpin_one_pte(lpae_t *ptep, void *arg, int level)
> +{
> +    lpae_t pte = *ptep;
> +    ASSERT(level == 3);
> +
> +    if ( !pte.p2m.avail & P2M_DMA_PIN )
> +        return -EBUSY;

The error here is that it isn't busy. EINVAL perhaps?

> +    pte.p2m.avail &= ~P2M_DMA_PIN;
> +    write_pte(ptep, pte);
> +    return 0;
> +}
> +
> +int guest_physmap_unpin_range(struct domain *d,
> +                              xen_pfn_t gpfn,
> +                              unsigned int order)
> +{
> +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> +                      unpin_one_pte, NULL);
> +}
> +
>  int guest_physmap_mark_populate_on_demand(struct domain *d,
>                                            unsigned long gfn,
>                                            unsigned int order)
> @@ -263,6 +303,14 @@ static int create_p2m_entries(struct domain *d,
>              cur_second_offset = second_table_offset(addr);
>          }
>  
> +        if ( third[third_table_offset(addr)].p2m.avail & P2M_DMA_PIN )
> +        {
> +            rc = -EINVAL;
> +            gdprintk(XENLOG_WARNING, "cannot change p2m mapping for paddr=%"PRIpaddr
> +                    " domid=%d, the page is pinned\n", addr, d->domain_id);
> +            goto out;
> +        }
> +
>          flush = third[third_table_offset(addr)].p2m.valid;
>  
>          /* Allocate a new RAM page and attach */
> diff --git a/xen/include/asm-arm/page.h b/xen/include/asm-arm/page.h
> index 41e9eff..c08cfca 100644
> --- a/xen/include/asm-arm/page.h
> +++ b/xen/include/asm-arm/page.h
> @@ -153,11 +153,16 @@ typedef struct {
>      unsigned long hint:1;       /* In a block of 16 contiguous entries */
>      unsigned long sbz2:1;
>      unsigned long xn:1;         /* eXecute-Never */
> -    unsigned long avail:4;      /* Ignored by hardware */
> +    unsigned long avail:4;      /* Ignored by hardware, see below */
>  
>      unsigned long sbz1:5;
>  } __attribute__((__packed__)) lpae_p2m_t;
>  
> +/* Xen "avail" bits allocation in third level entries */
> +#define P2M_DMA_PIN     (1<<0)  /* The page has been "pinned": the hypervisor
> +                                   promises not to change the p2m mapping.
> +                                   Only normal r/w guest RAM can be pinned. */

Where is that last requirement enforced?

> +
>  /*
>   * Walk is the common bits of p2m and pt entries which are needed to
>   * simply walk the table (e.g. for debug).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-09 16:06 ` [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range Stefano Stabellini
  2013-09-10  9:30   ` Ian Campbell
@ 2013-09-10  9:33   ` Ian Campbell
  2013-09-27 15:23     ` Stefano Stabellini
  1 sibling, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10  9:33 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim

On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
>  
> +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> +{
> +    lpae_t pte = *ptep;
> +    ASSERT(level == 3);
> +
> +    if ( pte.p2m.avail & P2M_DMA_PIN )
> +        return -EBUSY;
> +    pte.p2m.avail |= P2M_DMA_PIN;
> +    write_pte(ptep, pte);
> +    return 0;
> +}
> +
> +int guest_physmap_pin_range(struct domain *d,
> +                            xen_pfn_t gpfn,
> +                            unsigned int order)
> +{
> +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> +                      pin_one_pte, NULL);

Did we not also discuss accounting and limits on the amount of memory a
guest can lock down?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-09 16:06 ` [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
@ 2013-09-10  9:47   ` Ian Campbell
  2013-09-10 12:15   ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Ian Campbell @ 2013-09-10  9:47 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: Keir Fraser, xen-devel, tim, Jan Beulich

On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> Introduce two new hypercalls XENMEM_exchange_and_pin and XENMEM_unpin.
> 
> XENMEM_exchange_and_pin, it's like XENMEM_exchange but it also pins the
> new pages: their p2m mapping are guaranteed not to be changed, until
> XENMEM_unpin is called.  XENMEM_exchange_and_pin returns the DMA frame
> numbers of the new pages to the caller, even if it's an autotranslate
> guest.
> 
> The only effect of XENMEM_unpin is to "unpin" the previously
> pinned pages. Afterwards the p2m mappings can be transparently changed by
> the hypervisor as normal. The memory remains accessible from the guest.
> 
> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

This needs input from the x86 and common code maintainers, who you've
not ccd. I've added them.

> 
> 
> Changes in v5:
> - memory_exchange: handle guest_physmap_pin_range failures;
> - make i an unsigned long in unpinned;
> - add nr_unpinned to xen_unpin to report partial success.
> 
> Changes in v4:
> - rename XENMEM_get_dma_buf to XENMEM_exchange_and_pin;
> - rename XENMEM_get_dma_buf to XENMEM_unpin;
> - move the pinning before we copy back the mfn to the guest;
> - propagate errors returned by guest_physmap_pin_range;
> - use xen_memory_exchange_t as parameter for XENMEM_exchange_and_pin;
> - use an unsigned iterator in unpin;
> - improve the documentation of the new hypercalls;
> - add a note about out.address_bits for XENMEM_exchange.
> ---
>  xen/common/memory.c         |  141 +++++++++++++++++++++++++++++++++----------
>  xen/include/public/memory.h |   47 ++++++++++++++
>  2 files changed, 156 insertions(+), 32 deletions(-)
> 
> diff --git a/xen/common/memory.c b/xen/common/memory.c
> index 4b2f311..34169c1 100644
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -364,14 +364,14 @@ static void decrease_reservation(struct memop_args *a)
>      a->nr_done = i;
>  }
>  
> -static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
> +static long memory_exchange(int op, XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>  {
>      struct xen_memory_exchange exch;
>      PAGE_LIST_HEAD(in_chunk_list);
>      PAGE_LIST_HEAD(out_chunk_list);
>      unsigned long in_chunk_order, out_chunk_order;
>      xen_pfn_t     gpfn, gmfn, mfn;
> -    unsigned long i, j, k = 0; /* gcc ... */
> +    unsigned long i = 0, j = 0, k = 0; /* gcc ... */
>      unsigned int  memflags = 0;
>      long          rc = 0;
>      struct domain *d;
> @@ -542,54 +542,72 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>          /* Assign each output page to the domain. */
>          for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )
>          {
> -            if ( assign_pages(d, page, exch.out.extent_order,
> -                              MEMF_no_refcount) )
> -            {
> -                unsigned long dec_count;
> -                bool_t drop_dom_ref;
> -
> -                /*
> -                 * Pages in in_chunk_list is stolen without
> -                 * decreasing the tot_pages. If the domain is dying when
> -                 * assign pages, we need decrease the count. For those pages
> -                 * that has been assigned, it should be covered by
> -                 * domain_relinquish_resources().
> -                 */
> -                dec_count = (((1UL << exch.in.extent_order) *
> -                              (1UL << in_chunk_order)) -
> -                             (j * (1UL << exch.out.extent_order)));
> -
> -                spin_lock(&d->page_alloc_lock);
> -                domain_adjust_tot_pages(d, -dec_count);
> -                drop_dom_ref = (dec_count && !d->tot_pages);
> -                spin_unlock(&d->page_alloc_lock);
> -
> -                if ( drop_dom_ref )
> -                    put_domain(d);
> -
> -                free_domheap_pages(page, exch.out.extent_order);
> -                goto dying;
> -            }
> +            unsigned long dec_count;
> +            bool_t drop_dom_ref;
>  
>              if ( __copy_from_guest_offset(&gpfn, exch.out.extent_start,
>                                            (i << out_chunk_order) + j, 1) )
>              {
>                  rc = -EFAULT;
> -                continue;
> +                goto extent_error;
>              }
>  
>              mfn = page_to_mfn(page);
>              guest_physmap_add_page(d, gpfn, mfn, exch.out.extent_order);
>  
> +            if ( op == XENMEM_exchange_and_pin )
> +            {
> +                if ( guest_physmap_pin_range(d, gpfn, exch.out.extent_order) )
> +                {
> +                    rc = -EFAULT;
> +                    goto extent_error_physmap;
> +                }
> +            }
> +
>              if ( !paging_mode_translate(d) )
>              {
>                  for ( k = 0; k < (1UL << exch.out.extent_order); k++ )
>                      set_gpfn_from_mfn(mfn + k, gpfn + k);
> +            }
> +
> +            rc = assign_pages(d, page, exch.out.extent_order, MEMF_no_refcount);

This has moved after set_gpfn_from_mfn in the exchange case? Why?

Aren't we now pinning and setting the gpfn for pages which don't yet
belong to the guest?

I think any refactoring of error paths etc would be best done as a
separate patch to adding the new hypercall.

> +            if ( rc )
> +                goto extent_error_physmap;
> +
> +            if ( op == XENMEM_exchange_and_pin || !paging_mode_translate(d) )
> +            {
>                  if ( __copy_to_guest_offset(exch.out.extent_start,
>                                              (i << out_chunk_order) + j,
>                                              &mfn, 1) )
>                      rc = -EFAULT;

After this we continue rather than errorring?

>              }
> +
> +            continue;
> +
> +extent_error_physmap:
> +            guest_physmap_remove_page(d, gpfn, mfn, exch.out.extent_order);
> +extent_error:
> +            /*
> +             * Pages in in_chunk_list is stolen without
> +             * decreasing the tot_pages. If the domain is dying when
> +             * assign pages, we need decrease the count. For those pages
> +             * that has been assigned, it should be covered by
> +             * domain_relinquish_resources().
> +             */
> +            dec_count = (((1UL << exch.in.extent_order) *
> +                        (1UL << in_chunk_order)) -
> +                    (j * (1UL << exch.out.extent_order)));
> +
> +            spin_lock(&d->page_alloc_lock);
> +            domain_adjust_tot_pages(d, -dec_count);
> +            drop_dom_ref = (dec_count && !d->tot_pages);
> +            spin_unlock(&d->page_alloc_lock);
> +
> +            if ( drop_dom_ref )
> +                put_domain(d);
> +
> +            free_domheap_pages(page, exch.out.extent_order);
> +            goto dying;
>          }
>          BUG_ON( !(d->is_dying) && (j != (1UL << out_chunk_order)) );
>      }
> @@ -627,6 +645,60 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>      return rc;
>  }
>  
> +static long unpin(XEN_GUEST_HANDLE_PARAM(xen_unpin_t) arg)
> +{
> +    int rc;
> +    unsigned long i;
> +    struct xen_unpin unpin;
> +    xen_pfn_t gpfn;
> +    struct domain *d;
> +
> +    if ( copy_from_guest(&unpin, arg, 1) )
> +        return -EFAULT;
> +
> +    /* Various sanity checks. */
> +    if ( /* Extent orders are sensible? */
> +         (unpin.in.extent_order > MAX_ORDER) ||
> +         /* Sizes of input list do not overflow a long? */
> +         ((~0UL >> unpin.in.extent_order) < unpin.in.nr_extents) )
> +        return -EFAULT;
> +
> +    if ( !guest_handle_okay(unpin.in.extent_start, unpin.in.nr_extents) )
> +        return -EFAULT;
> +
> +    d = rcu_lock_domain_by_any_id(unpin.in.domid);
> +    if ( d == NULL )
> +    {
> +        rc = -ESRCH;
> +        goto fail;
> +    }
> +
> +    for ( i = 0; i < unpin.in.nr_extents; i++ )
> +    {
> +        if ( unlikely(__copy_from_guest_offset(
> +                      &gpfn, unpin.in.extent_start, i, 1)) )
> +        {
> +            rc = -EFAULT;
> +            goto partial;
> +        }
> +
> +        rc = guest_physmap_unpin_range(d, gpfn, unpin.in.extent_order);
> +        if ( rc )
> +            goto partial;
> +    }
> +
> +    rc = 0;
> +
> +partial:
> +    unpin.nr_unpinned = i;
> +    if ( __copy_field_to_guest(arg, &unpin, nr_unpinned) )
> +        rc = -EFAULT;
> +
> + fail:
> +    rcu_unlock_domain(d);
> +    return rc;
> +}
> +
>  long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  {
>      struct domain *d;
> @@ -715,8 +787,13 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
>  
>          break;
>  
> +    case XENMEM_exchange_and_pin:
>      case XENMEM_exchange:
> -        rc = memory_exchange(guest_handle_cast(arg, xen_memory_exchange_t));
> +        rc = memory_exchange(op, guest_handle_cast(arg, xen_memory_exchange_t));
> +        break;
> +
> +    case XENMEM_unpin:
> +        rc = unpin(guest_handle_cast(arg, xen_unpin_t));
>          break;
>  
>      case XENMEM_maximum_ram_page:
> diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
> index 7a26dee..73fdc31 100644
> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -105,6 +105,7 @@ struct xen_memory_exchange {
>      /*
>       * [IN] Details of memory extents to be exchanged (GMFN bases).
>       * Note that @in.address_bits is ignored and unused.
> +     * @out.address_bits should contain the address mask for the new pages.
>       */
>      struct xen_memory_reservation in;
>  
> @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>   * The zero value is appropiate.
>   */
>  
> +#define XENMEM_exchange_and_pin             26

All the other doc comments in this file precede their #define.

> +/*
> + * This hypercall is similar to XENMEM_exchange: it takes the same
> + * struct as an argument and it exchanges the pages passed in with a new
> + * set of pages. The new pages are going to be "pinned": it's guaranteed
> + * that their p2m mapping won't be changed until explicitly "unpinned".
> + * The content of the exchanged pages is lost.
> + * Only normal guest r/w memory can be pinned: no granted pages or
> + * ballooned pages.
> + * If return code is zero then @out.extent_list provides the DMA frame
> + * numbers of the newly-allocated memory.
> + * Returns zero on complete success, otherwise a negative error code:
> + *   -ENOSYS if not implemented
> + *   -EINVAL if the page is already pinned

Not -EBUSY?

> + *   -EFAULT if an internal error occurs

> + * On complete success then always @nr_exchanged == @in.nr_extents.  On
> + * partial success @nr_exchanged indicates how much work was done and a
> + * negative error code is returned.
> + */
> +
> +#define XENMEM_unpin             27
> +/*
> + * XENMEM_unpin unpins a set of pages, previously pinned by
> + * XENMEM_exchange_and_pin. After this call the p2m mapping of the pages can
> + * be transparently changed by the hypervisor, as usual. The pages are
> + * still accessible from the guest.
> + */
> +struct xen_unpin {
> +    /*
> +     * [IN] Details of memory extents to be unpinned (GMFN bases).
> +     * Note that @in.address_bits is ignored and unused.
> +     */
> +    struct xen_memory_reservation in;
> +    /*
> +     * [OUT] Number of input extents that were successfully unpinned.
> +     *  1. The first @nr_unpinned input extents were successfully
> +     *     unpinned.

Except if the code fails before doing any pinning (e.g. the domain
lookup fails), then it is arbitrarily whatever it was set to by the
guest or uninitialised (from the fail label in unpin()). There are error
codes (e..g EFAULT) which are common to both cases so they are
indistinguishable.

> +     *  2. All other input extents are untouched.
> +     *  3. If not all input exents are unpinned then the return code of this

"extents"

> +     *     command will be non-zero.
> +     */
> +    xen_ulong_t nr_unpinned;
> +};
> +typedef struct xen_unpin xen_unpin_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_unpin_t);
> +
>  #endif /* defined(__XEN__) || defined(__XEN_TOOLS__) */
>  
>  #endif /* __XEN_PUBLIC_MEMORY_H__ */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-09 16:06 ` [PATCH v5 1/4] xen: move steal_page and donate_page to common code Stefano Stabellini
  2013-09-10  9:08   ` Ian Campbell
@ 2013-09-10 10:39   ` Jan Beulich
  2013-09-10 17:20     ` Stefano Stabellini
  1 sibling, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2013-09-10 10:39 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, keir, Ian.Campbell, tim

>>> On 09.09.13 at 18:06, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> --- a/xen/common/memory.c
> +++ b/xen/common/memory.c
> @@ -230,6 +230,91 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
>      return 1;
>  }
>  
> +int donate_page(
> +    struct domain *d, struct page_info *page, unsigned int memflags)
> +{
> +    spin_lock(&d->page_alloc_lock);
> +
> +    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
> +        goto fail;
> +
> +    if ( d->is_dying )
> +        goto fail;
> +
> +    if ( page->count_info & ~(PGC_allocated | 1) )

While the literal 1 was acceptable in arch specific code, I'm not sure
it really is in generic code. Nothing enforces that an arch has the
count starting at bit 0. Even if it reads a little odd, using
(-PGC_count_mask & PGC_count_mask) would probably be an
acceptable replacement here and further down.

> +        goto fail;
> +
> +    if ( !(memflags & MEMF_no_refcount) )
> +    {
> +        if ( d->tot_pages >= d->max_pages )
> +            goto fail;
> +        domain_adjust_tot_pages(d, 1);
> +    }
> +
> +    page->count_info = PGC_allocated | 1;
> +    page_set_owner(page, d);
> +    page_list_add_tail(page,&d->page_list);

Please use the opportunity and insert a blank after the comma,
the more that you're not doing a literal copy of the function
anyway.

> +
> +    spin_unlock(&d->page_alloc_lock);
> +    return 0;
> +
> + fail:
> +    spin_unlock(&d->page_alloc_lock);
> +    gdprintk(XENLOG_WARNING, "Bad donate %p: ed=%p(%u), sd=%p, caf=%08lx, taf=%016lx",
> +            (void *)page_to_mfn(page), d, d->domain_id,
> +            page_get_owner(page), page->count_info, page->u.inuse.type_info);

One of the (unfortunately many) cases where gdprintk() is wrong:
You don't care about the current domain here.

Also casting page_to_mfn() to void * in common code seems fragile.
I wonder why this cast was there in the original code: Using %lx
should be quite fine here.

> +    return -1;
> +}
> +
> +int steal_page(
> +    struct domain *d, struct page_info *page, unsigned int memflags)
> +{
> +    unsigned long x, y;
> +    bool_t drop_dom_ref = 0;
> +
> +    spin_lock(&d->page_alloc_lock);
> +
> +    if ( is_xen_heap_page(page) || (page_get_owner(page) != d) )
> +        goto fail;
> +
> +    /*
> +     * We require there is just one reference (PGC_allocated). We temporarily
> +     * drop this reference now so that we can safely swizzle the owner.
> +     */
> +    y = page->count_info;
> +    do {
> +        x = y;
> +        if ( (x & (PGC_count_mask|PGC_allocated)) != (1 | PGC_allocated) )
> +            goto fail;
> +        y = cmpxchg(&page->count_info, x, x & ~PGC_count_mask);
> +    } while ( y != x );
> +
> +    /* Swizzle the owner then reinstate the PGC_allocated reference. */
> +    page_set_owner(page, NULL);
> +    y = page->count_info;
> +    do {
> +        x = y;
> +        BUG_ON((x & (PGC_count_mask|PGC_allocated)) != PGC_allocated);
> +    } while ( (y = cmpxchg(&page->count_info, x, x | 1)) != x );
> +
> +    /* Unlink from original owner. */
> +    if ( !(memflags & MEMF_no_refcount) && !domain_adjust_tot_pages(d, -1) )
> +        drop_dom_ref = 1;
> +    page_list_del(page, &d->page_list);
> +
> +    spin_unlock(&d->page_alloc_lock);
> +    if ( unlikely(drop_dom_ref) )
> +        put_domain(d);
> +    return 0;
> +
> + fail:
> +    spin_unlock(&d->page_alloc_lock);
> +    printk("Bad page %p: ed=%p(%u), sd=%p, caf=%08lx, taf=%lx\n",
> +            (void *)page_to_mfn(page), d, d->domain_id,
> +            page_get_owner(page), page->count_info, page->u.inuse.type_info);

This clearly lacks a XENLOG_G_<something>. It would also be
desirable to have the word "steal" in the message somewhere, to
make it less generic.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-09 16:06 ` [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
  2013-09-10  9:47   ` Ian Campbell
@ 2013-09-10 12:15   ` Jan Beulich
  2013-09-10 12:50     ` Ian Campbell
  2013-09-27 14:49     ` Stefano Stabellini
  1 sibling, 2 replies; 32+ messages in thread
From: Jan Beulich @ 2013-09-10 12:15 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim, Ian.Campbell

>>> On 09.09.13 at 18:06, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> +static long memory_exchange(int op, XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>  {
>      struct xen_memory_exchange exch;
>      PAGE_LIST_HEAD(in_chunk_list);
>      PAGE_LIST_HEAD(out_chunk_list);
>      unsigned long in_chunk_order, out_chunk_order;
>      xen_pfn_t     gpfn, gmfn, mfn;
> -    unsigned long i, j, k = 0; /* gcc ... */
> +    unsigned long i = 0, j = 0, k = 0; /* gcc ... */

This is unnecessary afaict.

> @@ -542,54 +542,72 @@ static long memory_exchange(XEN_GUEST_HANDLE_PARAM(xen_memory_exchange_t) arg)
>          /* Assign each output page to the domain. */
>          for ( j = 0; (page = page_list_remove_head(&out_chunk_list)); ++j )
>          {
> -            if ( assign_pages(d, page, exch.out.extent_order,
> -                              MEMF_no_refcount) )
> -            {
> -                unsigned long dec_count;
> -                bool_t drop_dom_ref;
> -
> -                /*
> -                 * Pages in in_chunk_list is stolen without
> -                 * decreasing the tot_pages. If the domain is dying when
> -                 * assign pages, we need decrease the count. For those pages
> -                 * that has been assigned, it should be covered by
> -                 * domain_relinquish_resources().
> -                 */
> -                dec_count = (((1UL << exch.in.extent_order) *
> -                              (1UL << in_chunk_order)) -
> -                             (j * (1UL << exch.out.extent_order)));
> -
> -                spin_lock(&d->page_alloc_lock);
> -                domain_adjust_tot_pages(d, -dec_count);
> -                drop_dom_ref = (dec_count && !d->tot_pages);
> -                spin_unlock(&d->page_alloc_lock);
> -
> -                if ( drop_dom_ref )
> -                    put_domain(d);
> -
> -                free_domheap_pages(page, exch.out.extent_order);
> -                goto dying;
> -            }
> +            unsigned long dec_count;
> +            bool_t drop_dom_ref;
>  
>              if ( __copy_from_guest_offset(&gpfn, exch.out.extent_start,
>                                            (i << out_chunk_order) + j, 1) )
>              {
>                  rc = -EFAULT;
> -                continue;
> +                goto extent_error;

No, nothing was undone in this case before. And _if_ the
re-ordering is provably correct (which I doubt at this point), you'd
need to continue to assign_pages() here to retain previous
behavior. Or if you think current behavior is wrong, please send
a (perhaps prerequisite) separate patch to address that, which
would at once make sure we get a proper explanation of what
you think is wrong.

>              }
>  
>              mfn = page_to_mfn(page);
>              guest_physmap_add_page(d, gpfn, mfn, exch.out.extent_order);

As Ian already pointed out, there's a window now where the
guest has the new page(s) in its physmap, but not assigned to it
yet.

>  
> +            if ( op == XENMEM_exchange_and_pin )
> +            {
> +                if ( guest_physmap_pin_range(d, gpfn, exch.out.extent_order) )
> +                {
> +                    rc = -EFAULT;
> +                    goto extent_error_physmap;
> +                }
> +            }
> +
>              if ( !paging_mode_translate(d) )
>              {
>                  for ( k = 0; k < (1UL << exch.out.extent_order); k++ )
>                      set_gpfn_from_mfn(mfn + k, gpfn + k);

Again, _if_ the error path changes are provably correct, wouldn't
this need undoing on the error path too? That wasn't necessary
before your change because we ran the handling of the current
extent to completion.

> +            }
> +
> +            rc = assign_pages(d, page, exch.out.extent_order, MEMF_no_refcount);
> +            if ( rc )
> +                goto extent_error_physmap;
> +
> +            if ( op == XENMEM_exchange_and_pin || !paging_mode_translate(d) 
> )
> +            {
>                  if ( __copy_to_guest_offset(exch.out.extent_start,
>                                              (i << out_chunk_order) + j,
>                                              &mfn, 1) )
>                      rc = -EFAULT;
>              }
> +
> +            continue;
> +
> +extent_error_physmap:
> +            guest_physmap_remove_page(d, gpfn, mfn, exch.out.extent_order);
> +extent_error:
> +            /*
> +             * Pages in in_chunk_list is stolen without
> +             * decreasing the tot_pages. If the domain is dying when
> +             * assign pages, we need decrease the count. For those pages
> +             * that has been assigned, it should be covered by
> +             * domain_relinquish_resources().
> +             */
> +            dec_count = (((1UL << exch.in.extent_order) *
> +                        (1UL << in_chunk_order)) -
> +                    (j * (1UL << exch.out.extent_order)));
> +
> +            spin_lock(&d->page_alloc_lock);
> +            domain_adjust_tot_pages(d, -dec_count);
> +            drop_dom_ref = (dec_count && !d->tot_pages);
> +            spin_unlock(&d->page_alloc_lock);
> +
> +            if ( drop_dom_ref )
> +                put_domain(d);
> +
> +            free_domheap_pages(page, exch.out.extent_order);
> +            goto dying;

I don't think moving the error handling here helps readability or
such.

> +static long unpin(XEN_GUEST_HANDLE_PARAM(xen_unpin_t) arg)
> +{
> +    int rc;

Please keep function return type and return code variable type
in sync - "int" would be sufficient for both here.

> +    unsigned long i;
> +    struct xen_unpin unpin;
> +    xen_pfn_t gpfn;
> +    struct domain *d;
> +
> +    if ( copy_from_guest(&unpin, arg, 1) )
> +        return -EFAULT;
> +
> +    /* Various sanity checks. */
> +    if ( /* Extent orders are sensible? */
> +         (unpin.in.extent_order > MAX_ORDER) ||
> +         /* Sizes of input list do not overflow a long? */
> +         ((~0UL >> unpin.in.extent_order) < unpin.in.nr_extents) )
> +        return -EFAULT;
> +
> +    if ( !guest_handle_okay(unpin.in.extent_start, unpin.in.nr_extents) )
> +        return -EFAULT;
> +
> +    d = rcu_lock_domain_by_any_id(unpin.in.domid);
> +    if ( d == NULL )
> +    {
> +        rc = -ESRCH;
> +        goto fail;
> +    }
> +
> +    for ( i = 0; i < unpin.in.nr_extents; i++ )
> +    {
> +        if ( unlikely(__copy_from_guest_offset(
> +                      &gpfn, unpin.in.extent_start, i, 1)) )
> +        {
> +            rc = -EFAULT;
> +            goto partial;
> +        }
> +

Want/need to make sure gpfn is suitable aligned for the passed in
extent order? Depends on whether guest_physmap_unpin_range()
is expected to always cope with a mis-aligned one.

> --- a/xen/include/public/memory.h
> +++ b/xen/include/public/memory.h
> @@ -105,6 +105,7 @@ struct xen_memory_exchange {
>      /*
>       * [IN] Details of memory extents to be exchanged (GMFN bases).
>       * Note that @in.address_bits is ignored and unused.
> +     * @out.address_bits should contain the address mask for the new pages.

"mask"? No comment is likely better than a confusing one.

> @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>   * The zero value is appropiate.
>   */
>  
> +#define XENMEM_exchange_and_pin             26
> +/*
> + * This hypercall is similar to XENMEM_exchange: it takes the same
> + * struct as an argument and it exchanges the pages passed in with a new
> + * set of pages. The new pages are going to be "pinned": it's guaranteed
> + * that their p2m mapping won't be changed until explicitly "unpinned".
> + * The content of the exchanged pages is lost.
> + * Only normal guest r/w memory can be pinned: no granted pages or
> + * ballooned pages.
> + * If return code is zero then @out.extent_list provides the DMA frame
> + * numbers of the newly-allocated memory.

"DMA"? I don't think that term is universally true across all possible
architectures (but we're in an architecture independent header
here). "Machine" would probably be better (as it implies CPU
perspective, whereas DMA hints at device perspective).

> + * Returns zero on complete success, otherwise a negative error code:
> + *   -ENOSYS if not implemented
> + *   -EINVAL if the page is already pinned
> + *   -EFAULT if an internal error occurs

I'm not sure enumerating error codes here is useful; certainly not
giving the impression that the list might be exhaustive.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 12:15   ` Jan Beulich
@ 2013-09-10 12:50     ` Ian Campbell
  2013-09-10 13:28       ` Jan Beulich
  2013-09-27 14:49     ` Stefano Stabellini
  1 sibling, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10 12:50 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, tim, Stefano Stabellini

On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:

> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
> >   * The zero value is appropiate.
> >   */
> >  
> > +#define XENMEM_exchange_and_pin             26
> > +/*
> > + * This hypercall is similar to XENMEM_exchange: it takes the same
> > + * struct as an argument and it exchanges the pages passed in with a new
> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
> > + * that their p2m mapping won't be changed until explicitly "unpinned".
> > + * The content of the exchanged pages is lost.
> > + * Only normal guest r/w memory can be pinned: no granted pages or
> > + * ballooned pages.
> > + * If return code is zero then @out.extent_list provides the DMA frame
> > + * numbers of the newly-allocated memory.
> 
> "DMA"? I don't think that term is universally true across all possible
> architectures (but we're in an architecture independent header
> here). "Machine" would probably be better (as it implies CPU
> perspective, whereas DMA hints at device perspective).

I think DMA here is correct. The purpose of exchange and pin is so that
the page can be safely handed to a device for DMA.

On an architecture where DMA address != Machine address then this should
indeed return the DMA addresses.

Ian.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 12:50     ` Ian Campbell
@ 2013-09-10 13:28       ` Jan Beulich
  2013-09-10 18:32         ` Stefano Stabellini
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Beulich @ 2013-09-10 13:28 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, Stefano Stabellini

>>> On 10.09.13 at 14:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:
> 
>> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>> >   * The zero value is appropiate.
>> >   */
>> >  
>> > +#define XENMEM_exchange_and_pin             26
>> > +/*
>> > + * This hypercall is similar to XENMEM_exchange: it takes the same
>> > + * struct as an argument and it exchanges the pages passed in with a new
>> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
>> > + * that their p2m mapping won't be changed until explicitly "unpinned".
>> > + * The content of the exchanged pages is lost.
>> > + * Only normal guest r/w memory can be pinned: no granted pages or
>> > + * ballooned pages.
>> > + * If return code is zero then @out.extent_list provides the DMA frame
>> > + * numbers of the newly-allocated memory.
>> 
>> "DMA"? I don't think that term is universally true across all possible
>> architectures (but we're in an architecture independent header
>> here). "Machine" would probably be better (as it implies CPU
>> perspective, whereas DMA hints at device perspective).
> 
> I think DMA here is correct. The purpose of exchange and pin is so that
> the page can be safely handed to a device for DMA.
> 
> On an architecture where DMA address != Machine address then this should
> indeed return the DMA addresses.

One problem is that I think there are architectures where there's no
single canonical DMA address; such an address may depend on the
placement of a device in the system's topology. Hence I don't think
it would even be possible to return "the" DMA address here. It ought
to be the machine address (CPU view), and the consumer ought to
know how to translate this to a DMA address for a particular device.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-10 10:39   ` Jan Beulich
@ 2013-09-10 17:20     ` Stefano Stabellini
  2013-09-10 18:06       ` Tim Deegan
  2013-09-11  6:31       ` Jan Beulich
  0 siblings, 2 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-10 17:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, tim, keir, Ian.Campbell, Stefano Stabellini

On Tue, 10 Sep 2013, Jan Beulich wrote:
> >>> On 09.09.13 at 18:06, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > --- a/xen/common/memory.c
> > +++ b/xen/common/memory.c
> > @@ -230,6 +230,91 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
> >      return 1;
> >  }
> >  
> > +int donate_page(
> > +    struct domain *d, struct page_info *page, unsigned int memflags)
> > +{
> > +    spin_lock(&d->page_alloc_lock);
> > +
> > +    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
> > +        goto fail;
> > +
> > +    if ( d->is_dying )
> > +        goto fail;
> > +
> > +    if ( page->count_info & ~(PGC_allocated | 1) )
> 
> While the literal 1 was acceptable in arch specific code, I'm not sure
> it really is in generic code. Nothing enforces that an arch has the
> count starting at bit 0. Even if it reads a little odd, using
> (-PGC_count_mask & PGC_count_mask) would probably be an
> acceptable replacement here and further down.

(-PGC_count_mask & PGC_count_mask) is always 0.

I guess you mean:

if ( page->count_info & ~(PGC_count_mask | PGC_allocated) )

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup
  2013-09-10  9:26   ` Ian Campbell
@ 2013-09-10 18:00     ` Stefano Stabellini
  2013-09-10 18:38       ` Ian Campbell
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-10 18:00 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, Stefano Stabellini

On Tue, 10 Sep 2013, Ian Campbell wrote:
> This function is not actually all the p2m specific in the end, by using
> lpae_t.walk instead of lpae_t.pt (see dump_pt_walk) you could make it
> useful for e.g. hyp pagetable walks too. (e.g. dump_hyp_walk). In theory
> it could also be used for LPAE guest walks too, but we'd need a separate
> walker for non-LPAE guests.

I could move the function somewhere more generic but..

> If we wanted to replace dump_hyp_walk with this then calling the
> callback for each level would be required.

.. even though the difference is not likely to be big, still it would
slow down p2m_walker, that could even be called as often as twice per DMA
request.
Maybe it's best to keep it as it is?


> >  
> >      spin_lock(&p2m->lock);
> >  
> >      first = __map_domain_page(p2m->first_level);
> >  
> > -    pte = first[first_table_offset(paddr)];
> > -    if ( !pte.p2m.valid || !pte.p2m.table )
> > -        goto done;
> > +    if ( !first ||
> > +         !first[first_table_offset(paddr)].p2m.valid )
> > +        goto err;
> 
> Why is the first level handled specially outside the loop? What happens
> if order is such that we span multiple first level entries?

I think that the assumption in the original code was that the size
couldn't be larger than 1GB. It isn't the case anymore, I'll fix this.


> > -    /* This bit must be one in the level 3 entry */
> > -    if ( !pte.p2m.table )
> > -        pte.bits = 0;
> > +        if ( cur_first_offset != first_table_offset(paddr) )
> > +        {
> > +            if (second) unmap_domain_page(second);
> > +            second = map_domain_page(first[first_table_offset(paddr)].p2m.base);
> > +            cur_first_offset = first_table_offset(paddr);
> > +        }
> > +        level++;
> > +        if ( !second ||
> 
> ASSERT(second) seems reasonable here, I think, since it would indicate
> we had screwed up the p2m pretty badly.

I think it's possible: what if a memory range is simply missing from the
p2m? The second level pagetable pages could be missing too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-10 17:20     ` Stefano Stabellini
@ 2013-09-10 18:06       ` Tim Deegan
  2013-09-10 18:14         ` Stefano Stabellini
  2013-09-11  6:31       ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: Tim Deegan @ 2013-09-10 18:06 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, keir, Ian.Campbell, Jan Beulich

At 18:20 +0100 on 10 Sep (1378837226), Stefano Stabellini wrote:
> On Tue, 10 Sep 2013, Jan Beulich wrote:
> > >>> On 09.09.13 at 18:06, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > > --- a/xen/common/memory.c
> > > +++ b/xen/common/memory.c
> > > @@ -230,6 +230,91 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
> > >      return 1;
> > >  }
> > >  
> > > +int donate_page(
> > > +    struct domain *d, struct page_info *page, unsigned int memflags)
> > > +{
> > > +    spin_lock(&d->page_alloc_lock);
> > > +
> > > +    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
> > > +        goto fail;
> > > +
> > > +    if ( d->is_dying )
> > > +        goto fail;
> > > +
> > > +    if ( page->count_info & ~(PGC_allocated | 1) )
> > 
> > While the literal 1 was acceptable in arch specific code, I'm not sure
> > it really is in generic code. Nothing enforces that an arch has the
> > count starting at bit 0. Even if it reads a little odd, using
> > (-PGC_count_mask & PGC_count_mask) would probably be an
> > acceptable replacement here and further down.
> 
> (-PGC_count_mask & PGC_count_mask) is always 0.

'x & -x' always gives you the lowest set bit in x, though as Jan points
out it's a bit opaque.  Maybe it needs a friendly macro name?

Tim.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-10 18:06       ` Tim Deegan
@ 2013-09-10 18:14         ` Stefano Stabellini
  0 siblings, 0 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-10 18:14 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, keir, Ian.Campbell, Jan Beulich, Stefano Stabellini

On Tue, 10 Sep 2013, Tim Deegan wrote:
> At 18:20 +0100 on 10 Sep (1378837226), Stefano Stabellini wrote:
> > On Tue, 10 Sep 2013, Jan Beulich wrote:
> > > >>> On 09.09.13 at 18:06, Stefano Stabellini <stefano.stabellini@eu.citrix.com> wrote:
> > > > --- a/xen/common/memory.c
> > > > +++ b/xen/common/memory.c
> > > > @@ -230,6 +230,91 @@ int guest_remove_page(struct domain *d, unsigned long gmfn)
> > > >      return 1;
> > > >  }
> > > >  
> > > > +int donate_page(
> > > > +    struct domain *d, struct page_info *page, unsigned int memflags)
> > > > +{
> > > > +    spin_lock(&d->page_alloc_lock);
> > > > +
> > > > +    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
> > > > +        goto fail;
> > > > +
> > > > +    if ( d->is_dying )
> > > > +        goto fail;
> > > > +
> > > > +    if ( page->count_info & ~(PGC_allocated | 1) )
> > > 
> > > While the literal 1 was acceptable in arch specific code, I'm not sure
> > > it really is in generic code. Nothing enforces that an arch has the
> > > count starting at bit 0. Even if it reads a little odd, using
> > > (-PGC_count_mask & PGC_count_mask) would probably be an
> > > acceptable replacement here and further down.
> > 
> > (-PGC_count_mask & PGC_count_mask) is always 0.
> 
> 'x & -x' always gives you the lowest set bit in x, though as Jan points
> out it's a bit opaque.  Maybe it needs a friendly macro name?

No that's fine. I read it as ~PGC_count_mask.
It proves that I need new glasses or a larger font size :P

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 13:28       ` Jan Beulich
@ 2013-09-10 18:32         ` Stefano Stabellini
  2013-09-10 18:37           ` Tim Deegan
  2013-09-11  6:33           ` Jan Beulich
  0 siblings, 2 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-10 18:32 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, tim, Ian Campbell, Stefano Stabellini

On Tue, 10 Sep 2013, Jan Beulich wrote:
> >>> On 10.09.13 at 14:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:
> > 
> >> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
> >> >   * The zero value is appropiate.
> >> >   */
> >> >  
> >> > +#define XENMEM_exchange_and_pin             26
> >> > +/*
> >> > + * This hypercall is similar to XENMEM_exchange: it takes the same
> >> > + * struct as an argument and it exchanges the pages passed in with a new
> >> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
> >> > + * that their p2m mapping won't be changed until explicitly "unpinned".
> >> > + * The content of the exchanged pages is lost.
> >> > + * Only normal guest r/w memory can be pinned: no granted pages or
> >> > + * ballooned pages.
> >> > + * If return code is zero then @out.extent_list provides the DMA frame
> >> > + * numbers of the newly-allocated memory.
> >> 
> >> "DMA"? I don't think that term is universally true across all possible
> >> architectures (but we're in an architecture independent header
> >> here). "Machine" would probably be better (as it implies CPU
> >> perspective, whereas DMA hints at device perspective).
> > 
> > I think DMA here is correct. The purpose of exchange and pin is so that
> > the page can be safely handed to a device for DMA.
> > 
> > On an architecture where DMA address != Machine address then this should
> > indeed return the DMA addresses.
> 
> One problem is that I think there are architectures where there's no
> single canonical DMA address; such an address may depend on the
> placement of a device in the system's topology. Hence I don't think
> it would even be possible to return "the" DMA address here. It ought
> to be the machine address (CPU view), and the consumer ought to
> know how to translate this to a DMA address for a particular device.

We could leave it up to each architecture to specify whether the
hypercall returns a DMA address or a Machine address (according to
their definition of DMA address and Machine address).
Even if this is a common header.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 18:32         ` Stefano Stabellini
@ 2013-09-10 18:37           ` Tim Deegan
  2013-09-10 18:40             ` Ian Campbell
  2013-09-11  6:33           ` Jan Beulich
  1 sibling, 1 reply; 32+ messages in thread
From: Tim Deegan @ 2013-09-10 18:37 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, Ian Campbell, Jan Beulich

At 19:32 +0100 on 10 Sep (1378841575), Stefano Stabellini wrote:
> On Tue, 10 Sep 2013, Jan Beulich wrote:
> > >>> On 10.09.13 at 14:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > > On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:
> > > 
> > >> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
> > >> >   * The zero value is appropiate.
> > >> >   */
> > >> >  
> > >> > +#define XENMEM_exchange_and_pin             26
> > >> > +/*
> > >> > + * This hypercall is similar to XENMEM_exchange: it takes the same
> > >> > + * struct as an argument and it exchanges the pages passed in with a new
> > >> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
> > >> > + * that their p2m mapping won't be changed until explicitly "unpinned".
> > >> > + * The content of the exchanged pages is lost.
> > >> > + * Only normal guest r/w memory can be pinned: no granted pages or
> > >> > + * ballooned pages.
> > >> > + * If return code is zero then @out.extent_list provides the DMA frame
> > >> > + * numbers of the newly-allocated memory.
> > >> 
> > >> "DMA"? I don't think that term is universally true across all possible
> > >> architectures (but we're in an architecture independent header
> > >> here). "Machine" would probably be better (as it implies CPU
> > >> perspective, whereas DMA hints at device perspective).
> > > 
> > > I think DMA here is correct. The purpose of exchange and pin is so that
> > > the page can be safely handed to a device for DMA.
> > > 
> > > On an architecture where DMA address != Machine address then this should
> > > indeed return the DMA addresses.
> > 
> > One problem is that I think there are architectures where there's no
> > single canonical DMA address; such an address may depend on the
> > placement of a device in the system's topology. Hence I don't think
> > it would even be possible to return "the" DMA address here. It ought
> > to be the machine address (CPU view), and the consumer ought to
> > know how to translate this to a DMA address for a particular device.
> 
> We could leave it up to each architecture to specify whether the
> hypercall returns a DMA address or a Machine address (according to
> their definition of DMA address and Machine address).
> Even if this is a common header.

Is this going to be needed on architectures other than arm?  It's not
useful for x86, AFAICT.

Tim.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup
  2013-09-10 18:00     ` Stefano Stabellini
@ 2013-09-10 18:38       ` Ian Campbell
  2013-09-11 16:55         ` Stefano Stabellini
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10 18:38 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim

On Tue, 2013-09-10 at 19:00 +0100, Stefano Stabellini wrote:
> > > -    /* This bit must be one in the level 3 entry */
> > > -    if ( !pte.p2m.table )
> > > -        pte.bits = 0;
> > > +        if ( cur_first_offset != first_table_offset(paddr) )
> > > +        {
> > > +            if (second) unmap_domain_page(second);
> > > +            second = map_domain_page(first[first_table_offset(paddr)].p2m.base);
> > > +            cur_first_offset = first_table_offset(paddr);
> > > +        }
> > > +        level++;
> > > +        if ( !second ||
> > 
> > ASSERT(second) seems reasonable here, I think, since it would indicate
> > we had screwed up the p2m pretty badly.
> 
> I think it's possible: what if a memory range is simply missing from the
> p2m? The second level pagetable pages could be missing too.

Then the level above it shouldn't have the valid bit set.

Ian.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 18:37           ` Tim Deegan
@ 2013-09-10 18:40             ` Ian Campbell
  2013-09-10 18:50               ` Tim Deegan
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10 18:40 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Jan Beulich, Stefano Stabellini

On Tue, 2013-09-10 at 19:37 +0100, Tim Deegan wrote:
> At 19:32 +0100 on 10 Sep (1378841575), Stefano Stabellini wrote:
> > On Tue, 10 Sep 2013, Jan Beulich wrote:
> > > >>> On 10.09.13 at 14:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > > > On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:
> > > > 
> > > >> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
> > > >> >   * The zero value is appropiate.
> > > >> >   */
> > > >> >  
> > > >> > +#define XENMEM_exchange_and_pin             26
> > > >> > +/*
> > > >> > + * This hypercall is similar to XENMEM_exchange: it takes the same
> > > >> > + * struct as an argument and it exchanges the pages passed in with a new
> > > >> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
> > > >> > + * that their p2m mapping won't be changed until explicitly "unpinned".
> > > >> > + * The content of the exchanged pages is lost.
> > > >> > + * Only normal guest r/w memory can be pinned: no granted pages or
> > > >> > + * ballooned pages.
> > > >> > + * If return code is zero then @out.extent_list provides the DMA frame
> > > >> > + * numbers of the newly-allocated memory.
> > > >> 
> > > >> "DMA"? I don't think that term is universally true across all possible
> > > >> architectures (but we're in an architecture independent header
> > > >> here). "Machine" would probably be better (as it implies CPU
> > > >> perspective, whereas DMA hints at device perspective).
> > > > 
> > > > I think DMA here is correct. The purpose of exchange and pin is so that
> > > > the page can be safely handed to a device for DMA.
> > > > 
> > > > On an architecture where DMA address != Machine address then this should
> > > > indeed return the DMA addresses.
> > > 
> > > One problem is that I think there are architectures where there's no
> > > single canonical DMA address; such an address may depend on the
> > > placement of a device in the system's topology. Hence I don't think
> > > it would even be possible to return "the" DMA address here. It ought
> > > to be the machine address (CPU view), and the consumer ought to
> > > know how to translate this to a DMA address for a particular device.
> > 
> > We could leave it up to each architecture to specify whether the
> > hypercall returns a DMA address or a Machine address (according to
> > their definition of DMA address and Machine address).
> > Even if this is a common header.
> 
> Is this going to be needed on architectures other than arm?  It's not
> useful for x86, AFAICT.

It might be needed for PVH with passthrough? Maybe that insists on an
IOMMU, in which case maybe it is arm only.

Until someone does e.g. a ppc port ;-)

Ian.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 18:40             ` Ian Campbell
@ 2013-09-10 18:50               ` Tim Deegan
  0 siblings, 0 replies; 32+ messages in thread
From: Tim Deegan @ 2013-09-10 18:50 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Jan Beulich, Stefano Stabellini

At 19:40 +0100 on 10 Sep (1378842048), Ian Campbell wrote:
> On Tue, 2013-09-10 at 19:37 +0100, Tim Deegan wrote:
> > At 19:32 +0100 on 10 Sep (1378841575), Stefano Stabellini wrote:
> > > On Tue, 10 Sep 2013, Jan Beulich wrote:
> > > > >>> On 10.09.13 at 14:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> > > > > On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:
> > > > > 
> > > > >> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
> > > > >> >   * The zero value is appropiate.
> > > > >> >   */
> > > > >> >  
> > > > >> > +#define XENMEM_exchange_and_pin             26
> > > > >> > +/*
> > > > >> > + * This hypercall is similar to XENMEM_exchange: it takes the same
> > > > >> > + * struct as an argument and it exchanges the pages passed in with a new
> > > > >> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
> > > > >> > + * that their p2m mapping won't be changed until explicitly "unpinned".
> > > > >> > + * The content of the exchanged pages is lost.
> > > > >> > + * Only normal guest r/w memory can be pinned: no granted pages or
> > > > >> > + * ballooned pages.
> > > > >> > + * If return code is zero then @out.extent_list provides the DMA frame
> > > > >> > + * numbers of the newly-allocated memory.
> > > > >> 
> > > > >> "DMA"? I don't think that term is universally true across all possible
> > > > >> architectures (but we're in an architecture independent header
> > > > >> here). "Machine" would probably be better (as it implies CPU
> > > > >> perspective, whereas DMA hints at device perspective).
> > > > > 
> > > > > I think DMA here is correct. The purpose of exchange and pin is so that
> > > > > the page can be safely handed to a device for DMA.
> > > > > 
> > > > > On an architecture where DMA address != Machine address then this should
> > > > > indeed return the DMA addresses.
> > > > 
> > > > One problem is that I think there are architectures where there's no
> > > > single canonical DMA address; such an address may depend on the
> > > > placement of a device in the system's topology. Hence I don't think
> > > > it would even be possible to return "the" DMA address here. It ought
> > > > to be the machine address (CPU view), and the consumer ought to
> > > > know how to translate this to a DMA address for a particular device.
> > > 
> > > We could leave it up to each architecture to specify whether the
> > > hypercall returns a DMA address or a Machine address (according to
> > > their definition of DMA address and Machine address).
> > > Even if this is a common header.
> > 
> > Is this going to be needed on architectures other than arm?  It's not
> > useful for x86, AFAICT.
> 
> It might be needed for PVH with passthrough? Maybe that insists on an
> IOMMU, in which case maybe it is arm only.

Yes, I think it would be fine require a working IOMMU on x86 for
PVH+passthrough.

> Until someone does e.g. a ppc port ;-)

m68k!

Tim.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-10  9:30   ` Ian Campbell
@ 2013-09-10 19:11     ` Stefano Stabellini
  2013-09-10 19:42       ` Ian Campbell
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-10 19:11 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, Stefano Stabellini

On Tue, 10 Sep 2013, Ian Campbell wrote:
> On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> > guest_physmap_pin_range pins a range of guest pages so that their p2m
> > mappings won't be changed.
> > guest_physmap_unpin_range unpins the previously pinned pages.
> > The pinning is done using one of the spare bits in the p2m ptes.
> > 
> > Use the newly introduce p2m_walker to implement the two functions on
> > ARM.
> > Provide empty stubs for x86.
> > 
> > 
> > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > 
> > 
> > Changes in v5:
> > - return -EBUSY when the P2M_DMA_PIN check fails;
> > - rename _guest_physmap_pin_range to pin_one_pte;
> > - rename _guest_physmap_unpin_range to unpin_one_pte.
> > 
> > Changes in v4:
> > - use p2m_walker to implement guest_physmap_pin_range and
> > guest_physmap_unpin_range;
> > - return -EINVAL when the P2M_DMA_PIN check fails;
> > - change the printk into a gdprintk;
> > - add a comment on what type of page can be pinned.
> > ---
> >  xen/arch/arm/p2m.c         |   48 ++++++++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-arm/mm.h   |    4 +++
> >  xen/include/asm-arm/page.h |    7 +++++-
> >  xen/include/asm-x86/p2m.h  |   12 +++++++++++
> >  4 files changed, 70 insertions(+), 1 deletions(-)
> > 
> > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> > index a9ceacf..bac6c7e 100644
> > --- a/xen/arch/arm/p2m.c
> > +++ b/xen/arch/arm/p2m.c
> > @@ -154,6 +154,46 @@ paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
> >      return p2m.maddr;
> >  }
> >  
> > +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> > +{
> > +    lpae_t pte = *ptep;
> > +    ASSERT(level == 3);
> > +
> > +    if ( pte.p2m.avail & P2M_DMA_PIN )
> > +        return -EBUSY;
> > +    pte.p2m.avail |= P2M_DMA_PIN;
> > +    write_pte(ptep, pte);
> > +    return 0;
> > +}
> > +
> > +int guest_physmap_pin_range(struct domain *d,
> > +                            xen_pfn_t gpfn,
> > +                            unsigned int order)
> > +{
> > +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> > +                      pin_one_pte, NULL);
> 
> If this fails then you will have left some subset of the pages
> successfully pinned. You need to clean it up I think, or be fatal to the
> guest or something.

Making the failure of guest_physmap_pin_range fatal to the guest would
also solve the non-trivial problem of error handling in memory_exchange.

Jan, do you agree on this approach?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-10 19:11     ` Stefano Stabellini
@ 2013-09-10 19:42       ` Ian Campbell
  2013-09-27 14:46         ` Stefano Stabellini
  0 siblings, 1 reply; 32+ messages in thread
From: Ian Campbell @ 2013-09-10 19:42 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim

On Tue, 2013-09-10 at 20:11 +0100, Stefano Stabellini wrote:
> On Tue, 10 Sep 2013, Ian Campbell wrote:
> > On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> > > guest_physmap_pin_range pins a range of guest pages so that their p2m
> > > mappings won't be changed.
> > > guest_physmap_unpin_range unpins the previously pinned pages.
> > > The pinning is done using one of the spare bits in the p2m ptes.
> > > 
> > > Use the newly introduce p2m_walker to implement the two functions on
> > > ARM.
> > > Provide empty stubs for x86.
> > > 
> > > 
> > > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > 
> > > 
> > > Changes in v5:
> > > - return -EBUSY when the P2M_DMA_PIN check fails;
> > > - rename _guest_physmap_pin_range to pin_one_pte;
> > > - rename _guest_physmap_unpin_range to unpin_one_pte.
> > > 
> > > Changes in v4:
> > > - use p2m_walker to implement guest_physmap_pin_range and
> > > guest_physmap_unpin_range;
> > > - return -EINVAL when the P2M_DMA_PIN check fails;
> > > - change the printk into a gdprintk;
> > > - add a comment on what type of page can be pinned.
> > > ---
> > >  xen/arch/arm/p2m.c         |   48 ++++++++++++++++++++++++++++++++++++++++++++
> > >  xen/include/asm-arm/mm.h   |    4 +++
> > >  xen/include/asm-arm/page.h |    7 +++++-
> > >  xen/include/asm-x86/p2m.h  |   12 +++++++++++
> > >  4 files changed, 70 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> > > index a9ceacf..bac6c7e 100644
> > > --- a/xen/arch/arm/p2m.c
> > > +++ b/xen/arch/arm/p2m.c
> > > @@ -154,6 +154,46 @@ paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
> > >      return p2m.maddr;
> > >  }
> > >  
> > > +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> > > +{
> > > +    lpae_t pte = *ptep;
> > > +    ASSERT(level == 3);
> > > +
> > > +    if ( pte.p2m.avail & P2M_DMA_PIN )
> > > +        return -EBUSY;
> > > +    pte.p2m.avail |= P2M_DMA_PIN;
> > > +    write_pte(ptep, pte);
> > > +    return 0;
> > > +}
> > > +
> > > +int guest_physmap_pin_range(struct domain *d,
> > > +                            xen_pfn_t gpfn,
> > > +                            unsigned int order)
> > > +{
> > > +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> > > +                      pin_one_pte, NULL);
> > 
> > If this fails then you will have left some subset of the pages
> > successfully pinned. You need to clean it up I think, or be fatal to the
> > guest or something.
> 
> Making the failure of guest_physmap_pin_range fatal to the guest would
> also solve the non-trivial problem of error handling in memory_exchange.

Actually, I don't think this will work.

Consider two concurrent I/O operations dispatched by two threads in the
same process, to/from non-overlapping regions of the same page.

Hrm, we may actually need a reference count on the pin :-(

We might need to rethink here -- i.e. use a type count on the underlying
machine page rather than a pin in the p2m entry? That kind of makes
sense since it will be the underlying page which is exposed to the
hardware...

Ian.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 1/4] xen: move steal_page and donate_page to common code
  2013-09-10 17:20     ` Stefano Stabellini
  2013-09-10 18:06       ` Tim Deegan
@ 2013-09-11  6:31       ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2013-09-11  6:31 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, keir, Ian.Campbell, tim

>>> On 10.09.13 at 19:20, Stefano Stabellini <stefano.stabellini@eu.citrix.com>
wrote:
> On Tue, 10 Sep 2013, Jan Beulich wrote:
>> >>> On 09.09.13 at 18:06, Stefano Stabellini <stefano.stabellini@eu.citrix.com> 
> wrote:
>> > --- a/xen/common/memory.c
>> > +++ b/xen/common/memory.c
>> > @@ -230,6 +230,91 @@ int guest_remove_page(struct domain *d, unsigned long 
> gmfn)
>> >      return 1;
>> >  }
>> >  
>> > +int donate_page(
>> > +    struct domain *d, struct page_info *page, unsigned int memflags)
>> > +{
>> > +    spin_lock(&d->page_alloc_lock);
>> > +
>> > +    if ( is_xen_heap_page(page) || (page_get_owner(page) != NULL) )
>> > +        goto fail;
>> > +
>> > +    if ( d->is_dying )
>> > +        goto fail;
>> > +
>> > +    if ( page->count_info & ~(PGC_allocated | 1) )
>> 
>> While the literal 1 was acceptable in arch specific code, I'm not sure
>> it really is in generic code. Nothing enforces that an arch has the
>> count starting at bit 0. Even if it reads a little odd, using
>> (-PGC_count_mask & PGC_count_mask) would probably be an
>> acceptable replacement here and further down.
> 
> (-PGC_count_mask & PGC_count_mask) is always 0.
> 
> I guess you mean:
> 
> if ( page->count_info & ~(PGC_count_mask | PGC_allocated) )

No, I really meant what I wrote, resulting in

    if ( page->count_info & ~(PGC_allocated | (-PGC_count_mask & PGC_count_mask)) )

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 18:32         ` Stefano Stabellini
  2013-09-10 18:37           ` Tim Deegan
@ 2013-09-11  6:33           ` Jan Beulich
  1 sibling, 0 replies; 32+ messages in thread
From: Jan Beulich @ 2013-09-11  6:33 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim, Ian Campbell

>>> On 10.09.13 at 20:32, Stefano Stabellini <stefano.stabellini@eu.citrix.com>
wrote:
> On Tue, 10 Sep 2013, Jan Beulich wrote:
>> >>> On 10.09.13 at 14:50, Ian Campbell <Ian.Campbell@citrix.com> wrote:
>> > On Tue, 2013-09-10 at 13:15 +0100, Jan Beulich wrote:
>> > 
>> >> > @@ -459,6 +460,52 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t);
>> >> >   * The zero value is appropiate.
>> >> >   */
>> >> >  
>> >> > +#define XENMEM_exchange_and_pin             26
>> >> > +/*
>> >> > + * This hypercall is similar to XENMEM_exchange: it takes the same
>> >> > + * struct as an argument and it exchanges the pages passed in with a new
>> >> > + * set of pages. The new pages are going to be "pinned": it's guaranteed
>> >> > + * that their p2m mapping won't be changed until explicitly "unpinned".
>> >> > + * The content of the exchanged pages is lost.
>> >> > + * Only normal guest r/w memory can be pinned: no granted pages or
>> >> > + * ballooned pages.
>> >> > + * If return code is zero then @out.extent_list provides the DMA frame
>> >> > + * numbers of the newly-allocated memory.
>> >> 
>> >> "DMA"? I don't think that term is universally true across all possible
>> >> architectures (but we're in an architecture independent header
>> >> here). "Machine" would probably be better (as it implies CPU
>> >> perspective, whereas DMA hints at device perspective).
>> > 
>> > I think DMA here is correct. The purpose of exchange and pin is so that
>> > the page can be safely handed to a device for DMA.
>> > 
>> > On an architecture where DMA address != Machine address then this should
>> > indeed return the DMA addresses.
>> 
>> One problem is that I think there are architectures where there's no
>> single canonical DMA address; such an address may depend on the
>> placement of a device in the system's topology. Hence I don't think
>> it would even be possible to return "the" DMA address here. It ought
>> to be the machine address (CPU view), and the consumer ought to
>> know how to translate this to a DMA address for a particular device.
> 
> We could leave it up to each architecture to specify whether the
> hypercall returns a DMA address or a Machine address (according to
> their definition of DMA address and Machine address).
> Even if this is a common header.

That would likely result in ugly conditionals in the kernel side code
consuming this.

But yes, considering that we're not currently aware of ports to
such architectures (I have no idea how MIPS works in this regard;
that's the only architecture I'm aware people have shown some
interest in), we might as well do it that way. But then please
change the comment so say so.

Jan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup
  2013-09-10 18:38       ` Ian Campbell
@ 2013-09-11 16:55         ` Stefano Stabellini
  0 siblings, 0 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-11 16:55 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, Stefano Stabellini

On Tue, 10 Sep 2013, Ian Campbell wrote:
> On Tue, 2013-09-10 at 19:00 +0100, Stefano Stabellini wrote:
> > > > -    /* This bit must be one in the level 3 entry */
> > > > -    if ( !pte.p2m.table )
> > > > -        pte.bits = 0;
> > > > +        if ( cur_first_offset != first_table_offset(paddr) )
> > > > +        {
> > > > +            if (second) unmap_domain_page(second);
> > > > +            second = map_domain_page(first[first_table_offset(paddr)].p2m.base);
> > > > +            cur_first_offset = first_table_offset(paddr);
> > > > +        }
> > > > +        level++;
> > > > +        if ( !second ||
> > > 
> > > ASSERT(second) seems reasonable here, I think, since it would indicate
> > > we had screwed up the p2m pretty badly.
> > 
> > I think it's possible: what if a memory range is simply missing from the
> > p2m? The second level pagetable pages could be missing too.
> 
> Then the level above it shouldn't have the valid bit set.

Oh, right!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-10 19:42       ` Ian Campbell
@ 2013-09-27 14:46         ` Stefano Stabellini
  0 siblings, 0 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-27 14:46 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, Stefano Stabellini

On Tue, 10 Sep 2013, Ian Campbell wrote:
> On Tue, 2013-09-10 at 20:11 +0100, Stefano Stabellini wrote:
> > On Tue, 10 Sep 2013, Ian Campbell wrote:
> > > On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> > > > guest_physmap_pin_range pins a range of guest pages so that their p2m
> > > > mappings won't be changed.
> > > > guest_physmap_unpin_range unpins the previously pinned pages.
> > > > The pinning is done using one of the spare bits in the p2m ptes.
> > > > 
> > > > Use the newly introduce p2m_walker to implement the two functions on
> > > > ARM.
> > > > Provide empty stubs for x86.
> > > > 
> > > > 
> > > > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
> > > > 
> > > > 
> > > > Changes in v5:
> > > > - return -EBUSY when the P2M_DMA_PIN check fails;
> > > > - rename _guest_physmap_pin_range to pin_one_pte;
> > > > - rename _guest_physmap_unpin_range to unpin_one_pte.
> > > > 
> > > > Changes in v4:
> > > > - use p2m_walker to implement guest_physmap_pin_range and
> > > > guest_physmap_unpin_range;
> > > > - return -EINVAL when the P2M_DMA_PIN check fails;
> > > > - change the printk into a gdprintk;
> > > > - add a comment on what type of page can be pinned.
> > > > ---
> > > >  xen/arch/arm/p2m.c         |   48 ++++++++++++++++++++++++++++++++++++++++++++
> > > >  xen/include/asm-arm/mm.h   |    4 +++
> > > >  xen/include/asm-arm/page.h |    7 +++++-
> > > >  xen/include/asm-x86/p2m.h  |   12 +++++++++++
> > > >  4 files changed, 70 insertions(+), 1 deletions(-)
> > > > 
> > > > diff --git a/xen/arch/arm/p2m.c b/xen/arch/arm/p2m.c
> > > > index a9ceacf..bac6c7e 100644
> > > > --- a/xen/arch/arm/p2m.c
> > > > +++ b/xen/arch/arm/p2m.c
> > > > @@ -154,6 +154,46 @@ paddr_t p2m_lookup(struct domain *d, paddr_t paddr)
> > > >      return p2m.maddr;
> > > >  }
> > > >  
> > > > +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> > > > +{
> > > > +    lpae_t pte = *ptep;
> > > > +    ASSERT(level == 3);
> > > > +
> > > > +    if ( pte.p2m.avail & P2M_DMA_PIN )
> > > > +        return -EBUSY;
> > > > +    pte.p2m.avail |= P2M_DMA_PIN;
> > > > +    write_pte(ptep, pte);
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +int guest_physmap_pin_range(struct domain *d,
> > > > +                            xen_pfn_t gpfn,
> > > > +                            unsigned int order)
> > > > +{
> > > > +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> > > > +                      pin_one_pte, NULL);
> > > 
> > > If this fails then you will have left some subset of the pages
> > > successfully pinned. You need to clean it up I think, or be fatal to the
> > > guest or something.
> > 
> > Making the failure of guest_physmap_pin_range fatal to the guest would
> > also solve the non-trivial problem of error handling in memory_exchange.
> 
> Actually, I don't think this will work.
> 
> Consider two concurrent I/O operations dispatched by two threads in the
> same process, to/from non-overlapping regions of the same page.
> 
> Hrm, we may actually need a reference count on the pin :-(
> 
> We might need to rethink here -- i.e. use a type count on the underlying
> machine page rather than a pin in the p2m entry? That kind of makes
> sense since it will be the underlying page which is exposed to the
> hardware...

I think that we should only allow pinning at page granularity level.
Also I think it should be the guest to do the reference counting: it
needs to keep track of the mfn_to_pfn mappings anyway, it might as well
keep track of the references too, so that it can avoid issuing an
hypercall to pin a page that is already pinned.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin
  2013-09-10 12:15   ` Jan Beulich
  2013-09-10 12:50     ` Ian Campbell
@ 2013-09-27 14:49     ` Stefano Stabellini
  1 sibling, 0 replies; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-27 14:49 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, tim, Ian.Campbell, Stefano Stabellini

On Tue, 10 Sep 2013, Jan Beulich wrote:
> > +    unsigned long i;
> > +    struct xen_unpin unpin;
> > +    xen_pfn_t gpfn;
> > +    struct domain *d;
> > +
> > +    if ( copy_from_guest(&unpin, arg, 1) )
> > +        return -EFAULT;
> > +
> > +    /* Various sanity checks. */
> > +    if ( /* Extent orders are sensible? */
> > +         (unpin.in.extent_order > MAX_ORDER) ||
> > +         /* Sizes of input list do not overflow a long? */
> > +         ((~0UL >> unpin.in.extent_order) < unpin.in.nr_extents) )
> > +        return -EFAULT;
> > +
> > +    if ( !guest_handle_okay(unpin.in.extent_start, unpin.in.nr_extents) )
> > +        return -EFAULT;
> > +
> > +    d = rcu_lock_domain_by_any_id(unpin.in.domid);
> > +    if ( d == NULL )
> > +    {
> > +        rc = -ESRCH;
> > +        goto fail;
> > +    }
> > +
> > +    for ( i = 0; i < unpin.in.nr_extents; i++ )
> > +    {
> > +        if ( unlikely(__copy_from_guest_offset(
> > +                      &gpfn, unpin.in.extent_start, i, 1)) )
> > +        {
> > +            rc = -EFAULT;
> > +            goto partial;
> > +        }
> > +
> 
> Want/need to make sure gpfn is suitable aligned for the passed in
> extent order? Depends on whether guest_physmap_unpin_range()
> is expected to always cope with a mis-aligned one.

I think it should be able to cope with non order aligned requests.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-10  9:33   ` Ian Campbell
@ 2013-09-27 15:23     ` Stefano Stabellini
  2013-09-27 15:27       ` Ian Campbell
  0 siblings, 1 reply; 32+ messages in thread
From: Stefano Stabellini @ 2013-09-27 15:23 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, tim, Stefano Stabellini

On Tue, 10 Sep 2013, Ian Campbell wrote:
> On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> >  
> > +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> > +{
> > +    lpae_t pte = *ptep;
> > +    ASSERT(level == 3);
> > +
> > +    if ( pte.p2m.avail & P2M_DMA_PIN )
> > +        return -EBUSY;
> > +    pte.p2m.avail |= P2M_DMA_PIN;
> > +    write_pte(ptep, pte);
> > +    return 0;
> > +}
> > +
> > +int guest_physmap_pin_range(struct domain *d,
> > +                            xen_pfn_t gpfn,
> > +                            unsigned int order)
> > +{
> > +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> > +                      pin_one_pte, NULL);
> 
> Did we not also discuss accounting and limits on the amount of memory a
> guest can lock down?
 
I am thinking to allow hardware domains (is_hardware_domain(d)) to pin
any number of pages, while preveting any other domains from doing it.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range
  2013-09-27 15:23     ` Stefano Stabellini
@ 2013-09-27 15:27       ` Ian Campbell
  0 siblings, 0 replies; 32+ messages in thread
From: Ian Campbell @ 2013-09-27 15:27 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel, tim

On Fri, 2013-09-27 at 16:23 +0100, Stefano Stabellini wrote:
> On Tue, 10 Sep 2013, Ian Campbell wrote:
> > On Mon, 2013-09-09 at 17:06 +0100, Stefano Stabellini wrote:
> > >  
> > > +static int pin_one_pte(lpae_t *ptep, void *arg, int level)
> > > +{
> > > +    lpae_t pte = *ptep;
> > > +    ASSERT(level == 3);
> > > +
> > > +    if ( pte.p2m.avail & P2M_DMA_PIN )
> > > +        return -EBUSY;
> > > +    pte.p2m.avail |= P2M_DMA_PIN;
> > > +    write_pte(ptep, pte);
> > > +    return 0;
> > > +}
> > > +
> > > +int guest_physmap_pin_range(struct domain *d,
> > > +                            xen_pfn_t gpfn,
> > > +                            unsigned int order)
> > > +{
> > > +    return p2m_walker(d, gpfn << PAGE_SHIFT, order,
> > > +                      pin_one_pte, NULL);
> > 
> > Did we not also discuss accounting and limits on the amount of memory a
> > guest can lock down?
>  
> I am thinking to allow hardware domains (is_hardware_domain(d)) to pin
> any number of pages, while preveting any other domains from doing it.

That would work.

FWIW XENMEM_exchange uses multipage_allocation_permitted which allows
any guest to allocate 2MB contiguous chunks and domains with a rangeset
(i.e. dom0 or passthrough) to allocate whichever chunk they want (or at
least lets them try).

I'm not sure that pinning and exchanging are analogous though, you
blanket privileged only rule will suffice until we implement
passthrough.

Ian.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-09-27 15:27 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-09 16:04 [PATCH v5 0/4] introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
2013-09-09 16:06 ` [PATCH v5 1/4] xen: move steal_page and donate_page to common code Stefano Stabellini
2013-09-10  9:08   ` Ian Campbell
2013-09-10 10:39   ` Jan Beulich
2013-09-10 17:20     ` Stefano Stabellini
2013-09-10 18:06       ` Tim Deegan
2013-09-10 18:14         ` Stefano Stabellini
2013-09-11  6:31       ` Jan Beulich
2013-09-09 16:06 ` [PATCH v5 2/4] xen/arm: introduce a generic p2m walker and use it in p2m_lookup Stefano Stabellini
2013-09-10  9:26   ` Ian Campbell
2013-09-10 18:00     ` Stefano Stabellini
2013-09-10 18:38       ` Ian Campbell
2013-09-11 16:55         ` Stefano Stabellini
2013-09-09 16:06 ` [PATCH v5 3/4] xen: implement guest_physmap_pin_range and guest_physmap_unpin_range Stefano Stabellini
2013-09-10  9:30   ` Ian Campbell
2013-09-10 19:11     ` Stefano Stabellini
2013-09-10 19:42       ` Ian Campbell
2013-09-27 14:46         ` Stefano Stabellini
2013-09-10  9:33   ` Ian Campbell
2013-09-27 15:23     ` Stefano Stabellini
2013-09-27 15:27       ` Ian Campbell
2013-09-09 16:06 ` [PATCH v5 4/4] xen: introduce XENMEM_exchange_and_pin and XENMEM_unpin Stefano Stabellini
2013-09-10  9:47   ` Ian Campbell
2013-09-10 12:15   ` Jan Beulich
2013-09-10 12:50     ` Ian Campbell
2013-09-10 13:28       ` Jan Beulich
2013-09-10 18:32         ` Stefano Stabellini
2013-09-10 18:37           ` Tim Deegan
2013-09-10 18:40             ` Ian Campbell
2013-09-10 18:50               ` Tim Deegan
2013-09-11  6:33           ` Jan Beulich
2013-09-27 14:49     ` Stefano Stabellini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).