* [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
[not found] <1373936045-22653-1-git-send-email-aik@ozlabs.ru>
@ 2013-07-16 0:53 ` Alexey Kardashevskiy
2013-07-23 2:22 ` Alexey Kardashevskiy
0 siblings, 1 reply; 7+ messages in thread
From: Alexey Kardashevskiy @ 2013-07-16 0:53 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, David Gibson, Benjamin Herrenschmidt,
Paul Mackerras, Alexander Graf, Alex Williamson, kvm,
linux-kernel, kvm-ppc, linux-mm
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to increment/decrement page counter as
get_user_pages API used for user mode mapping does not work
in the real mode.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Cc: linux-mm@kvack.org
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.
2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
arch/powerpc/mm/init_64.c | 76 +++++++++++++++++++++++++++++++-
include/linux/page-flags.h | 4 +-
3 files changed, 82 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..aa7b169 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
}
+struct page *realmode_pfn_to_page(unsigned long pfn);
+int realmode_get_page(struct page *page);
+int realmode_put_page(struct page *page);
+
static inline char *get_hpte_slot_array(pmd_t *pmdp)
{
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..dcbb806 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * Any of realmode_XXXX functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
+
+#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
+int realmode_get_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_inc_not_zero(&page->_count))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_get_page);
+
+int realmode_put_page(struct page *page)
+{
+ if (PageCompound(page))
+ return -EAGAIN;
+
+ if (!atomic_add_unless(&page->_count, -1, 1))
+ return -EAGAIN;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(realmode_put_page);
+#endif
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.3.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
2013-07-16 0:53 ` [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
@ 2013-07-23 2:22 ` Alexey Kardashevskiy
2013-07-24 22:43 ` Andrew Morton
0 siblings, 1 reply; 7+ messages in thread
From: Alexey Kardashevskiy @ 2013-07-23 2:22 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: linuxppc-dev, David Gibson, Benjamin Herrenschmidt,
Paul Mackerras, Alexander Graf, Alex Williamson, kvm,
linux-kernel, kvm-ppc, linux-mm, Andrew Morton, Yasuaki Ishimatsu,
Martin Schwidefsky, Andrea Arcangeli, Mel Gorman
Ping, anyone, please?
Ben needs ack from any of MM people before proceeding with this patch. Thanks!
On 07/16/2013 10:53 AM, Alexey Kardashevskiy wrote:
> The current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow, so we want to move that to KVM.
> Since H_PUT_TCE can be extremely performance sensitive (especially with
> network adapters where each packet needs to be mapped/unmapped) we chose
> to implement that as a "fast" hypercall directly in "real
> mode" (processor still in the guest context but MMU off).
>
> To be able to do that, we need to provide some facilities to
> access the struct page count within that real mode environment as things
> like the sparsemem vmemmap mappings aren't accessible.
>
> This adds an API to increment/decrement page counter as
> get_user_pages API used for user mode mapping does not work
> in the real mode.
>
> CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
>
> Cc: linux-mm@kvack.org
> Reviewed-by: Paul Mackerras <paulus@samba.org>
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> ---
>
> Changes:
> 2013/07/10:
> * adjusted comment (removed sentence about virtual mode)
> * get_page_unless_zero replaced with atomic_inc_not_zero to minimize
> effect of a possible get_page_unless_zero() rework (if it ever happens).
>
> 2013/06/27:
> * realmode_get_page() fixed to use get_page_unless_zero(). If failed,
> the call will be passed from real to virtual mode and safely handled.
> * added comment to PageCompound() in include/linux/page-flags.h.
>
> 2013/05/20:
> * PageTail() is replaced by PageCompound() in order to have the same checks
> for whether the page is huge in realmode_get_page() and realmode_put_page()
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> arch/powerpc/include/asm/pgtable-ppc64.h | 4 ++
> arch/powerpc/mm/init_64.c | 76 +++++++++++++++++++++++++++++++-
> include/linux/page-flags.h | 4 +-
> 3 files changed, 82 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
> index 46db094..aa7b169 100644
> --- a/arch/powerpc/include/asm/pgtable-ppc64.h
> +++ b/arch/powerpc/include/asm/pgtable-ppc64.h
> @@ -394,6 +394,10 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
> hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
> }
>
> +struct page *realmode_pfn_to_page(unsigned long pfn);
> +int realmode_get_page(struct page *page);
> +int realmode_put_page(struct page *page);
> +
> static inline char *get_hpte_slot_array(pmd_t *pmdp)
> {
> /*
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index d0cd9e4..dcbb806 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -300,5 +300,79 @@ void vmemmap_free(unsigned long start, unsigned long end)
> {
> }
>
> -#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +/*
> + * We do not have access to the sparsemem vmemmap, so we fallback to
> + * walking the list of sparsemem blocks which we already maintain for
> + * the sake of crashdump. In the long run, we might want to maintain
> + * a tree if performance of that linear walk becomes a problem.
> + *
> + * Any of realmode_XXXX functions can fail due to:
> + * 1) As real sparsemem blocks do not lay in RAM continously (they
> + * are in virtual address space which is not available in the real mode),
> + * the requested page struct can be split between blocks so get_page/put_page
> + * may fail.
> + * 2) When huge pages are used, the get_page/put_page API will fail
> + * in real mode as the linked addresses in the page struct are virtual
> + * too.
> + */
> +struct page *realmode_pfn_to_page(unsigned long pfn)
> +{
> + struct vmemmap_backing *vmem_back;
> + struct page *page;
> + unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
> + unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
>
> + for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
> + if (pg_va < vmem_back->virt_addr)
> + continue;
> +
> + /* Check that page struct is not split between real pages */
> + if ((pg_va + sizeof(struct page)) >
> + (vmem_back->virt_addr + page_size))
> + return NULL;
> +
> + page = (struct page *) (vmem_back->phys + pg_va -
> + vmem_back->virt_addr);
> + return page;
> + }
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
> +
> +#elif defined(CONFIG_FLATMEM)
> +
> +struct page *realmode_pfn_to_page(unsigned long pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + return page;
> +}
> +EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
> +
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
> +
> +#if defined(CONFIG_SPARSEMEM_VMEMMAP) || defined(CONFIG_FLATMEM)
> +int realmode_get_page(struct page *page)
> +{
> + if (PageCompound(page))
> + return -EAGAIN;
> +
> + if (!atomic_inc_not_zero(&page->_count))
> + return -EAGAIN;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_get_page);
> +
> +int realmode_put_page(struct page *page)
> +{
> + if (PageCompound(page))
> + return -EAGAIN;
> +
> + if (!atomic_add_unless(&page->_count, -1, 1))
> + return -EAGAIN;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(realmode_put_page);
> +#endif
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6d53675..98ada58 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
> * System with lots of page flags available. This allows separate
> * flags for PageHead() and PageTail() checks of compound pages so that bit
> * tests can be used in performance sensitive paths. PageCompound is
> - * generally not used in hot code paths.
> + * generally not used in hot code paths except arch/powerpc/mm/init_64.c
> + * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
> + * and avoid handling those in real mode.
> */
> __PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
> __PAGEFLAG(Tail, tail)
>
--
Alexey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
2013-07-23 2:22 ` Alexey Kardashevskiy
@ 2013-07-24 22:43 ` Andrew Morton
2013-07-24 23:13 ` Benjamin Herrenschmidt
0 siblings, 1 reply; 7+ messages in thread
From: Andrew Morton @ 2013-07-24 22:43 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: linuxppc-dev, David Gibson, Benjamin Herrenschmidt,
Paul Mackerras, Alexander Graf, Alex Williamson, kvm,
linux-kernel, kvm-ppc, linux-mm, Yasuaki Ishimatsu,
Martin Schwidefsky, Andrea Arcangeli, Mel Gorman
On Tue, 23 Jul 2013 12:22:59 +1000 Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> Ping, anyone, please?
ew, you top-posted.
> Ben needs ack from any of MM people before proceeding with this patch. Thanks!
For what? The three lines of comment in page-flags.h? ack :)
Manipulating page->_count directly is considered poor form. Don't
blame us if we break your code ;)
Actually, the manipulation in realmode_get_page() duplicates the
existing get_page_unless_zero() and the one in realmode_put_page()
could perhaps be placed in mm.h with a suitable name and some
documentation. That would improve your form and might protect the code
from getting broken later on.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
2013-07-24 22:43 ` Andrew Morton
@ 2013-07-24 23:13 ` Benjamin Herrenschmidt
2013-07-25 10:26 ` [PATCH] " Alexey Kardashevskiy
0 siblings, 1 reply; 7+ messages in thread
From: Benjamin Herrenschmidt @ 2013-07-24 23:13 UTC (permalink / raw)
To: Andrew Morton
Cc: Alexey Kardashevskiy, linuxppc-dev, David Gibson, Paul Mackerras,
Alexander Graf, Alex Williamson, kvm, linux-kernel, kvm-ppc,
linux-mm, Yasuaki Ishimatsu, Martin Schwidefsky, Andrea Arcangeli,
Mel Gorman
On Wed, 2013-07-24 at 15:43 -0700, Andrew Morton wrote:
> For what? The three lines of comment in page-flags.h? ack :)
>
> Manipulating page->_count directly is considered poor form. Don't
> blame us if we break your code ;)
>
> Actually, the manipulation in realmode_get_page() duplicates the
> existing get_page_unless_zero() and the one in realmode_put_page()
> could perhaps be placed in mm.h with a suitable name and some
> documentation. That would improve your form and might protect the code
> from getting broken later on.
Yes, this stuff makes me really nervous :-) If it didn't provide an order
of magnitude performance improvement in KVM I would avoid it but heh...
Alexey, I like having that stuff in generic code.
However the meaning of the words "real mode" can be ambiguous accross
architectures, it might be best to then name it "mmu_off_put_page" to
make things a bit clearer, along with a comment explaining that this is
called in a context where none of the virtual mappings are accessible
(vmalloc, vmemmap, IOs, ...), and that in the case of sparsemem vmemmap
the caller must have taken care of getting the physical address of the
struct page and of ensuring it isn't split accross two vmemmap blocks.
Cheers,
Ben.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH] powerpc: Prepare to support kernel handling of IOMMU map/unmap
2013-07-24 23:13 ` Benjamin Herrenschmidt
@ 2013-07-25 10:26 ` Alexey Kardashevskiy
2013-07-25 10:33 ` Alexey Kardashevskiy
0 siblings, 1 reply; 7+ messages in thread
From: Alexey Kardashevskiy @ 2013-07-25 10:26 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
Andrew Morton, linux-kernel, linux-mm
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API to get page struct when MMU is off.
This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(real mode on PPC64 is the first user) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Cc: linux-mm@kvack.org
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
2013/07/25:
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() will be used
* realmode_pfn_to_page fixed to return NULL for compound pages
2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.
2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
arch/powerpc/mm/init_64.c | 54 +++++++++++++++++++++++++++++++-
include/linux/mm.h | 14 +++++++++
include/linux/page-flags.h | 4 ++-
4 files changed, 72 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
}
+struct page *realmode_pfn_to_page(unsigned long pfn);
+
static inline char *get_hpte_slot_array(pmd_t *pmdp)
{
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..d2bb97c 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,57 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+
+ if (PageCompound(page))
+ return NULL;
+
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..ef07aea 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -290,12 +290,26 @@ static inline int put_page_testzero(struct page *page)
/*
* Try to grab a ref unless the page has a refcount of zero, return false if
* that is the case.
+ * This can be called when MMU is off so it must not to access
+ * any of the virtual mappings.
*/
static inline int get_page_unless_zero(struct page *page)
{
return atomic_inc_not_zero(&page->_count);
}
+/*
+ * Try to drop a ref unless the page has a refcount of one, return false if
+ * that is the case.
+ * This is to make sure that the refcount won't become zero after this drop.
+ * This can be called when MMU is off so it must not to access
+ * any of the virtual mappings.
+ */
+static inline int put_page_unless_one(struct page *page)
+{
+ return atomic_add_unless(&page->_count, -1, 1);
+}
+
extern int page_is_ram(unsigned long pfn);
/* Support for virtually mapped pages */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.3.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] powerpc: Prepare to support kernel handling of IOMMU map/unmap
2013-07-25 10:26 ` [PATCH] " Alexey Kardashevskiy
@ 2013-07-25 10:33 ` Alexey Kardashevskiy
0 siblings, 0 replies; 7+ messages in thread
From: Alexey Kardashevskiy @ 2013-07-25 10:33 UTC (permalink / raw)
To: Alexey Kardashevskiy
Cc: linuxppc-dev, Benjamin Herrenschmidt, Paul Mackerras,
Andrew Morton, linux-kernel, linux-mm
On 07/25/2013 08:26 PM, Alexey Kardashevskiy wrote:
> The current VFIO-on-POWER implementation supports only user mode
> driven mapping, i.e. QEMU is sending requests to map/unmap pages.
> However this approach is really slow, so we want to move that to KVM.
> Since H_PUT_TCE can be extremely performance sensitive (especially with
> network adapters where each packet needs to be mapped/unmapped) we chose
> to implement that as a "fast" hypercall directly in "real
> mode" (processor still in the guest context but MMU off).
>
> To be able to do that, we need to provide some facilities to
> access the struct page count within that real mode environment as things
> like the sparsemem vmemmap mappings aren't accessible.
>
> This adds an API to get page struct when MMU is off.
>
> This adds to MM a new function put_page_unless_one() which drops a page
> if counter is bigger than 1. It is going to be used when MMU is off
> (real mode on PPC64 is the first user) and we want to make sure that page
> release will not happen in real mode as it may crash the kernel in
> a horrible way.
Yes, my english needs to be polished, I even know where :)
--
Alexey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap
[not found] <1375332272-22176-1-git-send-email-aik@ozlabs.ru>
@ 2013-08-01 4:44 ` Alexey Kardashevskiy
0 siblings, 0 replies; 7+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-01 4:44 UTC (permalink / raw)
To: linuxppc-dev
Cc: Alexey Kardashevskiy, Benjamin Herrenschmidt, Paul Mackerras,
Alexander Graf, kvm, linux-doc, linux-kernel, kvm-ppc, linux-mm,
Andrew Morton
The current VFIO-on-POWER implementation supports only user mode
driven mapping, i.e. QEMU is sending requests to map/unmap pages.
However this approach is really slow, so we want to move that to KVM.
Since H_PUT_TCE can be extremely performance sensitive (especially with
network adapters where each packet needs to be mapped/unmapped) we chose
to implement that as a "fast" hypercall directly in "real
mode" (processor still in the guest context but MMU off).
To be able to do that, we need to provide some facilities to
access the struct page count within that real mode environment as things
like the sparsemem vmemmap mappings aren't accessible.
This adds an API function realmode_pfn_to_page() to get page struct when
MMU is off.
This adds to MM a new function put_page_unless_one() which drops a page
if counter is bigger than 1. It is going to be used when MMU is off
(for example, real mode on PPC64) and we want to make sure that page
release will not happen in real mode as it may crash the kernel in
a horrible way.
CONFIG_SPARSEMEM_VMEMMAP and CONFIG_FLATMEM are supported.
Cc: linux-mm@kvack.org
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
2013/07/25 (v7):
* removed realmode_put_page and added put_page_unless_one() instead.
The name has been chosen to conform the already existing
get_page_unless_zero().
* removed realmode_get_page. Instead, get_page_unless_zero() should be used
2013/07/10:
* adjusted comment (removed sentence about virtual mode)
* get_page_unless_zero replaced with atomic_inc_not_zero to minimize
effect of a possible get_page_unless_zero() rework (if it ever happens).
2013/06/27:
* realmode_get_page() fixed to use get_page_unless_zero(). If failed,
the call will be passed from real to virtual mode and safely handled.
* added comment to PageCompound() in include/linux/page-flags.h.
2013/05/20:
* PageTail() is replaced by PageCompound() in order to have the same checks
for whether the page is huge in realmode_get_page() and realmode_put_page()
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
arch/powerpc/mm/init_64.c | 50 +++++++++++++++++++++++++++++++-
include/linux/mm.h | 14 +++++++++
include/linux/page-flags.h | 4 ++-
4 files changed, 68 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 46db094..4a191c4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -394,6 +394,8 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
hpte_slot_array[index] = hidx << 4 | 0x1 << 3;
}
+struct page *realmode_pfn_to_page(unsigned long pfn);
+
static inline char *get_hpte_slot_array(pmd_t *pmdp)
{
/*
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index d0cd9e4..8cf345a 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -300,5 +300,53 @@ void vmemmap_free(unsigned long start, unsigned long end)
{
}
-#endif /* CONFIG_SPARSEMEM_VMEMMAP */
+/*
+ * We do not have access to the sparsemem vmemmap, so we fallback to
+ * walking the list of sparsemem blocks which we already maintain for
+ * the sake of crashdump. In the long run, we might want to maintain
+ * a tree if performance of that linear walk becomes a problem.
+ *
+ * realmode_pfn_to_page functions can fail due to:
+ * 1) As real sparsemem blocks do not lay in RAM continously (they
+ * are in virtual address space which is not available in the real mode),
+ * the requested page struct can be split between blocks so get_page/put_page
+ * may fail.
+ * 2) When huge pages are used, the get_page/put_page API will fail
+ * in real mode as the linked addresses in the page struct are virtual
+ * too.
+ */
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct vmemmap_backing *vmem_back;
+ struct page *page;
+ unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
+ unsigned long pg_va = (unsigned long) pfn_to_page(pfn);
+ for (vmem_back = vmemmap_list; vmem_back; vmem_back = vmem_back->list) {
+ if (pg_va < vmem_back->virt_addr)
+ continue;
+
+ /* Check that page struct is not split between real pages */
+ if ((pg_va + sizeof(struct page)) >
+ (vmem_back->virt_addr + page_size))
+ return NULL;
+
+ page = (struct page *) (vmem_back->phys + pg_va -
+ vmem_back->virt_addr);
+ return page;
+ }
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#elif defined(CONFIG_FLATMEM)
+
+struct page *realmode_pfn_to_page(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ return page;
+}
+EXPORT_SYMBOL_GPL(realmode_pfn_to_page);
+
+#endif /* CONFIG_SPARSEMEM_VMEMMAP/CONFIG_FLATMEM */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..dcc99b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -290,12 +290,26 @@ static inline int put_page_testzero(struct page *page)
/*
* Try to grab a ref unless the page has a refcount of zero, return false if
* that is the case.
+ * This can be called when MMU is off so it must not access
+ * any of the virtual mappings.
*/
static inline int get_page_unless_zero(struct page *page)
{
return atomic_inc_not_zero(&page->_count);
}
+/*
+ * Try to drop a ref unless the page has a refcount of one, return false if
+ * that is the case.
+ * This is to make sure that the refcount won't become zero after this drop.
+ * This can be called when MMU is off so it must not access
+ * any of the virtual mappings.
+ */
+static inline int put_page_unless_one(struct page *page)
+{
+ return atomic_add_unless(&page->_count, -1, 1);
+}
+
extern int page_is_ram(unsigned long pfn);
/* Support for virtually mapped pages */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..98ada58 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -329,7 +329,9 @@ static inline void set_page_writeback(struct page *page)
* System with lots of page flags available. This allows separate
* flags for PageHead() and PageTail() checks of compound pages so that bit
* tests can be used in performance sensitive paths. PageCompound is
- * generally not used in hot code paths.
+ * generally not used in hot code paths except arch/powerpc/mm/init_64.c
+ * and arch/powerpc/kvm/book3s_64_vio_hv.c which use it to detect huge pages
+ * and avoid handling those in real mode.
*/
__PAGEFLAG(Head, head) CLEARPAGEFLAG(Head, head)
__PAGEFLAG(Tail, tail)
--
1.8.3.2
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-08-01 4:45 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1373936045-22653-1-git-send-email-aik@ozlabs.ru>
2013-07-16 0:53 ` [PATCH 04/10] powerpc: Prepare to support kernel handling of IOMMU map/unmap Alexey Kardashevskiy
2013-07-23 2:22 ` Alexey Kardashevskiy
2013-07-24 22:43 ` Andrew Morton
2013-07-24 23:13 ` Benjamin Herrenschmidt
2013-07-25 10:26 ` [PATCH] " Alexey Kardashevskiy
2013-07-25 10:33 ` Alexey Kardashevskiy
[not found] <1375332272-22176-1-git-send-email-aik@ozlabs.ru>
2013-08-01 4:44 ` [PATCH 04/10] " Alexey Kardashevskiy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).