linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFCv3][PATCH 1/3] create slow_virt_to_phys()
@ 2013-01-09 18:59 Dave Hansen
  2013-01-09 18:59 ` [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Dave Hansen @ 2013-01-09 18:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, Avi Kivity, Ingo Molnar, H. Peter Anvin,
	x86, Marcelo Tosatti, Dave Hansen


Broadening the cc list here a bit...  This bug is still present,
and I still need these patches to boot 32-bit NUMA kernels.  They
might be obscure, but if we don't care about them any more, perhaps
we should go remove the NUMA remapping code instead of this.

--

This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems.  We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages.  We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).

In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area.  Any __pa() call on them will break
and basically return garbage.

This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory.  It should work on the normal linear mapping,
vmalloc(), kmap(), etc...


Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h |    1 
 linux-2.6.git-dave/arch/x86/mm/pageattr.c               |   47 ++++++++++++++++
 2 files changed, 48 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys	2013-01-09 13:55:39.370629437 -0500
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h	2013-01-09 13:55:39.386629577 -0500
@@ -352,6 +352,7 @@ static inline void update_page_count(int
  * as a pte too.
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
 
 #endif	/* !__ASSEMBLY__ */
 
diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys	2013-01-09 13:55:39.370629437 -0500
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c	2013-01-09 13:55:39.386629577 -0500
@@ -364,6 +364,53 @@ pte_t *lookup_address(unsigned long addr
 EXPORT_SYMBOL_GPL(lookup_address);
 
 /*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems.  The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+	unsigned long virt_addr = (unsigned long)__virt_addr;
+	phys_addr_t phys_addr;
+	unsigned long offset;
+	unsigned int level = -1;
+	unsigned long psize = 0;
+	unsigned long pmask = 0;
+	pte_t *pte;
+
+	pte = lookup_address(virt_addr, &level);
+	BUG_ON(!pte);
+	switch (level) {
+	case PG_LEVEL_4K:
+		psize = PAGE_SIZE;
+		pmask = PAGE_MASK;
+		break;
+	case PG_LEVEL_2M:
+		psize = PMD_PAGE_SIZE;
+		pmask = PMD_PAGE_MASK;
+		break;
+#ifdef CONFIG_X86_64
+	case PG_LEVEL_1G:
+		psize = PUD_PAGE_SIZE;
+		pmask = PUD_PAGE_MASK;
+		break;
+#endif
+	default:
+		BUG();
+	}
+	offset = virt_addr & ~pmask;
+	phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+	return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
  * Set the new pmd in all the pgds we know about:
  */
 static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas
  2013-01-09 18:59 [RFCv3][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
@ 2013-01-09 18:59 ` Dave Hansen
  2013-01-15 18:38   ` Rik van Riel
  2013-01-09 18:59 ` [RFCv3][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2013-01-09 18:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, Avi Kivity, Ingo Molnar, H. Peter Anvin,
	x86, Marcelo Tosatti, Dave Hansen


In short, it is illegal to call __pa() on an address holding
a percpu variable.  The times when this actually matters are
pretty obscure (certain 32-bit NUMA systems), but it _does_
happen.  It is important to keep KVM guests working on these
systems because the real hardware is getting harder and
harder to find.

This bug manifested first by me seeing a plain hang at boot
after this message:

	CPU 0 irqstacks, hard=f3018000 soft=f301a000

or, sometimes, it would actually make it out to the console:

[    0.000000] BUG: unable to handle kernel paging request at ffffffff

I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.

The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.

The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host.  Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.

Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---

 linux-2.6.git-dave/arch/x86/kernel/kvm.c      |    9 +++++----
 linux-2.6.git-dave/arch/x86/kernel/kvmclock.c |    4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
--- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas	2013-01-09 13:55:44.218672043 -0500
+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c	2013-01-09 13:55:44.222672079 -0500
@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void
 
 	memset(st, 0, sizeof(*st));
 
-	wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+	wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
 	printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
-		cpu, __pa(st));
+		cpu, slow_virt_to_phys(st));
 }
 
 static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void)
 		return;
 
 	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
-		u64 pa = __pa(&__get_cpu_var(apf_reason));
+		u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
 
 #ifdef CONFIG_PREEMPT
 		pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void)
 		/* Size alignment is implied but just to make it explicit. */
 		BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
 		__get_cpu_var(kvm_apic_eoi) = 0;
-		pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+		pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+			| KVM_MSR_ENABLED;
 		wrmsrl(MSR_KVM_PV_EOI_EN, pa);
 	}
 
diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
--- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas	2013-01-09 13:55:44.218672043 -0500
+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c	2013-01-09 13:55:44.222672079 -0500
@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
 	int low, high, ret;
 	struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
 
-	low = (int)__pa(src) | 1;
-	high = ((u64)__pa(src) >> 32);
+	low = (int)slow_virt_to_phys(src) | 1;
+	high = ((u64)slow_virt_to_phys(src) >> 32);
 	ret = native_write_msr_safe(msr_kvm_system_time, low, high);
 	printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
 	       cpu, high, low, txt);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFCv3][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot
  2013-01-09 18:59 [RFCv3][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
  2013-01-09 18:59 ` [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
@ 2013-01-09 18:59 ` Dave Hansen
  2013-01-15 17:04 ` [RFCv3][PATCH 1/3] create slow_virt_to_phys() Rik van Riel
  2013-01-15 19:46 ` H. Peter Anvin
  3 siblings, 0 replies; 8+ messages in thread
From: Dave Hansen @ 2013-01-09 18:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Gleb Natapov, Avi Kivity, Ingo Molnar, H. Peter Anvin,
	x86, Marcelo Tosatti, Dave Hansen


The KVM code has some repeated bugs in it around use of __pa() on
per-cpu data.  Those data are not in an area on which __pa() is
valid.  However, they are also called early enough in boot that
__vmalloc_start_set is not set, and thus the CONFIG_DEBUG_VIRTUAL
debugging does not catch them.

This adds a check to also verify them against max_low_pfn, which
we can use earler in boot than is_vmalloc_addr().  However, if
we are super-early in boot, max_low_pfn=0 and this will trip
on every call, so also make sure that max_low_pfn is set.

With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
catch the bug I was chasing.

I'd love to find a generic way so that any __pa() call on percpu
areas could do a BUG_ON(), but there don't appear to be any nice
and easy ways to check if an address is a percpu one.  Anybody
have ideas on a way to do this?

---

 linux-2.6.git-dave/arch/x86/mm/numa.c     |    2 +-
 linux-2.6.git-dave/arch/x86/mm/pat.c      |    4 ++--
 linux-2.6.git-dave/arch/x86/mm/physaddr.c |    9 ++++++++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff -puN arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/numa.c
--- linux-2.6.git/arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-09 13:55:44.718676439 -0500
+++ linux-2.6.git-dave/arch/x86/mm/numa.c	2013-01-09 13:55:44.726676509 -0500
@@ -219,7 +219,7 @@ static void __init setup_node_data(int n
 	 */
 	nd = alloc_remap(nid, nd_size);
 	if (nd) {
-		nd_pa = __pa(nd);
+		nd_pa = __phys_addr_nodebug(nd);
 		remapped = true;
 	} else {
 		nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
diff -puN arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/pat.c
--- linux-2.6.git/arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-09 13:55:44.722676474 -0500
+++ linux-2.6.git-dave/arch/x86/mm/pat.c	2013-01-09 13:55:44.726676509 -0500
@@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, un
 {
 	unsigned long id_sz;
 
-	if (base >= __pa(high_memory))
+	if (base > __pa(high_memory-1))
 		return 0;
 
-	id_sz = (__pa(high_memory) < base + size) ?
+	id_sz = (__pa(high_memory-1) <= base + size) ?
 				__pa(high_memory) - base :
 				size;
 
diff -puN arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/physaddr.c
--- linux-2.6.git/arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot	2013-01-09 13:55:44.722676474 -0500
+++ linux-2.6.git-dave/arch/x86/mm/physaddr.c	2013-01-09 13:55:44.726676509 -0500
@@ -1,3 +1,4 @@
+#include <linux/bootmem.h>
 #include <linux/mmdebug.h>
 #include <linux/module.h>
 #include <linux/mm.h>
@@ -47,10 +48,16 @@ EXPORT_SYMBOL(__virt_addr_valid);
 #ifdef CONFIG_DEBUG_VIRTUAL
 unsigned long __phys_addr(unsigned long x)
 {
+	unsigned long phys_addr = x - PAGE_OFFSET;
 	/* VMALLOC_* aren't constants  */
 	VIRTUAL_BUG_ON(x < PAGE_OFFSET);
 	VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x));
-	return x - PAGE_OFFSET;
+	/* max_low_pfn is set early, but not _that_ early */
+	if (max_low_pfn) {
+		VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
+		BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
+	}
+	return phys_addr;
 }
 EXPORT_SYMBOL(__phys_addr);
 #endif
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFCv3][PATCH 1/3] create slow_virt_to_phys()
  2013-01-09 18:59 [RFCv3][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
  2013-01-09 18:59 ` [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
  2013-01-09 18:59 ` [RFCv3][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
@ 2013-01-15 17:04 ` Rik van Riel
  2013-01-15 19:46 ` H. Peter Anvin
  3 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2013-01-15 17:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, Ingo Molnar, H. Peter Anvin,
	x86, Marcelo Tosatti

On 01/09/2013 01:59 PM, Dave Hansen wrote:
> Broadening the cc list here a bit...  This bug is still present,
> and I still need these patches to boot 32-bit NUMA kernels.  They
> might be obscure, but if we don't care about them any more, perhaps
> we should go remove the NUMA remapping code instead of this.
>
> --
>
> This is necessary because __pa() does not work on some kinds of
> memory, like vmalloc() or the alloc_remap() areas on 32-bit
> NUMA systems.  We have some functions to do conversions _like_
> this in the vmalloc() code (like vmalloc_to_page()), but they
> do not work on sizes other than 4k pages.  We would potentially
> need to be able to handle all the page sizes that we use for
> the kernel linear mapping (4k, 2M, 1G).

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas
  2013-01-09 18:59 ` [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
@ 2013-01-15 18:38   ` Rik van Riel
  0 siblings, 0 replies; 8+ messages in thread
From: Rik van Riel @ 2013-01-15 18:38 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, Ingo Molnar, H. Peter Anvin,
	x86, Marcelo Tosatti

On 01/09/2013 01:59 PM, Dave Hansen wrote:
> In short, it is illegal to call __pa() on an address holding
> a percpu variable.  The times when this actually matters are
> pretty obscure (certain 32-bit NUMA systems), but it _does_
> happen.  It is important to keep KVM guests working on these
> systems because the real hardware is getting harder and
> harder to find.
>
> This bug manifested first by me seeing a plain hang at boot
> after this message:
>
> 	CPU 0 irqstacks, hard=f3018000 soft=f301a000
>
> or, sometimes, it would actually make it out to the console:
>
> [    0.000000] BUG: unable to handle kernel paging request at ffffffff
>
> I eventually traced it down to the KVM async pagefault code.
> This can be worked around by disabling that code either at
> compile-time, or on the kernel command-line.

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFCv3][PATCH 1/3] create slow_virt_to_phys()
  2013-01-09 18:59 [RFCv3][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
                   ` (2 preceding siblings ...)
  2013-01-15 17:04 ` [RFCv3][PATCH 1/3] create slow_virt_to_phys() Rik van Riel
@ 2013-01-15 19:46 ` H. Peter Anvin
  2013-01-15 22:50   ` Dave Hansen
  3 siblings, 1 reply; 8+ messages in thread
From: H. Peter Anvin @ 2013-01-15 19:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, Avi Kivity, Ingo Molnar,
	x86, Marcelo Tosatti

On 01/09/2013 10:59 AM, Dave Hansen wrote:
> +	switch (level) {
> +	case PG_LEVEL_4K:
> +		psize = PAGE_SIZE;
> +		pmask = PAGE_MASK;
> +		break;
> +	case PG_LEVEL_2M:
> +		psize = PMD_PAGE_SIZE;
> +		pmask = PMD_PAGE_MASK;
> +		break;
> +#ifdef CONFIG_X86_64
> +	case PG_LEVEL_1G:
> +		psize = PUD_PAGE_SIZE;
> +		pmask = PUD_PAGE_MASK;
> +		break;
> +#endif
> +	default:
> +		BUG();
> +	}

I object to this switch statement.  If we are going to create new 
primitives, let's create a primitive that embody this and put it in 
pgtypes_types.h, especially since it is simply an algorithmic operation:

static inline unsigned long page_level_size(int level)
{
	return (PAGE_SIZE/PGDIR_SIZE) << (PGDIR_SHIFT*level);
}
static inline unsigned long page_level_shift(int level)
{
	return (PAGE_SHIFT-PGDIR_SHIFT) + (PGDIR_SHIFT*level);
}

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFCv3][PATCH 1/3] create slow_virt_to_phys()
  2013-01-15 19:46 ` H. Peter Anvin
@ 2013-01-15 22:50   ` Dave Hansen
  2013-01-15 23:46     ` H. Peter Anvin
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Hansen @ 2013-01-15 22:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: linux-kernel, linux-mm, Gleb Natapov, Avi Kivity, Ingo Molnar,
	x86, Marcelo Tosatti

On 01/15/2013 11:46 AM, H. Peter Anvin wrote:
> I object to this switch statement.  If we are going to create new
> primitives, let's create a primitive that embody this and put it in
> pgtypes_types.h, especially since it is simply an algorithmic operation:

Yeah, that's a good point.  I did at least copy part of the switch from
elsewhere in the file, so there's certainly room for consolidating some
things.

> static inline unsigned long page_level_size(int level)
> {
>     return (PAGE_SIZE/PGDIR_SIZE) << (PGDIR_SHIFT*level);
> }
> static inline unsigned long page_level_shift(int level)
> {
>     return (PAGE_SHIFT-PGDIR_SHIFT) + (PGDIR_SHIFT*level);
> }

(PAGE_SHIFT-PGDIR_SHIFT) == -27, so this can't possibly work, right?

How about something like this?

/*
 * Note: this only holds true for pagetable levels where PTEs can be
 * present.  It would break if you used it on the PGD level where PAE
 * is in use.  It basically assumes that the shift between _all_
 * adjacent levels of the pagetables are the same as the lowest-level
 * shift.
 */
#define PG_SHIFT_PER_LEVEL (PMD_SHIFT-PAGE_SHIFT)

static inline unsigned long page_level_shift(int level)
{
	return PAGE_SHIFT + (level - PG_LEVEL_4K) * PG_SHIFT_PER_LEVEL;
}
static inline unsigned long page_level_size(int level)
{
	return 1 << page_level_shift(level);
}

The generated code for page_level_size() looks pretty good, despite it
depending on page_level_shift(), so we might as well leave it defined
this way for simplicity:

0000000000400610 <plsize>:
  400610:       8d 7c bf fb             lea    -0x5(%rdi,%rdi,4),%edi
  400614:       b8 01 00 00 00          mov    $0x1,%eax
  400619:       8d 4c 3f 0c             lea    0xc(%rdi,%rdi,1),%ecx
  40061d:       d3 e0                   shl    %cl,%eax
  40061f:       c3                      retq

I'll send out another series doing this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFCv3][PATCH 1/3] create slow_virt_to_phys()
  2013-01-15 22:50   ` Dave Hansen
@ 2013-01-15 23:46     ` H. Peter Anvin
  0 siblings, 0 replies; 8+ messages in thread
From: H. Peter Anvin @ 2013-01-15 23:46 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Gleb Natapov, Avi Kivity, Ingo Molnar,
	x86, Marcelo Tosatti

On 01/15/2013 02:50 PM, Dave Hansen wrote:
>
>> static inline unsigned long page_level_size(int level)
>> {
>>      return (PAGE_SIZE/PGDIR_SIZE) << (PGDIR_SHIFT*level);
>> }
>> static inline unsigned long page_level_shift(int level)
>> {
>>      return (PAGE_SHIFT-PGDIR_SHIFT) + (PGDIR_SHIFT*level);
>> }
>
> (PAGE_SHIFT-PGDIR_SHIFT) == -27, so this can't possibly work, right?
>

Ah right... sorry, got messed up in my head what that constant is about.

> How about something like this?
>
> /*
>   * Note: this only holds true for pagetable levels where PTEs can be
>   * present.  It would break if you used it on the PGD level where PAE
>   * is in use.  It basically assumes that the shift between _all_
>   * adjacent levels of the pagetables are the same as the lowest-level
>   * shift.
>   */

This comment is totally misleading.  What it refers to is the separation 
between various levels of the page hierarchy; in x86 it is always the same.

Perhaps a cleaner way to do this is:

#define PTRS_PER_PTE_SHIFT	ilog2(PTRS_PER_PTE)

> #define PG_SHIFT_PER_LEVEL (PMD_SHIFT-PAGE_SHIFT)
>
> static inline unsigned long page_level_shift(int level)
> {
> 	return PAGE_SHIFT + (level - PG_LEVEL_4K) * PG_SHIFT_PER_LEVEL;
> }
> static inline unsigned long page_level_size(int level)
> {
> 	return 1 << page_level_shift(level);
> }
>
> The generated code for page_level_size() looks pretty good, despite it
> depending on page_level_shift(), so we might as well leave it defined
> this way for simplicity:
>

Make sure to make that 1UL instead of 1; page_level_shift() should 
return int.  See below.

> 0000000000400610 <plsize>:
>    400610:       8d 7c bf fb             lea    -0x5(%rdi,%rdi,4),%edi
>    400614:       b8 01 00 00 00          mov    $0x1,%eax
>    400619:       8d 4c 3f 0c             lea    0xc(%rdi,%rdi,1),%ecx
>    40061d:       d3 e0                   shl    %cl,%eax
>    40061f:       c3                      retq

We get better code with:

static inline int page_level_shift(int level)
{
	return (PAGE_SHIFT - PTRS_PER_PTE_SHIFT) +
		level * PTRS_PER_PTE_SHIFT;
}
static inline unsigned long page_level_size(int level)
{
	return 1UL << page_level_shift(level);
}

... the resulting code has one lea instead of two:

0000000000000000 <plsize>:
    0:   8d 4c ff 03             lea    0x3(%rdi,%rdi,8),%ecx
    4:   b8 01 00 00 00          mov    $0x1,%eax
    9:   48 d3 e0                shl    %cl,%rax
    c:   c3                      retq

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-01-15 23:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-09 18:59 [RFCv3][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
2013-01-09 18:59 ` [RFCv3][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
2013-01-15 18:38   ` Rik van Riel
2013-01-09 18:59 ` [RFCv3][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
2013-01-15 17:04 ` [RFCv3][PATCH 1/3] create slow_virt_to_phys() Rik van Riel
2013-01-15 19:46 ` H. Peter Anvin
2013-01-15 22:50   ` Dave Hansen
2013-01-15 23:46     ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).