* [RFCv2][PATCH 1/3] create slow_virt_to_phys()
@ 2012-12-07 21:30 Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
` (2 more replies)
0 siblings, 3 replies; 4+ messages in thread
From: Dave Hansen @ 2012-12-07 21:30 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm, Gleb Natapov, Avi Kivity, Dave Hansen
This is necessary because __pa() does not work on some kinds of
memory, like vmalloc() or the alloc_remap() areas on 32-bit
NUMA systems. We have some functions to do conversions _like_
this in the vmalloc() code (like vmalloc_to_page()), but they
do not work on sizes other than 4k pages. We would potentially
need to be able to handle all the page sizes that we use for
the kernel linear mapping (4k, 2M, 1G).
In practice, on 32-bit NUMA systems, the percpu areas get stuck
in the alloc_remap() area. Any __pa() call on them will break
and basically return garbage.
This patch introduces a new function slow_virt_to_phys(), which
walks the kernel page tables on x86 and should do precisely
the same logical thing as __pa(), but actually work on a wider
range of memory. It should work on the normal linear mapping,
vmalloc(), kmap(), etc...
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1
linux-2.6.git-dave/arch/x86/mm/pageattr.c | 47 ++++++++++++++++
2 files changed, 48 insertions(+)
diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
--- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2012-12-07 16:25:16.317592189 -0500
+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2012-12-07 16:25:16.321592224 -0500
@@ -332,6 +332,7 @@ static inline void update_page_count(int
* as a pte too.
*/
extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern phys_addr_t slow_virt_to_phys(void *__address);
#endif /* !__ASSEMBLY__ */
diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
--- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2012-12-07 16:25:16.317592189 -0500
+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2012-12-07 16:28:20.675189758 -0500
@@ -364,6 +364,53 @@ pte_t *lookup_address(unsigned long addr
EXPORT_SYMBOL_GPL(lookup_address);
/*
+ * This is necessary because __pa() does not work on some
+ * kinds of memory, like vmalloc() or the alloc_remap()
+ * areas on 32-bit NUMA systems. The percpu areas can
+ * end up in this kind of memory, for instance.
+ *
+ * This could be optimized, but it is only intended to be
+ * used at inititalization time, and keeping it
+ * unoptimized should increase the testing coverage for
+ * the more obscure platforms.
+ */
+phys_addr_t slow_virt_to_phys(void *__virt_addr)
+{
+ unsigned long virt_addr = (unsigned long)__virt_addr;
+ phys_addr_t phys_addr;
+ unsigned long offset;
+ unsigned int level = -1;
+ unsigned long psize = 0;
+ unsigned long pmask = 0;
+ pte_t *pte;
+
+ pte = lookup_address(virt_addr, &level);
+ BUG_ON(!pte);
+ switch (level) {
+ case PG_LEVEL_4K:
+ psize = PAGE_SIZE;
+ pmask = PAGE_MASK;
+ break;
+ case PG_LEVEL_2M:
+ psize = PMD_PAGE_SIZE;
+ pmask = PMD_PAGE_MASK;
+ break;
+#ifdef CONFIG_X86_64
+ case PG_LEVEL_1G:
+ psize = PUD_PAGE_SIZE;
+ pmask = PUD_PAGE_MASK;
+ break;
+#endif
+ default:
+ BUG();
+ }
+ offset = virt_addr & ~pmask;
+ phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
+ return (phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(slow_virt_to_phys);
+
+/*
* Set the new pmd in all the pgds we know about:
*/
static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFCv2][PATCH 2/3] fix kvm's use of __pa() on percpu areas
2012-12-07 21:30 [RFCv2][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
@ 2012-12-07 21:30 ` Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
2012-12-09 14:06 ` [RFCv2][PATCH 1/3] create slow_virt_to_phys() Gleb Natapov
2 siblings, 0 replies; 4+ messages in thread
From: Dave Hansen @ 2012-12-07 21:30 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm, Gleb Natapov, Avi Kivity, Dave Hansen
In short, it is illegal to call __pa() on an address holding
a percpu variable. The times when this actually matters are
pretty obscure (certain 32-bit NUMA systems), but it _does_
happen. It is important to keep KVM guests working on these
systems because the real hardware is getting harder and
harder to find.
This bug manifested first by me seeing a plain hang at boot
after this message:
CPU 0 irqstacks, hard=f3018000 soft=f301a000
or, sometimes, it would actually make it out to the console:
[ 0.000000] BUG: unable to handle kernel paging request at ffffffff
I eventually traced it down to the KVM async pagefault code.
This can be worked around by disabling that code either at
compile-time, or on the kernel command-line.
The kvm async pagefault code was injecting page faults in
to the guest which the guest misinterpreted because its
"reason" was not being properly sent from the host.
The guest passes a physical address of an per-cpu async page
fault structure via an MSR to the host. Since __pa() is
broken on percpu data, the physical address it sent was
bascially bogus and the host went scribbling on random data.
The guest never saw the real reason for the page fault (it
was injected by the host), assumed that the kernel had taken
a _real_ page fault, and panic()'d.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
---
linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++----
linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++--
2 files changed, 7 insertions(+), 6 deletions(-)
diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c
--- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2012-11-30 16:09:03.562055643 -0500
+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2012-11-30 16:09:03.586055841 -0500
@@ -284,9 +284,9 @@ static void kvm_register_steal_time(void
memset(st, 0, sizeof(*st));
- wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+ wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
- cpu, __pa(st));
+ cpu, slow_virt_to_phys(st));
}
static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
@@ -311,7 +311,7 @@ void __cpuinit kvm_guest_cpu_init(void)
return;
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
- u64 pa = __pa(&__get_cpu_var(apf_reason));
+ u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
#ifdef CONFIG_PREEMPT
pa |= KVM_ASYNC_PF_SEND_ALWAYS;
@@ -327,7 +327,8 @@ void __cpuinit kvm_guest_cpu_init(void)
/* Size alignment is implied but just to make it explicit. */
BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
__get_cpu_var(kvm_apic_eoi) = 0;
- pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
+ pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
+ | KVM_MSR_ENABLED;
wrmsrl(MSR_KVM_PV_EOI_EN, pa);
}
diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c
--- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2012-11-30 16:09:03.562055643 -0500
+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2012-11-30 16:09:03.586055841 -0500
@@ -142,8 +142,8 @@ int kvm_register_clock(char *txt)
int cpu = smp_processor_id();
int low, high, ret;
- low = (int)__pa(&per_cpu(hv_clock, cpu)) | 1;
- high = ((u64)__pa(&per_cpu(hv_clock, cpu)) >> 32);
+ low = (int)slow_virt_to_phys(&per_cpu(hv_clock, cpu)) | 1;
+ high = ((u64)slow_virt_to_phys(&per_cpu(hv_clock, cpu)) >> 32);
ret = native_write_msr_safe(msr_kvm_system_time, low, high);
printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
cpu, high, low, txt);
diff -puN include/linux/percpu-defs.h~fix-kvm-__pa-use-on-percpu-areas include/linux/percpu-defs.h
diff -puN mm/percpu-vm.c~fix-kvm-__pa-use-on-percpu-areas mm/percpu-vm.c
diff -puN mm/percpu.c~fix-kvm-__pa-use-on-percpu-areas mm/percpu.c
diff -puN include/linux/percpu.h~fix-kvm-__pa-use-on-percpu-areas include/linux/percpu.h
diff -puN include/asm-generic/percpu.h~fix-kvm-__pa-use-on-percpu-areas include/asm-generic/percpu.h
diff -puN arch/x86/include/asm/segment.h~fix-kvm-__pa-use-on-percpu-areas arch/x86/include/asm/segment.h
diff -puN arch/x86/kernel/entry_32.S~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/entry_32.S
diff -puN drivers/base/cpu.c~fix-kvm-__pa-use-on-percpu-areas drivers/base/cpu.c
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
* [RFCv2][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot
2012-12-07 21:30 [RFCv2][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
@ 2012-12-07 21:30 ` Dave Hansen
2012-12-09 14:06 ` [RFCv2][PATCH 1/3] create slow_virt_to_phys() Gleb Natapov
2 siblings, 0 replies; 4+ messages in thread
From: Dave Hansen @ 2012-12-07 21:30 UTC (permalink / raw)
To: linux-kernel; +Cc: linux-mm, Gleb Natapov, Avi Kivity, Dave Hansen
The KVM code has some repeated bugs in it around use of __pa() on
per-cpu data. Those data are not in an area on which __pa() is
valid. However, they are also called early enough in boot that
__vmalloc_start_set is not set, and thus the CONFIG_DEBUG_VIRTUAL
debugging does not catch them.
This adds a check to also verify them against max_low_pfn, which
we can use earler in boot than is_vmalloc_addr(). However, if
we are super-early in boot, max_low_pfn=0 and this will trip
on every call, so also make sure that max_low_pfn is set.
With this patch applied, CONFIG_DEBUG_VIRTUAL will actually
catch the bug I was chasing.
I'd love to find a generic way so that any __pa() call on percpu
areas could do a BUG_ON(), but there don't appear to be any nice
and easy ways to check if an address is a percpu one. Anybody
have ideas on a way to do this?
---
linux-2.6.git-dave/arch/x86/mm/numa.c | 2 +-
linux-2.6.git-dave/arch/x86/mm/pat.c | 4 ++--
linux-2.6.git-dave/arch/x86/mm/physaddr.c | 10 ++++++++--
3 files changed, 11 insertions(+), 5 deletions(-)
diff -puN arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/physaddr.c
--- linux-2.6.git/arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2012-11-30 16:18:44.522847232 -0500
+++ linux-2.6.git-dave/arch/x86/mm/physaddr.c 2012-11-30 16:18:44.530847298 -0500
@@ -1,3 +1,4 @@
+#include <linux/bootmem.h>
#include <linux/mmdebug.h>
#include <linux/module.h>
#include <linux/mm.h>
@@ -41,16 +42,21 @@ bool __virt_addr_valid(unsigned long x)
return pfn_valid(x >> PAGE_SHIFT);
}
EXPORT_SYMBOL(__virt_addr_valid);
-
#else
#ifdef CONFIG_DEBUG_VIRTUAL
unsigned long __phys_addr(unsigned long x)
{
+ unsigned long phys_addr = x - PAGE_OFFSET;
/* VMALLOC_* aren't constants */
VIRTUAL_BUG_ON(x < PAGE_OFFSET);
VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x));
- return x - PAGE_OFFSET;
+ /* max_low_pfn is set early, but not _that_ early */
+ if (max_low_pfn) {
+ VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
+ BUG_ON(slow_virt_to_phys((void *)x) != phys_addr);
+ }
+ return phys_addr;
}
EXPORT_SYMBOL(__phys_addr);
#endif
diff -puN arch/x86/kernel/kvmclock.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/kernel/kvmclock.c
diff -L sr -puN /dev/null /dev/null
diff -puN arch/x86/include/asm/page_32.h~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/include/asm/page_32.h
diff -puN arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/numa.c
--- linux-2.6.git/arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2012-11-30 16:18:44.526847265 -0500
+++ linux-2.6.git-dave/arch/x86/mm/numa.c 2012-11-30 16:18:44.534847331 -0500
@@ -219,7 +219,7 @@ static void __init setup_node_data(int n
*/
nd = alloc_remap(nid, nd_size);
if (nd) {
- nd_pa = __pa(nd);
+ nd_pa = __phys_addr_nodebug(nd);
remapped = true;
} else {
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
diff -puN arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/pat.c
--- linux-2.6.git/arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2012-11-30 16:19:34.371258739 -0500
+++ linux-2.6.git-dave/arch/x86/mm/pat.c 2012-11-30 16:22:38.528778740 -0500
@@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, un
{
unsigned long id_sz;
- if (base >= __pa(high_memory))
+ if (base > __pa(high_memory-1))
return 0;
- id_sz = (__pa(high_memory) < base + size) ?
+ id_sz = (__pa(high_memory-1) <= base + size) ?
__pa(high_memory) - base :
size;
_
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFCv2][PATCH 1/3] create slow_virt_to_phys()
2012-12-07 21:30 [RFCv2][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
@ 2012-12-09 14:06 ` Gleb Natapov
2 siblings, 0 replies; 4+ messages in thread
From: Gleb Natapov @ 2012-12-09 14:06 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, linux-mm, Andrew Morton, Rik van Riel,
Andrea Arcangeli, mtosatti
Copying more people. Is this approach good? The alternative would be to
allocate NR_CPUS sized arrays in KVM.
On Fri, Dec 07, 2012 at 04:30:23PM -0500, Dave Hansen wrote:
>
> This is necessary because __pa() does not work on some kinds of
> memory, like vmalloc() or the alloc_remap() areas on 32-bit
> NUMA systems. We have some functions to do conversions _like_
> this in the vmalloc() code (like vmalloc_to_page()), but they
> do not work on sizes other than 4k pages. We would potentially
> need to be able to handle all the page sizes that we use for
> the kernel linear mapping (4k, 2M, 1G).
>
> In practice, on 32-bit NUMA systems, the percpu areas get stuck
> in the alloc_remap() area. Any __pa() call on them will break
> and basically return garbage.
>
> This patch introduces a new function slow_virt_to_phys(), which
> walks the kernel page tables on x86 and should do precisely
> the same logical thing as __pa(), but actually work on a wider
> range of memory. It should work on the normal linear mapping,
> vmalloc(), kmap(), etc...
>
>
> Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
> ---
>
> linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1
> linux-2.6.git-dave/arch/x86/mm/pageattr.c | 47 ++++++++++++++++
> 2 files changed, 48 insertions(+)
>
> diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h
> --- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2012-12-07 16:25:16.317592189 -0500
> +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2012-12-07 16:25:16.321592224 -0500
> @@ -332,6 +332,7 @@ static inline void update_page_count(int
> * as a pte too.
> */
> extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern phys_addr_t slow_virt_to_phys(void *__address);
>
> #endif /* !__ASSEMBLY__ */
>
> diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c
> --- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2012-12-07 16:25:16.317592189 -0500
> +++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2012-12-07 16:28:20.675189758 -0500
> @@ -364,6 +364,53 @@ pte_t *lookup_address(unsigned long addr
> EXPORT_SYMBOL_GPL(lookup_address);
>
> /*
> + * This is necessary because __pa() does not work on some
> + * kinds of memory, like vmalloc() or the alloc_remap()
> + * areas on 32-bit NUMA systems. The percpu areas can
> + * end up in this kind of memory, for instance.
> + *
> + * This could be optimized, but it is only intended to be
> + * used at inititalization time, and keeping it
> + * unoptimized should increase the testing coverage for
> + * the more obscure platforms.
> + */
> +phys_addr_t slow_virt_to_phys(void *__virt_addr)
> +{
> + unsigned long virt_addr = (unsigned long)__virt_addr;
> + phys_addr_t phys_addr;
> + unsigned long offset;
> + unsigned int level = -1;
> + unsigned long psize = 0;
> + unsigned long pmask = 0;
> + pte_t *pte;
> +
> + pte = lookup_address(virt_addr, &level);
> + BUG_ON(!pte);
> + switch (level) {
> + case PG_LEVEL_4K:
> + psize = PAGE_SIZE;
> + pmask = PAGE_MASK;
> + break;
> + case PG_LEVEL_2M:
> + psize = PMD_PAGE_SIZE;
> + pmask = PMD_PAGE_MASK;
> + break;
> +#ifdef CONFIG_X86_64
> + case PG_LEVEL_1G:
> + psize = PUD_PAGE_SIZE;
> + pmask = PUD_PAGE_MASK;
> + break;
> +#endif
> + default:
> + BUG();
> + }
> + offset = virt_addr & ~pmask;
> + phys_addr = pte_pfn(*pte) << PAGE_SHIFT;
> + return (phys_addr | offset);
> +}
> +EXPORT_SYMBOL_GPL(slow_virt_to_phys);
> +
> +/*
> * Set the new pmd in all the pgds we know about:
> */
> static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
> _
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-12-09 14:06 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-07 21:30 [RFCv2][PATCH 1/3] create slow_virt_to_phys() Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 2/3] fix kvm's use of __pa() on percpu areas Dave Hansen
2012-12-07 21:30 ` [RFCv2][PATCH 3/3] make DEBUG_VIRTUAL work earlier in boot Dave Hansen
2012-12-09 14:06 ` [RFCv2][PATCH 1/3] create slow_virt_to_phys() Gleb Natapov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).