From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx107.postini.com [74.125.245.107]) by kanga.kvack.org (Postfix) with SMTP id 23DAA6B0006 for ; Mon, 21 Jan 2013 12:52:51 -0500 (EST) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 12:52:49 -0500 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id 8B315C90058 for ; Mon, 21 Jan 2013 12:52:46 -0500 (EST) Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LHqktN341218 for ; Mon, 21 Jan 2013 12:52:46 -0500 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LHqkns030222 for ; Mon, 21 Jan 2013 12:52:46 -0500 Subject: [PATCH 0/5] fix illegal use of __pa() in KVM code From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:44 -0800 Message-Id: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen This series fixes a hard-to-debug early boot hang on 32-bit NUMA systems. It adds coverage to the debugging code, adds some helpers, and eventually fixes the original bug I was hitting. [v2] * Moved DEBUG_VIRTUAL patch earlier in the series (it has no dependencies on anything else and stands on its own. * Created page_level_*() helpers to replace a nasty switch() at hpa's suggestion -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx161.postini.com [74.125.245.161]) by kanga.kvack.org (Postfix) with SMTP id 9155D6B0009 for ; Mon, 21 Jan 2013 12:52:54 -0500 (EST) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 12:52:53 -0500 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id B95FBC9001C for ; Mon, 21 Jan 2013 12:52:49 -0500 (EST) Received: from d03av02.boulder.ibm.com (d03av02.boulder.ibm.com [9.17.195.168]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LHqmkZ193418 for ; Mon, 21 Jan 2013 12:52:49 -0500 Received: from d03av02.boulder.ibm.com (loopback [127.0.0.1]) by d03av02.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LHqlCR007850 for ; Mon, 21 Jan 2013 10:52:47 -0700 Subject: [PATCH 2/5] pagetable level size/shift/mask helpers From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:46 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175246.6B215415@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen I plan to use lookup_address() to walk the kernel pagetables in a later patch. It returns a "pte" and the level in the pagetables where the "pte" was found. The level is just an enum and needs to be converted to a useful value in order to do address calculations with it. These helpers will be used in at least two places. This also gives the anonymous enum a real name so that no one gets confused about what they should be passing in to these helpers. "PTE_SHIFT" was chosen for naming consistency with the other pagetable levels (PGD/PUD/PMD_SHIFT). Cc: H. Peter Anvin Signed-off-by: Dave Hansen --- linux-2.6.git-dave/arch/x86/include/asm/pgtable.h | 14 ++++++++++++++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 2 +- 2 files changed, 15 insertions(+), 1 deletion(-) diff -puN arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable.h --- linux-2.6.git/arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers 2013-01-17 10:22:25.958428542 -0800 +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable.h 2013-01-17 10:22:25.962428578 -0800 @@ -390,6 +390,7 @@ pte_t *populate_extra_pte(unsigned long #ifndef __ASSEMBLY__ #include +#include static inline int pte_none(pte_t pte) { @@ -781,6 +782,19 @@ static inline void clone_pgd_range(pgd_t memcpy(dst, src, count * sizeof(pgd_t)); } +#define PTE_SHIFT ilog2(PTRS_PER_PTE) +static inline int page_level_shift(enum pg_level level) +{ + return (PAGE_SHIFT - PTE_SHIFT) + level * PTE_SHIFT; +} +static inline unsigned long page_level_size(enum pg_level level) +{ + return 1UL << page_level_shift(level); +} +static inline unsigned long page_level_mask(enum pg_level level) +{ + return ~(page_level_size(level) - 1); +} #include #endif /* __ASSEMBLY__ */ diff -puN arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable_types.h --- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers 2013-01-17 10:22:25.958428542 -0800 +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2013-01-17 10:22:25.966428612 -0800 @@ -331,7 +331,7 @@ extern void native_pagetable_init(void); struct seq_file; extern void arch_report_meminfo(struct seq_file *m); -enum { +enum pg_level { PG_LEVEL_NONE, PG_LEVEL_4K, PG_LEVEL_2M, _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id 1E1BB6B000C for ; Mon, 21 Jan 2013 12:52:58 -0500 (EST) Received: from /spool/local by e36.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 10:52:56 -0700 Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com [9.17.195.107]) by d03dlp01.boulder.ibm.com (Postfix) with ESMTP id 3D6B71FF003F for ; Mon, 21 Jan 2013 10:52:40 -0700 (MST) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LHqotC131996 for ; Mon, 21 Jan 2013 10:52:50 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LHqno3025493 for ; Mon, 21 Jan 2013 10:52:49 -0700 Subject: [PATCH 4/5] create slow_virt_to_phys() From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:49 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen This is necessary because __pa() does not work on some kinds of memory, like vmalloc() or the alloc_remap() areas on 32-bit NUMA systems. We have some functions to do conversions _like_ this in the vmalloc() code (like vmalloc_to_page()), but they do not work on sizes other than 4k pages. We would potentially need to be able to handle all the page sizes that we use for the kernel linear mapping (4k, 2M, 1G). In practice, on 32-bit NUMA systems, the percpu areas get stuck in the alloc_remap() area. Any __pa() call on them will break and basically return garbage. This patch introduces a new function slow_virt_to_phys(), which walks the kernel page tables on x86 and should do precisely the same logical thing as __pa(), but actually work on a wider range of memory. It should work on the normal linear mapping, vmalloc(), kmap(), etc... Signed-off-by: Dave Hansen Acked-by: Rik van Riel --- linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1 linux-2.6.git-dave/arch/x86/mm/pageattr.c | 31 ++++++++++++++++ 2 files changed, 32 insertions(+) diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h --- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2013-01-17 10:22:26.590434129 -0800 +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2013-01-17 10:22:26.598434199 -0800 @@ -352,6 +352,7 @@ static inline void update_page_count(int * as a pte too. */ extern pte_t *lookup_address(unsigned long address, unsigned int *level); +extern phys_addr_t slow_virt_to_phys(void *__address); #endif /* !__ASSEMBLY__ */ diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c --- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2013-01-17 10:22:26.594434163 -0800 +++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2013-01-17 10:22:26.598434199 -0800 @@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr EXPORT_SYMBOL_GPL(lookup_address); /* + * This is necessary because __pa() does not work on some + * kinds of memory, like vmalloc() or the alloc_remap() + * areas on 32-bit NUMA systems. The percpu areas can + * end up in this kind of memory, for instance. + * + * This could be optimized, but it is only intended to be + * used at inititalization time, and keeping it + * unoptimized should increase the testing coverage for + * the more obscure platforms. + */ +phys_addr_t slow_virt_to_phys(void *__virt_addr) +{ + unsigned long virt_addr = (unsigned long)__virt_addr; + phys_addr_t phys_addr; + unsigned long offset; + unsigned int level = -1; + unsigned long psize = 0; + unsigned long pmask = 0; + pte_t *pte; + + pte = lookup_address(virt_addr, &level); + BUG_ON(!pte); + psize = page_level_size(level); + pmask = page_level_mask(level); + offset = virt_addr & ~pmask; + phys_addr = pte_pfn(*pte) << PAGE_SHIFT; + return (phys_addr | offset); +} +EXPORT_SYMBOL_GPL(slow_virt_to_phys); + +/* * Set the new pmd in all the pgds we know about: */ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx130.postini.com [74.125.245.130]) by kanga.kvack.org (Postfix) with SMTP id BE0B86B000E for ; Mon, 21 Jan 2013 12:52:59 -0500 (EST) Received: from /spool/local by e35.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 10:52:58 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id 3207619D8045 for ; Mon, 21 Jan 2013 10:52:55 -0700 (MST) Received: from d03av03.boulder.ibm.com (d03av03.boulder.ibm.com [9.17.195.169]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LHqqOL262946 for ; Mon, 21 Jan 2013 10:52:54 -0700 Received: from d03av03.boulder.ibm.com (loopback [127.0.0.1]) by d03av03.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LHqpgE012014 for ; Mon, 21 Jan 2013 10:52:52 -0700 Subject: [PATCH 5/5] fix kvm's use of __pa() on percpu areas From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:50 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175250.1AAC7981@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen In short, it is illegal to call __pa() on an address holding a percpu variable. The times when this actually matters are pretty obscure (certain 32-bit NUMA systems), but it _does_ happen. It is important to keep KVM guests working on these systems because the real hardware is getting harder and harder to find. This bug manifested first by me seeing a plain hang at boot after this message: CPU 0 irqstacks, hard=f3018000 soft=f301a000 or, sometimes, it would actually make it out to the console: [ 0.000000] BUG: unable to handle kernel paging request at ffffffff I eventually traced it down to the KVM async pagefault code. This can be worked around by disabling that code either at compile-time, or on the kernel command-line. The kvm async pagefault code was injecting page faults in to the guest which the guest misinterpreted because its "reason" was not being properly sent from the host. The guest passes a physical address of an per-cpu async page fault structure via an MSR to the host. Since __pa() is broken on percpu data, the physical address it sent was bascially bogus and the host went scribbling on random data. The guest never saw the real reason for the page fault (it was injected by the host), assumed that the kernel had taken a _real_ page fault, and panic()'d. The behavior varied, though, depending on what got corrupted by the bad write. Signed-off-by: Dave Hansen Acked-by: Rik van Riel --- linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 10:22:26.914436992 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-17 10:22:26.922437062 -0800 @@ -289,9 +289,9 @@ static void kvm_register_steal_time(void memset(st, 0, sizeof(*st)); - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED)); printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", - cpu, __pa(st)); + cpu, slow_virt_to_phys(st)); } static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED; @@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) return; if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { - u64 pa = __pa(&__get_cpu_var(apf_reason)); + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); #ifdef CONFIG_PREEMPT pa |= KVM_ASYNC_PF_SEND_ALWAYS; @@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) /* Size alignment is implied but just to make it explicit. */ BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); __get_cpu_var(kvm_apic_eoi) = 0; - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) + | KVM_MSR_ENABLED; wrmsrl(MSR_KVM_PV_EOI_EN, pa); } diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 10:22:26.918437028 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-17 10:22:26.922437062 -0800 @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) int low, high, ret; struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; - low = (int)__pa(src) | 1; - high = ((u64)__pa(src) >> 32); + low = (int)slow_virt_to_phys(src) | 1; + high = ((u64)slow_virt_to_phys(src) >> 32); ret = native_write_msr_safe(msr_kvm_system_time, low, high); printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", cpu, high, low, txt); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx204.postini.com [74.125.245.204]) by kanga.kvack.org (Postfix) with SMTP id 63CE86B0010 for ; Mon, 21 Jan 2013 12:53:00 -0500 (EST) Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 12:52:59 -0500 Received: from d01relay03.pok.ibm.com (d01relay03.pok.ibm.com [9.56.227.235]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id E106638C801C for ; Mon, 21 Jan 2013 12:52:48 -0500 (EST) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay03.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LHqm9w310888 for ; Mon, 21 Jan 2013 12:52:48 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LHqm77011768 for ; Mon, 21 Jan 2013 15:52:48 -0200 Subject: [PATCH 3/5] use new pagetable helpers in try_preserve_large_page() From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:47 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175247.76641034@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen try_preserve_large_page() can be slightly simplified by using the new page_level_*() helpers. Signed-off-by: Dave Hansen --- linux-2.6.git-dave/arch/x86/mm/pageattr.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff -puN arch/x86/mm/pageattr.c~use-new-pagetable-helpers arch/x86/mm/pageattr.c --- linux-2.6.git/arch/x86/mm/pageattr.c~use-new-pagetable-helpers 2013-01-17 10:22:26.282431407 -0800 +++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2013-01-17 10:22:26.286431442 -0800 @@ -412,15 +412,12 @@ try_preserve_large_page(pte_t *kpte, uns switch (level) { case PG_LEVEL_2M: - psize = PMD_PAGE_SIZE; - pmask = PMD_PAGE_MASK; - break; #ifdef CONFIG_X86_64 case PG_LEVEL_1G: - psize = PUD_PAGE_SIZE; - pmask = PUD_PAGE_MASK; - break; #endif + psize = page_level_size(level); + pmask = page_level_mask(level); + break; default: do_split = -EINVAL; goto out_unlock; _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx164.postini.com [74.125.245.164]) by kanga.kvack.org (Postfix) with SMTP id EFACF6B0012 for ; Mon, 21 Jan 2013 12:53:00 -0500 (EST) Received: from /spool/local by e36.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 10:53:00 -0700 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id 34D7719D8048 for ; Mon, 21 Jan 2013 10:52:55 -0700 (MST) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LHqp0x167752 for ; Mon, 21 Jan 2013 10:52:52 -0700 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LHqkAU025212 for ; Mon, 21 Jan 2013 10:52:47 -0700 Subject: [PATCH 1/5] make DEBUG_VIRTUAL work earlier in boot From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:45 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175245.3081B2B1@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen The KVM code has some repeated bugs in it around use of __pa() on per-cpu data. Those data are not in an area on which using __pa() is valid. However, they are also called early enough in boot that __vmalloc_start_set is not set, and thus the CONFIG_DEBUG_VIRTUAL debugging does not catch them. This adds a check to also verify __pa() calls against max_low_pfn, which we can use earler in boot than is_vmalloc_addr(). However, if we are super-early in boot, max_low_pfn=0 and this will trip on every call, so also make sure that max_low_pfn is set before we try to use it. With this patch applied, CONFIG_DEBUG_VIRTUAL will actually catch the bug I was chasing (and fix later in this series). I'd love to find a generic way so that any __pa() call on percpu areas could do a BUG_ON(), but there don't appear to be any nice and easy ways to check if an address is a percpu one. Anybody have ideas on a way to do this? Signed-off-by: Dave Hansen --- linux-2.6.git-dave/arch/x86/mm/numa.c | 2 +- linux-2.6.git-dave/arch/x86/mm/pat.c | 4 ++-- linux-2.6.git-dave/arch/x86/mm/physaddr.c | 9 ++++++++- 3 files changed, 11 insertions(+), 4 deletions(-) diff -puN arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/numa.c --- linux-2.6.git/arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2013-01-17 10:22:25.614425502 -0800 +++ linux-2.6.git-dave/arch/x86/mm/numa.c 2013-01-17 10:22:25.622425572 -0800 @@ -219,7 +219,7 @@ static void __init setup_node_data(int n */ nd = alloc_remap(nid, nd_size); if (nd) { - nd_pa = __pa(nd); + nd_pa = __phys_addr_nodebug(nd); remapped = true; } else { nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid); diff -puN arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/pat.c --- linux-2.6.git/arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2013-01-17 10:22:25.614425502 -0800 +++ linux-2.6.git-dave/arch/x86/mm/pat.c 2013-01-17 10:22:25.622425572 -0800 @@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, un { unsigned long id_sz; - if (base >= __pa(high_memory)) + if (base > __pa(high_memory-1)) return 0; - id_sz = (__pa(high_memory) < base + size) ? + id_sz = (__pa(high_memory-1) <= base + size) ? __pa(high_memory) - base : size; diff -puN arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/physaddr.c --- linux-2.6.git/arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2013-01-17 10:22:25.618425536 -0800 +++ linux-2.6.git-dave/arch/x86/mm/physaddr.c 2013-01-17 10:22:25.622425572 -0800 @@ -1,3 +1,4 @@ +#include #include #include #include @@ -47,10 +48,16 @@ EXPORT_SYMBOL(__virt_addr_valid); #ifdef CONFIG_DEBUG_VIRTUAL unsigned long __phys_addr(unsigned long x) { + unsigned long phys_addr = x - PAGE_OFFSET; /* VMALLOC_* aren't constants */ VIRTUAL_BUG_ON(x < PAGE_OFFSET); VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x)); - return x - PAGE_OFFSET; + /* max_low_pfn is set early, but not _that_ early */ + if (max_low_pfn) { + VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn); + BUG_ON(slow_virt_to_phys((void *)x) != phys_addr); + } + return phys_addr; } EXPORT_SYMBOL(__phys_addr); #endif _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx201.postini.com [74.125.245.201]) by kanga.kvack.org (Postfix) with SMTP id 0AE976B0006 for ; Mon, 21 Jan 2013 13:08:33 -0500 (EST) In-Reply-To: <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PATCH 4/5] create slow_virt_to_phys() From: "H. Peter Anvin" Date: Mon, 21 Jan 2013 12:08:18 -0600 Message-ID: <2ad09c09-98c3-4b2d-9b3f-f16fbcce4edf@email.android.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Why are you initializing psize/pmask? Dave Hansen wrote: > >This is necessary because __pa() does not work on some kinds of >memory, like vmalloc() or the alloc_remap() areas on 32-bit >NUMA systems. We have some functions to do conversions _like_ >this in the vmalloc() code (like vmalloc_to_page()), but they >do not work on sizes other than 4k pages. We would potentially >need to be able to handle all the page sizes that we use for >the kernel linear mapping (4k, 2M, 1G). > >In practice, on 32-bit NUMA systems, the percpu areas get stuck >in the alloc_remap() area. Any __pa() call on them will break >and basically return garbage. > >This patch introduces a new function slow_virt_to_phys(), which >walks the kernel page tables on x86 and should do precisely >the same logical thing as __pa(), but actually work on a wider >range of memory. It should work on the normal linear mapping, >vmalloc(), kmap(), etc... > >Signed-off-by: Dave Hansen >Acked-by: Rik van Riel >--- > > linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1 >linux-2.6.git-dave/arch/x86/mm/pageattr.c | 31 >++++++++++++++++ > 2 files changed, 32 insertions(+) > >diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys >arch/x86/include/asm/pgtable_types.h >--- >linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2013-01-17 >10:22:26.590434129 -0800 >+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2013-01-17 >10:22:26.598434199 -0800 >@@ -352,6 +352,7 @@ static inline void update_page_count(int > * as a pte too. > */ >extern pte_t *lookup_address(unsigned long address, unsigned int >*level); >+extern phys_addr_t slow_virt_to_phys(void *__address); > > #endif /* !__ASSEMBLY__ */ > >diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys >arch/x86/mm/pageattr.c >--- >linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2013-01-17 >10:22:26.594434163 -0800 >+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2013-01-17 >10:22:26.598434199 -0800 >@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr > EXPORT_SYMBOL_GPL(lookup_address); > > /* >+ * This is necessary because __pa() does not work on some >+ * kinds of memory, like vmalloc() or the alloc_remap() >+ * areas on 32-bit NUMA systems. The percpu areas can >+ * end up in this kind of memory, for instance. >+ * >+ * This could be optimized, but it is only intended to be >+ * used at inititalization time, and keeping it >+ * unoptimized should increase the testing coverage for >+ * the more obscure platforms. >+ */ >+phys_addr_t slow_virt_to_phys(void *__virt_addr) >+{ >+ unsigned long virt_addr = (unsigned long)__virt_addr; >+ phys_addr_t phys_addr; >+ unsigned long offset; >+ unsigned int level = -1; >+ unsigned long psize = 0; >+ unsigned long pmask = 0; >+ pte_t *pte; >+ >+ pte = lookup_address(virt_addr, &level); >+ BUG_ON(!pte); >+ psize = page_level_size(level); >+ pmask = page_level_mask(level); >+ offset = virt_addr & ~pmask; >+ phys_addr = pte_pfn(*pte) << PAGE_SHIFT; >+ return (phys_addr | offset); >+} >+EXPORT_SYMBOL_GPL(slow_virt_to_phys); >+ >+/* > * Set the new pmd in all the pgds we know about: > */ >static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t >pte) >_ -- Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx105.postini.com [74.125.245.105]) by kanga.kvack.org (Postfix) with SMTP id C23EA6B0004 for ; Mon, 21 Jan 2013 13:21:02 -0500 (EST) Received: from /spool/local by e8.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 13:21:01 -0500 Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id C6DC3C90048 for ; Mon, 21 Jan 2013 13:18:48 -0500 (EST) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LIIkOj236350 for ; Mon, 21 Jan 2013 13:18:46 -0500 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LIIYPR018220 for ; Mon, 21 Jan 2013 11:18:36 -0700 Message-ID: <50FD8676.9090203@linux.vnet.ibm.com> Date: Mon, 21 Jan 2013 10:18:30 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 4/5] create slow_virt_to_phys() References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> <2ad09c09-98c3-4b2d-9b3f-f16fbcce4edf@email.android.com> In-Reply-To: <2ad09c09-98c3-4b2d-9b3f-f16fbcce4edf@email.android.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel On 01/21/2013 10:08 AM, H. Peter Anvin wrote: > Why are you initializing psize/pmask? It's an artifact from the switch() that was there before. I'll clean it up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx203.postini.com [74.125.245.203]) by kanga.kvack.org (Postfix) with SMTP id 72BD06B0002 for ; Mon, 21 Jan 2013 13:54:34 -0500 (EST) In-Reply-To: <20130121175250.1AAC7981@kernel.stglabs.ibm.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas From: "H. Peter Anvin" Date: Mon, 21 Jan 2013 12:38:06 -0600 Message-ID: <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen , linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Final question: are any of these done in frequent paths? (I believe no, but...) Dave Hansen wrote: > >In short, it is illegal to call __pa() on an address holding >a percpu variable. The times when this actually matters are >pretty obscure (certain 32-bit NUMA systems), but it _does_ >happen. It is important to keep KVM guests working on these >systems because the real hardware is getting harder and >harder to find. > >This bug manifested first by me seeing a plain hang at boot >after this message: > > CPU 0 irqstacks, hard=f3018000 soft=f301a000 > >or, sometimes, it would actually make it out to the console: > >[ 0.000000] BUG: unable to handle kernel paging request at ffffffff > >I eventually traced it down to the KVM async pagefault code. >This can be worked around by disabling that code either at >compile-time, or on the kernel command-line. > >The kvm async pagefault code was injecting page faults in >to the guest which the guest misinterpreted because its >"reason" was not being properly sent from the host. > >The guest passes a physical address of an per-cpu async page >fault structure via an MSR to the host. Since __pa() is >broken on percpu data, the physical address it sent was >bascially bogus and the host went scribbling on random data. >The guest never saw the real reason for the page fault (it >was injected by the host), assumed that the kernel had taken >a _real_ page fault, and panic()'d. The behavior varied, >though, depending on what got corrupted by the bad write. > >Signed-off-by: Dave Hansen >Acked-by: Rik van Riel >--- > > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- > linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- > 2 files changed, 7 insertions(+), 6 deletions(-) > >diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas >arch/x86/kernel/kvm.c >--- >linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 >10:22:26.914436992 -0800 >+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-17 >10:22:26.922437062 -0800 >@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void > > memset(st, 0, sizeof(*st)); > >- wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); >+ wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | >KVM_MSR_ENABLED)); > printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", >- cpu, __pa(st)); >+ cpu, slow_virt_to_phys(st)); > } > >static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = >KVM_PV_EOI_DISABLED; >@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) > return; > > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { >- u64 pa = __pa(&__get_cpu_var(apf_reason)); >+ u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); > > #ifdef CONFIG_PREEMPT > pa |= KVM_ASYNC_PF_SEND_ALWAYS; >@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) > /* Size alignment is implied but just to make it explicit. */ > BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); > __get_cpu_var(kvm_apic_eoi) = 0; >- pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; >+ pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) >+ | KVM_MSR_ENABLED; > wrmsrl(MSR_KVM_PV_EOI_EN, pa); > } > >diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas >arch/x86/kernel/kvmclock.c >--- >linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 >10:22:26.918437028 -0800 >+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-17 >10:22:26.922437062 -0800 >@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) > int low, high, ret; > struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; > >- low = (int)__pa(src) | 1; >- high = ((u64)__pa(src) >> 32); >+ low = (int)slow_virt_to_phys(src) | 1; >+ high = ((u64)slow_virt_to_phys(src) >> 32); > ret = native_write_msr_safe(msr_kvm_system_time, low, high); > printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", > cpu, high, low, txt); >_ -- Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx181.postini.com [74.125.245.181]) by kanga.kvack.org (Postfix) with SMTP id 176226B0002 for ; Mon, 21 Jan 2013 13:59:51 -0500 (EST) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 21 Jan 2013 13:59:50 -0500 Received: from d01relay01.pok.ibm.com (d01relay01.pok.ibm.com [9.56.227.233]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id E3FED38C8045 for ; Mon, 21 Jan 2013 13:59:47 -0500 (EST) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0LIxlMn271840 for ; Mon, 21 Jan 2013 13:59:47 -0500 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0LIxkot026589 for ; Mon, 21 Jan 2013 16:59:47 -0200 Message-ID: <50FD901C.8000002@linux.vnet.ibm.com> Date: Mon, 21 Jan 2013 10:59:40 -0800 From: Dave Hansen MIME-Version: 1.0 Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> In-Reply-To: <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel On 01/21/2013 10:38 AM, H. Peter Anvin wrote: > Final question: are any of these done in frequent paths? (I believe no, but...) Nope. All of the places that it gets used here are in initialization-time paths. The two we have here are when kvm and the host are setting up a new vcpu and when the kvmclock clocksource is being registered. A CPU getting hotplugged is the only thing that might even have these get called more than at boot. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx152.postini.com [74.125.245.152]) by kanga.kvack.org (Postfix) with SMTP id EF6026B0004 for ; Mon, 21 Jan 2013 14:02:10 -0500 (EST) Date: Mon, 21 Jan 2013 21:02:07 +0200 From: Gleb Natapov Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas Message-ID: <20130121190207.GB25818@redhat.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> Sender: owner-linux-mm@kvack.org List-ID: To: "H. Peter Anvin" Cc: Dave Hansen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, Marcelo Tosatti , Rik van Riel On Mon, Jan 21, 2013 at 12:38:06PM -0600, H. Peter Anvin wrote: > Final question: are any of these done in frequent paths? (I believe no, but...) > No, only during guest boot. > Dave Hansen wrote: > > > > >In short, it is illegal to call __pa() on an address holding > >a percpu variable. The times when this actually matters are > >pretty obscure (certain 32-bit NUMA systems), but it _does_ > >happen. It is important to keep KVM guests working on these > >systems because the real hardware is getting harder and > >harder to find. > > > >This bug manifested first by me seeing a plain hang at boot > >after this message: > > > > CPU 0 irqstacks, hard=f3018000 soft=f301a000 > > > >or, sometimes, it would actually make it out to the console: > > > >[ 0.000000] BUG: unable to handle kernel paging request at ffffffff > > > >I eventually traced it down to the KVM async pagefault code. > >This can be worked around by disabling that code either at > >compile-time, or on the kernel command-line. > > > >The kvm async pagefault code was injecting page faults in > >to the guest which the guest misinterpreted because its > >"reason" was not being properly sent from the host. > > > >The guest passes a physical address of an per-cpu async page > >fault structure via an MSR to the host. Since __pa() is > >broken on percpu data, the physical address it sent was > >bascially bogus and the host went scribbling on random data. > >The guest never saw the real reason for the page fault (it > >was injected by the host), assumed that the kernel had taken > >a _real_ page fault, and panic()'d. The behavior varied, > >though, depending on what got corrupted by the bad write. > > > >Signed-off-by: Dave Hansen > >Acked-by: Rik van Riel > >--- > > > > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- > > linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- > > 2 files changed, 7 insertions(+), 6 deletions(-) > > > >diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas > >arch/x86/kernel/kvm.c > >--- > >linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 > >10:22:26.914436992 -0800 > >+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-17 > >10:22:26.922437062 -0800 > >@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void > > > > memset(st, 0, sizeof(*st)); > > > >- wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); > >+ wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | > >KVM_MSR_ENABLED)); > > printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", > >- cpu, __pa(st)); > >+ cpu, slow_virt_to_phys(st)); > > } > > > >static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = > >KVM_PV_EOI_DISABLED; > >@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) > > return; > > > > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { > >- u64 pa = __pa(&__get_cpu_var(apf_reason)); > >+ u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); > > > > #ifdef CONFIG_PREEMPT > > pa |= KVM_ASYNC_PF_SEND_ALWAYS; > >@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) > > /* Size alignment is implied but just to make it explicit. */ > > BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); > > __get_cpu_var(kvm_apic_eoi) = 0; > >- pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; > >+ pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) > >+ | KVM_MSR_ENABLED; > > wrmsrl(MSR_KVM_PV_EOI_EN, pa); > > } > > > >diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas > >arch/x86/kernel/kvmclock.c > >--- > >linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 > >10:22:26.918437028 -0800 > >+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-17 > >10:22:26.922437062 -0800 > >@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) > > int low, high, ret; > > struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; > > > >- low = (int)__pa(src) | 1; > >- high = ((u64)__pa(src) >> 32); > >+ low = (int)slow_virt_to_phys(src) | 1; > >+ high = ((u64)slow_virt_to_phys(src) >> 32); > > ret = native_write_msr_safe(msr_kvm_system_time, low, high); > > printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", > > cpu, high, low, txt); > >_ > > -- > Sent from my mobile phone. Please excuse brevity and lack of formatting. -- Gleb. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx172.postini.com [74.125.245.172]) by kanga.kvack.org (Postfix) with SMTP id F12D16B0004 for ; Mon, 21 Jan 2013 14:24:59 -0500 (EST) In-Reply-To: <50FD901C.8000002@linux.vnet.ibm.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> <50FD901C.8000002@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas From: "H. Peter Anvin" Date: Mon, 21 Jan 2013 13:22:50 -0600 Message-ID: <6a43e949-61b2-4d96-8e85-46de3da8c3d0@email.android.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Cool, just checking. Dave Hansen wrote: >On 01/21/2013 10:38 AM, H. Peter Anvin wrote: >> Final question: are any of these done in frequent paths? (I believe >no, but...) > >Nope. All of the places that it gets used here are in >initialization-time paths. The two we have here are when kvm and the >host are setting up a new vcpu and when the kvmclock clocksource is >being registered. A CPU getting hotplugged is the only thing that >might >even have these get called more than at boot. -- Sent from my mobile phone. Please excuse brevity and lack of formatting. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx123.postini.com [74.125.245.123]) by kanga.kvack.org (Postfix) with SMTP id 9C76E6B0002 for ; Tue, 22 Jan 2013 16:24:41 -0500 (EST) Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 22 Jan 2013 16:24:38 -0500 Received: from d01relay06.pok.ibm.com (d01relay06.pok.ibm.com [9.56.227.116]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id E24B238C8039 for ; Tue, 22 Jan 2013 16:24:36 -0500 (EST) Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay06.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r0MLOa7u22085794 for ; Tue, 22 Jan 2013 16:24:36 -0500 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r0MLOZje000674 for ; Tue, 22 Jan 2013 16:24:36 -0500 Subject: [PATCH 5/5] fix kvm's use of __pa() on percpu areas From: Dave Hansen Date: Tue, 22 Jan 2013 13:24:35 -0800 References: <20130122212428.8DF70119@kernel.stglabs.ibm.com> In-Reply-To: <20130122212428.8DF70119@kernel.stglabs.ibm.com> Message-Id: <20130122212435.4905663F@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen In short, it is illegal to call __pa() on an address holding a percpu variable. This replaces those __pa() calls with slow_virt_to_phys(). All of the cases in this patch are in boot time (or CPU hotplug time at worst) code, so the slow pagetable walking in slow_virt_to_phys() is not expected to have a performance impact. The times when this actually matters are pretty obscure (certain 32-bit NUMA systems), but it _does_ happen. It is important to keep KVM guests working on these systems because the real hardware is getting harder and harder to find. This bug manifested first by me seeing a plain hang at boot after this message: CPU 0 irqstacks, hard=f3018000 soft=f301a000 or, sometimes, it would actually make it out to the console: [ 0.000000] BUG: unable to handle kernel paging request at ffffffff I eventually traced it down to the KVM async pagefault code. This can be worked around by disabling that code either at compile-time, or on the kernel command-line. The kvm async pagefault code was injecting page faults in to the guest which the guest misinterpreted because its "reason" was not being properly sent from the host. The guest passes a physical address of an per-cpu async page fault structure via an MSR to the host. Since __pa() is broken on percpu data, the physical address it sent was bascially bogus and the host went scribbling on random data. The guest never saw the real reason for the page fault (it was injected by the host), assumed that the kernel had taken a _real_ page fault, and panic()'d. The behavior varied, though, depending on what got corrupted by the bad write. Signed-off-by: Dave Hansen Acked-by: Rik van Riel --- linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-22 13:17:16.424317475 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-22 13:17:16.432317541 -0800 @@ -289,9 +289,9 @@ static void kvm_register_steal_time(void memset(st, 0, sizeof(*st)); - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED)); printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", - cpu, __pa(st)); + cpu, slow_virt_to_phys(st)); } static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED; @@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) return; if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { - u64 pa = __pa(&__get_cpu_var(apf_reason)); + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); #ifdef CONFIG_PREEMPT pa |= KVM_ASYNC_PF_SEND_ALWAYS; @@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) /* Size alignment is implied but just to make it explicit. */ BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); __get_cpu_var(kvm_apic_eoi) = 0; - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) + | KVM_MSR_ENABLED; wrmsrl(MSR_KVM_PV_EOI_EN, pa); } diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-22 13:17:16.428317508 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-22 13:17:16.432317541 -0800 @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) int low, high, ret; struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; - low = (int)__pa(src) | 1; - high = ((u64)__pa(src) >> 32); + low = (int)slow_virt_to_phys(src) | 1; + high = ((u64)slow_virt_to_phys(src) >> 32); ret = native_write_msr_safe(msr_kvm_system_time, low, high); printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", cpu, high, low, txt); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx126.postini.com [74.125.245.126]) by kanga.kvack.org (Postfix) with SMTP id C220A6B0010 for ; Tue, 22 Jan 2013 19:21:54 -0500 (EST) Date: Tue, 22 Jan 2013 22:08:20 -0200 From: Marcelo Tosatti Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas Message-ID: <20130123000820.GA27204@amt.cnet> References: <20130122212428.8DF70119@kernel.stglabs.ibm.com> <20130122212435.4905663F@kernel.stglabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130122212435.4905663F@kernel.stglabs.ibm.com> Sender: owner-linux-mm@kvack.org List-ID: To: Dave Hansen Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Rik van Riel On Tue, Jan 22, 2013 at 01:24:35PM -0800, Dave Hansen wrote: > > In short, it is illegal to call __pa() on an address holding > a percpu variable. This replaces those __pa() calls with > slow_virt_to_phys(). All of the cases in this patch are > in boot time (or CPU hotplug time at worst) code, so the > slow pagetable walking in slow_virt_to_phys() is not expected > to have a performance impact. > > The times when this actually matters are pretty obscure > (certain 32-bit NUMA systems), but it _does_ happen. It is > important to keep KVM guests working on these systems because > the real hardware is getting harder and harder to find. > > This bug manifested first by me seeing a plain hang at boot > after this message: > > CPU 0 irqstacks, hard=f3018000 soft=f301a000 > > or, sometimes, it would actually make it out to the console: > > [ 0.000000] BUG: unable to handle kernel paging request at ffffffff > > I eventually traced it down to the KVM async pagefault code. > This can be worked around by disabling that code either at > compile-time, or on the kernel command-line. > > The kvm async pagefault code was injecting page faults in > to the guest which the guest misinterpreted because its > "reason" was not being properly sent from the host. > > The guest passes a physical address of an per-cpu async page > fault structure via an MSR to the host. Since __pa() is > broken on percpu data, the physical address it sent was > bascially bogus and the host went scribbling on random data. > The guest never saw the real reason for the page fault (it > was injected by the host), assumed that the kernel had taken > a _real_ page fault, and panic()'d. The behavior varied, > though, depending on what got corrupted by the bad write. > > Signed-off-by: Dave Hansen > Acked-by: Rik van Riel > --- > > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- > linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- > 2 files changed, 7 insertions(+), 6 deletions(-) Reviewed-by: Marcelo Tosatti -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754850Ab3AURw5 (ORCPT ); Mon, 21 Jan 2013 12:52:57 -0500 Received: from e9.ny.us.ibm.com ([32.97.182.139]:57537 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751635Ab3AURw4 (ORCPT ); Mon, 21 Jan 2013 12:52:56 -0500 Subject: [PATCH 4/5] create slow_virt_to_phys() To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:49 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012117-7182-0000-0000-00000496D55A Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is necessary because __pa() does not work on some kinds of memory, like vmalloc() or the alloc_remap() areas on 32-bit NUMA systems. We have some functions to do conversions _like_ this in the vmalloc() code (like vmalloc_to_page()), but they do not work on sizes other than 4k pages. We would potentially need to be able to handle all the page sizes that we use for the kernel linear mapping (4k, 2M, 1G). In practice, on 32-bit NUMA systems, the percpu areas get stuck in the alloc_remap() area. Any __pa() call on them will break and basically return garbage. This patch introduces a new function slow_virt_to_phys(), which walks the kernel page tables on x86 and should do precisely the same logical thing as __pa(), but actually work on a wider range of memory. It should work on the normal linear mapping, vmalloc(), kmap(), etc... Signed-off-by: Dave Hansen Acked-by: Rik van Riel --- linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1 linux-2.6.git-dave/arch/x86/mm/pageattr.c | 31 ++++++++++++++++ 2 files changed, 32 insertions(+) diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys arch/x86/include/asm/pgtable_types.h --- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2013-01-17 10:22:26.590434129 -0800 +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2013-01-17 10:22:26.598434199 -0800 @@ -352,6 +352,7 @@ static inline void update_page_count(int * as a pte too. */ extern pte_t *lookup_address(unsigned long address, unsigned int *level); +extern phys_addr_t slow_virt_to_phys(void *__address); #endif /* !__ASSEMBLY__ */ diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys arch/x86/mm/pageattr.c --- linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2013-01-17 10:22:26.594434163 -0800 +++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2013-01-17 10:22:26.598434199 -0800 @@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr EXPORT_SYMBOL_GPL(lookup_address); /* + * This is necessary because __pa() does not work on some + * kinds of memory, like vmalloc() or the alloc_remap() + * areas on 32-bit NUMA systems. The percpu areas can + * end up in this kind of memory, for instance. + * + * This could be optimized, but it is only intended to be + * used at inititalization time, and keeping it + * unoptimized should increase the testing coverage for + * the more obscure platforms. + */ +phys_addr_t slow_virt_to_phys(void *__virt_addr) +{ + unsigned long virt_addr = (unsigned long)__virt_addr; + phys_addr_t phys_addr; + unsigned long offset; + unsigned int level = -1; + unsigned long psize = 0; + unsigned long pmask = 0; + pte_t *pte; + + pte = lookup_address(virt_addr, &level); + BUG_ON(!pte); + psize = page_level_size(level); + pmask = page_level_mask(level); + offset = virt_addr & ~pmask; + phys_addr = pte_pfn(*pte) << PAGE_SHIFT; + return (phys_addr | offset); +} +EXPORT_SYMBOL_GPL(slow_virt_to_phys); + +/* * Set the new pmd in all the pgds we know about: */ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte) _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755690Ab3AURxG (ORCPT ); Mon, 21 Jan 2013 12:53:06 -0500 Received: from e7.ny.us.ibm.com ([32.97.182.137]:36654 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755265Ab3AURxC (ORCPT ); Mon, 21 Jan 2013 12:53:02 -0500 Subject: [PATCH 2/5] pagetable level size/shift/mask helpers To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:46 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175246.6B215415@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012117-5806-0000-0000-00001E8AC64D Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I plan to use lookup_address() to walk the kernel pagetables in a later patch. It returns a "pte" and the level in the pagetables where the "pte" was found. The level is just an enum and needs to be converted to a useful value in order to do address calculations with it. These helpers will be used in at least two places. This also gives the anonymous enum a real name so that no one gets confused about what they should be passing in to these helpers. "PTE_SHIFT" was chosen for naming consistency with the other pagetable levels (PGD/PUD/PMD_SHIFT). Cc: H. Peter Anvin Signed-off-by: Dave Hansen --- linux-2.6.git-dave/arch/x86/include/asm/pgtable.h | 14 ++++++++++++++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 2 +- 2 files changed, 15 insertions(+), 1 deletion(-) diff -puN arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable.h --- linux-2.6.git/arch/x86/include/asm/pgtable.h~pagetable-level-size-helpers 2013-01-17 10:22:25.958428542 -0800 +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable.h 2013-01-17 10:22:25.962428578 -0800 @@ -390,6 +390,7 @@ pte_t *populate_extra_pte(unsigned long #ifndef __ASSEMBLY__ #include +#include static inline int pte_none(pte_t pte) { @@ -781,6 +782,19 @@ static inline void clone_pgd_range(pgd_t memcpy(dst, src, count * sizeof(pgd_t)); } +#define PTE_SHIFT ilog2(PTRS_PER_PTE) +static inline int page_level_shift(enum pg_level level) +{ + return (PAGE_SHIFT - PTE_SHIFT) + level * PTE_SHIFT; +} +static inline unsigned long page_level_size(enum pg_level level) +{ + return 1UL << page_level_shift(level); +} +static inline unsigned long page_level_mask(enum pg_level level) +{ + return ~(page_level_size(level) - 1); +} #include #endif /* __ASSEMBLY__ */ diff -puN arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers arch/x86/include/asm/pgtable_types.h --- linux-2.6.git/arch/x86/include/asm/pgtable_types.h~pagetable-level-size-helpers 2013-01-17 10:22:25.958428542 -0800 +++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2013-01-17 10:22:25.966428612 -0800 @@ -331,7 +331,7 @@ extern void native_pagetable_init(void); struct seq_file; extern void arch_report_meminfo(struct seq_file *m); -enum { +enum pg_level { PG_LEVEL_NONE, PG_LEVEL_4K, PG_LEVEL_2M, _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756071Ab3AURxa (ORCPT ); Mon, 21 Jan 2013 12:53:30 -0500 Received: from e36.co.us.ibm.com ([32.97.110.154]:52467 "EHLO e36.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755726Ab3AURx1 (ORCPT ); Mon, 21 Jan 2013 12:53:27 -0500 Subject: [PATCH 5/5] fix kvm's use of __pa() on percpu areas To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:50 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175250.1AAC7981@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012117-7606-0000-0000-0000079FD649 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In short, it is illegal to call __pa() on an address holding a percpu variable. The times when this actually matters are pretty obscure (certain 32-bit NUMA systems), but it _does_ happen. It is important to keep KVM guests working on these systems because the real hardware is getting harder and harder to find. This bug manifested first by me seeing a plain hang at boot after this message: CPU 0 irqstacks, hard=f3018000 soft=f301a000 or, sometimes, it would actually make it out to the console: [ 0.000000] BUG: unable to handle kernel paging request at ffffffff I eventually traced it down to the KVM async pagefault code. This can be worked around by disabling that code either at compile-time, or on the kernel command-line. The kvm async pagefault code was injecting page faults in to the guest which the guest misinterpreted because its "reason" was not being properly sent from the host. The guest passes a physical address of an per-cpu async page fault structure via an MSR to the host. Since __pa() is broken on percpu data, the physical address it sent was bascially bogus and the host went scribbling on random data. The guest never saw the real reason for the page fault (it was injected by the host), assumed that the kernel had taken a _real_ page fault, and panic()'d. The behavior varied, though, depending on what got corrupted by the bad write. Signed-off-by: Dave Hansen Acked-by: Rik van Riel --- linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 10:22:26.914436992 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-17 10:22:26.922437062 -0800 @@ -289,9 +289,9 @@ static void kvm_register_steal_time(void memset(st, 0, sizeof(*st)); - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED)); printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", - cpu, __pa(st)); + cpu, slow_virt_to_phys(st)); } static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED; @@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) return; if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { - u64 pa = __pa(&__get_cpu_var(apf_reason)); + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); #ifdef CONFIG_PREEMPT pa |= KVM_ASYNC_PF_SEND_ALWAYS; @@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) /* Size alignment is implied but just to make it explicit. */ BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); __get_cpu_var(kvm_apic_eoi) = 0; - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) + | KVM_MSR_ENABLED; wrmsrl(MSR_KVM_PV_EOI_EN, pa); } diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 10:22:26.918437028 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-17 10:22:26.922437062 -0800 @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) int low, high, ret; struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; - low = (int)__pa(src) | 1; - high = ((u64)__pa(src) >> 32); + low = (int)slow_virt_to_phys(src) | 1; + high = ((u64)slow_virt_to_phys(src) >> 32); ret = native_write_msr_safe(msr_kvm_system_time, low, high); printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", cpu, high, low, txt); _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756403Ab3AURy4 (ORCPT ); Mon, 21 Jan 2013 12:54:56 -0500 Received: from e8.ny.us.ibm.com ([32.97.182.138]:56278 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756199Ab3AURyf (ORCPT ); Mon, 21 Jan 2013 12:54:35 -0500 Subject: [PATCH 1/5] make DEBUG_VIRTUAL work earlier in boot To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:45 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175245.3081B2B1@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012117-9360-0000-0000-00000F8012D5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The KVM code has some repeated bugs in it around use of __pa() on per-cpu data. Those data are not in an area on which using __pa() is valid. However, they are also called early enough in boot that __vmalloc_start_set is not set, and thus the CONFIG_DEBUG_VIRTUAL debugging does not catch them. This adds a check to also verify __pa() calls against max_low_pfn, which we can use earler in boot than is_vmalloc_addr(). However, if we are super-early in boot, max_low_pfn=0 and this will trip on every call, so also make sure that max_low_pfn is set before we try to use it. With this patch applied, CONFIG_DEBUG_VIRTUAL will actually catch the bug I was chasing (and fix later in this series). I'd love to find a generic way so that any __pa() call on percpu areas could do a BUG_ON(), but there don't appear to be any nice and easy ways to check if an address is a percpu one. Anybody have ideas on a way to do this? Signed-off-by: Dave Hansen --- linux-2.6.git-dave/arch/x86/mm/numa.c | 2 +- linux-2.6.git-dave/arch/x86/mm/pat.c | 4 ++-- linux-2.6.git-dave/arch/x86/mm/physaddr.c | 9 ++++++++- 3 files changed, 11 insertions(+), 4 deletions(-) diff -puN arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/numa.c --- linux-2.6.git/arch/x86/mm/numa.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2013-01-17 10:22:25.614425502 -0800 +++ linux-2.6.git-dave/arch/x86/mm/numa.c 2013-01-17 10:22:25.622425572 -0800 @@ -219,7 +219,7 @@ static void __init setup_node_data(int n */ nd = alloc_remap(nid, nd_size); if (nd) { - nd_pa = __pa(nd); + nd_pa = __phys_addr_nodebug(nd); remapped = true; } else { nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid); diff -puN arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/pat.c --- linux-2.6.git/arch/x86/mm/pat.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2013-01-17 10:22:25.614425502 -0800 +++ linux-2.6.git-dave/arch/x86/mm/pat.c 2013-01-17 10:22:25.622425572 -0800 @@ -560,10 +560,10 @@ int kernel_map_sync_memtype(u64 base, un { unsigned long id_sz; - if (base >= __pa(high_memory)) + if (base > __pa(high_memory-1)) return 0; - id_sz = (__pa(high_memory) < base + size) ? + id_sz = (__pa(high_memory-1) <= base + size) ? __pa(high_memory) - base : size; diff -puN arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot arch/x86/mm/physaddr.c --- linux-2.6.git/arch/x86/mm/physaddr.c~make-DEBUG_VIRTUAL-work-earlier-in-boot 2013-01-17 10:22:25.618425536 -0800 +++ linux-2.6.git-dave/arch/x86/mm/physaddr.c 2013-01-17 10:22:25.622425572 -0800 @@ -1,3 +1,4 @@ +#include #include #include #include @@ -47,10 +48,16 @@ EXPORT_SYMBOL(__virt_addr_valid); #ifdef CONFIG_DEBUG_VIRTUAL unsigned long __phys_addr(unsigned long x) { + unsigned long phys_addr = x - PAGE_OFFSET; /* VMALLOC_* aren't constants */ VIRTUAL_BUG_ON(x < PAGE_OFFSET); VIRTUAL_BUG_ON(__vmalloc_start_set && is_vmalloc_addr((void *) x)); - return x - PAGE_OFFSET; + /* max_low_pfn is set early, but not _that_ early */ + if (max_low_pfn) { + VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn); + BUG_ON(slow_virt_to_phys((void *)x) != phys_addr); + } + return phys_addr; } EXPORT_SYMBOL(__phys_addr); #endif _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756447Ab3AURzv (ORCPT ); Mon, 21 Jan 2013 12:55:51 -0500 Received: from e8.ny.us.ibm.com ([32.97.182.138]:56231 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753779Ab3AURye (ORCPT ); Mon, 21 Jan 2013 12:54:34 -0500 Subject: [PATCH 3/5] use new pagetable helpers in try_preserve_large_page() To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:47 -0800 References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> In-Reply-To: <20130121175244.E5839E06@kernel.stglabs.ibm.com> Message-Id: <20130121175247.76641034@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012117-9360-0000-0000-00000F8012A5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org try_preserve_large_page() can be slightly simplified by using the new page_level_*() helpers. Signed-off-by: Dave Hansen --- linux-2.6.git-dave/arch/x86/mm/pageattr.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff -puN arch/x86/mm/pageattr.c~use-new-pagetable-helpers arch/x86/mm/pageattr.c --- linux-2.6.git/arch/x86/mm/pageattr.c~use-new-pagetable-helpers 2013-01-17 10:22:26.282431407 -0800 +++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2013-01-17 10:22:26.286431442 -0800 @@ -412,15 +412,12 @@ try_preserve_large_page(pte_t *kpte, uns switch (level) { case PG_LEVEL_2M: - psize = PMD_PAGE_SIZE; - pmask = PMD_PAGE_MASK; - break; #ifdef CONFIG_X86_64 case PG_LEVEL_1G: - psize = PUD_PAGE_SIZE; - pmask = PUD_PAGE_MASK; - break; #endif + psize = page_level_size(level); + pmask = page_level_mask(level); + break; default: do_split = -EINVAL; goto out_unlock; _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756439Ab3AUR4u (ORCPT ); Mon, 21 Jan 2013 12:56:50 -0500 Received: from e8.ny.us.ibm.com ([32.97.182.138]:56199 "EHLO e8.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754346Ab3AURyb (ORCPT ); Mon, 21 Jan 2013 12:54:31 -0500 Subject: [PATCH 0/5] fix illegal use of __pa() in KVM code To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Mon, 21 Jan 2013 09:52:44 -0800 Message-Id: <20130121175244.E5839E06@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012117-9360-0000-0000-00000F80127A Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This series fixes a hard-to-debug early boot hang on 32-bit NUMA systems. It adds coverage to the debugging code, adds some helpers, and eventually fixes the original bug I was hitting. [v2] * Moved DEBUG_VIRTUAL patch earlier in the series (it has no dependencies on anything else and stands on its own. * Created page_level_*() helpers to replace a nasty switch() at hpa's suggestion From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754225Ab3AUSIm (ORCPT ); Mon, 21 Jan 2013 13:08:42 -0500 Received: from terminus.zytor.com ([198.137.202.10]:53521 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751635Ab3AUSIl (ORCPT ); Mon, 21 Jan 2013 13:08:41 -0500 User-Agent: K-9 Mail for Android In-Reply-To: <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PATCH 4/5] create slow_virt_to_phys() From: "H. Peter Anvin" Date: Mon, 21 Jan 2013 12:08:18 -0600 To: Dave Hansen , linux-kernel@vger.kernel.org CC: linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Message-ID: <2ad09c09-98c3-4b2d-9b3f-f16fbcce4edf@email.android.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Why are you initializing psize/pmask? Dave Hansen wrote: > >This is necessary because __pa() does not work on some kinds of >memory, like vmalloc() or the alloc_remap() areas on 32-bit >NUMA systems. We have some functions to do conversions _like_ >this in the vmalloc() code (like vmalloc_to_page()), but they >do not work on sizes other than 4k pages. We would potentially >need to be able to handle all the page sizes that we use for >the kernel linear mapping (4k, 2M, 1G). > >In practice, on 32-bit NUMA systems, the percpu areas get stuck >in the alloc_remap() area. Any __pa() call on them will break >and basically return garbage. > >This patch introduces a new function slow_virt_to_phys(), which >walks the kernel page tables on x86 and should do precisely >the same logical thing as __pa(), but actually work on a wider >range of memory. It should work on the normal linear mapping, >vmalloc(), kmap(), etc... > >Signed-off-by: Dave Hansen >Acked-by: Rik van Riel >--- > > linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h | 1 >linux-2.6.git-dave/arch/x86/mm/pageattr.c | 31 >++++++++++++++++ > 2 files changed, 32 insertions(+) > >diff -puN arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys >arch/x86/include/asm/pgtable_types.h >--- >linux-2.6.git/arch/x86/include/asm/pgtable_types.h~create-slow_virt_to_phys 2013-01-17 >10:22:26.590434129 -0800 >+++ linux-2.6.git-dave/arch/x86/include/asm/pgtable_types.h 2013-01-17 >10:22:26.598434199 -0800 >@@ -352,6 +352,7 @@ static inline void update_page_count(int > * as a pte too. > */ >extern pte_t *lookup_address(unsigned long address, unsigned int >*level); >+extern phys_addr_t slow_virt_to_phys(void *__address); > > #endif /* !__ASSEMBLY__ */ > >diff -puN arch/x86/mm/pageattr.c~create-slow_virt_to_phys >arch/x86/mm/pageattr.c >--- >linux-2.6.git/arch/x86/mm/pageattr.c~create-slow_virt_to_phys 2013-01-17 >10:22:26.594434163 -0800 >+++ linux-2.6.git-dave/arch/x86/mm/pageattr.c 2013-01-17 >10:22:26.598434199 -0800 >@@ -364,6 +364,37 @@ pte_t *lookup_address(unsigned long addr > EXPORT_SYMBOL_GPL(lookup_address); > > /* >+ * This is necessary because __pa() does not work on some >+ * kinds of memory, like vmalloc() or the alloc_remap() >+ * areas on 32-bit NUMA systems. The percpu areas can >+ * end up in this kind of memory, for instance. >+ * >+ * This could be optimized, but it is only intended to be >+ * used at inititalization time, and keeping it >+ * unoptimized should increase the testing coverage for >+ * the more obscure platforms. >+ */ >+phys_addr_t slow_virt_to_phys(void *__virt_addr) >+{ >+ unsigned long virt_addr = (unsigned long)__virt_addr; >+ phys_addr_t phys_addr; >+ unsigned long offset; >+ unsigned int level = -1; >+ unsigned long psize = 0; >+ unsigned long pmask = 0; >+ pte_t *pte; >+ >+ pte = lookup_address(virt_addr, &level); >+ BUG_ON(!pte); >+ psize = page_level_size(level); >+ pmask = page_level_mask(level); >+ offset = virt_addr & ~pmask; >+ phys_addr = pte_pfn(*pte) << PAGE_SHIFT; >+ return (phys_addr | offset); >+} >+EXPORT_SYMBOL_GPL(slow_virt_to_phys); >+ >+/* > * Set the new pmd in all the pgds we know about: > */ >static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t >pte) >_ -- Sent from my mobile phone. Please excuse brevity and lack of formatting. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755239Ab3AUSSz (ORCPT ); Mon, 21 Jan 2013 13:18:55 -0500 Received: from e37.co.us.ibm.com ([32.97.110.158]:39813 "EHLO e37.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752782Ab3AUSSx (ORCPT ); Mon, 21 Jan 2013 13:18:53 -0500 Message-ID: <50FD8676.9090203@linux.vnet.ibm.com> Date: Mon, 21 Jan 2013 10:18:30 -0800 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: "H. Peter Anvin" CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Subject: Re: [PATCH 4/5] create slow_virt_to_phys() References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175249.AFE9EAD7@kernel.stglabs.ibm.com> <2ad09c09-98c3-4b2d-9b3f-f16fbcce4edf@email.android.com> In-Reply-To: <2ad09c09-98c3-4b2d-9b3f-f16fbcce4edf@email.android.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012118-7408-0000-0000-00000C1491B3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/21/2013 10:08 AM, H. Peter Anvin wrote: > Why are you initializing psize/pmask? It's an artifact from the switch() that was there before. I'll clean it up. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756065Ab3AUSyk (ORCPT ); Mon, 21 Jan 2013 13:54:40 -0500 Received: from terminus.zytor.com ([198.137.202.10]:53845 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755711Ab3AUSyi (ORCPT ); Mon, 21 Jan 2013 13:54:38 -0500 User-Agent: K-9 Mail for Android In-Reply-To: <20130121175250.1AAC7981@kernel.stglabs.ibm.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas From: "H. Peter Anvin" Date: Mon, 21 Jan 2013 12:38:06 -0600 To: Dave Hansen , linux-kernel@vger.kernel.org CC: linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Message-ID: <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Final question: are any of these done in frequent paths? (I believe no, but...) Dave Hansen wrote: > >In short, it is illegal to call __pa() on an address holding >a percpu variable. The times when this actually matters are >pretty obscure (certain 32-bit NUMA systems), but it _does_ >happen. It is important to keep KVM guests working on these >systems because the real hardware is getting harder and >harder to find. > >This bug manifested first by me seeing a plain hang at boot >after this message: > > CPU 0 irqstacks, hard=f3018000 soft=f301a000 > >or, sometimes, it would actually make it out to the console: > >[ 0.000000] BUG: unable to handle kernel paging request at ffffffff > >I eventually traced it down to the KVM async pagefault code. >This can be worked around by disabling that code either at >compile-time, or on the kernel command-line. > >The kvm async pagefault code was injecting page faults in >to the guest which the guest misinterpreted because its >"reason" was not being properly sent from the host. > >The guest passes a physical address of an per-cpu async page >fault structure via an MSR to the host. Since __pa() is >broken on percpu data, the physical address it sent was >bascially bogus and the host went scribbling on random data. >The guest never saw the real reason for the page fault (it >was injected by the host), assumed that the kernel had taken >a _real_ page fault, and panic()'d. The behavior varied, >though, depending on what got corrupted by the bad write. > >Signed-off-by: Dave Hansen >Acked-by: Rik van Riel >--- > > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- > linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- > 2 files changed, 7 insertions(+), 6 deletions(-) > >diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas >arch/x86/kernel/kvm.c >--- >linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 >10:22:26.914436992 -0800 >+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-17 >10:22:26.922437062 -0800 >@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void > > memset(st, 0, sizeof(*st)); > >- wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); >+ wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | >KVM_MSR_ENABLED)); > printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", >- cpu, __pa(st)); >+ cpu, slow_virt_to_phys(st)); > } > >static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = >KVM_PV_EOI_DISABLED; >@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) > return; > > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { >- u64 pa = __pa(&__get_cpu_var(apf_reason)); >+ u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); > > #ifdef CONFIG_PREEMPT > pa |= KVM_ASYNC_PF_SEND_ALWAYS; >@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) > /* Size alignment is implied but just to make it explicit. */ > BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); > __get_cpu_var(kvm_apic_eoi) = 0; >- pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; >+ pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) >+ | KVM_MSR_ENABLED; > wrmsrl(MSR_KVM_PV_EOI_EN, pa); > } > >diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas >arch/x86/kernel/kvmclock.c >--- >linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 >10:22:26.918437028 -0800 >+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-17 >10:22:26.922437062 -0800 >@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) > int low, high, ret; > struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; > >- low = (int)__pa(src) | 1; >- high = ((u64)__pa(src) >> 32); >+ low = (int)slow_virt_to_phys(src) | 1; >+ high = ((u64)slow_virt_to_phys(src) >> 32); > ret = native_write_msr_safe(msr_kvm_system_time, low, high); > printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", > cpu, high, low, txt); >_ -- Sent from my mobile phone. Please excuse brevity and lack of formatting. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756113Ab3AUS7w (ORCPT ); Mon, 21 Jan 2013 13:59:52 -0500 Received: from e7.ny.us.ibm.com ([32.97.182.137]:39620 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753657Ab3AUS7v (ORCPT ); Mon, 21 Jan 2013 13:59:51 -0500 Message-ID: <50FD901C.8000002@linux.vnet.ibm.com> Date: Mon, 21 Jan 2013 10:59:40 -0800 From: Dave Hansen User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/17.0 Thunderbird/17.0 MIME-Version: 1.0 To: "H. Peter Anvin" CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> In-Reply-To: <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012118-5806-0000-0000-00001E8B7448 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01/21/2013 10:38 AM, H. Peter Anvin wrote: > Final question: are any of these done in frequent paths? (I believe no, but...) Nope. All of the places that it gets used here are in initialization-time paths. The two we have here are when kvm and the host are setting up a new vcpu and when the kvmclock clocksource is being registered. A CPU getting hotplugged is the only thing that might even have these get called more than at boot. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756199Ab3AUTCQ (ORCPT ); Mon, 21 Jan 2013 14:02:16 -0500 Received: from mx1.redhat.com ([209.132.183.28]:15634 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751635Ab3AUTCP (ORCPT ); Mon, 21 Jan 2013 14:02:15 -0500 Date: Mon, 21 Jan 2013 21:02:07 +0200 From: Gleb Natapov To: "H. Peter Anvin" Cc: Dave Hansen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, Marcelo Tosatti , Rik van Riel Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas Message-ID: <20130121190207.GB25818@redhat.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 21, 2013 at 12:38:06PM -0600, H. Peter Anvin wrote: > Final question: are any of these done in frequent paths? (I believe no, but...) > No, only during guest boot. > Dave Hansen wrote: > > > > >In short, it is illegal to call __pa() on an address holding > >a percpu variable. The times when this actually matters are > >pretty obscure (certain 32-bit NUMA systems), but it _does_ > >happen. It is important to keep KVM guests working on these > >systems because the real hardware is getting harder and > >harder to find. > > > >This bug manifested first by me seeing a plain hang at boot > >after this message: > > > > CPU 0 irqstacks, hard=f3018000 soft=f301a000 > > > >or, sometimes, it would actually make it out to the console: > > > >[ 0.000000] BUG: unable to handle kernel paging request at ffffffff > > > >I eventually traced it down to the KVM async pagefault code. > >This can be worked around by disabling that code either at > >compile-time, or on the kernel command-line. > > > >The kvm async pagefault code was injecting page faults in > >to the guest which the guest misinterpreted because its > >"reason" was not being properly sent from the host. > > > >The guest passes a physical address of an per-cpu async page > >fault structure via an MSR to the host. Since __pa() is > >broken on percpu data, the physical address it sent was > >bascially bogus and the host went scribbling on random data. > >The guest never saw the real reason for the page fault (it > >was injected by the host), assumed that the kernel had taken > >a _real_ page fault, and panic()'d. The behavior varied, > >though, depending on what got corrupted by the bad write. > > > >Signed-off-by: Dave Hansen > >Acked-by: Rik van Riel > >--- > > > > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- > > linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- > > 2 files changed, 7 insertions(+), 6 deletions(-) > > > >diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas > >arch/x86/kernel/kvm.c > >--- > >linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 > >10:22:26.914436992 -0800 > >+++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-17 > >10:22:26.922437062 -0800 > >@@ -289,9 +289,9 @@ static void kvm_register_steal_time(void > > > > memset(st, 0, sizeof(*st)); > > > >- wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); > >+ wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | > >KVM_MSR_ENABLED)); > > printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", > >- cpu, __pa(st)); > >+ cpu, slow_virt_to_phys(st)); > > } > > > >static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = > >KVM_PV_EOI_DISABLED; > >@@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) > > return; > > > > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { > >- u64 pa = __pa(&__get_cpu_var(apf_reason)); > >+ u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); > > > > #ifdef CONFIG_PREEMPT > > pa |= KVM_ASYNC_PF_SEND_ALWAYS; > >@@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) > > /* Size alignment is implied but just to make it explicit. */ > > BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); > > __get_cpu_var(kvm_apic_eoi) = 0; > >- pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; > >+ pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) > >+ | KVM_MSR_ENABLED; > > wrmsrl(MSR_KVM_PV_EOI_EN, pa); > > } > > > >diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas > >arch/x86/kernel/kvmclock.c > >--- > >linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-17 > >10:22:26.918437028 -0800 > >+++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-17 > >10:22:26.922437062 -0800 > >@@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) > > int low, high, ret; > > struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; > > > >- low = (int)__pa(src) | 1; > >- high = ((u64)__pa(src) >> 32); > >+ low = (int)slow_virt_to_phys(src) | 1; > >+ high = ((u64)slow_virt_to_phys(src) >> 32); > > ret = native_write_msr_safe(msr_kvm_system_time, low, high); > > printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", > > cpu, high, low, txt); > >_ > > -- > Sent from my mobile phone. Please excuse brevity and lack of formatting. -- Gleb. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756519Ab3AUTZO (ORCPT ); Mon, 21 Jan 2013 14:25:14 -0500 Received: from terminus.zytor.com ([198.137.202.10]:54096 "EHLO mail.zytor.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756419Ab3AUTZK (ORCPT ); Mon, 21 Jan 2013 14:25:10 -0500 User-Agent: K-9 Mail for Android In-Reply-To: <50FD901C.8000002@linux.vnet.ibm.com> References: <20130121175244.E5839E06@kernel.stglabs.ibm.com> <20130121175250.1AAC7981@kernel.stglabs.ibm.com> <08cba1bf-6476-4fad-8d29-e380ec7127ba@email.android.com> <50FD901C.8000002@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas From: "H. Peter Anvin" Date: Mon, 21 Jan 2013 13:22:50 -0600 To: Dave Hansen CC: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , x86@kernel.org, Marcelo Tosatti , Rik van Riel Message-ID: <6a43e949-61b2-4d96-8e85-46de3da8c3d0@email.android.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Cool, just checking. Dave Hansen wrote: >On 01/21/2013 10:38 AM, H. Peter Anvin wrote: >> Final question: are any of these done in frequent paths? (I believe >no, but...) > >Nope. All of the places that it gets used here are in >initialization-time paths. The two we have here are when kvm and the >host are setting up a new vcpu and when the kvmclock clocksource is >being registered. A CPU getting hotplugged is the only thing that >might >even have these get called more than at boot. -- Sent from my mobile phone. Please excuse brevity and lack of formatting. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756118Ab3AVVYn (ORCPT ); Tue, 22 Jan 2013 16:24:43 -0500 Received: from e9.ny.us.ibm.com ([32.97.182.139]:48070 "EHLO e9.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752181Ab3AVVYl (ORCPT ); Tue, 22 Jan 2013 16:24:41 -0500 Subject: [PATCH 5/5] fix kvm's use of __pa() on percpu areas To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Marcelo Tosatti , Rik van Riel , Dave Hansen From: Dave Hansen Date: Tue, 22 Jan 2013 13:24:35 -0800 References: <20130122212428.8DF70119@kernel.stglabs.ibm.com> In-Reply-To: <20130122212428.8DF70119@kernel.stglabs.ibm.com> Message-Id: <20130122212435.4905663F@kernel.stglabs.ibm.com> X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13012221-7182-0000-0000-000004A19446 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org In short, it is illegal to call __pa() on an address holding a percpu variable. This replaces those __pa() calls with slow_virt_to_phys(). All of the cases in this patch are in boot time (or CPU hotplug time at worst) code, so the slow pagetable walking in slow_virt_to_phys() is not expected to have a performance impact. The times when this actually matters are pretty obscure (certain 32-bit NUMA systems), but it _does_ happen. It is important to keep KVM guests working on these systems because the real hardware is getting harder and harder to find. This bug manifested first by me seeing a plain hang at boot after this message: CPU 0 irqstacks, hard=f3018000 soft=f301a000 or, sometimes, it would actually make it out to the console: [ 0.000000] BUG: unable to handle kernel paging request at ffffffff I eventually traced it down to the KVM async pagefault code. This can be worked around by disabling that code either at compile-time, or on the kernel command-line. The kvm async pagefault code was injecting page faults in to the guest which the guest misinterpreted because its "reason" was not being properly sent from the host. The guest passes a physical address of an per-cpu async page fault structure via an MSR to the host. Since __pa() is broken on percpu data, the physical address it sent was bascially bogus and the host went scribbling on random data. The guest never saw the real reason for the page fault (it was injected by the host), assumed that the kernel had taken a _real_ page fault, and panic()'d. The behavior varied, though, depending on what got corrupted by the bad write. Signed-off-by: Dave Hansen Acked-by: Rik van Riel --- linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- 2 files changed, 7 insertions(+), 6 deletions(-) diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvm.c --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-22 13:17:16.424317475 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c 2013-01-22 13:17:16.432317541 -0800 @@ -289,9 +289,9 @@ static void kvm_register_steal_time(void memset(st, 0, sizeof(*st)); - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED)); + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED)); printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n", - cpu, __pa(st)); + cpu, slow_virt_to_phys(st)); } static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED; @@ -316,7 +316,7 @@ void __cpuinit kvm_guest_cpu_init(void) return; if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) { - u64 pa = __pa(&__get_cpu_var(apf_reason)); + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason)); #ifdef CONFIG_PREEMPT pa |= KVM_ASYNC_PF_SEND_ALWAYS; @@ -332,7 +332,8 @@ void __cpuinit kvm_guest_cpu_init(void) /* Size alignment is implied but just to make it explicit. */ BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4); __get_cpu_var(kvm_apic_eoi) = 0; - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED; + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi)) + | KVM_MSR_ENABLED; wrmsrl(MSR_KVM_PV_EOI_EN, pa); } diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas arch/x86/kernel/kvmclock.c --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 2013-01-22 13:17:16.428317508 -0800 +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c 2013-01-22 13:17:16.432317541 -0800 @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt) int low, high, ret; struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti; - low = (int)__pa(src) | 1; - high = ((u64)__pa(src) >> 32); + low = (int)slow_virt_to_phys(src) | 1; + high = ((u64)slow_virt_to_phys(src) >> 32); ret = native_write_msr_safe(msr_kvm_system_time, low, high); printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n", cpu, high, low, txt); _ From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756401Ab3AWAXG (ORCPT ); Tue, 22 Jan 2013 19:23:06 -0500 Received: from mx1.redhat.com ([209.132.183.28]:30924 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752607Ab3AWAXC (ORCPT ); Tue, 22 Jan 2013 19:23:02 -0500 Date: Tue, 22 Jan 2013 22:08:20 -0200 From: Marcelo Tosatti To: Dave Hansen Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Gleb Natapov , "H. Peter Anvin" , x86@kernel.org, Rik van Riel Subject: Re: [PATCH 5/5] fix kvm's use of __pa() on percpu areas Message-ID: <20130123000820.GA27204@amt.cnet> References: <20130122212428.8DF70119@kernel.stglabs.ibm.com> <20130122212435.4905663F@kernel.stglabs.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130122212435.4905663F@kernel.stglabs.ibm.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 22, 2013 at 01:24:35PM -0800, Dave Hansen wrote: > > In short, it is illegal to call __pa() on an address holding > a percpu variable. This replaces those __pa() calls with > slow_virt_to_phys(). All of the cases in this patch are > in boot time (or CPU hotplug time at worst) code, so the > slow pagetable walking in slow_virt_to_phys() is not expected > to have a performance impact. > > The times when this actually matters are pretty obscure > (certain 32-bit NUMA systems), but it _does_ happen. It is > important to keep KVM guests working on these systems because > the real hardware is getting harder and harder to find. > > This bug manifested first by me seeing a plain hang at boot > after this message: > > CPU 0 irqstacks, hard=f3018000 soft=f301a000 > > or, sometimes, it would actually make it out to the console: > > [ 0.000000] BUG: unable to handle kernel paging request at ffffffff > > I eventually traced it down to the KVM async pagefault code. > This can be worked around by disabling that code either at > compile-time, or on the kernel command-line. > > The kvm async pagefault code was injecting page faults in > to the guest which the guest misinterpreted because its > "reason" was not being properly sent from the host. > > The guest passes a physical address of an per-cpu async page > fault structure via an MSR to the host. Since __pa() is > broken on percpu data, the physical address it sent was > bascially bogus and the host went scribbling on random data. > The guest never saw the real reason for the page fault (it > was injected by the host), assumed that the kernel had taken > a _real_ page fault, and panic()'d. The behavior varied, > though, depending on what got corrupted by the bad write. > > Signed-off-by: Dave Hansen > Acked-by: Rik van Riel > --- > > linux-2.6.git-dave/arch/x86/kernel/kvm.c | 9 +++++---- > linux-2.6.git-dave/arch/x86/kernel/kvmclock.c | 4 ++-- > 2 files changed, 7 insertions(+), 6 deletions(-) Reviewed-by: Marcelo Tosatti