* another pagetable initialization crash on xen
@ 2011-03-28 17:26 Stefano Stabellini
2011-03-28 19:04 ` Stefano Stabellini
0 siblings, 1 reply; 9+ messages in thread
From: Stefano Stabellini @ 2011-03-28 17:26 UTC (permalink / raw)
To: Yinghai Lu; +Cc: linux-kernel
Hi Yinghai,
unfortunately I found another pagetable initialization bug on xen
affecting linux 2.6.39-rc0.
The problem is that on xen we need to make sure that all the pagetable pages
are mapped read-only, in fact in xen_set_pte we have this check:
if (pfn >= pgt_buf_start && pfn < pgt_buf_end)
/* make the pte read-only */
however pgt_buf_end is where the kernel pagetable *currently* ends, so
some kernel pagetable pages allocated after pgt_buf_end might be marked
read-write by mistake. A simple way to fix the issue would be to use
pgt_buf_top instead:
if (pfn >= pgt_buf_start && pfn < pgt_buf_top)
/* make the pte read-only */
however after building the kernel pagetable in init_memory_mapping we
only reserve memory between pgt_buf_start and pgt_buf_end:
if (!after_bootmem && pgt_buf_end > pgt_buf_start)
memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT,
pgt_buf_end << PAGE_SHIFT, "PGTABLE");
so feature allocations might use memory between the final value of
pgt_buf_end and pgt_buf_top that has been marked read-only in the xen
specific code, causing a crash.
The only way I could find to fix the crash is to reserve also the memory
region between pgt_buf_start and pgt_buf_top on xen, but that would
require an ugly if(xen_domain()) at the of init_memory_mapping or
the introduction of a new pvop function to reserve the pagetable memory.
I don't like the idea, but I couldn't find anything better.
Yinghai, do you have any better suggestions?
Thanks,
Stefano
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: another pagetable initialization crash on xen 2011-03-28 17:26 another pagetable initialization crash on xen Stefano Stabellini @ 2011-03-28 19:04 ` Stefano Stabellini 2011-03-29 17:25 ` Stefano Stabellini 0 siblings, 1 reply; 9+ messages in thread From: Stefano Stabellini @ 2011-03-28 19:04 UTC (permalink / raw) To: Stefano Stabellini; +Cc: Yinghai Lu, linux-kernel@vger.kernel.org On Mon, 28 Mar 2011, Stefano Stabellini wrote: > Hi Yinghai, > unfortunately I found another pagetable initialization bug on xen > affecting linux 2.6.39-rc0. > The problem is that on xen we need to make sure that all the pagetable pages > are mapped read-only, in fact in xen_set_pte we have this check: > > if (pfn >= pgt_buf_start && pfn < pgt_buf_end) > /* make the pte read-only */ > > however pgt_buf_end is where the kernel pagetable *currently* ends, so > some kernel pagetable pages allocated after pgt_buf_end might be marked > read-write by mistake. A simple way to fix the issue would be to use > pgt_buf_top instead: > > if (pfn >= pgt_buf_start && pfn < pgt_buf_top) > /* make the pte read-only */ > > however after building the kernel pagetable in init_memory_mapping we > only reserve memory between pgt_buf_start and pgt_buf_end: > > if (!after_bootmem && pgt_buf_end > pgt_buf_start) > memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, > pgt_buf_end << PAGE_SHIFT, "PGTABLE"); > > so feature allocations might use memory between the final value of > pgt_buf_end and pgt_buf_top that has been marked read-only in the xen > specific code, causing a crash. > The only way I could find to fix the crash is to reserve also the memory > region between pgt_buf_start and pgt_buf_top on xen, but that would > require an ugly if(xen_domain()) at the of init_memory_mapping or > the introduction of a new pvop function to reserve the pagetable memory. > I don't like the idea, but I couldn't find anything better. > Yinghai, do you have any better suggestions? Just to make it clearer, this is the not-so-pretty patch I am talking about: --- diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 7db7723..49e7f36 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -299,6 +299,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, /* Install a pte for a particular vaddr in kernel space. */ void set_pte_vaddr(unsigned long vaddr, pte_t pte); +void native_kernel_pagetable_reserve(void); #ifdef CONFIG_X86_32 extern void native_pagetable_setup_start(pgd_t *base); extern void native_pagetable_setup_done(pgd_t *base); diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 643ebf2..214ee51 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -73,6 +73,7 @@ struct x86_init_oem { * @pagetable_setup_done: platform specific post paging_init() call */ struct x86_init_paging { + void (*kernel_pagetable_reserve)(void); void (*pagetable_setup_start)(pgd_t *base); void (*pagetable_setup_done)(pgd_t *base); }; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index c11514e..a0e920c 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -62,6 +62,7 @@ struct x86_init_ops x86_init __initdata = { }, .paging = { + .kernel_pagetable_reserve = native_kernel_pagetable_reserve, .pagetable_setup_start = native_pagetable_setup_start, .pagetable_setup_done = native_pagetable_setup_done, }, diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 286d289..07a9aa6 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -109,6 +109,13 @@ static int __meminit save_mr(struct map_range *mr, int nr_range, return nr_range; } +void __init native_kernel_pagetable_reserve(void) +{ + if (pgt_buf_end > pgt_buf_start) + memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, + pgt_buf_end << PAGE_SHIFT, "PGTABLE"); +} + /* * Setup the direct mapping of the physical memory at PAGE_OFFSET. * This runs before bootmem is initialized and gets pages directly from @@ -272,10 +279,9 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, __flush_tlb_all(); - if (!after_bootmem && pgt_buf_end > pgt_buf_start) - memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, - pgt_buf_end << PAGE_SHIFT, "PGTABLE"); - + if (!after_bootmem) + x86_init.paging.kernel_pagetable_reserve(); + if (!after_bootmem) early_memtest(start, end); diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index c82df6c..30b768f 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1275,6 +1275,13 @@ static __init void xen_pagetable_setup_start(pgd_t *base) { } +static __init void xen_kernel_pagetable_reserve(void) +{ + if (pgt_buf_end > pgt_buf_start) + memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, + pgt_buf_top << PAGE_SHIFT, "PGTABLE"); +} + static void xen_post_allocator_init(void); static __init void xen_pagetable_setup_done(pgd_t *base) @@ -2100,6 +2107,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initdata = { void __init xen_init_mmu_ops(void) { + x86_init.paging.kernel_pagetable_reserve = xen_kernel_pagetable_reserve; x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start; x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done; pv_mmu_ops = xen_mmu_ops; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-28 19:04 ` Stefano Stabellini @ 2011-03-29 17:25 ` Stefano Stabellini 2011-03-29 18:15 ` Yinghai Lu 2011-03-30 16:05 ` Konrad Rzeszutek Wilk 0 siblings, 2 replies; 9+ messages in thread From: Stefano Stabellini @ 2011-03-29 17:25 UTC (permalink / raw) To: Stefano Stabellini; +Cc: Yinghai Lu, linux-kernel@vger.kernel.org On Mon, 28 Mar 2011, Stefano Stabellini wrote: > On Mon, 28 Mar 2011, Stefano Stabellini wrote: > > Hi Yinghai, > > unfortunately I found another pagetable initialization bug on xen > > affecting linux 2.6.39-rc0. > > The problem is that on xen we need to make sure that all the pagetable pages > > are mapped read-only, in fact in xen_set_pte we have this check: > > > > if (pfn >= pgt_buf_start && pfn < pgt_buf_end) > > /* make the pte read-only */ > > > > however pgt_buf_end is where the kernel pagetable *currently* ends, so > > some kernel pagetable pages allocated after pgt_buf_end might be marked > > read-write by mistake. A simple way to fix the issue would be to use > > pgt_buf_top instead: > > > > if (pfn >= pgt_buf_start && pfn < pgt_buf_top) > > /* make the pte read-only */ > > > > however after building the kernel pagetable in init_memory_mapping we > > only reserve memory between pgt_buf_start and pgt_buf_end: > > > > if (!after_bootmem && pgt_buf_end > pgt_buf_start) > > memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, > > pgt_buf_end << PAGE_SHIFT, "PGTABLE"); > > > > so feature allocations might use memory between the final value of > > pgt_buf_end and pgt_buf_top that has been marked read-only in the xen > > specific code, causing a crash. > > The only way I could find to fix the crash is to reserve also the memory > > region between pgt_buf_start and pgt_buf_top on xen, but that would > > require an ugly if(xen_domain()) at the of init_memory_mapping or > > the introduction of a new pvop function to reserve the pagetable memory. > > I don't like the idea, but I couldn't find anything better. > > Yinghai, do you have any better suggestions? > > Just to make it clearer, this is the not-so-pretty patch I am talking > about: I kept working on refactoring this code, and I think I was able to come up with a slightly better patch. Do you think this is acceptable? commit c576ff205060e92b36224e80a989492b245e806d Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Date: Tue Mar 29 10:59:02 2011 +0000 x86: introduce kernel_pagetable hooks Introduce two new pvop hooks: - kernel_pagetable_alloc is used to allocate the initial kernel pagetable pages; - kernel_pagetable_reserve is used to reserved the kernel pagetable memory range; Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 7db7723..a3ab967 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -299,6 +299,12 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, /* Install a pte for a particular vaddr in kernel space. */ void set_pte_vaddr(unsigned long vaddr, pte_t pte); +struct map_range; +void native_kernel_pagetable_alloc(struct map_range *mr, + unsigned long *pgt_start, + unsigned long *pgt_top); +void native_kernel_pagetable_reserve(unsigned long pgt_start, + unsigned long pgt_end); #ifdef CONFIG_X86_32 extern void native_pagetable_setup_start(pgd_t *base); extern void native_pagetable_setup_done(pgd_t *base); diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 643ebf2..984ce6f 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -69,10 +69,26 @@ struct x86_init_oem { /** * struct x86_init_paging - platform specific paging functions + * @kernel_pagetable_alloc: platform specific kernel pagetable + * allocation function + * @kernel_pagetable_reserve: platform specific kernel pagetable + * reserve memory function * @pagetable_setup_start: platform specific pre paging_init() call * @pagetable_setup_done: platform specific post paging_init() call */ struct x86_init_paging { + /* @mr: (in) memory range to map + * @pgt_start - @pgt_top: (out) memory allocated for the kernel + * pagetable pages + */ + void (*kernel_pagetable_alloc)(struct map_range *mr, + unsigned long *pgt_start, + unsigned long *pgt_top); + /* @pgt_start - @pgt_end: memory region used for the kernel + * pagetable pages + */ + void (*kernel_pagetable_reserve)(unsigned long pgt_start, + unsigned long pgt_end); void (*pagetable_setup_start)(pgd_t *base); void (*pagetable_setup_done)(pgd_t *base); }; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index c11514e..1048a3d 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -62,6 +62,8 @@ struct x86_init_ops x86_init __initdata = { }, .paging = { + .kernel_pagetable_alloc = native_kernel_pagetable_alloc, + .kernel_pagetable_reserve = native_kernel_pagetable_reserve, .pagetable_setup_start = native_pagetable_setup_start, .pagetable_setup_done = native_pagetable_setup_done, }, diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 286d289..67d0792 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -30,11 +30,21 @@ int direct_gbpages #endif ; -static void __init find_early_table_space(unsigned long end, int use_pse, - int use_gbpages) +struct map_range { + unsigned long start; + unsigned long end; + unsigned page_size_mask; +}; + +void __init native_kernel_pagetable_alloc(struct map_range *mr, + unsigned long *pgt_start, + unsigned long *pgt_top) { - unsigned long puds, pmds, ptes, tables, start = 0, good_end = end; + unsigned long puds, pmds, ptes, tables, start = 0, + end = mr->end, good_end = end; phys_addr_t base; + int use_pse = mr->page_size_mask & PG_LEVEL_2M; + int use_gbpages = mr->page_size_mask & PG_LEVEL_1G; puds = (end + PUD_SIZE - 1) >> PUD_SHIFT; tables = roundup(puds * sizeof(pud_t), PAGE_SIZE); @@ -73,19 +83,20 @@ static void __init find_early_table_space(unsigned long end, int use_pse, if (base == MEMBLOCK_ERROR) panic("Cannot find space for the kernel page tables"); - pgt_buf_start = base >> PAGE_SHIFT; - pgt_buf_end = pgt_buf_start; - pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT); + *pgt_start = base >> PAGE_SHIFT; + *pgt_top = *pgt_start + (tables >> PAGE_SHIFT); printk(KERN_DEBUG "kernel direct mapping tables up to %lx @ %lx-%lx\n", - end, pgt_buf_start << PAGE_SHIFT, pgt_buf_top << PAGE_SHIFT); + end, *pgt_start << PAGE_SHIFT, *pgt_top << PAGE_SHIFT); } -struct map_range { - unsigned long start; - unsigned long end; - unsigned page_size_mask; -}; +void __init native_kernel_pagetable_reserve(unsigned long pgt_start, + unsigned long pgt_end) +{ + if (pgt_start > pgt_end) + memblock_x86_reserve_range(pgt_start << PAGE_SHIFT, + pgt_end << PAGE_SHIFT, "PGTABLE"); +} #ifdef CONFIG_X86_32 #define NR_RANGE_MR 3 @@ -257,9 +268,15 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, * memory mapped. Unfortunately this is done currently before the * nodes are discovered. */ - if (!after_bootmem) - find_early_table_space(end, use_pse, use_gbpages); - + if (!after_bootmem) { + struct map_range map; + map.start = start; + map.end = end; + map.page_size_mask = page_size_mask; + x86_init.paging.kernel_pagetable_alloc(&map, &pgt_buf_start, &pgt_buf_top); + pgt_buf_end = pgt_buf_start; + } + for (i = 0; i < nr_range; i++) ret = kernel_physical_mapping_init(mr[i].start, mr[i].end, mr[i].page_size_mask); @@ -272,9 +289,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, __flush_tlb_all(); - if (!after_bootmem && pgt_buf_end > pgt_buf_start) - memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, - pgt_buf_end << PAGE_SHIFT, "PGTABLE"); + if (!after_bootmem) + x86_init.paging.kernel_pagetable_reserve(pgt_buf_start, pgt_buf_end); if (!after_bootmem) early_memtest(start, end); ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-29 17:25 ` Stefano Stabellini @ 2011-03-29 18:15 ` Yinghai Lu 2011-03-30 16:26 ` Stefano Stabellini 2011-03-30 16:05 ` Konrad Rzeszutek Wilk 1 sibling, 1 reply; 9+ messages in thread From: Yinghai Lu @ 2011-03-29 18:15 UTC (permalink / raw) To: Stefano Stabellini; +Cc: linux-kernel@vger.kernel.org On 03/29/2011 10:25 AM, Stefano Stabellini wrote: > I kept working on refactoring this code, and I think I was able to come > up with a slightly better patch. > Do you think this is acceptable? > > > > commit c576ff205060e92b36224e80a989492b245e806d > Author: Stefano Stabellini<stefano.stabellini@eu.citrix.com> > Date: Tue Mar 29 10:59:02 2011 +0000 > > x86: introduce kernel_pagetable hooks > > Introduce two new pvop hooks: > > - kernel_pagetable_alloc is used to allocate the initial kernel > pagetable pages; > > - kernel_pagetable_reserve is used to reserved the kernel pagetable > memory range; > > Signed-off-by: Stefano Stabellini<stefano.stabellini@eu.citrix.com> > > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index 7db7723..a3ab967 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -299,6 +299,12 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, > /* Install a pte for a particular vaddr in kernel space. */ > void set_pte_vaddr(unsigned long vaddr, pte_t pte); > > +struct map_range; > +void native_kernel_pagetable_alloc(struct map_range *mr, > + unsigned long *pgt_start, > + unsigned long *pgt_top); > +void native_kernel_pagetable_reserve(unsigned long pgt_start, > + unsigned long pgt_end); > #ifdef CONFIG_X86_32 > extern void native_pagetable_setup_start(pgd_t *base); > extern void native_pagetable_setup_done(pgd_t *base); > diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h > index 643ebf2..984ce6f 100644 > --- a/arch/x86/include/asm/x86_init.h > +++ b/arch/x86/include/asm/x86_init.h > @@ -69,10 +69,26 @@ struct x86_init_oem { > > /** > * struct x86_init_paging - platform specific paging functions > + * @kernel_pagetable_alloc: platform specific kernel pagetable > + * allocation function > + * @kernel_pagetable_reserve: platform specific kernel pagetable > + * reserve memory function > * @pagetable_setup_start: platform specific pre paging_init() call > * @pagetable_setup_done: platform specific post paging_init() call > */ > struct x86_init_paging { > + /* @mr: (in) memory range to map > + * @pgt_start - @pgt_top: (out) memory allocated for the kernel > + * pagetable pages > + */ > + void (*kernel_pagetable_alloc)(struct map_range *mr, > + unsigned long *pgt_start, > + unsigned long *pgt_top); > + /* @pgt_start - @pgt_end: memory region used for the kernel > + * pagetable pages > + */ > + void (*kernel_pagetable_reserve)(unsigned long pgt_start, > + unsigned long pgt_end); > void (*pagetable_setup_start)(pgd_t *base); > void (*pagetable_setup_done)(pgd_t *base); > }; > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c > index c11514e..1048a3d 100644 > --- a/arch/x86/kernel/x86_init.c > +++ b/arch/x86/kernel/x86_init.c > @@ -62,6 +62,8 @@ struct x86_init_ops x86_init __initdata = { > }, > > .paging = { > + .kernel_pagetable_alloc = native_kernel_pagetable_alloc, > + .kernel_pagetable_reserve = native_kernel_pagetable_reserve, > .pagetable_setup_start = native_pagetable_setup_start, > .pagetable_setup_done = native_pagetable_setup_done, > }, > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index 286d289..67d0792 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -30,11 +30,21 @@ int direct_gbpages > #endif > ; > > -static void __init find_early_table_space(unsigned long end, int use_pse, > - int use_gbpages) > +struct map_range { > + unsigned long start; > + unsigned long end; > + unsigned page_size_mask; > +}; > + > +void __init native_kernel_pagetable_alloc(struct map_range *mr, > + unsigned long *pgt_start, > + unsigned long *pgt_top) > { > - unsigned long puds, pmds, ptes, tables, start = 0, good_end = end; > + unsigned long puds, pmds, ptes, tables, start = 0, > + end = mr->end, good_end = end; > phys_addr_t base; > + int use_pse = mr->page_size_mask& PG_LEVEL_2M; > + int use_gbpages = mr->page_size_mask& PG_LEVEL_1G; > > puds = (end + PUD_SIZE - 1)>> PUD_SHIFT; > tables = roundup(puds * sizeof(pud_t), PAGE_SIZE); > @@ -73,19 +83,20 @@ static void __init find_early_table_space(unsigned long end, int use_pse, > if (base == MEMBLOCK_ERROR) > panic("Cannot find space for the kernel page tables"); > > - pgt_buf_start = base>> PAGE_SHIFT; > - pgt_buf_end = pgt_buf_start; > - pgt_buf_top = pgt_buf_start + (tables>> PAGE_SHIFT); > + *pgt_start = base>> PAGE_SHIFT; > + *pgt_top = *pgt_start + (tables>> PAGE_SHIFT); > > printk(KERN_DEBUG "kernel direct mapping tables up to %lx @ %lx-%lx\n", > - end, pgt_buf_start<< PAGE_SHIFT, pgt_buf_top<< PAGE_SHIFT); > + end, *pgt_start<< PAGE_SHIFT, *pgt_top<< PAGE_SHIFT); > } > > -struct map_range { > - unsigned long start; > - unsigned long end; > - unsigned page_size_mask; > -}; > +void __init native_kernel_pagetable_reserve(unsigned long pgt_start, > + unsigned long pgt_end) > +{ > + if (pgt_start> pgt_end) > + memblock_x86_reserve_range(pgt_start<< PAGE_SHIFT, > + pgt_end<< PAGE_SHIFT, "PGTABLE"); > +} > > #ifdef CONFIG_X86_32 > #define NR_RANGE_MR 3 > @@ -257,9 +268,15 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, > * memory mapped. Unfortunately this is done currently before the > * nodes are discovered. > */ > - if (!after_bootmem) > - find_early_table_space(end, use_pse, use_gbpages); > - > + if (!after_bootmem) { > + struct map_range map; > + map.start = start; > + map.end = end; > + map.page_size_mask = page_size_mask; > + x86_init.paging.kernel_pagetable_alloc(&map,&pgt_buf_start,&pgt_buf_top); > + pgt_buf_end = pgt_buf_start; > + } > + > for (i = 0; i< nr_range; i++) > ret = kernel_physical_mapping_init(mr[i].start, mr[i].end, > mr[i].page_size_mask); > @@ -272,9 +289,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, > > __flush_tlb_all(); > > - if (!after_bootmem&& pgt_buf_end> pgt_buf_start) > - memblock_x86_reserve_range(pgt_buf_start<< PAGE_SHIFT, > - pgt_buf_end<< PAGE_SHIFT, "PGTABLE"); > + if (!after_bootmem) > + x86_init.paging.kernel_pagetable_reserve(pgt_buf_start, pgt_buf_end); > > if (!after_bootmem) > early_memtest(start, end); 1. x86_init.paging is the right place for those two functions? 2. native_kernel_pagetable_alloc... should be function that do FIND + RESERVE. or We just have one function : free_not_used with (pgt_buf_end, pgt_buf_top) ? Thanks Yinghai ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-29 18:15 ` Yinghai Lu @ 2011-03-30 16:26 ` Stefano Stabellini 2011-03-30 16:59 ` Stefano Stabellini 0 siblings, 1 reply; 9+ messages in thread From: Stefano Stabellini @ 2011-03-30 16:26 UTC (permalink / raw) To: Yinghai Lu; +Cc: Stefano Stabellini, linux-kernel@vger.kernel.org On Tue, 29 Mar 2011, Yinghai Lu wrote: > 1. x86_init.paging is the right place for those two functions? > 2. native_kernel_pagetable_alloc... should be function that do FIND + RESERVE. > or We just have one function : free_not_used with (pgt_buf_end, pgt_buf_top) ? I like this suggestion very much! I just added a single hook called x86_init.mapping.pagetable_free that is a noop on native but on Xen set the memory from RO to RW. How does it look like now? I have another unrelated question: init_memory_mapping is called on the range 0 - max_low_pfn, but that range usually includes a reserved region below the first MB. On one machine of mine the IOAPIC mmio region falls in that memory range therefore we are mapping the IOAPIC mmio region in init_memory_mapping without going through the fixmap as we should. This is causing problems on Xen, but I guess it could theoretically cause problems on other platforms as well. Should we avoid reserved memory regions below the first MB from the initial memory mappings? --- commit fca74103af73ef871174fcd627c2317991f28911 Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Date: Wed Mar 30 16:17:33 2011 +0000 x86,xen: introduce x86_init.mapping.pagetable_free Introduce a new x86_init hook called pagetable_free that is a noop on native but on xen makes sure that the spare memory previously allocated for kernel pagetable pages can be used for other purposes (set the corresponding pagetable entries from RO to RW). Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h index 643ebf2..38df202 100644 --- a/arch/x86/include/asm/x86_init.h +++ b/arch/x86/include/asm/x86_init.h @@ -68,6 +68,14 @@ struct x86_init_oem { }; /** + * struct x86_init_mapping - platform specific initial kernel mappings setup + * @pagetable_free: free a range of unused kernel pagetable memory + */ +struct x86_init_mapping { + void (*pagetable_free)(unsigned long start_pfn, unsigned long end_pfn); +}; + +/** * struct x86_init_paging - platform specific paging functions * @pagetable_setup_start: platform specific pre paging_init() call * @pagetable_setup_done: platform specific post paging_init() call @@ -123,6 +131,7 @@ struct x86_init_ops { struct x86_init_mpparse mpparse; struct x86_init_irqs irqs; struct x86_init_oem oem; + struct x86_init_mapping mapping; struct x86_init_paging paging; struct x86_init_timers timers; struct x86_init_iommu iommu; diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c index c11514e..07e4fdc 100644 --- a/arch/x86/kernel/x86_init.c +++ b/arch/x86/kernel/x86_init.c @@ -24,6 +24,7 @@ void __cpuinit x86_init_noop(void) { } void __init x86_init_uint_noop(unsigned int unused) { } +void __init x86_init_ul_ul_noop(unsigned long u1, unsigned long u2) { } void __init x86_init_pgd_noop(pgd_t *unused) { } int __init iommu_init_noop(void) { return 0; } void iommu_shutdown_noop(void) { } @@ -61,6 +62,10 @@ struct x86_init_ops x86_init __initdata = { .banner = default_banner, }, + .mapping = { + .pagetable_free = x86_init_ul_ul_noop, + }, + .paging = { .pagetable_setup_start = native_pagetable_setup_start, .pagetable_setup_done = native_pagetable_setup_done, diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 286d289..afea949 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -272,10 +272,13 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, __flush_tlb_all(); - if (!after_bootmem && pgt_buf_end > pgt_buf_start) + if (!after_bootmem && pgt_buf_end > pgt_buf_start) { memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, pgt_buf_end << PAGE_SHIFT, "PGTABLE"); + x86_init.mapping.pagetable_free(pgt_buf_end, pgt_buf_top); + } + if (!after_bootmem) early_memtest(start, end); diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index c82df6c..14ce8e5 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1275,6 +1275,18 @@ static __init void xen_pagetable_setup_start(pgd_t *base) { } +static __init void xen_mapping_pagetable_free(unsigned long start_pfn, + unsigned long end_pfn) +{ + unsigned long pfn = start_pfn; + printk(KERN_DEBUG "xen: setting RW the range %lx - %lx\n", + start_pfn << PAGE_SHIFT, end_pfn << PAGE_SHIFT); + while (pfn < end_pfn) { + make_lowmem_page_readwrite(__va(PFN_PHYS(pfn))); + pfn++; + } +} + static void xen_post_allocator_init(void); static __init void xen_pagetable_setup_done(pgd_t *base) @@ -2100,6 +2112,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initdata = { void __init xen_init_mmu_ops(void) { + x86_init.mapping.pagetable_free = xen_mapping_pagetable_free; x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start; x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done; pv_mmu_ops = xen_mmu_ops; ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-30 16:26 ` Stefano Stabellini @ 2011-03-30 16:59 ` Stefano Stabellini 2011-03-30 17:58 ` Yinghai Lu 0 siblings, 1 reply; 9+ messages in thread From: Stefano Stabellini @ 2011-03-30 16:59 UTC (permalink / raw) To: Stefano Stabellini; +Cc: Yinghai Lu, linux-kernel@vger.kernel.org On Wed, 30 Mar 2011, Stefano Stabellini wrote: > I have another unrelated question: init_memory_mapping is called on the > range 0 - max_low_pfn, but that range usually includes a reserved region > below the first MB. On one machine of mine the IOAPIC mmio region falls > in that memory range therefore we are mapping the IOAPIC mmio region in > init_memory_mapping without going through the fixmap as we should. > This is causing problems on Xen, but I guess it could theoretically > cause problems on other platforms as well. Should we avoid reserved > memory regions below the first MB from the initial memory mappings? Sorry I mixed up frame numbers with physical addresses, so the IOAPIC mmio region is actually at 0xfec00000 where it should be but it gets mapped during the initial memory mapping (range 0 - 0x100000000). Is that supposed to happen? Shouldn't it go through the fixmap? ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-30 16:59 ` Stefano Stabellini @ 2011-03-30 17:58 ` Yinghai Lu 2011-04-05 13:47 ` Stefano Stabellini 0 siblings, 1 reply; 9+ messages in thread From: Yinghai Lu @ 2011-03-30 17:58 UTC (permalink / raw) To: Stefano Stabellini; +Cc: linux-kernel@vger.kernel.org On 03/30/2011 09:59 AM, Stefano Stabellini wrote: > On Wed, 30 Mar 2011, Stefano Stabellini wrote: >> I have another unrelated question: init_memory_mapping is called on the >> range 0 - max_low_pfn, but that range usually includes a reserved region >> below the first MB. On one machine of mine the IOAPIC mmio region falls >> in that memory range therefore we are mapping the IOAPIC mmio region in >> init_memory_mapping without going through the fixmap as we should. >> This is causing problems on Xen, but I guess it could theoretically >> cause problems on other platforms as well. Should we avoid reserved >> memory regions below the first MB from the initial memory mappings? > > Sorry I mixed up frame numbers with physical addresses, so the IOAPIC > mmio region is actually at 0xfec00000 where it should be but it gets > mapped during the initial memory mapping (range 0 - 0x100000000). > Is that supposed to happen? Shouldn't it go through the fixmap? io apic addr is going through fixmap. initial_memory_mapping will map to [0, max_mem_under_4g). max_mem_under_4g is from E820 table searching. later it will map [4g, max_mem_above_4g). We do not map mmio gap between [max_mem_under_4g,4g) Thanks Yinghai ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-30 17:58 ` Yinghai Lu @ 2011-04-05 13:47 ` Stefano Stabellini 0 siblings, 0 replies; 9+ messages in thread From: Stefano Stabellini @ 2011-04-05 13:47 UTC (permalink / raw) To: Yinghai Lu; +Cc: Stefano Stabellini, linux-kernel@vger.kernel.org On Wed, 30 Mar 2011, Yinghai Lu wrote: > On 03/30/2011 09:59 AM, Stefano Stabellini wrote: > > On Wed, 30 Mar 2011, Stefano Stabellini wrote: > >> I have another unrelated question: init_memory_mapping is called on the > >> range 0 - max_low_pfn, but that range usually includes a reserved region > >> below the first MB. On one machine of mine the IOAPIC mmio region falls > >> in that memory range therefore we are mapping the IOAPIC mmio region in > >> init_memory_mapping without going through the fixmap as we should. > >> This is causing problems on Xen, but I guess it could theoretically > >> cause problems on other platforms as well. Should we avoid reserved > >> memory regions below the first MB from the initial memory mappings? > > > > Sorry I mixed up frame numbers with physical addresses, so the IOAPIC > > mmio region is actually at 0xfec00000 where it should be but it gets > > mapped during the initial memory mapping (range 0 - 0x100000000). > > Is that supposed to happen? Shouldn't it go through the fixmap? > > io apic addr is going through fixmap. > > initial_memory_mapping will map to [0, max_mem_under_4g). > max_mem_under_4g is from E820 table searching. > > later it will map [4g, max_mem_above_4g). > > We do not map mmio gap between [max_mem_under_4g,4g) Thanks for the explanation, I figure out what the problem is: on xen we add an extra memory region at the end of the e820, and on this particular machine this extra memory region would start below 4g and cross over the 4g boundary: [0xfee01000-0x192655000) Unfortunately e820_end_of_low_ram_pfn does not expect an e820 layout like that so it returns 4g, therefore initial_memory_mapping will map [0 - 0x100000000), that is a memory range that includes some reserved memory regions. I would submit a fix that makes sure that the extra memory region is always added over 4g and not below. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: another pagetable initialization crash on xen 2011-03-29 17:25 ` Stefano Stabellini 2011-03-29 18:15 ` Yinghai Lu @ 2011-03-30 16:05 ` Konrad Rzeszutek Wilk 1 sibling, 0 replies; 9+ messages in thread From: Konrad Rzeszutek Wilk @ 2011-03-30 16:05 UTC (permalink / raw) To: Stefano Stabellini; +Cc: Yinghai Lu, linux-kernel@vger.kernel.org On Tue, Mar 29, 2011 at 06:25:56PM +0100, Stefano Stabellini wrote: > On Mon, 28 Mar 2011, Stefano Stabellini wrote: > > On Mon, 28 Mar 2011, Stefano Stabellini wrote: > > > Hi Yinghai, > > > unfortunately I found another pagetable initialization bug on xen > > > affecting linux 2.6.39-rc0. > > > The problem is that on xen we need to make sure that all the pagetable pages > > > are mapped read-only, in fact in xen_set_pte we have this check: > > > > > > if (pfn >= pgt_buf_start && pfn < pgt_buf_end) > > > /* make the pte read-only */ > > > > > > however pgt_buf_end is where the kernel pagetable *currently* ends, so > > > some kernel pagetable pages allocated after pgt_buf_end might be marked > > > read-write by mistake. A simple way to fix the issue would be to use > > > pgt_buf_top instead: > > > > > > if (pfn >= pgt_buf_start && pfn < pgt_buf_top) > > > /* make the pte read-only */ > > > > > > however after building the kernel pagetable in init_memory_mapping we > > > only reserve memory between pgt_buf_start and pgt_buf_end: > > > > > > if (!after_bootmem && pgt_buf_end > pgt_buf_start) > > > memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, > > > pgt_buf_end << PAGE_SHIFT, "PGTABLE"); > > > > > > so feature allocations might use memory between the final value of > > > pgt_buf_end and pgt_buf_top that has been marked read-only in the xen > > > specific code, causing a crash. > > > The only way I could find to fix the crash is to reserve also the memory > > > region between pgt_buf_start and pgt_buf_top on xen, but that would > > > require an ugly if(xen_domain()) at the of init_memory_mapping or > > > the introduction of a new pvop function to reserve the pagetable memory. > > > I don't like the idea, but I couldn't find anything better. > > > Yinghai, do you have any better suggestions? > > > > Just to make it clearer, this is the not-so-pretty patch I am talking > > about: > > > I kept working on refactoring this code, and I think I was able to come > up with a slightly better patch. > Do you think this is acceptable? > > > > commit c576ff205060e92b36224e80a989492b245e806d > Author: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > Date: Tue Mar 29 10:59:02 2011 +0000 > > x86: introduce kernel_pagetable hooks > > Introduce two new pvop hooks: > > - kernel_pagetable_alloc is used to allocate the initial kernel > pagetable pages; > > - kernel_pagetable_reserve is used to reserved the kernel pagetable > memory range; > > Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> > > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index 7db7723..a3ab967 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -299,6 +299,12 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn, > /* Install a pte for a particular vaddr in kernel space. */ > void set_pte_vaddr(unsigned long vaddr, pte_t pte); > > +struct map_range; > +void native_kernel_pagetable_alloc(struct map_range *mr, > + unsigned long *pgt_start, > + unsigned long *pgt_top); > +void native_kernel_pagetable_reserve(unsigned long pgt_start, > + unsigned long pgt_end); > #ifdef CONFIG_X86_32 > extern void native_pagetable_setup_start(pgd_t *base); > extern void native_pagetable_setup_done(pgd_t *base); > diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h > index 643ebf2..984ce6f 100644 > --- a/arch/x86/include/asm/x86_init.h > +++ b/arch/x86/include/asm/x86_init.h > @@ -69,10 +69,26 @@ struct x86_init_oem { > > /** > * struct x86_init_paging - platform specific paging functions > + * @kernel_pagetable_alloc: platform specific kernel pagetable > + * allocation function > + * @kernel_pagetable_reserve: platform specific kernel pagetable > + * reserve memory function > * @pagetable_setup_start: platform specific pre paging_init() call > * @pagetable_setup_done: platform specific post paging_init() call > */ > struct x86_init_paging { > + /* @mr: (in) memory range to map > + * @pgt_start - @pgt_top: (out) memory allocated for the kernel > + * pagetable pages > + */ > + void (*kernel_pagetable_alloc)(struct map_range *mr, > + unsigned long *pgt_start, > + unsigned long *pgt_top); > + /* @pgt_start - @pgt_end: memory region used for the kernel > + * pagetable pages > + */ > + void (*kernel_pagetable_reserve)(unsigned long pgt_start, > + unsigned long pgt_end); > void (*pagetable_setup_start)(pgd_t *base); > void (*pagetable_setup_done)(pgd_t *base); > }; > diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c > index c11514e..1048a3d 100644 > --- a/arch/x86/kernel/x86_init.c > +++ b/arch/x86/kernel/x86_init.c > @@ -62,6 +62,8 @@ struct x86_init_ops x86_init __initdata = { > }, > > .paging = { > + .kernel_pagetable_alloc = native_kernel_pagetable_alloc, > + .kernel_pagetable_reserve = native_kernel_pagetable_reserve, > .pagetable_setup_start = native_pagetable_setup_start, > .pagetable_setup_done = native_pagetable_setup_done, > }, > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c > index 286d289..67d0792 100644 > --- a/arch/x86/mm/init.c > +++ b/arch/x86/mm/init.c > @@ -30,11 +30,21 @@ int direct_gbpages > #endif > ; > > -static void __init find_early_table_space(unsigned long end, int use_pse, > - int use_gbpages) > +struct map_range { > + unsigned long start; > + unsigned long end; > + unsigned page_size_mask; > +}; > + > +void __init native_kernel_pagetable_alloc(struct map_range *mr, > + unsigned long *pgt_start, > + unsigned long *pgt_top) > { > - unsigned long puds, pmds, ptes, tables, start = 0, good_end = end; > + unsigned long puds, pmds, ptes, tables, start = 0, > + end = mr->end, good_end = end; end = good_end = mr->end; > phys_addr_t base; > + int use_pse = mr->page_size_mask & PG_LEVEL_2M; > + int use_gbpages = mr->page_size_mask & PG_LEVEL_1G; bool. > > puds = (end + PUD_SIZE - 1) >> PUD_SHIFT; > tables = roundup(puds * sizeof(pud_t), PAGE_SIZE); > @@ -73,19 +83,20 @@ static void __init find_early_table_space(unsigned long end, int use_pse, > if (base == MEMBLOCK_ERROR) > panic("Cannot find space for the kernel page tables"); > > - pgt_buf_start = base >> PAGE_SHIFT; > - pgt_buf_end = pgt_buf_start; > - pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT); > + *pgt_start = base >> PAGE_SHIFT; > + *pgt_top = *pgt_start + (tables >> PAGE_SHIFT); > > printk(KERN_DEBUG "kernel direct mapping tables up to %lx @ %lx-%lx\n", > - end, pgt_buf_start << PAGE_SHIFT, pgt_buf_top << PAGE_SHIFT); > + end, *pgt_start << PAGE_SHIFT, *pgt_top << PAGE_SHIFT); > } > > -struct map_range { > - unsigned long start; > - unsigned long end; > - unsigned page_size_mask; > -}; > +void __init native_kernel_pagetable_reserve(unsigned long pgt_start, > + unsigned long pgt_end) > +{ > + if (pgt_start > pgt_end) > + memblock_x86_reserve_range(pgt_start << PAGE_SHIFT, > + pgt_end << PAGE_SHIFT, "PGTABLE"); > +} > > #ifdef CONFIG_X86_32 > #define NR_RANGE_MR 3 > @@ -257,9 +268,15 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, > * memory mapped. Unfortunately this is done currently before the > * nodes are discovered. > */ > - if (!after_bootmem) > - find_early_table_space(end, use_pse, use_gbpages); > - > + if (!after_bootmem) { > + struct map_range map; > + map.start = start; > + map.end = end; > + map.page_size_mask = page_size_mask; > + x86_init.paging.kernel_pagetable_alloc(&map, &pgt_buf_start, &pgt_buf_top); > + pgt_buf_end = pgt_buf_start; > + } > + > for (i = 0; i < nr_range; i++) > ret = kernel_physical_mapping_init(mr[i].start, mr[i].end, > mr[i].page_size_mask); > @@ -272,9 +289,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start, > > __flush_tlb_all(); > > - if (!after_bootmem && pgt_buf_end > pgt_buf_start) > - memblock_x86_reserve_range(pgt_buf_start << PAGE_SHIFT, > - pgt_buf_end << PAGE_SHIFT, "PGTABLE"); > + if (!after_bootmem) > + x86_init.paging.kernel_pagetable_reserve(pgt_buf_start, pgt_buf_end); > > if (!after_bootmem) > early_memtest(start, end); > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-04-05 13:46 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-28 17:26 another pagetable initialization crash on xen Stefano Stabellini 2011-03-28 19:04 ` Stefano Stabellini 2011-03-29 17:25 ` Stefano Stabellini 2011-03-29 18:15 ` Yinghai Lu 2011-03-30 16:26 ` Stefano Stabellini 2011-03-30 16:59 ` Stefano Stabellini 2011-03-30 17:58 ` Yinghai Lu 2011-04-05 13:47 ` Stefano Stabellini 2011-03-30 16:05 ` Konrad Rzeszutek Wilk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox