All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Vrabel <david.vrabel@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>,
	Stefan Bader <stefan.bader@canonical.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: Kees Cook <keescook@chromium.org>,
	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: [Xen-devel] [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests
Date: Tue, 2 Sep 2014 12:01:59 +0100	[thread overview]
Message-ID: <5405A3A7.2050406@citrix.com> (raw)
In-Reply-To: <5404AE0F.1010207@citrix.com>

On 01/09/14 18:34, David Vrabel wrote:
> On 29/08/14 16:17, Stefan Bader wrote:
>>
>> This change might not be the fully correct approach as it basically
>> removes the pre-set page table entry for the fixmap that is compile
>> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
>> level1 page table is not yet declared in C headers (that might be
>> fixed). But also with the current bug, it was removed, too. Since
>> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
>> and some Xen data this did never reach that far. And still, something
>> does create entries at level2_fixmap_pgt[506..507]. So it should be
>> ok. At least I was able to successfully boot a kernel with 1G kernel
>> image size without any vmalloc whinings.
> [...]
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>>  		/* L3_i[0] -> level2_ident_pgt */
>>  		convert_pfn_mfn(level3_ident_pgt);
>>  		/* L3_k[510] -> level2_kernel_pgt
>> -		 * L3_i[511] -> level2_fixmap_pgt */
>> +		 * L3_k[511] -> level2_fixmap_pgt */
>>  		convert_pfn_mfn(level3_kernel_pgt);
>> +
>> +		/* level2_fixmap_pgt contains a single entry for the
>> +		 * fixmap area at offset 506. The correct way would
>> +		 * be to convert level2_fixmap_pgt to mfn and set the
>> +		 * level1_fixmap_pgt (which is completely empty) to RO,
>> +		 * too. But currently this page table is not declared,
>> +		 * so it would be a bit of voodoo to get its address.
>> +		 * And also the fixmap entry was never set due to using
>> +		 * the wrong l2 when getting Xen's tables. So let's just
>> +		 * just nuke it.
>> +		 * This orphans level1_fixmap_pgt, but that was basically
>> +		 * done before the change as well.
>> +		 */
>> +		memset(level2_fixmap_pgt, 0, 512*sizeof(long));
> 
> level2_fixmap_pgt etc. are defined for the benefit of Xen only so I
> think you should add an extern for level1_fixmap_pgt and fix this up
> properly.
> 
> It might not matter now, but it might in the future...

I found some time to look into this and test it.  Including without
enabling KSLAR, but reproing the vmalloc failure with a large (800 MiB
module).

I've clarified the change description, can you review my edits?

Thanks for investigating and fixing this!

David

8<------------------------------
x86/xen: don't copy junk into kernel page tables

When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying junk into the
page tables (to level2_fixmap_pgt), overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 arch/x86/include/asm/pgtable_64.h |    1 +
 arch/x86/xen/mmu.c                |   27 ++++++++++++---------------
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 5be9063..3874693 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
+extern pte_t level1_fixmap_pgt[512];
 extern pgd_t init_level4_pgt[];
 
 #define swapper_pg_dir init_level4_pgt
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..16fb009 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1866,12 +1866,11 @@ static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
  *
  * We can construct this by grafting the Xen provided pagetable into
  * head_64.S's preconstructed pagetables.  We copy the Xen L2's into
- * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt.  This
- * means that only the kernel has a physical mapping to start with -
- * but that's enough to get __va working.  We need to fill in the rest
- * of the physical mapping once some sort of allocator has been set
- * up.
- * NOTE: for PVH, the page tables are native.
+ * level2_ident_pgt, and level2_kernel_pgt.  This means that only the
+ * kernel has a physical mapping to start with - but that's enough to
+ * get __va working.  We need to fill in the rest of the physical
+ * mapping once some sort of allocator has been set up.  NOTE: for
+ * PVH, the page tables are native.
  */
 void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 {
@@ -1902,8 +1901,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
 		/* L3_k[510] -> level2_kernel_pgt
-		 * L3_i[511] -> level2_fixmap_pgt */
+		 * L3_k[511] -> level2_fixmap_pgt */
 		convert_pfn_mfn(level3_kernel_pgt);
+
+		/* L3_k[511][506] -> level1_fixmap_pgt */
+		convert_pfn_mfn(level2_fixmap_pgt);
 	}
 	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
@@ -1913,21 +1915,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	addr[1] = (unsigned long)l3;
 	addr[2] = (unsigned long)l2;
 	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
-	 * Both L4[272][0] and L4[511][511] have entries that point to the same
+	 * Both L4[272][0] and L4[511][510] have entries that point to the same
 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
 	 * it will be also modified in the __ka space! (But if you just
 	 * modify the PMD table to point to other PTE's or none, then you
 	 * are OK - which is what cleanup_highmap does) */
 	copy_page(level2_ident_pgt, l2);
-	/* Graft it onto L4[511][511] */
+	/* Graft it onto L4[511][510] */
 	copy_page(level2_kernel_pgt, l2);
 
-	/* Get [511][510] and graft that in level2_fixmap_pgt */
-	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
-	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
-	copy_page(level2_fixmap_pgt, l2);
-	/* Note that we don't do anything with level1_fixmap_pgt which
-	 * we don't need. */
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1937,6 +1933,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
+		set_page_prot(level1_fixmap_pgt, PAGE_KERNEL_RO);
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-- 
1.7.10.4


WARNING: multiple messages have this Message-ID (diff)
From: David Vrabel <david.vrabel@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>,
	Stefan Bader <stefan.bader@canonical.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
	Kees Cook <keescook@chromium.org>
Subject: Re: [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests
Date: Tue, 2 Sep 2014 12:01:59 +0100	[thread overview]
Message-ID: <5405A3A7.2050406@citrix.com> (raw)
In-Reply-To: <5404AE0F.1010207@citrix.com>

On 01/09/14 18:34, David Vrabel wrote:
> On 29/08/14 16:17, Stefan Bader wrote:
>>
>> This change might not be the fully correct approach as it basically
>> removes the pre-set page table entry for the fixmap that is compile
>> time set (level2_fixmap_pgt[506]->level1_fixmap_pgt). For one the
>> level1 page table is not yet declared in C headers (that might be
>> fixed). But also with the current bug, it was removed, too. Since
>> the Xen mappings for level2_kernel_pgt only covered kernel + initrd
>> and some Xen data this did never reach that far. And still, something
>> does create entries at level2_fixmap_pgt[506..507]. So it should be
>> ok. At least I was able to successfully boot a kernel with 1G kernel
>> image size without any vmalloc whinings.
> [...]
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -1902,8 +1902,22 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
>>  		/* L3_i[0] -> level2_ident_pgt */
>>  		convert_pfn_mfn(level3_ident_pgt);
>>  		/* L3_k[510] -> level2_kernel_pgt
>> -		 * L3_i[511] -> level2_fixmap_pgt */
>> +		 * L3_k[511] -> level2_fixmap_pgt */
>>  		convert_pfn_mfn(level3_kernel_pgt);
>> +
>> +		/* level2_fixmap_pgt contains a single entry for the
>> +		 * fixmap area at offset 506. The correct way would
>> +		 * be to convert level2_fixmap_pgt to mfn and set the
>> +		 * level1_fixmap_pgt (which is completely empty) to RO,
>> +		 * too. But currently this page table is not declared,
>> +		 * so it would be a bit of voodoo to get its address.
>> +		 * And also the fixmap entry was never set due to using
>> +		 * the wrong l2 when getting Xen's tables. So let's just
>> +		 * just nuke it.
>> +		 * This orphans level1_fixmap_pgt, but that was basically
>> +		 * done before the change as well.
>> +		 */
>> +		memset(level2_fixmap_pgt, 0, 512*sizeof(long));
> 
> level2_fixmap_pgt etc. are defined for the benefit of Xen only so I
> think you should add an extern for level1_fixmap_pgt and fix this up
> properly.
> 
> It might not matter now, but it might in the future...

I found some time to look into this and test it.  Including without
enabling KSLAR, but reproing the vmalloc failure with a large (800 MiB
module).

I've clarified the change description, can you review my edits?

Thanks for investigating and fixing this!

David

8<------------------------------
x86/xen: don't copy junk into kernel page tables

When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
modules exceeds 512 MiB, then loading modules fails with a warning
(and hence a vmalloc allocation failure) because the PTEs for the
newly-allocated vmalloc address space are not zero.

  WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
           vmap_page_range_noflush+0x2a1/0x360()

This is caused by xen_setup_kernel_pagetables() copying junk into the
page tables (to level2_fixmap_pgt), overwriting many non-present
entries.

Without KASLR, the normal kernel image size only covers the first half
of level2_kernel_pgt and module space starts after that.

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                  [256..511]->module
                          [511]->level2_fixmap_pgt[  0..505]->module

This allows 512 MiB of of module vmalloc space to be used before
having to use the corrupted level2_fixmap_pgt entries.

With KASLR enabled, the kernel image uses the full PUD range of 1G and
module space starts in the level2_fixmap_pgt. So basically:

L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                          [511]->level2_fixmap_pgt[0..505]->module

And now no module vmalloc space can be used without using the corrupt
level2_fixmap_pgt entries.

Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
and setting level1_fixmap_pgt as read-only.

Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
---
 arch/x86/include/asm/pgtable_64.h |    1 +
 arch/x86/xen/mmu.c                |   27 ++++++++++++---------------
 2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 5be9063..3874693 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512];
 extern pmd_t level2_kernel_pgt[512];
 extern pmd_t level2_fixmap_pgt[512];
 extern pmd_t level2_ident_pgt[512];
+extern pte_t level1_fixmap_pgt[512];
 extern pgd_t init_level4_pgt[];
 
 #define swapper_pg_dir init_level4_pgt
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index e8a1201..16fb009 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1866,12 +1866,11 @@ static void __init check_pt_base(unsigned long *pt_base, unsigned long *pt_end,
  *
  * We can construct this by grafting the Xen provided pagetable into
  * head_64.S's preconstructed pagetables.  We copy the Xen L2's into
- * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt.  This
- * means that only the kernel has a physical mapping to start with -
- * but that's enough to get __va working.  We need to fill in the rest
- * of the physical mapping once some sort of allocator has been set
- * up.
- * NOTE: for PVH, the page tables are native.
+ * level2_ident_pgt, and level2_kernel_pgt.  This means that only the
+ * kernel has a physical mapping to start with - but that's enough to
+ * get __va working.  We need to fill in the rest of the physical
+ * mapping once some sort of allocator has been set up.  NOTE: for
+ * PVH, the page tables are native.
  */
 void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 {
@@ -1902,8 +1901,11 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		/* L3_i[0] -> level2_ident_pgt */
 		convert_pfn_mfn(level3_ident_pgt);
 		/* L3_k[510] -> level2_kernel_pgt
-		 * L3_i[511] -> level2_fixmap_pgt */
+		 * L3_k[511] -> level2_fixmap_pgt */
 		convert_pfn_mfn(level3_kernel_pgt);
+
+		/* L3_k[511][506] -> level1_fixmap_pgt */
+		convert_pfn_mfn(level2_fixmap_pgt);
 	}
 	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
 	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
@@ -1913,21 +1915,15 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 	addr[1] = (unsigned long)l3;
 	addr[2] = (unsigned long)l2;
 	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
-	 * Both L4[272][0] and L4[511][511] have entries that point to the same
+	 * Both L4[272][0] and L4[511][510] have entries that point to the same
 	 * L2 (PMD) tables. Meaning that if you modify it in __va space
 	 * it will be also modified in the __ka space! (But if you just
 	 * modify the PMD table to point to other PTE's or none, then you
 	 * are OK - which is what cleanup_highmap does) */
 	copy_page(level2_ident_pgt, l2);
-	/* Graft it onto L4[511][511] */
+	/* Graft it onto L4[511][510] */
 	copy_page(level2_kernel_pgt, l2);
 
-	/* Get [511][510] and graft that in level2_fixmap_pgt */
-	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
-	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
-	copy_page(level2_fixmap_pgt, l2);
-	/* Note that we don't do anything with level1_fixmap_pgt which
-	 * we don't need. */
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
 		/* Make pagetable pieces RO */
 		set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
@@ -1937,6 +1933,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
 		set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
 		set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
+		set_page_prot(level1_fixmap_pgt, PAGE_KERNEL_RO);
 
 		/* Pin down new L4 */
 		pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
-- 
1.7.10.4

  reply	other threads:[~2014-09-02 11:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-29 15:17 [PATCH] x86/xen: Fix 64bit kernel pagetable setup of PV guests Stefan Bader
2014-09-01 17:34 ` [Xen-devel] " David Vrabel
2014-09-02 11:01   ` David Vrabel [this message]
2014-09-02 11:01     ` David Vrabel
2014-09-02 14:34     ` [Xen-devel] " Andrew Cooper
2014-09-06 15:42     ` Stefan Bader

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5405A3A7.2050406@citrix.com \
    --to=david.vrabel@citrix.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=keescook@chromium.org \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=stefan.bader@canonical.com \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.