* [PATCH] i386: For debugging, make the initial page table setup less forgiving. @ 2007-04-13 21:49 H. Peter Anvin 2007-04-13 22:18 ` Zachary Amsden 2007-04-25 11:48 ` Andrew Morton 0 siblings, 2 replies; 20+ messages in thread From: H. Peter Anvin @ 2007-04-13 21:49 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: Andi Kleen, Andrew Morton, Eric W. Biederman, H. Peter Anvin, Zachary Amsden, Linux Kernel Mailing List We just discovered that the accounting for initial memory usage (head.S: INIT_MAP_BEYOND_END) has been way, way off for a very long time. This patch makes the initial page table not round up to the nearest 4M boundary, but instead stop dead (and zero the rest of the final page table) as soon as it hits the configured limit. This patch is intended as a debugging aid. If it goes into the kernel, it should go in at the very beginning of a review cycle, as it may very well expose real failures (without Jeremy's patch to fix the accounting, it *will* crash.) Signed-off-by: H. Peter Anvin <hpa@zytor.com> diff --git a/arch/i386/kernel/head.S b/arch/i386/kernel/head.S index 3fa7f93..db6735a 100644 --- a/arch/i386/kernel/head.S +++ b/arch/i386/kernel/head.S @@ -131,15 +131,27 @@ page_pde_offset = (__PAGE_OFFSET >> 20); movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */ addl $4,%edx movl $1024, %ecx + /* + * End condition: we must map up to and including + * INIT_MAP_BEYOND_END bytes beyond the end of our + * own page tables; 0x1000 is the size of the page + * table were about to write, and +0x007 is the + * attribute bits. + */ + leal (INIT_MAP_BEYOND_END+0x1000+0x007)(%edi),%ebp 11: stosl addl $0x1000,%eax - loop 11b - /* End condition: we must map up to and including INIT_MAP_BEYOND_END */ - /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */ - leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp cmpl %ebp,%eax - jb 10b + jae 12f + loop 11b + jmp 10b +12: + /* Clear the remainder of the last page table */ + decl %ecx + xorl %eax,%eax + rep; stosl + movl %edi,(init_pg_tables_end - __PAGE_OFFSET) xorl %ebx,%ebx /* This is the boot CPU (BSP) */ ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-13 21:49 [PATCH] i386: For debugging, make the initial page table setup less forgiving H. Peter Anvin @ 2007-04-13 22:18 ` Zachary Amsden 2007-04-13 22:26 ` H. Peter Anvin 2007-04-13 22:26 ` Jeremy Fitzhardinge 2007-04-25 11:48 ` Andrew Morton 1 sibling, 2 replies; 20+ messages in thread From: Zachary Amsden @ 2007-04-13 22:18 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List H. Peter Anvin wrote: > + /* > + * End condition: we must map up to and including > + * INIT_MAP_BEYOND_END bytes beyond the end of our > + * own page tables; 0x1000 is the size of the page > + * table were about to write, and +0x007 is the > + * attribute bits. > + */ > + leal (INIT_MAP_BEYOND_END+0x1000+0x007)(%edi),%ebp > hrmm? Shouldn't that still be INIT_MAP_BEYOND_END+0x0007? Seems you are mapping 4M more than you need. Zach ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-13 22:18 ` Zachary Amsden @ 2007-04-13 22:26 ` H. Peter Anvin 2007-04-13 22:40 ` Zachary Amsden 2007-04-13 22:26 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 20+ messages in thread From: H. Peter Anvin @ 2007-04-13 22:26 UTC (permalink / raw) To: Zachary Amsden Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List Zachary Amsden wrote: > H. Peter Anvin wrote: >> + /* >> + * End condition: we must map up to and including >> + * INIT_MAP_BEYOND_END bytes beyond the end of our >> + * own page tables; 0x1000 is the size of the page >> + * table were about to write, and +0x007 is the >> + * attribute bits. >> + */ >> + leal (INIT_MAP_BEYOND_END+0x1000+0x007)(%edi),%ebp >> > > hrmm? Shouldn't that still be INIT_MAP_BEYOND_END+0x0007? Seems you > are mapping 4M more than you need. > 4K, not 4M. This is an actual address, not an indirection. However, the expression is correct, because it needs to refer to the termination address *after* the current page table is written -- you can think of it as having already allocated 4K for a page table that it is about to be generated. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-13 22:26 ` H. Peter Anvin @ 2007-04-13 22:40 ` Zachary Amsden 0 siblings, 0 replies; 20+ messages in thread From: Zachary Amsden @ 2007-04-13 22:40 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Andi Kleen, Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List H. Peter Anvin wrote: > Zachary Amsden wrote: >> H. Peter Anvin wrote: >>> + /* >>> + * End condition: we must map up to and including >>> + * INIT_MAP_BEYOND_END bytes beyond the end of our >>> + * own page tables; 0x1000 is the size of the page >>> + * table were about to write, and +0x007 is the >>> + * attribute bits. >>> + */ >>> + leal (INIT_MAP_BEYOND_END+0x1000+0x007)(%edi),%ebp >>> >> >> hrmm? Shouldn't that still be INIT_MAP_BEYOND_END+0x0007? Seems you >> are mapping 4M more than you need. >> > > 4K, not 4M. This is an actual address, not an indirection. Ok, the old code was writing the entire page table, so it was mapping up 4M more than it should have. It wasn't clear with 10b out of scope in the patch. > > However, the expression is correct, because it needs to refer to the > termination address *after* the current page table is written -- you > can think of it as having already allocated 4K for a page table that > it is about to be generated. Correct. Acked-by: Zachary Amsden <zach@vmware.com> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-13 22:18 ` Zachary Amsden 2007-04-13 22:26 ` H. Peter Anvin @ 2007-04-13 22:26 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 20+ messages in thread From: Jeremy Fitzhardinge @ 2007-04-13 22:26 UTC (permalink / raw) To: Zachary Amsden Cc: H. Peter Anvin, Andi Kleen, Andrew Morton, Eric W. Biederman, Linux Kernel Mailing List Zachary Amsden wrote: > H. Peter Anvin wrote: >> + /* >> + * End condition: we must map up to and including >> + * INIT_MAP_BEYOND_END bytes beyond the end of our >> + * own page tables; 0x1000 is the size of the page >> + * table were about to write, and +0x007 is the >> + * attribute bits. >> + */ >> + leal (INIT_MAP_BEYOND_END+0x1000+0x007)(%edi),%ebp >> > > hrmm? Shouldn't that still be INIT_MAP_BEYOND_END+0x0007? Seems you > are mapping 4M more than you need. No, that's an actual byte address of the end of the mapped area, not a pfn. J ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-13 21:49 [PATCH] i386: For debugging, make the initial page table setup less forgiving H. Peter Anvin 2007-04-13 22:18 ` Zachary Amsden @ 2007-04-25 11:48 ` Andrew Morton 2007-04-25 15:28 ` H. Peter Anvin 1 sibling, 1 reply; 20+ messages in thread From: Andrew Morton @ 2007-04-25 11:48 UTC (permalink / raw) To: H. Peter Anvin Cc: Jeremy Fitzhardinge, Andi Kleen, Eric W. Biederman, Zachary Amsden, Linux Kernel Mailing List On Fri, 13 Apr 2007 14:49:57 -0700 "H. Peter Anvin" <hpa@zytor.com> wrote: > We just discovered that the accounting for initial memory usage > (head.S: INIT_MAP_BEYOND_END) has been way, way off for a very long > time. This patch makes the initial page table not round up to the > nearest 4M boundary, but instead stop dead (and zero the rest of the > final page table) as soon as it hits the configured limit. > > This patch is intended as a debugging aid. If it goes into the > kernel, it should go in at the very beginning of a review cycle, as it > may very well expose real failures (without Jeremy's patch to fix the > accounting, it *will* crash.) > > Signed-off-by: H. Peter Anvin <hpa@zytor.com> > > diff --git a/arch/i386/kernel/head.S b/arch/i386/kernel/head.S > index 3fa7f93..db6735a 100644 > --- a/arch/i386/kernel/head.S > +++ b/arch/i386/kernel/head.S > @@ -131,15 +131,27 @@ page_pde_offset = (__PAGE_OFFSET >> 20); > movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */ > addl $4,%edx > movl $1024, %ecx > + /* > + * End condition: we must map up to and including > + * INIT_MAP_BEYOND_END bytes beyond the end of our > + * own page tables; 0x1000 is the size of the page > + * table were about to write, and +0x007 is the > + * attribute bits. > + */ > + leal (INIT_MAP_BEYOND_END+0x1000+0x007)(%edi),%ebp > 11: > stosl > addl $0x1000,%eax > - loop 11b > - /* End condition: we must map up to and including INIT_MAP_BEYOND_END */ > - /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */ > - leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp > cmpl %ebp,%eax > - jb 10b > + jae 12f > + loop 11b > + jmp 10b > +12: > + /* Clear the remainder of the last page table */ > + decl %ecx > + xorl %eax,%eax > + rep; stosl > + > movl %edi,(init_pg_tables_end - __PAGE_OFFSET) > > xorl %ebx,%ebx /* This is the boot CPU (BSP) */ This patch causes oopses after a minute or so running LTP's ./testcases/bin/growfiles -W gf16 -b -e 1 -i 0 -L 120 -u -g 4090 -T 100 -t 408990 -l -C 10 -c 1000 -S 10 -f Lgf02_ on everyone's favoutite Vaio, configured with http://userweb.kernel.org/~akpm/config-sony.txt BUG: unable to handle kernel paging request at virtual address c084fa8c printing eip: c0174c46 *pde = 0042a027 *pte = 00000000 Oops: 0002 [#1] Modules linked in: ipw2200 sonypi ipv6 autofs4 hidp l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables cpufreq_ondemand video sbs button battery asus_acpi ac nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss sr_mod snd_pcm cdrom ieee80211 ieee80211_crypt snd_timer piix snd soundcore snd_page_alloc i2c_i801 generic i2c_core pcspkr ext3 jbd ide_disk ide_core CPU: 0 EIP: 0060:[<c0174c46>] Not tainted VLI EFLAGS: 00010212 (2.6.21-rc7 #13) EIP is at __block_prepare_write+0x246/0x445 eax: 00000000 ebx: c084f000 ecx: 0000015d edx: 00000574 esi: 00000000 edi: c084fa8c ebp: 00000000 esp: f51e3bc8 ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0068 Process growfiles (pid: 2913, ti=f51e2000 task=f5f31340 task.ti=f51e2000) Stack: c08317b8 00000001 0000000a 00000000 c08317f0 f6c0f9b0 00000980 00000574 f58953d8 00000000 c10109e0 f7dbb0d8 00005e1a 00000000 00001000 c08317b8 f51e3c24 fffff000 00000000 01000286 00001000 00001000 0000000a f58953d8 Call Trace: [<c0174e67>] block_prepare_write+0x22/0x30 [<f8b853b8>] ext3_get_block+0x0/0xd0 [ext3] [<f8b866e2>] ext3_prepare_write+0x0/0x151 [ext3] [<f8b86778>] ext3_prepare_write+0x96/0x151 [ext3] [<f8b853b8>] ext3_get_block+0x0/0xd0 [ext3] [<f8b866e2>] ext3_prepare_write+0x0/0x151 [ext3] [<c0142832>] generic_file_buffered_write+0x26d/0x5eb [<c024d691>] ata_altstatus+0x1d/0x24 [<c011e314>] current_fs_time+0x37/0x3c [<c0143096>] __generic_file_aio_write_nolock+0x4e6/0x562 [<f8b91728>] ext3_permission+0x0/0xa [ext3] [<c02e3322>] __mutex_lock_slowpath+0x1ac/0x1b4 [<c0143173>] generic_file_aio_write+0x61/0xc1 [<f8b82dfc>] ext3_file_write+0x24/0x90 [ext3] [<c015a444>] do_sync_write+0xc7/0x10a [<c012907c>] autoremove_wake_function+0x0/0x35 [<c015a37d>] do_sync_write+0x0/0x10a [<c015ac89>] vfs_write+0xa8/0x155 [<c015b1c5>] sys_write+0x41/0x67 [<c0163e2a>] sys_fcntl64+0x7d/0x87 [<c0103b68>] syscall_call+0x7/0xb ======================= Code: fa ff 80 7c 24 4f 00 89 c3 74 2f 8b 54 24 50 2b 94 24 80 00 00 00 8b 84 24 80 00 00 00 89 d1 89 54 24 1c c1 e9 02 8d 3c 03 89 f0 <f3> ab f6 c2 02 74 02 66 ab f6 c2 01 74 01 aa 80 7c 24 4e 00 74 EIP: [<c0174c46>] __block_prepare_write+0x246/0x445 SS:ESP 0068:f51e3bc8 note: growfiles[2913] exited with preempt_count 1 BUG: unable to handle kernel paging request at virtual address c0850048 Dropped, with alacrity, at 4:47AM. More care and better testing and reviewing, please. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 11:48 ` Andrew Morton @ 2007-04-25 15:28 ` H. Peter Anvin 2007-04-25 17:50 ` Eric W. Biederman 0 siblings, 1 reply; 20+ messages in thread From: H. Peter Anvin @ 2007-04-25 15:28 UTC (permalink / raw) To: Andrew Morton Cc: Jeremy Fitzhardinge, Andi Kleen, Eric W. Biederman, Zachary Amsden, Linux Kernel Mailing List Andrew Morton wrote: > > This patch causes oopses after a minute or so running LTP's > > ./testcases/bin/growfiles -W gf16 -b -e 1 -i 0 -L 120 -u -g 4090 -T 100 -t 408990 -l -C 10 -c 1000 -S 10 -f Lgf02_ > > on everyone's favoutite Vaio, configured with > http://userweb.kernel.org/~akpm/config-sony.txt > *BLINK* This patch only affects the initial page tables, which should have been thrown out *way* long ago at this point. Yet they seem to have stuck around. This is a very bad thing on many levels, especially since we should have switched the kernel 1:1 area over to PSE pages a long time ago. > BUG: unable to handle kernel paging request at virtual address c084fa8c > printing eip: > c0174c46 > *pde = 0042a027 > *pte = 00000000 Touching a non-PSE page which is zero, and quite consistent with being a remnant from the original page tables. Methinks this has smoked out a bug in the initial page table setup which probably has been a performance roadblock for quite some time. -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 15:28 ` H. Peter Anvin @ 2007-04-25 17:50 ` Eric W. Biederman 2007-04-25 17:56 ` H. Peter Anvin 2007-04-25 18:18 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 20+ messages in thread From: Eric W. Biederman @ 2007-04-25 17:50 UTC (permalink / raw) To: H. Peter Anvin Cc: Andrew Morton, Jeremy Fitzhardinge, Andi Kleen, Eric W. Biederman, Zachary Amsden, Linux Kernel Mailing List "H. Peter Anvin" <hpa@zytor.com> writes: > Andrew Morton wrote: > >> >> This patch causes oopses after a minute or so running LTP's >> >> ./testcases/bin/growfiles -W gf16 -b -e 1 -i 0 -L 120 -u -g 4090 -T 100 -t > 408990 -l -C 10 -c 1000 -S 10 -f Lgf02_ >> >> on everyone's favoutite Vaio, configured with >> http://userweb.kernel.org/~akpm/config-sony.txt >> > > *BLINK* > > This patch only affects the initial page tables, which should have been > thrown out *way* long ago at this point. Yes. I noticed this was happening a few days ago. I must not have mentioned it loudly enough. > Yet they seem to have stuck around. This is a very bad thing on many > levels, especially since we should have switched the kernel 1:1 area > over to PSE pages a long time ago. Yep. We don't continue to use the early page table pages if you enable PAE, but otherwise we do. Which also gives us potential permission issues with the initial page tables. >> BUG: unable to handle kernel paging request at virtual address c084fa8c >> printing eip: >> c0174c46 >> *pde = 0042a027 >> *pte = 00000000 > > Touching a non-PSE page which is zero, and quite consistent with being a > remnant from the original page tables. > > Methinks this has smoked out a bug in the initial page table setup which > probably has been a performance roadblock for quite some time. Yes our initial page table setup on arch/i386 is full of issues. I'm halfway through putting together a patchset to address a bunch of these. I haven't yet resolved how I want to allocate the pages for the identity mapping of the page table yet. I can't use the bootmem allocate as it exists because that assumes the page is mapped into the address space already. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 17:50 ` Eric W. Biederman @ 2007-04-25 17:56 ` H. Peter Anvin 2007-04-25 18:23 ` Eric W. Biederman 2007-04-25 18:18 ` Jeremy Fitzhardinge 1 sibling, 1 reply; 20+ messages in thread From: H. Peter Anvin @ 2007-04-25 17:56 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Jeremy Fitzhardinge, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Eric W. Biederman wrote: >> This patch only affects the initial page tables, which should have been >> thrown out *way* long ago at this point. > > Yes. I noticed this was happening a few days ago. > I must not have mentioned it loudly enough. You mentioned the continued use of init_mm. This is *very* different. What we're seeing here is that ON PSE-CAPABLE HARDWARE, we continue to not just use the init_mm page directory, but the actual page *tables*, which should all have been replaced with PSE large pages to begin with. Reusing the initial page tables on non-PSE-capable hardware *sort of* makes sense, but his hardware should not fall in that category, I don't think? (Unless it's one of these machines that fall over if you map the bottom 4 MB with PSE pages?) -hpa ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 17:56 ` H. Peter Anvin @ 2007-04-25 18:23 ` Eric W. Biederman 0 siblings, 0 replies; 20+ messages in thread From: Eric W. Biederman @ 2007-04-25 18:23 UTC (permalink / raw) To: H. Peter Anvin Cc: Andrew Morton, Jeremy Fitzhardinge, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List "H. Peter Anvin" <hpa@zytor.com> writes: > Eric W. Biederman wrote: >>> This patch only affects the initial page tables, which should have been >>> thrown out *way* long ago at this point. >> >> Yes. I noticed this was happening a few days ago. >> I must not have mentioned it loudly enough. > > You mentioned the continued use of init_mm. This is *very* different. > > What we're seeing here is that ON PSE-CAPABLE HARDWARE, we continue to > not just use the init_mm page directory, but the actual page *tables*, > which should all have been replaced with PSE large pages to begin with. That is what I meant if not what I communicated. If you read the code that is exactly what it is trying to do. > Reusing the initial page tables on non-PSE-capable hardware *sort of* > makes sense, but his hardware should not fall in that category, I don't > think? (Unless it's one of these machines that fall over if you map the > bottom 4 MB with PSE pages?) I agree that I don't think the current behaviour makes sense. I think the code has accumulated so many small modifications that is very far from making sense in the corner cases. arch/i386 either needs to be frozen as a legacy only thing or it needs to be cleaned up so we can continue to enhance the code. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 17:50 ` Eric W. Biederman 2007-04-25 17:56 ` H. Peter Anvin @ 2007-04-25 18:18 ` Jeremy Fitzhardinge 2007-04-25 19:01 ` Eric W. Biederman 1 sibling, 1 reply; 20+ messages in thread From: Jeremy Fitzhardinge @ 2007-04-25 18:18 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Eric W. Biederman wrote: > "H. Peter Anvin" <hpa@zytor.com> writes: > > >> Andrew Morton wrote: >> >> >>> This patch causes oopses after a minute or so running LTP's >>> >>> ./testcases/bin/growfiles -W gf16 -b -e 1 -i 0 -L 120 -u -g 4090 -T 100 -t >>> >> 408990 -l -C 10 -c 1000 -S 10 -f Lgf02_ >> >>> on everyone's favoutite Vaio, configured with >>> http://userweb.kernel.org/~akpm/config-sony.txt >>> >>> >> *BLINK* >> >> This patch only affects the initial page tables, which should have been >> thrown out *way* long ago at this point. >> > > Yes. I noticed this was happening a few days ago. > I must not have mentioned it loudly enough. > I might have introduced it as part of the paravirt_ops patches. When setting up pagetables under Xen, we need to make sure we preserve the initial mappings Xen put in place (it starts the VM with paging enabled, and a sane initial pagetable). This may have leaked over into native if it doesn't replace an existing entry. Hm, but it should be overwriting small mappings with large ones. Maybe I overlooked that. > I'm halfway through putting together a patchset to address a > bunch of these. > > I haven't yet resolved how I want to allocate the pages for the > identity mapping of the page table yet. I can't use the bootmem > allocate as it exists because that assumes the page is mapped > into the address space already. > Are you going to clash horribly with the paravirt pagetable setup? We should probably coordinate if so. J ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 18:18 ` Jeremy Fitzhardinge @ 2007-04-25 19:01 ` Eric W. Biederman 2007-04-25 19:19 ` Jeremy Fitzhardinge 2007-04-25 22:08 ` Jeremy Fitzhardinge 0 siblings, 2 replies; 20+ messages in thread From: Eric W. Biederman @ 2007-04-25 19:01 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Jeremy Fitzhardinge <jeremy@goop.org> writes: > I might have introduced it as part of the paravirt_ops patches. When > setting up pagetables under Xen, we need to make sure we preserve the > initial mappings Xen put in place (it starts the VM with paging enabled, > and a sane initial pagetable). This may have leaked over into native if > it doesn't replace an existing entry. Yes in paravirt_ops-hooks-to-set-up-initial-pagetable Grr. That is what I get for trying to avoid conflicts by working against the -mm tree. I thought that code had been like that for a while. Since it hasn't been merged upstream it looks to me like that patch needs to be fixed. And we can keep HPA debugging patch. > Hm, but it should be overwriting small mappings with large ones. Maybe > I overlooked that. Nope. It's not overwriting anything. I'm not at all convinced we should preserve the Xen sillyness. I'm not exactly against it, but it adds complexity to already complex code. >> I'm halfway through putting together a patchset to address a >> bunch of these. >> >> I haven't yet resolved how I want to allocate the pages for the >> identity mapping of the page table yet. I can't use the bootmem >> allocate as it exists because that assumes the page is mapped >> into the address space already. >> > > Are you going to clash horribly with the paravirt pagetable setup? We > should probably coordinate if so. We will have to see. Currently the paravirt pagetable setup clashes horrible with actually running a kernel on real hardware. Color me not impressed by the paravirt mess. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 19:01 ` Eric W. Biederman @ 2007-04-25 19:19 ` Jeremy Fitzhardinge 2007-04-25 22:08 ` Jeremy Fitzhardinge 1 sibling, 0 replies; 20+ messages in thread From: Jeremy Fitzhardinge @ 2007-04-25 19:19 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Eric W. Biederman wrote: >> Hm, but it should be overwriting small mappings with large ones. Maybe >> I overlooked that. >> > > Nope. It's not overwriting anything. It will work if we just overwrite the pmd's with large pages in the PSE case, since Xen doesn't support PSE. Or we could only overwrite if its !present or !huge. > I'm not at all convinced we > should preserve the Xen sillyness. I'm not exactly against it, > but it adds complexity to already complex code. > Yes. I would love to see that code get simplified. I've been trying to walk a line of making my changes small and incremental, while still trying to get a reasonable result. In this case, its a file which has undergone many such changes, and the entropy is really building up. The Xen case is tricky because the hypervisor is very strict about what can be in the pagetable, and so we can't go through any transitional bad states on the way to constructing the final result. The easiest way to do that is use the existing pagetable where possible and just add new things to it. > We will have to see. Currently the paravirt pagetable setup clashes > horrible with actually running a kernel on real hardware. > I'll fix it up. I've been running these kernels on my normal work machine without obvious problems, which is why it escaped notice until HPA's debug patch. > Color me not impressed by the paravirt mess. Well, I'm working on it :) J ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 19:01 ` Eric W. Biederman 2007-04-25 19:19 ` Jeremy Fitzhardinge @ 2007-04-25 22:08 ` Jeremy Fitzhardinge 2007-04-25 22:27 ` Eric W. Biederman 1 sibling, 1 reply; 20+ messages in thread From: Jeremy Fitzhardinge @ 2007-04-25 22:08 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Eric W. Biederman wrote: > Nope. It's not overwriting anything. This should fix it. Subject: x86: fix PSE pagetable construction When constructing the initial pagetable in pagetable_init, make sure that non-PSE pmds are updated to PSE ones. This moves the definition of pmd_huge() out of the hugetlbfs files into pgtable.h. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Eric W. Biederman <ebiederm@xmission.com> --- arch/i386/mm/hugetlbpage.c | 6 +----- arch/i386/mm/init.c | 2 +- include/asm-i386/pgtable.h | 2 +- include/asm-x86_64/pgtable.h | 1 + include/linux/hugetlb.h | 2 -- 5 files changed, 4 insertions(+), 9 deletions(-) =================================================================== --- a/arch/i386/mm/hugetlbpage.c +++ b/arch/i386/mm/hugetlbpage.c @@ -183,6 +183,7 @@ follow_huge_addr(struct mm_struct *mm, u return page; } +#undef pmd_huge int pmd_huge(pmd_t pmd) { return 0; @@ -201,11 +202,6 @@ follow_huge_addr(struct mm_struct *mm, u follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) { return ERR_PTR(-EINVAL); -} - -int pmd_huge(pmd_t pmd) -{ - return !!(pmd_val(pmd) & _PAGE_PSE); } struct page * =================================================================== --- a/arch/i386/mm/init.c +++ b/arch/i386/mm/init.c @@ -172,7 +172,7 @@ static void __init kernel_physical_mappi /* Map with big pages if possible, otherwise create normal page tables. */ if (cpu_has_pse) { unsigned int address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1; - if (!pmd_present(*pmd)) { + if (!pmd_present(*pmd) || !pmd_huge(*pmd)) { if (is_kernel_text(address) || is_kernel_text(address2)) set_pmd(pmd, pfn_pmd(pfn, PAGE_KERNEL_LARGE_EXEC)); else =================================================================== --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -211,7 +211,7 @@ extern unsigned long pg0[]; #define pmd_none(x) (!(unsigned long)pmd_val(x)) #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT) #define pmd_bad(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE) - +#define pmd_huge(x) ((pmd_val(x) & _PAGE_PSE) != 0) #define pages_to_mb(x) ((x) >> (20-PAGE_SHIFT)) =================================================================== --- a/include/asm-x86_64/pgtable.h +++ b/include/asm-x86_64/pgtable.h @@ -352,6 +352,7 @@ static inline int pmd_large(pmd_t pte) { pmd_index(address)) #define pmd_none(x) (!pmd_val(x)) #define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT) +#define pmd_huge(x) ((pmd_val(x) & _PAGE_PSE) != 0) #define pmd_clear(xp) do { set_pmd(xp, __pmd(0)); } while (0) #define pfn_pmd(nr,prot) (__pmd(((nr) << PAGE_SHIFT) | pgprot_val(prot))) #define pmd_pfn(x) ((pmd_val(x) & __PHYSICAL_MASK) >> PAGE_SHIFT) =================================================================== --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -41,7 +41,6 @@ struct page *follow_huge_addr(struct mm_ int write); struct page *follow_huge_pmd(struct mm_struct *mm, unsigned long address, pmd_t *pmd, int write); -int pmd_huge(pmd_t pmd); void hugetlb_change_protection(struct vm_area_struct *vma, unsigned long address, unsigned long end, pgprot_t newprot); @@ -114,7 +113,6 @@ static inline unsigned long hugetlb_tota #define hugetlb_report_node_meminfo(n, buf) 0 #define follow_huge_pmd(mm, addr, pmd, write) NULL #define prepare_hugepage_range(addr,len,pgoff) (-EINVAL) -#define pmd_huge(x) 0 #define is_hugepage_only_range(mm, addr, len) 0 #define hugetlb_free_pgd_range(tlb, addr, end, floor, ceiling) ({BUG(); 0; }) #define hugetlb_fault(mm, vma, addr, write) ({ BUG(); 0; }) ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 22:08 ` Jeremy Fitzhardinge @ 2007-04-25 22:27 ` Eric W. Biederman 2007-04-25 23:08 ` Jeremy Fitzhardinge 0 siblings, 1 reply; 20+ messages in thread From: Eric W. Biederman @ 2007-04-25 22:27 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Jeremy Fitzhardinge <jeremy@goop.org> writes: > Eric W. Biederman wrote: >> Nope. It's not overwriting anything. > > This should fix it. > > Subject: x86: fix PSE pagetable construction > > When constructing the initial pagetable in pagetable_init, make sure > that non-PSE pmds are updated to PSE ones. > > This moves the definition of pmd_huge() out of the hugetlbfs files > into pgtable.h. No. Please just remove the conditionals on the leaf pages. We know exactly what we require them to be, there is minimal cost and no downside to just setting the pte entries to what we want them to be for the identity mapping. It doesn't make sense for paravirtualization or anything else to influence that. This may be redoing work that has been done before but it is doing it all one common place. It does make sense not to overwrite the intermediate pages. Not overwriting intermediate pages is reasonable, and allows other things to be going on and to be preserved. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 22:27 ` Eric W. Biederman @ 2007-04-25 23:08 ` Jeremy Fitzhardinge 2007-04-25 23:45 ` Eric W. Biederman 0 siblings, 1 reply; 20+ messages in thread From: Jeremy Fitzhardinge @ 2007-04-25 23:08 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Eric W. Biederman wrote: > No. Please just remove the conditionals on the leaf pages. > So, to be specific, you mean make updating the pte_t entries (and pmd_t entries which refer to hugepages) entries unconditional? > We know exactly what we require them to be, there is minimal > cost and no downside to just setting the pte entries to > what we want them to be for the identity mapping. > > It doesn't make sense for paravirtualization or anything else to > influence that. > > This may be redoing work that has been done before but it is > doing it all one common place. > The issue is not a matter of avoiding duplicate work, but making sure all the pagetables are consistent from Xen's perspective. Specifically, you may not ever, at any time, create a writable mapping of a page which is currently part of an active pagetable. This means that when we're creating mappings of physical memory, the pages which are part of the current pagetable must be mapped RO. The easiest way I found to guarantee that is to copy the Xen-provided pagetable as a template, and only update pages which are missing. The other way I could do this is to have special-purpose init-time version of xen_set_pte which checks to see if it's making a RO mapping RW, and refuse to do it. That would minimize the changes to mm/init.c, but give init-time set_pte rather unexpected hidden semantics. J ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 23:08 ` Jeremy Fitzhardinge @ 2007-04-25 23:45 ` Eric W. Biederman 2007-04-26 0:14 ` Jeremy Fitzhardinge 2007-04-26 3:27 ` Zachary Amsden 0 siblings, 2 replies; 20+ messages in thread From: Eric W. Biederman @ 2007-04-25 23:45 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Jeremy Fitzhardinge <jeremy@goop.org> writes: > Eric W. Biederman wrote: >> No. Please just remove the conditionals on the leaf pages. >> > > So, to be specific, you mean make updating the pte_t entries (and pmd_t > entries which refer to hugepages) entries unconditional? I mean make updating pte_t and pmd_t entries that refer to identity mapped physical pages unconditional. >> We know exactly what we require them to be, there is minimal >> cost and no downside to just setting the pte entries to >> what we want them to be for the identity mapping. >> >> It doesn't make sense for paravirtualization or anything else to >> influence that. >> >> This may be redoing work that has been done before but it is >> doing it all one common place. >> > > The issue is not a matter of avoiding duplicate work, but making sure > all the pagetables are consistent from Xen's perspective. > > Specifically, you may not ever, at any time, create a writable mapping > of a page which is currently part of an active pagetable. This means > that when we're creating mappings of physical memory, the pages which > are part of the current pagetable must be mapped RO. The easiest way I > found to guarantee that is to copy the Xen-provided pagetable as a > template, and only update pages which are missing. Hmm. I now see your problem. > The other way I could do this is to have special-purpose init-time > version of xen_set_pte which checks to see if it's making a RO mapping > RW, and refuse to do it. That would minimize the changes to mm/init.c, > but give init-time set_pte rather unexpected hidden semantics. Yes. However how do we handle attempting to create this kind of mapping when mmap /dev/mem? or /dev/kmem? I'm pretty certain there are other paths through the kernel where we can get page table mapping. Right now by leaving things read-only you are hiding from the kernel what you are really trying to do. That makes me distinctly uncomfortable. In general when things get swept under the rug we can never handle the properly. Although this issue may be small enough it doesn't matter. I suspect what we want to do is come up with a function to call to test to see if a page should be read-only and map such pages _PAGE_KERNEL_RO, or _PAGE_KERNEL_RO_EXEC if it's code. Speaking of things what are paravirt_alloc_pd and parafirt_alloc_pd supposed to do? Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 23:45 ` Eric W. Biederman @ 2007-04-26 0:14 ` Jeremy Fitzhardinge 2007-04-27 5:02 ` Eric W. Biederman 2007-04-26 3:27 ` Zachary Amsden 1 sibling, 1 reply; 20+ messages in thread From: Jeremy Fitzhardinge @ 2007-04-26 0:14 UTC (permalink / raw) To: Eric W. Biederman Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Eric W. Biederman wrote: >> The issue is not a matter of avoiding duplicate work, but making sure >> all the pagetables are consistent from Xen's perspective. >> >> Specifically, you may not ever, at any time, create a writable mapping >> of a page which is currently part of an active pagetable. This means >> that when we're creating mappings of physical memory, the pages which >> are part of the current pagetable must be mapped RO. The easiest way I >> found to guarantee that is to copy the Xen-provided pagetable as a >> template, and only update pages which are missing. >> > > Hmm. I now see your problem. > > >> The other way I could do this is to have special-purpose init-time >> version of xen_set_pte which checks to see if it's making a RO mapping >> RW, and refuse to do it. That would minimize the changes to mm/init.c, >> but give init-time set_pte rather unexpected hidden semantics. >> > > Yes. However how do we handle attempting to create this kind > of mapping when mmap /dev/mem? or /dev/kmem? > Hm, I hadn't thought about that. I'm not sure that /dev/k?mem is very useful in an unprivileged guest, but I guess its useful for debugging or stats or something. It's tricky to tell whether an arbitrary pfn is part of a pagetable or not; there's a PG_PINNED page flag to tell you if its active, but iff you've already determined its a pagetable page. > I'm pretty certain there are other paths through the kernel where > we can get page table mapping. > > Right now by leaving things read-only you are hiding from the kernel > what you are really trying to do. That makes me distinctly > uncomfortable. In general when things get swept under the rug > we can never handle the properly. Although this issue may be small > enough it doesn't matter. > Well, the general idea is that in a paravirtualized environment pagetable pages need special handling. Different hypervisors need different handling, but they all need something special. The paravirt hooks are intended to capture all the interesting events, without over-constraining what special thing the hypervisor wants to do at that point. That's why I went for the "allow the hypervisor to provide a prototype pagetable, and avoid the bits it has already set up"; it allows it to do whatever it wants, without getting too specific about what that is, and retains a fairly straightforward interface. > I suspect what we want to do is come up with a function to call > to test to see if a page should be read-only and map such pages > _PAGE_KERNEL_RO, or _PAGE_KERNEL_RO_EXEC if it's code. > Hm, I think that's a hard function to write in general. For the special case of pagetable_init it wouldn't be too hard, but it doesn't seem like a big improvement over the current state of affairs. > Speaking of things what are paravirt_alloc_pd and parafirt_alloc_pd > supposed to do? > (alloc_pd and alloc_pt) Broadly speaking, they tell the hypevisor that there's a new page about to be attached to the pagetable. Xen uses it as the hook to map those pages RO if the pagetable is active. VMI (and lguest?) use it to tell the hypervisor's shadow pagetable machinery that there's something new to track. J ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-26 0:14 ` Jeremy Fitzhardinge @ 2007-04-27 5:02 ` Eric W. Biederman 0 siblings, 0 replies; 20+ messages in thread From: Eric W. Biederman @ 2007-04-27 5:02 UTC (permalink / raw) To: Jeremy Fitzhardinge Cc: H. Peter Anvin, Andrew Morton, Andi Kleen, Zachary Amsden, Linux Kernel Mailing List Jeremy Fitzhardinge <jeremy@goop.org> writes: >> I suspect what we want to do is come up with a function to call >> to test to see if a page should be read-only and map such pages >> _PAGE_KERNEL_RO, or _PAGE_KERNEL_RO_EXEC if it's code. >> > > Hm, I think that's a hard function to write in general. For the special > case of pagetable_init it wouldn't be too hard, but it doesn't seem like > a big improvement over the current state of affairs. There is some difficulty there, and I need to look at the issue some more but it just occurred to me that this problem of tracking special permissions on pages is not confined to Xen. Currently for pages we need to have a consistent global view if a pages is uncached, write-combining or cached. We need a way to check this so that we don't get inconsistencies in how we are caching pages, when we start controlling this on a per page basis. Currently we are not doing this and it is blocking merging of PAT write-combining support in the kernel because of the bad things (silent data corruption and general cpu undefined behavior) that can happen if we don't handling things consistently. I don't know if we can solve the two problems with the same mechanism but it is worth looking into. Especially if what we do is just add support for marking some pages as read-only. We could use that ensuring there are no writable mappings of the kernel's text segment for example. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] i386: For debugging, make the initial page table setup less forgiving. 2007-04-25 23:45 ` Eric W. Biederman 2007-04-26 0:14 ` Jeremy Fitzhardinge @ 2007-04-26 3:27 ` Zachary Amsden 1 sibling, 0 replies; 20+ messages in thread From: Zachary Amsden @ 2007-04-26 3:27 UTC (permalink / raw) To: Eric W. Biederman Cc: Jeremy Fitzhardinge, H. Peter Anvin, Andrew Morton, Andi Kleen, Linux Kernel Mailing List Eric W. Biederman wrote: > I suspect what we want to do is come up with a function to call > to test to see if a page should be read-only and map such pages > _PAGE_KERNEL_RO, or _PAGE_KERNEL_RO_EXEC if it's code. > > Speaking of things what are paravirt_alloc_pd and parafirt_alloc_pd > supposed to do? > For hypervisors which shadow kernel page tables, none of these concerns with keeping page tables read-only arise. However, another set of concerns does arise with maintaining shadow synchronization. One of those problems is keeping the hypervisor aware of when pages are being used as page tables. However, it turns out both direct page table and shadow page table implementations can be made to use one page table allocation function; in the direct page table case (as for Xen), this is the point where page tables can be recognized and made read-only. So this is the dual purpose of the paravirt_alloc_p[dt] functions. Zach ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2007-04-27 5:04 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-04-13 21:49 [PATCH] i386: For debugging, make the initial page table setup less forgiving H. Peter Anvin 2007-04-13 22:18 ` Zachary Amsden 2007-04-13 22:26 ` H. Peter Anvin 2007-04-13 22:40 ` Zachary Amsden 2007-04-13 22:26 ` Jeremy Fitzhardinge 2007-04-25 11:48 ` Andrew Morton 2007-04-25 15:28 ` H. Peter Anvin 2007-04-25 17:50 ` Eric W. Biederman 2007-04-25 17:56 ` H. Peter Anvin 2007-04-25 18:23 ` Eric W. Biederman 2007-04-25 18:18 ` Jeremy Fitzhardinge 2007-04-25 19:01 ` Eric W. Biederman 2007-04-25 19:19 ` Jeremy Fitzhardinge 2007-04-25 22:08 ` Jeremy Fitzhardinge 2007-04-25 22:27 ` Eric W. Biederman 2007-04-25 23:08 ` Jeremy Fitzhardinge 2007-04-25 23:45 ` Eric W. Biederman 2007-04-26 0:14 ` Jeremy Fitzhardinge 2007-04-27 5:02 ` Eric W. Biederman 2007-04-26 3:27 ` Zachary Amsden
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox