* Re: [git pull] cpus4096 tree, part 3
[not found] ` <20090105011416.GG32239@wotan.suse.de>
@ 2009-01-05 1:16 ` Nick Piggin
2009-01-26 19:00 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Nick Piggin @ 2009-01-05 1:16 UTC (permalink / raw)
To: Ingo Molnar, linux-arch
Cc: Linus Torvalds, Rusty Russell, Mike Travis, linux-kernel,
Pallipadi, Venkatesh, Suresh Siddha, Arjan van de Ven,
H. Peter Anvin, Thomas Gleixner
Really cc linux-arch this time
On Mon, Jan 05, 2009 at 02:14:16AM +0100, Nick Piggin wrote:
> On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote:
> >
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > > What happened to Nick's cleanup patch to do_page_fault (a month or two
> > > ago? I complained about some of the issues in his first version and
> > > asked for some further cleanups, but I think that whole discussion ended
> > > with him saying "I am going to add those changes that you suggested (in
> > > fact, I already have)".
> > >
> > > And then I didn't see anything further. Maybe I just missed the end
> > > result. Or maybe we have it in some -mm branch or something?
> >
> > they would have been in tip/x86/mm and would be upstream now had Nick
> > re-sent a v2 series but that never happened. I think they might have
> > fallen victim to a serious attention deficit caused by the SLQB patch ;-)
>
> Well, I already added Linus's suggestions but didn't submit it because
> there was a bit of work going on in that file as far as I could see, both
> in the x86 tree and in -mm:
>
> (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch)
>
> It isn't a big deal to resolve either way, but I don't want to make Andrew's
> life harder.
>
> [Yes OK now I'm the guilty one of pushing in an x86 patch not via the
> x86 tree ;) This one is easy to break in pieces, but I didn't want
> to create a dependency between the trees]
>
> I didn't really consider it to be urgent, so I was waiting for that patch
> to go in, but I was still hoping to get this into 2.6.29... This is what
> it looks like now with your suggestions, and just merged it to your current
> tree (untested).
>
> I'll cc the linux-arch list here too, because it might be nice to keep these
> things as structurally similar as possible (and they'll all want to look at
> the -mm patch above, although I'll probably end up having to write the
> patches!).
---
Optimise x86's do_page_fault (C entry point for the page fault path).
gcc isn't _all_ that smart about spilling registers to stack or reusing
stack slots, even with branch annotations. do_page_fault contained a lot
of functionality, so split unlikely paths into their own functions, and
mark them as noinline just to be sure. I consider this actually to be
somewhat of a cleanup too: the main function now contains about half
the number of lines.
Also, ensure the order of arguments to functions is always the same: regs,
addr, error_code. This can reduce code size a tiny bit, and just looks neater
too.
Add a couple of branch annotations.
One real behavioural difference this makes is that the OOM-init-task case
will no longer loop around the page fault handler, but will return to
userspace and presumably retry the fault. Effectively the same macro-behaviour,
but it is a notable difference. Such change in behaviour should disappear after
the "call oom killer from page fault" patch.
Before:
do_page_fault:
subq $360, %rsp #,
After:
do_page_fault:
subq $56, %rsp #,
bloat-o-meter:
add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
function old new delta
__bad_area_nosemaphore - 506 +506
no_context - 474 +474
vmalloc_fault - 424 +424
spurious_fault - 358 +358
mm_fault_error - 272 +272
bad_area_access_error - 89 +89
bad_area - 89 +89
bad_area_nosemaphore - 10 +10
do_page_fault 2464 784 -1680
Yes, the total size increases by 542 bytes, due to the extra function calls.
But these will very rarely be called (except for vmalloc_fault) in a normal
workload. Importantly, do_page_fault is less than 1/3rd it's original size.
Existing gotos and branch hints did move a lot of the infrequently used text
out of the fastpath, but that's even further improved after this patch.
---
arch/x86/mm/fault.c | 458 ++++++++++++++++++++++++++++++----------------------
1 file changed, 265 insertions(+), 193 deletions(-)
Index: linux-2.6/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/fault.c
+++ linux-2.6/arch/x86/mm/fault.c
@@ -91,8 +91,8 @@ static inline int notify_page_fault(stru
*
* Opcode checker based on code by Richard Brunner
*/
-static int is_prefetch(struct pt_regs *regs, unsigned long addr,
- unsigned long error_code)
+static int is_prefetch(struct pt_regs *regs, unsigned long error_code,
+ unsigned long addr)
{
unsigned char *instr;
int scan_more = 1;
@@ -409,15 +409,15 @@ static void show_fault_oops(struct pt_re
}
#ifdef CONFIG_X86_64
-static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs,
- unsigned long error_code)
+static noinline void pgtable_bad(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
{
unsigned long flags = oops_begin();
int sig = SIGKILL;
- struct task_struct *tsk;
+ struct task_struct *tsk = current;
printk(KERN_ALERT "%s: Corrupted page table at address %lx\n",
- current->comm, address);
+ tsk->comm, address);
dump_pagetable(address);
tsk = current;
tsk->thread.cr2 = address;
@@ -429,6 +429,200 @@ static noinline void pgtable_bad(unsigne
}
#endif
+static noinline void no_context(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+#ifdef CONFIG_X86_64
+ unsigned long flags;
+ int sig;
+#endif
+
+ /* Are we prepared to handle this kernel fault? */
+ if (fixup_exception(regs))
+ return;
+
+ /*
+ * X86_32
+ * Valid to do another page fault here, because if this fault
+ * had been triggered by is_prefetch fixup_exception would have
+ * handled it.
+ *
+ * X86_64
+ * Hall of shame of CPU/BIOS bugs.
+ */
+ if (is_prefetch(regs, error_code, address))
+ return;
+
+ if (is_errata93(regs, address))
+ return;
+
+ /*
+ * Oops. The kernel tried to access some bad page. We'll have to
+ * terminate things with extreme prejudice.
+ */
+#ifdef CONFIG_X86_32
+ bust_spinlocks(1);
+#else
+ flags = oops_begin();
+#endif
+
+ show_fault_oops(regs, error_code, address);
+
+ tsk->thread.cr2 = address;
+ tsk->thread.trap_no = 14;
+ tsk->thread.error_code = error_code;
+
+#ifdef CONFIG_X86_32
+ die("Oops", regs, error_code);
+ bust_spinlocks(0);
+ do_exit(SIGKILL);
+#else
+ sig = SIGKILL;
+ if (__die("Oops", regs, error_code))
+ sig = 0;
+ /* Executive summary in case the body of the oops scrolled away */
+ printk(KERN_EMERG "CR2: %016lx\n", address);
+ oops_end(flags, regs, sig);
+#endif
+}
+
+static void __bad_area_nosemaphore(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address,
+ int si_code)
+{
+ struct task_struct *tsk = current;
+
+ /* User mode accesses just cause a SIGSEGV */
+ if (error_code & PF_USER) {
+ /*
+ * It's possible to have interrupts off here.
+ */
+ local_irq_enable();
+
+ /*
+ * Valid to do another page fault here because this one came
+ * from user space.
+ */
+ if (is_prefetch(regs, error_code, address))
+ return;
+
+ if (is_errata100(regs, address))
+ return;
+
+ if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+ printk_ratelimit()) {
+ printk(
+ "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
+ task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+ tsk->comm, task_pid_nr(tsk), address,
+ (void *) regs->ip, (void *) regs->sp, error_code);
+ print_vma_addr(" in ", regs->ip);
+ printk("\n");
+ }
+
+ tsk->thread.cr2 = address;
+ /* Kernel addresses are always protection faults */
+ tsk->thread.error_code = error_code | (address >= TASK_SIZE);
+ tsk->thread.trap_no = 14;
+ force_sig_info_fault(SIGSEGV, si_code, address, tsk);
+ return;
+ }
+
+ if (is_f00f_bug(regs, address))
+ return;
+
+ no_context(regs, error_code, address);
+}
+
+static noinline void bad_area_nosemaphore(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR);
+}
+
+static void __bad_area(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address,
+ int si_code)
+{
+ struct mm_struct *mm = current->mm;
+
+ /*
+ * Something tried to access memory that isn't in our memory map..
+ * Fix it, but check if it's kernel or user first..
+ */
+ up_read(&mm->mmap_sem);
+
+ __bad_area_nosemaphore(regs, error_code, address, si_code);
+}
+
+static noinline void bad_area(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area(regs, error_code, address, SEGV_MAPERR);
+}
+
+static noinline void bad_area_access_error(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ __bad_area(regs, error_code, address, SEGV_ACCERR);
+}
+
+/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */
+static void out_of_memory(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+ /*
+ * We ran out of memory, or some other thing happened to us that made
+ * us unable to handle the page fault gracefully.
+ */
+ up_read(&mm->mmap_sem);
+ if (is_global_init(tsk)) {
+ yield();
+ return;
+ }
+
+ printk("VM: killing process %s\n", tsk->comm);
+ if (error_code & PF_USER)
+ do_group_exit(SIGKILL);
+ no_context(regs, error_code, address);
+}
+
+static void do_sigbus(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *mm = tsk->mm;
+
+ up_read(&mm->mmap_sem);
+
+ /* Kernel mode? Handle exceptions or die */
+ if (!(error_code & PF_USER))
+ no_context(regs, error_code, address);
+#ifdef CONFIG_X86_32
+ /* User space => ok to do another page fault */
+ if (is_prefetch(regs, error_code, address))
+ return;
+#endif
+ tsk->thread.cr2 = address;
+ tsk->thread.error_code = error_code;
+ tsk->thread.trap_no = 14;
+ force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
+}
+
+static noinline void mm_fault_error(struct pt_regs *regs,
+ unsigned long error_code, unsigned long address, unsigned int fault)
+{
+ if (fault & VM_FAULT_OOM)
+ out_of_memory(regs, error_code, address);
+ else if (fault & VM_FAULT_SIGBUS)
+ do_sigbus(regs, error_code, address);
+ else
+ BUG();
+}
+
static int spurious_fault_check(unsigned long error_code, pte_t *pte)
{
if ((error_code & PF_WRITE) && !pte_write(*pte))
@@ -448,8 +642,8 @@ static int spurious_fault_check(unsigned
* There are no security implications to leaving a stale TLB when
* increasing the permissions on a page.
*/
-static int spurious_fault(unsigned long address,
- unsigned long error_code)
+static noinline int spurious_fault(unsigned long error_code,
+ unsigned long address)
{
pgd_t *pgd;
pud_t *pud;
@@ -494,7 +688,7 @@ static int spurious_fault(unsigned long
*
* This assumes no large pages in there.
*/
-static int vmalloc_fault(unsigned long address)
+static noinline int vmalloc_fault(unsigned long address)
{
#ifdef CONFIG_X86_32
unsigned long pgd_paddr;
@@ -573,6 +767,25 @@ static int vmalloc_fault(unsigned long a
int show_unhandled_signals = 1;
+static inline int access_error(unsigned long error_code, int write,
+ struct vm_area_struct *vma)
+{
+ if (write) {
+ /* write, present and write, not present */
+ if (unlikely(!(vma->vm_flags & VM_WRITE)))
+ return 1;
+ } else if (unlikely(error_code & PF_PROT)) {
+ /* read, present */
+ return 1;
+ } else {
+ /* read, not present */
+ if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))))
+ return 1;
+ }
+
+ return 0;
+}
+
/*
* This routine handles page faults. It determines the address,
* and the problem, and then passes it off to one of the appropriate
@@ -583,16 +796,12 @@ asmlinkage
#endif
void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
+ unsigned long address;
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
- unsigned long address;
- int write, si_code;
+ int write;
int fault;
-#ifdef CONFIG_X86_64
- unsigned long flags;
- int sig;
-#endif
tsk = current;
mm = tsk->mm;
@@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r
/* get the address */
address = read_cr2();
- si_code = SEGV_MAPERR;
-
- if (notify_page_fault(regs))
+ if (unlikely(notify_page_fault(regs)))
return;
if (unlikely(kmmio_fault(regs, address)))
return;
@@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r
* Don't take the mm semaphore here. If we fixup a prefetch
* fault we could otherwise deadlock.
*/
- goto bad_area_nosemaphore;
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
}
-
/*
* It's safe to allow irq's after cr2 has been saved and the
* vmalloc fault has been handled.
@@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r
#ifdef CONFIG_X86_64
if (unlikely(error_code & PF_RSVD))
- pgtable_bad(address, regs, error_code);
+ pgtable_bad(regs, error_code, address);
#endif
/*
* If we're in an interrupt, have no user context or are running in an
* atomic region then we must not take the fault.
*/
- if (unlikely(in_atomic() || !mm))
- goto bad_area_nosemaphore;
+ if (unlikely(in_atomic() || !mm)) {
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }
-again:
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
@@ -684,20 +892,26 @@ again:
* source. If this is invalid we can skip the address space check,
* thus avoiding the deadlock.
*/
- if (!down_read_trylock(&mm->mmap_sem)) {
+ if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
- !search_exception_tables(regs->ip))
- goto bad_area_nosemaphore;
+ !search_exception_tables(regs->ip)) {
+ bad_area_nosemaphore(regs, error_code, address);
+ return;
+ }
down_read(&mm->mmap_sem);
}
vma = find_vma(mm, address);
- if (!vma)
- goto bad_area;
- if (vma->vm_start <= address)
+ if (unlikely(!vma)) {
+ bad_area(regs, error_code, address);
+ return;
+ }
+ if (likely(vma->vm_start <= address))
goto good_area;
- if (!(vma->vm_flags & VM_GROWSDOWN))
- goto bad_area;
+ if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
+ bad_area(regs, error_code, address);
+ return;
+ }
if (error_code & PF_USER) {
/*
* Accessing the stack below %sp is always a bug.
@@ -705,31 +919,25 @@ again:
* and pusha to work. ("enter $65535,$31" pushes
* 32 pointers and then decrements %sp by 65535.)
*/
- if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
- goto bad_area;
+ if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
+ bad_area(regs, error_code, address);
+ return;
+ }
}
- if (expand_stack(vma, address))
- goto bad_area;
-/*
- * Ok, we have a good vm_area for this memory access, so
- * we can handle it..
- */
+ if (unlikely(expand_stack(vma, address))) {
+ bad_area(regs, error_code, address);
+ return;
+ }
+
+ /*
+ * Ok, we have a good vm_area for this memory access, so
+ * we can handle it..
+ */
good_area:
- si_code = SEGV_ACCERR;
- write = 0;
- switch (error_code & (PF_PROT|PF_WRITE)) {
- default: /* 3: write, present */
- /* fall through */
- case PF_WRITE: /* write, not present */
- if (!(vma->vm_flags & VM_WRITE))
- goto bad_area;
- write++;
- break;
- case PF_PROT: /* read, present */
- goto bad_area;
- case 0: /* read, not present */
- if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
- goto bad_area;
+ write = error_code & PF_WRITE;
+ if (unlikely(access_error(error_code, write, vma))) {
+ bad_area_access_error(regs, error_code, address);
+ return;
}
/*
@@ -739,11 +947,8 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, write);
if (unlikely(fault & VM_FAULT_ERROR)) {
- if (fault & VM_FAULT_OOM)
- goto out_of_memory;
- else if (fault & VM_FAULT_SIGBUS)
- goto do_sigbus;
- BUG();
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
@@ -761,139 +966,6 @@ good_area:
}
#endif
up_read(&mm->mmap_sem);
- return;
-
-/*
- * Something tried to access memory that isn't in our memory map..
- * Fix it, but check if it's kernel or user first..
- */
-bad_area:
- up_read(&mm->mmap_sem);
-
-bad_area_nosemaphore:
- /* User mode accesses just cause a SIGSEGV */
- if (error_code & PF_USER) {
- /*
- * It's possible to have interrupts off here.
- */
- local_irq_enable();
-
- /*
- * Valid to do another page fault here because this one came
- * from user space.
- */
- if (is_prefetch(regs, address, error_code))
- return;
-
- if (is_errata100(regs, address))
- return;
-
- if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
- printk_ratelimit()) {
- printk(
- "%s%s[%d]: segfault at %lx ip %p sp %p error %lx",
- task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
- tsk->comm, task_pid_nr(tsk), address,
- (void *) regs->ip, (void *) regs->sp, error_code);
- print_vma_addr(" in ", regs->ip);
- printk("\n");
- }
-
- tsk->thread.cr2 = address;
- /* Kernel addresses are always protection faults */
- tsk->thread.error_code = error_code | (address >= TASK_SIZE);
- tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGSEGV, si_code, address, tsk);
- return;
- }
-
- if (is_f00f_bug(regs, address))
- return;
-
-no_context:
- /* Are we prepared to handle this kernel fault? */
- if (fixup_exception(regs))
- return;
-
- /*
- * X86_32
- * Valid to do another page fault here, because if this fault
- * had been triggered by is_prefetch fixup_exception would have
- * handled it.
- *
- * X86_64
- * Hall of shame of CPU/BIOS bugs.
- */
- if (is_prefetch(regs, address, error_code))
- return;
-
- if (is_errata93(regs, address))
- return;
-
-/*
- * Oops. The kernel tried to access some bad page. We'll have to
- * terminate things with extreme prejudice.
- */
-#ifdef CONFIG_X86_32
- bust_spinlocks(1);
-#else
- flags = oops_begin();
-#endif
-
- show_fault_oops(regs, error_code, address);
-
- tsk->thread.cr2 = address;
- tsk->thread.trap_no = 14;
- tsk->thread.error_code = error_code;
-
-#ifdef CONFIG_X86_32
- die("Oops", regs, error_code);
- bust_spinlocks(0);
- do_exit(SIGKILL);
-#else
- sig = SIGKILL;
- if (__die("Oops", regs, error_code))
- sig = 0;
- /* Executive summary in case the body of the oops scrolled away */
- printk(KERN_EMERG "CR2: %016lx\n", address);
- oops_end(flags, regs, sig);
-#endif
-
-/*
- * We ran out of memory, or some other thing happened to us that made
- * us unable to handle the page fault gracefully.
- */
-out_of_memory:
- up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- /*
- * Re-lookup the vma - in theory the vma tree might
- * have changed:
- */
- goto again;
- }
-
- printk("VM: killing process %s\n", tsk->comm);
- if (error_code & PF_USER)
- do_group_exit(SIGKILL);
- goto no_context;
-
-do_sigbus:
- up_read(&mm->mmap_sem);
-
- /* Kernel mode? Handle exceptions or die */
- if (!(error_code & PF_USER))
- goto no_context;
-#ifdef CONFIG_X86_32
- /* User space => ok to do another page fault */
- if (is_prefetch(regs, address, error_code))
- return;
-#endif
- tsk->thread.cr2 = address;
- tsk->thread.error_code = error_code;
- tsk->thread.trap_no = 14;
- force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}
DEFINE_SPINLOCK(pgd_lock);
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
2009-01-05 1:16 ` [git pull] cpus4096 tree, part 3 Nick Piggin
@ 2009-01-26 19:00 ` Andrew Morton
2009-01-26 19:09 ` Linus Torvalds
2009-01-26 20:09 ` Ingo Molnar
0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2009-01-26 19:00 UTC (permalink / raw)
To: Nick Piggin
Cc: mingo, linux-arch, torvalds, rusty, travis, linux-kernel,
venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx
On Mon, 5 Jan 2009 02:16:30 +0100
Nick Piggin <npiggin@suse.de> wrote:
> Really cc linux-arch this time
>
> On Mon, Jan 05, 2009 at 02:14:16AM +0100, Nick Piggin wrote:
> > On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote:
> > >
> > > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > > What happened to Nick's cleanup patch to do_page_fault (a month or two
> > > > ago? I complained about some of the issues in his first version and
> > > > asked for some further cleanups, but I think that whole discussion ended
> > > > with him saying "I am going to add those changes that you suggested (in
> > > > fact, I already have)".
> > > >
> > > > And then I didn't see anything further. Maybe I just missed the end
> > > > result. Or maybe we have it in some -mm branch or something?
> > >
> > > they would have been in tip/x86/mm and would be upstream now had Nick
> > > re-sent a v2 series but that never happened. I think they might have
> > > fallen victim to a serious attention deficit caused by the SLQB patch ;-)
> >
> > Well, I already added Linus's suggestions but didn't submit it because
> > there was a bit of work going on in that file as far as I could see, both
> > in the x86 tree and in -mm:
> >
> > (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch)
> >
> > It isn't a big deal to resolve either way, but I don't want to make Andrew's
> > life harder.
> >
> > [Yes OK now I'm the guilty one of pushing in an x86 patch not via the
> > x86 tree ;) This one is easy to break in pieces, but I didn't want
> > to create a dependency between the trees]
> >
> > I didn't really consider it to be urgent, so I was waiting for that patch
> > to go in, but I was still hoping to get this into 2.6.29... This is what
> > it looks like now with your suggestions, and just merged it to your current
> > tree (untested).
> >
> > I'll cc the linux-arch list here too, because it might be nice to keep these
> > things as structurally similar as possible (and they'll all want to look at
> > the -mm patch above, although I'll probably end up having to write the
> > patches!).
>
> ---
> Optimise x86's do_page_fault (C entry point for the page fault path).
It took rather a lot of hunting to find this email. Please do try to
make the email subject match the final patch's title?
> * This routine handles page faults. It determines the address,
> * and the problem, and then passes it off to one of the appropriate
> @@ -583,16 +796,12 @@ asmlinkage
> #endif
> void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
> {
> + unsigned long address;
> struct task_struct *tsk;
> struct mm_struct *mm;
> struct vm_area_struct *vma;
> - unsigned long address;
> - int write, si_code;
> + int write;
> int fault;
> -#ifdef CONFIG_X86_64
> - unsigned long flags;
> - int sig;
> -#endif
>
> tsk = current;
> mm = tsk->mm;
> @@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r
> /* get the address */
> address = read_cr2();
>
> - si_code = SEGV_MAPERR;
> -
> - if (notify_page_fault(regs))
> + if (unlikely(notify_page_fault(regs)))
> return;
> if (unlikely(kmmio_fault(regs, address)))
> return;
> @@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r
> * Don't take the mm semaphore here. If we fixup a prefetch
> * fault we could otherwise deadlock.
> */
> - goto bad_area_nosemaphore;
> + bad_area_nosemaphore(regs, error_code, address);
> + return;
> }
>
> -
> /*
> * It's safe to allow irq's after cr2 has been saved and the
> * vmalloc fault has been handled.
> @@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r
>
> #ifdef CONFIG_X86_64
> if (unlikely(error_code & PF_RSVD))
> - pgtable_bad(address, regs, error_code);
> + pgtable_bad(regs, error_code, address);
> #endif
>
> /*
> * If we're in an interrupt, have no user context or are running in an
> * atomic region then we must not take the fault.
> */
> - if (unlikely(in_atomic() || !mm))
> - goto bad_area_nosemaphore;
> + if (unlikely(in_atomic() || !mm)) {
> + bad_area_nosemaphore(regs, error_code, address);
> + return;
> + }
>
> -again:
> /*
> * When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> @@ -684,20 +892,26 @@ again:
> * source. If this is invalid we can skip the address space check,
> * thus avoiding the deadlock.
> */
> - if (!down_read_trylock(&mm->mmap_sem)) {
> + if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
> if ((error_code & PF_USER) == 0 &&
> - !search_exception_tables(regs->ip))
> - goto bad_area_nosemaphore;
> + !search_exception_tables(regs->ip)) {
> + bad_area_nosemaphore(regs, error_code, address);
> + return;
> + }
> down_read(&mm->mmap_sem);
> }
>
> vma = find_vma(mm, address);
> - if (!vma)
> - goto bad_area;
> - if (vma->vm_start <= address)
> + if (unlikely(!vma)) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> + if (likely(vma->vm_start <= address))
> goto good_area;
> - if (!(vma->vm_flags & VM_GROWSDOWN))
> - goto bad_area;
> + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> if (error_code & PF_USER) {
> /*
> * Accessing the stack below %sp is always a bug.
> @@ -705,31 +919,25 @@ again:
> * and pusha to work. ("enter $65535,$31" pushes
> * 32 pointers and then decrements %sp by 65535.)
> */
> - if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp)
> - goto bad_area;
> + if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> }
> - if (expand_stack(vma, address))
> - goto bad_area;
> -/*
> - * Ok, we have a good vm_area for this memory access, so
> - * we can handle it..
> - */
> + if (unlikely(expand_stack(vma, address))) {
> + bad_area(regs, error_code, address);
> + return;
> + }
> +
> + /*
> + * Ok, we have a good vm_area for this memory access, so
> + * we can handle it..
> + */
> good_area:
> - si_code = SEGV_ACCERR;
> - write = 0;
> - switch (error_code & (PF_PROT|PF_WRITE)) {
> - default: /* 3: write, present */
> - /* fall through */
> - case PF_WRITE: /* write, not present */
> - if (!(vma->vm_flags & VM_WRITE))
> - goto bad_area;
> - write++;
> - break;
> - case PF_PROT: /* read, present */
> - goto bad_area;
> - case 0: /* read, not present */
> - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
> - goto bad_area;
> + write = error_code & PF_WRITE;
What's going on here? We set `error_code' to PF_WRITE, which is some
x86-specific thing.
> + if (unlikely(access_error(error_code, write, vma))) {
> + bad_area_access_error(regs, error_code, address);
> + return;
> }
>
> /*
> @@ -739,11 +947,8 @@ good_area:
> */
> fault = handle_mm_fault(mm, vma, address, write);
and then pass it into handle_mm_fault(), which is expecting a bunch of
flags in the FAULT_FLAG_foo domain.
IOW, the code will accidentally set FAULT_FLAG_NONLINEAR!.
Methinks we want something like this,
--- a/arch/x86/mm/fault.c~fix-x86-optimise-x86s-do_page_fault-c-entry-point-for-the-page-fault-path
+++ a/arch/x86/mm/fault.c
@@ -942,7 +942,7 @@ void __kprobes do_page_fault(struct pt_r
* we can handle it..
*/
good_area:
- write = error_code & PF_WRITE;
+ write = (error_code & PF_WRITE) ? FAULT_FLAG_WRITE : 0;
if (unlikely(access_error(error_code, write, vma))) {
bad_area_access_error(regs, error_code, address);
return;
_
but why did the current code pass testing at all??
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
2009-01-26 19:00 ` Andrew Morton
@ 2009-01-26 19:09 ` Linus Torvalds
2009-01-26 19:30 ` Andrew Morton
2009-01-26 20:09 ` Ingo Molnar
1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2009-01-26 19:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Nick Piggin, mingo, linux-arch, rusty, travis, linux-kernel,
venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx
On Mon, 26 Jan 2009, Andrew Morton wrote:
> > + write = error_code & PF_WRITE;
>
> What's going on here? We set `error_code' to PF_WRITE, which is some
> x86-specific thing.
No. We set "write" to non-zero if it was a write fault.
> > fault = handle_mm_fault(mm, vma, address, write);
>
> and then pass it into handle_mm_fault(), which is expecting a bunch of
> flags in the FAULT_FLAG_foo domain.
No. "handle_mm_fault()" takes an integer that is non-zero if it's a write,
zero if it's a read. That's how it has _always_ worked.
I don't see where you find that FAULT_FLAG_foo thing. That's much deeper
down, when people do things like
unsigned int flags = FAULT_FLAG_NONLINEAR |
(write_access ? FAULT_FLAG_WRITE : 0);
based on that whole "write_access" flag.
Linus
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
2009-01-26 19:09 ` Linus Torvalds
@ 2009-01-26 19:30 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2009-01-26 19:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: npiggin, mingo, linux-arch, rusty, travis, linux-kernel,
venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx, Ying Han,
Mike Waychison
On Mon, 26 Jan 2009 11:09:59 -0800 (PST)
Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Mon, 26 Jan 2009, Andrew Morton wrote:
> > > + write = error_code & PF_WRITE;
> >
> > What's going on here? We set `error_code' to PF_WRITE, which is some
> > x86-specific thing.
>
> No. We set "write" to non-zero if it was a write fault.
>
> > > fault = handle_mm_fault(mm, vma, address, write);
> >
> > and then pass it into handle_mm_fault(), which is expecting a bunch of
> > flags in the FAULT_FLAG_foo domain.
>
> No. "handle_mm_fault()" takes an integer that is non-zero if it's a write,
> zero if it's a read. That's how it has _always_ worked.
>
> I don't see where you find that FAULT_FLAG_foo thing. That's much deeper
> down, when people do things like
>
> unsigned int flags = FAULT_FLAG_NONLINEAR |
> (write_access ? FAULT_FLAG_WRITE : 0);
>
> based on that whole "write_access" flag.
>
OK, thanks. It's actually page_fault-retry-with-nopage_retry.patch
which got those things confused, and then confused me. I'll go address
that in the other thread..
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
2009-01-26 19:00 ` Andrew Morton
2009-01-26 19:09 ` Linus Torvalds
@ 2009-01-26 20:09 ` Ingo Molnar
2009-01-26 20:44 ` Andrew Morton
1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2009-01-26 20:09 UTC (permalink / raw)
To: Andrew Morton
Cc: Nick Piggin, linux-arch, torvalds, rusty, travis, linux-kernel,
venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx
* Andrew Morton <akpm@linux-foundation.org> wrote:
> but why did the current code pass testing at all??
i queued it up a week ago and beyond a same-day breakage i reported to
Nick (and which he fixed) this commit was problem-free and passed all
testing here.
Does it cause problems for you? If yes then please describe the kind of
problems.
Note: i see that -mm modifies a few other details of the x86 pagefault
handling path (there a pagefault-retry patch in there) - so there might be
contextual interactions there. But this particular cleanup/improvement
from Nick is working fine on a wide range of systems here.
Btw., regarding pagefault retry. The bits that are in -mm currently i
find a bit ugly:
> +++ a/arch/x86/mm/fault.c
> @@ -799,7 +799,7 @@ void __kprobes do_page_fault(struct pt_r
> struct vm_area_struct *vma;
> int write;
> int fault;
> - unsigned int retry_flag = FAULT_FLAG_RETRY;
> + int retry_flag = 1;
>
> tsk = current;
> mm = tsk->mm;
> @@ -951,6 +951,7 @@ good_area:
> }
>
> write |= retry_flag;
> +
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -969,8 +970,8 @@ good_area:
> * be removed or changed after the retry.
> */
> if (fault & VM_FAULT_RETRY) {
> - if (write & FAULT_FLAG_RETRY) {
> - retry_flag &= ~FAULT_FLAG_RETRY;
> + if (retry_flag) {
> + retry_flag = 0;
> goto retry;
> }
> BUG();
as this complicates every architecture with a 'can the fault be retried'
logic and open-coded retry loop.
But that logic is rather repetitive and once an architecture filters out
all its special in-kernel sources of faults and the hw quirks it has, the
handling of pte faults is rather generic and largely offloaded into
handle_pte_fault() already.
So when this patch was submitted a few weeks ago i suggested that retry
should be done purely in mm/memory.c instead, and the low level code
should at most be refactored to suit this method, but not complicated any
further.
Any deep reasons for why such a more generic approach is not desirable?
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
2009-01-26 20:09 ` Ingo Molnar
@ 2009-01-26 20:44 ` Andrew Morton
[not found] ` <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com>
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2009-01-26 20:44 UTC (permalink / raw)
To: Ingo Molnar
Cc: npiggin, linux-arch, torvalds, rusty, travis, linux-kernel,
venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx, Ying Han,
Mike Waychison
On Mon, 26 Jan 2009 21:09:57 +0100
Ingo Molnar <mingo@elte.hu> wrote:
> ...
>
> Btw., regarding pagefault retry. The bits that are in -mm currently i
> find a bit ugly:
>
> > +++ a/arch/x86/mm/fault.c
> > @@ -799,7 +799,7 @@ void __kprobes do_page_fault(struct pt_r
> > struct vm_area_struct *vma;
> > int write;
> > int fault;
> > - unsigned int retry_flag = FAULT_FLAG_RETRY;
> > + int retry_flag = 1;
> >
> > tsk = current;
> > mm = tsk->mm;
> > @@ -951,6 +951,7 @@ good_area:
> > }
> >
> > write |= retry_flag;
> > +
> > /*
> > * If for any reason at all we couldn't handle the fault,
> > * make sure we exit gracefully rather than endlessly redo
> > @@ -969,8 +970,8 @@ good_area:
> > * be removed or changed after the retry.
> > */
> > if (fault & VM_FAULT_RETRY) {
> > - if (write & FAULT_FLAG_RETRY) {
> > - retry_flag &= ~FAULT_FLAG_RETRY;
> > + if (retry_flag) {
> > + retry_flag = 0;
> > goto retry;
> > }
> > BUG();
>
> as this complicates every architecture with a 'can the fault be retried'
> logic and open-coded retry loop.
>
> But that logic is rather repetitive and once an architecture filters out
> all its special in-kernel sources of faults and the hw quirks it has, the
> handling of pte faults is rather generic and largely offloaded into
> handle_pte_fault() already.
>
> So when this patch was submitted a few weeks ago i suggested that retry
> should be done purely in mm/memory.c instead, and the low level code
> should at most be refactored to suit this method, but not complicated any
> further.
>
> Any deep reasons for why such a more generic approach is not desirable?
>
Let's cc the people who wrote it.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
[not found] ` <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com>
@ 2009-01-26 23:21 ` Ingo Molnar
2009-01-26 23:44 ` Andrew Morton
0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2009-01-26 23:21 UTC (permalink / raw)
To: Ying Han
Cc: Andrew Morton, npiggin, linux-arch, torvalds, rusty, travis,
linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa,
tglx, Mike Waychison
* Ying Han <yinghan@google.com> wrote:
> Thank you Ingo and Andrew for the comments. I will take a look into it
> ASAP and updates it here.
Note, my objection wasnt a hard NAK - just an observation. If all things
considered Andrew still favors the VM_FAULT_RETRY approach then that's
fine too i guess.
It's just that a quick look gave me the feeling of a retry flag tacked on
to an existing codepath [and all the micro-overhead and complexity that
this brings], instead of a clean refactoring of pagefault handling
functionality into a higher MM level retry loop.
So the alternative has to be looked at and rejected because it's
technically inferior - not because it's more difficult to implement.
(which it certainly is)
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3
2009-01-26 23:21 ` Ingo Molnar
@ 2009-01-26 23:44 ` Andrew Morton
0 siblings, 0 replies; 8+ messages in thread
From: Andrew Morton @ 2009-01-26 23:44 UTC (permalink / raw)
To: Ingo Molnar
Cc: yinghan, npiggin, linux-arch, torvalds, rusty, travis,
linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa,
tglx, mikew
On Tue, 27 Jan 2009 00:21:39 +0100
Ingo Molnar <mingo@elte.hu> wrote:
>
> * Ying Han <yinghan@google.com> wrote:
>
> > Thank you Ingo and Andrew for the comments. I will take a look into it
> > ASAP and updates it here.
>
> Note, my objection wasnt a hard NAK - just an observation. If all things
> considered Andrew still favors the VM_FAULT_RETRY approach then that's
> fine too i guess.
>
> It's just that a quick look gave me the feeling of a retry flag tacked on
> to an existing codepath [and all the micro-overhead and complexity that
> this brings], instead of a clean refactoring of pagefault handling
> functionality into a higher MM level retry loop.
>
> So the alternative has to be looked at and rejected because it's
> technically inferior - not because it's more difficult to implement.
> (which it certainly is)
>
I have wobbly feelings about this patch. There are your issues, and a
long string of problems and fixes. And my recent half-assed
linux-next-related fix which I didn't really think about.
It all needs a revisit/rereview/reunderstand cycle.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-01-26 23:45 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <alpine.LFD.2.00.0901021149020.5086@localhost.localdomain>
[not found] ` <20090102203839.GA26850@elte.hu>
[not found] ` <alpine.LFD.2.00.0901021531140.3179@localhost.localdomain>
[not found] ` <20090103193859.GB9805@elte.hu>
[not found] ` <alpine.LFD.2.00.0901031225020.3179@localhost.localdomain>
[not found] ` <20090103203621.GA2491@elte.hu>
[not found] ` <20090103213856.GA24138@elte.hu>
[not found] ` <alpine.LFD.2.00.0901031347570.3179@localhost.localdomain>
[not found] ` <20090103223723.GA17047@elte.hu>
[not found] ` <20090105011416.GG32239@wotan.suse.de>
2009-01-05 1:16 ` [git pull] cpus4096 tree, part 3 Nick Piggin
2009-01-26 19:00 ` Andrew Morton
2009-01-26 19:09 ` Linus Torvalds
2009-01-26 19:30 ` Andrew Morton
2009-01-26 20:09 ` Ingo Molnar
2009-01-26 20:44 ` Andrew Morton
[not found] ` <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com>
2009-01-26 23:21 ` Ingo Molnar
2009-01-26 23:44 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox