* Re: [git pull] cpus4096 tree, part 3 [not found] ` <20090105011416.GG32239@wotan.suse.de> @ 2009-01-05 1:16 ` Nick Piggin 2009-01-26 19:00 ` Andrew Morton 0 siblings, 1 reply; 8+ messages in thread From: Nick Piggin @ 2009-01-05 1:16 UTC (permalink / raw) To: Ingo Molnar, linux-arch Cc: Linus Torvalds, Rusty Russell, Mike Travis, linux-kernel, Pallipadi, Venkatesh, Suresh Siddha, Arjan van de Ven, H. Peter Anvin, Thomas Gleixner Really cc linux-arch this time On Mon, Jan 05, 2009 at 02:14:16AM +0100, Nick Piggin wrote: > On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote: > > > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > What happened to Nick's cleanup patch to do_page_fault (a month or two > > > ago? I complained about some of the issues in his first version and > > > asked for some further cleanups, but I think that whole discussion ended > > > with him saying "I am going to add those changes that you suggested (in > > > fact, I already have)". > > > > > > And then I didn't see anything further. Maybe I just missed the end > > > result. Or maybe we have it in some -mm branch or something? > > > > they would have been in tip/x86/mm and would be upstream now had Nick > > re-sent a v2 series but that never happened. I think they might have > > fallen victim to a serious attention deficit caused by the SLQB patch ;-) > > Well, I already added Linus's suggestions but didn't submit it because > there was a bit of work going on in that file as far as I could see, both > in the x86 tree and in -mm: > > (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch) > > It isn't a big deal to resolve either way, but I don't want to make Andrew's > life harder. > > [Yes OK now I'm the guilty one of pushing in an x86 patch not via the > x86 tree ;) This one is easy to break in pieces, but I didn't want > to create a dependency between the trees] > > I didn't really consider it to be urgent, so I was waiting for that patch > to go in, but I was still hoping to get this into 2.6.29... This is what > it looks like now with your suggestions, and just merged it to your current > tree (untested). > > I'll cc the linux-arch list here too, because it might be nice to keep these > things as structurally similar as possible (and they'll all want to look at > the -mm patch above, although I'll probably end up having to write the > patches!). --- Optimise x86's do_page_fault (C entry point for the page fault path). gcc isn't _all_ that smart about spilling registers to stack or reusing stack slots, even with branch annotations. do_page_fault contained a lot of functionality, so split unlikely paths into their own functions, and mark them as noinline just to be sure. I consider this actually to be somewhat of a cleanup too: the main function now contains about half the number of lines. Also, ensure the order of arguments to functions is always the same: regs, addr, error_code. This can reduce code size a tiny bit, and just looks neater too. Add a couple of branch annotations. One real behavioural difference this makes is that the OOM-init-task case will no longer loop around the page fault handler, but will return to userspace and presumably retry the fault. Effectively the same macro-behaviour, but it is a notable difference. Such change in behaviour should disappear after the "call oom killer from page fault" patch. Before: do_page_fault: subq $360, %rsp #, After: do_page_fault: subq $56, %rsp #, bloat-o-meter: add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542) function old new delta __bad_area_nosemaphore - 506 +506 no_context - 474 +474 vmalloc_fault - 424 +424 spurious_fault - 358 +358 mm_fault_error - 272 +272 bad_area_access_error - 89 +89 bad_area - 89 +89 bad_area_nosemaphore - 10 +10 do_page_fault 2464 784 -1680 Yes, the total size increases by 542 bytes, due to the extra function calls. But these will very rarely be called (except for vmalloc_fault) in a normal workload. Importantly, do_page_fault is less than 1/3rd it's original size. Existing gotos and branch hints did move a lot of the infrequently used text out of the fastpath, but that's even further improved after this patch. --- arch/x86/mm/fault.c | 458 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 265 insertions(+), 193 deletions(-) Index: linux-2.6/arch/x86/mm/fault.c =================================================================== --- linux-2.6.orig/arch/x86/mm/fault.c +++ linux-2.6/arch/x86/mm/fault.c @@ -91,8 +91,8 @@ static inline int notify_page_fault(stru * * Opcode checker based on code by Richard Brunner */ -static int is_prefetch(struct pt_regs *regs, unsigned long addr, - unsigned long error_code) +static int is_prefetch(struct pt_regs *regs, unsigned long error_code, + unsigned long addr) { unsigned char *instr; int scan_more = 1; @@ -409,15 +409,15 @@ static void show_fault_oops(struct pt_re } #ifdef CONFIG_X86_64 -static noinline void pgtable_bad(unsigned long address, struct pt_regs *regs, - unsigned long error_code) +static noinline void pgtable_bad(struct pt_regs *regs, + unsigned long error_code, unsigned long address) { unsigned long flags = oops_begin(); int sig = SIGKILL; - struct task_struct *tsk; + struct task_struct *tsk = current; printk(KERN_ALERT "%s: Corrupted page table at address %lx\n", - current->comm, address); + tsk->comm, address); dump_pagetable(address); tsk = current; tsk->thread.cr2 = address; @@ -429,6 +429,200 @@ static noinline void pgtable_bad(unsigne } #endif +static noinline void no_context(struct pt_regs *regs, + unsigned long error_code, unsigned long address) +{ + struct task_struct *tsk = current; +#ifdef CONFIG_X86_64 + unsigned long flags; + int sig; +#endif + + /* Are we prepared to handle this kernel fault? */ + if (fixup_exception(regs)) + return; + + /* + * X86_32 + * Valid to do another page fault here, because if this fault + * had been triggered by is_prefetch fixup_exception would have + * handled it. + * + * X86_64 + * Hall of shame of CPU/BIOS bugs. + */ + if (is_prefetch(regs, error_code, address)) + return; + + if (is_errata93(regs, address)) + return; + + /* + * Oops. The kernel tried to access some bad page. We'll have to + * terminate things with extreme prejudice. + */ +#ifdef CONFIG_X86_32 + bust_spinlocks(1); +#else + flags = oops_begin(); +#endif + + show_fault_oops(regs, error_code, address); + + tsk->thread.cr2 = address; + tsk->thread.trap_no = 14; + tsk->thread.error_code = error_code; + +#ifdef CONFIG_X86_32 + die("Oops", regs, error_code); + bust_spinlocks(0); + do_exit(SIGKILL); +#else + sig = SIGKILL; + if (__die("Oops", regs, error_code)) + sig = 0; + /* Executive summary in case the body of the oops scrolled away */ + printk(KERN_EMERG "CR2: %016lx\n", address); + oops_end(flags, regs, sig); +#endif +} + +static void __bad_area_nosemaphore(struct pt_regs *regs, + unsigned long error_code, unsigned long address, + int si_code) +{ + struct task_struct *tsk = current; + + /* User mode accesses just cause a SIGSEGV */ + if (error_code & PF_USER) { + /* + * It's possible to have interrupts off here. + */ + local_irq_enable(); + + /* + * Valid to do another page fault here because this one came + * from user space. + */ + if (is_prefetch(regs, error_code, address)) + return; + + if (is_errata100(regs, address)) + return; + + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && + printk_ratelimit()) { + printk( + "%s%s[%d]: segfault at %lx ip %p sp %p error %lx", + task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, + tsk->comm, task_pid_nr(tsk), address, + (void *) regs->ip, (void *) regs->sp, error_code); + print_vma_addr(" in ", regs->ip); + printk("\n"); + } + + tsk->thread.cr2 = address; + /* Kernel addresses are always protection faults */ + tsk->thread.error_code = error_code | (address >= TASK_SIZE); + tsk->thread.trap_no = 14; + force_sig_info_fault(SIGSEGV, si_code, address, tsk); + return; + } + + if (is_f00f_bug(regs, address)) + return; + + no_context(regs, error_code, address); +} + +static noinline void bad_area_nosemaphore(struct pt_regs *regs, + unsigned long error_code, unsigned long address) +{ + __bad_area_nosemaphore(regs, error_code, address, SEGV_MAPERR); +} + +static void __bad_area(struct pt_regs *regs, + unsigned long error_code, unsigned long address, + int si_code) +{ + struct mm_struct *mm = current->mm; + + /* + * Something tried to access memory that isn't in our memory map.. + * Fix it, but check if it's kernel or user first.. + */ + up_read(&mm->mmap_sem); + + __bad_area_nosemaphore(regs, error_code, address, si_code); +} + +static noinline void bad_area(struct pt_regs *regs, + unsigned long error_code, unsigned long address) +{ + __bad_area(regs, error_code, address, SEGV_MAPERR); +} + +static noinline void bad_area_access_error(struct pt_regs *regs, + unsigned long error_code, unsigned long address) +{ + __bad_area(regs, error_code, address, SEGV_ACCERR); +} + +/* TODO: fixup for "mm-invoke-oom-killer-from-page-fault.patch" */ +static void out_of_memory(struct pt_regs *regs, + unsigned long error_code, unsigned long address) +{ + struct task_struct *tsk = current; + struct mm_struct *mm = tsk->mm; + /* + * We ran out of memory, or some other thing happened to us that made + * us unable to handle the page fault gracefully. + */ + up_read(&mm->mmap_sem); + if (is_global_init(tsk)) { + yield(); + return; + } + + printk("VM: killing process %s\n", tsk->comm); + if (error_code & PF_USER) + do_group_exit(SIGKILL); + no_context(regs, error_code, address); +} + +static void do_sigbus(struct pt_regs *regs, + unsigned long error_code, unsigned long address) +{ + struct task_struct *tsk = current; + struct mm_struct *mm = tsk->mm; + + up_read(&mm->mmap_sem); + + /* Kernel mode? Handle exceptions or die */ + if (!(error_code & PF_USER)) + no_context(regs, error_code, address); +#ifdef CONFIG_X86_32 + /* User space => ok to do another page fault */ + if (is_prefetch(regs, error_code, address)) + return; +#endif + tsk->thread.cr2 = address; + tsk->thread.error_code = error_code; + tsk->thread.trap_no = 14; + force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); +} + +static noinline void mm_fault_error(struct pt_regs *regs, + unsigned long error_code, unsigned long address, unsigned int fault) +{ + if (fault & VM_FAULT_OOM) + out_of_memory(regs, error_code, address); + else if (fault & VM_FAULT_SIGBUS) + do_sigbus(regs, error_code, address); + else + BUG(); +} + static int spurious_fault_check(unsigned long error_code, pte_t *pte) { if ((error_code & PF_WRITE) && !pte_write(*pte)) @@ -448,8 +642,8 @@ static int spurious_fault_check(unsigned * There are no security implications to leaving a stale TLB when * increasing the permissions on a page. */ -static int spurious_fault(unsigned long address, - unsigned long error_code) +static noinline int spurious_fault(unsigned long error_code, + unsigned long address) { pgd_t *pgd; pud_t *pud; @@ -494,7 +688,7 @@ static int spurious_fault(unsigned long * * This assumes no large pages in there. */ -static int vmalloc_fault(unsigned long address) +static noinline int vmalloc_fault(unsigned long address) { #ifdef CONFIG_X86_32 unsigned long pgd_paddr; @@ -573,6 +767,25 @@ static int vmalloc_fault(unsigned long a int show_unhandled_signals = 1; +static inline int access_error(unsigned long error_code, int write, + struct vm_area_struct *vma) +{ + if (write) { + /* write, present and write, not present */ + if (unlikely(!(vma->vm_flags & VM_WRITE))) + return 1; + } else if (unlikely(error_code & PF_PROT)) { + /* read, present */ + return 1; + } else { + /* read, not present */ + if (unlikely(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))) + return 1; + } + + return 0; +} + /* * This routine handles page faults. It determines the address, * and the problem, and then passes it off to one of the appropriate @@ -583,16 +796,12 @@ asmlinkage #endif void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) { + unsigned long address; struct task_struct *tsk; struct mm_struct *mm; struct vm_area_struct *vma; - unsigned long address; - int write, si_code; + int write; int fault; -#ifdef CONFIG_X86_64 - unsigned long flags; - int sig; -#endif tsk = current; mm = tsk->mm; @@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r /* get the address */ address = read_cr2(); - si_code = SEGV_MAPERR; - - if (notify_page_fault(regs)) + if (unlikely(notify_page_fault(regs))) return; if (unlikely(kmmio_fault(regs, address))) return; @@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r * Don't take the mm semaphore here. If we fixup a prefetch * fault we could otherwise deadlock. */ - goto bad_area_nosemaphore; + bad_area_nosemaphore(regs, error_code, address); + return; } - /* * It's safe to allow irq's after cr2 has been saved and the * vmalloc fault has been handled. @@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r #ifdef CONFIG_X86_64 if (unlikely(error_code & PF_RSVD)) - pgtable_bad(address, regs, error_code); + pgtable_bad(regs, error_code, address); #endif /* * If we're in an interrupt, have no user context or are running in an * atomic region then we must not take the fault. */ - if (unlikely(in_atomic() || !mm)) - goto bad_area_nosemaphore; + if (unlikely(in_atomic() || !mm)) { + bad_area_nosemaphore(regs, error_code, address); + return; + } -again: /* * When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the @@ -684,20 +892,26 @@ again: * source. If this is invalid we can skip the address space check, * thus avoiding the deadlock. */ - if (!down_read_trylock(&mm->mmap_sem)) { + if (unlikely(!down_read_trylock(&mm->mmap_sem))) { if ((error_code & PF_USER) == 0 && - !search_exception_tables(regs->ip)) - goto bad_area_nosemaphore; + !search_exception_tables(regs->ip)) { + bad_area_nosemaphore(regs, error_code, address); + return; + } down_read(&mm->mmap_sem); } vma = find_vma(mm, address); - if (!vma) - goto bad_area; - if (vma->vm_start <= address) + if (unlikely(!vma)) { + bad_area(regs, error_code, address); + return; + } + if (likely(vma->vm_start <= address)) goto good_area; - if (!(vma->vm_flags & VM_GROWSDOWN)) - goto bad_area; + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { + bad_area(regs, error_code, address); + return; + } if (error_code & PF_USER) { /* * Accessing the stack below %sp is always a bug. @@ -705,31 +919,25 @@ again: * and pusha to work. ("enter $65535,$31" pushes * 32 pointers and then decrements %sp by 65535.) */ - if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp) - goto bad_area; + if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) { + bad_area(regs, error_code, address); + return; + } } - if (expand_stack(vma, address)) - goto bad_area; -/* - * Ok, we have a good vm_area for this memory access, so - * we can handle it.. - */ + if (unlikely(expand_stack(vma, address))) { + bad_area(regs, error_code, address); + return; + } + + /* + * Ok, we have a good vm_area for this memory access, so + * we can handle it.. + */ good_area: - si_code = SEGV_ACCERR; - write = 0; - switch (error_code & (PF_PROT|PF_WRITE)) { - default: /* 3: write, present */ - /* fall through */ - case PF_WRITE: /* write, not present */ - if (!(vma->vm_flags & VM_WRITE)) - goto bad_area; - write++; - break; - case PF_PROT: /* read, present */ - goto bad_area; - case 0: /* read, not present */ - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) - goto bad_area; + write = error_code & PF_WRITE; + if (unlikely(access_error(error_code, write, vma))) { + bad_area_access_error(regs, error_code, address); + return; } /* @@ -739,11 +947,8 @@ good_area: */ fault = handle_mm_fault(mm, vma, address, write); if (unlikely(fault & VM_FAULT_ERROR)) { - if (fault & VM_FAULT_OOM) - goto out_of_memory; - else if (fault & VM_FAULT_SIGBUS) - goto do_sigbus; - BUG(); + mm_fault_error(regs, error_code, address, fault); + return; } if (fault & VM_FAULT_MAJOR) tsk->maj_flt++; @@ -761,139 +966,6 @@ good_area: } #endif up_read(&mm->mmap_sem); - return; - -/* - * Something tried to access memory that isn't in our memory map.. - * Fix it, but check if it's kernel or user first.. - */ -bad_area: - up_read(&mm->mmap_sem); - -bad_area_nosemaphore: - /* User mode accesses just cause a SIGSEGV */ - if (error_code & PF_USER) { - /* - * It's possible to have interrupts off here. - */ - local_irq_enable(); - - /* - * Valid to do another page fault here because this one came - * from user space. - */ - if (is_prefetch(regs, address, error_code)) - return; - - if (is_errata100(regs, address)) - return; - - if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) && - printk_ratelimit()) { - printk( - "%s%s[%d]: segfault at %lx ip %p sp %p error %lx", - task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, - tsk->comm, task_pid_nr(tsk), address, - (void *) regs->ip, (void *) regs->sp, error_code); - print_vma_addr(" in ", regs->ip); - printk("\n"); - } - - tsk->thread.cr2 = address; - /* Kernel addresses are always protection faults */ - tsk->thread.error_code = error_code | (address >= TASK_SIZE); - tsk->thread.trap_no = 14; - force_sig_info_fault(SIGSEGV, si_code, address, tsk); - return; - } - - if (is_f00f_bug(regs, address)) - return; - -no_context: - /* Are we prepared to handle this kernel fault? */ - if (fixup_exception(regs)) - return; - - /* - * X86_32 - * Valid to do another page fault here, because if this fault - * had been triggered by is_prefetch fixup_exception would have - * handled it. - * - * X86_64 - * Hall of shame of CPU/BIOS bugs. - */ - if (is_prefetch(regs, address, error_code)) - return; - - if (is_errata93(regs, address)) - return; - -/* - * Oops. The kernel tried to access some bad page. We'll have to - * terminate things with extreme prejudice. - */ -#ifdef CONFIG_X86_32 - bust_spinlocks(1); -#else - flags = oops_begin(); -#endif - - show_fault_oops(regs, error_code, address); - - tsk->thread.cr2 = address; - tsk->thread.trap_no = 14; - tsk->thread.error_code = error_code; - -#ifdef CONFIG_X86_32 - die("Oops", regs, error_code); - bust_spinlocks(0); - do_exit(SIGKILL); -#else - sig = SIGKILL; - if (__die("Oops", regs, error_code)) - sig = 0; - /* Executive summary in case the body of the oops scrolled away */ - printk(KERN_EMERG "CR2: %016lx\n", address); - oops_end(flags, regs, sig); -#endif - -/* - * We ran out of memory, or some other thing happened to us that made - * us unable to handle the page fault gracefully. - */ -out_of_memory: - up_read(&mm->mmap_sem); - if (is_global_init(tsk)) { - yield(); - /* - * Re-lookup the vma - in theory the vma tree might - * have changed: - */ - goto again; - } - - printk("VM: killing process %s\n", tsk->comm); - if (error_code & PF_USER) - do_group_exit(SIGKILL); - goto no_context; - -do_sigbus: - up_read(&mm->mmap_sem); - - /* Kernel mode? Handle exceptions or die */ - if (!(error_code & PF_USER)) - goto no_context; -#ifdef CONFIG_X86_32 - /* User space => ok to do another page fault */ - if (is_prefetch(regs, address, error_code)) - return; -#endif - tsk->thread.cr2 = address; - tsk->thread.error_code = error_code; - tsk->thread.trap_no = 14; - force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk); } DEFINE_SPINLOCK(pgd_lock); ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3 2009-01-05 1:16 ` [git pull] cpus4096 tree, part 3 Nick Piggin @ 2009-01-26 19:00 ` Andrew Morton 2009-01-26 19:09 ` Linus Torvalds 2009-01-26 20:09 ` Ingo Molnar 0 siblings, 2 replies; 8+ messages in thread From: Andrew Morton @ 2009-01-26 19:00 UTC (permalink / raw) To: Nick Piggin Cc: mingo, linux-arch, torvalds, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx On Mon, 5 Jan 2009 02:16:30 +0100 Nick Piggin <npiggin@suse.de> wrote: > Really cc linux-arch this time > > On Mon, Jan 05, 2009 at 02:14:16AM +0100, Nick Piggin wrote: > > On Sat, Jan 03, 2009 at 11:37:23PM +0100, Ingo Molnar wrote: > > > > > > * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > > > > > What happened to Nick's cleanup patch to do_page_fault (a month or two > > > > ago? I complained about some of the issues in his first version and > > > > asked for some further cleanups, but I think that whole discussion ended > > > > with him saying "I am going to add those changes that you suggested (in > > > > fact, I already have)". > > > > > > > > And then I didn't see anything further. Maybe I just missed the end > > > > result. Or maybe we have it in some -mm branch or something? > > > > > > they would have been in tip/x86/mm and would be upstream now had Nick > > > re-sent a v2 series but that never happened. I think they might have > > > fallen victim to a serious attention deficit caused by the SLQB patch ;-) > > > > Well, I already added Linus's suggestions but didn't submit it because > > there was a bit of work going on in that file as far as I could see, both > > in the x86 tree and in -mm: > > > > (http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/broken-out/mm-invoke-oom-killer-from-page-fault.patch) > > > > It isn't a big deal to resolve either way, but I don't want to make Andrew's > > life harder. > > > > [Yes OK now I'm the guilty one of pushing in an x86 patch not via the > > x86 tree ;) This one is easy to break in pieces, but I didn't want > > to create a dependency between the trees] > > > > I didn't really consider it to be urgent, so I was waiting for that patch > > to go in, but I was still hoping to get this into 2.6.29... This is what > > it looks like now with your suggestions, and just merged it to your current > > tree (untested). > > > > I'll cc the linux-arch list here too, because it might be nice to keep these > > things as structurally similar as possible (and they'll all want to look at > > the -mm patch above, although I'll probably end up having to write the > > patches!). > > --- > Optimise x86's do_page_fault (C entry point for the page fault path). It took rather a lot of hunting to find this email. Please do try to make the email subject match the final patch's title? > * This routine handles page faults. It determines the address, > * and the problem, and then passes it off to one of the appropriate > @@ -583,16 +796,12 @@ asmlinkage > #endif > void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code) > { > + unsigned long address; > struct task_struct *tsk; > struct mm_struct *mm; > struct vm_area_struct *vma; > - unsigned long address; > - int write, si_code; > + int write; > int fault; > -#ifdef CONFIG_X86_64 > - unsigned long flags; > - int sig; > -#endif > > tsk = current; > mm = tsk->mm; > @@ -601,9 +810,7 @@ void __kprobes do_page_fault(struct pt_r > /* get the address */ > address = read_cr2(); > > - si_code = SEGV_MAPERR; > - > - if (notify_page_fault(regs)) > + if (unlikely(notify_page_fault(regs))) > return; > if (unlikely(kmmio_fault(regs, address))) > return; > @@ -638,10 +845,10 @@ void __kprobes do_page_fault(struct pt_r > * Don't take the mm semaphore here. If we fixup a prefetch > * fault we could otherwise deadlock. > */ > - goto bad_area_nosemaphore; > + bad_area_nosemaphore(regs, error_code, address); > + return; > } > > - > /* > * It's safe to allow irq's after cr2 has been saved and the > * vmalloc fault has been handled. > @@ -657,17 +864,18 @@ void __kprobes do_page_fault(struct pt_r > > #ifdef CONFIG_X86_64 > if (unlikely(error_code & PF_RSVD)) > - pgtable_bad(address, regs, error_code); > + pgtable_bad(regs, error_code, address); > #endif > > /* > * If we're in an interrupt, have no user context or are running in an > * atomic region then we must not take the fault. > */ > - if (unlikely(in_atomic() || !mm)) > - goto bad_area_nosemaphore; > + if (unlikely(in_atomic() || !mm)) { > + bad_area_nosemaphore(regs, error_code, address); > + return; > + } > > -again: > /* > * When running in the kernel we expect faults to occur only to > * addresses in user space. All other faults represent errors in the > @@ -684,20 +892,26 @@ again: > * source. If this is invalid we can skip the address space check, > * thus avoiding the deadlock. > */ > - if (!down_read_trylock(&mm->mmap_sem)) { > + if (unlikely(!down_read_trylock(&mm->mmap_sem))) { > if ((error_code & PF_USER) == 0 && > - !search_exception_tables(regs->ip)) > - goto bad_area_nosemaphore; > + !search_exception_tables(regs->ip)) { > + bad_area_nosemaphore(regs, error_code, address); > + return; > + } > down_read(&mm->mmap_sem); > } > > vma = find_vma(mm, address); > - if (!vma) > - goto bad_area; > - if (vma->vm_start <= address) > + if (unlikely(!vma)) { > + bad_area(regs, error_code, address); > + return; > + } > + if (likely(vma->vm_start <= address)) > goto good_area; > - if (!(vma->vm_flags & VM_GROWSDOWN)) > - goto bad_area; > + if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) { > + bad_area(regs, error_code, address); > + return; > + } > if (error_code & PF_USER) { > /* > * Accessing the stack below %sp is always a bug. > @@ -705,31 +919,25 @@ again: > * and pusha to work. ("enter $65535,$31" pushes > * 32 pointers and then decrements %sp by 65535.) > */ > - if (address + 65536 + 32 * sizeof(unsigned long) < regs->sp) > - goto bad_area; > + if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) { > + bad_area(regs, error_code, address); > + return; > + } > } > - if (expand_stack(vma, address)) > - goto bad_area; > -/* > - * Ok, we have a good vm_area for this memory access, so > - * we can handle it.. > - */ > + if (unlikely(expand_stack(vma, address))) { > + bad_area(regs, error_code, address); > + return; > + } > + > + /* > + * Ok, we have a good vm_area for this memory access, so > + * we can handle it.. > + */ > good_area: > - si_code = SEGV_ACCERR; > - write = 0; > - switch (error_code & (PF_PROT|PF_WRITE)) { > - default: /* 3: write, present */ > - /* fall through */ > - case PF_WRITE: /* write, not present */ > - if (!(vma->vm_flags & VM_WRITE)) > - goto bad_area; > - write++; > - break; > - case PF_PROT: /* read, present */ > - goto bad_area; > - case 0: /* read, not present */ > - if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE))) > - goto bad_area; > + write = error_code & PF_WRITE; What's going on here? We set `error_code' to PF_WRITE, which is some x86-specific thing. > + if (unlikely(access_error(error_code, write, vma))) { > + bad_area_access_error(regs, error_code, address); > + return; > } > > /* > @@ -739,11 +947,8 @@ good_area: > */ > fault = handle_mm_fault(mm, vma, address, write); and then pass it into handle_mm_fault(), which is expecting a bunch of flags in the FAULT_FLAG_foo domain. IOW, the code will accidentally set FAULT_FLAG_NONLINEAR!. Methinks we want something like this, --- a/arch/x86/mm/fault.c~fix-x86-optimise-x86s-do_page_fault-c-entry-point-for-the-page-fault-path +++ a/arch/x86/mm/fault.c @@ -942,7 +942,7 @@ void __kprobes do_page_fault(struct pt_r * we can handle it.. */ good_area: - write = error_code & PF_WRITE; + write = (error_code & PF_WRITE) ? FAULT_FLAG_WRITE : 0; if (unlikely(access_error(error_code, write, vma))) { bad_area_access_error(regs, error_code, address); return; _ but why did the current code pass testing at all?? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3 2009-01-26 19:00 ` Andrew Morton @ 2009-01-26 19:09 ` Linus Torvalds 2009-01-26 19:30 ` Andrew Morton 2009-01-26 20:09 ` Ingo Molnar 1 sibling, 1 reply; 8+ messages in thread From: Linus Torvalds @ 2009-01-26 19:09 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, mingo, linux-arch, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx On Mon, 26 Jan 2009, Andrew Morton wrote: > > + write = error_code & PF_WRITE; > > What's going on here? We set `error_code' to PF_WRITE, which is some > x86-specific thing. No. We set "write" to non-zero if it was a write fault. > > fault = handle_mm_fault(mm, vma, address, write); > > and then pass it into handle_mm_fault(), which is expecting a bunch of > flags in the FAULT_FLAG_foo domain. No. "handle_mm_fault()" takes an integer that is non-zero if it's a write, zero if it's a read. That's how it has _always_ worked. I don't see where you find that FAULT_FLAG_foo thing. That's much deeper down, when people do things like unsigned int flags = FAULT_FLAG_NONLINEAR | (write_access ? FAULT_FLAG_WRITE : 0); based on that whole "write_access" flag. Linus ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3 2009-01-26 19:09 ` Linus Torvalds @ 2009-01-26 19:30 ` Andrew Morton 0 siblings, 0 replies; 8+ messages in thread From: Andrew Morton @ 2009-01-26 19:30 UTC (permalink / raw) To: Linus Torvalds Cc: npiggin, mingo, linux-arch, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx, Ying Han, Mike Waychison On Mon, 26 Jan 2009 11:09:59 -0800 (PST) Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Mon, 26 Jan 2009, Andrew Morton wrote: > > > + write = error_code & PF_WRITE; > > > > What's going on here? We set `error_code' to PF_WRITE, which is some > > x86-specific thing. > > No. We set "write" to non-zero if it was a write fault. > > > > fault = handle_mm_fault(mm, vma, address, write); > > > > and then pass it into handle_mm_fault(), which is expecting a bunch of > > flags in the FAULT_FLAG_foo domain. > > No. "handle_mm_fault()" takes an integer that is non-zero if it's a write, > zero if it's a read. That's how it has _always_ worked. > > I don't see where you find that FAULT_FLAG_foo thing. That's much deeper > down, when people do things like > > unsigned int flags = FAULT_FLAG_NONLINEAR | > (write_access ? FAULT_FLAG_WRITE : 0); > > based on that whole "write_access" flag. > OK, thanks. It's actually page_fault-retry-with-nopage_retry.patch which got those things confused, and then confused me. I'll go address that in the other thread.. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3 2009-01-26 19:00 ` Andrew Morton 2009-01-26 19:09 ` Linus Torvalds @ 2009-01-26 20:09 ` Ingo Molnar 2009-01-26 20:44 ` Andrew Morton 1 sibling, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2009-01-26 20:09 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, linux-arch, torvalds, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx * Andrew Morton <akpm@linux-foundation.org> wrote: > but why did the current code pass testing at all?? i queued it up a week ago and beyond a same-day breakage i reported to Nick (and which he fixed) this commit was problem-free and passed all testing here. Does it cause problems for you? If yes then please describe the kind of problems. Note: i see that -mm modifies a few other details of the x86 pagefault handling path (there a pagefault-retry patch in there) - so there might be contextual interactions there. But this particular cleanup/improvement from Nick is working fine on a wide range of systems here. Btw., regarding pagefault retry. The bits that are in -mm currently i find a bit ugly: > +++ a/arch/x86/mm/fault.c > @@ -799,7 +799,7 @@ void __kprobes do_page_fault(struct pt_r > struct vm_area_struct *vma; > int write; > int fault; > - unsigned int retry_flag = FAULT_FLAG_RETRY; > + int retry_flag = 1; > > tsk = current; > mm = tsk->mm; > @@ -951,6 +951,7 @@ good_area: > } > > write |= retry_flag; > + > /* > * If for any reason at all we couldn't handle the fault, > * make sure we exit gracefully rather than endlessly redo > @@ -969,8 +970,8 @@ good_area: > * be removed or changed after the retry. > */ > if (fault & VM_FAULT_RETRY) { > - if (write & FAULT_FLAG_RETRY) { > - retry_flag &= ~FAULT_FLAG_RETRY; > + if (retry_flag) { > + retry_flag = 0; > goto retry; > } > BUG(); as this complicates every architecture with a 'can the fault be retried' logic and open-coded retry loop. But that logic is rather repetitive and once an architecture filters out all its special in-kernel sources of faults and the hw quirks it has, the handling of pte faults is rather generic and largely offloaded into handle_pte_fault() already. So when this patch was submitted a few weeks ago i suggested that retry should be done purely in mm/memory.c instead, and the low level code should at most be refactored to suit this method, but not complicated any further. Any deep reasons for why such a more generic approach is not desirable? Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3 2009-01-26 20:09 ` Ingo Molnar @ 2009-01-26 20:44 ` Andrew Morton [not found] ` <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com> 0 siblings, 1 reply; 8+ messages in thread From: Andrew Morton @ 2009-01-26 20:44 UTC (permalink / raw) To: Ingo Molnar Cc: npiggin, linux-arch, torvalds, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx, Ying Han, Mike Waychison On Mon, 26 Jan 2009 21:09:57 +0100 Ingo Molnar <mingo@elte.hu> wrote: > ... > > Btw., regarding pagefault retry. The bits that are in -mm currently i > find a bit ugly: > > > +++ a/arch/x86/mm/fault.c > > @@ -799,7 +799,7 @@ void __kprobes do_page_fault(struct pt_r > > struct vm_area_struct *vma; > > int write; > > int fault; > > - unsigned int retry_flag = FAULT_FLAG_RETRY; > > + int retry_flag = 1; > > > > tsk = current; > > mm = tsk->mm; > > @@ -951,6 +951,7 @@ good_area: > > } > > > > write |= retry_flag; > > + > > /* > > * If for any reason at all we couldn't handle the fault, > > * make sure we exit gracefully rather than endlessly redo > > @@ -969,8 +970,8 @@ good_area: > > * be removed or changed after the retry. > > */ > > if (fault & VM_FAULT_RETRY) { > > - if (write & FAULT_FLAG_RETRY) { > > - retry_flag &= ~FAULT_FLAG_RETRY; > > + if (retry_flag) { > > + retry_flag = 0; > > goto retry; > > } > > BUG(); > > as this complicates every architecture with a 'can the fault be retried' > logic and open-coded retry loop. > > But that logic is rather repetitive and once an architecture filters out > all its special in-kernel sources of faults and the hw quirks it has, the > handling of pte faults is rather generic and largely offloaded into > handle_pte_fault() already. > > So when this patch was submitted a few weeks ago i suggested that retry > should be done purely in mm/memory.c instead, and the low level code > should at most be refactored to suit this method, but not complicated any > further. > > Any deep reasons for why such a more generic approach is not desirable? > Let's cc the people who wrote it. ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com>]
* Re: [git pull] cpus4096 tree, part 3 [not found] ` <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com> @ 2009-01-26 23:21 ` Ingo Molnar 2009-01-26 23:44 ` Andrew Morton 0 siblings, 1 reply; 8+ messages in thread From: Ingo Molnar @ 2009-01-26 23:21 UTC (permalink / raw) To: Ying Han Cc: Andrew Morton, npiggin, linux-arch, torvalds, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx, Mike Waychison * Ying Han <yinghan@google.com> wrote: > Thank you Ingo and Andrew for the comments. I will take a look into it > ASAP and updates it here. Note, my objection wasnt a hard NAK - just an observation. If all things considered Andrew still favors the VM_FAULT_RETRY approach then that's fine too i guess. It's just that a quick look gave me the feeling of a retry flag tacked on to an existing codepath [and all the micro-overhead and complexity that this brings], instead of a clean refactoring of pagefault handling functionality into a higher MM level retry loop. So the alternative has to be looked at and rejected because it's technically inferior - not because it's more difficult to implement. (which it certainly is) Ingo ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [git pull] cpus4096 tree, part 3 2009-01-26 23:21 ` Ingo Molnar @ 2009-01-26 23:44 ` Andrew Morton 0 siblings, 0 replies; 8+ messages in thread From: Andrew Morton @ 2009-01-26 23:44 UTC (permalink / raw) To: Ingo Molnar Cc: yinghan, npiggin, linux-arch, torvalds, rusty, travis, linux-kernel, venkatesh.pallipadi, suresh.b.siddha, arjan, hpa, tglx, mikew On Tue, 27 Jan 2009 00:21:39 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > * Ying Han <yinghan@google.com> wrote: > > > Thank you Ingo and Andrew for the comments. I will take a look into it > > ASAP and updates it here. > > Note, my objection wasnt a hard NAK - just an observation. If all things > considered Andrew still favors the VM_FAULT_RETRY approach then that's > fine too i guess. > > It's just that a quick look gave me the feeling of a retry flag tacked on > to an existing codepath [and all the micro-overhead and complexity that > this brings], instead of a clean refactoring of pagefault handling > functionality into a higher MM level retry loop. > > So the alternative has to be looked at and rejected because it's > technically inferior - not because it's more difficult to implement. > (which it certainly is) > I have wobbly feelings about this patch. There are your issues, and a long string of problems and fixes. And my recent half-assed linux-next-related fix which I didn't really think about. It all needs a revisit/rereview/reunderstand cycle. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-01-26 23:45 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <alpine.LFD.2.00.0901021149020.5086@localhost.localdomain>
[not found] ` <20090102203839.GA26850@elte.hu>
[not found] ` <alpine.LFD.2.00.0901021531140.3179@localhost.localdomain>
[not found] ` <20090103193859.GB9805@elte.hu>
[not found] ` <alpine.LFD.2.00.0901031225020.3179@localhost.localdomain>
[not found] ` <20090103203621.GA2491@elte.hu>
[not found] ` <20090103213856.GA24138@elte.hu>
[not found] ` <alpine.LFD.2.00.0901031347570.3179@localhost.localdomain>
[not found] ` <20090103223723.GA17047@elte.hu>
[not found] ` <20090105011416.GG32239@wotan.suse.de>
2009-01-05 1:16 ` [git pull] cpus4096 tree, part 3 Nick Piggin
2009-01-26 19:00 ` Andrew Morton
2009-01-26 19:09 ` Linus Torvalds
2009-01-26 19:30 ` Andrew Morton
2009-01-26 20:09 ` Ingo Molnar
2009-01-26 20:44 ` Andrew Morton
[not found] ` <604427e00901261312w23a1f0f5y61fc5c6cc70297fb@mail.gmail.com>
2009-01-26 23:21 ` Ingo Molnar
2009-01-26 23:44 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox