* [RFC] de-asmify the x86-64 system call slowpath @ 2014-01-26 22:28 Linus Torvalds 2014-01-27 0:22 ` Al Viro ` (2 more replies) 0 siblings, 3 replies; 32+ messages in thread From: Linus Torvalds @ 2014-01-26 22:28 UTC (permalink / raw) To: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra Cc: the arch/x86 maintainers, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 2092 bytes --] The x86-64 (and 32-bit, for that matter) system call slowpaths are all in C, but the *selection* of which slow-path to take is a mixture of complicated assembler ("sysret_check -> sysret_careful -> sysret_signal ->sysret_audit -> int_check_syscall_exit_work" etc), and oddly named and placed C code ("schedule_user" vs "__audit_syscall_exit" vs "do_notify_resume"). This attached patch tries to take the "do_notify_resume()" approach, and renaming it to something sane ("syscall_exit_slowpath") and call out to *all* the different slow cases from that one place, instead of having some cases hardcoded in asm, and some in C. And instead of hardcoding which cases result in a "iretq" and which cases result in a faster sysret case, it's now simply a return value from that syscall_exit_slowpath() function, so it's very natural and easy to say "taking a signal will force us to do the slow iretq case, but we can do the task exit work and still do the sysret". I've marked this as an RFC, because I didn't bother trying to clean up the 32-bit code similarly (no test-cases, and trust me, if you get this wrong, it will fail spectacularly but in very subtle and hard-to-debug ways), and I also didn't bother with the slow cases in the "iretq" path, so that path still has the odd asm cases and calls the old (now legacy) do_notify_resume() path. But this is actually tested, and seems to work (including limited testing with strace, gdb etc), and while it adds a few more lines than it removes, the removed lines are mostly asm, and added lines are C (and part of them are the temporary still-extant do_notify_resume() wrapper). In particular, it should be fairly straightforward to take this as a starting point, removing the extant do_notify_resume/schedule/etc cases one by one, and get rid of more asm code and finally the wrapper. Comments? This was obviously brought on by my frustration with the currently nasty do_notify_resume() always returning to iret for the task_work case, and PeterZ's patch that fixed that, but made the asm mess even *worse*. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 4573 bytes --] arch/x86/kernel/entry_64.S | 51 ++++++++++++++-------------------------------- arch/x86/kernel/signal.c | 37 ++++++++++++++++++++++++++++----- 2 files changed, 47 insertions(+), 41 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1e96c3628bf2..15b5953da958 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -658,30 +658,24 @@ sysret_check: /* Handle reschedules */ /* edx: work, edi: workmask */ sysret_careful: - bt $TIF_NEED_RESCHED,%edx - jnc sysret_signal TRACE_IRQS_ON ENABLE_INTERRUPTS(CLBR_NONE) - pushq_cfi %rdi - SCHEDULE_USER - popq_cfi %rdi - jmp sysret_check + SAVE_REST + FIXUP_TOP_OF_STACK %r11 + movq %rsp,%rdi + movl %edx,%esi + call syscall_exit_slowpath + movl $_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT,%edi + testl %eax,%eax + jne sysret_using_iret + addq $REST_SKIP,%rsp + jmp sysret_check; - /* Handle a signal */ -sysret_signal: - TRACE_IRQS_ON - ENABLE_INTERRUPTS(CLBR_NONE) -#ifdef CONFIG_AUDITSYSCALL - bt $TIF_SYSCALL_AUDIT,%edx - jc sysret_audit -#endif - /* - * We have a signal, or exit tracing or single-step. - * These all wind up with the iret return path anyway, - * so just join that path right now. - */ - FIXUP_TOP_OF_STACK %r11, -ARGOFFSET - jmp int_check_syscall_exit_work +sysret_using_iret: + RESTORE_REST + DISABLE_INTERRUPTS(CLBR_NONE) + TRACE_IRQS_OFF + jmp int_with_check badsys: movq $-ENOSYS,RAX-ARGOFFSET(%rsp) @@ -703,20 +697,6 @@ auditsys: call __audit_syscall_entry LOAD_ARGS 0 /* reload call-clobbered registers */ jmp system_call_fastpath - - /* - * Return fast path for syscall audit. Call __audit_syscall_exit() - * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT - * masked off. - */ -sysret_audit: - movq RAX-ARGOFFSET(%rsp),%rsi /* second arg, syscall return value */ - cmpq $-MAX_ERRNO,%rsi /* is it < -MAX_ERRNO? */ - setbe %al /* 1 if so, 0 if not */ - movzbl %al,%edi /* zero-extend that into %edi */ - call __audit_syscall_exit - movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi - jmp sysret_check #endif /* CONFIG_AUDITSYSCALL */ /* Do syscall tracing */ @@ -786,7 +766,6 @@ int_careful: int_very_careful: TRACE_IRQS_ON ENABLE_INTERRUPTS(CLBR_NONE) -int_check_syscall_exit_work: SAVE_REST /* Check for syscall exit trace */ testl $_TIF_WORK_SYSCALL_EXIT,%edx diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index 9e5de6813e1f..25d02f6a65f1 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -725,13 +725,30 @@ static void do_signal(struct pt_regs *regs) } /* - * notification of userspace execution resumption - * - triggered by the TIF_WORK_MASK flags + * This is the system call exit slowpath, called if any of + * the thread info flags are set that cause us to not just + * return using a 'sysret' instruction. + * + * Return zero if the user register state hasn't changed, + * and a 'sysret' is still ok. Return nonzero if we need + * to go to the slow 'iret' path due to possible register + * changes (signals, ptrace, whatever). */ -__visible void -do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags) +__visible int syscall_exit_slowpath(struct pt_regs *regs, u32 thread_info_flags) { + int retval = 0; + user_exit(); + if (thread_info_flags & _TIF_NEED_RESCHED) + schedule(); + + if (thread_info_flags & _TIF_WORK_SYSCALL_EXIT) + syscall_trace_leave(regs); + +#ifdef CONFIG_AUDITSYSCALL + if (thread_info_flags & _TIF_SYSCALL_AUDIT) + __audit_syscall_exit(regs->ax, regs->ax < -MAX_ERRNO); +#endif #ifdef CONFIG_X86_MCE /* notify userspace of pending MCEs */ @@ -743,8 +760,10 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags) uprobe_notify_resume(regs); /* deal with pending signal delivery */ - if (thread_info_flags & _TIF_SIGPENDING) + if (thread_info_flags & _TIF_SIGPENDING) { do_signal(regs); + retval = 1; + } if (thread_info_flags & _TIF_NOTIFY_RESUME) { clear_thread_flag(TIF_NOTIFY_RESUME); @@ -754,8 +773,16 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags) fire_user_return_notifiers(); user_enter(); + return retval; } +/* Get rid of this old interface some day.. */ +__visible void do_notify_resume(struct pt_regs *regs, void *unused, u32 thread_info_flags) +{ + syscall_exit_slowpath(regs, thread_info_flags & ~(_TIF_NEED_RESCHED|_TIF_SYSCALL_AUDIT)); +} + + void signal_fault(struct pt_regs *regs, void __user *frame, char *where) { struct task_struct *me = current; ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds @ 2014-01-27 0:22 ` Al Viro 2014-01-27 4:32 ` Linus Torvalds 2014-01-27 10:27 ` Peter Zijlstra 2014-02-06 0:32 ` Linus Torvalds 2 siblings, 1 reply; 32+ messages in thread From: Al Viro @ 2014-01-27 0:22 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Jan 26, 2014 at 02:28:15PM -0800, Linus Torvalds wrote: > The x86-64 (and 32-bit, for that matter) system call slowpaths are all > in C, but the *selection* of which slow-path to take is a mixture of > complicated assembler ("sysret_check -> sysret_careful -> > sysret_signal ->sysret_audit -> int_check_syscall_exit_work" etc), and > oddly named and placed C code ("schedule_user" vs > "__audit_syscall_exit" vs "do_notify_resume"). > > This attached patch tries to take the "do_notify_resume()" approach, > and renaming it to something sane ("syscall_exit_slowpath") and call > out to *all* the different slow cases from that one place, instead of > having some cases hardcoded in asm, and some in C. And instead of > hardcoding which cases result in a "iretq" and which cases result in a > faster sysret case, it's now simply a return value from that > syscall_exit_slowpath() function, so it's very natural and easy to say > "taking a signal will force us to do the slow iretq case, but we can > do the task exit work and still do the sysret". > > I've marked this as an RFC, because I didn't bother trying to clean up > the 32-bit code similarly (no test-cases, and trust me, if you get > this wrong, it will fail spectacularly but in very subtle and > hard-to-debug ways), and I also didn't bother with the slow cases in > the "iretq" path, so that path still has the odd asm cases and calls > the old (now legacy) do_notify_resume() path. Umm... Can't uprobe_notify_resume() modify regs as well? While we are at it, when we start using the same thing on 32bit kernels, we'll need to watch out for execve() - the reason why start_thread() sets TIF_NOTIFY_RESUME is to force us away from sysexit path. IIRC, vm86 is another thing to watch out for (same reasons). ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 0:22 ` Al Viro @ 2014-01-27 4:32 ` Linus Torvalds 2014-01-27 4:48 ` H. Peter Anvin 2014-01-27 7:42 ` Al Viro 0 siblings, 2 replies; 32+ messages in thread From: Linus Torvalds @ 2014-01-27 4:32 UTC (permalink / raw) To: Al Viro Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Umm... Can't uprobe_notify_resume() modify regs as well? Probably. .. and on the other hand, we should actually be able to use 'sysret' for signal handling on x86-64, because while sysret destroys %rcx and doesn't allow for returning to odd modes, for calling a signal handler I don't think we really care.. > While we > are at it, when we start using the same thing on 32bit kernels, we'll > need to watch out for execve() - the reason why start_thread() sets > TIF_NOTIFY_RESUME is to force us away from sysexit path. IIRC, vm86 > is another thing to watch out for (same reasons). Yes, the 32-bit code I didn't want to touch, partly because I no longer have a test-case. And it does end up having some more interesting cases. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 4:32 ` Linus Torvalds @ 2014-01-27 4:48 ` H. Peter Anvin 2014-01-27 7:42 ` Al Viro 1 sibling, 0 replies; 32+ messages in thread From: H. Peter Anvin @ 2014-01-27 4:48 UTC (permalink / raw) To: Linus Torvalds, Al Viro Cc: Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On 01/26/2014 08:32 PM, Linus Torvalds wrote: > On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> Umm... Can't uprobe_notify_resume() modify regs as well? > > Probably. > > .. and on the other hand, we should actually be able to use 'sysret' > for signal handling on x86-64, because while sysret destroys %rcx and > doesn't allow for returning to odd modes, for calling a signal handler > I don't think we really care.. > Yes, it is the fourth argument register, but we only have three arguments to a signal handler. I had to think about that one. >> While we >> are at it, when we start using the same thing on 32bit kernels, we'll >> need to watch out for execve() - the reason why start_thread() sets >> TIF_NOTIFY_RESUME is to force us away from sysexit path. IIRC, vm86 >> is another thing to watch out for (same reasons). > > Yes, the 32-bit code I didn't want to touch, partly because I no > longer have a test-case. And it does end up having some more > interesting cases. That is one way to put it. However, this code is incredibly ugly and getting it cleaned up would really, really help, of course. -hpa ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 4:32 ` Linus Torvalds 2014-01-27 4:48 ` H. Peter Anvin @ 2014-01-27 7:42 ` Al Viro 2014-01-27 22:06 ` Andy Lutomirski 1 sibling, 1 reply; 32+ messages in thread From: Al Viro @ 2014-01-27 7:42 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Jan 26, 2014 at 08:32:09PM -0800, Linus Torvalds wrote: > On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Umm... Can't uprobe_notify_resume() modify regs as well? > > Probably. > > .. and on the other hand, we should actually be able to use 'sysret' > for signal handling on x86-64, because while sysret destroys %rcx and > doesn't allow for returning to odd modes, for calling a signal handler > I don't think we really care.. I'm afraid we might: * When user can change the frames always force IRET. That is because * it deals with uncanonical addresses better. SYSRET has trouble * with them due to bugs in both AMD and Intel CPUs. IIRC, that was about SYSRET with something unpleasant left in RCX, which comes from regs->ip, which is set to sa_handler by __setup_rt_frame(). And we do not normalize or validate that - not in __setup_rt_frame() and not at sigaction(2) time. Something about GPF triggered and buggering attacker-chosen memory area? I don't remember details, but IIRC the conclusion had been "just don't go there"... Note that we can manipulate regs->ip and regs->sp, regardless of validation at sigaction(2) or __setup_rt_frame() - just have the sucker ptraced, send it a signal and it'll stop on delivery. Then tracer can use ptrace to modify registers and issue PTRACE_CONT with zero signal. Voila - regs->[is]p set to arbitrary values, no signal handlers triggered... ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 7:42 ` Al Viro @ 2014-01-27 22:06 ` Andy Lutomirski 2014-01-27 22:17 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Andy Lutomirski @ 2014-01-27 22:06 UTC (permalink / raw) To: Al Viro, Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On 01/26/2014 11:42 PM, Al Viro wrote: > On Sun, Jan 26, 2014 at 08:32:09PM -0800, Linus Torvalds wrote: >> On Sun, Jan 26, 2014 at 4:22 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >>> >>> Umm... Can't uprobe_notify_resume() modify regs as well? >> >> Probably. >> >> .. and on the other hand, we should actually be able to use 'sysret' >> for signal handling on x86-64, because while sysret destroys %rcx and >> doesn't allow for returning to odd modes, for calling a signal handler >> I don't think we really care.. > > I'm afraid we might: > > * When user can change the frames always force IRET. That is because > * it deals with uncanonical addresses better. SYSRET has trouble > * with them due to bugs in both AMD and Intel CPUs. > > IIRC, that was about SYSRET with something unpleasant left in RCX, which > comes from regs->ip, which is set to sa_handler by __setup_rt_frame(). > And we do not normalize or validate that - not in __setup_rt_frame() and > not at sigaction(2) time. Something about GPF triggered and buggering > attacker-chosen memory area? I don't remember details, but IIRC the > conclusion had been "just don't go there"... > > Note that we can manipulate regs->ip and regs->sp, regardless of validation > at sigaction(2) or __setup_rt_frame() - just have the sucker ptraced, send > it a signal and it'll stop on delivery. Then tracer can use ptrace to modify > registers and issue PTRACE_CONT with zero signal. Voila - regs->[is]p > set to arbitrary values, no signal handlers triggered... > It's not just ip and sp -- cs matters here, too, I think. (I may be the only one to have ever tried it, but it's possible to far-call from 64-bit to 32-bit cs, and it works. I've never tried switching cs using ptrace, but someone may want that to work. too.) That being said, the last time I benchmarked it, sysret was *way* faster than iret. So maybe the thing to do is to validate the registers on the way out and, if they're appropriate for sysret, do the sysret. I'm not quite sure how to express "I don't care about rcx" in pt_regs. Maybe use the actual value that the CPU will stick in there (assuming that anyone knows that this is). --Andy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 22:06 ` Andy Lutomirski @ 2014-01-27 22:17 ` Linus Torvalds 2014-01-27 22:32 ` Al Viro 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2014-01-27 22:17 UTC (permalink / raw) To: Andy Lutomirski Cc: Al Viro, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 2:06 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > It's not just ip and sp -- cs matters here, too, I think. For signal *delivery*, CS will always be __USER_CS, and %rcx can be crap, so sysret should be fine. We could easily check that %rip is valid in the whole slow-path instead of saying "return 1 if we did do_signal()". Now, it's a different thing wrt signal handler *return*, because at that point we really cannot return with some random value in %rcx. We absolutely do need to use 'iretq' in that whole [rt_]sigreturn() path, but on x86-64 that is all handled by the system call itself (see the stub_*_sigreturn stuff in entry_64.S) and it very much uses iret explicitly (the 32-bit case also does that, by forcing the sigreturn to be done with an "int 0x80" instruction - we could change that to use syscall+iret, but frankly, I'm not all that inclined to care, although it might be worth trying to do just to unify the models a bit). Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 22:17 ` Linus Torvalds @ 2014-01-27 22:32 ` Al Viro 2014-01-27 22:43 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Al Viro @ 2014-01-27 22:32 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 02:17:23PM -0800, Linus Torvalds wrote: > On Mon, Jan 27, 2014 at 2:06 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > It's not just ip and sp -- cs matters here, too, I think. > > For signal *delivery*, CS will always be __USER_CS, and %rcx can be > crap, so sysret should be fine. We could easily check that %rip is > valid in the whole slow-path instead of saying "return 1 if we did > do_signal()". do_signal() is also a place where arbitrary changes to regs might've been done by tracer, so regs->cs might need to be checked in the same place where we validate regs->rip ;-/ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 22:32 ` Al Viro @ 2014-01-27 22:43 ` Linus Torvalds 2014-01-27 22:46 ` Andy Lutomirski 2014-01-27 23:07 ` Al Viro 0 siblings, 2 replies; 32+ messages in thread From: Linus Torvalds @ 2014-01-27 22:43 UTC (permalink / raw) To: Al Viro Cc: Andy Lutomirski, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > do_signal() is also a place where arbitrary changes to regs might've > been done by tracer, so regs->cs might need to be checked in the same > place where we validate regs->rip ;-/ Fair enough. But it would still be really easy, and make the common case signal delivery a bit faster. Now, sadly, most signal delivery is then followed by sigreturn (the exceptions being dying or doing a longjmp), so we'd still get the iretq then. But it would cut the iretq's related to signals in half. We *could* try to do sigreturn with sysret and a small trampoline too, of course. But I'm not sure how far I'd want to take it. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 22:43 ` Linus Torvalds @ 2014-01-27 22:46 ` Andy Lutomirski 2014-01-28 0:22 ` H. Peter Anvin 2014-01-27 23:07 ` Al Viro 1 sibling, 1 reply; 32+ messages in thread From: Andy Lutomirski @ 2014-01-27 22:46 UTC (permalink / raw) To: Linus Torvalds Cc: Al Viro, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 2:43 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Jan 27, 2014 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: >> >> do_signal() is also a place where arbitrary changes to regs might've >> been done by tracer, so regs->cs might need to be checked in the same >> place where we validate regs->rip ;-/ > > Fair enough. But it would still be really easy, and make the common > case signal delivery a bit faster. > > Now, sadly, most signal delivery is then followed by sigreturn (the > exceptions being dying or doing a longjmp), so we'd still get the > iretq then. But it would cut the iretq's related to signals in half. > > We *could* try to do sigreturn with sysret and a small trampoline too, > of course. But I'm not sure how far I'd want to take it. I once spent a while thinking about how to do this. The best I could come up with was to use something like 'ret 128' for the trampoline. (The issue here is that there's no good place to shove a global variable with the missing register values, fs and gs aren't really available for these games, and the red zone is in the way.) I think that sysret for sigreturn is probably not very interesting. On the other hand, sysret for #PF might be a huge win, despite being even scarier. (Or someone could politely ask Intel for a couple of non-serializing msrs that set the values of rcx and whatever other registers get clobbered by sysret.) --Andy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 22:46 ` Andy Lutomirski @ 2014-01-28 0:22 ` H. Peter Anvin 2014-01-28 0:44 ` Andy Lutomirski 0 siblings, 1 reply; 32+ messages in thread From: H. Peter Anvin @ 2014-01-28 0:22 UTC (permalink / raw) To: Andy Lutomirski, Linus Torvalds Cc: Al Viro, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On 01/27/2014 02:46 PM, Andy Lutomirski wrote: > > I think that sysret for sigreturn is probably not very interesting. > On the other hand, sysret for #PF might be a huge win, despite being > even scarier. > SYSRET for #PF or other exceptions is a nonstarter; register state is live at that point. -hpa ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-28 0:22 ` H. Peter Anvin @ 2014-01-28 0:44 ` Andy Lutomirski 0 siblings, 0 replies; 32+ messages in thread From: Andy Lutomirski @ 2014-01-28 0:44 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Al Viro, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 4:22 PM, H. Peter Anvin <hpa@zytor.com> wrote: > On 01/27/2014 02:46 PM, Andy Lutomirski wrote: >> >> I think that sysret for sigreturn is probably not very interesting. >> On the other hand, sysret for #PF might be a huge win, despite being >> even scarier. >> > > SYSRET for #PF or other exceptions is a nonstarter; register state is > live at that point. I mean sysret-via-trampoline for #PF. It's scary, it probably has issues with ptrace and interrupts that hit while the trampoline is still running, and it could break anything that writes past the red zone, but I think it could work. No, I don't particularly want to implement (and debug) such a beast. (I will continue cursing Intel -- why can't we have a fast way to return to 64-bit userspace with complete control of all non-segment registers? sysret is *almost* the right thing.) --Andy > > -hpa > > -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 22:43 ` Linus Torvalds 2014-01-27 22:46 ` Andy Lutomirski @ 2014-01-27 23:07 ` Al Viro 1 sibling, 0 replies; 32+ messages in thread From: Al Viro @ 2014-01-27 23:07 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 02:43:14PM -0800, Linus Torvalds wrote: > On Mon, Jan 27, 2014 at 2:32 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > do_signal() is also a place where arbitrary changes to regs might've > > been done by tracer, so regs->cs might need to be checked in the same > > place where we validate regs->rip ;-/ > > Fair enough. But it would still be really easy, and make the common > case signal delivery a bit faster. > > Now, sadly, most signal delivery is then followed by sigreturn (the > exceptions being dying or doing a longjmp), so we'd still get the > iretq then. But it would cut the iretq's related to signals in half. > > We *could* try to do sigreturn with sysret and a small trampoline too, > of course. But I'm not sure how far I'd want to take it. The problem with validation is that we'll suddenly become sensitive to hard-to-estimate pile of hardware bugs ;-/ E.g. which AMD-specific errata is that comment in entry_64.S about? The one I kinda-sorta remember is Intel-specific, and that was about uncanonical RIP; looking for AMD one has turned #353 (with suggested workaround being "have bit 16 set in whatever you put into R11"), but I've no idea whether that's the only potential issue there... ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds 2014-01-27 0:22 ` Al Viro @ 2014-01-27 10:27 ` Peter Zijlstra 2014-01-27 11:36 ` Al Viro 2014-02-06 0:32 ` Linus Torvalds 2 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2014-01-27 10:27 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Jan 26, 2014 at 02:28:15PM -0800, Linus Torvalds wrote: > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S > index 1e96c3628bf2..15b5953da958 100644 > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -658,30 +658,24 @@ sysret_check: > /* Handle reschedules */ > /* edx: work, edi: workmask */ > sysret_careful: > TRACE_IRQS_ON > ENABLE_INTERRUPTS(CLBR_NONE) > + SAVE_REST > + FIXUP_TOP_OF_STACK %r11 > + movq %rsp,%rdi > + movl %edx,%esi > + call syscall_exit_slowpath > + movl $_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT,%edi > + testl %eax,%eax > + jne sysret_using_iret > + addq $REST_SKIP,%rsp > + jmp sysret_check; Obviously I don't particularly like the SAVE_REST/FIXUP_TOP_OF_STACK being added to the reschedule path. Can't we do as Al suggested earlier and have 2 slowpath calls, one without PT_REGS and one with? That said, yes its a nice cleanup, entry.S always hurts my brain. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 10:27 ` Peter Zijlstra @ 2014-01-27 11:36 ` Al Viro 2014-01-27 17:39 ` Oleg Nesterov 0 siblings, 1 reply; 32+ messages in thread From: Al Viro @ 2014-01-27 11:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 11:27:59AM +0100, Peter Zijlstra wrote: > Obviously I don't particularly like the SAVE_REST/FIXUP_TOP_OF_STACK > being added to the reschedule path. > > Can't we do as Al suggested earlier and have 2 slowpath calls, one > without PT_REGS and one with? > > That said, yes its a nice cleanup, entry.S always hurts my brain. BTW, there's an additional pile of obfuscation: /* work to do on interrupt/exception return */ #define _TIF_WORK_MASK \ (0x0000FFFF & \ ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT| \ _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) /* work to do on any return to user space */ #define _TIF_ALLWORK_MASK \ ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \ _TIF_NOHZ) These guys are _TIF_NOTIFY_RESUME | _TIF_SIGPENDING | _TIF_MCE_NOTIFY | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | _TIF_NEED_RESCHED | 0xe200 and _TIF_SYSCALL_TRACE | _TIF_NOTIFY_RESUME | _TIF_SIGPENDING | _TIF_NEED_RESCHED | _TIF_SINGLESTEP | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | _TIF_MCE_NOTIFY | _TIF_SYSCALL_TRACEPOINT | _TIF_NOHZ | _TIF_USER_RETURN_NOTIFY | _TIF_UPROBE | 0xe200 resp., or _TIF_DO_NOTIFY_MASK | _TIF_UPROBE | _TIF_NEED_RESCHED | 0xe200 and _TIF_DO_NOTIFY_MASK | _TIF_WORK_SYSCALL_EXIT | _TIF_NEED_RESCHED | _TIF_SYSCALL_EMU | _TIF_UPROBE | 0xe200 0xe200 (aka bits 15,14,13,9) consists of the bits that are never set by anybody, so short of really deep magic it can be discarded. The rest is also interesting, to put it politely. Why is _TIF_UPROBE *not* a part of _TIF_DO_NOTIFY_MASK, for example? Note that do_notify_resume() checks and clears it, but on syscall (and interrupt) exit paths we only call it with something in _TIF_DO_NOTIFY_MASK. If UPROBE is set, but nothing else in that set is, we'll be looping forever, right? There's pending work (according to _TIF_WORK_MASK), so we won't just leave. And we won't be calling do_notify_resume(), so there's nothing to clear that bit. Only it gets even nastier - on the paranoid_userspace path we call do_notify_resume() if anything in _TIF_WORK_MASK besides NEED_RESCHED happens to be set. So _there_ getting solitary UPROBE is legitimate. _TIF_SYSCALL_EMU is also an interesting story - on the way out it * forces us on iret path * does *not* trigger trace_syscall_leave() on its own (trace_syscall_leave() is aware of that sucker, though, with rather confusing comment) * hits do_notify_resume() (for no good reason - do_notify_resume() silently ignores it) * gets cleared from the workmask (i.e. %edi), so on the next iteration through the loop it gets completely ignored. AFAICS, all of that is pointless, since SYSCALL_EMU wants to avoid SYSRET only if we had entered with it and in that case we would've gone through tracesys and stayed the fsck away from SYSRET path anyway (similar on 32bit - if we hit syscall_trace_enter(), we do not rejoin the sysenter path). IOW, no reason for it to be in _TIF_ALLWORK_MASK... ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 11:36 ` Al Viro @ 2014-01-27 17:39 ` Oleg Nesterov 2014-01-28 1:18 ` Al Viro 0 siblings, 1 reply; 32+ messages in thread From: Oleg Nesterov @ 2014-01-27 17:39 UTC (permalink / raw) To: Al Viro Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List On 01/27, Al Viro wrote: > > BTW, there's an additional pile of obfuscation: > /* work to do on interrupt/exception return */ > #define _TIF_WORK_MASK \ > (0x0000FFFF & \ > ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT| \ > _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) > > /* work to do on any return to user space */ > #define _TIF_ALLWORK_MASK \ > ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \ > _TIF_NOHZ) Heh, yes ;) > Why is _TIF_UPROBE *not* a part > of _TIF_DO_NOTIFY_MASK, for example? Yes, please see another email. That is why uprobe_deny_signal() sets TIF_NOTIFY_RESUME along with TIF_UPROBE. Oleg. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-27 17:39 ` Oleg Nesterov @ 2014-01-28 1:18 ` Al Viro 2014-01-28 16:38 ` Oleg Nesterov 0 siblings, 1 reply; 32+ messages in thread From: Al Viro @ 2014-01-28 1:18 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote: > On 01/27, Al Viro wrote: > > > > BTW, there's an additional pile of obfuscation: > > /* work to do on interrupt/exception return */ > > #define _TIF_WORK_MASK \ > > (0x0000FFFF & \ > > ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT| \ > > _TIF_SINGLESTEP|_TIF_SECCOMP|_TIF_SYSCALL_EMU)) > > > > /* work to do on any return to user space */ > > #define _TIF_ALLWORK_MASK \ > > ((0x0000FFFF & ~_TIF_SECCOMP) | _TIF_SYSCALL_TRACEPOINT | \ > > _TIF_NOHZ) > > Heh, yes ;) > > > Why is _TIF_UPROBE *not* a part > > of _TIF_DO_NOTIFY_MASK, for example? > > Yes, please see another email. That is why uprobe_deny_signal() > sets TIF_NOTIFY_RESUME along with TIF_UPROBE. *grumble* Can it end up modifying *regs? From very cursory reading of kernel/events/uprobe.c it seems to do so, so we probably want to leave via iretq if that has hit, right? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-28 1:18 ` Al Viro @ 2014-01-28 16:38 ` Oleg Nesterov 2014-01-28 16:48 ` Al Viro 0 siblings, 1 reply; 32+ messages in thread From: Oleg Nesterov @ 2014-01-28 16:38 UTC (permalink / raw) To: Al Viro Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List, Srikar Dronamraju On 01/28, Al Viro wrote: > > On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote: > > On 01/27, Al Viro wrote: > > > > > > Why is _TIF_UPROBE *not* a part > > > of _TIF_DO_NOTIFY_MASK, for example? > > > > Yes, please see another email. That is why uprobe_deny_signal() > > sets TIF_NOTIFY_RESUME along with TIF_UPROBE. > > *grumble* Can it end up modifying *regs? From very cursory reading of > kernel/events/uprobe.c it seems to do so, so we probably want to leave > via iretq if that has hit, right? But we do this anyway, restore_args path does iretq? I mean, uprobe_notify_resume() is called from do_notify_resume(), it should be fine to modify*regs there? Oleg. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-28 16:38 ` Oleg Nesterov @ 2014-01-28 16:48 ` Al Viro 2014-01-28 17:19 ` Oleg Nesterov 0 siblings, 1 reply; 32+ messages in thread From: Al Viro @ 2014-01-28 16:48 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List, Srikar Dronamraju On Tue, Jan 28, 2014 at 05:38:08PM +0100, Oleg Nesterov wrote: > On 01/28, Al Viro wrote: > > > > On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote: > > > On 01/27, Al Viro wrote: > > > > > > > > Why is _TIF_UPROBE *not* a part > > > > of _TIF_DO_NOTIFY_MASK, for example? > > > > > > Yes, please see another email. That is why uprobe_deny_signal() > > > sets TIF_NOTIFY_RESUME along with TIF_UPROBE. > > > > *grumble* Can it end up modifying *regs? From very cursory reading of > > kernel/events/uprobe.c it seems to do so, so we probably want to leave > > via iretq if that has hit, right? > > But we do this anyway, restore_args path does iretq? > > I mean, uprobe_notify_resume() is called from do_notify_resume(), it > should be fine to modify*regs there? See Linus' patch trying to avoid iretq path; it's really costly. Looks like that patch will have to treat _TIF_UPROBE the same way it treats _TIF_SIGPENDING... ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-28 16:48 ` Al Viro @ 2014-01-28 17:19 ` Oleg Nesterov 0 siblings, 0 replies; 32+ messages in thread From: Oleg Nesterov @ 2014-01-28 17:19 UTC (permalink / raw) To: Al Viro Cc: Peter Zijlstra, Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, the arch/x86 maintainers, Linux Kernel Mailing List, Srikar Dronamraju On 01/28, Al Viro wrote: > > On Tue, Jan 28, 2014 at 05:38:08PM +0100, Oleg Nesterov wrote: > > On 01/28, Al Viro wrote: > > > > > > On Mon, Jan 27, 2014 at 06:39:31PM +0100, Oleg Nesterov wrote: > > > > On 01/27, Al Viro wrote: > > > > > > > > > > Why is _TIF_UPROBE *not* a part > > > > > of _TIF_DO_NOTIFY_MASK, for example? > > > > > > > > Yes, please see another email. That is why uprobe_deny_signal() > > > > sets TIF_NOTIFY_RESUME along with TIF_UPROBE. > > > > > > *grumble* Can it end up modifying *regs? From very cursory reading of > > > kernel/events/uprobe.c it seems to do so, so we probably want to leave > > > via iretq if that has hit, right? > > > > But we do this anyway, restore_args path does iretq? > > > > I mean, uprobe_notify_resume() is called from do_notify_resume(), it > > should be fine to modify*regs there? > > See Linus' patch trying to avoid iretq path; it's really costly. Looks > like that patch will have to treat _TIF_UPROBE the same way it treats > _TIF_SIGPENDING... Ah, this one I guess: http://marc.info/?l=linux-kernel&m=139077532507926 I think this should be fine wrt uprobes, unless I misread this patch syscall_exit_slowpath() is actually only called by ret_from_sys_call path, it this case TIF_UPROBE should not be set. But perhaps "retval = 1" after uprobe_notify_resume() makes sense anyway. And while I am almost sure I missed something, can't we (with or without that patch) simply add TIF_UPROBE into _TIF_DO_NOTIFY_MASK and remove set_tsk_thread_flag(t, TIF_NOTIFY_RESUME) from uprobe_deny_signal ? Oleg. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds 2014-01-27 0:22 ` Al Viro 2014-01-27 10:27 ` Peter Zijlstra @ 2014-02-06 0:32 ` Linus Torvalds 2014-02-06 0:55 ` Kirill A. Shutemov 2 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2014-02-06 0:32 UTC (permalink / raw) To: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra Cc: the arch/x86 maintainers, Linux Kernel Mailing List On Sun, Jan 26, 2014 at 2:28 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Comments? This was obviously brought on by my frustration with the > currently nasty do_notify_resume() always returning to iret for the > task_work case, and PeterZ's patch that fixed that, but made the asm > mess even *worse*. Actually, I should have taken a closer look. Yes, do_notify_resume() is a real issue, and my stupid open/close test-case showed that part of the profile. But the "iretq" that dominates on the kernel build is actually the page fault one. I noticed this when I compared "-e cycles:pp" with "-e cycles:p". The single-p version shows largely the same profile for the kernel, except that instead of showing "iretq" as the big cost, it shows the first instruction in "page_fault". In fact, even when *not* zoomed into the kernel DSO, "page_fault" actually takes 5% of CPU time according to pref report. That's really quite impressive. I suspect the Haswell architecture has made everything else cheaper, and the exception overhead hasn't kept up. I'm wondering if there is anything we could do to speed this up - like doing gang lookup in the page cache and pre-populating the page tables opportunistically. We're using an interrupt gate for the page fault handling, and I don't think we can avoid that. For all I know, a trap gate might be slightly faster (but likely not really noticeable - the microcode is surely expensive, but the pipeline unwinding is probably the biggest cost of the page fault), but we have the issue of interrupts causing page faults for vmalloc pages.. And obviously we can't avoid the iretq for the return path. So as far as I can see, there's no sane way to make the page fault itself cheaper. Looking at opportunistically prepopulating page tables when it's cheap and easy might be the best we can do.. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 0:32 ` Linus Torvalds @ 2014-02-06 0:55 ` Kirill A. Shutemov 2014-02-06 2:32 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Kirill A. Shutemov @ 2014-02-06 0:55 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu On Wed, Feb 05, 2014 at 04:32:55PM -0800, Linus Torvalds wrote: > On Sun, Jan 26, 2014 at 2:28 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Comments? This was obviously brought on by my frustration with the > > currently nasty do_notify_resume() always returning to iret for the > > task_work case, and PeterZ's patch that fixed that, but made the asm > > mess even *worse*. > > Actually, I should have taken a closer look. > > Yes, do_notify_resume() is a real issue, and my stupid open/close > test-case showed that part of the profile. > > But the "iretq" that dominates on the kernel build is actually the > page fault one. > > I noticed this when I compared "-e cycles:pp" with "-e cycles:p". The > single-p version shows largely the same profile for the kernel, except > that instead of showing "iretq" as the big cost, it shows the first > instruction in "page_fault". > > In fact, even when *not* zoomed into the kernel DSO, "page_fault" > actually takes 5% of CPU time according to pref report. That's really > quite impressive. > > I suspect the Haswell architecture has made everything else cheaper, > and the exception overhead hasn't kept up. I'm wondering if there is > anything we could do to speed this up - like doing gang lookup in the > page cache and pre-populating the page tables opportunistically. One thing that could help is THP for file-backed pages. And there's prototype with basic infrasturure and support for ramfs and shmem/tmpfs (by Ning Qu). Work in progress. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 0:55 ` Kirill A. Shutemov @ 2014-02-06 2:32 ` Linus Torvalds 2014-02-06 4:33 ` Linus Torvalds 2014-02-06 5:42 ` [RFC] de-asmify the x86-64 system call slowpath Ingo Molnar 0 siblings, 2 replies; 32+ messages in thread From: Linus Torvalds @ 2014-02-06 2:32 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu On Wed, Feb 5, 2014 at 4:55 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > One thing that could help is THP for file-backed pages. And there's > prototype with basic infrasturure and support for ramfs and > shmem/tmpfs (by Ning Qu). Work in progress. THP isn't really realistic for any general-purpose loads. 2MB pages are just too big of a granularity. On a machine with just a few gigs (ie anything but big servers), you count the number of large pages in a few thousands. That makes fragmentation a big issue. Also, ELF sections aren't actually 2MB-aligned, nor do you really want to throw away 9 bits of virtual address space randomization. Plus common binaries aren't even big enough. Look at "bash", which is a pig - it's still less than a megabyte as just the binary, with the code segment being 900kB or so for me. For something like "perl", it's even smaller, with a number of shared object files that get loaded dynamically etc. IOW, THP is completely useless for any kind of scripting. It can be good for individual large binaries that have long lifetimes (DB processes, HPC, stuff like that), but no, it is *not* the answer to complex loads. And almost certainly never will be. It's possible that some special case (glibc for example) could be reasonably hacked to use a THP code segment, but that sounds unlikely. And the virtual address randomization really gets hurt. No, I was thinking "try to optimistically map 8 adjacent aligned pages at a time" - that would be the same cacheline in the page tables, so it would be fairly cheap if we couple it with a gang-lookup of the pages in the page cache (or, for anonymous pages, by just optimistically trying to do an order-3 page allocation, and if that works, just map the 32kB allocation you got as eight individual pages). I know it's been discussed at some point, and I even have a dim memory of having seen some really ugly patches. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 2:32 ` Linus Torvalds @ 2014-02-06 4:33 ` Linus Torvalds 2014-02-06 21:29 ` Linus Torvalds 2014-02-06 5:42 ` [RFC] de-asmify the x86-64 system call slowpath Ingo Molnar 1 sibling, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2014-02-06 4:33 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu [-- Attachment #1: Type: text/plain, Size: 816 bytes --] On Wed, Feb 5, 2014 at 6:32 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > No, I was thinking "try to optimistically map 8 adjacent aligned pages > at a time" - that would be the same cacheline in the page tables, so > it would be fairly cheap if we couple it with a gang-lookup of the > pages in the page cache Doing the gang-lookup is hard, since it's all abstracted away, but the attached patch kind of tries to do what I described. This patch probably doesn't work, but something *like* this might be worth playing with. Except I suspect the page-backed faults are actually the minority, judging by how high clear_page_c_e is in the profile it's probably mostly anonymous memory. I have no idea why I started with the (more complex) case of file-backed prefaulting. Oh well. Linus [-- Attachment #2: patch.diff --] [-- Type: text/plain, Size: 1753 bytes --] mm/memory.c | 42 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/mm/memory.c b/mm/memory.c index be6a0c0d4ae0..d52ec6a344dc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3487,15 +3487,55 @@ uncharge_out: return ret; } +#define no_fault_around(x) (x & (VM_FAULT_ERROR | VM_FAULT_MAJOR | VM_FAULT_RETRY)) +#define FAULT_AROUND_SHIFT (3) + +static void fault_around(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, pmd_t *pmd) +{ + int nr = 1 << FAULT_AROUND_SHIFT; + + address &= PAGE_MASK << FAULT_AROUND_SHIFT; + if (address < vma->vm_start) + return; + + do { + pte_t *pte; + pte_t entry; + pgoff_t pgoff; + + pte = pte_offset_map(pmd, address); + entry = *pte; + + pte_unmap(pte); + if (!pte_none(entry)) + continue; + pgoff = (address - vma->vm_start) >> PAGE_SHIFT; + pgoff += vma->vm_pgoff; + if (no_fault_around(__do_fault(mm, vma, address, pmd, pgoff, 0, entry))) + break; + } while (address += PAGE_SIZE, address < vma->vm_end && --nr > 0); +} + static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pte_t *page_table, pmd_t *pmd, unsigned int flags, pte_t orig_pte) { + int ret; pgoff_t pgoff = (((address & PAGE_MASK) - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; pte_unmap(page_table); - return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); + ret = __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte); + + /* + * If the page we were looking for succeeded with no retries, + * see if we can fault around it too.. + */ + if (!no_fault_around(ret) && (flags & FAULT_FLAG_ALLOW_RETRY)) + fault_around(mm, vma, address, pmd); + + return ret; } /* ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 4:33 ` Linus Torvalds @ 2014-02-06 21:29 ` Linus Torvalds 2014-02-06 22:24 ` Kirill A. Shutemov 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2014-02-06 21:29 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu On Wed, Feb 5, 2014 at 8:33 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Doing the gang-lookup is hard, since it's all abstracted away, but the > attached patch kind of tries to do what I described. > > This patch probably doesn't work, but something *like* this might be > worth playing with. Interesting. Here are some pte fault statistics with and without the patch. I added a few new count_vm_event() counters: PTEFAULT, PTEFILE, PTEANON, PTEWP, PTESPECULATIVE for the handle_pte_fault, do_linear_fault, do_anonymous_page, do_wp_page and the "let's speculatively fill the page tables" case. This is what the statistics look like for me doing a "make -j" of a fully built almodconfig build: 5007450 ptefault 3272038 ptefile 1470242 pteanon 265121 ptewp 0 ptespeculative where obviously the ptespeculative count is zero, and I was wrong about anon faults being most common - the file mapping faults really are the most common for this load (it's fairly fork/exec heavy, I guess). This is what happens with that patch I posted: 2962090 ptefault 1195130 ptefile 1490460 pteanon 276479 ptewp 5690724 ptespeculative about 200k page faults went away, and the numbers make sense (ie they got removed from the ptefile column - the other number changes are just noise). Now, we filled 600k extra page table entries to do that (that ptespeculative number), so the "hitrate" for the speculative filling was basically about 33%. Which doesn't sound crazy - the code basically populates the 8 aligned pages around the faulting page. Now, because I didn't make this easily dynamically configurable I have no good way to really test timing, but the numbers says at least the concept works. Whether the reduced number of page faults and presumably better locality for the speculative prefilling makes up for the fact that 66% of the prefilled entries never get used is very debatable. But I think it's a somewhat interesting experiment, and the patch was certainly not hugely complicated. I should add a switch to turn this on/off and then do many builds in sequence to get some kind of idea of whether it actually changes performance. But if 5% of the overall time was literally spent on the *exception* part of the page fault (ie not counting all the work we do in the kernel), I think it's worth looking at this. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 21:29 ` Linus Torvalds @ 2014-02-06 22:24 ` Kirill A. Shutemov 2014-02-07 1:31 ` Linus Torvalds 0 siblings, 1 reply; 32+ messages in thread From: Kirill A. Shutemov @ 2014-02-06 22:24 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu On Thu, Feb 06, 2014 at 01:29:13PM -0800, Linus Torvalds wrote: > On Wed, Feb 5, 2014 at 8:33 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Doing the gang-lookup is hard, since it's all abstracted away, but the > > attached patch kind of tries to do what I described. > > > > This patch probably doesn't work, but something *like* this might be > > worth playing with. > > Interesting. Here are some pte fault statistics with and without the patch. > > I added a few new count_vm_event() counters: PTEFAULT, PTEFILE, > PTEANON, PTEWP, PTESPECULATIVE for the handle_pte_fault, > do_linear_fault, do_anonymous_page, do_wp_page and the "let's > speculatively fill the page tables" case. > > This is what the statistics look like for me doing a "make -j" of a > fully built almodconfig build: > > 5007450 ptefault > 3272038 ptefile > 1470242 pteanon > 265121 ptewp > 0 ptespeculative > > where obviously the ptespeculative count is zero, and I was wrong > about anon faults being most common - the file mapping faults really > are the most common for this load (it's fairly fork/exec heavy, I > guess). > > This is what happens with that patch I posted: > > 2962090 ptefault > 1195130 ptefile > 1490460 pteanon > 276479 ptewp > 5690724 ptespeculative > > about 200k page faults went away, and the numbers make sense (ie they > got removed from the ptefile column - the other number changes are > just noise). > > Now, we filled 600k extra page table entries to do that (that > ptespeculative number), so the "hitrate" for the speculative filling > was basically about 33%. Which doesn't sound crazy - the code > basically populates the 8 aligned pages around the faulting page. > > Now, because I didn't make this easily dynamically configurable I have > no good way to really test timing, but the numbers says at least the > concept works. > > Whether the reduced number of page faults and presumably better > locality for the speculative prefilling makes up for the fact that 66% > of the prefilled entries never get used is very debatable. But I think > it's a somewhat interesting experiment, and the patch was certainly > not hugely complicated. > > I should add a switch to turn this on/off and then do many builds in > sequence to get some kind of idea of whether it actually changes > performance. But if 5% of the overall time was literally spent on the > *exception* part of the page fault (ie not counting all the work we do > in the kernel), I think it's worth looking at this. If we want to reduce number of page fault with less overhead we probably should concentrate on minor page fault -- populate pte around fault address which already in page cache. It should cover scripting use-case pretty well. We also can do reasonable batching in this case: make changes under the same page table lock and group radix tree lookup. It may help with scalability. I'm working on prototype of this. It's more invasive and too ugly at the moment. Will see how it will go. I'll try to show something tomorrow. Other idea: we could introduce less strict version of MAP_POPULATE to populate only what we already have in page cache. Heh! `man 3 mmap' actually suggests MAP_POPULATE | MAP_NONBLOCK. What's the story beyond MAP_NONBLOCK? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 22:24 ` Kirill A. Shutemov @ 2014-02-07 1:31 ` Linus Torvalds 2014-02-07 15:42 ` [RFC, PATCH] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov 0 siblings, 1 reply; 32+ messages in thread From: Linus Torvalds @ 2014-02-07 1:31 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu On Thu, Feb 6, 2014 at 2:24 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > If we want to reduce number of page fault with less overhead we probably > should concentrate on minor page fault -- populate pte around fault > address which already in page cache. It should cover scripting use-case > pretty well. That's what my patch largely does. Except I screwed up and didn't use FAULT_FLAG_ALLOW_RETRY in fault_around(). Anyway, my patch kind of works, but I'm starting to hate it. I think I want to try to extend the "->fault()" interface to allow filemap_fault() to just fill in multiple pages. We alread have that "vmf->page" thing, we could make it a small array easily. That would allow proper gang lookup, and much more efficient "fill in multiple entries in one go" in mm/memory.c. > Heh! `man 3 mmap' actually suggests MAP_POPULATE | MAP_NONBLOCK. > What's the story beyond MAP_NONBLOCK? It does nothing, afaik. Linus ^ permalink raw reply [flat|nested] 32+ messages in thread
* [RFC, PATCH] mm: map few pages around fault address if they are in page cache 2014-02-07 1:31 ` Linus Torvalds @ 2014-02-07 15:42 ` Kirill A. Shutemov 2014-02-07 17:32 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: Kirill A. Shutemov @ 2014-02-07 15:42 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox, Andi Kleen On Thu, Feb 06, 2014 at 05:31:55PM -0800, Linus Torvalds wrote: > On Thu, Feb 6, 2014 at 2:24 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > > > If we want to reduce number of page fault with less overhead we probably > > should concentrate on minor page fault -- populate pte around fault > > address which already in page cache. It should cover scripting use-case > > pretty well. > > That's what my patch largely does. Except I screwed up and didn't use > FAULT_FLAG_ALLOW_RETRY in fault_around(). I fail to see how you avoid touching backing storage. > Anyway, my patch kind of works, but I'm starting to hate it. I think I > want to try to extend the "->fault()" interface to allow > filemap_fault() to just fill in multiple pages. > > We alread have that "vmf->page" thing, we could make it a small array > easily. That would allow proper gang lookup, and much more efficient > "fill in multiple entries in one go" in mm/memory.c. See patch below. I extended ->fault() interface: added pointer to array of pages and nr_pages. If ->fault() nows about new interface and use it, it fills array and set VM_FAULT_AROUND bit in return code. Only filemap_fault() converted at the moment. I haven't tested it much, but my kvm boots. There're few places where code should be fixed. __do_fault() and filemap_fault() are too ugly and need to be cleaned. I don't have any performance data yet. Any thoughts? --- include/linux/mm.h | 11 +++++ mm/filemap.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++++++++- mm/memory.c | 58 ++++++++++++++++++++-- 3 files changed, 204 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 35527173cf50..deb65c0850a9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -181,6 +181,9 @@ extern pgprot_t protection_map[16]; #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */ #define FAULT_FLAG_TRIED 0x40 /* second try */ #define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */ +#define FAULT_FLAG_AROUND 0x100 /* Get ->nr_pages pages around fault + * address + */ /* * vm_fault is filled by the the pagefault handler and passed to the vma's @@ -200,6 +203,13 @@ struct vm_fault { * is set (which is also implied by * VM_FAULT_ERROR). */ + + int nr_pages; /* Number of pages to faultin, naturally + * aligned around virtual_address + */ + struct page **pages; /* Pointer to array to store nr_pages + * pages. + */ }; /* @@ -962,6 +972,7 @@ static inline int page_mapped(struct page *page) #define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */ #define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */ #define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */ +#define VM_FAULT_AROUND 0x1000 #define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */ diff --git a/mm/filemap.c b/mm/filemap.c index b7749a92021c..1cabb15a5847 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -884,6 +884,64 @@ repeat: return ret; } +void find_get_pages_or_null(struct address_space *mapping, pgoff_t start, + unsigned int nr_pages, struct page **pages) +{ + struct radix_tree_iter iter; + void **slot; + + if (unlikely(!nr_pages)) + return; + + memset(pages, 0, sizeof(struct page *) * nr_pages); + + rcu_read_lock(); +restart: + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) { + struct page *page; +repeat: + page = radix_tree_deref_slot(slot); + if (unlikely(!page)) + continue; + + if (radix_tree_exception(page)) { + if (radix_tree_deref_retry(page)) { + /* + * Transient condition which can only trigger + * when entry at index 0 moves out of or back + * to root: none yet gotten, safe to restart. + */ + WARN_ON(iter.index); + goto restart; + } + /* + * Otherwise, shmem/tmpfs must be storing a swap entry + * here as an exceptional entry: so skip over it - + * we only reach this from invalidate_mapping_pages(). + */ + continue; + } + + if (!page_cache_get_speculative(page)) + goto repeat; + + if (page->index - start >= nr_pages) { + page_cache_release(page); + break; + } + + /* Has the page moved? */ + if (unlikely(page != *slot)) { + page_cache_release(page); + goto repeat; + } + + pages[page->index - start] = page; + } + + rcu_read_unlock(); +} + /** * find_get_pages_contig - gang contiguous pagecache lookup * @mapping: The address_space to search @@ -1595,6 +1653,67 @@ static void do_async_mmap_readahead(struct vm_area_struct *vma, page, offset, ra->ra_pages); } +static void lock_secondary_pages(struct vm_area_struct *vma, + struct vm_fault *vmf) +{ + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_mapping; + unsigned long address = (unsigned long)vmf->virtual_address; + int primary_idx = (address >> PAGE_SHIFT) & (vmf->nr_pages - 1); + pgoff_t size; + int i; + + for (i = 0; i < vmf->nr_pages; i++) { + unsigned long addr = address + PAGE_SIZE * (i - primary_idx); + + if (i == primary_idx || !vmf->pages[i]) + continue; + + if (addr < vma->vm_start || addr >= vma->vm_end) + goto put; + if (!trylock_page(vmf->pages[i])) + goto put; + /* Truncated? */ + if (unlikely(vmf->pages[i]->mapping != mapping)) + goto unlock; + + if (unlikely(!PageUptodate(vmf->pages[i]))) + goto unlock; + + /* + * We have a locked page in the page cache, now we need to + * check that it's up-to-date. If not, it is going to be due to + * an error. + */ + size = (i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1) + >> PAGE_CACHE_SHIFT; + if (unlikely(vmf->pages[i]->index >= size)) + goto unlock; + + continue; +unlock: + unlock_page(vmf->pages[i]); +put: + put_page(vmf->pages[i]); + vmf->pages[i] = NULL; + } +} + +static void unlock_and_put_secondary_pages(struct vm_fault *vmf) +{ + unsigned long address = (unsigned long)vmf->virtual_address; + int primary_idx = (address >> PAGE_SHIFT) & (vmf->nr_pages - 1); + int i; + + for (i = 0; i < vmf->nr_pages; i++) { + if (i == primary_idx || !vmf->pages[i]) + continue; + unlock_page(vmf->pages[i]); + page_cache_release(vmf->pages[i]); + vmf->pages[i] = NULL; + } +} + /** * filemap_fault - read in file data for page fault handling * @vma: vma in which the fault was taken @@ -1617,6 +1736,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) pgoff_t offset = vmf->pgoff; struct page *page; pgoff_t size; + unsigned long address = (unsigned long)vmf->virtual_address; + int primary_idx = (address >> PAGE_SHIFT) & (vmf->nr_pages - 1); int ret = 0; size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; @@ -1626,7 +1747,15 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) /* * Do we have something in the page cache already? */ - page = find_get_page(mapping, offset); + if (vmf->flags & FAULT_FLAG_AROUND) { + find_get_pages_or_null(mapping, offset - primary_idx, + vmf->nr_pages, vmf->pages); + page = vmf->pages[primary_idx]; + lock_secondary_pages(vma, vmf); + ret |= VM_FAULT_AROUND; + } else + page = find_get_page(mapping, offset); + if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) { /* * We found the page, so try async readahead before @@ -1638,14 +1767,18 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) do_sync_mmap_readahead(vma, ra, file, offset); count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); - ret = VM_FAULT_MAJOR; + ret |= VM_FAULT_MAJOR; retry_find: page = find_get_page(mapping, offset); if (!page) goto no_cached_page; + if (vmf->flags & FAULT_FLAG_AROUND) + vmf->pages[primary_idx] = page; } if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { + unlock_and_put_secondary_pages(vmf); + ret &= ~VM_FAULT_AROUND; page_cache_release(page); return ret | VM_FAULT_RETRY; } @@ -1671,6 +1804,7 @@ retry_find: */ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; if (unlikely(offset >= size)) { + unlock_and_put_secondary_pages(vmf); unlock_page(page); page_cache_release(page); return VM_FAULT_SIGBUS; @@ -1694,6 +1828,8 @@ no_cached_page: if (error >= 0) goto retry_find; + unlock_and_put_secondary_pages(vmf); + /* * An error return from page_cache_read can result if the * system is low on memory, or a problem occurs while trying @@ -1722,6 +1858,8 @@ page_not_uptodate: if (!error || error == AOP_TRUNCATED_PAGE) goto retry_find; + unlock_and_put_secondary_pages(vmf); + /* Things didn't work out. Return zero to tell the mm layer so. */ shrink_readahead_size_eio(file, ra); return VM_FAULT_SIGBUS; diff --git a/mm/memory.c b/mm/memory.c index 6768ce9e57d2..e94d86ac7d5a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3283,6 +3283,9 @@ oom: return VM_FAULT_OOM; } +#define FAULT_AROUND_PAGES 32 +#define FAULT_AROUND_MASK (FAULT_AROUND_PAGES - 1) + /* * __do_fault() tries to create a new page mapping. It aggressively * tries to share with existing pages, but makes a separate copy if @@ -3303,12 +3306,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, pte_t *page_table; spinlock_t *ptl; struct page *page; + struct page *pages[FAULT_AROUND_PAGES]; struct page *cow_page; pte_t entry; int anon = 0; struct page *dirty_page = NULL; struct vm_fault vmf; - int ret; + int i, ret; int page_mkwrite = 0; /* @@ -3336,14 +3340,35 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, vmf.flags = flags; vmf.page = NULL; + /* Do fault around only for read faults on linear mapping */ + if (flags & (FAULT_FLAG_WRITE | FAULT_FLAG_NONLINEAR)) { + vmf.nr_pages = 0; + vmf.pages = NULL; + } else { + vmf.nr_pages = FAULT_AROUND_PAGES; + vmf.pages = pages; + vmf.flags |= FAULT_FLAG_AROUND; + } ret = vma->vm_ops->fault(vma, &vmf); + + /* ->fault don't know about FAULT_FLAG_AROUND */ + if ((vmf.flags & FAULT_FLAG_AROUND) && !(ret & VM_FAULT_AROUND)) { + vmf.flags &= ~FAULT_FLAG_AROUND; + } if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; if (unlikely(PageHWPoison(vmf.page))) { - if (ret & VM_FAULT_LOCKED) - unlock_page(vmf.page); + if (ret & VM_FAULT_LOCKED) { + if (vmf.flags & FAULT_FLAG_AROUND) { + for (i = 0; i < FAULT_AROUND_PAGES; i++) { + if (pages[i]) + unlock_page(pages[i]); + } + } else + unlock_page(vmf.page); + } ret = VM_FAULT_HWPOISON; goto uncharge_out; } @@ -3352,9 +3377,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, * For consistency in subsequent calls, make the faulted page always * locked. */ - if (unlikely(!(ret & VM_FAULT_LOCKED))) + if (unlikely(!(ret & VM_FAULT_LOCKED))) { + BUG_ON(ret & VM_FAULT_AROUND); // FIXME lock_page(vmf.page); - else + } else VM_BUG_ON(!PageLocked(vmf.page)); /* @@ -3401,6 +3427,28 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, page_table = pte_offset_map_lock(mm, pmd, address, &ptl); + for (i = 0; (vmf.flags & FAULT_FLAG_AROUND) && i < FAULT_AROUND_PAGES; + i++) { + int primary_idx = (address >> PAGE_SHIFT) & FAULT_AROUND_MASK; + pte_t *pte = page_table - primary_idx + i; + unsigned long addr = address + PAGE_SIZE * (i - primary_idx); + + if (!pages[i]) + continue; + if (i == primary_idx || !pte_none(*pte)) + goto skip; + if (PageHWPoison(vmf.page)) + goto skip; + flush_icache_page(vma, pages[i]); + entry = mk_pte(pages[i], vma->vm_page_prot); + inc_mm_counter_fast(mm, MM_FILEPAGES); + page_add_file_rmap(pages[i]); + set_pte_at(mm, addr, pte, entry); + update_mmu_cache(vma, addr, pte); +skip: + unlock_page(pages[i]); + } + /* * This silly early PAGE_DIRTY setting removes a race * due to the bad i386 page protection. But it's valid -- Kirill A. Shutemov ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [RFC, PATCH] mm: map few pages around fault address if they are in page cache 2014-02-07 15:42 ` [RFC, PATCH] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov @ 2014-02-07 17:32 ` Andi Kleen 2014-02-07 17:56 ` Kirill A. Shutemov 0 siblings, 1 reply; 32+ messages in thread From: Andi Kleen @ 2014-02-07 17:32 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox > I haven't tested it much, but my kvm boots. There're few places where code > should be fixed. __do_fault() and filemap_fault() are too ugly and need to > be cleaned. > > I don't have any performance data yet. > > Any thoughts? It seems very drastic to do it unconditionally. How about at least a simple stream detection heuristic and perhaps also madvise? There are some extreme cases where workloads could use a lot more memory than before, if they access their memory sparsely in the right pattern. -Andi ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC, PATCH] mm: map few pages around fault address if they are in page cache 2014-02-07 17:32 ` Andi Kleen @ 2014-02-07 17:56 ` Kirill A. Shutemov 2014-02-07 18:11 ` Andi Kleen 0 siblings, 1 reply; 32+ messages in thread From: Kirill A. Shutemov @ 2014-02-07 17:56 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox On Fri, Feb 07, 2014 at 09:32:00AM -0800, Andi Kleen wrote: > > I haven't tested it much, but my kvm boots. There're few places where code > > should be fixed. __do_fault() and filemap_fault() are too ugly and need to > > be cleaned. > > > > I don't have any performance data yet. > > > > Any thoughts? > > It seems very drastic to do it unconditionally. How about at least a simple > stream detection heuristic and perhaps also madvise? We already have readahead here it can be reused here. But see below. > There are some extreme cases where workloads could use a lot more memory > than before, if they access their memory sparsely in the right pattern. Have you noticied that we don't actually allocate any memory: only reuse what's already there. Sure, it will increase VmSize, but do we care? -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC, PATCH] mm: map few pages around fault address if they are in page cache 2014-02-07 17:56 ` Kirill A. Shutemov @ 2014-02-07 18:11 ` Andi Kleen 0 siblings, 0 replies; 32+ messages in thread From: Andi Kleen @ 2014-02-07 18:11 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Linus Torvalds, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu, Dave Hansen, Matthew Wilcox > > There are some extreme cases where workloads could use a lot more memory > > than before, if they access their memory sparsely in the right pattern. > > Have you noticied that we don't actually allocate any memory: only reuse > what's already there. Sure, it will increase VmSize, but do we care? Good point. With that probably readahead is overkill, agree. We would need it though if we ever do it for anonymous. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [RFC] de-asmify the x86-64 system call slowpath 2014-02-06 2:32 ` Linus Torvalds 2014-02-06 4:33 ` Linus Torvalds @ 2014-02-06 5:42 ` Ingo Molnar 1 sibling, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2014-02-06 5:42 UTC (permalink / raw) To: Linus Torvalds Cc: Kirill A. Shutemov, Peter Anvin, Ingo Molnar, Thomas Gleixner, Peter Zijlstra, the arch/x86 maintainers, Linux Kernel Mailing List, Ning Qu * Linus Torvalds <torvalds@linux-foundation.org> wrote: > [...] > > No, I was thinking "try to optimistically map 8 adjacent aligned > pages at a time" - that would be the same cacheline in the page > tables, so it would be fairly cheap if we couple it with a > gang-lookup of the pages in the page cache (or, for anonymous pages, > by just optimistically trying to do an order-3 page allocation, and > if that works, just map the 32kB allocation you got as eight > individual pages). > > I know it's been discussed at some point, and I even have a dim > memory of having seen some really ugly patches. I have a dim memory of having written such group-prefaulting patches myself a decade ago or so - IIRC the main problem was that at that time we never found a common load where it really mattered, and it was easy to spend more time doing all this extra work and not see the prefaulted pages used. But the cost/benefit balance has indeed changed so IMO it's worth a try. Thanks, Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2014-02-07 18:14 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-01-26 22:28 [RFC] de-asmify the x86-64 system call slowpath Linus Torvalds 2014-01-27 0:22 ` Al Viro 2014-01-27 4:32 ` Linus Torvalds 2014-01-27 4:48 ` H. Peter Anvin 2014-01-27 7:42 ` Al Viro 2014-01-27 22:06 ` Andy Lutomirski 2014-01-27 22:17 ` Linus Torvalds 2014-01-27 22:32 ` Al Viro 2014-01-27 22:43 ` Linus Torvalds 2014-01-27 22:46 ` Andy Lutomirski 2014-01-28 0:22 ` H. Peter Anvin 2014-01-28 0:44 ` Andy Lutomirski 2014-01-27 23:07 ` Al Viro 2014-01-27 10:27 ` Peter Zijlstra 2014-01-27 11:36 ` Al Viro 2014-01-27 17:39 ` Oleg Nesterov 2014-01-28 1:18 ` Al Viro 2014-01-28 16:38 ` Oleg Nesterov 2014-01-28 16:48 ` Al Viro 2014-01-28 17:19 ` Oleg Nesterov 2014-02-06 0:32 ` Linus Torvalds 2014-02-06 0:55 ` Kirill A. Shutemov 2014-02-06 2:32 ` Linus Torvalds 2014-02-06 4:33 ` Linus Torvalds 2014-02-06 21:29 ` Linus Torvalds 2014-02-06 22:24 ` Kirill A. Shutemov 2014-02-07 1:31 ` Linus Torvalds 2014-02-07 15:42 ` [RFC, PATCH] mm: map few pages around fault address if they are in page cache Kirill A. Shutemov 2014-02-07 17:32 ` Andi Kleen 2014-02-07 17:56 ` Kirill A. Shutemov 2014-02-07 18:11 ` Andi Kleen 2014-02-06 5:42 ` [RFC] de-asmify the x86-64 system call slowpath Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).