From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752385AbbC1Jat (ORCPT ); Sat, 28 Mar 2015 05:30:49 -0400 Received: from mail-wg0-f48.google.com ([74.125.82.48]:33503 "EHLO mail-wg0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751231AbbC1Jam (ORCPT ); Sat, 28 Mar 2015 05:30:42 -0400 Date: Sat, 28 Mar 2015 10:30:36 +0100 From: Ingo Molnar To: Linus Torvalds Cc: Brian Gerst , Andy Lutomirski , Denys Vlasenko , Borislav Petkov , "linux-kernel@vger.kernel.org" , X86 ML Subject: Re: ia32_sysenter_target does not preserve EFLAGS Message-ID: <20150328093036.GA9453@gmail.com> References: <5515686B.3080204@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst wrote: > >> <-- IRQ. Boom > > > > The sti will delay interrupts for one instruction, and that should include NMIs. > > Nope. Intel explicitly documents the NMI case only for mov->ss and popss. Interestingly, I still see a STI 'NMI shadow' even on Intel CPUs. Try something like this as root on a system with Intel CPUs (running recent tools/perf), with high-freq NMI sampling: perf top -F 10000 execute a tight syscall loop on all CPUs (getppid() loop for example), and you'll see something like this in the profile: Samples: 1M of event 'cycles', Event count (approx.): 377899840545 Overhead Shared Object Symbol ◆ 27.67% libc-2.19.so [.] __GI___getppid ▒ 21.34% [kernel] [k] system_call ▒ 17.42% [kernel] [k] system_call_after_swapgs ▒ 12.00% [kernel] [k] pid_vnr ▒ 7.49% [kernel] [k] sys_getppid ▒ 5.49% [kernel] [k] sysret_check ▒ 5.34% loop-getppid [.] main ▒ 1.56% [kernel] [k] system_call_fastpath ▒ 0.36% loop-getppid [.] getppid@plt ▒ Note the very high sample count (due to sampling at 10 KHz). Now if you hit '' twice to annotate system_call_after_swapgs you should see something like this (the live kernel image disassembly, annotated): system_call_after_swapgs /proc/kcore │ │ │ │ Disassembly of section load0: │ │ ffffffff8178b3f3 : 9.72 │ffffffff8178b3f3: mov %rsp,%gs:0xb040 44.24 │ffffffff8178b3fc: mov %gs:0xb888,%rsp 0.02 │ffffffff8178b405: sti │ffffffff8178b406: nopl 0x0(%rax) 16.04 │ffffffff8178b40d: sub $0x50,%rsp │ffffffff8178b411: mov %rdi,0x40(%rsp) 6.51 │ffffffff8178b416: mov %rsi,0x38(%rsp) 5.81 │ffffffff8178b41b: mov %rdx,0x30(%rsp) 2.22 │ffffffff8178b420: mov %rax,0x20(%rsp) 2.16 │ffffffff8178b425: mov %r8,0x18(%rsp) 0.93 │ffffffff8178b42a: mov %r9,0x10(%rsp) 1.57 │ffffffff8178b42f: mov %r10,0x8(%rsp) 3.70 │ffffffff8178b434: mov %r11,(%rsp) 2.27 │ffffffff8178b438: mov %rax,0x48(%rsp) Note how the 7-byte NOP after the STI did not get a single profiler hit. This is with the default '-e cycles', not '-e cycles:pp', so what we see as profiler hits should be the raw NMI entry RIPs. Arguably this could be just the decoder hiding the NOP efficiently, I'll try to run some more experiments ... Thanks, Ingo