* RCU question @ 2004-12-09 23:59 George Anzinger 2004-12-10 4:31 ` Dipankar Sarma 0 siblings, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-09 23:59 UTC (permalink / raw) To: Dipankar Sarma, Manfred Spraul, lkml I am working on VST code. This code is called from the idle loop to check for future timers. It then sets up a timer to interrupt in time to handle the nearest timer and turns off the time base interrupt source. As part of qualifying the entry to this state I want to make sure there is no pending work so, from the idle task I have this: if (local_softirq_pending()) do_softirq(); BUG_ON(local_softirq_pending()); I did not really expect to find any pending softirqs, but, not only are there some, they don't go away and the system BUGs. The offender is the RCU task. The question is: is this normal or is there something wrong? -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-09 23:59 RCU question George Anzinger @ 2004-12-10 4:31 ` Dipankar Sarma 2004-12-10 19:42 ` George Anzinger 0 siblings, 1 reply; 44+ messages in thread From: Dipankar Sarma @ 2004-12-10 4:31 UTC (permalink / raw) To: ganzinger; +Cc: Manfred Spraul, lkml On Thu, Dec 09, 2004 at 03:59:45PM -0800, George Anzinger wrote: > I am working on VST code. This code is called from the idle loop to check > for future timers. It then sets up a timer to interrupt in time to handle > the nearest timer and turns off the time base interrupt source. As part of > qualifying the entry to this state I want to make sure there is no pending > work so, from the idle task I have this: > > if (local_softirq_pending()) > do_softirq(); > > BUG_ON(local_softirq_pending()); > > I did not really expect to find any pending softirqs, but, not only are > there some, they don't go away and the system BUGs. The offender is the > RCU task. The question is: is this normal or is there something wrong? Why do you think there would not be any softirq pending after do_softirq() ? What if the cpu gets a network interrupt which raises a softirq ? And yes, RCU processing in softirq context can re-raise the softirq. AFAICS, it is perfectly normal. Thanks Dipankar ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-10 4:31 ` Dipankar Sarma @ 2004-12-10 19:42 ` George Anzinger 2004-12-10 20:40 ` Dipankar Sarma 0 siblings, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-10 19:42 UTC (permalink / raw) To: dipankar; +Cc: ganzinger, Manfred Spraul, lkml Dipankar Sarma wrote: > On Thu, Dec 09, 2004 at 03:59:45PM -0800, George Anzinger wrote: > >>I am working on VST code. This code is called from the idle loop to check >>for future timers. It then sets up a timer to interrupt in time to handle >>the nearest timer and turns off the time base interrupt source. As part of >>qualifying the entry to this state I want to make sure there is no pending >>work so, from the idle task I have this: >> >> if (local_softirq_pending()) >> do_softirq(); >> >> BUG_ON(local_softirq_pending()); >> >>I did not really expect to find any pending softirqs, but, not only are >>there some, they don't go away and the system BUGs. The offender is the >>RCU task. The question is: is this normal or is there something wrong? > > > Why do you think there would not be any softirq pending after do_softirq() ? > What if the cpu gets a network interrupt which raises a softirq ? Yes, but it is serviced on interrupt exit and the task level code would never see it. > And yes, RCU processing in softirq context can re-raise the softirq. > AFAICS, it is perfectly normal. My assumption was that, this being the idle task, RCU would be more than happy to finish all its pending tasks. It may be necessary for me to rethink the conditions required to go into the VST state. I had assumed that it required NO softirq pending as a pre condition. From this point on we would have the interrupt system off until the hardware sleep instruction (hlt in the x86 case). -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-10 19:42 ` George Anzinger @ 2004-12-10 20:40 ` Dipankar Sarma 2004-12-10 20:45 ` Lee Revell 0 siblings, 1 reply; 44+ messages in thread From: Dipankar Sarma @ 2004-12-10 20:40 UTC (permalink / raw) To: George Anzinger; +Cc: ganzinger, Manfred Spraul, lkml On Fri, Dec 10, 2004 at 11:42:55AM -0800, George Anzinger wrote: > Dipankar Sarma wrote: > >And yes, RCU processing in softirq context can re-raise the softirq. > >AFAICS, it is perfectly normal. > > My assumption was that, this being the idle task, RCU would be more than > happy to finish all its pending tasks. We try to avoid really long running softirqs (RCU tasklet in this case) for better scheduling latency. A long running rcu tasklet during an idle cpu may delay running of an RT process that becomes runnable during the rcu tasklet. > > It may be necessary for me to rethink the conditions required to go into > the VST state. I had assumed that it required NO softirq pending as a pre > condition. From this point on we would have the interrupt system off until > the hardware sleep instruction (hlt in the x86 case). Unfortunately, we aren't there yet. But it is in my TODO list for a generic nohz system. Thanks Dipankar ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-10 20:40 ` Dipankar Sarma @ 2004-12-10 20:45 ` Lee Revell 2004-12-10 21:02 ` George Anzinger 0 siblings, 1 reply; 44+ messages in thread From: Lee Revell @ 2004-12-10 20:45 UTC (permalink / raw) To: dipankar; +Cc: George Anzinger, ganzinger, Manfred Spraul, lkml On Sat, 2004-12-11 at 02:10 +0530, Dipankar Sarma wrote: > On Fri, Dec 10, 2004 at 11:42:55AM -0800, George Anzinger wrote: > > Dipankar Sarma wrote: > > >And yes, RCU processing in softirq context can re-raise the softirq. > > >AFAICS, it is perfectly normal. > > > > My assumption was that, this being the idle task, RCU would be more than > > happy to finish all its pending tasks. > > We try to avoid really long running softirqs (RCU tasklet in this case) > for better scheduling latency. A long running rcu tasklet during > an idle cpu may delay running of an RT process that becomes runnable > during the rcu tasklet. > Well, softirqs should really be preemptible if you care about RT task latency. Ingo's patches have had this for months. Works great. Maybe it's time to push it upstream. Lee ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-10 20:45 ` Lee Revell @ 2004-12-10 21:02 ` George Anzinger 2004-12-10 22:58 ` Zwane Mwaikambo 0 siblings, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-10 21:02 UTC (permalink / raw) To: Lee Revell; +Cc: dipankar, ganzinger, Manfred Spraul, lkml Lee Revell wrote: > On Sat, 2004-12-11 at 02:10 +0530, Dipankar Sarma wrote: > >>On Fri, Dec 10, 2004 at 11:42:55AM -0800, George Anzinger wrote: >> >>>Dipankar Sarma wrote: >>> >>>>And yes, RCU processing in softirq context can re-raise the softirq. >>>>AFAICS, it is perfectly normal. >>> >>>My assumption was that, this being the idle task, RCU would be more than >>>happy to finish all its pending tasks. >> >>We try to avoid really long running softirqs (RCU tasklet in this case) >>for better scheduling latency. A long running rcu tasklet during >>an idle cpu may delay running of an RT process that becomes runnable >>during the rcu tasklet. >> > > > Well, softirqs should really be preemptible if you care about RT task > latency. Ingo's patches have had this for months. Works great. Maybe > it's time to push it upstream. Yes, I understand, and soft_irq() does turn on interrupts... I was thinking of something like: while(softirq_pending()) { local_irq_enable(); do_softirq(); local_irq_disable(); } <proceed to idle hlt...> -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-10 21:02 ` George Anzinger @ 2004-12-10 22:58 ` Zwane Mwaikambo 2004-12-11 2:22 ` George Anzinger 0 siblings, 1 reply; 44+ messages in thread From: Zwane Mwaikambo @ 2004-12-10 22:58 UTC (permalink / raw) To: George Anzinger; +Cc: Lee Revell, dipankar, ganzinger, Manfred Spraul, lkml On Fri, 10 Dec 2004, George Anzinger wrote: > > Well, softirqs should really be preemptible if you care about RT task > > latency. Ingo's patches have had this for months. Works great. Maybe > > it's time to push it upstream. > > Yes, I understand, and soft_irq() does turn on interrupts... > I was thinking of something like: > > while(softirq_pending()) { > local_irq_enable(); > do_softirq(); > local_irq_disable(); > } > <proceed to idle hlt...> But that's a deadlock and if you enable interrupts you race. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-10 22:58 ` Zwane Mwaikambo @ 2004-12-11 2:22 ` George Anzinger 2004-12-11 2:45 ` Zwane Mwaikambo 0 siblings, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-11 2:22 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: Lee Revell, dipankar, ganzinger, Manfred Spraul, lkml Zwane Mwaikambo wrote: > On Fri, 10 Dec 2004, George Anzinger wrote: > > >>>Well, softirqs should really be preemptible if you care about RT task >>>latency. Ingo's patches have had this for months. Works great. Maybe >>>it's time to push it upstream. >> >>Yes, I understand, and soft_irq() does turn on interrupts... >>I was thinking of something like: >> >> while(softirq_pending()) { >> local_irq_enable(); >> do_softirq(); >> local_irq_disable(); >> } >> <proceed to idle hlt...> > > > But that's a deadlock and if you enable interrupts you race. Again, I remind you we are in the idle task. Nothing more important to do. Or do you mean that softirq_pending() will NEVER return false? The other question is: "Is useful work being done?" -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-11 2:22 ` George Anzinger @ 2004-12-11 2:45 ` Zwane Mwaikambo 2004-12-11 3:29 ` George Anzinger 0 siblings, 1 reply; 44+ messages in thread From: Zwane Mwaikambo @ 2004-12-11 2:45 UTC (permalink / raw) To: George Anzinger; +Cc: Lee Revell, dipankar, ganzinger, Manfred Spraul, lkml On Fri, 10 Dec 2004, George Anzinger wrote: > > But that's a deadlock and if you enable interrupts you race. > > Again, I remind you we are in the idle task. Nothing more important to do. > Or do you mean that softirq_pending() will NEVER return false? > > The other question is: "Is useful work being done?" We're in the idle task but obviously interrupts (such as network) are still coming in. So you may take an interrupt after your while (softirq_pending()) loop has exited. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-11 2:45 ` Zwane Mwaikambo @ 2004-12-11 3:29 ` George Anzinger 2004-12-11 14:52 ` Zwane Mwaikambo 0 siblings, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-11 3:29 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: Lee Revell, dipankar, ganzinger, Manfred Spraul, lkml Zwane Mwaikambo wrote: > On Fri, 10 Dec 2004, George Anzinger wrote: > > >>>But that's a deadlock and if you enable interrupts you race. >> >>Again, I remind you we are in the idle task. Nothing more important to do. >>Or do you mean that softirq_pending() will NEVER return false? >> >>The other question is: "Is useful work being done?" > > > We're in the idle task but obviously interrupts (such as network) are > still coming in. So you may take an interrupt after your while > (softirq_pending()) loop has exited. That is ok. Either we have interrupts off and no softirqs are pending and we proceed to the "hlt" (where the interrupt will be taken), or softirqs are pending, we turn interrupts on, do the softirq, turn interrupts off and try again. Unless some tasklet (RCU?) never "gives up" or we will exit the while with interrupts off and move on to the "hlt". Or did I miss something? - George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-11 3:29 ` George Anzinger @ 2004-12-11 14:52 ` Zwane Mwaikambo 2004-12-11 16:32 ` Manfred Spraul 0 siblings, 1 reply; 44+ messages in thread From: Zwane Mwaikambo @ 2004-12-11 14:52 UTC (permalink / raw) To: George Anzinger; +Cc: Lee Revell, dipankar, ganzinger, Manfred Spraul, lkml On Fri, 10 Dec 2004, George Anzinger wrote: > That is ok. Either we have interrupts off and no softirqs are pending and we > proceed to the "hlt" (where the interrupt will be taken), or softirqs are > pending, we turn interrupts on, do the softirq, turn interrupts off and try > again. Unless some tasklet (RCU?) never "gives up" or we will exit the while > with interrupts off and move on to the "hlt". Or did I miss something? But the point is that you cannot execute hlt with interrupts disabled. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-11 14:52 ` Zwane Mwaikambo @ 2004-12-11 16:32 ` Manfred Spraul 2004-12-11 16:52 ` George Anzinger 0 siblings, 1 reply; 44+ messages in thread From: Manfred Spraul @ 2004-12-11 16:32 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: George Anzinger, Lee Revell, dipankar, ganzinger, lkml Zwane Mwaikambo wrote: >On Fri, 10 Dec 2004, George Anzinger wrote: > > > >>That is ok. Either we have interrupts off and no softirqs are pending and we >>proceed to the "hlt" (where the interrupt will be taken), or softirqs are >>pending, we turn interrupts on, do the softirq, turn interrupts off and try >>again. Unless some tasklet (RCU?) never "gives up" or we will exit the while >>with interrupts off and move on to the "hlt". Or did I miss something? >> >> > >But the point is that you cannot execute hlt with interrupts disabled. > > The trick is the sti instruction: It enables interrupt processing after the following instruction. Thus sti hlt cannot race - it atomically enables interrupts and waits. -- Manfred ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-11 16:32 ` Manfred Spraul @ 2004-12-11 16:52 ` George Anzinger 2004-12-12 2:53 ` Zwane Mwaikambo 0 siblings, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-11 16:52 UTC (permalink / raw) To: Manfred Spraul; +Cc: Zwane Mwaikambo, Lee Revell, dipankar, ganzinger, lkml Manfred Spraul wrote: > Zwane Mwaikambo wrote: > >> On Fri, 10 Dec 2004, George Anzinger wrote: >> >> >> >>> That is ok. Either we have interrupts off and no softirqs are >>> pending and we >>> proceed to the "hlt" (where the interrupt will be taken), or softirqs >>> are >>> pending, we turn interrupts on, do the softirq, turn interrupts off >>> and try >>> again. Unless some tasklet (RCU?) never "gives up" or we will exit >>> the while >>> with interrupts off and move on to the "hlt". Or did I miss something? >>> >> >> >> But the point is that you cannot execute hlt with interrupts disabled. >> >> > The trick is the sti instruction: It enables interrupt processing after > the following instruction. > > Thus > sti > hlt > > cannot race - it atomically enables interrupts and waits. Exactly :) -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-11 16:52 ` George Anzinger @ 2004-12-12 2:53 ` Zwane Mwaikambo 2004-12-12 8:59 ` Manfred Spraul 0 siblings, 1 reply; 44+ messages in thread From: Zwane Mwaikambo @ 2004-12-12 2:53 UTC (permalink / raw) To: George Anzinger; +Cc: Manfred Spraul, Lee Revell, dipankar, ganzinger, lkml On Sat, 11 Dec 2004, George Anzinger wrote: > Manfred Spraul wrote: > > > > > The trick is the sti instruction: It enables interrupt processing after the > > following instruction. > > > > Thus > > sti > > hlt > > > > cannot race - it atomically enables interrupts and waits. > > Exactly :) Ok i wasn't aware that it was safe_halt() that he was referring too, my poor assumption. But regardless, this seems highly fragile and relying on behaviour which may change across processor models/vendors. I also found the following excerpt from (http://sandpile.org/ia32/inter.htm) which you may find interesting; "Intel processors don't suppress SMI or NMI after an STI instruction. Since the INTR suppresion is not preserved across an SMI or NMI handler, this may result in an INTR being serviced after the STI, which constitutes a violation of the INTR suppresion. Therefore, ideally the STI instruction also suppresses SMI and NMI." George thanks for persisting and explaining your point, i can be very slow =) Zwane ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 2:53 ` Zwane Mwaikambo @ 2004-12-12 8:59 ` Manfred Spraul 2004-12-12 9:37 ` Andrea Arcangeli 2004-12-12 16:26 ` Zwane Mwaikambo 0 siblings, 2 replies; 44+ messages in thread From: Manfred Spraul @ 2004-12-12 8:59 UTC (permalink / raw) To: Zwane Mwaikambo; +Cc: George Anzinger, Lee Revell, dipankar, ganzinger, lkml Zwane Mwaikambo wrote: >"Intel processors don't suppress SMI or NMI after an STI instruction. >Since the INTR suppresion is not preserved across an SMI or NMI handler, >this may result in an INTR being serviced after the STI, which constitutes >a violation of the INTR suppresion. > > Interesting find. It means that our NMI irq return path should check if it points to a hlt instruction and if yes, then increase the saved EIP by one before doing the iretd, right? -- Manfred ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 8:59 ` Manfred Spraul @ 2004-12-12 9:37 ` Andrea Arcangeli 2004-12-12 10:22 ` Manfred Spraul 2004-12-12 16:26 ` Zwane Mwaikambo 1 sibling, 1 reply; 44+ messages in thread From: Andrea Arcangeli @ 2004-12-12 9:37 UTC (permalink / raw) To: Manfred Spraul Cc: Zwane Mwaikambo, George Anzinger, Lee Revell, dipankar, ganzinger, lkml On Sun, Dec 12, 2004 at 09:59:00AM +0100, Manfred Spraul wrote: > It means that our NMI irq return path should check if it points to a hlt > instruction and if yes, then increase the saved EIP by one before doing > the iretd, right? I don't think we'll ever post any event through nmi, so it doesn't matter. We only care to be waken by real irqs, not nmi/smi. Idle loop is fine to ignore the actions of the nmi handlers and to hang into the "hlt". ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 9:37 ` Andrea Arcangeli @ 2004-12-12 10:22 ` Manfred Spraul 2004-12-12 12:15 ` Andrea Arcangeli 2004-12-12 16:51 ` RCU question George Anzinger 0 siblings, 2 replies; 44+ messages in thread From: Manfred Spraul @ 2004-12-12 10:22 UTC (permalink / raw) To: Andrea Arcangeli Cc: Zwane Mwaikambo, George Anzinger, Lee Revell, dipankar, ganzinger, lkml Andrea Arcangeli wrote: >On Sun, Dec 12, 2004 at 09:59:00AM +0100, Manfred Spraul wrote: > > >>It means that our NMI irq return path should check if it points to a hlt >>instruction and if yes, then increase the saved EIP by one before doing >>the iretd, right? >> >> > >I don't think we'll ever post any event through nmi, so it doesn't >matter. We only care to be waken by real irqs, not nmi/smi. Idle loop is >fine to ignore the actions of the nmi handlers and to hang into the >"hlt". > > No, You misunderstood the problem: sti ** NMI handler ** normal interrupt arrives, is queued by the cpu ** irqd from NMI handler ** cpu notices the normal interrupt, handles it. ** normal interrupt does a wakeup, schedules a tasklet, whatever ** irqd from normal interupt hlt << cpu sleeps. Thus: lost wakeup. -- Manfred ** ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 10:22 ` Manfred Spraul @ 2004-12-12 12:15 ` Andrea Arcangeli 2004-12-14 21:40 ` Lee Revell 2004-12-12 16:51 ` RCU question George Anzinger 1 sibling, 1 reply; 44+ messages in thread From: Andrea Arcangeli @ 2004-12-12 12:15 UTC (permalink / raw) To: Manfred Spraul Cc: Zwane Mwaikambo, George Anzinger, Lee Revell, dipankar, ganzinger, lkml On Sun, Dec 12, 2004 at 11:22:49AM +0100, Manfred Spraul wrote: > Andrea Arcangeli wrote: > > >On Sun, Dec 12, 2004 at 09:59:00AM +0100, Manfred Spraul wrote: > > > > > >>It means that our NMI irq return path should check if it points to a hlt > >>instruction and if yes, then increase the saved EIP by one before doing > >>the iretd, right? > >> > >> > > > >I don't think we'll ever post any event through nmi, so it doesn't > >matter. We only care to be waken by real irqs, not nmi/smi. Idle loop is > >fine to ignore the actions of the nmi handlers and to hang into the > >"hlt". > > > > > No, You misunderstood the problem: > > sti > ** NMI handler > ** normal interrupt arrives, is queued by the cpu > ** irqd from NMI handler > ** cpu notices the normal interrupt, handles it. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ok. The above just wasn't obvious to me because iret of an nmi is doing the same thing that sti does (and the nmi itself is like a cli). Shouldn't iret wait 1 instruction too or is there a special case about iret? The specs only tells sti waits 1 instruction, but they don't tell anything about iret (nor that it waits nor that it doesn't wait). I realized now the link posted here assumes iret isn't going to wait 1 instruction before processing pending irqs which is reasonable given the specs don't tell anything about iret, but I didn't imagine there was a difference between sti and iret (I mean only when iret is going to change the interrupt enable flag from 0 to 1 just like sti does). Overall this is a very minor issue (unless HZ is 0), it would only introduce a 1/HZ latency to the irq that get posted while the nmi handler is running, and the nmi handlers never runs in production. Forcing idle=poll when the nmi watchdog is enabled is probably a reasonable fix. As for the SMI, I wonder how you plan to fix it. To me it sounds like a minor mistake that iret isn't equivalent to sti when it toggles the irq enable bitflag (infact I don't see a way to fix it for smi, though I know very little about smi). thanks. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 12:15 ` Andrea Arcangeli @ 2004-12-14 21:40 ` Lee Revell 2004-12-14 22:23 ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar 0 siblings, 1 reply; 44+ messages in thread From: Lee Revell @ 2004-12-14 21:40 UTC (permalink / raw) To: Andrea Arcangeli Cc: Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Ingo Molnar On Sun, 2004-12-12 at 13:15 +0100, Andrea Arcangeli wrote: > Overall this is a very minor issue (unless HZ is 0), it would only > introduce a 1/HZ latency to the irq that get posted while the nmi > handler is running, and the nmi handlers never runs in production. Ingo, couldn't this account for some of the inexplicable outliers some people were seeing in latency tests? Lee ^ permalink raw reply [flat|nested] 44+ messages in thread
* [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 21:40 ` Lee Revell @ 2004-12-14 22:23 ` Ingo Molnar 2004-12-14 22:47 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Ingo Molnar @ 2004-12-14 22:23 UTC (permalink / raw) To: Lee Revell Cc: Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Linus Torvalds, Andi Kleen * Lee Revell <rlrevell@joe-job.com> wrote: > On Sun, 2004-12-12 at 13:15 +0100, Andrea Arcangeli wrote: > > Overall this is a very minor issue (unless HZ is 0), it would only > > introduce a 1/HZ latency to the irq that get posted while the nmi > > handler is running, and the nmi handlers never runs in production. > > Ingo, couldn't this account for some of the inexplicable outliers some > people were seeing in latency tests? indeed, there could be a connection, and it's certainly a fun race. The proper fix is Manfred's suggestion: check whether the EIP is a kernel text address, and if yes, whether it's a HLT instruction - and if yes then increase EIP by 1. I've included the fix in the -33-02 -RT patch. Andrew, Linus: upstream fix is below - i think it's post-2.6.10 stuff. Tested it on SMP and UP x86, using both the IO-APIC and the local-APIC based NMI watchdog. i think x64 needs a similar fix as well. Ingo --- linux/arch/i386/kernel/traps.c.orig +++ linux/arch/i386/kernel/traps.c @@ -670,6 +670,17 @@ fastcall void do_nmi(struct pt_regs * re cpu = smp_processor_id(); + /* + * Fix up obscure CPU behavior: if we interrupt safe_hlt() via + * the NMI then we might miss a reschedule if an interrupt is + * posted to the CPU and executes before the HLT instruction. + * + * We check whether the EIP is kernelspace, and if yes, whether + * the instruction is HLT: + */ + if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4) + regs->eip++; + #ifdef CONFIG_HOTPLUG_CPU if (!cpu_online(cpu)) { nmi_exit(); ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 22:23 ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar @ 2004-12-14 22:47 ` Ingo Molnar 2004-12-14 23:09 ` Linus Torvalds 2004-12-14 23:41 ` Andrea Arcangeli 2004-12-14 23:00 ` Linus Torvalds 2004-12-15 6:27 ` Avi Kivity 2 siblings, 2 replies; 44+ messages in thread From: Ingo Molnar @ 2004-12-14 22:47 UTC (permalink / raw) To: Lee Revell Cc: Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Linus Torvalds, Andi Kleen * Ingo Molnar <mingo@elte.hu> wrote: > indeed, there could be a connection, and it's certainly a fun race. > The proper fix is Manfred's suggestion: check whether the EIP is a > kernel text address, and if yes, whether it's a HLT instruction - and > if yes then increase EIP by 1. I've included the fix in the -33-02 -RT > patch. Andrew, Linus: upstream fix is below - i think it's post-2.6.10 > stuff. Tested it on SMP and UP x86, using both the IO-APIC and the > local-APIC based NMI watchdog. > > i think x64 needs a similar fix as well. find the correct patch below. I've tested it with an NMI watchdog frequency artificially increased to 10 KHz, and i've instrumented the new branch in the NMI handler, but even under heavy IRQ load i was not able to trigger the branch. Maybe newer CPUs handle this case somehow and make sti;hlt truly atomic? I tried this on an old Celeron (Mendocino) and on an Athlon64. Ingo Signed-off-by: Ingo Molnar <mingo@elte.hu> --- linux/arch/i386/kernel/traps.c.orig +++ linux/arch/i386/kernel/traps.c @@ -670,6 +670,18 @@ fastcall void do_nmi(struct pt_regs * re cpu = smp_processor_id(); + /* + * Fix up obscure CPU behavior: if we interrupt safe_hlt() via + * the NMI then we might miss a reschedule if an interrupt is + * posted to the CPU and executes before the HLT instruction. + * + * We check whether the EIP is kernelspace, and if yes, whether + * the instruction is HLT: + */ + if (__kernel_text_address(regs->eip) && + *(unsigned char *)regs->eip == 0xf4) + regs->eip++; + #ifdef CONFIG_HOTPLUG_CPU if (!cpu_online(cpu)) { nmi_exit(); ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 22:47 ` Ingo Molnar @ 2004-12-14 23:09 ` Linus Torvalds 2004-12-15 8:52 ` Ingo Molnar 2004-12-16 0:37 ` Alan Cox 2004-12-14 23:41 ` Andrea Arcangeli 1 sibling, 2 replies; 44+ messages in thread From: Linus Torvalds @ 2004-12-14 23:09 UTC (permalink / raw) To: Ingo Molnar Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Tue, 14 Dec 2004, Ingo Molnar wrote: > > find the correct patch below. I've tested it with an NMI watchdog > frequency artificially increased to 10 KHz, and i've instrumented the > new branch in the NMI handler, but even under heavy IRQ load i was not > able to trigger the branch. Maybe newer CPUs handle this case somehow > and make sti;hlt truly atomic? Now that you mention it, I have this dim memory of the one-instruction "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU literally doesn't test for async events following "sti". Or maybe that was "mov->ss". That one also has that strange "black hole" for one instruction. Hmm.. You could be evil and try to fill up 64kB worth of memory with a "mov %ax,%ss", and jump to it in vm86 mode and see what happens. The eip will just keep wrapping around... Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 23:09 ` Linus Torvalds @ 2004-12-15 8:52 ` Ingo Molnar 2004-12-15 15:44 ` Linus Torvalds 2004-12-16 0:37 ` Alan Cox 1 sibling, 1 reply; 44+ messages in thread From: Ingo Molnar @ 2004-12-15 8:52 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen * Linus Torvalds <torvalds@osdl.org> wrote: > > find the correct patch below. I've tested it with an NMI watchdog > > frequency artificially increased to 10 KHz, and i've instrumented the > > new branch in the NMI handler, but even under heavy IRQ load i was not > > able to trigger the branch. Maybe newer CPUs handle this case somehow > > and make sti;hlt truly atomic? > > Now that you mention it, I have this dim memory of the one-instruction > "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU > literally doesn't test for async events following "sti". i ran the stresstest overnight with the 10 KHz NMI, and not a single time did the new branch trigger, out of hundreds of millions of IRQs and NMIs. I think this suggests that the race doesnt exist in current CPUs. Ingo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-15 8:52 ` Ingo Molnar @ 2004-12-15 15:44 ` Linus Torvalds 2004-12-15 16:35 ` Ingo Molnar 0 siblings, 1 reply; 44+ messages in thread From: Linus Torvalds @ 2004-12-15 15:44 UTC (permalink / raw) To: Ingo Molnar Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Wed, 15 Dec 2004, Ingo Molnar wrote: > > i ran the stresstest overnight with the 10 KHz NMI, and not a single > time did the new branch trigger, out of hundreds of millions of IRQs and > NMIs. I think this suggests that the race doesnt exist in current CPUs. That may well be true, but I'm not convinced your test is meaningful or shows anything. The thing is, either the CPU is busy, or it's idle. If it's busy, you'll never see this. And if it's idle, it will always be _in_ the "halt" instruction. The only way to see the case is in the borderline cases, and if/when there are multiple different interrupts (first non-NMI interrupt takes it out of the hlt, and then the NMI happens to catch the sti). And quite frankly, I don't see how you would stress-test it. A 1kHz timer interrupt with a 10kHz NMI interrupt is still very infrequent interrupts... Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-15 15:44 ` Linus Torvalds @ 2004-12-15 16:35 ` Ingo Molnar 0 siblings, 0 replies; 44+ messages in thread From: Ingo Molnar @ 2004-12-15 16:35 UTC (permalink / raw) To: Linus Torvalds Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen * Linus Torvalds <torvalds@osdl.org> wrote: > > > On Wed, 15 Dec 2004, Ingo Molnar wrote: > > > > i ran the stresstest overnight with the 10 KHz NMI, and not a single > > time did the new branch trigger, out of hundreds of millions of IRQs and > > NMIs. I think this suggests that the race doesnt exist in current CPUs. > > That may well be true, but I'm not convinced your test is meaningful > or shows anything. > > The thing is, either the CPU is busy, or it's idle. If it's busy, > you'll never see this. And if it's idle, it will always be _in_ the > "halt" instruction. i deliberately started a test where there was roughly 50% idle time. > The only way to see the case is in the borderline cases, and if/when > there are multiple different interrupts (first non-NMI interrupt takes > it out of the hlt, and then the NMI happens to catch the sti). And > quite frankly, I don't see how you would stress-test it. A 1kHz timer > interrupt with a 10kHz NMI interrupt is still very infrequent > interrupts... i started an infinite loop that generated disk IRQs, and started a network test that generated network IRQs. The IRQ rate was roughly 10K/sec - this combined with the 10K/sec NMI rate should be an adequate mix. (I also made sure that it's really default_idle that is used.) Ingo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 23:09 ` Linus Torvalds 2004-12-15 8:52 ` Ingo Molnar @ 2004-12-16 0:37 ` Alan Cox 2004-12-16 1:58 ` Linus Torvalds 2004-12-16 2:10 ` Zwane Mwaikambo 1 sibling, 2 replies; 44+ messages in thread From: Alan Cox @ 2004-12-16 0:37 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Maw, 2004-12-14 at 23:09, Linus Torvalds wrote: > On Tue, 14 Dec 2004, Ingo Molnar wrote: > Now that you mention it, I have this dim memory of the one-instruction > "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU > literally doesn't test for async events following "sti". > > Or maybe that was "mov->ss". That one also has that strange "black hole" > for one instruction. The mov to ss one is a bit more magic than that however. If you write 3Gb of mov->ss into memory (ie about 64 pages to thrash the cache and slow it plus mmap repeatedly) and run it you don't get a vastly long irq delay at least on intel, not tried the others. Alan ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 0:37 ` Alan Cox @ 2004-12-16 1:58 ` Linus Torvalds 2004-12-16 14:51 ` Ingo Molnar 2004-12-16 2:10 ` Zwane Mwaikambo 1 sibling, 1 reply; 44+ messages in thread From: Linus Torvalds @ 2004-12-16 1:58 UTC (permalink / raw) To: Alan Cox Cc: Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Thu, 16 Dec 2004, Alan Cox wrote: > On Maw, 2004-12-14 at 23:09, Linus Torvalds wrote: > > On Tue, 14 Dec 2004, Ingo Molnar wrote: > > Now that you mention it, I have this dim memory of the one-instruction > > "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU > > literally doesn't test for async events following "sti". > > > > Or maybe that was "mov->ss". That one also has that strange "black hole" > > for one instruction. > > The mov to ss one is a bit more magic than that however. If you write > 3Gb of mov->ss into memory (ie about 64 pages to thrash the cache and > slow it plus mmap repeatedly) and run it you don't get a vastly long irq > delay at least on intel, not tried the others. The irq window should actually be open every alternate instruction, I think. Although it's not actually architected, and I thought that there was some errata for some CPU about this.. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 1:58 ` Linus Torvalds @ 2004-12-16 14:51 ` Ingo Molnar 2004-12-16 15:08 ` Maciej W. Rozycki 2004-12-16 15:54 ` Linus Torvalds 0 siblings, 2 replies; 44+ messages in thread From: Ingo Molnar @ 2004-12-16 14:51 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen * Linus Torvalds <torvalds@osdl.org> wrote: > The irq window should actually be open every alternate instruction, I > think. Although it's not actually architected, and I thought that > there was some errata for some CPU about this.. i have generated an instruction-granularity profile of kernel code executing the following sequence, driven by the NMI watchdog interrupt: asm ("cli; cli; sti; cli; sti; cli; sti; cli; sti; cli; sti; "); asm ("cli; cli; sti; cli; sti; cli; sti; cli; sti; cli; sti; "); asm ("cli; cli; sti; cli; sti; cli; sti; cli; sti; cli; sti; "); the first CLI is done twice, to prove that the NMI profiling works and that the kernel can be interrupted in those places. Then i called this kernel code in a loop. Here's the result: c0125ee9: 1529 fa cli ^---------------------------------- # of profiler hits c0125eea: 507 fb sti c0125eeb: 0 fa cli c0125eec: 3719 fb sti c0125eed: 0 fa cli c0125eee: 1579 fb sti c0125eef: 0 fa cli c0125ef0: 3317 fb sti c0125ef1: 0 fa cli c0125ef2: 3030 fb sti c0125ef3: 0 fa cli c0125ef4: 2497 fa cli c0125ef5: 1055 fb sti c0125ef6: 0 fa cli c0125ef7: 4674 fb sti c0125ef8: 0 fa cli c0125ef9: 3827 fb sti c0125efa: 0 fa cli c0125efb: 1622 fb sti c0125efc: 0 fa cli c0125efd: 3155 fb sti c0125efe: 0 fa cli c0125eff: 1273 fa cli c0125f00: 512 fb sti c0125f01: 0 fa cli c0125f02: 1312 fb sti c0125f03: 0 fa cli c0125f04: 1426 fb sti c0125f05: 0 fa cli c0125f06: 1507 fb sti c0125f07: 0 fa cli c0125f08: 2720 fb sti c0125f09: 0 fa cli c0125f0a: 2469 fa cli c0125f0b: 787 fb sti c0125f0c: 0 fa cli c0125f0d: 2085 fb sti c0125f0e: 0 fa cli the 'cli' is always a 'black hole' to the NMI, while the second of two consecutive cli's are not. i also played a bit with the %ss instructions, and combined them with the cli/sti instructions and other instructions in various ways, and with a bit of experimenting found the following, somewhat surprising results: c0125f33: 1016 66 8c d0 mov %ss,%ax c0125f36: 6626 8e d0 mov %eax,%ss c0125f38: 34715 8e d0 mov %eax,%ss c0125f3a: 14682 8e d0 mov %eax,%ss c0125f3c: 4521 8e d0 mov %eax,%ss c0125f3e: 7564 8e d0 mov %eax,%ss c0125f40: 3861 66 8e d0 mov %ax,%ss c0125f43: 0 66 8c d1 mov %ss,%cx c0125f46: 1061 66 8c da mov %ds,%dx c0125f49: 7660 8e d1 mov %ecx,%ss c0125f4b: 11322 17 pop %ss c0125f4c: 0 fb sti c0125f4d: 8935 8e d1 mov %ecx,%ss c0125f4f: 0 fa cli c0125f50: 2198 66 8c d1 mov %ss,%cx c0125f53: 735 66 8c da mov %ds,%dx c0125f56: 0 8e da mov %edx,%ds c0125f58: 6400 8e d0 mov %eax,%ss c0125f5a: 3062 8e d0 mov %eax,%ss c0125f5c: 3552 8e d0 mov %eax,%ss c0125f5e: 4818 8e d0 mov %eax,%ss c0125f60: 0 fb sti c0125f61: 0 66 8c da mov %ds,%dx c0125f64: 17788 8e d0 mov %eax,%ss c0125f66: 64694 8e d0 mov %eax,%ss c0125f68: 12837 8e d0 mov %eax,%ss c0125f6a: 9859 8e d0 mov %eax,%ss c0125f6c: 0 fb sti c0125f6d: 74506 8e d0 mov %eax,%ss c0125f6f: 0 fb sti c0125f70: 8589 fa cli c0125f71: 10248 8e d0 mov %eax,%ss c0125f73: 3825 8e d0 mov %eax,%ss c0125f75: 4903 8e d0 mov %eax,%ss c0125f77: 71134 8e d0 mov %eax,%ss c0125f79: 0 fb sti c0125f7a: 0 fa cli c0125f7b: 7461 8e d0 mov %eax,%ss c0125f7d: 0 66 8c d0 mov %ss,%ax c0125f80: 39387 8e d0 mov %eax,%ss c0125f82: 0 fa cli c0125f83: 41484 8e d0 mov %eax,%ss c0125f85: 0 fa cli c0125f86: 4490 8e d0 mov %eax,%ss c0125f88: 0 fa cli c0125f89: 6024 8e d0 mov %eax,%ss c0125f8b: 15454 8e d0 mov %eax,%ss c0125f8d: 0 fb sti c0125f8e: 0 fb sti c0125f8f: 115104 fb sti c0125f90: 39061 fb sti it shows a number of interesting effects: - "mov %eax,%ss" followed by the _same_ instruction cancels the black-hole. This i suspect is done to prevent the lockup in vm86 mode. - an %ss black-hole instruction followed by 'sti' cancels sti's black-hole. This is unlikely to occur in real kernel code, but we might want to add a 'nop' in front of safe_halt()'s sti, to make sure the black-hole takes effect. - in one case a two-instruction blackhole was created - but this might be some prefetch effect. i played around with the instructions a bit to manufacture combinations that enlengthen the black-hole but failed :) This was on an Athlon64. Ingo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 14:51 ` Ingo Molnar @ 2004-12-16 15:08 ` Maciej W. Rozycki 2004-12-16 15:11 ` Ingo Molnar 2004-12-16 15:54 ` Linus Torvalds 1 sibling, 1 reply; 44+ messages in thread From: Maciej W. Rozycki @ 2004-12-16 15:08 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Thu, 16 Dec 2004, Ingo Molnar wrote: > c0125ee9: 1529 fa cli > ^---------------------------------- # of profiler hits > c0125eea: 507 fb sti > c0125eeb: 0 fa cli > c0125eec: 3719 fb sti > c0125eed: 0 fa cli > c0125eee: 1579 fb sti > c0125eef: 0 fa cli > c0125ef0: 3317 fb sti > c0125ef1: 0 fa cli > c0125ef2: 3030 fb sti > c0125ef3: 0 fa cli > c0125ef4: 2497 fa cli > c0125ef5: 1055 fb sti > c0125ef6: 0 fa cli [...] > the 'cli' is always a 'black hole' to the NMI, while the second of two > consecutive cli's are not. It looks like the 'sti' is actually the black hole -- remember interrupts are traps, that is they are probed for and taken after instruction execution. Maciej ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 15:08 ` Maciej W. Rozycki @ 2004-12-16 15:11 ` Ingo Molnar 2004-12-16 15:42 ` Maciej W. Rozycki 0 siblings, 1 reply; 44+ messages in thread From: Ingo Molnar @ 2004-12-16 15:11 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Linus Torvalds, Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen * Maciej W. Rozycki <macro@linux-mips.org> wrote: > On Thu, 16 Dec 2004, Ingo Molnar wrote: > > > c0125ee9: 1529 fa cli > > ^---------------------------------- # of profiler hits > > c0125eea: 507 fb sti > > c0125eeb: 0 fa cli > > c0125eec: 3719 fb sti > > c0125eed: 0 fa cli > > c0125eee: 1579 fb sti > > c0125eef: 0 fa cli > > c0125ef0: 3317 fb sti > > c0125ef1: 0 fa cli > > c0125ef2: 3030 fb sti > > c0125ef3: 0 fa cli > > c0125ef4: 2497 fa cli > > c0125ef5: 1055 fb sti > > c0125ef6: 0 fa cli > [...] > > the 'cli' is always a 'black hole' to the NMI, while the second of two > > consecutive cli's are not. > > It looks like the 'sti' is actually the black hole -- remember > interrupts are traps, that is they are probed for and taken after > instruction execution. The 'sti' "shadows" the cli, i.e. we'll never get an interrupt that gets inbetween 'sti;cli'. I.e. sti is the black-hole generator, and 'cli' is in the black hole. In that sense the 'cli' is in a black hole to the NMI: the NMI will never see cli as the 'next to be executed' instruction. Ingo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 15:11 ` Ingo Molnar @ 2004-12-16 15:42 ` Maciej W. Rozycki 0 siblings, 0 replies; 44+ messages in thread From: Maciej W. Rozycki @ 2004-12-16 15:42 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Thu, 16 Dec 2004, Ingo Molnar wrote: > The 'sti' "shadows" the cli, i.e. we'll never get an interrupt that gets > inbetween 'sti;cli'. I.e. sti is the black-hole generator, and 'cli' is > in the black hole. In that sense the 'cli' is in a black hole to the > NMI: the NMI will never see cli as the 'next to be executed' > instruction. That's what I meant indeed, but I'd like to emphasise, for readers to be aware, the black hole is not tied to the 'cli' instruction itself in any way. The black-holed instruction needs not be a 'cli' -- it can be an arbitrary one except from ones creating black holes as you've observed with your test. Maciej ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 14:51 ` Ingo Molnar 2004-12-16 15:08 ` Maciej W. Rozycki @ 2004-12-16 15:54 ` Linus Torvalds 1 sibling, 0 replies; 44+ messages in thread From: Linus Torvalds @ 2004-12-16 15:54 UTC (permalink / raw) To: Ingo Molnar Cc: Alan Cox, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Thu, 16 Dec 2004, Ingo Molnar wrote: > > i also played a bit with the %ss instructions, and combined them with > the cli/sti instructions and other instructions in various ways, and > with a bit of experimenting found the following, somewhat surprising > results: > > [ snip ] > > it shows a number of interesting effects: > > - "mov %eax,%ss" followed by the _same_ instruction cancels the > black-hole. This i suspect is done to prevent the lockup in vm86 > mode. I don't think it's the "same instruction". Looking at the pattern, I think that a "mov->ss" always checks interrupts _before_ it executes, and never checks interrupts _after_ it executes. So I think the pattern is (for your athlon64): - regular instructions check interrupts before they execute, _except_ if the "dontcheck" flag was set. They clear "dontcheck" after execution. - "mov->ss" always checks interrupts before it executes, regardless of "dontcheck". It always sets "dontcheck". - "sti" sets "dontcheck" if interrupts were disabled before. So you can get two-instruction holes by doing the sequence /* interrupts disabled */ mov->ss sti /* any instruction except cli/mov->ss */ but no other combination (series of "mov->ss" will always check _before_ each "mov->ss", and series of "sti" will obviously only have interrupts disabled for the _first_ sti). And I suspect this is very much micro-architecture-dependent, although the Athlon64 rules seem very simple and straightforward. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 0:37 ` Alan Cox 2004-12-16 1:58 ` Linus Torvalds @ 2004-12-16 2:10 ` Zwane Mwaikambo 2004-12-16 13:26 ` Alan Cox 1 sibling, 1 reply; 44+ messages in thread From: Zwane Mwaikambo @ 2004-12-16 2:10 UTC (permalink / raw) To: Alan Cox Cc: Linus Torvalds, Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Thu, 16 Dec 2004, Alan Cox wrote: > On Maw, 2004-12-14 at 23:09, Linus Torvalds wrote: > > On Tue, 14 Dec 2004, Ingo Molnar wrote: > > Now that you mention it, I have this dim memory of the one-instruction > > "sti-shadow" actually disabling NMI's (and debug traps) too. The CPU > > literally doesn't test for async events following "sti". > > > > Or maybe that was "mov->ss". That one also has that strange "black hole" > > for one instruction. > > The mov to ss one is a bit more magic than that however. If you write > 3Gb of mov->ss into memory (ie about 64 pages to thrash the cache and > slow it plus mmap repeatedly) and run it you don't get a vastly long irq > delay at least on intel, not tried the others. Might this be because you can't rely on interrupt suppression for back to back suppressing instructions? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-16 2:10 ` Zwane Mwaikambo @ 2004-12-16 13:26 ` Alan Cox 0 siblings, 0 replies; 44+ messages in thread From: Alan Cox @ 2004-12-16 13:26 UTC (permalink / raw) To: Zwane Mwaikambo Cc: Linus Torvalds, Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Iau, 2004-12-16 at 02:10, Zwane Mwaikambo wrote: > Might this be because you can't rely on interrupt suppression for back to > back suppressing instructions? The documentation seems to have little to say on this. I've also not tried things like interleaved mov->ss, sti to see how the interlocking is done. It would make sense given the original 8086 reason was to allow ss/sp to be loaded cleanly. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 22:47 ` Ingo Molnar 2004-12-14 23:09 ` Linus Torvalds @ 2004-12-14 23:41 ` Andrea Arcangeli 1 sibling, 0 replies; 44+ messages in thread From: Andrea Arcangeli @ 2004-12-14 23:41 UTC (permalink / raw) To: Ingo Molnar Cc: Lee Revell, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Linus Torvalds, Andi Kleen On Tue, Dec 14, 2004 at 11:47:06PM +0100, Ingo Molnar wrote: > find the correct patch below. I've tested it with an NMI watchdog > frequency artificially increased to 10 KHz, and i've instrumented the Nice test, it'd be nice to trigger it in real life. on the lines of the 64k movl ss, I wonder if we could create an huge piece of memory like this: new_htl: cli sti htl cli sti htl [..] jmp original_hlt and to call new_htl from original_hlt instead of sti;hlt. A dozen megs of the above should boost the probability of getting interrupted in "hlt" quite a bit. However even if the nmi can execute on top of the "hlt" instruction, it doesn't necessairly mean the next pending irq will execute before executing 'hlt' too, so it'd need a bit more of instrumentation to as well track down the race as happening (it's not enough to see the branch in the nmi handler to be taken). The additional instrumentation should be quite easy though, just copying the same nmi code to the irq handler should do the trick. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 22:23 ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar 2004-12-14 22:47 ` Ingo Molnar @ 2004-12-14 23:00 ` Linus Torvalds 2004-12-15 5:04 ` Andi Kleen 2004-12-15 6:27 ` Avi Kivity 2 siblings, 1 reply; 44+ messages in thread From: Linus Torvalds @ 2004-12-14 23:00 UTC (permalink / raw) To: Ingo Molnar Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen On Tue, 14 Dec 2004, Ingo Molnar wrote: > > indeed, there could be a connection, and it's certainly a fun race. The > proper fix is Manfred's suggestion: check whether the EIP is a kernel > text address, and if yes, whether it's a HLT instruction - and if yes > then increase EIP by 1. You do it the wrong way, though. This is not safe: if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4) does _entirely_ the wrong thing if CS is not the kernel CS. It can trigger with a regular use CS if you were to run the 4G:4G patches, but more realistically, I think you can make ii trigger even with a standard kernel by creating a local code segment in your LDT, and then trying to confuse the kernel that way. Now, as long as the _only_ thing it does is increment the eip, the worst that can happen is that it screws over the user program that must have worked at this a bit, but the basic point is that you shouldn't do this. In _theory_ you could confuse a real program that wasn't doing anything bad. Checking for kernel CS also requires checking that it's not vm86 mode, btw. So that's not just a "regs->xcs & 0xffff == __KERNEL_CS" either. But something like static inline int kernel_mode(struct pt_regs *regs) { return !((regs->eflags & VM_MASK) | (regs->xcs & 3)); } should DTRT. Can you pls double-check my thinking, and test? Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 23:00 ` Linus Torvalds @ 2004-12-15 5:04 ` Andi Kleen 0 siblings, 0 replies; 44+ messages in thread From: Andi Kleen @ 2004-12-15 5:04 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Andi Kleen > But something like > > static inline int kernel_mode(struct pt_regs *regs) > { > return !((regs->eflags & VM_MASK) | (regs->xcs & 3)); > } > > should DTRT. > > Can you pls double-check my thinking, and test? Reasoning looks correct to me. -Andi ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-14 22:23 ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar 2004-12-14 22:47 ` Ingo Molnar 2004-12-14 23:00 ` Linus Torvalds @ 2004-12-15 6:27 ` Avi Kivity 2004-12-15 8:51 ` Ingo Molnar 2 siblings, 1 reply; 44+ messages in thread From: Avi Kivity @ 2004-12-15 6:27 UTC (permalink / raw) To: Ingo Molnar Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Linus Torvalds, Andi Kleen Ingo Molnar wrote: >+ if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4) > > shouldn't that cast be (unsigned char *), otherwise the test will always fail? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs 2004-12-15 6:27 ` Avi Kivity @ 2004-12-15 8:51 ` Ingo Molnar 0 siblings, 0 replies; 44+ messages in thread From: Ingo Molnar @ 2004-12-15 8:51 UTC (permalink / raw) To: Avi Kivity Cc: Lee Revell, Andrea Arcangeli, Manfred Spraul, Zwane Mwaikambo, George Anzinger, dipankar, ganzinger, lkml, Andrew Morton, Linus Torvalds, Andi Kleen * Avi Kivity <avi@argo.co.il> wrote: > Ingo Molnar wrote: > > >+ if (__kernel_text_address(regs->eip) && *(char *)regs->eip == 0xf4) > > > > > shouldn't that cast be (unsigned char *), otherwise the test will > always fail? yes, i fixed this in the second patch. (the compiler warned about it too) Ingo ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 10:22 ` Manfred Spraul 2004-12-12 12:15 ` Andrea Arcangeli @ 2004-12-12 16:51 ` George Anzinger 2004-12-12 22:40 ` Manfred Spraul 1 sibling, 1 reply; 44+ messages in thread From: George Anzinger @ 2004-12-12 16:51 UTC (permalink / raw) To: Manfred Spraul Cc: Andrea Arcangeli, Zwane Mwaikambo, Lee Revell, dipankar, ganzinger, lkml Manfred Spraul wrote: > Andrea Arcangeli wrote: > >> On Sun, Dec 12, 2004 at 09:59:00AM +0100, Manfred Spraul wrote: >> >> >>> It means that our NMI irq return path should check if it points to a >>> hlt instruction and if yes, then increase the saved EIP by one before >>> doing the iretd, right? >>> >> >> >> I don't think we'll ever post any event through nmi, so it doesn't >> matter. We only care to be waken by real irqs, not nmi/smi. Idle loop is >> fine to ignore the actions of the nmi handlers and to hang into the >> "hlt". >> >> > No, You misunderstood the problem: > > sti > ** NMI handler > ** normal interrupt arrives, is queued by the cpu > ** irqd from NMI handler > ** cpu notices the normal interrupt, handles it. > ** normal interrupt does a wakeup, schedules a tasklet, whatever I think you are forgetting that the system does the full context switch from the interrupt handler (well, actually from entry.S) and does not do the irqd until it is time to go back to the idle thread (i.e. there is nothing left to do), so.. > ** irqd from normal interupt > hlt << cpu sleeps. What we loose here is that idle does not go around its little loop again. If an interrupt becomes pending on the way to the hlt, i.e. while entry.S has interrupts masked and is doing the irqd, it will be handled prior to the hlt so we could loose several of these idle loop spins, until no interrupt is pending allowing the hlt to be executed. On the next interrupt/irqd the hlt will exit. So what is lost is one or more spins round the idle loop. The "normal" idle loop just looks at the need_resched flag and goes right back to the hlt, however, idle, it self, never sets this flag, only interrupt code can set it at this point, and the interrupt exit takes action to clear it so I don't see it every being found set in the idle loop (I suppose one could do a test to see if it is ever found set here), so, in theory, the net effect should be nill. Did I miss something? -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 16:51 ` RCU question George Anzinger @ 2004-12-12 22:40 ` Manfred Spraul 2004-12-13 5:22 ` George Anzinger 0 siblings, 1 reply; 44+ messages in thread From: Manfred Spraul @ 2004-12-12 22:40 UTC (permalink / raw) To: george Cc: Andrea Arcangeli, Zwane Mwaikambo, Lee Revell, dipankar, ganzinger, lkml George Anzinger wrote: > > The "normal" idle loop just looks at the need_resched flag and goes > right back to the hlt, That's the problem: If a the tasklet does a wakeup then the reschedule is delayed until the next interrupt. Testing need_resched and executing hlt must be atomic, but it isn't - NMIs break the atomicity. Not a big deal, except if someone implements a tickless kernel. I think we can ignore it for now [or was the thread started by someone who want's to disable the hardware timer when the system is really idle?] -- Manfred ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 22:40 ` Manfred Spraul @ 2004-12-13 5:22 ` George Anzinger 0 siblings, 0 replies; 44+ messages in thread From: George Anzinger @ 2004-12-13 5:22 UTC (permalink / raw) To: Manfred Spraul Cc: Andrea Arcangeli, Zwane Mwaikambo, Lee Revell, dipankar, ganzinger, lkml Manfred Spraul wrote: > George Anzinger wrote: > >> >> The "normal" idle loop just looks at the need_resched flag and goes >> right back to the hlt, > > > That's the problem: If a the tasklet does a wakeup then the reschedule > is delayed until the next interrupt. Not so. On the interrupt that runs the tasklet, on the way out via entry.S, the need_resched flag is checked and acted on. Thus the task switch is done prio to getting back to the hlt. > Testing need_resched and executing > hlt must be atomic, but it isn't - NMIs break the atomicity. Actually this is not required, especially if preemption is turned on. > Not a big deal, except if someone implements a tickless kernel. Well, it is not tickless, but VST that I am working on :). The notion is to turn off the ticks when in idle and there are not time events in the list. I think > we can ignore it for now [or was the thread started by someone who > want's to disable the hardware timer when the system is really idle?] Yep, me! But still, I keep a timer around to exit, it is just way more than a tick later (depending on what the next entry in the time list needs). > -- George Anzinger george@mvista.com High-res-timers: http://sourceforge.net/projects/high-res-timers/ ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: RCU question 2004-12-12 8:59 ` Manfred Spraul 2004-12-12 9:37 ` Andrea Arcangeli @ 2004-12-12 16:26 ` Zwane Mwaikambo 1 sibling, 0 replies; 44+ messages in thread From: Zwane Mwaikambo @ 2004-12-12 16:26 UTC (permalink / raw) To: Manfred Spraul; +Cc: George Anzinger, Lee Revell, dipankar, ganzinger, lkml On Sun, 12 Dec 2004, Manfred Spraul wrote: > Zwane Mwaikambo wrote: > > > "Intel processors don't suppress SMI or NMI after an STI instruction. Since > > the INTR suppresion is not preserved across an SMI or NMI handler, this may > > result in an INTR being serviced after the STI, which constitutes a > > violation of the INTR suppresion. > > > Interesting find. > It means that our NMI irq return path should check if it points to a hlt > instruction and if yes, then increase the saved EIP by one before doing the > iretd, right? Yeah that should do it, but then we also have to worry about SMIs, perhaps we could add similar logic to interrupt return path instead? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch, 2.6.10-rc3] safe_hlt() & NMIs
@ 2004-12-17 23:35 Chuck Ebbert
0 siblings, 0 replies; 44+ messages in thread
From: Chuck Ebbert @ 2004-12-17 23:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel, Andi Kleen, Andrew Morton
On Tue, 14 Dec 2004 at 15:00:56 -0800 (PST) Linus Torvalds wrote:
> Checking for kernel CS also requires checking that it's not vm86 mode,
> btw. So that's not just a "regs->xcs & 0xffff == __KERNEL_CS" either.
>
> But something like
>
> static inline int kernel_mode(struct pt_regs *regs)
> {
> return !((regs->eflags & VM_MASK) | (regs->xcs & 3));
> }
>
> should DTRT.
>
> Can you pls double-check my thinking, and test?
There is already a user_mode() macro in asm-i386/ptrace.h but it's slow.
x86_64's macro is ugly but at least there's no logical-or in it.
Your i386 code is better, so how about applying this patch (boots/runs/is faster
on my tests):
Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
--- linux-2.6.9.1/include/asm-i386/ptrace.h 2004-10-19 15:28:18.000000000 -0400
+++ linux-2.6.9.2/include/asm-i386/ptrace.h 2004-12-17 16:59:39.956099664 -0500
@@ -55,7 +55,11 @@ struct pt_regs {
#define PTRACE_SET_THREAD_AREA 26
#ifdef __KERNEL__
-#define user_mode(regs) ((VM_MASK & (regs)->eflags) || (3 & (regs)->xcs))
+static inline int kernel_mode(struct pt_regs *regs)
+{
+ return !((3 & (regs)->xcs) | (VM_MASK & (regs)->eflags));
+}
+#define user_mode(regs) (!kernel_mode(regs))
#define instruction_pointer(regs) ((regs)->eip)
#if defined(CONFIG_SMP) && defined(CONFIG_FRAME_POINTER)
extern unsigned long profile_pc(struct pt_regs *regs);
--
Please take it as a sign of my infinite respect for you,
that I insist on you doing all the work.
-- Rusty Russell
^ permalink raw reply [flat|nested] 44+ messages in threadend of thread, other threads:[~2004-12-17 23:38 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-12-09 23:59 RCU question George Anzinger 2004-12-10 4:31 ` Dipankar Sarma 2004-12-10 19:42 ` George Anzinger 2004-12-10 20:40 ` Dipankar Sarma 2004-12-10 20:45 ` Lee Revell 2004-12-10 21:02 ` George Anzinger 2004-12-10 22:58 ` Zwane Mwaikambo 2004-12-11 2:22 ` George Anzinger 2004-12-11 2:45 ` Zwane Mwaikambo 2004-12-11 3:29 ` George Anzinger 2004-12-11 14:52 ` Zwane Mwaikambo 2004-12-11 16:32 ` Manfred Spraul 2004-12-11 16:52 ` George Anzinger 2004-12-12 2:53 ` Zwane Mwaikambo 2004-12-12 8:59 ` Manfred Spraul 2004-12-12 9:37 ` Andrea Arcangeli 2004-12-12 10:22 ` Manfred Spraul 2004-12-12 12:15 ` Andrea Arcangeli 2004-12-14 21:40 ` Lee Revell 2004-12-14 22:23 ` [patch, 2.6.10-rc3] safe_hlt() & NMIs Ingo Molnar 2004-12-14 22:47 ` Ingo Molnar 2004-12-14 23:09 ` Linus Torvalds 2004-12-15 8:52 ` Ingo Molnar 2004-12-15 15:44 ` Linus Torvalds 2004-12-15 16:35 ` Ingo Molnar 2004-12-16 0:37 ` Alan Cox 2004-12-16 1:58 ` Linus Torvalds 2004-12-16 14:51 ` Ingo Molnar 2004-12-16 15:08 ` Maciej W. Rozycki 2004-12-16 15:11 ` Ingo Molnar 2004-12-16 15:42 ` Maciej W. Rozycki 2004-12-16 15:54 ` Linus Torvalds 2004-12-16 2:10 ` Zwane Mwaikambo 2004-12-16 13:26 ` Alan Cox 2004-12-14 23:41 ` Andrea Arcangeli 2004-12-14 23:00 ` Linus Torvalds 2004-12-15 5:04 ` Andi Kleen 2004-12-15 6:27 ` Avi Kivity 2004-12-15 8:51 ` Ingo Molnar 2004-12-12 16:51 ` RCU question George Anzinger 2004-12-12 22:40 ` Manfred Spraul 2004-12-13 5:22 ` George Anzinger 2004-12-12 16:26 ` Zwane Mwaikambo -- strict thread matches above, loose matches on Subject: below -- 2004-12-17 23:35 [patch, 2.6.10-rc3] safe_hlt() & NMIs Chuck Ebbert
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox