* Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking @ 2015-07-17 1:53 Andy Lutomirski 2015-07-17 4:29 ` Paul E. McKenney 2015-07-18 13:00 ` Frederic Weisbecker 0 siblings, 2 replies; 15+ messages in thread From: Andy Lutomirski @ 2015-07-17 1:53 UTC (permalink / raw) To: Sasha Levin, Frédéric Weisbecker, Paul McKenney, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel For reasons that mystify me a bit, we currently track context tracking state separately from rcu's watching state. This results in strange artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we can nest exceptions inside the IRQ handler (an example would be wrmsr_safe failing), and, in -next, we splat a warning: https://gist.github.com/sashalevin/a006a44989312f6835e7 I'm trying to make context tracking more exact, which will fix this issue (the particular splat that Sasha hit shouldn't be possible when I'm done), but I think it would be nice to unify all of this stuff. Would it be plausible for us to guarantee that RCU state is always in sync with context tracking state? If so, we could maybe simplify things and have fewer state variables. Doing this for NMIs might be weird. Would it make sense to have a CONTEXT_NMI that's somehow valid even if the NMI happened while changing context tracking state. Thoughts? As it stands, I think we might already be broken for real: Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does copy_from_user_nmi, which can fault, causing do_page_fault to get called, which calls exception_enter(), which can't be a good thing. RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. Thoughts? As it stands, I need to do something because -tip and thus -next spews occasional warnings. --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 1:53 Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking Andy Lutomirski @ 2015-07-17 4:29 ` Paul E. McKenney 2015-07-17 4:49 ` Paul E. McKenney 2015-07-18 13:00 ` Frederic Weisbecker 1 sibling, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2015-07-17 4:29 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: > For reasons that mystify me a bit, we currently track context tracking > state separately from rcu's watching state. This results in strange > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we > can nest exceptions inside the IRQ handler (an example would be > wrmsr_safe failing), and, in -next, we splat a warning: > > https://gist.github.com/sashalevin/a006a44989312f6835e7 > > I'm trying to make context tracking more exact, which will fix this > issue (the particular splat that Sasha hit shouldn't be possible when > I'm done), but I think it would be nice to unify all of this stuff. > Would it be plausible for us to guarantee that RCU state is always in > sync with context tracking state? If so, we could maybe simplify > things and have fewer state variables. A noble goal. Might even be possible, and maybe even advantageous. But it is usually easier to say than to do. RCU really does need to make some adjustments when the state changes, as do the other subsystems. It might or might not be possible to do the transitions atomically. And if the transitions are not atomic, there will still be weird code paths where (say) the processor is considered non-idle, but RCU doesn't realize it yet. Such a code path could not safely use rcu_read_lock(), so you still need RCU to be able to scream if someone tries it. Contrariwise, if there is a code path where the processor is considered idle, but RCU thinks it is non-idle, that code path can stall grace periods. (Yes, not a problem if the code path is short enough. At least if the underlying VCPU is making progres...) Still, I cannot prove that it is impossible, and if it is possible, then as you say, there might well be benefits. > Doing this for NMIs might be weird. Would it make sense to have a > CONTEXT_NMI that's somehow valid even if the NMI happened while > changing context tracking state. Face it, NMIs are weird. ;-) > Thoughts? As it stands, I think we might already be broken for real: > > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does > copy_from_user_nmi, which can fault, causing do_page_fault to get > called, which calls exception_enter(), which can't be a good thing. > > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. Actually, I see more cases where people forget irq_enter() than rcu_nmi_enter(). "We will just nip in quickly and do something without actually letting the irq system know. Oh, and we want some event tracing in that code path." Boom! > Thoughts? As it stands, I need to do something because -tip and thus > -next spews occasional warnings. Tell me more? Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 4:29 ` Paul E. McKenney @ 2015-07-17 4:49 ` Paul E. McKenney 2015-07-17 18:59 ` Andy Lutomirski 0 siblings, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2015-07-17 4:49 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote: > On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: > > For reasons that mystify me a bit, we currently track context tracking > > state separately from rcu's watching state. This results in strange > > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we > > can nest exceptions inside the IRQ handler (an example would be > > wrmsr_safe failing), and, in -next, we splat a warning: > > > > https://gist.github.com/sashalevin/a006a44989312f6835e7 > > > > I'm trying to make context tracking more exact, which will fix this > > issue (the particular splat that Sasha hit shouldn't be possible when > > I'm done), but I think it would be nice to unify all of this stuff. > > Would it be plausible for us to guarantee that RCU state is always in > > sync with context tracking state? If so, we could maybe simplify > > things and have fewer state variables. > > A noble goal. Might even be possible, and maybe even advantageous. > > But it is usually easier to say than to do. RCU really does need to make > some adjustments when the state changes, as do the other subsystems. > It might or might not be possible to do the transitions atomically. > And if the transitions are not atomic, there will still be weird code > paths where (say) the processor is considered non-idle, but RCU doesn't > realize it yet. Such a code path could not safely use rcu_read_lock(), > so you still need RCU to be able to scream if someone tries it. > Contrariwise, if there is a code path where the processor is considered > idle, but RCU thinks it is non-idle, that code path can stall > grace periods. (Yes, not a problem if the code path is short enough. > At least if the underlying VCPU is making progres...) > > Still, I cannot prove that it is impossible, and if it is possible, > then as you say, there might well be benefits. > > > Doing this for NMIs might be weird. Would it make sense to have a > > CONTEXT_NMI that's somehow valid even if the NMI happened while > > changing context tracking state. > > Face it, NMIs are weird. ;-) > > > Thoughts? As it stands, I think we might already be broken for real: > > > > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does > > copy_from_user_nmi, which can fault, causing do_page_fault to get > > called, which calls exception_enter(), which can't be a good thing. > > > > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. > > Actually, I see more cases where people forget irq_enter() than > rcu_nmi_enter(). "We will just nip in quickly and do something without > actually letting the irq system know. Oh, and we want some event tracing > in that code path." Boom! > > > Thoughts? As it stands, I need to do something because -tip and thus > > -next spews occasional warnings. > > Tell me more? And for completeness, RCU also has the following requirements on the state-transition mechanism: 1. It must be possible to reliably sample some other CPU's state. This is an energy-efficiency requirement, as RCU is not normally permitted to wake up idle CPUs. Nor nohz CPUs, for that matter. 2. RCU must be able to track passage through idle and nohz states. In other words, if RCU samples at t=0 and finds that the CPU is executing (say) in kernel mode, and RCU samples again at t=10 and again finds that the CPU is executing in kernel mode, RCU needs to be able to determine whether or not that CPU passed through idle or nohz betweentimes. 3. In some configurations, RCU needs to be able to block entry into nohz state, both for idle and userspace. Probably others as well... Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 4:49 ` Paul E. McKenney @ 2015-07-17 18:59 ` Andy Lutomirski 2015-07-17 20:12 ` Paul E. McKenney 2015-07-18 13:12 ` Frederic Weisbecker 0 siblings, 2 replies; 15+ messages in thread From: Andy Lutomirski @ 2015-07-17 18:59 UTC (permalink / raw) To: Paul McKenney Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote: >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: >> > For reasons that mystify me a bit, we currently track context tracking >> > state separately from rcu's watching state. This results in strange >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we >> > can nest exceptions inside the IRQ handler (an example would be >> > wrmsr_safe failing), and, in -next, we splat a warning: >> > >> > https://gist.github.com/sashalevin/a006a44989312f6835e7 >> > >> > I'm trying to make context tracking more exact, which will fix this >> > issue (the particular splat that Sasha hit shouldn't be possible when >> > I'm done), but I think it would be nice to unify all of this stuff. >> > Would it be plausible for us to guarantee that RCU state is always in >> > sync with context tracking state? If so, we could maybe simplify >> > things and have fewer state variables. >> >> A noble goal. Might even be possible, and maybe even advantageous. >> >> But it is usually easier to say than to do. RCU really does need to make >> some adjustments when the state changes, as do the other subsystems. >> It might or might not be possible to do the transitions atomically. >> And if the transitions are not atomic, there will still be weird code >> paths where (say) the processor is considered non-idle, but RCU doesn't >> realize it yet. Such a code path could not safely use rcu_read_lock(), >> so you still need RCU to be able to scream if someone tries it. >> Contrariwise, if there is a code path where the processor is considered >> idle, but RCU thinks it is non-idle, that code path can stall >> grace periods. (Yes, not a problem if the code path is short enough. >> At least if the underlying VCPU is making progres...) >> >> Still, I cannot prove that it is impossible, and if it is possible, >> then as you say, there might well be benefits. >> >> > Doing this for NMIs might be weird. Would it make sense to have a >> > CONTEXT_NMI that's somehow valid even if the NMI happened while >> > changing context tracking state. >> >> Face it, NMIs are weird. ;-) >> >> > Thoughts? As it stands, I think we might already be broken for real: >> > >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does >> > copy_from_user_nmi, which can fault, causing do_page_fault to get >> > called, which calls exception_enter(), which can't be a good thing. >> > >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. >> >> Actually, I see more cases where people forget irq_enter() than >> rcu_nmi_enter(). "We will just nip in quickly and do something without >> actually letting the irq system know. Oh, and we want some event tracing >> in that code path." Boom! >> >> > Thoughts? As it stands, I need to do something because -tip and thus >> > -next spews occasional warnings. >> >> Tell me more? > > And for completeness, RCU also has the following requirements on the > state-transition mechanism: > > 1. It must be possible to reliably sample some other CPU's state. > This is an energy-efficiency requirement, as RCU is not normally > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter. NOHZ needs this for vtime accounting, too. I think Rik might be thinking about this. Maybe the underlying state could be shared? > > 2. RCU must be able to track passage through idle and nohz states. > In other words, if RCU samples at t=0 and finds that the CPU > is executing (say) in kernel mode, and RCU samples again at > t=10 and again finds that the CPU is executing in kernel mode, > RCU needs to be able to determine whether or not that CPU passed > through idle or nohz betweentimes. And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the context tracking stuff notifies RCU. The think I'm less than happy with is that we can currently be CONTEXT_USER but still rcu-awake. This is manageable, but it seems messy. > > 3. In some configurations, RCU needs to be able to block entry into > nohz state, both for idle and userspace. > Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, although the tick would have to stay on. Grumble. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 18:59 ` Andy Lutomirski @ 2015-07-17 20:12 ` Paul E. McKenney 2015-07-17 20:32 ` Andy Lutomirski 2015-07-18 13:12 ` Frederic Weisbecker 1 sibling, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2015-07-17 20:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote: > On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote: > >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: > >> > For reasons that mystify me a bit, we currently track context tracking > >> > state separately from rcu's watching state. This results in strange > >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we > >> > can nest exceptions inside the IRQ handler (an example would be > >> > wrmsr_safe failing), and, in -next, we splat a warning: > >> > > >> > https://gist.github.com/sashalevin/a006a44989312f6835e7 > >> > > >> > I'm trying to make context tracking more exact, which will fix this > >> > issue (the particular splat that Sasha hit shouldn't be possible when > >> > I'm done), but I think it would be nice to unify all of this stuff. > >> > Would it be plausible for us to guarantee that RCU state is always in > >> > sync with context tracking state? If so, we could maybe simplify > >> > things and have fewer state variables. > >> > >> A noble goal. Might even be possible, and maybe even advantageous. > >> > >> But it is usually easier to say than to do. RCU really does need to make > >> some adjustments when the state changes, as do the other subsystems. > >> It might or might not be possible to do the transitions atomically. > >> And if the transitions are not atomic, there will still be weird code > >> paths where (say) the processor is considered non-idle, but RCU doesn't > >> realize it yet. Such a code path could not safely use rcu_read_lock(), > >> so you still need RCU to be able to scream if someone tries it. > >> Contrariwise, if there is a code path where the processor is considered > >> idle, but RCU thinks it is non-idle, that code path can stall > >> grace periods. (Yes, not a problem if the code path is short enough. > >> At least if the underlying VCPU is making progres...) > >> > >> Still, I cannot prove that it is impossible, and if it is possible, > >> then as you say, there might well be benefits. > >> > >> > Doing this for NMIs might be weird. Would it make sense to have a > >> > CONTEXT_NMI that's somehow valid even if the NMI happened while > >> > changing context tracking state. > >> > >> Face it, NMIs are weird. ;-) > >> > >> > Thoughts? As it stands, I think we might already be broken for real: > >> > > >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does > >> > copy_from_user_nmi, which can fault, causing do_page_fault to get > >> > called, which calls exception_enter(), which can't be a good thing. > >> > > >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. > >> > >> Actually, I see more cases where people forget irq_enter() than > >> rcu_nmi_enter(). "We will just nip in quickly and do something without > >> actually letting the irq system know. Oh, and we want some event tracing > >> in that code path." Boom! > >> > >> > Thoughts? As it stands, I need to do something because -tip and thus > >> > -next spews occasional warnings. > >> > >> Tell me more? > > > > And for completeness, RCU also has the following requirements on the > > state-transition mechanism: > > > > 1. It must be possible to reliably sample some other CPU's state. > > This is an energy-efficiency requirement, as RCU is not normally > > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter. > > NOHZ needs this for vtime accounting, too. I think Rik might be > thinking about this. Maybe the underlying state could be shared? >From what I understand, what Rik is looking at is accounting information, which is a different type of state. And a type of state where some approximation is just fine. Try that with RCU, and you will approximate yourself into a segfault. > > 2. RCU must be able to track passage through idle and nohz states. > > In other words, if RCU samples at t=0 and finds that the CPU > > is executing (say) in kernel mode, and RCU samples again at > > t=10 and again finds that the CPU is executing in kernel mode, > > RCU needs to be able to determine whether or not that CPU passed > > through idle or nohz betweentimes. > > And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the > context tracking stuff notifies RCU. The think I'm less than happy > with is that we can currently be CONTEXT_USER but still rcu-awake. > This is manageable, but it seems messy. Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context tracking, so RCU cannot see CONTEXT_USER. Or are you thinking of making context tracking unconditional? (The tinification guys might have some opinions on this.) > > 3. In some configurations, RCU needs to be able to block entry into > > nohz state, both for idle and userspace. > > Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, > although the tick would have to stay on. Right, there are situations where RCU needs a given CPU to keep the tick going, for example, when there are RCU callbacks queued on that CPU. Failing to keep the tick going could result in a system hang, because that callback might never be invoked. Of course, something or another will normally eventually disturb the CPU, but the resulting huge delay would not be good. And on deep embedded systems, it is quite possible that the CPU would go for a good long time without being disturbed. (This is not just a theoretical possibility, and I have the scars to prove it.) And there is this one as well: 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace context differently than idle context, and still needs to be able to take two samples and determine if the CPU ever went idle (and only idle, not userspace) betweentimes. > Grumble. Welcome to my world! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 20:12 ` Paul E. McKenney @ 2015-07-17 20:32 ` Andy Lutomirski 2015-07-17 21:19 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2015-07-17 20:32 UTC (permalink / raw) To: Paul McKenney Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 1:12 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote: >> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney >> <paulmck@linux.vnet.ibm.com> wrote: >> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote: >> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: >> >> > For reasons that mystify me a bit, we currently track context tracking >> >> > state separately from rcu's watching state. This results in strange >> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we >> >> > can nest exceptions inside the IRQ handler (an example would be >> >> > wrmsr_safe failing), and, in -next, we splat a warning: >> >> > >> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7 >> >> > >> >> > I'm trying to make context tracking more exact, which will fix this >> >> > issue (the particular splat that Sasha hit shouldn't be possible when >> >> > I'm done), but I think it would be nice to unify all of this stuff. >> >> > Would it be plausible for us to guarantee that RCU state is always in >> >> > sync with context tracking state? If so, we could maybe simplify >> >> > things and have fewer state variables. >> >> >> >> A noble goal. Might even be possible, and maybe even advantageous. >> >> >> >> But it is usually easier to say than to do. RCU really does need to make >> >> some adjustments when the state changes, as do the other subsystems. >> >> It might or might not be possible to do the transitions atomically. >> >> And if the transitions are not atomic, there will still be weird code >> >> paths where (say) the processor is considered non-idle, but RCU doesn't >> >> realize it yet. Such a code path could not safely use rcu_read_lock(), >> >> so you still need RCU to be able to scream if someone tries it. >> >> Contrariwise, if there is a code path where the processor is considered >> >> idle, but RCU thinks it is non-idle, that code path can stall >> >> grace periods. (Yes, not a problem if the code path is short enough. >> >> At least if the underlying VCPU is making progres...) >> >> >> >> Still, I cannot prove that it is impossible, and if it is possible, >> >> then as you say, there might well be benefits. >> >> >> >> > Doing this for NMIs might be weird. Would it make sense to have a >> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while >> >> > changing context tracking state. >> >> >> >> Face it, NMIs are weird. ;-) >> >> >> >> > Thoughts? As it stands, I think we might already be broken for real: >> >> > >> >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does >> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get >> >> > called, which calls exception_enter(), which can't be a good thing. >> >> > >> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. >> >> >> >> Actually, I see more cases where people forget irq_enter() than >> >> rcu_nmi_enter(). "We will just nip in quickly and do something without >> >> actually letting the irq system know. Oh, and we want some event tracing >> >> in that code path." Boom! >> >> >> >> > Thoughts? As it stands, I need to do something because -tip and thus >> >> > -next spews occasional warnings. >> >> >> >> Tell me more? >> > >> > And for completeness, RCU also has the following requirements on the >> > state-transition mechanism: >> > >> > 1. It must be possible to reliably sample some other CPU's state. >> > This is an energy-efficiency requirement, as RCU is not normally >> > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter. >> >> NOHZ needs this for vtime accounting, too. I think Rik might be >> thinking about this. Maybe the underlying state could be shared? > > From what I understand, what Rik is looking at is accounting information, > which is a different type of state. And a type of state where some > approximation is just fine. Try that with RCU, and you will approximate > yourself into a segfault. True. But context tracking wouldn't object to being exact. And I think we need context tracking to treat user mode as quiescent, so they're at least related. > >> > 2. RCU must be able to track passage through idle and nohz states. >> > In other words, if RCU samples at t=0 and finds that the CPU >> > is executing (say) in kernel mode, and RCU samples again at >> > t=10 and again finds that the CPU is executing in kernel mode, >> > RCU needs to be able to determine whether or not that CPU passed >> > through idle or nohz betweentimes. >> >> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the >> context tracking stuff notifies RCU. The think I'm less than happy >> with is that we can currently be CONTEXT_USER but still rcu-awake. >> This is manageable, but it seems messy. > > Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context > tracking, so RCU cannot see CONTEXT_USER. Or are you thinking of making > context tracking unconditional? (The tinification guys might have some > opinions on this.) Without context tracking, user mode is not RCU idle, right? Instead we have timer ticks. We could get away with a very minimal context tracking implementation that just tracked CONTEXT_IDLE and CONTEXT_KERNEL. (Hmm, there is no CONTEXT_IDLE right now. Further grumbling.) > >> > 3. In some configurations, RCU needs to be able to block entry into >> > nohz state, both for idle and userspace. >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, >> although the tick would have to stay on. > > Right, there are situations where RCU needs a given CPU to keep the tick > going, for example, when there are RCU callbacks queued on that CPU. > Failing to keep the tick going could result in a system hang, because > that callback might never be invoked. Can't we just fire the callbacks right away? We should only go RCU-idle from reasonable contexts. In fact, the nohz crowd likes the idea of nohz userspace being absolutely no hz. NMIs can't queue RCU callbacks, right? (I hope!) As of a couple releases ago, on x86, we *always* have a clean state when transitioning from any non-NMI kernel context to user mode on x86. > Of course, something or another > will normally eventually disturb the CPU, but the resulting huge delay > would not be good. And on deep embedded systems, it is quite possible > that the CPU would go for a good long time without being disturbed. > (This is not just a theoretical possibility, and I have the scars to > prove it.) > > And there is this one as well: > > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace > context differently than idle context, and still needs to be > able to take two samples and determine if the CPU ever went idle > (and only idle, not userspace) betweentimes. If the context tracking code, or whatever the hook is, tracked the number of transitions out of user mode, that would do it, right? We're talking literally a single per-cpu increment on user entry and/or exit, I think. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 20:32 ` Andy Lutomirski @ 2015-07-17 21:19 ` Paul E. McKenney 2015-07-17 21:22 ` Paul E. McKenney 2015-07-17 22:13 ` Andy Lutomirski 0 siblings, 2 replies; 15+ messages in thread From: Paul E. McKenney @ 2015-07-17 21:19 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote: > On Fri, Jul 17, 2015 at 1:12 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote: > >> On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney > >> <paulmck@linux.vnet.ibm.com> wrote: > >> > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote: > >> >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: > >> >> > For reasons that mystify me a bit, we currently track context tracking > >> >> > state separately from rcu's watching state. This results in strange > >> >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we > >> >> > can nest exceptions inside the IRQ handler (an example would be > >> >> > wrmsr_safe failing), and, in -next, we splat a warning: > >> >> > > >> >> > https://gist.github.com/sashalevin/a006a44989312f6835e7 > >> >> > > >> >> > I'm trying to make context tracking more exact, which will fix this > >> >> > issue (the particular splat that Sasha hit shouldn't be possible when > >> >> > I'm done), but I think it would be nice to unify all of this stuff. > >> >> > Would it be plausible for us to guarantee that RCU state is always in > >> >> > sync with context tracking state? If so, we could maybe simplify > >> >> > things and have fewer state variables. > >> >> > >> >> A noble goal. Might even be possible, and maybe even advantageous. > >> >> > >> >> But it is usually easier to say than to do. RCU really does need to make > >> >> some adjustments when the state changes, as do the other subsystems. > >> >> It might or might not be possible to do the transitions atomically. > >> >> And if the transitions are not atomic, there will still be weird code > >> >> paths where (say) the processor is considered non-idle, but RCU doesn't > >> >> realize it yet. Such a code path could not safely use rcu_read_lock(), > >> >> so you still need RCU to be able to scream if someone tries it. > >> >> Contrariwise, if there is a code path where the processor is considered > >> >> idle, but RCU thinks it is non-idle, that code path can stall > >> >> grace periods. (Yes, not a problem if the code path is short enough. > >> >> At least if the underlying VCPU is making progres...) > >> >> > >> >> Still, I cannot prove that it is impossible, and if it is possible, > >> >> then as you say, there might well be benefits. > >> >> > >> >> > Doing this for NMIs might be weird. Would it make sense to have a > >> >> > CONTEXT_NMI that's somehow valid even if the NMI happened while > >> >> > changing context tracking state. > >> >> > >> >> Face it, NMIs are weird. ;-) > >> >> > >> >> > Thoughts? As it stands, I think we might already be broken for real: > >> >> > > >> >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does > >> >> > copy_from_user_nmi, which can fault, causing do_page_fault to get > >> >> > called, which calls exception_enter(), which can't be a good thing. > >> >> > > >> >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. > >> >> > >> >> Actually, I see more cases where people forget irq_enter() than > >> >> rcu_nmi_enter(). "We will just nip in quickly and do something without > >> >> actually letting the irq system know. Oh, and we want some event tracing > >> >> in that code path." Boom! > >> >> > >> >> > Thoughts? As it stands, I need to do something because -tip and thus > >> >> > -next spews occasional warnings. > >> >> > >> >> Tell me more? > >> > > >> > And for completeness, RCU also has the following requirements on the > >> > state-transition mechanism: > >> > > >> > 1. It must be possible to reliably sample some other CPU's state. > >> > This is an energy-efficiency requirement, as RCU is not normally > >> > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter. > >> > >> NOHZ needs this for vtime accounting, too. I think Rik might be > >> thinking about this. Maybe the underlying state could be shared? > > > > From what I understand, what Rik is looking at is accounting information, > > which is a different type of state. And a type of state where some > > approximation is just fine. Try that with RCU, and you will approximate > > yourself into a segfault. > > True. But context tracking wouldn't object to being exact. And I > think we need context tracking to treat user mode as quiescent, so > they're at least related. And RCU would be happy to be able to always detect usermode execution. But there are configurations and architectures that exclude context tracking, which means that RCU has to roll its own in those cases. > >> > 2. RCU must be able to track passage through idle and nohz states. > >> > In other words, if RCU samples at t=0 and finds that the CPU > >> > is executing (say) in kernel mode, and RCU samples again at > >> > t=10 and again finds that the CPU is executing in kernel mode, > >> > RCU needs to be able to determine whether or not that CPU passed > >> > through idle or nohz betweentimes. > >> > >> And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the > >> context tracking stuff notifies RCU. The think I'm less than happy > >> with is that we can currently be CONTEXT_USER but still rcu-awake. > >> This is manageable, but it seems messy. > > > > Well, if you don't have CONFIG_NO_HZ_FULL, there normally isn't context > > tracking, so RCU cannot see CONTEXT_USER. Or are you thinking of making > > context tracking unconditional? (The tinification guys might have some > > opinions on this.) > > Without context tracking, user mode is not RCU idle, right? Instead > we have timer ticks. We could get away with a very minimal context > tracking implementation that just tracked CONTEXT_IDLE and > CONTEXT_KERNEL. (Hmm, there is no CONTEXT_IDLE right now. Further > grumbling.) Without context tracking, usermode is still RCU idle. However, in that case, RCU detects idle using a hook in the timer tick handler. > >> > 3. In some configurations, RCU needs to be able to block entry into > >> > nohz state, both for idle and userspace. > >> > >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, > >> although the tick would have to stay on. > > > > Right, there are situations where RCU needs a given CPU to keep the tick > > going, for example, when there are RCU callbacks queued on that CPU. > > Failing to keep the tick going could result in a system hang, because > > that callback might never be invoked. > > Can't we just fire the callbacks right away? Absolutely not!!! At least not on multi-CPU systems. There might be an RCU read-side critical section on some other CPU that we still have to wait for. > We should only go > RCU-idle from reasonable contexts. In fact, the nohz crowd likes the > idea of nohz userspace being absolutely no hz. Guilty to charges as read, and this is why RCU needs to be informed about userspace execution for nohz userspace. > NMIs can't queue RCU callbacks, right? (I hope!) As of a couple > releases ago, on x86, we *always* have a clean state when > transitioning from any non-NMI kernel context to user mode on x86. You are correct, NMIs cannot queue RCU callbacks. That would be possible, but it alws would be a bit painful and not so good for energy efficiency. So if someone wants that, they need to have an extremely good reason. ;-) > > Of course, something or another > > will normally eventually disturb the CPU, but the resulting huge delay > > would not be good. And on deep embedded systems, it is quite possible > > that the CPU would go for a good long time without being disturbed. > > (This is not just a theoretical possibility, and I have the scars to > > prove it.) > > > > And there is this one as well: > > > > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace > > context differently than idle context, and still needs to be > > able to take two samples and determine if the CPU ever went idle > > (and only idle, not userspace) betweentimes. > > If the context tracking code, or whatever the hook is, tracked the > number of transitions out of user mode, that would do it, right? > We're talking literally a single per-cpu increment on user entry > and/or exit, I think. With memory barriers, because RCU has to accurately sample the counters remotely. I currently use full-up atomic operations, possibly only due to paranoia, but I need to further intensify testing to trust moving away from full-up atomic operations. In addition, RCU currently relies on a single counter counting idle-to-RCU transitions, both userspace and idle. It might be possible to wean RCU of this habit, or maybe have the two counters be combined into a single word. The checks would of course be more complex in that case, but should be doable. In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to rcu_prepare_for_idle() just before incrementing the counter when transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU kernels, RCU needs a call to the following in the same place: for_each_rcu_flavor(rsp) { rdp = this_cpu_ptr(rsp->rda); do_nocb_deferred_wakeup(rdp); } On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on transition from an idle-like state. There is also some debug that complains if something transitions to/from an idle-like state that shouldn't be doing so, but that could be pulled into context tracking. (Might already be there, for all I know.) And there is event tracing, which might be subsumed into context tracking. See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c for the full story. This could all be handled by an RCU hook being invoked just before incrementing the counter on entry to an idle-like state and just after incrementing the counter on exit from an idle-like state. Ah, yes, and interrupts to/from idle. Some architectures have half-interrupts that never return. RCU uses a compound counter that is zeroed upon process-level entry to an idle-like state to deal with this. See kernel/rcu/rcu.h, the definitions starting with DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus the associated comment block. But maybe context tracking has some other way of handling these beasts? And transitions to idle-like states are not atomic. In some cases in some configurations, rcu_needs_cpu() says "OK" when asked about stopping the tick, but by the time we get to rcu_prepare_for_idle(), it is no longer OK. RCU raises softirq to force a replay in these sorts of cases. Hey, you asked!!! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 21:19 ` Paul E. McKenney @ 2015-07-17 21:22 ` Paul E. McKenney 2015-07-17 22:45 ` Andy Lutomirski 2015-07-17 22:13 ` Andy Lutomirski 1 sibling, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2015-07-17 21:22 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel [-- Attachment #1: Type: text/plain, Size: 182 bytes --] And please see attached for an in-process LWN article on RCU's requirements. If you get a chance to look it over, I would value any feedback that you might have. Thanx, Paul [-- Attachment #2: Requirements.2015.07.17a.tgz --] [-- Type: application/x-gtar-compressed, Size: 167814 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 21:22 ` Paul E. McKenney @ 2015-07-17 22:45 ` Andy Lutomirski 0 siblings, 0 replies; 15+ messages in thread From: Andy Lutomirski @ 2015-07-17 22:45 UTC (permalink / raw) To: Paul McKenney Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 2:22 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > And please see attached for an in-process LWN article on RCU's requirements. > If you get a chance to look it over, I would value any feedback that you > might have. > Sure. Nice article! I found the add_gp_buggy thing a bit confusing. What's the rcu reader doing? What's the rcu_access_pointer for? You have a spinlock for the updater. You reference remove_gp_synchronous before you define it. You say: Quick Quiz 4: Without the rcu_dereference() or the rcu_access_pointer(), what destructive optimizations might the compiler make use of? Answer: It could reuse a value formerly fetched from this same pointer. It could also fetch the pointer from gp in a byte-at-a-time manner, resulting in load tearing, in turn resulting a bytewise mash-up of two distince pointer values. It might even use value-speculation optimizations, where it makes a wrong guess, but by the time it gets around to checking the value, an update has changed the pointer to match the wrong guess. Too bad about any dereferences that returned pre-initialization garbage in the meantime! Doesn't the spinlock protect against that? Requirement #2: you say "Each CPU that has an RCU read-side critical section that ends after synchronize_rcu() returns is guaranteed to execute a full memory barrier between the time that synchronize_rcu() begins and the time that the RCU read-side critical section begins. " Don't you mean an RCU read-side critical section that *starts* after synchronize_rcu() returns? Having read it: on x86 right now with nohz_full, a cpu can go into user mode and come back without executing a full barrier (I think). Certainly there's no such barrier in the entry code. I don't know what user_enter and user_exit do. Is that okay? The issue here is that a SYSRET/SYSCALL pair doesn't serialize or enforce any ordering whatsoever. I got Intel to promise that SYSCALL will always force a TSX abort [1], but that's about it. For timer-driven idle/user detection, we're fine: IRET is serializing. However, there are a couple of kernels in which we don't promise to IRET after an IRQ. This also reminds me: we really really need task switches to be full barriers in general. Otherwise store forwarding can bite otherwise-correct code on x86. s/Guaranteed Unconditional/Guaranteed Unconditionally/ [1] Because of an amusing possible attack. If you attempt a TSX transaction and it fails, you get to leak at least 8 bits out of the aborted transaction. If you manage to read /dev/urandom in a transaction, you can abort the read and still know the random number. Whoops! --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 21:19 ` Paul E. McKenney 2015-07-17 21:22 ` Paul E. McKenney @ 2015-07-17 22:13 ` Andy Lutomirski 2015-07-17 22:55 ` Paul E. McKenney 1 sibling, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2015-07-17 22:13 UTC (permalink / raw) To: Paul McKenney Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote: >> True. But context tracking wouldn't object to being exact. And I >> think we need context tracking to treat user mode as quiescent, so >> they're at least related. > > And RCU would be happy to be able to always detect usermode execution. > But there are configurations and architectures that exclude context > tracking, which means that RCU has to roll its own in those cases. > We could slowly fix them, perhaps. I suspect that I'm half-way done with accidentally enabling it for x86_32 :) > >> >> > 3. In some configurations, RCU needs to be able to block entry into >> >> > nohz state, both for idle and userspace. >> >> >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, >> >> although the tick would have to stay on. >> > >> > Right, there are situations where RCU needs a given CPU to keep the tick >> > going, for example, when there are RCU callbacks queued on that CPU. >> > Failing to keep the tick going could result in a system hang, because >> > that callback might never be invoked. >> >> Can't we just fire the callbacks right away? > > Absolutely not!!! > > At least not on multi-CPU systems. There might be an RCU read-side > critical section on some other CPU that we still have to wait for. Oh, right, obviously. Could you kick them over to a different CPU, though? > >> We should only go >> RCU-idle from reasonable contexts. In fact, the nohz crowd likes the >> idea of nohz userspace being absolutely no hz. > > Guilty to charges as read, and this is why RCU needs to be informed > about userspace execution for nohz userspace. > >> NMIs can't queue RCU callbacks, right? (I hope!) As of a couple >> releases ago, on x86, we *always* have a clean state when >> transitioning from any non-NMI kernel context to user mode on x86. > > You are correct, NMIs cannot queue RCU callbacks. That would be possible, > but it alws would be a bit painful and not so good for energy efficiency. > So if someone wants that, they need to have an extremely good reason. ;-) Eww, please no. NMI is already a terrifying scary disaster, and keeping it simple would be for the best. > >> > Of course, something or another >> > will normally eventually disturb the CPU, but the resulting huge delay >> > would not be good. And on deep embedded systems, it is quite possible >> > that the CPU would go for a good long time without being disturbed. >> > (This is not just a theoretical possibility, and I have the scars to >> > prove it.) >> > >> > And there is this one as well: >> > >> > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace >> > context differently than idle context, and still needs to be >> > able to take two samples and determine if the CPU ever went idle >> > (and only idle, not userspace) betweentimes. >> >> If the context tracking code, or whatever the hook is, tracked the >> number of transitions out of user mode, that would do it, right? >> We're talking literally a single per-cpu increment on user entry >> and/or exit, I think. > > With memory barriers, because RCU has to accurately sample the counters > remotely. I currently use full-up atomic operations, possibly only due > to paranoia, but I need to further intensify testing to trust moving away > from full-up atomic operations. In addition, RCU currently relies on a > single counter counting idle-to-RCU transitions, both userspace and idle. > It might be possible to wean RCU of this habit, or maybe have the two > counters be combined into a single word. The checks would of course be > more complex in that case, but should be doable. Why is idle-to-RCU different from user-to-RCU? I feel like RCU and context tracking are implementing more or less the same thing, and the fact that they're not shared makes life complicated. >From my perspective, I want to be able to say "I'm transitioning to user mode right now" and "I'm transitioning out of user mode right now" and have it Just Work. In current -tip, on x86_64, we do that for literally every non-NMI entry with one stupid racy exception, and that racy exception is very much fixable. I'd prefer not to think about whether I'm informing RCU about exiting user mode, informing context tracking about exiting user mode, or both. > > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to > rcu_prepare_for_idle() just before incrementing the counter when > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU > kernels, RCU needs a call to the following in the same place: > > for_each_rcu_flavor(rsp) { > rdp = this_cpu_ptr(rsp->rda); > do_nocb_deferred_wakeup(rdp); > } > > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on > transition from an idle-like state. If my hypothetical "I'm going to userspace now" function did that, great! I call it from a context where percpu variable work and IRQs are off. I further promise not to run any RCU code or enable IRQs between calling that function and actually entering user mode. > > There is also some debug that complains if something transitions > to/from an idle-like state that shouldn't be doing so, but that could be > pulled into context tracking. (Might already be there, for all I know.) > And there is event tracing, which might be subsumed into context tracking. > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c > for the full story. > > This could all be handled by an RCU hook being invoked just before > incrementing the counter on entry to an idle-like state and just after > incrementing the counter on exit from an idle-like state. > > Ah, yes, and interrupts to/from idle. Some architectures have > half-interrupts that never return. WTF? Some architectures are clearly nuts. x86 gets this right, at least in the intel_idle and acpi_idle cases. There are no interrupts from RCU idle unless I badly misread the code. > RCU uses a compound counter > that is zeroed upon process-level entry to an idle-like state to > deal with this. See kernel/rcu/rcu.h, the definitions starting with > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus > the associated comment block. But maybe context tracking has some > other way of handling these beasts? > > And transitions to idle-like states are not atomic. In some cases > in some configurations, rcu_needs_cpu() says "OK" when asked about > stopping the tick, but by the time we get to rcu_prepare_for_idle(), > it is no longer OK. RCU raises softirq to force a replay in these > sorts of cases. Hmm. If pending callbacks got kicked to another CPU, would that help? If that's impossible, when RCU finally detects that the grace period is over, could it send an IPI rather than relying on the timer tick? --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 22:13 ` Andy Lutomirski @ 2015-07-17 22:55 ` Paul E. McKenney 2015-07-17 23:20 ` Andy Lutomirski 0 siblings, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2015-07-17 22:55 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote: > On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote: > >> True. But context tracking wouldn't object to being exact. And I > >> think we need context tracking to treat user mode as quiescent, so > >> they're at least related. > > > > And RCU would be happy to be able to always detect usermode execution. > > But there are configurations and architectures that exclude context > > tracking, which means that RCU has to roll its own in those cases. > > We could slowly fix them, perhaps. I suspect that I'm half-way done > with accidentally enabling it for x86_32 :) If there was an appropriate Kconfig variable, I could do things one way or the other, depending on what the architecture was doing. So you are -unconditionally- enabling context tracking for x86_32? Doesn't that increase kernel-user transition overhead? > >> >> > 3. In some configurations, RCU needs to be able to block entry into > >> >> > nohz state, both for idle and userspace. > >> >> > >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, > >> >> although the tick would have to stay on. > >> > > >> > Right, there are situations where RCU needs a given CPU to keep the tick > >> > going, for example, when there are RCU callbacks queued on that CPU. > >> > Failing to keep the tick going could result in a system hang, because > >> > that callback might never be invoked. > >> > >> Can't we just fire the callbacks right away? > > > > Absolutely not!!! > > > > At least not on multi-CPU systems. There might be an RCU read-side > > critical section on some other CPU that we still have to wait for. > > Oh, right, obviously. Could you kick them over to a different CPU, though? For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on whatever CPU the scheduler or the sysadm chooses, as the case may be. Otherwise, it gets really hard to make sure that a given CPU's callbacks execute in order, which is required for rcu_barrier() to work properly. There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only way that RCU callbacks were invoked, but the overhead is higher and turns out to be all too noticeable on some workloads. :-( > >> We should only go > >> RCU-idle from reasonable contexts. In fact, the nohz crowd likes the > >> idea of nohz userspace being absolutely no hz. > > > > Guilty to charges as read, and this is why RCU needs to be informed > > about userspace execution for nohz userspace. > > > >> NMIs can't queue RCU callbacks, right? (I hope!) As of a couple > >> releases ago, on x86, we *always* have a clean state when > >> transitioning from any non-NMI kernel context to user mode on x86. > > > > You are correct, NMIs cannot queue RCU callbacks. That would be possible, > > but it alws would be a bit painful and not so good for energy efficiency. > > So if someone wants that, they need to have an extremely good reason. ;-) > > Eww, please no. NMI is already a terrifying scary disaster, and > keeping it simple would be for the best. ;-) ;-) ;-) > >> > Of course, something or another > >> > will normally eventually disturb the CPU, but the resulting huge delay > >> > would not be good. And on deep embedded systems, it is quite possible > >> > that the CPU would go for a good long time without being disturbed. > >> > (This is not just a theoretical possibility, and I have the scars to > >> > prove it.) > >> > > >> > And there is this one as well: > >> > > >> > 4. In CONFIG_NO_HZ_FULL_SYSIDLE=y kernels, RCU has to treat userspace > >> > context differently than idle context, and still needs to be > >> > able to take two samples and determine if the CPU ever went idle > >> > (and only idle, not userspace) betweentimes. > >> > >> If the context tracking code, or whatever the hook is, tracked the > >> number of transitions out of user mode, that would do it, right? > >> We're talking literally a single per-cpu increment on user entry > >> and/or exit, I think. > > > > With memory barriers, because RCU has to accurately sample the counters > > remotely. I currently use full-up atomic operations, possibly only due > > to paranoia, but I need to further intensify testing to trust moving away > > from full-up atomic operations. In addition, RCU currently relies on a > > single counter counting idle-to-RCU transitions, both userspace and idle. > > It might be possible to wean RCU of this habit, or maybe have the two > > counters be combined into a single word. The checks would of course be > > more complex in that case, but should be doable. > > Why is idle-to-RCU different from user-to-RCU? Full-system-idle checking that is supposed to someday allow CPU 0's scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL systems. In this case, RCU treats user as non-idle for the purpose of determining whether or not CPU 0's scheduling-clock interrupt can be stopped. But idle is still idle. And for purposes of determining whether the grace period has ended, idle and user are both extended quiescent states, regardless. > I feel like RCU and context tracking are implementing more or less the > same thing, and the fact that they're not shared makes life > complicated. > > >From my perspective, I want to be able to say "I'm transitioning to > user mode right now" and "I'm transitioning out of user mode right > now" and have it Just Work. In current -tip, on x86_64, we do that > for literally every non-NMI entry with one stupid racy exception, and > that racy exception is very much fixable. I'd prefer not to think > about whether I'm informing RCU about exiting user mode, informing > context tracking about exiting user mode, or both. RCU's tracking was in place for many years before context tracking appeared. If we can converge them, well and good, but it really does have to fully work. > > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to > > rcu_prepare_for_idle() just before incrementing the counter when > > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU > > kernels, RCU needs a call to the following in the same place: > > > > for_each_rcu_flavor(rsp) { > > rdp = this_cpu_ptr(rsp->rda); > > do_nocb_deferred_wakeup(rdp); > > } > > > > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked > > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on > > transition from an idle-like state. > > If my hypothetical "I'm going to userspace now" function did that, > great! I call it from a context where percpu variable work and IRQs > are off. I further promise not to run any RCU code or enable IRQs > between calling that function and actually entering user mode. Probably need a pair of nonspecific RCU hook function that does whatever RCU eventually needs to be done, but that could hopefully work. Of course, just having a pair of hook functions assumes that RCU can be convinced to not care about the difference between irq and process transitions, and the difference between idle and user (at transition, sysidle will continue to care about the difference between idle and user when remotely checking a given CPU's state). > > There is also some debug that complains if something transitions > > to/from an idle-like state that shouldn't be doing so, but that could be > > pulled into context tracking. (Might already be there, for all I know.) > > And there is event tracing, which might be subsumed into context tracking. > > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c > > for the full story. > > > > This could all be handled by an RCU hook being invoked just before > > incrementing the counter on entry to an idle-like state and just after > > incrementing the counter on exit from an idle-like state. > > > > Ah, yes, and interrupts to/from idle. Some architectures have > > half-interrupts that never return. > > WTF? Some architectures are clearly nuts. Heh. That was -exactly- my reaction when I first ran into it. > x86 gets this right, at least in the intel_idle and acpi_idle cases. > There are no interrupts from RCU idle unless I badly misread the code. These half-interrupts happen when running non-idle in kernel code. Things like simulating exceptions and system calls from within the kernel. I would not be sad to see it go, but while it is here, RCU must handle it correctly. If it really truly cannot happen for x86, then x86 arch code of course need not worry about it. > > RCU uses a compound counter > > that is zeroed upon process-level entry to an idle-like state to > > deal with this. See kernel/rcu/rcu.h, the definitions starting with > > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus > > the associated comment block. But maybe context tracking has some > > other way of handling these beasts? > > > > And transitions to idle-like states are not atomic. In some cases > > in some configurations, rcu_needs_cpu() says "OK" when asked about > > stopping the tick, but by the time we get to rcu_prepare_for_idle(), > > it is no longer OK. RCU raises softirq to force a replay in these > > sorts of cases. > > Hmm. If pending callbacks got kicked to another CPU, would that help? Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally says "OK", so for workloads where that is a reasonable strategy, it works quite well. Otherwise, kicking callbacks to other CPUs is extremely painful at best, and perhaps even impossible, at least if rcu_barrier() is to work correctly. > If that's impossible, when RCU finally detects that the grace period > is over, could it send an IPI rather than relying on the timer tick? In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets a timer to catch end of the grace period in the common case. This is the point of the time passed back from rcu_needs_cpu(). Making the end-of-grace-period code send IPIs to CPUs that might (or might not) be in this mode involves too much rummaging through other CPU's states. Plus people are not complaining that the grace-period kthread is using too little CPU. Besides, you have to tolerate a CPU catching an interrupt that does a wakeup just as that CPU was trying to go idle, so the current approach is not introducing any additional pain from what I can see. Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 22:55 ` Paul E. McKenney @ 2015-07-17 23:20 ` Andy Lutomirski 2015-07-18 0:04 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Andy Lutomirski @ 2015-07-17 23:20 UTC (permalink / raw) To: Paul McKenney Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 3:55 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote: >> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney >> <paulmck@linux.vnet.ibm.com> wrote: >> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote: >> >> True. But context tracking wouldn't object to being exact. And I >> >> think we need context tracking to treat user mode as quiescent, so >> >> they're at least related. >> > >> > And RCU would be happy to be able to always detect usermode execution. >> > But there are configurations and architectures that exclude context >> > tracking, which means that RCU has to roll its own in those cases. >> >> We could slowly fix them, perhaps. I suspect that I'm half-way done >> with accidentally enabling it for x86_32 :) > > If there was an appropriate Kconfig variable, I could do things one way > or the other, depending on what the architecture was doing. There's CONFIG_CONTEXT_TRACKING. IMO it would be nice if there was a sort of clear spec for what promises an arch needs to make for proper RCU operation and what additional promises it needs to make for RCU idle during user mode. > > So you are -unconditionally- enabling context tracking for x86_32? > Doesn't that increase kernel-user transition overhead? No, I'm just adding the user_enter and user_exit calls. This might let us enable HAVE_CONTEXT_TRACKING. Someone still needs to flip it on. > >> >> >> > 3. In some configurations, RCU needs to be able to block entry into >> >> >> > nohz state, both for idle and userspace. >> >> >> >> >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, >> >> >> although the tick would have to stay on. >> >> > >> >> > Right, there are situations where RCU needs a given CPU to keep the tick >> >> > going, for example, when there are RCU callbacks queued on that CPU. >> >> > Failing to keep the tick going could result in a system hang, because >> >> > that callback might never be invoked. >> >> >> >> Can't we just fire the callbacks right away? >> > >> > Absolutely not!!! >> > >> > At least not on multi-CPU systems. There might be an RCU read-side >> > critical section on some other CPU that we still have to wait for. >> >> Oh, right, obviously. Could you kick them over to a different CPU, though? > > For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on > whatever CPU the scheduler or the sysadm chooses, as the case may be. > Otherwise, it gets really hard to make sure that a given CPU's callbacks > execute in order, which is required for rcu_barrier() to work properly. > > There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only > way that RCU callbacks were invoked, but the overhead is higher and > turns out to be all too noticeable on some workloads. :-( > Yuck. What if we make CONFIG_RCU_NOCB_CPUS=y mandatory for NOHZ_FULL? In any case, the people who want their systems to be really truly quiescent in user space will have CONFIG_RCU_NOCB_CPUS=y. >> >> Why is idle-to-RCU different from user-to-RCU? > > Full-system-idle checking that is supposed to someday allow CPU 0's > scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL > systems. In this case, RCU treats user as non-idle for the purpose > of determining whether or not CPU 0's scheduling-clock interrupt can > be stopped. I'm not quite sure I understand this. We can turn off the tick if truly idle but not if in user mode? Why? > >> I feel like RCU and context tracking are implementing more or less the >> same thing, and the fact that they're not shared makes life >> complicated. >> >> >From my perspective, I want to be able to say "I'm transitioning to >> user mode right now" and "I'm transitioning out of user mode right >> now" and have it Just Work. In current -tip, on x86_64, we do that >> for literally every non-NMI entry with one stupid racy exception, and >> that racy exception is very much fixable. I'd prefer not to think >> about whether I'm informing RCU about exiting user mode, informing >> context tracking about exiting user mode, or both. > > RCU's tracking was in place for many years before context tracking > appeared. If we can converge them, well and good, but it really does > have to fully work. True. One of my goals is to perfect coverage of the entering-usermode and exiting-usermode callbacks on x86. What those callbacks are called and what they do is up for debate, but I intend to call them. In fact, I intend to promise that we will never execute non-NMI kernel code with IRQs on at *any* point where I haven't called the appropriate callback to tell the kernel that I'm executing kernel code. The splat that Sasha got was an assertion I added to help validate this promise, but I'm asserting sort-of the wrong thing. Currently there's a narrow window in which RCU knows we're non-idle but context tracking still thinks we're in user mode, and Sasha got it to take an interrupt there. I can change the assertion (it's currently harmless), or I can fix it (needs a bit more work, and Ingo is currently swamped so it'll be a couple weeks). > >> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to >> > rcu_prepare_for_idle() just before incrementing the counter when >> > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU >> > kernels, RCU needs a call to the following in the same place: >> > >> > for_each_rcu_flavor(rsp) { >> > rdp = this_cpu_ptr(rsp->rda); >> > do_nocb_deferred_wakeup(rdp); >> > } >> > >> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked >> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on >> > transition from an idle-like state. >> >> If my hypothetical "I'm going to userspace now" function did that, >> great! I call it from a context where percpu variable work and IRQs >> are off. I further promise not to run any RCU code or enable IRQs >> between calling that function and actually entering user mode. > > Probably need a pair of nonspecific RCU hook function that does whatever > RCU eventually needs to be done, but that could hopefully work. > > Of course, just having a pair of hook functions assumes that RCU can > be convinced to not care about the difference between irq and process > transitions, and the difference between idle and user (at transition, > sysidle will continue to care about the difference between idle and > user when remotely checking a given CPU's state). It can be more complex than just a pair of functions. I could call idle_to_kernel() and user_to_kernel(), both with IRQs off and exactly at the right times. There may be architectures that deliver IRQs directly from idle, which would mean that IRQ handlers would need care to choose the right hook, but x86 is not one of those architectures. > >> > There is also some debug that complains if something transitions >> > to/from an idle-like state that shouldn't be doing so, but that could be >> > pulled into context tracking. (Might already be there, for all I know.) >> > And there is event tracing, which might be subsumed into context tracking. >> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c >> > for the full story. >> > >> > This could all be handled by an RCU hook being invoked just before >> > incrementing the counter on entry to an idle-like state and just after >> > incrementing the counter on exit from an idle-like state. >> > >> > Ah, yes, and interrupts to/from idle. Some architectures have >> > half-interrupts that never return. >> >> WTF? Some architectures are clearly nuts. > > Heh. That was -exactly- my reaction when I first ran into it. > >> x86 gets this right, at least in the intel_idle and acpi_idle cases. >> There are no interrupts from RCU idle unless I badly misread the code. > > These half-interrupts happen when running non-idle in kernel code. > Things like simulating exceptions and system calls from within the kernel. > I would not be sad to see it go, but while it is here, RCU must handle > it correctly. If it really truly cannot happen for x86, then x86 > arch code of course need not worry about it. On x86, you can simulate a syscall by calling the syscall body. I clearly don't understand this weirdness... > >> > RCU uses a compound counter >> > that is zeroed upon process-level entry to an idle-like state to >> > deal with this. See kernel/rcu/rcu.h, the definitions starting with >> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus >> > the associated comment block. But maybe context tracking has some >> > other way of handling these beasts? >> > >> > And transitions to idle-like states are not atomic. In some cases >> > in some configurations, rcu_needs_cpu() says "OK" when asked about >> > stopping the tick, but by the time we get to rcu_prepare_for_idle(), >> > it is no longer OK. RCU raises softirq to force a replay in these >> > sorts of cases. >> >> Hmm. If pending callbacks got kicked to another CPU, would that help? > > Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally > says "OK", so for workloads where that is a reasonable strategy, it works > quite well. Otherwise, kicking callbacks to other CPUs is extremely > painful at best, and perhaps even impossible, at least if rcu_barrier() > is to work correctly. > >> If that's impossible, when RCU finally detects that the grace period >> is over, could it send an IPI rather than relying on the timer tick? > > In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets > a timer to catch end of the grace period in the common case. This is > the point of the time passed back from rcu_needs_cpu(). Making the > end-of-grace-period code send IPIs to CPUs that might (or might not) > be in this mode involves too much rummaging through other CPU's states. > Plus people are not complaining that the grace-period kthread is using > too little CPU. > > Besides, you have to tolerate a CPU catching an interrupt that does a > wakeup just as that CPU was trying to go idle, so the current approach > is not introducing any additional pain from what I can see. Except that, in that case, we defer idle exactly long enough to handle the interrupt and then either schedule or go idle for real (or have another interrupt). There's no weird state where we have no work to do right now but still can't become idle. --Andy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 23:20 ` Andy Lutomirski @ 2015-07-18 0:04 ` Paul E. McKenney 0 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2015-07-18 0:04 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Frédéric Weisbecker, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 04:20:57PM -0700, Andy Lutomirski wrote: > On Fri, Jul 17, 2015 at 3:55 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Fri, Jul 17, 2015 at 03:13:36PM -0700, Andy Lutomirski wrote: > >> On Fri, Jul 17, 2015 at 2:19 PM, Paul E. McKenney > >> <paulmck@linux.vnet.ibm.com> wrote: > >> > On Fri, Jul 17, 2015 at 01:32:27PM -0700, Andy Lutomirski wrote: > >> >> True. But context tracking wouldn't object to being exact. And I > >> >> think we need context tracking to treat user mode as quiescent, so > >> >> they're at least related. > >> > > >> > And RCU would be happy to be able to always detect usermode execution. > >> > But there are configurations and architectures that exclude context > >> > tracking, which means that RCU has to roll its own in those cases. > >> > >> We could slowly fix them, perhaps. I suspect that I'm half-way done > >> with accidentally enabling it for x86_32 :) > > > > If there was an appropriate Kconfig variable, I could do things one way > > or the other, depending on what the architecture was doing. > > There's CONFIG_CONTEXT_TRACKING. > > IMO it would be nice if there was a sort of clear spec for what > promises an arch needs to make for proper RCU operation and what > additional promises it needs to make for RCU idle during user mode. Well, if RCU is going to delegate that responsibility to the architectures, as you are suggesting, something will indeed be required. ;-) If we come up with something workable, I will document it. Self-defense and all that. > > So you are -unconditionally- enabling context tracking for x86_32? > > Doesn't that increase kernel-user transition overhead? > > No, I'm just adding the user_enter and user_exit calls. This might > let us enable HAVE_CONTEXT_TRACKING. Someone still needs to flip it > on. Whew! ;-) On the other hand, RCU will need to be set up to work either way. Shouldn't be too hard, though. > >> >> >> > 3. In some configurations, RCU needs to be able to block entry into > >> >> >> > nohz state, both for idle and userspace. > >> >> >> > >> >> >> Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, > >> >> >> although the tick would have to stay on. > >> >> > > >> >> > Right, there are situations where RCU needs a given CPU to keep the tick > >> >> > going, for example, when there are RCU callbacks queued on that CPU. > >> >> > Failing to keep the tick going could result in a system hang, because > >> >> > that callback might never be invoked. > >> >> > >> >> Can't we just fire the callbacks right away? > >> > > >> > Absolutely not!!! > >> > > >> > At least not on multi-CPU systems. There might be an RCU read-side > >> > critical section on some other CPU that we still have to wait for. > >> > >> Oh, right, obviously. Could you kick them over to a different CPU, though? > > > > For CONFIG_RCU_NOCB_CPUS=y kernels, no problem, they will execute on > > whatever CPU the scheduler or the sysadm chooses, as the case may be. > > Otherwise, it gets really hard to make sure that a given CPU's callbacks > > execute in order, which is required for rcu_barrier() to work properly. > > > > There was some thought of making CONFIG_RCU_NOCB_CPUS=y be the only > > way that RCU callbacks were invoked, but the overhead is higher and > > turns out to be all too noticeable on some workloads. :-( > > Yuck. What if we make CONFIG_RCU_NOCB_CPUS=y mandatory for NOHZ_FULL? > In any case, the people who want their systems to be really truly > quiescent in user space will have CONFIG_RCU_NOCB_CPUS=y. It is already mandatory, NO_HZ_FULL selects RCU_NOCB_CPUS. However, the choice of which CPU will really do nohz (and thus nocbs as well) happens at boot time. So a given NO_HZ_FULL system will typically have at least one non-RCU_NOCB_CPUS CPU, namely the boot CPU, which cannot be a nohz CPU. Thankfully, the choices are automated at Kconfig and boot time, so the right thing should happen without any undue user consternation. > >> Why is idle-to-RCU different from user-to-RCU? > > > > Full-system-idle checking that is supposed to someday allow CPU 0's > > scheduling-clock interrupt to be turned off on CONFIG_NO_HZ_FULL > > systems. In this case, RCU treats user as non-idle for the purpose > > of determining whether or not CPU 0's scheduling-clock interrupt can > > be stopped. > > I'm not quite sure I understand this. We can turn off the tick if > truly idle but not if in user mode? Why? Because of the need to maintain time synchronization at all times that at least one CPU is non-idle. User-mode code might access fine-grained time, and might care about time synchronization. The scheduling-clock interrupt handler takes care of this when needed. > >> I feel like RCU and context tracking are implementing more or less the > >> same thing, and the fact that they're not shared makes life > >> complicated. > >> > >> >From my perspective, I want to be able to say "I'm transitioning to > >> user mode right now" and "I'm transitioning out of user mode right > >> now" and have it Just Work. In current -tip, on x86_64, we do that > >> for literally every non-NMI entry with one stupid racy exception, and > >> that racy exception is very much fixable. I'd prefer not to think > >> about whether I'm informing RCU about exiting user mode, informing > >> context tracking about exiting user mode, or both. > > > > RCU's tracking was in place for many years before context tracking > > appeared. If we can converge them, well and good, but it really does > > have to fully work. > > True. > > One of my goals is to perfect coverage of the entering-usermode and > exiting-usermode callbacks on x86. What those callbacks are called > and what they do is up for debate, but I intend to call them. In > fact, I intend to promise that we will never execute non-NMI kernel > code with IRQs on at *any* point where I haven't called the > appropriate callback to tell the kernel that I'm executing kernel > code. > > The splat that Sasha got was an assertion I added to help validate > this promise, but I'm asserting sort-of the wrong thing. Currently > there's a narrow window in which RCU knows we're non-idle but context > tracking still thinks we're in user mode, and Sasha got it to take an > interrupt there. I can change the assertion (it's currently > harmless), or I can fix it (needs a bit more work, and Ingo is > currently swamped so it'll be a couple weeks). Agreed, false-positive splats are no fun either. > >> > In addition, in CONFIG_RCU_FAST_NO_HZ kernels, RCU needs a call to > >> > rcu_prepare_for_idle() just before incrementing the counter when > >> > transitioning to an idle-like state. Similarly, in CONFIG_RCU_NOCB_CPU > >> > kernels, RCU needs a call to the following in the same place: > >> > > >> > for_each_rcu_flavor(rsp) { > >> > rdp = this_cpu_ptr(rsp->rda); > >> > do_nocb_deferred_wakeup(rdp); > >> > } > >> > > >> > On the from-idle-like-state side, rcu_cleanup_after_idle() must be invoked > >> > in CONFIG_RCU_FAST_NO_HZ kernels just after incrementing the counter on > >> > transition from an idle-like state. > >> > >> If my hypothetical "I'm going to userspace now" function did that, > >> great! I call it from a context where percpu variable work and IRQs > >> are off. I further promise not to run any RCU code or enable IRQs > >> between calling that function and actually entering user mode. > > > > Probably need a pair of nonspecific RCU hook function that does whatever > > RCU eventually needs to be done, but that could hopefully work. > > > > Of course, just having a pair of hook functions assumes that RCU can > > be convinced to not care about the difference between irq and process > > transitions, and the difference between idle and user (at transition, > > sysidle will continue to care about the difference between idle and > > user when remotely checking a given CPU's state). > > It can be more complex than just a pair of functions. I could call > idle_to_kernel() and user_to_kernel(), both with IRQs off and exactly > at the right times. OK, that approach sounds reasonable. So rcu_idle_to_kernel(), rcu_kernel_to_idle(), rcu_user_to_kernel(), and rcu_kernel_to_user()? I expect that I would abstract them out of the current rcu_user_enter() and friends, in the name of maintainability. I would guess that the NMIs remain as they are, but one way or another RCU needs to know about the NMIs. > There may be architectures that deliver IRQs directly from idle, which > would mean that IRQ handlers would need care to choose the right hook, > but x86 is not one of those architectures. There are indeed such architectures. > >> > There is also some debug that complains if something transitions > >> > to/from an idle-like state that shouldn't be doing so, but that could be > >> > pulled into context tracking. (Might already be there, for all I know.) > >> > And there is event tracing, which might be subsumed into context tracking. > >> > See rcu_eqs_enter_common() and rcu_eqs_exit_common() in kernel/rcu/tree.c > >> > for the full story. > >> > > >> > This could all be handled by an RCU hook being invoked just before > >> > incrementing the counter on entry to an idle-like state and just after > >> > incrementing the counter on exit from an idle-like state. > >> > > >> > Ah, yes, and interrupts to/from idle. Some architectures have > >> > half-interrupts that never return. > >> > >> WTF? Some architectures are clearly nuts. > > > > Heh. That was -exactly- my reaction when I first ran into it. > > > >> x86 gets this right, at least in the intel_idle and acpi_idle cases. > >> There are no interrupts from RCU idle unless I badly misread the code. > > > > These half-interrupts happen when running non-idle in kernel code. > > Things like simulating exceptions and system calls from within the kernel. > > I would not be sad to see it go, but while it is here, RCU must handle > > it correctly. If it really truly cannot happen for x86, then x86 > > arch code of course need not worry about it. > > On x86, you can simulate a syscall by calling the syscall body. I > clearly don't understand this weirdness... If it could go away, that would be good. > >> > RCU uses a compound counter > >> > that is zeroed upon process-level entry to an idle-like state to > >> > deal with this. See kernel/rcu/rcu.h, the definitions starting with > >> > DYNTICK_TASK_NEST_WIDTH and ending with DYNTICK_TASK_EXIT_IDLE, plus > >> > the associated comment block. But maybe context tracking has some > >> > other way of handling these beasts? > >> > > >> > And transitions to idle-like states are not atomic. In some cases > >> > in some configurations, rcu_needs_cpu() says "OK" when asked about > >> > stopping the tick, but by the time we get to rcu_prepare_for_idle(), > >> > it is no longer OK. RCU raises softirq to force a replay in these > >> > sorts of cases. > >> > >> Hmm. If pending callbacks got kicked to another CPU, would that help? > > > > Well, for CONFIG_RCU_NOCB_CPUS_ALL kernels, rcu_needs_cpu() unconditionally > > says "OK", so for workloads where that is a reasonable strategy, it works > > quite well. Otherwise, kicking callbacks to other CPUs is extremely > > painful at best, and perhaps even impossible, at least if rcu_barrier() > > is to work correctly. > > > >> If that's impossible, when RCU finally detects that the grace period > >> is over, could it send an IPI rather than relying on the timer tick? > > > > In CONFIG_RCU_FAST_NO_HZ kernels, when the CPU is going idle, it sets > > a timer to catch end of the grace period in the common case. This is > > the point of the time passed back from rcu_needs_cpu(). Making the > > end-of-grace-period code send IPIs to CPUs that might (or might not) > > be in this mode involves too much rummaging through other CPU's states. > > Plus people are not complaining that the grace-period kthread is using > > too little CPU. > > > > Besides, you have to tolerate a CPU catching an interrupt that does a > > wakeup just as that CPU was trying to go idle, so the current approach > > is not introducing any additional pain from what I can see. > > Except that, in that case, we defer idle exactly long enough to handle > the interrupt and then either schedule or go idle for real (or have > another interrupt). There's no weird state where we have no work to > do right now but still can't become idle. So I have to actually wake up some kthread on that CPU to force reconsideration of the type of transition to idle? Ouch... Or will raising softirq have the same effect? The reason I care is that the interrupt might have invoked call_rcu(), which in some configurations means that RCU cannot allow the scheduling clock interrupt to be turned off. The same thing can happen in a softirq handler. Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 18:59 ` Andy Lutomirski 2015-07-17 20:12 ` Paul E. McKenney @ 2015-07-18 13:12 ` Frederic Weisbecker 1 sibling, 0 replies; 15+ messages in thread From: Frederic Weisbecker @ 2015-07-18 13:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Paul McKenney, Sasha Levin, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Fri, Jul 17, 2015 at 11:59:18AM -0700, Andy Lutomirski wrote: > On Thu, Jul 16, 2015 at 9:49 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > On Thu, Jul 16, 2015 at 09:29:07PM -0700, Paul E. McKenney wrote: > >> On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: > >> > For reasons that mystify me a bit, we currently track context tracking > >> > state separately from rcu's watching state. This results in strange > >> > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we > >> > can nest exceptions inside the IRQ handler (an example would be > >> > wrmsr_safe failing), and, in -next, we splat a warning: > >> > > >> > https://gist.github.com/sashalevin/a006a44989312f6835e7 > >> > > >> > I'm trying to make context tracking more exact, which will fix this > >> > issue (the particular splat that Sasha hit shouldn't be possible when > >> > I'm done), but I think it would be nice to unify all of this stuff. > >> > Would it be plausible for us to guarantee that RCU state is always in > >> > sync with context tracking state? If so, we could maybe simplify > >> > things and have fewer state variables. > >> > >> A noble goal. Might even be possible, and maybe even advantageous. > >> > >> But it is usually easier to say than to do. RCU really does need to make > >> some adjustments when the state changes, as do the other subsystems. > >> It might or might not be possible to do the transitions atomically. > >> And if the transitions are not atomic, there will still be weird code > >> paths where (say) the processor is considered non-idle, but RCU doesn't > >> realize it yet. Such a code path could not safely use rcu_read_lock(), > >> so you still need RCU to be able to scream if someone tries it. > >> Contrariwise, if there is a code path where the processor is considered > >> idle, but RCU thinks it is non-idle, that code path can stall > >> grace periods. (Yes, not a problem if the code path is short enough. > >> At least if the underlying VCPU is making progres...) > >> > >> Still, I cannot prove that it is impossible, and if it is possible, > >> then as you say, there might well be benefits. > >> > >> > Doing this for NMIs might be weird. Would it make sense to have a > >> > CONTEXT_NMI that's somehow valid even if the NMI happened while > >> > changing context tracking state. > >> > >> Face it, NMIs are weird. ;-) > >> > >> > Thoughts? As it stands, I think we might already be broken for real: > >> > > >> > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does > >> > copy_from_user_nmi, which can fault, causing do_page_fault to get > >> > called, which calls exception_enter(), which can't be a good thing. > >> > > >> > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. > >> > >> Actually, I see more cases where people forget irq_enter() than > >> rcu_nmi_enter(). "We will just nip in quickly and do something without > >> actually letting the irq system know. Oh, and we want some event tracing > >> in that code path." Boom! > >> > >> > Thoughts? As it stands, I need to do something because -tip and thus > >> > -next spews occasional warnings. > >> > >> Tell me more? > > > > And for completeness, RCU also has the following requirements on the > > state-transition mechanism: > > > > 1. It must be possible to reliably sample some other CPU's state. > > This is an energy-efficiency requirement, as RCU is not normally > > permitted to wake up idle CPUs. Nor nohz CPUs, for that matter. > > NOHZ needs this for vtime accounting, too. I think Rik might be > thinking about this. Maybe the underlying state could be shared? > > > > > 2. RCU must be able to track passage through idle and nohz states. > > In other words, if RCU samples at t=0 and finds that the CPU > > is executing (say) in kernel mode, and RCU samples again at > > t=10 and again finds that the CPU is executing in kernel mode, > > RCU needs to be able to determine whether or not that CPU passed > > through idle or nohz betweentimes. > > And RCU can do this for CONTEXT_KERNEL vs CONTEXT_USER because the > context tracking stuff notifies RCU. The think I'm less than happy > with is that we can currently be CONTEXT_USER but still rcu-awake. > This is manageable, but it seems messy. When we interrupt userspace, right? I don't see that much as a problem, until we use a unified context tracking for both RCU and context tracking. > > > > > 3. In some configurations, RCU needs to be able to block entry into > > nohz state, both for idle and userspace. > > > > Hmm. I suppose we could be CONTEXT_USER but still have RCU awake, > although the tick would have to stay on. Well 3) is handled by the tick nohz code so it's still external. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking 2015-07-17 1:53 Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking Andy Lutomirski 2015-07-17 4:29 ` Paul E. McKenney @ 2015-07-18 13:00 ` Frederic Weisbecker 1 sibling, 0 replies; 15+ messages in thread From: Frederic Weisbecker @ 2015-07-18 13:00 UTC (permalink / raw) To: Andy Lutomirski Cc: Sasha Levin, Paul McKenney, linux-kernel@vger.kernel.org, Peter Zijlstra, X86 ML, Rik van Riel On Thu, Jul 16, 2015 at 06:53:15PM -0700, Andy Lutomirski wrote: > For reasons that mystify me a bit, we currently track context tracking > state separately from rcu's watching state. This results in strange > artifacts: nothing generic cause IRQs to enter CONTEXT_KERNEL, and we > can nest exceptions inside the IRQ handler (an example would be > wrmsr_safe failing), and, in -next, we splat a warning: > > https://gist.github.com/sashalevin/a006a44989312f6835e7 I don't know how it happened. But the context tracking code should be able to handle exceptions on irqs. They are supposed to be simply ignored with the in_interrupt() check on context_tracking_enter/exit(). > > I'm trying to make context tracking more exact, which will fix this > issue (the particular splat that Sasha hit shouldn't be possible when > I'm done), but I think it would be nice to unify all of this stuff. > Would it be plausible for us to guarantee that RCU state is always in > sync with context tracking state? If so, we could maybe simplify > things and have fewer state variables. RCU uses the same variables for idle and user tracking whereas context tracking only tracks user. So they are at least decoupled there. And we probably don't want RCU to use a different variable due to the overhead it brings on readers. But it could be a shifted count on the same variable. > > Doing this for NMIs might be weird. Would it make sense to have a > CONTEXT_NMI that's somehow valid even if the NMI happened while > changing context tracking state. > > Thoughts? As it stands, I think we might already be broken for real: > > Syscall -> user_exit. Perf NMI hits *during* user_exit. Perf does > copy_from_user_nmi, which can fault, causing do_page_fault to get > called, which calls exception_enter(), which can't be a good thing. I think the in_interrupt() handles that. Besides NMI has its own counter. > > RCU is okay (sort of) because of rcu_nmi_enter, but this seems very fragile. > > Thoughts? As it stands, I need to do something because -tip and thus > -next spews occasional warnings. But yeah if we can, it would be nice to use context tracking as the sole tracker that RCU can safely use. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-07-18 13:12 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-17 1:53 Reconciling rcu_irq_enter()/rcu_nmi_enter() with context tracking Andy Lutomirski 2015-07-17 4:29 ` Paul E. McKenney 2015-07-17 4:49 ` Paul E. McKenney 2015-07-17 18:59 ` Andy Lutomirski 2015-07-17 20:12 ` Paul E. McKenney 2015-07-17 20:32 ` Andy Lutomirski 2015-07-17 21:19 ` Paul E. McKenney 2015-07-17 21:22 ` Paul E. McKenney 2015-07-17 22:45 ` Andy Lutomirski 2015-07-17 22:13 ` Andy Lutomirski 2015-07-17 22:55 ` Paul E. McKenney 2015-07-17 23:20 ` Andy Lutomirski 2015-07-18 0:04 ` Paul E. McKenney 2015-07-18 13:12 ` Frederic Weisbecker 2015-07-18 13:00 ` Frederic Weisbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox