* [BUG] 2.6.37-rc3 massive interactivity regression on ARM @ 2010-11-27 15:16 Mikael Pettersson 2010-12-05 12:32 ` Mikael Pettersson 0 siblings, 1 reply; 51+ messages in thread From: Mikael Pettersson @ 2010-11-27 15:16 UTC (permalink / raw) To: linux-arm-kernel The scenario is that I do a remote login to an ARM build server, use screen to start a sub-shell, in that shell start a largish compile job, detach from that screen, and from the original login shell I occasionally monitor the compile job with top or ps or by attaching to the screen. With kernels 2.6.37-rc2 and -rc3 this causes the machine to become very sluggish: top takes forever to start, once started it shows no activity from the compile job (it's as if it's sleeping on a lock), and ps also takes forever and shows no activity from the compile job. Rebooting into 2.6.36 eliminates these issues. I do pretty much the same thing (remote login -> screen -> compile job) on other archs, but so far I've only seen the 2.6.37-rc misbehaviour on ARM EABI, specifically on an IOP n2100. (I have access to other ARM sub-archs, but haven't had time to test 2.6.37-rc on them yet.) Has anyone else seen this? Any ideas about the cause? /Mikael ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-11-27 15:16 [BUG] 2.6.37-rc3 massive interactivity regression on ARM Mikael Pettersson @ 2010-12-05 12:32 ` Mikael Pettersson 2010-12-05 13:17 ` Russell King - ARM Linux 0 siblings, 1 reply; 51+ messages in thread From: Mikael Pettersson @ 2010-12-05 12:32 UTC (permalink / raw) To: linux-arm-kernel Mikael Pettersson writes: > The scenario is that I do a remote login to an ARM build server, > use screen to start a sub-shell, in that shell start a largish > compile job, detach from that screen, and from the original login > shell I occasionally monitor the compile job with top or ps or > by attaching to the screen. > > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become > very sluggish: top takes forever to start, once started it shows no > activity from the compile job (it's as if it's sleeping on a lock), > and ps also takes forever and shows no activity from the compile job. > > Rebooting into 2.6.36 eliminates these issues. > > I do pretty much the same thing (remote login -> screen -> compile job) > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM > sub-archs, but haven't had time to test 2.6.37-rc on them yet.) > > Has anyone else seen this? Any ideas about the cause? (Re-followup since I just realised my previous followups were to Rafael's regressions mailbot rather than the original thread.) > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it. git bisect identified [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some hackery due to subsequent changes in the same area) restores sane behaviour. The original patch submission talks about irq-heavy scenarios. My case is the exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU bound in userspace but expected to schedule quickly when needed (e.g. running top or ps or just hitting CR in one shell while another runs a compile job). I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave. So it looks like an ARM-only issue, possibly depending on platform specifics. One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is much higher on Kirkwood, even when the machine is idle. /Mikael ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-05 12:32 ` Mikael Pettersson @ 2010-12-05 13:17 ` Russell King - ARM Linux 2010-12-05 14:19 ` Russell King - ARM Linux 0 siblings, 1 reply; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-05 13:17 UTC (permalink / raw) To: linux-arm-kernel On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote: > Mikael Pettersson writes: > > The scenario is that I do a remote login to an ARM build server, > > use screen to start a sub-shell, in that shell start a largish > > compile job, detach from that screen, and from the original login > > shell I occasionally monitor the compile job with top or ps or > > by attaching to the screen. > > > > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become > > very sluggish: top takes forever to start, once started it shows no > > activity from the compile job (it's as if it's sleeping on a lock), > > and ps also takes forever and shows no activity from the compile job. > > > > Rebooting into 2.6.36 eliminates these issues. > > > > I do pretty much the same thing (remote login -> screen -> compile job) > > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour > > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM > > sub-archs, but haven't had time to test 2.6.37-rc on them yet.) > > > > Has anyone else seen this? Any ideas about the cause? > > (Re-followup since I just realised my previous followups were to Rafael's > regressions mailbot rather than the original thread.) > > > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it. > > git bisect identified > > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task > > as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some > hackery due to subsequent changes in the same area) restores sane behaviour. > > The original patch submission talks about irq-heavy scenarios. My case is the > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU > bound in userspace but expected to schedule quickly when needed (e.g. running > top or ps or just hitting CR in one shell while another runs a compile job). > > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave. > > So it looks like an ARM-only issue, possibly depending on platform specifics. > > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is > much higher on Kirkwood, even when the machine is idle. The above patch you point out is fundamentally broken. + rq->clock = sched_clock_cpu(cpu); + irq_time = irq_time_cpu(cpu); + if (rq->clock - irq_time > rq->clock_task) + rq->clock_task = rq->clock - irq_time; This means that we will only update rq->clock_task if it is smaller than rq->clock. So, eventually over time, rq->clock_task becomes the maximum value that rq->clock can ever be. Or in other words, the maximum value of sched_clock_cpu(). Once that has been reached, although rq->clock will wrap back to zero, rq->clock_task will not, and so (I think) task execution time accounting effectively stops dead. I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock, and so need to wait a long time for this to be noticed. However, on ARM where we tend to have 32-bit counters feeding sched_clock(), this value will wrap far sooner. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-05 13:17 ` Russell King - ARM Linux @ 2010-12-05 14:19 ` Russell King - ARM Linux 2010-12-05 16:07 ` Mikael Pettersson 2010-12-06 21:29 ` Venkatesh Pallipadi 0 siblings, 2 replies; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-05 14:19 UTC (permalink / raw) To: linux-arm-kernel On Sun, Dec 05, 2010 at 01:17:02PM +0000, Russell King - ARM Linux wrote: > On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote: > > Mikael Pettersson writes: > > > The scenario is that I do a remote login to an ARM build server, > > > use screen to start a sub-shell, in that shell start a largish > > > compile job, detach from that screen, and from the original login > > > shell I occasionally monitor the compile job with top or ps or > > > by attaching to the screen. > > > > > > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become > > > very sluggish: top takes forever to start, once started it shows no > > > activity from the compile job (it's as if it's sleeping on a lock), > > > and ps also takes forever and shows no activity from the compile job. > > > > > > Rebooting into 2.6.36 eliminates these issues. > > > > > > I do pretty much the same thing (remote login -> screen -> compile job) > > > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour > > > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM > > > sub-archs, but haven't had time to test 2.6.37-rc on them yet.) > > > > > > Has anyone else seen this? Any ideas about the cause? > > > > (Re-followup since I just realised my previous followups were to Rafael's > > regressions mailbot rather than the original thread.) > > > > > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it. > > > > git bisect identified > > > > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task > > > > as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some > > hackery due to subsequent changes in the same area) restores sane behaviour. > > > > The original patch submission talks about irq-heavy scenarios. My case is the > > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU > > bound in userspace but expected to schedule quickly when needed (e.g. running > > top or ps or just hitting CR in one shell while another runs a compile job). > > > > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and > > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs > > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave. > > > > So it looks like an ARM-only issue, possibly depending on platform specifics. > > > > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x > > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is > > much higher on Kirkwood, even when the machine is idle. > > The above patch you point out is fundamentally broken. > > + rq->clock = sched_clock_cpu(cpu); > + irq_time = irq_time_cpu(cpu); > + if (rq->clock - irq_time > rq->clock_task) > + rq->clock_task = rq->clock - irq_time; > > This means that we will only update rq->clock_task if it is smaller than > rq->clock. So, eventually over time, rq->clock_task becomes the maximum > value that rq->clock can ever be. Or in other words, the maximum value > of sched_clock_cpu(). > > Once that has been reached, although rq->clock will wrap back to zero, > rq->clock_task will not, and so (I think) task execution time accounting > effectively stops dead. > > I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock, > and so need to wait a long time for this to be noticed. However, on ARM > where we tend to have 32-bit counters feeding sched_clock(), this value > will wrap far sooner. I'm not so sure about this - certainly that if() statement looks very suspicious above. As irq_time_cpu() will always be zero, can you try removing the conditional? In any case, sched_clock_cpu() should be resilient against sched_clock() wrapping. However, your comments about it being iop32x and ixp4xx (both of which are 32-bit-counter-to-ns based implementations) and kirkwood being a 32-bit-extended-to-63-bit-counter-to-ns implementation does make me wonder... ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-05 14:19 ` Russell King - ARM Linux @ 2010-12-05 16:07 ` Mikael Pettersson 2010-12-05 16:21 ` Russell King - ARM Linux 2010-12-06 21:29 ` Venkatesh Pallipadi 1 sibling, 1 reply; 51+ messages in thread From: Mikael Pettersson @ 2010-12-05 16:07 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux writes: > On Sun, Dec 05, 2010 at 01:17:02PM +0000, Russell King - ARM Linux wrote: > > On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote: > > > Mikael Pettersson writes: > > > > The scenario is that I do a remote login to an ARM build server, > > > > use screen to start a sub-shell, in that shell start a largish > > > > compile job, detach from that screen, and from the original login > > > > shell I occasionally monitor the compile job with top or ps or > > > > by attaching to the screen. > > > > > > > > With kernels 2.6.37-rc2 and -rc3 this causes the machine to become > > > > very sluggish: top takes forever to start, once started it shows no > > > > activity from the compile job (it's as if it's sleeping on a lock), > > > > and ps also takes forever and shows no activity from the compile job. > > > > > > > > Rebooting into 2.6.36 eliminates these issues. > > > > > > > > I do pretty much the same thing (remote login -> screen -> compile job) > > > > on other archs, but so far I've only seen the 2.6.37-rc misbehaviour > > > > on ARM EABI, specifically on an IOP n2100. (I have access to other ARM > > > > sub-archs, but haven't had time to test 2.6.37-rc on them yet.) > > > > > > > > Has anyone else seen this? Any ideas about the cause? > > > > > > (Re-followup since I just realised my previous followups were to Rafael's > > > regressions mailbot rather than the original thread.) > > > > > > > The bug is still present in 2.6.37-rc4. I'm currently trying to bisect it. > > > > > > git bisect identified > > > > > > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task > > > > > > as the cause of this regression. Reverting it from 2.6.37-rc4 (requires some > > > hackery due to subsequent changes in the same area) restores sane behaviour. > > > > > > The original patch submission talks about irq-heavy scenarios. My case is the > > > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU > > > bound in userspace but expected to schedule quickly when needed (e.g. running > > > top or ps or just hitting CR in one shell while another runs a compile job). > > > > > > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and > > > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs > > > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave. > > > > > > So it looks like an ARM-only issue, possibly depending on platform specifics. > > > > > > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x > > > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is > > > much higher on Kirkwood, even when the machine is idle. > > > > The above patch you point out is fundamentally broken. > > > > + rq->clock = sched_clock_cpu(cpu); > > + irq_time = irq_time_cpu(cpu); > > + if (rq->clock - irq_time > rq->clock_task) > > + rq->clock_task = rq->clock - irq_time; > > > > This means that we will only update rq->clock_task if it is smaller than > > rq->clock. So, eventually over time, rq->clock_task becomes the maximum > > value that rq->clock can ever be. Or in other words, the maximum value > > of sched_clock_cpu(). > > > > Once that has been reached, although rq->clock will wrap back to zero, > > rq->clock_task will not, and so (I think) task execution time accounting > > effectively stops dead. > > > > I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock, > > and so need to wait a long time for this to be noticed. However, on ARM > > where we tend to have 32-bit counters feeding sched_clock(), this value > > will wrap far sooner. > > I'm not so sure about this - certainly that if() statement looks very > suspicious above. As irq_time_cpu() will always be zero, can you try > removing the conditional? > > In any case, sched_clock_cpu() should be resilient against sched_clock() > wrapping. However, your comments about it being iop32x and ixp4xx > (both of which are 32-bit-counter-to-ns based implementations) and > kirkwood being a 32-bit-extended-to-63-bit-counter-to-ns implementation > does make me wonder... I ran two tests on my iop32x machine: 1. Made the above-mentioned assignment to rq->clock_task unconditional. That cured the interactivity regressions. 2. Restored the conditional assignment to rq->clock_task and disabled the platform-specific sched_clock() so the kernel used the generic one. That too cured the interactivity regressions. I then repeated these tests on my ixp4xx machine, with the same results. I'll try to come up with a fix for the ixp4xx and plat-iop 32-bit sched_clock()s. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-05 16:07 ` Mikael Pettersson @ 2010-12-05 16:21 ` Russell King - ARM Linux 2010-12-08 12:40 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-05 16:21 UTC (permalink / raw) To: linux-arm-kernel On Sun, Dec 05, 2010 at 05:07:36PM +0100, Mikael Pettersson wrote: > I ran two tests on my iop32x machine: > > 1. Made the above-mentioned assignment to rq->clock_task unconditional. > That cured the interactivity regressions. > > 2. Restored the conditional assignment to rq->clock_task and disabled the > platform-specific sched_clock() so the kernel used the generic one. > That too cured the interactivity regressions. > > I then repeated these tests on my ixp4xx machine, with the same results. > > I'll try to come up with a fix for the ixp4xx and plat-iop 32-bit > sched_clock()s. I'm not sure that's the correct fix - it looks like sched_clock_cpu() should already be preventing scheduler clock time going backwards. Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means it wraps once every 21s. However, we have that converted to ns by an unknown multiplier and shift. It seems that those are chosen to guarantee that we will cover only 4s without wrapping in the clocksource conversion. Maybe that's not sufficient? Could you try looking into sched_clock_cpu(), sched_clock_local() and sched_clock() to see whether anything odd stands out? ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-05 16:21 ` Russell King - ARM Linux @ 2010-12-08 12:40 ` Peter Zijlstra 2010-12-08 12:55 ` Russell King - ARM Linux 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-08 12:40 UTC (permalink / raw) To: linux-arm-kernel On Sun, 2010-12-05 at 16:21 +0000, Russell King - ARM Linux wrote: > I'm not sure that's the correct fix - it looks like sched_clock_cpu() > should already be preventing scheduler clock time going backwards. > > Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means > it wraps once every 21s. However, we have that converted to ns by an > unknown multiplier and shift. It seems that those are chosen to > guarantee that we will cover only 4s without wrapping in the clocksource > conversion. Maybe that's not sufficient? > > Could you try looking into sched_clock_cpu(), sched_clock_local() and > sched_clock() to see whether anything odd stands out? # git grep HAVE_UNSTABLE_SCHED_CLOCK arch/arm | wc -l 0 That code won't help if you don't enable it ;-) John Stultz was also looking into making the kernel/sched_clock.c code deal with short clocks. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 12:40 ` Peter Zijlstra @ 2010-12-08 12:55 ` Russell King - ARM Linux 2010-12-08 14:04 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-08 12:55 UTC (permalink / raw) To: linux-arm-kernel On Wed, Dec 08, 2010 at 01:40:15PM +0100, Peter Zijlstra wrote: > On Sun, 2010-12-05 at 16:21 +0000, Russell King - ARM Linux wrote: > > > I'm not sure that's the correct fix - it looks like sched_clock_cpu() > > should already be preventing scheduler clock time going backwards. > > > > Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means > > it wraps once every 21s. However, we have that converted to ns by an > > unknown multiplier and shift. It seems that those are chosen to > > guarantee that we will cover only 4s without wrapping in the clocksource > > conversion. Maybe that's not sufficient? > > > > Could you try looking into sched_clock_cpu(), sched_clock_local() and > > sched_clock() to see whether anything odd stands out? > > # git grep HAVE_UNSTABLE_SCHED_CLOCK arch/arm | wc -l > 0 Hmm, you're right. In which case it's purely down to sched_clock() only being able to cover 4s - which seems to be far too small a gap. I'm not sure that the unstable sched clock stuff makes much sense to be enabled - we don't have an unstable clock, we just don't have the required number of bits for the scheduler to work correctly. Nevertheless, this provides a good way to find this kind of wrap bug. Even with cnt_32_to_63, we still don't get a 64-bit sched_clock(), so this bug will still be there. Even with a 64-bit clock, the bug will still be there. It's basically crap code. Maybe it's better that on ARM, we just don't implement sched_clock() at all? ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 12:55 ` Russell King - ARM Linux @ 2010-12-08 14:04 ` Peter Zijlstra 2010-12-08 14:28 ` Russell King - ARM Linux 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-08 14:04 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2010-12-08 at 12:55 +0000, Russell King - ARM Linux wrote: > On Wed, Dec 08, 2010 at 01:40:15PM +0100, Peter Zijlstra wrote: > > On Sun, 2010-12-05 at 16:21 +0000, Russell King - ARM Linux wrote: > > > > > I'm not sure that's the correct fix - it looks like sched_clock_cpu() > > > should already be preventing scheduler clock time going backwards. > > > > > > Hmm. IOP32x seems to have a 32-bit timer clocked at 200MHz. That means > > > it wraps once every 21s. However, we have that converted to ns by an > > > unknown multiplier and shift. It seems that those are chosen to > > > guarantee that we will cover only 4s without wrapping in the clocksource > > > conversion. Maybe that's not sufficient? > > > > > > Could you try looking into sched_clock_cpu(), sched_clock_local() and > > > sched_clock() to see whether anything odd stands out? > > > > # git grep HAVE_UNSTABLE_SCHED_CLOCK arch/arm | wc -l > > 0 > > Hmm, you're right. In which case it's purely down to sched_clock() > only being able to cover 4s - which seems to be far too small a gap. > > I'm not sure that the unstable sched clock stuff makes much sense to > be enabled - we don't have an unstable clock, we just don't have the > required number of bits for the scheduler to work correctly. We can perhaps make part of the HAVE_UNSTABLE_SCHED_CLOCK depend on SMP and only deal with the short wraps (and maybe monotonicity) on UP. > Nevertheless, this provides a good way to find this kind of wrap bug. > Even with cnt_32_to_63, we still don't get a 64-bit sched_clock(), so > this bug will still be there. Even with a 64-bit clock, the bug will > still be there. It's basically crap code. You're referring to the clock_task bit from Venki? Yes that needs fixing. > Maybe it's better that on ARM, we just don't implement sched_clock() > at all? If you have a high res clock source that's cheap to read it would be better if we can simply fix the infrastructure such that we can make use of it. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 14:04 ` Peter Zijlstra @ 2010-12-08 14:28 ` Russell King - ARM Linux 2010-12-08 14:44 ` Peter Zijlstra 2010-12-08 23:31 ` Venkatesh Pallipadi 0 siblings, 2 replies; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-08 14:28 UTC (permalink / raw) To: linux-arm-kernel On Wed, Dec 08, 2010 at 03:04:36PM +0100, Peter Zijlstra wrote: > On Wed, 2010-12-08 at 12:55 +0000, Russell King - ARM Linux wrote: > > Hmm, you're right. In which case it's purely down to sched_clock() > > only being able to cover 4s - which seems to be far too small a gap. > > > > I'm not sure that the unstable sched clock stuff makes much sense to > > be enabled - we don't have an unstable clock, we just don't have the > > required number of bits for the scheduler to work correctly. > > We can perhaps make part of the HAVE_UNSTABLE_SCHED_CLOCK depend on SMP > and only deal with the short wraps (and maybe monotonicity) on UP. > > > Nevertheless, this provides a good way to find this kind of wrap bug. > > Even with cnt_32_to_63, we still don't get a 64-bit sched_clock(), so > > this bug will still be there. Even with a 64-bit clock, the bug will > > still be there. It's basically crap code. > > You're referring to the clock_task bit from Venki? Yes that needs > fixing. Indeed. > > Maybe it's better that on ARM, we just don't implement sched_clock() > > at all? > > If you have a high res clock source that's cheap to read it would be > better if we can simply fix the infrastructure such that we can make use > of it. This is the point. We don't have high res clock sources that run for millenium before wrapping. What we have depends on the platform, and as this example has found, some platforms have high-res clock sources but they wrap in as little as 4 seconds. Other platforms will have slower clock sources which'll run for longer before wrapping. Essentially, clock sources on ARM so far have been a 32-bit counter (if you're lucky) clocked at some rate. So, what I'm saying is that if wrapping in 4 seconds is a problem, then maybe we shouldn't be providing sched_clock() at all. Also, if wrapping below 64-bits is also a problem, again, maybe we shouldn't be providing it there either. Eg: #define TCR2NS_SCALE_FACTOR 10 unsigned long long sched_clock(void) { unsigned long long v = cnt32_to_63(timer_read()); return (v * tcr2ns_scale) >> TCR2NS_SCALE_FACTOR; } has a maximum of 54 bits - and this seems to be the most we can sanely get from a 32-bit counter. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 14:28 ` Russell King - ARM Linux @ 2010-12-08 14:44 ` Peter Zijlstra 2010-12-08 15:05 ` Russell King - ARM Linux ` (2 more replies) 2010-12-08 23:31 ` Venkatesh Pallipadi 1 sibling, 3 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-08 14:44 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2010-12-08 at 14:28 +0000, Russell King - ARM Linux wrote: > So, what I'm saying is that if wrapping in 4 seconds is a problem, > then maybe we shouldn't be providing sched_clock() at all. 4 seconds should be perfectly fine, we call it at least every scheduler tick (HZ) and NO_HZ will either have to limit the max sleep period or provide its own sleep duration (using a more expensive clock) so we can recover (much like GTOD already requires). > Also, if wrapping below 64-bits is also a problem, again, maybe we > shouldn't be providing it there either. Eg: That's mostly due to hysterical raisins and we should just fix that, the simplest way is to use something like kernel/sched_clock.c to use sched_clock() deltas to make a running u64 value. Like said, John Stultz was already looking at doing something like that because there's a number of architectures suffering this same problem and they're all already using part of the clocksource infrastructure to implement the sched_clock() interface simply because they only have a single hardware resource. One of the problems is I think the cycles2ns multiplication of the raw clock, that makes dealing with wrap-around lots harder, so I guess we should deal with the wrap on the raw clock values and then apply cycles2ns on the delta or somesuch. But I expect the clocksource infrastructure already has something like that, John? ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 14:44 ` Peter Zijlstra @ 2010-12-08 15:05 ` Russell King - ARM Linux 2010-12-08 15:43 ` Linus Walleij 2010-12-08 20:42 ` john stultz 2 siblings, 0 replies; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-08 15:05 UTC (permalink / raw) To: linux-arm-kernel On Wed, Dec 08, 2010 at 03:44:19PM +0100, Peter Zijlstra wrote: > One of the problems is I think the cycles2ns multiplication of the raw > clock, that makes dealing with wrap-around lots harder, so I guess we > should deal with the wrap on the raw clock values and then apply > cycles2ns on the delta or somesuch. But I expect the clocksource > infrastructure already has something like that, John? I've thought about that, but it becomes slightly problematical, as was shown in one of the examples I provided. If you do scale by doing a 64-bit multiply and shift, you're always going to end up with less than a 64-bit result. I think your idea makes sense though, but I think for it to be able to cover the full 64-bit range, we need to do the wraparound handling after scaling. So maybe something like the following: static unsigned long long last_ns, cur_ns; static unsigned long long max = (max_read_clock() * mult) >> shift; unsigned long long sched_clock(void) { unsigned long long cyc = read_clock(); unsigned long long ns = (cyc * mult) >> shift; unsigned long long delta; spin_lock(&sched_clock_lock); delta = last_ns - ns; if (ns < last_ns) delta += max; last_ns = ns; ns = cur_ns += delta; spin_unlock(&sched_clock_lock); return ns; } ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 14:44 ` Peter Zijlstra 2010-12-08 15:05 ` Russell King - ARM Linux @ 2010-12-08 15:43 ` Linus Walleij 2010-12-08 20:42 ` john stultz 2 siblings, 0 replies; 51+ messages in thread From: Linus Walleij @ 2010-12-08 15:43 UTC (permalink / raw) To: linux-arm-kernel 2010/12/8 Peter Zijlstra <a.p.zijlstra@chello.nl>: > Like said, John Stultz was already looking at doing something like that > because there's a number of architectures suffering this same problem > and they're all already using part of the clocksource infrastructure to > implement the sched_clock() interface simply because they only have a > single hardware resource. I was in on that discussion and for the Ux500 this patch was the outcome: http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6488/1 It seems to work, mostly, it's based off Nico's orion code and just tries to make everything as explicit as possible. Uwe has also been onto it I think. Yours, Linus Walleij ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 14:44 ` Peter Zijlstra 2010-12-08 15:05 ` Russell King - ARM Linux 2010-12-08 15:43 ` Linus Walleij @ 2010-12-08 20:42 ` john stultz 2 siblings, 0 replies; 51+ messages in thread From: john stultz @ 2010-12-08 20:42 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2010-12-08 at 15:44 +0100, Peter Zijlstra wrote: > On Wed, 2010-12-08 at 14:28 +0000, Russell King - ARM Linux wrote: > > So, what I'm saying is that if wrapping in 4 seconds is a problem, > > then maybe we shouldn't be providing sched_clock() at all. > > 4 seconds should be perfectly fine, we call it at least every scheduler > tick (HZ) and NO_HZ will either have to limit the max sleep period or > provide its own sleep duration (using a more expensive clock) so we can > recover (much like GTOD already requires). > > > Also, if wrapping below 64-bits is also a problem, again, maybe we > > shouldn't be providing it there either. Eg: > > That's mostly due to hysterical raisins and we should just fix that, the > simplest way is to use something like kernel/sched_clock.c to use > sched_clock() deltas to make a running u64 value. > > Like said, John Stultz was already looking at doing something like that > because there's a number of architectures suffering this same problem > and they're all already using part of the clocksource infrastructure to > implement the sched_clock() interface simply because they only have a > single hardware resource. I'm not actively working on it right now, but trying to rework the sched_clock code so its more like the generic timekeeping code is on my list (I'm Looking to see if I can bump it up to the front in the near future). thanks -john ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 14:28 ` Russell King - ARM Linux 2010-12-08 14:44 ` Peter Zijlstra @ 2010-12-08 23:31 ` Venkatesh Pallipadi 2010-12-09 12:52 ` Peter Zijlstra 1 sibling, 1 reply; 51+ messages in thread From: Venkatesh Pallipadi @ 2010-12-08 23:31 UTC (permalink / raw) To: linux-arm-kernel As I mentioned earlier in this thread, there are other places in the scheduler, where we do curr_sched_clock - prev_sched_clock kind of math in u64 and this 32 bit wraparound will cause problems there as well. Example- accounting a big positive number as exec_time for some task. So, we need a generic fix for sched_clock. This particular code happened to cause an easy to notice failure. Anyways, here's the patch that should resolve the wraparound problem with this particular piece of code. I haven't been able to test this with latest git as my test systems fail to boot with divide error in select_task_rq_fair+0x5dd/0x70a with vanilla git. Thats an unrelated bug that has been reported earlier. I have only tested with older known to work kernel with this sched irq changes. Peter: Also, this change will cause some conflict with pending IRQ time reporting changeset. Let me know how you want to deal with this patch and that pending patchset. Thanks, Venki [PATCH] update_rq_clock() with irq_time should handle 64 bit wraparound update_rq_clock with IRQ_TIME_ACCOUNTING assumed 64 bit sched_clock that practically will not wraparound. Change to code to handle wraparounds, still preventing rq->clock_task from going backwards. Signed-off-by: Venkatesh Pallipadi <venki@google.com> --- kernel/sched.c | 49 ++++++++++++++++++++++++++++++------------------- 1 files changed, 30 insertions(+), 19 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index dc91a4d..29ce549 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -636,21 +636,13 @@ static inline struct task_group *task_group(struct task_struct *p) #endif /* CONFIG_CGROUP_SCHED */ -static u64 irq_time_cpu(int cpu); -static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time); +static void update_rq_clock_irqtime(struct rq *rq, int cpu); inline void update_rq_clock(struct rq *rq) { if (!rq->skip_clock_update) { int cpu = cpu_of(rq); - u64 irq_time; - - rq->clock = sched_clock_cpu(cpu); - irq_time = irq_time_cpu(cpu); - if (rq->clock - irq_time > rq->clock_task) - rq->clock_task = rq->clock - irq_time; - - sched_irq_time_avg_update(rq, irq_time); + update_rq_clock_irqtime(rq, cpu); } } @@ -1983,24 +1975,43 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) +/* + * Update rq->clock_task making sure that it does not go backwards due to + * irq_time subtraction, allowing for possible wraparound. + */ +static void update_rq_clock_irqtime(struct rq *rq, int cpu) { - if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) { - u64 delta_irq = curr_irq_time - rq->prev_irq_time; - rq->prev_irq_time = curr_irq_time; - sched_rt_avg_update(rq, delta_irq); + u64 curr_rq_clock = sched_clock_cpu(cpu); + u64 curr_irq_time = irq_time_cpu(cpu); + + if (!sched_clock_irqtime) { + rq->clock = sched_clock_cpu(cpu); + rq->clock_task = rq->clock; + return; } + + if (unlikely(curr_irq_time - rq->prev_irq_time > + curr_rq_clock - rq->clock)) { + curr_irq_time = rq->prev_irq_time + curr_rq_clock - rq->clock; + } + + rq->clock = curr_rq_clock; + rq->clock_task = rq->clock - curr_irq_time; + + if (sched_feat(NONIRQ_POWER)) + sched_rt_avg_update(rq, curr_irq_time - rq->prev_irq_time); + + rq->prev_irq_time = curr_irq_time; } #else -static u64 irq_time_cpu(int cpu) +static void update_rq_clock_irqtime(struct rq *rq, int cpu) { - return 0; + rq->clock = sched_clock_cpu(cpu); + rq->clock_task = rq->clock; } -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { } - #endif #include "sched_idletask.c" -- 1.7.3.1 ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-08 23:31 ` Venkatesh Pallipadi @ 2010-12-09 12:52 ` Peter Zijlstra 2010-12-09 17:43 ` Venkatesh Pallipadi 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-09 12:52 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2010-12-08 at 15:31 -0800, Venkatesh Pallipadi wrote: > [PATCH] update_rq_clock() with irq_time should handle 64 bit wraparound > > update_rq_clock with IRQ_TIME_ACCOUNTING assumed 64 bit sched_clock that > practically will not wraparound. Change to code to handle wraparounds, > still preventing rq->clock_task from going backwards. All we should do is deal with the 64wrap, not filter backward motion, I came up with something like the below. I think for now the best solution to the early wrap problem for ARM is for them to select HAVE_UNSTABLE_SCHED_CLOCK, it will mostly deal with the short wrap by filtering out the occasional negative value. Then later we can look at cleaning/breaking-up the kernel/sched_clock.c code to provide smaller bits of functionality and possibly merging it with some of the clocksource code. --- kernel/sched.c | 54 +++++++++++++++++++++++++----------------------------- 1 files changed, 25 insertions(+), 29 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 8e885c1..0fb7de8 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -642,23 +642,19 @@ static inline struct task_group *task_group(struct task_struct *p) #endif /* CONFIG_CGROUP_SCHED */ -static u64 irq_time_cpu(int cpu); -static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time); +static void update_rq_clock_task(struct rq *rq, s64 delta); inline void update_rq_clock(struct rq *rq) { - int cpu = cpu_of(rq); - u64 irq_time; + s64 delta; if (rq->skip_clock_update) return; - rq->clock = sched_clock_cpu(cpu); - irq_time = irq_time_cpu(cpu); - if (rq->clock - irq_time > rq->clock_task) - rq->clock_task = rq->clock - irq_time; + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; + rq->clock += delta; + update_rq_clock_task(rq, delta); - sched_irq_time_avg_update(rq, irq_time); } /* @@ -1817,14 +1813,6 @@ void disable_sched_clock_irqtime(void) sched_clock_irqtime = 0; } -static u64 irq_time_cpu(int cpu) -{ - if (!sched_clock_irqtime) - return 0; - - return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); -} - void account_system_vtime(struct task_struct *curr) { unsigned long flags; @@ -1855,25 +1843,33 @@ void account_system_vtime(struct task_struct *curr) } EXPORT_SYMBOL_GPL(account_system_vtime); -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) +static inline u64 irq_time_cpu(int cpu) { - if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) { - u64 delta_irq = curr_irq_time - rq->prev_irq_time; - rq->prev_irq_time = curr_irq_time; - sched_rt_avg_update(rq, delta_irq); - } + return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); } -#else - -static u64 irq_time_cpu(int cpu) +static void update_rq_clock_task(struct rq *rq, s64 delta) { - return 0; + s64 irq_delta; + + irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time; + rq->prev_irq_time += irq_delta; + + delta -= irq_delta; + rq->clock_task += delta; + + if (irq_delta && sched_feat(NONIRQ_POWER)) + sched_rt_avg_update(rq, irq_delta); } -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { } +#else /* CONFIG_IRQ_TIME_ACCOUNTING */ -#endif +static inline void update_rq_clock_task(struct rq *rq, s64 delta) +{ + rq->clock_task += delta; +} + +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ #include "sched_idletask.c" #include "sched_fair.c" ^ permalink raw reply related [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 12:52 ` Peter Zijlstra @ 2010-12-09 17:43 ` Venkatesh Pallipadi 2010-12-09 17:55 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Venkatesh Pallipadi @ 2010-12-09 17:43 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 9, 2010 at 4:52 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Wed, 2010-12-08 at 15:31 -0800, Venkatesh Pallipadi wrote: > >> [PATCH] update_rq_clock() with irq_time should handle 64 bit wraparound >> >> update_rq_clock with IRQ_TIME_ACCOUNTING assumed 64 bit sched_clock that >> practically will not wraparound. Change to code to handle wraparounds, >> still preventing rq->clock_task from going backwards. > > All we should do is deal with the 64wrap, not filter backward motion, I > came up with something like the below. The original intent of this check - if (rq->clock - irq_time > rq->clock_task) was to filter the potential backward motion. As otherwise, downstream users of clock_task can see huge positive numbers from curr - prev calculations. The same problem will be there with below code, with irq_delta > delta, clock_task can go backwards which is not good. + delta -= irq_delta; + rq->clock_task += delta; The reason for this is rq->clock and irqtime updates kind of happen independently and specifically, if a rq->clock update happens while we are in a softirq, we may have this case of going backwards on the next update. > > I think for now the best solution to the early wrap problem for ARM is > for them to select HAVE_UNSTABLE_SCHED_CLOCK, it will mostly deal with > the short wrap by filtering out the occasional negative value. > > Then later we can look at cleaning/breaking-up the kernel/sched_clock.c > code to provide smaller bits of functionality and possibly merging it > with some of the clocksource code. > Yes. That should resolve the early wrap. Thanks, Venki > --- > ?kernel/sched.c | ? 54 +++++++++++++++++++++++++----------------------------- > ?1 files changed, 25 insertions(+), 29 deletions(-) > > diff --git a/kernel/sched.c b/kernel/sched.c > index 8e885c1..0fb7de8 100644 > --- a/kernel/sched.c > +++ b/kernel/sched.c > @@ -642,23 +642,19 @@ static inline struct task_group *task_group(struct task_struct *p) > > ?#endif /* CONFIG_CGROUP_SCHED */ > > -static u64 irq_time_cpu(int cpu); > -static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time); > +static void update_rq_clock_task(struct rq *rq, s64 delta); > > ?inline void update_rq_clock(struct rq *rq) > ?{ > - ? ? ? int cpu = cpu_of(rq); > - ? ? ? u64 irq_time; > + ? ? ? s64 delta; > > ? ? ? ?if (rq->skip_clock_update) > ? ? ? ? ? ? ? ?return; > > - ? ? ? rq->clock = sched_clock_cpu(cpu); > - ? ? ? irq_time = irq_time_cpu(cpu); > - ? ? ? if (rq->clock - irq_time > rq->clock_task) > - ? ? ? ? ? ? ? rq->clock_task = rq->clock - irq_time; > + ? ? ? delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; > + ? ? ? rq->clock += delta; > + ? ? ? update_rq_clock_task(rq, delta); > > - ? ? ? sched_irq_time_avg_update(rq, irq_time); > ?} > > ?/* > @@ -1817,14 +1813,6 @@ void disable_sched_clock_irqtime(void) > ? ? ? ?sched_clock_irqtime = 0; > ?} > > -static u64 irq_time_cpu(int cpu) > -{ > - ? ? ? if (!sched_clock_irqtime) > - ? ? ? ? ? ? ? return 0; > - > - ? ? ? return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); > -} > - > ?void account_system_vtime(struct task_struct *curr) > ?{ > ? ? ? ?unsigned long flags; > @@ -1855,25 +1843,33 @@ void account_system_vtime(struct task_struct *curr) > ?} > ?EXPORT_SYMBOL_GPL(account_system_vtime); > > -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) > +static inline u64 irq_time_cpu(int cpu) > ?{ > - ? ? ? if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) { > - ? ? ? ? ? ? ? u64 delta_irq = curr_irq_time - rq->prev_irq_time; > - ? ? ? ? ? ? ? rq->prev_irq_time = curr_irq_time; > - ? ? ? ? ? ? ? sched_rt_avg_update(rq, delta_irq); > - ? ? ? } > + ? ? ? return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); > ?} > > -#else > - > -static u64 irq_time_cpu(int cpu) > +static void update_rq_clock_task(struct rq *rq, s64 delta) > ?{ > - ? ? ? return 0; > + ? ? ? s64 irq_delta; > + > + ? ? ? irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time; > + ? ? ? rq->prev_irq_time += irq_delta; > + > + ? ? ? delta -= irq_delta; > + ? ? ? rq->clock_task += delta; > + > + ? ? ? if (irq_delta && sched_feat(NONIRQ_POWER)) > + ? ? ? ? ? ? ? sched_rt_avg_update(rq, irq_delta); > ?} > > -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { } > +#else /* CONFIG_IRQ_TIME_ACCOUNTING */ > > -#endif > +static inline void update_rq_clock_task(struct rq *rq, s64 delta) > +{ > + ? ? ? rq->clock_task += delta; > +} > + > +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ > > ?#include "sched_idletask.c" > ?#include "sched_fair.c" > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 17:43 ` Venkatesh Pallipadi @ 2010-12-09 17:55 ` Peter Zijlstra 2010-12-09 18:11 ` Venkatesh Pallipadi 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-09 17:55 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote: > > The same problem will be there with below code, with irq_delta > > delta, clock_task can go backwards which is not good. > + delta -= irq_delta; > + rq->clock_task += delta; > > The reason for this is rq->clock and irqtime updates kind of happen > independently and specifically, if a rq->clock update happens while we > are in a softirq, we may have this case of going backwards on the next > update. But how can irq_delta > delta?, we measure it using the same clock. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 17:55 ` Peter Zijlstra @ 2010-12-09 18:11 ` Venkatesh Pallipadi 2010-12-09 18:55 ` Peter Zijlstra 2010-12-13 14:33 ` Jack Daniel 0 siblings, 2 replies; 51+ messages in thread From: Venkatesh Pallipadi @ 2010-12-09 18:11 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote: >> >> The same problem will be there with below code, with irq_delta > >> delta, clock_task can go backwards which is not good. >> + ? ? ? delta -= irq_delta; >> + ? ? ? rq->clock_task += delta; >> >> The reason for this is rq->clock and irqtime updates kind of happen >> independently and specifically, if a rq->clock update happens while we >> are in a softirq, we may have this case of going backwards on the next >> update. > > But how can irq_delta > delta?, we measure it using the same clock. > This would be mostly a corner case like: - softirq start time t1 - rq->clock updated at t2 and rq->clock_task updated at t2 without accounting for current softirq - softirq end time t3 - cpu spends most time here in softirq or hardirq - next rq->clock update at t4 and rq->clock_task update, with delta = t4-t2 and irq_delta ~= t4 - t1 ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 18:11 ` Venkatesh Pallipadi @ 2010-12-09 18:55 ` Peter Zijlstra 2010-12-09 22:21 ` Venkatesh Pallipadi 2010-12-13 14:33 ` Jack Daniel 1 sibling, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-09 18:55 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-09 at 10:11 -0800, Venkatesh Pallipadi wrote: > On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote: > >> > >> The same problem will be there with below code, with irq_delta > > >> delta, clock_task can go backwards which is not good. > >> + delta -= irq_delta; > >> + rq->clock_task += delta; > >> > >> The reason for this is rq->clock and irqtime updates kind of happen > >> independently and specifically, if a rq->clock update happens while we > >> are in a softirq, we may have this case of going backwards on the next > >> update. > > > > But how can irq_delta > delta?, we measure it using the same clock. > > > > This would be mostly a corner case like: > - softirq start time t1 > - rq->clock updated at t2 and rq->clock_task updated at t2 without > accounting for current softirq > - softirq end time t3 > - cpu spends most time here in softirq or hardirq > - next rq->clock update at t4 and rq->clock_task update, with delta = > t4-t2 and irq_delta ~= t4 - t1 Ah, something like that would happen when we do a wakeup from soft/hard-irq context, not an altogether uncommon occurrence. Wouldn't that be cured by updating the irq-time when asking for it, something like the below? (on top of my earlier patch) --- Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1843,8 +1843,9 @@ void account_system_vtime(struct task_st } EXPORT_SYMBOL_GPL(account_system_vtime); -static inline u64 irq_time_cpu(int cpu) +static u64 irq_time_cpu(int cpu) { + account_system_vtime(current); return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); } ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 18:55 ` Peter Zijlstra @ 2010-12-09 22:21 ` Venkatesh Pallipadi 2010-12-09 23:16 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Venkatesh Pallipadi @ 2010-12-09 22:21 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 9, 2010 at 10:55 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2010-12-09 at 10:11 -0800, Venkatesh Pallipadi wrote: >> On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> > On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote: >> >> >> >> The same problem will be there with below code, with irq_delta > >> >> delta, clock_task can go backwards which is not good. >> >> + ? ? ? delta -= irq_delta; >> >> + ? ? ? rq->clock_task += delta; >> >> >> >> The reason for this is rq->clock and irqtime updates kind of happen >> >> independently and specifically, if a rq->clock update happens while we >> >> are in a softirq, we may have this case of going backwards on the next >> >> update. >> > >> > But how can irq_delta > delta?, we measure it using the same clock. >> > >> >> This would be mostly a corner case like: >> - softirq start time t1 >> - rq->clock updated at t2 and rq->clock_task updated at t2 without >> accounting for current softirq >> - softirq end time t3 >> - cpu spends most time here in softirq or hardirq >> - next rq->clock update at t4 and rq->clock_task update, with delta = >> t4-t2 and irq_delta ~= t4 - t1 > > Ah, something like that would happen when we do a wakeup from > soft/hard-irq context, not an altogether uncommon occurrence. > > Wouldn't that be cured by updating the irq-time when asking for it, > something like the below? (on top of my earlier patch) This should mostly work. There would be a small window there on rq->clock=sched_clock_cpu() and system_vtime doing sched_clock_cpu() and also the overhead of doing this everytime when this going backwards may happen rarely. Thanks, Venki > > --- > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -1843,8 +1843,9 @@ void account_system_vtime(struct task_st > ?} > ?EXPORT_SYMBOL_GPL(account_system_vtime); > > -static inline u64 irq_time_cpu(int cpu) > +static u64 irq_time_cpu(int cpu) > ?{ > + ? ? ? account_system_vtime(current); > ? ? ? ?return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); > ?} > > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 22:21 ` Venkatesh Pallipadi @ 2010-12-09 23:16 ` Peter Zijlstra 2010-12-09 23:35 ` Venkatesh Pallipadi 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-09 23:16 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-09 at 14:21 -0800, Venkatesh Pallipadi wrote: > This should mostly work. There would be a small window there on > rq->clock=sched_clock_cpu() and system_vtime doing sched_clock_cpu() > and also the overhead of doing this everytime when this going > backwards may happen rarely. Right, so something like the below removes that tiny race by re-using rq->clock as now. It also removes some overhead by removing the IRQ fudging (we already know IRQs are disabled). It does have a few extra branches (2 afaict) than your monotonicity path had, but it seems to me this approach is slightly more accurate. --- Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1813,19 +1813,10 @@ void disable_sched_clock_irqtime(void) sched_clock_irqtime = 0; } -void account_system_vtime(struct task_struct *curr) +static void __account_system_vtime(int cpu, u64 now) { - unsigned long flags; - int cpu; - u64 now, delta; + s64 delta; - if (!sched_clock_irqtime) - return; - - local_irq_save(flags); - - cpu = smp_processor_id(); - now = sched_clock_cpu(cpu); delta = now - per_cpu(irq_start_time, cpu); per_cpu(irq_start_time, cpu) = now; /* @@ -1836,16 +1827,36 @@ void account_system_vtime(struct task_st */ if (hardirq_count()) per_cpu(cpu_hardirq_time, cpu) += delta; - else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD)) + else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD)) per_cpu(cpu_softirq_time, cpu) += delta; +} + +void account_system_vtime(struct task_struct *curr) +{ + unsigned long flags; + u64 now; + int cpu; + + if (!sched_clock_irqtime) + return; + + local_irq_save(flags); + + cpu = smp_processor_id(); + now = sched_clock_cpu(cpu); + __account_system_vtime(cpu, now); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(account_system_vtime); -static u64 irq_time_cpu(int cpu) +static u64 irq_time_cpu(struct rq *rq) { - account_system_vtime(current); + int cpu = cpu_of(rq); + + if (sched_clock_irqtime) + __account_system_vtime(cpu, rq->clock); + return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); } @@ -1853,7 +1864,7 @@ static void update_rq_clock_task(struct { s64 irq_delta; - irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time; + irq_delta = irq_time_cpu(rq) - rq->prev_irq_time; rq->prev_irq_time += irq_delta; delta -= irq_delta; ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 23:16 ` Peter Zijlstra @ 2010-12-09 23:35 ` Venkatesh Pallipadi 2010-12-10 10:08 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Venkatesh Pallipadi @ 2010-12-09 23:35 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 9, 2010 at 3:16 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, 2010-12-09 at 14:21 -0800, Venkatesh Pallipadi wrote: >> This should mostly work. There would be a small window there on >> rq->clock=sched_clock_cpu() and system_vtime doing sched_clock_cpu() >> and also the overhead of doing this everytime when this going >> backwards may happen rarely. > > Right, so something like the below removes that tiny race by re-using > rq->clock as now. It also removes some overhead by removing the IRQ > fudging (we already know IRQs are disabled). > > It does have a few extra branches (2 afaict) than your monotonicity path > had, but it seems to me this approach is slightly more accurate. > Looks fine. Just to make sure, update_rq_clock() always gets called on current CPU. Right? The pending patches I have optimizes account_system_vtime() to use this_cpu_write and friends. Want to make sure this change will still keep that optimization relevant. > --- > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -1813,19 +1813,10 @@ void disable_sched_clock_irqtime(void) > ? ? ? ?sched_clock_irqtime = 0; > ?} > > -void account_system_vtime(struct task_struct *curr) > +static void __account_system_vtime(int cpu, u64 now) > ?{ > - ? ? ? unsigned long flags; > - ? ? ? int cpu; > - ? ? ? u64 now, delta; > + ? ? ? s64 delta; > > - ? ? ? if (!sched_clock_irqtime) > - ? ? ? ? ? ? ? return; > - > - ? ? ? local_irq_save(flags); > - > - ? ? ? cpu = smp_processor_id(); > - ? ? ? now = sched_clock_cpu(cpu); > ? ? ? ?delta = now - per_cpu(irq_start_time, cpu); > ? ? ? ?per_cpu(irq_start_time, cpu) = now; > ? ? ? ?/* > @@ -1836,16 +1827,36 @@ void account_system_vtime(struct task_st > ? ? ? ? */ > ? ? ? ?if (hardirq_count()) > ? ? ? ? ? ? ? ?per_cpu(cpu_hardirq_time, cpu) += delta; > - ? ? ? else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD)) > + ? ? ? else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD)) > ? ? ? ? ? ? ? ?per_cpu(cpu_softirq_time, cpu) += delta; > +} > + > +void account_system_vtime(struct task_struct *curr) > +{ > + ? ? ? unsigned long flags; > + ? ? ? u64 now; > + ? ? ? int cpu; > + > + ? ? ? if (!sched_clock_irqtime) > + ? ? ? ? ? ? ? return; > + > + ? ? ? local_irq_save(flags); > + > + ? ? ? cpu = smp_processor_id(); > + ? ? ? now = sched_clock_cpu(cpu); > + ? ? ? __account_system_vtime(cpu, now); > > ? ? ? ?local_irq_restore(flags); > ?} > ?EXPORT_SYMBOL_GPL(account_system_vtime); > > -static u64 irq_time_cpu(int cpu) > +static u64 irq_time_cpu(struct rq *rq) > ?{ > - ? ? ? account_system_vtime(current); > + ? ? ? int cpu = cpu_of(rq); > + > + ? ? ? if (sched_clock_irqtime) > + ? ? ? ? ? ? ? __account_system_vtime(cpu, rq->clock); > + > ? ? ? ?return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); > ?} > > @@ -1853,7 +1864,7 @@ static void update_rq_clock_task(struct > ?{ > ? ? ? ?s64 irq_delta; > > - ? ? ? irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time; > + ? ? ? irq_delta = irq_time_cpu(rq) - rq->prev_irq_time; > ? ? ? ?rq->prev_irq_time += irq_delta; > > ? ? ? ?delta -= irq_delta; > > ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 23:35 ` Venkatesh Pallipadi @ 2010-12-10 10:08 ` Peter Zijlstra 2010-12-10 13:17 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 10:08 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2010-12-09 at 15:35 -0800, Venkatesh Pallipadi wrote: > > Just to make sure, update_rq_clock() always gets called on current > CPU. Right? No, specifically not. If that were the case we wouldn't need the cross-cpu synced timestamp. Things like load-balancing and remote-wakeups need to update a remote CPUs clock. > The pending patches I have optimizes > account_system_vtime() to use this_cpu_write and friends. Want to make > sure this change will still keep that optimization relevant. Ah, good point, remote CPUs updating that will mess with the consistency of the per-cpu timestamps due to non atomic updates :/ Bugger.. making them atomics will make it even more expensive. /me goes ponder. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 10:08 ` Peter Zijlstra @ 2010-12-10 13:17 ` Peter Zijlstra 2010-12-10 13:27 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 13:17 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 11:08 +0100, Peter Zijlstra wrote: > > Just to make sure, update_rq_clock() always gets called on current > > CPU. Right? > > No, specifically not. If that were the case we wouldn't need the > cross-cpu synced timestamp. Things like load-balancing and > remote-wakeups need to update a remote CPUs clock. > > > The pending patches I have optimizes > > account_system_vtime() to use this_cpu_write and friends. Want to make > > sure this change will still keep that optimization relevant. > > Ah, good point, remote CPUs updating that will mess with the consistency > of the per-cpu timestamps due to non atomic updates :/ > > Bugger.. making them atomics will make it even more expensive. /me goes > ponder. OK, so I ended up doing the same you did.. Still staring at that, 32bit will go very funny in the head once every so often. One possible solution would be to ignore the occasional abs(irq_delta) > 2 * delta. That would however result in an accounting discrepancy such that: clock_task + irq_time != clock Thoughts? --- Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1813,11 +1813,36 @@ void disable_sched_clock_irqtime(void) sched_clock_irqtime = 0; } +static void __account_system_vtime(int cpu, u64 now) +{ + s64 delta; + + delta = now - per_cpu(irq_start_time, cpu); + per_cpu(irq_start_time, cpu) = now; + + if (hardirq_count()) + per_cpu(cpu_hardirq_time, cpu) += delta; + /* + * We do not account for softirq time from ksoftirqd here. We want to + * continue accounting softirq time to ksoftirqd thread in that case, + * so as not to confuse scheduler with a special task that do not + * consume any time, but still wants to run. + */ + else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD)) + per_cpu(cpu_softirq_time, cpu) += delta; +} + +/* + * Called before incrementing preempt_count on {soft,}irq_enter + * and before decrementing preempt_count on {soft,}irq_exit. + * + * @curr should at all times be current. + */ void account_system_vtime(struct task_struct *curr) { unsigned long flags; + u64 now; int cpu; - u64 now, delta; if (!sched_clock_irqtime) return; @@ -1826,26 +1851,21 @@ void account_system_vtime(struct task_st cpu = smp_processor_id(); now = sched_clock_cpu(cpu); - delta = now - per_cpu(irq_start_time, cpu); - per_cpu(irq_start_time, cpu) = now; - /* - * We do not account for softirq time from ksoftirqd here. - * We want to continue accounting softirq time to ksoftirqd thread - * in that case, so as not to confuse scheduler with a special task - * that do not consume any time, but still wants to run. - */ - if (hardirq_count()) - per_cpu(cpu_hardirq_time, cpu) += delta; - else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD)) - per_cpu(cpu_softirq_time, cpu) += delta; + __account_system_vtime(cpu, now); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(account_system_vtime); -static u64 irq_time_cpu(int cpu) +static u64 irq_time_cpu(struct rq *rq) { - account_system_vtime(current); + int cpu = cpu_of(rq); + /* + * See the comment in update_rq_clock_task(), ideally we'd update + * the *irq_time values using rq->clock here. + * + * As it stands, reading this from a remote cpu is buggy on 32bit. + */ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); } @@ -1853,9 +1873,27 @@ static void update_rq_clock_task(struct { s64 irq_delta; - irq_delta = irq_time_cpu(cpu_of(rq)) - rq->prev_irq_time; - rq->prev_irq_time += irq_delta; + irq_delta = irq_time_cpu(rq) - rq->prev_irq_time; + + /* + * Since irq_time is only updated on {soft,}irq_exit, we might run into + * this case when a previous update_rq_clock() happened inside a + * {soft,}irq region. + * + * When this happens, we stop ->clock_task and only update the + * prev_irq_time stamp to account for the part that fit, so that a next + * update will consume the rest. This ensures ->clock_task is + * monotonic. + * + * It does however cause some slight miss-attribution of {soft,}irq + * time, a more accurate solution would be to update the irq_time using + * the current rq->clock timestamp, except that would require using + * atomic ops. + */ + if (irq_delta > delta) + irq_delta = delta; + rq->prev_irq_time += irq_delta; delta -= irq_delta; rq->clock_task += delta; ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 13:17 ` Peter Zijlstra @ 2010-12-10 13:27 ` Peter Zijlstra 2010-12-10 13:47 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 13:27 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 14:17 +0100, Peter Zijlstra wrote: > > OK, so I ended up doing the same you did.. Still staring at that, 32bit > will go very funny in the head once every so often. One possible > solution would be to ignore the occasional abs(irq_delta) > 2 * delta. > > That would however result in an accounting discrepancy such that: > clock_task + irq_time != clock > > Thoughts? The brute force solution is a seqcount.. something like so: --- Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1786,21 +1786,63 @@ static void deactivate_task(struct rq *r #ifdef CONFIG_IRQ_TIME_ACCOUNTING /* - * There are no locks covering percpu hardirq/softirq time. - * They are only modified in account_system_vtime, on corresponding CPU - * with interrupts disabled. So, writes are safe. + * There are no locks covering percpu hardirq/softirq time. They are only + * modified in account_system_vtime, on corresponding CPU with interrupts + * disabled. So, writes are safe. + * * They are read and saved off onto struct rq in update_rq_clock(). - * This may result in other CPU reading this CPU's irq time and can - * race with irq/account_system_vtime on this CPU. We would either get old - * or new value (or semi updated value on 32 bit) with a side effect of - * accounting a slice of irq time to wrong task when irq is in progress - * while we read rq->clock. That is a worthy compromise in place of having - * locks on each irq in account_system_time. + * + * This may result in other CPU reading this CPU's irq time and can race with + * irq/account_system_vtime on this CPU. We would either get old or new value + * with a side effect of accounting a slice of irq time to wrong task when irq + * is in progress while we read rq->clock. That is a worthy compromise in place + * of having locks on each irq in account_system_time. */ static DEFINE_PER_CPU(u64, cpu_hardirq_time); static DEFINE_PER_CPU(u64, cpu_softirq_time); - static DEFINE_PER_CPU(u64, irq_start_time); + +#ifndef CONFIG_64BIT +static DEFINE_PER_CPU(seqcount_t, irq_time_seq); + +static inline void irq_time_write_begin(int cpu) +{ + write_seqcount_begin(&per_cpu(irq_time_seq, cpu)); +} + +static inline void irq_time_write_end(int cpu) +{ + write_seqcount_end(&per_cpu(irq_time_seq, cpu)); +} + +static inline u64 irq_time_read(int cpu) +{ + u64 irq_time; + unsigned seq; + + do { + seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); + irq_time = per_cpu(cpu_softirq_time, cpu) + + per_cpu(cpu_hardirq_time, cpu); + } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); + + return irq_time; +} +#else /* CONFIG_64BIT */ +static inline void irq_time_write_begin(int cpu) +{ +} + +static inline void irq_time_write_end(int cpu) +{ +} + +static inline u64 irq_time_read(int cpu) +{ + return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); +} +#endif /* CONFIG_64BIT */ + static int sched_clock_irqtime; void enable_sched_clock_irqtime(void) @@ -1820,6 +1862,7 @@ static void __account_system_vtime(int c delta = now - per_cpu(irq_start_time, cpu); per_cpu(irq_start_time, cpu) = now; + irq_time_write_begin(cpu); if (hardirq_count()) per_cpu(cpu_hardirq_time, cpu) += delta; /* @@ -1830,6 +1873,7 @@ static void __account_system_vtime(int c */ else if (in_serving_softirq() && !(current->flags & PF_KSOFTIRQD)) per_cpu(cpu_softirq_time, cpu) += delta; + irq_time_write_end(cpu); } /* @@ -1859,14 +1903,11 @@ EXPORT_SYMBOL_GPL(account_system_vtime); static u64 irq_time_cpu(struct rq *rq) { - int cpu = cpu_of(rq); /* * See the comment in update_rq_clock_task(), ideally we'd update * the *irq_time values using rq->clock here. - * - * As it stands, reading this from a remote cpu is buggy on 32bit. */ - return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); + return irq_time_read(cpu_of(rq)); } static void update_rq_clock_task(struct rq *rq, s64 delta) ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 13:27 ` Peter Zijlstra @ 2010-12-10 13:47 ` Peter Zijlstra 2010-12-10 16:50 ` Russell King - ARM Linux ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 13:47 UTC (permalink / raw) To: linux-arm-kernel Full patch.. --- Subject: sched: Fix the irqtime code to deal with u64 wraps From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Thu Dec 09 14:15:34 CET 2010 ARM systems have a 32bit sched_clock() [ which needs to be fixed ], but this exposed a bug in the irq_time code as well, it doesn't deal with wraps at all. Fix the irq_time code to deal with u64 wraps by re-writing the code to only use delta increments, which avoids the whole issue. Furthermore, solve the problem of 32bit arches reading partial updates of the u64 time values. Cc: Venkatesh Pallipadi <venki@google.com> Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 166 +++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 116 insertions(+), 50 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -642,23 +642,19 @@ static inline struct task_group *task_gr #endif /* CONFIG_CGROUP_SCHED */ -static u64 irq_time_cpu(int cpu); -static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time); +static void update_rq_clock_task(struct rq *rq, s64 delta); inline void update_rq_clock(struct rq *rq) { - int cpu = cpu_of(rq); - u64 irq_time; + s64 delta; if (rq->skip_clock_update) return; - rq->clock = sched_clock_cpu(cpu); - irq_time = irq_time_cpu(cpu); - if (rq->clock - irq_time > rq->clock_task) - rq->clock_task = rq->clock - irq_time; + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; + rq->clock += delta; + update_rq_clock_task(rq, delta); - sched_irq_time_avg_update(rq, irq_time); } /* @@ -1790,90 +1786,160 @@ static void deactivate_task(struct rq *r #ifdef CONFIG_IRQ_TIME_ACCOUNTING /* - * There are no locks covering percpu hardirq/softirq time. - * They are only modified in account_system_vtime, on corresponding CPU - * with interrupts disabled. So, writes are safe. + * There are no locks covering percpu hardirq/softirq time. They are only + * modified in account_system_vtime, on corresponding CPU with interrupts + * disabled. So, writes are safe. + * * They are read and saved off onto struct rq in update_rq_clock(). - * This may result in other CPU reading this CPU's irq time and can - * race with irq/account_system_vtime on this CPU. We would either get old - * or new value (or semi updated value on 32 bit) with a side effect of - * accounting a slice of irq time to wrong task when irq is in progress - * while we read rq->clock. That is a worthy compromise in place of having - * locks on each irq in account_system_time. + * + * This may result in other CPU reading this CPU's irq time and can race with + * irq/account_system_vtime on this CPU. We would either get old or new value + * with a side effect of accounting a slice of irq time to wrong task when irq + * is in progress while we read rq->clock. That is a worthy compromise in place + * of having locks on each irq in account_system_time. */ static DEFINE_PER_CPU(u64, cpu_hardirq_time); static DEFINE_PER_CPU(u64, cpu_softirq_time); - static DEFINE_PER_CPU(u64, irq_start_time); -static int sched_clock_irqtime; -void enable_sched_clock_irqtime(void) +#ifndef CONFIG_64BIT +static DEFINE_PER_CPU(seqcount_t, irq_time_seq); + +static inline void irq_time_write_begin(int cpu) { - sched_clock_irqtime = 1; + write_seqcount_begin(&per_cpu(irq_time_seq, cpu)); } -void disable_sched_clock_irqtime(void) +static inline void irq_time_write_end(int cpu) { - sched_clock_irqtime = 0; + write_seqcount_end(&per_cpu(irq_time_seq, cpu)); } -static u64 irq_time_cpu(int cpu) +static inline u64 irq_time_read(int cpu) { - if (!sched_clock_irqtime) - return 0; + u64 irq_time; + unsigned seq; + + do { + seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); + irq_time = per_cpu(cpu_softirq_time, cpu) + + per_cpu(cpu_hardirq_time, cpu); + } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); + + return irq_time; +} +#else /* CONFIG_64BIT */ +static inline void irq_time_write_begin(int cpu) +{ +} + +static inline void irq_time_write_end(int cpu) +{ +} +static inline u64 irq_time_read(int cpu) +{ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); } +#endif /* CONFIG_64BIT */ +static int sched_clock_irqtime; + +void enable_sched_clock_irqtime(void) +{ + sched_clock_irqtime = 1; +} + +void disable_sched_clock_irqtime(void) +{ + sched_clock_irqtime = 0; +} + +/* + * Called before incrementing preempt_count on {soft,}irq_enter + * and before decrementing preempt_count on {soft,}irq_exit. + */ void account_system_vtime(struct task_struct *curr) { unsigned long flags; + s64 delta; int cpu; - u64 now, delta; if (!sched_clock_irqtime) return; local_irq_save(flags); - cpu = smp_processor_id(); - now = sched_clock_cpu(cpu); - delta = now - per_cpu(irq_start_time, cpu); - per_cpu(irq_start_time, cpu) = now; - /* - * We do not account for softirq time from ksoftirqd here. - * We want to continue accounting softirq time to ksoftirqd thread - * in that case, so as not to confuse scheduler with a special task - * that do not consume any time, but still wants to run. - */ + delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu); + per_cpu(irq_start_time, cpu) += delta; + + irq_time_write_begin(cpu); + if (hardirq_count()) per_cpu(cpu_hardirq_time, cpu) += delta; + /* + * We do not account for softirq time from ksoftirqd here. We want to + * continue accounting softirq time to ksoftirqd thread in that case, + * so as not to confuse scheduler with a special task that do not + * consume any time, but still wants to run. + */ else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD)) per_cpu(cpu_softirq_time, cpu) += delta; + irq_time_write_end(cpu); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(account_system_vtime); -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) +static u64 irq_time_cpu(struct rq *rq) { - if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) { - u64 delta_irq = curr_irq_time - rq->prev_irq_time; - rq->prev_irq_time = curr_irq_time; - sched_rt_avg_update(rq, delta_irq); - } + /* + * See the comment in update_rq_clock_task(), ideally we'd update + * the *irq_time values using rq->clock here. + */ + return irq_time_read(cpu_of(rq)); } -#else - -static u64 irq_time_cpu(int cpu) +static void update_rq_clock_task(struct rq *rq, s64 delta) { - return 0; + s64 irq_delta; + + irq_delta = irq_time_cpu(rq) - rq->prev_irq_time; + + /* + * Since irq_time is only updated on {soft,}irq_exit, we might run into + * this case when a previous update_rq_clock() happened inside a + * {soft,}irq region. + * + * When this happens, we stop ->clock_task and only update the + * prev_irq_time stamp to account for the part that fit, so that a next + * update will consume the rest. This ensures ->clock_task is + * monotonic. + * + * It does however cause some slight miss-attribution of {soft,}irq + * time, a more accurate solution would be to update the irq_time using + * the current rq->clock timestamp, except that would require using + * atomic ops. + */ + if (irq_delta > delta) + irq_delta = delta; + + rq->prev_irq_time += irq_delta; + delta -= irq_delta; + rq->clock_task += delta; + + if (irq_delta && sched_feat(NONIRQ_POWER)) + sched_rt_avg_update(rq, irq_delta); } -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { } +#else /* CONFIG_IRQ_TIME_ACCOUNTING */ -#endif +static inline void update_rq_clock_task(struct rq *rq, s64 delta) +{ + rq->clock_task += delta; +} + +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ #include "sched_idletask.c" #include "sched_fair.c" ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 13:47 ` Peter Zijlstra @ 2010-12-10 16:50 ` Russell King - ARM Linux 2010-12-10 16:54 ` Peter Zijlstra 2010-12-10 17:18 ` Eric Dumazet 2010-12-10 17:56 ` Russell King - ARM Linux 2 siblings, 1 reply; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-10 16:50 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote: > > Full patch.. > > --- > Subject: sched: Fix the irqtime code to deal with u64 wraps > From: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Thu Dec 09 14:15:34 CET 2010 > > ARM systems have a 32bit sched_clock() [ which needs to be fixed ], > but this exposed a bug in the irq_time code as well, it doesn't deal > with wraps at all. > > Fix the irq_time code to deal with u64 wraps by re-writing the code to > only use delta increments, which avoids the whole issue. > > Furthermore, solve the problem of 32bit arches reading partial updates > of the u64 time values. > > Cc: Venkatesh Pallipadi <venki@google.com> > Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> I think credit should go to Mikael Pettersson, who identified the interactivity regression and problematical commit. I only pointed out the dubious nature of the code. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 16:50 ` Russell King - ARM Linux @ 2010-12-10 16:54 ` Peter Zijlstra 0 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 16:54 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 16:50 +0000, Russell King - ARM Linux wrote: > > Reported-by: Russell King - ARM Linux <linux@arm.linux.org.uk> > > I think credit should go to Mikael Pettersson, who identified the > interactivity regression and problematical commit. I only pointed > out the dubious nature of the code. Ah, thanks, missed that. Updated the patch. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 13:47 ` Peter Zijlstra 2010-12-10 16:50 ` Russell King - ARM Linux @ 2010-12-10 17:18 ` Eric Dumazet 2010-12-10 17:49 ` Peter Zijlstra 2010-12-10 17:56 ` Russell King - ARM Linux 2 siblings, 1 reply; 51+ messages in thread From: Eric Dumazet @ 2010-12-10 17:18 UTC (permalink / raw) To: linux-arm-kernel Le vendredi 10 d?cembre 2010 ? 14:47 +0100, Peter Zijlstra a ?crit : > Full patch.. > unt_system_vtime(struct task_struct *curr) > { > unsigned long flags; > + s64 delta; > int cpu; > - u64 now, delta; > > if (!sched_clock_irqtime) > return; > > local_irq_save(flags); > - > cpu = smp_processor_id(); > - now = sched_clock_cpu(cpu); > - delta = now - per_cpu(irq_start_time, cpu); > - per_cpu(irq_start_time, cpu) = now; > - /* > - * We do not account for softirq time from ksoftirqd here. > - * We want to continue accounting softirq time to ksoftirqd thread > - * in that case, so as not to confuse scheduler with a special task > - * that do not consume any time, but still wants to run. > - */ > + delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu); > + per_cpu(irq_start_time, cpu) += delta; We should add some checkpatch warning to any new per_cpu() use. delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time); __this_cpu_add(irq_start_time, delta); Also irq_time_write_begin() and irq_time_write_end() could be faster (called for current cpu) static inline void irq_time_write_begin(void) { __this_cpu_inc(irq_time_seq.sequence); smp_wmb(); } static inline void irq_time_write_end(void) { smp_wmb(); __this_cpu_inc(irq_time_seq.sequence); } ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 17:18 ` Eric Dumazet @ 2010-12-10 17:49 ` Peter Zijlstra 2010-12-10 18:14 ` Eric Dumazet 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 17:49 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 18:18 +0100, Eric Dumazet wrote: > Le vendredi 10 d?cembre 2010 ? 14:47 +0100, Peter Zijlstra a ?crit : > Also irq_time_write_begin() and irq_time_write_end() could be faster > (called for current cpu) > > static inline void irq_time_write_begin(void) > { > __this_cpu_inc(irq_time_seq.sequence); > smp_wmb(); > } > > static inline void irq_time_write_end(void) > { > smp_wmb(); > __this_cpu_inc(irq_time_seq.sequence); > } Yeah, but that kinda defeats the purpose of having it implemented in seqlock.h. Ideally we'd teach gcc about these long pointers and have something like: write_seqcount_begin(&this_cpu_read(irq_time_seq)); do the right thing. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 17:49 ` Peter Zijlstra @ 2010-12-10 18:14 ` Eric Dumazet 2010-12-10 18:39 ` Christoph Lameter 0 siblings, 1 reply; 51+ messages in thread From: Eric Dumazet @ 2010-12-10 18:14 UTC (permalink / raw) To: linux-arm-kernel Le vendredi 10 d?cembre 2010 ? 18:49 +0100, Peter Zijlstra a ?crit : > On Fri, 2010-12-10 at 18:18 +0100, Eric Dumazet wrote: > > Le vendredi 10 d?cembre 2010 ? 14:47 +0100, Peter Zijlstra a ?crit : > > > Also irq_time_write_begin() and irq_time_write_end() could be faster > > (called for current cpu) > > > > static inline void irq_time_write_begin(void) > > { > > __this_cpu_inc(irq_time_seq.sequence); > > smp_wmb(); > > } > > > > static inline void irq_time_write_end(void) > > { > > smp_wmb(); > > __this_cpu_inc(irq_time_seq.sequence); > > } > > Yeah, but that kinda defeats the purpose of having it implemented in > seqlock.h. Ideally we'd teach gcc about these long pointers and have > something like: > > write_seqcount_begin(&this_cpu_read(irq_time_seq)); > > do the right thing. gcc wont be able to do this yet (%fs/%gs selectors) But we can provide this_cpu_write_seqcount_{begin|end}() static inline void this_cpu_write_seqcount_begin(seqcount_t *s) { __this_cpu_inc(s->sequence); smp_wmb(); } ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 18:14 ` Eric Dumazet @ 2010-12-10 18:39 ` Christoph Lameter 2010-12-10 18:46 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Christoph Lameter @ 2010-12-10 18:39 UTC (permalink / raw) To: linux-arm-kernel On Fri, 10 Dec 2010, Eric Dumazet wrote: > > Yeah, but that kinda defeats the purpose of having it implemented in > > seqlock.h. Ideally we'd teach gcc about these long pointers and have > > something like: > > > > write_seqcount_begin(&this_cpu_read(irq_time_seq)); > > > > do the right thing. > > gcc wont be able to do this yet (%fs/%gs selectors) The kernel can do that using the __percpu annotation. > But we can provide this_cpu_write_seqcount_{begin|end}() No we cannot do hat. this_cpu ops are for per cpu data and not for locking values shared between processors. We have a mechanism for passing per cpu pointers with a corresponding annotation. > static inline void this_cpu_write_seqcount_begin(seqcount_t *s) ^^^ Would have to be seqcount_t __percpu *s > { > __this_cpu_inc(s->sequence); > smp_wmb(); > } ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 18:39 ` Christoph Lameter @ 2010-12-10 18:46 ` Peter Zijlstra 2010-12-10 19:51 ` Christoph Lameter 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 18:46 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 12:39 -0600, Christoph Lameter wrote: > On Fri, 10 Dec 2010, Eric Dumazet wrote: > > > > Yeah, but that kinda defeats the purpose of having it implemented in > > > seqlock.h. Ideally we'd teach gcc about these long pointers and have > > > something like: > > > > > > write_seqcount_begin(&this_cpu_read(irq_time_seq)); > > > > > > do the right thing. > > > > gcc wont be able to do this yet (%fs/%gs selectors) > > The kernel can do that using the __percpu annotation. That's not true: # define __percpu Its a complete NOP. > > But we can provide this_cpu_write_seqcount_{begin|end}() > > No we cannot do hat. this_cpu ops are for per cpu data and not for locking > values shared between processors. We have a mechanism for passing per cpu > pointers with a corresponding annotation. -enoparse, its not locking anything, is a per-cpu sequence count. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 18:46 ` Peter Zijlstra @ 2010-12-10 19:51 ` Christoph Lameter 2010-12-10 20:07 ` Peter Zijlstra 0 siblings, 1 reply; 51+ messages in thread From: Christoph Lameter @ 2010-12-10 19:51 UTC (permalink / raw) To: linux-arm-kernel On Fri, 10 Dec 2010, Peter Zijlstra wrote: > > > gcc wont be able to do this yet (%fs/%gs selectors) > > > > The kernel can do that using the __percpu annotation. > > That's not true: > > # define __percpu > > Its a complete NOP. The annotation serves for sparse checking. .... If you do not care about those checks then you can simply pass a percpu pointer in the same form as a regular pointer. > > > But we can provide this_cpu_write_seqcount_{begin|end}() > > > > No we cannot do hat. this_cpu ops are for per cpu data and not for locking > > values shared between processors. We have a mechanism for passing per cpu > > pointers with a corresponding annotation. > > -enoparse, its not locking anything, is a per-cpu sequence count. seqlocks are for synchronization of objects on different processors. Seems that you do not have that use case in mind. So a seqlock restricted to a single processor? If so then you wont need any of those smp write barriers mentioned earlier. A simple compiler barrier() is sufficient. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 19:51 ` Christoph Lameter @ 2010-12-10 20:07 ` Peter Zijlstra 2010-12-10 20:23 ` Christoph Lameter 0 siblings, 1 reply; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 20:07 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 13:51 -0600, Christoph Lameter wrote: > On Fri, 10 Dec 2010, Peter Zijlstra wrote: > > > > > gcc wont be able to do this yet (%fs/%gs selectors) > > > > > > The kernel can do that using the __percpu annotation. > > > > That's not true: > > > > # define __percpu > > > > Its a complete NOP. > > The annotation serves for sparse checking. .... If you do not care about > those checks then you can simply pass a percpu pointer in the same form as > a regular pointer. Its not about passing per-cpu pointers, its about passing long pointers. When I write: void foo(u64 *bla) { *bla++; } DEFINE_PER_CPU(u64, plop); void bar(void) { foo(__this_cpu_ptr(plop)); } I want gcc to emit the equivalent to: __this_cpu_inc(plop); /* incq %fs:(%0) */ Now I guess the C type system will get in the way of this ever working, since a long pointer would have a distinct type from a regular pointer :/ The idea is to use 'regular' functions with the per-cpu data in a transparent manner so as not to have to replicate all logic. > > > > But we can provide this_cpu_write_seqcount_{begin|end}() > > > > > > No we cannot do hat. this_cpu ops are for per cpu data and not for locking > > > values shared between processors. We have a mechanism for passing per cpu > > > pointers with a corresponding annotation. > > > > -enoparse, its not locking anything, is a per-cpu sequence count. > > seqlocks are for synchronization of objects on different processors. > > Seems that you do not have that use case in mind. So a seqlock restricted > to a single processor? If so then you wont need any of those smp write > barriers mentioned earlier. A simple compiler barrier() is sufficient. The seqcount is sometimes read by different CPUs, but I don't see why we couldn't do what Eric suggested. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 20:07 ` Peter Zijlstra @ 2010-12-10 20:23 ` Christoph Lameter 2010-12-10 20:32 ` Peter Zijlstra 2010-12-10 20:39 ` Eric Dumazet 0 siblings, 2 replies; 51+ messages in thread From: Christoph Lameter @ 2010-12-10 20:23 UTC (permalink / raw) To: linux-arm-kernel On Fri, 10 Dec 2010, Peter Zijlstra wrote: > Its not about passing per-cpu pointers, its about passing long pointers. > > When I write: > > void foo(u64 *bla) > { > *bla++; > } > > DEFINE_PER_CPU(u64, plop); > > void bar(void) > { > foo(__this_cpu_ptr(plop)); > } > > I want gcc to emit the equivalent to: > > __this_cpu_inc(plop); /* incq %fs:(%0) */ > > Now I guess the C type system will get in the way of this ever working, > since a long pointer would have a distinct type from a regular > pointer :/ > > The idea is to use 'regular' functions with the per-cpu data in a > transparent manner so as not to have to replicate all logic. That would mean you would have to pass information in the pointer at runtime indicating that this particular pointer is a per cpu pointer. Code for the Itanium arch can do that because it has per cpu virtual mappings. So you define a virtual area for per cpu data and then map it differently for each processor. If we would have a different page table for each processor then we could avoid using segment register and do the same on x86. > > Seems that you do not have that use case in mind. So a seqlock restricted > > to a single processor? If so then you wont need any of those smp write > > barriers mentioned earlier. A simple compiler barrier() is sufficient. > > The seqcount is sometimes read by different CPUs, but I don't see why we > couldn't do what Eric suggested. But you would have to define a per cpu seqlock. Each cpu would have its own seqlock. Then you could have this_cpu_read_seqcount_begin and friends: DEFINE_PER_CPU(seqcount, bla); /* Start of read using pointer to a sequence counter only. */ static inline unsigned this_cpu_read_seqcount_begin(const seqcount_t __percpu *s) { /* No other processor can be using this lock since it is per cpu*/ ret = this_cpu_read(s->sequence); barrier(); return ret; } /* * Test if reader processed invalid data because sequence number has changed. */ static inline int this_cpu_read_seqcount_retry(const seqcount_t __percpu *s, unsigned start) { barrier(); return this_cpu_read(s->sequence) != start; } /* * Sequence counter only version assumes that callers are using their * own mutexing. */ static inline void this_cpu_write_seqcount_begin(seqcount_t __percpu *s) { __this_cpu_inc(s->sequence); barrier(); } static inline void this_cpuwrite_seqcount_end(seqcount_t __percpu *s) { __this_cpu_dec(s->sequence); barrier(); } Then you can do this_cpu_read_seqcount_begin(&bla) ... But then this seemed to be a discussion related to ARM. ARM does not have optimized per cpu accesses. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 20:23 ` Christoph Lameter @ 2010-12-10 20:32 ` Peter Zijlstra 2010-12-10 20:39 ` Eric Dumazet 1 sibling, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 20:32 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 14:23 -0600, Christoph Lameter wrote: > On Fri, 10 Dec 2010, Peter Zijlstra wrote: > > > Its not about passing per-cpu pointers, its about passing long pointers. > > > > When I write: > > > > void foo(u64 *bla) > > { > > *bla++; > > } > > > > DEFINE_PER_CPU(u64, plop); > > > > void bar(void) > > { > > foo(__this_cpu_ptr(plop)); > > } > > > > I want gcc to emit the equivalent to: > > > > __this_cpu_inc(plop); /* incq %fs:(%0) */ > > > > Now I guess the C type system will get in the way of this ever working, > > since a long pointer would have a distinct type from a regular > > pointer :/ > > > > The idea is to use 'regular' functions with the per-cpu data in a > > transparent manner so as not to have to replicate all logic. > > That would mean you would have to pass information in the pointer at > runtime indicating that this particular pointer is a per cpu pointer. > > Code for the Itanium arch can do that because it has per cpu virtual > mappings. So you define a virtual area for per cpu data and then map it > differently for each processor. If we would have a different page table > for each processor then we could avoid using segment register and do the > same on x86. I don't think its a runtime issue, its a compile time issue. At compile time the compiler can see the argument is a long pointer: %fs:(addr,idx,size), and could propagate this into the caller. The above example will compute the effective address by doing something like: lea %fs:(addr,idx,size),%ebx and will then do something like inc (%ebx) Where it could easily have optimized this into: inc %fs:(addr,idx,size) esp when foo would be inlined. If its an actual call-site you need function overloading because a long pointer has a different signature from a regular pointer, and that is something C doesn't do. > > > Seems that you do not have that use case in mind. So a seqlock restricted > > > to a single processor? If so then you wont need any of those smp write > > > barriers mentioned earlier. A simple compiler barrier() is sufficient. > > > > The seqcount is sometimes read by different CPUs, but I don't see why we > > couldn't do what Eric suggested. > > But you would have to define a per cpu seqlock. Each cpu would have > its own seqlock. Then you could have this_cpu_read_seqcount_begin and > friends: > > Then you can do > > this_cpu_read_seqcount_begin(&bla) > Which to me seems to be exactly what Eric proposed.. > But then this seemed to be a discussion related to ARM. ARM does not have > optimized per cpu accesses. Nah, there's multiple issues all nicely mangled into one thread ;-) ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 20:23 ` Christoph Lameter 2010-12-10 20:32 ` Peter Zijlstra @ 2010-12-10 20:39 ` Eric Dumazet 2010-12-10 20:49 ` Eric Dumazet 1 sibling, 1 reply; 51+ messages in thread From: Eric Dumazet @ 2010-12-10 20:39 UTC (permalink / raw) To: linux-arm-kernel Le vendredi 10 d?cembre 2010 ? 14:23 -0600, Christoph Lameter a ?crit : > On Fri, 10 Dec 2010, Peter Zijlstra wrote: > > > Its not about passing per-cpu pointers, its about passing long pointers. > > > > When I write: > > > > void foo(u64 *bla) > > { > > *bla++; > > } > > > > DEFINE_PER_CPU(u64, plop); > > > > void bar(void) > > { > > foo(__this_cpu_ptr(plop)); > > } > > > > I want gcc to emit the equivalent to: > > > > __this_cpu_inc(plop); /* incq %fs:(%0) */ > > > > Now I guess the C type system will get in the way of this ever working, > > since a long pointer would have a distinct type from a regular > > pointer :/ > > > > The idea is to use 'regular' functions with the per-cpu data in a > > transparent manner so as not to have to replicate all logic. > > That would mean you would have to pass information in the pointer at > runtime indicating that this particular pointer is a per cpu pointer. > > Code for the Itanium arch can do that because it has per cpu virtual > mappings. So you define a virtual area for per cpu data and then map it > differently for each processor. If we would have a different page table > for each processor then we could avoid using segment register and do the > same on x86. > > > > Seems that you do not have that use case in mind. So a seqlock restricted > > > to a single processor? If so then you wont need any of those smp write > > > barriers mentioned earlier. A simple compiler barrier() is sufficient. > > > > The seqcount is sometimes read by different CPUs, but I don't see why we > > couldn't do what Eric suggested. > > But you would have to define a per cpu seqlock. Each cpu would have > its own seqlock. Then you could have this_cpu_read_seqcount_begin and > friends: > > Yes. It was the idea. > DEFINE_PER_CPU(seqcount, bla); > > This is in Peter patch :) > > > /* Start of read using pointer to a sequence counter only. */ > static inline unsigned this_cpu_read_seqcount_begin(const seqcount_t __percpu *s) > { > /* No other processor can be using this lock since it is per cpu*/ > ret = this_cpu_read(s->sequence); > barrier(); > return ret; > } > > /* > * Test if reader processed invalid data because sequence number has changed. > */ > static inline int this_cpu_read_seqcount_retry(const seqcount_t __percpu *s, unsigned start) > { > barrier(); > return this_cpu_read(s->sequence) != start; > } > > > /* > * Sequence counter only version assumes that callers are using their > * own mutexing. > */ > static inline void this_cpu_write_seqcount_begin(seqcount_t __percpu *s) > { > __this_cpu_inc(s->sequence); > barrier(); > } > > static inline void this_cpuwrite_seqcount_end(seqcount_t __percpu *s) > { > __this_cpu_dec(s->sequence); > barrier(); > } > > > Then you can do > > this_cpu_read_seqcount_begin(&bla) > > ... This was exactly my suggestion Christoph. I am glad you understand it now. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 20:39 ` Eric Dumazet @ 2010-12-10 20:49 ` Eric Dumazet 2010-12-10 21:09 ` Christoph Lameter 0 siblings, 1 reply; 51+ messages in thread From: Eric Dumazet @ 2010-12-10 20:49 UTC (permalink / raw) To: linux-arm-kernel Le vendredi 10 d?cembre 2010 ? 21:39 +0100, Eric Dumazet a ?crit : > This was exactly my suggestion Christoph. > > I am glad you understand it now. > > By the way, we need smp_wmb(), not barrier(), even only the "owner cpu" can write into its 'percpu' seqcount. There is nothing special about a seqcount being percpu or a 'global' one. We must have same memory barrier semantics. this_cpu_write_seqcount_begin(&myseqcount); this_cpu_add(mydata1, add1); this_cpu_add(mydata2, add2); this_cpu_inc(mydata3); this_cpu_write_seqcount_end(&myseqcount); We protect the data[1,2,3] set with a seqcount, so need smp_wmb() in both _begin() and _end() ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 20:49 ` Eric Dumazet @ 2010-12-10 21:09 ` Christoph Lameter 2010-12-10 21:22 ` Eric Dumazet 0 siblings, 1 reply; 51+ messages in thread From: Christoph Lameter @ 2010-12-10 21:09 UTC (permalink / raw) To: linux-arm-kernel On Fri, 10 Dec 2010, Eric Dumazet wrote: > > By the way, we need smp_wmb(), not barrier(), even only the "owner cpu" > can write into its 'percpu' seqcount. > > There is nothing special about a seqcount being percpu or a 'global' > one. We must have same memory barrier semantics. There is certainly a major difference in that execution of a stream of instructions on the same cpu is guaranteed to have a coherent view of the data. That is not affected by interrupts etc. > > this_cpu_write_seqcount_begin(&myseqcount); > this_cpu_add(mydata1, add1); > this_cpu_add(mydata2, add2); > this_cpu_inc(mydata3); > this_cpu_write_seqcount_end(&myseqcount); > > We protect the data[1,2,3] set with a seqcount, so need smp_wmb() in > both _begin() and _end() There is nothing to protect there since processing is on the same cpu. The data coherency guarantees of the processor will not allow anything out of sequence to affect execution. An interrupt f.e. will not cause updates to mydata1 to get lost. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 21:09 ` Christoph Lameter @ 2010-12-10 21:22 ` Eric Dumazet 2010-12-10 21:45 ` Christoph Lameter 0 siblings, 1 reply; 51+ messages in thread From: Eric Dumazet @ 2010-12-10 21:22 UTC (permalink / raw) To: linux-arm-kernel Le vendredi 10 d?cembre 2010 ? 15:09 -0600, Christoph Lameter a ?crit : > On Fri, 10 Dec 2010, Eric Dumazet wrote: > > > > > By the way, we need smp_wmb(), not barrier(), even only the "owner cpu" > > can write into its 'percpu' seqcount. > > > > There is nothing special about a seqcount being percpu or a 'global' > > one. We must have same memory barrier semantics. > > There is certainly a major difference in that execution of a stream of > instructions on the same cpu is guaranteed to have a coherent view of > the data. That is not affected by interrupts etc. > We dont care of interrupts. We care of doing a transaction over a complex set of data, that cannot be done using an atomic op (or we need a spinlock/mutex/rwlock), and should not because of performance. > > > > this_cpu_write_seqcount_begin(&myseqcount); > > this_cpu_add(mydata1, add1); > > this_cpu_add(mydata2, add2); > > this_cpu_inc(mydata3); > > this_cpu_write_seqcount_end(&myseqcount); > > > > We protect the data[1,2,3] set with a seqcount, so need smp_wmb() in > > both _begin() and _end() > > There is nothing to protect there since processing is on the same cpu. The > data coherency guarantees of the processor will not allow anything out of > sequence to affect execution. An interrupt f.e. will not cause updates to > mydata1 to get lost. > Please take a look at include/linux/u64_stats_sync.h, maybe you'll understand the concern about using a seqcount to protect a set of data, for example a 256 bit counter increment. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 21:22 ` Eric Dumazet @ 2010-12-10 21:45 ` Christoph Lameter 0 siblings, 0 replies; 51+ messages in thread From: Christoph Lameter @ 2010-12-10 21:45 UTC (permalink / raw) To: linux-arm-kernel On Fri, 10 Dec 2010, Eric Dumazet wrote: > > There is certainly a major difference in that execution of a stream of > > instructions on the same cpu is guaranteed to have a coherent view of > > the data. That is not affected by interrupts etc. > > > > We dont care of interrupts. We care of doing a transaction over a > complex set of data, that cannot be done using an atomic op (or we need > a spinlock/mutex/rwlock), and should not because of performance. So what? The cpu gives you incoherent view of data somewhere when only processing data from a single cpu? If you have remote data accesses (loop summing the data?) and you have to be concerned about data coherence then you CANNOT use this_cpu_ops since they are not guaranteed to be coherent to other processors. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 13:47 ` Peter Zijlstra 2010-12-10 16:50 ` Russell King - ARM Linux 2010-12-10 17:18 ` Eric Dumazet @ 2010-12-10 17:56 ` Russell King - ARM Linux 2010-12-10 18:10 ` Peter Zijlstra 2 siblings, 1 reply; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-10 17:56 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote: > inline void update_rq_clock(struct rq *rq) > { > - int cpu = cpu_of(rq); > - u64 irq_time; > + s64 delta; > > if (rq->skip_clock_update) > return; > > - rq->clock = sched_clock_cpu(cpu); > - irq_time = irq_time_cpu(cpu); > - if (rq->clock - irq_time > rq->clock_task) > - rq->clock_task = rq->clock - irq_time; > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; > + rq->clock += delta; Hmm. Can you tell me how this is different to: new_clock = sched_clock_cpu(cpu_of(rq)); delta = new_clock - rq->clock; rq->clock = new_clock; which I think may be simpler in terms of 64-bit math for 32-bit compilers to deal with? In terms of the wrap-around, I don't see this as any different from the above, as: rq->clock += sched_clock_cpu(cpu_of(rq)) - rq_clock; rq->clock = rq->clock + sched_clock_cpu(cpu_of(rq)) - rq_clock; rq->clock = sched_clock_cpu(cpu_of(rq)); ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 17:56 ` Russell King - ARM Linux @ 2010-12-10 18:10 ` Peter Zijlstra 2010-12-10 18:43 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 18:10 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 17:56 +0000, Russell King - ARM Linux wrote: > On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote: > > inline void update_rq_clock(struct rq *rq) > > { > > - int cpu = cpu_of(rq); > > - u64 irq_time; > > + s64 delta; > > > > if (rq->skip_clock_update) > > return; > > > > - rq->clock = sched_clock_cpu(cpu); > > - irq_time = irq_time_cpu(cpu); > > - if (rq->clock - irq_time > rq->clock_task) > > - rq->clock_task = rq->clock - irq_time; > > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; > > + rq->clock += delta; > > Hmm. Can you tell me how this is different to: > > new_clock = sched_clock_cpu(cpu_of(rq)); > delta = new_clock - rq->clock; > rq->clock = new_clock; > > which I think may be simpler in terms of 64-bit math for 32-bit compilers > to deal with? Its not, I could write it like that, the only reason I didn't is because it uses an extra variable. If gcc on 32bit targets really generates hideous code for it I'll happily change it. > In terms of the wrap-around, I don't see this as any different from the > above, as: > > rq->clock += sched_clock_cpu(cpu_of(rq)) - rq_clock; > rq->clock = rq->clock + sched_clock_cpu(cpu_of(rq)) - rq_clock; > rq->clock = sched_clock_cpu(cpu_of(rq)); Correct, its not different. Nor was it meant to be. The only problem it solves is the u64 wrap failure in: if (rq->clock - irq_time > rq->clock_task) There are lots of places in the scheduler that rely on u64 wrap, for now the easiest thing for ARM would be to select HAVE_UNSTABLE_SCHED_CLOCK for those platforms that implement a short sched_clock(). While that isn't ideal it is something that makes it work, we can work on something more suitable for future kernels. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 18:10 ` Peter Zijlstra @ 2010-12-10 18:43 ` Peter Zijlstra 2010-12-10 19:17 ` Russell King - ARM Linux 2010-12-10 19:25 ` Peter Zijlstra 2 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 18:43 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 19:10 +0100, Peter Zijlstra wrote: > There are lots of places in the scheduler that rely on u64 wrap, for now > the easiest thing for ARM would be to select HAVE_UNSTABLE_SCHED_CLOCK > for those platforms that implement a short sched_clock(). > > While that isn't ideal it is something that makes it work, we can work > on something more suitable for future kernels. Either that, or the thing you proposed a while back in this thread. Since ARM doesn't have NMIs something like that should work just fine. ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 18:10 ` Peter Zijlstra 2010-12-10 18:43 ` Peter Zijlstra @ 2010-12-10 19:17 ` Russell King - ARM Linux 2010-12-10 19:37 ` Peter Zijlstra 2010-12-10 19:25 ` Peter Zijlstra 2 siblings, 1 reply; 51+ messages in thread From: Russell King - ARM Linux @ 2010-12-10 19:17 UTC (permalink / raw) To: linux-arm-kernel On Fri, Dec 10, 2010 at 07:10:54PM +0100, Peter Zijlstra wrote: > On Fri, 2010-12-10 at 17:56 +0000, Russell King - ARM Linux wrote: > > On Fri, Dec 10, 2010 at 02:47:46PM +0100, Peter Zijlstra wrote: > > > inline void update_rq_clock(struct rq *rq) > > > { > > > - int cpu = cpu_of(rq); > > > - u64 irq_time; > > > + s64 delta; > > > > > > if (rq->skip_clock_update) > > > return; > > > > > > - rq->clock = sched_clock_cpu(cpu); > > > - irq_time = irq_time_cpu(cpu); > > > - if (rq->clock - irq_time > rq->clock_task) > > > - rq->clock_task = rq->clock - irq_time; > > > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; > > > + rq->clock += delta; > > > > Hmm. Can you tell me how this is different to: > > > > new_clock = sched_clock_cpu(cpu_of(rq)); > > delta = new_clock - rq->clock; > > rq->clock = new_clock; > > > > which I think may be simpler in terms of 64-bit math for 32-bit compilers > > to deal with? > > Its not, I could write it like that, the only reason I didn't is because > it uses an extra variable. If gcc on 32bit targets really generates > hideous code for it I'll happily change it. Well, I can't tell you what kind of code this produces on ARM, as it doesn't appear to apply to any kernel I've tried. So, I assume it's against some scheduler development tree rather than Linus' tree? ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 19:17 ` Russell King - ARM Linux @ 2010-12-10 19:37 ` Peter Zijlstra 0 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 19:37 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 19:17 +0000, Russell King - ARM Linux wrote: > > > Well, I can't tell you what kind of code this produces on ARM, as it > doesn't appear to apply to any kernel I've tried. So, I assume it's > against some scheduler development tree rather than Linus' tree? Ah yes, my bad, there's some change that got in the way. --- Subject: sched: Fix the irqtime code to deal with u64 wraps From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Thu Dec 09 14:15:34 CET 2010 ARM systems have a 32bit sched_clock() [ which needs to be fixed ], but this exposed a bug in the irq_time code as well, it doesn't deal with wraps at all. Fix the irq_time code to deal with u64 wraps by re-writing the code to only use delta increments, which avoids the whole issue. Furthermore, solve the problem of 32bit arches reading partial updates of the u64 time values. Cc: Venkatesh Pallipadi <venki@google.com> Reported-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <new-submission> --- kernel/sched.c | 172 +++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 119 insertions(+), 53 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -636,22 +636,18 @@ static inline struct task_group *task_gr #endif /* CONFIG_CGROUP_SCHED */ -static u64 irq_time_cpu(int cpu); -static void sched_irq_time_avg_update(struct rq *rq, u64 irq_time); +static void update_rq_clock_task(struct rq *rq, s64 delta); -inline void update_rq_clock(struct rq *rq) +static void update_rq_clock(struct rq *rq) { - if (!rq->skip_clock_update) { - int cpu = cpu_of(rq); - u64 irq_time; + s64 delta; - rq->clock = sched_clock_cpu(cpu); - irq_time = irq_time_cpu(cpu); - if (rq->clock - irq_time > rq->clock_task) - rq->clock_task = rq->clock - irq_time; + if (rq->skip_clock_update) + return; - sched_irq_time_avg_update(rq, irq_time); - } + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; + rq->clock += delta; + update_rq_clock_task(rq, delta); } /* @@ -1918,90 +1914,160 @@ static void deactivate_task(struct rq *r #ifdef CONFIG_IRQ_TIME_ACCOUNTING /* - * There are no locks covering percpu hardirq/softirq time. - * They are only modified in account_system_vtime, on corresponding CPU - * with interrupts disabled. So, writes are safe. + * There are no locks covering percpu hardirq/softirq time. They are only + * modified in account_system_vtime, on corresponding CPU with interrupts + * disabled. So, writes are safe. + * * They are read and saved off onto struct rq in update_rq_clock(). - * This may result in other CPU reading this CPU's irq time and can - * race with irq/account_system_vtime on this CPU. We would either get old - * or new value (or semi updated value on 32 bit) with a side effect of - * accounting a slice of irq time to wrong task when irq is in progress - * while we read rq->clock. That is a worthy compromise in place of having - * locks on each irq in account_system_time. + * + * This may result in other CPU reading this CPU's irq time and can race with + * irq/account_system_vtime on this CPU. We would either get old or new value + * with a side effect of accounting a slice of irq time to wrong task when irq + * is in progress while we read rq->clock. That is a worthy compromise in place + * of having locks on each irq in account_system_time. */ static DEFINE_PER_CPU(u64, cpu_hardirq_time); static DEFINE_PER_CPU(u64, cpu_softirq_time); - static DEFINE_PER_CPU(u64, irq_start_time); -static int sched_clock_irqtime; -void enable_sched_clock_irqtime(void) +#ifndef CONFIG_64BIT +static DEFINE_PER_CPU(seqcount_t, irq_time_seq); + +static inline void irq_time_write_begin(int cpu) { - sched_clock_irqtime = 1; + write_seqcount_begin(&per_cpu(irq_time_seq, cpu)); } -void disable_sched_clock_irqtime(void) +static inline void irq_time_write_end(int cpu) { - sched_clock_irqtime = 0; + write_seqcount_end(&per_cpu(irq_time_seq, cpu)); } -static u64 irq_time_cpu(int cpu) +static inline u64 irq_time_read(int cpu) { - if (!sched_clock_irqtime) - return 0; + u64 irq_time; + unsigned seq; + + do { + seq = read_seqcount_begin(&per_cpu(irq_time_seq, cpu)); + irq_time = per_cpu(cpu_softirq_time, cpu) + + per_cpu(cpu_hardirq_time, cpu); + } while (read_seqcount_retry(&per_cpu(irq_time_seq, cpu), seq)); + + return irq_time; +} +#else /* CONFIG_64BIT */ +static inline void irq_time_write_begin(int cpu) +{ +} + +static inline void irq_time_write_end(int cpu) +{ +} +static inline u64 irq_time_read(int cpu) +{ return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu); } +#endif /* CONFIG_64BIT */ +static int sched_clock_irqtime; + +void enable_sched_clock_irqtime(void) +{ + sched_clock_irqtime = 1; +} + +void disable_sched_clock_irqtime(void) +{ + sched_clock_irqtime = 0; +} + +/* + * Called before incrementing preempt_count on {soft,}irq_enter + * and before decrementing preempt_count on {soft,}irq_exit. + */ void account_system_vtime(struct task_struct *curr) { unsigned long flags; + s64 delta; int cpu; - u64 now, delta; if (!sched_clock_irqtime) return; local_irq_save(flags); - cpu = smp_processor_id(); - now = sched_clock_cpu(cpu); - delta = now - per_cpu(irq_start_time, cpu); - per_cpu(irq_start_time, cpu) = now; - /* - * We do not account for softirq time from ksoftirqd here. - * We want to continue accounting softirq time to ksoftirqd thread - * in that case, so as not to confuse scheduler with a special task - * that do not consume any time, but still wants to run. - */ + delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu); + per_cpu(irq_start_time, cpu) += delta; + + irq_time_write_begin(cpu); + if (hardirq_count()) per_cpu(cpu_hardirq_time, cpu) += delta; + /* + * We do not account for softirq time from ksoftirqd here. We want to + * continue accounting softirq time to ksoftirqd thread in that case, + * so as not to confuse scheduler with a special task that do not + * consume any time, but still wants to run. + */ else if (in_serving_softirq() && !(curr->flags & PF_KSOFTIRQD)) per_cpu(cpu_softirq_time, cpu) += delta; + irq_time_write_end(cpu); local_irq_restore(flags); } EXPORT_SYMBOL_GPL(account_system_vtime); -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) +static u64 irq_time_cpu(struct rq *rq) { - if (sched_clock_irqtime && sched_feat(NONIRQ_POWER)) { - u64 delta_irq = curr_irq_time - rq->prev_irq_time; - rq->prev_irq_time = curr_irq_time; - sched_rt_avg_update(rq, delta_irq); - } + /* + * See the comment in update_rq_clock_task(), ideally we'd update + * the *irq_time values using rq->clock here. + */ + return irq_time_read(cpu_of(rq)); } -#else - -static u64 irq_time_cpu(int cpu) +static void update_rq_clock_task(struct rq *rq, s64 delta) { - return 0; + s64 irq_delta; + + irq_delta = irq_time_cpu(rq) - rq->prev_irq_time; + + /* + * Since irq_time is only updated on {soft,}irq_exit, we might run into + * this case when a previous update_rq_clock() happened inside a + * {soft,}irq region. + * + * When this happens, we stop ->clock_task and only update the + * prev_irq_time stamp to account for the part that fit, so that a next + * update will consume the rest. This ensures ->clock_task is + * monotonic. + * + * It does however cause some slight miss-attribution of {soft,}irq + * time, a more accurate solution would be to update the irq_time using + * the current rq->clock timestamp, except that would require using + * atomic ops. + */ + if (irq_delta > delta) + irq_delta = delta; + + rq->prev_irq_time += irq_delta; + delta -= irq_delta; + rq->clock_task += delta; + + if (irq_delta && sched_feat(NONIRQ_POWER)) + sched_rt_avg_update(rq, irq_delta); } -static void sched_irq_time_avg_update(struct rq *rq, u64 curr_irq_time) { } +#else /* CONFIG_IRQ_TIME_ACCOUNTING */ -#endif +static inline void update_rq_clock_task(struct rq *rq, s64 delta) +{ + rq->clock_task += delta; +} + +#endif /* CONFIG_IRQ_TIME_ACCOUNTING */ #include "sched_idletask.c" #include "sched_fair.c" ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-10 18:10 ` Peter Zijlstra 2010-12-10 18:43 ` Peter Zijlstra 2010-12-10 19:17 ` Russell King - ARM Linux @ 2010-12-10 19:25 ` Peter Zijlstra 2 siblings, 0 replies; 51+ messages in thread From: Peter Zijlstra @ 2010-12-10 19:25 UTC (permalink / raw) To: linux-arm-kernel On Fri, 2010-12-10 at 19:10 +0100, Peter Zijlstra wrote: > > > + delta = sched_clock_cpu(cpu_of(rq)) - rq->clock; > > > + rq->clock += delta; > > > > Hmm. Can you tell me how this is different to: > > > > new_clock = sched_clock_cpu(cpu_of(rq)); > > delta = new_clock - rq->clock; > > rq->clock = new_clock; > > > > which I think may be simpler in terms of 64-bit math for 32-bit compilers > > to deal with? > > Its not, I could write it like that, the only reason I didn't is because > it uses an extra variable. If gcc on 32bit targets really generates > hideous code for it I'll happily change it. Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1863,6 +1863,7 @@ void account_system_vtime(struct task_st { unsigned long flags; s64 delta; + u64 now; int cpu; if (!sched_clock_irqtime) @@ -1870,8 +1871,9 @@ void account_system_vtime(struct task_st local_irq_save(flags); cpu = smp_processor_id(); - delta = sched_clock_cpu(cpu) - per_cpu(irq_start_time, cpu); - per_cpu(irq_start_time, cpu) += delta; + now = sched_clock_cpu(cpu); + delta = now - per_cpu(irq_start_time, cpu); + per_cpu(irq_start_time, cpu) = now; irq_time_write_begin(cpu); On i386 (gcc (Ubuntu/Linaro 4.4.4-14ubuntu5) 4.4.5): Before: account_system_vtime: 160 bytes After: account_system_vtime: 214 bytes ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-09 18:11 ` Venkatesh Pallipadi 2010-12-09 18:55 ` Peter Zijlstra @ 2010-12-13 14:33 ` Jack Daniel 1 sibling, 0 replies; 51+ messages in thread From: Jack Daniel @ 2010-12-13 14:33 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 9, 2010 at 11:41 PM, Venkatesh Pallipadi <venki@google.com> wrote: > On Thu, Dec 9, 2010 at 9:55 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> On Thu, 2010-12-09 at 09:43 -0800, Venkatesh Pallipadi wrote: >>> >>> The same problem will be there with below code, with irq_delta > >>> delta, clock_task can go backwards which is not good. >>> + ? ? ? delta -= irq_delta; >>> + ? ? ? rq->clock_task += delta; >>> >>> The reason for this is rq->clock and irqtime updates kind of happen >>> independently and specifically, if a rq->clock update happens while we >>> are in a softirq, we may have this case of going backwards on the next >>> update. >> >> But how can irq_delta > delta?, we measure it using the same clock. >> > > This would be mostly a corner case like: > - softirq start time t1 > - rq->clock updated at t2 and rq->clock_task updated at t2 without > accounting for current softirq > - softirq end time t3 > - cpu spends most time here in softirq or hardirq > - next rq->clock update at t4 and rq->clock_task update, with delta = > t4-t2 and irq_delta ~= t4 - t1 ^^^ I was curious on this line. Shouldn't this be irq_delta ~= t3 - t1 ? If so the time going backwards shouldn't happen. If it does happen then wouldn't it be better to program irq_time_cpu(cpu_of(rq)) to return t3 instead of t4? Thanks, Jack ^ permalink raw reply [flat|nested] 51+ messages in thread
* [BUG] 2.6.37-rc3 massive interactivity regression on ARM 2010-12-05 14:19 ` Russell King - ARM Linux 2010-12-05 16:07 ` Mikael Pettersson @ 2010-12-06 21:29 ` Venkatesh Pallipadi 1 sibling, 0 replies; 51+ messages in thread From: Venkatesh Pallipadi @ 2010-12-06 21:29 UTC (permalink / raw) To: linux-arm-kernel On Sun, Dec 5, 2010 at 6:19 AM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Sun, Dec 05, 2010 at 01:17:02PM +0000, Russell King - ARM Linux wrote: >> On Sun, Dec 05, 2010 at 01:32:37PM +0100, Mikael Pettersson wrote: >> > Mikael Pettersson writes: >> > ?> The scenario is that I do a remote login to an ARM build server, >> > ?> use screen to start a sub-shell, in that shell start a largish >> > ?> compile job, detach from that screen, and from the original login >> > ?> shell I occasionally monitor the compile job with top or ps or >> > ?> by attaching to the screen. >> > ?> >> > ?> With kernels 2.6.37-rc2 and -rc3 this causes the machine to become >> > ?> very sluggish: top takes forever to start, once started it shows no >> > ?> activity from the compile job (it's as if it's sleeping on a lock), >> > ?> and ps also takes forever and shows no activity from the compile job. >> > ?> >> > ?> Rebooting into 2.6.36 eliminates these issues. >> > ?> >> > ?> I do pretty much the same thing (remote login -> screen -> compile job) >> > ?> on other archs, but so far I've only seen the 2.6.37-rc misbehaviour >> > ?> on ARM EABI, specifically on an IOP n2100. (I have access to other ARM >> > ?> sub-archs, but haven't had time to test 2.6.37-rc on them yet.) >> > ?> >> > ?> Has anyone else seen this? Any ideas about the cause? >> > >> > (Re-followup since I just realised my previous followups were to Rafael's >> > regressions mailbot rather than the original thread.) >> > >> > > The bug is still present in 2.6.37-rc4. ?I'm currently trying to bisect it. >> > >> > git bisect identified >> > >> > [305e6835e05513406fa12820e40e4a8ecb63743c] sched: Do not account irq time to current task >> > >> > as the cause of this regression. ?Reverting it from 2.6.37-rc4 (requires some >> > hackery due to subsequent changes in the same area) restores sane behaviour. >> > >> > The original patch submission talks about irq-heavy scenarios. ?My case is the >> > exact opposite: UP, !PREEMPT, NO_HZ, very low irq rate, essentially 100% CPU >> > bound in userspace but expected to schedule quickly when needed (e.g. running >> > top or ps or just hitting CR in one shell while another runs a compile job). >> > >> > I've reproduced the misbehaviour with 2.6.37-rc4 on ARM/mach-iop32x and >> > ARM/mach-ixp4xx, but ARM/mach-kirkwood does not misbehave, and other archs >> > (x86 SMP, SPARC64 UP and SMP, PowerPC32 UP, Alpha UP) also do not misbehave. >> > >> > So it looks like an ARM-only issue, possibly depending on platform specifics. >> > >> > One difference I noticed between my Kirkwood machine and my ixp4xx and iop32x >> > machines is that even though all have CONFIG_NO_HZ=y, the timer irq rate is >> > much higher on Kirkwood, even when the machine is idle. >> >> The above patch you point out is fundamentally broken. >> >> + ? ? ? ? ? ? ? rq->clock = sched_clock_cpu(cpu); >> + ? ? ? ? ? ? ? irq_time = irq_time_cpu(cpu); >> + ? ? ? ? ? ? ? if (rq->clock - irq_time > rq->clock_task) >> + ? ? ? ? ? ? ? ? ? ? ? rq->clock_task = rq->clock - irq_time; >> >> This means that we will only update rq->clock_task if it is smaller than >> rq->clock. ?So, eventually over time, rq->clock_task becomes the maximum >> value that rq->clock can ever be. ?Or in other words, the maximum value >> of sched_clock_cpu(). >> >> Once that has been reached, although rq->clock will wrap back to zero, >> rq->clock_task will not, and so (I think) task execution time accounting >> effectively stops dead. >> >> I guess this hasn't been noticed on x86 as they have a 64-bit sched_clock, >> and so need to wait a long time for this to be noticed. ?However, on ARM >> where we tend to have 32-bit counters feeding sched_clock(), this value >> will wrap far sooner. > > I'm not so sure about this - certainly that if() statement looks very > suspicious above. ?As irq_time_cpu() will always be zero, can you try > removing the conditional? > > In any case, sched_clock_cpu() should be resilient against sched_clock() > wrapping. ?However, your comments about it being iop32x and ixp4xx > (both of which are 32-bit-counter-to-ns based implementations) and > kirkwood being a 32-bit-extended-to-63-bit-counter-to-ns implementation > does make me wonder... > That conditional is based on assumption that sched_clock_cpu() is u64. If that is not true and sched_clock_cpu() is 32 wrapping around, then there are other places in scheduler which may have problems as well, where we do curr_time - prev_time kind of calculations in u64. For example, update_curr() has: delta_exec = (unsigned long)(now - curr->exec_start); which is based on rq->clock and can end up as high positive number in case of 32 bit wraparound. Having said that, this conditional can be cleaned up to handle the potential 64 bit overflow (even after a long long time) cleanly. But, it will be good to know what exactly is going wrong here though. Thanks, Venki ^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2010-12-13 14:33 UTC | newest] Thread overview: 51+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-11-27 15:16 [BUG] 2.6.37-rc3 massive interactivity regression on ARM Mikael Pettersson 2010-12-05 12:32 ` Mikael Pettersson 2010-12-05 13:17 ` Russell King - ARM Linux 2010-12-05 14:19 ` Russell King - ARM Linux 2010-12-05 16:07 ` Mikael Pettersson 2010-12-05 16:21 ` Russell King - ARM Linux 2010-12-08 12:40 ` Peter Zijlstra 2010-12-08 12:55 ` Russell King - ARM Linux 2010-12-08 14:04 ` Peter Zijlstra 2010-12-08 14:28 ` Russell King - ARM Linux 2010-12-08 14:44 ` Peter Zijlstra 2010-12-08 15:05 ` Russell King - ARM Linux 2010-12-08 15:43 ` Linus Walleij 2010-12-08 20:42 ` john stultz 2010-12-08 23:31 ` Venkatesh Pallipadi 2010-12-09 12:52 ` Peter Zijlstra 2010-12-09 17:43 ` Venkatesh Pallipadi 2010-12-09 17:55 ` Peter Zijlstra 2010-12-09 18:11 ` Venkatesh Pallipadi 2010-12-09 18:55 ` Peter Zijlstra 2010-12-09 22:21 ` Venkatesh Pallipadi 2010-12-09 23:16 ` Peter Zijlstra 2010-12-09 23:35 ` Venkatesh Pallipadi 2010-12-10 10:08 ` Peter Zijlstra 2010-12-10 13:17 ` Peter Zijlstra 2010-12-10 13:27 ` Peter Zijlstra 2010-12-10 13:47 ` Peter Zijlstra 2010-12-10 16:50 ` Russell King - ARM Linux 2010-12-10 16:54 ` Peter Zijlstra 2010-12-10 17:18 ` Eric Dumazet 2010-12-10 17:49 ` Peter Zijlstra 2010-12-10 18:14 ` Eric Dumazet 2010-12-10 18:39 ` Christoph Lameter 2010-12-10 18:46 ` Peter Zijlstra 2010-12-10 19:51 ` Christoph Lameter 2010-12-10 20:07 ` Peter Zijlstra 2010-12-10 20:23 ` Christoph Lameter 2010-12-10 20:32 ` Peter Zijlstra 2010-12-10 20:39 ` Eric Dumazet 2010-12-10 20:49 ` Eric Dumazet 2010-12-10 21:09 ` Christoph Lameter 2010-12-10 21:22 ` Eric Dumazet 2010-12-10 21:45 ` Christoph Lameter 2010-12-10 17:56 ` Russell King - ARM Linux 2010-12-10 18:10 ` Peter Zijlstra 2010-12-10 18:43 ` Peter Zijlstra 2010-12-10 19:17 ` Russell King - ARM Linux 2010-12-10 19:37 ` Peter Zijlstra 2010-12-10 19:25 ` Peter Zijlstra 2010-12-13 14:33 ` Jack Daniel 2010-12-06 21:29 ` Venkatesh Pallipadi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).