* [BUG] race of RCU vs NOHU @ 2009-08-07 13:15 Martin Schwidefsky 2009-08-07 14:29 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Martin Schwidefsky @ 2009-08-07 13:15 UTC (permalink / raw) To: Paul E. McKenney, linux-kernel Cc: Ingo Molnar, Thomas Gleixner, Gerald Schaefer Hi Paul, I analysed a dump of a hanging 2.6.30 system and found what I think is a bug of RCU vs NOHZ. There are a number of patches ontop of that kernel but they should be independent of the bug. The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up recently, cpu #1 has been sleeping for 5 minutes, but there is a pending rcu batch. The timer wheel for cpu #1 is empty, it will continue to sleep for NEXT_TIMER_MAX_DELTA ticks. Now if I look at the RCU data structures I find this: rcu_ctrlblk >> px *(struct rcu_ctrlblk *) 0x810000 struct rcu_ctrlblk { cur = 0xffffffffffffff99 completed = 0xffffffffffffff98 pending = 0xffffffffffffff99 signaled = 0x0 lock = spinlock_t { raw_lock = raw_spinlock_t { owner_cpu = 0x0 } break_lock = 0x0 magic = 0xdead4ead owner_cpu = 0xffffffff owner = 0xffffffffffffffff dep_map = struct lockdep_map { key = 0x810118 class_cache = 0xcbcff0 name = 0x63e944 cpu = 0x0 ip = 0x1a7f64 } } cpumask = { [0] 0x2 } } rcu_data cpu #0 >> px *(struct rcu_data *) 0x872f8430 struct rcu_data { quiescbatch = 0xffffffffffffff99 passed_quiesc = 0x1 qs_pending = 0x0 batch = 0xffffffffffffff97 nxtlist = (nil) nxttail = { [0] 0x872f8448 [1] 0x872f8448 [2] 0x872f8448 } qlen = 0x0 donelist = (nil) donetail = 0x872f8470 blimit = 0xa cpu = 0x0 barrier = struct rcu_head { next = (nil) func = 0x0 } } rcu_data cpu #1 >> px *(struct rcu_data *) 0x874be430 struct rcu_data { quiescbatch = 0xffffffffffffff98 passed_quiesc = 0x1 qs_pending = 0x0 batch = 0xffffffffffffff97 nxtlist = (nil) nxttail = { [0] 0x874be448 [1] 0x874be448 [2] 0x874be448 } qlen = 0x0 donelist = (nil) donetail = 0x874be470 blimit = 0xa cpu = 0x1 barrier = struct rcu_head { next = (nil) func = 0x0 } } rcu_data cpu #2 >> px *(struct rcu_data *) 0x87684430 struct rcu_data { quiescbatch = 0xffffffffffffff99 passed_quiesc = 0x1 qs_pending = 0x0 batch = 0xffffffffffffff99 nxtlist = 0xffc1fc18 nxttail = { [0] 0x87684448 [1] 0x87684448 [2] 0xffc1fc18 } qlen = 0x1 donelist = (nil) donetail = 0x87684470 blimit = 0xa cpu = 0x2 barrier = struct rcu_head { next = (nil) func = 0x0 } } rcu_data cpu #3 >> px *(struct rcu_data *) 0x8784a430 struct rcu_data { quiescbatch = 0xffffffffffffff99 passed_quiesc = 0x1 qs_pending = 0x0 batch = 0xffffffffffffff63 nxtlist = (nil) nxttail = { [0] 0x8784a448 [1] 0x8784a448 [2] 0x8784a448 } qlen = 0x0 donelist = (nil) donetail = 0x8784a470 blimit = 0xa cpu = 0x3 barrier = struct rcu_head { next = (nil) func = 0x0 } } At the time cpu #1 went to sleep rcu_needs_cpu must have answered false, otherwise a 1 tick delay would have been programmed. rcu_pending compares rcu_ctrlblk.cur with rcu_data.quiescbatch for cpu #1. So these two must have been equal otherwise rcu_needs_cpu would have answered true. That means that the rcu_needs_cpu check has been completed before rcu_start_batch for batch 0xffffffffffffff99. The bit for cpu #1 is still set in the rcu_ctrlblk.cpumask, therefore the bit for cpu #1 in nohz_cpu_mask can not have been set at the time rcu_start_batch has completed. That gives the following race (cpu 0 is starting the batch, cpu 1 is going to sleep): cpu 1: tick_nohz_stop_sched_tick: rcu_needs_cpu(); cpu 0: rcu_start_batch: rcp->cur++; cpu 0: rcu_start_batch: cpumask_andnot(to_cpumask(rcp->cpumask), cpu_online_mask, nonz_cpu_mask); cpu 1: tick_nohz_stop_schedk_tick: cpumask_set_cpu(1, nohz_cpu_mask); The order of i) setting the bit in nohz_cpu_mask and ii) the rcu_needs_cpu() check in tick_nohz_stop_sched_tick is wrong, no? Or did I miss some suble check that comes afterwards ? -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-07 13:15 [BUG] race of RCU vs NOHU Martin Schwidefsky @ 2009-08-07 14:29 ` Paul E. McKenney 2009-08-10 12:25 ` Martin Schwidefsky 0 siblings, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2009-08-07 14:29 UTC (permalink / raw) To: Martin Schwidefsky Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > Hi Paul, > I analysed a dump of a hanging 2.6.30 system and found what I think is > a bug of RCU vs NOHZ. There are a number of patches ontop of that > kernel but they should be independent of the bug. > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > sleep for NEXT_TIMER_MAX_DELTA ticks. Congratulations, Martin! You have exercised what to date has been a theoretical bug identified last year by Manfred Spraul. The fix is to switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in 2.6.29. Of course, if you need to work with an old kernel version, you might still need a patch, perhaps for the various -stable versions. If so, please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE rather than backwards on CONFIG_RCU_CLASSIC. Thanx, Paul > Now if I look at the RCU data structures I find this: > > rcu_ctrlblk > >> px *(struct rcu_ctrlblk *) 0x810000 > struct rcu_ctrlblk { > cur = 0xffffffffffffff99 > completed = 0xffffffffffffff98 > pending = 0xffffffffffffff99 > signaled = 0x0 > lock = spinlock_t { > raw_lock = raw_spinlock_t { > owner_cpu = 0x0 > } > break_lock = 0x0 > magic = 0xdead4ead > owner_cpu = 0xffffffff > owner = 0xffffffffffffffff > dep_map = struct lockdep_map { > key = 0x810118 > class_cache = 0xcbcff0 > name = 0x63e944 > cpu = 0x0 > ip = 0x1a7f64 > } > } > cpumask = { > [0] 0x2 > } > } > > rcu_data cpu #0 > >> px *(struct rcu_data *) 0x872f8430 > struct rcu_data { > quiescbatch = 0xffffffffffffff99 > passed_quiesc = 0x1 > qs_pending = 0x0 > batch = 0xffffffffffffff97 > nxtlist = (nil) > nxttail = { > [0] 0x872f8448 > [1] 0x872f8448 > [2] 0x872f8448 > } > qlen = 0x0 > donelist = (nil) > donetail = 0x872f8470 > blimit = 0xa > cpu = 0x0 > barrier = struct rcu_head { > next = (nil) > func = 0x0 > } > } > > rcu_data cpu #1 > >> px *(struct rcu_data *) 0x874be430 > struct rcu_data { > quiescbatch = 0xffffffffffffff98 > passed_quiesc = 0x1 > qs_pending = 0x0 > batch = 0xffffffffffffff97 > nxtlist = (nil) > nxttail = { > [0] 0x874be448 > [1] 0x874be448 > [2] 0x874be448 > } > qlen = 0x0 > donelist = (nil) > donetail = 0x874be470 > blimit = 0xa > cpu = 0x1 > barrier = struct rcu_head { > next = (nil) > func = 0x0 > } > } > > rcu_data cpu #2 > >> px *(struct rcu_data *) 0x87684430 > struct rcu_data { > quiescbatch = 0xffffffffffffff99 > passed_quiesc = 0x1 > qs_pending = 0x0 > batch = 0xffffffffffffff99 > nxtlist = 0xffc1fc18 > nxttail = { > [0] 0x87684448 > [1] 0x87684448 > [2] 0xffc1fc18 > } > qlen = 0x1 > donelist = (nil) > donetail = 0x87684470 > blimit = 0xa > cpu = 0x2 > barrier = struct rcu_head { > next = (nil) > func = 0x0 > } > } > > rcu_data cpu #3 > >> px *(struct rcu_data *) 0x8784a430 > struct rcu_data { > quiescbatch = 0xffffffffffffff99 > passed_quiesc = 0x1 > qs_pending = 0x0 > batch = 0xffffffffffffff63 > nxtlist = (nil) > nxttail = { > [0] 0x8784a448 > [1] 0x8784a448 > [2] 0x8784a448 > } > qlen = 0x0 > donelist = (nil) > donetail = 0x8784a470 > blimit = 0xa > cpu = 0x3 > barrier = struct rcu_head { > next = (nil) > func = 0x0 > } > } > > At the time cpu #1 went to sleep rcu_needs_cpu must have answered false, > otherwise a 1 tick delay would have been programmed. rcu_pending compares > rcu_ctrlblk.cur with rcu_data.quiescbatch for cpu #1. So these two must > have been equal otherwise rcu_needs_cpu would have answered true. > That means that the rcu_needs_cpu check has been completed before > rcu_start_batch for batch 0xffffffffffffff99. The bit for cpu #1 is > still set in the rcu_ctrlblk.cpumask, therefore the bit for cpu #1 > in nohz_cpu_mask can not have been set at the time rcu_start_batch has > completed. That gives the following race (cpu 0 is starting the batch, > cpu 1 is going to sleep): > > cpu 1: tick_nohz_stop_sched_tick: rcu_needs_cpu(); > cpu 0: rcu_start_batch: rcp->cur++; > cpu 0: rcu_start_batch: cpumask_andnot(to_cpumask(rcp->cpumask), > cpu_online_mask, nonz_cpu_mask); > cpu 1: tick_nohz_stop_schedk_tick: cpumask_set_cpu(1, nohz_cpu_mask); > > The order of i) setting the bit in nohz_cpu_mask and ii) the rcu_needs_cpu() > check in tick_nohz_stop_sched_tick is wrong, no? Or did I miss some suble > check that comes afterwards ? > > -- > blue skies, > Martin. > > "Reality continues to ruin my life." - Calvin. > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-07 14:29 ` Paul E. McKenney @ 2009-08-10 12:25 ` Martin Schwidefsky 2009-08-10 15:08 ` Paul E. McKenney 2009-08-10 16:10 ` Pavel Machek 0 siblings, 2 replies; 15+ messages in thread From: Martin Schwidefsky @ 2009-08-10 12:25 UTC (permalink / raw) To: paulmck Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred On Fri, 7 Aug 2009 07:29:57 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > Hi Paul, > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > kernel but they should be independent of the bug. > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > Congratulations, Martin! You have exercised what to date has been a > theoretical bug identified last year by Manfred Spraul. The fix is to > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > 2.6.29. > > Of course, if you need to work with an old kernel version, you might > still need a patch, perhaps for the various -stable versions. If so, > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > rather than backwards on CONFIG_RCU_CLASSIC. SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is present there and I think it needs to be fixed :-/ -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-10 12:25 ` Martin Schwidefsky @ 2009-08-10 15:08 ` Paul E. McKenney 2009-08-11 10:56 ` Martin Schwidefsky 2009-08-10 16:10 ` Pavel Machek 1 sibling, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2009-08-10 15:08 UTC (permalink / raw) To: Martin Schwidefsky Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > On Fri, 7 Aug 2009 07:29:57 -0700 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > Hi Paul, > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > kernel but they should be independent of the bug. > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > Congratulations, Martin! You have exercised what to date has been a > > theoretical bug identified last year by Manfred Spraul. The fix is to > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > 2.6.29. > > > > Of course, if you need to work with an old kernel version, you might > > still need a patch, perhaps for the various -stable versions. If so, > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > rather than backwards on CONFIG_RCU_CLASSIC. > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > present there and I think it needs to be fixed :-/ I was afraid of that. ;-) Given that there are some other theoretical bugs in Classic RCU involving interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more sense than playing whack-a-mole on Classic RCU bugs? Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-10 15:08 ` Paul E. McKenney @ 2009-08-11 10:56 ` Martin Schwidefsky 2009-08-11 14:52 ` Paul E. McKenney 2009-08-11 16:58 ` Greg KH 0 siblings, 2 replies; 15+ messages in thread From: Martin Schwidefsky @ 2009-08-11 10:56 UTC (permalink / raw) To: paulmck Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Mon, 10 Aug 2009 08:08:07 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > On Fri, 7 Aug 2009 07:29:57 -0700 > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > Hi Paul, > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > kernel but they should be independent of the bug. > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > 2.6.29. > > > > > > Of course, if you need to work with an old kernel version, you might > > > still need a patch, perhaps for the various -stable versions. If so, > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > present there and I think it needs to be fixed :-/ > > I was afraid of that. ;-) > > Given that there are some other theoretical bugs in Classic RCU involving > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > sense than playing whack-a-mole on Classic RCU bugs? Fine with me but I don't know if SuSE/Novell is willing to accept such a big change for an existing distribution. I've put Ihno and Greg on Cc. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-11 10:56 ` Martin Schwidefsky @ 2009-08-11 14:52 ` Paul E. McKenney 2009-08-11 15:17 ` Martin Schwidefsky 2009-08-11 16:58 ` Greg KH 1 sibling, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2009-08-11 14:52 UTC (permalink / raw) To: Martin Schwidefsky Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > On Mon, 10 Aug 2009 08:08:07 -0700 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > Hi Paul, > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > kernel but they should be independent of the bug. > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > 2.6.29. > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > present there and I think it needs to be fixed :-/ > > > > I was afraid of that. ;-) > > > > Given that there are some other theoretical bugs in Classic RCU involving > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > sense than playing whack-a-mole on Classic RCU bugs? > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > big change for an existing distribution. I've put Ihno and Greg on Cc. Good point! While they are thinking about the tradeoff between whack-a-mole on Classic RCU and backporting CONFIG_TREE_RCU, if I was to send you a patch backporting CONFIG_TREE_RCU, to exactly which kernel version(s) should I backport it to? Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-11 14:52 ` Paul E. McKenney @ 2009-08-11 15:17 ` Martin Schwidefsky 2009-08-11 18:04 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Martin Schwidefsky @ 2009-08-11 15:17 UTC (permalink / raw) To: paulmck Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Tue, 11 Aug 2009 07:52:22 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > > On Mon, 10 Aug 2009 08:08:07 -0700 > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > > Hi Paul, > > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > > kernel but they should be independent of the bug. > > > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > > 2.6.29. > > > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > > present there and I think it needs to be fixed :-/ > > > > > > I was afraid of that. ;-) > > > > > > Given that there are some other theoretical bugs in Classic RCU involving > > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > > sense than playing whack-a-mole on Classic RCU bugs? > > > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > > big change for an existing distribution. I've put Ihno and Greg on Cc. > > Good point! While they are thinking about the tradeoff between > whack-a-mole on Classic RCU and backporting CONFIG_TREE_RCU, if I was > to send you a patch backporting CONFIG_TREE_RCU, to exactly which kernel > version(s) should I backport it to? We found the bug with kernel version 2.6.30 - the kernel on our test systems still use classic RCU. For us it is easy to switch to tree-RCU, no patch required. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-11 15:17 ` Martin Schwidefsky @ 2009-08-11 18:04 ` Paul E. McKenney 2009-08-12 7:32 ` Martin Schwidefsky 0 siblings, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2009-08-11 18:04 UTC (permalink / raw) To: Martin Schwidefsky Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Tue, Aug 11, 2009 at 05:17:51PM +0200, Martin Schwidefsky wrote: > On Tue, 11 Aug 2009 07:52:22 -0700 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > > > On Mon, 10 Aug 2009 08:08:07 -0700 > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > > > Hi Paul, > > > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > > > kernel but they should be independent of the bug. > > > > > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > > > 2.6.29. > > > > > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > > > present there and I think it needs to be fixed :-/ > > > > > > > > I was afraid of that. ;-) > > > > > > > > Given that there are some other theoretical bugs in Classic RCU involving > > > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > > > sense than playing whack-a-mole on Classic RCU bugs? > > > > > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > > > big change for an existing distribution. I've put Ihno and Greg on Cc. > > > > Good point! While they are thinking about the tradeoff between > > whack-a-mole on Classic RCU and backporting CONFIG_TREE_RCU, if I was > > to send you a patch backporting CONFIG_TREE_RCU, to exactly which kernel > > version(s) should I backport it to? > > We found the bug with kernel version 2.6.30 - the kernel on our test systems > still use classic RCU. For us it is easy to switch to tree-RCU, no patch > required. Ah! Could you please send me the test you use? My tests were insufficient to force this problem to happen. Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-11 18:04 ` Paul E. McKenney @ 2009-08-12 7:32 ` Martin Schwidefsky 2009-08-21 15:54 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Martin Schwidefsky @ 2009-08-12 7:32 UTC (permalink / raw) To: paulmck Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Tue, 11 Aug 2009 11:04:07 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Tue, Aug 11, 2009 at 05:17:51PM +0200, Martin Schwidefsky wrote: > > On Tue, 11 Aug 2009 07:52:22 -0700 > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > > > > On Mon, 10 Aug 2009 08:08:07 -0700 > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > > > > Hi Paul, > > > > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > > > > kernel but they should be independent of the bug. > > > > > > > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > > > > 2.6.29. > > > > > > > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > > > > present there and I think it needs to be fixed :-/ > > > > > > > > > > I was afraid of that. ;-) > > > > > > > > > > Given that there are some other theoretical bugs in Classic RCU involving > > > > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > > > > sense than playing whack-a-mole on Classic RCU bugs? > > > > > > > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > > > > big change for an existing distribution. I've put Ihno and Greg on Cc. > > > > > > Good point! While they are thinking about the tradeoff between > > > whack-a-mole on Classic RCU and backporting CONFIG_TREE_RCU, if I was > > > to send you a patch backporting CONFIG_TREE_RCU, to exactly which kernel > > > version(s) should I backport it to? > > > > We found the bug with kernel version 2.6.30 - the kernel on our test systems > > still use classic RCU. For us it is easy to switch to tree-RCU, no patch > > required. > > Ah! Could you please send me the test you use? My tests were > insufficient to force this problem to happen. There is no specific test, just a regular system boot. The boot did not finish and our tester took a dump. This boot failure seems to happen from time to time. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-12 7:32 ` Martin Schwidefsky @ 2009-08-21 15:54 ` Paul E. McKenney 2009-08-31 8:47 ` Martin Schwidefsky 0 siblings, 1 reply; 15+ messages in thread From: Paul E. McKenney @ 2009-08-21 15:54 UTC (permalink / raw) To: Martin Schwidefsky Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Wed, Aug 12, 2009 at 09:32:33AM +0200, Martin Schwidefsky wrote: > On Tue, 11 Aug 2009 11:04:07 -0700 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Tue, Aug 11, 2009 at 05:17:51PM +0200, Martin Schwidefsky wrote: > > > On Tue, 11 Aug 2009 07:52:22 -0700 > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > > > > > On Mon, 10 Aug 2009 08:08:07 -0700 > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > > > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > > > > > Hi Paul, > > > > > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > > > > > kernel but they should be independent of the bug. > > > > > > > > > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > > > > > 2.6.29. > > > > > > > > > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > > > > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > > > > > present there and I think it needs to be fixed :-/ > > > > > > > > > > > > I was afraid of that. ;-) > > > > > > > > > > > > Given that there are some other theoretical bugs in Classic RCU involving > > > > > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > > > > > sense than playing whack-a-mole on Classic RCU bugs? > > > > > > > > > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > > > > > big change for an existing distribution. I've put Ihno and Greg on Cc. > > > > > > > > Good point! While they are thinking about the tradeoff between > > > > whack-a-mole on Classic RCU and backporting CONFIG_TREE_RCU, if I was > > > > to send you a patch backporting CONFIG_TREE_RCU, to exactly which kernel > > > > version(s) should I backport it to? > > > > > > We found the bug with kernel version 2.6.30 - the kernel on our test systems > > > still use classic RCU. For us it is easy to switch to tree-RCU, no patch > > > required. > > > > Ah! Could you please send me the test you use? My tests were > > insufficient to force this problem to happen. > > There is no specific test, just a regular system boot. The boot did not > finish and our tester took a dump. This boot failure seems to happen from > time to time. OK. Has CONFIG_TREE_RCU been working for you? If so, which variant of 2.6.27 do you need a backport to? Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-21 15:54 ` Paul E. McKenney @ 2009-08-31 8:47 ` Martin Schwidefsky 2009-08-31 14:30 ` Paul E. McKenney 0 siblings, 1 reply; 15+ messages in thread From: Martin Schwidefsky @ 2009-08-31 8:47 UTC (permalink / raw) To: paulmck Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Fri, 21 Aug 2009 08:54:18 -0700 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > On Wed, Aug 12, 2009 at 09:32:33AM +0200, Martin Schwidefsky wrote: > > On Tue, 11 Aug 2009 11:04:07 -0700 > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > On Tue, Aug 11, 2009 at 05:17:51PM +0200, Martin Schwidefsky wrote: > > > > On Tue, 11 Aug 2009 07:52:22 -0700 > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > > > > > > On Mon, 10 Aug 2009 08:08:07 -0700 > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > > > > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > > > > > > Hi Paul, > > > > > > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > > > > > > kernel but they should be independent of the bug. > > > > > > > > > > > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > > > > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > > > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > > > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > > > > > > 2.6.29. > > > > > > > > > > > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > > > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > > > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > > > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > > > > > > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > > > > > > present there and I think it needs to be fixed :-/ > > > > > > > > > > > > > > I was afraid of that. ;-) > > > > > > > > > > > > > > Given that there are some other theoretical bugs in Classic RCU involving > > > > > > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > > > > > > sense than playing whack-a-mole on Classic RCU bugs? > > > > > > > > > > > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > > > > > > big change for an existing distribution. I've put Ihno and Greg on Cc. > > > > > > > > > > Good point! While they are thinking about the tradeoff between > > > > > whack-a-mole on Classic RCU and backporting CONFIG_TREE_RCU, if I was > > > > > to send you a patch backporting CONFIG_TREE_RCU, to exactly which kernel > > > > > version(s) should I backport it to? > > > > > > > > We found the bug with kernel version 2.6.30 - the kernel on our test systems > > > > still use classic RCU. For us it is easy to switch to tree-RCU, no patch > > > > required. > > > > > > Ah! Could you please send me the test you use? My tests were > > > insufficient to force this problem to happen. > > > > There is no specific test, just a regular system boot. The boot did not > > finish and our tester took a dump. This boot failure seems to happen from > > time to time. > > OK. Has CONFIG_TREE_RCU been working for you? If so, which variant > of 2.6.27 do you need a backport to? We changed the configuration of our test kernels to CONFIG_TREE_RCU. So far the problem has not shown up again. As we a dealing with a rare race here this has to be taken with a grain of salt. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-31 8:47 ` Martin Schwidefsky @ 2009-08-31 14:30 ` Paul E. McKenney 0 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2009-08-31 14:30 UTC (permalink / raw) To: Martin Schwidefsky Cc: linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich, Greg KH On Mon, Aug 31, 2009 at 10:47:28AM +0200, Martin Schwidefsky wrote: > On Fri, 21 Aug 2009 08:54:18 -0700 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > On Wed, Aug 12, 2009 at 09:32:33AM +0200, Martin Schwidefsky wrote: > > > On Tue, 11 Aug 2009 11:04:07 -0700 > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > On Tue, Aug 11, 2009 at 05:17:51PM +0200, Martin Schwidefsky wrote: > > > > > On Tue, 11 Aug 2009 07:52:22 -0700 [ . . . ] > > > > > We found the bug with kernel version 2.6.30 - the kernel on our test systems > > > > > still use classic RCU. For us it is easy to switch to tree-RCU, no patch > > > > > required. > > > > > > > > Ah! Could you please send me the test you use? My tests were > > > > insufficient to force this problem to happen. > > > > > > There is no specific test, just a regular system boot. The boot did not > > > finish and our tester took a dump. This boot failure seems to happen from > > > time to time. > > > > OK. Has CONFIG_TREE_RCU been working for you? If so, which variant > > of 2.6.27 do you need a backport to? > > We changed the configuration of our test kernels to CONFIG_TREE_RCU. So > far the problem has not shown up again. As we a dealing with a rare race > here this has to be taken with a grain of salt. Thank you for trying it out! Did you by any chance record the success and failure statistic? Perhaps something like number of failures per unit time, time to first failure, number of successful vs. failed reboots, or whatever? This would allow calculation of confidence statistics. Thanx, Paul ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-11 10:56 ` Martin Schwidefsky 2009-08-11 14:52 ` Paul E. McKenney @ 2009-08-11 16:58 ` Greg KH 1 sibling, 0 replies; 15+ messages in thread From: Greg KH @ 2009-08-11 16:58 UTC (permalink / raw) To: Martin Schwidefsky Cc: paulmck, linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred, Ihno Krumreich On Tue, Aug 11, 2009 at 12:56:53PM +0200, Martin Schwidefsky wrote: > On Mon, 10 Aug 2009 08:08:07 -0700 > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > On Mon, Aug 10, 2009 at 02:25:35PM +0200, Martin Schwidefsky wrote: > > > On Fri, 7 Aug 2009 07:29:57 -0700 > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > On Fri, Aug 07, 2009 at 03:15:29PM +0200, Martin Schwidefsky wrote: > > > > > Hi Paul, > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > > kernel but they should be independent of the bug. > > > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > > 2.6.29. > > > > > > > > Of course, if you need to work with an old kernel version, you might > > > > still need a patch, perhaps for the various -stable versions. If so, > > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > > present there and I think it needs to be fixed :-/ > > > > I was afraid of that. ;-) > > > > Given that there are some other theoretical bugs in Classic RCU involving > > interrupts and CONFIG_NO_HZ, would backporting CONFIG_TREE_RCU make more > > sense than playing whack-a-mole on Classic RCU bugs? > > Fine with me but I don't know if SuSE/Novell is willing to accept such a > big change for an existing distribution. I've put Ihno and Greg on Cc. File a bug in bugzilla.novell.com with the problem and the proper people at Novell will evaluate it there. thanks, greg k-h ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-10 12:25 ` Martin Schwidefsky 2009-08-10 15:08 ` Paul E. McKenney @ 2009-08-10 16:10 ` Pavel Machek 2009-08-11 21:23 ` Paul E. McKenney 1 sibling, 1 reply; 15+ messages in thread From: Pavel Machek @ 2009-08-10 16:10 UTC (permalink / raw) To: Martin Schwidefsky Cc: paulmck, linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred Hi! > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > kernel but they should be independent of the bug. > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > Congratulations, Martin! You have exercised what to date has been a > > theoretical bug identified last year by Manfred Spraul. The fix is to > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > 2.6.29. > > > > Of course, if you need to work with an old kernel version, you might > > still need a patch, perhaps for the various -stable versions. If so, > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > rather than backwards on CONFIG_RCU_CLASSIC. > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > present there and I think it needs to be fixed :-/ Plus... if config_rcu_classic is known buggy (and you are not willing to fix it), it should be disabled (or made depend on !CONFIG_NOHZ, or maybe made depend on CONFIG_BROKEN). Ugh. rcu_classic does not seem to have help text, and I'm running known bad code here :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [BUG] race of RCU vs NOHU 2009-08-10 16:10 ` Pavel Machek @ 2009-08-11 21:23 ` Paul E. McKenney 0 siblings, 0 replies; 15+ messages in thread From: Paul E. McKenney @ 2009-08-11 21:23 UTC (permalink / raw) To: Pavel Machek Cc: Martin Schwidefsky, linux-kernel, Ingo Molnar, Thomas Gleixner, Gerald Schaefer, manfred On Mon, Aug 10, 2009 at 06:10:07PM +0200, Pavel Machek wrote: > Hi! > > > > > I analysed a dump of a hanging 2.6.30 system and found what I think is > > > > a bug of RCU vs NOHZ. There are a number of patches ontop of that > > > > kernel but they should be independent of the bug. > > > > > > > > The systems has 4 cpus and uses classic RCU. cpus #0, #2 and #3 woke up > > > > recently, cpu #1 has been sleeping for 5 minutes, but there is a pending > > > > rcu batch. The timer wheel for cpu #1 is empty, it will continue to > > > > sleep for NEXT_TIMER_MAX_DELTA ticks. > > > > > > Congratulations, Martin! You have exercised what to date has been a > > > theoretical bug identified last year by Manfred Spraul. The fix is to > > > switch from CONFIG_RCU_CLASSIC to CONFIG_RCU_TREE, which was added in > > > 2.6.29. > > > > > > Of course, if you need to work with an old kernel version, you might > > > still need a patch, perhaps for the various -stable versions. If so, > > > please let me know -- otherwise, I will focus forward on CONFIG_RCU_TREE > > > rather than backwards on CONFIG_RCU_CLASSIC. > > > > SLES11 is 2.6.27 and uses classic RCU. The not-so theoretical bug is > > present there and I think it needs to be fixed :-/ > > Plus... if config_rcu_classic is known buggy (and you are not willing > to fix it), it should be disabled (or made depend on !CONFIG_NOHZ, or > maybe made depend on CONFIG_BROKEN). Already done in -tip. ;-) paulmck@paulmck-laptop:/home/git/linux-2.6-tip$ ls kernel/rcuclassic.c ls: kernel/rcuclassic.c: No such file or directory Thanx, Paul > Ugh. rcu_classic does not seem to have help text, and I'm running > known > bad > code > here > :-(. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2009-08-31 14:30 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-07 13:15 [BUG] race of RCU vs NOHU Martin Schwidefsky 2009-08-07 14:29 ` Paul E. McKenney 2009-08-10 12:25 ` Martin Schwidefsky 2009-08-10 15:08 ` Paul E. McKenney 2009-08-11 10:56 ` Martin Schwidefsky 2009-08-11 14:52 ` Paul E. McKenney 2009-08-11 15:17 ` Martin Schwidefsky 2009-08-11 18:04 ` Paul E. McKenney 2009-08-12 7:32 ` Martin Schwidefsky 2009-08-21 15:54 ` Paul E. McKenney 2009-08-31 8:47 ` Martin Schwidefsky 2009-08-31 14:30 ` Paul E. McKenney 2009-08-11 16:58 ` Greg KH 2009-08-10 16:10 ` Pavel Machek 2009-08-11 21:23 ` Paul E. McKenney
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox