* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro @ 2011-02-17 3:39 Cypher Wu 2011-02-17 4:49 ` Américo Wang 0 siblings, 1 reply; 20+ messages in thread From: Cypher Wu @ 2011-02-17 3:39 UTC (permalink / raw) To: linux-kernel; +Cc: Chris Metcalf, Américo Wang, Eric Dumazet, netdev ---------- Forwarded message ---------- From: Cypher Wu <cypher.w@gmail.com> Date: Wed, Feb 16, 2011 at 5:58 PM Subject: GMP and rwlock: Dead ocurred again on TILEPro To: linux-kernel@vger.kernel.org The rwlock and spinlock of TILEPro platform use TNS instruction to test the value of lock, but if interrupt is not masked, read_lock() have another chance to deadlock while read_lock() called in bh of interrupt. The original code: void __raw_read_lock_slow(raw_ rwlock_t *rwlock, u32 val) { u32 iterations = 0; do { if (!(val & 1)) rwlock->lock = val; delay_backoff(iterations++); val = __insn_tns((int *)&rwlock->lock); } while ((val << RD_COUNT_WIDTH) != 0); rwlock->lock = val + (1 << RD_COUNT_SHIFT); } I've modified it to get some information: void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val) { u32 iterations = 0; do { if (!(val & 1)) { rwlock->lock = val; iterations = 0; } delay_backoff(iterations++); if (iterations > 0x1000000) { dump_stack(); iterations = 0; } val = __insn_tns((int *)&rwlock->lock); } while ((val << RD_COUNT_WIDTH) != 0); rwlock->lock = val + (1 << RD_COUNT_SHIFT); } And this is the stack info: Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8) frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8) frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0) frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20) frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30) frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70) frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98) frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8) frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0) frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0) <interrupt 25 while in kernel mode> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38) frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40) frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70) frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90) frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98) frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0) <syscall while in user mode> frame 16: 0x3342a0 (sp 0xbfddfb00) frame 17: 0x16130 (sp 0xbfddfb08) frame 18: 0x16640 (sp 0xbfddfb38) frame 19: 0x16ee8 (sp 0xbfddfc58) frame 20: 0x345a08 (sp 0xbfddfc90) frame 21: 0x10218 (sp 0xbfddfe48) Stack dump complete I don't know the clear definition of rwlock & spinlock in Linux, but the implementation of other platforms like x86, PowerPC, ARM don't have that issue. The use of TNS cause a race condition between system call and interrupt. Through the call tree of packet sending, there are also some other rwlock will be tried, say read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in ip_route_output_slow(). I've seen deadlock on fib_hash_lock, but haven't reproduced with that debug information yet. Maybe IGMP is not the only one, TCP timer will retransmit data and will also call read_lock(&fib_hash_lock). -- Cyberman Wu -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 3:39 Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu @ 2011-02-17 4:49 ` Américo Wang 2011-02-17 5:04 ` Cypher Wu 0 siblings, 1 reply; 20+ messages in thread From: Américo Wang @ 2011-02-17 4:49 UTC (permalink / raw) To: Cypher Wu Cc: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet, netdev On Thu, Feb 17, 2011 at 11:39:22AM +0800, Cypher Wu wrote: >---------- Forwarded message ---------- >From: Cypher Wu <cypher.w@gmail.com> >Date: Wed, Feb 16, 2011 at 5:58 PM >Subject: GMP and rwlock: Dead ocurred again on TILEPro >To: linux-kernel@vger.kernel.org > > >The rwlock and spinlock of TILEPro platform use TNS instruction to >test the value of lock, but if interrupt is not masked, read_lock() >have another chance to deadlock while read_lock() called in bh of >interrupt. > In this case, you should call read_lock_bh() instead of read_lock(). > frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8) > frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8) > frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0) > frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20) > frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30) > frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70) > frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98) > frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8) > frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0) > frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0) > <interrupt 25 while in kernel mode> > frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38) > frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40) > frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70) > frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90) > frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98) > frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0) > <syscall while in user mode> > frame 16: 0x3342a0 (sp 0xbfddfb00) > frame 17: 0x16130 (sp 0xbfddfb08) > frame 18: 0x16640 (sp 0xbfddfb38) > frame 19: 0x16ee8 (sp 0xbfddfc58) > frame 20: 0x345a08 (sp 0xbfddfc90) > frame 21: 0x10218 (sp 0xbfddfe48) >Stack dump complete > >I don't know the clear definition of rwlock & spinlock in Linux, but >the implementation of other platforms >like x86, PowerPC, ARM don't have that issue. The use of TNS cause a >race condition between system >call and interrupt. > Have you turned CONFIG_LOCKDEP on? I think Eric already converted that rwlock into RCU lock, thus this problem should disappear. Could you try a new kernel? Thanks. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 4:49 ` Américo Wang @ 2011-02-17 5:04 ` Cypher Wu 2011-02-17 5:42 ` Américo Wang 0 siblings, 1 reply; 20+ messages in thread From: Cypher Wu @ 2011-02-17 5:04 UTC (permalink / raw) To: Américo Wang; +Cc: linux-kernel, Chris Metcalf, Eric Dumazet, netdev On Thu, Feb 17, 2011 at 12:49 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote: > On Thu, Feb 17, 2011 at 11:39:22AM +0800, Cypher Wu wrote: >>---------- Forwarded message ---------- >>From: Cypher Wu <cypher.w@gmail.com> >>Date: Wed, Feb 16, 2011 at 5:58 PM >>Subject: GMP and rwlock: Dead ocurred again on TILEPro >>To: linux-kernel@vger.kernel.org >> >> >>The rwlock and spinlock of TILEPro platform use TNS instruction to >>test the value of lock, but if interrupt is not masked, read_lock() >>have another chance to deadlock while read_lock() called in bh of >>interrupt. >> > > > In this case, you should call read_lock_bh() instead of read_lock(). > > >> frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8) >> frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8) >> frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0) >> frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20) >> frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30) >> frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70) >> frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98) >> frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8) >> frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0) >> frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0) >> <interrupt 25 while in kernel mode> >> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38) >> frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40) >> frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70) >> frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90) >> frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98) >> frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0) >> <syscall while in user mode> >> frame 16: 0x3342a0 (sp 0xbfddfb00) >> frame 17: 0x16130 (sp 0xbfddfb08) >> frame 18: 0x16640 (sp 0xbfddfb38) >> frame 19: 0x16ee8 (sp 0xbfddfc58) >> frame 20: 0x345a08 (sp 0xbfddfc90) >> frame 21: 0x10218 (sp 0xbfddfe48) >>Stack dump complete >> >>I don't know the clear definition of rwlock & spinlock in Linux, but >>the implementation of other platforms >>like x86, PowerPC, ARM don't have that issue. The use of TNS cause a >>race condition between system >>call and interrupt. >> > > Have you turned CONFIG_LOCKDEP on? > > I think Eric already converted that rwlock into RCU lock, thus > this problem should disappear. Could you try a new kernel? > > Thanks. > I haven't turned CONFIG_LOCKDEP on for test since I didn't get too much information when we tried to figured out the former deadlock. IGMP used read_lock() instead of read_lock_bh() since usually read_lock() can be called recursively, and today I've read the implementation of MIPS, it's should also works fine in that situation. The implementation of TILEPro cause problem since after it use TNS set the lock-val to 1 and hold the original value and before it re-set lock-val a new value, it a race condition window. It's not practical to upgrade the kernel. Thanks. -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 5:04 ` Cypher Wu @ 2011-02-17 5:42 ` Américo Wang 2011-02-17 5:46 ` David Miller 2011-02-17 6:42 ` Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu 0 siblings, 2 replies; 20+ messages in thread From: Américo Wang @ 2011-02-17 5:42 UTC (permalink / raw) To: Cypher Wu Cc: Américo Wang, linux-kernel, Chris Metcalf, Eric Dumazet, netdev On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote: >> >> Have you turned CONFIG_LOCKDEP on? >> >> I think Eric already converted that rwlock into RCU lock, thus >> this problem should disappear. Could you try a new kernel? >> >> Thanks. >> > >I haven't turned CONFIG_LOCKDEP on for test since I didn't get too >much information when we tried to figured out the former deadlock. > >IGMP used read_lock() instead of read_lock_bh() since usually >read_lock() can be called recursively, and today I've read the >implementation of MIPS, it's should also works fine in that situation. >The implementation of TILEPro cause problem since after it use TNS set >the lock-val to 1 and hold the original value and before it re-set >lock-val a new value, it a race condition window. > I see no reason why you can't call read_lock_bh() recursively, read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(), both can be recursive. But I may miss something here. :-/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 5:42 ` Américo Wang @ 2011-02-17 5:46 ` David Miller 2011-02-17 6:39 ` Eric Dumazet 2011-02-17 22:49 ` Chris Metcalf 2011-02-17 6:42 ` Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu 1 sibling, 2 replies; 20+ messages in thread From: David Miller @ 2011-02-17 5:46 UTC (permalink / raw) To: xiyou.wangcong; +Cc: cypher.w, linux-kernel, cmetcalf, eric.dumazet, netdev From: Américo Wang <xiyou.wangcong@gmail.com> Date: Thu, 17 Feb 2011 13:42:37 +0800 > On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote: >>> >>> Have you turned CONFIG_LOCKDEP on? >>> >>> I think Eric already converted that rwlock into RCU lock, thus >>> this problem should disappear. Could you try a new kernel? >>> >>> Thanks. >>> >> >>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too >>much information when we tried to figured out the former deadlock. >> >>IGMP used read_lock() instead of read_lock_bh() since usually >>read_lock() can be called recursively, and today I've read the >>implementation of MIPS, it's should also works fine in that situation. >>The implementation of TILEPro cause problem since after it use TNS set >>the lock-val to 1 and hold the original value and before it re-set >>lock-val a new value, it a race condition window. >> > > I see no reason why you can't call read_lock_bh() recursively, > read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(), > both can be recursive. > > But I may miss something here. :-/ IGMP is doing this so that taking the read lock does not stop packet processing. TILEPro's rwlock implementation is simply buggy and needs to be fixed. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 5:46 ` David Miller @ 2011-02-17 6:39 ` Eric Dumazet 2011-02-17 22:49 ` Chris Metcalf 1 sibling, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2011-02-17 6:39 UTC (permalink / raw) To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, cmetcalf, netdev Le mercredi 16 février 2011 à 21:46 -0800, David Miller a écrit : > From: Américo Wang <xiyou.wangcong@gmail.com> > Date: Thu, 17 Feb 2011 13:42:37 +0800 > > > On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote: > >>> > >>> Have you turned CONFIG_LOCKDEP on? > >>> > >>> I think Eric already converted that rwlock into RCU lock, thus > >>> this problem should disappear. Could you try a new kernel? > >>> > >>> Thanks. > >>> > >> > >>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too > >>much information when we tried to figured out the former deadlock. > >> > >>IGMP used read_lock() instead of read_lock_bh() since usually > >>read_lock() can be called recursively, and today I've read the > >>implementation of MIPS, it's should also works fine in that situation. > >>The implementation of TILEPro cause problem since after it use TNS set > >>the lock-val to 1 and hold the original value and before it re-set > >>lock-val a new value, it a race condition window. > >> > > > > I see no reason why you can't call read_lock_bh() recursively, > > read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(), > > both can be recursive. > > > > But I may miss something here. :-/ > > IGMP is doing this so that taking the read lock does not stop packet > processing. > > TILEPro's rwlock implementation is simply buggy and needs to be fixed. Yep. Finding all recursive readlocks in kernel and convert them to another locking model is probably more expensive than fixing TILEPro rwlock implementation. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 5:46 ` David Miller 2011-02-17 6:39 ` Eric Dumazet @ 2011-02-17 22:49 ` Chris Metcalf 2011-02-17 22:53 ` David Miller 1 sibling, 1 reply; 20+ messages in thread From: Chris Metcalf @ 2011-02-17 22:49 UTC (permalink / raw) To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev On 2/17/2011 12:46 AM, David Miller wrote: > From: Américo Wang <xiyou.wangcong@gmail.com> > Date: Thu, 17 Feb 2011 13:42:37 +0800 > >> On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote: >>>> Have you turned CONFIG_LOCKDEP on? >>>> >>>> I think Eric already converted that rwlock into RCU lock, thus >>>> this problem should disappear. Could you try a new kernel? >>>> >>>> Thanks. >>>> >>> I haven't turned CONFIG_LOCKDEP on for test since I didn't get too >>> much information when we tried to figured out the former deadlock. >>> >>> IGMP used read_lock() instead of read_lock_bh() since usually >>> read_lock() can be called recursively, and today I've read the >>> implementation of MIPS, it's should also works fine in that situation. >>> The implementation of TILEPro cause problem since after it use TNS set >>> the lock-val to 1 and hold the original value and before it re-set >>> lock-val a new value, it a race condition window. >>> >> I see no reason why you can't call read_lock_bh() recursively, >> read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(), >> both can be recursive. >> >> But I may miss something here. :-/ > IGMP is doing this so that taking the read lock does not stop packet > processing. > > TILEPro's rwlock implementation is simply buggy and needs to be fixed. Cypher, thanks for tracking this down with a good bug report. The fix is to disable interrupts for the arch_read_lock family of methods. In my fix I'm using the "hard" disable that locks out NMIs as well, so that in the event the NMI handler needs to share an rwlock with regular code it would be possible (plus, it's more efficient). I believe it's not necessary to worry about similar protection for the arch_write_lock methods, since they aren't guaranteed to be re-entrant anyway (you'd have to use write_lock_irqsave or equivalent). I'll send the patch to LKML after letting it bake internally for a little while. Thanks again! -- Chris Metcalf, Tilera Corp. http://www.tilera.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 22:49 ` Chris Metcalf @ 2011-02-17 22:53 ` David Miller 2011-02-17 23:04 ` Chris Metcalf 0 siblings, 1 reply; 20+ messages in thread From: David Miller @ 2011-02-17 22:53 UTC (permalink / raw) To: cmetcalf; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev From: Chris Metcalf <cmetcalf@tilera.com> Date: Thu, 17 Feb 2011 17:49:46 -0500 > The fix is to disable interrupts for the arch_read_lock family of methods. How does that help handle the race when it happens between different cpus, instead of between IRQ and non-IRQ context on the same CPU? Why don't you just use the generic spinlock based rwlock code on Tile, since that is all that your atomic instructions can handle sufficiently? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 22:53 ` David Miller @ 2011-02-17 23:04 ` Chris Metcalf 2011-02-17 23:11 ` David Miller 0 siblings, 1 reply; 20+ messages in thread From: Chris Metcalf @ 2011-02-17 23:04 UTC (permalink / raw) To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev On 2/17/2011 5:53 PM, David Miller wrote: > From: Chris Metcalf <cmetcalf@tilera.com> > Date: Thu, 17 Feb 2011 17:49:46 -0500 > >> The fix is to disable interrupts for the arch_read_lock family of methods. > How does that help handle the race when it happens between different > cpus, instead of between IRQ and non-IRQ context on the same CPU? There's no race in that case, since the lock code properly backs off and retries until the other cpu frees it. The distinction here is that the non-IRQ context is "wedged" by the IRQ context. > Why don't you just use the generic spinlock based rwlock code on Tile, > since that is all that your atomic instructions can handle > sufficiently? The tile-specific code encodes reader/writer information in the same 32-bit word that the test-and-set instruction manipulates, so it's more efficient both in space and time. This may not really matter for rwlocks, since no one cares much about them any more, but that was the motivation. -- Chris Metcalf, Tilera Corp. http://www.tilera.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 23:04 ` Chris Metcalf @ 2011-02-17 23:11 ` David Miller 2011-02-17 23:18 ` Chris Metcalf 0 siblings, 1 reply; 20+ messages in thread From: David Miller @ 2011-02-17 23:11 UTC (permalink / raw) To: cmetcalf; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev From: Chris Metcalf <cmetcalf@tilera.com> Date: Thu, 17 Feb 2011 18:04:13 -0500 > On 2/17/2011 5:53 PM, David Miller wrote: >> From: Chris Metcalf <cmetcalf@tilera.com> >> Date: Thu, 17 Feb 2011 17:49:46 -0500 >> >>> The fix is to disable interrupts for the arch_read_lock family of methods. >> How does that help handle the race when it happens between different >> cpus, instead of between IRQ and non-IRQ context on the same CPU? > > There's no race in that case, since the lock code properly backs off and > retries until the other cpu frees it. The distinction here is that the > non-IRQ context is "wedged" by the IRQ context. > >> Why don't you just use the generic spinlock based rwlock code on Tile, >> since that is all that your atomic instructions can handle >> sufficiently? > > The tile-specific code encodes reader/writer information in the same 32-bit > word that the test-and-set instruction manipulates, so it's more efficient > both in space and time. This may not really matter for rwlocks, since no > one cares much about them any more, but that was the motivation. Ok, but IRQ disabling is going to be very expensive. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 23:11 ` David Miller @ 2011-02-17 23:18 ` Chris Metcalf 2011-02-18 3:16 ` Cypher Wu ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Chris Metcalf @ 2011-02-17 23:18 UTC (permalink / raw) To: David Miller; +Cc: xiyou.wangcong, cypher.w, linux-kernel, eric.dumazet, netdev On 2/17/2011 6:11 PM, David Miller wrote: > From: Chris Metcalf <cmetcalf@tilera.com> > Date: Thu, 17 Feb 2011 18:04:13 -0500 > >> On 2/17/2011 5:53 PM, David Miller wrote: >>> From: Chris Metcalf <cmetcalf@tilera.com> >>> Date: Thu, 17 Feb 2011 17:49:46 -0500 >>> >>>> The fix is to disable interrupts for the arch_read_lock family of methods. >>> How does that help handle the race when it happens between different >>> cpus, instead of between IRQ and non-IRQ context on the same CPU? >> There's no race in that case, since the lock code properly backs off and >> retries until the other cpu frees it. The distinction here is that the >> non-IRQ context is "wedged" by the IRQ context. >> >>> Why don't you just use the generic spinlock based rwlock code on Tile, >>> since that is all that your atomic instructions can handle >>> sufficiently? >> The tile-specific code encodes reader/writer information in the same 32-bit >> word that the test-and-set instruction manipulates, so it's more efficient >> both in space and time. This may not really matter for rwlocks, since no >> one cares much about them any more, but that was the motivation. > Ok, but IRQ disabling is going to be very expensive. The interrupt architecture on Tile allows a write to a special-purpose register to put you into a "critical section" where no interrupts or faults are delivered. So we just need to bracket the read_lock operations with two SPR writes; each takes six machine cycles, so we're only adding 12 cycles to the total cost of taking or releasing a read lock on an rwlock. -- Chris Metcalf, Tilera Corp. http://www.tilera.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 23:18 ` Chris Metcalf @ 2011-02-18 3:16 ` Cypher Wu 2011-02-18 3:19 ` Cypher Wu 2011-02-18 7:08 ` Cypher Wu 2 siblings, 0 replies; 20+ messages in thread From: Cypher Wu @ 2011-02-18 3:16 UTC (permalink / raw) To: Chris Metcalf Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: > On 2/17/2011 6:11 PM, David Miller wrote: >> From: Chris Metcalf <cmetcalf@tilera.com> >> Date: Thu, 17 Feb 2011 18:04:13 -0500 >> >>> On 2/17/2011 5:53 PM, David Miller wrote: >>>> From: Chris Metcalf <cmetcalf@tilera.com> >>>> Date: Thu, 17 Feb 2011 17:49:46 -0500 >>>> >>>>> The fix is to disable interrupts for the arch_read_lock family of methods. >>>> How does that help handle the race when it happens between different >>>> cpus, instead of between IRQ and non-IRQ context on the same CPU? >>> There's no race in that case, since the lock code properly backs off and >>> retries until the other cpu frees it. The distinction here is that the >>> non-IRQ context is "wedged" by the IRQ context. >>> >>>> Why don't you just use the generic spinlock based rwlock code on Tile, >>>> since that is all that your atomic instructions can handle >>>> sufficiently? >>> The tile-specific code encodes reader/writer information in the same 32-bit >>> word that the test-and-set instruction manipulates, so it's more efficient >>> both in space and time. This may not really matter for rwlocks, since no >>> one cares much about them any more, but that was the motivation. >> Ok, but IRQ disabling is going to be very expensive. > > The interrupt architecture on Tile allows a write to a special-purpose > register to put you into a "critical section" where no interrupts or faults > are delivered. So we just need to bracket the read_lock operations with > two SPR writes; each takes six machine cycles, so we're only adding 12 > cycles to the total cost of taking or releasing a read lock on an rwlock. > > -- > Chris Metcalf, Tilera Corp. > http://www.tilera.com > > I agree that just lock interrupt for read operations should be enough, but read_unlock() is also the place we should lock interrupt, right? If interrupt occurred when it hold lock-val after TNS deadlock still can occur. When will you release out that patch? Since time is tight, so maybe I've to fix-up it myself. Though the problem is clearly now, I still have two questions to confirm: 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the interrupt which claimed 'CM', is that right? Should we have to same its original value and restore it later? There is some code in Linux: int __tns_atomic_acquire(atomic_t *lock) { int ret; u32 iterations = 0; BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION)); __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1); while((ret = __insn_tns(&lock->counter)) == 1) delay_backoff(iterations++); return ret; } It just use BUG_ON to check SPR_INTERRUPT_CRITICAL_SECTION have to be 0, is that means that SPR is only used in special situations like that? 2. Should we lock interrupt for the whole operation of read_lock()/read_unlock(), or we should leave interrupt critical section if it run into __raw_read_lock_slow() and before have to delay_backoff() some time, and re-enter interrupt critical section again before TNS? -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 23:18 ` Chris Metcalf 2011-02-18 3:16 ` Cypher Wu @ 2011-02-18 3:19 ` Cypher Wu 2011-02-18 7:08 ` Cypher Wu 2 siblings, 0 replies; 20+ messages in thread From: Cypher Wu @ 2011-02-18 3:19 UTC (permalink / raw) To: Chris Metcalf Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: > On 2/17/2011 6:11 PM, David Miller wrote: >> From: Chris Metcalf <cmetcalf@tilera.com> >> Date: Thu, 17 Feb 2011 18:04:13 -0500 >> >>> On 2/17/2011 5:53 PM, David Miller wrote: >>>> From: Chris Metcalf <cmetcalf@tilera.com> >>>> Date: Thu, 17 Feb 2011 17:49:46 -0500 >>>> >>>>> The fix is to disable interrupts for the arch_read_lock family of methods. >>>> How does that help handle the race when it happens between different >>>> cpus, instead of between IRQ and non-IRQ context on the same CPU? >>> There's no race in that case, since the lock code properly backs off and >>> retries until the other cpu frees it. The distinction here is that the >>> non-IRQ context is "wedged" by the IRQ context. >>> >>>> Why don't you just use the generic spinlock based rwlock code on Tile, >>>> since that is all that your atomic instructions can handle >>>> sufficiently? >>> The tile-specific code encodes reader/writer information in the same 32-bit >>> word that the test-and-set instruction manipulates, so it's more efficient >>> both in space and time. This may not really matter for rwlocks, since no >>> one cares much about them any more, but that was the motivation. >> Ok, but IRQ disabling is going to be very expensive. > > The interrupt architecture on Tile allows a write to a special-purpose > register to put you into a "critical section" where no interrupts or faults > are delivered. So we just need to bracket the read_lock operations with > two SPR writes; each takes six machine cycles, so we're only adding 12 > cycles to the total cost of taking or releasing a read lock on an rwlock. > > -- > Chris Metcalf, Tilera Corp. > http://www.tilera.com > > Bye the way, other RISC platforms, say ARM and MIPS, use store conditional rather that TNS a temp value for lock-val, does Fx have similar instructions? -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 23:18 ` Chris Metcalf 2011-02-18 3:16 ` Cypher Wu 2011-02-18 3:19 ` Cypher Wu @ 2011-02-18 7:08 ` Cypher Wu 2011-02-18 21:51 ` Chris Metcalf 2011-03-01 18:30 ` [PATCH] arch/tile: fix deadlock bugs in rwlock implementation Chris Metcalf 2 siblings, 2 replies; 20+ messages in thread From: Cypher Wu @ 2011-02-18 7:08 UTC (permalink / raw) To: Chris Metcalf Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: > On 2/17/2011 6:11 PM, David Miller wrote: >> From: Chris Metcalf <cmetcalf@tilera.com> >> Date: Thu, 17 Feb 2011 18:04:13 -0500 >> >>> On 2/17/2011 5:53 PM, David Miller wrote: >>>> From: Chris Metcalf <cmetcalf@tilera.com> >>>> Date: Thu, 17 Feb 2011 17:49:46 -0500 >>>> >>>>> The fix is to disable interrupts for the arch_read_lock family of methods. >>>> How does that help handle the race when it happens between different >>>> cpus, instead of between IRQ and non-IRQ context on the same CPU? >>> There's no race in that case, since the lock code properly backs off and >>> retries until the other cpu frees it. The distinction here is that the >>> non-IRQ context is "wedged" by the IRQ context. >>> >>>> Why don't you just use the generic spinlock based rwlock code on Tile, >>>> since that is all that your atomic instructions can handle >>>> sufficiently? >>> The tile-specific code encodes reader/writer information in the same 32-bit >>> word that the test-and-set instruction manipulates, so it's more efficient >>> both in space and time. This may not really matter for rwlocks, since no >>> one cares much about them any more, but that was the motivation. >> Ok, but IRQ disabling is going to be very expensive. > > The interrupt architecture on Tile allows a write to a special-purpose > register to put you into a "critical section" where no interrupts or faults > are delivered. So we just need to bracket the read_lock operations with > two SPR writes; each takes six machine cycles, so we're only adding 12 > cycles to the total cost of taking or releasing a read lock on an rwlock. > > -- > Chris Metcalf, Tilera Corp. > http://www.tilera.com > > Adding that to SPR writes should be fine, but it may cause interrupt delay a little more that other platform's read_lock()? Another question: What NMI in the former mail means? Looking forward to your patch. Regards -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-18 7:08 ` Cypher Wu @ 2011-02-18 21:51 ` Chris Metcalf 2011-02-19 4:07 ` Cypher Wu 2011-03-01 18:30 ` [PATCH] arch/tile: fix deadlock bugs in rwlock implementation Chris Metcalf 1 sibling, 1 reply; 20+ messages in thread From: Chris Metcalf @ 2011-02-18 21:51 UTC (permalink / raw) To: Cypher Wu Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev On 2/17/2011 10:16 PM, Cypher Wu wrote: > On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: >> The interrupt architecture on Tile allows a write to a special-purpose >> register to put you into a "critical section" where no interrupts or faults >> are delivered. So we just need to bracket the read_lock operations with >> two SPR writes; each takes six machine cycles, so we're only adding 12 >> cycles to the total cost of taking or releasing a read lock on an rwlock > I agree that just lock interrupt for read operations should be enough, > but read_unlock() is also the place we should lock interrupt, right? > If interrupt occurred when it hold lock-val after TNS deadlock still > can occur. Correct; that's what I meant by "read_lock operations". This include lock, trylock, and unlock. > When will you release out that patch? Since time is tight, so maybe > I've to fix-up it myself. I heard from one of our support folks that you were asking through that channel, so I asked him to go ahead and give you the spinlock sources directly. I will be spending time next week syncing up our internal tree with the public git repository so you'll see it on LKML at that time. > 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the > interrupt which claimed 'CM', is that right? Should we have to same > its original value and restore it later? We don't need to save and restore, since INTERRUPT_CRITICAL_SECTION is almost always zero except in very specific situations. > 2. Should we lock interrupt for the whole operation of > read_lock()/read_unlock(), or we should leave interrupt critical > section if it run into __raw_read_lock_slow() and before have to > delay_backoff() some time, and re-enter interrupt critical section > again before TNS? Correct, the fix only holds the critical section around the tns and the write-back, not during the delay_backoff(). > Bye the way, other RISC platforms, say ARM and MIPS, use store > conditional rather that TNS a temp value for lock-val, does Fx have > similar instructions? TILEPro does not have anything more than test-and-set; TILE-Gx (the 64-bit processor) has a full array of atomic instructions. > Adding that to SPR writes should be fine, but it may cause interrupt > delay a little more that other platform's read_lock()? A little, but I think it's in the noise relative to the basic cost of read_lock in the absence of full-fledged atomic instructions. > Another question: What NMI in the former mail means? Non-maskable interrupt, such as performance counter interrupts. -- Chris Metcalf, Tilera Corp. http://www.tilera.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-18 21:51 ` Chris Metcalf @ 2011-02-19 4:07 ` Cypher Wu 2011-02-20 13:33 ` Chris Metcalf 0 siblings, 1 reply; 20+ messages in thread From: Cypher Wu @ 2011-02-19 4:07 UTC (permalink / raw) To: Chris Metcalf Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev On Sat, Feb 19, 2011 at 5:51 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: > On 2/17/2011 10:16 PM, Cypher Wu wrote: >> On Fri, Feb 18, 2011 at 7:18 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: >>> The interrupt architecture on Tile allows a write to a special-purpose >>> register to put you into a "critical section" where no interrupts or faults >>> are delivered. So we just need to bracket the read_lock operations with >>> two SPR writes; each takes six machine cycles, so we're only adding 12 >>> cycles to the total cost of taking or releasing a read lock on an rwlock >> I agree that just lock interrupt for read operations should be enough, >> but read_unlock() is also the place we should lock interrupt, right? >> If interrupt occurred when it hold lock-val after TNS deadlock still >> can occur. > > Correct; that's what I meant by "read_lock operations". This include lock, > trylock, and unlock. > >> When will you release out that patch? Since time is tight, so maybe >> I've to fix-up it myself. > > I heard from one of our support folks that you were asking through that > channel, so I asked him to go ahead and give you the spinlock sources > directly. I will be spending time next week syncing up our internal tree > with the public git repository so you'll see it on LKML at that time. > >> 1. If we use SPR_INTERRUPT_CRITICAL_SECTION it will disable all the >> interrupt which claimed 'CM', is that right? Should we have to same >> its original value and restore it later? > > We don't need to save and restore, since INTERRUPT_CRITICAL_SECTION is > almost always zero except in very specific situations. > >> 2. Should we lock interrupt for the whole operation of >> read_lock()/read_unlock(), or we should leave interrupt critical >> section if it run into __raw_read_lock_slow() and before have to >> delay_backoff() some time, and re-enter interrupt critical section >> again before TNS? > > Correct, the fix only holds the critical section around the tns and the > write-back, not during the delay_backoff(). > >> Bye the way, other RISC platforms, say ARM and MIPS, use store >> conditional rather that TNS a temp value for lock-val, does Fx have >> similar instructions? > > TILEPro does not have anything more than test-and-set; TILE-Gx (the 64-bit > processor) has a full array of atomic instructions. > >> Adding that to SPR writes should be fine, but it may cause interrupt >> delay a little more that other platform's read_lock()? > > A little, but I think it's in the noise relative to the basic cost of > read_lock in the absence of full-fledged atomic instructions. > >> Another question: What NMI in the former mail means? > > Non-maskable interrupt, such as performance counter interrupts. > > -- > Chris Metcalf, Tilera Corp. > http://www.tilera.com > > I've got your source code, thank you very much. There is still two more question: 1. Why we merge the inlined code and the *_slow into none inlined functions? 2. I've seen the use of 'mb()' in unlock operation, but we don't use that in the lock operation. I've released a temporary version with that modification under our customer' demand, since they want to do a long time test though this weekend. I'll appreciate that if you gave some comment on my modifications: *** /opt/TileraMDE-2.1.0.98943/tilepro/src/sys/linux/include/asm-tile/spinlock_32.h 2010-04-02 11:07:47.000000000 +0800 --- include/asm-tile/spinlock_32.h 2011-02-18 17:09:40.000000000 +0800 *************** *** 12,17 **** --- 12,27 ---- * more details. * * 32-bit SMP spinlocks. + * + * + * The use of TNS instruction cause race condition for system call and + * interrupt, so we have to lock interrupt when we trying lock-value. + * However, since write_lock() is exclusive so if we really need to + * operate it in interrupt then system call have to use write_lock_irqsave(), + * So it don't need to lock interrupt here. + * Spinlock is also exclusive so we don't take care about it. + * + * Modified by Cyberman Wu on Feb 18th, 2011. */ #ifndef _ASM_TILE_SPINLOCK_32_H *************** void __raw_read_unlock_slow(raw_rwlock_t *** 86,91 **** --- 96,114 ---- void __raw_write_lock_slow(raw_rwlock_t *, u32); void __raw_write_unlock_slow(raw_rwlock_t *, u32); + + static inline void __raw_read_lock_enter_critical(void) + { + BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION)); + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1); + } + + static inline void __raw_read_lock_leave_critical(void) + { + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); + } + + /** * __raw_read_can_lock() - would read_trylock() succeed? */ *************** static inline int __raw_write_can_lock(r *** 107,121 **** */ static inline void __raw_read_lock(raw_rwlock_t *rwlock) { ! u32 val = __insn_tns((int *)&rwlock->lock); if (unlikely(val << _RD_COUNT_WIDTH)) { #ifdef __TILECC__ #pragma frequency_hint NEVER #endif __raw_read_lock_slow(rwlock, val); return; } rwlock->lock = val + (1 << _RD_COUNT_SHIFT); } /** --- 130,148 ---- */ static inline void __raw_read_lock(raw_rwlock_t *rwlock) { ! u32 val; ! __raw_read_lock_enter_critical(); ! /*u32 */val = __insn_tns((int *)&rwlock->lock); if (unlikely(val << _RD_COUNT_WIDTH)) { #ifdef __TILECC__ #pragma frequency_hint NEVER #endif __raw_read_lock_slow(rwlock, val); + __raw_read_lock_leave_critical(); return; } rwlock->lock = val + (1 << _RD_COUNT_SHIFT); + __raw_read_lock_leave_critical(); } /** *************** static inline void __raw_write_lock(raw_ *** 140,154 **** static inline int __raw_read_trylock(raw_rwlock_t *rwlock) { int locked; ! u32 val = __insn_tns((int *)&rwlock->lock); if (unlikely(val & 1)) { #ifdef __TILECC__ #pragma frequency_hint NEVER #endif ! return __raw_read_trylock_slow(rwlock); } locked = (val << _RD_COUNT_WIDTH) == 0; rwlock->lock = val + (locked << _RD_COUNT_SHIFT); return locked; } --- 167,187 ---- static inline int __raw_read_trylock(raw_rwlock_t *rwlock) { int locked; ! u32 val; ! __raw_read_lock_enter_critical(); ! /*u32 */val = __insn_tns((int *)&rwlock->lock); if (unlikely(val & 1)) { #ifdef __TILECC__ #pragma frequency_hint NEVER #endif ! // return __raw_read_trylock_slow(rwlock); ! locked =__raw_read_trylock_slow(rwlock); ! __raw_read_lock_leave_critical(); ! return locked; } locked = (val << _RD_COUNT_WIDTH) == 0; rwlock->lock = val + (locked << _RD_COUNT_SHIFT); + __raw_read_lock_leave_critical(); return locked; } *************** static inline void __raw_read_unlock(raw *** 184,198 **** --- 217,234 ---- { u32 val; mb(); + __raw_read_lock_enter_critical(); val = __insn_tns((int *)&rwlock->lock); if (unlikely(val & 1)) { #ifdef __TILECC__ #pragma frequency_hint NEVER #endif __raw_read_unlock_slow(rwlock); + __raw_read_lock_leave_critical(); return; } rwlock->lock = val - (1 << _RD_COUNT_SHIFT); + __raw_read_lock_leave_critical(); } --- /opt/TileraMDE-2.1.0.98943/tilepro/src/sys/linux/arch/tile/lib/spinlock_32.c 2010-04-02 11:08:02.000000000 +0800 +++ arch/tile/lib/spinlock_32.c 2011-02-18 16:05:31.000000000 +0800 @@ -98,7 +98,18 @@ static inline u32 get_rwlock(raw_rwlock_ #ifdef __TILECC__ #pragma frequency_hint NEVER #endif + /* + * get_rwlock() now have to be called in Interrupt + * Critical Section, so it can't be called in the + * these __raw_write_xxx() anymore!!!!! + * + * We leave Interrupt Critical Section for making + * interrupt delay minimal. + * Is that really needed??? + */ + __raw_read_lock_leave_critical(); delay_backoff(iterations++); + __raw_read_lock_enter_critical(); continue; } return val; @@ -152,7 +163,14 @@ void __raw_read_lock_slow(raw_rwlock_t * do { if (!(val & 1)) rwlock->lock = val; + /* + * We leave Interrupt Critical Section for making + * interrupt delay minimal. + * Is that really needed??? + */ + __raw_read_lock_leave_critical(); delay_backoff(iterations++); + __raw_read_lock_enter_critical(); val = __insn_tns((int *)&rwlock->lock); } while ((val << RD_COUNT_WIDTH) != 0); rwlock->lock = val + (1 << RD_COUNT_SHIFT); @@ -166,23 +184,30 @@ void __raw_write_lock_slow(raw_rwlock_t * when we compare them. */ u32 my_ticket_; + u32 iterations = 0; - /* Take out the next ticket; this will also stop would-be readers. */ - if (val & 1) - val = get_rwlock(rwlock); - rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT); + /* + * Wait until there are no readers, then bump up the next + * field and capture the ticket value. + */ + for (;;) { + if (!(val & 1)) { + if ((val >> RD_COUNT_SHIFT) == 0) + break; + rwlock->lock = val; + } + delay_backoff(iterations++); + val = __insn_tns((int *)&rwlock->lock); + } - /* Extract my ticket value from the original word. */ + /* Take out the next ticket and extract my ticket value. */ + rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT); my_ticket_ = val >> WR_NEXT_SHIFT; - /* - * Wait until the "current" field matches our ticket, and - * there are no remaining readers. - */ + /* Wait until the "current" field matches our ticket. */ for (;;) { u32 curr_ = val >> WR_CURR_SHIFT; - u32 readers = val >> RD_COUNT_SHIFT; - u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers; + u32 delta = ((my_ticket_ - curr_) & WR_MASK); if (likely(delta == 0)) break; -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-19 4:07 ` Cypher Wu @ 2011-02-20 13:33 ` Chris Metcalf 0 siblings, 0 replies; 20+ messages in thread From: Chris Metcalf @ 2011-02-20 13:33 UTC (permalink / raw) To: Cypher Wu Cc: David Miller, xiyou.wangcong, linux-kernel, eric.dumazet, netdev On 2/18/2011 11:07 PM, Cypher Wu wrote: > On Sat, Feb 19, 2011 at 5:51 AM, Chris Metcalf <cmetcalf@tilera.com> wrote: >> I heard from one of our support folks that you were asking through that >> channel, so I asked him to go ahead and give you the spinlock sources >> directly. I will be spending time next week syncing up our internal tree >> with the public git repository so you'll see it on LKML at that time. > I've got your source code, thank you very much. > > There is still two more question: > 1. Why we merge the inlined code and the *_slow into none inlined functions? Those functions were always borderline in terms of being sensible inlined functions. In my opinion, adding the SPR writes as well pushed them over the edge, so I made them just straight function calls instead, for code density reasons. It also makes the code simpler, which is a plus. And since I was changing the read_lock versions I changed the write_lock versions as well for consistency. > 2. I've seen the use of 'mb()' in unlock operation, but we don't use > that in the lock operation. You don't need a memory barrier when acquiring a lock. (Well, some architectures require a read barrier, but Tile doesn't speculate loads past control dependencies at the moment.) > I've released a temporary version with that modification under our > customer' demand, since they want to do a long time test though this > weekend. I'll appreciate that if you gave some comment on my > modifications: It seems OK functionally, and has the advantage of addressing the deadlock without changing the module API, so it's appropriate if you're trying to maintain binary compatibility. -- Chris Metcalf, Tilera Corp. http://www.tilera.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH] arch/tile: fix deadlock bugs in rwlock implementation 2011-02-18 7:08 ` Cypher Wu 2011-02-18 21:51 ` Chris Metcalf @ 2011-03-01 18:30 ` Chris Metcalf 1 sibling, 0 replies; 20+ messages in thread From: Chris Metcalf @ 2011-03-01 18:30 UTC (permalink / raw) To: linux-kernel; +Cc: David Miller, xiyou.wangcong, eric.dumazet, netdev The first issue fixed in this patch is that pending rwlock write locks could lock out new readers; this could cause a deadlock if a read lock was held on cpu 1, a write lock was then attempted on cpu 2 and was pending, and cpu 1 was interrupted and attempted to re-acquire a read lock. The write lock code was modified to not lock out new readers. The second issue fixed is that there was a narrow race window where a tns instruction had been issued (setting the lock value to "1") and the store instruction to reset the lock value correctly had not yet been issued. In this case, if an interrupt occurred and the same cpu then tried to manipulate the lock, it would find the lock value set to "1" and spin forever, assuming some other cpu was partway through updating it. The fix is to enforce an interrupt critical section around the tns/store pair. Since these changes make the rwlock "fast path" code heavier weight, I decided to move all the rwlock code all out of line, leaving only the conventional spinlock code with fastpath inlines. As part of this change I also eliminate support for the now-obsolete tns_atomic mode. Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> --- arch/tile/include/asm/spinlock_32.h | 84 ++---------------- arch/tile/lib/spinlock_32.c | 163 +++++++++++++++++++++-------------- 2 files changed, 108 insertions(+), 139 deletions(-) diff --git a/arch/tile/include/asm/spinlock_32.h b/arch/tile/include/asm/spinlock_32.h index 88efdde..32ba1fe 100644 --- a/arch/tile/include/asm/spinlock_32.h +++ b/arch/tile/include/asm/spinlock_32.h @@ -21,6 +21,7 @@ #include <asm/page.h> #include <asm/system.h> #include <linux/compiler.h> +#include <arch/spr_def.h> /* * We only use even ticket numbers so the '1' inserted by a tns is @@ -78,13 +79,6 @@ void arch_spin_unlock_wait(arch_spinlock_t *lock); #define _RD_COUNT_SHIFT 24 #define _RD_COUNT_WIDTH 8 -/* Internal functions; do not use. */ -void arch_read_lock_slow(arch_rwlock_t *, u32); -int arch_read_trylock_slow(arch_rwlock_t *); -void arch_read_unlock_slow(arch_rwlock_t *); -void arch_write_lock_slow(arch_rwlock_t *, u32); -void arch_write_unlock_slow(arch_rwlock_t *, u32); - /** * arch_read_can_lock() - would read_trylock() succeed? */ @@ -104,94 +98,32 @@ static inline int arch_write_can_lock(arch_rwlock_t *rwlock) /** * arch_read_lock() - acquire a read lock. */ -static inline void arch_read_lock(arch_rwlock_t *rwlock) -{ - u32 val = __insn_tns((int *)&rwlock->lock); - if (unlikely(val << _RD_COUNT_WIDTH)) { - arch_read_lock_slow(rwlock, val); - return; - } - rwlock->lock = val + (1 << _RD_COUNT_SHIFT); -} +void arch_read_lock(arch_rwlock_t *rwlock); /** - * arch_read_lock() - acquire a write lock. + * arch_write_lock() - acquire a write lock. */ -static inline void arch_write_lock(arch_rwlock_t *rwlock) -{ - u32 val = __insn_tns((int *)&rwlock->lock); - if (unlikely(val != 0)) { - arch_write_lock_slow(rwlock, val); - return; - } - rwlock->lock = 1 << _WR_NEXT_SHIFT; -} +void arch_write_lock(arch_rwlock_t *rwlock); /** * arch_read_trylock() - try to acquire a read lock. */ -static inline int arch_read_trylock(arch_rwlock_t *rwlock) -{ - int locked; - u32 val = __insn_tns((int *)&rwlock->lock); - if (unlikely(val & 1)) - return arch_read_trylock_slow(rwlock); - locked = (val << _RD_COUNT_WIDTH) == 0; - rwlock->lock = val + (locked << _RD_COUNT_SHIFT); - return locked; -} +int arch_read_trylock(arch_rwlock_t *rwlock); /** * arch_write_trylock() - try to acquire a write lock. */ -static inline int arch_write_trylock(arch_rwlock_t *rwlock) -{ - u32 val = __insn_tns((int *)&rwlock->lock); - - /* - * If a tns is in progress, or there's a waiting or active locker, - * or active readers, we can't take the lock, so give up. - */ - if (unlikely(val != 0)) { - if (!(val & 1)) - rwlock->lock = val; - return 0; - } - - /* Set the "next" field to mark it locked. */ - rwlock->lock = 1 << _WR_NEXT_SHIFT; - return 1; -} +int arch_write_trylock(arch_rwlock_t *rwlock); /** * arch_read_unlock() - release a read lock. */ -static inline void arch_read_unlock(arch_rwlock_t *rwlock) -{ - u32 val; - mb(); /* guarantee anything modified under the lock is visible */ - val = __insn_tns((int *)&rwlock->lock); - if (unlikely(val & 1)) { - arch_read_unlock_slow(rwlock); - return; - } - rwlock->lock = val - (1 << _RD_COUNT_SHIFT); -} +void arch_read_unlock(arch_rwlock_t *rwlock); /** * arch_write_unlock() - release a write lock. */ -static inline void arch_write_unlock(arch_rwlock_t *rwlock) -{ - u32 val; - mb(); /* guarantee anything modified under the lock is visible */ - val = __insn_tns((int *)&rwlock->lock); - if (unlikely(val != (1 << _WR_NEXT_SHIFT))) { - arch_write_unlock_slow(rwlock, val); - return; - } - rwlock->lock = 0; -} +void arch_write_unlock(arch_rwlock_t *rwlock); #define arch_read_lock_flags(lock, flags) arch_read_lock(lock) #define arch_write_lock_flags(lock, flags) arch_write_lock(lock) diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c index 5cd1c40..723e789 100644 --- a/arch/tile/lib/spinlock_32.c +++ b/arch/tile/lib/spinlock_32.c @@ -91,75 +91,82 @@ EXPORT_SYMBOL(arch_spin_unlock_wait); #define RD_COUNT_MASK ((1 << RD_COUNT_WIDTH) - 1) -/* Lock the word, spinning until there are no tns-ers. */ -static inline u32 get_rwlock(arch_rwlock_t *rwlock) +/* + * We spin until everything but the reader bits (which are in the high + * part of the word) are zero, i.e. no active or waiting writers, no tns. + * + * We guard the tns/store-back with an interrupt critical section to + * preserve the semantic that the same read lock can be acquired in an + * interrupt context. + * + * ISSUE: This approach can permanently starve readers. A reader who sees + * a writer could instead take a ticket lock (just like a writer would), + * and atomically enter read mode (with 1 reader) when it gets the ticket. + * This way both readers and writers will always make forward progress + * in a finite time. + */ +void arch_read_lock(arch_rwlock_t *rwlock) { - u32 iterations = 0; + u32 val, iterations = 0; + for (;;) { - u32 val = __insn_tns((int *)&rwlock->lock); - if (unlikely(val & 1)) { - delay_backoff(iterations++); - continue; + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1); + val = __insn_tns((int *)&rwlock->lock); + if (likely((val << _RD_COUNT_WIDTH) == 0)) { + rwlock->lock = val + (1 << RD_COUNT_SHIFT); + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); + break; } - return val; + if ((val & 1) == 0) + rwlock->lock = val; + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); + delay_backoff(iterations++); } } +EXPORT_SYMBOL(arch_read_lock); -int arch_read_trylock_slow(arch_rwlock_t *rwlock) +int arch_read_trylock(arch_rwlock_t *rwlock) { - u32 val = get_rwlock(rwlock); - int locked = (val << RD_COUNT_WIDTH) == 0; - rwlock->lock = val + (locked << RD_COUNT_SHIFT); + u32 val; + int locked; + + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1); + val = __insn_tns((int *)&rwlock->lock); + locked = ((val << _RD_COUNT_WIDTH) == 0); + if (likely(locked)) + rwlock->lock = val + (1 << RD_COUNT_SHIFT); + else if ((val & 1) == 0) + rwlock->lock = val; + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); return locked; } -EXPORT_SYMBOL(arch_read_trylock_slow); +EXPORT_SYMBOL(arch_read_trylock); -void arch_read_unlock_slow(arch_rwlock_t *rwlock) +void arch_read_unlock(arch_rwlock_t *rwlock) { - u32 val = get_rwlock(rwlock); - rwlock->lock = val - (1 << RD_COUNT_SHIFT); -} -EXPORT_SYMBOL(arch_read_unlock_slow); + u32 val, iterations = 0; -void arch_write_unlock_slow(arch_rwlock_t *rwlock, u32 val) -{ - u32 eq, mask = 1 << WR_CURR_SHIFT; - while (unlikely(val & 1)) { - /* Limited backoff since we are the highest-priority task. */ - relax(4); + mb(); /* guarantee anything modified under the lock is visible */ + for (;;) { + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1); val = __insn_tns((int *)&rwlock->lock); + if (likely(val & 1) == 0) { + rwlock->lock = val - (1 << _RD_COUNT_SHIFT); + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); + break; + } + __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); + delay_backoff(iterations++); } - val = __insn_addb(val, mask); - eq = __insn_seqb(val, val << (WR_CURR_SHIFT - WR_NEXT_SHIFT)); - val = __insn_mz(eq & mask, val); - rwlock->lock = val; } -EXPORT_SYMBOL(arch_write_unlock_slow); +EXPORT_SYMBOL(arch_read_unlock); /* - * We spin until everything but the reader bits (which are in the high - * part of the word) are zero, i.e. no active or waiting writers, no tns. - * - * ISSUE: This approach can permanently starve readers. A reader who sees - * a writer could instead take a ticket lock (just like a writer would), - * and atomically enter read mode (with 1 reader) when it gets the ticket. - * This way both readers and writers will always make forward progress - * in a finite time. + * We don't need an interrupt critical section here (unlike for + * arch_read_lock) since we should never use a bare write lock where + * it could be interrupted by code that could try to re-acquire it. */ -void arch_read_lock_slow(arch_rwlock_t *rwlock, u32 val) -{ - u32 iterations = 0; - do { - if (!(val & 1)) - rwlock->lock = val; - delay_backoff(iterations++); - val = __insn_tns((int *)&rwlock->lock); - } while ((val << RD_COUNT_WIDTH) != 0); - rwlock->lock = val + (1 << RD_COUNT_SHIFT); -} -EXPORT_SYMBOL(arch_read_lock_slow); - -void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val) +void arch_write_lock(arch_rwlock_t *rwlock) { /* * The trailing underscore on this variable (and curr_ below) @@ -168,6 +175,12 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val) */ u32 my_ticket_; u32 iterations = 0; + u32 val = __insn_tns((int *)&rwlock->lock); + + if (likely(val == 0)) { + rwlock->lock = 1 << _WR_NEXT_SHIFT; + return; + } /* * Wait until there are no readers, then bump up the next @@ -206,23 +219,47 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val) relax(4); } } -EXPORT_SYMBOL(arch_write_lock_slow); +EXPORT_SYMBOL(arch_write_lock); -int __tns_atomic_acquire(atomic_t *lock) +int arch_write_trylock(arch_rwlock_t *rwlock) { - int ret; - u32 iterations = 0; + u32 val = __insn_tns((int *)&rwlock->lock); - BUG_ON(__insn_mfspr(SPR_INTERRUPT_CRITICAL_SECTION)); - __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 1); + /* + * If a tns is in progress, or there's a waiting or active locker, + * or active readers, we can't take the lock, so give up. + */ + if (unlikely(val != 0)) { + if (!(val & 1)) + rwlock->lock = val; + return 0; + } - while ((ret = __insn_tns((void *)&lock->counter)) == 1) - delay_backoff(iterations++); - return ret; + /* Set the "next" field to mark it locked. */ + rwlock->lock = 1 << _WR_NEXT_SHIFT; + return 1; } +EXPORT_SYMBOL(arch_write_trylock); -void __tns_atomic_release(atomic_t *p, int v) +void arch_write_unlock(arch_rwlock_t *rwlock) { - p->counter = v; - __insn_mtspr(SPR_INTERRUPT_CRITICAL_SECTION, 0); + u32 val, eq, mask; + + mb(); /* guarantee anything modified under the lock is visible */ + val = __insn_tns((int *)&rwlock->lock); + if (likely(val == (1 << _WR_NEXT_SHIFT))) { + rwlock->lock = 0; + return; + } + while (unlikely(val & 1)) { + /* Limited backoff since we are the highest-priority task. */ + relax(4); + val = __insn_tns((int *)&rwlock->lock); + } + mask = 1 << WR_CURR_SHIFT; + val = __insn_addb(val, mask); + eq = __insn_seqb(val, val << (WR_CURR_SHIFT - WR_NEXT_SHIFT)); + val = __insn_mz(eq & mask, val); + rwlock->lock = val; } +EXPORT_SYMBOL(arch_write_unlock); -- 1.6.5.2 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: Fwd: IGMP and rwlock: Dead ocurred again on TILEPro 2011-02-17 5:42 ` Américo Wang 2011-02-17 5:46 ` David Miller @ 2011-02-17 6:42 ` Cypher Wu 1 sibling, 0 replies; 20+ messages in thread From: Cypher Wu @ 2011-02-17 6:42 UTC (permalink / raw) To: Américo Wang; +Cc: linux-kernel, Chris Metcalf, Eric Dumazet, netdev On Thu, Feb 17, 2011 at 1:42 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote: > On Thu, Feb 17, 2011 at 01:04:14PM +0800, Cypher Wu wrote: >>> >>> Have you turned CONFIG_LOCKDEP on? >>> >>> I think Eric already converted that rwlock into RCU lock, thus >>> this problem should disappear. Could you try a new kernel? >>> >>> Thanks. >>> >> >>I haven't turned CONFIG_LOCKDEP on for test since I didn't get too >>much information when we tried to figured out the former deadlock. >> >>IGMP used read_lock() instead of read_lock_bh() since usually >>read_lock() can be called recursively, and today I've read the >>implementation of MIPS, it's should also works fine in that situation. >>The implementation of TILEPro cause problem since after it use TNS set >>the lock-val to 1 and hold the original value and before it re-set >>lock-val a new value, it a race condition window. >> > > I see no reason why you can't call read_lock_bh() recursively, > read_lock_bh() is roughly equalent to local_bh_disable() + read_lock(), > both can be recursive. > > But I may miss something here. :-/ > Of course read_lock_bh() can be called recursively, but read_lock() is already enough for IGMP, the only reason for that deadlock is because using TNS instruction set the value to 1 cause another race condition. -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
* Fwd: IGMP and rwlock: Dead ocurred again on TILEPro @ 2011-02-17 2:17 Cypher Wu 0 siblings, 0 replies; 20+ messages in thread From: Cypher Wu @ 2011-02-17 2:17 UTC (permalink / raw) To: linux-kernel, Chris Metcalf, Américo Wang, Eric Dumazet, netdev ---------- Forwarded message ---------- From: Cypher Wu <cypher.w@gmail.com> Date: Wed, Feb 16, 2011 at 5:58 PM Subject: GMP and rwlock: Dead ocurred again on TILEPro To: linux-kernel@vger.kernel.org The rwlock and spinlock of TILEPro platform use TNS instruction to test the value of lock, but if interrupt is not masked, read_lock() have another chance to deadlock while read_lock() called in bh of interrupt. The original code: void __raw_read_lock_slow(raw_ rwlock_t *rwlock, u32 val) { u32 iterations = 0; do { if (!(val & 1)) rwlock->lock = val; delay_backoff(iterations++); val = __insn_tns((int *)&rwlock->lock); } while ((val << RD_COUNT_WIDTH) != 0); rwlock->lock = val + (1 << RD_COUNT_SHIFT); } I've modified it to get some information: void __raw_read_lock_slow(raw_rwlock_t *rwlock, u32 val) { u32 iterations = 0; do { if (!(val & 1)) { rwlock->lock = val; iterations = 0; } delay_backoff(iterations++); if (iterations > 0x1000000) { dump_stack(); iterations = 0; } val = __insn_tns((int *)&rwlock->lock); } while ((val << RD_COUNT_WIDTH) != 0); rwlock->lock = val + (1 << RD_COUNT_SHIFT); } And this is the stack info: Starting stack dump of tid 837, pid 837 (ff0) on cpu 55 at cycle 10180633928773 frame 0: 0xfd3bfbe0 dump_stack+0x0/0x20 (sp 0xe4b5f9d8) frame 1: 0xfd3c0b50 __raw_read_lock_slow.cold+0x50/0x90 (sp 0xe4b5f9d8) frame 2: 0xfd184a58 igmpv3_send_cr+0x60/0x440 (sp 0xe4b5f9f0) frame 3: 0xfd3bd928 igmp_ifc_timer_expire+0x30/0x90 (sp 0xe4b5fa20) frame 4: 0xfd047698 run_timer_softirq+0x258/0x3c8 (sp 0xe4b5fa30) frame 5: 0xfd0563f8 __do_softirq+0x138/0x220 (sp 0xe4b5fa70) frame 6: 0xfd097d48 do_softirq+0x88/0x110 (sp 0xe4b5fa98) frame 7: 0xfd1871f8 irq_exit+0xf8/0x120 (sp 0xe4b5faa8) frame 8: 0xfd1afda0 do_timer_interrupt+0xa0/0xf8 (sp 0xe4b5fab0) frame 9: 0xfd187b98 handle_interrupt+0x2d8/0x2e0 (sp 0xe4b5fac0) <interrupt 25 while in kernel mode> frame 10: 0xfd0241c8 _read_lock+0x8/0x40 (sp 0xe4b5fc38) frame 11: 0xfd1bb008 ip_mc_del_src+0xc8/0x378 (sp 0xe4b5fc40) frame 12: 0xfd2681e8 ip_mc_leave_group+0xf8/0x1e0 (sp 0xe4b5fc70) frame 13: 0xfd0a4d70 do_ip_setsockopt+0xe48/0x1560 (sp 0xe4b5fc90) frame 14: 0xfd2b4168 sys_setsockopt+0x150/0x170 (sp 0xe4b5fe98) frame 15: 0xfd14e550 handle_syscall+0x2d0/0x320 (sp 0xe4b5fec0) <syscall while in user mode> frame 16: 0x3342a0 (sp 0xbfddfb00) frame 17: 0x16130 (sp 0xbfddfb08) frame 18: 0x16640 (sp 0xbfddfb38) frame 19: 0x16ee8 (sp 0xbfddfc58) frame 20: 0x345a08 (sp 0xbfddfc90) frame 21: 0x10218 (sp 0xbfddfe48) Stack dump complete I don't know the clear definition of rwlock & spinlock in Linux, but the implementation of other platforms like x86, PowerPC, ARM don't have that issue. The use of TNS cause a race condition between system call and interrupt. Through the call tree of packet sending, there are also some other rwlock will be tried, say read_lock(&fib_hash_lock) in fn_hash_lookup() which is called in ip_route_output_slow(). I've seen deadlock on fib_hash_lock, but haven't reproduced with that debug information yet. Maybe IGMP is not the only one, TCP timer will retransmit data and will also call read_lock(&fib_hash_lock). -- Cyberman Wu -- Cyberman Wu ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2011-03-01 18:30 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-02-17 3:39 Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu 2011-02-17 4:49 ` Américo Wang 2011-02-17 5:04 ` Cypher Wu 2011-02-17 5:42 ` Américo Wang 2011-02-17 5:46 ` David Miller 2011-02-17 6:39 ` Eric Dumazet 2011-02-17 22:49 ` Chris Metcalf 2011-02-17 22:53 ` David Miller 2011-02-17 23:04 ` Chris Metcalf 2011-02-17 23:11 ` David Miller 2011-02-17 23:18 ` Chris Metcalf 2011-02-18 3:16 ` Cypher Wu 2011-02-18 3:19 ` Cypher Wu 2011-02-18 7:08 ` Cypher Wu 2011-02-18 21:51 ` Chris Metcalf 2011-02-19 4:07 ` Cypher Wu 2011-02-20 13:33 ` Chris Metcalf 2011-03-01 18:30 ` [PATCH] arch/tile: fix deadlock bugs in rwlock implementation Chris Metcalf 2011-02-17 6:42 ` Fwd: IGMP and rwlock: Dead ocurred again on TILEPro Cypher Wu -- strict thread matches above, loose matches on Subject: below -- 2011-02-17 2:17 Cypher Wu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).