netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] net: tipc: fix possible CPU stall while removing module
@ 2013-12-10 17:54 Pablo Neira Ayuso
  2013-12-10 20:20 ` Jon Maloy
  0 siblings, 1 reply; 2+ messages in thread
From: Pablo Neira Ayuso @ 2013-12-10 17:54 UTC (permalink / raw)
  To: netdev; +Cc: davem, jon.maloy, allan.stephens

The following stall is possible while removing the tipc module:

[ 4244.091196] INFO: rcu_sched self-detected stall on CPU { 41}  (t=30001 jiffies)
[ 4244.091414] Pid: 5311, comm: rmmod Tainted: G           O 3.4.51 #1
[ 4244.091524] Call Trace:
[ 4244.091618]  <IRQ>  [<ffffffff810b3741>] ? __rcu_pending+0x1a1/0x4d0
[ 4244.091741]  [<ffffffff81085030>] ? tick_nohz_handler+0xe0/0xe0
[ 4244.091848]  [<ffffffff810b3b18>] ? rcu_check_callbacks+0xa8/0x150
[ 4244.091957]  [<ffffffff81046f3f>] ? update_process_times+0x3f/0x80
[ 4244.092065]  [<ffffffff8108508b>] ? tick_sched_timer+0x5b/0xb0
[ 4244.092172]  [<ffffffff8105d967>] ? __run_hrtimer+0x77/0x1c0
[ 4244.092278]  [<ffffffff8105dd1f>] ? hrtimer_interrupt+0xef/0x270
[ 4244.092386]  [<ffffffff8106714d>] ? ttwu_do_wakeup+0x3d/0x100
[ 4244.092494]  [<ffffffff81020d43>] ? smp_apic_timer_interrupt+0x63/0xa0
[ 4244.092605]  [<ffffffff8167fc4a>] ? apic_timer_interrupt+0x6a/0x70
[ 4244.092714]  [<ffffffff81677985>] ? _raw_spin_lock_bh+0x25/0x30
[ 4244.092820]  [<ffffffff81677969>] ? _raw_spin_lock_bh+0x9/0x30
[ 4244.092936]  [<ffffffffa0333c45>] ? tipc_nodesub_unsubscribe+0x15/0x50 [tipc]
[ 4244.093049]  [<ffffffffa03303ea>] ? named_purge_publ+0x3a/0x90 [tipc]
[ 4244.093158]  [<ffffffff8103ee15>] ? __do_softirq+0xd5/0x1e0
[ 4244.093266]  [<ffffffffa032a25b>] ? process_signal_queue+0x7b/0xc0 [tipc]
[ 4244.093376]  [<ffffffff8103f41b>] ? tasklet_action+0xbb/0xd0
[ 4244.093482]  [<ffffffff8103edf1>] ? __do_softirq+0xb1/0x1e0
[ 4244.093589]  [<ffffffff8168059c>] ? call_softirq+0x1c/0x30
[ 4244.093691]  <EOI>  [<ffffffff810041e5>] ? do_softirq+0x65/0xa0
[ 4244.095460]  [<ffffffff8103f834>] ? local_bh_enable_ip+0x94/0xa0
[ 4244.095570]  [<ffffffffa0332c93>] ? tipc_net_stop+0x73/0x90 [tipc]
[ 4244.095679]  [<ffffffffa0339d61>] ? tipc_exit+0x9/0x29 [tipc]
[ 4244.095786]  [<ffffffff8108e8bf>] ? sys_delete_module+0x1af/0x2b0
[ 4244.095894]  [<ffffffff81677d75>] ? page_fault+0x25/0x30
[ 4244.095999]  [<ffffffff8167f1b9>] ? system_call_fastpath+0x16/0x1b

The two things that trigger this oops are related to tipc_net_stop(), they
are:

* tipc_net_stop() schedules the removal of the bearers via workqueue, which
  includes the removal of the packet handler for the TIPC protocol family. So,
  we have no time guarantees when the packet handler is removed.

* tipc_net_stop() cleans up the the tipc_node_list, so it releases all
  tipc_node structures. However, the tipc_node_subscr structure still holds
  a reference to the tipc_node structures, which is now invalid.

After leaving tipc_net_stop, BH are enabled again. If we have a TIPC
message that is pending to be handled (in the softirq path), the packet
handler will likely be still there, so it passes the packet to tipc_recv_msg().
In that path, if the TIPC message announces a new service publication,
named_purge_publ() is invoked, then tipc_nodesub_unsubscribe() to remove
the node subscription happens. This function tries to get the node lock, but
that structure was already released by tipc_net_stop(), so it stalls.

The proposed fix removes the bearers first so we make sure we get no more
TIPC packets running through the input path, accessing the name-service
base in inconsistent state (as tipc_node structures are not there anymore).

Signed-off-by: Pablo Neira Ayuso <pablo@gnumonks.org>
---
 net/tipc/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tipc/core.c b/net/tipc/core.c
index fd4eeea..7373fd9 100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -81,9 +81,9 @@ struct sk_buff *tipc_buf_acquire(u32 size)
  */
 static void tipc_core_stop_net(void)
 {
-	tipc_net_stop();
 	tipc_eth_media_stop();
 	tipc_ib_media_stop();
+	tipc_net_stop();
 }
 
 /**
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 2+ messages in thread

* Re: [PATCH net] net: tipc: fix possible CPU stall while removing module
  2013-12-10 17:54 [PATCH net] net: tipc: fix possible CPU stall while removing module Pablo Neira Ayuso
@ 2013-12-10 20:20 ` Jon Maloy
  0 siblings, 0 replies; 2+ messages in thread
From: Jon Maloy @ 2013-12-10 20:20 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netdev; +Cc: davem, allan.stephens

There is already a patch in the queue fixing this issue.

http://marc.info/?l=linux-netdev&m=138665890819879&w=2

Regards
///jon


On 12/10/2013 09:54 AM, Pablo Neira Ayuso wrote:
> The following stall is possible while removing the tipc module:
> 
> [ 4244.091196] INFO: rcu_sched self-detected stall on CPU { 41}  (t=30001 jiffies)
> [ 4244.091414] Pid: 5311, comm: rmmod Tainted: G           O 3.4.51 #1
> [ 4244.091524] Call Trace:
> [ 4244.091618]  <IRQ>  [<ffffffff810b3741>] ? __rcu_pending+0x1a1/0x4d0
> [ 4244.091741]  [<ffffffff81085030>] ? tick_nohz_handler+0xe0/0xe0
> [ 4244.091848]  [<ffffffff810b3b18>] ? rcu_check_callbacks+0xa8/0x150
> [ 4244.091957]  [<ffffffff81046f3f>] ? update_process_times+0x3f/0x80
> [ 4244.092065]  [<ffffffff8108508b>] ? tick_sched_timer+0x5b/0xb0
> [ 4244.092172]  [<ffffffff8105d967>] ? __run_hrtimer+0x77/0x1c0
> [ 4244.092278]  [<ffffffff8105dd1f>] ? hrtimer_interrupt+0xef/0x270
> [ 4244.092386]  [<ffffffff8106714d>] ? ttwu_do_wakeup+0x3d/0x100
> [ 4244.092494]  [<ffffffff81020d43>] ? smp_apic_timer_interrupt+0x63/0xa0
> [ 4244.092605]  [<ffffffff8167fc4a>] ? apic_timer_interrupt+0x6a/0x70
> [ 4244.092714]  [<ffffffff81677985>] ? _raw_spin_lock_bh+0x25/0x30
> [ 4244.092820]  [<ffffffff81677969>] ? _raw_spin_lock_bh+0x9/0x30
> [ 4244.092936]  [<ffffffffa0333c45>] ? tipc_nodesub_unsubscribe+0x15/0x50 [tipc]
> [ 4244.093049]  [<ffffffffa03303ea>] ? named_purge_publ+0x3a/0x90 [tipc]
> [ 4244.093158]  [<ffffffff8103ee15>] ? __do_softirq+0xd5/0x1e0
> [ 4244.093266]  [<ffffffffa032a25b>] ? process_signal_queue+0x7b/0xc0 [tipc]
> [ 4244.093376]  [<ffffffff8103f41b>] ? tasklet_action+0xbb/0xd0
> [ 4244.093482]  [<ffffffff8103edf1>] ? __do_softirq+0xb1/0x1e0
> [ 4244.093589]  [<ffffffff8168059c>] ? call_softirq+0x1c/0x30
> [ 4244.093691]  <EOI>  [<ffffffff810041e5>] ? do_softirq+0x65/0xa0
> [ 4244.095460]  [<ffffffff8103f834>] ? local_bh_enable_ip+0x94/0xa0
> [ 4244.095570]  [<ffffffffa0332c93>] ? tipc_net_stop+0x73/0x90 [tipc]
> [ 4244.095679]  [<ffffffffa0339d61>] ? tipc_exit+0x9/0x29 [tipc]
> [ 4244.095786]  [<ffffffff8108e8bf>] ? sys_delete_module+0x1af/0x2b0
> [ 4244.095894]  [<ffffffff81677d75>] ? page_fault+0x25/0x30
> [ 4244.095999]  [<ffffffff8167f1b9>] ? system_call_fastpath+0x16/0x1b
> 
> The two things that trigger this oops are related to tipc_net_stop(), they
> are:
> 
> * tipc_net_stop() schedules the removal of the bearers via workqueue, which
>   includes the removal of the packet handler for the TIPC protocol family. So,
>   we have no time guarantees when the packet handler is removed.
> 
> * tipc_net_stop() cleans up the the tipc_node_list, so it releases all
>   tipc_node structures. However, the tipc_node_subscr structure still holds
>   a reference to the tipc_node structures, which is now invalid.
> 
> After leaving tipc_net_stop, BH are enabled again. If we have a TIPC
> message that is pending to be handled (in the softirq path), the packet
> handler will likely be still there, so it passes the packet to tipc_recv_msg().
> In that path, if the TIPC message announces a new service publication,
> named_purge_publ() is invoked, then tipc_nodesub_unsubscribe() to remove
> the node subscription happens. This function tries to get the node lock, but
> that structure was already released by tipc_net_stop(), so it stalls.
> 
> The proposed fix removes the bearers first so we make sure we get no more
> TIPC packets running through the input path, accessing the name-service
> base in inconsistent state (as tipc_node structures are not there anymore).
> 
> Signed-off-by: Pablo Neira Ayuso <pablo@gnumonks.org>
> ---
>  net/tipc/core.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/tipc/core.c b/net/tipc/core.c
> index fd4eeea..7373fd9 100644
> --- a/net/tipc/core.c
> +++ b/net/tipc/core.c
> @@ -81,9 +81,9 @@ struct sk_buff *tipc_buf_acquire(u32 size)
>   */
>  static void tipc_core_stop_net(void)
>  {
> -	tipc_net_stop();
>  	tipc_eth_media_stop();
>  	tipc_ib_media_stop();
> +	tipc_net_stop();
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-12-10 20:20 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-10 17:54 [PATCH net] net: tipc: fix possible CPU stall while removing module Pablo Neira Ayuso
2013-12-10 20:20 ` Jon Maloy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).