All of lore.kernel.org
 help / color / mirror / Atom feed
From: Flavio Leitner <fbl@sysclose.org>
To: Cong Wang <amwang@redhat.com>
Cc: Jay Vosburgh <fubar@us.ibm.com>,
	Neil Horman <nhorman@tuxdriver.com>,
	netdev@vger.kernel.org, Matt Mackall <mpm@selenic.com>,
	bridge@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	David Miller <davem@davemloft.net>,
	Jeff Moyer <jmoyer@redhat.com>,
	Andy Gospodarek <gospo@redhat.com>,
	bonding-devel@lists.sourceforge.net
Subject: Re: [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
Date: Fri, 28 May 2010 16:40:41 -0300	[thread overview]
Message-ID: <20100528194041.GC2345@sysclose.org> (raw)
In-Reply-To: <4BFF2EA5.9090008@redhat.com>

On Fri, May 28, 2010 at 10:47:01AM +0800, Cong Wang wrote:
> On 05/28/10 02:05, Flavio Leitner wrote:
> >
> >Hi guys!
> >
> >I finally could test this to see if an old problem reported on bugzilla[1] was
> >fixed now, but unfortunately it is still there.
> >
> >The ticket is private I guess, but basically the problem happens when bonding
> >driver tries to print something after it had taken the write_lock (monitor
> >functions, enslave/de-enslave), so the printk() will pass through netpoll, then
> >on bonding again which no matter what mode you use, it will try to read_lock()
> >the lock again. The result is a deadlock and the entire system hangs.
> 
> 
> This is true, I already fixed some similar issues.
> 
> >
> >I manage to get a fresh backtrace with mode 1, see below:
> >
> >
> >[   93.167079] Call Trace:
> >[   93.167079]  [<ffffffff81034cf9>] warn_slowpath_common+0x77/0x8f
> >[   93.167079]  [<ffffffff81034d5e>] warn_slowpath_fmt+0x3c/0x3e
> >[   93.167079]  [<ffffffff81366aef>] ? _raw_read_trylock+0x11/0x4b
> >[   93.167079]  [<ffffffffa02a2c42>] ? bond_start_xmit+0x12b/0x401 [bonding]
> >->  read_lock fails
> >[   93.167079]  [<ffffffffa02a2c9f>] bond_start_xmit+0x188/0x401 [bonding]
> >[   93.167079]  [<ffffffff81055b37>] ? trace_hardirqs_off+0xd/0xf
> >[   93.167079]  [<ffffffff812dfdb9>] netpoll_send_skb+0xbd/0x1f3
> >[   93.167079]  [<ffffffff812e00ed>] netpoll_send_udp+0x1fe/0x20d
> >[   93.167079]  [<ffffffffa02c017c>] write_msg+0x89/0xcd [netconsole]
> >[   93.167079]  [<ffffffff81034e65>] __call_console_drivers+0x67/0x79
> >[   93.167079]  [<ffffffff81034ed0>] _call_console_drivers+0x59/0x5d
> >[   93.167079]  [<ffffffff810352d3>] release_console_sem+0x121/0x1d7
> >[   93.167079]  [<ffffffff8103590a>] vprintk+0x35d/0x393
> >[   93.167079]  [<ffffffff8103f947>] ? add_timer+0x17/0x19
> >[   93.167079]  [<ffffffff81046ddf>] ? queue_delayed_work_on+0xa2/0xa9
> >[   93.167079]  [<ffffffff81363bb8>] printk+0x3c/0x44
> >[   93.167079]  [<ffffffffa02a3b17>] bond_select_active_slave+0x105/0x109 [bonding]
> >->  write_locked
> >[   93.167079]  [<ffffffffa02a4798>] bond_mii_monitor+0x479/0x4ed [bonding]
> >[   93.167079]  [<ffffffff81046009>] worker_thread+0x1ef/0x2e2
> >
> >In this case, the message should be
> >     "bonding: bond0: making interface eth0 the new active one"
> 
> 
> Hmm, you triggered a warning here, let me check the source code
> and try to reproduce it here.

Okay, just pull the cable from one slave or both slaves.



> >I did the following patch to discard the packet if it was IN_NETPOLL
> >and the read_lock() fails, so I could go ahead testing it:
> >
> >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> >index 5e12462..a3b8bad 100644
> >--- a/drivers/net/bonding/bond_main.c
> >+++ b/drivers/net/bonding/bond_main.c
> >@@ -4258,8 +4258,19 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
> >  	struct bonding *bond = netdev_priv(bond_dev);
> >  	int res = 1;
> >
> >-	read_lock(&bond->lock);
> >-	read_lock(&bond->curr_slave_lock);
> >+	if (read_trylock(&bond->lock) == 0&&
> >+		(bond_dev->flags&  IFF_IN_NETPOLL)) {
> >+			dev_kfree_skb(skb);
> >+			return NETDEV_TX_OK;
> >+	}
> >+
> >+	if (read_trylock(&bond->curr_slave_lock) == 0&&
> >+		(bond_dev->flags&  IFF_IN_NETPOLL)) {
> >+			read_unlock(&bond->lock);
> >+			dev_kfree_skb(skb);
> >+			return NETDEV_TX_OK;
> >+	}
> >+			
> >
> >  	if (!BOND_IS_OK(bond))
> >  		goto out;
> >
> 
> 
> This looks like a workaround, not a fix. :)

No, it's a debugging patch, otherwise I couldn't see anything on
serial console when it deadlocks or to keep testing it. :)


> >and I found another problem.  The function netpoll_send_skb() checks
> >if the npinfo's queue length is zero and if it's not, it will queue
> >the packet to make sure it's in order and then schedule the thread
> >to run. Later, the thread wakes up running queue_process() which disables
> >interrupts before calling ndo_start_xmit().  However, dev_queue_xmit()
> >uses rcu_*_bh() and before return, it will enable the interrupts again,
> >spitting this:
> >
> >------------[ cut here ]------------
> >WARNING: at kernel/softirq.c:143 local_bh_enable+0x3c/0x86()
> >Hardware name: Precision WorkStation 490
> >Modules linked in: netconsole bonding sunrpc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6]
> >Pid: 17, comm: events/2 Not tainted 2.6.34-04700-gd938a70 #21
> >Call Trace:
> >  [<ffffffff810381d6>] warn_slowpath_common+0x77/0x8f
> >  [<ffffffff810381fd>] warn_slowpath_null+0xf/0x11
> >  [<ffffffff8103d691>] local_bh_enable+0x3c/0x86
> >  [<ffffffff812e4d85>] dev_queue_xmit+0x462/0x493
> >  [<ffffffffa018805f>] bond_dev_queue_xmit+0x1bd/0x1e3 [bonding]
> >  [<ffffffffa01881dd>] bond_start_xmit+0x158/0x37b [bonding]
> >->  interrupts disabled
> >  [<ffffffff812f3fca>] queue_process+0x9d/0xf9
> >  [<ffffffff8104d022>] worker_thread+0x19d/0x224
> >  [<ffffffff812f3f2d>] ? queue_process+0x0/0xf9
> >  [<ffffffff81050819>] ? autoremove_wake_function+0x0/0x34
> >  [<ffffffff8104ce85>] ? worker_thread+0x0/0x224
> >  [<ffffffff8105040b>] kthread+0x7a/0x82
> >  [<ffffffff810036d4>] kernel_thread_helper+0x4/0x10
> >  [<ffffffff81050391>] ? kthread+0x0/0x82
> >  [<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
> >---[ end trace 74e3904503fdb632 ]---
> >
> >kernel/softirq.c:
> >141 static inline void _local_bh_enable_ip(unsigned long ip)
> >142 {
> >143         WARN_ON_ONCE(in_irq() || irqs_disabled());
> >144 #ifdef CONFIG_TRACE_IRQFLAGS
> >145         local_irq_disable();
> >146 #endif
> >147         /*
> >148          * Are softirqs going to be turned on now:
> >149          */
> >
> >
> 
> I am wondering if this was caused by the previous issue.

Yeah, the first problem can help to trigger this.
The spinlock target_list_lock in write_msg() seems to serialize
this, but I'm not sure yet if the queue will always be empty by
the time another call to write_msg() is made because the 
netpoll_send_skb() will queue the packet if it fails to send.


> >The git is updated up to:
> >   d938a70 be2net: increase POST timeout for EEH recovery
> >
> >Two slave interfaces, bonding mode 1, netconsole over bond0.
> >
> >[1] https://bugzilla.redhat.com/show_bug.cgi?id=248374#c5
> 
> How did you reproduce this?
> I will check that BZ to see if I can find how to reproduce this.
> 
> Thanks.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Flavio

WARNING: multiple messages have this Message-ID (diff)
From: Flavio Leitner <fbl@sysclose.org>
To: Cong Wang <amwang@redhat.com>
Cc: linux-kernel@vger.kernel.org, Matt Mackall <mpm@selenic.com>,
	netdev@vger.kernel.org, bridge@lists.linux-foundation.org,
	Andy Gospodarek <gospo@redhat.com>,
	Neil Horman <nhorman@tuxdriver.com>,
	Jeff Moyer <jmoyer@redhat.com>,
	Stephen Hemminger <shemminger@linux-foundation.org>,
	bonding-devel@lists.sourceforge.net,
	Jay Vosburgh <fubar@us.ibm.com>,
	David Miller <davem@davemloft.net>
Subject: Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
Date: Fri, 28 May 2010 16:40:41 -0300	[thread overview]
Message-ID: <20100528194041.GC2345@sysclose.org> (raw)
In-Reply-To: <4BFF2EA5.9090008@redhat.com>

On Fri, May 28, 2010 at 10:47:01AM +0800, Cong Wang wrote:
> On 05/28/10 02:05, Flavio Leitner wrote:
> >
> >Hi guys!
> >
> >I finally could test this to see if an old problem reported on bugzilla[1] was
> >fixed now, but unfortunately it is still there.
> >
> >The ticket is private I guess, but basically the problem happens when bonding
> >driver tries to print something after it had taken the write_lock (monitor
> >functions, enslave/de-enslave), so the printk() will pass through netpoll, then
> >on bonding again which no matter what mode you use, it will try to read_lock()
> >the lock again. The result is a deadlock and the entire system hangs.
> 
> 
> This is true, I already fixed some similar issues.
> 
> >
> >I manage to get a fresh backtrace with mode 1, see below:
> >
> >
> >[   93.167079] Call Trace:
> >[   93.167079]  [<ffffffff81034cf9>] warn_slowpath_common+0x77/0x8f
> >[   93.167079]  [<ffffffff81034d5e>] warn_slowpath_fmt+0x3c/0x3e
> >[   93.167079]  [<ffffffff81366aef>] ? _raw_read_trylock+0x11/0x4b
> >[   93.167079]  [<ffffffffa02a2c42>] ? bond_start_xmit+0x12b/0x401 [bonding]
> >->  read_lock fails
> >[   93.167079]  [<ffffffffa02a2c9f>] bond_start_xmit+0x188/0x401 [bonding]
> >[   93.167079]  [<ffffffff81055b37>] ? trace_hardirqs_off+0xd/0xf
> >[   93.167079]  [<ffffffff812dfdb9>] netpoll_send_skb+0xbd/0x1f3
> >[   93.167079]  [<ffffffff812e00ed>] netpoll_send_udp+0x1fe/0x20d
> >[   93.167079]  [<ffffffffa02c017c>] write_msg+0x89/0xcd [netconsole]
> >[   93.167079]  [<ffffffff81034e65>] __call_console_drivers+0x67/0x79
> >[   93.167079]  [<ffffffff81034ed0>] _call_console_drivers+0x59/0x5d
> >[   93.167079]  [<ffffffff810352d3>] release_console_sem+0x121/0x1d7
> >[   93.167079]  [<ffffffff8103590a>] vprintk+0x35d/0x393
> >[   93.167079]  [<ffffffff8103f947>] ? add_timer+0x17/0x19
> >[   93.167079]  [<ffffffff81046ddf>] ? queue_delayed_work_on+0xa2/0xa9
> >[   93.167079]  [<ffffffff81363bb8>] printk+0x3c/0x44
> >[   93.167079]  [<ffffffffa02a3b17>] bond_select_active_slave+0x105/0x109 [bonding]
> >->  write_locked
> >[   93.167079]  [<ffffffffa02a4798>] bond_mii_monitor+0x479/0x4ed [bonding]
> >[   93.167079]  [<ffffffff81046009>] worker_thread+0x1ef/0x2e2
> >
> >In this case, the message should be
> >     "bonding: bond0: making interface eth0 the new active one"
> 
> 
> Hmm, you triggered a warning here, let me check the source code
> and try to reproduce it here.

Okay, just pull the cable from one slave or both slaves.



> >I did the following patch to discard the packet if it was IN_NETPOLL
> >and the read_lock() fails, so I could go ahead testing it:
> >
> >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> >index 5e12462..a3b8bad 100644
> >--- a/drivers/net/bonding/bond_main.c
> >+++ b/drivers/net/bonding/bond_main.c
> >@@ -4258,8 +4258,19 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
> >  	struct bonding *bond = netdev_priv(bond_dev);
> >  	int res = 1;
> >
> >-	read_lock(&bond->lock);
> >-	read_lock(&bond->curr_slave_lock);
> >+	if (read_trylock(&bond->lock) == 0&&
> >+		(bond_dev->flags&  IFF_IN_NETPOLL)) {
> >+			dev_kfree_skb(skb);
> >+			return NETDEV_TX_OK;
> >+	}
> >+
> >+	if (read_trylock(&bond->curr_slave_lock) == 0&&
> >+		(bond_dev->flags&  IFF_IN_NETPOLL)) {
> >+			read_unlock(&bond->lock);
> >+			dev_kfree_skb(skb);
> >+			return NETDEV_TX_OK;
> >+	}
> >+			
> >
> >  	if (!BOND_IS_OK(bond))
> >  		goto out;
> >
> 
> 
> This looks like a workaround, not a fix. :)

No, it's a debugging patch, otherwise I couldn't see anything on
serial console when it deadlocks or to keep testing it. :)


> >and I found another problem.  The function netpoll_send_skb() checks
> >if the npinfo's queue length is zero and if it's not, it will queue
> >the packet to make sure it's in order and then schedule the thread
> >to run. Later, the thread wakes up running queue_process() which disables
> >interrupts before calling ndo_start_xmit().  However, dev_queue_xmit()
> >uses rcu_*_bh() and before return, it will enable the interrupts again,
> >spitting this:
> >
> >------------[ cut here ]------------
> >WARNING: at kernel/softirq.c:143 local_bh_enable+0x3c/0x86()
> >Hardware name: Precision WorkStation 490
> >Modules linked in: netconsole bonding sunrpc ip6t_REJECT xt_tcpudp nf_conntrack_ipv6]
> >Pid: 17, comm: events/2 Not tainted 2.6.34-04700-gd938a70 #21
> >Call Trace:
> >  [<ffffffff810381d6>] warn_slowpath_common+0x77/0x8f
> >  [<ffffffff810381fd>] warn_slowpath_null+0xf/0x11
> >  [<ffffffff8103d691>] local_bh_enable+0x3c/0x86
> >  [<ffffffff812e4d85>] dev_queue_xmit+0x462/0x493
> >  [<ffffffffa018805f>] bond_dev_queue_xmit+0x1bd/0x1e3 [bonding]
> >  [<ffffffffa01881dd>] bond_start_xmit+0x158/0x37b [bonding]
> >->  interrupts disabled
> >  [<ffffffff812f3fca>] queue_process+0x9d/0xf9
> >  [<ffffffff8104d022>] worker_thread+0x19d/0x224
> >  [<ffffffff812f3f2d>] ? queue_process+0x0/0xf9
> >  [<ffffffff81050819>] ? autoremove_wake_function+0x0/0x34
> >  [<ffffffff8104ce85>] ? worker_thread+0x0/0x224
> >  [<ffffffff8105040b>] kthread+0x7a/0x82
> >  [<ffffffff810036d4>] kernel_thread_helper+0x4/0x10
> >  [<ffffffff81050391>] ? kthread+0x0/0x82
> >  [<ffffffff810036d0>] ? kernel_thread_helper+0x0/0x10
> >---[ end trace 74e3904503fdb632 ]---
> >
> >kernel/softirq.c:
> >141 static inline void _local_bh_enable_ip(unsigned long ip)
> >142 {
> >143         WARN_ON_ONCE(in_irq() || irqs_disabled());
> >144 #ifdef CONFIG_TRACE_IRQFLAGS
> >145         local_irq_disable();
> >146 #endif
> >147         /*
> >148          * Are softirqs going to be turned on now:
> >149          */
> >
> >
> 
> I am wondering if this was caused by the previous issue.

Yeah, the first problem can help to trigger this.
The spinlock target_list_lock in write_msg() seems to serialize
this, but I'm not sure yet if the queue will always be empty by
the time another call to write_msg() is made because the 
netpoll_send_skb() will queue the packet if it fails to send.


> >The git is updated up to:
> >   d938a70 be2net: increase POST timeout for EEH recovery
> >
> >Two slave interfaces, bonding mode 1, netconsole over bond0.
> >
> >[1] https://bugzilla.redhat.com/show_bug.cgi?id=248374#c5
> 
> How did you reproduce this?
> I will check that BZ to see if I can find how to reproduce this.
> 
> Thanks.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Flavio

  reply	other threads:[~2010-05-28 19:40 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-05  8:11 [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices Amerigo Wang
2010-05-05  8:11 ` Amerigo Wang
2010-05-05  8:11 ` Amerigo Wang
2010-05-05  8:11 ` [Bridge] [v5 Patch 2/3] bridge: make bridge support netpoll Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-05  8:11 ` [Bridge] [v5 Patch 3/3] bonding: make bonding " Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-06  2:05 ` [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices Matt Mackall
2010-05-06  2:05   ` Matt Mackall
2010-05-06  7:44   ` [Bridge] " David Miller
2010-05-06  7:44     ` David Miller
2010-05-07  3:24     ` [Bridge] " Cong Wang
2010-05-07  3:24       ` Cong Wang
2010-05-27 18:05 ` [Bridge] " Flavio Leitner
2010-05-27 18:05   ` Flavio Leitner
2010-05-27 20:35   ` [Bridge] " David Miller
2010-05-27 20:35     ` David Miller
2010-05-27 21:25     ` [Bridge] " Flavio Leitner
2010-05-27 21:25       ` Flavio Leitner
2010-05-28  2:47   ` [Bridge] " Cong Wang
2010-05-28  2:47     ` Cong Wang
2010-05-28 19:40     ` Flavio Leitner [this message]
2010-05-28 19:40       ` Flavio Leitner
2010-05-31  5:56       ` [Bridge] " Cong Wang
2010-05-31  5:56         ` Cong Wang
2010-05-31 19:08         ` [Bridge] " Flavio Leitner
2010-05-31 19:08           ` Flavio Leitner
2010-06-01  9:57           ` [Bridge] " Cong Wang
2010-06-01  9:57             ` Cong Wang
2010-06-01 18:42             ` [Bridge] " Jay Vosburgh
2010-06-01 18:42               ` Jay Vosburgh
2010-06-02 10:04               ` [Bridge] " Cong Wang
2010-06-02 10:04                 ` Cong Wang
2010-06-04 19:18                 ` [Bridge] " Andy Gospodarek
2010-06-04 19:18                   ` Andy Gospodarek
2010-06-07  9:57                   ` [Bridge] " Cong Wang
2010-06-07  9:57                     ` Cong Wang
2010-06-07 10:01                     ` [Bridge] " David Miller
2010-06-07 10:01                       ` David Miller
2010-06-08  8:36                       ` [Bridge] " Cong Wang
2010-06-08  8:36                         ` Cong Wang
2010-06-07 13:03                     ` [Bridge] " Andy Gospodarek
2010-06-07 13:03                       ` Andy Gospodarek
2010-06-08  8:38                       ` [Bridge] " Cong Wang
2010-06-08  8:38                         ` Cong Wang
2010-06-07 19:24               ` [Bridge] [PATCH] netconsole: queue console messages to send later Flavio Leitner
2010-06-07 19:24                 ` Flavio Leitner
2010-06-07 19:50                 ` [Bridge] " Matt Mackall
2010-06-07 19:50                   ` Matt Mackall
2010-06-07 20:00                   ` [Bridge] " Stephen Hemminger
2010-06-07 20:00                     ` Stephen Hemminger
2010-06-07 20:21                     ` [Bridge] " Matt Mackall
2010-06-07 20:21                       ` Matt Mackall
2010-06-07 23:52                       ` [Bridge] " David Miller
2010-06-07 23:52                         ` David Miller
2010-06-07 23:50                 ` [Bridge] " David Miller
2010-06-07 23:50                   ` David Miller
2010-06-08  0:37                   ` [Bridge] " Flavio Leitner
2010-06-08  0:37                     ` Flavio Leitner
2010-06-08  8:59                     ` [Bridge] " Cong Wang
2010-06-08  8:59                       ` Cong Wang
2010-05-28  8:16   ` [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices Cong Wang
2010-05-28  8:16     ` Cong Wang
2010-05-28 20:42     ` [Bridge] " Flavio Leitner
2010-05-28 20:42       ` Flavio Leitner
2010-05-28 21:03       ` [Bridge] " Jay Vosburgh
2010-05-28 21:03         ` Jay Vosburgh
2010-05-31  5:29         ` [Bridge] " Cong Wang
2010-05-31  5:29           ` Cong Wang
2010-05-31  5:37           ` [Bridge] " Cong Wang
2010-05-31  5:37             ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100528194041.GC2345@sysclose.org \
    --to=fbl@sysclose.org \
    --cc=amwang@redhat.com \
    --cc=bonding-devel@lists.sourceforge.net \
    --cc=bridge@lists.linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=fubar@us.ibm.com \
    --cc=gospo@redhat.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpm@selenic.com \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.