All of lore.kernel.org
 help / color / mirror / Atom feed
From: Cong Wang <amwang@redhat.com>
To: Jay Vosburgh <fubar@us.ibm.com>
Cc: Neil Horman <nhorman@tuxdriver.com>,
	netdev@vger.kernel.org, Matt Mackall <mpm@selenic.com>,
	bridge@lists.linux-foundation.org, linux-kernel@vger.kernel.org,
	David Miller <davem@davemloft.net>,
	Flavio Leitner <fbl@sysclose.org>, Jeff Moyer <jmoyer@redhat.com>,
	Andy Gospodarek <gospo@redhat.com>,
	bonding-devel@lists.sourceforge.net
Subject: Re: [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
Date: Wed, 02 Jun 2010 18:04:45 +0800	[thread overview]
Message-ID: <4C062CBD.7090906@redhat.com> (raw)
In-Reply-To: <24059.1275417767@death.nxdomain.ibm.com>

On 06/02/10 02:42, Jay Vosburgh wrote:
> Cong Wang<amwang@redhat.com>  wrote:
>
>> On 06/01/10 03:08, Flavio Leitner wrote:
>>> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>>>> Hi, Flavio,
>>>>
>>>> Please use the attached patch instead, try to see if it solves
>>>> all your problems.
>>>
>>> I tried and it hangs. No backtraces this time.
>>> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
>>> notification, so I think it won't work.
>>
>> Ah, I thought the same.
>>
>>>
>>> Please, correct if I'm wrong, but when a failover happens with your
>>> patch applied, the netconsole would be disabled forever even with
>>> another healthy slave, right?
>>>
>>
>> Yes, this is an easy solution, because bonding has several modes,
>> it is complex to make netpoll work in different modes.
>
> 	If I understand correctly, the root cause of the problem with
> netconsole and bonding is that bonding is, ultimately, performing
> printks with a write lock held, and when netconsole recursively calls
> into bonding to send the printk over the netconsole, there is a deadlock
> (when the bonding xmit function attempts to acquire the same lock for
> read).


Yes.

>
> 	You're trying to avoid the deadlock by shutting off netconsole
> (permanently, it looks like) for one problem case: a failover, which
> does some printks with a write lock held.
>
> 	This doesn't look to me like a complete solution, there are
> other cases in bonding that will do printk with write locks held.  I
> suspect those will also hang netconsole as things exist today, and won't
> be affected by your patch below.


I can expect that, bonding modes are complex.

>
> 	For example:
>
> 	The sysfs functions to set the primary (bonding_store_primary)
> or active (bonding_store_active_slave) options: a pr_info is called to
> provide a log message of the results.  These could be tested by setting
> the primary or active options via sysfs, e.g.,
>
> echo eth0>  /sys/class/net/bond0/bonding/primary
> echo eth0>  /sys/class/net/bond0/bonding/active
>
> 	If the kernel is defined with DEBUG, there are a few pr_debug
> calls within write_locks (bond_del_vlan, for example).
>
> 	If the slave's underlying device driver's ndo_vlan_rx_register
> or ndo_vlan_rx_kill_vid functions call printk (and it looks like some do
> for error cases, e.g., igbvf, ehea, enic), those would also presumably
> deadlock (because bonding holds its write_lock when calling the ndo_
> vlan functions).
>
> 	It also appears that (with the patch below) some nominally
> normal usage patterns will immediately disable netconsole.  The one that
> comes to mind is if the primary= option is set (to "eth1" for this
> example), but that slave not enslaved first (the slaves are added, say,
> eth0 then eth1).  In that situation, when the primary slave (eth1 here)
> is added, the first thing that will happen is a failover, and that will
> disable netconsole.
>

Thanks for your detailed explanation!

This is why I said bonding is complex. I guess we would have to adjust
netpoll code for different bonding cases, one solution seems not fix all.
I am not sure how much work to do, since I am not familiar with bonding
code. Maybe Andy can help?

For the previous patch, it at least can make Flavio happy. :)

Thanks!

WARNING: multiple messages have this Message-ID (diff)
From: Cong Wang <amwang@redhat.com>
To: Jay Vosburgh <fubar@us.ibm.com>
Cc: Flavio Leitner <fbl@sysclose.org>,
	linux-kernel@vger.kernel.org, Matt Mackall <mpm@selenic.com>,
	netdev@vger.kernel.org, bridge@lists.linux-foundation.org,
	Andy Gospodarek <gospo@redhat.com>,
	Neil Horman <nhorman@tuxdriver.com>,
	Jeff Moyer <jmoyer@redhat.com>,
	Stephen Hemminger <shemminger@linux-foundation.org>,
	bonding-devel@lists.sourceforge.net,
	David Miller <davem@davemloft.net>
Subject: Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
Date: Wed, 02 Jun 2010 18:04:45 +0800	[thread overview]
Message-ID: <4C062CBD.7090906@redhat.com> (raw)
In-Reply-To: <24059.1275417767@death.nxdomain.ibm.com>

On 06/02/10 02:42, Jay Vosburgh wrote:
> Cong Wang<amwang@redhat.com>  wrote:
>
>> On 06/01/10 03:08, Flavio Leitner wrote:
>>> On Mon, May 31, 2010 at 01:56:52PM +0800, Cong Wang wrote:
>>>> Hi, Flavio,
>>>>
>>>> Please use the attached patch instead, try to see if it solves
>>>> all your problems.
>>>
>>> I tried and it hangs. No backtraces this time.
>>> The bond_change_active_slave() prints before NETDEV_BONDING_FAILOVER
>>> notification, so I think it won't work.
>>
>> Ah, I thought the same.
>>
>>>
>>> Please, correct if I'm wrong, but when a failover happens with your
>>> patch applied, the netconsole would be disabled forever even with
>>> another healthy slave, right?
>>>
>>
>> Yes, this is an easy solution, because bonding has several modes,
>> it is complex to make netpoll work in different modes.
>
> 	If I understand correctly, the root cause of the problem with
> netconsole and bonding is that bonding is, ultimately, performing
> printks with a write lock held, and when netconsole recursively calls
> into bonding to send the printk over the netconsole, there is a deadlock
> (when the bonding xmit function attempts to acquire the same lock for
> read).


Yes.

>
> 	You're trying to avoid the deadlock by shutting off netconsole
> (permanently, it looks like) for one problem case: a failover, which
> does some printks with a write lock held.
>
> 	This doesn't look to me like a complete solution, there are
> other cases in bonding that will do printk with write locks held.  I
> suspect those will also hang netconsole as things exist today, and won't
> be affected by your patch below.


I can expect that, bonding modes are complex.

>
> 	For example:
>
> 	The sysfs functions to set the primary (bonding_store_primary)
> or active (bonding_store_active_slave) options: a pr_info is called to
> provide a log message of the results.  These could be tested by setting
> the primary or active options via sysfs, e.g.,
>
> echo eth0>  /sys/class/net/bond0/bonding/primary
> echo eth0>  /sys/class/net/bond0/bonding/active
>
> 	If the kernel is defined with DEBUG, there are a few pr_debug
> calls within write_locks (bond_del_vlan, for example).
>
> 	If the slave's underlying device driver's ndo_vlan_rx_register
> or ndo_vlan_rx_kill_vid functions call printk (and it looks like some do
> for error cases, e.g., igbvf, ehea, enic), those would also presumably
> deadlock (because bonding holds its write_lock when calling the ndo_
> vlan functions).
>
> 	It also appears that (with the patch below) some nominally
> normal usage patterns will immediately disable netconsole.  The one that
> comes to mind is if the primary= option is set (to "eth1" for this
> example), but that slave not enslaved first (the slaves are added, say,
> eth0 then eth1).  In that situation, when the primary slave (eth1 here)
> is added, the first thing that will happen is a failover, and that will
> disable netconsole.
>

Thanks for your detailed explanation!

This is why I said bonding is complex. I guess we would have to adjust
netpoll code for different bonding cases, one solution seems not fix all.
I am not sure how much work to do, since I am not familiar with bonding
code. Maybe Andy can help?

For the previous patch, it at least can make Flavio happy. :)

Thanks!

  reply	other threads:[~2010-06-02 10:04 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-05  8:11 [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices Amerigo Wang
2010-05-05  8:11 ` Amerigo Wang
2010-05-05  8:11 ` Amerigo Wang
2010-05-05  8:11 ` [Bridge] [v5 Patch 2/3] bridge: make bridge support netpoll Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-05  8:11 ` [Bridge] [v5 Patch 3/3] bonding: make bonding " Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-05  8:11   ` Amerigo Wang
2010-05-06  2:05 ` [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices Matt Mackall
2010-05-06  2:05   ` Matt Mackall
2010-05-06  7:44   ` [Bridge] " David Miller
2010-05-06  7:44     ` David Miller
2010-05-07  3:24     ` [Bridge] " Cong Wang
2010-05-07  3:24       ` Cong Wang
2010-05-27 18:05 ` [Bridge] " Flavio Leitner
2010-05-27 18:05   ` Flavio Leitner
2010-05-27 20:35   ` [Bridge] " David Miller
2010-05-27 20:35     ` David Miller
2010-05-27 21:25     ` [Bridge] " Flavio Leitner
2010-05-27 21:25       ` Flavio Leitner
2010-05-28  2:47   ` [Bridge] " Cong Wang
2010-05-28  2:47     ` Cong Wang
2010-05-28 19:40     ` [Bridge] " Flavio Leitner
2010-05-28 19:40       ` Flavio Leitner
2010-05-31  5:56       ` [Bridge] " Cong Wang
2010-05-31  5:56         ` Cong Wang
2010-05-31 19:08         ` [Bridge] " Flavio Leitner
2010-05-31 19:08           ` Flavio Leitner
2010-06-01  9:57           ` [Bridge] " Cong Wang
2010-06-01  9:57             ` Cong Wang
2010-06-01 18:42             ` [Bridge] " Jay Vosburgh
2010-06-01 18:42               ` Jay Vosburgh
2010-06-02 10:04               ` Cong Wang [this message]
2010-06-02 10:04                 ` Cong Wang
2010-06-04 19:18                 ` [Bridge] " Andy Gospodarek
2010-06-04 19:18                   ` Andy Gospodarek
2010-06-07  9:57                   ` [Bridge] " Cong Wang
2010-06-07  9:57                     ` Cong Wang
2010-06-07 10:01                     ` [Bridge] " David Miller
2010-06-07 10:01                       ` David Miller
2010-06-08  8:36                       ` [Bridge] " Cong Wang
2010-06-08  8:36                         ` Cong Wang
2010-06-07 13:03                     ` [Bridge] " Andy Gospodarek
2010-06-07 13:03                       ` Andy Gospodarek
2010-06-08  8:38                       ` [Bridge] " Cong Wang
2010-06-08  8:38                         ` Cong Wang
2010-06-07 19:24               ` [Bridge] [PATCH] netconsole: queue console messages to send later Flavio Leitner
2010-06-07 19:24                 ` Flavio Leitner
2010-06-07 19:50                 ` [Bridge] " Matt Mackall
2010-06-07 19:50                   ` Matt Mackall
2010-06-07 20:00                   ` [Bridge] " Stephen Hemminger
2010-06-07 20:00                     ` Stephen Hemminger
2010-06-07 20:21                     ` [Bridge] " Matt Mackall
2010-06-07 20:21                       ` Matt Mackall
2010-06-07 23:52                       ` [Bridge] " David Miller
2010-06-07 23:52                         ` David Miller
2010-06-07 23:50                 ` [Bridge] " David Miller
2010-06-07 23:50                   ` David Miller
2010-06-08  0:37                   ` [Bridge] " Flavio Leitner
2010-06-08  0:37                     ` Flavio Leitner
2010-06-08  8:59                     ` [Bridge] " Cong Wang
2010-06-08  8:59                       ` Cong Wang
2010-05-28  8:16   ` [Bridge] [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices Cong Wang
2010-05-28  8:16     ` Cong Wang
2010-05-28 20:42     ` [Bridge] " Flavio Leitner
2010-05-28 20:42       ` Flavio Leitner
2010-05-28 21:03       ` [Bridge] " Jay Vosburgh
2010-05-28 21:03         ` Jay Vosburgh
2010-05-31  5:29         ` [Bridge] " Cong Wang
2010-05-31  5:29           ` Cong Wang
2010-05-31  5:37           ` [Bridge] " Cong Wang
2010-05-31  5:37             ` Cong Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C062CBD.7090906@redhat.com \
    --to=amwang@redhat.com \
    --cc=bonding-devel@lists.sourceforge.net \
    --cc=bridge@lists.linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=fbl@sysclose.org \
    --cc=fubar@us.ibm.com \
    --cc=gospo@redhat.com \
    --cc=jmoyer@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mpm@selenic.com \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@tuxdriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.