All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jay Vosburgh <jay.vosburgh@canonical.com>
To: Aleksei Zakharov <zaharov@selectel.ru>
Cc: netdev@vger.kernel.org, "zhangsha (A)" <zhangsha.zhang@huawei.com>
Subject: Re: Fwd: [PATCH] bonding/802.3ad: fix slave initialization states race
Date: Thu, 26 Sep 2019 13:01:54 -0700	[thread overview]
Message-ID: <23353.1569528114@nyx> (raw)
In-Reply-To: <CAJYOGF-84BfK8DvAnam9+tgfo4=oBs04zF-ETWRfhz7CE_9oBA@mail.gmail.com>

Aleksei Zakharov <zaharov@selectel.ru> wrote:

>чт, 26 сент. 2019 г. в 07:38, Jay Vosburgh <jay.vosburgh@canonical.com>:
>>
>> Aleksei Zakharov <zaharov@selectel.ru> wrote:
>>
>> >ср, 25 сент. 2019 г. в 03:31, Jay Vosburgh <jay.vosburgh@canonical.com>:
>> >>
>> >> Алексей Захаров wrote:
>> >> [...]
>> >> >Right after reboot one of the slaves hangs with actor port state 71
>> >> >and partner port state 1.
>> >> >It doesn't send lacpdu and seems to be broken.
>> >> >Setting link down and up again fixes slave state.
>> >> [...]
>> >>
>> >>         I think I see what failed in the first patch, could you test the
>> >> following patch?  This one is for net-next, so you'd need to again swap
>> >> slave_err / netdev_err for the Ubuntu 4.15 kernel.
>> >>
>> >I've tested new patch. It seems to work. I can't reproduce the bug
>> >with this patch.
>> >There are two types of messages when link becomes up:
>> >First:
>> >bond-san: EVENT 1 llu 4294895911 slave eth2
>> >8021q: adding VLAN 0 to HW filter on device eth2
>> >bond-san: link status definitely down for interface eth2, disabling it
>> >mlx4_en: eth2: Link Up
>> >bond-san: EVENT 4 llu 4294895911 slave eth2
>> >bond-san: link status up for interface eth2, enabling it in 500 ms
>> >bond-san: invalid new link 3 on slave eth2
>> >bond-san: link status definitely up for interface eth2, 10000 Mbps full duplex
>> >Second:
>> >bond-san: EVENT 1 llu 4295147594 slave eth2
>> >8021q: adding VLAN 0 to HW filter on device eth2
>> >mlx4_en: eth2: Link Up
>> >bond-san: EVENT 4 llu 4295147594 slave eth2
>> >bond-san: link status up again after 0 ms for interface eth2
>> >bond-san: link status definitely up for interface eth2, 10000 Mbps full duplex
>> > [...]
>>
>>         The "invalid new link" is appearing because bond_miimon_commit
>> is being asked to commit a new state that isn't UP or DOWN (3 is
>> BOND_LINK_BACK).  I looked through the patched code today, and I don't
>> see a way to get to that message with the new link set to 3, so I'll add
>> some instrumentation and send out another patch to figure out what's
>> going on, as that shouldn't happen.
>>
>>         I don't see the "invalid" message testing locally, I think
>> because my network device doesn't transition to carrier up as quickly as
>> yours.  I thought you were getting BOND_LINK_BACK passed through from
>> bond_enslave (which calls bond_set_slave_link_state, which will set
>> link_new_link to BOND_LINK_BACK and leave it there), but the
>> link_new_link is reset first thing in bond_miimon_inspect, so I'm not
>> sure how it gets into bond_miimon_commit (I'm thinking perhaps a
>> concurrent commit triggered by another slave, which then picks up this
>> proposed link state change by happenstance).
>I assume that "invalid new link" happens in this way:
>Interface goes up
>NETDEV_CHANGE event occurs
>bond_update_speed_duplex fails
>and slave->last_link_up returns true
>slave->link becomes BOND_LINK_FAIL
>bond_check_dev_link returns 0
>miimon proposes slave->link_new_state BOND_LINK_DOWN
>NETDEV_UP event occurs
>miimon sets commit++
>miimon proposes slave->link_new_state BOND_LINK_BACK
>miimon sets slave->link to BOND_LINK_BACK

	I removed the "proposes link_new_state BOND_LINK_BACK" from the
second test patch and replaced it with the slave->link = BOND_LINK_BACK.
This particular place in the code also does not do commit++.  If you
have both of those in the code you're running, then perhaps you have a
merge error or some such.

	In the second test patch, the only place that could set
link_new_state to BOND_LINK_BACK is in bond_enslave, which calls
bond_set_slave_link_state if the slave is carrier up and updelay is
configured.  If that were to happen, there should be a "BOND_LINK_BACK
initial state" debug message, and the link_new_state should be replaced
with NOCHANGE at the first pass through bond_miimon_inspect.

	So, I'm unclear how the link_new_state can be BOND_LINK_BACK
from the message log you provided based on the second test patch code.

>we have updelay configured, so it doesn't set BOND_LINK_UP in the next
>case section
>miimon says "Invalid new link" and sets link state UP during next
>inspection(after updelay, i suppose)
>
>For the second type of messages it looks like this:
>Interface goes up
>NETDEV_CHANGE event occurs
>bond_update_speed_duplex fails
>and slave->last_link_up returns true
>slave->link becomes BOND_LINK_FAIL
>NETDEV_UP event occurs
>bond_check_dev_link returns 1
>miimon proposes slave->link_new_state BOND_LINK_UP and says "link
>status up again"
>
>My first patch changed slave->last_link_up check to (slave->link ==
>BOND_LINK_UP).
>This check looks more consistent for me, but I might be wrong here.
>As a result if link was in BOND_LINK_FAIL or BOND_LINK_BACK when
>CHANGE or UP event,
>it became BOND_LINK_DOWN.
>But if it was initially UP and bond_update_speed_duplex was unable to
>get speed/duplex,
>link became BOND_LINK_FAIL.
>
>I don't understand a few things here:
>How could a link be in a different state from time to time during the
>first NETDEV_* event?

	I'm not sure; possibly a race between the events in the kernel
and how long it takes for the hardware to establish Ethernet link up.

>And why slave->last_link_up is set when the first NETDEV event occurs?

	slave->last_link_up can be set at enslave time if the carrier
state of the slave (and thus the initial slave->link) is in a not-down
state.  There are some paths as well for modes that have an "active"
slave, but 802.3ad is not one of those.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

  reply	other threads:[~2019-09-26 20:02 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-18 13:05 [PATCH] bonding/802.3ad: fix slave initialization states race Aleksei Zakharov
2019-09-18 14:34 ` Jay Vosburgh
     [not found]   ` <CAJYOGF9KZdouvmTxQcTOQgsi-uBxbvW50K3ufW1=8neeW98QVA@mail.gmail.com>
2019-09-18 18:27     ` Fwd: " Алексей Захаров
2019-09-19  8:00       ` Jay Vosburgh
2019-09-19  9:53         ` Алексей Захаров
2019-09-19 15:27           ` Jay Vosburgh
2019-09-20 13:52             ` Jay Vosburgh
2019-09-20 16:00               ` Алексей Захаров
2019-09-21  7:06                 ` Jay Vosburgh
2019-09-21 11:17                   ` Алексей Захаров
2019-09-25  0:31                     ` Jay Vosburgh
2019-09-25 11:01                       ` Aleksei Zakharov
2019-09-26  4:38                         ` Jay Vosburgh
2019-09-26 14:25                           ` Aleksei Zakharov
2019-09-26 20:01                             ` Jay Vosburgh [this message]
2019-09-27 11:14                               ` Aleksei Zakharov
2019-09-27  9:43                         ` zhangsha (A)
2019-10-22 12:05                       ` Aleksei Zakharov
2019-09-24 13:52 ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23353.1569528114@nyx \
    --to=jay.vosburgh@canonical.com \
    --cc=netdev@vger.kernel.org \
    --cc=zaharov@selectel.ru \
    --cc=zhangsha.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.