* Re: A question about FAILOVER event in RoCE LAG
[not found] <26b0d23202814f60b994ce123830353d@hisilicon.com>
@ 2023-02-26 12:53 ` Leon Romanovsky
2023-02-26 13:57 ` Mark Bloch
0 siblings, 1 reply; 4+ messages in thread
From: Leon Romanovsky @ 2023-02-26 12:53 UTC (permalink / raw)
To: huangjunxian (C), Mark Bloch
Cc: jgg@nvidia.com, linux-rdma@vger.kernel.org, Linuxarm
+Mark
On Fri, Feb 24, 2023 at 11:14:47AM +0000, huangjunxian (C) wrote:
> Hi folks!
>
> We've been working on LAG in hns RoCE driver, and we notice that when a FAILOVER event
> occurs in active-backup mode, all GIDs of the RDMA bond device are deleted and new GIDs
> are added, triggered by the event handler listed below.
>
> So, when a FAILOVER event occurs on a RDMA bond device with running traffic, does it make
> sense that the traffic is terminated since its GIDs are deleted?
>
> The FAILOVER event handler mentioned above:
> static int netdevice_event(struct notifier_block *this, unsigned long event, void *ptr)
> {
> ......
> static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
> .cb = del_netdev_upper_ips, .filter = upper_device_filter};
> ......
> switch (event) {
> ......
> case NETDEV_BONDING_FAILOVER:
> cmds[0] = bonding_event_ips_del_cmd;
> /* Add default GIDs of the bond device */
> cmds[1] = bonding_default_add_cmd;
> /* Add IP based GIDs of the bond device */
> cmds[2] = add_cmd_upper_ips;
> break;
> ......
> }
> ......
> }
>
> Thanks,
> Junxian
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: A question about FAILOVER event in RoCE LAG
2023-02-26 12:53 ` A question about FAILOVER event in RoCE LAG Leon Romanovsky
@ 2023-02-26 13:57 ` Mark Bloch
2023-03-03 2:36 ` Junxian Huang
0 siblings, 1 reply; 4+ messages in thread
From: Mark Bloch @ 2023-02-26 13:57 UTC (permalink / raw)
To: Leon Romanovsky, huangjunxian (C), Mark Bloch
Cc: jgg@nvidia.com, linux-rdma@vger.kernel.org, Linuxarm
On 26/02/2023 14:53, Leon Romanovsky wrote:
> +Mark
>
> On Fri, Feb 24, 2023 at 11:14:47AM +0000, huangjunxian (C) wrote:
>> Hi folks!
>>
>> We've been working on LAG in hns RoCE driver, and we notice that when a FAILOVER event
>> occurs in active-backup mode, all GIDs of the RDMA bond device are deleted and new GIDs
>> are added, triggered by the event handler listed below.
>>
>> So, when a FAILOVER event occurs on a RDMA bond device with running traffic, does it make
>> sense that the traffic is terminated since its GIDs are deleted?
Yep, please read the original commit message:
commit 238fdf48f2b54a01cedb5774c3a1e81c94e1a3a0
Author: Matan Barak <matanb@mellanox.com>
Date: Thu Jul 30 18:33:27 2015 +0300
IB/core: Add RoCE table bonding support
Handling bonding and other devices require us to all all GIDs of the
net-devices which are upper-devices of the RoCE port related
net-device.
Active-backup configurations imposes even more challenges as the
default GID should only be set on the active devices (this is
necessary as otherwise the same MAC could be used for several
slaves and thus several slaves will have identical GIDs).
Managing these configurations are done by listening to:
(a) NETDEV_CHANGEUPPER event
(1) if a related net-device is linked, delete all inactive
slaves default GIDs and add the upper device GIDs.
(2) if a related net-device is unlinked, delete all upper GIDs
and add the default GIDs.
(b) NETDEV_BONDING_FAILOVER:
(1) delete the bond GIDs from inactive slaves
(2) delete the inactive slave's default GIDs
(3) Add the bond GIDs to the active slave.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
and please read: https://wiki.linuxfoundation.org/networking/bonding
especially the section that explains some of the restrictions of
active-backup mode.
Mark
>>
>> The FAILOVER event handler mentioned above:
>> static int netdevice_event(struct notifier_block *this, unsigned long event, void *ptr)
>> {
>> ......
>> static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
>> .cb = del_netdev_upper_ips, .filter = upper_device_filter};
>> ......
>> switch (event) {
>> ......
>> case NETDEV_BONDING_FAILOVER:
>> cmds[0] = bonding_event_ips_del_cmd;
>> /* Add default GIDs of the bond device */
>> cmds[1] = bonding_default_add_cmd;
>> /* Add IP based GIDs of the bond device */
>> cmds[2] = add_cmd_upper_ips;
>> break;
>> ......
>> }
>> ......
>> }
>>
>> Thanks,
>> Junxian
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: A question about FAILOVER event in RoCE LAG
2023-02-26 13:57 ` Mark Bloch
@ 2023-03-03 2:36 ` Junxian Huang
2023-03-03 5:34 ` Mark Bloch
0 siblings, 1 reply; 4+ messages in thread
From: Junxian Huang @ 2023-03-03 2:36 UTC (permalink / raw)
To: Mark Bloch, Leon Romanovsky
Cc: jgg@nvidia.com, linux-rdma@vger.kernel.org, Linuxarm,
huangjunxian (C)
On 2023/2/26 21:57, Mark Bloch wrote:
>
>
> On 26/02/2023 14:53, Leon Romanovsky wrote:
>> +Mark
>>
>> On Fri, Feb 24, 2023 at 11:14:47AM +0000, huangjunxian (C) wrote:
>>> Hi folks!
>>>
>>> We've been working on LAG in hns RoCE driver, and we notice that when a FAILOVER event
>>> occurs in active-backup mode, all GIDs of the RDMA bond device are deleted and new GIDs
>>> are added, triggered by the event handler listed below.
>>>
>>> So, when a FAILOVER event occurs on a RDMA bond device with running traffic, does it make
>>> sense that the traffic is terminated since its GIDs are deleted?
>
> Yep, please read the original commit message:
>
> commit 238fdf48f2b54a01cedb5774c3a1e81c94e1a3a0
> Author: Matan Barak <matanb@mellanox.com>
> Date: Thu Jul 30 18:33:27 2015 +0300
>
> IB/core: Add RoCE table bonding support
>
> Handling bonding and other devices require us to all all GIDs of the
> net-devices which are upper-devices of the RoCE port related
> net-device.
>
> Active-backup configurations imposes even more challenges as the
> default GID should only be set on the active devices (this is
> necessary as otherwise the same MAC could be used for several
> slaves and thus several slaves will have identical GIDs).
>
> Managing these configurations are done by listening to:
> (a) NETDEV_CHANGEUPPER event
> (1) if a related net-device is linked, delete all inactive
> slaves default GIDs and add the upper device GIDs.
> (2) if a related net-device is unlinked, delete all upper GIDs
> and add the default GIDs.
> (b) NETDEV_BONDING_FAILOVER:
> (1) delete the bond GIDs from inactive slaves
> (2) delete the inactive slave's default GIDs
> (3) Add the bond GIDs to the active slave.
>
> Signed-off-by: Matan Barak <matanb@mellanox.com>
> Signed-off-by: Doug Ledford <dledford@redhat.com>
>
> and please read: https://wiki.linuxfoundation.org/networking/bonding
> especially the section that explains some of the restrictions of
> active-backup mode.
>
> Mark
>
>>>
>>> The FAILOVER event handler mentioned above:
>>> static int netdevice_event(struct notifier_block *this, unsigned long event, void *ptr)
>>> {
>>> ......
>>> static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
>>> .cb = del_netdev_upper_ips, .filter = upper_device_filter};
>>> ......
>>> switch (event) {
>>> ......
>>> case NETDEV_BONDING_FAILOVER:
>>> cmds[0] = bonding_event_ips_del_cmd;
>>> /* Add default GIDs of the bond device */
>>> cmds[1] = bonding_default_add_cmd;
>>> /* Add IP based GIDs of the bond device */
>>> cmds[2] = add_cmd_upper_ips;
>>> break;
>>> ......
>>> }
>>> ......
>>> }
>>>
>>> Thanks,
>>> Junxian
Thanks for replying. I'd like to explain my question more specifically.
When a GID is deleted, the GID index is not reusable until all references
are dropped, according to this commit message:
commit be5914c124bc3179536e5c4598f59aeb4b880517
Author: Parav Pandit <parav@mellanox.com>
Date: Tue Dec 18 14:16:00 2018 +0200
RDMA/core: Delete RoCE GID in hw when corresponding IP is deleted
Currently a RoCE GID entry is removed from the hardware when all
references to the GID entry drop to zero. This is a change in behavior
from before the fixed patch. The GID entry should be removed from the
hardware when GID entry deletion is requested. This allows the driver
terminate ongoing traffic through the RoCE GID.
While a GID is deleted from the hardware, GID slot in the software GID
cache is not freed. GID slot is freed once all references of such GID are
dropped. This continue to ensure that such GID slot of hardware is not
allocated to new GID entry allocation request. It is allocated once all
references to GID entry drop.
This approach allows drivers to put a tombestone of some kind on the HW
GID index to block the traffic.
Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts")
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Assume that a RoCE bond device is running RoCE traffic with a GID in index 3.
When a FAILOVER event occurs, the original GIDs in index 0-3 are deleted
and new GIDs are added. As the QP is still holding a reference to index 3,
the new GIDs are not added to index 0-3 but to index 0-2 and index 4.
Due to the deletion of the original GID in index 3, the RoCE traffic is
terminated. Although a backup slave becomes active, the traffic will
not recover anymore as the new GID is not added to index 3 and there is no
valid GID in index 3. This means the active-backup mode in RoCE LAG cannot
maintain ongoing RoCE traffic when a FAILOVER event happens.
Are the phenomenon and conclusion here correct? This seems odd to me because
as far as I know, there is no such restriction on the active-backup mode in
NIC Bonding.
Junxian
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: A question about FAILOVER event in RoCE LAG
2023-03-03 2:36 ` Junxian Huang
@ 2023-03-03 5:34 ` Mark Bloch
0 siblings, 0 replies; 4+ messages in thread
From: Mark Bloch @ 2023-03-03 5:34 UTC (permalink / raw)
To: Junxian Huang, Leon Romanovsky
Cc: jgg@nvidia.com, linux-rdma@vger.kernel.org, Linuxarm
On 03/03/2023 4:36, Junxian Huang wrote:
>
>
> On 2023/2/26 21:57, Mark Bloch wrote:
>>
>>
>> On 26/02/2023 14:53, Leon Romanovsky wrote:
>>> +Mark
>>>
>>> On Fri, Feb 24, 2023 at 11:14:47AM +0000, huangjunxian (C) wrote:
>>>> Hi folks!
>>>>
>>>> We've been working on LAG in hns RoCE driver, and we notice that when a FAILOVER event
>>>> occurs in active-backup mode, all GIDs of the RDMA bond device are deleted and new GIDs
>>>> are added, triggered by the event handler listed below.
>>>>
>>>> So, when a FAILOVER event occurs on a RDMA bond device with running traffic, does it make
>>>> sense that the traffic is terminated since its GIDs are deleted?
>>
>> Yep, please read the original commit message:
>>
>> commit 238fdf48f2b54a01cedb5774c3a1e81c94e1a3a0
>> Author: Matan Barak <matanb@mellanox.com>
>> Date: Thu Jul 30 18:33:27 2015 +0300
>>
>> IB/core: Add RoCE table bonding support
>>
>> Handling bonding and other devices require us to all all GIDs of the
>> net-devices which are upper-devices of the RoCE port related
>> net-device.
>>
>> Active-backup configurations imposes even more challenges as the
>> default GID should only be set on the active devices (this is
>> necessary as otherwise the same MAC could be used for several
>> slaves and thus several slaves will have identical GIDs).
>>
>> Managing these configurations are done by listening to:
>> (a) NETDEV_CHANGEUPPER event
>> (1) if a related net-device is linked, delete all inactive
>> slaves default GIDs and add the upper device GIDs.
>> (2) if a related net-device is unlinked, delete all upper GIDs
>> and add the default GIDs.
>> (b) NETDEV_BONDING_FAILOVER:
>> (1) delete the bond GIDs from inactive slaves
>> (2) delete the inactive slave's default GIDs
>> (3) Add the bond GIDs to the active slave.
>>
>> Signed-off-by: Matan Barak <matanb@mellanox.com>
>> Signed-off-by: Doug Ledford <dledford@redhat.com>
>>
>> and please read: https://wiki.linuxfoundation.org/networking/bonding
>> especially the section that explains some of the restrictions of
>> active-backup mode.
>>
>> Mark
>>
>>>>
>>>> The FAILOVER event handler mentioned above:
>>>> static int netdevice_event(struct notifier_block *this, unsigned long event, void *ptr)
>>>> {
>>>> ......
>>>> static const struct netdev_event_work_cmd bonding_event_ips_del_cmd = {
>>>> .cb = del_netdev_upper_ips, .filter = upper_device_filter};
>>>> ......
>>>> switch (event) {
>>>> ......
>>>> case NETDEV_BONDING_FAILOVER:
>>>> cmds[0] = bonding_event_ips_del_cmd;
>>>> /* Add default GIDs of the bond device */
>>>> cmds[1] = bonding_default_add_cmd;
>>>> /* Add IP based GIDs of the bond device */
>>>> cmds[2] = add_cmd_upper_ips;
>>>> break;
>>>> ......
>>>> }
>>>> ......
>>>> }
>>>>
>>>> Thanks,
>>>> Junxian
>
> Thanks for replying. I'd like to explain my question more specifically.
>
> When a GID is deleted, the GID index is not reusable until all references
> are dropped, according to this commit message:
>
> commit be5914c124bc3179536e5c4598f59aeb4b880517
> Author: Parav Pandit <parav@mellanox.com>
> Date: Tue Dec 18 14:16:00 2018 +0200
>
> RDMA/core: Delete RoCE GID in hw when corresponding IP is deleted
>
> Currently a RoCE GID entry is removed from the hardware when all
> references to the GID entry drop to zero. This is a change in behavior
> from before the fixed patch. The GID entry should be removed from the
> hardware when GID entry deletion is requested. This allows the driver
> terminate ongoing traffic through the RoCE GID.
>
> While a GID is deleted from the hardware, GID slot in the software GID
> cache is not freed. GID slot is freed once all references of such GID are
> dropped. This continue to ensure that such GID slot of hardware is not
> allocated to new GID entry allocation request. It is allocated once all
> references to GID entry drop.
>
> This approach allows drivers to put a tombestone of some kind on the HW
> GID index to block the traffic.
>
> Fixes: b150c3862d21 ("IB/core: Introduce GID entry reference counts")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> Reviewed-by: Mark Bloch <markb@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>
> Assume that a RoCE bond device is running RoCE traffic with a GID in index 3.
>
> When a FAILOVER event occurs, the original GIDs in index 0-3 are deleted
> and new GIDs are added. As the QP is still holding a reference to index 3,
> the new GIDs are not added to index 0-3 but to index 0-2 and index 4.
>
> Due to the deletion of the original GID in index 3, the RoCE traffic is
> terminated. Although a backup slave becomes active, the traffic will
> not recover anymore as the new GID is not added to index 3 and there is no
> valid GID in index 3. This means the active-backup mode in RoCE LAG cannot
> maintain ongoing RoCE traffic when a FAILOVER event happens.
If you want fault tolerance and keep the traffic going you should use LACP.
>
> Are the phenomenon and conclusion here correct? This seems odd to me because
> as far as I know, there is no such restriction on the active-backup mode in
> NIC Bonding.
You are talking about netdev bonding I assume. With netdev traffic you can
control which netdev is used so traffic won't egress from multiple physical
ports (this is what the bond is for). With RDMA we can't control that as the
QP is used by a user (kernel / userspace) so if the traffic won't be stopped
the switch will see the same source MAC address from two different ports.
Of course there can be other solutions, for example you can make sure only 1
physical port is used for egress (with steering or with
blocking traffic going via the slave port). If your hardware can do something
like that it should be possible to support that.
Mark
>
> Junxian
>
>
>
>
>
>
>
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-03-03 5:35 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <26b0d23202814f60b994ce123830353d@hisilicon.com>
2023-02-26 12:53 ` A question about FAILOVER event in RoCE LAG Leon Romanovsky
2023-02-26 13:57 ` Mark Bloch
2023-03-03 2:36 ` Junxian Huang
2023-03-03 5:34 ` Mark Bloch
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox