* Is anyone working on implementing LAG in ib core? @ 2020-02-22 3:48 Weihang Li 2020-02-22 23:40 ` Jason Gunthorpe 0 siblings, 1 reply; 13+ messages in thread From: Weihang Li @ 2020-02-22 3:48 UTC (permalink / raw) To: dledford, jgg, leon; +Cc: linux-rdma@vger.kernel.org, linuxarm Hi all, We plan to implement LAG in hns drivers recently, and as we know, there is already a mature and stable solution in the mlx5 driver. Considering that the net subsystem in kernel adopt the strategy that the framework implement bonding, we think it's reasonable to add LAG feature to the ib core based on mlx5's implementation. So that all drivers including hns and mlx5 can benefit from it. In previous discussions with Leon about achieving reporting of ib port link event in ib core, Leon mentioned that there might be someone trying to do this. So may I ask if there is anyone working on LAG in ib core or planning to implement it in near future? I will appreciate it if you can share your progress with me and maybe we can finish it together. If nobody is working on this, our team may take a try to implement LAG in the core. Any comments and suggestion are welcome. Thanks, Weihang ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-22 3:48 Is anyone working on implementing LAG in ib core? Weihang Li @ 2020-02-22 23:40 ` Jason Gunthorpe 2020-02-23 0:44 ` Parav Pandit 0 siblings, 1 reply; 13+ messages in thread From: Jason Gunthorpe @ 2020-02-22 23:40 UTC (permalink / raw) To: Weihang Li Cc: dledford, leon, linux-rdma@vger.kernel.org, linuxarm, Shiraz Saleem On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: > Hi all, > > We plan to implement LAG in hns drivers recently, and as we know, there is > already a mature and stable solution in the mlx5 driver. Considering that > the net subsystem in kernel adopt the strategy that the framework implement > bonding, we think it's reasonable to add LAG feature to the ib core based > on mlx5's implementation. So that all drivers including hns and mlx5 can > benefit from it. > > In previous discussions with Leon about achieving reporting of ib port link > event in ib core, Leon mentioned that there might be someone trying to do > this. > > So may I ask if there is anyone working on LAG in ib core or planning to > implement it in near future? I will appreciate it if you can share your > progress with me and maybe we can finish it together. > > If nobody is working on this, our team may take a try to implement LAG in > the core. Any comments and suggestion are welcome. This is something that needs to be done, I understand several of the other drivers are going to want to use LAG and we certainly can't have everything copied into each driver. Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-22 23:40 ` Jason Gunthorpe @ 2020-02-23 0:44 ` Parav Pandit 2020-02-23 9:49 ` Leon Romanovsky 2020-03-02 7:22 ` liweihang 0 siblings, 2 replies; 13+ messages in thread From: Parav Pandit @ 2020-02-23 0:44 UTC (permalink / raw) To: Jason Gunthorpe, Weihang Li Cc: dledford@redhat.com, leon@kernel.org, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem Hi Jason, Weihang, On 2/22/2020 5:40 PM, Jason Gunthorpe wrote: > On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: >> Hi all, >> >> We plan to implement LAG in hns drivers recently, and as we know, there is >> already a mature and stable solution in the mlx5 driver. Considering that >> the net subsystem in kernel adopt the strategy that the framework implement >> bonding, we think it's reasonable to add LAG feature to the ib core based >> on mlx5's implementation. So that all drivers including hns and mlx5 can >> benefit from it. >> >> In previous discussions with Leon about achieving reporting of ib port link >> event in ib core, Leon mentioned that there might be someone trying to do >> this. >> >> So may I ask if there is anyone working on LAG in ib core or planning to >> implement it in near future? I will appreciate it if you can share your >> progress with me and maybe we can finish it together. >> >> If nobody is working on this, our team may take a try to implement LAG in >> the core. Any comments and suggestion are welcome. > > This is something that needs to be done, I understand several of the > other drivers are going to want to use LAG and we certainly can't have > everything copied into each driver. > > Jason > I am not sure mlx5 is right model for new rdma bond device support which I tried to highlight in Q&A-1 below. I have below not-so-refined proposal for rdma bond device. - Create a rdma bond device named rbond0 using two slave rdma devices mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma device of mlx5_0 rdma device. $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave mlx5_1 dmadevice mlx5_0 $ rdma dev show 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660 sys_image_guid 248a:0703:0055:4660 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661 sys_image_guid 248a:0703:0055:4660 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid 248a:0703:0055:4660 - This should be done via rdma bond driver in drivers/infiniband/ulp/rdma_bond Few obvious questions arise from above proposal: 1. But why can't I do the trick to remove two or more rdma device(s) and create one device when bond0 netdevice is created? Ans: (a) Because it leads to complex driver code in vendor driver to handle netdev events under rtnl lock. Given GID table needs to hold rtnl lock for short duration in ib_register_device(), things need to differ to work queue and perform synchronization. (b) User cannot predict when this new rdma bond device will be created automatically, after how long? (c) What if some failure occurred, should I parse /var/log/messages to figure out the error? What steps to roll back and retry? (d) What if driver internally attempt retry?.. and some more.. 2. Why do we need to give netdevice in above proposal? Ans: Because for RoCE device you want to build right GID table for its matching netdevice. No guess work. 3. But with that there will be multiple devices rbond0, mlx5_0 with same GID table entries. And that will confuse the user. What do we do about it? Ans: No. That won't happen, because this bond driver accept slave rdma devices. bond driver will request IB core to disable GID table of slave rdma devices. Or we can have commands to disable/enable specific GID types of slave rdma devices, which user can invoke before creating rdma bond device. Such as $ rdma link mlx5_0 disable rocev1 $ rdma link mlx5_0 enable rocev2 This way its easy to compose and addressed wider use case where RoCEv1 GID table entries can be disabled and make efficient use of GID table. 4. But if we are going to disable the GID table, why do we even need those RDMA devices? Ans: Because when you delete the bond rdma device, you can revert back to those mlx5_0/1 devices. Follwo mirror of add in delete. 5. Why do we need to give DMA device? Ans: Because we want to avoid doing guess work in rdma_bond driver on which DMA device to use. User knows the configuration of how he wants to use it based on the system. (irqs etc). 6. What happens if slave pci devices are hot removed? Ans: If slave dma device is removed, it disassociates the ucontext and rbond0 becomes unusable for applications. 7. How is the failover done? Ans: Since failover netdevice is provided, its failover settings are inherited by rbond0 rdma device and passed on to its slave device. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-23 0:44 ` Parav Pandit @ 2020-02-23 9:49 ` Leon Romanovsky 2020-02-24 7:10 ` Parav Pandit 2020-03-02 7:22 ` liweihang 1 sibling, 1 reply; 13+ messages in thread From: Leon Romanovsky @ 2020-02-23 9:49 UTC (permalink / raw) To: Parav Pandit Cc: Jason Gunthorpe, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote: > Hi Jason, Weihang, > > On 2/22/2020 5:40 PM, Jason Gunthorpe wrote: > > On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: > >> Hi all, > >> > >> We plan to implement LAG in hns drivers recently, and as we know, there is > >> already a mature and stable solution in the mlx5 driver. Considering that > >> the net subsystem in kernel adopt the strategy that the framework implement > >> bonding, we think it's reasonable to add LAG feature to the ib core based > >> on mlx5's implementation. So that all drivers including hns and mlx5 can > >> benefit from it. > >> > >> In previous discussions with Leon about achieving reporting of ib port link > >> event in ib core, Leon mentioned that there might be someone trying to do > >> this. > >> > >> So may I ask if there is anyone working on LAG in ib core or planning to > >> implement it in near future? I will appreciate it if you can share your > >> progress with me and maybe we can finish it together. > >> > >> If nobody is working on this, our team may take a try to implement LAG in > >> the core. Any comments and suggestion are welcome. > > > > This is something that needs to be done, I understand several of the > > other drivers are going to want to use LAG and we certainly can't have > > everything copied into each driver. > > > > Jason > > > I am not sure mlx5 is right model for new rdma bond device support which > I tried to highlight in Q&A-1 below. > > I have below not-so-refined proposal for rdma bond device. > > - Create a rdma bond device named rbond0 using two slave rdma devices > mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma > device of mlx5_0 rdma device. > > $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave > mlx5_1 dmadevice mlx5_0 > > $ rdma dev show > 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660 > sys_image_guid 248a:0703:0055:4660 > 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661 > sys_image_guid 248a:0703:0055:4660 > 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid > 248a:0703:0055:4660 > > - This should be done via rdma bond driver in > drivers/infiniband/ulp/rdma_bond Extra question, why do we need RDMA bond ULP which combines ib devices and not create netdev bond device and create one ib device on top of that? Thanks ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-23 9:49 ` Leon Romanovsky @ 2020-02-24 7:10 ` Parav Pandit 2020-02-24 10:52 ` Leon Romanovsky 0 siblings, 1 reply; 13+ messages in thread From: Parav Pandit @ 2020-02-24 7:10 UTC (permalink / raw) To: Leon Romanovsky Cc: Jason Gunthorpe, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem Hi Leon, On 2/23/2020 3:49 AM, Leon Romanovsky wrote: > On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote: >> Hi Jason, Weihang, >> >> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote: >>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: >>>> Hi all, >>>> >>>> We plan to implement LAG in hns drivers recently, and as we know, there is >>>> already a mature and stable solution in the mlx5 driver. Considering that >>>> the net subsystem in kernel adopt the strategy that the framework implement >>>> bonding, we think it's reasonable to add LAG feature to the ib core based >>>> on mlx5's implementation. So that all drivers including hns and mlx5 can >>>> benefit from it. >>>> >>>> In previous discussions with Leon about achieving reporting of ib port link >>>> event in ib core, Leon mentioned that there might be someone trying to do >>>> this. >>>> >>>> So may I ask if there is anyone working on LAG in ib core or planning to >>>> implement it in near future? I will appreciate it if you can share your >>>> progress with me and maybe we can finish it together. >>>> >>>> If nobody is working on this, our team may take a try to implement LAG in >>>> the core. Any comments and suggestion are welcome. >>> >>> This is something that needs to be done, I understand several of the >>> other drivers are going to want to use LAG and we certainly can't have >>> everything copied into each driver. >>> >>> Jason >>> >> I am not sure mlx5 is right model for new rdma bond device support which >> I tried to highlight in Q&A-1 below. >> >> I have below not-so-refined proposal for rdma bond device. >> >> - Create a rdma bond device named rbond0 using two slave rdma devices >> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma >> device of mlx5_0 rdma device. >> >> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave >> mlx5_1 dmadevice mlx5_0 >> >> $ rdma dev show >> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660 >> sys_image_guid 248a:0703:0055:4660 >> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661 >> sys_image_guid 248a:0703:0055:4660 >> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid >> 248a:0703:0055:4660 >> >> - This should be done via rdma bond driver in >> drivers/infiniband/ulp/rdma_bond > > Extra question, why do we need RDMA bond ULP which combines ib devices > and not create netdev bond device and create one ib device on top of that? > I read your question few times, but I likely don't understand your question. Are you asking why bonding should be implemented as dedicated ulp/driver, and not as an extension by the vendor driver? In an alternative approach a given hw driver implements a newlink() instead of dedicated bond driver. However this will duplicate the code in few drivers. Hence, to do in common driver and implement only necessary hooks in hw driver. > Thanks > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 7:10 ` Parav Pandit @ 2020-02-24 10:52 ` Leon Romanovsky 2020-02-24 18:12 ` Parav Pandit 2020-02-24 18:29 ` Jason Gunthorpe 0 siblings, 2 replies; 13+ messages in thread From: Leon Romanovsky @ 2020-02-24 10:52 UTC (permalink / raw) To: Parav Pandit Cc: Jason Gunthorpe, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On Mon, Feb 24, 2020 at 07:10:13AM +0000, Parav Pandit wrote: > Hi Leon, > > On 2/23/2020 3:49 AM, Leon Romanovsky wrote: > > On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote: > >> Hi Jason, Weihang, > >> > >> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote: > >>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: > >>>> Hi all, > >>>> > >>>> We plan to implement LAG in hns drivers recently, and as we know, there is > >>>> already a mature and stable solution in the mlx5 driver. Considering that > >>>> the net subsystem in kernel adopt the strategy that the framework implement > >>>> bonding, we think it's reasonable to add LAG feature to the ib core based > >>>> on mlx5's implementation. So that all drivers including hns and mlx5 can > >>>> benefit from it. > >>>> > >>>> In previous discussions with Leon about achieving reporting of ib port link > >>>> event in ib core, Leon mentioned that there might be someone trying to do > >>>> this. > >>>> > >>>> So may I ask if there is anyone working on LAG in ib core or planning to > >>>> implement it in near future? I will appreciate it if you can share your > >>>> progress with me and maybe we can finish it together. > >>>> > >>>> If nobody is working on this, our team may take a try to implement LAG in > >>>> the core. Any comments and suggestion are welcome. > >>> > >>> This is something that needs to be done, I understand several of the > >>> other drivers are going to want to use LAG and we certainly can't have > >>> everything copied into each driver. > >>> > >>> Jason > >>> > >> I am not sure mlx5 is right model for new rdma bond device support which > >> I tried to highlight in Q&A-1 below. > >> > >> I have below not-so-refined proposal for rdma bond device. > >> > >> - Create a rdma bond device named rbond0 using two slave rdma devices > >> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma > >> device of mlx5_0 rdma device. > >> > >> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave > >> mlx5_1 dmadevice mlx5_0 > >> > >> $ rdma dev show > >> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660 > >> sys_image_guid 248a:0703:0055:4660 > >> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661 > >> sys_image_guid 248a:0703:0055:4660 > >> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid > >> 248a:0703:0055:4660 > >> > >> - This should be done via rdma bond driver in > >> drivers/infiniband/ulp/rdma_bond > > > > Extra question, why do we need RDMA bond ULP which combines ib devices > > and not create netdev bond device and create one ib device on top of that? > > > > I read your question few times, but I likely don't understand your question. > > Are you asking why bonding should be implemented as dedicated > ulp/driver, and not as an extension by the vendor driver? No, I meant something different. You are proposing to combine IB devices, while keeping netdev devices separated. I'm asking if it is possible to combine netdev devices with already existing bond driver and simply create new ib device with bond netdev as an underlying provider. I'm not suggesting to implement anything in vendor drivers. > > In an alternative approach a given hw driver implements a newlink() > instead of dedicated bond driver. > However this will duplicate the code in few drivers. Hence, to do in > common driver and implement only necessary hooks in hw driver. I'm not sure about it. > > > Thanks > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 10:52 ` Leon Romanovsky @ 2020-02-24 18:12 ` Parav Pandit 2020-02-24 18:29 ` Jason Gunthorpe 1 sibling, 0 replies; 13+ messages in thread From: Parav Pandit @ 2020-02-24 18:12 UTC (permalink / raw) To: Leon Romanovsky Cc: Jason Gunthorpe, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On 2/24/2020 4:52 AM, Leon Romanovsky wrote: > On Mon, Feb 24, 2020 at 07:10:13AM +0000, Parav Pandit wrote: >> Hi Leon, >> >> On 2/23/2020 3:49 AM, Leon Romanovsky wrote: >>> On Sun, Feb 23, 2020 at 12:44:12AM +0000, Parav Pandit wrote: >>>> Hi Jason, Weihang, >>>> >>>> On 2/22/2020 5:40 PM, Jason Gunthorpe wrote: >>>>> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: >>>>>> Hi all, >>>>>> >>>>>> We plan to implement LAG in hns drivers recently, and as we know, there is >>>>>> already a mature and stable solution in the mlx5 driver. Considering that >>>>>> the net subsystem in kernel adopt the strategy that the framework implement >>>>>> bonding, we think it's reasonable to add LAG feature to the ib core based >>>>>> on mlx5's implementation. So that all drivers including hns and mlx5 can >>>>>> benefit from it. >>>>>> >>>>>> In previous discussions with Leon about achieving reporting of ib port link >>>>>> event in ib core, Leon mentioned that there might be someone trying to do >>>>>> this. >>>>>> >>>>>> So may I ask if there is anyone working on LAG in ib core or planning to >>>>>> implement it in near future? I will appreciate it if you can share your >>>>>> progress with me and maybe we can finish it together. >>>>>> >>>>>> If nobody is working on this, our team may take a try to implement LAG in >>>>>> the core. Any comments and suggestion are welcome. >>>>> >>>>> This is something that needs to be done, I understand several of the >>>>> other drivers are going to want to use LAG and we certainly can't have >>>>> everything copied into each driver. >>>>> >>>>> Jason >>>>> >>>> I am not sure mlx5 is right model for new rdma bond device support which >>>> I tried to highlight in Q&A-1 below. >>>> >>>> I have below not-so-refined proposal for rdma bond device. >>>> >>>> - Create a rdma bond device named rbond0 using two slave rdma devices >>>> mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma >>>> device of mlx5_0 rdma device. >>>> >>>> $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave >>>> mlx5_1 dmadevice mlx5_0 >>>> >>>> $ rdma dev show >>>> 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660 >>>> sys_image_guid 248a:0703:0055:4660 >>>> 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661 >>>> sys_image_guid 248a:0703:0055:4660 >>>> 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid >>>> 248a:0703:0055:4660 >>>> >>>> - This should be done via rdma bond driver in >>>> drivers/infiniband/ulp/rdma_bond >>> >>> Extra question, why do we need RDMA bond ULP which combines ib devices >>> and not create netdev bond device and create one ib device on top of that? >>> >> >> I read your question few times, but I likely don't understand your question. >> >> Are you asking why bonding should be implemented as dedicated >> ulp/driver, and not as an extension by the vendor driver? > > No, I meant something different. You are proposing to combine IB > devices, while keeping netdev devices separated. I'm asking if it is > possible to combine netdev devices with already existing bond driver > and simply create new ib device with bond netdev as an underlying > provider. > Ah I understand now. Yes, if bond_newlink() can be extended to accept rdma device specific parameters, it should be possible. For example we already have two class of users where one wants to have rdma bond device created and one doesn't want to create for a given bond netdev. You might be already aware that newlink() is called under rtnl lock and ib_register_device() also need to acquire rtnl lock. But this is implementation that can be discussed later once bond driver extension for rdma make sense. :-) If one wants to create a vlan netdevice on top of bond device, user usually creates it after bond netdevice is created. Similarly I see that bond rdma device is composed of underlying bond netdev and underlying rdma device or at least its parent 'struct device' for purpose of irq routing, pci access etc. Compare to overloading bond_newlink() with rdma parameters, independent rdma bond device creation linked to bond netdev, appears more elegant and modular approach. > I'm not suggesting to implement anything in vendor drivers. > >> >> In an alternative approach a given hw driver implements a newlink() >> instead of dedicated bond driver. >> However this will duplicate the code in few drivers. Hence, to do in >> common driver and implement only necessary hooks in hw driver. > > I'm not sure about it. > >> >>> Thanks >>> >> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 10:52 ` Leon Romanovsky 2020-02-24 18:12 ` Parav Pandit @ 2020-02-24 18:29 ` Jason Gunthorpe 2020-02-24 19:01 ` Parav Pandit 2020-02-25 7:58 ` Leon Romanovsky 1 sibling, 2 replies; 13+ messages in thread From: Jason Gunthorpe @ 2020-02-24 18:29 UTC (permalink / raw) To: Leon Romanovsky Cc: Parav Pandit, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote: > > Are you asking why bonding should be implemented as dedicated > > ulp/driver, and not as an extension by the vendor driver? > > No, I meant something different. You are proposing to combine IB > devices, while keeping netdev devices separated. I'm asking if it is > possible to combine netdev devices with already existing bond driver > and simply create new ib device with bond netdev as an underlying > provider. Isn't that basically what we do now in mlx5? Logically the ib_device is attached to the bond, it uses the bond for IP addressing, etc. I'm not sure trying to have 3 ib_devices like netdev does is sane, that is very, very complicated to get everything to work. The ib stuff just isn't designed to be stacked like that. Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 18:29 ` Jason Gunthorpe @ 2020-02-24 19:01 ` Parav Pandit 2020-02-24 19:23 ` Jason Gunthorpe 2020-02-25 7:58 ` Leon Romanovsky 1 sibling, 1 reply; 13+ messages in thread From: Parav Pandit @ 2020-02-24 19:01 UTC (permalink / raw) To: Jason Gunthorpe, Leon Romanovsky Cc: Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On 2/24/2020 12:29 PM, Jason Gunthorpe wrote: > On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote: >>> Are you asking why bonding should be implemented as dedicated >>> ulp/driver, and not as an extension by the vendor driver? >> >> No, I meant something different. You are proposing to combine IB >> devices, while keeping netdev devices separated. I'm asking if it is >> possible to combine netdev devices with already existing bond driver >> and simply create new ib device with bond netdev as an underlying >> provider. > > Isn't that basically what we do now in mlx5? > And its broken for few aspects that I described in Q&A question-1 in this thread previously. On top of that user has no ability to disable rdma bonding. User exactly asked us that they want to disable in some cases. (not on mailing list). So there are non-upstream hacks exists that is not applicable for this discussion. > Logically the ib_device is attached to the bond, it uses the bond for > IP addressing, etc. > > I'm not sure trying to have 3 ib_devices like netdev does is sane, > that is very, very complicated to get everything to work. The ib stuff > just isn't designed to be stacked like that. > I am not sure I understand all the complications you have thought through. I thought of few and put forward in the Q&A in the thread and we can improve the design as we go forward. Stacking rdma device on top of existing rdma device as an ib_client so that rdma bond device exactly aware of what is going on with slaves and bond netdev. On top of that I am enjoying below list delete corruption call trace for a while now coming from this lag code in a silly devel test of load unload driver 30+ times (not even running traffic). :-) RIP: 0010:__list_del_entry_valid+0x7c/0xa0 unregister_netdevice_notifier_dev_net+0x1f/0x70 mlx5_lag_remove+0x61/0xf0 [mlx5_core] mlx5e_detach_netdev+0x24/0x50 [mlx5_core] mlx5e_detach+0x36/0x40 [mlx5_core] mlx5e_remove+0x48/0x60 [mlx5_core] mlx5_remove_device+0xb0/0xc0 [mlx5_core] mlx5_unregister_interface+0x39/0x90 [mlx5_core] cleanup+0x5/0xdd1 [mlx5_core] And there is some LAG code in a hw driver that reschedules the work if it cannot acquire some mutex lock... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 19:01 ` Parav Pandit @ 2020-02-24 19:23 ` Jason Gunthorpe 2020-02-24 19:41 ` Parav Pandit 0 siblings, 1 reply; 13+ messages in thread From: Jason Gunthorpe @ 2020-02-24 19:23 UTC (permalink / raw) To: Parav Pandit Cc: Leon Romanovsky, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On Mon, Feb 24, 2020 at 07:01:43PM +0000, Parav Pandit wrote: > On 2/24/2020 12:29 PM, Jason Gunthorpe wrote: > > On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote: > >>> Are you asking why bonding should be implemented as dedicated > >>> ulp/driver, and not as an extension by the vendor driver? > >> > >> No, I meant something different. You are proposing to combine IB > >> devices, while keeping netdev devices separated. I'm asking if it is > >> possible to combine netdev devices with already existing bond driver > >> and simply create new ib device with bond netdev as an underlying > >> provider. > > > > Isn't that basically what we do now in mlx5? > > > And its broken for few aspects that I described in Q&A question-1 in > this thread previously. > > On top of that user has no ability to disable rdma bonding. And what does that mean? The real netdevs have no IP addreses so what exactly does a non-bonded RoCEv2 RDMA device do? > User exactly asked us that they want to disable in some cases. > (not on mailing list). So there are non-upstream hacks exists that is > not applicable for this discussion. Bah, I'm aware of that - that request is hack solution to something else as well. > > Logically the ib_device is attached to the bond, it uses the bond for > > IP addressing, etc. > > > > I'm not sure trying to have 3 ib_devices like netdev does is sane, > > that is very, very complicated to get everything to work. The ib stuff > > just isn't designed to be stacked like that. > > > I am not sure I understand all the complications you have thought through. > I thought of few and put forward in the Q&A in the thread and we can > improve the design as we go forward. > > Stacking rdma device on top of existing rdma device as an ib_client so > that rdma bond device exactly aware of what is going on with slaves and > bond netdev. How do you safely proxy every single op from the bond to slaves? How do you force the slaves to allow PDs to be shared across them? What provider lives in user space for the bond driver? What happens to the udata/etc? And it doesn't solve the main problem you raised, creating a IB device while holding RTNL simply should not ever be done. Moving this code into the core layer fixed it up significantly for the similar rxe/siw cases, I expect the same is possible for the LAG situation too. Jason ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 19:23 ` Jason Gunthorpe @ 2020-02-24 19:41 ` Parav Pandit 0 siblings, 0 replies; 13+ messages in thread From: Parav Pandit @ 2020-02-24 19:41 UTC (permalink / raw) To: Jason Gunthorpe Cc: Leon Romanovsky, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On 2/24/2020 1:23 PM, Jason Gunthorpe wrote: > On Mon, Feb 24, 2020 at 07:01:43PM +0000, Parav Pandit wrote: >> On 2/24/2020 12:29 PM, Jason Gunthorpe wrote: >>> On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote: >>>>> Are you asking why bonding should be implemented as dedicated >>>>> ulp/driver, and not as an extension by the vendor driver? >>>> >>>> No, I meant something different. You are proposing to combine IB >>>> devices, while keeping netdev devices separated. I'm asking if it is >>>> possible to combine netdev devices with already existing bond driver >>>> and simply create new ib device with bond netdev as an underlying >>>> provider. >>> >>> Isn't that basically what we do now in mlx5? >>> >> And its broken for few aspects that I described in Q&A question-1 in >> this thread previously. >> >> On top of that user has no ability to disable rdma bonding. > > And what does that mean? The real netdevs have no IP addreses so what > exactly does a non-bonded RoCEv2 RDMA device do? > Not sure. There is some default GID on it. >> User exactly asked us that they want to disable in some cases. >> (not on mailing list). So there are non-upstream hacks exists that is >> not applicable for this discussion. > > Bah, I'm aware of that - that request is hack solution to something > else as well. > >>> Logically the ib_device is attached to the bond, it uses the bond for >>> IP addressing, etc. >>> >>> I'm not sure trying to have 3 ib_devices like netdev does is sane, >>> that is very, very complicated to get everything to work. The ib stuff >>> just isn't designed to be stacked like that. >>> >> I am not sure I understand all the complications you have thought through. >> I thought of few and put forward in the Q&A in the thread and we can >> improve the design as we go forward. >> >> Stacking rdma device on top of existing rdma device as an ib_client so >> that rdma bond device exactly aware of what is going on with slaves and >> bond netdev. > > How do you safely proxy every single op from the bond to slaves? Bond config should tell which slave to use, instead of current implicit one. > > How do you force the slaves to allow PDs to be shared across them? > For slave it doesn't matter if caller is bond or direct. > What provider lives in user space for the bond driver? What happens to > the udata/etc? Same as that of primary slave used for pci, irq access whose info is provided through new netlink discovery path. > > And it doesn't solve the main problem you raised, creating a IB device > while holding RTNL simply should not ever be done. Moving this code > into the core layer fixed it up significantly for the similar rxe/siw > cases, I expect the same is possible for the LAG situation too. > $ rdma add dev command won't hold rtnl lock, when new rdma bond device is added through rdma netlink socket. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-24 18:29 ` Jason Gunthorpe 2020-02-24 19:01 ` Parav Pandit @ 2020-02-25 7:58 ` Leon Romanovsky 1 sibling, 0 replies; 13+ messages in thread From: Leon Romanovsky @ 2020-02-25 7:58 UTC (permalink / raw) To: Jason Gunthorpe Cc: Parav Pandit, Weihang Li, dledford@redhat.com, linux-rdma@vger.kernel.org, linuxarm@huawei.com, Shiraz Saleem On Mon, Feb 24, 2020 at 02:29:02PM -0400, Jason Gunthorpe wrote: > On Mon, Feb 24, 2020 at 12:52:06PM +0200, Leon Romanovsky wrote: > > > Are you asking why bonding should be implemented as dedicated > > > ulp/driver, and not as an extension by the vendor driver? > > > > No, I meant something different. You are proposing to combine IB > > devices, while keeping netdev devices separated. I'm asking if it is > > possible to combine netdev devices with already existing bond driver > > and simply create new ib device with bond netdev as an underlying > > provider. > > Isn't that basically what we do now in mlx5? Yes, logically the same, the implementation of course should be different by involving IB/core more than it is now. > > Logically the ib_device is attached to the bond, it uses the bond for > IP addressing, etc. Yes. Thanks ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Is anyone working on implementing LAG in ib core? 2020-02-23 0:44 ` Parav Pandit 2020-02-23 9:49 ` Leon Romanovsky @ 2020-03-02 7:22 ` liweihang 1 sibling, 0 replies; 13+ messages in thread From: liweihang @ 2020-03-02 7:22 UTC (permalink / raw) To: Parav Pandit, Jason Gunthorpe Cc: dledford@redhat.com, leon@kernel.org, linux-rdma@vger.kernel.org, Linuxarm, Shiraz Saleem On 2020/2/23 8:44, Parav Pandit wrote: > Hi Jason, Weihang, > > On 2/22/2020 5:40 PM, Jason Gunthorpe wrote: >> On Sat, Feb 22, 2020 at 11:48:04AM +0800, Weihang Li wrote: >>> Hi all, >>> >>> We plan to implement LAG in hns drivers recently, and as we know, there is >>> already a mature and stable solution in the mlx5 driver. Considering that >>> the net subsystem in kernel adopt the strategy that the framework implement >>> bonding, we think it's reasonable to add LAG feature to the ib core based >>> on mlx5's implementation. So that all drivers including hns and mlx5 can >>> benefit from it. >>> >>> In previous discussions with Leon about achieving reporting of ib port link >>> event in ib core, Leon mentioned that there might be someone trying to do >>> this. >>> >>> So may I ask if there is anyone working on LAG in ib core or planning to >>> implement it in near future? I will appreciate it if you can share your >>> progress with me and maybe we can finish it together. >>> >>> If nobody is working on this, our team may take a try to implement LAG in >>> the core. Any comments and suggestion are welcome. >> >> This is something that needs to be done, I understand several of the >> other drivers are going to want to use LAG and we certainly can't have >> everything copied into each driver. >> >> Jason >> > I am not sure mlx5 is right model for new rdma bond device support which > I tried to highlight in Q&A-1 below. > > I have below not-so-refined proposal for rdma bond device. > > - Create a rdma bond device named rbond0 using two slave rdma devices > mlx5_0 mlx5_1 which is connected to netdevice bond1 and underlying dma > device of mlx5_0 rdma device. > > $ rdma dev add type bond name rbond0 netdev bond1 slave mlx5_0 slave > mlx5_1 dmadevice mlx5_0 > > $ rdma dev show > 0: mlx5_0: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4660 > sys_image_guid 248a:0703:0055:4660 > 1: mlx5_1: node_type ca fw 12.25.1020 node_guid 248a:0703:0055:4661 > sys_image_guid 248a:0703:0055:4660 > 2: rbond0: node_type ca node_guid 248a:0703:0055:4660 sys_image_guid > 248a:0703:0055:4660 > > - This should be done via rdma bond driver in > drivers/infiniband/ulp/rdma_bond > > Few obvious questions arise from above proposal: > 1. But why can't I do the trick to remove two or more rdma device(s) and > create one device when bond0 netdevice is created? > Ans: > (a) Because it leads to complex driver code in vendor driver to handle > netdev events under rtnl lock. > Given GID table needs to hold rtnl lock for short duration in > ib_register_device(), things need to differ to work queue and perform > synchronization. > (b) User cannot predict when this new rdma bond device will be created > automatically, after how long? > (c) What if some failure occurred, should I parse /var/log/messages to > figure out the error? What steps to roll back and retry? > (d) What if driver internally attempt retry?.. > and some more.. > > 2. Why do we need to give netdevice in above proposal? > Ans: > Because for RoCE device you want to build right GID table for its > matching netdevice. No guess work. > > 3. But with that there will be multiple devices rbond0, mlx5_0 with same > GID table entries. > And that will confuse the user. > What do we do about it? > Ans: > No. That won't happen, because this bond driver accept slave rdma devices. > bond driver will request IB core to disable GID table of slave rdma devices. > Or we can have commands to disable/enable specific GID types of slave > rdma devices, which user can invoke before creating rdma bond device. > > Such as > $ rdma link mlx5_0 disable rocev1 > $ rdma link mlx5_0 enable rocev2 > > This way its easy to compose and addressed wider use case where RoCEv1 > GID table entries can be disabled and make efficient use of GID table. > > 4. But if we are going to disable the GID table, why do we even need > those RDMA devices? > Ans: > Because when you delete the bond rdma device, you can revert back to > those mlx5_0/1 devices. > Follwo mirror of add in delete. > > 5. Why do we need to give DMA device? > Ans: > Because we want to avoid doing guess work in rdma_bond driver on which > DMA device to use. > User knows the configuration of how he wants to use it based on the > system. (irqs etc). > > 6. What happens if slave pci devices are hot removed? > Ans: > If slave dma device is removed, it disassociates the ucontext and rbond0 > becomes unusable for applications. > > 7. How is the failover done? > Ans: Since failover netdevice is provided, its failover settings are > inherited by rbond0 rdma device and passed on to its slave device. > Hi Parav, Thanks for your discussion with Jason and Leon, there are a lot of things that I have never considered. And sorry for the late response, it took me some time to understand current LAG driver of mlx5. Seems a mlx5_lag structure combines all slave net devices and ib devices, and it has no directly combination with the bonding net device. It seems you have been thinking a lot about how LAG can be achieved in core. Do you have plans to implement it? Or have you already finished some part of it? I'm curious about that because we want to use LAG in HIPXX as I said, and we can help to test and improve the existing design if some work is already done. Thank you Weihang ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2020-03-02 7:22 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-02-22 3:48 Is anyone working on implementing LAG in ib core? Weihang Li 2020-02-22 23:40 ` Jason Gunthorpe 2020-02-23 0:44 ` Parav Pandit 2020-02-23 9:49 ` Leon Romanovsky 2020-02-24 7:10 ` Parav Pandit 2020-02-24 10:52 ` Leon Romanovsky 2020-02-24 18:12 ` Parav Pandit 2020-02-24 18:29 ` Jason Gunthorpe 2020-02-24 19:01 ` Parav Pandit 2020-02-24 19:23 ` Jason Gunthorpe 2020-02-24 19:41 ` Parav Pandit 2020-02-25 7:58 ` Leon Romanovsky 2020-03-02 7:22 ` liweihang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).