* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver [not found] <20200915171022.10561-1-oded.gabbay@gmail.com> @ 2020-09-15 20:35 ` Jakub Kicinski 2020-09-15 20:46 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jakub Kicinski @ 2020-09-15 20:35 UTC (permalink / raw) To: Oded Gabbay Cc: linux-kernel, netdev, SW_Drivers, gregkh, davem, andrew, f.fainelli, linux-rdma On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > Hello, > > This is the second version of the patch-set to upstream the GAUDI NIC code > into the habanalabs driver. > > The only modification from v2 is in the ethtool patch (patch 12). Details > are in that patch's commit message. You keep reposting this, yet this SDK shim^W^W driver is still living in drivers/misc. If you want to make it look like a NIC, the code belongs where NIC drivers are. Then again, is it a NIC? Why do you have those custom IOCTLs? That's far from normal. Please make sure to CC linux-rdma. You clearly stated that the device does RDMA-like transfers. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 20:35 ` [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver Jakub Kicinski @ 2020-09-15 20:46 ` Oded Gabbay 2020-09-15 21:04 ` Jakub Kicinski 2020-09-17 17:18 ` Jason Gunthorpe 0 siblings, 2 replies; 56+ messages in thread From: Oded Gabbay @ 2020-09-15 20:46 UTC (permalink / raw) To: Jakub Kicinski Cc: Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > > Hello, > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > into the habanalabs driver. > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > are in that patch's commit message. > > You keep reposting this, yet this SDK shim^W^W driver is still living in > drivers/misc. If you want to make it look like a NIC, the code belongs > where NIC drivers are. > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far > from normal. Hi Jakub, I'm sorry but from your question it seems as if you didn't read my cover letter at all, as I took great lengths in explaining exactly what our device is and why we use custom IOCTLs. TL;DR We have an accelerator for deep learning (GAUDI) which uses RDMA as infrastructure for communication between multiple accelerators. Same as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. The RDMA implementation we did does NOT support some basic RDMA IBverbs (such as MR and PD) and therefore, we can't use the rdma-core library or to connect to the rdma infrastructure in the kernel. We wanted to do it but when we analyzed it, we saw we wouldn't be able to support basic stuff and therefore we had to revert to our IOCTLs. To sum it up, because our NIC is used for intra-communication, we don't expose nor intend users to use it as a NIC per-se. However, to be able to get statistics and manage them in a standard way, and support control plane over Ethernet, we do register each port to the net subsystem (i.e. create netdev per port). I hope this short summary explains this better. As per your request that this code lives in the net subsystem, I think that will make it only more complicated and hard to upstream and maintain. I see there are other examples (e.g. sgi-xp) that contain networking driver code in misc so I don't understand this objection. > > Please make sure to CC linux-rdma. You clearly stated that the device > does RDMA-like transfers. We don't use the RDMA infrastructure in the kernel and we can't connect to it due to the lack of H/W support we have so I don't see why we need to CC linux-rdma. Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 20:46 ` Oded Gabbay @ 2020-09-15 21:04 ` Jakub Kicinski 2020-09-15 21:20 ` Oded Gabbay 2020-09-17 17:18 ` Jason Gunthorpe 1 sibling, 1 reply; 56+ messages in thread From: Jakub Kicinski @ 2020-09-15 21:04 UTC (permalink / raw) To: Oded Gabbay Cc: Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Tue, 15 Sep 2020 23:46:58 +0300 Oded Gabbay wrote: > On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > > > Hello, > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > into the habanalabs driver. > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > are in that patch's commit message. > > > > You keep reposting this, yet this SDK shim^W^W driver is still living in > > drivers/misc. If you want to make it look like a NIC, the code belongs > > where NIC drivers are. > > > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far > > from normal. > > I'm sorry but from your question it seems as if you didn't read my > cover letter at all, as I took great lengths in explaining exactly > what our device is and why we use custom IOCTLs. > TL;DR > We have an accelerator for deep learning (GAUDI) which uses RDMA as > infrastructure for communication between multiple accelerators. Same > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > The RDMA implementation we did does NOT support some basic RDMA > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > library or to connect to the rdma infrastructure in the kernel. We > wanted to do it but when we analyzed it, we saw we wouldn't be able to > support basic stuff and therefore we had to revert to our IOCTLs. > To sum it up, because our NIC is used for intra-communication, we > don't expose nor intend users to use it as a NIC per-se. However, to > be able to get statistics and manage them in a standard way, and > support control plane over Ethernet, we do register each port to the > net subsystem (i.e. create netdev per port). > > I hope this short summary explains this better. I read your cover letter. Networking drivers don't get to define random IOCTLs as they please. You have to take that part out of the "NIC" driver. > As per your request that this code lives in the net subsystem, I think > that will make it only more complicated and hard to upstream and > maintain. > I see there are other examples (e.g. sgi-xp) that contain networking > driver code in misc so I don't understand this objection. The maintenance structure and CI systems for the kernel depend on the directory layout. If you don't understand that I don't know how to help you. > > Please make sure to CC linux-rdma. You clearly stated that the device > > does RDMA-like transfers. > > We don't use the RDMA infrastructure in the kernel and we can't > connect to it due to the lack of H/W support we have so I don't see > why we need to CC linux-rdma. You have it backward. You don't get to pick and choose which parts of the infrastructure you use, and therefore who reviews your drivers. The device uses RDMA under the hood so Linux RDMA experts must very much be okay with it getting merged. That's how we ensure Linux interfaces are consistent and good quality. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 21:04 ` Jakub Kicinski @ 2020-09-15 21:20 ` Oded Gabbay 2020-09-15 21:37 ` Andrew Lunn 2020-09-15 22:34 ` David Miller 0 siblings, 2 replies; 56+ messages in thread From: Oded Gabbay @ 2020-09-15 21:20 UTC (permalink / raw) To: Jakub Kicinski Cc: Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Wed, Sep 16, 2020 at 12:04 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 15 Sep 2020 23:46:58 +0300 Oded Gabbay wrote: > > On Tue, Sep 15, 2020 at 11:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Tue, 15 Sep 2020 20:10:08 +0300 Oded Gabbay wrote: > > > > Hello, > > > > > > > > This is the second version of the patch-set to upstream the GAUDI NIC code > > > > into the habanalabs driver. > > > > > > > > The only modification from v2 is in the ethtool patch (patch 12). Details > > > > are in that patch's commit message. > > > > > > You keep reposting this, yet this SDK shim^W^W driver is still living in > > > drivers/misc. If you want to make it look like a NIC, the code belongs > > > where NIC drivers are. > > > > > > Then again, is it a NIC? Why do you have those custom IOCTLs? That's far > > > from normal. > > > > I'm sorry but from your question it seems as if you didn't read my > > cover letter at all, as I took great lengths in explaining exactly > > what our device is and why we use custom IOCTLs. > > TL;DR > > We have an accelerator for deep learning (GAUDI) which uses RDMA as > > infrastructure for communication between multiple accelerators. Same > > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > The RDMA implementation we did does NOT support some basic RDMA > > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > library or to connect to the rdma infrastructure in the kernel. We > > wanted to do it but when we analyzed it, we saw we wouldn't be able to > > support basic stuff and therefore we had to revert to our IOCTLs. > > To sum it up, because our NIC is used for intra-communication, we > > don't expose nor intend users to use it as a NIC per-se. However, to > > be able to get statistics and manage them in a standard way, and > > support control plane over Ethernet, we do register each port to the > > net subsystem (i.e. create netdev per port). > > > > I hope this short summary explains this better. > > I read your cover letter. Networking drivers don't get to define random > IOCTLs as they please. You have to take that part out of the "NIC" > driver. The IOCTLs are not for the Ethernet part. They are strictly for the RDMA operations. RDMA drivers also have IOCTLs as interfaces in the drivers/infiniband area, so I don't think I'm doing something different here. And my driver is not networking. It is an accelerator which has some network ports. btw, this is only a single new IOCTL call. The rest of the IOCTLs are already upstreamed and are for the rest of the ASIC's compute functionality. What I'm trying to say is that it's very common to define IOCTLs for accelerators. > > > As per your request that this code lives in the net subsystem, I think > > that will make it only more complicated and hard to upstream and > > maintain. > > I see there are other examples (e.g. sgi-xp) that contain networking > > driver code in misc so I don't understand this objection. > > The maintenance structure and CI systems for the kernel depend on the > directory layout. If you don't understand that I don't know how to help > you. I completely understand but you didn't answer my question. How come there are drivers which create netdev objects, and specifically sgi-xp in misc (but I also saw it in usb drivers) that live outside drivers/net ? Why doesn't your request apply to them as well ? When we wrote the code, we saw those examples and therefore assumed it was fine. > > > > Please make sure to CC linux-rdma. You clearly stated that the device > > > does RDMA-like transfers. > > > > We don't use the RDMA infrastructure in the kernel and we can't > > connect to it due to the lack of H/W support we have so I don't see > > why we need to CC linux-rdma. > > You have it backward. You don't get to pick and choose which parts of > the infrastructure you use, and therefore who reviews your drivers. > The device uses RDMA under the hood so Linux RDMA experts must very > much be okay with it getting merged. That's how we ensure Linux > interfaces are consistent and good quality. I understand your point of view but If my H/W doesn't support the basic requirements of the RDMA infrastructure and interfaces, then really there is nothing I can do about it. I can't use them. I wish I was able to use that infrastructure but I can't. That's why we wrote the IOCTLs in our accelerator driver. Thanks, Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 21:20 ` Oded Gabbay @ 2020-09-15 21:37 ` Andrew Lunn 2020-09-15 21:43 ` Oded Gabbay 2020-09-15 22:36 ` David Miller 2020-09-15 22:34 ` David Miller 1 sibling, 2 replies; 56+ messages in thread From: Andrew Lunn @ 2020-09-15 21:37 UTC (permalink / raw) To: Oded Gabbay Cc: Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Florian Fainelli, linux-rdma > I completely understand but you didn't answer my question. How come > there are drivers which create netdev objects, and specifically sgi-xp > in misc (but I also saw it in usb drivers) that live outside > drivers/net ? Why doesn't your request apply to them as well ? > When we wrote the code, we saw those examples and therefore assumed it was fine. commit 45d9ca492e4bd1522d1b5bd125c2908f1cee3d4a Author: Dean Nelson <dcn@sgi.com> Date: Tue Apr 22 14:46:56 2008 -0500 [IA64] move XP and XPC to drivers/misc/sgi-xp Move XPC and XPNET from arch/ia64/sn/kernel to drivers/misc/sgi-xp. Signed-off-by: Dean Nelson <dcn@sgi.com> Signed-off-by: Tony Luck <tony.luck@intel.com> It has been there a long time, and no networking person was involved in its move. drivers/usb/gadget/function/f_ncm.c commit 00a2430ff07d4e0e0e7e24e02fd8adede333b797 Author: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Date: Tue Jul 15 13:09:46 2014 +0200 usb: gadget: Gadget directory cleanup - group usb functions The drivers/usb/gadget directory contains many files. Files which are related can be distributed into separate directories. This patch moves the USB functions implementations into a separate directory. Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Signed-off-by: Felipe Balbi <balbi@ti.com> Again, old. Can you find an example of a network driver added in the last couple of years outside of drivers/met? > > > > Please make sure to CC linux-rdma. You clearly stated that the device > > > > does RDMA-like transfers. > > > > > > We don't use the RDMA infrastructure in the kernel and we can't > > > connect to it due to the lack of H/W support we have so I don't see > > > why we need to CC linux-rdma. > > > > You have it backward. You don't get to pick and choose which parts of > > the infrastructure you use, and therefore who reviews your drivers. > > The device uses RDMA under the hood so Linux RDMA experts must very > > much be okay with it getting merged. That's how we ensure Linux > > interfaces are consistent and good quality. > > I understand your point of view but If my H/W doesn't support the > basic requirements of the RDMA infrastructure and interfaces, then > really there is nothing I can do about it. I can't use them. It is up to the RDMA people to say that. They might see how the RDMA core can be made to work for your hardware. Andrew ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 21:37 ` Andrew Lunn @ 2020-09-15 21:43 ` Oded Gabbay 2020-09-15 22:35 ` David Miller 2020-09-15 22:36 ` David Miller 1 sibling, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-15 21:43 UTC (permalink / raw) To: Andrew Lunn Cc: Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Florian Fainelli, linux-rdma On Wed, Sep 16, 2020 at 12:37 AM Andrew Lunn <andrew@lunn.ch> wrote: > > > I completely understand but you didn't answer my question. How come > > there are drivers which create netdev objects, and specifically sgi-xp > > in misc (but I also saw it in usb drivers) that live outside > > drivers/net ? Why doesn't your request apply to them as well ? > > When we wrote the code, we saw those examples and therefore assumed it was fine. > > commit 45d9ca492e4bd1522d1b5bd125c2908f1cee3d4a > Author: Dean Nelson <dcn@sgi.com> > Date: Tue Apr 22 14:46:56 2008 -0500 > > [IA64] move XP and XPC to drivers/misc/sgi-xp > > Move XPC and XPNET from arch/ia64/sn/kernel to drivers/misc/sgi-xp. > > Signed-off-by: Dean Nelson <dcn@sgi.com> > Signed-off-by: Tony Luck <tony.luck@intel.com> > > It has been there a long time, and no networking person was involved > in its move. > > drivers/usb/gadget/function/f_ncm.c > commit 00a2430ff07d4e0e0e7e24e02fd8adede333b797 > Author: Andrzej Pietrasiewicz <andrzej.p@samsung.com> > Date: Tue Jul 15 13:09:46 2014 +0200 > > usb: gadget: Gadget directory cleanup - group usb functions > > The drivers/usb/gadget directory contains many files. > Files which are related can be distributed into separate directories. > This patch moves the USB functions implementations into a separate directory. > > Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@samsung.com> > Signed-off-by: Felipe Balbi <balbi@ti.com> > > Again, old. > > Can you find an example of a network driver added in the last couple > of years outside of drivers/met? I honestly don't know and I admit we didn't look at the dates of when these drivers were introduced. Oded > > > > > > Please make sure to CC linux-rdma. You clearly stated that the device > > > > > does RDMA-like transfers. > > > > > > > > We don't use the RDMA infrastructure in the kernel and we can't > > > > connect to it due to the lack of H/W support we have so I don't see > > > > why we need to CC linux-rdma. > > > > > > You have it backward. You don't get to pick and choose which parts of > > > the infrastructure you use, and therefore who reviews your drivers. > > > The device uses RDMA under the hood so Linux RDMA experts must very > > > much be okay with it getting merged. That's how we ensure Linux > > > interfaces are consistent and good quality. > > > > I understand your point of view but If my H/W doesn't support the > > basic requirements of the RDMA infrastructure and interfaces, then > > really there is nothing I can do about it. I can't use them. > > It is up to the RDMA people to say that. They might see how the RDMA > core can be made to work for your hardware. > > Andrew ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 21:43 ` Oded Gabbay @ 2020-09-15 22:35 ` David Miller 0 siblings, 0 replies; 56+ messages in thread From: David Miller @ 2020-09-15 22:35 UTC (permalink / raw) To: oded.gabbay Cc: andrew, kuba, linux-kernel, netdev, SW_Drivers, gregkh, f.fainelli, linux-rdma From: Oded Gabbay <oded.gabbay@gmail.com> Date: Wed, 16 Sep 2020 00:43:00 +0300 > I honestly don't know and I admit we didn't look at the dates of when > these drivers were introduced. Please do research when you make claims in the future, thank you. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 21:37 ` Andrew Lunn 2020-09-15 21:43 ` Oded Gabbay @ 2020-09-15 22:36 ` David Miller 1 sibling, 0 replies; 56+ messages in thread From: David Miller @ 2020-09-15 22:36 UTC (permalink / raw) To: andrew Cc: oded.gabbay, kuba, linux-kernel, netdev, SW_Drivers, gregkh, f.fainelli, linux-rdma From: Andrew Lunn <andrew@lunn.ch> Date: Tue, 15 Sep 2020 23:37:35 +0200 >> I understand your point of view but If my H/W doesn't support the >> basic requirements of the RDMA infrastructure and interfaces, then >> really there is nothing I can do about it. I can't use them. > > It is up to the RDMA people to say that. They might see how the RDMA > core can be made to work for your hardware. +1 ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 21:20 ` Oded Gabbay 2020-09-15 21:37 ` Andrew Lunn @ 2020-09-15 22:34 ` David Miller 2020-09-16 4:26 ` Oded Gabbay 1 sibling, 1 reply; 56+ messages in thread From: David Miller @ 2020-09-15 22:34 UTC (permalink / raw) To: oded.gabbay Cc: kuba, linux-kernel, netdev, SW_Drivers, gregkh, andrew, f.fainelli, linux-rdma From: Oded Gabbay <oded.gabbay@gmail.com> Date: Wed, 16 Sep 2020 00:20:12 +0300 > I completely understand but you didn't answer my question. How come > there are drivers which create netdev objects, and specifically sgi-xp > in misc (but I also saw it in usb drivers) that live outside > drivers/net ? Why doesn't your request apply to them as well ? Don't use examples of drivers doing the wrong thing as an excuse for you to repeat the mistake. Ok? That kind of argument doesn't work here. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 22:34 ` David Miller @ 2020-09-16 4:26 ` Oded Gabbay 0 siblings, 0 replies; 56+ messages in thread From: Oded Gabbay @ 2020-09-16 4:26 UTC (permalink / raw) To: David Miller Cc: Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, Andrew Lunn, Florian Fainelli, linux-rdma On Wed, Sep 16, 2020 at 1:34 AM David Miller <davem@davemloft.net> wrote: > > From: Oded Gabbay <oded.gabbay@gmail.com> > Date: Wed, 16 Sep 2020 00:20:12 +0300 > > > I completely understand but you didn't answer my question. How come > > there are drivers which create netdev objects, and specifically sgi-xp > > in misc (but I also saw it in usb drivers) that live outside > > drivers/net ? Why doesn't your request apply to them as well ? > > Don't use examples of drivers doing the wrong thing as an excuse for > you to repeat the mistake. > > Ok? Well, it's not like there is a big red warning near those drivers saying "this is wrong"... How could I have known that in advance ? > > That kind of argument doesn't work here. I know that, I just didn't know those drivers did "the wrong thing" Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-15 20:46 ` Oded Gabbay 2020-09-15 21:04 ` Jakub Kicinski @ 2020-09-17 17:18 ` Jason Gunthorpe 2020-09-18 11:36 ` Gal Pressman 1 sibling, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-17 17:18 UTC (permalink / raw) To: Oded Gabbay Cc: Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > infrastructure for communication between multiple accelerators. Same > as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > The RDMA implementation we did does NOT support some basic RDMA > IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > library or to connect to the rdma infrastructure in the kernel. You can't create a parallel RDMA subsystem in netdev, or in misc, and you can't add random device offloads as IOCTL to nedevs. RDMA is the proper home for all the networking offloads that don't fit into netdev. EFA was able to fit into rdma-core/etc and it isn't even RoCE at all. I'm sure this can too. > wanted to do it but when we analyzed it, we saw we wouldn't be able to > support basic stuff and therefore we had to revert to our IOCTLs. Try again. Ask for help. Your patches add CQs, WQ, and other RDMA objects. This is very clearly not an appropriate functionality for netdev. > To sum it up, because our NIC is used for intra-communication, we > don't expose nor intend users to use it as a NIC per-se. However, to > be able to get statistics and manage them in a standard way, and > support control plane over Ethernet, we do register each port to the > net subsystem (i.e. create netdev per port). Sure, the basic ethernet side is conceptually fine. > > Please make sure to CC linux-rdma. You clearly stated that the device > > does RDMA-like transfers. > > We don't use the RDMA infrastructure in the kernel and we can't > connect to it due to the lack of H/W support we have so I don't see > why we need to CC linux-rdma. Because you can't put RDMA like concepts under net. Jakub, NAK from me on this series. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-17 17:18 ` Jason Gunthorpe @ 2020-09-18 11:36 ` Gal Pressman 2020-09-18 11:52 ` Leon Romanovsky ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: Gal Pressman @ 2020-09-18 11:36 UTC (permalink / raw) To: Jason Gunthorpe, Oded Gabbay Cc: Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On 17/09/2020 20:18, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: >> infrastructure for communication between multiple accelerators. Same >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. >> The RDMA implementation we did does NOT support some basic RDMA >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core >> library or to connect to the rdma infrastructure in the kernel. > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > you can't add random device offloads as IOCTL to nedevs. > > RDMA is the proper home for all the networking offloads that don't fit > into netdev. > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > all. I'm sure this can too. Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it was suggested to go through the vfio subsystem instead. I think this comes back to the discussion we had when EFA was upstreamed, which is what's the bar to get accepted to the RDMA subsystem. IIRC, what we eventually agreed on is having a userspace rdma-core provider and ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). Does GAUDI fit these requirements? If not, should it be in a different subsystem or should we open the "what qualifies as an RDMA device" question again? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:36 ` Gal Pressman @ 2020-09-18 11:52 ` Leon Romanovsky 2020-09-18 11:56 ` Oded Gabbay 2020-09-18 11:56 ` Jason Gunthorpe 2020-09-18 12:10 ` Oded Gabbay 2 siblings, 1 reply; 56+ messages in thread From: Leon Romanovsky @ 2020-09-18 11:52 UTC (permalink / raw) To: Gal Pressman Cc: Jason Gunthorpe, Oded Gabbay, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > >> infrastructure for communication between multiple accelerators. Same > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > >> The RDMA implementation we did does NOT support some basic RDMA > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > >> library or to connect to the rdma infrastructure in the kernel. > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > you can't add random device offloads as IOCTL to nedevs. > > > > RDMA is the proper home for all the networking offloads that don't fit > > into netdev. > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > all. I'm sure this can too. > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > was suggested to go through the vfio subsystem instead. > > I think this comes back to the discussion we had when EFA was upstreamed, which > is what's the bar to get accepted to the RDMA subsystem. > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > or should we open the "what qualifies as an RDMA device" question again? I want to remind you that rdma-core requirement came to make sure that anything exposed from the RDMA to the userspace is strict with proper UAPI header hygiene. I doubt that Havana's ioctls are backed by anything like this. Thanks ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:52 ` Leon Romanovsky @ 2020-09-18 11:56 ` Oded Gabbay 2020-09-18 12:03 ` Leon Romanovsky 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 11:56 UTC (permalink / raw) To: Leon Romanovsky Cc: Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > >> infrastructure for communication between multiple accelerators. Same > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > >> The RDMA implementation we did does NOT support some basic RDMA > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > into netdev. > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > all. I'm sure this can too. > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > was suggested to go through the vfio subsystem instead. > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > is what's the bar to get accepted to the RDMA subsystem. > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > or should we open the "what qualifies as an RDMA device" question again? > > I want to remind you that rdma-core requirement came to make sure that > anything exposed from the RDMA to the userspace is strict with proper > UAPI header hygiene. > > I doubt that Havana's ioctls are backed by anything like this. > > Thanks Why do you doubt that ? Have you looked at our code ? Our uapi and IOCTLs interface is based on drm subsystem uapi interface and it is very safe and protected. Otherwise Greg would have never allowed me to go upstream in the first place. We have a single function which is the entry point for all the IOCTLs of our drivers (only one IOCTL is RDMA related, all the others are compute related). That function is almost 1:1 copy of the function in drm. Thanks, Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:56 ` Oded Gabbay @ 2020-09-18 12:03 ` Leon Romanovsky 2020-09-18 12:07 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Leon Romanovsky @ 2020-09-18 12:03 UTC (permalink / raw) To: Oded Gabbay Cc: Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > >> infrastructure for communication between multiple accelerators. Same > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > into netdev. > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > all. I'm sure this can too. > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > was suggested to go through the vfio subsystem instead. > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > is what's the bar to get accepted to the RDMA subsystem. > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > or should we open the "what qualifies as an RDMA device" question again? > > > > I want to remind you that rdma-core requirement came to make sure that > > anything exposed from the RDMA to the userspace is strict with proper > > UAPI header hygiene. > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > Thanks > > Why do you doubt that ? Have you looked at our code ? > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > and it is very safe and protected. Yes, I looked and didn't find open-source users of your UAPI headers. It is not related to being safe or protected by to the common request to present userspace that relies on those exported interfaces. > Otherwise Greg would have never allowed me to go upstream in the first place. Nice, can we get a link? > > We have a single function which is the entry point for all the IOCTLs > of our drivers (only one IOCTL is RDMA related, all the others are > compute related). > That function is almost 1:1 copy of the function in drm. DRM has same rules as RDMA, no kernel code will be merged without seeing open-source userspace. Thanks > > Thanks, > Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:03 ` Leon Romanovsky @ 2020-09-18 12:07 ` Oded Gabbay 2020-09-18 12:19 ` Leon Romanovsky 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 12:07 UTC (permalink / raw) To: Leon Romanovsky Cc: Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > >> infrastructure for communication between multiple accelerators. Same > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > into netdev. > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > all. I'm sure this can too. > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > anything exposed from the RDMA to the userspace is strict with proper > > > UAPI header hygiene. > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > Thanks > > > > Why do you doubt that ? Have you looked at our code ? > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > and it is very safe and protected. > > Yes, I looked and didn't find open-source users of your UAPI headers. > It is not related to being safe or protected by to the common request > to present userspace that relies on those exported interfaces. > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > Nice, can we get a link? > > > > > We have a single function which is the entry point for all the IOCTLs > > of our drivers (only one IOCTL is RDMA related, all the others are > > compute related). > > That function is almost 1:1 copy of the function in drm. > > DRM has same rules as RDMA, no kernel code will be merged without seeing > open-source userspace. > > Thanks > > > > > Thanks, > > Oded So we do have an open-source library called hl-thunk, which uses our driver and indeed that was part of the requirement. It is similar to libdrm. Here is the link: https://github.com/HabanaAI/hl-thunk That library also comes with a comprehensive suite of tests which shows how to use the accelerator and we have many NIC tests which show how to use the NIC. All the rest of the user-space code in Habana is going through that library. Currently, you won't find the NIC code there because we didn't upstream it as the driver code wasn't ready, but I'll push it there in a private branch if you want to take a look. Thanks, Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:07 ` Oded Gabbay @ 2020-09-18 12:19 ` Leon Romanovsky 2020-09-18 12:31 ` Oded Gabbay 2020-09-19 6:40 ` Greg Kroah-Hartman 0 siblings, 2 replies; 56+ messages in thread From: Leon Romanovsky @ 2020-09-18 12:19 UTC (permalink / raw) To: Oded Gabbay Cc: Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > > >> infrastructure for communication between multiple accelerators. Same > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > > into netdev. > > > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > > all. I'm sure this can too. > > > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > > anything exposed from the RDMA to the userspace is strict with proper > > > > UAPI header hygiene. > > > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > > > Thanks > > > > > > Why do you doubt that ? Have you looked at our code ? > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > > and it is very safe and protected. > > > > Yes, I looked and didn't find open-source users of your UAPI headers. > > It is not related to being safe or protected by to the common request > > to present userspace that relies on those exported interfaces. > > > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > > > Nice, can we get a link? > > > > > > > > We have a single function which is the entry point for all the IOCTLs > > > of our drivers (only one IOCTL is RDMA related, all the others are > > > compute related). > > > That function is almost 1:1 copy of the function in drm. > > > > DRM has same rules as RDMA, no kernel code will be merged without seeing > > open-source userspace. > > > > Thanks > > > > > > > > Thanks, > > > Oded > > So we do have an open-source library called hl-thunk, which uses our > driver and indeed that was part of the requirement. > It is similar to libdrm. > Here is the link: > https://github.com/HabanaAI/hl-thunk Are you kidding? This is mirror of some internal repository that looks like dumpster with ChangeId, internal bug tracker numbers, not part of major OS distributions. It is not open-source library and shows very clear why you chose to upstream your driver through driver/misc/ tree. Thanks ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:19 ` Leon Romanovsky @ 2020-09-18 12:31 ` Oded Gabbay 2020-09-18 13:09 ` Leon Romanovsky 2020-09-19 6:40 ` Greg Kroah-Hartman 1 sibling, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 12:31 UTC (permalink / raw) To: Leon Romanovsky Cc: Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, izur, Olof Johansson On Fri, Sep 18, 2020 at 3:19 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > > > >> infrastructure for communication between multiple accelerators. Same > > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > > > into netdev. > > > > > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > > > all. I'm sure this can too. > > > > > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > > > anything exposed from the RDMA to the userspace is strict with proper > > > > > UAPI header hygiene. > > > > > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > > > > > Thanks > > > > > > > > Why do you doubt that ? Have you looked at our code ? > > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > > > and it is very safe and protected. > > > > > > Yes, I looked and didn't find open-source users of your UAPI headers. > > > It is not related to being safe or protected by to the common request > > > to present userspace that relies on those exported interfaces. > > > > > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > > > > > Nice, can we get a link? > > > > > > > > > > > We have a single function which is the entry point for all the IOCTLs > > > > of our drivers (only one IOCTL is RDMA related, all the others are > > > > compute related). > > > > That function is almost 1:1 copy of the function in drm. > > > > > > DRM has same rules as RDMA, no kernel code will be merged without seeing > > > open-source userspace. > > > > > > Thanks > > > > > > > > > > > Thanks, > > > > Oded > > > > So we do have an open-source library called hl-thunk, which uses our > > driver and indeed that was part of the requirement. > > It is similar to libdrm. > > Here is the link: > > https://github.com/HabanaAI/hl-thunk > > Are you kidding? > > This is mirror of some internal repository that looks like dumpster > with ChangeId, internal bug tracker numbers, not part of major OS > distributions. > > It is not open-source library and shows very clear why you chose > to upstream your driver through driver/misc/ tree. > > Thanks Adding Olof here. No, usually not. But are you kidding ? What did you exactly expect to find ? Is there an open-source project somewhere that encapsulates Deep-learning accelerators which I could connect to ? AFAIK, the only thing remotely relevant is CUDA and that is closed-source (strange to hear lectures about open-source from NVIDIA people here...) So we are trying to give to the community such an open source library, or at least an example. Hopefully one day, when more companies upstream their drivers for deep-learning accelerators we could do something like libdrm or rdma-core, but for now, it's just our driver. I have been in this community since 2013 with AMD and then RedHat, and I come with good intentions and a desire to open source and upstream as much as I can. I don't think I deserve this kind of response. The bottom line is that we had this discussion with Greg and Olof and DRM people almost 2 years ago and if there was some open-source project in user-space or some subsystem in the kernel we could connect to, we would have done that instead of what we did, but the fact of the matter there isn't such thing. Olof tried and is trying to create a h/w accelerator subsystem but it still hasn't got up from the ground yet. Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:31 ` Oded Gabbay @ 2020-09-18 13:09 ` Leon Romanovsky 0 siblings, 0 replies; 56+ messages in thread From: Leon Romanovsky @ 2020-09-18 13:09 UTC (permalink / raw) To: Oded Gabbay Cc: Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, izur, Olof Johansson On Fri, Sep 18, 2020 at 03:31:51PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 3:19 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Fri, Sep 18, 2020 at 03:07:19PM +0300, Oded Gabbay wrote: > > > On Fri, Sep 18, 2020 at 3:03 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > On Fri, Sep 18, 2020 at 02:56:09PM +0300, Oded Gabbay wrote: > > > > > On Fri, Sep 18, 2020 at 2:52 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > > > > >> infrastructure for communication between multiple accelerators. Same > > > > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > > > > into netdev. > > > > > > > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > > > > all. I'm sure this can too. > > > > > > > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > > > > > > > > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > > > > > > > or should we open the "what qualifies as an RDMA device" question again? > > > > > > > > > > > > I want to remind you that rdma-core requirement came to make sure that > > > > > > anything exposed from the RDMA to the userspace is strict with proper > > > > > > UAPI header hygiene. > > > > > > > > > > > > I doubt that Havana's ioctls are backed by anything like this. > > > > > > > > > > > > Thanks > > > > > > > > > > Why do you doubt that ? Have you looked at our code ? > > > > > Our uapi and IOCTLs interface is based on drm subsystem uapi interface > > > > > and it is very safe and protected. > > > > > > > > Yes, I looked and didn't find open-source users of your UAPI headers. > > > > It is not related to being safe or protected by to the common request > > > > to present userspace that relies on those exported interfaces. > > > > > > > > > Otherwise Greg would have never allowed me to go upstream in the first place. > > > > > > > > Nice, can we get a link? > > > > > > > > > > > > > > We have a single function which is the entry point for all the IOCTLs > > > > > of our drivers (only one IOCTL is RDMA related, all the others are > > > > > compute related). > > > > > That function is almost 1:1 copy of the function in drm. > > > > > > > > DRM has same rules as RDMA, no kernel code will be merged without seeing > > > > open-source userspace. > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks, > > > > > Oded > > > > > > So we do have an open-source library called hl-thunk, which uses our > > > driver and indeed that was part of the requirement. > > > It is similar to libdrm. > > > Here is the link: > > > https://github.com/HabanaAI/hl-thunk > > > > Are you kidding? > > > > This is mirror of some internal repository that looks like dumpster > > with ChangeId, internal bug tracker numbers, not part of major OS > > distributions. > > > > It is not open-source library and shows very clear why you chose > > to upstream your driver through driver/misc/ tree. > > > > Thanks > > Adding Olof here. > > No, usually not. > But are you kidding ? > What did you exactly expect to find ? Is there an open-source project > somewhere that encapsulates Deep-learning accelerators which I could > connect to ? I would expect certain level of code quality, collaboration and review that distros require for inclusion. It is not the case for the github repo you presented. > AFAIK, the only thing remotely relevant is CUDA and that is > closed-source (strange to hear lectures about open-source from NVIDIA > people here...) Please check git log statistics to estimate Nvidia/Mellanox/Cumulus contributions to the Linux kernel and the open-source. You will be surprised. > > So we are trying to give to the community such an open source library, > or at least an example. Hopefully one day, when more companies > upstream their drivers for deep-learning accelerators we could do > something like libdrm or rdma-core, but for now, it's just our driver. AFAIR, your driver is not unique, HiSilicon tried to submit something similar years ago (warpdrive) and they are not alone. > > I have been in this community since 2013 with AMD and then RedHat, and > I come with good intentions and a desire to open source and upstream > as much as I can. I don't think I deserve this kind of response. There is no need to take it personal. It was you who posted a link to the github repo. What did you expect? > > The bottom line is that we had this discussion with Greg and Olof and > DRM people almost 2 years ago and if there was some open-source > project in user-space or some subsystem in the kernel we could connect > to, we would have done that instead of what we did, but the fact of > the matter there isn't such thing. Olof tried and is trying to create > a h/w accelerator subsystem but it still hasn't got up from the ground > yet. Maybe it is a time to do it right. > > Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:19 ` Leon Romanovsky 2020-09-18 12:31 ` Oded Gabbay @ 2020-09-19 6:40 ` Greg Kroah-Hartman 2020-09-19 8:20 ` Leon Romanovsky 1 sibling, 1 reply; 56+ messages in thread From: Greg Kroah-Hartman @ 2020-09-19 6:40 UTC (permalink / raw) To: Leon Romanovsky Cc: Oded Gabbay, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > So we do have an open-source library called hl-thunk, which uses our > > driver and indeed that was part of the requirement. > > It is similar to libdrm. > > Here is the link: > > https://github.com/HabanaAI/hl-thunk > > Are you kidding? > > This is mirror of some internal repository that looks like dumpster > with ChangeId, internal bug tracker numbers, not part of major OS > distributions. > > It is not open-source library and shows very clear why you chose > to upstream your driver through driver/misc/ tree. It is an open source library, as per the license and the code availability. What more is expected here? No distro has to pick it up, that's not a requirement for kernel code, we have many kernel helper programs that are not in distros. Heck, udev took a long time to get into distros, does that mean the kernel side of that interface should never have been merged? I don't understand your complaint here, it's not our place to judge the code quality of userspace libraries, otherwise we would never get any real-work done :) thanks, greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 6:40 ` Greg Kroah-Hartman @ 2020-09-19 8:20 ` Leon Romanovsky 2020-09-19 8:30 ` Greg Kroah-Hartman 0 siblings, 1 reply; 56+ messages in thread From: Leon Romanovsky @ 2020-09-19 8:20 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Oded Gabbay, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > So we do have an open-source library called hl-thunk, which uses our > > > driver and indeed that was part of the requirement. > > > It is similar to libdrm. > > > Here is the link: > > > https://github.com/HabanaAI/hl-thunk > > > > Are you kidding? > > > > This is mirror of some internal repository that looks like dumpster > > with ChangeId, internal bug tracker numbers, not part of major OS > > distributions. > > > > It is not open-source library and shows very clear why you chose > > to upstream your driver through driver/misc/ tree. > > It is an open source library, as per the license and the code > availability. What more is expected here? So can I fork iproute2, add bunch of new custom netlink UAPIs and expect Dave to merge it after I throw it on github? > > No distro has to pick it up, that's not a requirement for kernel code, > we have many kernel helper programs that are not in distros. Heck, udev > took a long time to get into distros, does that mean the kernel side of > that interface should never have been merged? > > I don't understand your complaint here, it's not our place to judge the > code quality of userspace libraries, otherwise we would never get any > real-work done :) My main complaint is that you can't imagine merging code into large subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source citizen. It means use of existing user-space libraries/tools and/or providing new ones that will be usable for everyone. In this case, we have some custom char device with library that is not usable for anyone else and this is why drivers/misc/ is right place. While we are talking about real-work, it is our benefit to push companies to make investment into ecosystem and not letting them to find an excuse for not doing it. Thanks > > thanks, > > greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 8:20 ` Leon Romanovsky @ 2020-09-19 8:30 ` Greg Kroah-Hartman 2020-09-19 8:58 ` Leon Romanovsky 2020-09-19 16:43 ` Oded Gabbay 0 siblings, 2 replies; 56+ messages in thread From: Greg Kroah-Hartman @ 2020-09-19 8:30 UTC (permalink / raw) To: Leon Romanovsky Cc: Oded Gabbay, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > So we do have an open-source library called hl-thunk, which uses our > > > > driver and indeed that was part of the requirement. > > > > It is similar to libdrm. > > > > Here is the link: > > > > https://github.com/HabanaAI/hl-thunk > > > > > > Are you kidding? > > > > > > This is mirror of some internal repository that looks like dumpster > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > distributions. > > > > > > It is not open-source library and shows very clear why you chose > > > to upstream your driver through driver/misc/ tree. > > > > It is an open source library, as per the license and the code > > availability. What more is expected here? > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > Dave to merge it after I throw it on github? Don't be silly, that's not the case here at all and you know that. > > No distro has to pick it up, that's not a requirement for kernel code, > > we have many kernel helper programs that are not in distros. Heck, udev > > took a long time to get into distros, does that mean the kernel side of > > that interface should never have been merged? > > > > I don't understand your complaint here, it's not our place to judge the > > code quality of userspace libraries, otherwise we would never get any > > real-work done :) > > My main complaint is that you can't imagine merging code into large > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > citizen. It means use of existing user-space libraries/tools and/or > providing new ones that will be usable for everyone. Agreed. > In this case, we have some custom char device with library that is not > usable for anyone else and this is why drivers/misc/ is right place. Also agreed. > While we are talking about real-work, it is our benefit to push companies > to make investment into ecosystem and not letting them to find an excuse > for not doing it. So why are you complaining about a stand-alone driver that does not have any shared subsystems's userspace code to control that driver? Yes, when integrating into other subsystems (i.e. networking and rdma), they should use those common subsystems interfaces, no one is arguing that at all. totally lost, greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 8:30 ` Greg Kroah-Hartman @ 2020-09-19 8:58 ` Leon Romanovsky 2020-09-19 16:43 ` Oded Gabbay 1 sibling, 0 replies; 56+ messages in thread From: Leon Romanovsky @ 2020-09-19 8:58 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Oded Gabbay, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 10:30:12AM +0200, Greg Kroah-Hartman wrote: > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > > So we do have an open-source library called hl-thunk, which uses our > > > > > driver and indeed that was part of the requirement. > > > > > It is similar to libdrm. > > > > > Here is the link: > > > > > https://github.com/HabanaAI/hl-thunk > > > > > > > > Are you kidding? > > > > > > > > This is mirror of some internal repository that looks like dumpster > > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > > distributions. > > > > > > > > It is not open-source library and shows very clear why you chose > > > > to upstream your driver through driver/misc/ tree. > > > > > > It is an open source library, as per the license and the code > > > availability. What more is expected here? > > > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > > Dave to merge it after I throw it on github? > > Don't be silly, that's not the case here at all and you know that. It was far-fetched example. > > > > No distro has to pick it up, that's not a requirement for kernel code, > > > we have many kernel helper programs that are not in distros. Heck, udev > > > took a long time to get into distros, does that mean the kernel side of > > > that interface should never have been merged? > > > > > > I don't understand your complaint here, it's not our place to judge the > > > code quality of userspace libraries, otherwise we would never get any > > > real-work done :) > > > > My main complaint is that you can't imagine merging code into large > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > > citizen. It means use of existing user-space libraries/tools and/or > > providing new ones that will be usable for everyone. > > Agreed. > > > In this case, we have some custom char device with library that is not > > usable for anyone else and this is why drivers/misc/ is right place. > > Also agreed. > > > While we are talking about real-work, it is our benefit to push companies > > to make investment into ecosystem and not letting them to find an excuse > > for not doing it. > > So why are you complaining about a stand-alone driver that does not have > any shared subsystems's userspace code to control that driver? I didn't, everything started when I explained to Gal why RDMA subsystem requires rdma-core counterpart for any UAPI code. https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#m17d52d61adadf54c12bfecf1af5db40f5d829ac3 And expressed my view on the quality of the library that was presented as open-source example. https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#m9059c5a9405ba932d9ffb731195a43b27443d265 > > Yes, when integrating into other subsystems (i.e. networking and rdma), > they should use those common subsystems interfaces, no one is arguing > that at all. > > totally lost, And here comes my request to do it right https://lore.kernel.org/linux-rdma/CAFCwf12B4vCCwmfA7+VTUYUgJ9EHAtvg6F0bMYnsSCUBST+aWA@mail.gmail.com/T/#ma1fa6fe63666f630674eb668f1c00e6a672db85b All that I asked from Oded is to do UAPI/libraries right, while all the responses can be summarized to one sentence - "it is too hard, we don't want to do it." Thanks > > greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 8:30 ` Greg Kroah-Hartman 2020-09-19 8:58 ` Leon Romanovsky @ 2020-09-19 16:43 ` Oded Gabbay 2020-09-19 17:27 ` Greg Kroah-Hartman 2020-09-19 18:49 ` Andrew Lunn 1 sibling, 2 replies; 56+ messages in thread From: Oded Gabbay @ 2020-09-19 16:43 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Leon Romanovsky, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 11:30 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > > So we do have an open-source library called hl-thunk, which uses our > > > > > driver and indeed that was part of the requirement. > > > > > It is similar to libdrm. > > > > > Here is the link: > > > > > https://github.com/HabanaAI/hl-thunk > > > > > > > > Are you kidding? > > > > > > > > This is mirror of some internal repository that looks like dumpster > > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > > distributions. > > > > > > > > It is not open-source library and shows very clear why you chose > > > > to upstream your driver through driver/misc/ tree. > > > > > > It is an open source library, as per the license and the code > > > availability. What more is expected here? > > > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > > Dave to merge it after I throw it on github? > > Don't be silly, that's not the case here at all and you know that. > > > > No distro has to pick it up, that's not a requirement for kernel code, > > > we have many kernel helper programs that are not in distros. Heck, udev > > > took a long time to get into distros, does that mean the kernel side of > > > that interface should never have been merged? > > > > > > I don't understand your complaint here, it's not our place to judge the > > > code quality of userspace libraries, otherwise we would never get any > > > real-work done :) > > > > My main complaint is that you can't imagine merging code into large > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > > citizen. It means use of existing user-space libraries/tools and/or > > providing new ones that will be usable for everyone. > > Agreed. > > > In this case, we have some custom char device with library that is not > > usable for anyone else and this is why drivers/misc/ is right place. > > Also agreed. > > > While we are talking about real-work, it is our benefit to push companies > > to make investment into ecosystem and not letting them to find an excuse > > for not doing it. > > So why are you complaining about a stand-alone driver that does not have > any shared subsystems's userspace code to control that driver? > > Yes, when integrating into other subsystems (i.e. networking and rdma), > they should use those common subsystems interfaces, no one is arguing > that at all. Hi Greg, It's probably heresy, but why do I need to integrate into the RDMA subsystem ? I understand your reasoning about networking (Ethernet) as the driver connects to the kernel networking stack (netdev), but with RDMA the driver doesn't use or connect to anything in that stack. If I were to support IBverbs and declare that I support it, then of course I would need to integrate to the RDMA subsystem and add my backend to rdma-core. But we don't do that so why am I being forced to support IBverbs ? Forcing GAUDI to use the RDMA stack and IBverbs is like swatting flies with a sledgehammer. I do hope that in future devices we will support it natively and of course then we will integrate as requested, but for GAUDI it is just a huge overkill IMHO. Thanks, Oded > > totally lost, > > greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 16:43 ` Oded Gabbay @ 2020-09-19 17:27 ` Greg Kroah-Hartman 2020-09-19 19:22 ` Jason Gunthorpe 2020-09-19 18:49 ` Andrew Lunn 1 sibling, 1 reply; 56+ messages in thread From: Greg Kroah-Hartman @ 2020-09-19 17:27 UTC (permalink / raw) To: Oded Gabbay Cc: Leon Romanovsky, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 07:43:28PM +0300, Oded Gabbay wrote: > On Sat, Sep 19, 2020 at 11:30 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Sat, Sep 19, 2020 at 11:20:03AM +0300, Leon Romanovsky wrote: > > > On Sat, Sep 19, 2020 at 08:40:20AM +0200, Greg Kroah-Hartman wrote: > > > > On Fri, Sep 18, 2020 at 03:19:05PM +0300, Leon Romanovsky wrote: > > > > > > So we do have an open-source library called hl-thunk, which uses our > > > > > > driver and indeed that was part of the requirement. > > > > > > It is similar to libdrm. > > > > > > Here is the link: > > > > > > https://github.com/HabanaAI/hl-thunk > > > > > > > > > > Are you kidding? > > > > > > > > > > This is mirror of some internal repository that looks like dumpster > > > > > with ChangeId, internal bug tracker numbers, not part of major OS > > > > > distributions. > > > > > > > > > > It is not open-source library and shows very clear why you chose > > > > > to upstream your driver through driver/misc/ tree. > > > > > > > > It is an open source library, as per the license and the code > > > > availability. What more is expected here? > > > > > > So can I fork iproute2, add bunch of new custom netlink UAPIs and expect > > > Dave to merge it after I throw it on github? > > > > Don't be silly, that's not the case here at all and you know that. > > > > > > No distro has to pick it up, that's not a requirement for kernel code, > > > > we have many kernel helper programs that are not in distros. Heck, udev > > > > took a long time to get into distros, does that mean the kernel side of > > > > that interface should never have been merged? > > > > > > > > I don't understand your complaint here, it's not our place to judge the > > > > code quality of userspace libraries, otherwise we would never get any > > > > real-work done :) > > > > > > My main complaint is that you can't imagine merging code into large > > > subsystems (netdev, RDMA, DRM? e.t.c) without being civil open-source > > > citizen. It means use of existing user-space libraries/tools and/or > > > providing new ones that will be usable for everyone. > > > > Agreed. > > > > > In this case, we have some custom char device with library that is not > > > usable for anyone else and this is why drivers/misc/ is right place. > > > > Also agreed. > > > > > While we are talking about real-work, it is our benefit to push companies > > > to make investment into ecosystem and not letting them to find an excuse > > > for not doing it. > > > > So why are you complaining about a stand-alone driver that does not have > > any shared subsystems's userspace code to control that driver? > > > > Yes, when integrating into other subsystems (i.e. networking and rdma), > > they should use those common subsystems interfaces, no one is arguing > > that at all. > Hi Greg, > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > I understand your reasoning about networking (Ethernet) as the driver > connects to the kernel networking stack (netdev), but with RDMA the > driver doesn't use or connect to anything in that stack. If I were to > support IBverbs and declare that I support it, then of course I would > need to integrate to the RDMA subsystem and add my backend to > rdma-core. IBverbs are horrid and I would not wish them on anyone. Seriously. > But we don't do that so why am I being forced to support IBverbs ? You shouldn't. > Forcing GAUDI to use the RDMA stack and IBverbs is like swatting flies > with a sledgehammer. > I do hope that in future devices we will support it natively and of > course then we will integrate as requested, but for GAUDI it is just a > huge overkill IMHO. I think the general rdma apis are the key here, not the userspace api. Note, I do not know exactly what they are, but no, IBverbs are not ok. Ick. greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 17:27 ` Greg Kroah-Hartman @ 2020-09-19 19:22 ` Jason Gunthorpe 2020-09-20 8:47 ` Greg Kroah-Hartman 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-19 19:22 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Oded Gabbay, Leon Romanovsky, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > I understand your reasoning about networking (Ethernet) as the driver > > connects to the kernel networking stack (netdev), but with RDMA the > > driver doesn't use or connect to anything in that stack. If I were to > > support IBverbs and declare that I support it, then of course I would > > need to integrate to the RDMA subsystem and add my backend to > > rdma-core. > > IBverbs are horrid and I would not wish them on anyone. Seriously. I'm curious what drives this opinion? Did you have it since you reviewed the initial submission all those years ago? > I think the general rdma apis are the key here, not the userspace api. Are you proposing that habana should have uAPI in drivers/misc and present a standard rdma-core userspace for it? This is the only userspace programming interface for RoCE HW. I think that would be much more work. If not, what open source userspace are you going to ask them to present to merge the kernel side into misc? > Note, I do not know exactly what they are, but no, IBverbs are not ok. Should we stop merging new drivers and abandon the RDMA subsystem? Is there something you'd like to see fixed? Don't really understand your position, sorry. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 19:22 ` Jason Gunthorpe @ 2020-09-20 8:47 ` Greg Kroah-Hartman 2020-09-20 19:05 ` Oded Gabbay 2020-09-21 11:52 ` Jason Gunthorpe 0 siblings, 2 replies; 56+ messages in thread From: Greg Kroah-Hartman @ 2020-09-20 8:47 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oded Gabbay, Leon Romanovsky, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote: > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > > I understand your reasoning about networking (Ethernet) as the driver > > > connects to the kernel networking stack (netdev), but with RDMA the > > > driver doesn't use or connect to anything in that stack. If I were to > > > support IBverbs and declare that I support it, then of course I would > > > need to integrate to the RDMA subsystem and add my backend to > > > rdma-core. > > > > IBverbs are horrid and I would not wish them on anyone. Seriously. > > I'm curious what drives this opinion? Did you have it since you > reviewed the initial submission all those years ago? As I learned more about that interface, yes, I like it less and less :) But that's the userspace api you all are stuck with, for various reasons, my opinion doesn't matter here. > > I think the general rdma apis are the key here, not the userspace api. > > Are you proposing that habana should have uAPI in drivers/misc and > present a standard rdma-core userspace for it? This is the only > userspace programming interface for RoCE HW. I think that would be > much more work. > > If not, what open source userspace are you going to ask them to > present to merge the kernel side into misc? I don't think that they have a userspace api to their rdma feature from what I understand, but I could be totally wrong as I do not know their hardware at all, so I'll let them answer this question. > > Note, I do not know exactly what they are, but no, IBverbs are not ok. > > Should we stop merging new drivers and abandon the RDMA subsystem? Is > there something you'd like to see fixed? > > Don't really understand your position, sorry. For anything that _has_ to have a userspace RMDA interface, sure ibverbs are the one we are stuck with, but I didn't think that was the issue here at all, which is why I wrote the above comments. thanks, greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-20 8:47 ` Greg Kroah-Hartman @ 2020-09-20 19:05 ` Oded Gabbay 2020-09-21 10:39 ` Leon Romanovsky 2020-09-21 11:52 ` Jason Gunthorpe 1 sibling, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-20 19:05 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Jason Gunthorpe, Leon Romanovsky, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, izur On Sun, Sep 20, 2020 at 11:47 AM Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > > On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote: > > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > > > I understand your reasoning about networking (Ethernet) as the driver > > > > connects to the kernel networking stack (netdev), but with RDMA the > > > > driver doesn't use or connect to anything in that stack. If I were to > > > > support IBverbs and declare that I support it, then of course I would > > > > need to integrate to the RDMA subsystem and add my backend to > > > > rdma-core. > > > > > > IBverbs are horrid and I would not wish them on anyone. Seriously. > > > > I'm curious what drives this opinion? Did you have it since you > > reviewed the initial submission all those years ago? > > As I learned more about that interface, yes, I like it less and less :) > > But that's the userspace api you all are stuck with, for various > reasons, my opinion doesn't matter here. > > > > I think the general rdma apis are the key here, not the userspace api. > > > > Are you proposing that habana should have uAPI in drivers/misc and > > present a standard rdma-core userspace for it? This is the only > > userspace programming interface for RoCE HW. I think that would be > > much more work. > > > > If not, what open source userspace are you going to ask them to > > present to merge the kernel side into misc? > > I don't think that they have a userspace api to their rdma feature from > what I understand, but I could be totally wrong as I do not know their > hardware at all, so I'll let them answer this question. Hi Greg, We do expose a new IOCTL to enable the user to configure connections between multiple GAUDI devices. Having said that, we restrict this IOCTL to be used only by the same user who is doing the compute on our device, as opposed to a real RDMA device where any number of applications can send and receive. In addition, this IOCTL limits the user to connect ONLY to another GAUDI device and not to a 3rd party RDMA device. It is true that GAUDI supports RDMA data movement but the data movement is NOT done by the user. It is done by our compute engines. i.e. the compute engines performs "send" and "receive" without going to the host (aka no support for ibv_postsend, ibv_postreceive). The only thing that is controlled by the user is to say which GAUDI is connected to which. After that, the command submission the user performs to operate our compute engines will cause them to send and receive RDMA packets. Moreover, as opposed to smart NICs where the Networking is the main focus and the compute is only secondary, in our device the compute is our major focus and the networking is a slave for it. The hl-thunk userspace library will have wrappers around this single IOCTL (like all our driver's IOCTLs) and also contain demos to show how to use it. > > > > Note, I do not know exactly what they are, but no, IBverbs are not ok. > > > > Should we stop merging new drivers and abandon the RDMA subsystem? Is > > there something you'd like to see fixed? > > > > Don't really understand your position, sorry. > > For anything that _has_ to have a userspace RMDA interface, sure ibverbs > are the one we are stuck with, but I didn't think that was the issue > here at all, which is why I wrote the above comments. To emphasize again, we don't want to expose a userspace RDMA interface. We just want to allow our single compute user to configure a connection to another GAUDI. Thanks, Oded > > thanks, > > greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-20 19:05 ` Oded Gabbay @ 2020-09-21 10:39 ` Leon Romanovsky 0 siblings, 0 replies; 56+ messages in thread From: Leon Romanovsky @ 2020-09-21 10:39 UTC (permalink / raw) To: Oded Gabbay Cc: Greg Kroah-Hartman, Jason Gunthorpe, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, izur On Sun, Sep 20, 2020 at 10:05:39PM +0300, Oded Gabbay wrote: > On Sun, Sep 20, 2020 at 11:47 AM Greg Kroah-Hartman > <gregkh@linuxfoundation.org> wrote: > > > > On Sat, Sep 19, 2020 at 04:22:35PM -0300, Jason Gunthorpe wrote: > > > On Sat, Sep 19, 2020 at 07:27:30PM +0200, Greg Kroah-Hartman wrote: > > > > > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? > > > > > I understand your reasoning about networking (Ethernet) as the driver > > > > > connects to the kernel networking stack (netdev), but with RDMA the > > > > > driver doesn't use or connect to anything in that stack. If I were to > > > > > support IBverbs and declare that I support it, then of course I would > > > > > need to integrate to the RDMA subsystem and add my backend to > > > > > rdma-core. > > > > > > > > IBverbs are horrid and I would not wish them on anyone. Seriously. > > > > > > I'm curious what drives this opinion? Did you have it since you > > > reviewed the initial submission all those years ago? > > > > As I learned more about that interface, yes, I like it less and less :) > > > > But that's the userspace api you all are stuck with, for various > > reasons, my opinion doesn't matter here. > > > > > > I think the general rdma apis are the key here, not the userspace api. > > > > > > Are you proposing that habana should have uAPI in drivers/misc and > > > present a standard rdma-core userspace for it? This is the only > > > userspace programming interface for RoCE HW. I think that would be > > > much more work. > > > > > > If not, what open source userspace are you going to ask them to > > > present to merge the kernel side into misc? > > > > I don't think that they have a userspace api to their rdma feature from > > what I understand, but I could be totally wrong as I do not know their > > hardware at all, so I'll let them answer this question. > > Hi Greg, > We do expose a new IOCTL to enable the user to configure connections > between multiple GAUDI devices. How is it different from RDMA QP configuration? > > Having said that, we restrict this IOCTL to be used only by the same > user who is doing the compute on our device, as opposed to a real RDMA > device where any number of applications can send and receive. The ability to support multiple applications is not RDMA-requirement, but the implementation. For example MPI jobs are single user of RDMA device. > In addition, this IOCTL limits the user to connect ONLY to another > GAUDI device and not to a 3rd party RDMA device. I don't see how it is different from EFA with their SQD QP type or mlx5 devices with DC QPs that you can connect only to similar devices (no interoperability). Thanks ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-20 8:47 ` Greg Kroah-Hartman 2020-09-20 19:05 ` Oded Gabbay @ 2020-09-21 11:52 ` Jason Gunthorpe 2020-09-21 21:20 ` Jakub Kicinski 1 sibling, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-21 11:52 UTC (permalink / raw) To: Greg Kroah-Hartman Cc: Oded Gabbay, Leon Romanovsky, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Sun, Sep 20, 2020 at 10:47:02AM +0200, Greg Kroah-Hartman wrote: > > If not, what open source userspace are you going to ask them to > > present to merge the kernel side into misc? > > I don't think that they have a userspace api to their rdma feature from > what I understand, but I could be totally wrong as I do not know their > hardware at all, so I'll let them answer this question. I thought Oded was pretty clear, the goal of this series is to expose their RDMA HW to userspace. This problem space requires co-mingling networking and compute at extremely high speed/low overhead. This is all done in userspace. We are specifically talking about this in include/uapi/misc/habanalabs.h: /* * NIC * * This IOCTL allows the user to manage and configure the device's NIC ports. * The following operations are available: * - Create a completion queue * - Destroy a completion queue * - Wait on completion queue * - Poll a completion queue * - Update consumed completion queue entries * - Set a work queue * - Unset a work queue * * For all operations, the user should provide a pointer to an input structure * with the context parameters. Some of the operations also require a pointer to * driver regarding how many of the available CQEs were actually * processed/consumed. Only then the driver will override them with newer * entries. * The set WQ operation should provide the device virtual address of the WQ with * a matching size for the number of WQs and entries per WQ. * */ #define HL_IOCTL_NIC _IOWR('H', 0x07, struct hl_nic_args) Which is ibv_create_qp, ibv_create_cq, ibv_poll_cq, etc, etc Habana has repeatedly described their HW as having multiple 100G RoCE ports. RoCE is one of the common industry standards that ibverbs unambiguously is responsible for. I would be much less annoyed if they were not actively marketing their product as RoCE RDMA. Sure there is some argument that their RoCE isn't spec compliant, but I don't think it excuses the basic principle of our subsystem: RDMA HW needs to demonstrate some basic functionality using the standard open source userspace software stack. I don't like this idea of backdooring a bunch of proprietary closed source RDMA userspace through drivers/misc, and if you don't have a clear idea how to get something equal for drivers/misc you should not accept the H_IOCTL_NIC. Plus RoCE is complicated, there is a bunch of interaction with netdev and rules related to that that really needs to be respected. > For anything that _has_ to have a userspace RMDA interface, sure ibverbs > are the one we are stuck with, but I didn't think that was the issue > here at all, which is why I wrote the above comments. I think you should look at the patches #8 through 11: https://lore.kernel.org/lkml/20200915171022.10561-9-oded.gabbay@gmail.com/ Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-21 11:52 ` Jason Gunthorpe @ 2020-09-21 21:20 ` Jakub Kicinski 2020-09-22 11:49 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Jakub Kicinski @ 2020-09-21 21:20 UTC (permalink / raw) To: Oded Gabbay Cc: Jason Gunthorpe, Greg Kroah-Hartman, Leon Romanovsky, Gal Pressman, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Mon, 21 Sep 2020 08:52:39 -0300 Jason Gunthorpe wrote: > I don't like this idea of backdooring a bunch of proprietary closed > source RDMA userspace through drivers/misc, and if you don't have a > clear idea how to get something equal for drivers/misc you should not > accept the H_IOCTL_NIC. > > Plus RoCE is complicated, there is a bunch of interaction with netdev > and rules related to that that really needs to be respected. +1 To me this code quite clearly fits the description of vendor SDK which runs proprietary stuff on top. It's such an vendor SDK thing to do to pick the parts of our infrastructure they like and "simplify the rest" with its own re-implementation. I'd wager the only reason you expose the netdevs at all is for link settings, stats, packet capture and debug. You'd never run TCP traffic over those links. And you're fighting against using Linux APIs for the only real traffic that runs on those links - RDMA(ish) traffic. Greg - I'm probably the least experience of the folks involved in this conversation - could you ELI5 what's the benefit to the community from merging this code? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-21 21:20 ` Jakub Kicinski @ 2020-09-22 11:49 ` Jason Gunthorpe 0 siblings, 0 replies; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-22 11:49 UTC (permalink / raw) To: Jakub Kicinski Cc: Oded Gabbay, Greg Kroah-Hartman, Leon Romanovsky, Gal Pressman, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Mon, Sep 21, 2020 at 02:20:53PM -0700, Jakub Kicinski wrote: > I'd wager the only reason you expose the netdevs at all is for link > settings, stats, packet capture and debug. You'd never run TCP traffic > over those links. And you're fighting against using Linux APIs for the > only real traffic that runs on those links - RDMA(ish) traffic. The usual working flow is to use something like TCP to exchange connection information then pivot to RDMA for the actual data flow. This is why a driver like this could get away with such a low performance implementation for a 100G NIC, it is just application boot metadata being exchanged. Sniffing probably won't work as typically the HW will capture the RoCE traffic before reaching Linux - and the Linux driver couldn't handle a 100G flow anyhow. Stats might not work either. As far as the "usual rules" we do require that accelerator devices sharing a netdev are secure in the concept of netdev userspace security. They can access the assigned RoCEv2 UDP port but cannot do things like forge src IP/MAC addresses, violate VLANs, reach outside net namespaces, capature arbitary traffic, etc. This stuff is tricky and generally requires HW support. Someone has to audit all of this and ensure it meets the netdev security requirements too, otherwise it will need CAP_NET_RAW to function. Obviously this requires seeing enough of a userspace implementation to understand how the design approaches verbs 'Address Handles' and so forth. RDMA HW has had errors before and when discovered it was blocked with CAP_NET_RAW until new chip revs came out, this is something I take very seriously. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-19 16:43 ` Oded Gabbay 2020-09-19 17:27 ` Greg Kroah-Hartman @ 2020-09-19 18:49 ` Andrew Lunn 1 sibling, 0 replies; 56+ messages in thread From: Andrew Lunn @ 2020-09-19 18:49 UTC (permalink / raw) To: Oded Gabbay Cc: Greg Kroah-Hartman, Leon Romanovsky, Gal Pressman, Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Florian Fainelli, linux-rdma On Sat, Sep 19, 2020 at 07:43:28PM +0300, Oded Gabbay wrote: > It's probably heresy, but why do I need to integrate into the RDMA subsystem ? Hi Oded I don't know the RDMA subsystem at all. So i will give a more generic answer. Are you reinventing things which a subsystem core already has? The subsystem core will be well tested, since lots of devices use it. Because of this, subsystem cores generally have a lower bug count per line of code than driver code. Using core code means drivers are smaller, and smaller code has less bugs by definition. We as maintainers have to assume you are going to abandon the driver at some point, while the hardware still exists, and leave the community to maintain it. So a smaller driver, which makes heavy use of the core is much easier to maintain. By making use of core code, you also get freebies. Somebody adds new functionality to the core, your driver automatically gets it. Look at this from the opposite perspective. Say every driver implemented their own TCP/IP stack? Or DMA engine? SPI infrastructure? How big a nightmare would it be to maintain? In your case, some parts of you hardware looks a bit like RDMA? So you ideally want to use the core code from the RDMA subsystem. Maybe you just need some of the lower layers? Maybe you need to refactor some of the RDMA core to make it a library you can pick and choice the bits useful to you? What you really want to avoid is re-implementing stuff in your driver which is already in the core. Andrew ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:36 ` Gal Pressman 2020-09-18 11:52 ` Leon Romanovsky @ 2020-09-18 11:56 ` Jason Gunthorpe 2020-09-18 11:59 ` Oded Gabbay 2020-09-18 12:10 ` Oded Gabbay 2 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 11:56 UTC (permalink / raw) To: Gal Pressman Cc: Oded Gabbay, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > >> infrastructure for communication between multiple accelerators. Same > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > >> The RDMA implementation we did does NOT support some basic RDMA > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > >> library or to connect to the rdma infrastructure in the kernel. > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > you can't add random device offloads as IOCTL to nedevs. > > > > RDMA is the proper home for all the networking offloads that don't fit > > into netdev. > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > all. I'm sure this can too. > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > was suggested to go through the vfio subsystem instead. > > I think this comes back to the discussion we had when EFA was upstreamed, which > is what's the bar to get accepted to the RDMA subsystem. > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). That is more or less where we ended up, yes. I'm most worried about this lack of PD and MR. Kernel must provide security for apps doing user DMA, PD and MR do this. If the device doesn't have PD/MR then it is hard to see how a WQ could ever be exposed directly to userspace, regardless of subsystem. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:56 ` Jason Gunthorpe @ 2020-09-18 11:59 ` Oded Gabbay 2020-09-18 12:16 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 11:59 UTC (permalink / raw) To: Jason Gunthorpe Cc: Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > >> infrastructure for communication between multiple accelerators. Same > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > >> The RDMA implementation we did does NOT support some basic RDMA > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > into netdev. > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > all. I'm sure this can too. > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > was suggested to go through the vfio subsystem instead. > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > is what's the bar to get accepted to the RDMA subsystem. > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > That is more or less where we ended up, yes. > > I'm most worried about this lack of PD and MR. > > Kernel must provide security for apps doing user DMA, PD and MR do > this. If the device doesn't have PD/MR then it is hard to see how a WQ > could ever be exposed directly to userspace, regardless of subsystem. > > Jason Hi Jason, What you say here is very true and we handle that with different mechanisms. I will start working on a dedicated patch-set of the RDMA code in the next few weeks with MUCH MORE details in the commit messages. That will explain exactly how we expose stuff and protect. For example, regarding isolating between applications, we only support a single application opening our file descriptor. Another example is that the submission of WQ is done through our QMAN mechanism and is NOT mapped to userspace (due to the restrictions you mentioned above and other restrictions). But again, I want to send something organized and with proper explanations. I hope to have something in a couple of weeks. Thanks, Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:59 ` Oded Gabbay @ 2020-09-18 12:16 ` Jason Gunthorpe 2020-09-18 12:34 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 12:16 UTC (permalink / raw) To: Oded Gabbay Cc: Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 02:59:28PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > >> infrastructure for communication between multiple accelerators. Same > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > into netdev. > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > all. I'm sure this can too. > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > was suggested to go through the vfio subsystem instead. > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > is what's the bar to get accepted to the RDMA subsystem. > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > That is more or less where we ended up, yes. > > > > I'm most worried about this lack of PD and MR. > > > > Kernel must provide security for apps doing user DMA, PD and MR do > > this. If the device doesn't have PD/MR then it is hard to see how a WQ > > could ever be exposed directly to userspace, regardless of subsystem. > > Hi Jason, > What you say here is very true and we handle that with different > mechanisms. I will start working on a dedicated patch-set of the RDMA > code in the next few weeks with MUCH MORE details in the commit > messages. That will explain exactly how we expose stuff and protect. > > For example, regarding isolating between applications, we only support > a single application opening our file descriptor. Then the driver has a special PD create that requires the misc file descriptor to authorize RDMA access to the resources in that security context. > Another example is that the submission of WQ is done through our QMAN > mechanism and is NOT mapped to userspace (due to the restrictions you > mentioned above and other restrictions). Sure, other RDMA drivers also require a kernel ioctl for command execution. In this model the MR can be a software construct, again representing a security authorization: - A 'full process' MR, in which case the kernel command excution handles dma map and pinning at command execution time - A 'normal' MR, in which case the DMA list is pre-created and the command execution just re-uses this data The general requirement for RDMA is the same as DRM, you must provide enough code in rdma-core to show how the device works, and minimally test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. So you'll want to arrange something where the default MR and PD mechanisms do something workable on this device, like auto-open the misc FD when building the PD, and support the 'normal' MR flow for command execution. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:16 ` Jason Gunthorpe @ 2020-09-18 12:34 ` Oded Gabbay 2020-09-18 12:50 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 12:34 UTC (permalink / raw) To: Jason Gunthorpe, izur Cc: Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 3:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 02:59:28PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 2:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Sep 18, 2020 at 02:36:10PM +0300, Gal Pressman wrote: > > > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > > > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > > > > >> infrastructure for communication between multiple accelerators. Same > > > > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > > > > >> The RDMA implementation we did does NOT support some basic RDMA > > > > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > > > > >> library or to connect to the rdma infrastructure in the kernel. > > > > > > > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > > > > you can't add random device offloads as IOCTL to nedevs. > > > > > > > > > > RDMA is the proper home for all the networking offloads that don't fit > > > > > into netdev. > > > > > > > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > > > > all. I'm sure this can too. > > > > > > > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > > > > was suggested to go through the vfio subsystem instead. > > > > > > > > I think this comes back to the discussion we had when EFA was upstreamed, which > > > > is what's the bar to get accepted to the RDMA subsystem. > > > > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > > > > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > > > > > That is more or less where we ended up, yes. > > > > > > I'm most worried about this lack of PD and MR. > > > > > > Kernel must provide security for apps doing user DMA, PD and MR do > > > this. If the device doesn't have PD/MR then it is hard to see how a WQ > > > could ever be exposed directly to userspace, regardless of subsystem. > > > > Hi Jason, > > What you say here is very true and we handle that with different > > mechanisms. I will start working on a dedicated patch-set of the RDMA > > code in the next few weeks with MUCH MORE details in the commit > > messages. That will explain exactly how we expose stuff and protect. > > > > For example, regarding isolating between applications, we only support > > a single application opening our file descriptor. > > Then the driver has a special PD create that requires the misc file > descriptor to authorize RDMA access to the resources in that security > context. > > > Another example is that the submission of WQ is done through our QMAN > > mechanism and is NOT mapped to userspace (due to the restrictions you > > mentioned above and other restrictions). > > Sure, other RDMA drivers also require a kernel ioctl for command > execution. > > In this model the MR can be a software construct, again representing a > security authorization: > > - A 'full process' MR, in which case the kernel command excution > handles dma map and pinning at command execution time > - A 'normal' MR, in which case the DMA list is pre-created and the > command execution just re-uses this data > > The general requirement for RDMA is the same as DRM, you must provide > enough code in rdma-core to show how the device works, and minimally > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. > > So you'll want to arrange something where the default MR and PD > mechanisms do something workable on this device, like auto-open the > misc FD when building the PD, and support the 'normal' MR flow for > command execution. > > Jason I don't know how we can support MR because we can't support any virtual address on the host. Our internal MMU doesn't support 64-bits. We investigated in the past, very much wanted to use IBverbs but didn't figure out how to make it work. I'm adding Itay here and he can also shed more details on that. Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:34 ` Oded Gabbay @ 2020-09-18 12:50 ` Jason Gunthorpe 2020-09-18 13:02 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 12:50 UTC (permalink / raw) To: Oded Gabbay Cc: izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote: > > > Another example is that the submission of WQ is done through our QMAN > > > mechanism and is NOT mapped to userspace (due to the restrictions you > > > mentioned above and other restrictions). > > > > Sure, other RDMA drivers also require a kernel ioctl for command > > execution. > > > > In this model the MR can be a software construct, again representing a > > security authorization: > > > > - A 'full process' MR, in which case the kernel command excution > > handles dma map and pinning at command execution time > > - A 'normal' MR, in which case the DMA list is pre-created and the > > command execution just re-uses this data > > > > The general requirement for RDMA is the same as DRM, you must provide > > enough code in rdma-core to show how the device works, and minimally > > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. > > > > So you'll want to arrange something where the default MR and PD > > mechanisms do something workable on this device, like auto-open the > > misc FD when building the PD, and support the 'normal' MR flow for > > command execution. > > I don't know how we can support MR because we can't support any > virtual address on the host. Our internal MMU doesn't support 64-bits. > We investigated in the past, very much wanted to use IBverbs but > didn't figure out how to make it work. > I'm adding Itay here and he can also shed more details on that. I'm not sure what that means, if the driver intends to DMA from process memory then it certainly has a MR concept. MRs can control the IOVA directly so if you say the HW needs a MR IOVA < 2**32 then that is still OK. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 12:50 ` Jason Gunthorpe @ 2020-09-18 13:02 ` Oded Gabbay 2020-09-18 13:26 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 13:02 UTC (permalink / raw) To: Jason Gunthorpe Cc: izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 3:50 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote: > > > > Another example is that the submission of WQ is done through our QMAN > > > > mechanism and is NOT mapped to userspace (due to the restrictions you > > > > mentioned above and other restrictions). > > > > > > Sure, other RDMA drivers also require a kernel ioctl for command > > > execution. > > > > > > In this model the MR can be a software construct, again representing a > > > security authorization: > > > > > > - A 'full process' MR, in which case the kernel command excution > > > handles dma map and pinning at command execution time > > > - A 'normal' MR, in which case the DMA list is pre-created and the > > > command execution just re-uses this data > > > > > > The general requirement for RDMA is the same as DRM, you must provide > > > enough code in rdma-core to show how the device works, and minimally > > > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC. > > > > > > So you'll want to arrange something where the default MR and PD > > > mechanisms do something workable on this device, like auto-open the > > > misc FD when building the PD, and support the 'normal' MR flow for > > > command execution. > > > > I don't know how we can support MR because we can't support any > > virtual address on the host. Our internal MMU doesn't support 64-bits. > > We investigated in the past, very much wanted to use IBverbs but > > didn't figure out how to make it work. > > I'm adding Itay here and he can also shed more details on that. > > I'm not sure what that means, if the driver intends to DMA from > process memory then it certainly has a MR concept. > > MRs can control the IOVA directly so if you say the HW needs a MR IOVA > < 2**32 then that is still OK. > > Jason Hi Jason, I'll try to explain but please bear with me because it requires some understanding of our H/W architecture. Our ASIC has 32 GB of HBM memory (similar to GPUs). The problem is that HBM memory is accessed by our ASIC's engines (DMA, NIC, etc.) with physical addressing, which is mapped inside our device between 0x0 to 0x8_0000_0000. Now, if a user performs malloc and then maps that memory to our device (using our memory MAP ioctl, similar to how GPU works), it will get a new virtual address, which is in the range of 0x80_0000_0000 - (2^50 -1). Then, he can use that new VA in our device with different engines (DMA, NIC, compute). That way, addresses that represent the host memory do not overlap addresses that represent HBM memory. The problem with MR is that the API doesn't let us return a new VA. It forces us to use the original VA that the Host OS allocated. What will we do if that VA is in the range of our HBM addresses ? The device won't be able to distinguish between them. The transaction that is generated by an engine inside our device will go to the HBM instead of going to the PCI controller and then to the host. That's the crust of the problem and why we didn't use MR. If that's not clear, I'll be happy to explain more. Thanks, Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 13:02 ` Oded Gabbay @ 2020-09-18 13:26 ` Jason Gunthorpe 2020-09-18 13:49 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 13:26 UTC (permalink / raw) To: Oded Gabbay Cc: izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > The problem with MR is that the API doesn't let us return a new VA. It > forces us to use the original VA that the Host OS allocated. If using the common MR API you'd have to assign a unique linear range in the single device address map and record both the IOVA and the MMU VA in the kernel struct. Then when submitting work using that MR lkey the kernel will adjust the work VA using the equation (WORK_VA - IOVA) + MMU_VA before forwarding to HW. EFA doesn't support rkeys, so they are not required to be emulated. It would have to create rkeys using some guadidv_reg_mr_rkey() It is important to understand that the usual way we support these non-RDMA devices is to insist that they use SW to construct a minimal standards based RDMA API, and then allow the device to have a 'dv' API to access a faster, highly device specific, SW bypass path. So for instance you might have some guadidv_post_work(qp) that doesn't use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr) would return the required HW VA from the kernel. Usually the higher level communication library (UCX, MPI, etc) forms the dv primitives into something application usable. > we do if that VA is in the range of our HBM addresses ? The device > won't be able to distinguish between them. The transaction that is > generated by an engine inside our device will go to the HBM instead of > going to the PCI controller and then to the host. > > That's the crust of the problem and why we didn't use MR. No, the problem with the device is that it doesn't have a lkey/rkey, so it is stuck with a single translation domain. RoCE compliant devices are required to have multiple translation domains - each lkey/rkey specifies a unique translation. The MR concept is a region of process VA mapped into the device for device access, and this device *clearly* has that. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 13:26 ` Jason Gunthorpe @ 2020-09-18 13:49 ` Oded Gabbay 2020-09-18 13:59 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 13:49 UTC (permalink / raw) To: Jason Gunthorpe, Greg Kroah-Hartman Cc: izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > The problem with MR is that the API doesn't let us return a new VA. It > > forces us to use the original VA that the Host OS allocated. > > If using the common MR API you'd have to assign a unique linear range > in the single device address map and record both the IOVA and the MMU > VA in the kernel struct. > > Then when submitting work using that MR lkey the kernel will adjust > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > forwarding to HW. > We can't do that. That will kill the performance. If for every submission I need to modify the packet's contents, the throughput will go downhill. Also, submissions to our RDMA qmans are coupled with submissions to our DMA/Compute QMANs. We can't separate those to different API calls. That will also kill performance and in addition, will prevent us from synchronizing all the engines. I also have to say, it troubles me that you keep referring to our device as an RDMA device. It is not an RDMA device. It is a deep-learning accelerator which uses RDMA as a way to interconnect multiple devices. We don't intend to replace General-Purpose RDMA devices. We know we don't support that. Therefore, I still fail to see why we need to support all the above... Our work submission is not to just "send/receive packets". Sending packets is part of a general recipe to do DMA, perform compute on data and send/receive data. All together, in a synchronized fashion. The way you try to force me to go is to separate that into different functionality, as if I have different ASICs, which is very counter-productive in terms of performance and simplicity. i.e. have one method of submitting work to DMA/compute and another way to RDMA ports. I know this is how the kernel is structured now - subsystems for devices that belong to a single domain (graphics, net, storage). But I fear that you will soon see this paradigm doesn't work with new devices in AI, which combine multiple domains into a single ASIC. Greg, I would love to hear your opinion here. Am I totally wrong ? Is treating a single ASIC that belongs to multiple domains as if it were multiple ASICs a good thing ? Don't you think it will hurt the performance ? Oded > EFA doesn't support rkeys, so they are not required to be emulated. It > would have to create rkeys using some guadidv_reg_mr_rkey() > > It is important to understand that the usual way we support these > non-RDMA devices is to insist that they use SW to construct a minimal > standards based RDMA API, and then allow the device to have a 'dv' API > to access a faster, highly device specific, SW bypass path. > > So for instance you might have some guadidv_post_work(qp) that doesn't > use lkeys and works directly on the MMU_VA. A guadidv_get_mmu_va(mr) > would return the required HW VA from the kernel. > > Usually the higher level communication library (UCX, MPI, etc) forms > the dv primitives into something application usable. > > > we do if that VA is in the range of our HBM addresses ? The device > > won't be able to distinguish between them. The transaction that is > > generated by an engine inside our device will go to the HBM instead of > > going to the PCI controller and then to the host. > > > > That's the crust of the problem and why we didn't use MR. > > No, the problem with the device is that it doesn't have a lkey/rkey, > so it is stuck with a single translation domain. RoCE compliant > devices are required to have multiple translation domains - each > lkey/rkey specifies a unique translation. > > The MR concept is a region of process VA mapped into the device for > device access, and this device *clearly* has that. > > Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 13:49 ` Oded Gabbay @ 2020-09-18 13:59 ` Jason Gunthorpe 2020-09-18 14:12 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 13:59 UTC (permalink / raw) To: Oded Gabbay Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > forces us to use the original VA that the Host OS allocated. > > > > If using the common MR API you'd have to assign a unique linear range > > in the single device address map and record both the IOVA and the MMU > > VA in the kernel struct. > > > > Then when submitting work using that MR lkey the kernel will adjust > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > forwarding to HW. > > > We can't do that. That will kill the performance. If for every > submission I need to modify the packet's contents, the throughput will > go downhill. You clearly didn't read where I explained there is a fast path and slow path expectation. > Also, submissions to our RDMA qmans are coupled with submissions to > our DMA/Compute QMANs. We can't separate those to different API calls. > That will also kill performance and in addition, will prevent us from > synchronizing all the engines. Not sure I see why this is a problem. I already explained the fast device specific path. As long as the kernel maintains proper security when it processes submissions the driver can allow objects to cross between the two domains. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 13:59 ` Jason Gunthorpe @ 2020-09-18 14:12 ` Oded Gabbay 2020-09-18 14:19 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 14:12 UTC (permalink / raw) To: Jason Gunthorpe Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > > forces us to use the original VA that the Host OS allocated. > > > > > > If using the common MR API you'd have to assign a unique linear range > > > in the single device address map and record both the IOVA and the MMU > > > VA in the kernel struct. > > > > > > Then when submitting work using that MR lkey the kernel will adjust > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > > forwarding to HW. > > > > > We can't do that. That will kill the performance. If for every > > submission I need to modify the packet's contents, the throughput will > > go downhill. > > You clearly didn't read where I explained there is a fast path and > slow path expectation. > > > Also, submissions to our RDMA qmans are coupled with submissions to > > our DMA/Compute QMANs. We can't separate those to different API calls. > > That will also kill performance and in addition, will prevent us from > > synchronizing all the engines. > > Not sure I see why this is a problem. I already explained the fast > device specific path. > > As long as the kernel maintains proper security when it processes > submissions the driver can allow objects to cross between the two > domains. Can you please explain what you mean by "two domains" ? You mean the RDMA and compute domains ? Or something else ? What I was trying to say is that I don't want the application to split its submissions to different system calls. Currently we perform submissions through the CS_IOCTL that is defined in our driver. It is a single IOCTL which allows the user to submit work to all queues, without regard to the underlying engine of each queue. If I need to split that to different system calls it will have major implications. I don't even want to start thinking about all the synchronization at the host (userspace) level that I will need to do. That's what I meant by saying that you force me to treat my device as if it were multiple devices. The whole point of our ASIC is to combine multiple IPs on the same ASIC. What will happen when we will add a third domain to our device (e.g. storage, video decoding, encryption engine, whatever). Will I then need to separate submissions to 3 different system calls ? In 3 different subsystems ? This doesn't scale. And I strongly say that it will kill the performance of the device. Not because of the driver. Because of the complications to the user-space. Oded > > Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 14:12 ` Oded Gabbay @ 2020-09-18 14:19 ` Jason Gunthorpe 2020-09-18 14:45 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 14:19 UTC (permalink / raw) To: Oded Gabbay Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote: > On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > > > forces us to use the original VA that the Host OS allocated. > > > > > > > > If using the common MR API you'd have to assign a unique linear range > > > > in the single device address map and record both the IOVA and the MMU > > > > VA in the kernel struct. > > > > > > > > Then when submitting work using that MR lkey the kernel will adjust > > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > > > forwarding to HW. > > > > > > > We can't do that. That will kill the performance. If for every > > > submission I need to modify the packet's contents, the throughput will > > > go downhill. > > > > You clearly didn't read where I explained there is a fast path and > > slow path expectation. > > > > > Also, submissions to our RDMA qmans are coupled with submissions to > > > our DMA/Compute QMANs. We can't separate those to different API calls. > > > That will also kill performance and in addition, will prevent us from > > > synchronizing all the engines. > > > > Not sure I see why this is a problem. I already explained the fast > > device specific path. > > > > As long as the kernel maintains proper security when it processes > > submissions the driver can allow objects to cross between the two > > domains. > Can you please explain what you mean by "two domains" ? > You mean the RDMA and compute domains ? Or something else ? Yes > What I was trying to say is that I don't want the application to split > its submissions to different system calls. If you can manage the security then you can cross them. Eg since The RDMA PD would be created on top of the /dev/misc char dev then it is fine for the /dev/misc char dev to access the RDMA objects as a 'dv fast path'. But now that you say everything is interconnected, I'm wondering, without HW security how do you keep netdev isolated from userspace? Can I issue commands to /dev/misc and write to kernel memory (does the kernel put any pages into the single MMU?) or corrupt the netdev driver operations in any way? Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 14:19 ` Jason Gunthorpe @ 2020-09-18 14:45 ` Oded Gabbay 2020-09-18 15:07 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 14:45 UTC (permalink / raw) To: Jason Gunthorpe Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 5:19 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote: > > On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote: > > > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > > > > > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote: > > > > > > > > > > > The problem with MR is that the API doesn't let us return a new VA. It > > > > > > forces us to use the original VA that the Host OS allocated. > > > > > > > > > > If using the common MR API you'd have to assign a unique linear range > > > > > in the single device address map and record both the IOVA and the MMU > > > > > VA in the kernel struct. > > > > > > > > > > Then when submitting work using that MR lkey the kernel will adjust > > > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before > > > > > forwarding to HW. > > > > > > > > > We can't do that. That will kill the performance. If for every > > > > submission I need to modify the packet's contents, the throughput will > > > > go downhill. > > > > > > You clearly didn't read where I explained there is a fast path and > > > slow path expectation. > > > > > > > Also, submissions to our RDMA qmans are coupled with submissions to > > > > our DMA/Compute QMANs. We can't separate those to different API calls. > > > > That will also kill performance and in addition, will prevent us from > > > > synchronizing all the engines. > > > > > > Not sure I see why this is a problem. I already explained the fast > > > device specific path. > > > > > > As long as the kernel maintains proper security when it processes > > > submissions the driver can allow objects to cross between the two > > > domains. > > Can you please explain what you mean by "two domains" ? > > You mean the RDMA and compute domains ? Or something else ? > > Yes > > > What I was trying to say is that I don't want the application to split > > its submissions to different system calls. > > If you can manage the security then you can cross them. Eg since The > RDMA PD would be created on top of the /dev/misc char dev then it is > fine for the /dev/misc char dev to access the RDMA objects as a 'dv > fast path'. > > But now that you say everything is interconnected, I'm wondering, > without HW security how do you keep netdev isolated from userspace? > > Can I issue commands to /dev/misc and write to kernel memory (does the > kernel put any pages into the single MMU?) or corrupt the netdev > driver operations in any way? > > Jason No, no, no. Please give me more credit :) btw, our kernel interface was scrutinized when we upstreamed the driver and it was under review by the Intel security team. To explain our security mechanism will require some time. It is detailed in the driver, but it is hard to understand without some background. I wonder where to start... First of all, we support open, close, mmap and IOCTLs to /dev/misc/hlX. We don't support read/write system calls. A user never gets direct access to kernel memory. Only through standard mmap. The only thing we allow to mmap is a command buffer (which is used to submit work to certain DMA queues on our device) and to a memory region we use for "CQ" for the RDMA. That's it. Any access by the device's engines to the host memory is done via our device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The kernel driver is assigned ASID 0, while the user is assigned ASID 1. We can support up to 1024 ASIDs, but because we limit the user to have a single application, we only use ASID 0 and 1. The above means a user can't program an engine (DMA, NIC, compute) to access memory he didn't first mapped into our device's MMU. The mapping is done via one of our IOCTLs and the kernel driver makes sure (using standard kernel internal APIs) the host memory truly belongs to the user process. All those mappings are done using ASID 1. If the driver needs to map kernel pages into the device's MMU, then this is done using ASID 0. This is how we take care of separation between kernel memory and user memory. Each transaction our engines create and is going to the host first passes through our MMU. The transaction comes with its ASID value. According to that, the MMU knows which page tables to do the walk on. Specifically regarding RDMA, the user prepares a WQE on the host memory in an area which is mapped into our MMU using ASID 1. The user uses the NIC control IOCTL to give the kernel driver the virtual base address of the WQ and the driver programs it to the H/W. Then, the user can submit the WQE by submitting a command buffer to the NIC QMAN. The command buffer contains a message to the QMAN that tells it to ring the doorbell of the relevant NIC port. The user can't do it from userspace. For regular Ethernet traffice, we don't have any IOCTLs of course. All Ethernet operations are done via the standard networking subsystem (sockets, etc.). There are more details of course. I don't know how much you want me to go deeper. If you have specific questions I'll be happy to answer. Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 14:45 ` Oded Gabbay @ 2020-09-18 15:07 ` Jason Gunthorpe 2020-09-18 15:15 ` Oded Gabbay 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 15:07 UTC (permalink / raw) To: Oded Gabbay Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 05:45:21PM +0300, Oded Gabbay wrote: > Any access by the device's engines to the host memory is done via our > device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The > kernel driver is assigned ASID 0, while the user is assigned ASID 1. > We can support up to 1024 ASIDs, but because we limit the user to have > a single application, we only use ASID 0 and 1. If the QP/WQ/etc is HW bound to an ASID then that binding is called a PD and the ASID is acting in the PD role. If the ASID is translating from on the wire IOVA to DMA PA, then it is acting in the MR role as well. Bundling those two things together is not as flexible as standards based RDMA, but it is not as far away as you are making things out to be. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 15:07 ` Jason Gunthorpe @ 2020-09-18 15:15 ` Oded Gabbay 2020-09-18 15:28 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 15:15 UTC (permalink / raw) To: Jason Gunthorpe Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 6:07 PM Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Fri, Sep 18, 2020 at 05:45:21PM +0300, Oded Gabbay wrote: > > > Any access by the device's engines to the host memory is done via our > > device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The > > kernel driver is assigned ASID 0, while the user is assigned ASID 1. > > We can support up to 1024 ASIDs, but because we limit the user to have > > a single application, we only use ASID 0 and 1. > > If the QP/WQ/etc is HW bound to an ASID then that binding is called a > PD and the ASID is acting in the PD role. > > If the ASID is translating from on the wire IOVA to DMA PA, then it is > acting in the MR role as well. > > Bundling those two things together is not as flexible as standards > based RDMA, but it is not as far away as you are making things out to > be. > > Jason But Jason, why do I need to use RDMA definitions in my common code ? RDMA is such a small part of our ASIC. We also have an ASIC called GOYA for inference, which is handled by the same driver, but doesn't have RDMA ports at all. Why would I need to use RDMA definitions for that ? I'm sorry, but you won't be able to convince me here that I need to "enslave" my entire code to RDMA, just because my ASIC "also" has some RDMA ports. On the same weight, the GPU people tried and failed to say that my device is a GPU. And I think the reasoning that we applied back then, and Greg and Olof agreed with it, applies here as well. I want to play along, but it has to be something that won't make my entire device's driver into an RDMA driver. And it has to be something that doesn't hurt performance. All other things can and will be changed according to your inputs. Thanks, Oded Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 15:15 ` Oded Gabbay @ 2020-09-18 15:28 ` Jason Gunthorpe 2020-09-21 11:22 ` Gal Pressman 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-18 15:28 UTC (permalink / raw) To: Oded Gabbay Cc: Greg Kroah-Hartman, izur, Gal Pressman, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote: > I'm sorry, but you won't be able to convince me here that I need to > "enslave" my entire code to RDMA, just because my ASIC "also" has some > RDMA ports. You can't recreate common shared subsystems in a driver just because you don't want to work with the subsystem. I don't care what else the ASIC has. In Linux the netdev part is exposed through netdev, the RDMA part through RDMA, the totally-not-a-GPU part through drivers/misc. It is always been this way. Chelsio didn't get to rebuild the SCSI stack in their driver just because "storage is a small part of their device" Drivers are not allowed to re-implement I2C/SPI/etc without re-using the comon code for that just because "I2C is a small part of their device" Exposing to userspace the creation of RoCE QPs and their related objects are unambiguously a RDMA subsystem task. I don't even know how you think you can argue it is not. It is your company proudly claiming the device has 100G RoCE ports in all the marketing literature, after all. It is too bad the device has a non-standards compliant implementation of RoCE so this will be a bit hard for you. Oh well. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 15:28 ` Jason Gunthorpe @ 2020-09-21 11:22 ` Gal Pressman 2020-09-21 11:49 ` Leon Romanovsky 2020-09-22 11:41 ` Jason Gunthorpe 0 siblings, 2 replies; 56+ messages in thread From: Gal Pressman @ 2020-09-21 11:22 UTC (permalink / raw) To: Jason Gunthorpe, Oded Gabbay Cc: Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On 18/09/2020 18:28, Jason Gunthorpe wrote: > On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote: > >> I'm sorry, but you won't be able to convince me here that I need to >> "enslave" my entire code to RDMA, just because my ASIC "also" has some >> RDMA ports. > > You can't recreate common shared subsystems in a driver just because > you don't want to work with the subsystem. > > I don't care what else the ASIC has. In Linux the netdev part is > exposed through netdev, the RDMA part through RDMA, the > totally-not-a-GPU part through drivers/misc. > > It is always been this way. Chelsio didn't get to rebuild the SCSI > stack in their driver just because "storage is a small part of their > device" > > Drivers are not allowed to re-implement I2C/SPI/etc without re-using > the comon code for that just because "I2C is a small part of their > device" > > Exposing to userspace the creation of RoCE QPs and their related > objects are unambiguously a RDMA subsystem task. I don't even know how > you think you can argue it is not. It is your company proudly claiming > the device has 100G RoCE ports in all the marketing literature, after > all. > > It is too bad the device has a non-standards compliant implementation > of RoCE so this will be a bit hard for you. Oh well. What is considered a RoCE port in this case if it's not compliant with RoCE? Sounds like it's an implementation of RDMA over ethernet, not RoCE. Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol? (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation exposed? Or is it closed source?) Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by habanalabs driver)? So neither any userspace verbs apps could use it nor kernel ULPs? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-21 11:22 ` Gal Pressman @ 2020-09-21 11:49 ` Leon Romanovsky 2020-09-22 11:41 ` Jason Gunthorpe 1 sibling, 0 replies; 56+ messages in thread From: Leon Romanovsky @ 2020-09-21 11:49 UTC (permalink / raw) To: Gal Pressman Cc: Jason Gunthorpe, Oded Gabbay, Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote: > On 18/09/2020 18:28, Jason Gunthorpe wrote: > > On Fri, Sep 18, 2020 at 06:15:52PM +0300, Oded Gabbay wrote: > > > >> I'm sorry, but you won't be able to convince me here that I need to > >> "enslave" my entire code to RDMA, just because my ASIC "also" has some > >> RDMA ports. > > > > You can't recreate common shared subsystems in a driver just because > > you don't want to work with the subsystem. > > > > I don't care what else the ASIC has. In Linux the netdev part is > > exposed through netdev, the RDMA part through RDMA, the > > totally-not-a-GPU part through drivers/misc. > > > > It is always been this way. Chelsio didn't get to rebuild the SCSI > > stack in their driver just because "storage is a small part of their > > device" > > > > Drivers are not allowed to re-implement I2C/SPI/etc without re-using > > the comon code for that just because "I2C is a small part of their > > device" > > > > Exposing to userspace the creation of RoCE QPs and their related > > objects are unambiguously a RDMA subsystem task. I don't even know how > > you think you can argue it is not. It is your company proudly claiming > > the device has 100G RoCE ports in all the marketing literature, after > > all. > > > > It is too bad the device has a non-standards compliant implementation > > of RoCE so this will be a bit hard for you. Oh well. > > What is considered a RoCE port in this case if it's not compliant with RoCE? They claim that it is RoCE v2. https://www.hotchips.org/hc31/HC31_1.14_HabanaLabs.Eitan_Medina.v9.pdf Thanks ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-21 11:22 ` Gal Pressman 2020-09-21 11:49 ` Leon Romanovsky @ 2020-09-22 11:41 ` Jason Gunthorpe 2020-09-22 12:46 ` Gal Pressman 1 sibling, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-22 11:41 UTC (permalink / raw) To: Gal Pressman Cc: Oded Gabbay, Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote: > What is considered a RoCE port in this case if it's not compliant with RoCE? > Sounds like it's an implementation of RDMA over ethernet, not RoCE. > Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol? > (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation > exposed? Or is it closed source?) I think Oded was drawing a parallel to how nvlink is integral with the compute element. From Oded's descriptions I don't think it is much like nvlink at all. > Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path > verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by > habanalabs driver)? > So neither any userspace verbs apps could use it nor kernel ULPs? Based on what Oded described it seems like a reasonable RDMA device with some limitations around MR IOVA. Looks like the desire is to create a RDMA WR and CQ ring in userspace, and then co-mingle that with the compute side of the device. So instead of doing the special IOCTL and mmap against the compute FD it would create a RDMA QP and RDMA CQ, use dv to access the raw internals, and the propritary stack would have exactly the same stuff it would have had with the misc ioctl. But, completely separately, they'd also have to implement some of verbs which serves as the open source userspace showing how this HW works. What that is depends largely on what their HW can do, and if they want to connect to UCX/mpi/libfabric/etc A bunch of ioctl stubs or a few tests is far below our standard in RDMA. There may have been some argument that the compute side of this device has no industry standards so should be a drivers/misc, but HPC networking *does* have extensive standards and extensive open source software stacks. It is very hard for me to see how a device in this market could be competitive without integrating with that stuff. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-22 11:41 ` Jason Gunthorpe @ 2020-09-22 12:46 ` Gal Pressman 2020-09-22 16:14 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Gal Pressman @ 2020-09-22 12:46 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oded Gabbay, Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On 22/09/2020 14:41, Jason Gunthorpe wrote: > On Mon, Sep 21, 2020 at 02:22:02PM +0300, Gal Pressman wrote: > >> What is considered a RoCE port in this case if it's not compliant with RoCE? >> Sounds like it's an implementation of RDMA over ethernet, not RoCE. >> Does GAUDI support UD/RC/.. QPs? Is it using a proprietary wire protocol? >> (BTW, Oded claims it's similar to nvlink, how is nvlink's implementation >> exposed? Or is it closed source?) > > I think Oded was drawing a parallel to how nvlink is integral with the > compute element. From Oded's descriptions I don't think it is much > like nvlink at all. > >> Jason, how do you imagine GAUDI in the RDMA subsystem? Userspace control path >> verbs (used by hl-thunk?) and all data path verbs exposed as kverbs (used by >> habanalabs driver)? >> So neither any userspace verbs apps could use it nor kernel ULPs? > > Based on what Oded described it seems like a reasonable RDMA device > with some limitations around MR IOVA. > > Looks like the desire is to create a RDMA WR and CQ ring in userspace, > and then co-mingle that with the compute side of the device. > > So instead of doing the special IOCTL and mmap against the compute FD > it would create a RDMA QP and RDMA CQ, use dv to access the raw > internals, and the propritary stack would have exactly the same stuff > it would have had with the misc ioctl. > > But, completely separately, they'd also have to implement some of > verbs which serves as the open source userspace showing how this HW > works. What that is depends largely on what their HW can do, and if > they want to connect to UCX/mpi/libfabric/etc > > A bunch of ioctl stubs or a few tests is far below our standard in > RDMA. > > There may have been some argument that the compute side of this device > has no industry standards so should be a drivers/misc, but HPC > networking *does* have extensive standards and extensive open source > software stacks. It is very hard for me to see how a device in this > market could be competitive without integrating with that stuff. I agree, that makes sense. But assuming Oded actually goes and implements all the needed verbs to get a basic functional libibverbs provider (assuming their HW can do it somehow), is it really useful if no one is going to use it? It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, and I'm assuming the only real world use case is going to be using the hl stack, which means we're left with a lot of dead code that's not used/tested by anyone. Genuine question, wouldn't it be better if they only implement what's actually going to be used and tested by their customers? ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-22 12:46 ` Gal Pressman @ 2020-09-22 16:14 ` Jason Gunthorpe 2020-09-22 16:30 ` Gal Pressman 0 siblings, 1 reply; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-22 16:14 UTC (permalink / raw) To: Gal Pressman Cc: Oded Gabbay, Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote: > I agree, that makes sense. > But assuming Oded actually goes and implements all the needed verbs to get a > basic functional libibverbs provider (assuming their HW can do it somehow), is > it really useful if no one is going to use it? > It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, > and I'm assuming the only real world use case is going to be using the hl stack, > which means we're left with a lot of dead code that's not used/tested by anyone. > > Genuine question, wouldn't it be better if they only implement what's actually > going to be used and tested by their customers? The general standard for this 'accel' hardware, both in DRM and RDMA is to present an open source userspace. Companies are encouraged to use that as their main interface but I suppose are free to carry the cost of dual APIs, and the community's wrath if they want. At least for RDMA this is guided by the founding event of Linux RDMA where all customers demanded the madness of every supplier having a unique software stack from the kernel down stop. Since then the low level stack has been cross vendor and uniform. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-22 16:14 ` Jason Gunthorpe @ 2020-09-22 16:30 ` Gal Pressman 2020-09-22 16:52 ` Jason Gunthorpe 0 siblings, 1 reply; 56+ messages in thread From: Gal Pressman @ 2020-09-22 16:30 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oded Gabbay, Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On 22/09/2020 19:14, Jason Gunthorpe wrote: > On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote: > >> I agree, that makes sense. >> But assuming Oded actually goes and implements all the needed verbs to get a >> basic functional libibverbs provider (assuming their HW can do it somehow), is >> it really useful if no one is going to use it? >> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, >> and I'm assuming the only real world use case is going to be using the hl stack, >> which means we're left with a lot of dead code that's not used/tested by anyone. >> >> Genuine question, wouldn't it be better if they only implement what's actually >> going to be used and tested by their customers? > > The general standard for this 'accel' hardware, both in DRM and RDMA > is to present an open source userspace. Companies are encouraged to > use that as their main interface but I suppose are free to carry the > cost of dual APIs, and the community's wrath if they want. I didn't mean they should maintain two interfaces. The question is whether they should implement libibverbs support that covers the cases used by their stack, or should they implement all "mandatory" verbs so they could be able to run libibverbs' examples/perftest/pyverbs as well, even though these will likely be the only apps covering these verbs. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-22 16:30 ` Gal Pressman @ 2020-09-22 16:52 ` Jason Gunthorpe 0 siblings, 0 replies; 56+ messages in thread From: Jason Gunthorpe @ 2020-09-22 16:52 UTC (permalink / raw) To: Gal Pressman Cc: Oded Gabbay, Greg Kroah-Hartman, izur, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Olof Johansson On Tue, Sep 22, 2020 at 07:30:32PM +0300, Gal Pressman wrote: > On 22/09/2020 19:14, Jason Gunthorpe wrote: > > On Tue, Sep 22, 2020 at 03:46:29PM +0300, Gal Pressman wrote: > > > >> I agree, that makes sense. > >> But assuming Oded actually goes and implements all the needed verbs to get a > >> basic functional libibverbs provider (assuming their HW can do it somehow), is > >> it really useful if no one is going to use it? > >> It doesn't sound like habanalabs want people to use GAUDI as an RDMA adapter, > >> and I'm assuming the only real world use case is going to be using the hl stack, > >> which means we're left with a lot of dead code that's not used/tested by anyone. > >> > >> Genuine question, wouldn't it be better if they only implement what's actually > >> going to be used and tested by their customers? > > > > The general standard for this 'accel' hardware, both in DRM and RDMA > > is to present an open source userspace. Companies are encouraged to > > use that as their main interface but I suppose are free to carry the > > cost of dual APIs, and the community's wrath if they want. > > I didn't mean they should maintain two interfaces. > The question is whether they should implement libibverbs support that covers the > cases used by their stack, or should they implement all "mandatory" verbs so > they could be able to run libibverbs' examples/perftest/pyverbs as well, even > though these will likely be the only apps covering these verbs. As I said, the minimum standard is an open source user space that will operate the NIC. For EFA we decided that was ibv_ud_pingpong, and now parts of pyverbs. A similar decision would be needed here too. It is a conversation that should start with a propsal from Oded. The *point* is to have the open userspace, so I really don't care what their proprietary universe does, and shrinking the opensource side becuase it is "redundant" is completely backwards to what we want to see. Jason ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver 2020-09-18 11:36 ` Gal Pressman 2020-09-18 11:52 ` Leon Romanovsky 2020-09-18 11:56 ` Jason Gunthorpe @ 2020-09-18 12:10 ` Oded Gabbay 2 siblings, 0 replies; 56+ messages in thread From: Oded Gabbay @ 2020-09-18 12:10 UTC (permalink / raw) To: izur Cc: Jason Gunthorpe, Jakub Kicinski, Linux-Kernel@Vger. Kernel. Org, netdev, SW_Drivers, Greg Kroah-Hartman, David S. Miller, Andrew Lunn, Florian Fainelli, linux-rdma, Gal Pressman, Leon Romanovsky On Fri, Sep 18, 2020 at 2:36 PM Gal Pressman <galpress@amazon.com> wrote: > > On 17/09/2020 20:18, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 11:46:58PM +0300, Oded Gabbay wrote: > >> infrastructure for communication between multiple accelerators. Same > >> as Nvidia uses NVlink, we use RDMA that we have inside our ASIC. > >> The RDMA implementation we did does NOT support some basic RDMA > >> IBverbs (such as MR and PD) and therefore, we can't use the rdma-core > >> library or to connect to the rdma infrastructure in the kernel. > > > > You can't create a parallel RDMA subsystem in netdev, or in misc, and > > you can't add random device offloads as IOCTL to nedevs. > > > > RDMA is the proper home for all the networking offloads that don't fit > > into netdev. > > > > EFA was able to fit into rdma-core/etc and it isn't even RoCE at > > all. I'm sure this can too. > > Well, EFA wasn't welcomed to the RDMA subsystem with open arms ;), initially it > was suggested to go through the vfio subsystem instead. > > I think this comes back to the discussion we had when EFA was upstreamed, which > is what's the bar to get accepted to the RDMA subsystem. > IIRC, what we eventually agreed on is having a userspace rdma-core provider and > ibv_{ud,rc}_pingpong working (or just supporting one of the IB spec's QP types?). > > Does GAUDI fit these requirements? If not, should it be in a different subsystem > or should we open the "what qualifies as an RDMA device" question again? Hi Itay, Please see the above comments/questions. Thanks, Oded ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2020-09-22 16:52 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20200915171022.10561-1-oded.gabbay@gmail.com>
2020-09-15 20:35 ` [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver Jakub Kicinski
2020-09-15 20:46 ` Oded Gabbay
2020-09-15 21:04 ` Jakub Kicinski
2020-09-15 21:20 ` Oded Gabbay
2020-09-15 21:37 ` Andrew Lunn
2020-09-15 21:43 ` Oded Gabbay
2020-09-15 22:35 ` David Miller
2020-09-15 22:36 ` David Miller
2020-09-15 22:34 ` David Miller
2020-09-16 4:26 ` Oded Gabbay
2020-09-17 17:18 ` Jason Gunthorpe
2020-09-18 11:36 ` Gal Pressman
2020-09-18 11:52 ` Leon Romanovsky
2020-09-18 11:56 ` Oded Gabbay
2020-09-18 12:03 ` Leon Romanovsky
2020-09-18 12:07 ` Oded Gabbay
2020-09-18 12:19 ` Leon Romanovsky
2020-09-18 12:31 ` Oded Gabbay
2020-09-18 13:09 ` Leon Romanovsky
2020-09-19 6:40 ` Greg Kroah-Hartman
2020-09-19 8:20 ` Leon Romanovsky
2020-09-19 8:30 ` Greg Kroah-Hartman
2020-09-19 8:58 ` Leon Romanovsky
2020-09-19 16:43 ` Oded Gabbay
2020-09-19 17:27 ` Greg Kroah-Hartman
2020-09-19 19:22 ` Jason Gunthorpe
2020-09-20 8:47 ` Greg Kroah-Hartman
2020-09-20 19:05 ` Oded Gabbay
2020-09-21 10:39 ` Leon Romanovsky
2020-09-21 11:52 ` Jason Gunthorpe
2020-09-21 21:20 ` Jakub Kicinski
2020-09-22 11:49 ` Jason Gunthorpe
2020-09-19 18:49 ` Andrew Lunn
2020-09-18 11:56 ` Jason Gunthorpe
2020-09-18 11:59 ` Oded Gabbay
2020-09-18 12:16 ` Jason Gunthorpe
2020-09-18 12:34 ` Oded Gabbay
2020-09-18 12:50 ` Jason Gunthorpe
2020-09-18 13:02 ` Oded Gabbay
2020-09-18 13:26 ` Jason Gunthorpe
2020-09-18 13:49 ` Oded Gabbay
2020-09-18 13:59 ` Jason Gunthorpe
2020-09-18 14:12 ` Oded Gabbay
2020-09-18 14:19 ` Jason Gunthorpe
2020-09-18 14:45 ` Oded Gabbay
2020-09-18 15:07 ` Jason Gunthorpe
2020-09-18 15:15 ` Oded Gabbay
2020-09-18 15:28 ` Jason Gunthorpe
2020-09-21 11:22 ` Gal Pressman
2020-09-21 11:49 ` Leon Romanovsky
2020-09-22 11:41 ` Jason Gunthorpe
2020-09-22 12:46 ` Gal Pressman
2020-09-22 16:14 ` Jason Gunthorpe
2020-09-22 16:30 ` Gal Pressman
2020-09-22 16:52 ` Jason Gunthorpe
2020-09-18 12:10 ` Oded Gabbay
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox