Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-30 18:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	serguei.sagalovitch-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161130162353.GA24639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>



On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>> Two cases I can think of are RDMA access to an NVMe device's controller
>> memory buffer,
> 
> I'm not sure on the use model there..

The NVMe fabrics stuff could probably make use of this. It's an
in-kernel system to allow remote access to an NVMe device over RDMA. So
they ought to be able to optimize their transfers by DMAing directly to
the NVMe's CMB -- no userspace interface would be required but there
would need some kernel infrastructure.

Logan

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Jason Gunthorpe @ 2016-11-30 17:32 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Liran Liss, Tom Talpey, Doug Ledford, Steve Wise,
	'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', Marciniszyn, Mike, Dalessandro, Dennis,
	'Lijun Ou', 'Wei Hu(Xavier)', Latif, Faisal,
	Yishai Hadas, 'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni
In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373AB0BA190-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>

On Wed, Nov 30, 2016 at 05:25:18PM +0000, Hefty, Sean wrote:
> > > - this doesn't change anything.  Also, AHs are already
> > > port-specific. So, I don't see any issue in this regard.
> > 
> > The current scheme infers the protocol of the AH from the current
> > configuration of the port, which is a crazy API when the port's
> > protocol can change on the fly.
> 
> Maybe the solution is to make the protocol selection explicit
> throughout the APIs and associate it with a QP, rather than
> attempting to list all transport protocols that a port can support.

AH's are linked to a PD, not a QP..

If we had to do it again, a PD centric approach would be more
sensible:

  // Create a PD on 'port' using ah format 'protocol'
  pd = ibv_pd_create(port, enum ah_type protocol);

  // Enable APM or resource sharing on the PD across two ports
  ibv_pd_add_port(pd, alt_port);

And get rid of the multi-port ib_device concept entirely.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Hefty, Sean @ 2016-11-30 17:30 UTC (permalink / raw)
  To: Steve Wise, 'Jason Gunthorpe', 'Liran Liss'
  Cc: 'Tom Talpey', 'Doug Ledford',
	'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', Marciniszyn, Mike, Dalessandro, Dennis,
	'Lijun Ou', 'Wei Hu(Xavier)', Latif, Faisal,
	'Yishai Hadas', 'Selvin Xavier',
	'Devesh Sharma', 'Mitesh Ahuja',
	'Christian Benvenuti', 'Dave Goodell',
	'Moni Shoua', 'Or Gerlitz'
In-Reply-To: <01e501d24b2e$ff2a3260$fd7e9720$@opengridcomputing.com>

> > > > - this doesn't change anything.  Also, AHs are already
> > > > port-specific. So, I don't see any issue in this regard.
> > >
> > > The current scheme infers the protocol of the AH from the current
> > > configuration of the port, which is a crazy API when the port's
> > > protocol can change on the fly.
> >
> > Maybe the solution is to make the protocol selection explicit
> throughout the
> APIs
> > and associate it with a QP, rather than attempting to list all
> transport
> protocols that
> > a port can support.
> 
> Do you mean requiring the application to pick the protocol?

Yes - it seems necessary to support devices with RoCE and iWarp running on the same port.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-30 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161130162353.GA24639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>> Or do we need to extend the OOM killer to manage GPU pages?
> I don't know..
We could use send_sig_info to send signal from  kernel  to user space. 
So theoretically GPU driver
could issue KILL signal to some process.

> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
I do not think that using DMA-API as it is is the best solution (at 
least in the current form):

-  It deals with handles/fd for the whole allocation but client 
could/will use sub-allocation as
well as theoretically possible to "merge" several allocations in one 
from GPU perspective.
-  It require knowledge to export but because "sharing" is controlled 
from user space it
means that we must "export" all allocation by default
- It deals with 'fd'/handles but user application may work with 
addresses/pointers.

Also current  DMA-API force each time to do all DMA table programming 
unrelated if
location was changed or not. With  vma / mmu  we are  able to install 
notifier to intercept
changes in location and update  translation tables only as needed (we do 
not need to keep
get_user_pages()  lock).

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Steve Wise @ 2016-11-30 17:27 UTC (permalink / raw)
  To: 'Hefty, Sean', 'Jason Gunthorpe',
	'Liran Liss'
  Cc: 'Tom Talpey', 'Doug Ledford',
	'Leon Romanovsky', linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	'Steve Wise', 'Marciniszyn, Mike',
	'Dalessandro, Dennis', 'Lijun Ou',
	'Wei Hu(Xavier)', 'Latif, Faisal',
	'Yishai Hadas', 'Selvin Xavier',
	'Devesh Sharma', 'Mitesh Ahuja',
	'Christian Benvenuti', 'Dave Goodell',
	'Moni Shoua', 'Or Gerlitz'
In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373AB0BA190-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>

> 
> > > - this doesn't change anything.  Also, AHs are already
> > > port-specific. So, I don't see any issue in this regard.
> >
> > The current scheme infers the protocol of the AH from the current
> > configuration of the port, which is a crazy API when the port's
> > protocol can change on the fly.
> 
> Maybe the solution is to make the protocol selection explicit throughout the
APIs
> and associate it with a QP, rather than attempting to list all transport
protocols that
> a port can support.

Do you mean requiring the application to pick the protocol?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Hefty, Sean @ 2016-11-30 17:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Liran Liss
  Cc: Tom Talpey, Doug Ledford, Steve Wise, 'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', Marciniszyn, Mike, Dalessandro, Dennis,
	'Lijun Ou', 'Wei Hu(Xavier)', Latif, Faisal,
	Yishai Hadas, 'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni Shoua, Or
In-Reply-To: <20161130170830.GA17512-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

> > - this doesn't change anything.  Also, AHs are already
> > port-specific. So, I don't see any issue in this regard.
> 
> The current scheme infers the protocol of the AH from the current
> configuration of the port, which is a crazy API when the port's
> protocol can change on the fly.

Maybe the solution is to make the protocol selection explicit throughout the APIs and associate it with a QP, rather than attempting to list all transport protocols that a port can support.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: Enabling peer to peer device transactions for PCIe devices
From: Deucher, Alexander @ 2016-11-30 17:10 UTC (permalink / raw)
  To: 'Haggai Eran', Jason Gunthorpe
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Kuehling, Felix, Sagalovitch, Serguei,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Max Gurtovoy,
	Koenig, Christian,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <c0ddccf3-52ce-d883-a57a-70d8a1febf85-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

> -----Original Message-----
> From: Haggai Eran [mailto:haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org]
> Sent: Wednesday, November 30, 2016 5:46 AM
> To: Jason Gunthorpe
> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-
> nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org; Koenig, Christian; Suthikulpanit, Suravee; Bridgman,
> John; Deucher, Alexander; Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org; logang-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org; dri-
> devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org; Max Gurtovoy; linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> Sagalovitch, Serguei; Blinzer, Paul; Kuehling, Felix; Sander, Ben
> Subject: Re: Enabling peer to peer device transactions for PCIe devices
> 
> On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> >>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> >>>> user-space and the GPU not to migrate it. If they do, the MR gets
> >>>> destroyed immediately.
> >>> That sounds horrible. How can that possibly work? What if the MR is
> >>> being used when the GPU decides to migrate?
> >> Naturally this doesn't support migration. The GPU is expected to pin
> >> these pages as long as the MR lives. The MR invalidation is done only as
> >> a last resort to keep system correctness.
> >
> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.
> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

Christian sent out an RFC patch a while back that extended the OOM to cover memory allocated for the GPU:
https://lists.freedesktop.org/archives/dri-devel/2015-September/089778.html

Alex

> 
> >
> >> I think it is similar to how non-ODP MRs rely on user-space today to
> >> keep them correct. If you do something like madvise(MADV_DONTNEED)
> on a
> >> non-ODP MR's pages, you can still get yourself into a data corruption
> >> situation (HCA sees one page and the process sees another for the same
> >> virtual address). The pinning that we use only guarentees the HCA's page
> >> won't be reused.
> >
> > That is not really data corruption - the data still goes where it was
> > originally destined. That is an application violating the
> > requirements of a MR.
> I guess it is a matter of terminology. If you compare it to the ODP case
> or the CPU case then you usually expect a single virtual address to map to
> a single physical page. Violating this cause some of your writes to be dropped
> which is a data corruption in my book, even if the application caused it.
> 
> > An application cannot munmap/mremap a VMA
> > while a non ODP MR points to it and then keep using the MR.
> Right. And it is perfectly fine to have some similar requirements from the
> application
> when doing peer to peer with a non-ODP MR.
> 
> > That is totally different from a GPU driver wanthing to mess with
> > translation to physical pages.
> >
> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> >
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> >
> > What use case do you see for in kernel?
> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer, and O_DIRECT operations that access GPU memory.
> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel,
> although that is intended to be handled by the GPU driver if I understand
> correctly.
> 
> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API
> support
> for peer to peer. I'm not sure we need vmap. We need a way to have a
> scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.
> 
> Haggai

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Jason Gunthorpe @ 2016-11-30 17:08 UTC (permalink / raw)
  To: Liran Liss
  Cc: Tom Talpey, Doug Ledford, Steve Wise, 'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', 'Mike Marciniszyn',
	'Dennis Dalessandro', 'Lijun Ou',
	'Wei Hu(Xavier)', 'Faisal Latif', Yishai Hadas,
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni Shoua, Or Gerlitz <og>
In-Reply-To: <HE1PR0501MB28128CDB112558C5980CB638B18C0-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>

On Wed, Nov 30, 2016 at 05:01:32PM +0000, Liran Liss wrote:
> > From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]
> 
> > 
> > > Exactly. If/when such devices appear, we would need to extend
> > > connection management to specify the protocol, rather than infer it
> > > from the port space.
> > 
> > We support that perfectly today as long as the port creates two 'struct
> > ib_devices'. Anything else will require some kind of changes to libibverb's API to
> > specify the AH style.
> > 
> 
> rdmacm would still have to choose between these ib_devices somehow

Each ib_device is either iwarp or rocee, the rdma cm would route iwarp
stuff to the iwarp one and rocee stuff to the rocee one. Not really a
problem with today's architecture.

> - this doesn't change anything.  Also, AHs are already
> port-specific. So, I don't see any issue in this regard.

The current scheme infers the protocol of the AH from the current
configuration of the port, which is a crazy API when the port's
protocol can change on the fly.

> In any case, we have millions of multi-port devices that can use
> different link types deployed.  This is the specification, and more
> such devices could appear in the future.  We cannot change the
> device model.

Of course we can change how they are modeled in Linux, it is just
software.

> No doubt that the new ABI would be a lot more flexible and
> self-describing.  But it would take a while until we port everything
> to use it.  So, generally, I don't see any problem using the current
> extensibility capabilities to support useful semantics.

Perhaps a moritorium on some changes to the current uAPI will
encourage the new one to get finished :P

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Liran Liss @ 2016-11-30 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tom Talpey, Doug Ledford, Steve Wise, 'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', 'Mike Marciniszyn',
	'Dennis Dalessandro', 'Lijun Ou',
	'Wei Hu(Xavier)', 'Faisal Latif', Yishai Hadas,
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni Shoua, Or
In-Reply-To: <20161130163949.GC24639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

> From: Jason Gunthorpe [mailto:jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org]

> 
> > Exactly. If/when such devices appear, we would need to extend
> > connection management to specify the protocol, rather than infer it
> > from the port space.
> 
> We support that perfectly today as long as the port creates two 'struct
> ib_devices'. Anything else will require some kind of changes to libibverb's API to
> specify the AH style.
> 

rdmacm would still have to choose between these ib_devices somehow - this doesn't change anything.
Also, AHs are already port-specific. So, I don't see any issue in this regard.

In any case, we have millions of multi-port devices that can use different link types deployed.
This is the specification, and more such devices could appear in the future.
We cannot change the device model.

> > Rethinking about the uAPI, maybe we should report a protocol bit-mask
> > similar to the kernel's, instead of QP types?  This would provide all
> > the required information (e.g., any combination of RoCEv1/v2, iWARP,
> > and Raw Ethernet for Ethernet links) for today's use-cases as well as
> > tomorrow's combined RoCE/iWARP devices.
> 
> Maybe we should dump this uapi stuff until Matan's patches are done. The
> introspection possible with Matan's work is flexible enough to cope with more
> cases..

No doubt that the new ABI would be a lot more flexible and self-describing.
But it would take a while until we port everything to use it.
So, generally, I don't see any problem using the current extensibility capabilities to support useful semantics.

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Or Gerlitz @ 2016-11-30 16:59 UTC (permalink / raw)
  To: Liran Liss, Doug Ledford
  Cc: Jason Gunthorpe, Tom Talpey, Steve Wise,
	'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', 'Mike Marciniszyn',
	'Dennis Dalessandro', 'Lijun Ou',
	'Wei Hu(Xavier)', 'Faisal Latif', Yishai Hadas,
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni Shoua
In-Reply-To: <20161130163949.GC24639-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On 11/30/2016 6:39 PM, Jason Gunthorpe wrote:
> Maybe we should dump this uapi stuff until Matan's patches are
> done. The introspection possible with Matan's work is flexible enough
> to cope with more cases..

Basically I am OKay with that approach too.

Doug, if you are willing to take the mlx5 patches that enable the 
feature of mlx5 device over Eth port that doesn't support RoCE (8,9,10 - 
I will have to do some rebasing) and discuss the query when the new ABI 
code is getting closer to be upstream that's fine too.

Or.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Jason Gunthorpe @ 2016-11-30 16:39 UTC (permalink / raw)
  To: Liran Liss
  Cc: Tom Talpey, Doug Ledford, Steve Wise, 'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', 'Mike Marciniszyn',
	'Dennis Dalessandro', 'Lijun Ou',
	'Wei Hu(Xavier)', 'Faisal Latif', Yishai Hadas,
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni Shoua, Or Gerlitz <og>
In-Reply-To: <HE1PR0501MB28124286F8D902C49596EF85B18C0-692Kmc8YnlIVrnpjwTCbp8DSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>

On Wed, Nov 30, 2016 at 04:30:09PM +0000, Liran Liss wrote:
> > I'd love to see any such device support protocol choice per connection, not just
> > per port. That of course would have implications on the rdma commection
> > manager api.
 
> Exactly. If/when such devices appear, we would need to extend
> connection management to specify the protocol, rather than infer it
> from the port space.

We support that perfectly today as long as the port creates two 'struct
ib_devices'. Anything else will require some kind of changes to
libibverb's API to specify the AH style.

> Rethinking about the uAPI, maybe we should report a protocol
> bit-mask similar to the kernel's, instead of QP types?  This would
> provide all the required information (e.g., any combination of
> RoCEv1/v2, iWARP, and Raw Ethernet for Ethernet links) for today's
> use-cases as well as tomorrow's combined RoCE/iWARP devices.

Maybe we should dump this uapi stuff until Matan's patches are
done. The introspection possible with Matan's work is flexible enough
to cope with more cases..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Jason Gunthorpe @ 2016-11-30 16:36 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Steve Wise, 'Leon Romanovsky',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Steve Wise',
	'Mike Marciniszyn', 'Dennis Dalessandro',
	'Lijun Ou', 'Wei Hu(Xavier)',
	'Faisal Latif', 'Yishai Hadas',
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', 'Moni Shoua',
	'Or Gerlitz'
In-Reply-To: <cfdf28c6-4715-28d4-7da6-453fb6794c29-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Tue, Nov 29, 2016 at 09:07:52PM -0500, Doug Ledford wrote:
> On 11/28/2016 12:08 PM, Steve Wise wrote:
> >>
> >> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
> >>
> >>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
> > u8
> >> port_num)
> >>> +{
> >>> +	return device->port_immutable[port_num].core_cap_flags &
> >> RDMA_CORE_CAP_PROT_RAW_PACKET;
> >>> +}
> >>
> >> Does the mlx drivers really register ports with different capabilities
> >> as the same ib_device? I'm not sure that should be allowed.
> >>
> >> I keep talking about how we need to get rid of the port_num in these
> >> sorts of places because it makes no sense...
> >>
> > 
> > I agree.   Requiring the port number has implications that ripple up into the
> > rdma-rw api as well...
> >  
> > 
> 
> In all fairness, there is no requirement that any two ports on the same
> device be the same link layer, or if the link layer is Ethernet, there
> is no requirement that they can't support both iWARP and RoCE.

There actually is a requirement. The RDMA CM hard requires all ports
be iWARP or !iWARP at least. I'm sure there are other subtle things
floating around.

There are also things that become very confusing for user space, and
we don't have the infrastructure to support, if ports can switch
configurations on the fly.

The simplest, approach, most in line with how verbs was designed, is
to require each ib_device to have a single kind of AH.

> The idea that the parent device defined the supported protocols for
> all ports of a device became wrong with the first mlx4 device that

Arguably it was sort of OK for roceev1, is less OK for v2, but
shouldn't have been done anyhow.

The uapi question here is do we want to double down and try and make
this work (and what does that even *mean*) or admit mlx4 was an error
and stop doing that going forward..

Or do something else? eg Specifying the AH type when creating the PD
could potentially solve some of the problems...

> could do both IB and Ethernet.  And I think I've heard rumblings of
> a combined RoCE/iWARP device possibly in the future from someone
> else.

Two struct ib_devices for the same port then... Certainly ugly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Liran Liss @ 2016-11-30 16:30 UTC (permalink / raw)
  To: Tom Talpey, Doug Ledford, Steve Wise, 'Jason Gunthorpe',
	'Leon Romanovsky'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', 'Mike Marciniszyn',
	'Dennis Dalessandro', 'Lijun Ou',
	'Wei Hu(Xavier)', 'Faisal Latif', Yishai Hadas,
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', Moni Shoua, Or Gerlitz
In-Reply-To: <5927e04b-42ec-52c1-88a3-456cc4409334-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>

> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Tom Talpey

> >
> > In all fairness, there is no requirement that any two ports on the
> > same device be the same link layer, or if the link layer is Ethernet,
> > there is no requirement that they can't support both iWARP and RoCE.
> > The idea that the parent device defined the supported protocols for
> > all ports of a device became wrong with the first mlx4 device that
> > could do both IB and Ethernet.  And I think I've heard rumblings of a
> > combined RoCE/iWARP device possibly in the future from someone else.
> 
> This one for instance?
> 
> http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataShee
> t_QLE45211HL_QLE45212HL.pdf
> 
> I'd love to see any such device support protocol choice per connection, not just
> per port. That of course would have implications on the rdma commection
> manager api.
> 

Exactly. If/when such devices appear, we would need to extend connection management to specify the protocol, rather than infer it from the port space.
It would be perfectly sensible to use both RoCE and iWARP over the same physical Ethernet port and the same source IP address.

Rethinking about the uAPI, maybe we should report a protocol bit-mask similar to the kernel's, instead of QP types?
This would provide all the required information (e.g., any combination of RoCEv1/v2, iWARP, and Raw Ethernet for Ethernet links) for today's use-cases as well as tomorrow's combined RoCE/iWARP devices.
--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Hefty, Sean @ 2016-11-30 16:29 UTC (permalink / raw)
  To: Doug Ledford, Steve Wise, 'Jason Gunthorpe',
	'Leon Romanovsky'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	'Steve Wise', Marciniszyn, Mike, Dalessandro, Dennis,
	'Lijun Ou', 'Wei Hu(Xavier)', Latif, Faisal,
	'Yishai Hadas', 'Selvin Xavier',
	'Devesh Sharma', 'Mitesh Ahuja',
	'Christian Benvenuti', 'Dave Goodell',
	'Moni Shoua', 'Or Gerlitz'
In-Reply-To: <cfdf28c6-4715-28d4-7da6-453fb6794c29-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

> In all fairness, there is no requirement that any two ports on the same
> device be the same link layer, or if the link layer is Ethernet, there
> is no requirement that they can't support both iWARP and RoCE.  The
> idea
> that the parent device defined the supported protocols for all ports of
> a device became wrong with the first mlx4 device that could do both IB
> and Ethernet.  And I think I've heard rumblings of a combined
> RoCE/iWARP
> device possibly in the future from someone else.

It would help if the community didn't continually redefine terms based on the latest set of patches or whims or random hardware feature.  At one time an ib_device meant an actual IB device - go figure.  Now it's not even a device, but some abstract weirdness collection ports that all support the same transport, or was it link layer, or ... I really have no idea now.  The RDMA subsystem really needs to figure out what it wants to be, because even the term RDMA doesn't even apply to all of the devices that it supports.  And now we're at the point of arguing over where drivers should go because no one even knows that anymore.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-30 16:23 UTC (permalink / raw)
  To: Haggai Eran
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	serguei.sagalovitch-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <c0ddccf3-52ce-d883-a57a-70d8a1febf85-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:

> > That just forces applications to handle horrible unexpected
> > failures. If this sort of thing is needed for correctness then OOM
> > kill the offending process, don't corrupt its operation.

> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
> Or do we need to extend the OOM killer to manage GPU pages?

I don't know..

> >>> From what I understand we are not really talking about kernel p2p,
> >>> everything proposed so far is being mediated by a userspace VMA, so
> >>> I'd focus on making that work.
> > 
> >> Fair enough, although we will need both eventually, and I hope the
> >> infrastructure can be shared to some degree.
> > 
> > What use case do you see for in kernel?

> Two cases I can think of are RDMA access to an NVMe device's controller
> memory buffer,

I'm not sure on the use model there..

> and O_DIRECT operations that access GPU memory.

This goes through user space so there is still a VMA..

> Also, HMM's migration between two GPUs could use peer to peer in the
> kernel, although that is intended to be handled by the GPU driver if
> I understand correctly.

Hum, presumably these migrations are VMA backed as well...

> > Presumably in-kernel could use a vmap or something and the same basic
> > flow?
> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
> for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
> of MMIO pfns, and ZONE_DEVICE allows that.

Well, if there is no virtual map then we are back to how do you do
migrations and other things people seem to want to do on these
pages. Maybe the loose 'struct page' flow is not for those users.

But I think if you want kGPU or similar then you probably need vmaps
or something similar to represent the GPU pages in kernel memory.

Jason

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Doug Ledford @ 2016-11-30 16:18 UTC (permalink / raw)
  To: Tom Talpey, Steve Wise, 'Jason Gunthorpe',
	'Leon Romanovsky'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Steve Wise',
	'Mike Marciniszyn', 'Dennis Dalessandro',
	'Lijun Ou', 'Wei Hu(Xavier)',
	'Faisal Latif', 'Yishai Hadas',
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', 'Moni Shoua',
	'Or Gerlitz'
In-Reply-To: <5927e04b-42ec-52c1-88a3-456cc4409334-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 1863 bytes --]

On 11/29/2016 9:33 PM, Tom Talpey wrote:
> On 11/29/2016 9:07 PM, Doug Ledford wrote:
>> On 11/28/2016 12:08 PM, Steve Wise wrote:
>>>>
>>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>>>
>>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device
>>>>> *device,
>>> u8
>>>> port_num)
>>>>> +{
>>>>> +    return device->port_immutable[port_num].core_cap_flags &
>>>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>>>> +}
>>>>
>>>> Does the mlx drivers really register ports with different capabilities
>>>> as the same ib_device? I'm not sure that should be allowed.
>>>>
>>>> I keep talking about how we need to get rid of the port_num in these
>>>> sorts of places because it makes no sense...
>>>>
>>>
>>> I agree.   Requiring the port number has implications that ripple up
>>> into the
>>> rdma-rw api as well...
>>>
>>>
>>
>> In all fairness, there is no requirement that any two ports on the same
>> device be the same link layer, or if the link layer is Ethernet, there
>> is no requirement that they can't support both iWARP and RoCE.  The idea
>> that the parent device defined the supported protocols for all ports of
>> a device became wrong with the first mlx4 device that could do both IB
>> and Ethernet.  And I think I've heard rumblings of a combined RoCE/iWARP
>> device possibly in the future from someone else.
> 
> This one for instance?
> 
> http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE45211HL_QLE45212HL.pdf
> 
> 
> I'd love to see any such device support protocol choice per
> connection, not just per port. That of course would have
> implications on the rdma commection manager api.
> 

That's certainly a prime example, thanks Tom ;-)

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Haggai Eran @ 2016-11-30 10:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	serguei.sagalovitch-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161128190244.GA21975-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
>>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>>> destroyed immediately.
>>> That sounds horrible. How can that possibly work? What if the MR is
>>> being used when the GPU decides to migrate? 
>> Naturally this doesn't support migration. The GPU is expected to pin
>> these pages as long as the MR lives. The MR invalidation is done only as
>> a last resort to keep system correctness.
> 
> That just forces applications to handle horrible unexpected
> failures. If this sort of thing is needed for correctness then OOM
> kill the offending process, don't corrupt its operation.
Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

> 
>> I think it is similar to how non-ODP MRs rely on user-space today to
>> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
>> non-ODP MR's pages, you can still get yourself into a data corruption
>> situation (HCA sees one page and the process sees another for the same
>> virtual address). The pinning that we use only guarentees the HCA's page
>> won't be reused.
> 
> That is not really data corruption - the data still goes where it was
> originally destined. That is an application violating the
> requirements of a MR. 
I guess it is a matter of terminology. If you compare it to the ODP case 
or the CPU case then you usually expect a single virtual address to map to
a single physical page. Violating this cause some of your writes to be dropped
which is a data corruption in my book, even if the application caused it.

> An application cannot munmap/mremap a VMA
> while a non ODP MR points to it and then keep using the MR.
Right. And it is perfectly fine to have some similar requirements from the application
when doing peer to peer with a non-ODP MR. 

> That is totally different from a GPU driver wanthing to mess with
> translation to physical pages.
> 
>>> From what I understand we are not really talking about kernel p2p,
>>> everything proposed so far is being mediated by a userspace VMA, so
>>> I'd focus on making that work.
> 
>> Fair enough, although we will need both eventually, and I hope the
>> infrastructure can be shared to some degree.
> 
> What use case do you see for in kernel?
Two cases I can think of are RDMA access to an NVMe device's controller 
memory buffer, and O_DIRECT operations that access GPU memory.
Also, HMM's migration between two GPUs could use peer to peer in the kernel,
although that is intended to be handled by the GPU driver if I understand
correctly.

> Presumably in-kernel could use a vmap or something and the same basic
> flow?
I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.

Haggai

^ permalink raw reply

* Re: [PATCH v3] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: Tariq Toukan @ 2016-11-30  8:44 UTC (permalink / raw)
  To: Souptick Joarder, sergei.shtylyov, yishaih
  Cc: netdev, linux-rdma, sahu.rameshwar73
In-Reply-To: <20161129194611.GA4088@jordon-HP-15-Notebook-PC>

Hi Souptic,

Thanks for your patch.

On 29/11/2016 9:46 PM, Souptick Joarder wrote:
> In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
>
> Signed-off-by: Souptick joarder <jrdr.linux@gmail.com>
> ---
> v3:
>    - Fixed alignment issues
As mentioned already, you mean 'Remove empty line'.
>
> v2:
>    - Address comment from sergei
>      Alignment was not proper
>
>   drivers/net/ethernet/mellanox/mlx4/cmd.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> index e36bebc..a49072b4 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> @@ -2679,15 +2679,13 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev)
>   	if (!mailbox)
>   		return ERR_PTR(-ENOMEM);
>
> -	mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
> -				      &mailbox->dma);
> +	mailbox->buf = pci_pool_zalloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
> +				       &mailbox->dma);
>   	if (!mailbox->buf) {
>   		kfree(mailbox);
>   		return ERR_PTR(-ENOMEM);
>   	}
>
> -	memset(mailbox->buf, 0, MLX4_MAILBOX_SIZE);
> -
>   	return mailbox;
>   }
>   EXPORT_SYMBOL_GPL(mlx4_alloc_cmd_mailbox);
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

Thanks,
Tariq

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Tom Talpey @ 2016-11-30  2:33 UTC (permalink / raw)
  To: Doug Ledford, Steve Wise, 'Jason Gunthorpe',
	'Leon Romanovsky'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Steve Wise',
	'Mike Marciniszyn', 'Dennis Dalessandro',
	'Lijun Ou', 'Wei Hu(Xavier)',
	'Faisal Latif', 'Yishai Hadas',
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', 'Moni Shoua',
	'Or Gerlitz'
In-Reply-To: <cfdf28c6-4715-28d4-7da6-453fb6794c29-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On 11/29/2016 9:07 PM, Doug Ledford wrote:
> On 11/28/2016 12:08 PM, Steve Wise wrote:
>>>
>>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>>
>>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
>> u8
>>> port_num)
>>>> +{
>>>> +	return device->port_immutable[port_num].core_cap_flags &
>>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>>> +}
>>>
>>> Does the mlx drivers really register ports with different capabilities
>>> as the same ib_device? I'm not sure that should be allowed.
>>>
>>> I keep talking about how we need to get rid of the port_num in these
>>> sorts of places because it makes no sense...
>>>
>>
>> I agree.   Requiring the port number has implications that ripple up into the
>> rdma-rw api as well...
>>
>>
>
> In all fairness, there is no requirement that any two ports on the same
> device be the same link layer, or if the link layer is Ethernet, there
> is no requirement that they can't support both iWARP and RoCE.  The idea
> that the parent device defined the supported protocols for all ports of
> a device became wrong with the first mlx4 device that could do both IB
> and Ethernet.  And I think I've heard rumblings of a combined RoCE/iWARP
> device possibly in the future from someone else.

This one for instance?

http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/DataSheet_QLE45211HL_QLE45212HL.pdf

I'd love to see any such device support protocol choice per
connection, not just per port. That of course would have
implications on the rdma commection manager api.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Doug Ledford @ 2016-11-30  2:07 UTC (permalink / raw)
  To: Steve Wise, 'Jason Gunthorpe', 'Leon Romanovsky'
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Steve Wise',
	'Mike Marciniszyn', 'Dennis Dalessandro',
	'Lijun Ou', 'Wei Hu(Xavier)',
	'Faisal Latif', 'Yishai Hadas',
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', 'Moni Shoua',
	'Or Gerlitz'
In-Reply-To: <05bd01d2499a$1c75f750$5561e5f0$@opengridcomputing.com>


[-- Attachment #1.1: Type: text/plain, Size: 1320 bytes --]

On 11/28/2016 12:08 PM, Steve Wise wrote:
>>
>> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>>
>>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
> u8
>> port_num)
>>> +{
>>> +	return device->port_immutable[port_num].core_cap_flags &
>> RDMA_CORE_CAP_PROT_RAW_PACKET;
>>> +}
>>
>> Does the mlx drivers really register ports with different capabilities
>> as the same ib_device? I'm not sure that should be allowed.
>>
>> I keep talking about how we need to get rid of the port_num in these
>> sorts of places because it makes no sense...
>>
> 
> I agree.   Requiring the port number has implications that ripple up into the
> rdma-rw api as well...
>  
> 

In all fairness, there is no requirement that any two ports on the same
device be the same link layer, or if the link layer is Ethernet, there
is no requirement that they can't support both iWARP and RoCE.  The idea
that the parent device defined the supported protocols for all ports of
a device became wrong with the first mlx4 device that could do both IB
and Ethernet.  And I think I've heard rumblings of a combined RoCE/iWARP
device possibly in the future from someone else.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG Key ID: 0E572FDD


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply

* [PATCH for-next 6/6] IB/hns: Fix the IB device name
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

This patch mainly fix the name for IB device in order
to match with libhns.

Signed-off-by: Lijun Ou <oulijun-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_main.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 28a8f24..eddb053 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -433,7 +433,7 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev)
 	spin_lock_init(&iboe->lock);
 
 	ib_dev = &hr_dev->ib_dev;
-	strlcpy(ib_dev->name, "hisi_%d", IB_DEVICE_NAME_MAX);
+	strlcpy(ib_dev->name, "hns_%d", IB_DEVICE_NAME_MAX);
 
 	ib_dev->owner			= THIS_MODULE;
 	ib_dev->node_type		= RDMA_NODE_IB_CA;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH for-next 5/6] IB/hns: Fix the bug when free cq
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford
  Cc: salil.mehta, xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk,
	lijun_nudt, linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129231030.1105600-1-salil.mehta@huawei.com>

From: Shaobo Xu <xushaobo2@huawei.com>

If the resources of cq are freed while executing the user case, hardware
can not been notified in hip06 SoC. Then hardware will hold on when it
writes the cq buffer which has been released.

In order to slove this problem, RoCE driver checks the CQE counter, and
ensure that the outstanding CQE have been written. Then the cq buffer
can be released.

Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_common.h |    2 +
 drivers/infiniband/hw/hns/hns_roce_cq.c     |   27 ++++++++------
 drivers/infiniband/hw/hns/hns_roce_device.h |    8 ++++
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |   53 +++++++++++++++++++++++++++
 4 files changed, 79 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_common.h b/drivers/infiniband/hw/hns/hns_roce_common.h
index a055632..4af403e 100644
--- a/drivers/infiniband/hw/hns/hns_roce_common.h
+++ b/drivers/infiniband/hw/hns/hns_roce_common.h
@@ -354,6 +354,8 @@
 
 #define ROCEE_SDB_ISSUE_PTR_REG			0x758
 #define ROCEE_SDB_SEND_PTR_REG			0x75C
+#define ROCEE_CAEP_CQE_WCMD_EMPTY		0x850
+#define ROCEE_SCAEP_WR_CQE_CNT			0x8D0
 #define ROCEE_SDB_INV_CNT_REG			0x9A4
 #define ROCEE_SDB_RETRY_CNT_REG			0x9AC
 #define ROCEE_TSP_BP_ST_REG			0x9EC
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index c9f6c3d..ff9a6a3 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -179,8 +179,7 @@ static int hns_roce_hw2sw_cq(struct hns_roce_dev *dev,
 				 HNS_ROCE_CMD_TIMEOUT_MSECS);
 }
 
-static void hns_roce_free_cq(struct hns_roce_dev *hr_dev,
-			     struct hns_roce_cq *hr_cq)
+void hns_roce_free_cq(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq)
 {
 	struct hns_roce_cq_table *cq_table = &hr_dev->cq_table;
 	struct device *dev = &hr_dev->pdev->dev;
@@ -392,19 +391,25 @@ int hns_roce_ib_destroy_cq(struct ib_cq *ib_cq)
 {
 	struct hns_roce_dev *hr_dev = to_hr_dev(ib_cq->device);
 	struct hns_roce_cq *hr_cq = to_hr_cq(ib_cq);
+	int ret = 0;
 
-	hns_roce_free_cq(hr_dev, hr_cq);
-	hns_roce_mtt_cleanup(hr_dev, &hr_cq->hr_buf.hr_mtt);
+	if (hr_dev->hw->destroy_cq) {
+		ret = hr_dev->hw->destroy_cq(ib_cq);
+	} else {
+		hns_roce_free_cq(hr_dev, hr_cq);
+		hns_roce_mtt_cleanup(hr_dev, &hr_cq->hr_buf.hr_mtt);
 
-	if (ib_cq->uobject)
-		ib_umem_release(hr_cq->umem);
-	else
-		/* Free the buff of stored cq */
-		hns_roce_ib_free_cq_buf(hr_dev, &hr_cq->hr_buf, ib_cq->cqe);
+		if (ib_cq->uobject)
+			ib_umem_release(hr_cq->umem);
+		else
+			/* Free the buff of stored cq */
+			hns_roce_ib_free_cq_buf(hr_dev, &hr_cq->hr_buf,
+						ib_cq->cqe);
 
-	kfree(hr_cq);
+		kfree(hr_cq);
+	}
 
-	return 0;
+	return ret;
 }
 
 void hns_roce_cq_completion(struct hns_roce_dev *hr_dev, u32 cqn)
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index 1050829..d4f0fce 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -56,6 +56,12 @@
 #define HNS_ROCE_MAX_INNER_MTPT_NUM		0x7
 #define HNS_ROCE_MAX_MTPT_PBL_NUM		0x100000
 
+#define HNS_ROCE_EACH_FREE_CQ_WAIT_MSECS	20
+#define HNS_ROCE_MAX_FREE_CQ_WAIT_CNT	\
+	(5000 / HNS_ROCE_EACH_FREE_CQ_WAIT_MSECS)
+#define HNS_ROCE_CQE_WCMD_EMPTY_BIT		0x2
+#define HNS_ROCE_MIN_CQE_CNT			16
+
 #define HNS_ROCE_MAX_IRQ_NUM			34
 
 #define HNS_ROCE_COMP_VEC_NUM			32
@@ -528,6 +534,7 @@ struct hns_roce_hw {
 	int (*req_notify_cq)(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
 	int (*poll_cq)(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
 	int (*dereg_mr)(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr);
+	int (*destroy_cq)(struct ib_cq *ibcq);
 	void	*priv;
 };
 
@@ -734,6 +741,7 @@ struct ib_cq *hns_roce_ib_create_cq(struct ib_device *ib_dev,
 				    struct ib_udata *udata);
 
 int hns_roce_ib_destroy_cq(struct ib_cq *ib_cq);
+void hns_roce_free_cq(struct hns_roce_dev *hr_dev, struct hns_roce_cq *hr_cq);
 
 void hns_roce_cq_completion(struct hns_roce_dev *hr_dev, u32 cqn);
 void hns_roce_cq_event(struct hns_roce_dev *hr_dev, u32 cqn, int event_type);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index f67a3bf..b8111b0 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -3763,6 +3763,58 @@ int hns_roce_v1_destroy_qp(struct ib_qp *ibqp)
 	return 0;
 }
 
+int hns_roce_v1_destroy_cq(struct ib_cq *ibcq)
+{
+	struct hns_roce_dev *hr_dev = to_hr_dev(ibcq->device);
+	struct hns_roce_cq *hr_cq = to_hr_cq(ibcq);
+	struct device *dev = &hr_dev->pdev->dev;
+	u32 cqe_cnt_ori;
+	u32 cqe_cnt_cur;
+	u32 cq_buf_size;
+	int wait_time = 0;
+	int ret = 0;
+
+	hns_roce_free_cq(hr_dev, hr_cq);
+
+	/*
+	 * Before freeing cq buffer, we need to ensure that the outstanding CQE
+	 * have been written by checking the CQE counter.
+	 */
+	cqe_cnt_ori = roce_read(hr_dev, ROCEE_SCAEP_WR_CQE_CNT);
+	while (1) {
+		if (roce_read(hr_dev, ROCEE_CAEP_CQE_WCMD_EMPTY) &
+		    HNS_ROCE_CQE_WCMD_EMPTY_BIT)
+			break;
+
+		cqe_cnt_cur = roce_read(hr_dev, ROCEE_SCAEP_WR_CQE_CNT);
+		if ((cqe_cnt_cur - cqe_cnt_ori) >= HNS_ROCE_MIN_CQE_CNT)
+			break;
+
+		msleep(HNS_ROCE_EACH_FREE_CQ_WAIT_MSECS);
+		if (wait_time > HNS_ROCE_MAX_FREE_CQ_WAIT_CNT) {
+			dev_warn(dev, "Destroy cq 0x%lx timeout!\n",
+				hr_cq->cqn);
+			ret = -ETIMEDOUT;
+			break;
+		}
+		wait_time++;
+	}
+
+	hns_roce_mtt_cleanup(hr_dev, &hr_cq->hr_buf.hr_mtt);
+
+	if (ibcq->uobject)
+		ib_umem_release(hr_cq->umem);
+	else {
+		/* Free the buff of stored cq */
+		cq_buf_size = (ibcq->cqe + 1) * hr_dev->caps.cq_entry_sz;
+		hns_roce_buf_free(hr_dev, cq_buf_size, &hr_cq->hr_buf.hr_buf);
+	}
+
+	kfree(hr_cq);
+
+	return ret;
+}
+
 struct hns_roce_v1_priv hr_v1_priv;
 
 struct hns_roce_hw hns_roce_hw_v1 = {
@@ -3784,5 +3836,6 @@ struct hns_roce_hw hns_roce_hw_v1 = {
 	.req_notify_cq = hns_roce_v1_req_notify_cq,
 	.poll_cq = hns_roce_v1_poll_cq,
 	.dereg_mr = hns_roce_v1_dereg_mr,
+	.destroy_cq = hns_roce_v1_destroy_cq,
 	.priv = &hr_v1_priv,
 };
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH for-next 4/6] IB/hns: Delete the redundant memset operation
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

It deleted the redundant memset operation because the memory allocated
by ib_alloc_device has been set zero.

Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_main.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 5e620f9..28a8f24 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -843,9 +843,6 @@ static int hns_roce_probe(struct platform_device *pdev)
 	if (!hr_dev)
 		return -ENOMEM;
 
-	memset((u8 *)hr_dev + sizeof(struct ib_device), 0,
-		sizeof(struct hns_roce_dev) - sizeof(struct ib_device));
-
 	hr_dev->pdev = pdev;
 	platform_set_drvdata(pdev, hr_dev);
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH for-next 3/6] IB/hns: Fix the bug of setting port mtu
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: salil.mehta-hv44wF8Li93QT0dZR+AlfA,
	xavier.huwei-hv44wF8Li93QT0dZR+AlfA,
	oulijun-hv44wF8Li93QT0dZR+AlfA, xushaobo2-hv44wF8Li93QT0dZR+AlfA,
	mehta.salil.lnk-Re5JQEeQqe8AvxtiuMwx3w, lijun_nudt-9Onoh4P/yGk,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linuxarm-hv44wF8Li93QT0dZR+AlfA
In-Reply-To: <20161129231030.1105600-1-salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

From: "Wei Hu (Xavier)" <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>

In hns_roce driver, we need not call iboe_get_mtu to reduce
IB headers from effective IBoE MTU because hr_dev->caps.max_mtu
has already been reduced.

Signed-off-by: Wei Hu (Xavier) <xavier.huwei-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
Signed-off-by: Salil Mehta <salil.mehta-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/hw/hns/hns_roce_main.c |   16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 0cedec0..5e620f9 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -72,18 +72,6 @@ static void hns_roce_set_mac(struct hns_roce_dev *hr_dev, u8 port, u8 *addr)
 	hr_dev->hw->set_mac(hr_dev, phy_port, addr);
 }
 
-static void hns_roce_set_mtu(struct hns_roce_dev *hr_dev, u8 port, int mtu)
-{
-	u8 phy_port = hr_dev->iboe.phy_port[port];
-	enum ib_mtu tmp;
-
-	tmp = iboe_get_mtu(mtu);
-	if (!tmp)
-		tmp = IB_MTU_256;
-
-	hr_dev->hw->set_mtu(hr_dev, phy_port, tmp);
-}
-
 static int hns_roce_add_gid(struct ib_device *device, u8 port_num,
 			    unsigned int index, const union ib_gid *gid,
 			    const struct ib_gid_attr *attr, void **context)
@@ -188,8 +176,8 @@ static int hns_roce_setup_mtu_mac(struct hns_roce_dev *hr_dev)
 	u8 i;
 
 	for (i = 0; i < hr_dev->caps.num_ports; i++) {
-		hns_roce_set_mtu(hr_dev, i,
-				 ib_mtu_enum_to_int(hr_dev->caps.max_mtu));
+		hr_dev->hw->set_mtu(hr_dev, hr_dev->iboe.phy_port[i],
+				    hr_dev->caps.max_mtu);
 		hns_roce_set_mac(hr_dev, i, hr_dev->iboe.netdevs[i]->dev_addr);
 	}
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH for-next 2/6] IB/hns: Fix the bug when free mr
From: Salil Mehta @ 2016-11-29 23:10 UTC (permalink / raw)
  To: dledford
  Cc: salil.mehta, xavier.huwei, oulijun, xushaobo2, mehta.salil.lnk,
	lijun_nudt, linux-rdma, netdev, linux-kernel, linuxarm
In-Reply-To: <20161129231030.1105600-1-salil.mehta@huawei.com>

From: Shaobo Xu <xushaobo2@huawei.com>

If the resources of mr are freed while executing the user case, hardware
can not been notified in hip06 SoC. Then hardware will hold on when it
reads the payload by the PA which has been released.

In order to slove this problem, RoCE driver creates 8 reserved loopback
QPs to ensure zero wqe when free mr. When the mac address is reset, in
order to avoid loopback failure, we need to release the reserved loopback
QPs and recreate them.

Signed-off-by: Shaobo Xu <xushaobo2@huawei.com>
Reviewed-by: Wei Hu (Xavier) <xavier.huwei@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
---
 drivers/infiniband/hw/hns/hns_roce_cmd.h    |    5 -
 drivers/infiniband/hw/hns/hns_roce_device.h |   10 +
 drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  485 +++++++++++++++++++++++++++
 drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   34 ++
 drivers/infiniband/hw/hns/hns_roce_main.c   |    5 +-
 drivers/infiniband/hw/hns/hns_roce_mr.c     |   21 +-
 6 files changed, 545 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/hns/hns_roce_cmd.h b/drivers/infiniband/hw/hns/hns_roce_cmd.h
index ed14ad3..f5a9ee2 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cmd.h
+++ b/drivers/infiniband/hw/hns/hns_roce_cmd.h
@@ -58,11 +58,6 @@ enum {
 	HNS_ROCE_CMD_QUERY_QP		= 0x22,
 };
 
-struct hns_roce_cmd_mailbox {
-	void		       *buf;
-	dma_addr_t		dma;
-};
-
 int hns_roce_cmd_mbox(struct hns_roce_dev *hr_dev, u64 in_param, u64 out_param,
 		      unsigned long in_modifier, u8 op_modifier, u16 op,
 		      unsigned long timeout);
diff --git a/drivers/infiniband/hw/hns/hns_roce_device.h b/drivers/infiniband/hw/hns/hns_roce_device.h
index e48464d..1050829 100644
--- a/drivers/infiniband/hw/hns/hns_roce_device.h
+++ b/drivers/infiniband/hw/hns/hns_roce_device.h
@@ -388,6 +388,11 @@ struct hns_roce_cmdq {
 	u8			toggle;
 };
 
+struct hns_roce_cmd_mailbox {
+	void		       *buf;
+	dma_addr_t		dma;
+};
+
 struct hns_roce_dev;
 
 struct hns_roce_qp {
@@ -522,6 +527,7 @@ struct hns_roce_hw {
 			 struct ib_recv_wr **bad_recv_wr);
 	int (*req_notify_cq)(struct ib_cq *ibcq, enum ib_cq_notify_flags flags);
 	int (*poll_cq)(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+	int (*dereg_mr)(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr);
 	void	*priv;
 };
 
@@ -688,6 +694,10 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 				   u64 virt_addr, int access_flags,
 				   struct ib_udata *udata);
 int hns_roce_dereg_mr(struct ib_mr *ibmr);
+int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
+		       struct hns_roce_cmd_mailbox *mailbox,
+		       unsigned long mpt_index);
+unsigned long key_to_hw_index(u32 key);
 
 void hns_roce_buf_free(struct hns_roce_dev *hr_dev, u32 size,
 		       struct hns_roce_buf *buf);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
index aee1d01..f67a3bf 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.c
@@ -295,6 +295,8 @@ int hns_roce_v1_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		roce_set_field(sq_db.u32_4, SQ_DOORBELL_U32_4_SQ_HEAD_M,
 			       SQ_DOORBELL_U32_4_SQ_HEAD_S,
 			      (qp->sq.head & ((qp->sq.wqe_cnt << 1) - 1)));
+		roce_set_field(sq_db.u32_4, SQ_DOORBELL_U32_4_SL_M,
+			       SQ_DOORBELL_U32_4_SL_S, qp->sl);
 		roce_set_field(sq_db.u32_4, SQ_DOORBELL_U32_4_PORT_M,
 			       SQ_DOORBELL_U32_4_PORT_S, qp->phy_port);
 		roce_set_field(sq_db.u32_8, SQ_DOORBELL_U32_8_QPN_M,
@@ -622,6 +624,213 @@ static int hns_roce_db_ext_init(struct hns_roce_dev *hr_dev, u32 sdb_ext_mod,
 	return ret;
 }
 
+static struct hns_roce_qp *hns_roce_v1_create_lp_qp(struct hns_roce_dev *hr_dev,
+						    struct ib_pd *pd)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct ib_qp_init_attr init_attr;
+	struct ib_qp *qp;
+
+	memset(&init_attr, 0, sizeof(struct ib_qp_init_attr));
+	init_attr.qp_type		= IB_QPT_RC;
+	init_attr.sq_sig_type		= IB_SIGNAL_ALL_WR;
+	init_attr.cap.max_recv_wr	= HNS_ROCE_MIN_WQE_NUM;
+	init_attr.cap.max_send_wr	= HNS_ROCE_MIN_WQE_NUM;
+
+	qp = hns_roce_create_qp(pd, &init_attr, NULL);
+	if (IS_ERR(qp)) {
+		dev_err(dev, "Create loop qp for mr free failed!");
+		return NULL;
+	}
+
+	return to_hr_qp(qp);
+}
+
+static int hns_roce_v1_rsv_lp_qp(struct hns_roce_dev *hr_dev)
+{
+	struct hns_roce_caps *caps = &hr_dev->caps;
+	struct device *dev = &hr_dev->pdev->dev;
+	struct ib_cq_init_attr cq_init_attr;
+	struct hns_roce_free_mr *free_mr;
+	struct ib_qp_attr attr = { 0 };
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_qp *hr_qp;
+	struct ib_cq *cq;
+	struct ib_pd *pd;
+	u64 subnet_prefix;
+	int attr_mask = 0;
+	int i;
+	int ret;
+	u8 phy_port;
+	u8 sl;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	/* Reserved cq for loop qp */
+	cq_init_attr.cqe		= HNS_ROCE_MIN_WQE_NUM * 2;
+	cq_init_attr.comp_vector	= 0;
+	cq = hns_roce_ib_create_cq(&hr_dev->ib_dev, &cq_init_attr, NULL, NULL);
+	if (IS_ERR(cq)) {
+		dev_err(dev, "Create cq for reseved loop qp failed!");
+		return -ENOMEM;
+	}
+	free_mr->mr_free_cq = to_hr_cq(cq);
+	free_mr->mr_free_cq->ib_cq.device		= &hr_dev->ib_dev;
+	free_mr->mr_free_cq->ib_cq.uobject		= NULL;
+	free_mr->mr_free_cq->ib_cq.comp_handler		= NULL;
+	free_mr->mr_free_cq->ib_cq.event_handler	= NULL;
+	free_mr->mr_free_cq->ib_cq.cq_context		= NULL;
+	atomic_set(&free_mr->mr_free_cq->ib_cq.usecnt, 0);
+
+	pd = hns_roce_alloc_pd(&hr_dev->ib_dev, NULL, NULL);
+	if (IS_ERR(pd)) {
+		dev_err(dev, "Create pd for reseved loop qp failed!");
+		ret = -ENOMEM;
+		goto alloc_pd_failed;
+	}
+	free_mr->mr_free_pd = to_hr_pd(pd);
+	free_mr->mr_free_pd->ibpd.device  = &hr_dev->ib_dev;
+	free_mr->mr_free_pd->ibpd.uobject = NULL;
+	atomic_set(&free_mr->mr_free_pd->ibpd.usecnt, 0);
+
+	attr.qp_access_flags	= IB_ACCESS_REMOTE_WRITE;
+	attr.pkey_index		= 0;
+	attr.min_rnr_timer	= 0;
+	/* Disable read ability */
+	attr.max_dest_rd_atomic = 0;
+	attr.max_rd_atomic	= 0;
+	/* Use arbitrary values as rq_psn and sq_psn */
+	attr.rq_psn		= 0x0808;
+	attr.sq_psn		= 0x0808;
+	attr.retry_cnt		= 7;
+	attr.rnr_retry		= 7;
+	attr.timeout		= 0x12;
+	attr.path_mtu		= IB_MTU_256;
+	attr.ah_attr.ah_flags		= 1;
+	attr.ah_attr.static_rate	= 3;
+	attr.ah_attr.grh.sgid_index	= 0;
+	attr.ah_attr.grh.hop_limit	= 1;
+	attr.ah_attr.grh.flow_label	= 0;
+	attr.ah_attr.grh.traffic_class	= 0;
+
+	subnet_prefix = cpu_to_be64(0xfe80000000000000LL);
+	for (i = 0; i < HNS_ROCE_V1_RESV_QP; i++) {
+		free_mr->mr_free_qp[i] = hns_roce_v1_create_lp_qp(hr_dev, pd);
+		if (IS_ERR(free_mr->mr_free_qp[i])) {
+			dev_err(dev, "Create loop qp failed!\n");
+			goto create_lp_qp_failed;
+		}
+		hr_qp = free_mr->mr_free_qp[i];
+
+		sl = i / caps->num_ports;
+
+		if (caps->num_ports == HNS_ROCE_MAX_PORTS)
+			phy_port = (i >= HNS_ROCE_MAX_PORTS) ? (i - 2) :
+				(i % caps->num_ports);
+		else
+			phy_port = i % caps->num_ports;
+
+		hr_qp->port		= phy_port + 1;
+		hr_qp->phy_port		= phy_port;
+		hr_qp->ibqp.qp_type	= IB_QPT_RC;
+		hr_qp->ibqp.device	= &hr_dev->ib_dev;
+		hr_qp->ibqp.uobject	= NULL;
+		atomic_set(&hr_qp->ibqp.usecnt, 0);
+		hr_qp->ibqp.pd		= pd;
+		hr_qp->ibqp.recv_cq	= cq;
+		hr_qp->ibqp.send_cq	= cq;
+
+		attr.ah_attr.port_num	= phy_port + 1;
+		attr.ah_attr.sl		= sl;
+		attr.port_num		= phy_port + 1;
+
+		attr.dest_qp_num	= hr_qp->qpn;
+		memcpy(attr.ah_attr.dmac, hr_dev->dev_addr[phy_port],
+		       MAC_ADDR_OCTET_NUM);
+
+		memcpy(attr.ah_attr.grh.dgid.raw,
+			&subnet_prefix, sizeof(u64));
+		memcpy(&attr.ah_attr.grh.dgid.raw[8],
+		       hr_dev->dev_addr[phy_port], 3);
+		memcpy(&attr.ah_attr.grh.dgid.raw[13],
+		       hr_dev->dev_addr[phy_port] + 3, 3);
+		attr.ah_attr.grh.dgid.raw[11] = 0xff;
+		attr.ah_attr.grh.dgid.raw[12] = 0xfe;
+		attr.ah_attr.grh.dgid.raw[8] ^= 2;
+
+		attr_mask |= IB_QP_PORT;
+
+		ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, &attr, attr_mask,
+					    IB_QPS_RESET, IB_QPS_INIT);
+		if (ret) {
+			dev_err(dev, "modify qp failed(%d)!\n", ret);
+			goto create_lp_qp_failed;
+		}
+
+		ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, &attr, attr_mask,
+					    IB_QPS_INIT, IB_QPS_RTR);
+		if (ret) {
+			dev_err(dev, "modify qp failed(%d)!\n", ret);
+			goto create_lp_qp_failed;
+		}
+
+		ret = hr_dev->hw->modify_qp(&hr_qp->ibqp, &attr, attr_mask,
+					    IB_QPS_RTR, IB_QPS_RTS);
+		if (ret) {
+			dev_err(dev, "modify qp failed(%d)!\n", ret);
+			goto create_lp_qp_failed;
+		}
+	}
+
+	return 0;
+
+create_lp_qp_failed:
+	for (i -= 1; i >= 0; i--) {
+		hr_qp = free_mr->mr_free_qp[i];
+		if (hns_roce_v1_destroy_qp(&hr_qp->ibqp))
+			dev_err(dev, "Destroy qp %d for mr free failed!\n", i);
+	}
+
+	if (hns_roce_dealloc_pd(pd))
+		dev_err(dev, "Destroy pd for create_lp_qp failed!\n");
+
+alloc_pd_failed:
+	if (hns_roce_ib_destroy_cq(cq))
+		dev_err(dev, "Destroy cq for create_lp_qp failed!\n");
+
+	return -EINVAL;
+}
+
+static void hns_roce_v1_release_lp_qp(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_qp *hr_qp;
+	int ret;
+	int i;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	for (i = 0; i < HNS_ROCE_V1_RESV_QP; i++) {
+		hr_qp = free_mr->mr_free_qp[i];
+		ret = hns_roce_v1_destroy_qp(&hr_qp->ibqp);
+		if (ret)
+			dev_err(dev, "Destroy qp %d for mr free failed(%d)!\n",
+				i, ret);
+	}
+
+	ret = hns_roce_ib_destroy_cq(&free_mr->mr_free_cq->ib_cq);
+	if (ret)
+		dev_err(dev, "Destroy cq for mr_free failed(%d)!\n", ret);
+
+	ret = hns_roce_dealloc_pd(&free_mr->mr_free_pd->ibpd);
+	if (ret)
+		dev_err(dev, "Destroy pd for mr_free failed(%d)!\n", ret);
+}
+
 static int hns_roce_db_init(struct hns_roce_dev *hr_dev)
 {
 	struct device *dev = &hr_dev->pdev->dev;
@@ -659,6 +868,223 @@ static int hns_roce_db_init(struct hns_roce_dev *hr_dev)
 	return 0;
 }
 
+void hns_roce_v1_recreate_lp_qp_work_fn(struct work_struct *work)
+{
+	struct hns_roce_recreate_lp_qp_work *lp_qp_work;
+	struct hns_roce_dev *hr_dev;
+
+	lp_qp_work = container_of(work, struct hns_roce_recreate_lp_qp_work,
+				  work);
+	hr_dev = to_hr_dev(lp_qp_work->ib_dev);
+
+	hns_roce_v1_release_lp_qp(hr_dev);
+
+	if (hns_roce_v1_rsv_lp_qp(hr_dev))
+		dev_err(&hr_dev->pdev->dev, "create reserver qp failed\n");
+
+	if (lp_qp_work->comp_flag)
+		complete(lp_qp_work->comp);
+
+	kfree(lp_qp_work);
+}
+
+static int hns_roce_v1_recreate_lp_qp(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_recreate_lp_qp_work *lp_qp_work;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	struct completion comp;
+	unsigned long end =
+	  msecs_to_jiffies(HNS_ROCE_V1_RECREATE_LP_QP_TIMEOUT_MSECS) + jiffies;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	lp_qp_work = kzalloc(sizeof(struct hns_roce_recreate_lp_qp_work),
+			     GFP_KERNEL);
+
+	INIT_WORK(&(lp_qp_work->work), hns_roce_v1_recreate_lp_qp_work_fn);
+
+	lp_qp_work->ib_dev = &(hr_dev->ib_dev);
+	lp_qp_work->comp = &comp;
+	lp_qp_work->comp_flag = 1;
+
+	init_completion(lp_qp_work->comp);
+
+	queue_work(free_mr->free_mr_wq, &(lp_qp_work->work));
+
+	while (time_before_eq(jiffies, end)) {
+		if (try_wait_for_completion(&comp))
+			return 0;
+		msleep(HNS_ROCE_V1_RECREATE_LP_QP_WAIT_VALUE);
+	}
+
+	lp_qp_work->comp_flag = 0;
+	if (try_wait_for_completion(&comp))
+		return 0;
+
+	dev_warn(dev, "recreate lp qp failed 20s timeout and return failed!\n");
+	return -ETIMEDOUT;
+}
+
+static int hns_roce_v1_send_lp_wqe(struct hns_roce_qp *hr_qp)
+{
+	struct hns_roce_dev *hr_dev = to_hr_dev(hr_qp->ibqp.device);
+	struct device *dev = &hr_dev->pdev->dev;
+	struct ib_send_wr send_wr, *bad_wr;
+	int ret;
+
+	memset(&send_wr, 0, sizeof(send_wr));
+	send_wr.next	= NULL;
+	send_wr.num_sge	= 0;
+	send_wr.send_flags = 0;
+	send_wr.sg_list	= NULL;
+	send_wr.wr_id	= (unsigned long long)&send_wr;
+	send_wr.opcode	= IB_WR_RDMA_WRITE;
+
+	ret = hns_roce_v1_post_send(&hr_qp->ibqp, &send_wr, &bad_wr);
+	if (ret) {
+		dev_err(dev, "Post write wqe for mr free failed(%d)!", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void hns_roce_v1_mr_free_work_fn(struct work_struct *work)
+{
+	struct hns_roce_mr_free_work *mr_work;
+	struct ib_wc wc[HNS_ROCE_V1_RESV_QP];
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_cq *mr_free_cq;
+	struct hns_roce_v1_priv *priv;
+	struct hns_roce_dev *hr_dev;
+	struct hns_roce_mr *hr_mr;
+	struct hns_roce_qp *hr_qp;
+	struct device *dev;
+	unsigned long end =
+		msecs_to_jiffies(HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS) + jiffies;
+	int i;
+	int ret;
+	int ne;
+
+	mr_work = container_of(work, struct hns_roce_mr_free_work, work);
+	hr_mr = (struct hns_roce_mr *)mr_work->mr;
+	hr_dev = to_hr_dev(mr_work->ib_dev);
+	dev = &hr_dev->pdev->dev;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+	mr_free_cq = free_mr->mr_free_cq;
+
+	for (i = 0; i < HNS_ROCE_V1_RESV_QP; i++) {
+		hr_qp = free_mr->mr_free_qp[i];
+		ret = hns_roce_v1_send_lp_wqe(hr_qp);
+		if (ret) {
+			dev_err(dev,
+			     "Send wqe (qp:0x%lx) for mr free failed(%d)!\n",
+			     hr_qp->qpn, ret);
+			goto free_work;
+		}
+	}
+
+	ne = HNS_ROCE_V1_RESV_QP;
+	do {
+		ret = hns_roce_v1_poll_cq(&mr_free_cq->ib_cq, ne, wc);
+		if (ret < 0) {
+			dev_err(dev,
+			   "(qp:0x%lx) starts, Poll cqe failed(%d) for mr 0x%x free! Remain %d cqe\n",
+			   hr_qp->qpn, ret, hr_mr->key, ne);
+			goto free_work;
+		}
+		ne -= ret;
+		msleep(HNS_ROCE_V1_FREE_MR_WAIT_VALUE);
+	} while (ne && time_before_eq(jiffies, end));
+
+	if (ne != 0)
+		dev_err(dev,
+			"Poll cqe for mr 0x%x free timeout! Remain %d cqe\n",
+			hr_mr->key, ne);
+
+free_work:
+	if (mr_work->comp_flag)
+		complete(mr_work->comp);
+	kfree(mr_work);
+}
+
+int hns_roce_v1_dereg_mr(struct hns_roce_dev *hr_dev, struct hns_roce_mr *mr)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_mr_free_work *mr_work;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	struct completion comp;
+	unsigned long end =
+		msecs_to_jiffies(HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS) + jiffies;
+	unsigned long start = jiffies;
+	int npages;
+	int ret = 0;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	if (mr->enabled) {
+		if (hns_roce_hw2sw_mpt(hr_dev, NULL, key_to_hw_index(mr->key)
+				       & (hr_dev->caps.num_mtpts - 1)))
+			dev_warn(dev, "HW2SW_MPT failed!\n");
+	}
+
+	mr_work = kzalloc(sizeof(*mr_work), GFP_KERNEL);
+	if (!mr_work) {
+		ret = -ENOMEM;
+		goto free_mr;
+	}
+
+	INIT_WORK(&(mr_work->work), hns_roce_v1_mr_free_work_fn);
+
+	mr_work->ib_dev = &(hr_dev->ib_dev);
+	mr_work->comp = &comp;
+	mr_work->comp_flag = 1;
+	mr_work->mr = (void *)mr;
+	init_completion(mr_work->comp);
+
+	queue_work(free_mr->free_mr_wq, &(mr_work->work));
+
+	while (time_before_eq(jiffies, end)) {
+		if (try_wait_for_completion(&comp))
+			goto free_mr;
+		msleep(HNS_ROCE_V1_FREE_MR_WAIT_VALUE);
+	}
+
+	mr_work->comp_flag = 0;
+	if (try_wait_for_completion(&comp))
+		goto free_mr;
+
+	dev_warn(dev, "Free mr work 0x%x over 50s and failed!\n", mr->key);
+	ret = -ETIMEDOUT;
+
+free_mr:
+	dev_dbg(dev, "Free mr 0x%x use 0x%x us.\n",
+		mr->key, jiffies_to_usecs(jiffies) - jiffies_to_usecs(start));
+
+	if (mr->size != ~0ULL) {
+		npages = ib_umem_page_count(mr->umem);
+		dma_free_coherent(dev, npages * 8, mr->pbl_buf,
+				  mr->pbl_dma_addr);
+	}
+
+	hns_roce_bitmap_free(&hr_dev->mr_table.mtpt_bitmap,
+			     key_to_hw_index(mr->key), 0);
+
+	if (mr->umem)
+		ib_umem_release(mr->umem);
+
+	kfree(mr);
+
+	return ret;
+}
+
 static void hns_roce_db_free(struct hns_roce_dev *hr_dev)
 {
 	struct device *dev = &hr_dev->pdev->dev;
@@ -899,6 +1325,46 @@ static void hns_roce_tptr_free(struct hns_roce_dev *hr_dev)
 			  tptr_buf->buf, tptr_buf->map);
 }
 
+static int hns_roce_free_mr_init(struct hns_roce_dev *hr_dev)
+{
+	struct device *dev = &hr_dev->pdev->dev;
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+	int ret = 0;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	free_mr->free_mr_wq = create_singlethread_workqueue("hns_roce_free_mr");
+	if (!free_mr->free_mr_wq) {
+		dev_err(dev, "Create free mr workqueue failed!\n");
+		return -ENOMEM;
+	}
+
+	ret = hns_roce_v1_rsv_lp_qp(hr_dev);
+	if (ret) {
+		dev_err(dev, "Reserved loop qp failed(%d)!\n", ret);
+		flush_workqueue(free_mr->free_mr_wq);
+		destroy_workqueue(free_mr->free_mr_wq);
+	}
+
+	return ret;
+}
+
+static void hns_roce_free_mr_free(struct hns_roce_dev *hr_dev)
+{
+	struct hns_roce_free_mr *free_mr;
+	struct hns_roce_v1_priv *priv;
+
+	priv = (struct hns_roce_v1_priv *)hr_dev->hw->priv;
+	free_mr = &priv->free_mr;
+
+	flush_workqueue(free_mr->free_mr_wq);
+	destroy_workqueue(free_mr->free_mr_wq);
+
+	hns_roce_v1_release_lp_qp(hr_dev);
+}
+
 /**
  * hns_roce_v1_reset - reset RoCE
  * @hr_dev: RoCE device struct pointer
@@ -1100,10 +1566,19 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 		goto error_failed_des_qp_init;
 	}
 
+	ret = hns_roce_free_mr_init(hr_dev);
+	if (ret) {
+		dev_err(dev, "free mr init failed!\n");
+		goto error_failed_free_mr_init;
+	}
+
 	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_UP);
 
 	return 0;
 
+error_failed_free_mr_init:
+	hns_roce_des_qp_free(hr_dev);
+
 error_failed_des_qp_init:
 	hns_roce_tptr_free(hr_dev);
 
@@ -1121,6 +1596,7 @@ int hns_roce_v1_init(struct hns_roce_dev *hr_dev)
 void hns_roce_v1_exit(struct hns_roce_dev *hr_dev)
 {
 	hns_roce_port_enable(hr_dev, HNS_ROCE_PORT_DOWN);
+	hns_roce_free_mr_free(hr_dev);
 	hns_roce_des_qp_free(hr_dev);
 	hns_roce_tptr_free(hr_dev);
 	hns_roce_bt_free(hr_dev);
@@ -1161,6 +1637,14 @@ void hns_roce_v1_set_mac(struct hns_roce_dev *hr_dev, u8 phy_port, u8 *addr)
 	u32 *p;
 	u32 val;
 
+	/*
+	 * When mac changed, loopback may fail
+	 * because of smac not equal to dmac.
+	 * We Need to release and create reserved qp again.
+	 */
+	if (hr_dev->hw->dereg_mr && hns_roce_v1_recreate_lp_qp(hr_dev))
+		dev_warn(&hr_dev->pdev->dev, "recreate lp qp timeout!\n");
+
 	p = (u32 *)(&addr[0]);
 	reg_smac_l = *p;
 	roce_raw_write(reg_smac_l, hr_dev->reg_base + ROCEE_SMAC_L_0_REG +
@@ -3299,5 +3783,6 @@ struct hns_roce_hw hns_roce_hw_v1 = {
 	.post_recv = hns_roce_v1_post_recv,
 	.req_notify_cq = hns_roce_v1_req_notify_cq,
 	.poll_cq = hns_roce_v1_poll_cq,
+	.dereg_mr = hns_roce_v1_dereg_mr,
 	.priv = &hr_v1_priv,
 };
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
index 1d250c0..b213b5e 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v1.h
@@ -58,6 +58,7 @@
 #define HNS_ROCE_V1_PHY_UAR_NUM				8
 
 #define HNS_ROCE_V1_GID_NUM				16
+#define HNS_ROCE_V1_RESV_QP				8
 
 #define HNS_ROCE_V1_NUM_COMP_EQE			0x8000
 #define HNS_ROCE_V1_NUM_ASYNC_EQE			0x400
@@ -107,6 +108,10 @@
 #define HNS_ROCE_V1_DB_STAGE2				2
 #define HNS_ROCE_V1_CHECK_DB_TIMEOUT_MSECS		10000
 #define HNS_ROCE_V1_CHECK_DB_SLEEP_MSECS		20
+#define HNS_ROCE_V1_FREE_MR_TIMEOUT_MSECS		50000
+#define HNS_ROCE_V1_RECREATE_LP_QP_TIMEOUT_MSECS	10000
+#define HNS_ROCE_V1_FREE_MR_WAIT_VALUE			5
+#define HNS_ROCE_V1_RECREATE_LP_QP_WAIT_VALUE		20
 
 #define HNS_ROCE_BT_RSV_BUF_SIZE			(1 << 17)
 
@@ -969,6 +974,10 @@ struct hns_roce_sq_db {
 #define SQ_DOORBELL_U32_4_SQ_HEAD_M   \
 	(((1UL << 15) - 1) << SQ_DOORBELL_U32_4_SQ_HEAD_S)
 
+#define SQ_DOORBELL_U32_4_SL_S 16
+#define SQ_DOORBELL_U32_4_SL_M   \
+	(((1UL << 2) - 1) << SQ_DOORBELL_U32_4_SL_S)
+
 #define SQ_DOORBELL_U32_4_PORT_S 18
 #define SQ_DOORBELL_U32_4_PORT_M  (((1UL << 3) - 1) << SQ_DOORBELL_U32_4_PORT_S)
 
@@ -1015,14 +1024,39 @@ struct hns_roce_des_qp {
 	int	requeue_flag;
 };
 
+struct hns_roce_mr_free_work {
+	struct	work_struct work;
+	struct	ib_device *ib_dev;
+	struct	completion *comp;
+	int	comp_flag;
+	void	*mr;
+};
+
+struct hns_roce_recreate_lp_qp_work {
+	struct	work_struct work;
+	struct	ib_device *ib_dev;
+	struct	completion *comp;
+	int	comp_flag;
+};
+
+struct hns_roce_free_mr {
+	struct workqueue_struct *free_mr_wq;
+	struct hns_roce_qp *mr_free_qp[HNS_ROCE_V1_RESV_QP];
+	struct hns_roce_cq *mr_free_cq;
+	struct hns_roce_pd *mr_free_pd;
+};
+
 struct hns_roce_v1_priv {
 	struct hns_roce_db_table  db_table;
 	struct hns_roce_raq_table raq_table;
 	struct hns_roce_bt_table  bt_table;
 	struct hns_roce_tptr_table tptr_table;
 	struct hns_roce_des_qp des_qp;
+	struct hns_roce_free_mr free_mr;
 };
 
 int hns_dsaf_roce_reset(struct fwnode_handle *dsaf_fwnode, bool dereset);
+int hns_roce_v1_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc);
+int hns_roce_v1_destroy_qp(struct ib_qp *ibqp);
 
 #endif
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index 914d0ac..0cedec0 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -129,7 +129,6 @@ static int handle_en_event(struct hns_roce_dev *hr_dev, u8 port,
 {
 	struct device *dev = &hr_dev->pdev->dev;
 	struct net_device *netdev;
-	unsigned long flags;
 
 	netdev = hr_dev->iboe.netdevs[port];
 	if (!netdev) {
@@ -137,7 +136,7 @@ static int handle_en_event(struct hns_roce_dev *hr_dev, u8 port,
 		return -ENODEV;
 	}
 
-	spin_lock_irqsave(&hr_dev->iboe.lock, flags);
+	spin_lock_bh(&hr_dev->iboe.lock);
 
 	switch (event) {
 	case NETDEV_UP:
@@ -156,7 +155,7 @@ static int handle_en_event(struct hns_roce_dev *hr_dev, u8 port,
 		break;
 	}
 
-	spin_unlock_irqrestore(&hr_dev->iboe.lock, flags);
+	spin_unlock_bh(&hr_dev->iboe.lock);
 	return 0;
 }
 
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 9b8a1ad..4139abe 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -42,7 +42,7 @@ static u32 hw_index_to_key(unsigned long ind)
 	return (u32)(ind >> 24) | (ind << 8);
 }
 
-static unsigned long key_to_hw_index(u32 key)
+unsigned long key_to_hw_index(u32 key)
 {
 	return (key << 24) | (key >> 8);
 }
@@ -56,7 +56,7 @@ static int hns_roce_sw2hw_mpt(struct hns_roce_dev *hr_dev,
 				 HNS_ROCE_CMD_TIMEOUT_MSECS);
 }
 
-static int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
+int hns_roce_hw2sw_mpt(struct hns_roce_dev *hr_dev,
 			      struct hns_roce_cmd_mailbox *mailbox,
 			      unsigned long mpt_index)
 {
@@ -607,13 +607,20 @@ struct ib_mr *hns_roce_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 
 int hns_roce_dereg_mr(struct ib_mr *ibmr)
 {
+	struct hns_roce_dev *hr_dev = to_hr_dev(ibmr->device);
 	struct hns_roce_mr *mr = to_hr_mr(ibmr);
+	int ret = 0;
 
-	hns_roce_mr_free(to_hr_dev(ibmr->device), mr);
-	if (mr->umem)
-		ib_umem_release(mr->umem);
+	if (hr_dev->hw->dereg_mr) {
+		ret = hr_dev->hw->dereg_mr(hr_dev, mr);
+	} else {
+		hns_roce_mr_free(hr_dev, mr);
 
-	kfree(mr);
+		if (mr->umem)
+			ib_umem_release(mr->umem);
 
-	return 0;
+		kfree(mr);
+	}
+
+	return ret;
 }
-- 
1.7.9.5

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox