Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* [patch] net/mlx5: remove a duplicate condition
From: Dan Carpenter @ 2016-11-24 11:03 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Matan Barak, Leon Romanovsky, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-janitors-u79uwXL29TY76Z2rM5mHXA

We verified that MLX5_FLOW_CONTEXT_ACTION_COUNT was set on the first
line of the function so we don't need to check again here.

Signed-off-by: Dan Carpenter <dan.carpenter-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
Not a bugfix so it would go into -next

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
index 68ec4ea..a263d89 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.c
@@ -1307,8 +1307,7 @@ static bool counter_is_valid(struct mlx5_fc *counter, u32 action)
 		return false;
 
 	return (action & (MLX5_FLOW_CONTEXT_ACTION_DROP |
-			  MLX5_FLOW_CONTEXT_ACTION_FWD_DEST)) &&
-		(action & MLX5_FLOW_CONTEXT_ACTION_COUNT);
+			  MLX5_FLOW_CONTEXT_ACTION_FWD_DEST));
 }
 
 static bool dest_is_valid(struct mlx5_flow_destination *dest,
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH] net/mlx5: drop duplicate header delay.h
From: Geliang Tang @ 2016-11-24 13:58 UTC (permalink / raw)
  To: Saeed Mahameed, Matan Barak, Leon Romanovsky
  Cc: Geliang Tang, netdev, linux-rdma, linux-kernel
In-Reply-To: <15299de49216a2360976ca37ff774cae9d27d88b.1479991297.git.geliangtang@gmail.com>

Drop duplicate header delay.h from mlx5/core/main.c.

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index f28df33..d7a55eb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -46,7 +46,6 @@
 #include <linux/mlx5/srq.h>
 #include <linux/debugfs.h>
 #include <linux/kmod.h>
-#include <linux/delay.h>
 #include <linux/mlx5/mlx5_ifc.h>
 #ifdef CONFIG_RFS_ACCEL
 #include <linux/cpu_rmap.h>
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH] net/mlx5: drop duplicate header delay.h
From: Matan Barak @ 2016-11-24 14:12 UTC (permalink / raw)
  To: Geliang Tang, Saeed Mahameed, Leon Romanovsky
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <03d5a2a0f03458cdb4f2b139cfabc11b5c334f95.1479990943.git.geliangtang-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On 24/11/2016 15:58, Geliang Tang wrote:
> Drop duplicate header delay.h from mlx5/core/main.c.
>
> Signed-off-by: Geliang Tang <geliangtang-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/main.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index f28df33..d7a55eb 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -46,7 +46,6 @@
>  #include <linux/mlx5/srq.h>
>  #include <linux/debugfs.h>
>  #include <linux/kmod.h>
> -#include <linux/delay.h>
>  #include <linux/mlx5/mlx5_ifc.h>
>  #ifdef CONFIG_RFS_ACCEL
>  #include <linux/cpu_rmap.h>
>

Thanks.
Acked-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net/mlx5: drop duplicate header delay.h
From: Saeed Mahameed @ 2016-11-24 15:25 UTC (permalink / raw)
  To: Geliang Tang
  Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, Linux Netdev List,
	linux-rdma, linux-kernel
In-Reply-To: <03d5a2a0f03458cdb4f2b139cfabc11b5c334f95.1479990943.git.geliangtang@gmail.com>

On Thu, Nov 24, 2016 at 3:58 PM, Geliang Tang <geliangtang@gmail.com> wrote:
> Drop duplicate header delay.h from mlx5/core/main.c.
>
> Signed-off-by: Geliang Tang <geliangtang@gmail.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply

* Re: [patch] net/mlx5: remove a duplicate condition
From: Saeed Mahameed @ 2016-11-24 15:30 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Saeed Mahameed, Matan Barak, Leon Romanovsky, Linux Netdev List,
	linux-rdma, linux-kernel, kernel-janitors
In-Reply-To: <20161124110345.GC17225@mwanda>

On Thu, Nov 24, 2016 at 1:03 PM, Dan Carpenter <dan.carpenter@oracle.com> wrote:
> We verified that MLX5_FLOW_CONTEXT_ACTION_COUNT was set on the first
> line of the function so we don't need to check again here.
>
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-24 16:15 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161124000825.GA73280-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Wed, Nov 23, 2016 at 04:08:25PM -0800, Vishwanathapura, Niranjana wrote:

> In order to pass the hfi function pointers to the hfi_vnic ULP, I can,
> a) Have hfi_vnic ULP define an interface API for hfi1 driver to call to
> register its callback (as you pointed). Unfortunately there will be a module
> dependency here.
> Or,

That is probably backwards

> b) Add a new member ‘struct vnic_ops’ either to the ib_device structure or
> ib_port_immutable structure. As it is hfi1 specific, only hfi1 driver will
> set it. No module dependency here.

You can add a hfi1_get_vnic_ops(struct ib_device *) and implement it
in your module..

> And will move the hfi_vnic module under
> ‘drivers/infiniband/ulp/hfi_vnic’.

I would prefer drivers/net/ethernet

This is clearly not a ULP since it doesn't use verbs.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-24 16:24 UTC (permalink / raw)
  To: Sagalovitch, Serguei
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Koenig, Christian,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <SN1PR12MB0703CE07F4878243164299C4FEB70-z7L1TMIYDg6P/i5UxMCIqAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

On Thu, Nov 24, 2016 at 12:40:37AM +0000, Sagalovitch, Serguei wrote:
> On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
> 
> > Perhaps I am not following what Serguei is asking for, but I
> > understood the desire was for a complex GPU allocator that could
> > migrate pages between GPU and CPU memory under control of the GPU
> > driver, among other things. The desire is for DMA to continue to work
> > even after these migrations happen.
> 
> The main issue is to  how to solve use cases when p2p is 
> requested/initiated via CPU pointers where such pointers could 
> point to non-system memory location e.g.  VRAM.  

Okay, but your list is conflating a whole bunch of problems..

 1) How to go from a __user pointer to a p2p DMA address
  a) How to validate, setup iommu and maybe worst case bounce buffer
     these p2p DMAs
 2) How to allow drivers (ie GPU allocator) dynamically
    remap pages in a VMA to/from p2p DMA addresses
 3) How to expose uncachable p2p DMA address to user space via mmap

> to allow "get_user_pages"  to work transparently similar 
> how it is/was done for "DAX Device" case. Unfortunately 
> based on my understanding "DAX Device" implementation 
> deal only with permanently  "locked" memory  (fixed location) 
> unrelated to "get_user_pages"/"put_page" scope  
> which doesn't satisfy requirements  for "eviction" / "moving" of 
> memory keeping CPU address intact.  

Hurm, isn't that issue with DAX only to do with being coherent with
the page cache?

A GPU allocator would not use the page cache, it would have to
construct VMAs some other way.

> My understanding is that It will not solve RDMA MR issue where "lock" 
> could be during the whole  application life but  (a) it will not make 
> RDMA MR case worse  (b) should be enough for all other cases for 
> "get_user_pages"/"put_page" controlled by  kernel.

Right. There is no solution to the RDMA MR issue on old hardware. Apps
that are using GPU+RDMA+Old hardware will have to use short lived MRs
and pay that performance cost, or give up on migration.

Jason

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-24 16:26 UTC (permalink / raw)
  To: Christian König
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <a33ec1cd-051f-8a24-0587-68707459c25c-5C7GfCeVMHo@public.gmane.org>

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:
> Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
> >There is certainly nothing about the hardware that cares
> >about ZONE_DEVICE vs System memory.
> Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
> PCI BAR and another PCI device initiates a DMA to this address the DMA
> subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-24 16:42 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Koenig, Christian,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <91d28749-bc64-622f-56a1-26c00e6b462a-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
> >>> Only ODP hardware allows changing the DMA address on the fly, and it
> >>> works at the page table level. We do not need special handling for
> >>> RDMA.
> >>
> >> I am aware of ODP but, noted by others, it doesn't provide a general
> >> solution to the points above.
> > 
> > How do you mean?
> 
> I was only saying it wasn't general in that it wouldn't work for IB
> hardware that doesn't support ODP or other hardware  that doesn't do
> similar things (like an NVMe drive).

There are three cases to worry about:
 - Coherent long lived page table mirroring (RDMA ODP MR)
 - Non-coherent long lived page table mirroring (RDMA MR)
 - Short lived DMA mapping (everything else)

Like you say below we have to handle short lived in the usual way, and
that covers basically every device except IB MRs, including the
command queue on a NVMe drive.

> any complex allocators (GPU or otherwise) should respect that. And that
> seems like it should be the default way most of this works -- and I
> think it wouldn't actually take too much effort to make it all work now
> as is. (Our iopmem work is actually quite small and simple.)

Yes, absolutely, some kind of page pinning like locking is a hard
requirement.

> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
> memory working for some time. I'd say it's a good fit. The main question
> we've had is how to expose PCIe bars to userspace to be used as MRs and
> such.

Is there any progress on that?

I still don't quite get what iopmem was about.. I thought the
objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
over iopmem and still ending up with uncacheable mmaps still seems
like a non-starter to me...

Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?

One approach might be to mmap the uncachable ZONE_DEVICE memory and
mark it inaccessible to the CPU - DMA could still translate. If the
CPU needs it then the kernel migrates it to system memory so it
becomes cachable. ??

Jason

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-24 17:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161124162620.GC20818-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:
> On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:
>> Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:
>>> There is certainly nothing about the hardware that cares
>>> about ZONE_DEVICE vs System memory.
>> Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
>> PCI BAR and another PCI device initiates a DMA to this address the DMA
>> subsystem must be able to check if the interconnection really works.
> I said the hardware doesn't care.. You are right, we still have an
> outstanding problem in Linux of how to generically DMA map a P2P
> address - which is a different issue from getting the P2P address from
> a __user pointer...
>
> Jason
I agreed but the problem is that one issue immediately introduce another 
one
to solve and so on (if we do not want to cut corners). I would think  that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable  p2p between 
devices
it is required/recommended to disable iommu support  (e.g. intel iommu 
driver
has special logic for graphics and  comment "Reserve all PCI MMIO to avoid
peer-to-peer access").

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-24 17:55 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Dan Williams
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <a33ec1cd-051f-8a24-0587-68707459c25c-5C7GfCeVMHo@public.gmane.org>

Hey,

On 24/11/16 02:45 AM, Christian König wrote:
> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
> Not PCI device B (a SATA device) can directly read/write to it because
> it is on the same bus segment, but PCI device C (a network card for
> example) can't because it is on a different bus segment and the bridge
> can't handle P2P transactions.

Yeah, that could be an issue but in our experience we have yet to see
it. We've tested with two separate PCI buses on different CPUs connected
through QPI links and it works fine. (It is rather slow but I understand
Intel has improved the bottleneck in newer CPUs than the ones we tested.)

It may just be older hardware that has this issue. I expect that as long
as a failed transfer can be handled gracefully by the initiator I don't
see a need to predetermine whether a device can see another devices memory.


Logan

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-24 18:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Koenig, Christian,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161124164249.GD20818-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>



On 24/11/16 09:42 AM, Jason Gunthorpe wrote:
> There are three cases to worry about:
>  - Coherent long lived page table mirroring (RDMA ODP MR)
>  - Non-coherent long lived page table mirroring (RDMA MR)
>  - Short lived DMA mapping (everything else)
> 
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Yes, this makes sense to me. Though I thought regular IB MRs with
regular memory currently pinned the pages (despite being long lived)
that's why we can run up against the "max locked memory" limit. It
doesn't seem so terrible if GPU memory had a similar restriction until
ODP like solutions get implemented.

>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.

> Is there any progress on that?

Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.

* Regular DAX in the FS doesn't work at this time because the FS can
move the file you think your transfer to out from under you. Though I
understand there's been some work with XFS to solve that issue.

Though, we've been considering that the backed memory would be
non-volatile which adds some of this complexity. If the memory were
volatile the kernel would just need to do some relatively straight
forward allocation to user-space when asked. For example, with NVMe, the
kernel could give chunks of the CMB buffer to userspace via an mmap call
to /dev/nvmeX. Though I think there's been some push back against things
like that as well.

> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...

The latest incarnation of iopmem simply created a block device backed by
ZONE_DEVICE memory on a PCIe BAR. We then put a DAX FS on it and
user-space could mmap the files and send them to other devices to do P2P
transfers.

I don't think there was a hard objection to uncachable ZONE_DEVICE and
DAX. We did try our experimental hardware with cached ZONE_DEVICE and it
did work but the performance was beyond unusable (which may be a
hardware issue). In the end I feel the driver would have to decide the
most appropriate caching for the hardware and I don't understand why WC
or UC wouldn't work with ZONE_DEVICE.

Logan

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-25  2:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161124161545.GA20818-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Thu, Nov 24, 2016 at 09:15:45AM -0700, Jason Gunthorpe wrote:
>On Wed, Nov 23, 2016 at 04:08:25PM -0800, Vishwanathapura, Niranjana wrote:
>
>> In order to pass the hfi function pointers to the hfi_vnic ULP, I can,
>> a) Have hfi_vnic ULP define an interface API for hfi1 driver to call to
>> register its callback (as you pointed). Unfortunately there will be a module
>> dependency here.
>> Or,
>
>That is probably backwards
>
>> b) Add a new member ‘struct vnic_ops’ either to the ib_device structure or
>> ib_port_immutable structure. As it is hfi1 specific, only hfi1 driver will
>> set it. No module dependency here.
>
>You can add a hfi1_get_vnic_ops(struct ib_device *) and implement it
>in your module..
>

In order to be truely device independent the hfi_vnic ULP should not depend on 
a device exported symbol. Instead device should register its functions with the 
ULP. Hence the approaches a) and b).

Niranjana

>Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Christoph Hellwig @ 2016-11-25  7:58 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Haggai Eran, linux-rdma@vger.kernel.org,
	linux-nvdimm@lists.01.org, Kuehling, Felix, Serguei Sagalovitch,
	Blinzer, Paul, linux-kernel@vger.kernel.org,
	dri-devel@lists.freedesktop.org, Sander, Ben, Jason Gunthorpe,
	Suthikulpanit, Suravee, linux-pci@vger.kernel.org,
	Deucher, Alexander, Dan Williams, Koenig, Christian,
	Linux-media@vger.kernel.org
In-Reply-To: <9cc22068-ede8-c1bc-5d8b-cf6224a7ce05@deltatee.com>

On Thu, Nov 24, 2016 at 11:11:34AM -0700, Logan Gunthorpe wrote:
> * Regular DAX in the FS doesn't work at this time because the FS can
> move the file you think your transfer to out from under you. Though I
> understand there's been some work with XFS to solve that issue.

The file system will never move anything under locked down pages,
locking down pages is used exactly to protect against that.  So as long
as we page structures available RDMA to/from device memory _from kernel
space_ is trivial, although for file systems to work properly you
really want a notification to the consumer if the file systems wants
to remove the mapping.  We have implemented that using FL_LAYOUTS locks
for NFSD, but only XFS supports it so far.  Without that a long term
locked down region of memory (e.g. a kernel MR) would prevent various
file operations that would simply hang.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply

* [Bug 188541] New: Function rsxx_pci_probe() does not set error code when the call to create_singlethread_workqueue() fails
From: bugzilla-daemon-590EEB7GvNiWaY/ihj7yzEB+6BGkLq7r @ 2016-11-25 10:31 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

https://bugzilla.kernel.org/show_bug.cgi?id=188541

            Bug ID: 188541
           Summary: Function rsxx_pci_probe() does not set error code when
                    the call to create_singlethread_workqueue() fails
           Product: Drivers
           Version: 2.5
    Kernel Version: linux-4.9-rc6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
          Assignee: drivers_infiniband-rdma-ztI5WcYan/vQLgFONoPN62D2FQJk+8+b@public.gmane.org
          Reporter: bianpan2010-AvrBmmDjM4YnDS1+zs4M5A@public.gmane.org
        Regression: No

Function create_singlethread_workqueue() returns a NULL pointer on failure. In
function rsxx_pci_probe() defined in file drivers/block/rsxx/core.c, variable
st keeps the error code, its value must be 0 at line 888. As a result, 0
(indicates no error) will be returned even when the call to function
create_singlethread_workqueue() at line 890 fails. Though this error may occur
rarely, it's better to explicitly assign "-ENOMEM" to st on the true branch of
the if-statement at line 891.

 758 static int rsxx_pci_probe(struct pci_dev *dev,
 759                     const struct pci_device_id *id)
 760 {
 761     struct rsxx_cardinfo *card;
 762     int st;
         ...
 876     card->ctrl = kzalloc(card->n_targets * sizeof(*card->ctrl),
GFP_KERNEL);
 877     if (!card->ctrl) {
 878         st = -ENOMEM;
 879         goto failed_dma_setup;
 880     }
 881 
 882     st = rsxx_dma_setup(card);
 883     if (st) {
 884         dev_info(CARD_TO_DEV(card),
 885             "Failed to setup DMA engine\n");
 886         goto failed_dma_setup;
 887     }
 888     // the value of st must be 0 when the code reaches here
 889     /************* Setup Card Event Handler *************/
 890     card->event_wq = create_singlethread_workqueue(DRIVER_NAME"_event");
 891     if (!card->event_wq) {
 892         dev_err(CARD_TO_DEV(card), "Failed card event setup.\n");
 893         goto failed_event_handler;          // insert "st = -ENOMEM"
before the jump instruction?
 894     }
         ...
 961     return 0;
 962 
 963 failed_create_dev:
 964     destroy_workqueue(card->event_wq);
 965     card->event_wq = NULL;
 966 failed_event_handler:
 967     rsxx_dma_destroy(card);
 968 failed_dma_setup:
 969 failed_compatiblity_check:
 970     destroy_workqueue(card->creg_ctrl.creg_wq);
 971     card->creg_ctrl.creg_wq = NULL;
         ...
 990 failed_ida_get:
 991     kfree(card);
 992 
 993     return st;
 994 }

Thanks very much!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [Bug 188591] New: Function ioat_dma_self_test() does not set error code when the call to dma_mapping_error() fails
From: bugzilla-daemon-590EEB7GvNiWaY/ihj7yzEB+6BGkLq7r @ 2016-11-25 10:39 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

https://bugzilla.kernel.org/show_bug.cgi?id=188591

            Bug ID: 188591
           Summary: Function ioat_dma_self_test() does not set error code
                    when the call to dma_mapping_error() fails
           Product: Drivers
           Version: 2.5
    Kernel Version: linux-4.9-rc6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
          Assignee: drivers_infiniband-rdma-ztI5WcYan/vQLgFONoPN62D2FQJk+8+b@public.gmane.org
          Reporter: bianpan2010-AvrBmmDjM4YnDS1+zs4M5A@public.gmane.org
        Regression: No

Function dma_mapping_error() returns a NULL pointer on failure. In function
ioat_dma_self_test() defined in file drivers/dma/ioat/init.c, variable err
takes the error code. However, the value of err is 0 (indicates success) even
when the calls to dma_mapping_error() fail at line 341 or 346. Though this
error may occur rarely, it is better to assgin "-ENOMEM" to err before the jump
instructions at lines 343 and 348. Codes related to these two bugs are
summarised as follows.

ioat_dma_self_test @@ drivers/dma/ioat/init.c
 302 static int ioat_dma_self_test(struct ioatdma_device *ioat_dma)
 303 {
         ...
 313     int err = 0;
         ...
 332     dma_chan = container_of(dma->channels.next, struct dma_chan,
 333                 device_node);
 334     if (dma->device_alloc_chan_resources(dma_chan) < 1) {
 335         dev_err(dev, "selftest cannot allocate chan resource\n");
 336         err = -ENODEV;
 337         goto out;
 338     }
 339 
 340     dma_src = dma_map_single(dev, src, IOAT_TEST_SIZE, DMA_TO_DEVICE);
 341     if (dma_mapping_error(dev, dma_src)) {
 342         dev_err(dev, "mapping src buffer failed\n");
 343         goto free_resources;      // insert "err = -ENOMEM" before this
jump instruction?
 344     }
 345     dma_dest = dma_map_single(dev, dest, IOAT_TEST_SIZE, DMA_FROM_DEVICE);
 346     if (dma_mapping_error(dev, dma_dest)) {
 347         dev_err(dev, "mapping dest buffer failed\n");
 348         goto unmap_src;   // insert "err = -ENOMEM" before this jump
instruction?
 349     }
         ...
 387 unmap_dma:
 388     dma_unmap_single(dev, dma_dest, IOAT_TEST_SIZE, DMA_FROM_DEVICE);
 389 unmap_src:
 390     dma_unmap_single(dev, dma_src, IOAT_TEST_SIZE, DMA_TO_DEVICE);
 391 free_resources:
 392     dma->device_free_chan_resources(dma_chan);
 393 out:
 394     kfree(src);
 395     kfree(dest);
 396     return err;
 397 }

Thanks very much!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [Bug 188601] New: Function ioat_xor_val_self_test() does not set error code when the call to dma_mapping_error() fails
From: bugzilla-daemon-590EEB7GvNiWaY/ihj7yzEB+6BGkLq7r @ 2016-11-25 10:40 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

https://bugzilla.kernel.org/show_bug.cgi?id=188601

            Bug ID: 188601
           Summary: Function ioat_xor_val_self_test() does not set error
                    code when the call to dma_mapping_error() fails
           Product: Drivers
           Version: 2.5
    Kernel Version: linux-4.9-rc6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
          Assignee: drivers_infiniband-rdma-ztI5WcYan/vQLgFONoPN62D2FQJk+8+b@public.gmane.org
          Reporter: bianpan2010-AvrBmmDjM4YnDS1+zs4M5A@public.gmane.org
        Regression: No

Function dma_mapping_error() returns a NULL pointer on failure. In function
ioat_xor_val_self_test() defined in file drivers/dma/ioat/init.c, variable err
takes the error code. However, the value of err is 0 (indicates success) even
when the call to dma_mapping_error() fails at line 830. Though this error may
occur rarely, it is better to assign "-ENOMEM" to err when dma_mapping_error()
fails. 
There are other 3 similar bugs when the call to dma_mapping_error() fail at
lines 838, 907 and 960. Codes related to these bugs are summarised as follows.

ioat_xor_val_self_test @@ drivers/dma/ioat/init.c
 761 static int ioat_xor_val_self_test(struct ioatdma_device *ioat_dma)
 762 {
         ...
 775     int err = 0;
         ...
 819     dma_chan = container_of(dma->channels.next, struct dma_chan,
 820                 device_node);
 821     if (dma->device_alloc_chan_resources(dma_chan) < 1) {
 822         err = -ENODEV;
 823         goto out;
 824     }
 825 
 826     /* test xor */
 827     op = IOAT_OP_XOR;
 828 
 829     dest_dma = dma_map_page(dev, dest, 0, PAGE_SIZE, DMA_FROM_DEVICE);
 830     if (dma_mapping_error(dev, dest_dma))
 831         goto free_resources;    // insert "err = -ENOMEM" before this jump
instruction?
 832  
 833     for (i = 0; i < IOAT_NUM_SRC_TEST; i++)
 834         dma_srcs[i] = DMA_ERROR_CODE;
 835     for (i = 0; i < IOAT_NUM_SRC_TEST; i++) {
 836         dma_srcs[i] = dma_map_page(dev, xor_srcs[i], 0, PAGE_SIZE,
 837                        DMA_TO_DEVICE);
 838         if (dma_mapping_error(dev, dma_srcs[i]))
 839             goto dma_unmap;    // insert "err = -ENOMEM" before this jump
instruction?
 840     }
         ...
 904     for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) {
 905         dma_srcs[i] = dma_map_page(dev, xor_val_srcs[i], 0, PAGE_SIZE,
 906                        DMA_TO_DEVICE);
 907         if (dma_mapping_error(dev, dma_srcs[i]))
 908             goto dma_unmap;    // insert "err = -ENOMEM" before this jump
instruction?
 909     }
         ...
 957     for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++) {
 958         dma_srcs[i] = dma_map_page(dev, xor_val_srcs[i], 0, PAGE_SIZE,
 959                        DMA_TO_DEVICE);
 960         if (dma_mapping_error(dev, dma_srcs[i]))
 961             goto dma_unmap;    // insert "err = -ENOMEM" before this jump
instruction?
 962     }
         ...
1003 dma_unmap:
1004     if (op == IOAT_OP_XOR) {
1005         if (dest_dma != DMA_ERROR_CODE)
1006             dma_unmap_page(dev, dest_dma, PAGE_SIZE,
1007                        DMA_FROM_DEVICE);
1008         for (i = 0; i < IOAT_NUM_SRC_TEST; i++)
1009             if (dma_srcs[i] != DMA_ERROR_CODE)
1010                 dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE,
1011                            DMA_TO_DEVICE);
1012     } else if (op == IOAT_OP_XOR_VAL) {
1013         for (i = 0; i < IOAT_NUM_SRC_TEST + 1; i++)
1014             if (dma_srcs[i] != DMA_ERROR_CODE)
1015                 dma_unmap_page(dev, dma_srcs[i], PAGE_SIZE,
1016                            DMA_TO_DEVICE);
1017     }
1018 free_resources:
1019     dma->device_free_chan_resources(dma_chan);
1020 out:
1021     src_idx = IOAT_NUM_SRC_TEST;
1022     while (src_idx--)
1023         __free_page(xor_srcs[src_idx]);
1024     __free_page(dest);
1025     return err;
1026 }

Thanks very much!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [Bug 188821] New: Function c4iw_rdev_open() does not set error code when the call to __get_free_page() fails.
From: bugzilla-daemon-590EEB7GvNiWaY/ihj7yzEB+6BGkLq7r @ 2016-11-25 11:01 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

https://bugzilla.kernel.org/show_bug.cgi?id=188821

            Bug ID: 188821
           Summary: Function c4iw_rdev_open() does not set error code when
                    the call to __get_free_page() fails.
           Product: Drivers
           Version: 2.5
    Kernel Version: linux-4.9-rc6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
          Assignee: drivers_infiniband-rdma-ztI5WcYan/vQLgFONoPN62D2FQJk+8+b@public.gmane.org
          Reporter: bianpan2010-AvrBmmDjM4YnDS1+zs4M5A@public.gmane.org
        Regression: No

Function __get_free_page() returns a NULL pointer if there is no enough memory.
In function c4iw_rdev_open defined in file
drivers/infiniband/hw/cxgb4/device.c, __get_free_page() is called and its
return value is checked against NULL (at line 831). When the return value is
NULL, the control flow jumps to label "destroy_ocqp_pool", and returns the
value of variable err. However, after the check of variable err at line 825,
the value of err must be 0. As a result, 0 (indicates success) may be returned
even when the memory allocation fails. Maybe it is better to assign "-ENOMEM"
to err before the jump instruction at line 832. Codes related to this bug are
summarised as follows.

c4iw_rdev_open @@ drivers/infiniband/hw/cxgb4/device.c
 752 static int c4iw_rdev_open(struct c4iw_rdev *rdev)
 753 {
 754     int err;
         ...
 824     err = c4iw_ocqp_pool_create(rdev);
 825     if (err) {
 826         printk(KERN_ERR MOD "error %d initializing ocqp pool\n", err);
 827         goto destroy_rqtpool;
 828     }
 829     rdev->status_page = (struct t4_dev_status_page *)
 830                 __get_free_page(GFP_KERNEL);
 831     if (!rdev->status_page)
 832         goto destroy_ocqp_pool;  // insert "err = -ENOMEM" before this
line?
         ...
 851     return 0;
 852 destroy_ocqp_pool:
 853     c4iw_ocqp_pool_destroy(rdev);
 854 destroy_rqtpool:
 855     c4iw_rqtpool_destroy(rdev);
 856 destroy_pblpool:
 857     c4iw_pblpool_destroy(rdev);
 858 destroy_resource:
 859     c4iw_destroy_resource(&rdev->resource);
 860     return err;
 861 }

Thanks very much!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [Bug 188831] New: Function ocrdma_mbx_create_ah_tbl() does not set error code when the call to dma_alloc_coherent() fails
From: bugzilla-daemon-590EEB7GvNiWaY/ihj7yzEB+6BGkLq7r @ 2016-11-25 11:01 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

https://bugzilla.kernel.org/show_bug.cgi?id=188831

            Bug ID: 188831
           Summary: Function ocrdma_mbx_create_ah_tbl() does not set error
                    code when the call to dma_alloc_coherent() fails
           Product: Drivers
           Version: 2.5
    Kernel Version: linux-4.9-rc6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
          Assignee: drivers_infiniband-rdma-ztI5WcYan/vQLgFONoPN62D2FQJk+8+b@public.gmane.org
          Reporter: bianpan2010-AvrBmmDjM4YnDS1+zs4M5A@public.gmane.org
        Regression: No

Function dma_alloc_coherent() returns a NULL pointer if there is no enough
memory. Function ocrdma_mbx_create_ah_tbl() defined in file
drivers/infiniband/hw/ocrdma/ocrdma_hw.c will return 0 on success or negative
error codes on failures. It calls function dma_alloc_coherent() twice and
checks the return values against NULL (at lines 1681 and 1686). The control
flow jumps to label "mem_err_ah" and returns the value of variable status. The
value of status is 0 (see the check of variable status at line 1645). As a
result, the caller of ocrdma_mbx_create_ah_tbl() will be misled to believe all
goes well even the memory allocation fails. Maybe it is better to assign
"-ENOMEM" to variable status before the jump instructions at lines 1682 and
1687, or simply initialize status with "-ENOMEM" rather than "0" at line 1645.
Codes related to this bug are summarised as follows.

ocrdma_mbx_create_ah_tbl @@ drivers/infiniband/hw/ocrdma/ocrdma_hw.c
1642 static int ocrdma_mbx_create_ah_tbl(struct ocrdma_dev *dev)
1643 {
1644     int i;
1645     int status = 0;  // use "int status = -ENOMEM;" ?
         ...
1653     cmd = ocrdma_init_emb_mqe(OCRDMA_CMD_CREATE_AH_TBL, sizeof(*cmd));
1654     if (!cmd)
1655         return status;
         ...
1678     dev->av_tbl.pbl.va = dma_alloc_coherent(&pdev->dev, PAGE_SIZE,
1679                         &dev->av_tbl.pbl.pa,
1680                         GFP_KERNEL);
1681     if (dev->av_tbl.pbl.va == NULL)
1682         goto mem_err;
1683 
1684     dev->av_tbl.va = dma_alloc_coherent(&pdev->dev, dev->av_tbl.size,
1685                         &pa, GFP_KERNEL);
1686     if (dev->av_tbl.va == NULL)
1687         goto mem_err_ah;
         ...
1706     return 0;
1707 
1708 mbx_err:
1709     dma_free_coherent(&pdev->dev, dev->av_tbl.size, dev->av_tbl.va,
1710               dev->av_tbl.pa);
1711     dev->av_tbl.va = NULL;
1712 mem_err_ah:
1713     dma_free_coherent(&pdev->dev, PAGE_SIZE, dev->av_tbl.pbl.va,
1714               dev->av_tbl.pbl.pa);
1715     dev->av_tbl.pbl.va = NULL;
1716     dev->av_tbl.size = 0;
1717 mem_err:
1718     kfree(cmd);
1719     return status;
1720 }

Thanks very much!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [Bug 189031] New: Function mlx4_ib_query_device() does not set error codes on some failures
From: bugzilla-daemon-590EEB7GvNiWaY/ihj7yzEB+6BGkLq7r @ 2016-11-25 11:18 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

https://bugzilla.kernel.org/show_bug.cgi?id=189031

            Bug ID: 189031
           Summary: Function mlx4_ib_query_device() does not set error
                    codes on some failures
           Product: Drivers
           Version: 2.5
    Kernel Version: linux-4.9-rc6
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: Infiniband/RDMA
          Assignee: drivers_infiniband-rdma-ztI5WcYan/vQLgFONoPN62D2FQJk+8+b@public.gmane.org
          Reporter: bianpan2010-AvrBmmDjM4YnDS1+zs4M5A@public.gmane.org
        Regression: No

Functions kzalloc() and kmalloc() return NULL pointers if there is no enough
memory. They are called in the function mlx4_ib_query_device() defined in file
drivers/infiniband/hw/mlx4/main.c. If they return NULL pointers (see line 458),
function mlx4_ib_query_device() returns variable err. Because variable err is
checked at line 444, the value of err may be 0 when kzalloc() and kmalloc() are
called. As a result, 0 (indicates success) may be returned even on failures.
Though these errors may occur rarely, I think it is better to set correct error
code (e.g. -ENOMEM) when kzalloc() and kmalloc() returns NULL pointers. Codes
related to this bug are summarised as follows.

mlx4_ib_query_device @@ drivers/infiniband/hw/mlx4/main.c
 426 static int mlx4_ib_query_device(struct ib_device *ibdev,
 427                 struct ib_device_attr *props,
 428                 struct ib_udata *uhw)
 429 {
 430     struct mlx4_ib_dev *dev = to_mdev(ibdev);
 431     struct ib_smp *in_mad  = NULL;
 432     struct ib_smp *out_mad = NULL;
 433     int err = -ENOMEM;
         ...
 439     if (uhw->inlen) {
 440         if (uhw->inlen < sizeof(cmd))
 441             return -EINVAL;
 442 
 443         err = ib_copy_from_udata(&cmd, uhw, sizeof(cmd));
 444         if (err)
 445             return err;
 446 
 447         if (cmd.comp_mask)
 448             return -EINVAL;
 449 
 450         if (cmd.reserved)
 451             return -EINVAL;
 452     }
 453 
 454     resp.response_length = offsetof(typeof(resp), response_length) +
 455         sizeof(resp.response_length);
 456     in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
 457     out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
 458     if (!in_mad || !out_mad)
             // The value of err may be 0. Insert "err = -ENOMEM;" ?
 459         goto out;
         ...
 567 out:
 568     kfree(in_mad);
 569     kfree(out_mad);
 570 
 571     return err;
 572 }

Thanks very much!

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Christian König @ 2016-11-25 13:06 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Dan Williams
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <5e1de9ee-34f5-136d-a07e-f949d492864f-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

Am 24.11.2016 um 18:55 schrieb Logan Gunthorpe:
> Hey,
>
> On 24/11/16 02:45 AM, Christian König wrote:
>> E.g. it can happen that PCI device A exports it's BAR using ZONE_DEVICE.
>> Not PCI device B (a SATA device) can directly read/write to it because
>> it is on the same bus segment, but PCI device C (a network card for
>> example) can't because it is on a different bus segment and the bridge
>> can't handle P2P transactions.
> Yeah, that could be an issue but in our experience we have yet to see
> it. We've tested with two separate PCI buses on different CPUs connected
> through QPI links and it works fine. (It is rather slow but I understand
> Intel has improved the bottleneck in newer CPUs than the ones we tested.)

Well Serguei send me a couple of documents about QPI when we started to 
discuss this internally as well and that's exactly one of the cases I 
had in mind when writing this.

If I understood it correctly for such systems P2P is technical possible, 
but not necessary a good idea. Usually it is faster to just use a 
bouncing buffer when the peers are a bit "father" apart.

That this problem is solved on newer hardware is good, but doesn't helps 
us at all if we at want to support at least systems from the last five 
years or so.

> It may just be older hardware that has this issue. I expect that as long
> as a failed transfer can be handled gracefully by the initiator I don't
> see a need to predetermine whether a device can see another devices memory.

I don't want to predetermine whether a device can see another devices 
memory at get_user_pages() time.

My thinking was more going into the direction of a whitelist to figure 
out during dma_map_single()/dma_map_sg() time if we should use a 
bouncing buffer or not.

Christian.

>
>
> Logan

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Christian König @ 2016-11-25 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Logan Gunthorpe
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161124164249.GD20818-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

Am 24.11.2016 um 17:42 schrieb Jason Gunthorpe:
> On Wed, Nov 23, 2016 at 06:25:21PM -0700, Logan Gunthorpe wrote:
>>
>> On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>>>> works at the page table level. We do not need special handling for
>>>>> RDMA.
>>>> I am aware of ODP but, noted by others, it doesn't provide a general
>>>> solution to the points above.
>>> How do you mean?
>> I was only saying it wasn't general in that it wouldn't work for IB
>> hardware that doesn't support ODP or other hardware  that doesn't do
>> similar things (like an NVMe drive).
> There are three cases to worry about:
>   - Coherent long lived page table mirroring (RDMA ODP MR)
>   - Non-coherent long lived page table mirroring (RDMA MR)
>   - Short lived DMA mapping (everything else)
>
> Like you say below we have to handle short lived in the usual way, and
> that covers basically every device except IB MRs, including the
> command queue on a NVMe drive.

Well a problem which wasn't mentioned so far is that while GPUs do have 
a page table to mirror the CPU page table, they usually can't recover 
from page faults.

So what we do is making sure that all memory accessed by the GPU Jobs 
stays in place while those jobs run (pretty much the same pinning you do 
for the DMA).

But since this can lock down huge amounts of memory the whole command 
submission to GPUs is bound to the memory management. So when to much 
memory would get blocked by the GPU we block further command submissions 
until the situation resolves.

>> any complex allocators (GPU or otherwise) should respect that. And that
>> seems like it should be the default way most of this works -- and I
>> think it wouldn't actually take too much effort to make it all work now
>> as is. (Our iopmem work is actually quite small and simple.)
> Yes, absolutely, some kind of page pinning like locking is a hard
> requirement.
>
>> Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
>> memory working for some time. I'd say it's a good fit. The main question
>> we've had is how to expose PCIe bars to userspace to be used as MRs and
>> such.
> Is there any progress on that?
>
> I still don't quite get what iopmem was about.. I thought the
> objection to uncachable ZONE_DEVICE & DAX made sense, so running DAX
> over iopmem and still ending up with uncacheable mmaps still seems
> like a non-starter to me...
>
> Serguei, what is your plan in GPU land for migration? Ie if I have a
> CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
> - do you still allow the CPU to access it? Or do you swap it back to
> cachable memory if the CPU touches it?

Depends on the policy in command, but currently it's the other way 
around most of the time.

E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full speed.

When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.

So that move is transparent for both userspace as well as shaders 
running on the GPU.

> One approach might be to mmap the uncachable ZONE_DEVICE memory and
> mark it inaccessible to the CPU - DMA could still translate. If the
> CPU needs it then the kernel migrates it to system memory so it
> becomes cachable. ??

The whole purpose of this effort is that we can do I/O on VRAM directly 
without migrating everything back to system memory.

Allowing this, but then doing the migration by the first touch of the 
CPU is clearly not a good idea.

Regards,
Christian.

>
> Jason

^ permalink raw reply

* Re: [PATCH 0/2] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
From: Martin K. Petersen @ 2016-11-25 15:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Bottomley, Martin K. Petersen, Doug Ledford,
	Christoph Hellwig, Sagi Grimberg, Max Gurtovoy,
	linux-scsi@vger.kernel.org, linux-rdma@vger.kernel.org
In-Reply-To: <39d8cb23-0406-e8c3-6e3a-a467ebe41470@sandisk.com>

>>>>> "Bart" == Bart Van Assche <bart.vanassche@sandisk.com> writes:

Bart> The SRP transport code must wait until ongoing .queuecommand() /
Bart> .queue_rq() callback function invocations have finished before
Bart> reconnecting at the transport layer level and also before invoking
Bart> .terminate_rport_io(). This is already the case for the single
Bart> queue path but not yet for the scsi-mq path. This patch series
Bart> realizes the proper serialization for the scsi-mq path. Compared
Bart> to last time these patches were posted, only the patch
Bart> descriptions and one comment have been changed.

Applied to 4.10/scsi-queue.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: [PATCH 0/2] SRP transport, scsi-mq: Wait for .queue_rq() if necessary
From: Martin K. Petersen @ 2016-11-25 15:19 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bart Van Assche, James Bottomley, Doug Ledford, Christoph Hellwig,
	Sagi Grimberg, Max Gurtovoy, linux-scsi@vger.kernel.org,
	linux-rdma@vger.kernel.org
In-Reply-To: <yq1d1hjtwz5.fsf@sermon.lab.mkp.net>

>>>>> "Martin" == Martin K Petersen <martin.petersen@oracle.com> writes:

Hi Bart,

Martin> Applied to 4.10/scsi-queue.

2/2 needs a rebase and I'm not going to do another one this late in the
cycle. Please resend this patch once we hit 4.10 rc1.

Thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-25 16:45 UTC (permalink / raw)
  To: Christian König, Jason Gunthorpe, Dan Williams
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <c60815a1-aaac-52eb-1714-66abb28bdc01-5C7GfCeVMHo@public.gmane.org>



On 25/11/16 06:06 AM, Christian König wrote:
> Well Serguei send me a couple of documents about QPI when we started to
> discuss this internally as well and that's exactly one of the cases I
> had in mind when writing this.
> 
> If I understood it correctly for such systems P2P is technical possible,
> but not necessary a good idea. Usually it is faster to just use a
> bouncing buffer when the peers are a bit "father" apart.
> 
> That this problem is solved on newer hardware is good, but doesn't helps
> us at all if we at want to support at least systems from the last five
> years or so.

Well we have been testing with Sandy Bridge, I think the problem was
supposed to be fixed in Ivy but we never tested it so I can't say what
the performance turned out to be. Ivy is nearly 5 years old. I expect
this is something that will be improved more and more with subsequent
generations.

A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox