Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: [PATCH v2] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: Souptick Joarder @ 2016-11-29  7:25 UTC (permalink / raw)
  To: Sergei Shtylyov, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Rameshwar Sahu
In-Reply-To: <20161129065931.GA3245@gnr743-HP-ZBook-15>

Please ignore this v2 patch.

On Tue, Nov 29, 2016 at 12:29 PM, Souptick Joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc().
>
> Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
> v2:
>   - Address comment from sergei
>     Alignment was not proper
>
>  drivers/net/ethernet/mellanox/mlx4/cmd.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> index e36bebc..96cdf9a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> @@ -2679,14 +2679,13 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev)
>         if (!mailbox)
>                 return ERR_PTR(-ENOMEM);
>
> -       mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
> -                                     &mailbox->dma);
> +       mailbox->buf = pci_pool_zalloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
> +                                      &mailbox->dma);
>         if (!mailbox->buf) {
>                 kfree(mailbox);
>                 return ERR_PTR(-ENOMEM);
>         }
>
> -       memset(mailbox->buf, 0, MLX4_MAILBOX_SIZE);
>
>         return mailbox;
>  }
> --
> 1.9.1
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: Souptick Joarder @ 2016-11-29  6:59 UTC (permalink / raw)
  To: sergei.shtylyov-M4DtvfQ/ZS1MRgGoP+s0PdBPR1lH4CV8,
	yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sahu.rameshwar73-Re5JQEeQqe8AvxtiuMwx3w

In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
replaced by pci_pool_zalloc().

Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
v2:
  - Address comment from sergei
    Alignment was not proper

 drivers/net/ethernet/mellanox/mlx4/cmd.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index e36bebc..96cdf9a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -2679,14 +2679,13 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev)
 	if (!mailbox)
 		return ERR_PTR(-ENOMEM);
 
-	mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
-				      &mailbox->dma);
+	mailbox->buf = pci_pool_zalloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
+				       &mailbox->dma);
 	if (!mailbox->buf) {
 		kfree(mailbox);
 		return ERR_PTR(-ENOMEM);
 	}
 
-	memset(mailbox->buf, 0, MLX4_MAILBOX_SIZE);
 
 	return mailbox;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: Souptick Joarder @ 2016-11-29  6:49 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: yishaih, netdev, linux-rdma, Rameshwar Sahu
In-Reply-To: <a1b6f877-c40b-656d-2278-d32af1a93bc7@cogentembedded.com>

On Tue, Nov 29, 2016 at 12:36 AM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> Hello.
>
> On 11/28/2016 04:28 PM, Souptick Joarder wrote:
>
>> In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
>> replaced by pci_pool_zalloc()
>>
>> Signed-off-by: Souptick joarder <jrdr.linux@gmail.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx4/cmd.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> b/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> index e36bebc..ee3bd76 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
>> @@ -2679,14 +2679,13 @@ struct mlx4_cmd_mailbox
>> *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev)
>>         if (!mailbox)
>>                 return ERR_PTR(-ENOMEM);
>>
>> -       mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool,
>> GFP_KERNEL,
>> +       mailbox->buf = pci_pool_zalloc(mlx4_priv(dev)->cmd.pool,
>> GFP_KERNEL,
>>                                       &mailbox->dma);
>
>
>    You need to realign he continuation line now, the way it was aligned in
> the original code.
>

Ok, I will do that.





>
> MBR, Sergei
>

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Or Gerlitz @ 2016-11-29  6:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Yishai Hadas, Matan Barak
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Steve Wise, Moni Shoua
In-Reply-To: <20161128222559.GB744-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Tue, Nov 29, 2016 at 12:25 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Nov 28, 2016 at 10:57:00PM +0200, Or Gerlitz wrote:
>> On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

>> > Does the mlx drivers really register ports with different capabilities
>> > as the same ib_device? I'm not sure that should be allowed.

>> mlx4 yeah (practically for the last ~10 years) for instance Eth ports
>> don't support SMI -- this goes back to the fact that mlx4 devices are
>> single PCI function with potentially two ports and each port can be

> struct ib_device is not linked to a PCI device. AFAIK mlx4 created one
> ib_device for rocee ports and one for IB,

no, it doesn't

> or at least it should or things are already broken.

Can you elaborate what it broken with mlx4? I suspect we
even have down there some functionality which depends on that,
but again, I 1st would like to hear if/what is broken - I copied the maintainer.

>> set to different link layer. But this no more holds for mlx5, these
>> devices are function-per-port and hence IB device per port.

> Since it has nothing to do with pci devices, please do this properly.

again, mlx5 does this already.

Jason, patches 8-10 which carry the functional change I want to introduce
(allow mlx5 IB devices to be created when RoCE is not supported) stand
for themselves.

As I wrote, the stack is fully functional  (i.e no error in the IB
core etc) when
only these patches are put. E.g things behave in a similar manner to all
the upstream iWARP drivers that refuse to create any QP which is not RC.

I am okay to put aside patches 1-7 which I added per Doug request for
user-space
applications to be able and query what QPs are supports on a device,
or to get in
patches 1-7, whatever works better for Doug and ppl. I don't think
it's fair to ask
a re-write of the 10y old IB device-ing of things done by mlx4 just to be able
and introduce this reduced functionally (raw packet qp only) of mlx5 devices.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-29  6:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161125190509.GB16504-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Fri, Nov 25, 2016 at 12:05:09PM -0700, Jason Gunthorpe wrote:
>On Thu, Nov 24, 2016 at 06:13:50PM -0800, Vishwanathapura, Niranjana wrote:
>
>> In order to be truely device independent the hfi_vnic ULP should not depend
>> on a device exported symbol. Instead device should register its functions
>> with the ULP. Hence the approaches a) and b).
>
>It is not device independent, it is hard linked to hfi1, just like our
>other multi-component drivers.. So don't worry about that.
>

We would like to keep the design clean and avoid any tight coupling here (our 
original design in this series tackled these).
Any strong reason not to go with a) or b) ?

Niranjana

>Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-29  6:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: ira.weiny, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161124161545.GA20818-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Thu, Nov 24, 2016 at 09:15:45AM -0700, Jason Gunthorpe wrote:
>> And will move the hfi_vnic module under
>> ‘drivers/infiniband/ulp/hfi_vnic’.
>
>I would prefer drivers/net/ethernet
>
>This is clearly not a ULP since it doesn't use verbs.
>

I understand it is not using verbs, but the control path (ib_device client) is 
using verbs (IB MAD).
Our prefernce is to keep it somewhere under drivers/infiniband. Summarizing 
reasons again here,

- VNIC control driver (ib_device client) is an IB MAD agent.
- It is purly a software construct, encapsualtes ethernet packets in Omni-path 
packet and depends on hfi1 driver here for HW access.

Doug,
Any comments?

>Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Jason Gunthorpe @ 2016-11-28 22:25 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Leon Romanovsky, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Wise,
	Mike Marciniszyn, Dennis Dalessandro, Lijun Ou, Wei Hu(Xavier),
	Faisal Latif, Yishai Hadas, Selvin Xavier, Devesh Sharma,
	Mitesh Ahuja, Christian Benvenuti, Dave Goodell, Moni Shoua,
	Or Gerlitz
In-Reply-To: <CAJ3xEMiv6HCu-9fi12XtafxYWu-+gNPMbnfb-A4-+FrgR6KZNA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Nov 28, 2016 at 10:57:00PM +0200, Or Gerlitz wrote:
> On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
> >
> >> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
> >> +{
> >> +     return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
> >> +}
> >
> > Does the mlx drivers really register ports with different capabilities
> > as the same ib_device? I'm not sure that should be allowed.
> 
> mlx4 yeah (practically for the last ~10 years) for instance Eth ports
> don't support SMI -- this goes back to the fact that mlx4 devices are
> single PCI function with potentially two ports and each port can be

struct ib_device is not linked to a PCI device. AFAIK mlx4 created one
ib_device for rocee ports and one for IB, or at least it should or
things are already broken.

> set to different link layer. But this no more holds for mlx5, these
> devices are function-per-port and hence IB device per port.

Since it has nothing to do with pci devices, please do this properly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-28 22:24 UTC (permalink / raw)
  To: Serguei Sagalovitch
  Cc: Haggai Eran, Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Christian K??nig,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1ac2f9e7-f1ee-a2c9-0134-ffaa28c706af-5C7GfCeVMHo@public.gmane.org>

On Mon, Nov 28, 2016 at 04:55:23PM -0500, Serguei Sagalovitch wrote:

> >We haven't touch this in a long time and perhaps it changed, but there
> >definitely was a call back in the PeerDirect API to allow the GPU to
> >invalidate the mapping. That's what we don't want.

> I assume that you are talking about "invalidate_peer_memory()' callback?
> I was told that it is the "last resort" because HCA (and driver) is not
> able to handle  it in the safe manner so it is basically "abort" everything.

If it is a last resort to save system stability then kill the impacted
process, that will release the MRs.

Jason

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-28 21:55 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Haggai Eran
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Christian König,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <c8c25265-9f59-f3d6-6249-07500e73930e-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:
> On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
>> As soon as PeerDirect mapping is called then GPU must not "move" the
>> such memory.  It is by PeerDirect design. It is similar how it is works
>> with system memory and RDMA MR: when "get_user_pages" is called then the
>> memory is pinned.
> We haven't touch this in a long time and perhaps it changed, but there
> definitely was a call back in the PeerDirect API to allow the GPU to
> invalidate the mapping. That's what we don't want.
I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle  it in the safe manner so it is basically "abort" 
everything.

^ permalink raw reply

* [PATCH infiniband-diags] scripts: Add mkey support into ibhosts, ibswitches, and ibrouters
From: Hal Rosenstock @ 2016-11-28 21:47 UTC (permalink / raw)
  To: Weiny, Ira; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org


Signed-off-by: Hal Rosenstock <hal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
diff --git a/scripts/ibhosts.in b/scripts/ibhosts.in
index fda0541..c37260c 100644
--- a/scripts/ibhosts.in
+++ b/scripts/ibhosts.in
@@ -3,19 +3,32 @@
 IBPATH=${IBPATH:-@IBSCRIPTPATH@}
 
 usage() {
-	echo Usage: `basename $0` "[-h] [<topology-file> | -C ca_name" \
-	    "-P ca_port -t timeout_ms]"
+	echo Usage: `basename $0` "[-h] [<topology-file> | -y mkey" \
+	    "-C ca_name -P ca_port -t timeout_ms]"
 	exit -1
 }
 
 topofile=""
 ca_info=""
+mkey="0"
 
 while [ "$1" ]; do
 	case $1 in
 	-h | --help)
 		usage
 		;;
+	-y | --m_key)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		shift
+		mkey="$1"
+		;;
 	-P | --Port | -C | --Ca | -t | --timeout)
 		case $2 in
 		-*)
@@ -44,7 +57,7 @@ done
 if [ "$topofile" ]; then
 	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover $ca_info"
+	netcmd="$IBPATH/ibnetdiscover -y $mkey $ca_info"
 fi
 
 text="`eval $netcmd`"
diff --git a/scripts/ibrouters.in b/scripts/ibrouters.in
index ae66ca4..b3e5a1d 100644
--- a/scripts/ibrouters.in
+++ b/scripts/ibrouters.in
@@ -3,19 +3,32 @@
 IBPATH=${IBPATH:-@IBSCRIPTPATH@}
 
 usage() {
-	echo Usage: `basename $0` "[-h] [<topology-file> | -C ca_name" \
-	    "-P ca_port -t timeout_ms]"
+	echo Usage: `basename $0` "[-h] [<topology-file> | -y mkey" \
+	    "-C ca_name -P ca_port -t timeout_ms]"
 	exit -1
 }
 
 topofile=""
 ca_info=""
+mkey="0"
 
 while [ "$1" ]; do
 	case $1 in
 	-h | --help)
 		usage
 		;;
+	-y | --m_key)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		shift
+		mkey="$1"
+		;;
 	-P | --Port | -C | --Ca | -t | --timeout)
 		case $2 in
 		-*)
@@ -44,7 +57,7 @@ done
 if [ "$topofile" ]; then
 	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover $ca_info"
+	netcmd="$IBPATH/ibnetdiscover -y $mkey $ca_info"
 fi
 
 text="`eval $netcmd`"
diff --git a/scripts/ibswitches.in b/scripts/ibswitches.in
index 0f3aa91..743f1db 100644
--- a/scripts/ibswitches.in
+++ b/scripts/ibswitches.in
@@ -3,19 +3,32 @@
 IBPATH=${IBPATH:-@IBSCRIPTPATH@}
 
 usage() {
-	echo Usage: `basename $0` "[-h] [<topology-file> | -C ca_name" \
-	    "-P ca_port -t timeout_ms]"
+	echo Usage: `basename $0` "[-h] [<topology-file> | -y mkey" \
+	    "-C ca_name -P ca_port -t timeout_ms]"
 	exit -1
 }
 
 topofile=""
 ca_info=""
+mkey="0"
 
 while [ "$1" ]; do
 	case $1 in
 	-h | --help)
 		usage
 		;;
+	-y | --m_key)
+		case $2 in
+		-*)
+			usage
+			;;
+		esac
+		if [ x$2 = x ] ; then
+			usage
+		fi
+		shift
+		mkey="$1"
+		;;
 	-P | --Port | -C | --Ca | -t | --timeout)
 		case $2 in
 		-*)
@@ -44,7 +57,7 @@ done
 if [ "$topofile" ]; then
 	netcmd="cat $topofile"
 else
-	netcmd="$IBPATH/ibnetdiscover $ca_info"
+	netcmd="$IBPATH/ibnetdiscover -y $mkey $ca_info"
 fi
 
 text="`eval $netcmd`"
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-28 21:36 UTC (permalink / raw)
  To: Serguei Sagalovitch, Jason Gunthorpe, Haggai Eran
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Christian König,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <0d3d56e2-4d2b-85b7-9487-b7ae2aaea610-5C7GfCeVMHo@public.gmane.org>


On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:
> As soon as PeerDirect mapping is called then GPU must not "move" the
> such memory.  It is by PeerDirect design. It is similar how it is works
> with system memory and RDMA MR: when "get_user_pages" is called then the
> memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

Logan

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Or Gerlitz @ 2016-11-28 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Doug Ledford,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Steve Wise,
	Mike Marciniszyn, Dennis Dalessandro, Lijun Ou, Wei Hu(Xavier),
	Faisal Latif, Yishai Hadas, Selvin Xavier, Devesh Sharma,
	Mitesh Ahuja, Christian Benvenuti, Dave Goodell, Moni Shoua,
	Or Gerlitz
In-Reply-To: <20161128170056.GC28381-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Mon, Nov 28, 2016 at 7:00 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
>
>> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
>> +{
>> +     return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
>> +}
>
> Does the mlx drivers really register ports with different capabilities
> as the same ib_device? I'm not sure that should be allowed.

mlx4 yeah (practically for the last ~10 years) for instance Eth ports
don't support SMI -- this goes back to the fact that mlx4 devices are
single PCI function with potentially two ports and each port can be
set to different link layer. But this no more holds for mlx5, these
devices are function-per-port and hence IB device per port. I guess we
have to swallow that pill and move on as newer devices don't have this
behavior, okay?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-28 19:35 UTC (permalink / raw)
  To: Logan Gunthorpe, Jason Gunthorpe, Haggai Eran
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Christian König,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <f3bb8372-ae2e-2f5e-5505-4ecaddbfb16e-OTvnGxWRz7hWk0Htik3J/w@public.gmane.org>

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:
>
> On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>> destroyed immediately.
>> That sounds horrible. How can that possibly work? What if the MR is
>> being used when the GPU decides to migrate? I would not support that
>> upstream without a lot more explanation..
> Yup, this was our experience when playing around with PeerDirect. There
> was nothing we could do if the GPU decided to invalidate the P2P
> mapping.
As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

^ permalink raw reply

* Re: [PATCH] ethernet :mellanox :mlx5: Replace pci_pool_alloc by pci_pool_zalloc
From: Sergei Shtylyov @ 2016-11-28 19:08 UTC (permalink / raw)
  To: Souptick Joarder, saeedm-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, leonro-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sahu.rameshwar73-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161128133740.GA688@gnr743-HP-ZBook-15>

On 11/28/2016 04:37 PM, Souptick Joarder wrote:

> In alloc_cmd_box(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
>
> Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> index 1e639f8..d96ebd4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> @@ -1063,14 +1063,13 @@ static struct mlx5_cmd_mailbox *alloc_cmd_box(struct mlx5_core_dev *dev,
>  	if (!mailbox)
>  		return ERR_PTR(-ENOMEM);
>
> -	mailbox->buf = pci_pool_alloc(dev->cmd.pool, flags,
> +	mailbox->buf = pci_pool_zalloc(dev->cmd.pool, flags,
>  				      &mailbox->dma);

    Same here, the & needs to start under 'dev' on the broken up line.

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] ethernet :mellanox :mlx4: Replace pci_pool_alloc by pci_pool_zalloc
From: Sergei Shtylyov @ 2016-11-28 19:06 UTC (permalink / raw)
  To: Souptick Joarder, yishaih-VPRAkNaXOzVWk0Htik3J/w
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	sahu.rameshwar73-Re5JQEeQqe8AvxtiuMwx3w
In-Reply-To: <20161128132819.GA500@gnr743-HP-ZBook-15>

Hello.

On 11/28/2016 04:28 PM, Souptick Joarder wrote:

> In mlx4_alloc_cmd_mailbox(), pci_pool_alloc() followed by memset will be
> replaced by pci_pool_zalloc()
>
> Signed-off-by: Souptick joarder <jrdr.linux-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx4/cmd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> index e36bebc..ee3bd76 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> @@ -2679,14 +2679,13 @@ struct mlx4_cmd_mailbox *mlx4_alloc_cmd_mailbox(struct mlx4_dev *dev)
>  	if (!mailbox)
>  		return ERR_PTR(-ENOMEM);
>
> -	mailbox->buf = pci_pool_alloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
> +	mailbox->buf = pci_pool_zalloc(mlx4_priv(dev)->cmd.pool, GFP_KERNEL,
>  				      &mailbox->dma);

    You need to realign he continuation line now, the way it was aligned in 
the original code.

[...]

MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-28 19:02 UTC (permalink / raw)
  To: Haggai Eran
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	serguei.sagalovitch-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <1480357179.19407.13.camel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
> > > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > > user-space and the GPU not to migrate it. If they do, the MR gets
> > > destroyed immediately.
> > That sounds horrible. How can that possibly work? What if the MR is
> > being used when the GPU decides to migrate? 
> Naturally this doesn't support migration. The GPU is expected to pin
> these pages as long as the MR lives. The MR invalidation is done only as
> a last resort to keep system correctness.

That just forces applications to handle horrible unexpected
failures. If this sort of thing is needed for correctness then OOM
kill the offending process, don't corrupt its operation.

> I think it is similar to how non-ODP MRs rely on user-space today to
> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
> non-ODP MR's pages, you can still get yourself into a data corruption
> situation (HCA sees one page and the process sees another for the same
> virtual address). The pinning that we use only guarentees the HCA's page
> won't be reused.

That is not really data corruption - the data still goes where it was
originally destined. That is an application violating the
requirements of a MR. An application cannot munmap/mremap a VMA
while a non ODP MR points to it and then keep using the MR.

That is totally different from a GPU driver wanthing to mess with
translation to physical pages.

> > From what I understand we are not really talking about kernel p2p,
> > everything proposed so far is being mediated by a userspace VMA, so
> > I'd focus on making that work.

> Fair enough, although we will need both eventually, and I hope the
> infrastructure can be shared to some degree.

What use case do you see for in kernel?

Presumably in-kernel could use a vmap or something and the same basic
flow?

Jason

^ permalink raw reply

* [PATCH infiniband-diags] ibtracert.c: Enable m_key option
From: Hal Rosenstock @ 2016-11-28 18:47 UTC (permalink / raw)
  To: Weiny, Ira
  Cc: Dan Ben Yosef, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org


m_key option was mistakenly excluded in commit
ced1cd4758638468329392099ee82c6f2650ea0c:

Author: Jim Foraker <foraker1-i2BcT+NCU+M@public.gmane.org>
Date:   Thu May 31 12:11:33 2012 -0700

    infiniband-diags: Allow specification of an mkey to use on the command line
    
    Signed-off-by: Jim Foraker <foraker1-i2BcT+NCU+M@public.gmane.org>
    Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>

Signed-off-by: Dan Ben-Yosef <danby-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Hal Rosenstock <hal-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
diff --git a/src/ibtracert.c b/src/ibtracert.c
index d99d0e1..1da3d62 100644
--- a/src/ibtracert.c
+++ b/src/ibtracert.c
@@ -804,7 +804,7 @@ int main(int argc, char **argv)
 		NULL,
 	};
 
-	ibdiag_process_opts(argc, argv, NULL, "DKy", opts, process_opt,
+	ibdiag_process_opts(argc, argv, NULL, "DK", opts, process_opt,
 			    usage_args, usage_examples);
 
 	f = stdout;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: Enabling peer to peer device transactions for PCIe devices
From: Haggai Eran @ 2016-11-28 18:36 UTC (permalink / raw)
  To: jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	serguei.sagalovitch-5C7GfCeVMHo@public.gmane.org
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <314e9ef7-f60e-bf6b-d488-c585f1ea60e8-5C7GfCeVMHo@public.gmane.org>

On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote:
> On 2016-11-27 09:02 AM, Haggai Eran wrote
> > 
> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately. This should work on legacy devices without
> > ODP
> > support, and allows the system to safely terminate a process that
> > misbehaves. The downside of course is that it cannot transparently
> > migrate memory but I think for user-space RDMA doing that
> > transparently
> > requires hardware support for paging, via something like HMM.
> > 
> > ...
> May be I am wrong but my understanding is that PeerDirect logic
> basically
> follow  "RDMA register MR" logic 
Yes. The only difference from regular MRs is the invalidation process I
mentioned, and the fact that we get the addresses not from
get_user_pages but from a peer driver.

> so basically nothing prevent to "terminate"
> process for "MMU notifier" case when we are very low on memory
> not making it similar (not worse) then PeerDirect case.
I'm not sure I understand. I don't think any solution prevents
terminating an application. The paragraph above is just trying to
explain how a non-ODP device/MR can handle an invalidation.

> > > I'm hearing most people say ZONE_DEVICE is the way to handle this,
> > > which means the missing remaing piece for RDMA is some kind of DMA
> > > core support for p2p address translation..
> > Yes, this is definitely something we need. I think Will Davis's
> > patches
> > are a good start.
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that.
> About HMM: I do not think that in the current form HMM would  fit in
> requirement for generic P2P transfer case. My understanding is that at
> the current stage HMM is good for "caching" system memory
> in device memory for fast GPU access but in RDMA MR non-ODP case
> it will not work because  the location of memory should not be
> changed so memory should be allocated directly in PCIe memory.
The way I see it there are two ways to handle non-ODP MRs. Either you
prevent the GPU from migrating / reusing the MR's VRAM pages for as long
as the MR is alive (if I understand correctly you didn't like this
solution), or you allow the GPU to somehow notify the HCA to invalidate
the MR. If you do that, you can use mmu notifiers or HMM or something
else, but HMM provides a nice framework to facilitate that notification.

> > 
> > Using ZONE_DEVICE with or without something like DMA-BUF to pin and
> > unpin
> > pages for the short duration as you wrote above could work fine for
> > kernel uses in which we can guarantee they are short.
> Potentially there is another issue related to pin/unpin. If memory
> could
> be used a lot of time then there is no sense to rebuild and program
> s/g tables each time if location of memory was not changed.
Is this about the kernel use or user-space? In user-space I think the MR
concept captures a long-lived s/g table so you don't need to rebuild it
(unless the mapping changes).

Haggai
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Logan Gunthorpe @ 2016-11-28 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Christian König,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161128165751.GB28381-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>



On 28/11/16 09:57 AM, Jason Gunthorpe wrote:
>> On PeerDirect, we have some kind of a middle-ground solution for pinning
>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>> user-space and the GPU not to migrate it. If they do, the MR gets
>> destroyed immediately.
> 
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? I would not support that
> upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping. It just meant the application would fail or need complicated
logic to detect this and redo just about everything. And given that it
was a reasonably rare occurrence during development it probably means
not a lot of applications will be developed to handle it and most would
end up being randomly broken in environments with memory pressure.

Logan

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Haggai Eran @ 2016-11-28 18:19 UTC (permalink / raw)
  To: jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org
  Cc: John.Bridgman-5C7GfCeVMHo@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	Felix.Kuehling-5C7GfCeVMHo@public.gmane.org,
	serguei.sagalovitch-5C7GfCeVMHo@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Paul.Blinzer-5C7GfCeVMHo@public.gmane.org,
	ben.sander-5C7GfCeVMHo@public.gmane.org,
	Suravee.Suthikulpanit-5C7GfCeVMHo@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Alexander.Deucher-5C7GfCeVMHo@public.gmane.org, Max Gurtovoy,
	christian.koenig-5C7GfCeVMHo@public.gmane.org,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <20161128165751.GB28381-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Mon, 2016-11-28 at 09:57 -0700, Jason Gunthorpe wrote:
> On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:
> > I think blocking mmu notifiers against something that is basically
> > controlled by user-space can be problematic. This can block things
> > like
> > memory reclaim. If you have user-space access to the device's
> > queues,
> > user-space can block the mmu notifier forever.
> Right, I mentioned that..
Sorry, I must have missed it.

> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately.
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? 
Naturally this doesn't support migration. The GPU is expected to pin
these pages as long as the MR lives. The MR invalidation is done only as
a last resort to keep system correctness.

I think it is similar to how non-ODP MRs rely on user-space today to
keep them correct. If you do something like madvise(MADV_DONTNEED) on a
non-ODP MR's pages, you can still get yourself into a data corruption
situation (HCA sees one page and the process sees another for the same
virtual address). The pinning that we use only guarentees the HCA's page
won't be reused.

> I would not support that
> upstream without a lot more explanation..
> 
> I know people don't like requiring new hardware, but in this case we
> really do need ODP hardware to get all the semantics people want..
> 
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that. Using
> From what I understand we are not really talking about kernel p2p,
> everything proposed so far is being mediated by a userspace VMA, so
> I'd focus on making that work.
Fair enough, although we will need both eventually, and I hope the
infrastructure can be shared to some degree.

^ permalink raw reply

* RE: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Steve Wise @ 2016-11-28 17:08 UTC (permalink / raw)
  To: 'Jason Gunthorpe', 'Leon Romanovsky'
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Steve Wise',
	'Mike Marciniszyn', 'Dennis Dalessandro',
	'Lijun Ou', 'Wei Hu(Xavier)',
	'Faisal Latif', 'Yishai Hadas',
	'Selvin Xavier', 'Devesh Sharma',
	'Mitesh Ahuja', 'Christian Benvenuti',
	'Dave Goodell', 'Moni Shoua',
	'Or Gerlitz'
In-Reply-To: <20161128170056.GC28381-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

> 
> On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:
> 
> > +static inline bool rdma_protocol_raw_packet(const struct ib_device *device,
u8
> port_num)
> > +{
> > +	return device->port_immutable[port_num].core_cap_flags &
> RDMA_CORE_CAP_PROT_RAW_PACKET;
> > +}
> 
> Does the mlx drivers really register ports with different capabilities
> as the same ib_device? I'm not sure that should be allowed.
> 
> I keep talking about how we need to get rid of the port_num in these
> sorts of places because it makes no sense...
> 

I agree.   Requiring the port number has implications that ripple up into the
rdma-rw api as well...
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH rdma-next 01/10] IB/core: Add raw packet protocol
From: Jason Gunthorpe @ 2016-11-28 17:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Steve Wise, Mike Marciniszyn,
	Dennis Dalessandro, Lijun Ou, Wei Hu(Xavier), Faisal Latif,
	Yishai Hadas, Selvin Xavier, Devesh Sharma, Mitesh Ahuja,
	Christian Benvenuti, Dave Goodell, Moni Shoua, Or Gerlitz
In-Reply-To: <1480258296-27032-2-git-send-email-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

On Sun, Nov 27, 2016 at 04:51:27PM +0200, Leon Romanovsky wrote:

> +static inline bool rdma_protocol_raw_packet(const struct ib_device *device, u8 port_num)
> +{
> +	return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_RAW_PACKET;
> +}

Does the mlx drivers really register ports with different capabilities
as the same ib_device? I'm not sure that should be allowed.

I keep talking about how we need to get rid of the port_num in these
sorts of places because it makes no sense...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Jason Gunthorpe @ 2016-11-28 16:57 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix, Serguei Sagalovitch, Blinzer, Paul,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Sander, Ben, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Christian König,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <d9e064a0-9c47-3e41-3154-cece8c70a119-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:

> > Like in ODP, MMU notifiers/HMM are used to monitor for translation
> > changes. If a change comes in the GPU driver checks if an executing
> > command is touching those pages and blocks the MMU notifier until the
> > command flushes, then unfaults the page (blocking future commands) and
> > unblocks the mmu notifier.

> I think blocking mmu notifiers against something that is basically
> controlled by user-space can be problematic. This can block things like
> memory reclaim. If you have user-space access to the device's queues,
> user-space can block the mmu notifier forever.

Right, I mentioned that..

> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

I know people don't like requiring new hardware, but in this case we
really do need ODP hardware to get all the semantics people want..

> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that. Using

>From what I understand we are not really talking about kernel p2p,
everything proposed so far is being mediated by a userspace VMA, so
I'd focus on making that work.

Jason

^ permalink raw reply

* Re: [PATCH rdma-next 1/4] IB/mlx4: Fix out-of-range array index in destroy qp flow
From: Bart Van Assche @ 2016-11-28 16:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Jack Morgenstein
In-Reply-To: <20161128083107.GB6380-2ukJVAZIZ/Y@public.gmane.org>

On 11/28/2016 12:31 AM, Leon Romanovsky wrote:
> I'm extra cautions with stable tags and prefer to finalize my stable
> checker in my submissions scripts first, before adding it manually and
> I have plans to use it next kernel release.

Hello Leon,

Thanks for explaining your workflow. However, the question remains 
whether or not stable tags should be added to the patches in this series?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-28 14:48 UTC (permalink / raw)
  To: Haggai Eran, Jason Gunthorpe, Christian König
  Cc: Bridgman, John,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	Kuehling, Felix,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Blinzer, Paul, Suthikulpanit, Suravee,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Deucher, Alexander, Max Gurtovoy, Sander, Ben,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <d9e064a0-9c47-3e41-3154-cece8c70a119-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

On 2016-11-27 09:02 AM, Haggai Eran wrote
> On PeerDirect, we have some kind of a middle-ground solution for pinning
> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> user-space and the GPU not to migrate it. If they do, the MR gets
> destroyed immediately. This should work on legacy devices without ODP
> support, and allows the system to safely terminate a process that
> misbehaves. The downside of course is that it cannot transparently
> migrate memory but I think for user-space RDMA doing that transparently
> requires hardware support for paging, via something like HMM.
>
> ...
May be I am wrong but my understanding is that PeerDirect logic basically
follow  "RDMA register MR" logic so basically nothing prevent to "terminate"
process for "MMU notifier" case when we are very low on memory
not making it similar (not worse) then PeerDirect case.
>> I'm hearing most people say ZONE_DEVICE is the way to handle this,
>> which means the missing remaing piece for RDMA is some kind of DMA
>> core support for p2p address translation..
> Yes, this is definitely something we need. I think Will Davis's patches
> are a good start.
>
> Another thing I think is that while HMM is good for user-space
> applications, for kernel p2p use there is no need for that.
About HMM: I do not think that in the current form HMM would  fit in
requirement for generic P2P transfer case. My understanding is that at
the current stage HMM is good for "caching" system memory
in device memory for fast GPU access but in RDMA MR non-ODP case
it will not work because  the location of memory should not be
changed so memory should be allocated directly in PCIe memory.
> Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
> pages for the short duration as you wrote above could work fine for
> kernel uses in which we can guarantee they are short.
Potentially there is another issue related to pin/unpin. If memory could
be used a lot of time then there is no sense to rebuild and program
s/g tables each time if location of memory was not changed.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox