Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* Re: Enabling peer to peer device transactions for PCIe devices
From: Serguei Sagalovitch @ 2016-11-22 18:59 UTC (permalink / raw)
  To: Dan Williams, Deucher, Alexander
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kuehling, Felix, Bridgman, John,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Koenig, Christian, Sander, Ben, Suthikulpanit, Suravee,
	Blinzer, Paul,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <CAPcyv4i_5r2RVuV4F6V3ETbpKsf8jnMyQviZ7Legz3N4-v+9Og-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Dan,

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure / 
implementations
    in part dealing with CPU pointers?

-  How well we will be able to handle case when we need to "move"/"evict"
    memory/data to the new location so CPU pointer should point to the 
new physical location/address
     (and may be not in PCI device memory at all)?


Sincerely yours,
Serguei Sagalovitch

On 2016-11-22 01:11 PM, Dan Williams wrote:
> On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander
> <Alexander.Deucher-5C7GfCeVMHo@public.gmane.org> wrote:
>> This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward.  Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory.  Also in cases where both devices are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc.  Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
>>
>> Some use cases:
>> 1. Storage devices streaming directly to GPU device memory
>> 2. GPU device memory to GPU device memory streaming
>> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
>> 4. DVB/V4L/SDI devices streaming directly to storage devices
>>
>> Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.
>> - Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
>> - get_user_pages_fast() will  return corresponding struct pages when CPU address points to the device memory
>> - put_page() will deal with struct pages for device memory
>>
> [..]
>> 4. iopmem
>> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)
> The change I suggest for this particular approach is to switch to
> "device-DAX" [1]. I.e. a character device for establishing DAX
> mappings rather than a block device plus a DAX filesystem. The pro of
> this approach is standard user pointers and struct pages rather than a
> new construct. The con is that this is done via an interface separate
> from the existing gpu and storage device. For example it would require
> a /dev/dax instance alongside a /dev/nvme interface, but I don't see
> that as a significant blocking concern.
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html

^ permalink raw reply

* Re: Enabling peer to peer device transactions for PCIe devices
From: Dan Williams @ 2016-11-22 18:11 UTC (permalink / raw)
  To: Deucher, Alexander
  Cc: Sagalovitch, Serguei,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Kuehling, Felix, Bridgman, John,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org,
	Koenig, Christian, Sander, Ben, Suthikulpanit, Suravee,
	Blinzer, Paul,
	Linux-media-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <MWHPR12MB169484839282E2D56124FA02F7B50-Gy0DoCVfaSW4WA4dJ5YXGAdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>

On Mon, Nov 21, 2016 at 12:36 PM, Deucher, Alexander
<Alexander.Deucher-5C7GfCeVMHo@public.gmane.org> wrote:
> This is certainly not the first time this has been brought up, but I'd like to try and get some consensus on the best way to move this forward.  Allowing devices to talk directly improves performance and reduces latency by avoiding the use of staging buffers in system memory.  Also in cases where both devices are behind a switch, it avoids the CPU entirely.  Most current APIs (DirectGMA, PeerDirect, CUDA, HSA) that deal with this are pointer based.  Ideally we'd be able to take a CPU virtual address and be able to get to a physical address taking into account IOMMUs, etc.  Having struct pages for the memory would allow it to work more generally and wouldn't require as much explicit support in drivers that wanted to use it.
>
> Some use cases:
> 1. Storage devices streaming directly to GPU device memory
> 2. GPU device memory to GPU device memory streaming
> 3. DVB/V4L/SDI devices streaming directly to GPU device memory
> 4. DVB/V4L/SDI devices streaming directly to storage devices
>
> Here is a relatively simple example of how this could work for testing.  This is obviously not a complete solution.
> - Device memory will be registered with Linux memory sub-system by created corresponding struct page structures for device memory
> - get_user_pages_fast() will  return corresponding struct pages when CPU address points to the device memory
> - put_page() will deal with struct pages for device memory
>
[..]
> 4. iopmem
> iopmem : A block device for PCIe memory (https://lwn.net/Articles/703895/)

The change I suggest for this particular approach is to switch to
"device-DAX" [1]. I.e. a character device for establishing DAX
mappings rather than a block device plus a DAX filesystem. The pro of
this approach is standard user pointers and struct pages rather than a
new construct. The con is that this is done via an interface separate
from the existing gpu and storage device. For example it would require
a /dev/dax instance alongside a /dev/nvme interface, but I don't see
that as a significant blocking concern.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-October/007496.html

^ permalink raw reply

* RE: [RFC] Avoid running out of local port in RDMA_CM
From: Hefty, Sean @ 2016-11-22 18:04 UTC (permalink / raw)
  To: Moni Shoua; +Cc: linux-rdma, Doug Ledford
In-Reply-To: <CAG9sBKMfGWv7-wgoSNxVqMs7TktcJ3a4J0QnL+TmPSm1kaQ28g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

> > I believe that any solution here should mimic the TCP/IP stack as
> closely as possible.  So I would rule out the re-use of a single port
> for all active connections.
> >
> > I think TCP matches on the full tuple <src port, src ip, dst port,
> dst ip>.  We should be safe to re-use port numbers as long as some
> other portion of the tuple changes.  Maybe that can be added as part of
> the port reservation/checking?
> >
> 
> At first the thought was to reuse ports as long as the dest IP between
> rdma_id is different but is this complication really necessary?
> RDMA_CM mimics socket API but wire protocol is different and source
> port has no role in transporting a packet from QP to QP. Do you see a
> real risk in reusing a port unconditionally?

The port number is carried in the packet and provided as the source address to the other side. By re-using a single port number, a server will see multiple connections all reporting the same source address (port + src IP).

The proposal will break iWarp. 

^ permalink raw reply

* Re: [RFC] Avoid running out of local port in RDMA_CM
From: Moni Shoua @ 2016-11-22 17:52 UTC (permalink / raw)
  To: Hefty, Sean; +Cc: linux-rdma, Doug Ledford
In-Reply-To: <1828884A29C6694DAF28B7E6B8A82373AB0B611D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>

On Tue, Nov 22, 2016 at 7:38 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> I believe that any solution here should mimic the TCP/IP stack as closely as possible.  So I would rule out the re-use of a single port for all active connections.
>
> I think TCP matches on the full tuple <src port, src ip, dst port, dst ip>.  We should be safe to re-use port numbers as long as some other portion of the tuple changes.  Maybe that can be added as part of the port reservation/checking?
>

At first the thought was to reuse ports as long as the dest IP between
rdma_id is different but is this complication really necessary?
RDMA_CM mimics socket API but wire protocol is different and source
port has no role in transporting a packet from QP to QP. Do you see a
real risk in reusing a port unconditionally?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [RFC] Avoid running out of local port in RDMA_CM
From: Hefty, Sean @ 2016-11-22 17:38 UTC (permalink / raw)
  To: Moni Shoua; +Cc: linux-rdma, Doug Ledford
In-Reply-To: <CAG9sBKMMuj5Qf7_jgL_EAwOAVAFZCp5Wf=NytqsHAezrys97pA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

I believe that any solution here should mimic the TCP/IP stack as closely as possible.  So I would rule out the re-use of a single port for all active connections.

I think TCP matches on the full tuple <src port, src ip, dst port, dst ip>.  We should be safe to re-use port numbers as long as some other portion of the tuple changes.  Maybe that can be added as part of the port reservation/checking?

> Introduction
> -----------------------------------------------------------------------
> ---------
> Like TCP/IP sockets, RDMA_CM connection identifier (rdma_id) is
> associated with
> local and remote addresses and local and remote ports. Values for
> these attributes
> are assigned during the life cycle of the rdma_id. In this RFC we focus
> on the
> value that is given to the local port of the rdma_id.
> While in TCP/IP protocol port numbers are part of the transport header,
> in
> InfiniBand they don't have a place. The way for an application to
> connect to
> a specific service is to use the communication manager and use a known
> Service
> ID (see CHAPTER 12:COMMUNICATION MANAGEMENT in the InfiniBandTM
> Architecture
> Specification Volume 1). The RDMA IP CM Service, which provides support
> for
> a socket-like connection model for RDMA-aware ULPs, replaces the
> Service ID with a
> 16 bit port number which is used as an identifier for a service.
> 
> The problem
> -----------------------------------------------------------------------
> ---------
> RDMA_CM requires binding of a connection identifier (rdma_id) to a
> local port.
> The passive side, the one calling rdma_listen(), usually binds
> explicitly to a
> well-known port. The active side, the one calling rmda_connect(),
> binds implicitly
> to a random port that rdma_cm chooses for it. This makes sense if we
> keep in mind
> that the port number is a way to identify a service. Binding to a port
> removes it
> from the pool of available ports until the rdma_id is destroyed at
> which time the
> port is returned to the pool. The problem starts when number of
> rdma_ids is larger
> than the number of available ports. The most likely scenario for this
> to happen is a
> node with many clients trying to connect to remote services. When the
> available port
> pool is empty the call to rdma_resolve_addr() fails when a free port
> number is requested
> from the pool.
> 
> Suggested Solution
> -----------------------------------------------------------------------
> ---------
> Extending the size of the pool is out of the question since we must to
> keep the
> 16 bit width of the port number to avoid backward compatibility issues.
> 1. Port number is a parameter to the function that generates Service
> ID.
> 2. Port number is part of the private data of the request MAD (see
> Annex
>    A11: RDMA IP CM Service in the InfiniBandTM Architecture
> Specification Volume 1)
> The other alternative is to reuse port numbers. Since port numbers are
> not part of the
> InfiniBand transport header we don't need to worry about wire protocol
> issues. Also,
> since binding to a port  on the active side doesn't create a conflict
> in service
> identification (since no one listens to active side rdma_id), it is
> safe to reuse a
> port number there when the port pool is empty.
> The suggested solution is to reserve one port as a global port for
> reuse and assign
> it under the following conditions
> 1. The request for binding is implicit and for any port (via
> cma_alloc_any_port())
> 2. The pool is empty
> 3. The ULP allows it
> 
> Risks
> -----------------------------------------------------------------------
> ---------
> RDMA_CM puts the local port number in the private data section of the
> CM request
> MAD. If this field is observed by an application or a traffic analyzer
> there
> might be a confusion. A way to minimize the risk is to reuse a port
> only if
> application allows it (say by setting an option to the rdma_id)

^ permalink raw reply

* Re: [PATCH 6/7] IB/rxe: Avoid missed completions in the CM/MAD
From: Moni Shoua @ 2016-11-22 17:14 UTC (permalink / raw)
  To: Andrew Boyer; +Cc: linux-rdma
In-Reply-To: <1479479809-10798-6-git-send-email-andrew.boyer-8PEkshWhKlo@public.gmane.org>

On Fri, Nov 18, 2016 at 4:36 PM, Andrew Boyer <andrew.boyer-8PEkshWhKlo@public.gmane.org> wrote:
> The MAD code uses the IB_CQ_REPORT_MISSED_EVENTS flag to avoid a
> race between posting CQEs and arming the CQ. Without this fix, the
> last completion might be left on the CQ, hanging the kthread
> waiting on MAD to complete.
> See ib_cq_poll_work().
>
Looks OK but I would edit the commit message a bit. This fix is
relevant not only for MAD and not only for workqueue polling context.
For example, iSER allocates CQ with SOFTIRQ polling context and is
also exposed to this bug (see ib_poll_handler)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-22 17:04 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: Doug Ledford, linux-rdma, netdev, Dennis Dalessandro
In-Reply-To: <20161122015304.GB67988@knc-06.sc.intel.com>

On Mon, Nov 21, 2016 at 05:53:04PM -0800, Vishwanathapura, Niranjana wrote:
> There are many example drivers in kernel which are using bus_register() in
> an initcall.

There really are not, certainly not in major subsystems.

> We could add a custom Interface between HFI1 driver and hfi_vnic drivers
> without involving a bus.

hfi is already registering on the infiniband class, just use that.

> But using the existing bus model gave a lot of in-built flexibility in
> decoupling devices from the drivers.

If you want to have your own bus then you need your own hfi
subsystem. drivers/infiniband is not a dumping ground..

Jason

^ permalink raw reply

* RE: [PATCH 1/3] IB/mad: Fix an array index check
From: Weiny, Ira @ 2016-11-22 16:48 UTC (permalink / raw)
  To: Bart Van Assche, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Hefty, Sean
In-Reply-To: <396b3fe2-6a84-efed-07de-3e6381009ad1-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1726 bytes --]

> 
> The array ib_mad_mgmt_class_table.method_table has MAX_MGMT_CLASS
> (80) elements. Hence compare the array index with that value instead of with
> IB_MGMT_MAX_METHODS (128). This patch avoids that Coverity reports the
> following:
> 
> Overrunning array class->method_table of 80 8-byte elements at element index
> 127 (byte offset 1016) using index convert_mgmt_class(mad_hdr-
> >mgmt_class) (which evaluates to 127).
> 
> Fixes: commit b7ab0b19a85f ("IB/mad: Verify mgmt class in received MADs")
> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
> Cc: Sean Hefty <sean.hefty@intel.com>
> Cc: <stable@vger.kernel.org>

Thanks!

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> ---
>  drivers/infiniband/core/mad.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c index
> 40cbd6b..2395fe2 100644
> --- a/drivers/infiniband/core/mad.c
> +++ b/drivers/infiniband/core/mad.c
> @@ -1746,7 +1746,7 @@ find_mad_agent(struct ib_mad_port_private
> *port_priv,
>  			if (!class)
>  				goto out;
>  			if (convert_mgmt_class(mad_hdr->mgmt_class) >=
> -			    IB_MGMT_MAX_METHODS)
> +			    ARRAY_SIZE(class->method_table))
>  				goto out;
>  			method = class->method_table[convert_mgmt_class(
>  							mad_hdr-
> >mgmt_class)];
> --
> 2.10.2
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ÙšŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hš‹»öì\x17/oSc¾™Ú³9˜uÀ¦æå‰È&jw¨®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿïêäz¹Þ–Šàþf£¢·hšˆ§~ˆmš

^ permalink raw reply

* Re: [PATCH 4/7] IB/rxe: Unblock loopback by moving skb_out increment
From: Moni Shoua @ 2016-11-22 16:13 UTC (permalink / raw)
  To: Andrew Boyer, Doug Ledford; +Cc: linux-rdma
In-Reply-To: <1479479809-10798-4-git-send-email-andrew.boyer-8PEkshWhKlo@public.gmane.org>

Acked-by: Moni Shoua <monis-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

On Fri, Nov 18, 2016 at 4:36 PM, Andrew Boyer <andrew.boyer-8PEkshWhKlo@public.gmane.org> wrote:
> skb_out is decremented in rxe_skb_tx_dtor(), which is not called in the
> loopback() path. Move the increment to the send() path rather than
> rxe_xmit_packet().
>
> Signed-off-by: Andrew Boyer <andrew.boyer-8PEkshWhKlo@public.gmane.org>
> ---
>  drivers/infiniband/sw/rxe/rxe_loc.h | 2 --
>  drivers/infiniband/sw/rxe/rxe_net.c | 2 ++
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 73849a5a..efe4c6a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -266,8 +266,6 @@ static inline int rxe_xmit_packet(struct rxe_dev *rxe, struct rxe_qp *qp,
>                 return err;
>         }
>
> -       atomic_inc(&qp->skb_out);
> -
>         if ((qp_type(qp) != IB_QPT_RC) &&
>             (pkt->mask & RXE_END_MASK)) {
>                 pkt->wqe->state = wqe_state_done;
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index ffff5a5..332ce52 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -455,6 +455,8 @@ static int send(struct rxe_dev *rxe, struct rxe_pkt_info *pkt,
>                 return -EAGAIN;
>         }
>
> +       if (pkt->qp)
> +               atomic_inc(&pkt->qp->skb_out);
>         kfree_skb(skb);
>
>         return 0;
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] i40iw: Fix incorrect assignment of SQ head
From: Henry Orosco @ 2016-11-22 15:44 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	e1000-rdma-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Henry Orosco

The SQ head is incorrectly incremented when the number
of WQEs required is greater than the number available.
The fix is to use the I40IW_RING_MOV_HEAD_BY_COUNT
macro. This checks for the SQ full condition first and
only if SQ has room for the request, then we move the
head appropriately.

Signed-off-by: Shiraz Saleem <shiraz.saleem-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Henry Orosco <henry.orosco-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/i40iw/i40iw_uk.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_uk.c b/drivers/infiniband/hw/i40iw/i40iw_uk.c
index 4d28c3c..3987cd8 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_uk.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_uk.c
@@ -175,12 +175,10 @@ u64 *i40iw_qp_get_next_send_wqe(struct i40iw_qp_uk *qp,
 		if (!*wqe_idx)
 			qp->swqe_polarity = !qp->swqe_polarity;
 	}
-
-	for (i = 0; i < wqe_size / I40IW_QP_WQE_MIN_SIZE; i++) {
-		I40IW_RING_MOVE_HEAD(qp->sq_ring, ret_code);
-		if (ret_code)
-			return NULL;
-	}
+	I40IW_RING_MOVE_HEAD_BY_COUNT(qp->sq_ring,
+				      wqe_size / I40IW_QP_WQE_MIN_SIZE, ret_code);
+	if (ret_code)
+		return NULL;
 
 	wqe = qp->sq_base[*wqe_idx].elem;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH 2/7] IB/rxe: Advance the consumer pointer before posting the CQE
From: Yonatan Cohen @ 2016-11-22 15:24 UTC (permalink / raw)
  To: Andrew Boyer, monis-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479479809-10798-2-git-send-email-andrew.boyer-8PEkshWhKlo@public.gmane.org>

On 11/18/2016 4:36 PM, Andrew Boyer wrote:
> A simple userspace application might poll the CQ, find a completion,
> and then attempt to post a new WQE to the SQ. A spurious error can
> occur if the userspace application detects a full SQ in the instant
> before the kernel is able to advance the SQ consumer pointer.
>
> This is noticeable when using single-entry SQs with ibv_rc_pingpong()
> if lots of kernel and userspace library debugging is enabled.
>
> Signed-off-by: Andrew Boyer <andrew.boyer-8PEkshWhKlo@public.gmane.org>
> ---
>  drivers/infiniband/sw/rxe/rxe_comp.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index 6c5e29d..d46c49b 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -420,11 +420,12 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
>  	    (wqe->wr.send_flags & IB_SEND_SIGNALED) ||
>  	    (qp->req.state == QP_STATE_ERROR)) {
>  		make_send_cqe(qp, wqe, &cqe);
> +		advance_consumer(qp->sq.queue);
>  		rxe_cq_post(qp->scq, &cqe, 0);
> +	} else {
> +		advance_consumer(qp->sq.queue);
>  	}
>
> -	advance_consumer(qp->sq.queue);
> -
>  	/*
>  	 * we completed something so let req run again
>  	 * if it is trying to fence
>
Reviewed-by: Yonatan Cohen <yonatanc-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 1/7] IB/rxe: Allocate enough space for an IPv6 addr
From: Yonatan Cohen @ 2016-11-22 15:21 UTC (permalink / raw)
  To: Andrew Boyer, monis-VPRAkNaXOzVWk0Htik3J/w,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1479479809-10798-1-git-send-email-andrew.boyer-8PEkshWhKlo@public.gmane.org>

On 11/18/2016 4:36 PM, Andrew Boyer wrote:
> Avoid smashing the stack when an ICRC error occurs on an IPv6 network.
>
> Signed-off-by: Andrew Boyer <andrew.boyer-8PEkshWhKlo@public.gmane.org>
> ---
>  drivers/infiniband/sw/rxe/rxe_recv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index 46f0628..b40ab8d 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -391,7 +391,7 @@ int rxe_rcv(struct sk_buff *skb)
>  			     payload_size(pkt));
>  	calc_icrc = cpu_to_be32(~calc_icrc);
>  	if (unlikely(calc_icrc != pack_icrc)) {
> -		char saddr[sizeof(struct in6_addr)];
> +		char saddr[64];
>
>  		if (skb->protocol == htons(ETH_P_IPV6))
>  			sprintf(saddr, "%pI6", &ipv6_hdr(skb)->saddr);
>
you fixed a bug here but i think the following would be better
than hard coding 64 bytes on the stack

--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -391,16 +391,14 @@ int rxe_rcv(struct sk_buff *skb)
                              payload_size(pkt));
         calc_icrc = cpu_to_be32(~calc_icrc);
         if (unlikely(calc_icrc != pack_icrc)) {
-               char saddr[sizeof(struct in6_addr)];

                 if (skb->protocol == htons(ETH_P_IPV6))
-                       sprintf(saddr, "%pI6", &ipv6_hdr(skb)->saddr);
+                       pr_warn_ratelimited("bad ICRC from %pI6\n", 
&ipv6_hdr(skb)->saddr);
                 else if (skb->protocol == htons(ETH_P_IP))
-                       sprintf(saddr, "%pI4", &ip_hdr(skb)->saddr);
+                       pr_warn_ratelimited("bad ICRC from %pI4\n", 
&ip_hdr(skb)->saddr);
                 else
-                       sprintf(saddr, "unknown");
+                       pr_warn_ratelimited("bad ICRC from unknown\n");

-               pr_warn_ratelimited("bad ICRC from %s\n", saddr);
                 goto drop;
         }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [RFC] Avoid running out of local port in RDMA_CM
From: Moni Shoua @ 2016-11-22 10:12 UTC (permalink / raw)
  To: Sean Hefty; +Cc: linux-rdma, Doug Ledford

Introduction
--------------------------------------------------------------------------------
Like TCP/IP sockets, RDMA_CM connection identifier (rdma_id) is associated with
local and remote addresses and local and remote ports. Values for
these attributes
are assigned during the life cycle of the rdma_id. In this RFC we focus on the
value that is given to the local port of the rdma_id.
While in TCP/IP protocol port numbers are part of the transport header, in
InfiniBand they don't have a place. The way for an application to connect to
a specific service is to use the communication manager and use a known Service
ID (see CHAPTER 12:COMMUNICATION MANAGEMENT in the InfiniBandTM Architecture
Specification Volume 1). The RDMA IP CM Service, which provides support for
a socket-like connection model for RDMA-aware ULPs, replaces the
Service ID with a
16 bit port number which is used as an identifier for a service.

The problem
--------------------------------------------------------------------------------
RDMA_CM requires binding of a connection identifier (rdma_id) to a local port.
The passive side, the one calling rdma_listen(), usually binds explicitly to a
well-known port. The active side, the one calling rmda_connect(),
binds implicitly
to a random port that rdma_cm chooses for it. This makes sense if we
keep in mind
that the port number is a way to identify a service. Binding to a port
removes it
from the pool of available ports until the rdma_id is destroyed at
which time the
port is returned to the pool. The problem starts when number of
rdma_ids is larger
than the number of available ports. The most likely scenario for this
to happen is a
node with many clients trying to connect to remote services. When the
available port
pool is empty the call to rdma_resolve_addr() fails when a free port
number is requested
from the pool.

Suggested Solution
--------------------------------------------------------------------------------
Extending the size of the pool is out of the question since we must to keep the
16 bit width of the port number to avoid backward compatibility issues.
1. Port number is a parameter to the function that generates Service ID.
2. Port number is part of the private data of the request MAD (see Annex
   A11: RDMA IP CM Service in the InfiniBandTM Architecture
Specification Volume 1)
The other alternative is to reuse port numbers. Since port numbers are
not part of the
InfiniBand transport header we don't need to worry about wire protocol
issues. Also,
since binding to a port  on the active side doesn't create a conflict in service
identification (since no one listens to active side rdma_id), it is
safe to reuse a
port number there when the port pool is empty.
The suggested solution is to reserve one port as a global port for
reuse and assign
it under the following conditions
1. The request for binding is implicit and for any port (via
cma_alloc_any_port())
2. The pool is empty
3. The ULP allows it

Risks
--------------------------------------------------------------------------------
RDMA_CM puts the local port number in the private data section of the CM request
MAD. If this field is observed by an application or a traffic analyzer there
might be a confusion. A way to minimize the risk is to reuse a port only if
application allows it (say by setting an option to the rdma_id)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 2/3] IPoIB: Avoid reading an uninitialized member variable
From: Leon Romanovsky @ 2016-11-22  7:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Erez Shitrit
In-Reply-To: <fd4a2913-545e-bf2f-e352-a47fd50a954f-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]

On Mon, Nov 21, 2016 at 10:21:41AM -0800, Bart Van Assche wrote:
> This patch avoids that Coverity reports the following:
>
>     Using uninitialized value port_attr.state when calling printk
>
> Fixes: commit 94232d9ce817 ("IPoIB: Start multicast join process only on active ports")
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>

Except that it doesn't print the reason why ib_query_port failed,
it look good.
Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Thanks

> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 1909dd2..fddff40 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -575,8 +575,11 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
>  		return;
>
> -	if (ib_query_port(priv->ca, priv->port, &port_attr) ||
> -	    port_attr.state != IB_PORT_ACTIVE) {
> +	if (ib_query_port(priv->ca, priv->port, &port_attr)) {
> +		ipoib_dbg(priv, "ib_query_port() failed\n");
> +		return;
> +	}
> +	if (port_attr.state != IB_PORT_ACTIVE) {
>  		ipoib_dbg(priv, "port state is not ACTIVE (state = %d) suspending join task\n",
>  			  port_attr.state);
>  		return;
> --
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH 3/3] IB/multicast: Check ib_find_pkey() return value
From: Leon Romanovsky @ 2016-11-22  7:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Sean Hefty
In-Reply-To: <950a0611-4aad-9b85-8d18-3dff54db2820-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2136 bytes --]

On Mon, Nov 21, 2016 at 10:22:17AM -0800, Bart Van Assche wrote:
> This patch avoids that Coverity complains about not checking the
> ib_find_pkey() return value.
>
> Fixes: commit 547af76521b3 ("IB/multicast: Report errors on multicast groups if P_key changes")
> Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
> Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> ---
>  drivers/infiniband/core/multicast.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
> index e51b739..322cb67 100644
> --- a/drivers/infiniband/core/multicast.c
> +++ b/drivers/infiniband/core/multicast.c
> @@ -518,8 +518,11 @@ static void join_handler(int status, struct ib_sa_mcmember_rec *rec,
>  		process_join_error(group, status);
>  	else {
>  		int mgids_changed, is_mgid0;
> -		ib_find_pkey(group->port->dev->device, group->port->port_num,
> -			     be16_to_cpu(rec->pkey), &pkey_index);
> +
> +		if (ib_find_pkey(group->port->dev->device,
> +				 group->port->port_num, be16_to_cpu(rec->pkey),
> +				 &pkey_index))
> +			pkey_index = MCAST_INVALID_PKEY_INDEX;

The coverity warning is valid, ib_find_pkey() can return ENOMEM
(ib_find_pkey()->ib_query_pkey()->(device->query_pkey())->mlx5_ib_query_pkey()->mlx5_query_mad_ifc_pkey()/mlx5_query_hca_vport_pkey())

I'm not sure that the error handling proposed is enough, pkey_index is initialized to
MCAST_INVALID_PKEY_INDEX at the beginning of function and in case of error it preserves
its value.

Don't you need to update process_group_error() function? It has code
flows with ib_find_pkey() without checking return value, which can be
anything too.

Thanks

>
>  		spin_lock_irq(&group->port->lock);
>  		if (group->state == MCAST_BUSY &&
> --
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-22  1:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161121233118.GA31132-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>

On Mon, Nov 21, 2016 at 04:31:18PM -0700, Jason Gunthorpe wrote:
>> >>>>+	ida_init(&hfi_vnic_ctrl_ida);
>> >>>>+	idr_init(&hfi_vnic_idr);
>> >>>>+
>> >>>>+	rc = bus_register(&hfi_vnic_bus);
>> >>>
>> >>>Why on earth do we need this? Didn't I give you enough grief for the
>> >>>psm stuff and now you want to create an entire subystem hidden away!?
>> >>>
>> >>>Use some netlink scheme to control your vnic like the rest of the net
>> >>>stack..
>> >>>
>> >>
>> >>The hfi_vnic_bus is only abstracting the HW independent functionality (like
>> >>Ethernet interface, encapsulation, IB MAD interface etc) with the HW
>> >>dependent functionality (sending/receiving packets on the wire).
>> >>Thus providing a cleaner interface between HW independent hfi_vnic Ethernet
>> >>and Control drivers and the HW dependent HFI1 driver.
>> >
>> >That doesn't explain anything, sound like you don't need it so get rid
>> >of it.
>
>> >>There is no other User interface here other than the standard Ethernet
>> >>interface through network stack.
>> >
>> >Good, then this isn't needed, because it doesn't provide a user interface.
>> >
>>
>> Can you explain what exactly you are asking to get rid of here and why?
>
>Get rid of the bus_register/etc as drivers do not get to call this.
>

There are many example drivers in kernel which are using bus_register() in
an initcall.
We could add a custom Interface between HFI1 driver and hfi_vnic drivers 
without involving a bus.
But using the existing bus model gave a lot of in-built flexibility in 
decoupling devices from the drivers.

Niranjana

>Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Jason Gunthorpe @ 2016-11-21 23:31 UTC (permalink / raw)
  To: Vishwanathapura, Niranjana
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA, Dennis Dalessandro
In-Reply-To: <20161121232629.GA67988-wPcXA7LoDC+1XWohqUldA0EOCMrvLtNR@public.gmane.org>

On Mon, Nov 21, 2016 at 03:26:29PM -0800, Vishwanathapura, Niranjana wrote:
 
> I did not see any example IB mad agent outside drivers/inifiniband
> folder.

You can be the first.

> I did see some netdev drivers outside the net/ folder (like ipoib and
> drivers/infiniband/hw/nes/).

It is very rare, and arguably ipoib was a mistake..

> The hfi_vnic Ethernet driver is entirely a soft driver without any HW access.
> Also, any interface changes between hfi_vnic and the HFI driver makes it
> difficult to manage if they are under two subsystems/maintainers.

We manage this already for all the rocee drivers that are based on
netdev drivers, you will be fine.

> We have 'rdmavt' and 'soft-roce' drivers under drivers/infiniband/sw/ folder
> which are soft drivers similar to hfi_vnic. So we decided to put
> 'hif_vnic'

As I said those are HCA drivers & support, not just random 'software'.

> >>>>+	ida_init(&hfi_vnic_ctrl_ida);
> >>>>+	idr_init(&hfi_vnic_idr);
> >>>>+
> >>>>+	rc = bus_register(&hfi_vnic_bus);
> >>>
> >>>Why on earth do we need this? Didn't I give you enough grief for the
> >>>psm stuff and now you want to create an entire subystem hidden away!?
> >>>
> >>>Use some netlink scheme to control your vnic like the rest of the net
> >>>stack..
> >>>
> >>
> >>The hfi_vnic_bus is only abstracting the HW independent functionality (like
> >>Ethernet interface, encapsulation, IB MAD interface etc) with the HW
> >>dependent functionality (sending/receiving packets on the wire).
> >>Thus providing a cleaner interface between HW independent hfi_vnic Ethernet
> >>and Control drivers and the HW dependent HFI1 driver.
> >
> >That doesn't explain anything, sound like you don't need it so get rid
> >of it.

> >>There is no other User interface here other than the standard Ethernet
> >>interface through network stack.
> >
> >Good, then this isn't needed, because it doesn't provide a user interface.
> >
> 
> Can you explain what exactly you are asking to get rid of here and why?

Get rid of the bus_register/etc as drivers do not get to call this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver
From: Vishwanathapura, Niranjana @ 2016-11-21 23:26 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Doug Ledford, linux-rdma, netdev, Dennis Dalessandro
In-Reply-To: <20161121213930.GA30111@obsidianresearch.com>

On Mon, Nov 21, 2016 at 02:39:30PM -0700, Jason Gunthorpe wrote:
>On Mon, Nov 21, 2016 at 01:30:17PM -0800, Vishwanathapura, Niranjana wrote:
>> On Sat, Nov 19, 2016 at 12:04:45PM -0700, Jason Gunthorpe wrote:
>> >On Fri, Nov 18, 2016 at 02:42:10PM -0800, Vishwanathapura, Niranjana wrote:
>> >>+HFI-VNIC DRIVER
>> >>+M:	Dennis Dalessandro <dennis.dalessandro@intel.com>
>> >>+M:	Niranjana Vishwanathapura <niranjana.vishwanathapura@intel.com>
>> >>+L:	linux-rdma@vger.kernel.org
>> >>+S:	Supported
>> >>+F:	drivers/infiniband/sw/intel/vnic
>> >
>> >This is either a net driver or a ULP, no idea why it should go in this
>> >directory!?
>> >
>> >It sounds like an ethernet driver, so you should probably put it
>> >there...
>> >
>>
>> The hfi_vnic is an Ethernet driver. It is similar to ULP like ipoib, but
>> instead it is Ethernet over Omni-path here.
>> The VNIC Ethernet (hfi_vnic) driver encapsulates Ethernet packets in an
>> Omni-path header.
>> The hfi_vnic Ethernet driver do not access the HW. It interfaces with HFI1
>> driver which sends/receives Omni-Path encapsulated Ethernet frames from HW.
>> Also, the VNIC control path driver (VEMA) is an IB MAD agent which should be
>> under drivers/infiniband/.. .
>> Putting the VNIC Ethernet driver and the VNIC control driver together under
>> a single module (hfi_vnic.ko) provided a simpler interface between them.
>>
>> So, we have put the driver under drivers/infiniband/sw/intel for two reasons:
>> a) We have VNIC control driver (VEMA) which is an IB mad agent.
>> b) hfi_vnic Ethernet driver is dependent on HFI1 driver for sending/receving
>> Omni-path encapsulated Ethernet packets from HW.
>
>Sounds like this driver belongs under net/ someplace to me.
>
>NAK on drivers/infiniband/sw/ at least - that dir is only for HCA
>drivers.
>

I did not see any example IB mad agent outside drivers/inifiniband folder.
I did see some netdev drivers outside the net/ folder (like ipoib and 
drivers/infiniband/hw/nes/).
The hfi_vnic Ethernet driver is entirely a soft driver without any HW access.
Also, any interface changes between hfi_vnic and the HFI driver makes it 
difficult to manage if they are under two subsystems/maintainers.
So drivers/infiniband is probably more approriate for this driver.

We have 'rdmavt' and 'soft-roce' drivers under drivers/infiniband/sw/ folder 
which are soft drivers similar to hfi_vnic. So we decided to put 'hif_vnic' 
under there.
Other places under drivers/infiniband where we can put this driver are,
drivers/infiniband/ulp/hfi_vnic
drivers/infiniband/hw/hfi_vnic

>> >>+/* hfi_vnic_bus_init - initialize the hfi vnic bus drvier */
>> >>+static int hfi_vnic_bus_init(void)
>> >>+{
>> >>+	int rc;
>> >>+
>> >>+	ida_init(&hfi_vnic_ctrl_ida);
>> >>+	idr_init(&hfi_vnic_idr);
>> >>+
>> >>+	rc = bus_register(&hfi_vnic_bus);
>> >
>> >Why on earth do we need this? Didn't I give you enough grief for the
>> >psm stuff and now you want to create an entire subystem hidden away!?
>> >
>> >Use some netlink scheme to control your vnic like the rest of the net
>> >stack..
>> >
>>
>> The hfi_vnic_bus is only abstracting the HW independent functionality (like
>> Ethernet interface, encapsulation, IB MAD interface etc) with the HW
>> dependent functionality (sending/receiving packets on the wire).
>> Thus providing a cleaner interface between HW independent hfi_vnic Ethernet
>> and Control drivers and the HW dependent HFI1 driver.
>
>That doesn't explain anything, sound like you don't need it so get rid
>of it.
>
>> There is no other User interface here other than the standard Ethernet
>> interface through network stack.
>
>Good, then this isn't needed, because it doesn't provide a user interface.
>

Can you explain what exactly you are asking to get rid of here and why?

>> #ls /sys/bus/hfi_vnic_bus/devices/
>>    hfi_vnic_ctrl_00         /* control device for HFI instance 0 */
>>    hfi_vnic_00.01.00        /* first VNIC port on HFI instance 0, port 1 */
>
>Jason

^ permalink raw reply

* Re: [PATCH v5 2/9] IB/core: Replace semaphore sm_sem with an atomic wait
From: Arnd Bergmann @ 2016-11-21 22:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Binoy Jayan, linux-rdma@vger.kernel.org, Mustafa Ismail, Lijun Ou,
	Nicholas Bellinger, Leon Romanovsky, target-devel,
	Tatyana E Nikolova, Doug Ledford, Jenny Derzhavetz, Sagi Grimberg,
	Sean Hefty, Faisal Latif, Hal Rosenstock,
	Linux Kernel Mailing List, Wei Hu(Xavier), Mark Brown, Mark Bloch,
	Steve
In-Reply-To: <CA+55aFyd7d+C39ipC5XuZg1gJjJ+HFSgcPQU7Xb26Fsh6Z1_0A@mail.gmail.com>

On Monday, November 21, 2016 9:57:51 AM CET Linus Torvalds wrote:
> 
>  - semaphores are "old-fashioned mutexes". A mutex is better than a
> semaphore, but a semaphore is better than just about all the other
> alternatives. There's nothing _wrong_ with using a semaphore per se.
> 
> In this case, either use a semaphore or a mutex. If you are doing
> mutual exclusion, those are really the only two acceptable sleeping
> models.

The main problem with semaphores is that they are slowly spreading
into areas that really should be mutexes or completions. A couple
of years ago, we had only around 30 semaphores left in the kernel
and while a lot of those have been removed in the meantime, over
100 new ones have come in, the majority of them in the category
that can be trivially converted to a mutex or semaphore.

This in turn is not much of a problem, except to a certain degree
for preempt-rt users.

I suggested to Binoy that he could look into replacing the existing
semaphores one subsystem at a time under the assumption that we
could find a relatively easy alternative for every one of them
and then remove the implementation completely.

Christoph's suggestion is probably more productive here: let's
remove the ones that are obviously wrong or inferior, and only
once they have been taken care of we can look into whether it's
worth doing something about the rest or not.

	Arnd

^ permalink raw reply

* [PATCH 5/5] IB/srp: Make writing into the add_target sysfs attribute interruptible
From: Bart Van Assche @ 2016-11-21 21:58 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Max Gurtovoy, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ba32b251-135f-c57d-44cc-658ae35a49ca-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

Avoid that shutdown of srp_daemon is delayed if add_target_mutex is
held by another process.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index bb9d73d..8ddc071 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3299,7 +3299,9 @@ static ssize_t srp_create_target(struct device *dev,
 	 */
 	scsi_host_get(target->scsi_host);
 
-	mutex_lock(&host->add_target_mutex);
+	ret = mutex_lock_interruptible(&host->add_target_mutex);
+	if (ret < 0)
+		goto put;
 
 	ret = srp_parse_options(buf, target);
 	if (ret)
@@ -3455,6 +3457,7 @@ static ssize_t srp_create_target(struct device *dev,
 out:
 	mutex_unlock(&host->add_target_mutex);
 
+put:
 	scsi_host_put(target->scsi_host);
 	if (ret < 0)
 		scsi_host_put(target->scsi_host);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 4/5] IB/srp: Make mapping failures easier to debug
From: Bart Van Assche @ 2016-11-21 21:57 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Max Gurtovoy, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ba32b251-135f-c57d-44cc-658ae35a49ca-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

Make it easier to figure out what is going on if memory mapping
fails because more memory regions than mr_per_cmd are needed.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 81cb27f..bb9d73d 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1274,8 +1274,12 @@ static int srp_map_finish_fmr(struct srp_map_state *state,
 	struct ib_pool_fmr *fmr;
 	u64 io_addr = 0;
 
-	if (state->fmr.next >= state->fmr.end)
+	if (state->fmr.next >= state->fmr.end) {
+		shost_printk(KERN_ERR, ch->target->scsi_host,
+			     PFX "Out of MRs (mr_per_cmd = %d)\n",
+			     ch->target->mr_per_cmd);
 		return -ENOMEM;
+	}
 
 	WARN_ON_ONCE(!dev->use_fmr);
 
@@ -1331,8 +1335,12 @@ static int srp_map_finish_fr(struct srp_map_state *state,
 	u32 rkey;
 	int n, err;
 
-	if (state->fr.next >= state->fr.end)
+	if (state->fr.next >= state->fr.end) {
+		shost_printk(KERN_ERR, ch->target->scsi_host,
+			     PFX "Out of MRs (mr_per_cmd = %d)\n",
+			     ch->target->mr_per_cmd);
 		return -ENOMEM;
+	}
 
 	WARN_ON_ONCE(!dev->use_fast_reg);
 
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 3/5] IB/srp: Make login failures easier to debug
From: Bart Van Assche @ 2016-11-21 21:57 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Max Gurtovoy, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ba32b251-135f-c57d-44cc-658ae35a49ca-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

If login fails because memory region allocation failed it can be
hard to figure out what happened. Make it easier to figure out
why login failed by logging a message if ib_alloc_mr() fails.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index c216c6e..81cb27f 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -389,6 +389,9 @@ static struct srp_fr_pool *srp_create_fr_pool(struct ib_device *device,
 				 max_page_list_len);
 		if (IS_ERR(mr)) {
 			ret = PTR_ERR(mr);
+			if (ret == -ENOMEM)
+				pr_info("%s: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count\n",
+					dev_name(&device->dev));
 			goto destroy_pool;
 		}
 		d->mr = mr;
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 2/5] IB/srp: Introduce a local variable in srp_add_one()
From: Bart Van Assche @ 2016-11-21 21:57 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Max Gurtovoy, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ba32b251-135f-c57d-44cc-658ae35a49ca-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

This patch makes the srp_add_one() code more compact and does not
change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 8bb720c..c216c6e 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3527,6 +3527,7 @@ static struct srp_host *srp_add_port(struct srp_device *device, u8 port)
 static void srp_add_one(struct ib_device *device)
 {
 	struct srp_device *srp_dev;
+	struct ib_device_attr *attr = &device->attrs;
 	struct srp_host *host;
 	int mr_page_shift, p;
 	u64 max_pages_per_mr;
@@ -3541,25 +3542,25 @@ static void srp_add_one(struct ib_device *device)
 	 * minimum of 4096 bytes. We're unlikely to build large sglists
 	 * out of smaller entries.
 	 */
-	mr_page_shift		= max(12, ffs(device->attrs.page_size_cap) - 1);
+	mr_page_shift		= max(12, ffs(attr->page_size_cap) - 1);
 	srp_dev->mr_page_size	= 1 << mr_page_shift;
 	srp_dev->mr_page_mask	= ~((u64) srp_dev->mr_page_size - 1);
-	max_pages_per_mr	= device->attrs.max_mr_size;
+	max_pages_per_mr	= attr->max_mr_size;
 	do_div(max_pages_per_mr, srp_dev->mr_page_size);
 	pr_debug("%s: %llu / %u = %llu <> %u\n", __func__,
-		 device->attrs.max_mr_size, srp_dev->mr_page_size,
+		 attr->max_mr_size, srp_dev->mr_page_size,
 		 max_pages_per_mr, SRP_MAX_PAGES_PER_MR);
 	srp_dev->max_pages_per_mr = min_t(u64, SRP_MAX_PAGES_PER_MR,
 					  max_pages_per_mr);
 
 	srp_dev->has_fmr = (device->alloc_fmr && device->dealloc_fmr &&
 			    device->map_phys_fmr && device->unmap_fmr);
-	srp_dev->has_fr = (device->attrs.device_cap_flags &
+	srp_dev->has_fr = (attr->device_cap_flags &
 			   IB_DEVICE_MEM_MGT_EXTENSIONS);
 	if (!never_register && !srp_dev->has_fmr && !srp_dev->has_fr) {
 		dev_warn(&device->dev, "neither FMR nor FR is supported\n");
 	} else if (!never_register &&
-		   device->attrs.max_mr_size >= 2 * srp_dev->mr_page_size) {
+		   attr->max_mr_size >= 2 * srp_dev->mr_page_size) {
 		srp_dev->use_fast_reg = (srp_dev->has_fr &&
 					 (!srp_dev->has_fmr || prefer_fr));
 		srp_dev->use_fmr = !srp_dev->use_fast_reg && srp_dev->has_fmr;
@@ -3572,13 +3573,13 @@ static void srp_add_one(struct ib_device *device)
 	if (srp_dev->use_fast_reg) {
 		srp_dev->max_pages_per_mr =
 			min_t(u32, srp_dev->max_pages_per_mr,
-			      device->attrs.max_fast_reg_page_list_len);
+			      attr->max_fast_reg_page_list_len);
 	}
 	srp_dev->mr_max_size	= srp_dev->mr_page_size *
 				   srp_dev->max_pages_per_mr;
 	pr_debug("%s: mr_page_shift = %d, device->max_mr_size = %#llx, device->max_fast_reg_page_list_len = %u, max_pages_per_mr = %d, mr_max_size = %#x\n",
-		 device->name, mr_page_shift, device->attrs.max_mr_size,
-		 device->attrs.max_fast_reg_page_list_len,
+		 device->name, mr_page_shift, attr->max_mr_size,
+		 attr->max_fast_reg_page_list_len,
 		 srp_dev->max_pages_per_mr, srp_dev->mr_max_size);
 
 	INIT_LIST_HEAD(&srp_dev->dev_list);
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 1/5] IB/srp: Fix CONFIG_DYNAMIC_DEBUG=n build
From: Bart Van Assche @ 2016-11-21 21:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Max Gurtovoy, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <ba32b251-135f-c57d-44cc-658ae35a49ca-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>

Avoid that the kernel build fails as follows if dynamic debug support
is disabled:

drivers/infiniband/ulp/srp/ib_srp.c:2272:3: error: implicit declaration of function 'DEFINE_DYNAMIC_DEBUG_METADATA'
drivers/infiniband/ulp/srp/ib_srp.c:2272:33: error: 'ddm' undeclared (first use in this function)
drivers/infiniband/ulp/srp/ib_srp.c:2275:39: error: '_DPRINTK_FLAGS_PRINT' undeclared (first use in this function)

Signed-off-by: Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
---
 drivers/infiniband/ulp/srp/ib_srp.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index d980fb4..8bb720c 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -64,6 +64,11 @@ MODULE_LICENSE("Dual BSD/GPL");
 MODULE_VERSION(DRV_VERSION);
 MODULE_INFO(release_date, DRV_RELDATE);
 
+#if !defined(CONFIG_DYNAMIC_DEBUG)
+#define DEFINE_DYNAMIC_DEBUG_METADATA(name, fmt)
+#define DYNAMIC_DEBUG_BRANCH(descriptor) false
+#endif
+
 static unsigned int srp_sg_tablesize;
 static unsigned int cmd_sg_entries;
 static unsigned int indirect_sg_entries;
@@ -1556,7 +1561,6 @@ static int srp_map_idb(struct srp_rdma_ch *ch, struct srp_request *req,
 	return 0;
 }
 
-#if defined(DYNAMIC_DATA_DEBUG)
 static void srp_check_mapping(struct srp_map_state *state,
 			      struct srp_rdma_ch *ch, struct srp_request *req,
 			      struct scatterlist *scat, int count)
@@ -1580,7 +1584,6 @@ static void srp_check_mapping(struct srp_map_state *state,
 		       scsi_bufflen(req->scmnd), desc_len, mr_len,
 		       state->ndesc, state->nmdesc);
 }
-#endif
 
 /**
  * srp_map_data() - map SCSI data buffer onto an SRP request
@@ -1669,14 +1672,12 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch,
 	if (ret < 0)
 		goto unmap;
 
-#if defined(DYNAMIC_DEBUG)
 	{
 		DEFINE_DYNAMIC_DEBUG_METADATA(ddm,
 			"Memory mapping consistency check");
-		if (unlikely(ddm.flags & _DPRINTK_FLAGS_PRINT))
+		if (DYNAMIC_DEBUG_BRANCH(ddm))
 			srp_check_mapping(&state, ch, req, scat, count);
 	}
-#endif
 
 	/* We've mapped the request, now pull as much of the indirect
 	 * descriptor table as we can into the command buffer. If this
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 0/5] IB/srp patches for kernel v4.10
From: Bart Van Assche @ 2016-11-21 21:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Max Gurtovoy, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hello Doug,

The patches in this series are what I came up with while testing the 
v4.9-rc5 IB/srp driver. Please consider these patches for inclusion in 
kernel v4.10. The individual patches in this series are:

0001-IB-srp-Fix-CONFIG_DYNAMIC_DEBUG-n-build.patch
0002-IB-srp-Introduce-a-local-variable-in-srp_add_one.patch
0003-IB-srp-Make-login-failures-easier-to-debug.patch
0004-IB-srp-Make-mapping-failures-easier-to-debug.patch
0005-IB-srp-Make-writing-into-the-add_target-sysfs-attrib.patch

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox