linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/9] IP based RoCE GID Addressing
@ 2013-09-10 14:41 Or Gerlitz
       [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-10 14:41 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, monis-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, Or Gerlitz

changes from V3:

  - dropped the uverbs infrastructure patch for extensions which is now upstream
    400dbc9 "IB/core: Infrastructure for extensible uverbs commands"

  - added ocrdma patch to handle Ethernet L2 parameters, similar to the mlx4 patch.
   
  - removed the assumption that the low level driver can provide the source mac
    and vlan in the struct ib_wc returned by ib_poll_cq, and adjusted the 
    ib_init_ah_from_wc helper of the IB core accordingly.

  - fixed some vlan related issues in the mlx4 driver

See below full listing of change-history.

Currently, the IB stack (core + drivers) handle RoCE (IBoE) gids as
they encode related Ethernet net-device interface MAC address and 
possibly VLAN id.

This series changes RoCE GIDs to encode IP addresses (IPv4 + IPv6)
of the that Ethernet interface, under the following reasoning:

1. There are environments where the compute entity that runs the RoCE 
stack is not aware that its traffic is vlan-tagged. This results with that 
node to create/assume wrong GIDs from the view point of a peer node which 
is aware to vlans. 

Note that "node" here can be physical node connected to Ethernet switch acting in 
access mode talking to another node which does vlan insertion/stripping by itself.

Or another example is SRIOV Virtual Function which is configured to work in "VST" 
mode (Virtual-Switch-Tagging) such that the hypervisor configures the HW eSWitch 
to do vlan insertion for the vPORT representing that function.

2. When RoCE traffic is inspected (mirrored/trapped) in Ethernet switches for 
monitoring and security purposes. It is much more natural for both humans and 
automated utilities (...) to observe IP addresses in a certain offset into RoCE 
frames L3 header vs. MAC/VLANs (which are there anyway in the L2 header of that 
frame, so they are not gone by this change).

3. Some Bonding/Teaming advanced mode such as balance-alb and balance-tlb 
are using multiple underlying devices in parallel, and hence packets always 
carry the bond IP address but different streams have different source MACs.
The approach brought by this series is part from what would allow to 
support that for RoCE traffic too.

The 1st patch adds explicit handling of Ethernet L2 attributes, source/dest 
mac and vlan_id to the kernel IB core, in data-structures and CMA/CM code. 
Previously, with MAC/VLAN based addressing, they were encoded in the GIDs, 
where now they have to be resolved and placed separately from the IP based GIDs.

The 2nd patch modifies the CMA to cope with IP based GIDs, the 3rd/4th ones do 
that for the mlx4_ib driver, and the 5th patch to the ocrdma driver. 

The 6th patch sets the foundation for extending uverbs to the new scheme which 
was introduced lately, and the 7th/8th patches add two extended uverbs and 
respectively two extended ucma commands which are now exported to user space.
The last patch denotes mlx4 support for the uverbs extended modify qp command.

These extended verbs will allow to enhance user space libraries such that they work 
OK over the modified scheme. All RC applications using librdmacm will not need to be 
modified at all, since the change will be encapsulated into that library.

Or.

Full listing of change-history:

changes from V3:

  - dropped the uverbs Infrastructure patch for extensions which is now upstream
    400dbc9 "IB/core: Infrastructure for extensible uverbs commands"

  - added ocrdma patch to handle Ethernet L2 parameters, similar to the mlx4 patch.
   
  - removed the assumption that the low level driver can provide the source mac
    and vlan in the struct ib_wc returned by ib_poll_cq, and adjusted the 
    ib_init_ah_from_wc helper of the IB core accordingly.

  - fixed some vlan related issues in the mlx4 driver

changes from V2:

  - added handling of IP based GIDs in the ocrdma driver - patch #5, 
    as a result patches #5-8 of V1 became patches #6-9
  
changes from V1:

 - rebased the series against the latest kernel bits, which include Sean's 
   AF_IB changes to the rdma-cm
 
 - fixed bug in mlx4_ib where reset of the gid table was done for IB ports too
 
 - fixed build warnings and issues pointed by sparse

 - introduced patch #1 which does the explicit handling of Ethernet L2 attributes, 
   source/dest mac and vlan_id in the kernel data-structures and CMA/CM code. 

 - use smac when modifying a QP --> find smac in passive side + additional fields 
   to adress structures

 - add support to new QP atrr in ib_modify_qp_is_ok() special for ll = ETH
  and modified all low-level drivers to keep working after that change

 -- changes around uverbs:
 - use ah_ext as pointer in qp_attr passed from user space, so this 
   field by itself can be extended in the future
 - for kernel to user command respnses comp_mask is moved into the 
   right place which is after the non-extended command respond fields
 - fixed bug in copy_qp_attr_ex under which some fields were copied to
   wrong locations
 - use new structure rdma_ucm_init_qp_attr_ex which is extendable (ucma)

changes from V0:

 - enhanced documentation of the mlx4_ib, uverbs and ucma patches
 - broke the mlx4_ib patch to two
 - broke the extended user space commands patch to two

Matan Barak (4):
  IB/core: Ethernet L2 attributes in verbs/cm structures
  IB/core: Add RoCE IP based addressing extensions for uverbs
  IB/core: Add RoCE IP based addressing extensions for rdma_ucm
  IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX

Moni Shoua (5):
  IB/CMA: RoCE IP based GID addressing
  IB/mlx4: Use RoCE IP based GIDs in the port GID table
  IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing
  IB/ocrdma: Populate GID table with IP based gids
  IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing

 drivers/infiniband/core/addr.c              |   97 ++++++-
 drivers/infiniband/core/cm.c                |   55 +++
 drivers/infiniband/core/cma.c               |   74 ++++-
 drivers/infiniband/core/sa_query.c          |   12 +-
 drivers/infiniband/core/ucma.c              |  193 ++++++++++-
 drivers/infiniband/core/uverbs.h            |    2 +
 drivers/infiniband/core/uverbs_cmd.c        |  359 ++++++++++++++++-----
 drivers/infiniband/core/uverbs_main.c       |    4 +-
 drivers/infiniband/core/uverbs_marshall.c   |  128 +++++++-
 drivers/infiniband/core/verbs.c             |   45 +++-
 drivers/infiniband/hw/ehca/ehca_qp.c        |    2 +-
 drivers/infiniband/hw/ipath/ipath_qp.c      |    2 +-
 drivers/infiniband/hw/mlx4/ah.c             |   40 +--
 drivers/infiniband/hw/mlx4/cq.c             |    9 +
 drivers/infiniband/hw/mlx4/main.c           |  477 +++++++++++++++++++--------
 drivers/infiniband/hw/mlx4/mlx4_ib.h        |    6 +-
 drivers/infiniband/hw/mlx4/qp.c             |  104 +++++--
 drivers/infiniband/hw/mlx5/qp.c             |    3 +-
 drivers/infiniband/hw/mthca/mthca_qp.c      |    3 +-
 drivers/infiniband/hw/ocrdma/ocrdma.h       |   12 +
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c    |    5 +-
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c    |   21 +-
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h    |    1 -
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |  138 +++------
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |    3 +-
 drivers/infiniband/hw/qib/qib_qp.c          |    2 +-
 drivers/net/ethernet/mellanox/mlx4/port.c   |   20 ++
 include/linux/mlx4/cq.h                     |   15 +-
 include/linux/mlx4/device.h                 |    1 +
 include/rdma/ib_addr.h                      |   84 +++--
 include/rdma/ib_cm.h                        |    1 +
 include/rdma/ib_marshall.h                  |   12 +
 include/rdma/ib_pack.h                      |    1 +
 include/rdma/ib_sa.h                        |    3 +
 include/rdma/ib_verbs.h                     |   23 ++-
 include/uapi/rdma/ib_user_sa.h              |   34 ++-
 include/uapi/rdma/ib_user_verbs.h           |  160 +++++++++-
 include/uapi/rdma/rdma_user_cm.h            |   29 ++-
 38 files changed, 1684 insertions(+), 496 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread
* Re: [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX
@ 2013-09-17 20:49 Or Gerlitz
       [not found] ` <CAJZOPZJ_F06xORoQyt-6_SK5P5Y7LXekQuNKHHYSt+oJ8sV1GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 38+ messages in thread
From: Or Gerlitz @ 2013-09-17 20:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Jason Gunthorpe, Or Gerlitz, Devesh Sharma,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, monis, matanb

On Tue, Sep 17, 2013 at 8:50 PM, Roland Dreier wrote:
> On Thu, Sep 12, 2013 at 10:22 AM, Jason Gunthorpe wrote:
>> On Thu, Sep 12, 2013 at 03:24:46PM +0300, Or Gerlitz wrote:

>>> Let me clarify this. The idea is that current RoCE applications will
>>> run as is after they update "their" librdmacm, since its this
>>> library that works with the new uverbs entries.

>> Or, we are not supposed to break userspace. You can't insist that a
>> user space library be updated in-sync with the kernel.

> Agree.  This "IP based addressing" for RoCE looks like a big problem
> at the moment.  Let me reiterate my understanding, and you guys can
> correct me if I get something wrong:
>
>  - current addressing scheme is broken for virtualization use cases,
> because VMs may not know about what VLANs are in use.  (also there are
> issues around bonding modes that use different Ethernet addresses)

The current addressing is actually broken for vlan use cases, both
native and virtualized, for the virt as of the argument you mentioned,
for native as of one node connected to Ethernet edge switch acting in
access mode (that is the switch does vlan insertion/stripping) and the
other node handling vlans by itself. Each one will form different GID
for the other party.

>  - proposed change requires:
>    * all systems must update kernel at the same time, because old and
> new kernels cannot talk to each other
>    * all systems must update librdmacm when they update the kernel,
> because old librdmacm does not work with new kernel

> I understand that we want to fix the issue around VLAN tagged traffic
> from VMs, but I don't see how we can break the whole stack to
> accomplish that.  Isn't there some incremental way forward?

To begin with, we don't break the whole stack -- using the current
patch set, for ports whose link is IB, all biz as usual, and this is
the in the port resolution, that is if for a given device one port is
IB and one port Eth, existing librdmacm keep working on the IB por.

Another fact to put in the fire is that SRIOV VMs don't have RoCE now
(not supported by upstream). Actually we're holding off with the SRIOV
RoCE patches submission b/c of the breakage with the current scheme
--> no need for backward compatibility here either. The vast majority
if not all the Cloud use cases we are aware to which would use RoCE
need VST and need it to work right.

With vlans being broken already, I would say we need 1st and most fix
that and only/maybe later worry on backward compatibility for the few
native mode use cases that somehow manage to workaround the buggish
gid format when they use vlans.

As for those who don't use vlans, which is also rare, as RoCE is
working best over some lossless channel which is typically achieved
using PFC over a vlan... we can use the fact that the IP bases
addressing patches configure both interface IPv4 and IPv6 addresses
into the gid table.

Now,  the IPv6 link address is actually also plugged into the gid
table by nodes running the old code since this is how the non-vlan MAC
based GID is constructed. Using this fact, we can allow

1. the patched kernel to work with non updated user space, as long as
they use the GID which relates to an IPv6 link local address

2. node running the "old" code to talk with "new" node over what the
old node sees as a non-vlan MAC based GID and the new node sees as
IPv6 link local gid.

Sounds better?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-10-27 15:29 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-10 14:41 [PATCH V4 0/9] IP based RoCE GID Addressing Or Gerlitz
     [not found] ` <1378824099-22150-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-10 14:41   ` [PATCH V4 1/9] IB/core: Ethernet L2 attributes in verbs/cm structures Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 2/9] IB/CMA: RoCE IP based GID addressing Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 3/9] IB/mlx4: Use RoCE IP based GIDs in the port GID table Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 4/9] IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 5/9] IB/ocrdma: Populate GID table with IP based gids Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 6/9] IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing Or Gerlitz
2013-09-10 14:41   ` [PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs Or Gerlitz
     [not found]     ` <1378824099-22150-8-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11 10:06       ` meuh-zgzEX58YAwA
     [not found]         ` <6d494aa8d403e0c50b16f09fbd2c3ab6-zgzEX58YAwA@public.gmane.org>
2013-09-11 11:38           ` Or Gerlitz
     [not found]             ` <52305632.1030604-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11 12:42               ` Yann Droneaud
2013-09-10 14:41   ` [PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm Or Gerlitz
     [not found]     ` <1378824099-22150-9-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11  9:52       ` Yann Droneaud
     [not found]         ` <26c47667e463e65dd79caaa4bddc437b-zgzEX58YAwA@public.gmane.org>
2013-09-11 11:32           ` Or Gerlitz
     [not found]             ` <523054BA.2040608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-11 12:36               ` Yann Droneaud
     [not found]                 ` <97104d76028c356b458509ce95b08c92-zgzEX58YAwA@public.gmane.org>
2013-09-17 10:02                   ` Matan Barak
     [not found]                     ` <5238289D.40608-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-17 10:25                       ` Yann Droneaud
     [not found]                         ` <bcec9d3a9a72ed1d612a4dd49b670800-zgzEX58YAwA@public.gmane.org>
2013-09-17 15:13                           ` Matan Barak
     [not found]                             ` <523871A2.8010109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-17 15:43                               ` Yann Droneaud
     [not found]                                 ` <8bb85d86eca247afa5786b7c7e4c737a-zgzEX58YAwA@public.gmane.org>
2013-09-18  8:40                                   ` Matan Barak
     [not found]                                     ` <52396719.4050809-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-18 10:07                                       ` Yann Droneaud
     [not found]                                         ` <698ad99050d7ece7bac8a591e4318f45-zgzEX58YAwA@public.gmane.org>
2013-09-22  7:32                                           ` Matan Barak
     [not found]                                             ` <523E9D06.8050804-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-10-27 15:29                                               ` Tzahi Oved
2013-09-10 14:41   ` [PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX Or Gerlitz
     [not found]     ` <1378824099-22150-10-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-12  5:26       ` Devesh Sharma
     [not found]         ` <CAGgPuS1tAiyA3TZ5_fpua3ue6JrZ9ruS+O+QU-7t28i0dZ7cUw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-12 10:45           ` Or Gerlitz
     [not found]             ` <52319B38.5070807-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-12 11:31               ` Devesh Sharma
2013-09-12 12:24                 ` Or Gerlitz
     [not found]                   ` <5231B28E.4090605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-09-12 17:22                     ` Jason Gunthorpe
     [not found]                       ` <20130912172252.GA4611-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2013-09-17 17:50                         ` Roland Dreier
2013-09-12 11:46               ` Devesh Sharma
  -- strict thread matches above, loose matches on Subject: below --
2013-09-17 20:49 Or Gerlitz
     [not found] ` <CAJZOPZJ_F06xORoQyt-6_SK5P5Y7LXekQuNKHHYSt+oJ8sV1GA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-17 23:10   ` Roland Dreier
     [not found]     ` <CAG4TOxOtsy+vtmtYciREk0bOC=o9-ME1T=cqvt46CNssCU57zA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-09-18  4:31       ` Or Gerlitz
2013-09-29 10:48   ` Or Gerlitz
     [not found]     ` <52480568.8000801-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-10-02 15:09       ` Devesh Sharma
     [not found]         ` <CAGgPuS2791OXo9JrZ030qSn_4Yi777Vw5f8LP1-u2npNKppoKA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-10-02 20:01           ` Or Gerlitz
2013-10-10 21:26       ` Or Gerlitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).