public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Fix local destination address resolution with VRF
@ 2025-09-07 16:08 Edward Srouji
  2025-09-07 16:08 ` [PATCH 1/4] RDMA/core: Squash a single user static function Edward Srouji
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-07 16:08 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal

From Parav:

Presently, address resolve routines consider a destination to be local
if the next-hop device of the resolved route for the destination is the
loopback netdevice. While this works for simple configurations, it fails
when the source and destination IP addresses belong to an enslaved
netdevice of a VRF.
In that case the next-hop device is the VRF itself, so packets are
generated with an incorrect destination MAC on the VRF netdevice and
ib_write_bw times out.

This patch series fixes that by determining whether a destination is
local based on the resolved route's type rather than on the next-hop
netdevice's loopback flag.
That approach resolves loopback traffic consistently both with and
without VRF configurations.

This series contains 4 patches:
  1/4: refactor address resolution code for reuse by subsequent patches
  2/4: resolve destination MAC via IP stack
  3/4: use route table entry instead of netdev loopback flag
  4/4: fix netdev lookup for IPoIB interfaces

Parav.

Parav Pandit (3):
  RDMA/core: Squash a single user static function
  RDMA/core: Resolve MAC of next-hop device without ARP support
  RDMA/core: Use route entry flag to decide on loopback traffic

Vlad Dumitrescu (1):
  IB/ipoib: Ignore L3 master device

 drivers/infiniband/core/addr.c            | 83 +++++++++++------------
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 21 +++---
 2 files changed, 50 insertions(+), 54 deletions(-)

-- 
2.21.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] RDMA/core: Squash a single user static function
  2025-09-07 16:08 [PATCH 0/4] Fix local destination address resolution with VRF Edward Srouji
@ 2025-09-07 16:08 ` Edward Srouji
  2025-09-10  8:17   ` Leon Romanovsky
  2025-09-07 16:08 ` [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Edward Srouji @ 2025-09-07 16:08 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal

From: Parav Pandit <parav@nvidia.com>

In order to reduce dependencies in IFF_LOOPBACK in
route and neighbour resolution steps, squash the
static function to its single caller and simplify the
code. No functional change.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/core/addr.c | 49 ++++++++++++++--------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index be0743dac3ff..594e7ee335f7 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -465,34 +465,6 @@ static int addr_resolve_neigh(const struct dst_entry *dst,
 	return ret;
 }
 
-static int copy_src_l2_addr(struct rdma_dev_addr *dev_addr,
-			    const struct sockaddr *dst_in,
-			    const struct dst_entry *dst,
-			    const struct net_device *ndev)
-{
-	int ret = 0;
-
-	if (dst->dev->flags & IFF_LOOPBACK)
-		ret = rdma_translate_ip(dst_in, dev_addr);
-	else
-		rdma_copy_src_l2_addr(dev_addr, dst->dev);
-
-	/*
-	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
-	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
-	 * network type accordingly.
-	 */
-	if (has_gateway(dst, dst_in->sa_family) &&
-	    ndev->type != ARPHRD_INFINIBAND)
-		dev_addr->network = dst_in->sa_family == AF_INET ?
-						RDMA_NETWORK_IPV4 :
-						RDMA_NETWORK_IPV6;
-	else
-		dev_addr->network = RDMA_NETWORK_IB;
-
-	return ret;
-}
-
 static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
 				 unsigned int *ndev_flags,
 				 const struct sockaddr *dst_in,
@@ -503,6 +475,7 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
 	*ndev_flags = ndev->flags;
 	/* A physical device must be the RDMA device to use */
 	if (ndev->flags & IFF_LOOPBACK) {
+		int ret;
 		/*
 		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
 		 * loopback IP address. So if route is resolved to loopback
@@ -512,9 +485,27 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
 		ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev), dst_in);
 		if (IS_ERR(ndev))
 			return -ENODEV;
+		ret = rdma_translate_ip(dst_in, dev_addr);
+		if (ret)
+			return ret;
+	} else {
+		rdma_copy_src_l2_addr(dev_addr, dst->dev);
 	}
 
-	return copy_src_l2_addr(dev_addr, dst_in, dst, ndev);
+	/*
+	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
+	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
+	 * network type accordingly.
+	 */
+	if (has_gateway(dst, dst_in->sa_family) &&
+	    ndev->type != ARPHRD_INFINIBAND)
+		dev_addr->network = dst_in->sa_family == AF_INET ?
+						RDMA_NETWORK_IPV4 :
+						RDMA_NETWORK_IPV6;
+	else
+		dev_addr->network = RDMA_NETWORK_IB;
+
+	return 0;
 }
 
 static int set_addr_netns_by_gid_rcu(struct rdma_dev_addr *addr)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support
  2025-09-07 16:08 [PATCH 0/4] Fix local destination address resolution with VRF Edward Srouji
  2025-09-07 16:08 ` [PATCH 1/4] RDMA/core: Squash a single user static function Edward Srouji
@ 2025-09-07 16:08 ` Edward Srouji
  2025-09-10  8:32   ` Leon Romanovsky
  2025-09-07 16:08 ` [PATCH 3/4] RDMA/core: Use route entry flag to decide on loopback traffic Edward Srouji
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Edward Srouji @ 2025-09-07 16:08 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal

From: Parav Pandit <parav@nvidia.com>

Currently, if the next-hop netdevice does not support ARP resolution,
the destination MAC address is silently set to zero without reporting
an error. This leads to incorrect behavior and may result in packet
transmission failures.

Fix this by deferring MAC resolution to the IP stack via neighbour
lookup, allowing proper resolution or error reporting as appropriate.

Fixes: 7025fcd36bd6 ("IB: address translation to map IP toIB addresses (GIDs)")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/core/addr.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 594e7ee335f7..ca86c482662f 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -454,14 +454,10 @@ static int addr_resolve_neigh(const struct dst_entry *dst,
 {
 	int ret = 0;
 
-	if (ndev_flags & IFF_LOOPBACK) {
+	if (ndev_flags & IFF_LOOPBACK)
 		memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
-	} else {
-		if (!(ndev_flags & IFF_NOARP)) {
-			/* If the device doesn't do ARP internally */
-			ret = fetch_ha(dst, addr, dst_in, seq);
-		}
-	}
+	else
+		ret = fetch_ha(dst, addr, dst_in, seq);
 	return ret;
 }
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] RDMA/core: Use route entry flag to decide on loopback traffic
  2025-09-07 16:08 [PATCH 0/4] Fix local destination address resolution with VRF Edward Srouji
  2025-09-07 16:08 ` [PATCH 1/4] RDMA/core: Squash a single user static function Edward Srouji
  2025-09-07 16:08 ` [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
@ 2025-09-07 16:08 ` Edward Srouji
  2025-09-07 16:08 ` [PATCH 4/4] IB/ipoib: Ignore L3 master device Edward Srouji
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
  4 siblings, 0 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-07 16:08 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal

From: Parav Pandit <parav@nvidia.com>

addr_resolve() considers a destination to be local if the next-hop
device of the resolved route for the destination is the loopback
netdevice.

This fails when the source and destination IP addresses belong to
a netdev enslaved to a VRF netdev. In this case the next-hop device
is the VRF itself:

 $ ip link add name myvrf up type vrf table 100
 $ ip link set ens2f0np0 master myvrf up
 $ ip addr add 192.168.1.1/24 dev ens2f0np0
 $ ip route get 192.168.1.1 oif myvrf
 local 192.168.1.1 dev myvrf table 100 src 192.168.1.1 uid 0
    cache <local>

This results in packets being generated with an incorrect destination
MAC of the VRF netdevice and ib_write_bw failing with timeout.

Solve this by determining if a destination is local or not based on
the resolved route's type rather than based on its next-hop netdevice
loopback flag.

This enables to resolve loopback traffic with and without VRF
configurations in a uniform way.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/core/addr.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index ca86c482662f..61596cda2b65 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -446,31 +446,40 @@ static int addr6_resolve(struct sockaddr *src_sock,
 }
 #endif
 
+static bool is_dst_local(const struct dst_entry *dst)
+{
+	if (dst->ops->family == AF_INET)
+		return !!(dst_rtable(dst)->rt_type & RTN_LOCAL);
+	else if (dst->ops->family == AF_INET6)
+		return !!(dst_rt6_info(dst)->rt6i_flags & RTF_LOCAL);
+	else
+		return false;
+}
+
 static int addr_resolve_neigh(const struct dst_entry *dst,
 			      const struct sockaddr *dst_in,
 			      struct rdma_dev_addr *addr,
-			      unsigned int ndev_flags,
 			      u32 seq)
 {
-	int ret = 0;
-
-	if (ndev_flags & IFF_LOOPBACK)
+	if (is_dst_local(dst)) {
+		/* When the destination is local entry, source and destination
+		 * are same. Skip the neighbour lookup.
+		 */
 		memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
-	else
-		ret = fetch_ha(dst, addr, dst_in, seq);
-	return ret;
+		return 0;
+	}
+
+	return fetch_ha(dst, addr, dst_in, seq);
 }
 
 static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
-				 unsigned int *ndev_flags,
 				 const struct sockaddr *dst_in,
 				 const struct dst_entry *dst)
 {
 	struct net_device *ndev = READ_ONCE(dst->dev);
 
-	*ndev_flags = ndev->flags;
 	/* A physical device must be the RDMA device to use */
-	if (ndev->flags & IFF_LOOPBACK) {
+	if (is_dst_local(dst)) {
 		int ret;
 		/*
 		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
@@ -538,7 +547,6 @@ static int addr_resolve(struct sockaddr *src_in,
 			u32 seq)
 {
 	struct dst_entry *dst = NULL;
-	unsigned int ndev_flags = 0;
 	struct rtable *rt = NULL;
 	int ret;
 
@@ -575,7 +583,7 @@ static int addr_resolve(struct sockaddr *src_in,
 		rcu_read_unlock();
 		goto done;
 	}
-	ret = rdma_set_src_addr_rcu(addr, &ndev_flags, dst_in, dst);
+	ret = rdma_set_src_addr_rcu(addr, dst_in, dst);
 	rcu_read_unlock();
 
 	/*
@@ -583,7 +591,7 @@ static int addr_resolve(struct sockaddr *src_in,
 	 * only if src addr translation didn't fail.
 	 */
 	if (!ret && resolve_neigh)
-		ret = addr_resolve_neigh(dst, dst_in, addr, ndev_flags, seq);
+		ret = addr_resolve_neigh(dst, dst_in, addr, seq);
 
 	if (src_in->sa_family == AF_INET)
 		ip_rt_put(rt);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] IB/ipoib: Ignore L3 master device
  2025-09-07 16:08 [PATCH 0/4] Fix local destination address resolution with VRF Edward Srouji
                   ` (2 preceding siblings ...)
  2025-09-07 16:08 ` [PATCH 3/4] RDMA/core: Use route entry flag to decide on loopback traffic Edward Srouji
@ 2025-09-07 16:08 ` Edward Srouji
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
  4 siblings, 0 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-07 16:08 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

Currently, all master upper netdevices (e.g., bond, VRF) are treated
equally.

When a VRF netdevice is used over an IPoIB netdevice, the expected
netdev resolution is on the lower IPoIB device which has the IP address
assigned to it and not the VRF device.

The rdma_cm module (CMA) tries to match incoming requests to a
particular netdevice. When successful, it also validates that the return
path points to the same device by performing a routing table lookup.
Currently, the former would resolve to the VRF netdevice, while the
latter to the correct lower IPoIB netdevice, leading to failure in
rdma_cm.

Improve this by ignoring the VRF master netdevice, if it exists, and
instead return the lower IPoIB device.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7acafc5c0e09..5b4d76e97437 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -351,26 +351,27 @@ static bool ipoib_is_dev_match_addr_rcu(const struct sockaddr *addr,
 }
 
 /*
- * Find the master net_device on top of the given net_device.
+ * Find the L2 master net_device on top of the given net_device.
  * @dev: base IPoIB net_device
  *
- * Returns the master net_device with a reference held, or the same net_device
- * if no master exists.
+ * Returns the L2 master net_device with reference held if the L2 master
+ * exists (such as bond netdevice), or returns same netdev with reference
+ * held when master does not exist or when L3 master (such as VRF netdev).
  */
 static struct net_device *ipoib_get_master_net_dev(struct net_device *dev)
 {
 	struct net_device *master;
 
 	rcu_read_lock();
+
 	master = netdev_master_upper_dev_get_rcu(dev);
+	if (!master || netif_is_l3_master(master))
+		master = dev;
+
 	dev_hold(master);
 	rcu_read_unlock();
 
-	if (master)
-		return master;
-
-	dev_hold(dev);
-	return dev;
+	return master;
 }
 
 struct ipoib_walk_data {
@@ -522,7 +523,7 @@ static struct net_device *ipoib_get_net_dev_by_params(
 	if (ret)
 		return NULL;
 
-	/* See if we can find a unique device matching the L2 parameters */
+	/* See if we can find a unique device matching the pkey and GID */
 	matches = __ipoib_get_net_dev_by_params(dev_list, port, pkey_index,
 						gid, NULL, &net_dev);
 
@@ -535,7 +536,7 @@ static struct net_device *ipoib_get_net_dev_by_params(
 
 	dev_put(net_dev);
 
-	/* Couldn't find a unique device with L2 parameters only. Use L3
+	/* Couldn't find a unique device with pkey and GID only. Use L3
 	 * address to uniquely match the net device */
 	matches = __ipoib_get_net_dev_by_params(dev_list, port, pkey_index,
 						gid, addr, &net_dev);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] RDMA/core: Squash a single user static function
  2025-09-07 16:08 ` [PATCH 1/4] RDMA/core: Squash a single user static function Edward Srouji
@ 2025-09-10  8:17   ` Leon Romanovsky
  2025-09-10 10:51     ` Parav Pandit
  0 siblings, 1 reply; 18+ messages in thread
From: Leon Romanovsky @ 2025-09-10  8:17 UTC (permalink / raw)
  To: Edward Srouji
  Cc: jgg, linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, kuba,
	tariqt, mbloch, gal

On Sun, Sep 07, 2025 at 07:08:30PM +0300, Edward Srouji wrote:
> From: Parav Pandit <parav@nvidia.com>
> 
> In order to reduce dependencies in IFF_LOOPBACK in
> route and neighbour resolution steps, squash the
> static function to its single caller and simplify the
> code. No functional change.

It needs more explanation why it is true. Before this change,
we set dev_addr->network to some value and returned error.
That error propagated upto process_one_req(), which handles only
some errors and ignores rest.

So now, we continue to handle REQ without proper req->addr->network.

Thanks

> 
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
> Signed-off-by: Edward Srouji <edwards@nvidia.com>
> ---
>  drivers/infiniband/core/addr.c | 49 ++++++++++++++--------------------
>  1 file changed, 20 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index be0743dac3ff..594e7ee335f7 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -465,34 +465,6 @@ static int addr_resolve_neigh(const struct dst_entry *dst,
>  	return ret;
>  }
>  
> -static int copy_src_l2_addr(struct rdma_dev_addr *dev_addr,
> -			    const struct sockaddr *dst_in,
> -			    const struct dst_entry *dst,
> -			    const struct net_device *ndev)
> -{
> -	int ret = 0;
> -
> -	if (dst->dev->flags & IFF_LOOPBACK)
> -		ret = rdma_translate_ip(dst_in, dev_addr);
> -	else
> -		rdma_copy_src_l2_addr(dev_addr, dst->dev);
> -
> -	/*
> -	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
> -	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
> -	 * network type accordingly.
> -	 */
> -	if (has_gateway(dst, dst_in->sa_family) &&
> -	    ndev->type != ARPHRD_INFINIBAND)
> -		dev_addr->network = dst_in->sa_family == AF_INET ?
> -						RDMA_NETWORK_IPV4 :
> -						RDMA_NETWORK_IPV6;
> -	else
> -		dev_addr->network = RDMA_NETWORK_IB;
> -
> -	return ret;
> -}
> -
>  static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
>  				 unsigned int *ndev_flags,
>  				 const struct sockaddr *dst_in,
> @@ -503,6 +475,7 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
>  	*ndev_flags = ndev->flags;
>  	/* A physical device must be the RDMA device to use */
>  	if (ndev->flags & IFF_LOOPBACK) {
> +		int ret;
>  		/*
>  		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
>  		 * loopback IP address. So if route is resolved to loopback
> @@ -512,9 +485,27 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
>  		ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev), dst_in);
>  		if (IS_ERR(ndev))
>  			return -ENODEV;
> +		ret = rdma_translate_ip(dst_in, dev_addr);
> +		if (ret)
> +			return ret;
> +	} else {
> +		rdma_copy_src_l2_addr(dev_addr, dst->dev);
>  	}
>  
> -	return copy_src_l2_addr(dev_addr, dst_in, dst, ndev);
> +	/*
> +	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
> +	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
> +	 * network type accordingly.
> +	 */
> +	if (has_gateway(dst, dst_in->sa_family) &&
> +	    ndev->type != ARPHRD_INFINIBAND)
> +		dev_addr->network = dst_in->sa_family == AF_INET ?
> +						RDMA_NETWORK_IPV4 :
> +						RDMA_NETWORK_IPV6;
> +	else
> +		dev_addr->network = RDMA_NETWORK_IB;
> +
> +	return 0;
>  }
>  
>  static int set_addr_netns_by_gid_rcu(struct rdma_dev_addr *addr)
> -- 
> 2.21.3
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support
  2025-09-07 16:08 ` [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
@ 2025-09-10  8:32   ` Leon Romanovsky
  2025-09-10 10:55     ` Parav Pandit
  0 siblings, 1 reply; 18+ messages in thread
From: Leon Romanovsky @ 2025-09-10  8:32 UTC (permalink / raw)
  To: Edward Srouji
  Cc: jgg, linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, kuba,
	tariqt, mbloch, gal

On Sun, Sep 07, 2025 at 07:08:31PM +0300, Edward Srouji wrote:
> From: Parav Pandit <parav@nvidia.com>
> 
> Currently, if the next-hop netdevice does not support ARP resolution,
> the destination MAC address is silently set to zero without reporting
> an error. 

Not an expert here, but from my understanding this is right behavior.
IFF_NOARP means "leave" MAC address as is (zero).

> This leads to incorrect behavior and may result in packet transmission failures.
> 
> Fix this by deferring MAC resolution to the IP stack via neighbour
> lookup, allowing proper resolution or error reporting as appropriate.

What is the difference here? For IPv4, neighbour lookup is ARP, no?

> 
> Fixes: 7025fcd36bd6 ("IB: address translation to map IP toIB addresses (GIDs)")
> Signed-off-by: Parav Pandit <parav@nvidia.com>
> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
> Signed-off-by: Edward Srouji <edwards@nvidia.com>
> ---
>  drivers/infiniband/core/addr.c | 10 +++-------
>  1 file changed, 3 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
> index 594e7ee335f7..ca86c482662f 100644
> --- a/drivers/infiniband/core/addr.c
> +++ b/drivers/infiniband/core/addr.c
> @@ -454,14 +454,10 @@ static int addr_resolve_neigh(const struct dst_entry *dst,
>  {
>  	int ret = 0;
>  
> -	if (ndev_flags & IFF_LOOPBACK) {
> +	if (ndev_flags & IFF_LOOPBACK)
>  		memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
> -	} else {
> -		if (!(ndev_flags & IFF_NOARP)) {
> -			/* If the device doesn't do ARP internally */
> -			ret = fetch_ha(dst, addr, dst_in, seq);
> -		}
> -	}
> +	else
> +		ret = fetch_ha(dst, addr, dst_in, seq);
>  	return ret;
>  }
>  
> -- 
> 2.21.3
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 1/4] RDMA/core: Squash a single user static function
  2025-09-10  8:17   ` Leon Romanovsky
@ 2025-09-10 10:51     ` Parav Pandit
  2025-09-10 12:21       ` Leon Romanovsky
  0 siblings, 1 reply; 18+ messages in thread
From: Parav Pandit @ 2025-09-10 10:51 UTC (permalink / raw)
  To: Leon Romanovsky, Edward Srouji
  Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Cosmin Ratiu, Vlad Dumitrescu,
	kuba@kernel.org, Tariq Toukan, Mark Bloch, Gal Pressman



> From: Leon Romanovsky <leon@kernel.org>
> Sent: 10 September 2025 01:48 PM
> 
> On Sun, Sep 07, 2025 at 07:08:30PM +0300, Edward Srouji wrote:
> > From: Parav Pandit <parav@nvidia.com>
> >
> > In order to reduce dependencies in IFF_LOOPBACK in route and neighbour
> > resolution steps, squash the static function to its single caller and
> > simplify the code. No functional change.
> 
> It needs more explanation why it is true. 
After second look at the code, it is not true.
It is not true for the case when dev->flags has IFF_LOOPBACK and translate_ip() failed.
In existing code, when translate_ip() fails, code still sets dev_addr->network.
dev_addr->network is not referred if the process_one_req() has error.

> Before this change, we set dev_addr-
> >network to some value and returned error.
> That error propagated upto process_one_req(), which handles only some
> errors and ignores rest.
> 
> So now, we continue to handle REQ without proper req->addr->network.
>
Not exactly. When error is returned by the code addr->network is not filled up, which is correct thing to do.

So at best, commit message should be updated to remove "No functional change".

Will send v1.
 
> Thanks
> 
> >
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
> > Signed-off-by: Edward Srouji <edwards@nvidia.com>
> > ---
> >  drivers/infiniband/core/addr.c | 49
> > ++++++++++++++--------------------
> >  1 file changed, 20 insertions(+), 29 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/addr.c
> > b/drivers/infiniband/core/addr.c index be0743dac3ff..594e7ee335f7
> > 100644
> > --- a/drivers/infiniband/core/addr.c
> > +++ b/drivers/infiniband/core/addr.c
> > @@ -465,34 +465,6 @@ static int addr_resolve_neigh(const struct dst_entry
> *dst,
> >  	return ret;
> >  }
> >
> > -static int copy_src_l2_addr(struct rdma_dev_addr *dev_addr,
> > -			    const struct sockaddr *dst_in,
> > -			    const struct dst_entry *dst,
> > -			    const struct net_device *ndev)
> > -{
> > -	int ret = 0;
> > -
> > -	if (dst->dev->flags & IFF_LOOPBACK)
> > -		ret = rdma_translate_ip(dst_in, dev_addr);
> > -	else
> > -		rdma_copy_src_l2_addr(dev_addr, dst->dev);
> > -
> > -	/*
> > -	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
> > -	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
> > -	 * network type accordingly.
> > -	 */
> > -	if (has_gateway(dst, dst_in->sa_family) &&
> > -	    ndev->type != ARPHRD_INFINIBAND)
> > -		dev_addr->network = dst_in->sa_family == AF_INET ?
> > -						RDMA_NETWORK_IPV4 :
> > -						RDMA_NETWORK_IPV6;
> > -	else
> > -		dev_addr->network = RDMA_NETWORK_IB;
> > -
> > -	return ret;
> > -}
> > -
> >  static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
> >  				 unsigned int *ndev_flags,
> >  				 const struct sockaddr *dst_in,
> > @@ -503,6 +475,7 @@ static int rdma_set_src_addr_rcu(struct
> rdma_dev_addr *dev_addr,
> >  	*ndev_flags = ndev->flags;
> >  	/* A physical device must be the RDMA device to use */
> >  	if (ndev->flags & IFF_LOOPBACK) {
> > +		int ret;
> >  		/*
> >  		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
> >  		 * loopback IP address. So if route is resolved to loopback
> @@
> > -512,9 +485,27 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr
> *dev_addr,
> >  		ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev),
> dst_in);
> >  		if (IS_ERR(ndev))
> >  			return -ENODEV;
> > +		ret = rdma_translate_ip(dst_in, dev_addr);
> > +		if (ret)
> > +			return ret;
> > +	} else {
> > +		rdma_copy_src_l2_addr(dev_addr, dst->dev);
> >  	}
> >
> > -	return copy_src_l2_addr(dev_addr, dst_in, dst, ndev);
> > +	/*
> > +	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
> > +	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
> > +	 * network type accordingly.
> > +	 */
> > +	if (has_gateway(dst, dst_in->sa_family) &&
> > +	    ndev->type != ARPHRD_INFINIBAND)
> > +		dev_addr->network = dst_in->sa_family == AF_INET ?
> > +						RDMA_NETWORK_IPV4 :
> > +						RDMA_NETWORK_IPV6;
> > +	else
> > +		dev_addr->network = RDMA_NETWORK_IB;
> > +
> > +	return 0;
> >  }
> >
> >  static int set_addr_netns_by_gid_rcu(struct rdma_dev_addr *addr)
> > --
> > 2.21.3
> >
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support
  2025-09-10  8:32   ` Leon Romanovsky
@ 2025-09-10 10:55     ` Parav Pandit
  2025-09-15 16:30       ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Parav Pandit @ 2025-09-10 10:55 UTC (permalink / raw)
  To: Leon Romanovsky, Edward Srouji
  Cc: jgg@ziepe.ca, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Cosmin Ratiu, Vlad Dumitrescu,
	kuba@kernel.org, Tariq Toukan, Mark Bloch, Gal Pressman


> From: Leon Romanovsky <leon@kernel.org>
> Sent: 10 September 2025 02:02 PM
> 
> On Sun, Sep 07, 2025 at 07:08:31PM +0300, Edward Srouji wrote:
> > From: Parav Pandit <parav@nvidia.com>
> >
> > Currently, if the next-hop netdevice does not support ARP resolution,
> > the destination MAC address is silently set to zero without reporting
> > an error.
> 
> Not an expert here, but from my understanding this is right behavior.
> IFF_NOARP means "leave" MAC address as is (zero).
>
Not really.
In the example of the VRF, the device does not resolve the ARP itself, but it's the enslaved device which resolves the neighbour.
Some ip vlan l2 devices do not do arp internally but depends on the bridge/stack to resolve.


> > This leads to incorrect behavior and may result in packet transmission
> failures.
> >
> > Fix this by deferring MAC resolution to the IP stack via neighbour
> > lookup, allowing proper resolution or error reporting as appropriate.
> 
> What is the difference here? For IPv4, neighbour lookup is ARP, no?
It is but it is not the only way. A device may not do ARP by itself but it relies on the rest of the stack like vrf or ip vlan mode to resolve.
A user may also set manual entry without explicit ARP.

> 
> >
> > Fixes: 7025fcd36bd6 ("IB: address translation to map IP toIB addresses
> > (GIDs)")
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
> > Signed-off-by: Edward Srouji <edwards@nvidia.com>
> > ---
> >  drivers/infiniband/core/addr.c | 10 +++-------
> >  1 file changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/infiniband/core/addr.c
> > b/drivers/infiniband/core/addr.c index 594e7ee335f7..ca86c482662f
> > 100644
> > --- a/drivers/infiniband/core/addr.c
> > +++ b/drivers/infiniband/core/addr.c
> > @@ -454,14 +454,10 @@ static int addr_resolve_neigh(const struct
> > dst_entry *dst,  {
> >  	int ret = 0;
> >
> > -	if (ndev_flags & IFF_LOOPBACK) {
> > +	if (ndev_flags & IFF_LOOPBACK)
> >  		memcpy(addr->dst_dev_addr, addr->src_dev_addr,
> MAX_ADDR_LEN);
> > -	} else {
> > -		if (!(ndev_flags & IFF_NOARP)) {
> > -			/* If the device doesn't do ARP internally */
> > -			ret = fetch_ha(dst, addr, dst_in, seq);
> > -		}
> > -	}
> > +	else
> > +		ret = fetch_ha(dst, addr, dst_in, seq);
> >  	return ret;
> >  }
> >
> > --
> > 2.21.3
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] RDMA/core: Squash a single user static function
  2025-09-10 10:51     ` Parav Pandit
@ 2025-09-10 12:21       ` Leon Romanovsky
  0 siblings, 0 replies; 18+ messages in thread
From: Leon Romanovsky @ 2025-09-10 12:21 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Edward Srouji, jgg@ziepe.ca, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Cosmin Ratiu, Vlad Dumitrescu,
	kuba@kernel.org, Tariq Toukan, Mark Bloch, Gal Pressman

On Wed, Sep 10, 2025 at 10:51:08AM +0000, Parav Pandit wrote:
> 
> 
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: 10 September 2025 01:48 PM
> > 
> > On Sun, Sep 07, 2025 at 07:08:30PM +0300, Edward Srouji wrote:
> > > From: Parav Pandit <parav@nvidia.com>
> > >
> > > In order to reduce dependencies in IFF_LOOPBACK in route and neighbour
> > > resolution steps, squash the static function to its single caller and
> > > simplify the code. No functional change.
> > 
> > It needs more explanation why it is true. 
> After second look at the code, it is not true.
> It is not true for the case when dev->flags has IFF_LOOPBACK and translate_ip() failed.
> In existing code, when translate_ip() fails, code still sets dev_addr->network.
> dev_addr->network is not referred if the process_one_req() has error.
> 
> > Before this change, we set dev_addr-
> > >network to some value and returned error.
> > That error propagated upto process_one_req(), which handles only some
> > errors and ignores rest.
> > 
> > So now, we continue to handle REQ without proper req->addr->network.
> >
> Not exactly. When error is returned by the code addr->network is not filled up, which is correct thing to do.
> 
> So at best, commit message should be updated to remove "No functional change".
> 
> Will send v1.

Parav,

Please setup your e-mail client, your reply is mixed with mine and
repeats it.

Thanks

>  
> > Thanks
> > 
> > >
> > > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > > Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
> > > Signed-off-by: Edward Srouji <edwards@nvidia.com>
> > > ---
> > >  drivers/infiniband/core/addr.c | 49
> > > ++++++++++++++--------------------
> > >  1 file changed, 20 insertions(+), 29 deletions(-)
> > >
> > > diff --git a/drivers/infiniband/core/addr.c
> > > b/drivers/infiniband/core/addr.c index be0743dac3ff..594e7ee335f7
> > > 100644
> > > --- a/drivers/infiniband/core/addr.c
> > > +++ b/drivers/infiniband/core/addr.c
> > > @@ -465,34 +465,6 @@ static int addr_resolve_neigh(const struct dst_entry
> > *dst,
> > >  	return ret;
> > >  }
> > >
> > > -static int copy_src_l2_addr(struct rdma_dev_addr *dev_addr,
> > > -			    const struct sockaddr *dst_in,
> > > -			    const struct dst_entry *dst,
> > > -			    const struct net_device *ndev)
> > > -{
> > > -	int ret = 0;
> > > -
> > > -	if (dst->dev->flags & IFF_LOOPBACK)
> > > -		ret = rdma_translate_ip(dst_in, dev_addr);
> > > -	else
> > > -		rdma_copy_src_l2_addr(dev_addr, dst->dev);
> > > -
> > > -	/*
> > > -	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
> > > -	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
> > > -	 * network type accordingly.
> > > -	 */
> > > -	if (has_gateway(dst, dst_in->sa_family) &&
> > > -	    ndev->type != ARPHRD_INFINIBAND)
> > > -		dev_addr->network = dst_in->sa_family == AF_INET ?
> > > -						RDMA_NETWORK_IPV4 :
> > > -						RDMA_NETWORK_IPV6;
> > > -	else
> > > -		dev_addr->network = RDMA_NETWORK_IB;
> > > -
> > > -	return ret;
> > > -}
> > > -
> > >  static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
> > >  				 unsigned int *ndev_flags,
> > >  				 const struct sockaddr *dst_in,
> > > @@ -503,6 +475,7 @@ static int rdma_set_src_addr_rcu(struct
> > rdma_dev_addr *dev_addr,
> > >  	*ndev_flags = ndev->flags;
> > >  	/* A physical device must be the RDMA device to use */
> > >  	if (ndev->flags & IFF_LOOPBACK) {
> > > +		int ret;
> > >  		/*
> > >  		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
> > >  		 * loopback IP address. So if route is resolved to loopback
> > @@
> > > -512,9 +485,27 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr
> > *dev_addr,
> > >  		ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev),
> > dst_in);
> > >  		if (IS_ERR(ndev))
> > >  			return -ENODEV;
> > > +		ret = rdma_translate_ip(dst_in, dev_addr);
> > > +		if (ret)
> > > +			return ret;
> > > +	} else {
> > > +		rdma_copy_src_l2_addr(dev_addr, dst->dev);
> > >  	}
> > >
> > > -	return copy_src_l2_addr(dev_addr, dst_in, dst, ndev);
> > > +	/*
> > > +	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
> > > +	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
> > > +	 * network type accordingly.
> > > +	 */
> > > +	if (has_gateway(dst, dst_in->sa_family) &&
> > > +	    ndev->type != ARPHRD_INFINIBAND)
> > > +		dev_addr->network = dst_in->sa_family == AF_INET ?
> > > +						RDMA_NETWORK_IPV4 :
> > > +						RDMA_NETWORK_IPV6;
> > > +	else
> > > +		dev_addr->network = RDMA_NETWORK_IB;
> > > +
> > > +	return 0;
> > >  }
> > >
> > >  static int set_addr_netns_by_gid_rcu(struct rdma_dev_addr *addr)
> > > --
> > > 2.21.3
> > >
> > >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support
  2025-09-10 10:55     ` Parav Pandit
@ 2025-09-15 16:30       ` Jason Gunthorpe
  2025-09-15 17:16         ` Parav Pandit
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2025-09-15 16:30 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Leon Romanovsky, Edward Srouji, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Cosmin Ratiu, Vlad Dumitrescu,
	kuba@kernel.org, Tariq Toukan, Mark Bloch, Gal Pressman

On Wed, Sep 10, 2025 at 10:55:36AM +0000, Parav Pandit wrote:
> > > This leads to incorrect behavior and may result in packet transmission
> > failures.
> > >
> > > Fix this by deferring MAC resolution to the IP stack via neighbour
> > > lookup, allowing proper resolution or error reporting as appropriate.
> > 
> > What is the difference here? For IPv4, neighbour lookup is ARP, no?
> It is but it is not the only way. A device may not do ARP by itself but it relies on the rest of the stack like vrf or ip vlan mode to resolve.
> A user may also set manual entry without explicit ARP.

I think it was just a mistake to use NOARP this way in RDMA, I looked
in the git history and there was no justification. That or it was
right in the 2.x days and netdev moved on to the current schem.

I expect to just call the neighbor functions and if they can't work
for some reason they should fail?

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support
  2025-09-15 16:30       ` Jason Gunthorpe
@ 2025-09-15 17:16         ` Parav Pandit
  0 siblings, 0 replies; 18+ messages in thread
From: Parav Pandit @ 2025-09-15 17:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Edward Srouji, linux-rdma@vger.kernel.org,
	linux-kernel@vger.kernel.org, Cosmin Ratiu, Vlad Dumitrescu,
	kuba@kernel.org, Tariq Toukan, Mark Bloch, Gal Pressman


On 15-09-2025 09:30 am, Jason Gunthorpe wrote:
> On Wed, Sep 10, 2025 at 10:55:36AM +0000, Parav Pandit wrote:
>>>> This leads to incorrect behavior and may result in packet transmission
>>> failures.
>>>> Fix this by deferring MAC resolution to the IP stack via neighbour
>>>> lookup, allowing proper resolution or error reporting as appropriate.
>>> What is the difference here? For IPv4, neighbour lookup is ARP, no?
>> It is but it is not the only way. A device may not do ARP by itself but it relies on the rest of the stack like vrf or ip vlan mode to resolve.
>> A user may also set manual entry without explicit ARP.
> I think it was just a mistake to use NOARP this way in RDMA, I looked
> in the git history and there was no justification. That or it was
> right in the 2.x days and netdev moved on to the current schem.
>
> I expect to just call the neighbor functions and if they can't work
> for some reason they should fail?
>
> Jason

Right. This is the patch does, to rely on the neighbour functions to 
resolve without depending on the NOARP.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 0/4] Fix local destination address resolution with VRF
  2025-09-07 16:08 [PATCH 0/4] Fix local destination address resolution with VRF Edward Srouji
                   ` (3 preceding siblings ...)
  2025-09-07 16:08 ` [PATCH 4/4] IB/ipoib: Ignore L3 master device Edward Srouji
@ 2025-09-16 11:10 ` Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 1/4] RDMA/core: Squash a single user static function Edward Srouji
                     ` (4 more replies)
  4 siblings, 5 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-16 11:10 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal, idosch

From Parav:

Presently, address resolve routines consider a destination to be local
if the next-hop device of the resolved route for the destination is the 
loopback netdevice. While this works for simple configurations, it fails
when the source and destination IP addresses belong to an enslaved
netdevice of a VRF.
In that case the next-hop device is the VRF itself, so packets are 
generated with an incorrect destination MAC on the VRF netdevice and 
ib_write_bw times out.

This patch series fixes that by determining whether a destination is
local based on the resolved route's type rather than on the next-hop
netdevice's loopback flag.
That approach resolves loopback traffic consistently both with and 
without VRF configurations.

This series contains 4 patches:
  1/4: refactor address resolution code for reuse by subsequent patches
  2/4: resolve destination MAC via IP stack
  3/4: use route table entry instead of netdev loopback flag
  4/4: fix netdev lookup for IPoIB interfaces

Parav.

---

Changelog:
v0 -> v1:
- Addressed comments from Leon
- Updated commit message to reflect that dev_addr fields are invalid in
  case of failure in PATCH 1/4
- Removed incorrect commit log about 'no functional change' in PATCH 1/4

v0: https://lore-kernel.gnuweeb.org/lkml/20250907160833.56589-5-edwards@nvidia.com/T/

---

Parav Pandit (3):
  RDMA/core: Squash a single user static function
  RDMA/core: Resolve MAC of next-hop device without ARP support
  RDMA/core: Use route entry flag to decide on loopback traffic

Vlad Dumitrescu (1):
  IB/ipoib: Ignore L3 master device

 drivers/infiniband/core/addr.c            | 83 +++++++++++------------
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 21 +++---
 2 files changed, 50 insertions(+), 54 deletions(-)

-- 
2.21.3


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v1 1/4] RDMA/core: Squash a single user static function
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
@ 2025-09-16 11:11   ` Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-16 11:11 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal, idosch

From: Parav Pandit <parav@nvidia.com>

To reduce dependencies in IFF_LOOPBACK in route and neighbour resolution
steps, squash the static function to its single caller and simplify the
code.
Until now, network field was set even when neighbour resolution failed.
With this change, dev_addr output fields are valid only when resolution
is successful.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/core/addr.c | 49 ++++++++++++++--------------------
 1 file changed, 20 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index be0743dac3ff..594e7ee335f7 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -465,34 +465,6 @@ static int addr_resolve_neigh(const struct dst_entry *dst,
 	return ret;
 }
 
-static int copy_src_l2_addr(struct rdma_dev_addr *dev_addr,
-			    const struct sockaddr *dst_in,
-			    const struct dst_entry *dst,
-			    const struct net_device *ndev)
-{
-	int ret = 0;
-
-	if (dst->dev->flags & IFF_LOOPBACK)
-		ret = rdma_translate_ip(dst_in, dev_addr);
-	else
-		rdma_copy_src_l2_addr(dev_addr, dst->dev);
-
-	/*
-	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
-	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
-	 * network type accordingly.
-	 */
-	if (has_gateway(dst, dst_in->sa_family) &&
-	    ndev->type != ARPHRD_INFINIBAND)
-		dev_addr->network = dst_in->sa_family == AF_INET ?
-						RDMA_NETWORK_IPV4 :
-						RDMA_NETWORK_IPV6;
-	else
-		dev_addr->network = RDMA_NETWORK_IB;
-
-	return ret;
-}
-
 static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
 				 unsigned int *ndev_flags,
 				 const struct sockaddr *dst_in,
@@ -503,6 +475,7 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
 	*ndev_flags = ndev->flags;
 	/* A physical device must be the RDMA device to use */
 	if (ndev->flags & IFF_LOOPBACK) {
+		int ret;
 		/*
 		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
 		 * loopback IP address. So if route is resolved to loopback
@@ -512,9 +485,27 @@ static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
 		ndev = rdma_find_ndev_for_src_ip_rcu(dev_net(ndev), dst_in);
 		if (IS_ERR(ndev))
 			return -ENODEV;
+		ret = rdma_translate_ip(dst_in, dev_addr);
+		if (ret)
+			return ret;
+	} else {
+		rdma_copy_src_l2_addr(dev_addr, dst->dev);
 	}
 
-	return copy_src_l2_addr(dev_addr, dst_in, dst, ndev);
+	/*
+	 * If there's a gateway and type of device not ARPHRD_INFINIBAND,
+	 * we're definitely in RoCE v2 (as RoCE v1 isn't routable) set the
+	 * network type accordingly.
+	 */
+	if (has_gateway(dst, dst_in->sa_family) &&
+	    ndev->type != ARPHRD_INFINIBAND)
+		dev_addr->network = dst_in->sa_family == AF_INET ?
+						RDMA_NETWORK_IPV4 :
+						RDMA_NETWORK_IPV6;
+	else
+		dev_addr->network = RDMA_NETWORK_IB;
+
+	return 0;
 }
 
 static int set_addr_netns_by_gid_rcu(struct rdma_dev_addr *addr)
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 1/4] RDMA/core: Squash a single user static function Edward Srouji
@ 2025-09-16 11:11   ` Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 3/4] RDMA/core: Use route entry flag to decide on loopback traffic Edward Srouji
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-16 11:11 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal, idosch

From: Parav Pandit <parav@nvidia.com>

Currently, if the next-hop netdevice does not support ARP resolution,
the destination MAC address is silently set to zero without reporting
an error. This leads to incorrect behavior and may result in packet
transmission failures.

Fix this by deferring MAC resolution to the IP stack via neighbour
lookup, allowing proper resolution or error reporting as appropriate.

Fixes: 7025fcd36bd6 ("IB: address translation to map IP toIB addresses (GIDs)")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/core/addr.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 594e7ee335f7..ca86c482662f 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -454,14 +454,10 @@ static int addr_resolve_neigh(const struct dst_entry *dst,
 {
 	int ret = 0;
 
-	if (ndev_flags & IFF_LOOPBACK) {
+	if (ndev_flags & IFF_LOOPBACK)
 		memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
-	} else {
-		if (!(ndev_flags & IFF_NOARP)) {
-			/* If the device doesn't do ARP internally */
-			ret = fetch_ha(dst, addr, dst_in, seq);
-		}
-	}
+	else
+		ret = fetch_ha(dst, addr, dst_in, seq);
 	return ret;
 }
 
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 3/4] RDMA/core: Use route entry flag to decide on loopback traffic
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 1/4] RDMA/core: Squash a single user static function Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
@ 2025-09-16 11:11   ` Edward Srouji
  2025-09-16 11:11   ` [PATCH v1 4/4] IB/ipoib: Ignore L3 master device Edward Srouji
  2025-09-18  9:24   ` [PATCH v1 0/4] Fix local destination address resolution with VRF Leon Romanovsky
  4 siblings, 0 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-16 11:11 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal, idosch

From: Parav Pandit <parav@nvidia.com>

addr_resolve() considers a destination to be local if the next-hop
device of the resolved route for the destination is the loopback
netdevice.

This fails when the source and destination IP addresses belong to
a netdev enslaved to a VRF netdev. In this case the next-hop device
is the VRF itself:

 $ ip link add name myvrf up type vrf table 100
 $ ip link set ens2f0np0 master myvrf up
 $ ip addr add 192.168.1.1/24 dev ens2f0np0
 $ ip route get 192.168.1.1 oif myvrf
 local 192.168.1.1 dev myvrf table 100 src 192.168.1.1 uid 0
    cache <local>

This results in packets being generated with an incorrect destination
MAC of the VRF netdevice and ib_write_bw failing with timeout.

Solve this by determining if a destination is local or not based on
the resolved route's type rather than based on its next-hop netdevice
loopback flag.

This enables to resolve loopback traffic with and without VRF
configurations in a uniform way.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/core/addr.c | 34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index ca86c482662f..61596cda2b65 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -446,31 +446,40 @@ static int addr6_resolve(struct sockaddr *src_sock,
 }
 #endif
 
+static bool is_dst_local(const struct dst_entry *dst)
+{
+	if (dst->ops->family == AF_INET)
+		return !!(dst_rtable(dst)->rt_type & RTN_LOCAL);
+	else if (dst->ops->family == AF_INET6)
+		return !!(dst_rt6_info(dst)->rt6i_flags & RTF_LOCAL);
+	else
+		return false;
+}
+
 static int addr_resolve_neigh(const struct dst_entry *dst,
 			      const struct sockaddr *dst_in,
 			      struct rdma_dev_addr *addr,
-			      unsigned int ndev_flags,
 			      u32 seq)
 {
-	int ret = 0;
-
-	if (ndev_flags & IFF_LOOPBACK)
+	if (is_dst_local(dst)) {
+		/* When the destination is local entry, source and destination
+		 * are same. Skip the neighbour lookup.
+		 */
 		memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN);
-	else
-		ret = fetch_ha(dst, addr, dst_in, seq);
-	return ret;
+		return 0;
+	}
+
+	return fetch_ha(dst, addr, dst_in, seq);
 }
 
 static int rdma_set_src_addr_rcu(struct rdma_dev_addr *dev_addr,
-				 unsigned int *ndev_flags,
 				 const struct sockaddr *dst_in,
 				 const struct dst_entry *dst)
 {
 	struct net_device *ndev = READ_ONCE(dst->dev);
 
-	*ndev_flags = ndev->flags;
 	/* A physical device must be the RDMA device to use */
-	if (ndev->flags & IFF_LOOPBACK) {
+	if (is_dst_local(dst)) {
 		int ret;
 		/*
 		 * RDMA (IB/RoCE, iWarp) doesn't run on lo interface or
@@ -538,7 +547,6 @@ static int addr_resolve(struct sockaddr *src_in,
 			u32 seq)
 {
 	struct dst_entry *dst = NULL;
-	unsigned int ndev_flags = 0;
 	struct rtable *rt = NULL;
 	int ret;
 
@@ -575,7 +583,7 @@ static int addr_resolve(struct sockaddr *src_in,
 		rcu_read_unlock();
 		goto done;
 	}
-	ret = rdma_set_src_addr_rcu(addr, &ndev_flags, dst_in, dst);
+	ret = rdma_set_src_addr_rcu(addr, dst_in, dst);
 	rcu_read_unlock();
 
 	/*
@@ -583,7 +591,7 @@ static int addr_resolve(struct sockaddr *src_in,
 	 * only if src addr translation didn't fail.
 	 */
 	if (!ret && resolve_neigh)
-		ret = addr_resolve_neigh(dst, dst_in, addr, ndev_flags, seq);
+		ret = addr_resolve_neigh(dst, dst_in, addr, seq);
 
 	if (src_in->sa_family == AF_INET)
 		ip_rt_put(rt);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v1 4/4] IB/ipoib: Ignore L3 master device
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
                     ` (2 preceding siblings ...)
  2025-09-16 11:11   ` [PATCH v1 3/4] RDMA/core: Use route entry flag to decide on loopback traffic Edward Srouji
@ 2025-09-16 11:11   ` Edward Srouji
  2025-09-18  9:24   ` [PATCH v1 0/4] Fix local destination address resolution with VRF Leon Romanovsky
  4 siblings, 0 replies; 18+ messages in thread
From: Edward Srouji @ 2025-09-16 11:11 UTC (permalink / raw)
  To: jgg, leon
  Cc: linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, edwards,
	kuba, tariqt, mbloch, gal, idosch

From: Vlad Dumitrescu <vdumitrescu@nvidia.com>

Currently, all master upper netdevices (e.g., bond, VRF) are treated
equally.

When a VRF netdevice is used over an IPoIB netdevice, the expected
netdev resolution is on the lower IPoIB device which has the IP address
assigned to it and not the VRF device.

The rdma_cm module (CMA) tries to match incoming requests to a
particular netdevice. When successful, it also validates that the return
path points to the same device by performing a routing table lookup.
Currently, the former would resolve to the VRF netdevice, while the
latter to the correct lower IPoIB netdevice, leading to failure in
rdma_cm.

Improve this by ignoring the VRF master netdevice, if it exists, and
instead return the lower IPoIB device.

Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Edward Srouji <edwards@nvidia.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 7acafc5c0e09..5b4d76e97437 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -351,26 +351,27 @@ static bool ipoib_is_dev_match_addr_rcu(const struct sockaddr *addr,
 }
 
 /*
- * Find the master net_device on top of the given net_device.
+ * Find the L2 master net_device on top of the given net_device.
  * @dev: base IPoIB net_device
  *
- * Returns the master net_device with a reference held, or the same net_device
- * if no master exists.
+ * Returns the L2 master net_device with reference held if the L2 master
+ * exists (such as bond netdevice), or returns same netdev with reference
+ * held when master does not exist or when L3 master (such as VRF netdev).
  */
 static struct net_device *ipoib_get_master_net_dev(struct net_device *dev)
 {
 	struct net_device *master;
 
 	rcu_read_lock();
+
 	master = netdev_master_upper_dev_get_rcu(dev);
+	if (!master || netif_is_l3_master(master))
+		master = dev;
+
 	dev_hold(master);
 	rcu_read_unlock();
 
-	if (master)
-		return master;
-
-	dev_hold(dev);
-	return dev;
+	return master;
 }
 
 struct ipoib_walk_data {
@@ -522,7 +523,7 @@ static struct net_device *ipoib_get_net_dev_by_params(
 	if (ret)
 		return NULL;
 
-	/* See if we can find a unique device matching the L2 parameters */
+	/* See if we can find a unique device matching the pkey and GID */
 	matches = __ipoib_get_net_dev_by_params(dev_list, port, pkey_index,
 						gid, NULL, &net_dev);
 
@@ -535,7 +536,7 @@ static struct net_device *ipoib_get_net_dev_by_params(
 
 	dev_put(net_dev);
 
-	/* Couldn't find a unique device with L2 parameters only. Use L3
+	/* Couldn't find a unique device with pkey and GID only. Use L3
 	 * address to uniquely match the net device */
 	matches = __ipoib_get_net_dev_by_params(dev_list, port, pkey_index,
 						gid, addr, &net_dev);
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v1 0/4] Fix local destination address resolution with VRF
  2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
                     ` (3 preceding siblings ...)
  2025-09-16 11:11   ` [PATCH v1 4/4] IB/ipoib: Ignore L3 master device Edward Srouji
@ 2025-09-18  9:24   ` Leon Romanovsky
  4 siblings, 0 replies; 18+ messages in thread
From: Leon Romanovsky @ 2025-09-18  9:24 UTC (permalink / raw)
  To: Edward Srouji
  Cc: jgg, linux-rdma, linux-kernel, parav, cratiu, vdumitrescu, kuba,
	tariqt, mbloch, gal, idosch

On Tue, Sep 16, 2025 at 02:10:59PM +0300, Edward Srouji wrote:
> From Parav:
> 
> Presently, address resolve routines consider a destination to be local
> if the next-hop device of the resolved route for the destination is the 
> loopback netdevice. While this works for simple configurations, it fails
> when the source and destination IP addresses belong to an enslaved
> netdevice of a VRF.
> In that case the next-hop device is the VRF itself, so packets are 
> generated with an incorrect destination MAC on the VRF netdevice and 
> ib_write_bw times out.
> 
> This patch series fixes that by determining whether a destination is
> local based on the resolved route's type rather than on the next-hop
> netdevice's loopback flag.
> That approach resolves loopback traffic consistently both with and 
> without VRF configurations.
> 
> This series contains 4 patches:
>   1/4: refactor address resolution code for reuse by subsequent patches
>   2/4: resolve destination MAC via IP stack
>   3/4: use route table entry instead of netdev loopback flag
>   4/4: fix netdev lookup for IPoIB interfaces
> 
> Parav.
> 
> ---
> 
> Changelog:
> v0 -> v1:
> - Addressed comments from Leon
> - Updated commit message to reflect that dev_addr fields are invalid in
>   case of failure in PATCH 1/4
> - Removed incorrect commit log about 'no functional change' in PATCH 1/4
> 
> v0: https://lore-kernel.gnuweeb.org/lkml/20250907160833.56589-5-edwards@nvidia.com/T/
> 
> ---
> 
> Parav Pandit (3):
>   RDMA/core: Squash a single user static function
>   RDMA/core: Resolve MAC of next-hop device without ARP support
>   RDMA/core: Use route entry flag to decide on loopback traffic
> 
> Vlad Dumitrescu (1):
>   IB/ipoib: Ignore L3 master device
> 
>  drivers/infiniband/core/addr.c            | 83 +++++++++++------------
>  drivers/infiniband/ulp/ipoib/ipoib_main.c | 21 +++---
>  2 files changed, 50 insertions(+), 54 deletions(-)

I applied this series, but please never send series as reply-to. It
broke all flows: b4 automatic apply script, patchworks and "Thanks you .."
email.

Thanks

> 
> -- 
> 2.21.3
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-09-18  9:25 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-07 16:08 [PATCH 0/4] Fix local destination address resolution with VRF Edward Srouji
2025-09-07 16:08 ` [PATCH 1/4] RDMA/core: Squash a single user static function Edward Srouji
2025-09-10  8:17   ` Leon Romanovsky
2025-09-10 10:51     ` Parav Pandit
2025-09-10 12:21       ` Leon Romanovsky
2025-09-07 16:08 ` [PATCH 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
2025-09-10  8:32   ` Leon Romanovsky
2025-09-10 10:55     ` Parav Pandit
2025-09-15 16:30       ` Jason Gunthorpe
2025-09-15 17:16         ` Parav Pandit
2025-09-07 16:08 ` [PATCH 3/4] RDMA/core: Use route entry flag to decide on loopback traffic Edward Srouji
2025-09-07 16:08 ` [PATCH 4/4] IB/ipoib: Ignore L3 master device Edward Srouji
2025-09-16 11:10 ` [PATCH v1 0/4] Fix local destination address resolution with VRF Edward Srouji
2025-09-16 11:11   ` [PATCH v1 1/4] RDMA/core: Squash a single user static function Edward Srouji
2025-09-16 11:11   ` [PATCH v1 2/4] RDMA/core: Resolve MAC of next-hop device without ARP support Edward Srouji
2025-09-16 11:11   ` [PATCH v1 3/4] RDMA/core: Use route entry flag to decide on loopback traffic Edward Srouji
2025-09-16 11:11   ` [PATCH v1 4/4] IB/ipoib: Ignore L3 master device Edward Srouji
2025-09-18  9:24   ` [PATCH v1 0/4] Fix local destination address resolution with VRF Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox