Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 1/2] ip_gre: Make GRE and GRETAP devices always NETIF_F_LLTX
From: Peilin Ye @ 2022-04-21 22:44 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni
  Cc: Peilin Ye, Cong Wang, Eric Dumazet, netdev, linux-kernel,
	Peilin Ye
In-Reply-To: <cover.1650580763.git.peilin.ye@bytedance.com>

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we made o_seqno atomic_t.  Stop special-casing TUNNEL_SEQ, and
always mark GRE[TAP] devices as NETIF_F_LLTX, since we no longer need
the TX lock (&txq->_xmit_lock).

Depends on patch "ip_gre, ip6_gre: Fix race condition on o_seqno in
collect_md mode".

Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/ipv4/ip_gre.c | 50 +++++++++++++++++++++--------------------------
 1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 8cf86e42c1d1..a81c2964f70b 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -748,6 +748,7 @@ static netdev_tx_t gre_tap_xmit(struct sk_buff *skb,
 static void ipgre_link_update(struct net_device *dev, bool set_mtu)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
+	__be16 flags;
 	int len;
 
 	len = tunnel->tun_hlen;
@@ -763,19 +764,15 @@ static void ipgre_link_update(struct net_device *dev, bool set_mtu)
 	if (set_mtu)
 		dev->mtu = max_t(int, dev->mtu - len, 68);
 
-	if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
-		if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
-		    tunnel->encap.type == TUNNEL_ENCAP_NONE) {
-			dev->features |= NETIF_F_GSO_SOFTWARE;
-			dev->hw_features |= NETIF_F_GSO_SOFTWARE;
-		} else {
-			dev->features &= ~NETIF_F_GSO_SOFTWARE;
-			dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
-		}
-		dev->features |= NETIF_F_LLTX;
-	} else {
+	flags = tunnel->parms.o_flags;
+
+	if (flags & TUNNEL_SEQ ||
+	    (flags & TUNNEL_CSUM && tunnel->encap.type != TUNNEL_ENCAP_NONE)) {
+		dev->features &= ~NETIF_F_GSO_SOFTWARE;
 		dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
-		dev->features &= ~(NETIF_F_LLTX | NETIF_F_GSO_SOFTWARE);
+	} else {
+		dev->features |= NETIF_F_GSO_SOFTWARE;
+		dev->hw_features |= NETIF_F_GSO_SOFTWARE;
 	}
 }
 
@@ -949,6 +946,7 @@ static void ipgre_tunnel_setup(struct net_device *dev)
 static void __gre_tunnel_init(struct net_device *dev)
 {
 	struct ip_tunnel *tunnel;
+	__be16 flags;
 
 	tunnel = netdev_priv(dev);
 	tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
@@ -957,25 +955,21 @@ static void __gre_tunnel_init(struct net_device *dev)
 	tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
 	dev->needed_headroom = tunnel->hlen + sizeof(tunnel->parms.iph);
 
-	dev->features		|= GRE_FEATURES;
+	dev->features		|= GRE_FEATURES | NETIF_F_LLTX;
 	dev->hw_features	|= GRE_FEATURES;
 
-	if (!(tunnel->parms.o_flags & TUNNEL_SEQ)) {
-		/* TCP offload with GRE SEQ is not supported, nor
-		 * can we support 2 levels of outer headers requiring
-		 * an update.
-		 */
-		if (!(tunnel->parms.o_flags & TUNNEL_CSUM) ||
-		    (tunnel->encap.type == TUNNEL_ENCAP_NONE)) {
-			dev->features    |= NETIF_F_GSO_SOFTWARE;
-			dev->hw_features |= NETIF_F_GSO_SOFTWARE;
-		}
+	flags = tunnel->parms.o_flags;
 
-		/* Can use a lockless transmit, unless we generate
-		 * output sequences
-		 */
-		dev->features |= NETIF_F_LLTX;
-	}
+	/* TCP offload with GRE SEQ is not supported, nor can we support 2
+	 * levels of outer headers requiring an update.
+	 */
+	if (flags & TUNNEL_SEQ)
+		return;
+	if (flags & TUNNEL_CSUM && tunnel->encap.type != TUNNEL_ENCAP_NONE)
+		return;
+
+	dev->features |= NETIF_F_GSO_SOFTWARE;
+	dev->hw_features |= NETIF_F_GSO_SOFTWARE;
 }
 
 static int ipgre_tunnel_init(struct net_device *dev)
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next 0/2] Make [IP6]GRE[TAP] devices always NETIF_F_LLTX
From: Peilin Ye @ 2022-04-21 22:43 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni
  Cc: Peilin Ye, Cong Wang, Eric Dumazet, netdev, linux-kernel,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Hi all,

This patchset depends on these fixes [1].  Since o_seqno is now atomic_t,
we can always turn on NETIF_F_LLTX for [IP6]GRE[TAP] devices, since we no
longer need the TX lock (&txq->_xmit_lock).

We could probably do the same thing to [IP6]ERSPAN devices as well, but
I'm not familiar with them yet.  For example, ERSPAN devices are
initialized as |= GRE_FEATURES in erspan_tunnel_init(), but I don't see
IP6ERSPAN devices being initialized as |= GRE6_FEATURES.  Please suggest
if I'm missing something, thanks!

[1] https://lore.kernel.org/netdev/cover.1650575919.git.peilin.ye@bytedance.com/

Thanks,
Peilin Ye (2):
  ip_gre: Make GRE and GRETAP devices always NETIF_F_LLTX
  ip6_gre: Make IP6GRE and IP6GRETAP devices always NETIF_F_LLTX

 net/ipv4/ip_gre.c  | 50 ++++++++++++++++++++--------------------------
 net/ipv6/ip6_gre.c | 34 ++++++++++++-------------------
 2 files changed, 35 insertions(+), 49 deletions(-)

-- 
2.20.1

^ permalink raw reply

* [PATCH net] net: dsa: flood multicast to CPU when slave has IFF_PROMISC
From: Vladimir Oltean @ 2022-04-21 22:42 UTC (permalink / raw)
  To: netdev
  Cc: Jakub Kicinski, David S. Miller, Paolo Abeni, Florian Fainelli,
	Andrew Lunn, Vivien Didelot, Vladimir Oltean

Certain DSA switches can eliminate flooding to the CPU when none of the
ports have the IFF_ALLMULTI or IFF_PROMISC flags set. This is done by
synthesizing a call to dsa_port_bridge_flags() for the CPU port, a call
which normally comes from the bridge driver via switchdev.

The bridge port flags and IFF_PROMISC|IFF_ALLMULTI have slightly
different semantics, and due to inattention/lack of proper testing, the
IFF_PROMISC flag allows unknown unicast to be flooded to the CPU, but
not unknown multicast.

This must be fixed by setting both BR_FLOOD (unicast) and BR_MCAST_FLOOD
in the synthesized dsa_port_bridge_flags() call, since IFF_PROMISC means
that packets should not be filtered regardless of their MAC DA.

Fixes: 7569459a52c9 ("net: dsa: manage flooding on the CPU ports")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/slave.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 63da683d4660..5ee0aced9410 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -285,7 +285,7 @@ static void dsa_port_manage_cpu_flood(struct dsa_port *dp)
 		if (other_dp->slave->flags & IFF_ALLMULTI)
 			flags.val |= BR_MCAST_FLOOD;
 		if (other_dp->slave->flags & IFF_PROMISC)
-			flags.val |= BR_FLOOD;
+			flags.val |= BR_FLOOD | BR_MCAST_FLOOD;
 	}

 	err = dsa_port_pre_bridge_flags(dp, flags, NULL);
-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH v2 bpf 00/11] bpf: random unpopular userspace fixes (32 bit et al)
From: Alexander Lobakin @ 2022-04-21 22:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexander Lobakin, Toke Høiland-Jørgensen,
	Daniel Borkmann, Andrii Nakryiko, Maciej Fijalkowski, Song Liu,
	Kumar Kartikeya Dwivedi, bpf, Network Development, LKML
In-Reply-To: <CAADnVQJJiBO5T3dvYaifhu3crmce7CH9b5ioc1u4=Y25SUxVRA@mail.gmail.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Wed, 20 Apr 2022 17:40:34 -0700

> On Wed, Apr 20, 2022 at 5:38 PM Alexander Lobakin <alobakin@pm.me> wrote:
>
> Again?
>
> -----BEGIN PGP MESSAGE-----
> Version: ProtonMail
>
> wcFMA165ASBBe6s8AQ/8C9y4TqXgASA5xBT7UIf2GyTQRjKWcy/6kT1dkjkF
> FldAOhehhgLYjLJzNAIkecOQfz/XNapW3GdrQDq11pq9Bzs1SJJekGXlHVIW
>
> Sorry I'm tossing the series out of patchwork.

Oh sorry, I was hoping upgrading Bridge would help >_<

Let me know if you're reading this particular message in your inbox
finely. Toke guessed it precisely regarding the per-recipient lists
-- Proton by default saves every address I've ever sent mails to to
Contacts and then tries to fetch PGP public keys for each contact.
Again, for some reason, for a couple addresses, including
ast@kernel.org, it managed to fetch something, but that something
was sorta broken. So at the end I've been having broken PGP for
the address I've never manually set or even wanted PGP.
If it's still messed, I'll contact support then. Sorry again for
this.

Thanks,
Al


^ permalink raw reply

* [net-next v1] net: Add a second bind table hashed by port and address
From: Joanne Koong @ 2022-04-21 22:14 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, kafai, davem, kuba, Joanne Koong

We currently have one tcp bind table (bhash) which hashes by port
number only. In the socket bind path, we check for bind conflicts by
traversing the specified port's inet_bind2_bucket while holding the
bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).

In instances where there are tons of sockets hashed to the same port
at different addresses, checking for a bind conflict is time-intensive
and can cause softirq cpu lockups, as well as stops new tcp connections
since __inet_inherit_port() also contests for the spinlock.

This patch proposes adding a second bind table, bhash2, that hashes by
port and ip address. Searching the bhash2 table leads to significantly
faster conflict resolution and less time holding the spinlock.
When experimentally testing this on a local server, the results for how
long a bind request takes were as follows:

when there are ~24k sockets already bound to the port -

ipv4:
before - 0.002317 seconds
with bhash2 - 0.000018 seconds

ipv6:
before - 0.002431 seconds
with bhash2 - 0.000021 seconds

when there are ~12 million sockets already bound to the port -

ipv4:
before - 7.498583 seconds
with bhash2 - 0.000021 seconds

ipv6:
before - 7.813554 seconds
with bhash2 - 0.000029 seconds

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/net/inet_connection_sock.h |   3 +
 include/net/inet_hashtables.h      |  56 ++++++-
 include/net/sock.h                 |  14 ++
 net/dccp/proto.c                   |  14 +-
 net/ipv4/inet_connection_sock.c    | 227 +++++++++++++++++++++--------
 net/ipv4/inet_hashtables.c         | 188 ++++++++++++++++++++++--
 net/ipv4/tcp.c                     |  14 +-
 7 files changed, 438 insertions(+), 78 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 3908296d103f..d89a78d10294 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -25,6 +25,7 @@
 #undef INET_CSK_CLEAR_TIMERS
 
 struct inet_bind_bucket;
+struct inet_bind2_bucket;
 struct tcp_congestion_ops;
 
 /*
@@ -57,6 +58,7 @@ struct inet_connection_sock_af_ops {
  *
  * @icsk_accept_queue:	   FIFO of established children
  * @icsk_bind_hash:	   Bind node
+ * @icsk_bind2_hash:	   Bind node in the bhash2 table
  * @icsk_timeout:	   Timeout
  * @icsk_retransmit_timer: Resend (no ack)
  * @icsk_rto:		   Retransmit timeout
@@ -84,6 +86,7 @@ struct inet_connection_sock {
 	struct inet_sock	  icsk_inet;
 	struct request_sock_queue icsk_accept_queue;
 	struct inet_bind_bucket	  *icsk_bind_hash;
+	struct inet_bind2_bucket  *icsk_bind2_hash;
 	unsigned long		  icsk_timeout;
  	struct timer_list	  icsk_retransmit_timer;
  	struct timer_list	  icsk_delack_timer;
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index f72ec113ae56..143a33d815c2 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -90,11 +90,30 @@ struct inet_bind_bucket {
 	struct hlist_head	owners;
 };
 
+struct inet_bind2_bucket {
+	possible_net_t		ib_net;
+	int			l3mdev;
+	unsigned short		port;
+	union {
+#if IS_ENABLED(CONFIG_IPV6)
+		struct in6_addr		v6_rcv_saddr;
+#endif
+		__be32			rcv_saddr;
+	};
+	struct hlist_node	node;		/* Node in the inet2_bind_hashbucket chain */
+	struct hlist_head	owners;		/* List of sockets hashed to this bucket */
+};
+
 static inline struct net *ib_net(struct inet_bind_bucket *ib)
 {
 	return read_pnet(&ib->ib_net);
 }
 
+static inline struct net *ib2_net(struct inet_bind2_bucket *ib)
+{
+	return read_pnet(&ib->ib_net);
+}
+
 #define inet_bind_bucket_for_each(tb, head) \
 	hlist_for_each_entry(tb, head, node)
 
@@ -103,6 +122,15 @@ struct inet_bind_hashbucket {
 	struct hlist_head	chain;
 };
 
+/* This is synchronized using the inet_bind_hashbucket's spinlock.
+ * Instead of having separate spinlocks, the inet_bind2_hashbucket can share
+ * the inet_bind_hashbucket's given that in every case where the bhash2 table
+ * is useful, a lookup in the bhash table also occurs.
+ */
+struct inet_bind2_hashbucket {
+	struct hlist_head	chain;
+};
+
 /* Sockets can be hashed in established or listening table.
  * We must use different 'nulls' end-of-chain value for all hash buckets :
  * A socket might transition from ESTABLISH to LISTEN state without
@@ -138,6 +166,11 @@ struct inet_hashinfo {
 	 */
 	struct kmem_cache		*bind_bucket_cachep;
 	struct inet_bind_hashbucket	*bhash;
+	/* The 2nd binding table hashed by port and address.
+	 * This is used primarily for expediting the resolution of bind conflicts.
+	 */
+	struct kmem_cache		*bind2_bucket_cachep;
+	struct inet_bind2_hashbucket	*bhash2;
 	unsigned int			bhash_size;
 
 	/* The 2nd listener table hashed by local port and address */
@@ -221,6 +254,27 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
 			      struct inet_bind_bucket *tb);
 
+static inline bool check_bind_bucket_match(struct inet_bind_bucket *tb, struct net *net,
+					   const unsigned short port, int l3mdev)
+{
+	return net_eq(ib_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev;
+}
+
+struct inet_bind2_bucket *
+inet_bind2_bucket_create(struct kmem_cache *cachep, struct net *net,
+			 struct inet_bind2_hashbucket *head, const unsigned short port,
+			 int l3mdev, const struct sock *sk);
+
+void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb);
+
+struct inet_bind2_bucket *
+inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, const unsigned short port,
+		       int l3mdev, struct sock *sk, struct inet_bind2_hashbucket **head);
+
+bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, struct net *net,
+				       const unsigned short port, int l3mdev,
+				       const struct sock *sk);
+
 static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
 			       const u32 bhash_size)
 {
@@ -228,7 +282,7 @@ static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
 }
 
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
-		    const unsigned short snum);
+		    struct inet_bind2_bucket *tb2, const unsigned short snum);
 
 /* These can have wildcards, don't try too hard. */
 static inline u32 inet_lhashfn(const struct net *net, const unsigned short num)
diff --git a/include/net/sock.h b/include/net/sock.h
index c4b91fc19b9c..a2198d5674f6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -352,6 +352,7 @@ struct sk_filter;
   *	@sk_txtime_report_errors: set report errors mode for SO_TXTIME
   *	@sk_txtime_unused: unused txtime flags
   *	@ns_tracker: tracker for netns reference
+  *	@sk_bind2_node: bind node in the bhash2 table
   */
 struct sock {
 	/*
@@ -542,6 +543,7 @@ struct sock {
 #endif
 	struct rcu_head		sk_rcu;
 	netns_tracker		ns_tracker;
+	struct hlist_node	sk_bind2_node;
 };
 
 enum sk_pacing {
@@ -822,6 +824,16 @@ static inline void sk_add_bind_node(struct sock *sk,
 	hlist_add_head(&sk->sk_bind_node, list);
 }
 
+static inline void __sk_del_bind2_node(struct sock *sk)
+{
+	__hlist_del(&sk->sk_bind2_node);
+}
+
+static inline void sk_add_bind2_node(struct sock *sk, struct hlist_head *list)
+{
+	hlist_add_head(&sk->sk_bind2_node, list);
+}
+
 #define sk_for_each(__sk, list) \
 	hlist_for_each_entry(__sk, list, sk_node)
 #define sk_for_each_rcu(__sk, list) \
@@ -839,6 +851,8 @@ static inline void sk_add_bind_node(struct sock *sk,
 	hlist_for_each_entry_safe(__sk, tmp, list, sk_node)
 #define sk_for_each_bound(__sk, list) \
 	hlist_for_each_entry(__sk, list, sk_bind_node)
+#define sk_for_each_bound_bhash2(__sk, list) \
+	hlist_for_each_entry(__sk, list, sk_bind2_node)
 
 /**
  * sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index a976b4d29892..e65768370170 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1121,6 +1121,12 @@ static int __init dccp_init(void)
 				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
 	if (!dccp_hashinfo.bind_bucket_cachep)
 		goto out_free_hashinfo2;
+	dccp_hashinfo.bind2_bucket_cachep =
+		kmem_cache_create("dccp_bind2_bucket",
+				  sizeof(struct inet_bind2_bucket), 0,
+				  SLAB_HWCACHE_ALIGN | SLAB_ACCOUNT, NULL);
+	if (!dccp_hashinfo.bind2_bucket_cachep)
+		goto out_free_bind_bucket_cachep;
 
 	/*
 	 * Size and allocate the main established and bind bucket
@@ -1151,7 +1157,7 @@ static int __init dccp_init(void)
 
 	if (!dccp_hashinfo.ehash) {
 		DCCP_CRIT("Failed to allocate DCCP established hash table");
-		goto out_free_bind_bucket_cachep;
+		goto out_free_bind2_bucket_cachep;
 	}
 
 	for (i = 0; i <= dccp_hashinfo.ehash_mask; i++)
@@ -1170,6 +1176,8 @@ static int __init dccp_init(void)
 			continue;
 		dccp_hashinfo.bhash = (struct inet_bind_hashbucket *)
 			__get_free_pages(GFP_ATOMIC|__GFP_NOWARN, bhash_order);
+		dccp_hashinfo.bhash2 = (struct inet_bind2_hashbucket *)
+			__get_free_pages(GFP_ATOMIC | __GFP_NOWARN, bhash_order);
 	} while (!dccp_hashinfo.bhash && --bhash_order >= 0);
 
 	if (!dccp_hashinfo.bhash) {
@@ -1180,6 +1188,7 @@ static int __init dccp_init(void)
 	for (i = 0; i < dccp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&dccp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain);
+		INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
 	}
 
 	rc = dccp_mib_init();
@@ -1214,6 +1223,8 @@ static int __init dccp_init(void)
 	inet_ehash_locks_free(&dccp_hashinfo);
 out_free_dccp_ehash:
 	free_pages((unsigned long)dccp_hashinfo.ehash, ehash_order);
+out_free_bind2_bucket_cachep:
+	kmem_cache_destroy(dccp_hashinfo.bind2_bucket_cachep);
 out_free_bind_bucket_cachep:
 	kmem_cache_destroy(dccp_hashinfo.bind_bucket_cachep);
 out_free_hashinfo2:
@@ -1222,6 +1233,7 @@ static int __init dccp_init(void)
 	dccp_hashinfo.bhash = NULL;
 	dccp_hashinfo.ehash = NULL;
 	dccp_hashinfo.bind_bucket_cachep = NULL;
+	dccp_hashinfo.bind2_bucket_cachep = NULL;
 	return rc;
 }
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 1e5b53c2bb26..482935f0c8f6 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -117,6 +117,30 @@ bool inet_rcv_saddr_any(const struct sock *sk)
 	return !sk->sk_rcv_saddr;
 }
 
+static bool use_bhash2_on_bind(const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	int addr_type;
+
+	if (sk->sk_family == AF_INET6) {
+		addr_type = ipv6_addr_type(&sk->sk_v6_rcv_saddr);
+		return addr_type != IPV6_ADDR_ANY && addr_type != IPV6_ADDR_MAPPED;
+	}
+#endif
+	return sk->sk_rcv_saddr != htonl(INADDR_ANY);
+}
+
+static u32 get_bhash2_nulladdr_hash(const struct sock *sk, struct net *net, int port)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	struct in6_addr nulladdr = {};
+
+	if (sk->sk_family == AF_INET6)
+		return ipv6_portaddr_hash(net, &nulladdr, port);
+#endif
+	return ipv4_portaddr_hash(net, 0, port);
+}
+
 void inet_get_local_port_range(struct net *net, int *low, int *high)
 {
 	unsigned int seq;
@@ -130,16 +154,58 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
-static int inet_csk_bind_conflict(const struct sock *sk,
-				  const struct inet_bind_bucket *tb,
-				  bool relax, bool reuseport_ok)
+static bool bind_conflict_exist(const struct sock *sk, struct sock *sk2,
+				kuid_t sk_uid, bool relax, bool reuseport_cb_ok,
+				bool reuseport_ok)
+{
+	if (sk != sk2 && (!sk->sk_bound_dev_if || !sk2->sk_bound_dev_if ||
+			  sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
+		if (sk->sk_reuse && sk2->sk_reuse && sk2->sk_state != TCP_LISTEN) {
+			if (!relax || (!reuseport_ok && sk->sk_reuseport && sk2->sk_reuseport &&
+				       reuseport_cb_ok && (sk2->sk_state == TCP_TIME_WAIT ||
+							   uid_eq(sk_uid, sock_i_uid(sk2)))))
+				return true;
+		} else if (!reuseport_ok || !sk->sk_reuseport || !sk2->sk_reuseport ||
+			   !reuseport_cb_ok || (sk2->sk_state != TCP_TIME_WAIT &&
+						!uid_eq(sk_uid, sock_i_uid(sk2)))) {
+			return true;
+		}
+	}
+	return false;
+}
+
+static bool check_bhash2_conflict(const struct sock *sk, struct inet_bind2_bucket *tb2,
+				  kuid_t sk_uid, bool relax, bool reuseport_cb_ok,
+				  bool reuseport_ok)
 {
 	struct sock *sk2;
-	bool reuseport_cb_ok;
-	bool reuse = sk->sk_reuse;
-	bool reuseport = !!sk->sk_reuseport;
-	struct sock_reuseport *reuseport_cb;
+
+	sk_for_each_bound_bhash2(sk2, &tb2->owners) {
+		if (sk->sk_family == AF_INET && ipv6_only_sock(sk2))
+			continue;
+
+		if (bind_conflict_exist(sk, sk2, sk_uid, relax,
+					reuseport_cb_ok, reuseport_ok))
+			return true;
+	}
+	return false;
+}
+
+/* This should be called only when the corresponding inet_bind_bucket spinlock is held */
+static int inet_csk_bind_conflict(const struct sock *sk, int port,
+				  struct inet_bind_bucket *tb,
+				  struct inet_bind2_bucket *tb2, /* may be null */
+				  bool relax, bool reuseport_ok)
+{
+	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
 	kuid_t uid = sock_i_uid((struct sock *)sk);
+	struct sock_reuseport *reuseport_cb;
+	struct inet_bind2_hashbucket *head2;
+	bool reuseport_cb_ok;
+	struct sock *sk2;
+	struct net *net;
+	int l3mdev;
+	u32 hash;
 
 	rcu_read_lock();
 	reuseport_cb = rcu_dereference(sk->sk_reuseport_cb);
@@ -150,36 +216,40 @@ static int inet_csk_bind_conflict(const struct sock *sk,
 	/*
 	 * Unlike other sk lookup places we do not check
 	 * for sk_net here, since _all_ the socks listed
-	 * in tb->owners list belong to the same net - the
-	 * one this bucket belongs to.
+	 * in tb->owners and tb2->owners list belong
+	 * to the same net
 	 */
 
-	sk_for_each_bound(sk2, &tb->owners) {
-		if (sk != sk2 &&
-		    (!sk->sk_bound_dev_if ||
-		     !sk2->sk_bound_dev_if ||
-		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
-			if (reuse && sk2->sk_reuse &&
-			    sk2->sk_state != TCP_LISTEN) {
-				if ((!relax ||
-				     (!reuseport_ok &&
-				      reuseport && sk2->sk_reuseport &&
-				      reuseport_cb_ok &&
-				      (sk2->sk_state == TCP_TIME_WAIT ||
-				       uid_eq(uid, sock_i_uid(sk2))))) &&
-				    inet_rcv_saddr_equal(sk, sk2, true))
-					break;
-			} else if (!reuseport_ok ||
-				   !reuseport || !sk2->sk_reuseport ||
-				   !reuseport_cb_ok ||
-				   (sk2->sk_state != TCP_TIME_WAIT &&
-				    !uid_eq(uid, sock_i_uid(sk2)))) {
-				if (inet_rcv_saddr_equal(sk, sk2, true))
-					break;
-			}
-		}
+	if (!use_bhash2_on_bind(sk)) {
+		sk_for_each_bound(sk2, &tb->owners)
+			if (bind_conflict_exist(sk, sk2, uid, relax,
+						reuseport_cb_ok, reuseport_ok) &&
+			    inet_rcv_saddr_equal(sk, sk2, true))
+				return true;
+
+		return false;
 	}
-	return sk2 != NULL;
+
+	if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, reuseport_ok))
+		return true;
+
+	net = sock_net(sk);
+
+	/* check there's no conflict with an existing IPV6_ADDR_ANY (if ipv6) or
+	 * INADDR_ANY (if ipv4) socket.
+	 */
+	hash = get_bhash2_nulladdr_hash(sk, net, port);
+	head2 = &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
+
+	l3mdev = inet_sk_bound_l3mdev(sk);
+	inet_bind_bucket_for_each(tb2, &head2->chain)
+		if (check_bind2_bucket_match_nulladdr(tb2, net, port, l3mdev, sk))
+			break;
+
+	if (tb2 && check_bhash2_conflict(sk, tb2, uid, relax, reuseport_cb_ok, reuseport_ok))
+		return true;
+
+	return false;
 }
 
 /*
@@ -187,16 +257,20 @@ static int inet_csk_bind_conflict(const struct sock *sk,
  * inet_bind_hashbucket lock held.
  */
 static struct inet_bind_hashbucket *
-inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *port_ret)
+inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret,
+			struct inet_bind2_bucket **tb2_ret,
+			struct inet_bind2_hashbucket **head2_ret, int *port_ret)
 {
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int port = 0;
+	struct inet_bind2_hashbucket *head2;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
-	bool relax = false;
 	int i, low, high, attempt_half;
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
+	bool relax = false;
+	int port = 0;
 	int l3mdev;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
@@ -235,10 +309,11 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 		head = &hinfo->bhash[inet_bhashfn(net, port,
 						  hinfo->bhash_size)];
 		spin_lock_bh(&head->lock);
+		tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2);
 		inet_bind_bucket_for_each(tb, &head->chain)
-			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
-			    tb->port == port) {
-				if (!inet_csk_bind_conflict(sk, tb, relax, false))
+			if (check_bind_bucket_match(tb, net, port, l3mdev)) {
+				if (!inet_csk_bind_conflict(sk, port, tb, tb2,
+							    relax, false))
 					goto success;
 				goto next_port;
 			}
@@ -268,6 +343,8 @@ inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 success:
 	*port_ret = port;
 	*tb_ret = tb;
+	*tb2_ret = tb2;
+	*head2_ret = head2;
 	return head;
 }
 
@@ -363,54 +440,77 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 {
 	bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
 	struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
-	int ret = 1, port = snum;
+	bool bhash_created = false, bhash2_created = false;
+	struct inet_bind2_bucket *tb2 = NULL;
+	struct inet_bind2_hashbucket *head2;
+	struct inet_bind_bucket *tb = NULL;
 	struct inet_bind_hashbucket *head;
 	struct net *net = sock_net(sk);
-	struct inet_bind_bucket *tb = NULL;
+	int ret = 1, port = snum;
+	bool found_port = false;
 	int l3mdev;
 
 	l3mdev = inet_sk_bound_l3mdev(sk);
 
 	if (!port) {
-		head = inet_csk_find_open_port(sk, &tb, &port);
+		head = inet_csk_find_open_port(sk, &tb, &tb2, &head2, &port);
 		if (!head)
 			return ret;
+		if (tb && tb2)
+			goto success;
+		found_port = true;
+	} else {
+		head = &hinfo->bhash[inet_bhashfn(net, port,
+						  hinfo->bhash_size)];
+		spin_lock_bh(&head->lock);
+		inet_bind_bucket_for_each(tb, &head->chain)
+			if (check_bind_bucket_match(tb, net, port, l3mdev))
+				break;
+
+		tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2);
+	}
+
+	if (!tb) {
+		tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep, net, head,
+					     port, l3mdev);
 		if (!tb)
-			goto tb_not_found;
-		goto success;
+			goto fail_unlock;
+		bhash_created = true;
+	}
+
+	if (!tb2) {
+		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep,
+					       net, head2, port, l3mdev, sk);
+		if (!tb2)
+			goto fail_unlock;
+		bhash2_created = true;
 	}
-	head = &hinfo->bhash[inet_bhashfn(net, port,
-					  hinfo->bhash_size)];
-	spin_lock_bh(&head->lock);
-	inet_bind_bucket_for_each(tb, &head->chain)
-		if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
-		    tb->port == port)
-			goto tb_found;
-tb_not_found:
-	tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-				     net, head, port, l3mdev);
-	if (!tb)
-		goto fail_unlock;
-tb_found:
-	if (!hlist_empty(&tb->owners)) {
+
+	/* If we had to find an open port, we already checked for conflicts */
+	if (!found_port && !hlist_empty(&tb->owners)) {
 		if (sk->sk_reuse == SK_FORCE_REUSE)
 			goto success;
-
 		if ((tb->fastreuse > 0 && reuse) ||
 		    sk_reuseport_match(tb, sk))
 			goto success;
-		if (inet_csk_bind_conflict(sk, tb, true, true))
+		if (inet_csk_bind_conflict(sk, port, tb, tb2, true, true))
 			goto fail_unlock;
 	}
 success:
 	inet_csk_update_fastreuse(tb, sk);
-
 	if (!inet_csk(sk)->icsk_bind_hash)
-		inet_bind_hash(sk, tb, port);
+		inet_bind_hash(sk, tb, tb2, port);
 	WARN_ON(inet_csk(sk)->icsk_bind_hash != tb);
+	WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
 	ret = 0;
 
 fail_unlock:
+	if (ret) {
+		if (bhash_created)
+			inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
+		if (bhash2_created)
+			inet_bind2_bucket_destroy(hinfo->bind2_bucket_cachep, tb2);
+	}
 	spin_unlock_bh(&head->lock);
 	return ret;
 }
@@ -957,6 +1057,7 @@ struct sock *inet_csk_clone_lock(const struct sock *sk,
 
 		inet_sk_set_state(newsk, TCP_SYN_RECV);
 		newicsk->icsk_bind_hash = NULL;
+		newicsk->icsk_bind2_hash = NULL;
 
 		inet_sk(newsk)->inet_dport = inet_rsk(req)->ir_rmt_port;
 		inet_sk(newsk)->inet_num = inet_rsk(req)->ir_num;
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 17440840a791..9f0bece06609 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -81,6 +81,41 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 	return tb;
 }
 
+struct inet_bind2_bucket *inet_bind2_bucket_create(struct kmem_cache *cachep,
+						   struct net *net,
+						   struct inet_bind2_hashbucket *head,
+						   const unsigned short port,
+						   int l3mdev,
+						   const struct sock *sk)
+{
+	struct inet_bind2_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
+
+	if (tb) {
+		write_pnet(&tb->ib_net, net);
+		tb->l3mdev    = l3mdev;
+		tb->port      = port;
+#if IS_ENABLED(CONFIG_IPV6)
+		if (sk->sk_family == AF_INET6)
+			tb->v6_rcv_saddr = sk->sk_v6_rcv_saddr;
+		else
+#endif
+			tb->rcv_saddr = sk->sk_rcv_saddr;
+		INIT_HLIST_HEAD(&tb->owners);
+		hlist_add_head(&tb->node, &head->chain);
+	}
+	return tb;
+}
+
+static bool bind2_bucket_addr_match(struct inet_bind2_bucket *tb2, struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		return ipv6_addr_equal(&tb2->v6_rcv_saddr,
+				       &sk->sk_v6_rcv_saddr);
+#endif
+	return tb2->rcv_saddr == sk->sk_rcv_saddr;
+}
+
 /*
  * Caller must hold hashbucket lock for this tb with local BH disabled
  */
@@ -92,12 +127,25 @@ void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket
 	}
 }
 
+/* Caller must hold the lock for the corresponding hashbucket in the bhash table
+ * with local BH disabled
+ */
+void inet_bind2_bucket_destroy(struct kmem_cache *cachep, struct inet_bind2_bucket *tb)
+{
+	if (hlist_empty(&tb->owners)) {
+		__hlist_del(&tb->node);
+		kmem_cache_free(cachep, tb);
+	}
+}
+
 void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
-		    const unsigned short snum)
+		    struct inet_bind2_bucket *tb2, const unsigned short snum)
 {
 	inet_sk(sk)->inet_num = snum;
 	sk_add_bind_node(sk, &tb->owners);
 	inet_csk(sk)->icsk_bind_hash = tb;
+	sk_add_bind2_node(sk, &tb2->owners);
+	inet_csk(sk)->icsk_bind2_hash = tb2;
 }
 
 /*
@@ -109,6 +157,7 @@ static void __inet_put_port(struct sock *sk)
 	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(sk)->inet_num,
 			hashinfo->bhash_size);
 	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 
 	spin_lock(&head->lock);
@@ -117,6 +166,13 @@ static void __inet_put_port(struct sock *sk)
 	inet_csk(sk)->icsk_bind_hash = NULL;
 	inet_sk(sk)->inet_num = 0;
 	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+
+	if (inet_csk(sk)->icsk_bind2_hash) {
+		tb2 = inet_csk(sk)->icsk_bind2_hash;
+		__sk_del_bind2_node(sk);
+		inet_csk(sk)->icsk_bind2_hash = NULL;
+		inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
+	}
 	spin_unlock(&head->lock);
 }
 
@@ -133,14 +189,19 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
 	unsigned short port = inet_sk(child)->inet_num;
 	const int bhash = inet_bhashfn(sock_net(sk), port,
-			table->bhash_size);
+				       table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
+	struct inet_bind2_hashbucket *head_bhash2;
+	bool created_inet_bind_bucket = false;
+	struct net *net = sock_net(sk);
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
 	int l3mdev;
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
-	if (unlikely(!tb)) {
+	tb2 = inet_csk(sk)->icsk_bind2_hash;
+	if (unlikely(!tb || !tb2)) {
 		spin_unlock(&head->lock);
 		return -ENOENT;
 	}
@@ -153,25 +214,45 @@ int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		 * as that of the child socket. We have to look up or
 		 * create a new bind bucket for the child here. */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), sock_net(sk)) &&
-			    tb->l3mdev == l3mdev && tb->port == port)
+			if (check_bind_bucket_match(tb, net, port, l3mdev))
 				break;
 		}
 		if (!tb) {
 			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
-						     sock_net(sk), head, port,
-						     l3mdev);
+						     net, head, port, l3mdev);
 			if (!tb) {
 				spin_unlock(&head->lock);
 				return -ENOMEM;
 			}
+			created_inet_bind_bucket = true;
 		}
 		inet_csk_update_fastreuse(tb, child);
+
+		goto bhash2_find;
+	} else if (!bind2_bucket_addr_match(tb2, child)) {
+		l3mdev = inet_sk_bound_l3mdev(sk);
+
+bhash2_find:
+		tb2 = inet_bind2_bucket_find(table, net, port, l3mdev, child,
+					     &head_bhash2);
+		if (!tb2) {
+			tb2 = inet_bind2_bucket_create(table->bind2_bucket_cachep,
+						       net, head_bhash2, port, l3mdev,
+						       child);
+			if (!tb2)
+				goto error;
+		}
 	}
-	inet_bind_hash(child, tb, port);
+	inet_bind_hash(child, tb, tb2, port);
 	spin_unlock(&head->lock);
 
 	return 0;
+
+error:
+	if (created_inet_bind_bucket)
+		inet_bind_bucket_destroy(table->bind_bucket_cachep, tb);
+	spin_unlock(&head->lock);
+	return -ENOMEM;
 }
 EXPORT_SYMBOL_GPL(__inet_inherit_port);
 
@@ -722,6 +803,71 @@ void inet_unhash(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(inet_unhash);
 
+static inline bool check_bind2_bucket_match(struct inet_bind2_bucket *tb, struct net *net,
+					    unsigned short port, int l3mdev, struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
+			ipv6_addr_equal(&tb->v6_rcv_saddr, &sk->sk_v6_rcv_saddr);
+	else
+#endif
+		return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
+			tb->rcv_saddr == sk->sk_rcv_saddr;
+}
+
+bool check_bind2_bucket_match_nulladdr(struct inet_bind2_bucket *tb, struct net *net,
+				       const unsigned short port, int l3mdev, const struct sock *sk)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	struct in6_addr nulladdr = {};
+
+	if (sk->sk_family == AF_INET6)
+		return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
+			ipv6_addr_equal(&tb->v6_rcv_saddr, &nulladdr);
+	else
+#endif
+		return net_eq(ib2_net(tb), net) && tb->port == port && tb->l3mdev == l3mdev &&
+			tb->rcv_saddr == 0;
+}
+
+static struct inet_bind2_hashbucket *
+inet_bhashfn_portaddr(struct inet_hashinfo *hinfo, const struct sock *sk,
+		      const struct net *net, unsigned short port)
+{
+	u32 hash;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (sk->sk_family == AF_INET6)
+		hash = ipv6_portaddr_hash(net, &sk->sk_v6_rcv_saddr, port);
+	else
+#endif
+		hash = ipv4_portaddr_hash(net, sk->sk_rcv_saddr, port);
+	return &hinfo->bhash2[hash & (hinfo->bhash_size - 1)];
+}
+
+/* This should only be called when the spinlock for the socket's corresponding
+ * bind_hashbucket is held
+ */
+struct inet_bind2_bucket *
+inet_bind2_bucket_find(struct inet_hashinfo *hinfo, struct net *net, const unsigned short port,
+		       int l3mdev, struct sock *sk, struct inet_bind2_hashbucket **head)
+{
+	struct inet_bind2_bucket *bhash2 = NULL;
+	struct inet_bind2_hashbucket *h;
+
+	h = inet_bhashfn_portaddr(hinfo, sk, net, port);
+	inet_bind_bucket_for_each(bhash2, &h->chain) {
+		if (check_bind2_bucket_match(bhash2, net, port, l3mdev, sk))
+			break;
+	}
+
+	if (head)
+		*head = h;
+
+	return bhash2;
+}
+
 /* RFC 6056 3.3.4.  Algorithm 4: Double-Hash Port Selection Algorithm
  * Note that we use 32bit integers (vs RFC 'short integers')
  * because 2^16 is not a multiple of num_ephemeral and this
@@ -740,10 +886,13 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 {
 	struct inet_hashinfo *hinfo = death_row->hashinfo;
 	struct inet_timewait_sock *tw = NULL;
+	struct inet_bind2_hashbucket *head2;
 	struct inet_bind_hashbucket *head;
 	int port = inet_sk(sk)->inet_num;
 	struct net *net = sock_net(sk);
+	struct inet_bind2_bucket *tb2;
 	struct inet_bind_bucket *tb;
+	bool tb_created = false;
 	u32 remaining, offset;
 	int ret, i, low, high;
 	int l3mdev;
@@ -797,8 +946,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		 * the established check is already unique enough.
 		 */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
-			    tb->port == port) {
+			if (check_bind_bucket_match(tb, net, port, l3mdev)) {
 				if (tb->fastreuse >= 0 ||
 				    tb->fastreuseport >= 0)
 					goto next_port;
@@ -816,6 +964,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 			spin_unlock_bh(&head->lock);
 			return -ENOMEM;
 		}
+		tb_created = true;
 		tb->fastreuse = -1;
 		tb->fastreuseport = -1;
 		goto ok;
@@ -831,6 +980,17 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	return -EADDRNOTAVAIL;
 
 ok:
+	/* Find the corresponding tb2 bucket since we need to
+	 * add the socket to the bhash2 table as well
+	 */
+	tb2 = inet_bind2_bucket_find(hinfo, net, port, l3mdev, sk, &head2);
+	if (!tb2) {
+		tb2 = inet_bind2_bucket_create(hinfo->bind2_bucket_cachep, net,
+					       head2, port, l3mdev, sk);
+		if (!tb2)
+			goto error;
+	}
+
 	/* If our first attempt found a candidate, skip next candidate
 	 * in 1/16 of cases to add some noise.
 	 */
@@ -839,7 +999,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
 
 	/* Head lock still held and bh's disabled */
-	inet_bind_hash(sk, tb, port);
+	inet_bind_hash(sk, tb, tb2, port);
 	if (sk_unhashed(sk)) {
 		inet_sk(sk)->inet_sport = htons(port);
 		inet_ehash_nolisten(sk, (struct sock *)tw, NULL);
@@ -851,6 +1011,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		inet_twsk_deschedule_put(tw);
 	local_bh_enable();
 	return 0;
+
+error:
+	if (tb_created)
+		inet_bind_bucket_destroy(hinfo->bind_bucket_cachep, tb);
+	spin_unlock_bh(&head->lock);
+	return -ENOMEM;
 }
 
 /*
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index cf18fbcbf123..5a143c9afd20 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4627,6 +4627,12 @@ void __init tcp_init(void)
 				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
 				  SLAB_ACCOUNT,
 				  NULL);
+	tcp_hashinfo.bind2_bucket_cachep =
+		kmem_cache_create("tcp_bind2_bucket",
+				  sizeof(struct inet_bind2_bucket), 0,
+				  SLAB_HWCACHE_ALIGN | SLAB_PANIC |
+				  SLAB_ACCOUNT,
+				  NULL);
 
 	/* Size and allocate the main established and bind bucket
 	 * hash tables.
@@ -4649,8 +4655,9 @@ void __init tcp_init(void)
 	if (inet_ehash_locks_alloc(&tcp_hashinfo))
 		panic("TCP: failed to alloc ehash_locks");
 	tcp_hashinfo.bhash =
-		alloc_large_system_hash("TCP bind",
-					sizeof(struct inet_bind_hashbucket),
+		alloc_large_system_hash("TCP bind bhash tables",
+					sizeof(struct inet_bind_hashbucket) +
+					sizeof(struct inet_bind2_hashbucket),
 					tcp_hashinfo.ehash_mask + 1,
 					17, /* one slot per 128 KB of memory */
 					0,
@@ -4659,9 +4666,12 @@ void __init tcp_init(void)
 					0,
 					64 * 1024);
 	tcp_hashinfo.bhash_size = 1U << tcp_hashinfo.bhash_size;
+	tcp_hashinfo.bhash2 =
+		(struct inet_bind2_hashbucket *)(tcp_hashinfo.bhash + tcp_hashinfo.bhash_size);
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&tcp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
+		INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
 	}
 
 
-- 
2.30.2


^ permalink raw reply related

* Re: [PATCH v2] cgroup: Kill the parent controller when its last child is killed
From: Tejun Heo @ 2022-04-21 22:37 UTC (permalink / raw)
  To: Bui Quang Minh
  Cc: Michal Koutný, cgroups, kernel test robot, Zefan Li,
	Johannes Weiner, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, linux-kernel, netdev, bpf
In-Reply-To: <bdd4104d-390e-74c7-0de1-a275044831a5@gmail.com>

On Tue, Apr 05, 2022 at 09:58:01PM +0700, Bui Quang Minh wrote:
> @@ -5152,12 +5153,28 @@ static void css_release_work_fn(struct work_struct
> *work)
>                 container_of(work, struct cgroup_subsys_state,
> destroy_work);
>         struct cgroup_subsys *ss = css->ss;
>         struct cgroup *cgrp = css->cgroup;
> +       struct cgroup *parent = cgroup_parent(cgrp);
> 
>         mutex_lock(&cgroup_mutex);
> 
>         css->flags |= CSS_RELEASED;
>         list_del_rcu(&css->sibling);
> 
> +       /*
> +        * If parent doesn't have any children, start killing it.
> +        * And don't kill the default root.
> +        */
> +       if (parent && list_empty(&parent->self.children) &&
> +           parent->flags & CGRP_UMOUNT &&
> +           parent != &cgrp_dfl_root.cgrp &&
> +           !percpu_ref_is_dying(&parent->self.refcnt)) {
> +#ifdef CONFIG_CGROUP_BPF
> +               if (!percpu_ref_is_dying(&cgrp->bpf.refcnt))
> +                       cgroup_bpf_offline(parent);
> +#endif
> +               percpu_ref_kill(&parent->self.refcnt);
> +       }
> +
>         if (ss) {
>                 /* css release path */
>                 if (!list_empty(&css->rstat_css_node)) {
> 
> The idea is to set a flag in the umount path, in the rmdir it will destroy
> the css in case its direct parent is umounted, no recursive here. This is
> just an incomplete example, we may need to reset that flag when remounting.

I'm generally against adding complexities for this given that it's never
gonna be actually reliable. If adding one liner flush_workqueue makes life
easier in some cases, why not? But the root cause is something which can't
be solved from messing with release / umount paths and something we decided
against supporting.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH net-next 5/5] net: dsa: b53: mark as non-legacy
From: Florian Fainelli @ 2022-04-21 22:31 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Andrew Lunn, David S. Miller, Jakub Kicinski, netdev,
	Vivien Didelot, Vladimir Oltean
In-Reply-To: <E1nMSDS-00A2Ru-6J@rmk-PC.armlinux.org.uk>

Hi Russell,

On 2/22/22 02:16, Russell King (Oracle) wrote:
> The B53 driver does not make use of the speed, duplex, pause or
> advertisement in its phylink_mac_config() implementation, so it can be
> marked as a non-legacy driver.
> 
> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
> ---
>   drivers/net/dsa/b53/b53_common.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
> index 50a372dc32ae..83bf30349c26 100644
> --- a/drivers/net/dsa/b53/b53_common.c
> +++ b/drivers/net/dsa/b53/b53_common.c
> @@ -1346,6 +1346,12 @@ static void b53_phylink_get_caps(struct dsa_switch *ds, int port,
>   	/* Get the implementation specific capabilities */
>   	if (dev->ops->phylink_get_caps)
>   		dev->ops->phylink_get_caps(dev, port, config);
> +
> +	/* This driver does not make use of the speed, duplex, pause or the
> +	 * advertisement in its mac_config, so it is safe to mark this driver
> +	 * as non-legacy.
> +	 */
> +	config->legacy_pre_march2020 = false;

This patch appears to cause a regression for me, I am not sure why I did 
not notice it back when I tested it but I suspect it had to do with me 
testing only with a copper module and not with a fiber module.

Now that I tested it again, the SFP port (port 5 in my set-up) link up 
interrupt does not fire up when setting config->legacy_pre_march2020 to 
false.

Here is a working log with phylink debugging enabled:

# udhcpc -i sfp
udhcpc: started, v1.35.0
[   49.479637] bgmac-enet 18024000.ethernet eth2: Link is Up - 
1Gbps/Full - flow control off
[   49.488139] Generic PHY fixed-0:02: PHY state change UP -> RUNNING
[   49.488256] b53-srab-switch 18036000.ethernet-switch sfp: configuring 
for inband/1000base-x link mode
[   49.504062] b53-srab-switch 18036000.ethernet-switch sfp: major 
config 1000base-x
[   49.511800] b53-srab-switch 18036000.ethernet-switch sfp: 
phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown 
adv=0000000,00000201
[   49.527504] b53-srab-switch 18036000.ethernet-switch sfp: mac link down
[   49.535044] sfp sfp: SM: enter present:down:down event dev_up
[   49.541006] sfp sfp: tx disable 1 -> 0
[   49.544897] sfp sfp: SM: exit present:up:wait
[   49.549509] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
udhcpc: broadcasting discover
[   49.595185] sfp sfp: SM: enter present:up:wait event timeout
[   49.601064] sfp sfp: SM: exit present:up:link_up
[   52.388917] b53-srab-switch 18036000.ethernet-switch sfp: mac link up
[   52.396513] b53-srab-switch 18036000.ethernet-switch sfp: Link is Up 
- 1Gbps/Full - flow control rx/tx
[   52.406145] IPv6: ADDRCONF(NETDEV_CHANGE): sfp: link becomes ready
udhcpc: broadcasting discover
udhcpc: broadcasting select for 192.168.3.156, server 192.168.3.1
udhcpc: lease of 192.168.3.156 obtained from 192.168.3.1, lease time 600
deleting routers
adding dns 192.168.1.1

and one that is not working with phylink debugging enabled:

# udhcpc -i sfp
udhcpc: started, v1.35.0
[   27.863529] bgmac-enet 18024000.ethernet eth2: Link is Up - 
1Gbps/Full - flow control off
[   27.872021] Generic PHY fixed-0:02: PHY state change UP -> RUNNING
[   27.872120] b53-srab-switch 18036000.ethernet-switch sfp: configuring 
for inband/1000base-x link mode
[   27.887952] b53-srab-switch 18036000.ethernet-switch sfp: major 
config 1000base-x
[   27.895689] b53-srab-switch 18036000.ethernet-switch sfp: 
phylink_mac_config: mode=inband/1000base-x/Unknown/Unknown 
adv=0000000,00000201
[   27.895802] b53-srab-switch 18036000.ethernet-switch sfp: mac link down
[   27.911945] sfp sfp: SM: enter present:down:down event dev_up
[   27.923947] sfp sfp: tx disable 1 -> 0
[   27.927835] sfp sfp: SM: exit present:up:wait
[   27.932442] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
udhcpc: broadcasting discover
[   27.978181] sfp sfp: SM: enter present:up:wait event timeout
[   27.984056] sfp sfp: SM: exit present:up:link_up
[   30.686440] b53-srab-switch 18036000.ethernet-switch sfp: mac link up
udhcpc: broadcasting discover
udhcpc: broadcasting discover

The mac side appears to be UP but not no carrier is set to the sfp 
network device. Do you have any idea why that would happen?
-- 
Florian

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] ixgbe: xsk: get rid of redundant 'fallthrough'
From: Stephen Rothwell @ 2022-04-21 22:23 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, netdev, magnus.karlsson, linux-next
In-Reply-To: <20220421132126.471515-2-maciej.fijalkowski@intel.com>

[-- Attachment #1: Type: text/plain, Size: 1286 bytes --]

Hi Maciej,

On Thu, 21 Apr 2022 15:21:25 +0200 Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:
>
> Intel drivers translate actions returned from XDP programs to their own
> return codes that have the following mapping:
> 
> XDP_REDIRECT -> IXGBE_XDP_{REDIR,CONSUMED}
> XDP_TX -> IXGBE_XDP_{TX,CONSUMED}
> XDP_DROP -> IXGBE_XDP_CONSUMED
> XDP_ABORTED -> IXGBE_XDP_CONSUMED
> XDP_PASS -> IXGBE_XDP_PASS
> 
> Commit c7dd09fd4628 ("ixgbe, xsk: Terminate Rx side of NAPI when XSK Rx
> queue gets full") introduced new translation
> 
> XDP_REDIRECT -> IXGBE_XDP_EXIT
> 
> which is set when XSK RQ gets full and to indicate that driver should
> stop further Rx processing. This happens for unsuccessful
> xdp_do_redirect() so it is valuable to call trace_xdp_exception() for
> this case. In order to avoid IXGBE_XDP_EXIT -> IXGBE_XDP_CONSUMED
> overwrite, XDP_DROP case was moved above which in turn made the
> 'fallthrough' that is in XDP_ABORTED useless as it became the last label
> in the switch statement.
> 
> Simply drop this leftover.
> 
> Fixes: c7dd09fd4628 ("ixgbe, xsk: Terminate Rx side of NAPI when XSK Rx queue gets full")
> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

Reported-by ?
-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net 3/3] ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode
From: Peilin Ye @ 2022-04-21 22:09 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni
  Cc: Peilin Ye, xeb@mail.ru, William Tu, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, linux-kernel, Peilin Ye
In-Reply-To: <cover.1650575919.git.peilin.ye@bytedance.com>

From: Peilin Ye <peilin.ye@bytedance.com>

As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
collect_md mode is racy for [IP6]GRE[TAP] devices.  Consider the
following sequence of events:

1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
   add ... external".  "ip" ignores "[o]seq" if "external" is specified,
   so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
   it uses lockless TX);
2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
   bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
3. gre_fb_xmit() or __gre6_xmit() processes these skb's:

	gre_build_header(skb, tun_hlen,
			 flags, protocol,
			 tunnel_id_to_key32(tun_info->key.tun_id),
			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
					      : 0);   ^^^^^^^^^^^^^^^^^

Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
try to do this tunnel->o_seqno++ in parallel, which is racy.  Fix it by
making o_seqno atomic_t.

As mentioned by Eric Dumazet in commit b790e01aee74 ("ip_gre: lockless
xmit"), making o_seqno atomic_t increases "chance for packets being out
of order at receiver" when NETIF_F_LLTX is on.

Maybe a better fix would be:

1. Do not ignore "oseq" in external mode.  Users MUST specify "oseq" if
   they want the kernel to allow sequencing of outgoing packets;
2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
   with "oseq".

Unfortunately, that would break userspace.

We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
do it in separate patches to keep this fix minimal.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Fixes: 77a5196a804e ("gre: add sequence number for collect md mode.")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 include/net/ip6_tunnel.h | 2 +-
 include/net/ip_tunnels.h | 2 +-
 net/ipv4/ip_gre.c        | 6 +++---
 net/ipv6/ip6_gre.c       | 7 ++++---
 4 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index a38c4f1e4e5c..74b369bddf49 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -58,7 +58,7 @@ struct ip6_tnl {
 
 	/* These fields used only by GRE */
 	__u32 i_seqno;	/* The last seen seqno	*/
-	__u32 o_seqno;	/* The last output seqno */
+	atomic_t o_seqno;	/* The last output seqno */
 	int hlen;       /* tun_hlen + encap_hlen */
 	int tun_hlen;	/* Precalculated header length */
 	int encap_hlen; /* Encap header length (FOU,GUE) */
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 0219fe907b26..3ec6146f8734 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -116,7 +116,7 @@ struct ip_tunnel {
 
 	/* These four fields used only by GRE */
 	u32		i_seqno;	/* The last seen seqno	*/
-	u32		o_seqno;	/* The last output seqno */
+	atomic_t	o_seqno;	/* The last output seqno */
 	int		tun_hlen;	/* Precalculated header length */
 
 	/* These four fields used only by ERSPAN */
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index ca70b92e80d9..8cf86e42c1d1 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -464,7 +464,7 @@ static void __gre_xmit(struct sk_buff *skb, struct net_device *dev,
 	/* Push GRE header. */
 	gre_build_header(skb, tunnel->tun_hlen,
 			 flags, proto, tunnel->parms.o_key,
-			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
+			 (flags & TUNNEL_SEQ) ? htonl(atomic_fetch_inc(&tunnel->o_seqno)) : 0);
 
 	ip_tunnel_xmit(skb, dev, tnl_params, tnl_params->protocol);
 }
@@ -502,7 +502,7 @@ static void gre_fb_xmit(struct sk_buff *skb, struct net_device *dev,
 		(TUNNEL_CSUM | TUNNEL_KEY | TUNNEL_SEQ);
 	gre_build_header(skb, tunnel_hlen, flags, proto,
 			 tunnel_id_to_key32(tun_info->key.tun_id),
-			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
+			 (flags & TUNNEL_SEQ) ? htonl(atomic_fetch_inc(&tunnel->o_seqno)) : 0);
 
 	ip_md_tunnel_xmit(skb, dev, IPPROTO_GRE, tunnel_hlen);
 
@@ -579,7 +579,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	gre_build_header(skb, 8, TUNNEL_SEQ,
-			 proto, 0, htonl(tunnel->o_seqno++));
+			 proto, 0, htonl(atomic_fetch_inc(&tunnel->o_seqno)));
 
 	ip_md_tunnel_xmit(skb, dev, IPPROTO_GRE, tunnel_hlen);
 
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index d9e4ac94eab4..5136959b3dc5 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -766,7 +766,7 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 		gre_build_header(skb, tun_hlen,
 				 flags, protocol,
 				 tunnel_id_to_key32(tun_info->key.tun_id),
-				 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
+				 (flags & TUNNEL_SEQ) ? htonl(atomic_fetch_inc(&tunnel->o_seqno))
 						      : 0);
 
 	} else {
@@ -777,7 +777,8 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 
 		gre_build_header(skb, tunnel->tun_hlen, flags,
 				 protocol, tunnel->parms.o_key,
-				 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
+				 (flags & TUNNEL_SEQ) ? htonl(atomic_fetch_inc(&tunnel->o_seqno))
+						      : 0);
 	}
 
 	return ip6_tnl_xmit(skb, dev, dsfield, fl6, encap_limit, pmtu,
@@ -1055,7 +1056,7 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
 	/* Push GRE header. */
 	proto = (t->parms.erspan_ver == 1) ? htons(ETH_P_ERSPAN)
 					   : htons(ETH_P_ERSPAN2);
-	gre_build_header(skb, 8, TUNNEL_SEQ, proto, 0, htonl(t->o_seqno++));
+	gre_build_header(skb, 8, TUNNEL_SEQ, proto, 0, htonl(atomic_fetch_inc(&t->o_seqno)));
 
 	/* TooBig packet may have updated dst->dev's mtu */
 	if (!t->parms.collect_md && dst && dst_mtu(dst) > dst->dev->mtu)
-- 
2.20.1


^ permalink raw reply related

* [PATCH net 2/3] ip6_gre: Make o_seqno start from 0 in native mode
From: Peilin Ye @ 2022-04-21 22:08 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni
  Cc: Peilin Ye, xeb@mail.ru, William Tu, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, linux-kernel, Peilin Ye
In-Reply-To: <cover.1650575919.git.peilin.ye@bytedance.com>

From: Peilin Ye <peilin.ye@bytedance.com>

For IP6GRE and IP6GRETAP devices, currently o_seqno starts from 1 in
native mode.  According to RFC 2890 2.2., "The first datagram is sent
with a sequence number of 0."  Fix it.

It is worth mentioning that o_seqno already starts from 0 in collect_md
mode, see the "if (tunnel->parms.collect_md)" clause in __gre6_xmit(),
where tunnel->o_seqno is passed to gre_build_header() before getting
incremented.

Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/ipv6/ip6_gre.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 976236736146..d9e4ac94eab4 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -724,6 +724,7 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 {
 	struct ip6_tnl *tunnel = netdev_priv(dev);
 	__be16 protocol;
+	__be16 flags;
 
 	if (dev->type == ARPHRD_ETHER)
 		IPCB(skb)->flags = 0;
@@ -739,7 +740,6 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 	if (tunnel->parms.collect_md) {
 		struct ip_tunnel_info *tun_info;
 		const struct ip_tunnel_key *key;
-		__be16 flags;
 		int tun_hlen;
 
 		tun_info = skb_tunnel_info_txcheck(skb);
@@ -770,15 +770,14 @@ static netdev_tx_t __gre6_xmit(struct sk_buff *skb,
 						      : 0);
 
 	} else {
-		if (tunnel->parms.o_flags & TUNNEL_SEQ)
-			tunnel->o_seqno++;
-
 		if (skb_cow_head(skb, dev->needed_headroom ?: tunnel->hlen))
 			return -ENOMEM;
 
-		gre_build_header(skb, tunnel->tun_hlen, tunnel->parms.o_flags,
+		flags = tunnel->parms.o_flags;
+
+		gre_build_header(skb, tunnel->tun_hlen, flags,
 				 protocol, tunnel->parms.o_key,
-				 htonl(tunnel->o_seqno));
+				 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
 	}
 
 	return ip6_tnl_xmit(skb, dev, dsfield, fl6, encap_limit, pmtu,
-- 
2.20.1


^ permalink raw reply related

* [PATCH net 1/3] ip_gre: Make o_seqno start from 0 in native mode
From: Peilin Ye @ 2022-04-21 22:07 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni
  Cc: Peilin Ye, xeb@mail.ru, William Tu, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, linux-kernel, Peilin Ye
In-Reply-To: <cover.1650575919.git.peilin.ye@bytedance.com>

From: Peilin Ye <peilin.ye@bytedance.com>

For GRE and GRETAP devices, currently o_seqno starts from 1 in native
mode.  According to RFC 2890 2.2., "The first datagram is sent with a
sequence number of 0."  Fix it.

It is worth mentioning that o_seqno already starts from 0 in collect_md
mode, see gre_fb_xmit(), where tunnel->o_seqno is passed to
gre_build_header() before getting incremented.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/ipv4/ip_gre.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 99db2e41ed10..ca70b92e80d9 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -459,14 +459,12 @@ static void __gre_xmit(struct sk_buff *skb, struct net_device *dev,
 		       __be16 proto)
 {
 	struct ip_tunnel *tunnel = netdev_priv(dev);
-
-	if (tunnel->parms.o_flags & TUNNEL_SEQ)
-		tunnel->o_seqno++;
+	__be16 flags = tunnel->parms.o_flags;
 
 	/* Push GRE header. */
 	gre_build_header(skb, tunnel->tun_hlen,
-			 tunnel->parms.o_flags, proto, tunnel->parms.o_key,
-			 htonl(tunnel->o_seqno));
+			 flags, proto, tunnel->parms.o_key,
+			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++) : 0);
 
 	ip_tunnel_xmit(skb, dev, tnl_params, tnl_params->protocol);
 }
-- 
2.20.1


^ permalink raw reply related

* [PATCH net 0/3] ip_gre, ip6_gre: o_seqno fixes
From: Peilin Ye @ 2022-04-21 22:06 UTC (permalink / raw)
  To: David S. Miller, Jakub Kicinski, Hideaki YOSHIFUJI, David Ahern,
	Paolo Abeni
  Cc: Peilin Ye, xeb@mail.ru, William Tu, Daniel Borkmann, Cong Wang,
	Eric Dumazet, Alexei Starovoitov, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, linux-kernel, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Hi all,

As pointed out [1] by Jakub Kicinski, currently using TUNNEL_SEQ in
collect_md mode is racy for [IP6]GRE[TAP] devices, since they (typically,
e.g. if created using "ip") use lockless TX.

Patch [3/3] fixes it by making o_seqno atomic_t.

As mentioned by Eric Dumazet in commit b790e01aee74 ("ip_gre: lockless
xmit"), making o_seqno atomic_t increases "chance for packets being out
of order at receiver" when using lockless TX.

Another way to fix it would be: users must specify "external" and "oseq"
at the same time if they want the kernel to allow using TUNNEL_SEQ (e.g.
via eBPF) in collect_md mode, but that would break userspace.

I found another issue while reading the code: patches [1,2/3] make o_seqno
start from 0 in native mode, as described in RFC 2890 [2] section 2.2.:
"The first datagram is sent with a sequence number of 0."

Now we could make [IP6]GRE[TAP] (and probably [IP6]ERSPAN ?) devices
completely NETIF_F_LLTX, but that's out of scope of this fix and will be
sent as separate [net-next] patches.

[1] https://lore.kernel.org/netdev/20220415191133.0597a79a@kernel.org/
[2] https://datatracker.ietf.org/doc/html/rfc2890#section-2.2

Thanks,
Peilin Ye (3):
  ip_gre: Make o_seqno start from 0 in native mode
  ip6_gre: Make o_seqno start from 0 in native mode
  ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode

 include/net/ip6_tunnel.h |  2 +-
 include/net/ip_tunnels.h |  2 +-
 net/ipv4/ip_gre.c        | 12 +++++-------
 net/ipv6/ip6_gre.c       | 16 ++++++++--------
 4 files changed, 15 insertions(+), 17 deletions(-)

-- 
2.20.1

^ permalink raw reply

* Re: [GIT PULL] Networking for 5.18-rc4
From: pr-tracker-bot @ 2022-04-21 21:37 UTC (permalink / raw)
  To: Paolo Abeni; +Cc: torvalds, kuba, davem, netdev, linux-kernel
In-Reply-To: <20220421105218.18005-1-pabeni@redhat.com>

The pull request you sent on Thu, 21 Apr 2022 12:52:18 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-5.18-rc4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/59f0c2447e2553b0918b4a9fd38763a5c0587d02

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* Re: [PATCH] can: ctucanfd: Remove unused including <linux/version.h>
From: Pavel Pisa @ 2022-04-21 21:17 UTC (permalink / raw)
  To: Jiapeng Chong
  Cc: ondrej.ille, wg, mkl, davem, kuba, pabeni, linux-can, netdev,
	linux-kernel, Abaci Robot
In-Reply-To: <20220421202852.2693-1-jiapeng.chong@linux.alibaba.com>

Thanks for checking

On Thursday 21 of April 2022 22:28:52 Jiapeng Chong wrote:
> Eliminate the follow versioncheck warning:
>
> ./drivers/net/can/ctucanfd/ctucanfd_base.c: 34 linux/version.h not
> needed.
>
> Reported-by: Abaci Robot <abaci@linux.alibaba.com>
> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

Acked-by: Pave Pisa <pisa@cmp.felk.cvut.cz>

> ---
>  drivers/net/can/ctucanfd/ctucanfd_base.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/can/ctucanfd/ctucanfd_base.c
> b/drivers/net/can/ctucanfd/ctucanfd_base.c index 7a4550f60abb..be90136be442
> 100644
> --- a/drivers/net/can/ctucanfd/ctucanfd_base.c
> +++ b/drivers/net/can/ctucanfd/ctucanfd_base.c
> @@ -31,7 +31,6 @@
>  #include <linux/can/error.h>
>  #include <linux/can/led.h>
>  #include <linux/pm_runtime.h>
> -#include <linux/version.h>
>
>  #include "ctucanfd.h"
>  #include "ctucanfd_kregs.h"


-- 
Yours sincerely

                Pavel Pisa
    phone:      +420 603531357
    e-mail:     pisa@cmp.felk.cvut.cz
    Department of Control Engineering FEE CVUT
    Karlovo namesti 13, 121 35, Prague 2
    university: http://control.fel.cvut.cz/
    personal:   http://cmp.felk.cvut.cz/~pisa
    projects:   https://www.openhub.net/accounts/ppisa
    CAN related:http://canbus.pages.fel.cvut.cz/
    Open Technologies Research Education and Exchange Services
    https://gitlab.fel.cvut.cz/otrees/org/-/wikis/home


^ permalink raw reply

* Re: [PATCH] can: ctucanfd: Remove unnecessary print function dev_err()
From: Pavel Pisa @ 2022-04-21 21:16 UTC (permalink / raw)
  To: Jiapeng Chong
  Cc: ondrej.ille, wg, mkl, davem, kuba, pabeni, linux-can, netdev,
	linux-kernel, Abaci Robot
In-Reply-To: <20220421203242.7335-1-jiapeng.chong@linux.alibaba.com>

Thanks for checking

On Thursday 21 of April 2022 22:32:42 Jiapeng Chong wrote:
> The print function dev_err() is redundant because platform_get_irq()
> already prints an error.
>
> Eliminate the follow coccicheck warnings:
>
> ./drivers/net/can/ctucanfd/ctucanfd_platform.c:67:2-9: line 67 is
> redundant because platform_get_irq() already prints an error.
>
> Reported-by: Abaci Robot <abaci@linux.alibaba.com>
> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>

Acked-by: Pave Pisa <pisa@cmp.felk.cvut.cz>

>  drivers/net/can/ctucanfd/ctucanfd_platform.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/can/ctucanfd/ctucanfd_platform.c
> b/drivers/net/can/ctucanfd/ctucanfd_platform.c index
> 5e4806068662..89d54c2151e1 100644
> --- a/drivers/net/can/ctucanfd/ctucanfd_platform.c
> +++ b/drivers/net/can/ctucanfd/ctucanfd_platform.c
> @@ -64,7 +64,6 @@ static int ctucan_platform_probe(struct platform_device
> *pdev) }
>  	irq = platform_get_irq(pdev, 0);
>  	if (irq < 0) {
> -		dev_err(dev, "Cannot find interrupt.\n");
>  		ret = irq;
>  		goto err;
>  	}


^ permalink raw reply

* Re: linux 5.17.1 disregarding ACK values resulting in stalled TCP connections
From: Eric Dumazet @ 2022-04-21 21:14 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Florian Westphal, Neal Cardwell, Jaco Kroon, netfilter-devel,
	netdev
In-Reply-To: <9c6d2d7-70b-bd12-ee14-7923664afb1@netfilter.org>

On Thu, Apr 7, 2022 at 5:48 AM Jozsef Kadlecsik <kadlec@netfilter.org> wrote:
>
> On Thu, 7 Apr 2022, Florian Westphal wrote:
>
> > Jozsef Kadlecsik <kadlec@netfilter.org> wrote:
> > > I'd merge the two conditions so that it'd cover both original condition
> > > branches:
> > >
> > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> > > index 8ec55cd72572..87375ce2f995 100644
> > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > @@ -556,33 +556,26 @@ static bool tcp_in_window(struct nf_conn *ct,
> > >                     }
> > >
> > >             }
> > > -   } else if (((state->state == TCP_CONNTRACK_SYN_SENT
> > > -                && dir == IP_CT_DIR_ORIGINAL)
> > > -              || (state->state == TCP_CONNTRACK_SYN_RECV
> > > -                && dir == IP_CT_DIR_REPLY))
> > > -              && after(end, sender->td_end)) {
> > > +   } else if (tcph->syn &&
> > > +              ((after(end, sender->td_end) &&
> > > +                (state->state == TCP_CONNTRACK_SYN_SENT ||
> > > +                 state->state == TCP_CONNTRACK_SYN_RECV)) ||
> > > +               (dir == IP_CT_DIR_REPLY &&
> > > +                state->state == TCP_CONNTRACK_SYN_SENT))) {
> >
> > Thats what I did as well, I merged the two branches but I made the
> > 2nd clause stricter to also consider the after() test; it would no
> > longer re-init for syn-acks when sequence did not advance.
>
> That's perfectly fine.
>
> But what about simultaneous syn? The TCP state is zeroed in the REPLY
> direction, so the after() test can easily be false and the state wouldn't
> be picked up. Therefore I extended the condition.
>

Hi Jozsef and Florian

Any updates for this issue ?

Thanks !

^ permalink raw reply

* "mm: uninline copy_overflow()" breaks i386 build in Mellanox MLX4
From: Mateusz Jończyk @ 2022-04-21 20:47 UTC (permalink / raw)
  To: netdev, linux-kernel@vger.kernel.org
  Cc: David Laight, Andrew Morton, Christophe Leroy, Anshuman Khandual,
	linux-rdma

Hello,

commit ad7489d5262d ("mm: uninline copy_overflow()")

breaks for me a build for i386 in the Mellanox MLX4 driver:

        In file included from ./arch/x86/include/asm/preempt.h:7,
                         from ./include/linux/preempt.h:78,
                         from ./include/linux/percpu.h:6,
                         from ./include/linux/context_tracking_state.h:5,
                         from ./include/linux/hardirq.h:5,
                         from drivers/net/ethernet/mellanox/mlx4/cq.c:37:
        In function ‘check_copy_size’,
            inlined from ‘copy_to_user’ at ./include/linux/uaccess.h:159:6,
            inlined from ‘mlx4_init_user_cqes’ at drivers/net/ethernet/mellanox/mlx4/cq.c:317:9,
            inlined from ‘mlx4_cq_alloc’ at drivers/net/ethernet/mellanox/mlx4/cq.c:394:10:
        ./include/linux/thread_info.h:228:4: error: call to ‘__bad_copy_from’ declared with attribute error: copy source size is too small
          228 |    __bad_copy_from();
              |    ^~~~~~~~~~~~~~~~~
        make[5]: *** [scripts/Makefile.build:288: drivers/net/ethernet/mellanox/mlx4/cq.o] Błąd 1
        make[4]: *** [scripts/Makefile.build:550: drivers/net/ethernet/mellanox/mlx4] Błąd 2
        make[3]: *** [scripts/Makefile.build:550: drivers/net/ethernet/mellanox] Błąd 2
        make[2]: *** [scripts/Makefile.build:550: drivers/net/ethernet] Błąd 2
        make[1]: *** [scripts/Makefile.build:550: drivers/net] Błąd 2

Reverting this commit fixes the build. Disabling Mellanox Ethernet drivers
in Kconfig (tested only with also disabling of all Infiniband support) also fixes the build.

It appears that uninlining of copy_overflow() causes GCC to analyze the code deeper.

The code in mlx4_init_user_cqes, for reference:

        static int mlx4_init_user_cqes(void *buf, int entries, int cqe_size)
        {
                int entries_per_copy = PAGE_SIZE / cqe_size;
                void *init_ents;
                int err = 0;
                int i;

                init_ents = kmalloc(PAGE_SIZE, GFP_KERNEL);
                // ...
                if (entries_per_copy < entries) {
                        // ...
                } else {
                        // BUG here
                        err = copy_to_user((void __user *)buf, init_ents,
                                        array_size(entries, cqe_size)) ?
                                -EFAULT : 0;
                }

                // ...
        }

My setup: Ubuntu 20.04, gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

I was using lightly modified Kconfig from Debian i386 Linux packages.

Greetings,

Mateusz Jończyk


^ permalink raw reply

* [PATCH] can: ctucanfd: Remove unnecessary print function dev_err()
From: Jiapeng Chong @ 2022-04-21 20:32 UTC (permalink / raw)
  To: pisa
  Cc: ondrej.ille, wg, mkl, davem, kuba, pabeni, linux-can, netdev,
	linux-kernel, Jiapeng Chong, Abaci Robot

The print function dev_err() is redundant because platform_get_irq()
already prints an error.

Eliminate the follow coccicheck warnings:

./drivers/net/can/ctucanfd/ctucanfd_platform.c:67:2-9: line 67 is
redundant because platform_get_irq() already prints an error.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
---
 drivers/net/can/ctucanfd/ctucanfd_platform.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/can/ctucanfd/ctucanfd_platform.c b/drivers/net/can/ctucanfd/ctucanfd_platform.c
index 5e4806068662..89d54c2151e1 100644
--- a/drivers/net/can/ctucanfd/ctucanfd_platform.c
+++ b/drivers/net/can/ctucanfd/ctucanfd_platform.c
@@ -64,7 +64,6 @@ static int ctucan_platform_probe(struct platform_device *pdev)
 	}
 	irq = platform_get_irq(pdev, 0);
 	if (irq < 0) {
-		dev_err(dev, "Cannot find interrupt.\n");
 		ret = irq;
 		goto err;
 	}
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* [PATCH] can: ctucanfd: Remove unused including <linux/version.h>
From: Jiapeng Chong @ 2022-04-21 20:28 UTC (permalink / raw)
  To: pisa
  Cc: ondrej.ille, wg, mkl, davem, kuba, pabeni, linux-can, netdev,
	linux-kernel, Jiapeng Chong, Abaci Robot

Eliminate the follow versioncheck warning:

./drivers/net/can/ctucanfd/ctucanfd_base.c: 34 linux/version.h not
needed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
---
 drivers/net/can/ctucanfd/ctucanfd_base.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/can/ctucanfd/ctucanfd_base.c b/drivers/net/can/ctucanfd/ctucanfd_base.c
index 7a4550f60abb..be90136be442 100644
--- a/drivers/net/can/ctucanfd/ctucanfd_base.c
+++ b/drivers/net/can/ctucanfd/ctucanfd_base.c
@@ -31,7 +31,6 @@
 #include <linux/can/error.h>
 #include <linux/can/led.h>
 #include <linux/pm_runtime.h>
-#include <linux/version.h>
 
 #include "ctucanfd.h"
 #include "ctucanfd_kregs.h"
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* Re: [PATCH v2 2/2] igc: Trigger proper interrupts in igc_xsk_wakeup
From: Tony Nguyen @ 2022-04-21 20:23 UTC (permalink / raw)
  To: jeff.evanson, Jesse Brandeburg, David S. Miller, Jakub Kicinski,
	Vedang Patel, Andre Guedes, Maciej Fijalkowski, Jithu Joseph,
	moderated list:INTEL ETHERNET DRIVERS,
	open list:NETWORKING DRIVERS, open list
  Cc: Jeff Evanson
In-Reply-To: <20220420012635.13733-3-jeff.evanson@qsc.com>


On 4/19/2022 6:26 PM, jeff.evanson@gmail.com wrote:
> From: Jeff Evanson <jeff.evanson@qsc.com>
>
> In igc_xsk_wakeup, trigger the proper interrupt based on whether flags
> contains XDP_WAKEUP_RX and/or XDP_WAKEUP_TX.
>
> Consider a scenario where the transmit queue interrupt is mapped to a
> different irq from the receive queue. If XDP_WAKEUP_TX is set in the
> flags argument, the interrupt for transmit queue must be triggered,
> otherwise the transmit queue's napi_struct will never be scheduled.
>
> In the case where both XDP_WAKEUP_TX and XDP_WAKEUP_RX are both set,
> the receive interrupt should always be triggered, but the transmit
> interrupt should only be triggered if its q_vector differs from the
> receive queue's interrupt.
>
> Fixes: fc9df2a0b520 ("igc: Enable RX via AF_XDP zero-copy")

You're missing your sign-off; same for patch 1. Also, there's a handful 
of issues being reported by checkpatch [1]

Thanks,

Tony

[1] 
https://patchwork.kernel.org/project/netdevbpf/patch/20220420012635.13733-3-jeff.evanson@qsc.com/


^ permalink raw reply

* Re: [PATCH bpf-next v2 2/7] bpf: per-cgroup lsm flavor
From: Stanislav Fomichev @ 2022-04-21 20:22 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: netdev, bpf, ast, daniel, andrii
In-Reply-To: <20220421194650.q5ada2zdyzvayu4a@apollo.legion>

On Thu, Apr 21, 2022 at 12:46 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, Apr 06, 2022 at 03:13:37AM IST, Stanislav Fomichev wrote:
> > Allow attaching to lsm hooks in the cgroup context.
> >
> > Attaching to per-cgroup LSM works exactly like attaching
> > to other per-cgroup hooks. New BPF_LSM_CGROUP is added
> > to trigger new mode; the actual lsm hook we attach to is
> > signaled via existing attach_btf_id.
> >
> > For the hooks that have 'struct socket' as its first argument,
> > we use the cgroup associated with that socket. For the rest,
> > we use 'current' cgroup (this is all on default hierarchy == v2 only).
> > Note that for the hooks that work on 'struct sock' we still
> > take the cgroup from 'current' because most of the time,
> > the 'sock' argument is not properly initialized.
> >
> > Behind the scenes, we allocate a shim program that is attached
> > to the trampoline and runs cgroup effective BPF programs array.
> > This shim has some rudimentary ref counting and can be shared
> > between several programs attaching to the same per-cgroup lsm hook.
> >
> > Note that this patch bloats cgroup size because we add 211
> > cgroup_bpf_attach_type(s) for simplicity sake. This will be
> > addressed in the subsequent patch.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/bpf-cgroup-defs.h |   6 ++
> >  include/linux/bpf.h             |  13 +++
> >  include/linux/bpf_lsm.h         |  14 ++++
> >  include/uapi/linux/bpf.h        |   1 +
> >  kernel/bpf/bpf_lsm.c            |  92 ++++++++++++++++++++
> >  kernel/bpf/btf.c                |  11 +++
> >  kernel/bpf/cgroup.c             | 116 ++++++++++++++++++++++---
> >  kernel/bpf/syscall.c            |  10 +++
> >  kernel/bpf/trampoline.c         | 144 ++++++++++++++++++++++++++++++++
> >  kernel/bpf/verifier.c           |   1 +
> >  tools/include/uapi/linux/bpf.h  |   1 +
> >  11 files changed, 397 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
> > index 695d1224a71b..6c661b4df9fa 100644
> > --- a/include/linux/bpf-cgroup-defs.h
> > +++ b/include/linux/bpf-cgroup-defs.h
> > @@ -10,6 +10,8 @@
> >
> >  struct bpf_prog_array;
> >
> > +#define CGROUP_LSM_NUM 211 /* will be addressed in the next patch */
> > +
> >  enum cgroup_bpf_attach_type {
> >       CGROUP_BPF_ATTACH_TYPE_INVALID = -1,
> >       CGROUP_INET_INGRESS = 0,
> > @@ -35,6 +37,10 @@ enum cgroup_bpf_attach_type {
> >       CGROUP_INET4_GETSOCKNAME,
> >       CGROUP_INET6_GETSOCKNAME,
> >       CGROUP_INET_SOCK_RELEASE,
> > +#ifdef CONFIG_BPF_LSM
> > +     CGROUP_LSM_START,
> > +     CGROUP_LSM_END = CGROUP_LSM_START + CGROUP_LSM_NUM - 1,
> > +#endif
> >       MAX_CGROUP_BPF_ATTACH_TYPE
> >  };
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 487aba40ce52..17bbe2f7b2be 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -807,6 +807,9 @@ static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
> >  #ifdef CONFIG_BPF_JIT
> >  int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
> >  int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
> > +int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
> > +                                 struct bpf_attach_target_info *tgt_info);
> > +void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog);
> >  struct bpf_trampoline *bpf_trampoline_get(u64 key,
> >                                         struct bpf_attach_target_info *tgt_info);
> >  void bpf_trampoline_put(struct bpf_trampoline *tr);
> > @@ -865,6 +868,14 @@ static inline int bpf_trampoline_unlink_prog(struct bpf_prog *prog,
> >  {
> >       return -ENOTSUPP;
> >  }
> > +static inline int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
> > +                                               struct bpf_attach_target_info *tgt_info)
> > +{
> > +     return -EOPNOTSUPP;
> > +}
> > +static inline void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog)
> > +{
> > +}
> >  static inline struct bpf_trampoline *bpf_trampoline_get(u64 key,
> >                                                       struct bpf_attach_target_info *tgt_info)
> >  {
> > @@ -980,6 +991,7 @@ struct bpf_prog_aux {
> >       u64 load_time; /* ns since boottime */
> >       u32 verified_insns;
> >       struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
> > +     int cgroup_atype; /* enum cgroup_bpf_attach_type */
> >       char name[BPF_OBJ_NAME_LEN];
> >  #ifdef CONFIG_SECURITY
> >       void *security;
> > @@ -2383,6 +2395,7 @@ void *bpf_arch_text_copy(void *dst, void *src, size_t len);
> >
> >  struct btf_id_set;
> >  bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
> > +int btf_id_set_index(const struct btf_id_set *set, u32 id);
> >
> >  #define MAX_BPRINTF_VARARGS          12
> >
> > diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
> > index 479c101546ad..7f0e59f5f9be 100644
> > --- a/include/linux/bpf_lsm.h
> > +++ b/include/linux/bpf_lsm.h
> > @@ -42,6 +42,9 @@ extern const struct bpf_func_proto bpf_inode_storage_get_proto;
> >  extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
> >  void bpf_inode_storage_free(struct inode *inode);
> >
> > +int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func);
> > +int bpf_lsm_hook_idx(u32 btf_id);
> > +
> >  #else /* !CONFIG_BPF_LSM */
> >
> >  static inline bool bpf_lsm_is_sleepable_hook(u32 btf_id)
> > @@ -65,6 +68,17 @@ static inline void bpf_inode_storage_free(struct inode *inode)
> >  {
> >  }
> >
> > +static inline int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
> > +                                        bpf_func_t *bpf_func)
> > +{
> > +     return -ENOENT;
> > +}
> > +
> > +static inline int bpf_lsm_hook_idx(u32 btf_id)
> > +{
> > +     return -EINVAL;
> > +}
> > +
> >  #endif /* CONFIG_BPF_LSM */
> >
> >  #endif /* _LINUX_BPF_LSM_H */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index d14b10b85e51..bbe48a2dd852 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -998,6 +998,7 @@ enum bpf_attach_type {
> >       BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> >       BPF_PERF_EVENT,
> >       BPF_TRACE_KPROBE_MULTI,
> > +     BPF_LSM_CGROUP,
> >       __MAX_BPF_ATTACH_TYPE
> >  };
> >
> > diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> > index 064eccba641d..eca258ba71d8 100644
> > --- a/kernel/bpf/bpf_lsm.c
> > +++ b/kernel/bpf/bpf_lsm.c
> > @@ -35,6 +35,98 @@ BTF_SET_START(bpf_lsm_hooks)
> >  #undef LSM_HOOK
> >  BTF_SET_END(bpf_lsm_hooks)
> >
> > +static unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
> > +                                             const struct bpf_insn *insn)
> > +{
> > +     const struct bpf_prog *prog;
> > +     struct socket *sock;
> > +     struct cgroup *cgrp;
> > +     struct sock *sk;
> > +     int ret = 0;
> > +     u64 *regs;
> > +
> > +     regs = (u64 *)ctx;
> > +     sock = (void *)(unsigned long)regs[BPF_REG_0];
> > +     /*prog = container_of(insn, struct bpf_prog, insnsi);*/
> > +     prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> > +
> > +     if (unlikely(!sock))
> > +             return 0;
> > +
> > +     sk = sock->sk;
> > +     if (unlikely(!sk))
> > +             return 0;
> > +
> > +     cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> > +     if (likely(cgrp))
> > +             ret = BPF_PROG_RUN_ARRAY_CG(cgrp->bpf.effective[prog->aux->cgroup_atype],
> > +                                         ctx, bpf_prog_run, 0);
> > +     return ret;
> > +}
> > +
> > +static unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
> > +                                              const struct bpf_insn *insn)
> > +{
> > +     const struct bpf_prog *prog;
> > +     struct cgroup *cgrp;
> > +     int ret = 0;
> > +
> > +     if (unlikely(!current))
> > +             return 0;
> > +
> > +     /*prog = container_of(insn, struct bpf_prog, insnsi);*/
> > +     prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> > +
> > +     rcu_read_lock();
> > +     cgrp = task_dfl_cgroup(current);
> > +     if (likely(cgrp))
> > +             ret = BPF_PROG_RUN_ARRAY_CG(cgrp->bpf.effective[prog->aux->cgroup_atype],
> > +                                         ctx, bpf_prog_run, 0);
> > +     rcu_read_unlock();
> > +     return ret;
> > +}
> > +
> > +int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
> > +                          bpf_func_t *bpf_func)
> > +{
> > +     const struct btf_type *first_arg_type;
> > +     const struct btf_type *sock_type;
> > +     const struct btf *btf_vmlinux;
> > +     const struct btf_param *args;
> > +     s32 type_id;
> > +
> > +     if (!prog->aux->attach_func_proto ||
> > +         !btf_type_is_func_proto(prog->aux->attach_func_proto))
> > +             return -EINVAL;
> > +
> > +     if (btf_type_vlen(prog->aux->attach_func_proto) < 1)
> > +             return -EINVAL;
> > +
> > +     args = (const struct btf_param *)(prog->aux->attach_func_proto + 1);
> > +
> > +     btf_vmlinux = bpf_get_btf_vmlinux();
> > +     if (!btf_vmlinux)
>
> We should use IS_ERR_OR_NULL to check the result, e.g. see:
> 7ada3787e91c ("bpf: Check for NULL return from bpf_get_btf_vmlinux").

Oh, thanks, will do!

^ permalink raw reply

* [PATCH bpf-next] libbpf: also check /sys/kernel/tracing for tracefs files
From: Connor O'Brien @ 2022-04-21 20:11 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, netdev, bpf, linux-kernel, Connor O'Brien

libbpf looks for tracefs files only under debugfs, but tracefs may be
mounted even if debugfs is not. When /sys/kernel/debug/tracing is
absent, try looking under /sys/kernel/tracing instead.

Signed-off-by: Connor O'Brien <connoro@google.com>
---
 tools/lib/bpf/libbpf.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 68cc134d070d..6ef587329eb2 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -10140,10 +10140,16 @@ static void gen_kprobe_legacy_event_name(char *buf, size_t buf_sz,
 		 __sync_fetch_and_add(&index, 1));
 }
 
+static bool debugfs_available(void)
+{
+	return !access("/sys/kernel/debug/tracing", F_OK);
+}
+
 static int add_kprobe_event_legacy(const char *probe_name, bool retprobe,
 				   const char *kfunc_name, size_t offset)
 {
-	const char *file = "/sys/kernel/debug/tracing/kprobe_events";
+	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/kprobe_events" :
+		"/sys/kernel/tracing/kprobe_events";
 
 	return append_to_file(file, "%c:%s/%s %s+0x%zx",
 			      retprobe ? 'r' : 'p',
@@ -10153,7 +10159,8 @@ static int add_kprobe_event_legacy(const char *probe_name, bool retprobe,
 
 static int remove_kprobe_event_legacy(const char *probe_name, bool retprobe)
 {
-	const char *file = "/sys/kernel/debug/tracing/kprobe_events";
+	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/kprobe_events" :
+		"/sys/kernel/tracing/kprobe_events";
 
 	return append_to_file(file, "-:%s/%s", retprobe ? "kretprobes" : "kprobes", probe_name);
 }
@@ -10163,7 +10170,8 @@ static int determine_kprobe_perf_type_legacy(const char *probe_name, bool retpro
 	char file[256];
 
 	snprintf(file, sizeof(file),
-		 "/sys/kernel/debug/tracing/events/%s/%s/id",
+		 debugfs_available() ? "/sys/kernel/debug/tracing/events/%s/%s/id" :
+		 "/sys/kernel/tracing/events/%s/%s/id",
 		 retprobe ? "kretprobes" : "kprobes", probe_name);
 
 	return parse_uint_from_file(file, "%d\n");
@@ -10493,7 +10501,8 @@ static void gen_uprobe_legacy_event_name(char *buf, size_t buf_sz,
 static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe,
 					  const char *binary_path, size_t offset)
 {
-	const char *file = "/sys/kernel/debug/tracing/uprobe_events";
+	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/uprobe_events" :
+		"/sys/kernel/tracing/uprobe_events";
 
 	return append_to_file(file, "%c:%s/%s %s:0x%zx",
 			      retprobe ? 'r' : 'p',
@@ -10503,7 +10512,8 @@ static inline int add_uprobe_event_legacy(const char *probe_name, bool retprobe,
 
 static inline int remove_uprobe_event_legacy(const char *probe_name, bool retprobe)
 {
-	const char *file = "/sys/kernel/debug/tracing/uprobe_events";
+	const char *file = debugfs_available() ? "/sys/kernel/debug/tracing/uprobe_events" :
+		"/sys/kernel/tracing/uprobe_events";
 
 	return append_to_file(file, "-:%s/%s", retprobe ? "uretprobes" : "uprobes", probe_name);
 }
@@ -10513,7 +10523,8 @@ static int determine_uprobe_perf_type_legacy(const char *probe_name, bool retpro
 	char file[512];
 
 	snprintf(file, sizeof(file),
-		 "/sys/kernel/debug/tracing/events/%s/%s/id",
+		 debugfs_available() ? "/sys/kernel/debug/tracing/events/%s/%s/id" :
+		 "/sys/kernel/tracing/events/%s/%s/id",
 		 retprobe ? "uretprobes" : "uprobes", probe_name);
 
 	return parse_uint_from_file(file, "%d\n");
@@ -11071,7 +11082,8 @@ static int determine_tracepoint_id(const char *tp_category,
 	int ret;
 
 	ret = snprintf(file, sizeof(file),
-		       "/sys/kernel/debug/tracing/events/%s/%s/id",
+		       debugfs_available() ? "/sys/kernel/debug/tracing/events/%s/%s/id" :
+		       "/sys/kernel/tracing/events/%s/%s/id",
 		       tp_category, tp_name);
 	if (ret < 0)
 		return -errno;
-- 
2.36.0.rc2.479.g8af0fa9b8e-goog


^ permalink raw reply related

* Re: [PATCH 0/5] net: atlantic: more fuzzing fixes
From: Grant Grundler @ 2022-04-21 19:53 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Igor Russkikh, Jakub Kicinski, Paolo Abeni, netdev,
	David S . Miller, LKML, Aashay Shringarpure, Yi Chou,
	Shervin Oloumi
In-Reply-To: <20220418231746.2464800-1-grundler@chromium.org>

Igor,
Will you have a chance to comment on this in the near future?
Should someone else review/integrate these patches?

I'm asking since I've seen no comments in the past three days.

cheers,
grant


On Mon, Apr 18, 2022 at 4:17 PM Grant Grundler <grundler@chromium.org> wrote:
>
> The Chrome OS fuzzing team posted a "Fuzzing" report for atlantic driver
> in Q4 2021 using Chrome OS v5.4 kernel and "Cable Matters
> Thunderbolt 3 to 10 Gb Ethernet" (b0 version):
>     https://docs.google.com/document/d/e/2PACX-1vT4oCGNhhy_AuUqpu6NGnW0N9HF_jxf2kS7raOpOlNRqJNiTHAtjiHRthXYSeXIRTgfeVvsEt0qK9qK/pub
>
> It essentially describes four problems:
> 1) validate rxd_wb->next_desc_ptr before populating buff->next
> 2) "frag[0] not initialized" case in aq_ring_rx_clean()
> 3) limit iterations handling fragments in aq_ring_rx_clean()
> 4) validate hw_head_ in hw_atl_b0_hw_ring_tx_head_update()
>
> I've added one "clean up" contribution:
>     "net: atlantic: reduce scope of is_rsc_complete"
>
> I tested the "original" patches using chromeos-v5.4 kernel branch:
>     https://chromium-review.googlesource.com/q/hashtag:pcinet-atlantic-2022q1+(status:open%20OR%20status:merged)
>
> The fuzzing team will retest using the chromeos-v5.4 patches and the b0 HW.
>
> I've forward ported those patches to 5.18-rc2 and compiled them but am
> currently unable to test them on 5.18-rc2 kernel (logistics problems).
>
> I'm confident in all but the last patch:
>    "net: atlantic: verify hw_head_ is reasonable"
>
> Please verify I'm not confusing how ring->sw_head and ring->sw_tail
> are used in hw_atl_b0_hw_ring_tx_head_update().
>
> Credit largely goes to Chrome OS Fuzzing team members:
>     Aashay Shringarpure, Yi Chou, Shervin Oloumi
>
> cheers,
> grant

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/7] bpf: per-cgroup lsm flavor
From: Kumar Kartikeya Dwivedi @ 2022-04-21 19:46 UTC (permalink / raw)
  To: Stanislav Fomichev; +Cc: netdev, bpf, ast, daniel, andrii
In-Reply-To: <20220405214342.1968262-3-sdf@google.com>

On Wed, Apr 06, 2022 at 03:13:37AM IST, Stanislav Fomichev wrote:
> Allow attaching to lsm hooks in the cgroup context.
>
> Attaching to per-cgroup LSM works exactly like attaching
> to other per-cgroup hooks. New BPF_LSM_CGROUP is added
> to trigger new mode; the actual lsm hook we attach to is
> signaled via existing attach_btf_id.
>
> For the hooks that have 'struct socket' as its first argument,
> we use the cgroup associated with that socket. For the rest,
> we use 'current' cgroup (this is all on default hierarchy == v2 only).
> Note that for the hooks that work on 'struct sock' we still
> take the cgroup from 'current' because most of the time,
> the 'sock' argument is not properly initialized.
>
> Behind the scenes, we allocate a shim program that is attached
> to the trampoline and runs cgroup effective BPF programs array.
> This shim has some rudimentary ref counting and can be shared
> between several programs attaching to the same per-cgroup lsm hook.
>
> Note that this patch bloats cgroup size because we add 211
> cgroup_bpf_attach_type(s) for simplicity sake. This will be
> addressed in the subsequent patch.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/linux/bpf-cgroup-defs.h |   6 ++
>  include/linux/bpf.h             |  13 +++
>  include/linux/bpf_lsm.h         |  14 ++++
>  include/uapi/linux/bpf.h        |   1 +
>  kernel/bpf/bpf_lsm.c            |  92 ++++++++++++++++++++
>  kernel/bpf/btf.c                |  11 +++
>  kernel/bpf/cgroup.c             | 116 ++++++++++++++++++++++---
>  kernel/bpf/syscall.c            |  10 +++
>  kernel/bpf/trampoline.c         | 144 ++++++++++++++++++++++++++++++++
>  kernel/bpf/verifier.c           |   1 +
>  tools/include/uapi/linux/bpf.h  |   1 +
>  11 files changed, 397 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
> index 695d1224a71b..6c661b4df9fa 100644
> --- a/include/linux/bpf-cgroup-defs.h
> +++ b/include/linux/bpf-cgroup-defs.h
> @@ -10,6 +10,8 @@
>
>  struct bpf_prog_array;
>
> +#define CGROUP_LSM_NUM 211 /* will be addressed in the next patch */
> +
>  enum cgroup_bpf_attach_type {
>  	CGROUP_BPF_ATTACH_TYPE_INVALID = -1,
>  	CGROUP_INET_INGRESS = 0,
> @@ -35,6 +37,10 @@ enum cgroup_bpf_attach_type {
>  	CGROUP_INET4_GETSOCKNAME,
>  	CGROUP_INET6_GETSOCKNAME,
>  	CGROUP_INET_SOCK_RELEASE,
> +#ifdef CONFIG_BPF_LSM
> +	CGROUP_LSM_START,
> +	CGROUP_LSM_END = CGROUP_LSM_START + CGROUP_LSM_NUM - 1,
> +#endif
>  	MAX_CGROUP_BPF_ATTACH_TYPE
>  };
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 487aba40ce52..17bbe2f7b2be 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -807,6 +807,9 @@ static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
>  #ifdef CONFIG_BPF_JIT
>  int bpf_trampoline_link_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
>  int bpf_trampoline_unlink_prog(struct bpf_prog *prog, struct bpf_trampoline *tr);
> +int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
> +				    struct bpf_attach_target_info *tgt_info);
> +void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog);
>  struct bpf_trampoline *bpf_trampoline_get(u64 key,
>  					  struct bpf_attach_target_info *tgt_info);
>  void bpf_trampoline_put(struct bpf_trampoline *tr);
> @@ -865,6 +868,14 @@ static inline int bpf_trampoline_unlink_prog(struct bpf_prog *prog,
>  {
>  	return -ENOTSUPP;
>  }
> +static inline int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
> +						  struct bpf_attach_target_info *tgt_info)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void bpf_trampoline_unlink_cgroup_shim(struct bpf_prog *prog)
> +{
> +}
>  static inline struct bpf_trampoline *bpf_trampoline_get(u64 key,
>  							struct bpf_attach_target_info *tgt_info)
>  {
> @@ -980,6 +991,7 @@ struct bpf_prog_aux {
>  	u64 load_time; /* ns since boottime */
>  	u32 verified_insns;
>  	struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE];
> +	int cgroup_atype; /* enum cgroup_bpf_attach_type */
>  	char name[BPF_OBJ_NAME_LEN];
>  #ifdef CONFIG_SECURITY
>  	void *security;
> @@ -2383,6 +2395,7 @@ void *bpf_arch_text_copy(void *dst, void *src, size_t len);
>
>  struct btf_id_set;
>  bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
> +int btf_id_set_index(const struct btf_id_set *set, u32 id);
>
>  #define MAX_BPRINTF_VARARGS		12
>
> diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h
> index 479c101546ad..7f0e59f5f9be 100644
> --- a/include/linux/bpf_lsm.h
> +++ b/include/linux/bpf_lsm.h
> @@ -42,6 +42,9 @@ extern const struct bpf_func_proto bpf_inode_storage_get_proto;
>  extern const struct bpf_func_proto bpf_inode_storage_delete_proto;
>  void bpf_inode_storage_free(struct inode *inode);
>
> +int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog, bpf_func_t *bpf_func);
> +int bpf_lsm_hook_idx(u32 btf_id);
> +
>  #else /* !CONFIG_BPF_LSM */
>
>  static inline bool bpf_lsm_is_sleepable_hook(u32 btf_id)
> @@ -65,6 +68,17 @@ static inline void bpf_inode_storage_free(struct inode *inode)
>  {
>  }
>
> +static inline int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
> +					   bpf_func_t *bpf_func)
> +{
> +	return -ENOENT;
> +}
> +
> +static inline int bpf_lsm_hook_idx(u32 btf_id)
> +{
> +	return -EINVAL;
> +}
> +
>  #endif /* CONFIG_BPF_LSM */
>
>  #endif /* _LINUX_BPF_LSM_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index d14b10b85e51..bbe48a2dd852 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -998,6 +998,7 @@ enum bpf_attach_type {
>  	BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
>  	BPF_PERF_EVENT,
>  	BPF_TRACE_KPROBE_MULTI,
> +	BPF_LSM_CGROUP,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>
> diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> index 064eccba641d..eca258ba71d8 100644
> --- a/kernel/bpf/bpf_lsm.c
> +++ b/kernel/bpf/bpf_lsm.c
> @@ -35,6 +35,98 @@ BTF_SET_START(bpf_lsm_hooks)
>  #undef LSM_HOOK
>  BTF_SET_END(bpf_lsm_hooks)
>
> +static unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
> +						const struct bpf_insn *insn)
> +{
> +	const struct bpf_prog *prog;
> +	struct socket *sock;
> +	struct cgroup *cgrp;
> +	struct sock *sk;
> +	int ret = 0;
> +	u64 *regs;
> +
> +	regs = (u64 *)ctx;
> +	sock = (void *)(unsigned long)regs[BPF_REG_0];
> +	/*prog = container_of(insn, struct bpf_prog, insnsi);*/
> +	prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> +
> +	if (unlikely(!sock))
> +		return 0;
> +
> +	sk = sock->sk;
> +	if (unlikely(!sk))
> +		return 0;
> +
> +	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
> +	if (likely(cgrp))
> +		ret = BPF_PROG_RUN_ARRAY_CG(cgrp->bpf.effective[prog->aux->cgroup_atype],
> +					    ctx, bpf_prog_run, 0);
> +	return ret;
> +}
> +
> +static unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
> +						 const struct bpf_insn *insn)
> +{
> +	const struct bpf_prog *prog;
> +	struct cgroup *cgrp;
> +	int ret = 0;
> +
> +	if (unlikely(!current))
> +		return 0;
> +
> +	/*prog = container_of(insn, struct bpf_prog, insnsi);*/
> +	prog = (const struct bpf_prog *)((void *)insn - offsetof(struct bpf_prog, insnsi));
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(current);
> +	if (likely(cgrp))
> +		ret = BPF_PROG_RUN_ARRAY_CG(cgrp->bpf.effective[prog->aux->cgroup_atype],
> +					    ctx, bpf_prog_run, 0);
> +	rcu_read_unlock();
> +	return ret;
> +}
> +
> +int bpf_lsm_find_cgroup_shim(const struct bpf_prog *prog,
> +			     bpf_func_t *bpf_func)
> +{
> +	const struct btf_type *first_arg_type;
> +	const struct btf_type *sock_type;
> +	const struct btf *btf_vmlinux;
> +	const struct btf_param *args;
> +	s32 type_id;
> +
> +	if (!prog->aux->attach_func_proto ||
> +	    !btf_type_is_func_proto(prog->aux->attach_func_proto))
> +		return -EINVAL;
> +
> +	if (btf_type_vlen(prog->aux->attach_func_proto) < 1)
> +		return -EINVAL;
> +
> +	args = (const struct btf_param *)(prog->aux->attach_func_proto + 1);
> +
> +	btf_vmlinux = bpf_get_btf_vmlinux();
> +	if (!btf_vmlinux)

We should use IS_ERR_OR_NULL to check the result, e.g. see:
7ada3787e91c ("bpf: Check for NULL return from bpf_get_btf_vmlinux").

> [...]

--
Kartikeya

^ permalink raw reply

* Re: [Patch bpf-next v1 1/4] tcp: introduce tcp_read_skb()
From: Cong Wang @ 2022-04-21 19:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Linux Kernel Network Developers, Cong Wang, Eric Dumazet,
	Daniel Borkmann, Jakub Sitnicki
In-Reply-To: <6255da425c4ad_57e1208f9@john.notmuch>

On Tue, Apr 12, 2022 at 1:00 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Cong Wang wrote:
> > From: Cong Wang <cong.wang@bytedance.com>
> >
> > This patch inroduces tcp_read_skb() based on tcp_read_sock(),
> > a preparation for the next patch which actually introduces
> > a new sock ops.
> >
> > TCP is special here, because it has tcp_read_sock() which is
> > mainly used by splice(). tcp_read_sock() supports partial read
> > and arbitrary offset, neither of them is needed for sockmap.
> >
> > Cc: Eric Dumazet <edumazet@google.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Jakub Sitnicki <jakub@cloudflare.com>
> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> > ---
>
> Thanks for doing this Cong comment/question inline.
>
> [...]
>
> > +int tcp_read_skb(struct sock *sk, read_descriptor_t *desc,
> > +              sk_read_actor_t recv_actor)
> > +{
> > +     struct sk_buff *skb;
> > +     struct tcp_sock *tp = tcp_sk(sk);
> > +     u32 seq = tp->copied_seq;
> > +     u32 offset;
> > +     int copied = 0;
> > +
> > +     if (sk->sk_state == TCP_LISTEN)
> > +             return -ENOTCONN;
> > +     while ((skb = tcp_recv_skb(sk, seq, &offset, true)) != NULL) {
>
> I'm trying to see why we might have an offset here if we always
> consume the entire skb. There is a comment in tcp_recv_skb around
> GRO packets, but not clear how this applies here if it does at all
> to me yet. Will read a bit more I guess.
>
> If the offset can be >0 than we also need to fix the recv_actor to
> account for the extra offset in the skb. As is the bpf prog might
> see duplicate data. This is a problem on the stream parser now.
>
> Then another fallout is if offset is zero than we could just do
> a skb_dequeue here and skip the tcp_recv_skb bool flag addition
> and upate.

I think it is mainly for splice(), and of course strparser, but none of
them is touched by my patchset.

>
> I'll continue reading after a few other things I need to get
> sorted this afternoon, but maybe you have the answer on hand.
>

Please let me know if you have any other comments.

Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox