netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>,
	Michal Kubecek <mkubecek@suse.cz>, Firo Yang <firo.yang@suse.com>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Sasha Levin <sashal@kernel.org>,
	rcu@vger.kernel.org, netdev@vger.kernel.org
Subject: [PATCH AUTOSEL 4.19 46/84] tcp/dccp: fix possible race __inet_lookup_established()
Date: Fri, 27 Dec 2019 12:43:14 -0500	[thread overview]
Message-ID: <20191227174352.6264-46-sashal@kernel.org> (raw)
In-Reply-To: <20191227174352.6264-1-sashal@kernel.org>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 8dbd76e79a16b45b2ccb01d2f2e08dbf64e71e40 ]

Michal Kubecek and Firo Yang did a very nice analysis of crashes
happening in __inet_lookup_established().

Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
(via a close()/socket()/listen() cycle) without a RCU grace period,
I should not have changed listeners linkage in their hash table.

They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
so that a lookup can detect a socket in a hash list was moved in
another one.

Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve
merge conflict for v4/v6 ordering fix"), we have to add
hlist_nulls_add_tail_rcu() helper.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Firo Yang <firo.yang@suse.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/linux/rculist_nulls.h | 37 +++++++++++++++++++++++++++++++++++
 include/net/inet_hashtables.h | 12 +++++++++---
 include/net/sock.h            |  5 +++++
 net/ipv4/inet_diag.c          |  3 ++-
 net/ipv4/inet_hashtables.c    | 16 +++++++--------
 net/ipv4/tcp_ipv4.c           |  7 ++++---
 6 files changed, 65 insertions(+), 15 deletions(-)

diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index bc8206a8f30e..61974c4c566b 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -100,6 +100,43 @@ static inline void hlist_nulls_add_head_rcu(struct hlist_nulls_node *n,
 		first->pprev = &n->next;
 }
 
+/**
+ * hlist_nulls_add_tail_rcu
+ * @n: the element to add to the hash list.
+ * @h: the list to add to.
+ *
+ * Description:
+ * Adds the specified element to the specified hlist_nulls,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
+ * or hlist_nulls_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs.  Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
+					    struct hlist_nulls_head *h)
+{
+	struct hlist_nulls_node *i, *last = NULL;
+
+	/* Note: write side code, so rcu accessors are not needed. */
+	for (i = h->first; !is_a_nulls(i); i = i->next)
+		last = i;
+
+	if (last) {
+		n->next = last->next;
+		n->pprev = &last->next;
+		rcu_assign_pointer(hlist_next_rcu(last), n);
+	} else {
+		hlist_nulls_add_head_rcu(n, h);
+	}
+}
+
 /**
  * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
  * @tpos:	the type * to use as a loop cursor.
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..b875dcef173c 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -106,13 +106,19 @@ struct inet_bind_hashbucket {
 	struct hlist_head	chain;
 };
 
-/*
- * Sockets can be hashed in established or listening table
+/* Sockets can be hashed in established or listening table.
+ * We must use different 'nulls' end-of-chain value for all hash buckets :
+ * A socket might transition from ESTABLISH to LISTEN state without
+ * RCU grace period. A lookup in ehash table needs to handle this case.
  */
+#define LISTENING_NULLS_BASE (1U << 29)
 struct inet_listen_hashbucket {
 	spinlock_t		lock;
 	unsigned int		count;
-	struct hlist_head	head;
+	union {
+		struct hlist_head	head;
+		struct hlist_nulls_head	nulls_head;
+	};
 };
 
 /* This is for listening sockets, thus all sockets which possess wildcards. */
diff --git a/include/net/sock.h b/include/net/sock.h
index 4545a9ecc219..f359e5c94762 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -721,6 +721,11 @@ static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_h
 	hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
 }
 
+static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nulls_head *list)
+{
+	hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
+}
+
 static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
 {
 	sock_hold(sk);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 5731670c560b..9742b37afe1d 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -918,11 +918,12 @@ void inet_diag_dump_icsk(struct inet_hashinfo *hashinfo, struct sk_buff *skb,
 
 		for (i = s_i; i < INET_LHTABLE_SIZE; i++) {
 			struct inet_listen_hashbucket *ilb;
+			struct hlist_nulls_node *node;
 
 			num = 0;
 			ilb = &hashinfo->listening_hash[i];
 			spin_lock(&ilb->lock);
-			sk_for_each(sk, &ilb->head) {
+			sk_nulls_for_each(sk, node, &ilb->nulls_head) {
 				struct inet_sock *inet = inet_sk(sk);
 
 				if (!net_eq(sock_net(sk), net))
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 7be966a60801..900756b3defb 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -560,10 +560,11 @@ static int inet_reuseport_add_sock(struct sock *sk,
 				   struct inet_listen_hashbucket *ilb)
 {
 	struct inet_bind_bucket *tb = inet_csk(sk)->icsk_bind_hash;
+	const struct hlist_nulls_node *node;
 	struct sock *sk2;
 	kuid_t uid = sock_i_uid(sk);
 
-	sk_for_each_rcu(sk2, &ilb->head) {
+	sk_nulls_for_each_rcu(sk2, node, &ilb->nulls_head) {
 		if (sk2 != sk &&
 		    sk2->sk_family == sk->sk_family &&
 		    ipv6_only_sock(sk2) == ipv6_only_sock(sk) &&
@@ -599,9 +600,9 @@ int __inet_hash(struct sock *sk, struct sock *osk)
 	}
 	if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
 		sk->sk_family == AF_INET6)
-		hlist_add_tail_rcu(&sk->sk_node, &ilb->head);
+		__sk_nulls_add_node_tail_rcu(sk, &ilb->nulls_head);
 	else
-		hlist_add_head_rcu(&sk->sk_node, &ilb->head);
+		__sk_nulls_add_node_rcu(sk, &ilb->nulls_head);
 	inet_hash2(hashinfo, sk);
 	ilb->count++;
 	sock_set_flag(sk, SOCK_RCU_FREE);
@@ -650,11 +651,9 @@ void inet_unhash(struct sock *sk)
 		reuseport_detach_sock(sk);
 	if (ilb) {
 		inet_unhash2(hashinfo, sk);
-		 __sk_del_node_init(sk);
-		 ilb->count--;
-	} else {
-		__sk_nulls_del_node_init_rcu(sk);
+		ilb->count--;
 	}
+	__sk_nulls_del_node_init_rcu(sk);
 	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 unlock:
 	spin_unlock_bh(lock);
@@ -790,7 +789,8 @@ void inet_hashinfo_init(struct inet_hashinfo *h)
 
 	for (i = 0; i < INET_LHTABLE_SIZE; i++) {
 		spin_lock_init(&h->listening_hash[i].lock);
-		INIT_HLIST_HEAD(&h->listening_hash[i].head);
+		INIT_HLIST_NULLS_HEAD(&h->listening_hash[i].nulls_head,
+				      i + LISTENING_NULLS_BASE);
 		h->listening_hash[i].count = 0;
 	}
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index bfec48849735..5553f6a833f3 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2020,13 +2020,14 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct tcp_iter_state *st = seq->private;
 	struct net *net = seq_file_net(seq);
 	struct inet_listen_hashbucket *ilb;
+	struct hlist_nulls_node *node;
 	struct sock *sk = cur;
 
 	if (!sk) {
 get_head:
 		ilb = &tcp_hashinfo.listening_hash[st->bucket];
 		spin_lock(&ilb->lock);
-		sk = sk_head(&ilb->head);
+		sk = sk_nulls_head(&ilb->nulls_head);
 		st->offset = 0;
 		goto get_sk;
 	}
@@ -2034,9 +2035,9 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	++st->num;
 	++st->offset;
 
-	sk = sk_next(sk);
+	sk = sk_nulls_next(sk);
 get_sk:
-	sk_for_each_from(sk) {
+	sk_nulls_for_each_from(sk, node) {
 		if (!net_eq(sock_net(sk), net))
 			continue;
 		if (sk->sk_family == afinfo->family)
-- 
2.20.1


  parent reply	other threads:[~2019-12-27 17:48 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20191227174352.6264-1-sashal@kernel.org>
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 05/84] mwifiex: fix possible heap overflow in mwifiex_process_country_ie() Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 07/84] netfilter: ctnetlink: netns exit must wait for callbacks Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 08/84] mwifiex: Fix heap overflow in mmwifiex_process_tdls_action_frame() Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 12/84] netfilter: nf_queue: enqueue skbs with NULL dst Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 19/84] netfilter: nft_set_rbtree: bogus lookup/get on consecutive elements in named sets Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 20/84] netfilter: nf_tables: validate NFT_SET_ELEM_INTERVAL_END Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 21/84] netfilter: nf_tables: validate NFT_DATA_VALUE after nft_data_init() Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 22/84] netfilter: bridge: make sure to pull arp header in br_nf_forward_arp() Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 26/84] selftests: forwarding: Delete IPv6 address at the end Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 28/84] af_packet: set defaule value for tmo Sasha Levin
2019-12-27 17:42 ` [PATCH AUTOSEL 4.19 29/84] fjes: fix missed check in fjes_acpi_add Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 32/84] bnxt_en: Return error if FW returns more data than dump length Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 33/84] net: ena: fix napi handler misbehavior when the napi budget is zero Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 34/84] bpf, mips: Limit to 33 tail calls Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 37/84] samples: bpf: Replace symbol compare of trace_event Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 38/84] samples: bpf: fix syscall_tp due to unused syscall Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 40/84] net: usb: lan78xx: Fix suspend/resume PHY register access error Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 41/84] qede: Fix multicast mac configuration Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 45/84] bpf: Clear skb->tstamp in bpf_redirect when necessary Sasha Levin
2019-12-27 17:43 ` Sasha Levin [this message]
2020-01-02  8:01   ` [PATCH AUTOSEL 4.19 46/84] tcp/dccp: fix possible race __inet_lookup_established() Naresh Kamboju
2020-01-09 15:32     ` Sasha Levin
2020-01-09 17:07       ` Michal Kubecek
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 47/84] 6pack,mkiss: fix possible deadlock Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 48/84] net: marvell: mvpp2: phylink requires the link interrupt Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 49/84] bnx2x: Do not handle requests from VFs after parity Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 50/84] bnx2x: Fix logic to get total no. of PFs per engine Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 51/84] bonding: fix active-backup transition after link failure Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 52/84] gtp: do not allow adding duplicate tid and ms_addr pdp context Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 53/84] gtp: fix wrong condition in gtp_genl_dump_pdp() Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 54/84] gtp: avoid zero size hashtable Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 55/84] cxgb4: Fix kernel panic while accessing sge_info Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 56/84] net: usb: lan78xx: Fix error message format specifier Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 58/84] rfkill: Fix incorrect check to avoid NULL pointer dereference Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 61/84] net: gemini: Fix memory leak in gmac_setup_txqs Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 67/84] net: qlogic: Fix error paths in ql_alloc_large_buffers() Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 68/84] net: nfc: nci: fix a possible sleep-in-atomic-context bug in nci_uart_tty_receive() Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 69/84] net: stmmac: Do not accept invalid MTU values Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 70/84] net: stmmac: xgmac: Clear previous RX buffer size Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 71/84] net: stmmac: RX buffer size must be 16 byte aligned Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 72/84] net: stmmac: Always arm TX Timer at end of transmission start Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 75/84] net, sysctl: Fix compiler warning when only cBPF is present Sasha Levin
2019-12-27 17:43 ` [PATCH AUTOSEL 4.19 80/84] net: hisilicon: Fix a BUG trigered by wrong bytes_compl Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191227174352.6264-46-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=edumazet@google.com \
    --cc=firo.yang@suse.com \
    --cc=jakub.kicinski@netronome.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkubecek@suse.cz \
    --cc=netdev@vger.kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).