From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 274C2184524; Mon, 15 Jun 2026 11:04:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781521478; cv=none; b=qgrM1s1WHbvhvnUQcNgF8GxqKi8c5LWOUX7VDIEVd9rnUaNknXJTHdwiEPnOoNTjk1O8fzNlShaxLopBtW3iulHLPtHDz7ZWOsU1Re8EnMnngtC0oVbBuzmXI0s8siR60cKutEEuNBwhxRMEkji6R6sGdbxNSF64Ra+v1PQmzUM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781521478; c=relaxed/simple; bh=LGFZYlwtFUjYOoGIPlj99bcTUkCYljLg6LEyaCpWaH4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QJPK6ZgzoEgAPICcsKxN+dOLZ+GBeV4J1yfB+M/kidNXSpTH5Tssj0eAUmZDSsjovxn1LPlfz7Wmec3Ujm5l5qEcPKAVBsaBb3zNptVqmCgWuqaOZ1/BescZLKFK+fSmHnFK16DeNK4TEQ303zzujK2dijUWD1w5vJMPnyyb1nI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gIP2phMA; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gIP2phMA" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B366A1F000E9; Mon, 15 Jun 2026 11:04:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781521476; bh=k17Q1r8PFXZh1xa77/C5cYwnq/wQ3L9dzO4gyrjHwO0=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=gIP2phMALxM6PhidrUGxUwM+wRDITWsWwf6N6OVGcpSLFOvLrPzEI/PY4ZfjuIJSw feJ0eb4u6MW+Wx3M3y7Yf4gvaDwHCv8m0r++wC3pi6EZ5ut22ZnZ///cS7n4x6MEiB 5FzbgnU8NRF7QX/+7mscHaZp/a6ZpaIiS3c566bD2TH673Gv8suPmtCQrMMDImw8CX NBZqJrjr3bUkgfaE837TUcbbe75UBitYzOW97Ycnsa9j1zFjln1KliPWj5FVXxihGY yx1AXvDhvjqeiU2noe3zHAkhcPx08j5J1KAYHQi9uaY++uEcwC2o1ib1ospPpX1qh8 XYwzKw+gDKprQ== From: Simon Horman To: lucien.xin@gmail.com Cc: Simon Horman , netdev@vger.kernel.org, linux-sctp@vger.kernel.org, davem@davemloft.net, kuba@kernel.org, edumazet@google.com, pabeni@redhat.com, marcelo.leitner@gmail.com, w@1wt.eu, zdi-disclosures@trendmicro.com Subject: Re: [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag Date: Mon, 15 Jun 2026 12:04:03 +0100 Message-ID: <20260615110402.753443-2-horms@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <1bbd999cf88fb1ea93f4a3743047bce9b51adc24.1781287178.git.lucien.xin@gmail.com> References: <1bbd999cf88fb1ea93f4a3743047bce9b51adc24.1781287178.git.lucien.xin@gmail.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. Full review at: https://netdev-ai.bots.linux.dev/sashiko/ --- sctp: hold socket lock when dumping endpoints in sctp_diag This change reworks SCTP_DIAG endpoint dumping so the socket lock is held across inet_sctp_diag_fill() to keep the bind address list stable, and moves position tracking into sctp_for_each_endpoint() which now drops the bucket read_lock_bh() before invoking the callback. A few questions on the new iterator and dump callback below. > diff --git a/net/sctp/diag.c b/net/sctp/diag.c > index d758f5c3e06e..9108272ca527 100644 > --- a/net/sctp/diag.c > +++ b/net/sctp/diag.c > @@ -373,42 +374,36 @@ static int sctp_ep_dump(struct sctp_endpoint *ep, void *p) > struct sk_buff *skb = commp->skb; > struct netlink_callback *cb = commp->cb; > const struct inet_diag_req_v2 *r = commp->r; > - struct net *net = sock_net(skb->sk); > struct inet_sock *inet = inet_sk(sk); > int err = 0; > > - if (!net_eq(sock_net(sk), net)) > + lock_sock(sk); > + if (sctp_sstate(sk, CLOSED)) > goto out; [Low] The new unconditional skip on sctp_sstate(sk, CLOSED) does not appear to be described in the changelog. SCTP_SS_CLOSED maps to TCP_CLOSE, so a userspace caller that asks for TCPF_CLOSE in idiag_states will silently receive nothing from the endpoint hash for entries in that state. Most hashed endpoints are LISTENING (sctp_hash_endpoint() is called from sctp_listen_start() right after inet_sk_set_state(sk, SCTP_SS_LISTENING)) so the practical effect is small, but is the intent to filter teardown windows (in which case ep->base.dead might be a more direct check) or to honour the user-requested idiag_states bitmask? Either way, would it be worth calling this behaviour change out in the commit message? > > - if (cb->args[4] < cb->args[1]) > - goto next; > - > - if (!(r->idiag_states & TCPF_LISTEN) && !list_empty(&ep->asocs)) > - goto next; > + if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) && > + !list_empty(&ep->asocs)) > + goto out; [ ... ] > @@ -479,41 +474,40 @@ static void sctp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb, > .r = r, > .net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN), > }; > - int pos = cb->args[2]; > + int pos; > > /* eps hashtable dumps > * args: > * 0 : if it will traversal listen sock > * 1 : to record the sock pos of this time's traversal > - * 4 : to work as a temporary variable to traversal list > */ > if (cb->args[0] == 0) { > - if (!(idiag_states & TCPF_LISTEN)) > - goto skip; > - if (sctp_for_each_endpoint(sctp_ep_dump, &commp)) > - goto done; > -skip: > + if (idiag_states & TCPF_LISTEN) { > + pos = cb->args[1]; > + if (sctp_for_each_endpoint(sctp_ep_dump, net, &pos, > + &commp)) { > + cb->args[1] = pos; > + return; > + } > + } > cb->args[0] = 1; > cb->args[1] = 0; > - cb->args[4] = 0; > } > > + if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE))) > + return; > + > /* asocs by transport hashtable dump > * args: > * 1 : to record the assoc pos of this time's traversal > * 2 : to record the transport pos of this time's traversal > * 3 : to mark if we have dumped the ep info of the current asoc > * 4 : to work as a temporary variable to traversal list > - * 5 : to save the sk we get from travelsing the tsp list. > */ > - if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE))) > - goto done; > - > + pos = cb->args[2]; > sctp_transport_traverse_process(sctp_sock_filter, sctp_sock_dump, > net, &pos, &commp); > cb->args[2] = pos; > - > -done: > cb->args[1] = cb->args[4]; > cb->args[4] = 0; > } [Low] The commit message states that args[4] tracking is removed, but the asocs-phase comment block still lists: * 4 : to work as a temporary variable to traversal list and the trailing epilogue still does: cb->args[1] = cb->args[4]; cb->args[4] = 0; Are those leftovers from the eps-phase rewrite, or are they still needed by the asocs phase? If args[4] is genuinely gone, should the comment and the epilogue be cleaned up too? The previous comment also listed: * 5 : to save the sk we get from travelsing the tsp list. which is dropped here without further mention. Is args[5] confirmed unused on the asocs path? > diff --git a/net/sctp/socket.c b/net/sctp/socket.c > index 66e12fb0c646..1ed405dedc01 100644 > --- a/net/sctp/socket.c > +++ b/net/sctp/socket.c > @@ -5369,24 +5369,44 @@ struct sctp_transport *sctp_transport_get_idx(struct net *net, > } > > int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *), > - void *p) { > - int err = 0; > - int hash = 0; > - struct sctp_endpoint *ep; > + struct net *net, int *pos, void *p) { > + int err, hash = 0, idx = 0, start; > struct sctp_hashbucket *head; > + struct sctp_endpoint *ep; > + struct sock *sk; > > for (head = sctp_ep_hashtable; hash < sctp_ep_hashsize; > hash++, head++) { > + start = idx; > +again: > + sk = NULL; > read_lock_bh(&head->lock); > sctp_for_each_hentry(ep, &head->chain) { > - err = cb(ep, p); > - if (err) > + if (sock_net(ep->base.sk) != net) > + continue; > + if (idx++ >= *pos) { > + sk = ep->base.sk; > + sock_hold(sk); > break; > + } > } > read_unlock_bh(&head->lock); > + > + if (sk) { > + err = cb(ep, p); [Critical] Can ep be freed before this cb(ep, p) call? The bucket read_lock_bh() is dropped above, and only sock_hold(sk) is taken. In sctp_endpoint_init() the reference direction is ep -> sk: refcount_set(&ep->base.refcnt, 1); ... sock_hold(sk); so taking sock_hold(sk) does not pin ep. A concurrent close(2) on the same socket runs: sctp_close() -> release_sock() /* lock_sock dropped here */ -> sk_common_release() -> sctp_destroy_sock() -> sctp_endpoint_free() -> sctp_unhash_endpoint() -> sctp_endpoint_put() /* refcount may reach 0 */ -> call_rcu(&ep->rcu, sctp_endpoint_destroy_rcu) The destroy callback then does sctp_sk(sk)->ep = NULL; sock_put(sk); kfree(ep). Between read_unlock_bh() and cb(ep, p), this iterator is no longer in any RCU read-side section, and sctp_ep_dump() then calls lock_sock(sk) which may sleep. That allows an RCU grace period to elapse before the callback dereferences ep at: struct sock *sk = ep->base.sk; ... if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) && !list_empty(&ep->asocs)) The sibling helpers in the same file pin the endpoint explicitly to avoid this: sctp_transport_traverse_process() if (sctp_endpoint_hold(ep)) { /* asoc can be peeled off */ ret = cb(ep, tsp, p); ... sctp_endpoint_put(ep); } Should sctp_for_each_endpoint() take sctp_endpoint_hold(ep) inside the bucket read_lock_bh() (while ep is still hashed and live) and a matching sctp_endpoint_put(ep) after cb returns? [Medium] Can this restart pattern miss or duplicate endpoints when the bucket is modified concurrently? After processing one ep the loop does: idx = start; goto again; which re-walks the bucket from the head and uses the running counter idx compared against *pos to skip already-emitted entries. __sctp_hash_endpoint() inserts at the hlist head and __sctp_unhash_endpoint() can delete arbitrary nodes, both under the bucket write_lock that this iterator releases between walks. A concurrent listen() on a sibling endpoint shifts every existing entry's index by +1, and a concurrent close() before the cursor shifts indices by -1. Under that pattern, can SCTP_DIAG consumers observe duplicated endpoints (the previously-emitted entry now sits at the new *pos) or miss live endpoints (the next un-emitted entry is shifted past the cursor)? The pre-patch sweep walked each bucket once with read_lock_bh held throughout, so this consistency window did not exist. [Low] For a bucket that contains N matching endpoints, does this restart pattern walk the bucket O(N^2) times? For a bucket {A,B,C} starting from *pos=0 the goto-again loop performs 1+2+3 traversals to dump three entries plus a final 3-entry walk that finds nothing and terminates with sk == NULL, which is nine list iterations to dump three endpoints. The pre-patch code walked each bucket exactly once because the callback ran while still holding read_lock_bh(). The commit message describes the change as simplifying endpoint traversal; could the description also note the new per-bucket scan cost, or could a cursor that survives unlock/relock be used instead? > + if (err) { > + sock_put(sk); > + return err; > + } > + sock_put(sk); > + (*pos)++; > + > + idx = start; > + goto again; > + } > } > > - return err; > + return 0; > } > EXPORT_SYMBOL_GPL(sctp_for_each_endpoint);