From: Simon Horman <horms@kernel.org>
To: lucien.xin@gmail.com
Cc: Simon Horman <horms@kernel.org>,
netdev@vger.kernel.org, linux-sctp@vger.kernel.org,
davem@davemloft.net, kuba@kernel.org, edumazet@google.com,
pabeni@redhat.com, marcelo.leitner@gmail.com, w@1wt.eu,
zdi-disclosures@trendmicro.com
Subject: Re: [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag
Date: Mon, 15 Jun 2026 12:04:03 +0100 [thread overview]
Message-ID: <20260615110402.753443-2-horms@kernel.org> (raw)
In-Reply-To: <1bbd999cf88fb1ea93f4a3743047bce9b51adc24.1781287178.git.lucien.xin@gmail.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
sctp: hold socket lock when dumping endpoints in sctp_diag
This change reworks SCTP_DIAG endpoint dumping so the socket lock is held
across inet_sctp_diag_fill() to keep the bind address list stable, and
moves position tracking into sctp_for_each_endpoint() which now drops the
bucket read_lock_bh() before invoking the callback.
A few questions on the new iterator and dump callback below.
> diff --git a/net/sctp/diag.c b/net/sctp/diag.c
> index d758f5c3e06e..9108272ca527 100644
> --- a/net/sctp/diag.c
> +++ b/net/sctp/diag.c
> @@ -373,42 +374,36 @@ static int sctp_ep_dump(struct sctp_endpoint *ep, void *p)
> struct sk_buff *skb = commp->skb;
> struct netlink_callback *cb = commp->cb;
> const struct inet_diag_req_v2 *r = commp->r;
> - struct net *net = sock_net(skb->sk);
> struct inet_sock *inet = inet_sk(sk);
> int err = 0;
>
> - if (!net_eq(sock_net(sk), net))
> + lock_sock(sk);
> + if (sctp_sstate(sk, CLOSED))
> goto out;
[Low]
The new unconditional skip on sctp_sstate(sk, CLOSED) does not appear to
be described in the changelog.
SCTP_SS_CLOSED maps to TCP_CLOSE, so a userspace caller that asks for
TCPF_CLOSE in idiag_states will silently receive nothing from the
endpoint hash for entries in that state.
Most hashed endpoints are LISTENING (sctp_hash_endpoint() is called from
sctp_listen_start() right after inet_sk_set_state(sk, SCTP_SS_LISTENING))
so the practical effect is small, but is the intent to filter teardown
windows (in which case ep->base.dead might be a more direct check) or to
honour the user-requested idiag_states bitmask? Either way, would it be
worth calling this behaviour change out in the commit message?
>
> - if (cb->args[4] < cb->args[1])
> - goto next;
> -
> - if (!(r->idiag_states & TCPF_LISTEN) && !list_empty(&ep->asocs))
> - goto next;
> + if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) &&
> + !list_empty(&ep->asocs))
> + goto out;
[ ... ]
> @@ -479,41 +474,40 @@ static void sctp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
> .r = r,
> .net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN),
> };
> - int pos = cb->args[2];
> + int pos;
>
> /* eps hashtable dumps
> * args:
> * 0 : if it will traversal listen sock
> * 1 : to record the sock pos of this time's traversal
> - * 4 : to work as a temporary variable to traversal list
> */
> if (cb->args[0] == 0) {
> - if (!(idiag_states & TCPF_LISTEN))
> - goto skip;
> - if (sctp_for_each_endpoint(sctp_ep_dump, &commp))
> - goto done;
> -skip:
> + if (idiag_states & TCPF_LISTEN) {
> + pos = cb->args[1];
> + if (sctp_for_each_endpoint(sctp_ep_dump, net, &pos,
> + &commp)) {
> + cb->args[1] = pos;
> + return;
> + }
> + }
> cb->args[0] = 1;
> cb->args[1] = 0;
> - cb->args[4] = 0;
> }
>
> + if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
> + return;
> +
> /* asocs by transport hashtable dump
> * args:
> * 1 : to record the assoc pos of this time's traversal
> * 2 : to record the transport pos of this time's traversal
> * 3 : to mark if we have dumped the ep info of the current asoc
> * 4 : to work as a temporary variable to traversal list
> - * 5 : to save the sk we get from travelsing the tsp list.
> */
> - if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
> - goto done;
> -
> + pos = cb->args[2];
> sctp_transport_traverse_process(sctp_sock_filter, sctp_sock_dump,
> net, &pos, &commp);
> cb->args[2] = pos;
> -
> -done:
> cb->args[1] = cb->args[4];
> cb->args[4] = 0;
> }
[Low]
The commit message states that args[4] tracking is removed, but the
asocs-phase comment block still lists:
* 4 : to work as a temporary variable to traversal list
and the trailing epilogue still does:
cb->args[1] = cb->args[4];
cb->args[4] = 0;
Are those leftovers from the eps-phase rewrite, or are they still needed
by the asocs phase? If args[4] is genuinely gone, should the comment
and the epilogue be cleaned up too?
The previous comment also listed:
* 5 : to save the sk we get from travelsing the tsp list.
which is dropped here without further mention. Is args[5] confirmed
unused on the asocs path?
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 66e12fb0c646..1ed405dedc01 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -5369,24 +5369,44 @@ struct sctp_transport *sctp_transport_get_idx(struct net *net,
> }
>
> int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
> - void *p) {
> - int err = 0;
> - int hash = 0;
> - struct sctp_endpoint *ep;
> + struct net *net, int *pos, void *p) {
> + int err, hash = 0, idx = 0, start;
> struct sctp_hashbucket *head;
> + struct sctp_endpoint *ep;
> + struct sock *sk;
>
> for (head = sctp_ep_hashtable; hash < sctp_ep_hashsize;
> hash++, head++) {
> + start = idx;
> +again:
> + sk = NULL;
> read_lock_bh(&head->lock);
> sctp_for_each_hentry(ep, &head->chain) {
> - err = cb(ep, p);
> - if (err)
> + if (sock_net(ep->base.sk) != net)
> + continue;
> + if (idx++ >= *pos) {
> + sk = ep->base.sk;
> + sock_hold(sk);
> break;
> + }
> }
> read_unlock_bh(&head->lock);
> +
> + if (sk) {
> + err = cb(ep, p);
[Critical]
Can ep be freed before this cb(ep, p) call?
The bucket read_lock_bh() is dropped above, and only sock_hold(sk) is
taken. In sctp_endpoint_init() the reference direction is
ep -> sk:
refcount_set(&ep->base.refcnt, 1);
...
sock_hold(sk);
so taking sock_hold(sk) does not pin ep.
A concurrent close(2) on the same socket runs:
sctp_close()
-> release_sock() /* lock_sock dropped here */
-> sk_common_release()
-> sctp_destroy_sock()
-> sctp_endpoint_free()
-> sctp_unhash_endpoint()
-> sctp_endpoint_put() /* refcount may reach 0 */
-> call_rcu(&ep->rcu, sctp_endpoint_destroy_rcu)
The destroy callback then does sctp_sk(sk)->ep = NULL; sock_put(sk);
kfree(ep).
Between read_unlock_bh() and cb(ep, p), this iterator is no longer in
any RCU read-side section, and sctp_ep_dump() then calls lock_sock(sk)
which may sleep. That allows an RCU grace period to elapse before the
callback dereferences ep at:
struct sock *sk = ep->base.sk;
...
if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) &&
!list_empty(&ep->asocs))
The sibling helpers in the same file pin the endpoint explicitly to
avoid this:
sctp_transport_traverse_process()
if (sctp_endpoint_hold(ep)) { /* asoc can be peeled off */
ret = cb(ep, tsp, p);
...
sctp_endpoint_put(ep);
}
Should sctp_for_each_endpoint() take sctp_endpoint_hold(ep) inside the
bucket read_lock_bh() (while ep is still hashed and live) and a matching
sctp_endpoint_put(ep) after cb returns?
[Medium]
Can this restart pattern miss or duplicate endpoints when the bucket is
modified concurrently?
After processing one ep the loop does:
idx = start;
goto again;
which re-walks the bucket from the head and uses the running counter
idx compared against *pos to skip already-emitted entries.
__sctp_hash_endpoint() inserts at the hlist head and
__sctp_unhash_endpoint() can delete arbitrary nodes, both under the
bucket write_lock that this iterator releases between walks. A
concurrent listen() on a sibling endpoint shifts every existing entry's
index by +1, and a concurrent close() before the cursor shifts indices
by -1.
Under that pattern, can SCTP_DIAG consumers observe duplicated
endpoints (the previously-emitted entry now sits at the new *pos) or
miss live endpoints (the next un-emitted entry is shifted past the
cursor)? The pre-patch sweep walked each bucket once with read_lock_bh
held throughout, so this consistency window did not exist.
[Low]
For a bucket that contains N matching endpoints, does this restart
pattern walk the bucket O(N^2) times?
For a bucket {A,B,C} starting from *pos=0 the goto-again loop performs
1+2+3 traversals to dump three entries plus a final 3-entry walk that
finds nothing and terminates with sk == NULL, which is nine list
iterations to dump three endpoints. The pre-patch code walked each
bucket exactly once because the callback ran while still holding
read_lock_bh().
The commit message describes the change as simplifying endpoint
traversal; could the description also note the new per-bucket scan
cost, or could a cursor that survives unlock/relock be used instead?
> + if (err) {
> + sock_put(sk);
> + return err;
> + }
> + sock_put(sk);
> + (*pos)++;
> +
> + idx = start;
> + goto again;
> + }
> }
>
> - return err;
> + return 0;
> }
> EXPORT_SYMBOL_GPL(sctp_for_each_endpoint);
prev parent reply other threads:[~2026-06-15 11:04 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 17:59 [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag Xin Long
2026-06-13 7:10 ` Willy Tarreau
2026-06-15 11:04 ` Simon Horman [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260615110402.753443-2-horms@kernel.org \
--to=horms@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=linux-sctp@vger.kernel.org \
--cc=lucien.xin@gmail.com \
--cc=marcelo.leitner@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=w@1wt.eu \
--cc=zdi-disclosures@trendmicro.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox