Re: [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Simon Horman <horms@kernel.org>
To: lucien.xin@gmail.com
Cc: Simon Horman <horms@kernel.org>,
	netdev@vger.kernel.org, linux-sctp@vger.kernel.org,
	davem@davemloft.net, kuba@kernel.org, edumazet@google.com,
	pabeni@redhat.com, marcelo.leitner@gmail.com, w@1wt.eu,
	zdi-disclosures@trendmicro.com
Subject: Re: [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag
Date: Mon, 15 Jun 2026 12:04:03 +0100	[thread overview]
Message-ID: <20260615110402.753443-2-horms@kernel.org> (raw)
In-Reply-To: <1bbd999cf88fb1ea93f4a3743047bce9b51adc24.1781287178.git.lucien.xin@gmail.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
sctp: hold socket lock when dumping endpoints in sctp_diag

This change reworks SCTP_DIAG endpoint dumping so the socket lock is held
across inet_sctp_diag_fill() to keep the bind address list stable, and
moves position tracking into sctp_for_each_endpoint() which now drops the
bucket read_lock_bh() before invoking the callback.

A few questions on the new iterator and dump callback below.

> diff --git a/net/sctp/diag.c b/net/sctp/diag.c
> index d758f5c3e06e..9108272ca527 100644
> --- a/net/sctp/diag.c
> +++ b/net/sctp/diag.c
> @@ -373,42 +374,36 @@ static int sctp_ep_dump(struct sctp_endpoint *ep, void *p)
>  	struct sk_buff *skb = commp->skb;
>  	struct netlink_callback *cb = commp->cb;
>  	const struct inet_diag_req_v2 *r = commp->r;
> -	struct net *net = sock_net(skb->sk);
>  	struct inet_sock *inet = inet_sk(sk);
>  	int err = 0;
>  
> -	if (!net_eq(sock_net(sk), net))
> +	lock_sock(sk);
> +	if (sctp_sstate(sk, CLOSED))
>  		goto out;

[Low]
The new unconditional skip on sctp_sstate(sk, CLOSED) does not appear to
be described in the changelog.

SCTP_SS_CLOSED maps to TCP_CLOSE, so a userspace caller that asks for
TCPF_CLOSE in idiag_states will silently receive nothing from the
endpoint hash for entries in that state.

Most hashed endpoints are LISTENING (sctp_hash_endpoint() is called from
sctp_listen_start() right after inet_sk_set_state(sk, SCTP_SS_LISTENING))
so the practical effect is small, but is the intent to filter teardown
windows (in which case ep->base.dead might be a more direct check) or to
honour the user-requested idiag_states bitmask?  Either way, would it be
worth calling this behaviour change out in the commit message?

>  
> -	if (cb->args[4] < cb->args[1])
> -		goto next;
> -
> -	if (!(r->idiag_states & TCPF_LISTEN) && !list_empty(&ep->asocs))
> -		goto next;
> +	if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) &&
> +	    !list_empty(&ep->asocs))
> +		goto out;

[ ... ]

> @@ -479,41 +474,40 @@ static void sctp_diag_dump(struct sk_buff *skb, struct netlink_callback *cb,
>  		.r = r,
>  		.net_admin = netlink_net_capable(cb->skb, CAP_NET_ADMIN),
>  	};
> -	int pos = cb->args[2];
> +	int pos;
>  
>  	/* eps hashtable dumps
>  	 * args:
>  	 * 0 : if it will traversal listen sock
>  	 * 1 : to record the sock pos of this time's traversal
> -	 * 4 : to work as a temporary variable to traversal list
>  	 */
>  	if (cb->args[0] == 0) {
> -		if (!(idiag_states & TCPF_LISTEN))
> -			goto skip;
> -		if (sctp_for_each_endpoint(sctp_ep_dump, &commp))
> -			goto done;
> -skip:
> +		if (idiag_states & TCPF_LISTEN) {
> +			pos = cb->args[1];
> +			if (sctp_for_each_endpoint(sctp_ep_dump, net, &pos,
> +						   &commp)) {
> +				cb->args[1] = pos;
> +				return;
> +			}
> +		}
>  		cb->args[0] = 1;
>  		cb->args[1] = 0;
> -		cb->args[4] = 0;
>  	}
>  
> +	if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
> +		return;
> +
>  	/* asocs by transport hashtable dump
>  	 * args:
>  	 * 1 : to record the assoc pos of this time's traversal
>  	 * 2 : to record the transport pos of this time's traversal
>  	 * 3 : to mark if we have dumped the ep info of the current asoc
>  	 * 4 : to work as a temporary variable to traversal list
> -	 * 5 : to save the sk we get from travelsing the tsp list.
>  	 */
> -	if (!(idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)))
> -		goto done;
> -
> +	pos = cb->args[2];
>  	sctp_transport_traverse_process(sctp_sock_filter, sctp_sock_dump,
>  					net, &pos, &commp);
>  	cb->args[2] = pos;
> -
> -done:
>  	cb->args[1] = cb->args[4];
>  	cb->args[4] = 0;
>  }

[Low]
The commit message states that args[4] tracking is removed, but the
asocs-phase comment block still lists:

	 * 4 : to work as a temporary variable to traversal list

and the trailing epilogue still does:

	cb->args[1] = cb->args[4];
	cb->args[4] = 0;

Are those leftovers from the eps-phase rewrite, or are they still needed
by the asocs phase?  If args[4] is genuinely gone, should the comment
and the epilogue be cleaned up too?

The previous comment also listed:

	 * 5 : to save the sk we get from travelsing the tsp list.

which is dropped here without further mention.  Is args[5] confirmed
unused on the asocs path?

> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 66e12fb0c646..1ed405dedc01 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -5369,24 +5369,44 @@ struct sctp_transport *sctp_transport_get_idx(struct net *net,
>  }
>  
>  int sctp_for_each_endpoint(int (*cb)(struct sctp_endpoint *, void *),
> -			   void *p) {
> -	int err = 0;
> -	int hash = 0;
> -	struct sctp_endpoint *ep;
> +			   struct net *net, int *pos, void *p) {
> +	int err, hash = 0, idx = 0, start;
>  	struct sctp_hashbucket *head;
> +	struct sctp_endpoint *ep;
> +	struct sock *sk;
>  
>  	for (head = sctp_ep_hashtable; hash < sctp_ep_hashsize;
>  	     hash++, head++) {
> +		start = idx;
> +again:
> +		sk = NULL;
>  		read_lock_bh(&head->lock);
>  		sctp_for_each_hentry(ep, &head->chain) {
> -			err = cb(ep, p);
> -			if (err)
> +			if (sock_net(ep->base.sk) != net)
> +				continue;
> +			if (idx++ >= *pos) {
> +				sk = ep->base.sk;
> +				sock_hold(sk);
>  				break;
> +			}
>  		}
>  		read_unlock_bh(&head->lock);
> +
> +		if (sk) {
> +			err = cb(ep, p);

[Critical]
Can ep be freed before this cb(ep, p) call?

The bucket read_lock_bh() is dropped above, and only sock_hold(sk) is
taken.  In sctp_endpoint_init() the reference direction is
ep -> sk:

	refcount_set(&ep->base.refcnt, 1);
	...
	sock_hold(sk);

so taking sock_hold(sk) does not pin ep.

A concurrent close(2) on the same socket runs:

  sctp_close()
    -> release_sock()                  /* lock_sock dropped here */
    -> sk_common_release()
       -> sctp_destroy_sock()
          -> sctp_endpoint_free()
             -> sctp_unhash_endpoint()
             -> sctp_endpoint_put()    /* refcount may reach 0 */
                -> call_rcu(&ep->rcu, sctp_endpoint_destroy_rcu)

The destroy callback then does sctp_sk(sk)->ep = NULL; sock_put(sk);
kfree(ep).

Between read_unlock_bh() and cb(ep, p), this iterator is no longer in
any RCU read-side section, and sctp_ep_dump() then calls lock_sock(sk)
which may sleep.  That allows an RCU grace period to elapse before the
callback dereferences ep at:

	struct sock *sk = ep->base.sk;
	...
	if ((r->idiag_states & ~(TCPF_LISTEN | TCPF_CLOSE)) &&
	    !list_empty(&ep->asocs))

The sibling helpers in the same file pin the endpoint explicitly to
avoid this:

  sctp_transport_traverse_process()
	if (sctp_endpoint_hold(ep)) { /* asoc can be peeled off */
		ret = cb(ep, tsp, p);
		...
		sctp_endpoint_put(ep);
	}

Should sctp_for_each_endpoint() take sctp_endpoint_hold(ep) inside the
bucket read_lock_bh() (while ep is still hashed and live) and a matching
sctp_endpoint_put(ep) after cb returns?

[Medium]
Can this restart pattern miss or duplicate endpoints when the bucket is
modified concurrently?

After processing one ep the loop does:

			idx = start;
			goto again;

which re-walks the bucket from the head and uses the running counter
idx compared against *pos to skip already-emitted entries.

__sctp_hash_endpoint() inserts at the hlist head and
__sctp_unhash_endpoint() can delete arbitrary nodes, both under the
bucket write_lock that this iterator releases between walks.  A
concurrent listen() on a sibling endpoint shifts every existing entry's
index by +1, and a concurrent close() before the cursor shifts indices
by -1.

Under that pattern, can SCTP_DIAG consumers observe duplicated
endpoints (the previously-emitted entry now sits at the new *pos) or
miss live endpoints (the next un-emitted entry is shifted past the
cursor)?  The pre-patch sweep walked each bucket once with read_lock_bh
held throughout, so this consistency window did not exist.

[Low]
For a bucket that contains N matching endpoints, does this restart
pattern walk the bucket O(N^2) times?

For a bucket {A,B,C} starting from *pos=0 the goto-again loop performs
1+2+3 traversals to dump three entries plus a final 3-entry walk that
finds nothing and terminates with sk == NULL, which is nine list
iterations to dump three endpoints.  The pre-patch code walked each
bucket exactly once because the callback ran while still holding
read_lock_bh().

The commit message describes the change as simplifying endpoint
traversal; could the description also note the new per-bucket scan
cost, or could a cursor that survives unlock/relock be used instead?

> +			if (err) {
> +				sock_put(sk);
> +				return err;
> +			}
> +			sock_put(sk);
> +			(*pos)++;
> +
> +			idx = start;
> +			goto again;
> +		}
>  	}
>  
> -	return err;
> +	return 0;
>  }
>  EXPORT_SYMBOL_GPL(sctp_for_each_endpoint);

next prev parent reply	other threads:[~2026-06-15 11:04 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12 17:59 [PATCH net] sctp: hold socket lock when dumping endpoints in sctp_diag Xin Long
2026-06-13  7:10 ` Willy Tarreau
2026-06-15 19:13   ` Xin Long
2026-06-15 11:04 ` Simon Horman [this message]
2026-06-15 18:24   ` Xin Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260615110402.753443-2-horms@kernel.org \
    --to=horms@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=linux-sctp@vger.kernel.org \
    --cc=lucien.xin@gmail.com \
    --cc=marcelo.leitner@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=w@1wt.eu \
    --cc=zdi-disclosures@trendmicro.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.