Re: [PATCH net-next 2/2] net/smc: reduce TX slot contention with exclusive wait

Netdev List
 help / color / mirror / Atom feed

From: Paolo Abeni <pabeni@redhat.com>
To: alibuda@linux.alibaba.com
Cc: davem@davemloft.net, dust.li@linux.alibaba.com,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	sidraya@linux.ibm.com, wenjia@linux.ibm.com,
	mjambigi@linux.ibm.com, horms@kernel.org,
	tonylu@linux.alibaba.com, guwen@linux.alibaba.com,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	linux-s390@vger.kernel.org, netdev@vger.kernel.org,
	oliver.yang@linux.alibaba.com, pasic@linux.ibm.com
Subject: Re: [PATCH net-next 2/2] net/smc: reduce TX slot contention with exclusive wait
Date: Tue, 12 May 2026 10:26:43 +0200	[thread overview]
Message-ID: <20260512082643.89020-1-pabeni@redhat.com> (raw)
In-Reply-To: <20260508063718.101622-3-alibuda@linux.alibaba.com>

From: AI Reviewer <ai@example.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
net/smc: reduce TX slot contention with exclusive wait

This patch replaces the non-exclusive wait_event_interruptible_timeout()
in smc_wr_tx_get_free_slot() with an open-coded exclusive wait loop via
prepare_to_wait_exclusive(), so wake_up() on a freed TX slot wakes only
a single waiter instead of the entire wait queue.

> diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
> index 48037a3d97a3..0a6f2befb0e2 100644
> --- a/net/smc/smc_wr.c
> +++ b/net/smc/smc_wr.c
> @@ -171,17 +173,27 @@ int smc_wr_tx_get_free_slot(struct smc_link *link,
>  		if (rc)
>  			return rc;
>  	} else {
> -		rc = wait_event_interruptible_timeout(
> -			link->wr_tx_wait,
> -			!smc_link_sendable(link) ||
> -			lgr->terminating ||
> -			(smc_wr_tx_get_free_slot_index(link, &idx) != -EBUSY),
> -			SMC_WR_TX_WAIT_FREE_SLOT_TIME);
> -		if (!rc) {
> -			/* timeout - terminate link */
> -			smcr_link_down_cond_sched(link);
> -			return -EPIPE;
> +		rc = 0;
> +		for (;;) {
> +			prepare_to_wait_exclusive(&link->wr_tx_wait, &wait,
> +						  TASK_INTERRUPTIBLE);
> +			if (!smc_link_sendable(link) || lgr->terminating ||
> +			    smc_wr_tx_get_free_slot_index(link, &idx) != -EBUSY)
> +				break;
> +			timeout = schedule_timeout(timeout);
> +			if (!timeout) {
> +				/* timeout - terminate link */
> +				smcr_link_down_cond_sched(link);
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				rc = -ERESTARTSYS;
> +				break;
> +			}

[High]
With exclusive waiters, wake_up() on link->wr_tx_wait wakes only one
task per freed slot. If that waiter is about to exit via the
signal_pending() branch, can the freed slot be lost to the remaining
sleepers?

Consider two exclusive waiters A and B asleep in schedule_timeout():

  1. smc_wr_tx_put_slot() -> wake_up(&link->wr_tx_wait) is effectively
     __wake_up(..., nr_exclusive=1).  default_wake_function makes A
     runnable, autoremove_wake_function pulls A off the queue, and
     __wake_up_common stops -- B is not woken.

  2. A signal is delivered to A before it actually runs.  Because A is
     already TASK_RUNNING, signal_wake_up only sets TIF_SIGPENDING.

  3. A returns from schedule_timeout() with timeout > 0, so the
     !timeout branch is skipped.  signal_pending(current) is true, so
     the loop sets rc = -ERESTARTSYS and breaks without re-testing
     smc_wr_tx_get_free_slot_index() and without issuing any wake_up()
     on the way out.

  4. B stays asleep on the wait queue even though the wr_tx_mask bit
     for the freed slot is still available.  With no further
     completions, B sleeps until the full SMC_WR_TX_WAIT_FREE_SLOT_TIME
     expires, at which point smcr_link_down_cond_sched() tears down an
     otherwise healthy link.

The previous wait_event_interruptible_timeout() was non-exclusive, so
every waiter was woken on each wake_up() and re-evaluated the
condition, which avoided this class of lost-wakeup.

Should the signal path re-check the slot condition before returning,
or issue a wake_up() on exit so another exclusive waiter gets a chance
at the slot?

[Medium]
The commit message says this is a contention optimization and that
"teardown behavior is unchanged", but does it also change the errno
returned to callers on signal delivery?

Before this patch, wait_event_interruptible_timeout() returned
-ERESTARTSYS on signal; that value is non-zero, so the old if (!rc)
branch was skipped, control fell through to the trailing
if (idx == link->wr_tx_cnt) return -EPIPE, and callers saw -EPIPE.

After the patch, the signal path does:

    if (signal_pending(current)) {
        rc = -ERESTARTSYS;
        break;
    }
    ...
    finish_wait(&link->wr_tx_wait, &wait);
    if (rc)
        return rc;

so -ERESTARTSYS propagates up through smc_cdc_get_free_slot() (which
only remaps rc to -EPIPE when conn->killed is set),
smcr_cdc_get_slot_and_msg_send(), smc_cdc_get_slot_and_msg_send() and
smc_tx_sendmsg(), meaning userspace sees syscall restart / -EINTR
instead of -EPIPE.

Note that the sibling smc_wr_reg_send() in the same file explicitly
converts -ERESTARTSYS to -EINTR.  Should this return-code change be
called out in the commit message (and possibly split out), given that
the stated intent of the patch is purely to reduce contention?

[Low]
On timeout expiry the new loop takes the teardown path without
re-testing whether a slot became available:

> +			timeout = schedule_timeout(timeout);
> +			if (!timeout) {
> +				/* timeout - terminate link */
> +				smcr_link_down_cond_sched(link);
> +				break;
> +			}

The old wait_event_interruptible_timeout() expansion re-evaluated the
condition via ___wait_cond_timeout after schedule_timeout() returned
0, so a wake_up() racing with timer expiry could still report success
and let smc_wr_tx_get_free_slot_index() claim the slot.

In the narrow race where a slot is freed at the same instant the
timer fires, does this now tear the link down even though a slot was
actually available?

>  		}
> +		finish_wait(&link->wr_tx_wait, &wait);
> +		if (rc)
> +			return rc;
>  		if (idx == link->wr_tx_cnt)
>  			return -EPIPE;
>  	}
-- 
This is an AI-generated review.

     prev parent reply	other threads:[~2026-05-12  8:27 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  6:37 [PATCH net-next 0/2] net/smc: transition to RDMA core CQ pooling D. Wythe
2026-05-08  6:37 ` [PATCH net-next 1/2] " D. Wythe
2026-05-12  8:31   ` Paolo Abeni
2026-05-08  6:37 ` [PATCH net-next 2/2] net/smc: reduce TX slot contention with exclusive wait D. Wythe
2026-05-12  8:26   ` Paolo Abeni [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260512082643.89020-1-pabeni@redhat.com \
    --to=pabeni@redhat.com \
    --cc=alibuda@linux.alibaba.com \
    --cc=davem@davemloft.net \
    --cc=dust.li@linux.alibaba.com \
    --cc=edumazet@google.com \
    --cc=guwen@linux.alibaba.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mjambigi@linux.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=oliver.yang@linux.alibaba.com \
    --cc=pasic@linux.ibm.com \
    --cc=sidraya@linux.ibm.com \
    --cc=tonylu@linux.alibaba.com \
    --cc=wenjia@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox