linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] fix two issues and optimize code on tpacket_snd()
@ 2025-07-10 10:26 Yun Lu
  2025-07-10 10:26 ` [PATCH v4 1/3] af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd() Yun Lu
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Yun Lu @ 2025-07-10 10:26 UTC (permalink / raw)
  To: willemdebruijn.kernel, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel

From: Yun Lu <luyun@kylinos.cn>

This series fix two issues and optimize the code on tpacket_snd():
1, fix the SO_SNDTIMEO constraint not effective;
2, fix a soft lockup issue on a specific edge case;
3, optimize the packet_read_pending function called on tpacket_snd().

---
Changes in v4:
- Fix a typo and add the missing semicolon. Thanks: Simon Horman. 
- Split the second patch into two, one to fix, another to optimize.
  Thanks: Willem de Bruijn
- Link to v3: https://lore.kernel.org/all/20250709095653.62469-1-luyun_611@163.com/

Changes in v3:
- Split in two different patches.
- Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
- Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/

Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/

Yun Lu (3):
  af_packet: fix the SO_SNDTIMEO constraint not effective on
    tpacked_snd()
  af_packet: fix soft lockup issue caused by tpacket_snd()
  af_packet: optimize the packet_read_pending function called on
    tpacket_snd()

 net/packet/af_packet.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/3] af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd()
  2025-07-10 10:26 [PATCH v4 0/3] fix two issues and optimize code on tpacket_snd() Yun Lu
@ 2025-07-10 10:26 ` Yun Lu
  2025-07-10 10:26 ` [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd() Yun Lu
  2025-07-10 10:26 ` [PATCH v4 3/3] af_packet: optimize the packet_read_pending function called on tpacket_snd() Yun Lu
  2 siblings, 0 replies; 7+ messages in thread
From: Yun Lu @ 2025-07-10 10:26 UTC (permalink / raw)
  To: willemdebruijn.kernel, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel

From: Yun Lu <luyun@kylinos.cn>

Due to the changes in commit 581073f626e3 ("af_packet: do not call
packet_read_pending() from tpacket_destruct_skb()"), every time
tpacket_destruct_skb() is executed, the skb_completion is marked as
completed. When wait_for_completion_interruptible_timeout() returns
completed, the pending_refcnt has not yet been reduced to zero.
Therefore, when ph is NULL, the wait function may need to be called
multiple times until packet_read_pending() finally returns zero.

We should call sock_sndtimeo() only once, otherwise the SO_SNDTIMEO
constraint could be way off.

Fixes: 581073f626e3 ("af_packet: do not call packet_read_pending() from tpacket_destruct_skb()")
Cc: stable@kernel.org
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yun Lu <luyun@kylinos.cn>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
---
 net/packet/af_packet.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3d43f3eae759..7089b8c2a655 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,7 +2785,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int len_sum = 0;
 	int status = TP_STATUS_AVAILABLE;
 	int hlen, tlen, copylen = 0;
-	long timeo = 0;
+	long timeo;
 
 	mutex_lock(&po->pg_vec_lock);
 
@@ -2839,6 +2839,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
 		size_max = dev->mtu + reserve + VLAN_HLEN;
 
+	timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
 	reinit_completion(&po->skb_completion);
 
 	do {
@@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
 			if (need_wait && skb) {
-				timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
 				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
 				if (timeo <= 0) {
 					err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd()
  2025-07-10 10:26 [PATCH v4 0/3] fix two issues and optimize code on tpacket_snd() Yun Lu
  2025-07-10 10:26 ` [PATCH v4 1/3] af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd() Yun Lu
@ 2025-07-10 10:26 ` Yun Lu
  2025-07-10 13:49   ` Willem de Bruijn
  2025-07-10 10:26 ` [PATCH v4 3/3] af_packet: optimize the packet_read_pending function called on tpacket_snd() Yun Lu
  2 siblings, 1 reply; 7+ messages in thread
From: Yun Lu @ 2025-07-10 10:26 UTC (permalink / raw)
  To: willemdebruijn.kernel, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel

From: Yun Lu <luyun@kylinos.cn>

When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.

If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.

In fact, skb is true for all but the first iterations of that loop, and
as long as pending_refcnt is not zero, even if incremented by a previous
call, wait_for_completion_interruptible_timeout() should be executed to
yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
the execution condition of this function should be modified to check if
pending_refcnt is not zero, instead of check skb.

As a result, packet_read_pending() may be called twice in the loop. This
will be optimized in the following patch.

Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Cc: stable@kernel.org
Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
Signed-off-by: Yun Lu <luyun@kylinos.cn>

---
Changes in v4:
- Split to the fix alone. Thanks: Willem de Bruijn.
- Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/

Changes in v3:
- Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
- Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/

Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
---
---
 net/packet/af_packet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7089b8c2a655..581a96ec8e1a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
-			if (need_wait && skb) {
+			if (need_wait && packet_read_pending(&po->tx_ring)) {
 				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
 				if (timeo <= 0) {
 					err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 3/3] af_packet: optimize the packet_read_pending function called on tpacket_snd()
  2025-07-10 10:26 [PATCH v4 0/3] fix two issues and optimize code on tpacket_snd() Yun Lu
  2025-07-10 10:26 ` [PATCH v4 1/3] af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd() Yun Lu
  2025-07-10 10:26 ` [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd() Yun Lu
@ 2025-07-10 10:26 ` Yun Lu
  2 siblings, 0 replies; 7+ messages in thread
From: Yun Lu @ 2025-07-10 10:26 UTC (permalink / raw)
  To: willemdebruijn.kernel, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel

From: Yun Lu <luyun@kylinos.cn>

Now the packet_read_pending() may be called twice in the do-while loop,
and this function is super expensive on hosts with a large number of cpu,
as it's per_cpu variable. In fact, the second call at the end can be
removed by reusing the ph to continue for the next iteration, and the ph
will be reassigned at the start of the next iteration.

Signed-off-by: Yun Lu <luyun@kylinos.cn>
---
 net/packet/af_packet.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 581a96ec8e1a..ea7219e0c23a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2846,12 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
 		if (unlikely(ph == NULL)) {
+			/* Note: packet_read_pending() might be slow if we
+			 * have to call it as it's per_cpu variable, but in
+			 * fast-path we don't have to call it, only when ph
+			 * is NULL, we need to check pending_refcnt.
+			 */
 			if (need_wait && packet_read_pending(&po->tx_ring)) {
 				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
 				if (timeo <= 0) {
 					err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
 					goto out_put;
 				}
+				/* Just reuse ph to continue for the next iteration, and
+				 * ph will be reassigned at the start of the next iteration.
+				 */
+				ph = (void *)1;
 			}
 			/* check for additional frames */
 			continue;
@@ -2943,14 +2952,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
-	} while (likely((ph != NULL) ||
-		/* Note: packet_read_pending() might be slow if we have
-		 * to call it as it's per_cpu variable, but in fast-path
-		 * we already short-circuit the loop with the first
-		 * condition, and luckily don't have to go that path
-		 * anyway.
-		 */
-		 (need_wait && packet_read_pending(&po->tx_ring))));
+	} while (likely(ph != NULL));
 
 	err = len_sum;
 	goto out_put;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd()
  2025-07-10 10:26 ` [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd() Yun Lu
@ 2025-07-10 13:49   ` Willem de Bruijn
  2025-07-11  7:20     ` luyun
  0 siblings, 1 reply; 7+ messages in thread
From: Willem de Bruijn @ 2025-07-10 13:49 UTC (permalink / raw)
  To: Yun Lu, willemdebruijn.kernel, davem, edumazet, kuba, pabeni,
	horms
  Cc: netdev, linux-kernel

Yun Lu wrote:
> From: Yun Lu <luyun@kylinos.cn>
> 
> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
> pending_refcnt to decrement to zero before returning. The pending_refcnt
> is decremented by 1 when the skb->destructor function is called,
> indicating that the skb has been successfully sent and needs to be
> destroyed.
> 
> If an error occurs during this process, the tpacket_snd() function will
> exit and return error, but pending_refcnt may not yet have decremented to
> zero. Assuming the next send operation is executed immediately, but there
> are no available frames to be sent in tx_ring (i.e., packet_current_frame
> returns NULL), and skb is also NULL, the function will not execute
> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
> if the previous skb has completed transmission, the skb->destructor
> function can only be invoked in the ksoftirqd thread (assuming NAPI
> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
> operation happen to run on the same CPU, and the CPU trapped in the
> do-while loop without yielding, the ksoftirqd thread will not get
> scheduled to run. As a result, pending_refcnt will never be reduced to
> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
> lockup issue.
> 
> In fact, skb is true for all but the first iterations of that loop, and
> as long as pending_refcnt is not zero, even if incremented by a previous
> call, wait_for_completion_interruptible_timeout() should be executed to
> yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
> the execution condition of this function should be modified to check if
> pending_refcnt is not zero, instead of check skb.
> 
> As a result, packet_read_pending() may be called twice in the loop. This
> will be optimized in the following patch.
> 
> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
> Cc: stable@kernel.org
> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
> Signed-off-by: Yun Lu <luyun@kylinos.cn>
> 
> ---
> Changes in v4:
> - Split to the fix alone. Thanks: Willem de Bruijn.
> - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/
> 
> Changes in v3:
> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
> 
> Changes in v2:
> - Add a Fixes tag.
> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
> ---
> ---
>  net/packet/af_packet.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 7089b8c2a655..581a96ec8e1a 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>  		ph = packet_current_frame(po, &po->tx_ring,
>  					  TP_STATUS_SEND_REQUEST);
>  		if (unlikely(ph == NULL)) {
> -			if (need_wait && skb) {
> +			if (need_wait && packet_read_pending(&po->tx_ring)) {

Unfortunately I did not immediately fully appreciate Eric's
suggestion.

My comments was

    If [..] the extra packet_read_pending() is already present, not
    newly introduced with the fix

But of course that expensive call is newly introduced, so my
suggestion was invalid.

It's btw also not possible to mix net and net-next patches in a single
series like this (see Documentation/process/maintainer-netdev.rst).

But, instead of going back entirely to v2, perhaps we can make the
logic a bit more obvious by just having a while (1) at the end to show
that the only way to exit the loop (except errors) is in the ph == NULL
branch. And break in that loop directly.

There are two other ways to reach that while statement. A continue
on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph
is non-zero, so the condition is true anyway.

>  				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>  				if (timeo <= 0) {
>  					err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
> -- 
> 2.43.0
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd()
  2025-07-10 13:49   ` Willem de Bruijn
@ 2025-07-11  7:20     ` luyun
  2025-07-11 19:33       ` Willem de Bruijn
  0 siblings, 1 reply; 7+ messages in thread
From: luyun @ 2025-07-11  7:20 UTC (permalink / raw)
  To: Willem de Bruijn, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel


在 2025/7/10 21:49, Willem de Bruijn 写道:
> Yun Lu wrote:
>> From: Yun Lu <luyun@kylinos.cn>
>>
>> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
>> pending_refcnt to decrement to zero before returning. The pending_refcnt
>> is decremented by 1 when the skb->destructor function is called,
>> indicating that the skb has been successfully sent and needs to be
>> destroyed.
>>
>> If an error occurs during this process, the tpacket_snd() function will
>> exit and return error, but pending_refcnt may not yet have decremented to
>> zero. Assuming the next send operation is executed immediately, but there
>> are no available frames to be sent in tx_ring (i.e., packet_current_frame
>> returns NULL), and skb is also NULL, the function will not execute
>> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
>> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
>> if the previous skb has completed transmission, the skb->destructor
>> function can only be invoked in the ksoftirqd thread (assuming NAPI
>> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
>> operation happen to run on the same CPU, and the CPU trapped in the
>> do-while loop without yielding, the ksoftirqd thread will not get
>> scheduled to run. As a result, pending_refcnt will never be reduced to
>> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
>> lockup issue.
>>
>> In fact, skb is true for all but the first iterations of that loop, and
>> as long as pending_refcnt is not zero, even if incremented by a previous
>> call, wait_for_completion_interruptible_timeout() should be executed to
>> yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
>> the execution condition of this function should be modified to check if
>> pending_refcnt is not zero, instead of check skb.
>>
>> As a result, packet_read_pending() may be called twice in the loop. This
>> will be optimized in the following patch.
>>
>> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
>> Cc: stable@kernel.org
>> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
>> Signed-off-by: Yun Lu <luyun@kylinos.cn>
>>
>> ---
>> Changes in v4:
>> - Split to the fix alone. Thanks: Willem de Bruijn.
>> - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/
>>
>> Changes in v3:
>> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
>> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
>>
>> Changes in v2:
>> - Add a Fixes tag.
>> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
>> ---
>> ---
>>   net/packet/af_packet.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 7089b8c2a655..581a96ec8e1a 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>>   		ph = packet_current_frame(po, &po->tx_ring,
>>   					  TP_STATUS_SEND_REQUEST);
>>   		if (unlikely(ph == NULL)) {
>> -			if (need_wait && skb) {
>> +			if (need_wait && packet_read_pending(&po->tx_ring)) {
> Unfortunately I did not immediately fully appreciate Eric's
> suggestion.
>
> My comments was
>
>      If [..] the extra packet_read_pending() is already present, not
>      newly introduced with the fix
>
> But of course that expensive call is newly introduced, so my
> suggestion was invalid.
>
> It's btw also not possible to mix net and net-next patches in a single
> series like this (see Documentation/process/maintainer-netdev.rst).

Sorry, I misunderstood your comments. In the next version, I will 
combine the second and third patches together.

>
> But, instead of going back entirely to v2, perhaps we can make the
> logic a bit more obvious by just having a while (1) at the end to show
> that the only way to exit the loop (except errors) is in the ph == NULL
> branch. And break in that loop directly.
>
> There are two other ways to reach that while statement. A continue
> on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph
> is non-zero, so the condition is true anyway.

Following your suggestion, I tried modifying the code (as shown below),  
now the loop condition is still the same as origin, but the logic is now 
clearer and more obvious.

Thanks.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7089b8c2a655..be608f07441f 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2846,15 +2846,21 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
                 ph = packet_current_frame(po, &po->tx_ring,
                                           TP_STATUS_SEND_REQUEST);
                 if (unlikely(ph == NULL)) {
-                       if (need_wait && skb) {
+                       /* Note: packet_read_pending() might be slow if we
+                        * have to call it as it's per_cpu variable, but in
+                        * fast-path we don't have to call it, only when ph
+                        * is NULL, we need to check the pending_refcnt.
+                        */
+                       if (need_wait && 
packet_read_pending(&po->tx_ring)) {
                                 timeo = 
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
                                 if (timeo <= 0) {
                                         err = !timeo ? -ETIMEDOUT : 
-ERESTARTSYS;
                                         goto out_put;
                                 }
-                       }
-                       /* check for additional frames */
-                       continue;
+                               /* check for additional frames */
+                               continue;
+                       } else
+                               break;
                 }

                 skb = NULL;
@@ -2943,14 +2949,7 @@ static int tpacket_snd(struct packet_sock *po, 
struct msghdr *msg)
                 }
                 packet_increment_head(&po->tx_ring);
                 len_sum += tp_len;
-       } while (likely((ph != NULL) ||
-               /* Note: packet_read_pending() might be slow if we have
-                * to call it as it's per_cpu variable, but in fast-path
-                * we already short-circuit the loop with the first
-                * condition, and luckily don't have to go that path
-                * anyway.
-                */
-                (need_wait && packet_read_pending(&po->tx_ring))));
+       } while (1);

         err = len_sum;
         goto out_put;



>
>>   				timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
>>   				if (timeo <= 0) {
>>   					err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
>> -- 
>> 2.43.0
>>


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd()
  2025-07-11  7:20     ` luyun
@ 2025-07-11 19:33       ` Willem de Bruijn
  0 siblings, 0 replies; 7+ messages in thread
From: Willem de Bruijn @ 2025-07-11 19:33 UTC (permalink / raw)
  To: luyun, Willem de Bruijn, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel

luyun wrote:
> 
> 在 2025/7/10 21:49, Willem de Bruijn 写道:
> > Yun Lu wrote:
> >> From: Yun Lu <luyun@kylinos.cn>
> >>
> >> When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
> >> pending_refcnt to decrement to zero before returning. The pending_refcnt
> >> is decremented by 1 when the skb->destructor function is called,
> >> indicating that the skb has been successfully sent and needs to be
> >> destroyed.
> >>
> >> If an error occurs during this process, the tpacket_snd() function will
> >> exit and return error, but pending_refcnt may not yet have decremented to
> >> zero. Assuming the next send operation is executed immediately, but there
> >> are no available frames to be sent in tx_ring (i.e., packet_current_frame
> >> returns NULL), and skb is also NULL, the function will not execute
> >> wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
> >> will enter a do-while loop, waiting for pending_refcnt to be zero. Even
> >> if the previous skb has completed transmission, the skb->destructor
> >> function can only be invoked in the ksoftirqd thread (assuming NAPI
> >> threading is enabled). When both the ksoftirqd thread and the tpacket_snd
> >> operation happen to run on the same CPU, and the CPU trapped in the
> >> do-while loop without yielding, the ksoftirqd thread will not get
> >> scheduled to run. As a result, pending_refcnt will never be reduced to
> >> zero, and the do-while loop cannot exit, eventually leading to a CPU soft
> >> lockup issue.
> >>
> >> In fact, skb is true for all but the first iterations of that loop, and
> >> as long as pending_refcnt is not zero, even if incremented by a previous
> >> call, wait_for_completion_interruptible_timeout() should be executed to
> >> yield the CPU, allowing the ksoftirqd thread to be scheduled. Therefore,
> >> the execution condition of this function should be modified to check if
> >> pending_refcnt is not zero, instead of check skb.
> >>
> >> As a result, packet_read_pending() may be called twice in the loop. This
> >> will be optimized in the following patch.
> >>
> >> Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
> >> Cc: stable@kernel.org
> >> Suggested-by: LongJun Tang <tanglongjun@kylinos.cn>
> >> Signed-off-by: Yun Lu <luyun@kylinos.cn>
> >>
> >> ---
> >> Changes in v4:
> >> - Split to the fix alone. Thanks: Willem de Bruijn.
> >> - Link to v3: https://lore.kernel.org/all/20250709095653.62469-3-luyun_611@163.com/
> >>
> >> Changes in v3:
> >> - Simplify the code and reuse ph to continue. Thanks: Eric Dumazet.
> >> - Link to v2: https://lore.kernel.org/all/20250708020642.27838-1-luyun_611@163.com/
> >>
> >> Changes in v2:
> >> - Add a Fixes tag.
> >> - Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@163.com/
> >> ---
> >> ---
> >>   net/packet/af_packet.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> >> index 7089b8c2a655..581a96ec8e1a 100644
> >> --- a/net/packet/af_packet.c
> >> +++ b/net/packet/af_packet.c
> >> @@ -2846,7 +2846,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> >>   		ph = packet_current_frame(po, &po->tx_ring,
> >>   					  TP_STATUS_SEND_REQUEST);
> >>   		if (unlikely(ph == NULL)) {
> >> -			if (need_wait && skb) {
> >> +			if (need_wait && packet_read_pending(&po->tx_ring)) {
> > Unfortunately I did not immediately fully appreciate Eric's
> > suggestion.
> >
> > My comments was
> >
> >      If [..] the extra packet_read_pending() is already present, not
> >      newly introduced with the fix
> >
> > But of course that expensive call is newly introduced, so my
> > suggestion was invalid.
> >
> > It's btw also not possible to mix net and net-next patches in a single
> > series like this (see Documentation/process/maintainer-netdev.rst).
> 
> Sorry, I misunderstood your comments. In the next version, I will 
> combine the second and third patches together.

My original suggestion was just wrong, sorry. Thanks for revising again.
 
> >
> > But, instead of going back entirely to v2, perhaps we can make the
> > logic a bit more obvious by just having a while (1) at the end to show
> > that the only way to exit the loop (except errors) is in the ph == NULL
> > branch. And break in that loop directly.
> >
> > There are two other ways to reach that while statement. A continue
> > on PACKET_SOCK_TP_LOSS, or by regular control flow. In both cases, ph
> > is non-zero, so the condition is true anyway.
> 
> Following your suggestion, I tried modifying the code (as shown below),  
> now the loop condition is still the same as origin, but the logic is now 
> clearer and more obvious.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-11 19:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 10:26 [PATCH v4 0/3] fix two issues and optimize code on tpacket_snd() Yun Lu
2025-07-10 10:26 ` [PATCH v4 1/3] af_packet: fix the SO_SNDTIMEO constraint not effective on tpacked_snd() Yun Lu
2025-07-10 10:26 ` [PATCH v4 2/3] af_packet: fix soft lockup issue caused by tpacket_snd() Yun Lu
2025-07-10 13:49   ` Willem de Bruijn
2025-07-11  7:20     ` luyun
2025-07-11 19:33       ` Willem de Bruijn
2025-07-10 10:26 ` [PATCH v4 3/3] af_packet: optimize the packet_read_pending function called on tpacket_snd() Yun Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).