linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net/cls_cgroup: Fix task_get_classid() during qdisc run
@ 2025-08-19  9:37 Yafang Shao
  2025-08-19 12:42 ` Daniel Borkmann
  0 siblings, 1 reply; 4+ messages in thread
From: Yafang Shao @ 2025-08-19  9:37 UTC (permalink / raw)
  To: davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel, Yafang Shao, Daniel Borkmann, Thomas Graf

During recent testing with the netem qdisc to inject delays into TCP
traffic, we observed that our CLS BPF program failed to function correctly
due to incorrect classid retrieval from task_get_classid(). The issue
manifests in the following call stack:

        bpf_get_cgroup_classid+5
        cls_bpf_classify+507
        __tcf_classify+90
        tcf_classify+217
        __dev_queue_xmit+798
        bond_dev_queue_xmit+43
        __bond_start_xmit+211
        bond_start_xmit+70
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __qdisc_run+102             <<<<< Issue location
        __dev_xmit_skb+1015
        __dev_queue_xmit+637
        neigh_hh_output+159
        ip_finish_output2+461
        __ip_finish_output+183
        ip_finish_output+41
        ip_output+120
        ip_local_out+94
        __ip_queue_xmit+394
        ip_queue_xmit+21
        __tcp_transmit_skb+2169
        tcp_write_xmit+959
        __tcp_push_pending_frames+55
        tcp_push+264
        tcp_sendmsg_locked+661
        tcp_sendmsg+45
        inet_sendmsg+67
        sock_sendmsg+98
        sock_write_iter+147
        vfs_write+786
        ksys_write+181
        __x64_sys_write+25
        do_syscall_64+56
        entry_SYSCALL_64_after_hwframe+100

The problem occurs when multiple tasks share a single qdisc. In such cases,
__qdisc_run() may transmit skbs created by different tasks. Consequently,
task_get_classid() retrieves an incorrect classid since it references the
current task's context rather than the skb's originating task.

Given that dev_queue_xmit() always executes with bh disabled, we can safely
use in_softirq() instead of in_serving_softirq() to properly identify the
softirq context and obtain the correct classid.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Thomas Graf <tgraf@suug.ch>
---
 include/net/cls_cgroup.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
index 7e78e7d6f015..fc9e0617a73c 100644
--- a/include/net/cls_cgroup.h
+++ b/include/net/cls_cgroup.h
@@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
 	 * calls by looking at the number of nested bh disable calls because
 	 * softirqs always disables bh.
 	 */
-	if (in_serving_softirq()) {
+	if (in_softirq()) {
 		struct sock *sk = skb_to_full_sk(skb);
 
 		/* If there is an sock_cgroup_classid we'll use that. */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] net/cls_cgroup: Fix task_get_classid() during qdisc run
  2025-08-19  9:37 [PATCH] net/cls_cgroup: Fix task_get_classid() during qdisc run Yafang Shao
@ 2025-08-19 12:42 ` Daniel Borkmann
  2025-08-19 13:26   ` Sebastian Andrzej Siewior
  2025-08-19 13:51   ` Yafang Shao
  0 siblings, 2 replies; 4+ messages in thread
From: Daniel Borkmann @ 2025-08-19 12:42 UTC (permalink / raw)
  To: Yafang Shao, davem, edumazet, kuba, pabeni, horms
  Cc: netdev, linux-kernel, Thomas Graf, Paul McKenney,
	Sebastian Andrzej Siewior, open list:BPF [RINGBUF],
	Willem de Bruijn

On 8/19/25 11:37 AM, Yafang Shao wrote:
> During recent testing with the netem qdisc to inject delays into TCP
> traffic, we observed that our CLS BPF program failed to function correctly
> due to incorrect classid retrieval from task_get_classid(). The issue
> manifests in the following call stack:
> 
>          bpf_get_cgroup_classid+5
>          cls_bpf_classify+507
>          __tcf_classify+90
>          tcf_classify+217
>          __dev_queue_xmit+798
>          bond_dev_queue_xmit+43
>          __bond_start_xmit+211
>          bond_start_xmit+70
>          dev_hard_start_xmit+142
>          sch_direct_xmit+161
>          __qdisc_run+102             <<<<< Issue location
>          __dev_xmit_skb+1015
>          __dev_queue_xmit+637
>          neigh_hh_output+159
>          ip_finish_output2+461
>          __ip_finish_output+183
>          ip_finish_output+41
>          ip_output+120
>          ip_local_out+94
>          __ip_queue_xmit+394
>          ip_queue_xmit+21
>          __tcp_transmit_skb+2169
>          tcp_write_xmit+959
>          __tcp_push_pending_frames+55
>          tcp_push+264
>          tcp_sendmsg_locked+661
>          tcp_sendmsg+45
>          inet_sendmsg+67
>          sock_sendmsg+98
>          sock_write_iter+147
>          vfs_write+786
>          ksys_write+181
>          __x64_sys_write+25
>          do_syscall_64+56
>          entry_SYSCALL_64_after_hwframe+100
> 
> The problem occurs when multiple tasks share a single qdisc. In such cases,
> __qdisc_run() may transmit skbs created by different tasks. Consequently,
> task_get_classid() retrieves an incorrect classid since it references the
> current task's context rather than the skb's originating task.
> 
> Given that dev_queue_xmit() always executes with bh disabled, we can safely
> use in_softirq() instead of in_serving_softirq() to properly identify the
> softirq context and obtain the correct classid.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Thomas Graf <tgraf@suug.ch>
> ---
>   include/net/cls_cgroup.h | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
> index 7e78e7d6f015..fc9e0617a73c 100644
> --- a/include/net/cls_cgroup.h
> +++ b/include/net/cls_cgroup.h
> @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
>   	 * calls by looking at the number of nested bh disable calls because
>   	 * softirqs always disables bh.
>   	 */
> -	if (in_serving_softirq()) {
> +	if (in_softirq()) {
>   		struct sock *sk = skb_to_full_sk(skb);
>   
>   		/* If there is an sock_cgroup_classid we'll use that. */

Hm, essentially you only want to use the fallback method of retrieving cgroup from
the socket when the dev_queue_xmit() was triggered via ksoftirqd, rather than from
a process via syscall all the way into dev_queue_xmit(). It gets more fuzzy presumably
when skbs are queued somewhere and then some other kthread calls the dev_queue_xmit().

Looking at in_softirq(), the comment says "the following macros are deprecated and
should not be used in new code", see commit 15115830c887 ("preempt: Cleanup the
macro maze a bit"). Maybe Sebastian or Paul has input on whether in_softirq() is
still supposed to be used.

Side question, did you investigate where the skb->sk association got orphaned
somewhere along the way?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] net/cls_cgroup: Fix task_get_classid() during qdisc run
  2025-08-19 12:42 ` Daniel Borkmann
@ 2025-08-19 13:26   ` Sebastian Andrzej Siewior
  2025-08-19 13:51   ` Yafang Shao
  1 sibling, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-19 13:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Yafang Shao, davem, edumazet, kuba, pabeni, horms, netdev,
	linux-kernel, Thomas Graf, Paul McKenney, open list:BPF [RINGBUF],
	Willem de Bruijn

On 2025-08-19 14:42:07 [+0200], Daniel Borkmann wrote:
> > --- a/include/net/cls_cgroup.h
> > +++ b/include/net/cls_cgroup.h
> > @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
> >   	 * calls by looking at the number of nested bh disable calls because
> >   	 * softirqs always disables bh.
> >   	 */
> > -	if (in_serving_softirq()) {
> > +	if (in_softirq()) {
> >   		struct sock *sk = skb_to_full_sk(skb);
> >   		/* If there is an sock_cgroup_classid we'll use that. */
> Looking at in_softirq(), the comment says "the following macros are deprecated and
> should not be used in new code", see commit 15115830c887 ("preempt: Cleanup the
> macro maze a bit"). Maybe Sebastian or Paul has input on whether in_softirq() is
> still supposed to be used.

it listed in the deprecated section so I would tend to suggest
softirq_count(). But then in_softirq() kind of describes what is done
while the other two (in_irq, in_interrupt) were ambiguous and people
often mixed them up.
Let me poke someone and then we maybe commit to clean up so leave the
limbo state.

> Thanks,
> Daniel

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] net/cls_cgroup: Fix task_get_classid() during qdisc run
  2025-08-19 12:42 ` Daniel Borkmann
  2025-08-19 13:26   ` Sebastian Andrzej Siewior
@ 2025-08-19 13:51   ` Yafang Shao
  1 sibling, 0 replies; 4+ messages in thread
From: Yafang Shao @ 2025-08-19 13:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, edumazet, kuba, pabeni, horms, netdev, linux-kernel,
	Thomas Graf, Paul McKenney, Sebastian Andrzej Siewior,
	open list:BPF [RINGBUF], Willem de Bruijn

On Tue, Aug 19, 2025 at 8:42 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/19/25 11:37 AM, Yafang Shao wrote:
> > During recent testing with the netem qdisc to inject delays into TCP
> > traffic, we observed that our CLS BPF program failed to function correctly
> > due to incorrect classid retrieval from task_get_classid(). The issue
> > manifests in the following call stack:
> >
> >          bpf_get_cgroup_classid+5
> >          cls_bpf_classify+507
> >          __tcf_classify+90
> >          tcf_classify+217
> >          __dev_queue_xmit+798
> >          bond_dev_queue_xmit+43
> >          __bond_start_xmit+211
> >          bond_start_xmit+70
> >          dev_hard_start_xmit+142
> >          sch_direct_xmit+161
> >          __qdisc_run+102             <<<<< Issue location
> >          __dev_xmit_skb+1015
> >          __dev_queue_xmit+637
> >          neigh_hh_output+159
> >          ip_finish_output2+461
> >          __ip_finish_output+183
> >          ip_finish_output+41
> >          ip_output+120
> >          ip_local_out+94
> >          __ip_queue_xmit+394
> >          ip_queue_xmit+21
> >          __tcp_transmit_skb+2169
> >          tcp_write_xmit+959
> >          __tcp_push_pending_frames+55
> >          tcp_push+264
> >          tcp_sendmsg_locked+661
> >          tcp_sendmsg+45
> >          inet_sendmsg+67
> >          sock_sendmsg+98
> >          sock_write_iter+147
> >          vfs_write+786
> >          ksys_write+181
> >          __x64_sys_write+25
> >          do_syscall_64+56
> >          entry_SYSCALL_64_after_hwframe+100
> >
> > The problem occurs when multiple tasks share a single qdisc. In such cases,
> > __qdisc_run() may transmit skbs created by different tasks. Consequently,
> > task_get_classid() retrieves an incorrect classid since it references the
> > current task's context rather than the skb's originating task.
> >
> > Given that dev_queue_xmit() always executes with bh disabled, we can safely
> > use in_softirq() instead of in_serving_softirq() to properly identify the
> > softirq context and obtain the correct classid.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Daniel Borkmann <daniel@iogearbox.net>
> > Cc: Thomas Graf <tgraf@suug.ch>
> > ---
> >   include/net/cls_cgroup.h | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/net/cls_cgroup.h b/include/net/cls_cgroup.h
> > index 7e78e7d6f015..fc9e0617a73c 100644
> > --- a/include/net/cls_cgroup.h
> > +++ b/include/net/cls_cgroup.h
> > @@ -63,7 +63,7 @@ static inline u32 task_get_classid(const struct sk_buff *skb)
> >        * calls by looking at the number of nested bh disable calls because
> >        * softirqs always disables bh.
> >        */
> > -     if (in_serving_softirq()) {
> > +     if (in_softirq()) {
> >               struct sock *sk = skb_to_full_sk(skb);
> >
> >               /* If there is an sock_cgroup_classid we'll use that. */
>
> Hm, essentially you only want to use the fallback method of retrieving cgroup from
> the socket when the dev_queue_xmit() was triggered via ksoftirqd, rather than from
> a process via syscall all the way into dev_queue_xmit(). It gets more fuzzy presumably
> when skbs are queued somewhere and then some other kthread calls the dev_queue_xmit().

Not only can other kthreads call dev_queue_xmit(), but user tasks can
also invoke it directly.
In the execution path:

  __dev_xmit_skb
  → __qdisc_run
    → qdisc_restart
      → skb = dequeue_skb(q, &validate, packets);

The SKB dequeued at this point might have been originally enqueued by
completely different tasks. This creates a scenario where the current
execution context doesn't match the original context that created and
enqueued the SKB.

>
> Looking at in_softirq(), the comment says "the following macros are deprecated and
> should not be used in new code", see commit 15115830c887 ("preempt: Cleanup the
> macro maze a bit"). Maybe Sebastian or Paul has input on whether in_softirq() is
> still supposed to be used.
>
> Side question, did you investigate where the skb->sk association got orphaned
> somewhere along the way?

The skb->sk is not orphaned in our observed scenario.

The setup to reproduce this issue are as follows,
1. Add network delay to the network interface:
   tc qdisc add dev bond0 root netem delay 1.5ms
2. Create two distinct net_cls cgroups, each running a network-intensive task
3. Initiate parallel TCP streams from both tasks to external servers.
Under this specific condition, the issue reliably occurs.
   The kernel eventually dequeues an SKB that originated from Task-A
while executing in the context of Task-B.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-08-19 13:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19  9:37 [PATCH] net/cls_cgroup: Fix task_get_classid() during qdisc run Yafang Shao
2025-08-19 12:42 ` Daniel Borkmann
2025-08-19 13:26   ` Sebastian Andrzej Siewior
2025-08-19 13:51   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).