netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Justin Iurman <justin.iurman@uliege.be>
To: Andrea Mayer <andrea.mayer@uniroma2.it>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Sebastian Sewior <bigeasy@linutronix.de>,
	Stanislav Fomichev <stfomichev@gmail.com>,
	Network Development <netdev@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Kuniyuki Iwashima <kuniyu@amazon.com>, bpf <bpf@vger.kernel.org>,
	Stefano Salsano <stefano.salsano@uniroma2.it>,
	Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Subject: Re: [PATCH net] net: lwtunnel: disable preemption when required
Date: Tue, 15 Apr 2025 11:10:01 +0200	[thread overview]
Message-ID: <3cee5141-c525-4e83-830e-bf21828aed51@uliege.be> (raw)
In-Reply-To: <20250415025416.0273812f0322a6b1728d9c7b@uniroma2.it>

On 4/15/25 02:54, Andrea Mayer wrote:> I have been looking into the 
behavior of the lwtunnel_xmit() function in both
> task and softirq contexts. To facilitate this investigation, I have written a
> simple eBPF program that only prints messages to the trace pipe. This program
> is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
> with a destination address (DA) pointing to an external node, referred to as
> x.x.x.x, within my testbed network.
> 
> To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
> to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
> is then processed through the forwarding path, eventually leading to the
> ip_output() function. Processing ends with a call to ip_finish_output2(), which
> then calls lwtunnel_xmit().
> 
> Below is the stack trace from my testing machine, highlighting the key
> functions involved in this processing path:
> 
>   ============================================
>    <IRQ>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    NF_HOOK.constprop.0+0x7e/0x2f0
>    ip_rcv+0x4bf/0x4d0
>    __netif_receive_skb_one_core+0x11c/0x130
>    process_backlog+0x277/0x980
>    __napi_poll.constprop.0+0x58/0x260
>    net_rx_action+0x396/0x6e0
>    handle_softirqs+0x116/0x640
>    do_softirq+0xa9/0xe0
>    </IRQ>
>   ============================================
> 
> Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
> x.x.x.x on the same testing node. Below is the corresponding stack trace:
> 
>   ============================================
>    <TASK>
>    ...
>    lwtunnel_xmit+0x18/0x3f0
>    ip_finish_output2+0x45a/0xcc0
>    ip_output+0xe2/0x380
>    ip_push_pending_frames+0x17a/0x200
>    raw_sendmsg+0x9fa/0x1060
>    __sys_sendto+0x294/0x2e0
>    __x64_sys_sendto+0x6d/0x80
>    do_syscall_64+0x64/0x140
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>    </TASK>
>   ============================================
> 
> So also for the lwtunnel_xmit(), we need to make sure that the functions
> dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
> are protected, i.e. inside a local_bh_{disable/enable} block.

That's correct, and I ended up with the same conclusion as yours on the 
possible paths for lwtunnel_xmit() depending on the context (task vs 
irq). Based on your description, we're using a similar approach with 
eBPF :-) Note that paths are similar for lwtunnel_output() (see below).

> In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
> in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
> function. On my x86 architecture, preempt_count() accesses the per-CPU
> __preempt_count variable.
> 
> If in_softirq() returns 0, it indicates that no softirqs are currently being
> processed on the local CPU and BH are not disabled. Therefore, following the
> logic above, we disable bottom halves (BH) on that particular CPU.
> 
> However, there is my opinion an issue that can occur: between the check on
> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> another CPU. As a result, the check on in_softirq() becomes ineffective because
> we may end up disabling BH on a CPU that is not the one we just checked (with
> if (in_softirq()) { ... }).

Hmm, I think it's correct... good catch. I went for this solution to (i) 
avoid useless nested BHs disable calls; and (ii) avoid ending up with a 
spaghetti graph of possible paths with or without BHs disabled (i.e., 
with single entry points, namely lwtunnel_xmit() and lwtunnel_output()), 
which otherwise makes it hard to maintain the code IMO.

So, if we want to follow what Alexei suggests (see his last response), 
we'd need to disable BHs in both ip_local_out() and ip6_local_out(). 
These are the common functions which are closest in depth, and so for 
both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of 
disabling BHs even when it may not be required. Indeed, ip_local_out() 
and ip6_local_out() both call dst_output(), which one is usually not 
lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call 
either).

The other solution is to always call local_bh_disable() in both 
lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when 
they were already. Which was basically -v1 and received a NACK from Alexei.

At the moment, I'm not sure what's best.

  reply	other threads:[~2025-04-15  9:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-03  8:39 [PATCH net] net: lwtunnel: disable preemption when required Justin Iurman
2025-04-03 16:24 ` Stanislav Fomichev
2025-04-03 19:08   ` Justin Iurman
2025-04-03 20:35     ` Alexei Starovoitov
2025-04-04 14:19       ` Sebastian Sewior
2025-04-06  8:59         ` Justin Iurman
2025-04-07 17:54           ` Alexei Starovoitov
2025-04-11 18:34             ` Justin Iurman
2025-04-14 23:13               ` Alexei Starovoitov
2025-04-15  9:25                 ` Justin Iurman
2025-04-15 11:56                 ` Justin Iurman
2025-04-15  0:54               ` Andrea Mayer
2025-04-15  9:10                 ` Justin Iurman [this message]
2025-04-15 14:38                   ` Jakub Kicinski
2025-04-15 15:12                     ` Justin Iurman
2025-04-15 15:12                     ` Alexei Starovoitov
2025-04-15 17:12                       ` Justin Iurman
2025-04-03 23:38     ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3cee5141-c525-4e83-830e-bf21828aed51@uliege.be \
    --to=justin.iurman@uliege.be \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrea.mayer@uniroma2.it \
    --cc=bigeasy@linutronix.de \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=paolo.lungaroni@uniroma2.it \
    --cc=stefano.salsano@uniroma2.it \
    --cc=stfomichev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).