From: Justin Iurman <justin.iurman@uliege.be>
To: Andrea Mayer <andrea.mayer@uniroma2.it>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
Sebastian Sewior <bigeasy@linutronix.de>,
Stanislav Fomichev <stfomichev@gmail.com>,
Network Development <netdev@vger.kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
Kuniyuki Iwashima <kuniyu@amazon.com>, bpf <bpf@vger.kernel.org>,
Stefano Salsano <stefano.salsano@uniroma2.it>,
Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Subject: Re: [PATCH net] net: lwtunnel: disable preemption when required
Date: Tue, 15 Apr 2025 11:10:01 +0200 [thread overview]
Message-ID: <3cee5141-c525-4e83-830e-bf21828aed51@uliege.be> (raw)
In-Reply-To: <20250415025416.0273812f0322a6b1728d9c7b@uniroma2.it>
On 4/15/25 02:54, Andrea Mayer wrote:> I have been looking into the
behavior of the lwtunnel_xmit() function in both
> task and softirq contexts. To facilitate this investigation, I have written a
> simple eBPF program that only prints messages to the trace pipe. This program
> is attached to the LWT BPF XMIT hook by configuring a route (on my test node)
> with a destination address (DA) pointing to an external node, referred to as
> x.x.x.x, within my testbed network.
>
> To trigger that LWT BPF XMIT instance from a softirq context, it is sufficient
> to receive (on the test node) a packet with a DA matching x.x.x.x. This packet
> is then processed through the forwarding path, eventually leading to the
> ip_output() function. Processing ends with a call to ip_finish_output2(), which
> then calls lwtunnel_xmit().
>
> Below is the stack trace from my testing machine, highlighting the key
> functions involved in this processing path:
>
> ============================================
> <IRQ>
> ...
> lwtunnel_xmit+0x18/0x3f0
> ip_finish_output2+0x45a/0xcc0
> ip_output+0xe2/0x380
> NF_HOOK.constprop.0+0x7e/0x2f0
> ip_rcv+0x4bf/0x4d0
> __netif_receive_skb_one_core+0x11c/0x130
> process_backlog+0x277/0x980
> __napi_poll.constprop.0+0x58/0x260
> net_rx_action+0x396/0x6e0
> handle_softirqs+0x116/0x640
> do_softirq+0xa9/0xe0
> </IRQ>
> ============================================
>
> Conversely, to trigger lwtunnel_xmit() from the task context, simply ping
> x.x.x.x on the same testing node. Below is the corresponding stack trace:
>
> ============================================
> <TASK>
> ...
> lwtunnel_xmit+0x18/0x3f0
> ip_finish_output2+0x45a/0xcc0
> ip_output+0xe2/0x380
> ip_push_pending_frames+0x17a/0x200
> raw_sendmsg+0x9fa/0x1060
> __sys_sendto+0x294/0x2e0
> __x64_sys_sendto+0x6d/0x80
> do_syscall_64+0x64/0x140
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
> </TASK>
> ============================================
>
> So also for the lwtunnel_xmit(), we need to make sure that the functions
> dev_xmit_recursion{_inc/dec}() and the necessary logic to avoid lwt recursion
> are protected, i.e. inside a local_bh_{disable/enable} block.
That's correct, and I ended up with the same conclusion as yours on the
possible paths for lwtunnel_xmit() depending on the context (task vs
irq). Based on your description, we're using a similar approach with
eBPF :-) Note that paths are similar for lwtunnel_output() (see below).
> In a non-preemptible real-time environment (i.e., when !PREEMPT_RT), the
> in_softirq() expands to softirq_count(), which in turn uses the preempt_count()
> function. On my x86 architecture, preempt_count() accesses the per-CPU
> __preempt_count variable.
>
> If in_softirq() returns 0, it indicates that no softirqs are currently being
> processed on the local CPU and BH are not disabled. Therefore, following the
> logic above, we disable bottom halves (BH) on that particular CPU.
>
> However, there is my opinion an issue that can occur: between the check on
> in_softirq() and the call to local_bh_disable(), the task may be scheduled on
> another CPU. As a result, the check on in_softirq() becomes ineffective because
> we may end up disabling BH on a CPU that is not the one we just checked (with
> if (in_softirq()) { ... }).
Hmm, I think it's correct... good catch. I went for this solution to (i)
avoid useless nested BHs disable calls; and (ii) avoid ending up with a
spaghetti graph of possible paths with or without BHs disabled (i.e.,
with single entry points, namely lwtunnel_xmit() and lwtunnel_output()),
which otherwise makes it hard to maintain the code IMO.
So, if we want to follow what Alexei suggests (see his last response),
we'd need to disable BHs in both ip_local_out() and ip6_local_out().
These are the common functions which are closest in depth, and so for
both lwtunnel_xmit() and lwtunnel_output(). But... at the "cost" of
disabling BHs even when it may not be required. Indeed, ip_local_out()
and ip6_local_out() both call dst_output(), which one is usually not
lwtunnel_output() (and there may not even be a lwtunnel_xmit() to call
either).
The other solution is to always call local_bh_disable() in both
lwtunnel_xmit() and lwtunnel_output(), at the cost of disabling BHs when
they were already. Which was basically -v1 and received a NACK from Alexei.
At the moment, I'm not sure what's best.
next prev parent reply other threads:[~2025-04-15 9:10 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-03 8:39 [PATCH net] net: lwtunnel: disable preemption when required Justin Iurman
2025-04-03 16:24 ` Stanislav Fomichev
2025-04-03 19:08 ` Justin Iurman
2025-04-03 20:35 ` Alexei Starovoitov
2025-04-04 14:19 ` Sebastian Sewior
2025-04-06 8:59 ` Justin Iurman
2025-04-07 17:54 ` Alexei Starovoitov
2025-04-11 18:34 ` Justin Iurman
2025-04-14 23:13 ` Alexei Starovoitov
2025-04-15 9:25 ` Justin Iurman
2025-04-15 11:56 ` Justin Iurman
2025-04-15 0:54 ` Andrea Mayer
2025-04-15 9:10 ` Justin Iurman [this message]
2025-04-15 14:38 ` Jakub Kicinski
2025-04-15 15:12 ` Justin Iurman
2025-04-15 15:12 ` Alexei Starovoitov
2025-04-15 17:12 ` Justin Iurman
2025-04-03 23:38 ` Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3cee5141-c525-4e83-830e-bf21828aed51@uliege.be \
--to=justin.iurman@uliege.be \
--cc=alexei.starovoitov@gmail.com \
--cc=andrea.mayer@uniroma2.it \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paolo.lungaroni@uniroma2.it \
--cc=stefano.salsano@uniroma2.it \
--cc=stfomichev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).