* BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
@ 2023-11-06 23:03 Nelson, Shannon
2023-11-07 15:31 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 6+ messages in thread
From: Nelson, Shannon @ 2023-11-06 23:03 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann,
Alexei Starovoitov, Andrii Nakryiko
While testing new code to support XDP in the ionic driver we found that
we could panic the kernel by running a bind/unbind loop on the target
interface of an xdp_redirect action. Obviously this is a stress test
that is abusing the system, but it does point to a window of opportunity
in bq_enqueue() and bq_xmit_all(). I believe that while the validity of
the target interface has been checked in __xdp_enqueue(), the interface
can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
use the interface. There is no locking or reference taken on the
interface to hold it in place before the target’s ndo_xdp_xmit() is called.
Below is a stack trace that our tester captured while running our test
code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
non-upstream kernel. But if you look at the current upstream code in
kernel/bpf/devmap.c I think you can see what we ran into.
Other than telling users to not abuse the system with a bind/unbind
loop, is there something we can do to limit the potential pain here?
Without knowing what interfaces might be targeted by the users’ XDP
programs, is there a step the originating driver can do to take
precautions? Did we simply miss a step in the driver, or is this an
actual problem in the devmap code?
Thanks,
sln
[ 6118.862868] general protection fault, probably for non-canonical
address 0x696867666564659a: 0000 [#1] PREEMPT SMP NOPTI
[ 6118.863026] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G
OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1
[ 6118.863335] Hardware name: HPE ProLiant DL320 Gen11/ProLiant DL320
Gen11, BIOS 1.30 03/01/1023
[ 6118.863515] RIP: 0010:bq_xmit_all+0x8a/0x160
[ 6118.863704] Code: de 4c 89 f1 44 89 ea e8 b4 fc ff ff 89 c6 85 c0 0f
84 af 00 00 00 49 8b 86 d0 00 00 00 44 89 f9 48 89 da 4c 89 f7 89 74 24
04 <48> 8b 80 38 02 00 00 ff d0 0f 1f 00 31 c9 8b 74 24 04 85 c0 41 89
[ 6118.864137] RSP: 0018:ff733629002d8e50 EFLAGS: 00010246
[ 6118.864366] RAX: 6968676665646362 RBX: ffa53628f9ea47e8 RCX:
0000000000000001
[ 6118.864608] RDX: ffa53628f9ea47e8 RSI: 0000000000000010 RDI:
ff1fa0f2a46c8000
[ 6118.864859] RBP: ffa53628f9ea47f0 R08: ffffffffc082e700 R09:
ffffffffffffffff
[ 6118.865114] R10: 0000000000000040 R11: 0000000000000003 R12:
0000000000000010
[ 6118.865380] R13: 0000000000000010 R14: ff1fa0f2a46c8000 R15:
0000000000000001
[ 6118.865648] FS: 0000000000000000(0000) GS:ff1fa0f5af280000(0000)
knlGS:0000000000000000
[ 6118.865928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6118.866213] CR2: 0000557586bae5e6 CR3: 000000001d410004 CR4:
0000000000771ee0
[ 6118.866509] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 6118.866811] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7:
0000000000000400
[ 6118.867117] PKRU: 55555554
[ 6118.867423] Call Trace:
[ 6118.867731] <IRQ>
[ 6118.868040] __dev_flush+0x39/0xa0
[ 6118.868358] xdp_do_flush+0xa/0x20
[ 6118.868681] ionic_txrx_napi+0x1ba/0x1f0 [ionic]
[ 6118.869024] __napi_poll+0x27/0x170
[ 6118.869357] net_rx_action+0x233/0x2f0
[ 6118.869693] __do_softirq+0xc7/0x2ac
[ 6118.870037] __irq_exit_rcu+0xb5/0xe0
[ 6118.870383] common_interrupt+0x80/0xa0
[ 6118.870735] </IRQ>
[ 6118.871082] <TASK>
[ 6118.871431] asm_common_interrupt+0x22/0x40
[ 6118.871790] RIP: 0010:cpuidle_enter_state+0xd2/0x400
[ 6118.872162] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 f9 a9 8e ff 45 84
ff 74 12 9c 58 f6 c4 02 0f 85 12 03 00 00 31 ff e8 72 b7 94 ff fb 45 85
f6 <0f> 88 15 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d 04 82 49
[ 6118.872962] RSP: 0018:ff7336290010fe80 EFLAGS: 00000206
[ 6118.873377] RAX: ff1fa0f5af2aabc0 RBX: 0000000000000003 RCX:
000000000000001f
[ 6118.873804] RDX: 0000000000000000 RSI: 0000000055555555 RDI:
0000000000000000
[ 6118.874236] RBP: ffa53628faaa4400 R08: 00000590a8a55399 R09:
0000000000000001
[ 6118.874752] R10: 0000000000000400 R11: 0000000000000730 R12:
ffffffffbacace60
[ 6118.875282] R13: 00000590a8a55399 R14: 0000000000000003 R15:
0000000000000000
[ 6118.875737] cpuidle_enter+0x29/0x40
[ 6118.876198] cpuidle_idle_call+0x12c/0x1c0
[ 6118.876664] do_idle+0x7b/0xe0
[ 6118.877132] cpu_startup_entry+0x19/0x20
[ 6118.877604] start_secondary+0x116/0x140
[ 6118.878081] secondary_startup_64_no_verify+0xe5/0xeb
[ 6118.878562] </TASK>
[ 6118.879104] Modules linked in: ionic(OE) ib_core tls 8021q garp mrp
stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables
nfnetlink vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common nfit
libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
pmt_crashlog pmt_telemetry pmt_class kvm irqbypass rapl intel_cstate
cdc_eem intel_uncore idxd usbnet acpi_ipmi mei_me isst_if_mbox_pci
isst_if_mmio pcspkr isst_if_common intel_vsec mii idxd_bus mei hpilo
ipmi_si ipmi_devintf ipmi_msghandler acpi_tad acpi_power_meter xfs
libcrc32c sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper
drm_kms_helper ahci syscopyarea sysfillrect sysimgblt libahci
fb_sys_fops qat_4xxx drm libata crct10dif_pclmul crc32_pclmul intel_qat
crc32c_intel tg3 ghash_clmulni_intel crc8 hpwdt wmi dm_mirror
dm_region_hash dm_log dm_mod fuse [last unloaded: ionic]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
2023-11-06 23:03 BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target Nelson, Shannon
@ 2023-11-07 15:31 ` Toke Høiland-Jørgensen
2023-11-08 21:30 ` Nelson, Shannon
0 siblings, 1 reply; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-11-07 15:31 UTC (permalink / raw)
To: Nelson, Shannon, Jesper Dangaard Brouer, David Ahern,
Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
Andrii Nakryiko
"Nelson, Shannon" <shannon.nelson@amd.com> writes:
> While testing new code to support XDP in the ionic driver we found that
> we could panic the kernel by running a bind/unbind loop on the target
> interface of an xdp_redirect action. Obviously this is a stress test
> that is abusing the system, but it does point to a window of opportunity
> in bq_enqueue() and bq_xmit_all(). I believe that while the validity of
> the target interface has been checked in __xdp_enqueue(), the interface
> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
> use the interface. There is no locking or reference taken on the
> interface to hold it in place before the target’s ndo_xdp_xmit() is called.
>
> Below is a stack trace that our tester captured while running our test
> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
> non-upstream kernel. But if you look at the current upstream code in
> kernel/bpf/devmap.c I think you can see what we ran into.
>
> Other than telling users to not abuse the system with a bind/unbind
> loop, is there something we can do to limit the potential pain here?
> Without knowing what interfaces might be targeted by the users’ XDP
> programs, is there a step the originating driver can do to take
> precautions? Did we simply miss a step in the driver, or is this an
> actual problem in the devmap code?
Sounds like a driver bug :)
The XDP redirect flow guarantees that all outstanding packets are
flushed within a single NAPI cycle, as documented here:
https://docs.kernel.org/bpf/redirect.html
So basically, the driver should be doing a two-step teardown: remove
global visibility of the resource in question, wait for all concurrent
users to finish, and *then* free the data structure. This corresponds to
the usual RCU protection: resources should be kept alive until all
concurrent RCU critical sections have exited on all CPUs. So if your
driver is removing an interface's data structure without waiting for
concurrent NAPI cycles to finish, that's a bug in the driver.
This kind of thing is what the synchronize_net() function is for; for a
usage example, see veth_napi_del_range(). My guess would be that you're
missing this as part of your driver teardown flow?
Another source of a bug like this could be that your driver does not in
fact call xdp_do_flush() before exiting its NAPI cycle, so that there
will be packets from the previous cycle in the bq queue, in which case
the assumption mentioned in the linked document obviously breaks down.
But that would also be a driver bug :)
-Toke
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
2023-11-07 15:31 ` Toke Høiland-Jørgensen
@ 2023-11-08 21:30 ` Nelson, Shannon
2023-11-08 23:10 ` Toke Høiland-Jørgensen
2023-11-09 1:52 ` Jakub Kicinski
0 siblings, 2 replies; 6+ messages in thread
From: Nelson, Shannon @ 2023-11-08 21:30 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann,
Alexei Starovoitov, Andrii Nakryiko
On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote:
>
> "Nelson, Shannon" <shannon.nelson@amd.com> writes:
>
>> While testing new code to support XDP in the ionic driver we found that
>> we could panic the kernel by running a bind/unbind loop on the target
>> interface of an xdp_redirect action. Obviously this is a stress test
>> that is abusing the system, but it does point to a window of opportunity
>> in bq_enqueue() and bq_xmit_all(). I believe that while the validity of
>> the target interface has been checked in __xdp_enqueue(), the interface
>> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
>> use the interface. There is no locking or reference taken on the
>> interface to hold it in place before the target’s ndo_xdp_xmit() is called.
>>
>> Below is a stack trace that our tester captured while running our test
>> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
>> non-upstream kernel. But if you look at the current upstream code in
>> kernel/bpf/devmap.c I think you can see what we ran into.
>>
>> Other than telling users to not abuse the system with a bind/unbind
>> loop, is there something we can do to limit the potential pain here?
>> Without knowing what interfaces might be targeted by the users’ XDP
>> programs, is there a step the originating driver can do to take
>> precautions? Did we simply miss a step in the driver, or is this an
>> actual problem in the devmap code?
>
> Sounds like a driver bug :)
Entirely possible, wouldn't be our first ... :-)
>
> The XDP redirect flow guarantees that all outstanding packets are
> flushed within a single NAPI cycle, as documented here:
> https://docs.kernel.org/bpf/redirect.html
>
> So basically, the driver should be doing a two-step teardown: remove
> global visibility of the resource in question, wait for all concurrent
> users to finish, and *then* free the data structure. This corresponds to
> the usual RCU protection: resources should be kept alive until all
> concurrent RCU critical sections have exited on all CPUs. So if your
> driver is removing an interface's data structure without waiting for
> concurrent NAPI cycles to finish, that's a bug in the driver.
>
> This kind of thing is what the synchronize_net() function is for; for a
> usage example, see veth_napi_del_range(). My guess would be that you're
> missing this as part of your driver teardown flow?
Essentially, the first thing we do in the remove function is to call
unregister_netdev(), which has synchronize_net() in the path, so I don't
think this is missing from our scenario, but thanks for the hint, I'll
keep this in mind. I do see there are a couple of net drivers that are
more aggressive about calling it directly in some other parts of the
logic - I don't think that has a bearing on this issue, but I'll keep it
in mind.
>
> Another source of a bug like this could be that your driver does not in
> fact call xdp_do_flush() before exiting its NAPI cycle, so that there
> will be packets from the previous cycle in the bq queue, in which case
> the assumption mentioned in the linked document obviously breaks down.
> But that would also be a driver bug :)
We do call the xdp_do_flush() at the end of the NAPI cycle, just before
calling napi_complete_done().
>
> -Toke
>
Thanks for the notes - I'll have our tester spend some more time with
this using other drivers/interfaces as the targets to see if we can
gather more information on the scenario.
sln
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
2023-11-08 21:30 ` Nelson, Shannon
@ 2023-11-08 23:10 ` Toke Høiland-Jørgensen
2023-11-09 1:52 ` Jakub Kicinski
1 sibling, 0 replies; 6+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-11-08 23:10 UTC (permalink / raw)
To: Nelson, Shannon, Jesper Dangaard Brouer, David Ahern,
Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
Andrii Nakryiko
"Nelson, Shannon" <shannon.nelson@amd.com> writes:
> On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote:
>>
>> "Nelson, Shannon" <shannon.nelson@amd.com> writes:
>>
>>> While testing new code to support XDP in the ionic driver we found that
>>> we could panic the kernel by running a bind/unbind loop on the target
>>> interface of an xdp_redirect action. Obviously this is a stress test
>>> that is abusing the system, but it does point to a window of opportunity
>>> in bq_enqueue() and bq_xmit_all(). I believe that while the validity of
>>> the target interface has been checked in __xdp_enqueue(), the interface
>>> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to
>>> use the interface. There is no locking or reference taken on the
>>> interface to hold it in place before the target’s ndo_xdp_xmit() is called.
>>>
>>> Below is a stack trace that our tester captured while running our test
>>> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a
>>> non-upstream kernel. But if you look at the current upstream code in
>>> kernel/bpf/devmap.c I think you can see what we ran into.
>>>
>>> Other than telling users to not abuse the system with a bind/unbind
>>> loop, is there something we can do to limit the potential pain here?
>>> Without knowing what interfaces might be targeted by the users’ XDP
>>> programs, is there a step the originating driver can do to take
>>> precautions? Did we simply miss a step in the driver, or is this an
>>> actual problem in the devmap code?
>>
>> Sounds like a driver bug :)
>
> Entirely possible, wouldn't be our first ... :-)
>
>>
>> The XDP redirect flow guarantees that all outstanding packets are
>> flushed within a single NAPI cycle, as documented here:
>> https://docs.kernel.org/bpf/redirect.html
>>
>> So basically, the driver should be doing a two-step teardown: remove
>> global visibility of the resource in question, wait for all concurrent
>> users to finish, and *then* free the data structure. This corresponds to
>> the usual RCU protection: resources should be kept alive until all
>> concurrent RCU critical sections have exited on all CPUs. So if your
>> driver is removing an interface's data structure without waiting for
>> concurrent NAPI cycles to finish, that's a bug in the driver.
>>
>> This kind of thing is what the synchronize_net() function is for; for a
>> usage example, see veth_napi_del_range(). My guess would be that you're
>> missing this as part of your driver teardown flow?
>
> Essentially, the first thing we do in the remove function is to call
> unregister_netdev(), which has synchronize_net() in the path, so I don't
> think this is missing from our scenario, but thanks for the hint, I'll
> keep this in mind. I do see there are a couple of net drivers that are
> more aggressive about calling it directly in some other parts of the
> logic - I don't think that has a bearing on this issue, but I'll keep it
> in mind.
Hmm, right, in fact unregister_netdev() has two such synchronize_net()
calls. The XDP queue is only guaranteed to be flushed after the second
one of those, though, and there's an 'ndo_uninit()' callback in-between
them. So I don't suppose your driver implements that ndo and does
something there that could cause the crash you're seeing?
Otherwise, the one thing I can think of is that maybe it can be related
to the fact that synchronize_net() turns into a
synchronize_rcu_expedited() if the rtnl lock is held (which it is in
this case if you're calling the parameter-less unregister_netdev()). I'm
not quite sure I grok the expedited wait thing, but it should be pretty
easy to check if this is the cause by making a change like the one below
and seeing if the issue goes away.
-Toke
diff --git a/net/core/dev.c b/net/core/dev.c
index e28a18e7069b..1a035a5f0b0e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10932,7 +10932,7 @@ void synchronize_net(void)
{
might_sleep();
if (rtnl_is_locked())
- synchronize_rcu_expedited();
+ synchronize_rcu();
else
synchronize_rcu();
}
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
2023-11-08 21:30 ` Nelson, Shannon
2023-11-08 23:10 ` Toke Høiland-Jørgensen
@ 2023-11-09 1:52 ` Jakub Kicinski
2023-11-09 2:11 ` Nelson, Shannon
1 sibling, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2023-11-09 1:52 UTC (permalink / raw)
To: Nelson, Shannon
Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
David Ahern, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
Andrii Nakryiko
On Wed, 8 Nov 2023 13:30:21 -0800 Nelson, Shannon wrote:
> > Another source of a bug like this could be that your driver does not in
> > fact call xdp_do_flush() before exiting its NAPI cycle, so that there
> > will be packets from the previous cycle in the bq queue, in which case
> > the assumption mentioned in the linked document obviously breaks down.
> > But that would also be a driver bug :)
>
> We do call the xdp_do_flush() at the end of the NAPI cycle, just before
> calling napi_complete_done().
Just to be sure - flush has to happen on every cycle, not only
before calling napi_complete_done().
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target
2023-11-09 1:52 ` Jakub Kicinski
@ 2023-11-09 2:11 ` Nelson, Shannon
0 siblings, 0 replies; 6+ messages in thread
From: Nelson, Shannon @ 2023-11-09 2:11 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
David Ahern, netdev, bpf, Daniel Borkmann, Alexei Starovoitov,
Andrii Nakryiko
On 11/8/2023 5:52 PM, Jakub Kicinski wrote:
>
> On Wed, 8 Nov 2023 13:30:21 -0800 Nelson, Shannon wrote:
>>> Another source of a bug like this could be that your driver does not in
>>> fact call xdp_do_flush() before exiting its NAPI cycle, so that there
>>> will be packets from the previous cycle in the bq queue, in which case
>>> the assumption mentioned in the linked document obviously breaks down.
>>> But that would also be a driver bug :)
>>
>> We do call the xdp_do_flush() at the end of the NAPI cycle, just before
>> calling napi_complete_done().
>
> Just to be sure - flush has to happen on every cycle, not only
> before calling napi_complete_done().
Ah, good catch. The notes in redirect.html say "Before exiting its NAPI
loop" and "must be called before napi_complete_done()". I interpreted
those together.
We'll make sure we do the flush at the end of every budget cycle and see
what happens.
Thanks all. This is exactly why talking out your problems with others
is a very good thing. And yes, this eventually will make it into an
upstream patch set - after we do a bunch more testing.
Cheers,
sln
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-11-09 2:11 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-06 23:03 BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target Nelson, Shannon
2023-11-07 15:31 ` Toke Høiland-Jørgensen
2023-11-08 21:30 ` Nelson, Shannon
2023-11-08 23:10 ` Toke Høiland-Jørgensen
2023-11-09 1:52 ` Jakub Kicinski
2023-11-09 2:11 ` Nelson, Shannon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox