* BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target @ 2023-11-06 23:03 Nelson, Shannon 2023-11-07 15:31 ` Toke Høiland-Jørgensen 0 siblings, 1 reply; 6+ messages in thread From: Nelson, Shannon @ 2023-11-06 23:03 UTC (permalink / raw) To: Jesper Dangaard Brouer, Toke Høiland-Jørgensen, David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko While testing new code to support XDP in the ionic driver we found that we could panic the kernel by running a bind/unbind loop on the target interface of an xdp_redirect action. Obviously this is a stress test that is abusing the system, but it does point to a window of opportunity in bq_enqueue() and bq_xmit_all(). I believe that while the validity of the target interface has been checked in __xdp_enqueue(), the interface can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to use the interface. There is no locking or reference taken on the interface to hold it in place before the target’s ndo_xdp_xmit() is called. Below is a stack trace that our tester captured while running our test code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a non-upstream kernel. But if you look at the current upstream code in kernel/bpf/devmap.c I think you can see what we ran into. Other than telling users to not abuse the system with a bind/unbind loop, is there something we can do to limit the potential pain here? Without knowing what interfaces might be targeted by the users’ XDP programs, is there a step the originating driver can do to take precautions? Did we simply miss a step in the driver, or is this an actual problem in the devmap code? Thanks, sln [ 6118.862868] general protection fault, probably for non-canonical address 0x696867666564659a: 0000 [#1] PREEMPT SMP NOPTI [ 6118.863026] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G OE -------- --- 5.14.0-284.11.1.el9_2.x86_64 #1 [ 6118.863335] Hardware name: HPE ProLiant DL320 Gen11/ProLiant DL320 Gen11, BIOS 1.30 03/01/1023 [ 6118.863515] RIP: 0010:bq_xmit_all+0x8a/0x160 [ 6118.863704] Code: de 4c 89 f1 44 89 ea e8 b4 fc ff ff 89 c6 85 c0 0f 84 af 00 00 00 49 8b 86 d0 00 00 00 44 89 f9 48 89 da 4c 89 f7 89 74 24 04 <48> 8b 80 38 02 00 00 ff d0 0f 1f 00 31 c9 8b 74 24 04 85 c0 41 89 [ 6118.864137] RSP: 0018:ff733629002d8e50 EFLAGS: 00010246 [ 6118.864366] RAX: 6968676665646362 RBX: ffa53628f9ea47e8 RCX: 0000000000000001 [ 6118.864608] RDX: ffa53628f9ea47e8 RSI: 0000000000000010 RDI: ff1fa0f2a46c8000 [ 6118.864859] RBP: ffa53628f9ea47f0 R08: ffffffffc082e700 R09: ffffffffffffffff [ 6118.865114] R10: 0000000000000040 R11: 0000000000000003 R12: 0000000000000010 [ 6118.865380] R13: 0000000000000010 R14: ff1fa0f2a46c8000 R15: 0000000000000001 [ 6118.865648] FS: 0000000000000000(0000) GS:ff1fa0f5af280000(0000) knlGS:0000000000000000 [ 6118.865928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6118.866213] CR2: 0000557586bae5e6 CR3: 000000001d410004 CR4: 0000000000771ee0 [ 6118.866509] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6118.866811] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 6118.867117] PKRU: 55555554 [ 6118.867423] Call Trace: [ 6118.867731] <IRQ> [ 6118.868040] __dev_flush+0x39/0xa0 [ 6118.868358] xdp_do_flush+0xa/0x20 [ 6118.868681] ionic_txrx_napi+0x1ba/0x1f0 [ionic] [ 6118.869024] __napi_poll+0x27/0x170 [ 6118.869357] net_rx_action+0x233/0x2f0 [ 6118.869693] __do_softirq+0xc7/0x2ac [ 6118.870037] __irq_exit_rcu+0xb5/0xe0 [ 6118.870383] common_interrupt+0x80/0xa0 [ 6118.870735] </IRQ> [ 6118.871082] <TASK> [ 6118.871431] asm_common_interrupt+0x22/0x40 [ 6118.871790] RIP: 0010:cpuidle_enter_state+0xd2/0x400 [ 6118.872162] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 f9 a9 8e ff 45 84 ff 74 12 9c 58 f6 c4 02 0f 85 12 03 00 00 31 ff e8 72 b7 94 ff fb 45 85 f6 <0f> 88 15 01 00 00 49 63 d6 4c 2b 2c 24 48 8d 04 52 48 8d 04 82 49 [ 6118.872962] RSP: 0018:ff7336290010fe80 EFLAGS: 00000206 [ 6118.873377] RAX: ff1fa0f5af2aabc0 RBX: 0000000000000003 RCX: 000000000000001f [ 6118.873804] RDX: 0000000000000000 RSI: 0000000055555555 RDI: 0000000000000000 [ 6118.874236] RBP: ffa53628faaa4400 R08: 00000590a8a55399 R09: 0000000000000001 [ 6118.874752] R10: 0000000000000400 R11: 0000000000000730 R12: ffffffffbacace60 [ 6118.875282] R13: 00000590a8a55399 R14: 0000000000000003 R15: 0000000000000000 [ 6118.875737] cpuidle_enter+0x29/0x40 [ 6118.876198] cpuidle_idle_call+0x12c/0x1c0 [ 6118.876664] do_idle+0x7b/0xe0 [ 6118.877132] cpu_startup_entry+0x19/0x20 [ 6118.877604] start_secondary+0x116/0x140 [ 6118.878081] secondary_startup_64_no_verify+0xe5/0xeb [ 6118.878562] </TASK> [ 6118.879104] Modules linked in: ionic(OE) ib_core tls 8021q garp mrp stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfat fat ipmi_ssif intel_rapl_msr intel_rapl_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel pmt_crashlog pmt_telemetry pmt_class kvm irqbypass rapl intel_cstate cdc_eem intel_uncore idxd usbnet acpi_ipmi mei_me isst_if_mbox_pci isst_if_mmio pcspkr isst_if_common intel_vsec mii idxd_bus mei hpilo ipmi_si ipmi_devintf ipmi_msghandler acpi_tad acpi_power_meter xfs libcrc32c sd_mod t10_pi sg mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper ahci syscopyarea sysfillrect sysimgblt libahci fb_sys_fops qat_4xxx drm libata crct10dif_pclmul crc32_pclmul intel_qat crc32c_intel tg3 ghash_clmulni_intel crc8 hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: ionic] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target 2023-11-06 23:03 BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target Nelson, Shannon @ 2023-11-07 15:31 ` Toke Høiland-Jørgensen 2023-11-08 21:30 ` Nelson, Shannon 0 siblings, 1 reply; 6+ messages in thread From: Toke Høiland-Jørgensen @ 2023-11-07 15:31 UTC (permalink / raw) To: Nelson, Shannon, Jesper Dangaard Brouer, David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko "Nelson, Shannon" <shannon.nelson@amd.com> writes: > While testing new code to support XDP in the ionic driver we found that > we could panic the kernel by running a bind/unbind loop on the target > interface of an xdp_redirect action. Obviously this is a stress test > that is abusing the system, but it does point to a window of opportunity > in bq_enqueue() and bq_xmit_all(). I believe that while the validity of > the target interface has been checked in __xdp_enqueue(), the interface > can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to > use the interface. There is no locking or reference taken on the > interface to hold it in place before the target’s ndo_xdp_xmit() is called. > > Below is a stack trace that our tester captured while running our test > code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a > non-upstream kernel. But if you look at the current upstream code in > kernel/bpf/devmap.c I think you can see what we ran into. > > Other than telling users to not abuse the system with a bind/unbind > loop, is there something we can do to limit the potential pain here? > Without knowing what interfaces might be targeted by the users’ XDP > programs, is there a step the originating driver can do to take > precautions? Did we simply miss a step in the driver, or is this an > actual problem in the devmap code? Sounds like a driver bug :) The XDP redirect flow guarantees that all outstanding packets are flushed within a single NAPI cycle, as documented here: https://docs.kernel.org/bpf/redirect.html So basically, the driver should be doing a two-step teardown: remove global visibility of the resource in question, wait for all concurrent users to finish, and *then* free the data structure. This corresponds to the usual RCU protection: resources should be kept alive until all concurrent RCU critical sections have exited on all CPUs. So if your driver is removing an interface's data structure without waiting for concurrent NAPI cycles to finish, that's a bug in the driver. This kind of thing is what the synchronize_net() function is for; for a usage example, see veth_napi_del_range(). My guess would be that you're missing this as part of your driver teardown flow? Another source of a bug like this could be that your driver does not in fact call xdp_do_flush() before exiting its NAPI cycle, so that there will be packets from the previous cycle in the bq queue, in which case the assumption mentioned in the linked document obviously breaks down. But that would also be a driver bug :) -Toke ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target 2023-11-07 15:31 ` Toke Høiland-Jørgensen @ 2023-11-08 21:30 ` Nelson, Shannon 2023-11-08 23:10 ` Toke Høiland-Jørgensen 2023-11-09 1:52 ` Jakub Kicinski 0 siblings, 2 replies; 6+ messages in thread From: Nelson, Shannon @ 2023-11-08 21:30 UTC (permalink / raw) To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote: > > "Nelson, Shannon" <shannon.nelson@amd.com> writes: > >> While testing new code to support XDP in the ionic driver we found that >> we could panic the kernel by running a bind/unbind loop on the target >> interface of an xdp_redirect action. Obviously this is a stress test >> that is abusing the system, but it does point to a window of opportunity >> in bq_enqueue() and bq_xmit_all(). I believe that while the validity of >> the target interface has been checked in __xdp_enqueue(), the interface >> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to >> use the interface. There is no locking or reference taken on the >> interface to hold it in place before the target’s ndo_xdp_xmit() is called. >> >> Below is a stack trace that our tester captured while running our test >> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a >> non-upstream kernel. But if you look at the current upstream code in >> kernel/bpf/devmap.c I think you can see what we ran into. >> >> Other than telling users to not abuse the system with a bind/unbind >> loop, is there something we can do to limit the potential pain here? >> Without knowing what interfaces might be targeted by the users’ XDP >> programs, is there a step the originating driver can do to take >> precautions? Did we simply miss a step in the driver, or is this an >> actual problem in the devmap code? > > Sounds like a driver bug :) Entirely possible, wouldn't be our first ... :-) > > The XDP redirect flow guarantees that all outstanding packets are > flushed within a single NAPI cycle, as documented here: > https://docs.kernel.org/bpf/redirect.html > > So basically, the driver should be doing a two-step teardown: remove > global visibility of the resource in question, wait for all concurrent > users to finish, and *then* free the data structure. This corresponds to > the usual RCU protection: resources should be kept alive until all > concurrent RCU critical sections have exited on all CPUs. So if your > driver is removing an interface's data structure without waiting for > concurrent NAPI cycles to finish, that's a bug in the driver. > > This kind of thing is what the synchronize_net() function is for; for a > usage example, see veth_napi_del_range(). My guess would be that you're > missing this as part of your driver teardown flow? Essentially, the first thing we do in the remove function is to call unregister_netdev(), which has synchronize_net() in the path, so I don't think this is missing from our scenario, but thanks for the hint, I'll keep this in mind. I do see there are a couple of net drivers that are more aggressive about calling it directly in some other parts of the logic - I don't think that has a bearing on this issue, but I'll keep it in mind. > > Another source of a bug like this could be that your driver does not in > fact call xdp_do_flush() before exiting its NAPI cycle, so that there > will be packets from the previous cycle in the bq queue, in which case > the assumption mentioned in the linked document obviously breaks down. > But that would also be a driver bug :) We do call the xdp_do_flush() at the end of the NAPI cycle, just before calling napi_complete_done(). > > -Toke > Thanks for the notes - I'll have our tester spend some more time with this using other drivers/interfaces as the targets to see if we can gather more information on the scenario. sln ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target 2023-11-08 21:30 ` Nelson, Shannon @ 2023-11-08 23:10 ` Toke Høiland-Jørgensen 2023-11-09 1:52 ` Jakub Kicinski 1 sibling, 0 replies; 6+ messages in thread From: Toke Høiland-Jørgensen @ 2023-11-08 23:10 UTC (permalink / raw) To: Nelson, Shannon, Jesper Dangaard Brouer, David Ahern, Jakub Kicinski, netdev, bpf, Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko "Nelson, Shannon" <shannon.nelson@amd.com> writes: > On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote: >> >> "Nelson, Shannon" <shannon.nelson@amd.com> writes: >> >>> While testing new code to support XDP in the ionic driver we found that >>> we could panic the kernel by running a bind/unbind loop on the target >>> interface of an xdp_redirect action. Obviously this is a stress test >>> that is abusing the system, but it does point to a window of opportunity >>> in bq_enqueue() and bq_xmit_all(). I believe that while the validity of >>> the target interface has been checked in __xdp_enqueue(), the interface >>> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to >>> use the interface. There is no locking or reference taken on the >>> interface to hold it in place before the target’s ndo_xdp_xmit() is called. >>> >>> Below is a stack trace that our tester captured while running our test >>> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a >>> non-upstream kernel. But if you look at the current upstream code in >>> kernel/bpf/devmap.c I think you can see what we ran into. >>> >>> Other than telling users to not abuse the system with a bind/unbind >>> loop, is there something we can do to limit the potential pain here? >>> Without knowing what interfaces might be targeted by the users’ XDP >>> programs, is there a step the originating driver can do to take >>> precautions? Did we simply miss a step in the driver, or is this an >>> actual problem in the devmap code? >> >> Sounds like a driver bug :) > > Entirely possible, wouldn't be our first ... :-) > >> >> The XDP redirect flow guarantees that all outstanding packets are >> flushed within a single NAPI cycle, as documented here: >> https://docs.kernel.org/bpf/redirect.html >> >> So basically, the driver should be doing a two-step teardown: remove >> global visibility of the resource in question, wait for all concurrent >> users to finish, and *then* free the data structure. This corresponds to >> the usual RCU protection: resources should be kept alive until all >> concurrent RCU critical sections have exited on all CPUs. So if your >> driver is removing an interface's data structure without waiting for >> concurrent NAPI cycles to finish, that's a bug in the driver. >> >> This kind of thing is what the synchronize_net() function is for; for a >> usage example, see veth_napi_del_range(). My guess would be that you're >> missing this as part of your driver teardown flow? > > Essentially, the first thing we do in the remove function is to call > unregister_netdev(), which has synchronize_net() in the path, so I don't > think this is missing from our scenario, but thanks for the hint, I'll > keep this in mind. I do see there are a couple of net drivers that are > more aggressive about calling it directly in some other parts of the > logic - I don't think that has a bearing on this issue, but I'll keep it > in mind. Hmm, right, in fact unregister_netdev() has two such synchronize_net() calls. The XDP queue is only guaranteed to be flushed after the second one of those, though, and there's an 'ndo_uninit()' callback in-between them. So I don't suppose your driver implements that ndo and does something there that could cause the crash you're seeing? Otherwise, the one thing I can think of is that maybe it can be related to the fact that synchronize_net() turns into a synchronize_rcu_expedited() if the rtnl lock is held (which it is in this case if you're calling the parameter-less unregister_netdev()). I'm not quite sure I grok the expedited wait thing, but it should be pretty easy to check if this is the cause by making a change like the one below and seeing if the issue goes away. -Toke diff --git a/net/core/dev.c b/net/core/dev.c index e28a18e7069b..1a035a5f0b0e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10932,7 +10932,7 @@ void synchronize_net(void) { might_sleep(); if (rtnl_is_locked()) - synchronize_rcu_expedited(); + synchronize_rcu(); else synchronize_rcu(); } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target 2023-11-08 21:30 ` Nelson, Shannon 2023-11-08 23:10 ` Toke Høiland-Jørgensen @ 2023-11-09 1:52 ` Jakub Kicinski 2023-11-09 2:11 ` Nelson, Shannon 1 sibling, 1 reply; 6+ messages in thread From: Jakub Kicinski @ 2023-11-09 1:52 UTC (permalink / raw) To: Nelson, Shannon Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, David Ahern, netdev, bpf, Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko On Wed, 8 Nov 2023 13:30:21 -0800 Nelson, Shannon wrote: > > Another source of a bug like this could be that your driver does not in > > fact call xdp_do_flush() before exiting its NAPI cycle, so that there > > will be packets from the previous cycle in the bq queue, in which case > > the assumption mentioned in the linked document obviously breaks down. > > But that would also be a driver bug :) > > We do call the xdp_do_flush() at the end of the NAPI cycle, just before > calling napi_complete_done(). Just to be sure - flush has to happen on every cycle, not only before calling napi_complete_done(). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target 2023-11-09 1:52 ` Jakub Kicinski @ 2023-11-09 2:11 ` Nelson, Shannon 0 siblings, 0 replies; 6+ messages in thread From: Nelson, Shannon @ 2023-11-09 2:11 UTC (permalink / raw) To: Jakub Kicinski Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, David Ahern, netdev, bpf, Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko On 11/8/2023 5:52 PM, Jakub Kicinski wrote: > > On Wed, 8 Nov 2023 13:30:21 -0800 Nelson, Shannon wrote: >>> Another source of a bug like this could be that your driver does not in >>> fact call xdp_do_flush() before exiting its NAPI cycle, so that there >>> will be packets from the previous cycle in the bq queue, in which case >>> the assumption mentioned in the linked document obviously breaks down. >>> But that would also be a driver bug :) >> >> We do call the xdp_do_flush() at the end of the NAPI cycle, just before >> calling napi_complete_done(). > > Just to be sure - flush has to happen on every cycle, not only > before calling napi_complete_done(). Ah, good catch. The notes in redirect.html say "Before exiting its NAPI loop" and "must be called before napi_complete_done()". I interpreted those together. We'll make sure we do the flush at the end of every budget cycle and see what happens. Thanks all. This is exactly why talking out your problems with others is a very good thing. And yes, this eventually will make it into an upstream patch set - after we do a bunch more testing. Cheers, sln ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-11-09 2:11 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-11-06 23:03 BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target Nelson, Shannon 2023-11-07 15:31 ` Toke Høiland-Jørgensen 2023-11-08 21:30 ` Nelson, Shannon 2023-11-08 23:10 ` Toke Høiland-Jørgensen 2023-11-09 1:52 ` Jakub Kicinski 2023-11-09 2:11 ` Nelson, Shannon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox