* Kernel memory leak in bnx2x driver with vxlan tunnel @ 2016-01-14 17:17 John 2016-01-19 21:07 ` Jesse Gross 0 siblings, 1 reply; 15+ messages in thread From: John @ 2016-01-14 17:17 UTC (permalink / raw) To: netdev; +Cc: tom, david.roth I'm getting what seems to be a kernel memory leak while doing a TCP throughput test between two VMs on identical systems, in order to test a broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version 2.4.90. The host system of the receiving (server) VM leaks memory during the throughput test. The memory leaks fast enough to make the system completely unusable within five minutes. Once I stop the throughput test, the memory stops leaking. A couple of times, the kernel on the host system has actually killed the qemu process for me, but this doesn't happen reliably. The leaked memory doesn't become available again even after the VM is killed. To investigate this, I compiled a 4.4.0-rc8 kernel with kmemleak. I can scan the leaking system during and after killing the throughput test and get the following stack trace over and over again: unreferenced object 0xffff880464f11488 (size 256): comm "softirq", pid 0, jiffies 4312675043 (age 379.184s) hex dump (first 32 bytes): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81550e28>] kmemleak_alloc+0x28/0x50 [<ffffffff811a8bac>] __kmalloc+0x11c/0x2a0 [<ffffffff8146972e>] metadata_dst_alloc+0x1e/0x40 [<ffffffff814ed476>] udp_tun_rx_dst+0x126/0x1c0 [<ffffffff8140be68>] vxlan_udp_encap_recv+0x148/0xb10 [<ffffffff814cabe9>] udp_queue_rcv_skb+0x1e9/0x480 [<ffffffff814cb77c>] __udp4_lib_rcv+0x45c/0x700 [<ffffffff814cbf0a>] udp_rcv+0x1a/0x20 [<ffffffff8149ae54>] ip_local_deliver_finish+0x94/0x1e0 [<ffffffff8149b150>] ip_local_deliver+0x60/0xd0 [<ffffffff8149ab39>] ip_rcv_finish+0x99/0x320 [<ffffffff8149b41e>] ip_rcv+0x25e/0x380 [<ffffffff814602db>] __netif_receive_skb_core+0x2cb/0xa00 [<ffffffff81460a26>] __netif_receive_skb+0x16/0x70 [<ffffffff81460aa3>] netif_receive_skb_internal+0x23/0x80 [<ffffffff814615f5>] napi_gro_receive+0xa5/0xd0 I pulled down the kernel tree from http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, and did a git bisect and got this: 58ce31cca1ffe057f4744c3f671e3e84606d3d4a is the first bad commit commit 58ce31cca1ffe057f4744c3f671e3e84606d3d4a Author: Tom Herbert <tom@herbertland.com> Date: Wed Aug 19 17:07:33 2015 -0700 vxlan: GRO support at tunnel layer Add calls to gro_cells infrastructure to do GRO when receiving on a tunnel. Testing: Ran 200 netperf TCP_STREAM instance - With fix (GRO enabled on VXLAN interface) Verify GRO is happening. 9084 MBps tput 3.44% CPU utilization - Without fix (GRO disabled on VXLAN interface) Verified no GRO is happening. 9084 MBps tput 5.54% CPU utilization Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 a7d49cb2e24ebddf620c01e27515cc756b32e46f c3951c16da75ff3e0db1322b8ccb3e61975b1242 M drivers :040000 040000 f36442958138eafdd472c58d06ea35be66990aa1 0e29d513e575dd11f459c59df71e05db074363de M include For the test I'm using two HP Proliant dl360gen9's. I put two matching broadcom PCIe cards in each machine and ran throughput tests between two VMs on either machine, using the throughput testing program iperf3. On each host we had a qemu VM attached to an OVS bridge; these bridges are connected over a VxLAN tunnel as detailed here: https://community.mellanox.com/docs/DOC-1446. The test went well with an Intel Niantic NIC, and I saw high (8.04 Gb/s) throughput over an eighteen hour throughput test. There was no memory leak. However, when I put in a Broadcom NIC on both systems I get the above memory leak, if they have a VM on the receving end of the test. lspci -v output for the Broadcom NIC below: 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10) Subsystem: Hewlett-Packard Company HP FlexFabric 10Gb 2-port 534FLR-SFP+ Adapter Flags: bus master, fast devsel, latency 0 Memory at 97000000 (64-bit, prefetchable) [size=8M] Memory at 96800000 (64-bit, prefetchable) [size=8M] Memory at 98800000 (64-bit, prefetchable) [size=64K] [virtual] Expansion ROM at 98880000 [disabled] [size=512K] Capabilities: [48] Power Management version 3 Capabilities: [50] Vital Product Data Capabilities: [a0] MSI-X: Enable+ Count=32 Masked- Capabilities: [ac] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [13c] Device Serial Number 14-58-d0-ff-fe-52-5b-d8 Capabilities: [150] Power Budgeting <?> Capabilities: [160] Virtual Channel Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI) Capabilities: [1c0] Single Root I/O Virtualization (SR-IOV) Capabilities: [220] #15 Capabilities: [300] #19 Kernel driver in use: bnx2x ethtool -i info: driver: bnx2x version: 1.712.30-0 firmware-version: bc 7.8.24 bus-info: 0000:04:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes dmesg bnx2x output: [ 1.506071] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.712.30-0 (2014/02/10) [ 1.506205] bnx2x 0000:04:00.0: msix capability found [ 1.506297] bnx2x 0000:04:00.0: part number 394D4342-31383735-31543030-47303030 [ 1.555970] bnx2x 0000:04:00.1: msix capability found [ 1.556061] bnx2x 0000:04:00.1: part number 394D4342-31383735-31543030-47303030 [ 10.360477] bnx2x 0000:04:00.1 eth9: renamed from eth3 [ 10.584371] bnx2x 0000:04:00.0 rename3: renamed from eth1 [ 588.956002] bnx2x 0000:04:00.0 rename3: using MSI-X IRQs: sp 70 fp[0] 74 ... fp[7] 81 [ 589.208675] bnx2x 0000:04:00.0 rename3: Added vxlan dest port 4789 [ 640.159842] bnx2x 0000:04:00.1 eth10: renamed from eth9 [ 642.432216] bnx2x 0000:04:00.1 eth10: using MSI-X IRQs: sp 82 fp[0] 84 ... fp[7] 91 [ 642.700576] bnx2x 0000:04:00.1 eth10: Added vxlan dest port 4789 [ 1098.368845] bnx2x 0000:04:00.1 eth10: using MSI-X IRQs: sp 82 fp[0] 84 ... fp[7] 91 [ 1109.277182] bnx2x 0000:04:00.1 eth10_nolink: renamed from eth10 [ 1115.368873] bnx2x 0000:04:00.0 eth10: renamed from rename3 [ 1117.928156] bnx2x 0000:04:00.0 eth10: using MSI-X IRQs: sp 70 fp[0] 74 ... fp[7] 81 [ 1118.214861] bnx2x 0000:04:00.0 eth10: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit I've tried disabling all offloads (gro included) but the leak still happens. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-14 17:17 Kernel memory leak in bnx2x driver with vxlan tunnel John @ 2016-01-19 21:07 ` Jesse Gross 2016-01-19 22:47 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Jesse Gross @ 2016-01-19 21:07 UTC (permalink / raw) To: John; +Cc: Linux Kernel Network Developers, Tom Herbert, david.roth On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote: > I'm getting what seems to be a kernel memory leak while doing a TCP > throughput test between two VMs on identical systems, in order to test a > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version > 2.4.90. The host system of the receiving (server) VM leaks memory during the > throughput test. The memory leaks fast enough to make the system completely > unusable within five minutes. Once I stop the throughput test, the memory > stops > leaking. A couple of times, the kernel on the host system has actually > killed > the qemu process for me, but this doesn't happen reliably. The leaked memory > doesn't become available again even after the VM is killed. It looks like the problem is in napi_skb_finish(). If we when do GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself but not any of the associated elements. Historically, this would have been OK but these days we will have allocated a dst entry already for tunnel metadata, which will get leaked. If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a __kfree_skb(), which will release the dst entry. That would explain why some drivers have the problem but not others since the memory is laid out differently. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-19 21:07 ` Jesse Gross @ 2016-01-19 22:47 ` Eric Dumazet 2016-01-19 23:34 ` Jesse Gross 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-01-19 22:47 UTC (permalink / raw) To: Jesse Gross Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote: > On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote: > > I'm getting what seems to be a kernel memory leak while doing a TCP > > throughput test between two VMs on identical systems, in order to test a > > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version > > 2.4.90. The host system of the receiving (server) VM leaks memory during the > > throughput test. The memory leaks fast enough to make the system completely > > unusable within five minutes. Once I stop the throughput test, the memory > > stops > > leaking. A couple of times, the kernel on the host system has actually > > killed > > the qemu process for me, but this doesn't happen reliably. The leaked memory > > doesn't become available again even after the VM is killed. > > It looks like the problem is in napi_skb_finish(). If we when do > GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free == > NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself > but not any of the associated elements. Historically, this would have > been OK but these days we will have allocated a dst entry already for > tunnel metadata, which will get leaked. > > If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a > __kfree_skb(), which will release the dst entry. That would explain > why some drivers have the problem but not others since the memory is > laid out differently. Wow.... What is the purpose of using skb_dst_set() on skb before calling gro_cells_receive() exactly ? Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not helpful : Following patch create new tunnel flag which enable tunnel metadata collection on given device. This is rather strange since later the dst is thrown away with the skb_valid_dst() test. Note also that IP early demux is broken as well, since it does not use skb_valid_dst() but a simple : if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) { ... } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-19 22:47 ` Eric Dumazet @ 2016-01-19 23:34 ` Jesse Gross 2016-01-19 23:51 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Jesse Gross @ 2016-01-19 23:34 UTC (permalink / raw) To: Eric Dumazet Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar, Thomas Graf On Tue, Jan 19, 2016 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote: >> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote: >> > I'm getting what seems to be a kernel memory leak while doing a TCP >> > throughput test between two VMs on identical systems, in order to test a >> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version >> > 2.4.90. The host system of the receiving (server) VM leaks memory during the >> > throughput test. The memory leaks fast enough to make the system completely >> > unusable within five minutes. Once I stop the throughput test, the memory >> > stops >> > leaking. A couple of times, the kernel on the host system has actually >> > killed >> > the qemu process for me, but this doesn't happen reliably. The leaked memory >> > doesn't become available again even after the VM is killed. >> >> It looks like the problem is in napi_skb_finish(). If we when do >> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free == >> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself >> but not any of the associated elements. Historically, this would have >> been OK but these days we will have allocated a dst entry already for >> tunnel metadata, which will get leaked. >> >> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a >> __kfree_skb(), which will release the dst entry. That would explain >> why some drivers have the problem but not others since the memory is >> laid out differently. > > > Wow.... What is the purpose of using skb_dst_set() on skb before calling > gro_cells_receive() exactly ? > > Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not > helpful : > > Following patch create new tunnel flag which enable > tunnel metadata collection on given device. Note that this isn't really the problem commit. The general issue is lightweight tunnels - the above commit is just adding support for GRE in a way that I think follows the existing model. > This is rather strange since later the dst is thrown away with the > skb_valid_dst() test. This isn't really an IP routing dst - it's the tunnel encapsulation information - so there isn't much to with it in ip_rcv_finish(). However, the information can be consumed in other places, such as through eBPF. > Note also that IP early demux is broken as well, since it does not use > skb_valid_dst() but a simple : > > if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) { > ... > } I agree that this should use the helper function. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-19 23:34 ` Jesse Gross @ 2016-01-19 23:51 ` Eric Dumazet 2016-01-20 0:00 ` Jesse Gross 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-01-19 23:51 UTC (permalink / raw) To: Jesse Gross Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar, Thomas Graf On Tue, 2016-01-19 at 15:34 -0800, Jesse Gross wrote: > On Tue, Jan 19, 2016 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote: > >> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote: > >> > I'm getting what seems to be a kernel memory leak while doing a TCP > >> > throughput test between two VMs on identical systems, in order to test a > >> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version > >> > 2.4.90. The host system of the receiving (server) VM leaks memory during the > >> > throughput test. The memory leaks fast enough to make the system completely > >> > unusable within five minutes. Once I stop the throughput test, the memory > >> > stops > >> > leaking. A couple of times, the kernel on the host system has actually > >> > killed > >> > the qemu process for me, but this doesn't happen reliably. The leaked memory > >> > doesn't become available again even after the VM is killed. > >> > >> It looks like the problem is in napi_skb_finish(). If we when do > >> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free == > >> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself > >> but not any of the associated elements. Historically, this would have > >> been OK but these days we will have allocated a dst entry already for > >> tunnel metadata, which will get leaked. > >> > >> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a > >> __kfree_skb(), which will release the dst entry. That would explain > >> why some drivers have the problem but not others since the memory is > >> laid out differently. > > > > > > Wow.... What is the purpose of using skb_dst_set() on skb before calling > > gro_cells_receive() exactly ? > > > > Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not > > helpful : > > > > Following patch create new tunnel flag which enable > > tunnel metadata collection on given device. > > Note that this isn't really the problem commit. The general issue is > lightweight tunnels - the above commit is just adding support for GRE > in a way that I think follows the existing model. I believe this commit added the first skb_dst_set() before gro_cells_receive(), in ip_tunnel_rcv(). This is already buggy. Then Tom in 58ce31cca1ffe057f4744c3f671e3e84606d3d4a added the gro_cells_receive() in vxlan_rcv(), which added another bug, because of existing skb_dst_set() call. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-19 23:51 ` Eric Dumazet @ 2016-01-20 0:00 ` Jesse Gross 2016-01-20 0:17 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Jesse Gross @ 2016-01-20 0:00 UTC (permalink / raw) To: Eric Dumazet Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar, Thomas Graf On Tue, Jan 19, 2016 at 3:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-01-19 at 15:34 -0800, Jesse Gross wrote: >> On Tue, Jan 19, 2016 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote: >> >> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote: >> >> > I'm getting what seems to be a kernel memory leak while doing a TCP >> >> > throughput test between two VMs on identical systems, in order to test a >> >> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version >> >> > 2.4.90. The host system of the receiving (server) VM leaks memory during the >> >> > throughput test. The memory leaks fast enough to make the system completely >> >> > unusable within five minutes. Once I stop the throughput test, the memory >> >> > stops >> >> > leaking. A couple of times, the kernel on the host system has actually >> >> > killed >> >> > the qemu process for me, but this doesn't happen reliably. The leaked memory >> >> > doesn't become available again even after the VM is killed. >> >> >> >> It looks like the problem is in napi_skb_finish(). If we when do >> >> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free == >> >> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself >> >> but not any of the associated elements. Historically, this would have >> >> been OK but these days we will have allocated a dst entry already for >> >> tunnel metadata, which will get leaked. >> >> >> >> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a >> >> __kfree_skb(), which will release the dst entry. That would explain >> >> why some drivers have the problem but not others since the memory is >> >> laid out differently. >> > >> > >> > Wow.... What is the purpose of using skb_dst_set() on skb before calling >> > gro_cells_receive() exactly ? >> > >> > Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not >> > helpful : >> > >> > Following patch create new tunnel flag which enable >> > tunnel metadata collection on given device. >> >> Note that this isn't really the problem commit. The general issue is >> lightweight tunnels - the above commit is just adding support for GRE >> in a way that I think follows the existing model. > > I believe this commit added the first skb_dst_set() before > gro_cells_receive(), in ip_tunnel_rcv(). > > This is already buggy. > > Then Tom in 58ce31cca1ffe057f4744c3f671e3e84606d3d4a added the > gro_cells_receive() in vxlan_rcv(), which added another bug, because of > existing skb_dst_set() call. Oh, I wasn't disagreeing that there is a problem. I was just trying to point out the conceptual root of the problem rather than the individual instances in different tunnels implementations. In any case, I think we can solve this by calling skb_release_head_state() in napi_skb_finish() and switching early demux to use skb_valid_dst(). ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-20 0:00 ` Jesse Gross @ 2016-01-20 0:17 ` Eric Dumazet 2016-01-20 0:51 ` Jesse Gross 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-01-20 0:17 UTC (permalink / raw) To: Jesse Gross Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar, Thomas Graf On Tue, 2016-01-19 at 16:00 -0800, Jesse Gross wrote: > Oh, I wasn't disagreeing that there is a problem. I was just trying to > point out the conceptual root of the problem rather than the > individual instances in different tunnels implementations. > > In any case, I think we can solve this by calling > skb_release_head_state() in napi_skb_finish() and switching early > demux to use skb_valid_dst(). Yes, but it will also have to call skb_dst_drop() or risk a leak of prior one. So what is the purpose of having a dst if we need to drop it ? Adding code in GRO would be fine if someone explains me the purpose of doing apparently useless work. (refcounting on dst is not exactly free) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-20 0:17 ` Eric Dumazet @ 2016-01-20 0:51 ` Jesse Gross 2016-01-20 1:31 ` Thomas Graf 0 siblings, 1 reply; 15+ messages in thread From: Jesse Gross @ 2016-01-20 0:51 UTC (permalink / raw) To: Eric Dumazet Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar, Thomas Graf On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-01-19 at 16:00 -0800, Jesse Gross wrote: > >> Oh, I wasn't disagreeing that there is a problem. I was just trying to >> point out the conceptual root of the problem rather than the >> individual instances in different tunnels implementations. >> >> In any case, I think we can solve this by calling >> skb_release_head_state() in napi_skb_finish() and switching early >> demux to use skb_valid_dst(). > > Yes, but it will also have to call skb_dst_drop() or risk a leak of > prior one. skb_dst_drop() is the key part but I mentioned skb_release_head_state() (which calls skb_dst_drop()) as a slightly more generic solution. But either one is fine. > So what is the purpose of having a dst if we need to drop it ? > > Adding code in GRO would be fine if someone explains me the purpose of > doing apparently useless work. > > (refcounting on dst is not exactly free) In the GRO case, the dst is only dropped on the packets which have been merged and therefore need to be freed (the GRO_MERGED_FREE case). It's not being thrown away for the overall frame, just metadata that has been duplicated on each individual frame, similar to the metadata in struct sk_buff itself. And while it is not used by the IP stack there are other consumers (eBPF/OVS/etc.). This entire process is controlled by the COLLECT_METADATA flag on tunnels, so there is no cost in situations where it is not actually used. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-20 0:51 ` Jesse Gross @ 2016-01-20 1:31 ` Thomas Graf 2016-01-20 1:40 ` Jesse Gross 2016-01-20 23:43 ` John 0 siblings, 2 replies; 15+ messages in thread From: Thomas Graf @ 2016-01-20 1:31 UTC (permalink / raw) To: Jesse Gross Cc: Eric Dumazet, John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On 01/19/16 at 04:51pm, Jesse Gross wrote: > On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > So what is the purpose of having a dst if we need to drop it ? > > > > Adding code in GRO would be fine if someone explains me the purpose of > > doing apparently useless work. > > > > (refcounting on dst is not exactly free) > > In the GRO case, the dst is only dropped on the packets which have > been merged and therefore need to be freed (the GRO_MERGED_FREE case). > It's not being thrown away for the overall frame, just metadata that > has been duplicated on each individual frame, similar to the metadata > in struct sk_buff itself. And while it is not used by the IP stack > there are other consumers (eBPF/OVS/etc.). This entire process is > controlled by the COLLECT_METADATA flag on tunnels, so there is no > cost in situations where it is not actually used. Right. There were thoughts around leveraging a per CPU scratch buffer without a refcount and turn it into a full reference when the packet gets enqueued somewhere but the need hasn't really come up yet. Jesse, is this what you have in mind: diff --git a/net/core/dev.c b/net/core/dev.c index cc9e365..3a5e96d 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) break; case GRO_MERGED_FREE: - if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) + if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { + skb_release_head_state(skb); kmem_cache_free(skbuff_head_cache, skb); - else + } else __kfree_skb(skb); break; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-20 1:31 ` Thomas Graf @ 2016-01-20 1:40 ` Jesse Gross 2016-01-20 23:43 ` John 1 sibling, 0 replies; 15+ messages in thread From: Jesse Gross @ 2016-01-20 1:40 UTC (permalink / raw) To: Thomas Graf Cc: Eric Dumazet, John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On Tue, Jan 19, 2016 at 5:31 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 01/19/16 at 04:51pm, Jesse Gross wrote: >> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > So what is the purpose of having a dst if we need to drop it ? >> > >> > Adding code in GRO would be fine if someone explains me the purpose of >> > doing apparently useless work. >> > >> > (refcounting on dst is not exactly free) >> >> In the GRO case, the dst is only dropped on the packets which have >> been merged and therefore need to be freed (the GRO_MERGED_FREE case). >> It's not being thrown away for the overall frame, just metadata that >> has been duplicated on each individual frame, similar to the metadata >> in struct sk_buff itself. And while it is not used by the IP stack >> there are other consumers (eBPF/OVS/etc.). This entire process is >> controlled by the COLLECT_METADATA flag on tunnels, so there is no >> cost in situations where it is not actually used. > > Right. There were thoughts around leveraging a per CPU scratch > buffer without a refcount and turn it into a full reference when > the packet gets enqueued somewhere but the need hasn't really come > up yet. > > Jesse, is this what you have in mind: Yes, that's what I was thinking. I also realized that we should probably have a check in gro_list_prepare() to ensure that the dst of the packets that we are about to merge are equal. And we should fix IP early demux as Eric pointed out. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-20 1:31 ` Thomas Graf 2016-01-20 1:40 ` Jesse Gross @ 2016-01-20 23:43 ` John 2016-01-21 0:07 ` Eric Dumazet 1 sibling, 1 reply; 15+ messages in thread From: John @ 2016-01-20 23:43 UTC (permalink / raw) To: Thomas Graf, Jesse Gross Cc: Eric Dumazet, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On 01/19/2016 06:31 PM, Thomas Graf wrote: > On 01/19/16 at 04:51pm, Jesse Gross wrote: >> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >>> So what is the purpose of having a dst if we need to drop it ? >>> >>> Adding code in GRO would be fine if someone explains me the purpose of >>> doing apparently useless work. >>> >>> (refcounting on dst is not exactly free) >> In the GRO case, the dst is only dropped on the packets which have >> been merged and therefore need to be freed (the GRO_MERGED_FREE case). >> It's not being thrown away for the overall frame, just metadata that >> has been duplicated on each individual frame, similar to the metadata >> in struct sk_buff itself. And while it is not used by the IP stack >> there are other consumers (eBPF/OVS/etc.). This entire process is >> controlled by the COLLECT_METADATA flag on tunnels, so there is no >> cost in situations where it is not actually used. > Right. There were thoughts around leveraging a per CPU scratch > buffer without a refcount and turn it into a full reference when > the packet gets enqueued somewhere but the need hasn't really come > up yet. > > Jesse, is this what you have in mind: > > diff --git a/net/core/dev.c b/net/core/dev.c > index cc9e365..3a5e96d 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) > break; > > case GRO_MERGED_FREE: > - if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) > + if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { > + skb_release_head_state(skb); > kmem_cache_free(skbuff_head_cache, skb); > - else > + } else > __kfree_skb(skb); > break; So I've tested the below patch (same as one above with minor modifications made to make it compile) and it worked - no memory leak. Should I submit this or...? diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 4355129..a8fac63 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2829,6 +2829,7 @@ int skb_zerocopy(struct sk_buff *to, struct sk_buff *from, void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len); int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen); void skb_scrub_packet(struct sk_buff *skb, bool xnet); +void skb_release_head_state(struct sk_buff *skb); unsigned int skb_gso_transport_seglen(const struct sk_buff *skb); struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t features); struct sk_buff *skb_vlan_untag(struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index ae00b89..76e3623 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4337,9 +4337,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) break; case GRO_MERGED_FREE: - if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) + if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { + skb_release_head_state(skb); kmem_cache_free(skbuff_head_cache, skb); - else + } else __kfree_skb(skb); break; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b2df375..45f6f50 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -633,7 +633,7 @@ fastpath: kmem_cache_free(skbuff_fclone_cache, fclones); } -static void skb_release_head_state(struct sk_buff *skb) +void skb_release_head_state(struct sk_buff *skb) { skb_dst_drop(skb); #ifdef CONFIG_XFRM ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-20 23:43 ` John @ 2016-01-21 0:07 ` Eric Dumazet 2016-01-21 0:19 ` Jesse Gross 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-01-21 0:07 UTC (permalink / raw) To: John Cc: Thomas Graf, Jesse Gross, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On Wed, 2016-01-20 at 16:43 -0700, John wrote: > > On 01/19/2016 06:31 PM, Thomas Graf wrote: > > On 01/19/16 at 04:51pm, Jesse Gross wrote: > >> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >>> So what is the purpose of having a dst if we need to drop it ? > >>> > >>> Adding code in GRO would be fine if someone explains me the purpose of > >>> doing apparently useless work. > >>> > >>> (refcounting on dst is not exactly free) > >> In the GRO case, the dst is only dropped on the packets which have > >> been merged and therefore need to be freed (the GRO_MERGED_FREE case). > >> It's not being thrown away for the overall frame, just metadata that > >> has been duplicated on each individual frame, similar to the metadata > >> in struct sk_buff itself. And while it is not used by the IP stack > >> there are other consumers (eBPF/OVS/etc.). This entire process is > >> controlled by the COLLECT_METADATA flag on tunnels, so there is no > >> cost in situations where it is not actually used. > > Right. There were thoughts around leveraging a per CPU scratch > > buffer without a refcount and turn it into a full reference when > > the packet gets enqueued somewhere but the need hasn't really come > > up yet. > > > > Jesse, is this what you have in mind: > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index cc9e365..3a5e96d 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) > > break; > > > > case GRO_MERGED_FREE: > > - if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) > > + if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { > > + skb_release_head_state(skb); > > kmem_cache_free(skbuff_head_cache, skb); > > - else > > + } else > > __kfree_skb(skb); > > break; > So I've tested the below patch (same as one above with minor > modifications made to make it compile) and it worked - no memory leak. > Should I submit this or...? Unfortunately fix is not complete. As someone mentioned, GRO should not aggregate packets having different dst. This part is hard to achieve, as a pointer comparison wont be enough : Each skb has its own meta dst allocation. Quite frankly, I would rather disable GRO for packets with a dst, instead of making GRO dog slow. diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h index cf6c74550baa..124b8a5537e3 100644 --- a/include/net/gro_cells.h +++ b/include/net/gro_cells.h @@ -19,7 +19,10 @@ static inline void gro_cells_receive(struct gro_cells *gcells, struct sk_buff *s struct gro_cell *cell; struct net_device *dev = skb->dev; - if (!gcells->cells || skb_cloned(skb) || !(dev->features & NETIF_F_GRO)) { + if (!gcells->cells || + skb->_skb_refdst || + skb_cloned(skb) || + !(dev->features & NETIF_F_GRO)) { netif_rx(skb); return; } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-21 0:07 ` Eric Dumazet @ 2016-01-21 0:19 ` Jesse Gross 2016-01-21 0:34 ` Eric Dumazet 0 siblings, 1 reply; 15+ messages in thread From: Jesse Gross @ 2016-01-21 0:19 UTC (permalink / raw) To: Eric Dumazet Cc: John, Thomas Graf, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On Wed, Jan 20, 2016 at 4:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2016-01-20 at 16:43 -0700, John wrote: >> >> On 01/19/2016 06:31 PM, Thomas Graf wrote: >> > On 01/19/16 at 04:51pm, Jesse Gross wrote: >> >> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> >>> So what is the purpose of having a dst if we need to drop it ? >> >>> >> >>> Adding code in GRO would be fine if someone explains me the purpose of >> >>> doing apparently useless work. >> >>> >> >>> (refcounting on dst is not exactly free) >> >> In the GRO case, the dst is only dropped on the packets which have >> >> been merged and therefore need to be freed (the GRO_MERGED_FREE case). >> >> It's not being thrown away for the overall frame, just metadata that >> >> has been duplicated on each individual frame, similar to the metadata >> >> in struct sk_buff itself. And while it is not used by the IP stack >> >> there are other consumers (eBPF/OVS/etc.). This entire process is >> >> controlled by the COLLECT_METADATA flag on tunnels, so there is no >> >> cost in situations where it is not actually used. >> > Right. There were thoughts around leveraging a per CPU scratch >> > buffer without a refcount and turn it into a full reference when >> > the packet gets enqueued somewhere but the need hasn't really come >> > up yet. >> > >> > Jesse, is this what you have in mind: >> > >> > diff --git a/net/core/dev.c b/net/core/dev.c >> > index cc9e365..3a5e96d 100644 >> > --- a/net/core/dev.c >> > +++ b/net/core/dev.c >> > @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb) >> > break; >> > >> > case GRO_MERGED_FREE: >> > - if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) >> > + if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) { >> > + skb_release_head_state(skb); >> > kmem_cache_free(skbuff_head_cache, skb); >> > - else >> > + } else >> > __kfree_skb(skb); >> > break; >> So I've tested the below patch (same as one above with minor >> modifications made to make it compile) and it worked - no memory leak. >> Should I submit this or...? > > Unfortunately fix is not complete. > > As someone mentioned, GRO should not aggregate packets having different > dst. > > This part is hard to achieve, as a pointer comparison wont be enough : > Each skb has its own meta dst allocation. > > Quite frankly, I would rather disable GRO for packets with a dst, > instead of making GRO dog slow. I have a patch that implements the comparison between dsts. For packets without a dst, there isn't really a cost and if we do have a dst then GRO is still a benefit. So it seems like it is worth doing, even if it is more expensive than is ideal. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-21 0:19 ` Jesse Gross @ 2016-01-21 0:34 ` Eric Dumazet 2016-01-21 2:27 ` Thomas Graf 0 siblings, 1 reply; 15+ messages in thread From: Eric Dumazet @ 2016-01-21 0:34 UTC (permalink / raw) To: Jesse Gross Cc: John, Thomas Graf, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On Wed, 2016-01-20 at 16:19 -0800, Jesse Gross wrote: > I have a patch that implements the comparison between dsts. For > packets without a dst, there isn't really a cost and if we do have a > dst then GRO is still a benefit. So it seems like it is worth doing, > even if it is more expensive than is ideal. You guys really want to kill GRO performance. Really the aggregation should happen at the first layer (ethernet device), instead of doing it after tunnel decap. This is already done for GRE, IPIP, SIT, ... GRO having to access metadata looks wrong, if you think about trying to do the same function in hardware (offload) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Kernel memory leak in bnx2x driver with vxlan tunnel 2016-01-21 0:34 ` Eric Dumazet @ 2016-01-21 2:27 ` Thomas Graf 0 siblings, 0 replies; 15+ messages in thread From: Thomas Graf @ 2016-01-21 2:27 UTC (permalink / raw) To: Eric Dumazet Cc: Jesse Gross, John, Linux Kernel Network Developers, Tom Herbert, david.roth, Pravin B Shelar On 01/20/16 at 04:34pm, Eric Dumazet wrote: > On Wed, 2016-01-20 at 16:19 -0800, Jesse Gross wrote: > > > I have a patch that implements the comparison between dsts. For > > packets without a dst, there isn't really a cost and if we do have a > > dst then GRO is still a benefit. So it seems like it is worth doing, > > even if it is more expensive than is ideal. > > You guys really want to kill GRO performance. > > Really the aggregation should happen at the first layer (ethernet > device), instead of doing it after tunnel decap. > > This is already done for GRE, IPIP, SIT, ... > > GRO having to access metadata looks wrong, if you think about trying to > do the same function in hardware (offload) If I understand Jesse correctly then the added cost is only for metadata enabled packets. Though I agree, what's the benefit of doing GRO after decap? It seems like it's way too late and we've already paid the cost by going through the stack for each outer header packet. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2016-01-21 2:27 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-14 17:17 Kernel memory leak in bnx2x driver with vxlan tunnel John 2016-01-19 21:07 ` Jesse Gross 2016-01-19 22:47 ` Eric Dumazet 2016-01-19 23:34 ` Jesse Gross 2016-01-19 23:51 ` Eric Dumazet 2016-01-20 0:00 ` Jesse Gross 2016-01-20 0:17 ` Eric Dumazet 2016-01-20 0:51 ` Jesse Gross 2016-01-20 1:31 ` Thomas Graf 2016-01-20 1:40 ` Jesse Gross 2016-01-20 23:43 ` John 2016-01-21 0:07 ` Eric Dumazet 2016-01-21 0:19 ` Jesse Gross 2016-01-21 0:34 ` Eric Dumazet 2016-01-21 2:27 ` Thomas Graf
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).