netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel memory leak in bnx2x driver with vxlan tunnel
@ 2016-01-14 17:17 John
  2016-01-19 21:07 ` Jesse Gross
  0 siblings, 1 reply; 15+ messages in thread
From: John @ 2016-01-14 17:17 UTC (permalink / raw)
  To: netdev; +Cc: tom, david.roth

I'm getting what seems to be a kernel memory leak while doing a TCP
throughput test between two VMs on identical systems, in order to test a
broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version
2.4.90. The host system of the receiving (server) VM leaks memory during the
throughput test. The memory leaks fast enough to make the system completely
unusable within five minutes. Once I stop the throughput test, the 
memory stops
leaking. A couple of times, the kernel on the host system has actually 
killed
the qemu process for me, but this doesn't happen reliably. The leaked memory
doesn't become available again even after the VM is killed.

To investigate this, I compiled a 4.4.0-rc8 kernel with kmemleak. I can 
scan the
leaking system during and after killing the throughput test and get the
following stack trace over and over again:

unreferenced object 0xffff880464f11488 (size 256):
   comm "softirq", pid 0, jiffies 4312675043 (age 379.184s)
   hex dump (first 32 bytes):
     6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
   backtrace:
     [<ffffffff81550e28>] kmemleak_alloc+0x28/0x50
     [<ffffffff811a8bac>] __kmalloc+0x11c/0x2a0
     [<ffffffff8146972e>] metadata_dst_alloc+0x1e/0x40
     [<ffffffff814ed476>] udp_tun_rx_dst+0x126/0x1c0
     [<ffffffff8140be68>] vxlan_udp_encap_recv+0x148/0xb10
     [<ffffffff814cabe9>] udp_queue_rcv_skb+0x1e9/0x480
     [<ffffffff814cb77c>] __udp4_lib_rcv+0x45c/0x700
     [<ffffffff814cbf0a>] udp_rcv+0x1a/0x20
     [<ffffffff8149ae54>] ip_local_deliver_finish+0x94/0x1e0
     [<ffffffff8149b150>] ip_local_deliver+0x60/0xd0
     [<ffffffff8149ab39>] ip_rcv_finish+0x99/0x320
     [<ffffffff8149b41e>] ip_rcv+0x25e/0x380
     [<ffffffff814602db>] __netif_receive_skb_core+0x2cb/0xa00
     [<ffffffff81460a26>] __netif_receive_skb+0x16/0x70
     [<ffffffff81460aa3>] netif_receive_skb_internal+0x23/0x80
     [<ffffffff814615f5>] napi_gro_receive+0xa5/0xd0

I pulled down the kernel tree from
http://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git, and 
did a git
bisect and got this:

58ce31cca1ffe057f4744c3f671e3e84606d3d4a is the first bad commit
commit 58ce31cca1ffe057f4744c3f671e3e84606d3d4a
Author: Tom Herbert <tom@herbertland.com>
Date:   Wed Aug 19 17:07:33 2015 -0700

     vxlan: GRO support at tunnel layer

     Add calls to gro_cells infrastructure to do GRO when receiving on a 
tunnel.

     Testing:

     Ran 200 netperf TCP_STREAM instance

       - With fix (GRO enabled on VXLAN interface)

         Verify GRO is happening.

         9084 MBps tput
         3.44% CPU utilization

       - Without fix (GRO disabled on VXLAN interface)

         Verified no GRO is happening.

         9084 MBps tput
         5.54% CPU utilization

     Signed-off-by: Tom Herbert <tom@herbertland.com>
     Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 a7d49cb2e24ebddf620c01e27515cc756b32e46f 
c3951c16da75ff3e0db1322b8ccb3e61975b1242 M      drivers
:040000 040000 f36442958138eafdd472c58d06ea35be66990aa1 
0e29d513e575dd11f459c59df71e05db074363de M      include

For the test I'm using two HP Proliant dl360gen9's. I put two matching 
broadcom
PCIe cards in each machine and ran throughput tests between two VMs on 
either
machine, using the throughput testing program iperf3. On each host we 
had a qemu
VM attached to an OVS bridge; these bridges are connected over a VxLAN 
tunnel as
detailed here: https://community.mellanox.com/docs/DOC-1446.

The test went well with an Intel Niantic NIC, and I saw high (8.04 Gb/s)
throughput over an eighteen hour throughput test. There was no memory leak.
However, when I put in a Broadcom NIC on both systems I get the above memory
leak, if they have a VM on the receving end of the test. lspci -v output 
for the
Broadcom NIC below:

04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57810 
10 Gigabit Ethernet (rev 10)
         Subsystem: Hewlett-Packard Company HP FlexFabric 10Gb 2-port 
534FLR-SFP+ Adapter
         Flags: bus master, fast devsel, latency 0
         Memory at 97000000 (64-bit, prefetchable) [size=8M]
         Memory at 96800000 (64-bit, prefetchable) [size=8M]
         Memory at 98800000 (64-bit, prefetchable) [size=64K]
         [virtual] Expansion ROM at 98880000 [disabled] [size=512K]
         Capabilities: [48] Power Management version 3
         Capabilities: [50] Vital Product Data
         Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-
         Capabilities: [ac] Express Endpoint, MSI 00
         Capabilities: [100] Advanced Error Reporting
         Capabilities: [13c] Device Serial Number 14-58-d0-ff-fe-52-5b-d8
         Capabilities: [150] Power Budgeting <?>
         Capabilities: [160] Virtual Channel
         Capabilities: [1b8] Alternative Routing-ID Interpretation (ARI)
         Capabilities: [1c0] Single Root I/O Virtualization (SR-IOV)
         Capabilities: [220] #15
         Capabilities: [300] #19
         Kernel driver in use: bnx2x

ethtool -i info:
driver: bnx2x
version: 1.712.30-0
firmware-version: bc 7.8.24
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

dmesg bnx2x output:
[    1.506071] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver 
bnx2x 1.712.30-0 (2014/02/10)
[    1.506205] bnx2x 0000:04:00.0: msix capability found
[    1.506297] bnx2x 0000:04:00.0: part number 
394D4342-31383735-31543030-47303030
[    1.555970] bnx2x 0000:04:00.1: msix capability found
[    1.556061] bnx2x 0000:04:00.1: part number 
394D4342-31383735-31543030-47303030
[   10.360477] bnx2x 0000:04:00.1 eth9: renamed from eth3
[   10.584371] bnx2x 0000:04:00.0 rename3: renamed from eth1
[  588.956002] bnx2x 0000:04:00.0 rename3: using MSI-X  IRQs: sp 70 
fp[0] 74 ... fp[7] 81
[  589.208675] bnx2x 0000:04:00.0 rename3: Added vxlan dest port 4789
[  640.159842] bnx2x 0000:04:00.1 eth10: renamed from eth9
[  642.432216] bnx2x 0000:04:00.1 eth10: using MSI-X  IRQs: sp 82 fp[0] 
84 ... fp[7] 91
[  642.700576] bnx2x 0000:04:00.1 eth10: Added vxlan dest port 4789
[ 1098.368845] bnx2x 0000:04:00.1 eth10: using MSI-X  IRQs: sp 82 fp[0] 
84 ... fp[7] 91
[ 1109.277182] bnx2x 0000:04:00.1 eth10_nolink: renamed from eth10
[ 1115.368873] bnx2x 0000:04:00.0 eth10: renamed from rename3
[ 1117.928156] bnx2x 0000:04:00.0 eth10: using MSI-X  IRQs: sp 70 fp[0] 
74 ... fp[7] 81
[ 1118.214861] bnx2x 0000:04:00.0 eth10: NIC Link is Up, 10000 Mbps full 
duplex, Flow control: ON - receive & transmit

I've tried disabling all offloads (gro included) but the leak still 
happens.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-14 17:17 Kernel memory leak in bnx2x driver with vxlan tunnel John
@ 2016-01-19 21:07 ` Jesse Gross
  2016-01-19 22:47   ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jesse Gross @ 2016-01-19 21:07 UTC (permalink / raw)
  To: John; +Cc: Linux Kernel Network Developers, Tom Herbert, david.roth

On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote:
> I'm getting what seems to be a kernel memory leak while doing a TCP
> throughput test between two VMs on identical systems, in order to test a
> broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version
> 2.4.90. The host system of the receiving (server) VM leaks memory during the
> throughput test. The memory leaks fast enough to make the system completely
> unusable within five minutes. Once I stop the throughput test, the memory
> stops
> leaking. A couple of times, the kernel on the host system has actually
> killed
> the qemu process for me, but this doesn't happen reliably. The leaked memory
> doesn't become available again even after the VM is killed.

It looks like the problem is in napi_skb_finish(). If we when do
GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free ==
NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself
but not any of the associated elements. Historically, this would have
been OK but these days we will have allocated a dst entry already for
tunnel metadata, which will get leaked.

If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a
__kfree_skb(), which will release the dst entry. That would explain
why some drivers have the problem but not others since the memory is
laid out differently.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-19 21:07 ` Jesse Gross
@ 2016-01-19 22:47   ` Eric Dumazet
  2016-01-19 23:34     ` Jesse Gross
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-01-19 22:47 UTC (permalink / raw)
  To: Jesse Gross
  Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth,
	Pravin B Shelar

On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote:
> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote:
> > I'm getting what seems to be a kernel memory leak while doing a TCP
> > throughput test between two VMs on identical systems, in order to test a
> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version
> > 2.4.90. The host system of the receiving (server) VM leaks memory during the
> > throughput test. The memory leaks fast enough to make the system completely
> > unusable within five minutes. Once I stop the throughput test, the memory
> > stops
> > leaking. A couple of times, the kernel on the host system has actually
> > killed
> > the qemu process for me, but this doesn't happen reliably. The leaked memory
> > doesn't become available again even after the VM is killed.
> 
> It looks like the problem is in napi_skb_finish(). If we when do
> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free ==
> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself
> but not any of the associated elements. Historically, this would have
> been OK but these days we will have allocated a dst entry already for
> tunnel metadata, which will get leaked.
> 
> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a
> __kfree_skb(), which will release the dst entry. That would explain
> why some drivers have the problem but not others since the memory is
> laid out differently.


Wow.... What is the purpose of using skb_dst_set() on skb before calling
gro_cells_receive() exactly ?

Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not
helpful :

    Following patch create new tunnel flag which enable
    tunnel metadata collection on given device.

This is rather strange since later the dst is thrown away with the
skb_valid_dst() test.

Note also that IP early demux is broken as well, since it does not use
skb_valid_dst() but a simple :

if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) {
 ...
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-19 22:47   ` Eric Dumazet
@ 2016-01-19 23:34     ` Jesse Gross
  2016-01-19 23:51       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jesse Gross @ 2016-01-19 23:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth,
	Pravin B Shelar, Thomas Graf

On Tue, Jan 19, 2016 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote:
>> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote:
>> > I'm getting what seems to be a kernel memory leak while doing a TCP
>> > throughput test between two VMs on identical systems, in order to test a
>> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version
>> > 2.4.90. The host system of the receiving (server) VM leaks memory during the
>> > throughput test. The memory leaks fast enough to make the system completely
>> > unusable within five minutes. Once I stop the throughput test, the memory
>> > stops
>> > leaking. A couple of times, the kernel on the host system has actually
>> > killed
>> > the qemu process for me, but this doesn't happen reliably. The leaked memory
>> > doesn't become available again even after the VM is killed.
>>
>> It looks like the problem is in napi_skb_finish(). If we when do
>> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free ==
>> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself
>> but not any of the associated elements. Historically, this would have
>> been OK but these days we will have allocated a dst entry already for
>> tunnel metadata, which will get leaked.
>>
>> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a
>> __kfree_skb(), which will release the dst entry. That would explain
>> why some drivers have the problem but not others since the memory is
>> laid out differently.
>
>
> Wow.... What is the purpose of using skb_dst_set() on skb before calling
> gro_cells_receive() exactly ?
>
> Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not
> helpful :
>
>     Following patch create new tunnel flag which enable
>     tunnel metadata collection on given device.

Note that this isn't really the problem commit. The general issue is
lightweight tunnels - the above commit is just adding support for GRE
in a way that I think follows the existing model.

> This is rather strange since later the dst is thrown away with the
> skb_valid_dst() test.

This isn't really an IP routing dst - it's the tunnel encapsulation
information - so there isn't much to with it in ip_rcv_finish().
However, the information can be consumed in other places, such as
through eBPF.

> Note also that IP early demux is broken as well, since it does not use
> skb_valid_dst() but a simple :
>
> if (sysctl_ip_early_demux && !skb_dst(skb) && !skb->sk) {
>  ...
> }

I agree that this should use the helper function.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-19 23:34     ` Jesse Gross
@ 2016-01-19 23:51       ` Eric Dumazet
  2016-01-20  0:00         ` Jesse Gross
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-01-19 23:51 UTC (permalink / raw)
  To: Jesse Gross
  Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth,
	Pravin B Shelar, Thomas Graf

On Tue, 2016-01-19 at 15:34 -0800, Jesse Gross wrote:
> On Tue, Jan 19, 2016 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote:
> >> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote:
> >> > I'm getting what seems to be a kernel memory leak while doing a TCP
> >> > throughput test between two VMs on identical systems, in order to test a
> >> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version
> >> > 2.4.90. The host system of the receiving (server) VM leaks memory during the
> >> > throughput test. The memory leaks fast enough to make the system completely
> >> > unusable within five minutes. Once I stop the throughput test, the memory
> >> > stops
> >> > leaking. A couple of times, the kernel on the host system has actually
> >> > killed
> >> > the qemu process for me, but this doesn't happen reliably. The leaked memory
> >> > doesn't become available again even after the VM is killed.
> >>
> >> It looks like the problem is in napi_skb_finish(). If we when do
> >> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free ==
> >> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself
> >> but not any of the associated elements. Historically, this would have
> >> been OK but these days we will have allocated a dst entry already for
> >> tunnel metadata, which will get leaked.
> >>
> >> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a
> >> __kfree_skb(), which will release the dst entry. That would explain
> >> why some drivers have the problem but not others since the memory is
> >> laid out differently.
> >
> >
> > Wow.... What is the purpose of using skb_dst_set() on skb before calling
> > gro_cells_receive() exactly ?
> >
> > Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not
> > helpful :
> >
> >     Following patch create new tunnel flag which enable
> >     tunnel metadata collection on given device.
> 
> Note that this isn't really the problem commit. The general issue is
> lightweight tunnels - the above commit is just adding support for GRE
> in a way that I think follows the existing model.

I believe this commit added the first skb_dst_set() before
gro_cells_receive(), in ip_tunnel_rcv().

This is already buggy.

Then Tom in 58ce31cca1ffe057f4744c3f671e3e84606d3d4a added the
gro_cells_receive() in vxlan_rcv(), which added another bug, because of
existing skb_dst_set() call.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-19 23:51       ` Eric Dumazet
@ 2016-01-20  0:00         ` Jesse Gross
  2016-01-20  0:17           ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jesse Gross @ 2016-01-20  0:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth,
	Pravin B Shelar, Thomas Graf

On Tue, Jan 19, 2016 at 3:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-01-19 at 15:34 -0800, Jesse Gross wrote:
>> On Tue, Jan 19, 2016 at 2:47 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Tue, 2016-01-19 at 13:07 -0800, Jesse Gross wrote:
>> >> On Thu, Jan 14, 2016 at 9:17 AM, John <john.phillips5@hpe.com> wrote:
>> >> > I'm getting what seems to be a kernel memory leak while doing a TCP
>> >> > throughput test between two VMs on identical systems, in order to test a
>> >> > broadcom NIC's performance with a kernel 4.4.0-rc8 and OpenVSwitch version
>> >> > 2.4.90. The host system of the receiving (server) VM leaks memory during the
>> >> > throughput test. The memory leaks fast enough to make the system completely
>> >> > unusable within five minutes. Once I stop the throughput test, the memory
>> >> > stops
>> >> > leaking. A couple of times, the kernel on the host system has actually
>> >> > killed
>> >> > the qemu process for me, but this doesn't happen reliably. The leaked memory
>> >> > doesn't become available again even after the VM is killed.
>> >>
>> >> It looks like the problem is in napi_skb_finish(). If we when do
>> >> GRO_MERGED_FREE we have NAPI_GRO_CB(skb)->free ==
>> >> NAPI_GRO_FREE_STOLEN_HEAD then we will just free the skb memory itself
>> >> but not any of the associated elements. Historically, this would have
>> >> been OK but these days we will have allocated a dst entry already for
>> >> tunnel metadata, which will get leaked.
>> >>
>> >> If we don't have NAPI_GRO_FREE_STOLEN_HEAD then we'll do a
>> >> __kfree_skb(), which will release the dst entry. That would explain
>> >> why some drivers have the problem but not others since the memory is
>> >> laid out differently.
>> >
>> >
>> > Wow.... What is the purpose of using skb_dst_set() on skb before calling
>> > gro_cells_receive() exactly ?
>> >
>> > Commit 2e15ea390e6f4466655066d97e22ec66870a042c changelog is not
>> > helpful :
>> >
>> >     Following patch create new tunnel flag which enable
>> >     tunnel metadata collection on given device.
>>
>> Note that this isn't really the problem commit. The general issue is
>> lightweight tunnels - the above commit is just adding support for GRE
>> in a way that I think follows the existing model.
>
> I believe this commit added the first skb_dst_set() before
> gro_cells_receive(), in ip_tunnel_rcv().
>
> This is already buggy.
>
> Then Tom in 58ce31cca1ffe057f4744c3f671e3e84606d3d4a added the
> gro_cells_receive() in vxlan_rcv(), which added another bug, because of
> existing skb_dst_set() call.

Oh, I wasn't disagreeing that there is a problem. I was just trying to
point out the conceptual root of the problem rather than the
individual instances in different tunnels implementations.

In any case, I think we can solve this by calling
skb_release_head_state() in napi_skb_finish() and switching early
demux to use skb_valid_dst().

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-20  0:00         ` Jesse Gross
@ 2016-01-20  0:17           ` Eric Dumazet
  2016-01-20  0:51             ` Jesse Gross
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-01-20  0:17 UTC (permalink / raw)
  To: Jesse Gross
  Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth,
	Pravin B Shelar, Thomas Graf

On Tue, 2016-01-19 at 16:00 -0800, Jesse Gross wrote:

> Oh, I wasn't disagreeing that there is a problem. I was just trying to
> point out the conceptual root of the problem rather than the
> individual instances in different tunnels implementations.
> 
> In any case, I think we can solve this by calling
> skb_release_head_state() in napi_skb_finish() and switching early
> demux to use skb_valid_dst().

Yes, but it will also have to call skb_dst_drop() or risk a leak of
prior one.

So what is the purpose of having a dst if we need to drop it ?

Adding code in GRO would be fine if someone explains me the purpose of
doing apparently useless work.

(refcounting on dst is not exactly free)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-20  0:17           ` Eric Dumazet
@ 2016-01-20  0:51             ` Jesse Gross
  2016-01-20  1:31               ` Thomas Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Jesse Gross @ 2016-01-20  0:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John, Linux Kernel Network Developers, Tom Herbert, david.roth,
	Pravin B Shelar, Thomas Graf

On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2016-01-19 at 16:00 -0800, Jesse Gross wrote:
>
>> Oh, I wasn't disagreeing that there is a problem. I was just trying to
>> point out the conceptual root of the problem rather than the
>> individual instances in different tunnels implementations.
>>
>> In any case, I think we can solve this by calling
>> skb_release_head_state() in napi_skb_finish() and switching early
>> demux to use skb_valid_dst().
>
> Yes, but it will also have to call skb_dst_drop() or risk a leak of
> prior one.

skb_dst_drop() is the key part but I mentioned
skb_release_head_state() (which calls skb_dst_drop()) as a slightly
more generic solution. But either one is fine.

> So what is the purpose of having a dst if we need to drop it ?
>
> Adding code in GRO would be fine if someone explains me the purpose of
> doing apparently useless work.
>
> (refcounting on dst is not exactly free)

In the GRO case, the dst is only dropped on the packets which have
been merged and therefore need to be freed (the GRO_MERGED_FREE case).
It's not being thrown away for the overall frame, just metadata that
has been duplicated on each individual frame, similar to the metadata
in struct sk_buff itself. And while it is not used by the IP stack
there are other consumers (eBPF/OVS/etc.). This entire process is
controlled by the COLLECT_METADATA flag on tunnels, so there is no
cost in situations where it is not actually used.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-20  0:51             ` Jesse Gross
@ 2016-01-20  1:31               ` Thomas Graf
  2016-01-20  1:40                 ` Jesse Gross
  2016-01-20 23:43                 ` John
  0 siblings, 2 replies; 15+ messages in thread
From: Thomas Graf @ 2016-01-20  1:31 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, John, Linux Kernel Network Developers, Tom Herbert,
	david.roth, Pravin B Shelar

On 01/19/16 at 04:51pm, Jesse Gross wrote:
> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > So what is the purpose of having a dst if we need to drop it ?
> >
> > Adding code in GRO would be fine if someone explains me the purpose of
> > doing apparently useless work.
> >
> > (refcounting on dst is not exactly free)
> 
> In the GRO case, the dst is only dropped on the packets which have
> been merged and therefore need to be freed (the GRO_MERGED_FREE case).
> It's not being thrown away for the overall frame, just metadata that
> has been duplicated on each individual frame, similar to the metadata
> in struct sk_buff itself. And while it is not used by the IP stack
> there are other consumers (eBPF/OVS/etc.). This entire process is
> controlled by the COLLECT_METADATA flag on tunnels, so there is no
> cost in situations where it is not actually used.

Right. There were thoughts around leveraging a per CPU scratch
buffer without a refcount and turn it into a full reference when
the packet gets enqueued somewhere but the need hasn't really come
up yet.

Jesse, is this what you have in mind:

diff --git a/net/core/dev.c b/net/core/dev.c
index cc9e365..3a5e96d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
                break;
 
        case GRO_MERGED_FREE:
-               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
+               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
+                       skb_release_head_state(skb);
                        kmem_cache_free(skbuff_head_cache, skb);
-               else
+               } else
                        __kfree_skb(skb);
                break;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-20  1:31               ` Thomas Graf
@ 2016-01-20  1:40                 ` Jesse Gross
  2016-01-20 23:43                 ` John
  1 sibling, 0 replies; 15+ messages in thread
From: Jesse Gross @ 2016-01-20  1:40 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Eric Dumazet, John, Linux Kernel Network Developers, Tom Herbert,
	david.roth, Pravin B Shelar

On Tue, Jan 19, 2016 at 5:31 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 01/19/16 at 04:51pm, Jesse Gross wrote:
>> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > So what is the purpose of having a dst if we need to drop it ?
>> >
>> > Adding code in GRO would be fine if someone explains me the purpose of
>> > doing apparently useless work.
>> >
>> > (refcounting on dst is not exactly free)
>>
>> In the GRO case, the dst is only dropped on the packets which have
>> been merged and therefore need to be freed (the GRO_MERGED_FREE case).
>> It's not being thrown away for the overall frame, just metadata that
>> has been duplicated on each individual frame, similar to the metadata
>> in struct sk_buff itself. And while it is not used by the IP stack
>> there are other consumers (eBPF/OVS/etc.). This entire process is
>> controlled by the COLLECT_METADATA flag on tunnels, so there is no
>> cost in situations where it is not actually used.
>
> Right. There were thoughts around leveraging a per CPU scratch
> buffer without a refcount and turn it into a full reference when
> the packet gets enqueued somewhere but the need hasn't really come
> up yet.
>
> Jesse, is this what you have in mind:

Yes, that's what I was thinking. I also realized that we should
probably have a check in gro_list_prepare() to ensure that the dst of
the packets that we are about to merge are equal. And we should fix IP
early demux as Eric pointed out.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-20  1:31               ` Thomas Graf
  2016-01-20  1:40                 ` Jesse Gross
@ 2016-01-20 23:43                 ` John
  2016-01-21  0:07                   ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: John @ 2016-01-20 23:43 UTC (permalink / raw)
  To: Thomas Graf, Jesse Gross
  Cc: Eric Dumazet, Linux Kernel Network Developers, Tom Herbert,
	david.roth, Pravin B Shelar



On 01/19/2016 06:31 PM, Thomas Graf wrote:
> On 01/19/16 at 04:51pm, Jesse Gross wrote:
>> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> So what is the purpose of having a dst if we need to drop it ?
>>>
>>> Adding code in GRO would be fine if someone explains me the purpose of
>>> doing apparently useless work.
>>>
>>> (refcounting on dst is not exactly free)
>> In the GRO case, the dst is only dropped on the packets which have
>> been merged and therefore need to be freed (the GRO_MERGED_FREE case).
>> It's not being thrown away for the overall frame, just metadata that
>> has been duplicated on each individual frame, similar to the metadata
>> in struct sk_buff itself. And while it is not used by the IP stack
>> there are other consumers (eBPF/OVS/etc.). This entire process is
>> controlled by the COLLECT_METADATA flag on tunnels, so there is no
>> cost in situations where it is not actually used.
> Right. There were thoughts around leveraging a per CPU scratch
> buffer without a refcount and turn it into a full reference when
> the packet gets enqueued somewhere but the need hasn't really come
> up yet.
>
> Jesse, is this what you have in mind:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index cc9e365..3a5e96d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
>                  break;
>   
>          case GRO_MERGED_FREE:
> -               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
> +               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
> +                       skb_release_head_state(skb);
>                          kmem_cache_free(skbuff_head_cache, skb);
> -               else
> +               } else
>                          __kfree_skb(skb);
>                  break;
So I've tested the below patch (same as one above with minor 
modifications made to make it compile) and it worked - no memory leak. 
Should I submit this or...?

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4355129..a8fac63 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2829,6 +2829,7 @@ int skb_zerocopy(struct sk_buff *to, struct 
sk_buff *from,
  void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len);
  int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen);
  void skb_scrub_packet(struct sk_buff *skb, bool xnet);
+void skb_release_head_state(struct sk_buff *skb);
  unsigned int skb_gso_transport_seglen(const struct sk_buff *skb);
  struct sk_buff *skb_segment(struct sk_buff *skb, netdev_features_t 
features);
  struct sk_buff *skb_vlan_untag(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index ae00b89..76e3623 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4337,9 +4337,10 @@ static gro_result_t napi_skb_finish(gro_result_t 
ret, struct sk_buff *skb)
          break;

      case GRO_MERGED_FREE:
-        if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
+        if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
+                        skb_release_head_state(skb);
              kmem_cache_free(skbuff_head_cache, skb);
-        else
+                } else
              __kfree_skb(skb);
          break;

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375..45f6f50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -633,7 +633,7 @@ fastpath:
      kmem_cache_free(skbuff_fclone_cache, fclones);
  }

-static void skb_release_head_state(struct sk_buff *skb)
+void skb_release_head_state(struct sk_buff *skb)
  {
      skb_dst_drop(skb);
  #ifdef CONFIG_XFRM

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-20 23:43                 ` John
@ 2016-01-21  0:07                   ` Eric Dumazet
  2016-01-21  0:19                     ` Jesse Gross
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-01-21  0:07 UTC (permalink / raw)
  To: John
  Cc: Thomas Graf, Jesse Gross, Linux Kernel Network Developers,
	Tom Herbert, david.roth, Pravin B Shelar

On Wed, 2016-01-20 at 16:43 -0700, John wrote:
> 
> On 01/19/2016 06:31 PM, Thomas Graf wrote:
> > On 01/19/16 at 04:51pm, Jesse Gross wrote:
> >> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>> So what is the purpose of having a dst if we need to drop it ?
> >>>
> >>> Adding code in GRO would be fine if someone explains me the purpose of
> >>> doing apparently useless work.
> >>>
> >>> (refcounting on dst is not exactly free)
> >> In the GRO case, the dst is only dropped on the packets which have
> >> been merged and therefore need to be freed (the GRO_MERGED_FREE case).
> >> It's not being thrown away for the overall frame, just metadata that
> >> has been duplicated on each individual frame, similar to the metadata
> >> in struct sk_buff itself. And while it is not used by the IP stack
> >> there are other consumers (eBPF/OVS/etc.). This entire process is
> >> controlled by the COLLECT_METADATA flag on tunnels, so there is no
> >> cost in situations where it is not actually used.
> > Right. There were thoughts around leveraging a per CPU scratch
> > buffer without a refcount and turn it into a full reference when
> > the packet gets enqueued somewhere but the need hasn't really come
> > up yet.
> >
> > Jesse, is this what you have in mind:
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index cc9e365..3a5e96d 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
> >                  break;
> >   
> >          case GRO_MERGED_FREE:
> > -               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
> > +               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
> > +                       skb_release_head_state(skb);
> >                          kmem_cache_free(skbuff_head_cache, skb);
> > -               else
> > +               } else
> >                          __kfree_skb(skb);
> >                  break;
> So I've tested the below patch (same as one above with minor 
> modifications made to make it compile) and it worked - no memory leak. 
> Should I submit this or...?

Unfortunately fix is not complete.

As someone mentioned, GRO should not aggregate packets having different
dst.

This part is hard to achieve, as a pointer comparison wont be enough :
Each skb has its own meta dst allocation.

Quite frankly, I would rather disable GRO for packets with a dst,
instead of making GRO dog slow.

diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
index cf6c74550baa..124b8a5537e3 100644
--- a/include/net/gro_cells.h
+++ b/include/net/gro_cells.h
@@ -19,7 +19,10 @@ static inline void gro_cells_receive(struct gro_cells *gcells, struct sk_buff *s
 	struct gro_cell *cell;
 	struct net_device *dev = skb->dev;
 
-	if (!gcells->cells || skb_cloned(skb) || !(dev->features & NETIF_F_GRO)) {
+	if (!gcells->cells ||
+	    skb->_skb_refdst ||
+	    skb_cloned(skb) ||
+	    !(dev->features & NETIF_F_GRO)) {
 		netif_rx(skb);
 		return;
 	}

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-21  0:07                   ` Eric Dumazet
@ 2016-01-21  0:19                     ` Jesse Gross
  2016-01-21  0:34                       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jesse Gross @ 2016-01-21  0:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John, Thomas Graf, Linux Kernel Network Developers, Tom Herbert,
	david.roth, Pravin B Shelar

On Wed, Jan 20, 2016 at 4:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-01-20 at 16:43 -0700, John wrote:
>>
>> On 01/19/2016 06:31 PM, Thomas Graf wrote:
>> > On 01/19/16 at 04:51pm, Jesse Gross wrote:
>> >> On Tue, Jan 19, 2016 at 4:17 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >>> So what is the purpose of having a dst if we need to drop it ?
>> >>>
>> >>> Adding code in GRO would be fine if someone explains me the purpose of
>> >>> doing apparently useless work.
>> >>>
>> >>> (refcounting on dst is not exactly free)
>> >> In the GRO case, the dst is only dropped on the packets which have
>> >> been merged and therefore need to be freed (the GRO_MERGED_FREE case).
>> >> It's not being thrown away for the overall frame, just metadata that
>> >> has been duplicated on each individual frame, similar to the metadata
>> >> in struct sk_buff itself. And while it is not used by the IP stack
>> >> there are other consumers (eBPF/OVS/etc.). This entire process is
>> >> controlled by the COLLECT_METADATA flag on tunnels, so there is no
>> >> cost in situations where it is not actually used.
>> > Right. There were thoughts around leveraging a per CPU scratch
>> > buffer without a refcount and turn it into a full reference when
>> > the packet gets enqueued somewhere but the need hasn't really come
>> > up yet.
>> >
>> > Jesse, is this what you have in mind:
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index cc9e365..3a5e96d 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4548,9 +4548,10 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
>> >                  break;
>> >
>> >          case GRO_MERGED_FREE:
>> > -               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
>> > +               if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD) {
>> > +                       skb_release_head_state(skb);
>> >                          kmem_cache_free(skbuff_head_cache, skb);
>> > -               else
>> > +               } else
>> >                          __kfree_skb(skb);
>> >                  break;
>> So I've tested the below patch (same as one above with minor
>> modifications made to make it compile) and it worked - no memory leak.
>> Should I submit this or...?
>
> Unfortunately fix is not complete.
>
> As someone mentioned, GRO should not aggregate packets having different
> dst.
>
> This part is hard to achieve, as a pointer comparison wont be enough :
> Each skb has its own meta dst allocation.
>
> Quite frankly, I would rather disable GRO for packets with a dst,
> instead of making GRO dog slow.

I have a patch that implements the comparison between dsts. For
packets without a dst, there isn't really a cost and if we do have a
dst then GRO is still a benefit. So it seems like it is worth doing,
even if it is more expensive than is ideal.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-21  0:19                     ` Jesse Gross
@ 2016-01-21  0:34                       ` Eric Dumazet
  2016-01-21  2:27                         ` Thomas Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-01-21  0:34 UTC (permalink / raw)
  To: Jesse Gross
  Cc: John, Thomas Graf, Linux Kernel Network Developers, Tom Herbert,
	david.roth, Pravin B Shelar

On Wed, 2016-01-20 at 16:19 -0800, Jesse Gross wrote:

> I have a patch that implements the comparison between dsts. For
> packets without a dst, there isn't really a cost and if we do have a
> dst then GRO is still a benefit. So it seems like it is worth doing,
> even if it is more expensive than is ideal.

You guys really want to kill GRO performance.

Really the aggregation should happen at the first layer (ethernet
device), instead of doing it after tunnel decap.

This is already done for GRE, IPIP, SIT, ...

GRO having to access metadata looks wrong, if you think about trying to
do the same function in hardware (offload)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Kernel memory leak in bnx2x driver with vxlan tunnel
  2016-01-21  0:34                       ` Eric Dumazet
@ 2016-01-21  2:27                         ` Thomas Graf
  0 siblings, 0 replies; 15+ messages in thread
From: Thomas Graf @ 2016-01-21  2:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesse Gross, John, Linux Kernel Network Developers, Tom Herbert,
	david.roth, Pravin B Shelar

On 01/20/16 at 04:34pm, Eric Dumazet wrote:
> On Wed, 2016-01-20 at 16:19 -0800, Jesse Gross wrote:
> 
> > I have a patch that implements the comparison between dsts. For
> > packets without a dst, there isn't really a cost and if we do have a
> > dst then GRO is still a benefit. So it seems like it is worth doing,
> > even if it is more expensive than is ideal.
> 
> You guys really want to kill GRO performance.
> 
> Really the aggregation should happen at the first layer (ethernet
> device), instead of doing it after tunnel decap.
> 
> This is already done for GRE, IPIP, SIT, ...
> 
> GRO having to access metadata looks wrong, if you think about trying to
> do the same function in hardware (offload)

If I understand Jesse correctly then the added cost is only for
metadata enabled packets. Though I agree, what's the benefit of
doing GRO after decap?

It seems like it's way too late and we've already paid the cost
by going through the stack for each outer header packet.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-01-21  2:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-14 17:17 Kernel memory leak in bnx2x driver with vxlan tunnel John
2016-01-19 21:07 ` Jesse Gross
2016-01-19 22:47   ` Eric Dumazet
2016-01-19 23:34     ` Jesse Gross
2016-01-19 23:51       ` Eric Dumazet
2016-01-20  0:00         ` Jesse Gross
2016-01-20  0:17           ` Eric Dumazet
2016-01-20  0:51             ` Jesse Gross
2016-01-20  1:31               ` Thomas Graf
2016-01-20  1:40                 ` Jesse Gross
2016-01-20 23:43                 ` John
2016-01-21  0:07                   ` Eric Dumazet
2016-01-21  0:19                     ` Jesse Gross
2016-01-21  0:34                       ` Eric Dumazet
2016-01-21  2:27                         ` Thomas Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).