netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ido Schimmel <idosch@idosch.org>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	netdev <netdev@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>
Subject: Re: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists
Date: Wed, 27 Apr 2022 18:34:44 +0300	[thread overview]
Message-ID: <YmlilMi5MmApVqTX@shredder> (raw)
In-Reply-To: <20220422201237.416238-1-eric.dumazet@gmail.com>

On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket
> lock is released") helped bulk TCP flows to move the cost of skbs
> frees outside of critical section where socket lock was held.
> 
> But for RPC traffic, or hosts with RFS enabled, the solution is far from
> being ideal.
> 
> For RPC traffic, recvmsg() has to return to user space right after
> skb payload has been consumed, meaning that BH handler has no chance
> to pick the skb before recvmsg() thread. This issue is more visible
> with BIG TCP, as more RPC fit one skb.
> 
> For RFS, even if BH handler picks the skbs, they are still picked
> from the cpu on which user thread is running.
> 
> Ideally, it is better to free the skbs (and associated page frags)
> on the cpu that originally allocated them.
> 
> This patch removes the per socket anchor (sk->defer_list) and
> instead uses a per-cpu list, which will hold more skbs per round.
> 
> This new per-cpu list is drained at the end of net_action_rx(),
> after incoming packets have been processed, to lower latencies.
> 
> In normal conditions, skbs are added to the per-cpu list with
> no further action. In the (unlikely) cases where the cpu does not
> run net_action_rx() handler fast enough, we use an IPI to raise
> NET_RX_SOFTIRQ on the remote cpu.
> 
> Also, we do not bother draining the per-cpu list from dev_cpu_dead()
> This is because skbs in this list have no requirement on how fast
> they should be freed.
> 
> Note that we can add in the future a small per-cpu cache
> if we see any contention on sd->defer_lock.
> 
> Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
> and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
> page recycling strategy used by NIC driver (its page pool capacity
> being too small compared to number of skbs/pages held in sockets
> receive queues)
> 
> Note that this tuning was only done to demonstrate worse
> conditions for skb freeing for this particular test.
> These conditions can happen in more general production workload.

[...]

> Signed-off-by: Eric Dumazet <edumazet@google.com>

Eric, with this patch I'm seeing memory leaks such as these [1][2] after
boot. The system is using the igb driver for its management interface
[3]. The leaks disappear after reverting the patch.

Any ideas?

Let me know if you need more info. I can easily test a patch.

Thanks

[1]
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff888170143740 (size 216):
  comm "softirq", pid 0, jiffies 4294825261 (age 95.244s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff82571fc0>] napi_skb_cache_get+0xf0/0x180
    [<ffffffff8257206a>] __napi_build_skb+0x1a/0x60
    [<ffffffff8257b0f3>] napi_build_skb+0x23/0x350
    [<ffffffffa0469592>] igb_poll+0x2b72/0x5880 [igb]
    [<ffffffff825f9584>] __napi_poll.constprop.0+0xb4/0x480
    [<ffffffff825f9d5a>] net_rx_action+0x40a/0xc60
    [<ffffffff82e00295>] __do_softirq+0x295/0x9fe
    [<ffffffff81185bcc>] __irq_exit_rcu+0x11c/0x180
    [<ffffffff8118622a>] irq_exit_rcu+0xa/0x20
    [<ffffffff82bbed39>] common_interrupt+0xa9/0xc0
    [<ffffffff82c00b5e>] asm_common_interrupt+0x1e/0x40
    [<ffffffff824a186e>] cpuidle_enter_state+0x27e/0xcb0
    [<ffffffff824a236f>] cpuidle_enter+0x4f/0xa0
    [<ffffffff8126a290>] do_idle+0x3b0/0x4b0
    [<ffffffff8126a869>] cpu_startup_entry+0x19/0x20
    [<ffffffff810f4725>] start_secondary+0x265/0x340

[2]
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810ce3aac0 (size 216):
  comm "softirq", pid 0, jiffies 4294861408 (age 64.607s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00  ..{.............
  backtrace:
    [<ffffffff82575539>] __alloc_skb+0x229/0x360
    [<ffffffff8290bd3c>] __tcp_send_ack.part.0+0x6c/0x760
    [<ffffffff8291a062>] tcp_send_ack+0x82/0xa0
    [<ffffffff828cb6db>] __tcp_ack_snd_check+0x15b/0xa00
    [<ffffffff828f17fe>] tcp_rcv_established+0x198e/0x2120
    [<ffffffff829363b5>] tcp_v4_do_rcv+0x665/0x9a0
    [<ffffffff8293d8ae>] tcp_v4_rcv+0x2c1e/0x32f0
    [<ffffffff828610b3>] ip_protocol_deliver_rcu+0x53/0x2c0
    [<ffffffff828616eb>] ip_local_deliver+0x3cb/0x620
    [<ffffffff8285e66f>] ip_sublist_rcv_finish+0x9f/0x2c0
    [<ffffffff82860895>] ip_list_rcv_finish.constprop.0+0x525/0x6f0
    [<ffffffff82861f88>] ip_list_rcv+0x318/0x460
    [<ffffffff825f5e61>] __netif_receive_skb_list_core+0x541/0x8f0
    [<ffffffff825f8043>] netif_receive_skb_list_internal+0x763/0xdc0
    [<ffffffff826c3025>] napi_gro_complete.constprop.0+0x5a5/0x700
    [<ffffffff826c44ed>] dev_gro_receive+0xf2d/0x23f0
unreferenced object 0xffff888175e1afc0 (size 216):
  comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00  ..{.............
  backtrace:
    [<ffffffff82575539>] __alloc_skb+0x229/0x360
    [<ffffffff8258201c>] alloc_skb_with_frags+0x9c/0x720
    [<ffffffff8255f333>] sock_alloc_send_pskb+0x7b3/0x940
    [<ffffffff82876af4>] __ip_append_data+0x1874/0x36d0
    [<ffffffff8287f283>] ip_make_skb+0x263/0x2e0
    [<ffffffff82978afa>] udp_sendmsg+0x1c8a/0x29d0
    [<ffffffff829af94e>] inet_sendmsg+0x9e/0xe0
    [<ffffffff8255082d>] __sys_sendto+0x23d/0x360
    [<ffffffff82550a31>] __x64_sys_sendto+0xe1/0x1b0
    [<ffffffff82bbde05>] do_syscall_64+0x35/0x80
    [<ffffffff82c00068>] entry_SYSCALL_64_after_hwframe+0x44/0xae

[3]
# ethtool -i enp8s0
driver: igb
version: 5.18.0-rc3-custom-91743-g481c1b
firmware-version: 3.25, 0x80000708, 1.1824.0
expansion-rom-version: 
bus-info: 0000:08:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

  parent reply	other threads:[~2022-04-27 15:34 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-22 20:12 [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists Eric Dumazet
2022-04-26  7:38 ` Paolo Abeni
2022-04-26 13:11   ` Eric Dumazet
2022-04-26 15:28     ` Paolo Abeni
2022-04-26 16:13       ` Eric Dumazet
2022-04-27  0:50 ` patchwork-bot+netdevbpf
2022-04-27 15:34 ` Ido Schimmel [this message]
2022-04-27 16:53   ` Eric Dumazet
2022-04-27 17:11     ` Eric Dumazet
2022-04-27 17:59       ` Eric Dumazet
2022-04-27 18:54         ` Eric Dumazet
2022-04-29 16:18 ` Qian Cai
2022-04-29 16:23   ` Eric Dumazet
2022-04-29 20:44     ` Qian Cai
2022-05-06  6:44 ` [net] 72fd55c0db: invoked_oom-killer:gfp_mask=0x kernel test robot
2022-05-06  8:15   ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YmlilMi5MmApVqTX@shredder \
    --to=idosch@idosch.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).