From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24C54C433F5 for ; Wed, 27 Apr 2022 15:34:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239926AbiD0PiI (ORCPT ); Wed, 27 Apr 2022 11:38:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45658 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239789AbiD0PiH (ORCPT ); Wed, 27 Apr 2022 11:38:07 -0400 Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C1C7A5622F for ; Wed, 27 Apr 2022 08:34:52 -0700 (PDT) Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailout.nyi.internal (Postfix) with ESMTP id 1BD0F5C0242; Wed, 27 Apr 2022 11:34:49 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Wed, 27 Apr 2022 11:34:49 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1651073689; x= 1651160089; bh=mMYKroTBUO9hSuVVMMeoL+cmh7lfGJ7m0ucGPP4QTYM=; b=p WOxuvHV8Woq0q4VP83eiRMOYWbGGaZGAV5RS5m+MqMrYHC1dntLFN4anQcXf2vKI t89DTmN1HZwVvCaDLIvOVMKNHFdebrICw5UIvlnnDcYLKL4L/1fI66uZjjSKNe1u 4Y0D+pSLwULb3PJTbGDYQ//YSI+pfomgpW4/T5djZCIpZHcv7mY4x+NBZn4QY/LT 0o/y//U7OgeOYSkQoicl5cHFWADaHTQxdmDXPDBN8aQ+x0xz5BquBqJ/Ds7fa49S 3G3K2tsHKYwkqQV8z5m5poDzjmim0tXi3ot6ELqrywnRMOZ0Fif5YYk58m4ke9yP 5V4dWUXzPZ6MDnRGgmr4w== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrudehgdekjecutefuodetggdotefrodftvf curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu uegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenuc fjughrpeffhffvvefukfhfgggtuggjsehttdertddttddvnecuhfhrohhmpefkughoucfu tghhihhmmhgvlhcuoehiughoshgthhesihguohhstghhrdhorhhgqeenucggtffrrghtth gvrhhnpedvudefveekheeugeeftddvveefgfduieefudeifefgleekheegleegjeejgeeg hfenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehiug hoshgthhesihguohhstghhrdhorhhg X-ME-Proxy: Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 27 Apr 2022 11:34:48 -0400 (EDT) Date: Wed, 27 Apr 2022 18:34:44 +0300 From: Ido Schimmel To: Eric Dumazet Cc: "David S . Miller" , Jakub Kicinski , Paolo Abeni , netdev , Eric Dumazet Subject: Re: [PATCH v2 net-next] net: generalize skb freeing deferral to per-cpu lists Message-ID: References: <20220422201237.416238-1-eric.dumazet@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220422201237.416238-1-eric.dumazet@gmail.com> Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Fri, Apr 22, 2022 at 01:12:37PM -0700, Eric Dumazet wrote: > From: Eric Dumazet > > Logic added in commit f35f821935d8 ("tcp: defer skb freeing after socket > lock is released") helped bulk TCP flows to move the cost of skbs > frees outside of critical section where socket lock was held. > > But for RPC traffic, or hosts with RFS enabled, the solution is far from > being ideal. > > For RPC traffic, recvmsg() has to return to user space right after > skb payload has been consumed, meaning that BH handler has no chance > to pick the skb before recvmsg() thread. This issue is more visible > with BIG TCP, as more RPC fit one skb. > > For RFS, even if BH handler picks the skbs, they are still picked > from the cpu on which user thread is running. > > Ideally, it is better to free the skbs (and associated page frags) > on the cpu that originally allocated them. > > This patch removes the per socket anchor (sk->defer_list) and > instead uses a per-cpu list, which will hold more skbs per round. > > This new per-cpu list is drained at the end of net_action_rx(), > after incoming packets have been processed, to lower latencies. > > In normal conditions, skbs are added to the per-cpu list with > no further action. In the (unlikely) cases where the cpu does not > run net_action_rx() handler fast enough, we use an IPI to raise > NET_RX_SOFTIRQ on the remote cpu. > > Also, we do not bother draining the per-cpu list from dev_cpu_dead() > This is because skbs in this list have no requirement on how fast > they should be freed. > > Note that we can add in the future a small per-cpu cache > if we see any contention on sd->defer_lock. > > Tested on a pair of hosts with 100Gbit NIC, RFS enabled, > and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around > page recycling strategy used by NIC driver (its page pool capacity > being too small compared to number of skbs/pages held in sockets > receive queues) > > Note that this tuning was only done to demonstrate worse > conditions for skb freeing for this particular test. > These conditions can happen in more general production workload. [...] > Signed-off-by: Eric Dumazet Eric, with this patch I'm seeing memory leaks such as these [1][2] after boot. The system is using the igb driver for its management interface [3]. The leaks disappear after reverting the patch. Any ideas? Let me know if you need more info. I can easily test a patch. Thanks [1] # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888170143740 (size 216): comm "softirq", pid 0, jiffies 4294825261 (age 95.244s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 17 0f 81 88 ff ff 00 00 00 00 00 00 00 00 ................ backtrace: [] napi_skb_cache_get+0xf0/0x180 [] __napi_build_skb+0x1a/0x60 [] napi_build_skb+0x23/0x350 [] igb_poll+0x2b72/0x5880 [igb] [] __napi_poll.constprop.0+0xb4/0x480 [] net_rx_action+0x40a/0xc60 [] __do_softirq+0x295/0x9fe [] __irq_exit_rcu+0x11c/0x180 [] irq_exit_rcu+0xa/0x20 [] common_interrupt+0xa9/0xc0 [] asm_common_interrupt+0x1e/0x40 [] cpuidle_enter_state+0x27e/0xcb0 [] cpuidle_enter+0x4f/0xa0 [] do_idle+0x3b0/0x4b0 [] cpu_startup_entry+0x19/0x20 [] start_secondary+0x265/0x340 [2] # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff88810ce3aac0 (size 216): comm "softirq", pid 0, jiffies 4294861408 (age 64.607s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{............. backtrace: [] __alloc_skb+0x229/0x360 [] __tcp_send_ack.part.0+0x6c/0x760 [] tcp_send_ack+0x82/0xa0 [] __tcp_ack_snd_check+0x15b/0xa00 [] tcp_rcv_established+0x198e/0x2120 [] tcp_v4_do_rcv+0x665/0x9a0 [] tcp_v4_rcv+0x2c1e/0x32f0 [] ip_protocol_deliver_rcu+0x53/0x2c0 [] ip_local_deliver+0x3cb/0x620 [] ip_sublist_rcv_finish+0x9f/0x2c0 [] ip_list_rcv_finish.constprop.0+0x525/0x6f0 [] ip_list_rcv+0x318/0x460 [] __netif_receive_skb_list_core+0x541/0x8f0 [] netif_receive_skb_list_internal+0x763/0xdc0 [] napi_gro_complete.constprop.0+0x5a5/0x700 [] dev_gro_receive+0xf2d/0x23f0 unreferenced object 0xffff888175e1afc0 (size 216): comm "sshd", pid 1024, jiffies 4294861424 (age 64.591s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 c0 7b 07 81 88 ff ff 00 00 00 00 00 00 00 00 ..{............. backtrace: [] __alloc_skb+0x229/0x360 [] alloc_skb_with_frags+0x9c/0x720 [] sock_alloc_send_pskb+0x7b3/0x940 [] __ip_append_data+0x1874/0x36d0 [] ip_make_skb+0x263/0x2e0 [] udp_sendmsg+0x1c8a/0x29d0 [] inet_sendmsg+0x9e/0xe0 [] __sys_sendto+0x23d/0x360 [] __x64_sys_sendto+0xe1/0x1b0 [] do_syscall_64+0x35/0x80 [] entry_SYSCALL_64_after_hwframe+0x44/0xae [3] # ethtool -i enp8s0 driver: igb version: 5.18.0-rc3-custom-91743-g481c1b firmware-version: 3.25, 0x80000708, 1.1824.0 expansion-rom-version: bus-info: 0000:08:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes