From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E410343880; Wed, 20 May 2026 18:31:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779301915; cv=none; b=RTfAyH6avAx5w3QfKlnOu9Pyr5SJL1K4X8IsRRHeTc0PjA+IMhktrOSHY5yEoRriyPJfN1PRzQgo0rpb6PGdY7hSf8ZVvTXklo+DP8ToS45kKIeGhleK9z35Y677r5fhwa92ohGe69D5iCVduK1B4ZFdqPdRN2iDDH+AMrizT1Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779301915; c=relaxed/simple; bh=Tu7/KMDh4lsvpa/cItk8pgrdoRNOAiL4v1FEJQKvxTc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=BbXU7/i0h/PjPORKwM2GAq3xbg9qpMWH56B4kmtQ6n+9URpOc/cCohu9yxztVTDpX8mEj5n6lZpwefQAJ8aCndCMh1jd9EK984FjTrwl+SM7MrAmL1VxFNUDuOZbgtBfFpI8JqIkKSH1KCw4z6c1Hi3xh1JKuncSkmLV+Gj/+xs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b=kp1nQ7Et; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b="kp1nQ7Et" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 983861F000E9; Wed, 20 May 2026 18:31:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linuxfoundation.org; s=korg; t=1779301914; bh=Nu6peBTIxCZnQgWIsaeAlqIWJflFBcKGWB3xfc6REMg=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=kp1nQ7EtbAO6ZDJBow4Qx8OndkXbAxdxF2lcOwI7HkIGtsSjLzSZz1kuohGEsauSw IRB74ZWAF+Kc63LlkODHuCCxrn+uNn6kaATgTqnXpx+fmiOF23nG8n6pRDvtfIvo+M WAwIh/TsxK7RWGZkmERS9v7nHWA9M47dDNbtEc2c= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, =?UTF-8?q?H=C3=A5kon=20Bugge?= , Somasundaram Krishnasamy , Gerd Rausch , Allison Henderson , Jakub Kicinski , Sasha Levin Subject: [PATCH 6.6 067/508] net/rds: Optimize rds_ib_laddr_check Date: Wed, 20 May 2026 18:18:10 +0200 Message-ID: <20260520162100.055275886@linuxfoundation.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260520162058.573354582@linuxfoundation.org> References: <20260520162058.573354582@linuxfoundation.org> User-Agent: quilt/0.69 X-stable: review X-Patchwork-Hint: ignore Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 6.6-stable review patch. If anyone has any objections, please let me know. ------------------ From: Håkon Bugge [ Upstream commit 236f718ac885965fa886440b9898dfae185c9733 ] rds_ib_laddr_check() creates a CM_ID and attempts to bind the address in question to it. This in order to qualify the allegedly local address as a usable IB/RoCE address. In the field, ExaWatcher runs rds-ping to all ports in the fabric from all local ports. This using all active ToS'es. In a full rack system, we have 14 cell servers and eight db servers. Typically, 6 ToS'es are used. This implies 528 rds-ping invocations per ExaWatcher's "RDSinfo" interval. Adding to this, each rds-ping invocation creates eight sockets and binds the local address to them: socket(AF_RDS, SOCK_SEQPACKET, 0) = 3 bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 4 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 5 bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 6 bind(6, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 7 bind(7, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 8 bind(8, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 9 bind(9, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 socket(AF_RDS, SOCK_SEQPACKET, 0) = 10 bind(10, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.36.2")}, 16) = 0 So, at every interval ExaWatcher executes rds-ping's, 4224 CM_IDs are allocated, considering this full-rack system. After the a CM_ID has been allocated, rdma_bind_addr() is called, with the port number being zero. This implies that the CMA will attempt to search for an un-used ephemeral port. Simplified, the algorithm is to start at a random position in the available port space, and then if needed, iterate until an un-used port is found. The book-keeping of used ports uses the idr system, which again uses slab to allocate new struct idr_layer's. The size is 2092 bytes and slab tries to reduce the wasted space. Hence, it chooses an order:3 allocation, for which 15 idr_layer structs will fit and only 1388 bytes are wasted per the 32KiB order:3 chunk. Although this order:3 allocation seems like a good space/speed trade-off, it does not resonate well with how it used by the CMA. The combination of the randomized starting point in the port space (which has close to zero spatial locality) and the close proximity in time of the 4224 invocations of the rds-ping's, creates a memory hog for order:3 allocations. These costly allocations may need reclaims and/or compaction. At worst, they may fail and produce a stack trace such as (from uek4): [] __inc_zone_page_state+0x35/0x40 [] page_add_file_rmap+0x57/0x60 [] remove_migration_pte+0x3f/0x3c0 [ksplice_6cn872bt_vmlinux_new] [] rmap_walk+0xd8/0x340 [] remove_migration_ptes+0x40/0x50 [] migrate_pages+0x3ec/0x890 [] compact_zone+0x32d/0x9a0 [] compact_zone_order+0x6d/0x90 [] try_to_compact_pages+0x102/0x270 [] __alloc_pages_direct_compact+0x46/0x100 [] __alloc_pages_nodemask+0x74b/0xaa0 [] alloc_pages_current+0x91/0x110 [] new_slab+0x38b/0x480 [] __slab_alloc+0x3b7/0x4a0 [ksplice_s0dk66a8_vmlinux_new] [] kmem_cache_alloc+0x1fb/0x250 [] idr_layer_alloc+0x36/0x90 [] idr_get_empty_slot+0x28c/0x3d0 [] idr_alloc+0x4d/0xf0 [] cma_alloc_port+0x4d/0xa0 [rdma_cm] [] rdma_bind_addr+0x2ae/0x5b0 [rdma_cm] [] rds_ib_laddr_check+0x83/0x2c0 [ksplice_6l2xst5i_rds_rdma_new] [] rds_trans_get_preferred+0x5b/0xa0 [rds] [] rds_bind+0x212/0x280 [rds] [] SYSC_bind+0xe6/0x120 [] SyS_bind+0xe/0x10 [] system_call_fastpath+0x18/0xd4 To avoid these excessive calls to rdma_bind_addr(), we optimize rds_ib_laddr_check() by simply checking if the address in question has been used before. The rds_rdma module keeps track of addresses associated with IB devices, and the function rds_ib_get_device() is used to determine if the address already has been qualified as a valid local address. If not found, we call the legacy rds_ib_laddr_check(), now renamed to rds_ib_laddr_check_cm(). Signed-off-by: Håkon Bugge Signed-off-by: Somasundaram Krishnasamy Signed-off-by: Gerd Rausch Signed-off-by: Allison Henderson Link: https://patch.msgid.link/20260408080420.540032-2-achender@kernel.org Signed-off-by: Jakub Kicinski Stable-dep-of: ebf71dd4aff4 ("net/rds: Restrict use of RDS/IB to the initial network namespace") Signed-off-by: Sasha Levin --- net/rds/ib.c | 20 ++++++++++++++++++-- net/rds/ib.h | 1 + net/rds/ib_rdma.c | 2 +- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/net/rds/ib.c b/net/rds/ib.c index 9826fe7f9d008..996f007cd516b 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -403,8 +403,8 @@ static void rds6_ib_ic_info(struct socket *sock, unsigned int len, * allowed to influence which paths have priority. We could call userspace * asserting this policy "routing". */ -static int rds_ib_laddr_check(struct net *net, const struct in6_addr *addr, - __u32 scope_id) +static int rds_ib_laddr_check_cm(struct net *net, const struct in6_addr *addr, + __u32 scope_id) { int ret; struct rdma_cm_id *cm_id; @@ -489,6 +489,22 @@ static int rds_ib_laddr_check(struct net *net, const struct in6_addr *addr, return ret; } +static int rds_ib_laddr_check(struct net *net, const struct in6_addr *addr, + __u32 scope_id) +{ + struct rds_ib_device *rds_ibdev = NULL; + + if (ipv6_addr_v4mapped(addr)) { + rds_ibdev = rds_ib_get_device(addr->s6_addr32[3]); + if (rds_ibdev) { + rds_ib_dev_put(rds_ibdev); + return 0; + } + } + + return rds_ib_laddr_check_cm(net, addr, scope_id); +} + static void rds_ib_unregister_client(void) { ib_unregister_client(&rds_ib_client); diff --git a/net/rds/ib.h b/net/rds/ib.h index 2ba71102b1f1f..d6c1197731c1c 100644 --- a/net/rds/ib.h +++ b/net/rds/ib.h @@ -384,6 +384,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, __rds_ib_conn_error(conn, KERN_WARNING "RDS/IB: " fmt) /* ib_rdma.c */ +struct rds_ib_device *rds_ib_get_device(__be32 ipaddr); int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, struct in6_addr *ipaddr); void rds_ib_add_conn(struct rds_ib_device *rds_ibdev, struct rds_connection *conn); diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c index 30fca2169aa7a..468fd60d818ff 100644 --- a/net/rds/ib_rdma.c +++ b/net/rds/ib_rdma.c @@ -47,7 +47,7 @@ struct rds_ib_dereg_odp_mr { static void rds_ib_odp_mr_worker(struct work_struct *work); -static struct rds_ib_device *rds_ib_get_device(__be32 ipaddr) +struct rds_ib_device *rds_ib_get_device(__be32 ipaddr) { struct rds_ib_device *rds_ibdev; struct rds_ib_ipaddr *i_ipaddr; -- 2.53.0