From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from BL0PR03CU003.outbound.protection.outlook.com (mail-eastusazon11012052.outbound.protection.outlook.com [52.101.53.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D640B3BE65F; Mon, 4 May 2026 11:34:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=52.101.53.52 ARC-Seal:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777894470; cv=fail; b=RpWEcp19z218DwU9JRwgCr10nvt6hAFbEmPK4hNqni1ZYtPDfy6fHIB0T20XO0z2h90xARo5f6b/oVi5arYExmsVBqt9hKEeDIUXwD19kDyRcajhjjuR+WUpqEc9k1U2TIy1UNN2aoQ6359U6JpGYYRzN2/lTRfFpTBFuXHFODg= ARC-Message-Signature:i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777894470; c=relaxed/simple; bh=LhzJcXK1lQ3quKnRD5JFBLyBqAEdwfu/B8+ux1SlViU=; h=Message-ID:Date:MIME-Version:Subject:To:CC:References:From: In-Reply-To:Content-Type; b=heOso5RJYM2TvWBqeuldvwHskMV9IHODjXqdSTuahpRj3ygXD17yAmPUShgjyn9y7HmJW1E05eF6s4uNzBj6Jmd9KWrM+kgiQjtia8+ipi+AhQnH/ZFlMPUuIzWfQ4CtlG0C9eDoKzF/wy/p4oNWBOYgHaDO4UQGxjb79Qma9Ww= ARC-Authentication-Results:i=2; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com; spf=fail smtp.mailfrom=nvidia.com; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b=UJ8HoK1l; arc=fail smtp.client-ip=52.101.53.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nvidia.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=nvidia.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=Nvidia.com header.i=@Nvidia.com header.b="UJ8HoK1l" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=JG9kBCIl86W/sHwfHEtWhyhp30Z2HNpvJtiTR75QD6vLI0kFdfJ1JK1S/8Lch1XWRC1Vt/wmWt384PetihcQ1SJm6uml/97GJzd6n/azKv4PEYhYsHmp3sotngp0ZuaU71++k+T4zW+TeRGYMjJ4Jhddv+xV2E3GZRi7/fiHowuPTcBLi/VMcYAfKo0YZ1jZn5AtnbJTXTrzBYk0/dosig3u09hld6L36GAcPVv9xWtIaUr18qeldFqJAxFIWTfpsgzW6WrqfrWXbsx+XVh6gUmca97LYvgPcJQSFfriJV0a1/t3STRCAiKtMh2XatExMqDVf32ouejw1oj2X7uGZQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=tZUgtoRaI6Quom+Sr07Mg9kj2MslGNfavtxZVZ7S2S0=; b=H1UXJBeKxsXhj+jgSuTLCjoMRe7TnxYDFpMIBOqiqZsI4+W+LrmiGGygNKn4T6ZAy65DuvDl4+povIXwpLH+1Ix3L3HhYP3JmQ2mPR4XO9tYsXZVX3lxrU7OzocvyFeeGte34PYs46oJW8D/7/g340odbbet1Y0Rdna3F+KE7wy2CztHoJPMtrlsVZWPWL2hIkxXoQg5Ywt4EhgFsfZuhYl0DevrgeuL4g6cYxlb9cQpipPO/dAAnkja4CR+P5YdS1WUn7dwj2IeNySjpu3pZBKDJAQYyMv+4/q1j+D9WaNZr5V4xLjHWJE6ogmqTy/AXrv3qe/T2+isMZvBsb8nJA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 216.228.117.160) smtp.rcpttodomain=kernel.org smtp.mailfrom=nvidia.com; dmarc=pass (p=reject sp=reject pct=100) action=none header.from=nvidia.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Nvidia.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=tZUgtoRaI6Quom+Sr07Mg9kj2MslGNfavtxZVZ7S2S0=; b=UJ8HoK1lvq3qD54pON5g6GpYLT+4vol/X1Pxk4kC1u8HsFUz4fVuFMrSIw3uawmkiIz1uWZ4sYiXrfIcpkPvaPmVknhihMfZd5Ln4HyciIiEzNtWls/L7YiZn3N6irgpXGz89cyoIwfZ3bhg4ifeZAe+W+A/3zK4FnXNWh39HZ7WZonhgGrm8dt8ZYzhXpCY/r5/HH0ov0+YJ4N0Y7428Nu2pu3MHRTOww/gMyIt2yv3Ha6ebXR1GrSujizIUq43YNke5cTFSR+JyBn9nkig/XhjWrWqhS+1KAZywsY7CB1q2GkvhOfcjcEbJcxC3yPnt/11SSBC0A99DLZ2WhGlrg== Received: from SJ0PR03CA0020.namprd03.prod.outlook.com (2603:10b6:a03:33a::25) by SJ1PR12MB6290.namprd12.prod.outlook.com (2603:10b6:a03:457::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9870.25; Mon, 4 May 2026 11:34:17 +0000 Received: from SJ5PEPF000001EE.namprd05.prod.outlook.com (2603:10b6:a03:33a:cafe::59) by SJ0PR03CA0020.outlook.office365.com (2603:10b6:a03:33a::25) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.9870.25 via Frontend Transport; Mon, 4 May 2026 11:34:17 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 216.228.117.160) smtp.mailfrom=nvidia.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=nvidia.com; Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.160 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.160; helo=mail.nvidia.com; pr=C Received: from mail.nvidia.com (216.228.117.160) by SJ5PEPF000001EE.mail.protection.outlook.com (10.167.242.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.9891.9 via Frontend Transport; Mon, 4 May 2026 11:34:17 +0000 Received: from rnnvmail201.nvidia.com (10.129.68.8) by mail.nvidia.com (10.129.200.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.20; Mon, 4 May 2026 04:34:06 -0700 Received: from [10.125.201.174] (10.126.230.37) by rnnvmail201.nvidia.com (10.129.68.8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.20; Mon, 4 May 2026 04:34:01 -0700 Message-ID: <2b06dc5a-542e-440e-af7e-73c29204cda5@nvidia.com> Date: Mon, 4 May 2026 14:33:57 +0300 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next 3/3] net/mlx5: Add VHCA_ID page management mode support To: Jakub Kicinski , CC: , , , , , , , , , , , , References: <20260501044156.260875-4-tariqt@nvidia.com> <20260503014501.4098393-1-kuba@kernel.org> Content-Language: en-US From: Moshe Shemesh In-Reply-To: <20260503014501.4098393-1-kuba@kernel.org> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-ClientProxiedBy: rnnvmail202.nvidia.com (10.129.68.7) To rnnvmail201.nvidia.com (10.129.68.8) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SJ5PEPF000001EE:EE_|SJ1PR12MB6290:EE_ X-MS-Office365-Filtering-Correlation-Id: 7fe90fa5-b566-4034-e696-08dea9d1137c X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|1800799024|376014|36860700016|82310400026|18002099003|22082099003|18092099006|56012099003; X-Microsoft-Antispam-Message-Info: i7yWtd9j8GTDD5dREYoefwd6YTirpb6y+pxCzyDNWVEv/JmlnL1tju7yPNa3SQjz6Agy3LBRcXjWDQM4sxFHinSdcVtbxFGaz3Ozf7cf3qOJq2Dv1t9SO37XMuPYi18pXsPa9pbezhF+S1VqchE+83e/sKqMjNSz2MZK3pPIK2Rt1hWHQ196Zlj8cnQyZj4nMj7zCOfxwqc/jkOazFo2E3zBizEv0BDLrfn+KBBAeiC9dRGZ3cHbxlIj6ldUygBnvn8bQ54nGSNq6cFvzMM68RQ3uSt9Dr3nsYnqi3RXAES4FJr7NHW2TVYgVoLsjBOhNWdOoLDxu0crZoHy7CTyyecozV7IITPAAKLSAqw2KU4mCztWhkH7YF3Xm08SZ0OtFOf46U0s43OS87ckwihr5O0l7QCX0s/u76nwnEURMdgsNs4zNmnEe8V0JXLb7U2ajGQHXLLBqqpdyEUbRVWhxLb5aiYqCTTKSrM5qvDZ8WPgXlm7oyZlLKwpNfnSZ/4Mr5McLLL1p6aQ3+wMV0eJ7b2qE79plHqxzDBG+C3RHMcu1K1EyVVGxQ6Kb/QqARbpJw3tyMKpFddpvFxqorJf9cV4SNEBeIS6UX59cL8UInkPvm0OMaUOdKEUh34YY2roCHITpGAA/Z8hpMuzK7vi6c4i83BSodVd3rXcJs95rHvZS23XoP+pD2SAwLi3sgJnyMTEqsOEstdxt+Kv80aBH+AlVY6Yq+n6B/pjxSYBtPk= X-Forefront-Antispam-Report: CIP:216.228.117.160;CTRY:US;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:mail.nvidia.com;PTR:dc6edge1.nvidia.com;CAT:NONE;SFS:(13230040)(1800799024)(376014)(36860700016)(82310400026)(18002099003)(22082099003)(18092099006)(56012099003);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: HZ6ECk3HdCMbx9uPd+L8xCrypc3NklPLLc1mpC+BM38oKB3bVZCxhlG7JxcgalxaLR2PcCOyd4a+Ab90JfiO6NHMOJQiyCwT59zDQujfowKtwPXs+2RxGtflEVbLw3hKEYpPDvJ7gBrIgQDHsF7SvCx+4LGZKbmyY/K3h1ajcI6W1hnsWvGtdXWx4ppx8MAluP164CoWReGKUTZMQDTz9MVzpWoOc28Fwo6SYTKtpA5b9CI44W6W/wx06qYttjnto2OZoEymFJQNqv45xyEK3zT08gEe39HcOWi9CT9BF4vbF8rZSHwAtL1qdvH4KIExp35Hmw+p2co/dfPcllWR1eZqvQ6Bmk7lsJTwOq1jV4ofdhdyIINVVvn2ewIK1/xbkzygL1MW3UnV/45HaW7j73HRfN5Ow75MUNDi2bWSFN/U52IwAh2+nEXL2J4yo1gq X-OriginatorOrg: Nvidia.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 May 2026 11:34:17.2373 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 7fe90fa5-b566-4034-e696-08dea9d1137c X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.160];Helo=[mail.nvidia.com] X-MS-Exchange-CrossTenant-AuthSource: SJ5PEPF000001EE.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ1PR12MB6290 On 5/3/2026 4:45 AM, Jakub Kicinski wrote: > > This is an AI-generated review of your patch. The human sending this > email has considered the AI review valid, or at least plausible. > --- > net/mlx5: Add VHCA_ID page management mode support > > This change adds a VHCA_ID-based page management mode that uses vhca_id > instead of function_id as the effective function identifier when the > firmware advertises icm_mng_function_id_mode. Boot pages keep FUNC_ID > semantics for backward compatibility, and tracked pages are migrated to > the vhca_id key after set_hca_cap(). > > A few questions below on the per-type counter accounting, the boot-page > migration helper, and locking on the FW-driven page path. > >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c >> index e0eafcf0c52a..d3eaefc5c0e0 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c >> @@ -852,6 +852,48 @@ bool mlx5_esw_vport_vhca_id(struct mlx5_eswitch *esw, u16 vportn, u16 *vhca_id) >> return true; >> } >> >> +u16 mlx5_esw_vhca_id_to_func_type(struct mlx5_core_dev *dev, u16 vhca_id) >> +{ >> + struct mlx5_eswitch *esw = dev->priv.eswitch; >> + struct mlx5_vport *vport; >> + unsigned long i; >> + u16 type; >> + >> + if (vhca_id == MLX5_CAP_GEN(dev, vhca_id)) >> + return MLX5_SELF; >> + >> + if (!esw) >> + return MLX5_FUNC_TYPE_NONE; >> + >> + mutex_lock(&esw->state_lock); > > This function is reached from give_pages()/reclaim_pages()/ > release_all_pages() via func_vhca_id_to_type(), which in turn runs > from the pg_wq work handler triggered by firmware page-request EQE > events. Does acquiring esw->state_lock on that path introduce a new > lock dependency? > > Several eswitch paths (for example mlx5_esw_vport_enable(), > mlx5_esw_vport_disable(), mlx5_eswitch_set_vport_mac()) hold > state_lock while synchronously issuing firmware commands. Before this > patch, the page path held no eswitch locks. > > Would it be safer to resolve the func_type outside of state_lock, for > example by caching the vhca_id-to-type mapping separately, or by > attaching the resolved type to the fw_page at give time so reclaim > does not need to look it up again? > I don't think there is a real issue with the lock, but for sure better without, I will try caching vhca_id to type mapping. >> + mlx5_esw_for_each_vport(esw, i, vport) { >> + if (vport->vhca_id != vhca_id) >> + continue; >> + >> + if (vport->vport == MLX5_VPORT_HOST_PF) { >> + type = MLX5_HOST_PF; >> + goto unlock; >> + } >> + >> + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_SF)) { >> + type = MLX5_SF; >> + goto unlock; >> + } >> + >> + if (xa_get_mark(&esw->vports, i, MLX5_ESW_VPT_VF)) { >> + type = MLX5_VF; >> + goto unlock; >> + } >> + >> + type = MLX5_EC_VF; >> + goto unlock; > > Can this fall-through misattribute counters? > > If a matching vport is not HOST_PF and has neither the VPT_SF nor the > VPT_VF mark, it is unconditionally classified as MLX5_EC_VF. Would an > UPLINK, ECPF, manager, or adjacent-style vport whose vhca_id happens > to match end up charged to the EC_VF bucket here? > The fallthrough is correct in practice - uplink is the physical port, not a separate HCA function, FW doesn't request pages for it. ECPF is the esw manager, if requested here can be only as self, will catch the MLX5_SELF condition. Adjacent VFs are caught correctly by the VF mark. >> + } >> + type = MLX5_FUNC_TYPE_NONE; >> +unlock: >> + mutex_unlock(&esw->state_lock); >> + return type; >> +} >> + > > [ ... ] > >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c >> index a242053f3a58..52cf341ad6b3 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c >> @@ -597,6 +597,9 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx) >> if (MLX5_CAP_GEN_MAX(dev, release_all_pages)) >> MLX5_SET(cmd_hca_cap, set_hca_cap, release_all_pages, 1); >> >> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode)) >> + MLX5_SET(cmd_hca_cap, set_hca_cap, icm_mng_function_id_mode, 1); >> + > > [ ... ] > >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c >> index 77ffa31cc505..7ebe88aa3b3e 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c > > [ ... ] > >> @@ -89,12 +105,21 @@ static u16 func_id_to_type(struct mlx5_core_dev *dev, u16 func_id, bool ec_funct >> return MLX5_SF; >> } >> >> +static u16 func_vhca_id_to_type(struct mlx5_core_dev *dev, u16 func_vhca_id, >> + bool ec_function) >> +{ >> + if (mlx5_page_mgt_mode_is_vhca_id(dev)) >> + return mlx5_esw_vhca_id_to_func_type(dev, func_vhca_id); >> + >> + return func_id_to_type(dev, func_vhca_id, ec_function); >> +} > > Can the give/reclaim accounting become asymmetric in VHCA_ID mode? > > The per-type counters are only updated when the resolved type is not > MLX5_FUNC_TYPE_NONE: > > func_type = func_vhca_id_to_type(dev, func_vhca_id, ec_function); > if (func_type != MLX5_FUNC_TYPE_NONE) > dev->priv.page_counters[func_type] += npages; > > Since mlx5_esw_vhca_id_to_func_type() walks the eswitch vport table > dynamically on every call, can the give and the corresponding reclaim > resolve to different types? > > For example, if a give runs before vport->vhca_id is populated in > mlx5_esw_vport_caps_get(), the lookup returns MLX5_FUNC_TYPE_NONE and > the counter is not incremented. Later, when the vport is fully > populated, reclaim resolves to a real type and decrements the counter > below the amount ever added, which on a u32 drives it to a very large > value. > > Similarly, if a vport is removed before reclaim, the increment at > give time is recorded but the decrement at reclaim is skipped, so the > counter leaks upward. > > Would caching the func_type on the fw_page at give time and reusing > it on reclaim make the accounting symmetric by construction? Following first comment, I will try caching. > > [ ... ] > >> @@ -658,30 +708,101 @@ static int req_pages_handler(struct notifier_block *nb, >> * req->npages (and not min ()). >> */ >> req->npages = max_t(s32, npages, MAX_RECLAIM_NPAGES); >> - req->ec_function = ec_function; >> + if (!mlx5_page_mgt_mode_is_vhca_id(dev)) >> + req->ec_function = ec_function; >> req->release_all = release_all; >> INIT_WORK(&req->work, pages_work_handler); >> queue_work(dev->priv.pg_wq, &req->work); >> return NOTIFY_OK; >> } >> >> +/* >> + * After set_hca_cap(), the second satisfy_startup_pages(dev, 0) may see >> + * VHCA_ID mode. If page_root_xa already has the PF entry from the first >> + * (boot) call under FUNC_ID keys 0 or (ec_function << 16), migrate that >> + * entry to the device vhca_id key so lookups use VHCA_ID semantics. >> + */ >> +static int mlx5_pagealloc_migrate_pf_to_vhca_id(struct mlx5_core_dev *dev) >> +{ >> + u32 vhca_id_key, old_key; >> + struct rb_root *root; >> + struct fw_page *fwp; >> + struct rb_node *p; >> + bool ec_function; >> + int err; >> + >> + if (xa_empty(&dev->priv.page_root_xa)) >> + return 0; >> + >> + vhca_id_key = MLX5_CAP_GEN(dev, vhca_id); >> + ec_function = mlx5_core_is_ecpf(dev); >> + >> + old_key = ec_function ? (1U << 16) : 0; >> + root = xa_load(&dev->priv.page_root_xa, old_key); >> + if (!root) >> + return 0; > > Does this assume the boot-path func_vhca_id was always 0? > > The boot call to mlx5_cmd_query_pages() reads func_vhca_id directly > from the firmware output, and give_pages() then uses that value to > compute the key. The migration here instead hardcodes old_key as > ec_function ? (1U << 16) : 0. > > If firmware returned a non-zero boot function_id, xa_load(old_key) > returns NULL, the function silently returns 0, the caller flips the > mode to VHCA_ID, and the original rb_root is orphaned in page_root_xa > under the old key. Subsequent free_fwp()/find_fw_page() paths would > then hit WARN_ON_ONCE(!root) and leak the DMA mappings and pages. > > Would it be more robust to look up the actual key used at boot > (derived from the stored func_vhca_id), and to treat the "xa not > empty but old_key absent" case as an invariant violation rather than > silently succeeding? > During boot (mlx5_satisfy_startup_pages(dev, 1)), the mode is FUNC_ID. The mlx5_cmd_query_pages() call sets embedded_cpu_function in the request but leaves function_id as 0 (the PF queries its own pages). Firmware returns function_id in the output, which for the PF's own pages is always 0. The "silently returns 0" when xa_load returns NULL is also fine, it means no boot pages were allocated. >> + >> + if (old_key == vhca_id_key) >> + return 0; >> + >> + err = xa_insert(&dev->priv.page_root_xa, vhca_id_key, root, GFP_KERNEL); >> + if (err) { >> + mlx5_core_warn(dev, >> + "failed to migrate page root key 0x%x to vhca_id 0x%x\n", >> + old_key, vhca_id_key); >> + return err; >> + } >> + >> + xa_erase(&dev->priv.page_root_xa, old_key); >> + >> + for (p = rb_first(root); p; p = rb_next(p)) { >> + fwp = rb_entry(p, struct fw_page, rb_node); >> + fwp->function = vhca_id_key; >> + } > > Is the ordering here safe against any concurrent free_fwp()? > > Between xa_erase(old_key) and the loop that updates fwp->function, > every fw_page still carries the old key while page_root_xa no longer > resolves it. If a free_fwp() were to run in that window: > > root = xa_load(&dev->priv.page_root_xa, fwp->function); > if (WARN_ON_ONCE(!root)) > return; > > it would return early, skipping dma_unmap_page(), __free_page(), and > kfree(fwp), leaking the DMA mapping and the backing page. > > No concurrent free path is structurally reachable today because this > runs before the EQ notifier is registered in mlx5_pagealloc_start(), > but would it be cleaner to update the fwp->function values first, > then swap the xarray entries (or store the new value at a single key) > so the two views cannot disagree? > Reordering is safer, I will reorder. >> + >> + return 0; >> +} >> + >> int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot) >> { >> - u16 func_id; >> + bool ec_function = false; >> + u16 func_vhca_id; >> s32 npages; >> int err; >> >> - err = mlx5_cmd_query_pages(dev, &func_id, &npages, boot); >> + /* When boot flag is set, the icm_mng_function_id_mode capability is >> + * not yet set (only set after set_hca_cap()), so use FUNC_ID mode >> + * for backward compatibility. When boot is false, set mode from >> + * cap (set_hca_cap has run successfully). >> + */ >> + if (boot) { >> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_FUNC_ID); >> + } else { >> + if (MLX5_CAP_GEN_MAX(dev, icm_mng_function_id_mode) == >> + MLX5_ID_MODE_FUNCTION_VHCA_ID) { > > The comment just above says "set mode from cap (set_hca_cap has run > successfully)", which reads as "use the current/negotiated cap value", > but the check uses MLX5_CAP_GEN_MAX rather than MLX5_CAP_GEN. The > sibling code in drivers/net/ethernet/mellanox/mlx5/core/debugfs.c uses > MLX5_CAP_GEN(dev, icm_mng_function_id_mode) for the same semantic > check. > > Could the comment and the _MAX usage be made consistent? If anyone > later adds a conditional around the MLX5_SET() in handle_hca_cap() > (for example a module parameter), the driver would start flipping to > VHCA_ID mode based on _MAX without actually having enabled the feature > in firmware. OK, I will rewrite the comment. > >> + err = mlx5_pagealloc_migrate_pf_to_vhca_id(dev); >> + if (err) >> + return err; >> + mlx5_page_mgt_mode_set(dev, MLX5_PAGE_MGT_MODE_VHCA_ID); >> + } >> + } > > Once page_mgt_mode is flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, it stays > set in dev->priv and is only reset on the next boot=1 call. Between > teardown and the next boot=1 path (for example during health > recovery, PCI reset, or any async flow that runs page work before > mlx5_satisfy_startup_pages(dev, 1) re-runs), would a FW page-request > EQE observe a stale mode? > > The req_pages EQ notifier is registered via mlx5_pagealloc_start() > and unregistered via mlx5_pagealloc_stop(), so this is not reachable > today, but would it be worth resetting the mode explicitly on the > teardown side rather than relying on the next reinit? No. Once flipped to MLX5_PAGE_MGT_MODE_VHCA_ID, both driver and firmware keep using it till reopen function (next boot=1 call). > > [ ... ] > >> @@ -751,6 +874,9 @@ int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev) >> WARN(dev->priv.fw_pages, >> "FW pages counter is %d after reclaiming all pages\n", >> dev->priv.fw_pages); >> + if (mlx5_page_mgt_mode_is_vhca_id(dev) && !dev->priv.eswitch) >> + return 0; >> + > > Does this guard address the asymmetric-counter case raised above? > > It only skips the per-type WARNs when the eswitch is entirely absent. > In the common case where the eswitch is present but a vport's > vhca_id/marks change between give and reclaim, the counters can still > drift and these WARNs would still fire on normal teardown paths. > Following first comment, will try caching. > [ ... ]