From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66429331A77 for ; Thu, 4 Dec 2025 09:00:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.45 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764838814; cv=none; b=fMhyaME1PAB0ddF5gwCLvSYCR6EHsR2aqthmcyxleJyFe6t6klK5nCWHjkmE7o95g5AaKo10vHsTkp9Z1k0B+ps7JM2LsRdRm78pfIFip81Zov3RtKDzY2qgnh8JTfh0RTVaJYlhL1ey/ncs/JsESwr2DW638muhJgxnQL7eK08= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764838814; c=relaxed/simple; bh=q+LoIUfzJ18p1AvuqPRtbQ36awNa8jT+GpAg4XKdlv8=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=USOpUp9RMjnLo0JUguWQY3AKTE9ASeEX5U+gHCaDuudKssd0tS/xJHUd+HKYuVlHngLMF2KIiqkC/0/HkH1QMCcGF6IaHmT2NE43V2hRE6xekIpCnD/TZfbjLKJn2P4LZA5qhe12f5dDK47jDPeNd6xLGccvK27keiQHQNNuXRQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=NU38EzOr; arc=none smtp.client-ip=209.85.128.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NU38EzOr" Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-477563e28a3so5095285e9.1 for ; Thu, 04 Dec 2025 01:00:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764838811; x=1765443611; darn=vger.kernel.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=O0N1lMmbauFOWlZlRSXcG6h3Jmv0DRpHx+qvG3J8bOA=; b=NU38EzOrKXpV1QRnwNWD+ih4rHCjqevHViUcDUVHrBe2k8vplKxvUkMkMXYmq0EEO8 Wb1Uf7FTnLpWR6fJdkP8HNZntWSzjAtKS3OC4O6vSkof6AWiisNwf9LSI03r+iaobv6Z LYtD0GBjp5uOBFCJhWwX9AYL+K/+RgfYCWVdpP3bXsw7+Gb8O9f7X+TkbxpccaKF1soJ UlMBG7MtHd1lftqdCS2MvMk8zZ7X/NHH2i/T6OKlarYnzFgs3UGFhM33JM0aBPk5R+oi h0Kxu7YtzGd4ESfXGYPX87JRacBCMe8vM7DjGFv5HQbTZ3WYegu6AOfCLtRKJnWv0Ide 5OUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764838811; x=1765443611; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=O0N1lMmbauFOWlZlRSXcG6h3Jmv0DRpHx+qvG3J8bOA=; b=q4TdWksCI8F2BYIjkAJFqzufssSSwA7zkZaS3xRJAYYiQ1VQURERTlm5+zfN4p/0c5 ZxNpcT0Xy//l0NMGxz0ccDF8oPN46xyirjOlkrlLg7e+mOw9bDXUyNGA8CB+XxMW83uA HEU1DpOlq0NahLZvcJVqFHPnN+nuN8uhqCRx++qbM3Yf7B0T1YYEEGx5LbHinFIMV6cA 1K33L0bghBRREJ9YbXpJCXeouaS2DUYZqA6bMogaF+8NiLIJLppCUVimW9PLqd+1hvaf tFV8dqKZZ6BrXRUualQCe9Z0azVuJimbxTScKFruK1KllUi40xBcgDFfsEjvTu0HWXVm pnwQ== X-Forwarded-Encrypted: i=1; AJvYcCVZIvRktW0FuTUciQsoON6ZLT3lkBojk3QSVlRvyfeuPgs84nlQZe965wQB3+BDKIsOYIh29wLDQnsnwSU=@vger.kernel.org X-Gm-Message-State: AOJu0Yzb3TN1AJk86y/bLn6kl7g1SSx0Wa+pO2p1Pn1vbbzM+JUhqoco Hd5ND1FyNs7ZbrO0RlhJeygUYhEHP38YqsygMYclv6eesLujqx28C+CF X-Gm-Gg: ASbGncvBMBfi5na4jDRilgX3WJDW3uxg3j209oIvwj+NLfvCNt+1lmJ7Wt1/EM8Ttxd 4kWrJNs4L1VqymJb1NqT3ziMfUVM2cM7SQPPeAI9hNdtrqMSffW4qZ4QfJJoxcEGUaAu+wEBVDn kr7YpmNuHnDPZ7hj927HQtbW5x7nHRa37Ws1Ddh229ogCJ1y5U11puH+rhA28MoIj24w37ikZkO 4Mfw9xljAPbB6Jfrt+ERLXeW+2tCEJvYwmB55vh2MnKxe5bU8c3ZXtNREiH5HIpwRz0ZaEaF8Cq J3QLQaDlfwD1FFbEXCkTrD9506nZyRXKFgfAMARtI1MgOG/3S0zkoCvlFZrwdqgVm3r1amXHu63 2mIrEz2zhHGx988CUkZpx4t2OSIJPMtY3Ks5wvYJiPUdCsBAamPjRhViEmrGoOxcyJA+nN4X+jW LZK/7BLKZ1Ay0LQHL9TFlY2KQSY7qcdFq9VfE= X-Google-Smtp-Source: AGHT+IEZI+dBTeR4Ak8g8nW+sjyLOTD0PlMW1hfxTxArztZFbDdkrdMe37KP4yjwsiEfaEvbmxRqVw== X-Received: by 2002:a05:600c:4504:b0:477:a289:d854 with SMTP id 5b1f17b1804b1-4792eb223a2mr26071405e9.5.1764838809635; Thu, 04 Dec 2025 01:00:09 -0800 (PST) Received: from [10.221.198.188] ([165.85.126.46]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4792b02e7fbsm33545725e9.2.2025.12.04.01.00.07 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 04 Dec 2025 01:00:09 -0800 (PST) Message-ID: Date: Thu, 4 Dec 2025 11:00:07 +0200 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net] net/mlx5: Fix double unregister of HCA_PORTS component To: Gerd Bayer , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Mark Bloch , Andrew Lunn , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Shay Drory , Simon Horman Cc: Lukas Wunner , Bjorn Helgaas , Niklas Schnelle , Farhan Ali , netdev@vger.kernel.org, linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org, linux-pci@vger.kernel.org References: <20251202-fix_lag-v1-1-59e8177ffce0@linux.ibm.com> Content-Language: en-US From: Tariq Toukan In-Reply-To: <20251202-fix_lag-v1-1-59e8177ffce0@linux.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 02/12/2025 13:12, Gerd Bayer wrote: > Clear hca_devcom_comp in device's private data after unregistering it in > LAG teardown. Otherwise a slightly lagging second pass through > mlx5_unload_one() might try to unregister it again and trip over > use-after-free. > > On s390 almost all PCI level recovery events trigger two passes through > mxl5_unload_one() - one through the poll_health() method and one through > mlx5_pci_err_detected() as callback from generic PCI error recovery. > While testing PCI error recovery paths with more kernel debug features > enabled, this issue reproducibly led to kernel panics with the following > call chain: > > Unable to handle kernel pointer dereference in virtual kernel address space > Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6803 ESOP-2 FSI > Fault in home space mode while using kernel ASCE. > AS:00000000705c4007 R3:0000000000000024 > Oops: 0038 ilc:3 [#1]SMP > > CPU: 14 UID: 0 PID: 156 Comm: kmcheck Kdump: loaded Not tainted > 6.18.0-20251130.rc7.git0.16131a59cab1.300.fc43.s390x+debug #1 PREEMPT > > Krnl PSW : 0404e00180000000 0000020fc86aa1dc (__lock_acquire+0x5c/0x15f0) > R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 > Krnl GPRS: 0000000000000000 0000020f00000001 6b6b6b6b6b6b6c33 0000000000000000 > 0000000000000000 0000000000000000 0000000000000001 0000000000000000 > 0000000000000000 0000020fca28b820 0000000000000000 0000010a1ced8100 > 0000010a1ced8100 0000020fc9775068 0000018fce14f8b8 0000018fce14f7f8 > Krnl Code: 0000020fc86aa1cc: e3b003400004 lg %r11,832 > 0000020fc86aa1d2: a7840211 brc 8,0000020fc86aa5f4 > *0000020fc86aa1d6: c09000df0b25 larl %r9,0000020fca28b820 > >0000020fc86aa1dc: d50790002000 clc 0(8,%r9),0(%r2) > 0000020fc86aa1e2: a7840209 brc 8,0000020fc86aa5f4 > 0000020fc86aa1e6: c0e001100401 larl %r14,0000020fca8aa9e8 > 0000020fc86aa1ec: c01000e25a00 larl %r1,0000020fca2f55ec > 0000020fc86aa1f2: a7eb00e8 aghi %r14,232 > > Call Trace: > __lock_acquire+0x5c/0x15f0 > lock_acquire.part.0+0xf8/0x270 > lock_acquire+0xb0/0x1b0 > down_write+0x5a/0x250 > mlx5_detach_device+0x42/0x110 [mlx5_core] > mlx5_unload_one_devl_locked+0x50/0xc0 [mlx5_core] > mlx5_unload_one+0x42/0x60 [mlx5_core] > mlx5_pci_err_detected+0x94/0x150 [mlx5_core] > zpci_event_attempt_error_recovery+0xcc/0x388 > > Fixes: 5a977b5833b7 ("net/mlx5: Lag, move devcom registration to LAG layer") > Signed-off-by: Gerd Bayer > --- > Hi Shay et al, > > while checking for potential regressions by Lukas Wunner's recent work > on pci_save/restore_state() for the recoverability of mlx5 functions I > consistently hit this bug. (Bjorn has queued this up for 6.19, according > to [0] and [1]) > > Apparently, the issue is unrelated to Lukas' work but can be reproduced > with master. It appears to be timing-sensitive, since it shows up only > when I use s390's debug_defconfig, but I think needs fixing anyhow, as > timing can change for other reasons, too. > > I've spotted two additional places where the devcom reference is not > cleared after calling mlx5_devcom_unregister_component() in > drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c that I have not > addressed with a patch, since I'm unclear about how to test these > paths. > > Thanks, > Gerd > > [0] https://lore.kernel.org/all/cover.1760274044.git.lukas@wunner.de/ > [1] https://lore.kernel.org/linux-pci/cover.1763483367.git.lukas@wunner.de/ > --- > drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c > index 3db0387bf6dcb727a65df9d0253f242554af06db..8ec04a5f434dd4f717d6d556649fcc2a584db847 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c > @@ -1413,6 +1413,7 @@ static int __mlx5_lag_dev_add_mdev(struct mlx5_core_dev *dev) > static void mlx5_lag_unregister_hca_devcom_comp(struct mlx5_core_dev *dev) > { > mlx5_devcom_unregister_component(dev->priv.hca_devcom_comp); > + dev->priv.hca_devcom_comp = NULL; > } > > static int mlx5_lag_register_hca_devcom_comp(struct mlx5_core_dev *dev) > > --- > base-commit: 4a26e7032d7d57c998598c08a034872d6f0d3945 > change-id: 20251202-fix_lag-6a59b39a0b3c > > Best regards, Thanks for your patch. Acked-by: Tariq Toukan