From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D696ECAAD5 for ; Mon, 29 Aug 2022 08:16:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229923AbiH2IQl (ORCPT ); Mon, 29 Aug 2022 04:16:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40692 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229446AbiH2IQj (ORCPT ); Mon, 29 Aug 2022 04:16:39 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9F91C13E32 for ; Mon, 29 Aug 2022 01:16:38 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 368CC60BD6 for ; Mon, 29 Aug 2022 08:16:38 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1A149C433D7; Mon, 29 Aug 2022 08:16:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661760997; bh=ddsvlm5uix9WdTqZZUfUm4ae/TUa8Ac9HgZ4vXlgitw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=nyTLiLJwI53BGou1Q4bYu2E/sJUkEa8MPRjeT4rtU+sj7S5UBwIAtQy9yvwvaQ2rW /BJ02Hc6Lm9Lpkzzq34E34z8VA9R5LMQQKkLk1E2Mi/Z1VNyDHqdpjkLuz+Cig7jyA KSFkerTmZ0nFEgFl89dMi4BMpecgIKYc9ItWYtZxrtHukTi/VBYCXe4CQVV/WPjwG1 1dOpqaudAscibiH/qa6krsptGXJSQhFyecNbklmkha22asT7qWoYjReYVelqPV2d74 +5tvXjQOkvHJiafZTB7SxGmtJEbKEF0ObmyPOBCY0qfandis++r+kYL4k9nEYx/s/S JCIAJ/vOxdv7A== From: James Hogan To: Vinicius Costa Gomes Cc: Paul Menzel , Tony Nguyen , Jesse Brandeburg , netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org, Sasha Neftin , Aleksandr Loktionov Subject: Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path Date: Mon, 29 Aug 2022 09:16:33 +0100 Message-ID: <3186253.aeNJFYEL58@saruman> In-Reply-To: <2301866.ElGaqSPkdT@saruman> References: <20220811151342.19059-1-vinicius.gomes@intel.com> <87o7wpxb1m.fsf@intel.com> <2301866.ElGaqSPkdT@saruman> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote: > On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote: > > James Hogan writes: > > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote: > > >> It was reported a RTNL deadlock in the igc driver that was causing > > >> problems during suspend/resume. > > >> > > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock > > >> caused by taking RTNL in RPM resume path"). > > >> > > >> Reported-by: James Hogan > > >> Signed-off-by: Vinicius Costa Gomes > > >> --- > > >> Sorry for the noise earlier, my kernel config didn't have runtime PM > > >> enabled. > > > > > > Thanks for looking into this. > > > > > > This is identical to the patch I've been running for the last week. The > > > deadlock is avoided, however I now occasionally see an assertion from > > > netif_set_real_num_tx_queues due to the lock not being taken in some > > > cases > > > via the runtime_resume path, and a suspicious > > > rcu_dereference_protected() > > > warning (presumably due to the same issue of the lock not being taken). > > > See here for details: > > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/ > > > > Oh, sorry. I missed the part that the rtnl assert splat was already > > using similar/identical code to what I got/copied from igb. > > > > So what this seems to be telling us is that the "fix" from igb is only > > hiding the issue, > > I suppose the patch just changes the assumption from "lock will never be > held on runtime resume path" (incorrect, deadlock) to "lock will always be > held on runtime resume path" (also incorrect, probably racy). > > > and we would need to remove the need for taking the > > RTNL for the suspend/resume paths in igc and igb? (as someone else said > > in that igb thread, iirc) > > (I'll defer to others on this. I'm pretty unfamiliar with networking code > and this particular lock.) I'd be great to have this longstanding issue properly fixed rather than having to carry a patch locally that may not be lock safe. Also, any tips for diagnosing the issue of the network link not coming back up after resume? I sometimes have to unload and reload the driver module to get it back again. Cheers James