* Re: [Intel-wired-lan] [PATCH] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-11 15:13 ` [PATCH] igc: fix deadlock caused by taking RTNL in RPM resume path Vinicius Costa Gomes
@ 2022-08-11 18:58 ` kernel test robot
2022-08-11 19:59 ` kernel test robot
2022-08-11 20:25 ` [WIP v2] " Vinicius Costa Gomes
2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2022-08-11 18:58 UTC (permalink / raw)
To: Vinicius Costa Gomes, jhogan
Cc: llvm, kbuild-all, Paul Menzel, netdev, Jesse Brandeburg,
Aleksandr Loktionov, intel-wired-lan
Hi Vinicius,
I love your patch! Yet something to improve:
[auto build test ERROR on tnguy-next-queue/dev-queue]
[also build test ERROR on linus/master v5.19 next-20220811]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vinicius-Costa-Gomes/igc-fix-deadlock-caused-by-taking-RTNL-in-RPM-resume-path/20220811-232032
base: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
config: i386-randconfig-a013 (https://download.01.org/0day-ci/archive/20220812/202208120244.a7CKRiFy-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 5f1c7e2cc5a3c07cbc2412e851a7283c1841f520)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/61ed7ed758f23a10549c5d4fdc82ef9356281cbf
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vinicius-Costa-Gomes/igc-fix-deadlock-caused-by-taking-RTNL-in-RPM-resume-path/20220811-232032
git checkout 61ed7ed758f23a10549c5d4fdc82ef9356281cbf
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash drivers/net/ethernet/intel/igc/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
>> drivers/net/ethernet/intel/igc/igc_main.c:6838:26: error: use of undeclared identifier 'igc_suspend'; did you mean '__igc_suspend'?
SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
^~~~~~~~~~~
__igc_suspend
include/linux/pm.h:343:22: note: expanded from macro 'SET_SYSTEM_SLEEP_PM_OPS'
SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
^
include/linux/pm.h:313:26: note: expanded from macro 'SYSTEM_SLEEP_PM_OPS'
.suspend = pm_sleep_ptr(suspend_fn), \
^
include/linux/pm.h:439:65: note: expanded from macro 'pm_sleep_ptr'
#define pm_sleep_ptr(_ptr) PTR_IF(IS_ENABLED(CONFIG_PM_SLEEP), (_ptr))
^
include/linux/kernel.h:57:38: note: expanded from macro 'PTR_IF'
#define PTR_IF(cond, ptr) ((cond) ? (ptr) : NULL)
^
drivers/net/ethernet/intel/igc/igc_main.c:6706:27: note: '__igc_suspend' declared here
static int __maybe_unused __igc_suspend(struct device *dev)
^
>> drivers/net/ethernet/intel/igc/igc_main.c:6838:26: error: use of undeclared identifier 'igc_suspend'; did you mean '__igc_suspend'?
SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
^~~~~~~~~~~
__igc_suspend
include/linux/pm.h:343:22: note: expanded from macro 'SET_SYSTEM_SLEEP_PM_OPS'
SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
^
include/linux/pm.h:315:25: note: expanded from macro 'SYSTEM_SLEEP_PM_OPS'
.freeze = pm_sleep_ptr(suspend_fn), \
^
include/linux/pm.h:439:65: note: expanded from macro 'pm_sleep_ptr'
#define pm_sleep_ptr(_ptr) PTR_IF(IS_ENABLED(CONFIG_PM_SLEEP), (_ptr))
^
include/linux/kernel.h:57:38: note: expanded from macro 'PTR_IF'
#define PTR_IF(cond, ptr) ((cond) ? (ptr) : NULL)
^
drivers/net/ethernet/intel/igc/igc_main.c:6706:27: note: '__igc_suspend' declared here
static int __maybe_unused __igc_suspend(struct device *dev)
^
>> drivers/net/ethernet/intel/igc/igc_main.c:6838:26: error: use of undeclared identifier 'igc_suspend'; did you mean '__igc_suspend'?
SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
^~~~~~~~~~~
__igc_suspend
include/linux/pm.h:343:22: note: expanded from macro 'SET_SYSTEM_SLEEP_PM_OPS'
SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
^
include/linux/pm.h:317:27: note: expanded from macro 'SYSTEM_SLEEP_PM_OPS'
.poweroff = pm_sleep_ptr(suspend_fn), \
^
include/linux/pm.h:439:65: note: expanded from macro 'pm_sleep_ptr'
#define pm_sleep_ptr(_ptr) PTR_IF(IS_ENABLED(CONFIG_PM_SLEEP), (_ptr))
^
include/linux/kernel.h:57:38: note: expanded from macro 'PTR_IF'
#define PTR_IF(cond, ptr) ((cond) ? (ptr) : NULL)
^
drivers/net/ethernet/intel/igc/igc_main.c:6706:27: note: '__igc_suspend' declared here
static int __maybe_unused __igc_suspend(struct device *dev)
^
3 errors generated.
vim +6838 drivers/net/ethernet/intel/igc/igc_main.c
bc23aa949aeba0 Sasha Neftin 2020-01-29 6835
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6836 #ifdef CONFIG_PM
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6837 static const struct dev_pm_ops igc_pm_ops = {
9513d2a5dc7f3f Sasha Neftin 2019-11-14 @6838 SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6839 SET_RUNTIME_PM_OPS(igc_runtime_suspend, igc_runtime_resume,
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6840 igc_runtime_idle)
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6841 };
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6842 #endif
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6843
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [Intel-wired-lan] [PATCH] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-11 15:13 ` [PATCH] igc: fix deadlock caused by taking RTNL in RPM resume path Vinicius Costa Gomes
2022-08-11 18:58 ` [Intel-wired-lan] " kernel test robot
@ 2022-08-11 19:59 ` kernel test robot
2022-08-11 20:25 ` [WIP v2] " Vinicius Costa Gomes
2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2022-08-11 19:59 UTC (permalink / raw)
To: Vinicius Costa Gomes, jhogan
Cc: kbuild-all, Paul Menzel, netdev, Jesse Brandeburg,
Aleksandr Loktionov, intel-wired-lan
Hi Vinicius,
I love your patch! Yet something to improve:
[auto build test ERROR on tnguy-next-queue/dev-queue]
[also build test ERROR on linus/master v5.19 next-20220811]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vinicius-Costa-Gomes/igc-fix-deadlock-caused-by-taking-RTNL-in-RPM-resume-path/20220811-232032
base: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git dev-queue
config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20220812/202208120359.pPxeIJNZ-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-3) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/61ed7ed758f23a10549c5d4fdc82ef9356281cbf
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Vinicius-Costa-Gomes/igc-fix-deadlock-caused-by-taking-RTNL-in-RPM-resume-path/20220811-232032
git checkout 61ed7ed758f23a10549c5d4fdc82ef9356281cbf
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
In file included from arch/x86/include/asm/percpu.h:27,
from arch/x86/include/asm/nospec-branch.h:14,
from arch/x86/include/asm/paravirt_types.h:40,
from arch/x86/include/asm/ptrace.h:97,
from arch/x86/include/asm/math_emu.h:5,
from arch/x86/include/asm/processor.h:13,
from arch/x86/include/asm/timex.h:5,
from include/linux/timex.h:67,
from include/linux/time32.h:13,
from include/linux/time.h:60,
from include/linux/stat.h:19,
from include/linux/module.h:13,
from drivers/net/ethernet/intel/igc/igc_main.c:4:
>> drivers/net/ethernet/intel/igc/igc_main.c:6838:33: error: 'igc_suspend' undeclared here (not in a function); did you mean 'dpm_suspend'?
6838 | SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
| ^~~~~~~~~~~
include/linux/kernel.h:57:44: note: in definition of macro 'PTR_IF'
57 | #define PTR_IF(cond, ptr) ((cond) ? (ptr) : NULL)
| ^~~
include/linux/pm.h:313:20: note: in expansion of macro 'pm_sleep_ptr'
313 | .suspend = pm_sleep_ptr(suspend_fn), \
| ^~~~~~~~~~~~
include/linux/pm.h:343:9: note: in expansion of macro 'SYSTEM_SLEEP_PM_OPS'
343 | SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
| ^~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/igc/igc_main.c:6838:9: note: in expansion of macro 'SET_SYSTEM_SLEEP_PM_OPS'
6838 | SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
| ^~~~~~~~~~~~~~~~~~~~~~~
vim +6838 drivers/net/ethernet/intel/igc/igc_main.c
bc23aa949aeba0 Sasha Neftin 2020-01-29 6835
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6836 #ifdef CONFIG_PM
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6837 static const struct dev_pm_ops igc_pm_ops = {
9513d2a5dc7f3f Sasha Neftin 2019-11-14 @6838 SET_SYSTEM_SLEEP_PM_OPS(igc_suspend, igc_resume)
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6839 SET_RUNTIME_PM_OPS(igc_runtime_suspend, igc_runtime_resume,
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6840 igc_runtime_idle)
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6841 };
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6842 #endif
9513d2a5dc7f3f Sasha Neftin 2019-11-14 6843
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 15+ messages in thread* [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-11 15:13 ` [PATCH] igc: fix deadlock caused by taking RTNL in RPM resume path Vinicius Costa Gomes
2022-08-11 18:58 ` [Intel-wired-lan] " kernel test robot
2022-08-11 19:59 ` kernel test robot
@ 2022-08-11 20:25 ` Vinicius Costa Gomes
2022-08-11 21:41 ` James Hogan
2 siblings, 1 reply; 15+ messages in thread
From: Vinicius Costa Gomes @ 2022-08-11 20:25 UTC (permalink / raw)
To: jhogan
Cc: Vinicius Costa Gomes, Paul Menzel, Tony Nguyen, Jesse Brandeburg,
netdev, intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
It was reported a RTNL deadlock in the igc driver that was causing
problems during suspend/resume.
The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
caused by taking RTNL in RPM resume path").
Reported-by: James Hogan <jhogan@kernel.org>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
---
Sorry for the noise earlier, my kernel config didn't have runtime PM
enabled.
drivers/net/ethernet/intel/igc/igc_main.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index ebff0e04045d..9b0d4becfcfc 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -6600,7 +6600,7 @@ static void igc_deliver_wake_packet(struct net_device *netdev)
netif_rx(skb);
}
-static int __maybe_unused igc_resume(struct device *dev)
+static int __maybe_unused __igc_resume(struct device *dev, bool rpm)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct net_device *netdev = pci_get_drvdata(pdev);
@@ -6642,20 +6642,27 @@ static int __maybe_unused igc_resume(struct device *dev)
wr32(IGC_WUS, ~0);
- rtnl_lock();
+ if (!rpm)
+ rtnl_lock();
if (!err && netif_running(netdev))
err = __igc_open(netdev, true);
if (!err)
netif_device_attach(netdev);
- rtnl_unlock();
+ if (!rpm)
+ rtnl_unlock();
return err;
}
static int __maybe_unused igc_runtime_resume(struct device *dev)
{
- return igc_resume(dev);
+ return __igc_resume(dev, true);
+}
+
+static int __maybe_unused igc_resume(struct device *dev)
+{
+ return __igc_resume(dev, false);
}
static int __maybe_unused igc_suspend(struct device *dev)
@@ -6719,7 +6726,7 @@ static pci_ers_result_t igc_io_error_detected(struct pci_dev *pdev,
* @pdev: Pointer to PCI device
*
* Restart the card from scratch, as if from a cold-boot. Implementation
- * resembles the first-half of the igc_resume routine.
+ * resembles the first-half of the __igc_resume routine.
**/
static pci_ers_result_t igc_io_slot_reset(struct pci_dev *pdev)
{
@@ -6758,7 +6765,7 @@ static pci_ers_result_t igc_io_slot_reset(struct pci_dev *pdev)
*
* This callback is called when the error recovery driver tells us that
* its OK to resume normal operation. Implementation resembles the
- * second-half of the igc_resume routine.
+ * second-half of the __igc_resume routine.
*/
static void igc_io_resume(struct pci_dev *pdev)
{
--
2.37.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-11 20:25 ` [WIP v2] " Vinicius Costa Gomes
@ 2022-08-11 21:41 ` James Hogan
2022-08-13 0:05 ` Vinicius Costa Gomes
0 siblings, 1 reply; 15+ messages in thread
From: James Hogan @ 2022-08-11 21:41 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Vinicius Costa Gomes, Paul Menzel, Tony Nguyen, Jesse Brandeburg,
netdev, intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> It was reported a RTNL deadlock in the igc driver that was causing
> problems during suspend/resume.
>
> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> caused by taking RTNL in RPM resume path").
>
> Reported-by: James Hogan <jhogan@kernel.org>
> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> ---
> Sorry for the noise earlier, my kernel config didn't have runtime PM
> enabled.
Thanks for looking into this.
This is identical to the patch I've been running for the last week. The
deadlock is avoided, however I now occasionally see an assertion from
netif_set_real_num_tx_queues due to the lock not being taken in some cases via
the runtime_resume path, and a suspicious rcu_dereference_protected() warning
(presumably due to the same issue of the lock not being taken). See here for
details:
https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
Cheers
James
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-11 21:41 ` James Hogan
@ 2022-08-13 0:05 ` Vinicius Costa Gomes
2022-08-13 17:18 ` James Hogan
0 siblings, 1 reply; 15+ messages in thread
From: Vinicius Costa Gomes @ 2022-08-13 0:05 UTC (permalink / raw)
To: James Hogan
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
Hi James,
James Hogan <jhogan@kernel.org> writes:
> On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
>> It was reported a RTNL deadlock in the igc driver that was causing
>> problems during suspend/resume.
>>
>> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
>> caused by taking RTNL in RPM resume path").
>>
>> Reported-by: James Hogan <jhogan@kernel.org>
>> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>> ---
>> Sorry for the noise earlier, my kernel config didn't have runtime PM
>> enabled.
>
> Thanks for looking into this.
>
> This is identical to the patch I've been running for the last week. The
> deadlock is avoided, however I now occasionally see an assertion from
> netif_set_real_num_tx_queues due to the lock not being taken in some cases via
> the runtime_resume path, and a suspicious rcu_dereference_protected() warning
> (presumably due to the same issue of the lock not being taken). See here for
> details:
> https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
Oh, sorry. I missed the part that the rtnl assert splat was already
using similar/identical code to what I got/copied from igb.
So what this seems to be telling us is that the "fix" from igb is only
hiding the issue, and we would need to remove the need for taking the
RTNL for the suspend/resume paths in igc and igb? (as someone else said
in that igb thread, iirc)
Cheers,
--
Vinicius
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-13 0:05 ` Vinicius Costa Gomes
@ 2022-08-13 17:18 ` James Hogan
2022-08-29 8:16 ` James Hogan
0 siblings, 1 reply; 15+ messages in thread
From: James Hogan @ 2022-08-13 17:18 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> James Hogan <jhogan@kernel.org> writes:
> > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> >> It was reported a RTNL deadlock in the igc driver that was causing
> >> problems during suspend/resume.
> >>
> >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> >> caused by taking RTNL in RPM resume path").
> >>
> >> Reported-by: James Hogan <jhogan@kernel.org>
> >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> >> ---
> >> Sorry for the noise earlier, my kernel config didn't have runtime PM
> >> enabled.
> >
> > Thanks for looking into this.
> >
> > This is identical to the patch I've been running for the last week. The
> > deadlock is avoided, however I now occasionally see an assertion from
> > netif_set_real_num_tx_queues due to the lock not being taken in some cases
> > via the runtime_resume path, and a suspicious rcu_dereference_protected()
> > warning (presumably due to the same issue of the lock not being taken).
> > See here for details:
> > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
>
> Oh, sorry. I missed the part that the rtnl assert splat was already
> using similar/identical code to what I got/copied from igb.
>
> So what this seems to be telling us is that the "fix" from igb is only
> hiding the issue,
I suppose the patch just changes the assumption from "lock will never be held
on runtime resume path" (incorrect, deadlock) to "lock will always be held on
runtime resume path" (also incorrect, probably racy).
> and we would need to remove the need for taking the
> RTNL for the suspend/resume paths in igc and igb? (as someone else said
> in that igb thread, iirc)
(I'll defer to others on this. I'm pretty unfamiliar with networking code and
this particular lock.)
Cheers
James
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-13 17:18 ` James Hogan
@ 2022-08-29 8:16 ` James Hogan
2022-10-02 10:56 ` James Hogan
0 siblings, 1 reply; 15+ messages in thread
From: James Hogan @ 2022-08-29 8:16 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
> On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> > James Hogan <jhogan@kernel.org> writes:
> > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> > >> It was reported a RTNL deadlock in the igc driver that was causing
> > >> problems during suspend/resume.
> > >>
> > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> > >> caused by taking RTNL in RPM resume path").
> > >>
> > >> Reported-by: James Hogan <jhogan@kernel.org>
> > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > >> ---
> > >> Sorry for the noise earlier, my kernel config didn't have runtime PM
> > >> enabled.
> > >
> > > Thanks for looking into this.
> > >
> > > This is identical to the patch I've been running for the last week. The
> > > deadlock is avoided, however I now occasionally see an assertion from
> > > netif_set_real_num_tx_queues due to the lock not being taken in some
> > > cases
> > > via the runtime_resume path, and a suspicious
> > > rcu_dereference_protected()
> > > warning (presumably due to the same issue of the lock not being taken).
> > > See here for details:
> > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
> >
> > Oh, sorry. I missed the part that the rtnl assert splat was already
> > using similar/identical code to what I got/copied from igb.
> >
> > So what this seems to be telling us is that the "fix" from igb is only
> > hiding the issue,
>
> I suppose the patch just changes the assumption from "lock will never be
> held on runtime resume path" (incorrect, deadlock) to "lock will always be
> held on runtime resume path" (also incorrect, probably racy).
>
> > and we would need to remove the need for taking the
> > RTNL for the suspend/resume paths in igc and igb? (as someone else said
> > in that igb thread, iirc)
>
> (I'll defer to others on this. I'm pretty unfamiliar with networking code
> and this particular lock.)
I'd be great to have this longstanding issue properly fixed rather than having
to carry a patch locally that may not be lock safe.
Also, any tips for diagnosing the issue of the network link not coming back up
after resume? I sometimes have to unload and reload the driver module to get
it back again.
Cheers
James
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-08-29 8:16 ` James Hogan
@ 2022-10-02 10:56 ` James Hogan
2023-08-14 11:04 ` James Hogan
0 siblings, 1 reply; 15+ messages in thread
From: James Hogan @ 2022-10-02 10:56 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
On Monday, 29 August 2022 09:16:33 BST James Hogan wrote:
> On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
> > On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> > > James Hogan <jhogan@kernel.org> writes:
> > > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> > > >> It was reported a RTNL deadlock in the igc driver that was causing
> > > >> problems during suspend/resume.
> > > >>
> > > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> > > >> caused by taking RTNL in RPM resume path").
> > > >>
> > > >> Reported-by: James Hogan <jhogan@kernel.org>
> > > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > > >> ---
> > > >> Sorry for the noise earlier, my kernel config didn't have runtime PM
> > > >> enabled.
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > This is identical to the patch I've been running for the last week.
> > > > The
> > > > deadlock is avoided, however I now occasionally see an assertion from
> > > > netif_set_real_num_tx_queues due to the lock not being taken in some
> > > > cases
> > > > via the runtime_resume path, and a suspicious
> > > > rcu_dereference_protected()
> > > > warning (presumably due to the same issue of the lock not being
> > > > taken).
> > > > See here for details:
> > > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
> > >
> > > Oh, sorry. I missed the part that the rtnl assert splat was already
> > > using similar/identical code to what I got/copied from igb.
> > >
> > > So what this seems to be telling us is that the "fix" from igb is only
> > > hiding the issue,
> >
> > I suppose the patch just changes the assumption from "lock will never be
> > held on runtime resume path" (incorrect, deadlock) to "lock will always be
> > held on runtime resume path" (also incorrect, probably racy).
> >
> > > and we would need to remove the need for taking the
> > > RTNL for the suspend/resume paths in igc and igb? (as someone else said
> > > in that igb thread, iirc)
> >
> > (I'll defer to others on this. I'm pretty unfamiliar with networking code
> > and this particular lock.)
>
> I'd be great to have this longstanding issue properly fixed rather than
> having to carry a patch locally that may not be lock safe.
>
> Also, any tips for diagnosing the issue of the network link not coming back
> up after resume? I sometimes have to unload and reload the driver module to
> get it back again.
Any thoughts on this from anybody?
Cheers
James
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2022-10-02 10:56 ` James Hogan
@ 2023-08-14 11:04 ` James Hogan
2023-08-29 1:58 ` Vinicius Costa Gomes
0 siblings, 1 reply; 15+ messages in thread
From: James Hogan @ 2023-08-14 11:04 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov
On Sunday, 2 October 2022 11:56:28 BST James Hogan wrote:
> On Monday, 29 August 2022 09:16:33 BST James Hogan wrote:
> > On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
> > > On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
> > > > James Hogan <jhogan@kernel.org> writes:
> > > > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
> > > > >> It was reported a RTNL deadlock in the igc driver that was causing
> > > > >> problems during suspend/resume.
> > > > >>
> > > > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
> > > > >> caused by taking RTNL in RPM resume path").
> > > > >>
> > > > >> Reported-by: James Hogan <jhogan@kernel.org>
> > > > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
> > > > >> ---
> > > > >> Sorry for the noise earlier, my kernel config didn't have runtime
> > > > >> PM
> > > > >> enabled.
> > > > >
> > > > > Thanks for looking into this.
> > > > >
> > > > > This is identical to the patch I've been running for the last week.
> > > > > The
> > > > > deadlock is avoided, however I now occasionally see an assertion
> > > > > from
> > > > > netif_set_real_num_tx_queues due to the lock not being taken in some
> > > > > cases
> > > > > via the runtime_resume path, and a suspicious
> > > > > rcu_dereference_protected()
> > > > > warning (presumably due to the same issue of the lock not being
> > > > > taken).
> > > > > See here for details:
> > > > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
> > > >
> > > > Oh, sorry. I missed the part that the rtnl assert splat was already
> > > > using similar/identical code to what I got/copied from igb.
> > > >
> > > > So what this seems to be telling us is that the "fix" from igb is only
> > > > hiding the issue,
> > >
> > > I suppose the patch just changes the assumption from "lock will never be
> > > held on runtime resume path" (incorrect, deadlock) to "lock will always
> > > be
> > > held on runtime resume path" (also incorrect, probably racy).
> > >
> > > > and we would need to remove the need for taking the
> > > > RTNL for the suspend/resume paths in igc and igb? (as someone else
> > > > said
> > > > in that igb thread, iirc)
> > >
> > > (I'll defer to others on this. I'm pretty unfamiliar with networking
> > > code
> > > and this particular lock.)
> >
> > I'd be great to have this longstanding issue properly fixed rather than
> > having to carry a patch locally that may not be lock safe.
> >
> > Also, any tips for diagnosing the issue of the network link not coming
> > back
> > up after resume? I sometimes have to unload and reload the driver module
> > to
> > get it back again.
>
> Any thoughts on this from anybody?
Ping... I've been carrying this patch locally on archlinux for almost a year
now. Every time I update my kernel and forget to rebuild with the patch it
catches me out with deadlocks after resume, and even with the patch I
frequently have to reload the igc module after resume to get the network to
come up (which is preferable to deadlocks but still really sucks). I'd really
appreciate if it could get some attention.
Many thanks
James
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2023-08-14 11:04 ` James Hogan
@ 2023-08-29 1:58 ` Vinicius Costa Gomes
2023-09-03 17:57 ` James Hogan
0 siblings, 1 reply; 15+ messages in thread
From: Vinicius Costa Gomes @ 2023-08-29 1:58 UTC (permalink / raw)
To: James Hogan
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov, Neftin, Sasha
Hi James,
James Hogan <jhogan@kernel.org> writes:
> On Sunday, 2 October 2022 11:56:28 BST James Hogan wrote:
>> On Monday, 29 August 2022 09:16:33 BST James Hogan wrote:
>> > On Saturday, 13 August 2022 18:18:25 BST James Hogan wrote:
>> > > On Saturday, 13 August 2022 01:05:41 BST Vinicius Costa Gomes wrote:
>> > > > James Hogan <jhogan@kernel.org> writes:
>> > > > > On Thursday, 11 August 2022 21:25:24 BST Vinicius Costa Gomes wrote:
>> > > > >> It was reported a RTNL deadlock in the igc driver that was causing
>> > > > >> problems during suspend/resume.
>> > > > >>
>> > > > >> The solution is similar to commit ac8c58f5b535 ("igb: fix deadlock
>> > > > >> caused by taking RTNL in RPM resume path").
>> > > > >>
>> > > > >> Reported-by: James Hogan <jhogan@kernel.org>
>> > > > >> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>> > > > >> ---
>> > > > >> Sorry for the noise earlier, my kernel config didn't have runtime
>> > > > >> PM
>> > > > >> enabled.
>> > > > >
>> > > > > Thanks for looking into this.
>> > > > >
>> > > > > This is identical to the patch I've been running for the last week.
>> > > > > The
>> > > > > deadlock is avoided, however I now occasionally see an assertion
>> > > > > from
>> > > > > netif_set_real_num_tx_queues due to the lock not being taken in some
>> > > > > cases
>> > > > > via the runtime_resume path, and a suspicious
>> > > > > rcu_dereference_protected()
>> > > > > warning (presumably due to the same issue of the lock not being
>> > > > > taken).
>> > > > > See here for details:
>> > > > > https://lore.kernel.org/netdev/4765029.31r3eYUQgx@saruman/
>> > > >
>> > > > Oh, sorry. I missed the part that the rtnl assert splat was already
>> > > > using similar/identical code to what I got/copied from igb.
>> > > >
>> > > > So what this seems to be telling us is that the "fix" from igb is only
>> > > > hiding the issue,
>> > >
>> > > I suppose the patch just changes the assumption from "lock will never be
>> > > held on runtime resume path" (incorrect, deadlock) to "lock will always
>> > > be
>> > > held on runtime resume path" (also incorrect, probably racy).
>> > >
>> > > > and we would need to remove the need for taking the
>> > > > RTNL for the suspend/resume paths in igc and igb? (as someone else
>> > > > said
>> > > > in that igb thread, iirc)
>> > >
>> > > (I'll defer to others on this. I'm pretty unfamiliar with networking
>> > > code
>> > > and this particular lock.)
>> >
>> > I'd be great to have this longstanding issue properly fixed rather than
>> > having to carry a patch locally that may not be lock safe.
>> >
>> > Also, any tips for diagnosing the issue of the network link not coming
>> > back
>> > up after resume? I sometimes have to unload and reload the driver module
>> > to
>> > get it back again.
>>
>> Any thoughts on this from anybody?
>
> Ping... I've been carrying this patch locally on archlinux for almost a year
> now. Every time I update my kernel and forget to rebuild with the patch it
> catches me out with deadlocks after resume, and even with the patch I
> frequently have to reload the igc module after resume to get the network to
> come up (which is preferable to deadlocks but still really sucks). I'd really
> appreciate if it could get some attention.
I am setting up my test systems to reproduce the deadlocks, then let's
see what ideas happen about removing the need for those locks.
About the link failures, are there any error messages in the kernel
logs? (also, if you could share those logs, can be off-list, it would
help) I am trying to think what could be happening, and how to further
debug this.
Cheers,
--
Vinicius
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [WIP v2] igc: fix deadlock caused by taking RTNL in RPM resume path
2023-08-29 1:58 ` Vinicius Costa Gomes
@ 2023-09-03 17:57 ` James Hogan
0 siblings, 0 replies; 15+ messages in thread
From: James Hogan @ 2023-09-03 17:57 UTC (permalink / raw)
To: Vinicius Costa Gomes
Cc: Paul Menzel, Tony Nguyen, Jesse Brandeburg, netdev,
intel-wired-lan, Sasha Neftin, Aleksandr Loktionov, Neftin, Sasha
On Tuesday, 29 August 2023 02:58:42 BST Vinicius Costa Gomes wrote:
> James Hogan <jhogan@kernel.org> writes:
> > On Sunday, 2 October 2022 11:56:28 BST James Hogan wrote:
> >> On Monday, 29 August 2022 09:16:33 BST James Hogan wrote:
> >> > I'd be great to have this longstanding issue properly fixed rather than
> >> > having to carry a patch locally that may not be lock safe.
> >> >
> >> > Also, any tips for diagnosing the issue of the network link not coming
> >> > back
> >> > up after resume? I sometimes have to unload and reload the driver
> >> > module
> >> > to
> >> > get it back again.
> >>
> >> Any thoughts on this from anybody?
> >
> > Ping... I've been carrying this patch locally on archlinux for almost a
> > year now. Every time I update my kernel and forget to rebuild with the
> > patch it catches me out with deadlocks after resume, and even with the
> > patch I frequently have to reload the igc module after resume to get the
> > network to come up (which is preferable to deadlocks but still really
> > sucks). I'd really appreciate if it could get some attention.
>
> I am setting up my test systems to reproduce the deadlocks, then let's
> see what ideas happen about removing the need for those locks.
>
> About the link failures, are there any error messages in the kernel
> logs? (also, if you could share those logs, can be off-list, it would
> help) I am trying to think what could be happening, and how to further
> debug this.
Looking through the resume log, the only network/igc related items are these:
Sep 03 18:28:17 saruman kernel: igc 0000:06:00.0 enp6s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Sep 03 18:28:17 saruman NetworkManager[1016]: <info> [1693762097.7180] manager: sleep: wake requested (sleeping: yes enabled: yes)
Sep 03 18:28:17 saruman NetworkManager[1016]: <info> [1693762097.7181] device (enp6s0): state change: activated -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Sep 03 18:28:17 saruman avahi-daemon[989]: Withdrawing address record for 192.168.1.239 on enp6s0.
Sep 03 18:28:17 saruman avahi-daemon[989]: Leaving mDNS multicast group on interface enp6s0.IPv4 with address 192.168.1.239.
Sep 03 18:28:17 saruman avahi-daemon[989]: Interface enp6s0.IPv4 no longer relevant for mDNS.
Sep 03 18:28:17 saruman NetworkManager[1016]: <info> [1693762097.8202] manager: NetworkManager state is now CONNECTED_GLOBAL
Sep 03 18:28:17 saruman NetworkManager[1016]: <info> [1693762097.8657] manager: NetworkManager state is now DISCONNECTED
Sep 03 18:28:17 saruman NetworkManager[1016]: <info> [1693762097.8660] device (enp6s0): state change: unmanaged -> unavailable (reason 'managed', sys-iface-state: 'external')
Sep 03 18:28:17 saruman systemd[1]: Starting Network Manager Script Dispatcher Service...
Sep 03 18:28:17 saruman systemd[1]: Started Network Manager Script Dispatcher Service.
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3075] device (enp6s0): carrier: link connected
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3076] device (enp6s0): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3080] policy: auto-activating connection 'Wired connection 1' (f6634f16-77ca-34f7-846a-8c41e15a8ad1)
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3082] device (enp6s0): Activation: starting connection 'Wired connection 1' (f6634f16-77ca-34f7-846a-8c41e15a8ad1)
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3082] device (enp6s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3083] manager: NetworkManager state is now CONNECTING
Sep 03 18:28:21 saruman kernel: igc 0000:06:00.0 enp6s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Sep 03 18:28:21 saruman kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp6s0: link becomes ready
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3506] device (enp6s0): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3512] device (enp6s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
Sep 03 18:28:21 saruman NetworkManager[1016]: <info> [1693762101.3515] policy: set 'Wired connection 1' (enp6s0) as default for IPv4 routing and DNS
Sep 03 18:28:21 saruman avahi-daemon[989]: Joining mDNS multicast group on interface enp6s0.IPv4 with address 192.168.1.239.
Sep 03 18:28:21 saruman avahi-daemon[989]: New relevant interface enp6s0.IPv4 for mDNS.
Sep 03 18:28:21 saruman avahi-daemon[989]: Registering new address record for 192.168.1.239 on enp6s0.IPv4.
Sep 03 18:28:22 saruman systemd[1]: systemd-rfkill.service: Deactivated successfully.
Sep 03 18:28:23 saruman NetworkManager[1016]: <info> [1693762103.3544] device (enp6s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
Sep 03 18:28:23 saruman NetworkManager[1016]: <info> [1693762103.3553] device (enp6s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
Sep 03 18:28:23 saruman NetworkManager[1016]: <info> [1693762103.3554] device (enp6s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
Sep 03 18:28:23 saruman NetworkManager[1016]: <info> [1693762103.3555] manager: NetworkManager state is now CONNECTED_SITE
Sep 03 18:28:23 saruman NetworkManager[1016]: <info> [1693762103.3556] device (enp6s0): Activation: successful, device activated.
Sep 03 18:28:27 saruman NetworkManager[1016]: <info> [1693762107.3532] device (enp6s0): state change: activated -> unavailable (reason 'carrier-changed', sys-iface-state: 'managed')
Sep 03 18:28:27 saruman avahi-daemon[989]: Withdrawing address record for 192.168.1.239 on enp6s0.
Sep 03 18:28:27 saruman avahi-daemon[989]: Leaving mDNS multicast group on interface enp6s0.IPv4 with address 192.168.1.239.
Sep 03 18:28:27 saruman avahi-daemon[989]: Interface enp6s0.IPv4 no longer relevant for mDNS.
Sep 03 18:28:27 saruman NetworkManager[1016]: <info> [1693762107.5266] manager: NetworkManager state is now CONNECTED_LOCAL
Sep 03 18:28:27 saruman NetworkManager[1016]: <info> [1693762107.5267] manager: NetworkManager state is now DISCONNECTED
Sep 03 18:28:27 saruman systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
As mentioned previously, CONFIG_PROVE_LOCKING=y and I'm seeing splats during boot, notably RTNL assertion failed at net/core/dev.c (2877) and suspicious RCU usage.
Cheers
James
^ permalink raw reply [flat|nested] 15+ messages in thread