All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/5] net: netconsole: Fix netconsole unsafe locking
@ 2024-08-07  9:16 Breno Leitao
  2024-08-07  9:16 ` [PATCH net-next v2 1/5] net: netpoll: extract core of netpoll_cleanup Breno Leitao
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Breno Leitao @ 2024-08-07  9:16 UTC (permalink / raw)
  To: kuba, davem, edumazet, pabeni
  Cc: thevlad, thepacketgeek, riel, horms, netdev, linux-kernel,
	paulmck, davej

Problem:
=======

The current locking mechanism in netconsole is unsafe and suboptimal due
to the following issues:

1) Lock Release and Reacquisition Mid-Loop:

In netconsole_netdev_event(), the target_list_lock is released and
reacquired within a loop, potentially causing collisions and cleaning up
targets that are being enabled.

	int netconsole_netdev_event()
	{
	...
		spin_lock_irqsave(&target_list_lock, flags);
		list_for_each_entry(nt, &target_list, list) {
			spin_unlock_irqrestore(&target_list_lock, flags);
			__netpoll_cleanup(&nt->np);
			spin_lock_irqsave(&target_list_lock, flags);
		}
		spin_lock_irqsave(&target_list_lock, flags);
	...
	}

2) Non-Atomic Cleanup Operations:

In enabled_store(), the cleanup of structures is not atomic, risking
cleanup of structures that are in the process of being enabled.

	size_t enabled_store()
	{
	...
		spin_lock_irqsave(&target_list_lock, flags);
		nt->enabled = false;
		spin_unlock_irqrestore(&target_list_lock, flags);
		netpoll_cleanup(&nt->np);
	...
	}


These issues stem from the following limitations in netconsole's locking
design:

1) write_{ext_}msg() functions:

	a) Cannot sleep
	b) Must iterate through targets and send messages to all enabled entries.
	c) List iteration is protected by target_list_lock spinlock.

2) Network event handling in netconsole_netdev_event():

	a) Needs to sleep
	b) Requires iteration over the target list (holding
	   target_list_lock spinlock).
	c) Some events necessitate netpoll struct cleanup, which *needs*
	   to sleep.

The target_list_lock needs to be used by non-sleepable functions while
also protecting operations that may sleep, leading to the current unsafe
design.


Solution:
========

1) Dual Locking Mechanism:
	- Retain current target_list_lock for non-sleepable use cases.
	- Introduce target_cleanup_list_lock (mutex) for sleepable
	  operations.

2) Deferred Cleanup:
	- Implement atomic, deferred cleanup of structures using the new
	  mutex (target_cleanup_list_lock).
	- Avoid the `goto` in the middle of the list_for_each_entry

3) Separate Cleanup List:
	- Create target_cleanup_list for deferred cleanup, protected by
	  target_cleanup_list_lock.
	- This allows cleanup() to sleep without affecting message
	  transmission.
	- When iterating over targets, move devices needing cleanup to
	  target_cleanup_list.
	- Handle cleanup under the target_cleanup_list_lock mutex.

4) Make a clear locking hierarchy

	- The target_cleanup_list_lock takes precedence over target_list_lock.

	- Major Workflow Locking Sequences:
		a) Network Event Affecting Netpoll (netconsole_netdev_event):
			rtnl -> target_cleanup_list_lock -> target_list_lock

		b) Message Writing (write_msg()):
			console_lock -> target_list_lock

		c) Configfs Target Enable/Disable (enabled_store()):
			dynamic_netconsole_mutex -> target_cleanup_list_lock -> target_list_lock


This hierarchy ensures consistent lock acquisition order across
different operations, preventing deadlocks and maintaining proper
synchronization. The target_cleanup_list_lock's higher priority allows
for safe deferred cleanup operations without interfering with regular
message transmission protected by target_list_lock.  Each workflow
follows a specific locking sequence, ensuring that operations like
network event handling, message writing, and target management are
properly synchronized and do not conflict with each other.


Changelog:

v2:
  * The selftest has been removed from the patchset because veth is now
    IFF_DISABLE_NETPOLL. A new test will be sent separately.

v1:
  * https://lore.kernel.org/all/20240801161213.2707132-1-leitao@debian.org/

Breno Leitao (5):
  net: netpoll: extract core of netpoll_cleanup
  net: netconsole: Correct mismatched return types
  net: netconsole: Standardize variable naming
  net: netconsole: Unify Function Return Paths
  net: netconsole: Defer netpoll cleanup to avoid lock release during
    list traversal

 drivers/net/netconsole.c | 173 ++++++++++++++++++++++++---------------
 include/linux/netpoll.h  |   1 +
 net/core/netpoll.c       |  12 ++-
 3 files changed, 118 insertions(+), 68 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 8+ messages in thread
* [PATCH net-next v2 0/5] net: netconsole: Fix netconsole unsafe locking
@ 2024-08-08 12:25 Breno Leitao
  2024-08-08 12:25 ` [PATCH net-next v2 5/5] net: netconsole: Defer netpoll cleanup to avoid lock release during list traversal Breno Leitao
  0 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2024-08-08 12:25 UTC (permalink / raw)
  To: kuba, davem, edumazet, pabeni
  Cc: thepacketgeek, riel, horms, netdev, linux-kernel, paulmck, davej

Problem:
=======

The current locking mechanism in netconsole is unsafe and suboptimal due
to the following issues:

1) Lock Release and Reacquisition Mid-Loop:

In netconsole_netdev_event(), the target_list_lock is released and
reacquired within a loop, potentially causing collisions and cleaning up
targets that are being enabled.

	int netconsole_netdev_event()
	{
	...
		spin_lock_irqsave(&target_list_lock, flags);
		list_for_each_entry(nt, &target_list, list) {
			spin_unlock_irqrestore(&target_list_lock, flags);
			__netpoll_cleanup(&nt->np);
			spin_lock_irqsave(&target_list_lock, flags);
		}
		spin_lock_irqsave(&target_list_lock, flags);
	...
	}

2) Non-Atomic Cleanup Operations:

In enabled_store(), the cleanup of structures is not atomic, risking
cleanup of structures that are in the process of being enabled.

	size_t enabled_store()
	{
	...
		spin_lock_irqsave(&target_list_lock, flags);
		nt->enabled = false;
		spin_unlock_irqrestore(&target_list_lock, flags);
		netpoll_cleanup(&nt->np);
	...
	}


These issues stem from the following limitations in netconsole's locking
design:

1) write_{ext_}msg() functions:

	a) Cannot sleep
	b) Must iterate through targets and send messages to all enabled entries.
	c) List iteration is protected by target_list_lock spinlock.

2) Network event handling in netconsole_netdev_event():

	a) Needs to sleep
	b) Requires iteration over the target list (holding
	   target_list_lock spinlock).
	c) Some events necessitate netpoll struct cleanup, which *needs*
	   to sleep.

The target_list_lock needs to be used by non-sleepable functions while
also protecting operations that may sleep, leading to the current unsafe
design.


Solution:
========

1) Dual Locking Mechanism:
	- Retain current target_list_lock for non-sleepable use cases.
	- Introduce target_cleanup_list_lock (mutex) for sleepable
	  operations.

2) Deferred Cleanup:
	- Implement atomic, deferred cleanup of structures using the new
	  mutex (target_cleanup_list_lock).
	- Avoid the `goto` in the middle of the list_for_each_entry

3) Separate Cleanup List:
	- Create target_cleanup_list for deferred cleanup, protected by
	  target_cleanup_list_lock.
	- This allows cleanup() to sleep without affecting message
	  transmission.
	- When iterating over targets, move devices needing cleanup to
	  target_cleanup_list.
	- Handle cleanup under the target_cleanup_list_lock mutex.

4) Make a clear locking hierarchy

	- The target_cleanup_list_lock takes precedence over target_list_lock.

	- Major Workflow Locking Sequences:
		a) Network Event Affecting Netpoll (netconsole_netdev_event):
			rtnl -> target_cleanup_list_lock -> target_list_lock

		b) Message Writing (write_msg()):
			console_lock -> target_list_lock

		c) Configfs Target Enable/Disable (enabled_store()):
			dynamic_netconsole_mutex -> target_cleanup_list_lock -> target_list_lock


This hierarchy ensures consistent lock acquisition order across
different operations, preventing deadlocks and maintaining proper
synchronization. The target_cleanup_list_lock's higher priority allows
for safe deferred cleanup operations without interfering with regular
message transmission protected by target_list_lock.  Each workflow
follows a specific locking sequence, ensuring that operations like
network event handling, message writing, and target management are
properly synchronized and do not conflict with each other.


Changelog:

v3:
  * Move  netconsole_process_cleanups() function to inside
    CONFIG_NETCONSOLE_DYNAMIC block, avoiding Werror=unused-function
    (Jakub)

v2:
  * The selftest has been removed from the patchset because veth is now
    IFF_DISABLE_NETPOLL. A new test will be sent separately.
  * https://lore.kernel.org/all/20240807091657.4191542-1-leitao@debian.org/

v1:
  * https://lore.kernel.org/all/20240801161213.2707132-1-leitao@debian.org/

Breno Leitao (5):
  net: netpoll: extract core of netpoll_cleanup
  net: netconsole: Correct mismatched return types
  net: netconsole: Standardize variable naming
  net: netconsole: Unify Function Return Paths
  net: netconsole: Defer netpoll cleanup to avoid lock release during
    list traversal

 drivers/net/netconsole.c | 173 ++++++++++++++++++++++++---------------
 include/linux/netpoll.h  |   1 +
 net/core/netpoll.c       |  12 ++-
 3 files changed, 118 insertions(+), 68 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-08-08 12:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-07  9:16 [PATCH net-next v2 0/5] net: netconsole: Fix netconsole unsafe locking Breno Leitao
2024-08-07  9:16 ` [PATCH net-next v2 1/5] net: netpoll: extract core of netpoll_cleanup Breno Leitao
2024-08-07  9:16 ` [PATCH net-next v2 2/5] net: netconsole: Correct mismatched return types Breno Leitao
2024-08-07  9:16 ` [PATCH net-next v2 3/5] net: netconsole: Standardize variable naming Breno Leitao
2024-08-07  9:16 ` [PATCH net-next v2 4/5] net: netconsole: Unify Function Return Paths Breno Leitao
2024-08-07  9:16 ` [PATCH net-next v2 5/5] net: netconsole: Defer netpoll cleanup to avoid lock release during list traversal Breno Leitao
2024-08-07 13:54   ` Jakub Kicinski
  -- strict thread matches above, loose matches on Subject: below --
2024-08-08 12:25 [PATCH net-next v2 0/5] net: netconsole: Fix netconsole unsafe locking Breno Leitao
2024-08-08 12:25 ` [PATCH net-next v2 5/5] net: netconsole: Defer netpoll cleanup to avoid lock release during list traversal Breno Leitao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.