netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Proper suspend/resume flow
@ 2014-03-26 11:58 Russell King - ARM Linux
  2014-03-27 10:10 ` Claudiu Manoil
  2014-03-29 16:37 ` Ben Hutchings
  0 siblings, 2 replies; 7+ messages in thread
From: Russell King - ARM Linux @ 2014-03-26 11:58 UTC (permalink / raw)
  To: netdev

I've been looking at the Freescale FEC driver recently, and I've noticed
that it's suspend path looks like this:

        if (netif_running(ndev)) {
                netif_device_detach(ndev);
                fec_stop(ndev);
        }

I believe this to be unsafe, because if the device is active, then we may
have packets in the transmit ring.  It may be possible for the ring to be
cleaned between netif_device_detach() and fec_stop(), resulting in a call
to netif_wake_queue(), which doesn't seem to take account of whether the
device has been detached.

I wondered if this was just limited to the freescale driver, and I found
a similar pattern in e1000e - __e1000e_shutdown() calls netif_device_detach()
as the very first thing, before doing anything else.  This seems to be a
common pattern.  8139cp.c does this, the second call seems to be rather
redundant:

        netif_device_detach (dev);
        netif_stop_queue (dev);

Tulip on the other hand shuts the hardware down and cleans the transmit
ring first before calling netif_device_detach(), thus ensuring that there
won't be any transmit completions before this call.  It doesn't stop the
queue before this, so presumably in a SMP environment, it is possible for
packets to get queued after the transmit ring has been cleaned.

So, what is the correct approach to suspending a net device?  When should
netif_device_detach() be called?

I think the right solution for a driver which does all its processing in
the NAPI poll function would be:

	if (netif_running(ndev)) {
		napi_disable(napi);
		netif_device_detach(ndev);
		disable_hardware();
	}

and upon resume, the reverse order.  The theory being:

- napi_disable() is to stop the driver processing its rings and possibly
  waking up the transmit queue(s) after the following netif_device_detach().
- netif_device_detach() to stop new packets being queued to the device.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proper suspend/resume flow
  2014-03-26 11:58 Proper suspend/resume flow Russell King - ARM Linux
@ 2014-03-27 10:10 ` Claudiu Manoil
  2014-03-27 10:57   ` Russell King - ARM Linux
  2014-03-29 16:37 ` Ben Hutchings
  1 sibling, 1 reply; 7+ messages in thread
From: Claudiu Manoil @ 2014-03-27 10:10 UTC (permalink / raw)
  To: Russell King - ARM Linux, netdev

On 3/26/2014 1:58 PM, Russell King - ARM Linux wrote:
> I think the right solution for a driver which does all its processing in
> the NAPI poll function would be:
>
> 	if (netif_running(ndev)) {
> 		napi_disable(napi);
> 		netif_device_detach(ndev);
> 		disable_hardware();
> 	}
>
> and upon resume, the reverse order.  The theory being:
>
> - napi_disable() is to stop the driver processing its rings and possibly
>    waking up the transmit queue(s) after the following netif_device_detach().
> - netif_device_detach() to stop new packets being queued to the device.


There's a risk to disabling the NAPI before stopping the transmission
(i.e. stopping the Tx queues).  The net stack might continue to enqueue
Tx packets and, if the Tx BD rings get full, the driver's start_xmit
will return TX_BUSY which leads to netdev watchdog triggering Tx
timeout (and the ndo_tx_timeout hook from the driver will re-enable the
Tx queues).
Maybe it's unlikely to get the BD rings full during suspend, but there's
another case too - BQL (if the driver supports BQL).  BQL needs a
confirmation for each xmitted byte, and the confirmation comes from
Tx completion processing from NAPI context (see
netdev_tx_completed_queue()).  If the confirmation is delayed too much,
BQL blocks the Tx queues to trigger the same netdev Tx timeout watchdog
mechanism.

I think there's a more general problem to this: how to properly stop
the Tx traffic from the driver.  And I think an approach to solve it
would be to stop the Tx queues (i.e. netif_tx_stop_all_queues()) before
disabling the NAPI processing, however the driver will need to use a
special state flag (like "DOWN" or "RESETTING") to prevent waking up
the Tx queues from NAPI while the Tx traffic is being stopped.
There are some drivers implementing such "synchronization" flags to
prevent the Tx congestion mechanism to wake up the Tx queues while the
device is resetting or brought down.  But there's no generic
implementation for this (there are differences in implementation from
driver to driver).

Claudiu

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proper suspend/resume flow
  2014-03-27 10:10 ` Claudiu Manoil
@ 2014-03-27 10:57   ` Russell King - ARM Linux
  0 siblings, 0 replies; 7+ messages in thread
From: Russell King - ARM Linux @ 2014-03-27 10:57 UTC (permalink / raw)
  To: Claudiu Manoil; +Cc: netdev

On Thu, Mar 27, 2014 at 12:10:33PM +0200, Claudiu Manoil wrote:
> On 3/26/2014 1:58 PM, Russell King - ARM Linux wrote:
>> I think the right solution for a driver which does all its processing in
>> the NAPI poll function would be:
>>
>> 	if (netif_running(ndev)) {
>> 		napi_disable(napi);
>> 		netif_device_detach(ndev);
>> 		disable_hardware();
>> 	}
>>
>> and upon resume, the reverse order.  The theory being:
>>
>> - napi_disable() is to stop the driver processing its rings and possibly
>>    waking up the transmit queue(s) after the following netif_device_detach().
>> - netif_device_detach() to stop new packets being queued to the device.
>
>
> There's a risk to disabling the NAPI before stopping the transmission
> (i.e. stopping the Tx queues).  The net stack might continue to enqueue
> Tx packets and, if the Tx BD rings get full, the driver's start_xmit
> will return TX_BUSY which leads to netdev watchdog triggering Tx
> timeout (and the ndo_tx_timeout hook from the driver will re-enable the
> Tx queues).

Part of netif_device_detach()'s work is to stop the transmit queues in a
similar manner to the driver reporting "ring full", but before doing so,
it also marks the device not present.

The watchdog is conditional upon netif_device_present() returning true,
but as the device has been marked not-present, this test will fail.

When the device is resumed, and netif_device_attach() is called,
__netdev_watchdog_up() will also be called which reschedules the watchdog
for a new timeout grace period.

So I don't think that's an issue - I think we're safe from the watchdog
triggering.

> Maybe it's unlikely to get the BD rings full during suspend, but there's
> another case too - BQL (if the driver supports BQL).  BQL needs a
> confirmation for each xmitted byte, and the confirmation comes from
> Tx completion processing from NAPI context (see
> netdev_tx_completed_queue()).  If the confirmation is delayed too much,
> BQL blocks the Tx queues to trigger the same netdev Tx timeout watchdog
> mechanism.

One of the things FEC does when it resumes is to clean the transmit queue,
which with BQL ends up resetting the BQL counts (via netdev_reset_queue()).
Maybe that's something which should happen on the suspend path to clean
everything up in preparation?

> I think there's a more general problem to this: how to properly stop
> the Tx traffic from the driver.  And I think an approach to solve it
> would be to stop the Tx queues (i.e. netif_tx_stop_all_queues()) before
> disabling the NAPI processing, however the driver will need to use a
> special state flag (like "DOWN" or "RESETTING") to prevent waking up
> the Tx queues from NAPI while the Tx traffic is being stopped.
> There are some drivers implementing such "synchronization" flags to
> prevent the Tx congestion mechanism to wake up the Tx queues while the
> device is resetting or brought down.  But there's no generic
> implementation for this (there are differences in implementation from
> driver to driver).

If what I've said above is correct, then I don't think any of that is
necessary, and the sequence I mentioned in my initial email should work
fine.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proper suspend/resume flow
  2014-03-26 11:58 Proper suspend/resume flow Russell King - ARM Linux
  2014-03-27 10:10 ` Claudiu Manoil
@ 2014-03-29 16:37 ` Ben Hutchings
  2014-03-29 18:03   ` Russell King - ARM Linux
  1 sibling, 1 reply; 7+ messages in thread
From: Ben Hutchings @ 2014-03-29 16:37 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 2794 bytes --]

On Wed, 2014-03-26 at 11:58 +0000, Russell King - ARM Linux wrote:
> I've been looking at the Freescale FEC driver recently, and I've noticed
> that it's suspend path looks like this:
> 
>         if (netif_running(ndev)) {
>                 netif_device_detach(ndev);
>                 fec_stop(ndev);
>         }
> 
> I believe this to be unsafe, because if the device is active, then we may
> have packets in the transmit ring.  It may be possible for the ring to be
> cleaned between netif_device_detach() and fec_stop(), resulting in a call
> to netif_wake_queue(), which doesn't seem to take account of whether the
> device has been detached.
>
> I wondered if this was just limited to the freescale driver, and I found
> a similar pattern in e1000e - __e1000e_shutdown() calls netif_device_detach()
> as the very first thing, before doing anything else.  This seems to be a
> common pattern.  8139cp.c does this, the second call seems to be rather
> redundant:
> 
>         netif_device_detach (dev);
>         netif_stop_queue (dev);
> 
> Tulip on the other hand shuts the hardware down and cleans the transmit
> ring first before calling netif_device_detach(), thus ensuring that there
> won't be any transmit completions before this call.  It doesn't stop the
> queue before this, so presumably in a SMP environment, it is possible for
> packets to get queued after the transmit ring has been cleaned.
> 
> So, what is the correct approach to suspending a net device?  When should
> netif_device_detach() be called?
[...]

My advice here is based on Solarflare's extensive stress-testing of the
sfc driver control path and the race conditions it uncovered.  It did
not cover suspend/resume, but it does cover MTU and ring size changes
which involve the same sort of stop/start while the device is still
marked as running.

- Given the way the watchdog works, I think netif_device_detach() must
be called first if you are going to stop TX queues in a control
operation other than ndo_stop (commits e4abce853849, 29c69a488264).

- The driver must check netif_device_present() before calling
netif_wake_queue().  Perhaps netif_wake_queue() itself should check
that, if many drivers need to do it.

- Calls to netif_device_detach() may need to be synchronised with the TX
scheduler (commits c2f3b8e3a44b, 35205b211c8d).  To support this, the
function efx_device_detach_sync() maybe should be turned into a general
netif_device_detach_sync() (and then possibly needs to disable IRQs).

Ben.

-- 
Ben Hutchings
[W]e found...that it wasn't as easy to get programs right as we had thought.
... I realized that a large part of my life from then on was going to be spent
in finding mistakes in my own programs. - Maurice Wilkes, 1949

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proper suspend/resume flow
  2014-03-29 16:37 ` Ben Hutchings
@ 2014-03-29 18:03   ` Russell King - ARM Linux
  2014-03-29 20:08     ` Ben Hutchings
  0 siblings, 1 reply; 7+ messages in thread
From: Russell King - ARM Linux @ 2014-03-29 18:03 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On Sat, Mar 29, 2014 at 04:37:26PM +0000, Ben Hutchings wrote:
> My advice here is based on Solarflare's extensive stress-testing of the
> sfc driver control path and the race conditions it uncovered.  It did
> not cover suspend/resume, but it does cover MTU and ring size changes
> which involve the same sort of stop/start while the device is still
> marked as running.

Thanks for your reply.

> - Given the way the watchdog works, I think netif_device_detach() must
> be called first if you are going to stop TX queues in a control
> operation other than ndo_stop (commits e4abce853849, 29c69a488264).
> 
> - The driver must check netif_device_present() before calling
> netif_wake_queue().  Perhaps netif_wake_queue() itself should check
> that, if many drivers need to do it.

I don't think that's sufficient.  Unless I'm missing something, I think
the situation below is possible:

Thread0					Thread1
					netif_device_present()
					- test_bit(__LINK_STATE_PRESENT)
netif_device_detach
- test_and_clear_bit(__LINK_STATE_PRESENT)
- netif_running(dev)
- netif_tx_stop_all_queues
  - netif_tx_stop_queue
    - set_bit(__QUEUE_STATE_DRV_XOFF)
					netif_wake_queue()
					- netif_tx_wake_queue()
					  - test_and_clear_bit(__QUEUE_STATE_DRV_XOFF)
					    - __netif_schedule()
					      - test_and_set_bit(__QDISC_STATE_SCHED)
					        - __netif_reschedule()
					          - sd->output_queue_tailp = &q->next_sched
						    - raise_softirq_irqoff(NET_TX_SOFTIRQ)
					...
					net_tx_action()
					qdisc_run()
					- qdisc_run_begin()
					- __qdisc_run()
					  - qdisc_restart()
					    - sch_direct_xmit()
					      - netif_xmit_frozen_or_stopped()
					        - ops->ndo_start_xmit()

In other words, it's possible for the driver's ndo_start_xmit() method
to be called after netif_device_detach() is called due to a call to
netif_wake_queue() - even if netif_device_present() is first checked.

> - Calls to netif_device_detach() may need to be synchronised with the TX
> scheduler (commits c2f3b8e3a44b, 35205b211c8d).  To support this, the
> function efx_device_detach_sync() maybe should be turned into a general
> netif_device_detach_sync() (and then possibly needs to disable IRQs).

Given the above (I don't see anything that checks to see whether the
device is present in the path from netif_wake_queue() through to calling
the device's ndo_start_xmit() function), I think you're right.  I also
think "if (netif_device_present()) netif_wake_queue();" needs to also be
done under the xmit lock, or the same lock which protects the call to
netif_device_detach() to prevent the above occuring.

However, I have a voice in the back of my mind which sounds like Alan Cox
saying "don't lock code, lock data" :p

I also think you're right about calling netif_stop_queue() or
netif_tx_disable() from contexts other than the xmit function or the
ndo_stop() method.  It's safe in ndo_stop() because netif_running()
will be false at that point, which disables the watchdog.

Calling it at other moments is potentially dangerous as a watchdog poll
(which happens every dev->watchdog_timeo) could occur while the queue
is temporarily disabled, and if it has been more than watchdog_timeo
after the last packet was queued, we'll get an instant watchdog warning.

That said, for the case where a network driver does all it's packet
processing in the NAPI poll function, I think calling napi_disable()
is a good way to ensure that the poll function is not running, and
therefore there are can be no netif_wake_queue() calls - or anything
other than the ndo_start_xmit touching the rings or the device.  This
is needed anyway to stop receive packet processing looking at its
ring.

So, I've now come to this sequence:

suspend()
{
	if (netif_running()) {
		napi_disable();
		netif_tx_lock();
		netif_device_detach();
		netif_tx_unlock();
	}
	... suspend device ...
}

resume()
{
	... resume device ...
	if (netif_running()) {
		netif_device_attach();
		napi_enable();
	}
}

as the way to safely suspend packet processing while suspended.  For
reconfiguration purposes (where hopefully the reconfiguration part
doesn't need to schedule), I think:

	... prepare for reconfiguration (eg, allocate new rings)
	napi_disable();
	netif_tx_lock();

	... reconfigure, swap state, etc ...

	netif_tx_unlock();
	napi_enable();

is sufficient to temporarily suspend packet processing.

I can see another re-work of my Freescale FEC patchset being required.

I'm also debating about rtnl_lock() in some places where netif_running()
is checked - it seems to me that checking netif_running() without holding
the rtnl_lock() is unsafe.  That said, the suspend paths should be single
threaded, so there should be no chance of the device coming up or down.
Even so, I see some drivers do use rtnl_lock() in their suspend/resume
paths.

However, FEC defers the transmit timeout processing to work queue context,
and doesn't take any locks:

        if (fep->delay_work.timeout) {
                fep->delay_work.timeout = false;
                fec_restart(fep->netdev, fep->full_duplex);
                netif_wake_queue(fep->netdev);
	}

Internally fec_restart() has:

	if (netif_running(ndev)) {
		netif_device_detach(ndev);
		napi_disable(&fep->napi);
		netif_stop_queue(ndev);
		netif_tx_lock_bh(ndev);
	}
	... restart processing including a udelay() ...
	if (netif_running(ndev)) {
		netif_tx_unlock_bh(ndev);
		netif_wake_queue(ndev);
		napi_enable(&fep->napi);
		netif_device_attach(ndev);
	}

and it means that netif_running() could change state in the middle of
this sequence.  Putting rtnl_lock() around that stops netif_running()
changing state, but it feels like a very big sledge hammer.  However,
with the rest of my changes to this driver, I may be able to get rid
of that work queue entirely, moving the restart entirely under the
ndo_tx_timeout() call.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proper suspend/resume flow
  2014-03-29 18:03   ` Russell King - ARM Linux
@ 2014-03-29 20:08     ` Ben Hutchings
  2014-03-29 20:19       ` Russell King - ARM Linux
  0 siblings, 1 reply; 7+ messages in thread
From: Ben Hutchings @ 2014-03-29 20:08 UTC (permalink / raw)
  To: Russell King - ARM Linux; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1126 bytes --]

On Sat, 2014-03-29 at 18:03 +0000, Russell King - ARM Linux wrote:
[...]
> That said, for the case where a network driver does all it's packet
> processing in the NAPI poll function, I think calling napi_disable()
> is a good way to ensure that the poll function is not running, and
> therefore there are can be no netif_wake_queue() calls - or anything
> other than the ndo_start_xmit touching the rings or the device.  This
> is needed anyway to stop receive packet processing looking at its
> ring.
> 
> So, I've now come to this sequence:
> 
> suspend()
> {
> 	if (netif_running()) {
> 		napi_disable();
> 		netif_tx_lock();
> 		netif_device_detach();
> 		netif_tx_unlock();
> 	}
> 	... suspend device ...
> }
[...]

This is missing netif_stop_queue(), but I assume you do that somewhere
after netif_device_detach().  I think this should work.

Ben.

-- 
Ben Hutchings
[W]e found...that it wasn't as easy to get programs right as we had thought.
... I realized that a large part of my life from then on was going to be spent
in finding mistakes in my own programs. - Maurice Wilkes, 1949

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Proper suspend/resume flow
  2014-03-29 20:08     ` Ben Hutchings
@ 2014-03-29 20:19       ` Russell King - ARM Linux
  0 siblings, 0 replies; 7+ messages in thread
From: Russell King - ARM Linux @ 2014-03-29 20:19 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev

On Sat, Mar 29, 2014 at 08:08:14PM +0000, Ben Hutchings wrote:
> On Sat, 2014-03-29 at 18:03 +0000, Russell King - ARM Linux wrote:
> [...]
> > That said, for the case where a network driver does all it's packet
> > processing in the NAPI poll function, I think calling napi_disable()
> > is a good way to ensure that the poll function is not running, and
> > therefore there are can be no netif_wake_queue() calls - or anything
> > other than the ndo_start_xmit touching the rings or the device.  This
> > is needed anyway to stop receive packet processing looking at its
> > ring.
> > 
> > So, I've now come to this sequence:
> > 
> > suspend()
> > {
> > 	if (netif_running()) {
> > 		napi_disable();
> > 		netif_tx_lock();
> > 		netif_device_detach();
> > 		netif_tx_unlock();
> > 	}
> > 	... suspend device ...
> > }
> [...]
> 
> This is missing netif_stop_queue(), but I assume you do that somewhere
> after netif_device_detach().  I think this should work.

Ah, that's covered by netif_device_detach():

        if (test_and_clear_bit(__LINK_STATE_PRESENT, &dev->state) &&
            netif_running(dev)) {
                netif_tx_stop_all_queues(dev);
        }

provided __LINK_STATE_PRESENT was previously set, and the interface
was running.

netif_tx_stop_all_queues() is:

        for (i = 0; i < dev->num_tx_queues; i++) {
                struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
                netif_tx_stop_queue(txq);
        }

so with the above sequence, we end up stopping all queues under the
xmit lock.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-03-29 20:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-26 11:58 Proper suspend/resume flow Russell King - ARM Linux
2014-03-27 10:10 ` Claudiu Manoil
2014-03-27 10:57   ` Russell King - ARM Linux
2014-03-29 16:37 ` Ben Hutchings
2014-03-29 18:03   ` Russell King - ARM Linux
2014-03-29 20:08     ` Ben Hutchings
2014-03-29 20:19       ` Russell King - ARM Linux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).