[PATCH] netdev: enc28j60 kernel panic fix.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] netdev: enc28j60 kernel panic fix.
@ 2016-05-04 17:51 David Russell
  2016-05-05  8:51 ` Francois Romieu
  0 siblings, 1 reply; 5+ messages in thread
From: David Russell @ 2016-05-04 17:51 UTC (permalink / raw)
  To: netdev

When connected directly to another system (not via a switch)
eventually a condition where a NULL pointer dereference occurs in
enc28j60_hw_tx() and this patch simply checks for that condition and
returns gracefully without causing a kernel panic.  I believe, but
have not investigated this is caused by a packet collision and am not
sure if the kernel tracks collisions or counts them as errors, so that
should probably be added if this is what's happening.  I'm also not
familiar with the linux kernel, so may have fixed this in a less than
ideal way.

Developer's Certificate of Origin 1.1

        By making a contribution to this project, I certify that:

        (d) I understand and agree that this project and the contribution
            are public and that a record of the contribution (including all
            personal information I submit with it, including my sign-off) is
            maintained indefinitely and may be redistributed consistent with
            this project or the open source license(s) involved.

Signed-off-by: David Russell <david@aprsworld.com>

---
diff --git a/drivers/net/ethernet/microchip/enc28j60.c
b/drivers/net/ethernet/microchip/enc28j60.c
index 86ea17e..36ac65f 100644
--- a/drivers/net/ethernet/microchip/enc28j60.c
+++ b/drivers/net/ethernet/microchip/enc28j60.c
@@ -1233,6 +1233,9 @@ static void enc28j60_irq_work_handler(struct
work_struct *work)
  */
 static void enc28j60_hw_tx(struct enc28j60_net *priv)
 {
+       if (!priv->tx_skb)
+               return;
+
        if (netif_msg_tx_queued(priv))
                printk(KERN_DEBUG DRV_NAME
                        ": Tx Packet Len:%d\n", priv->tx_skb->len);

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] netdev: enc28j60 kernel panic fix.
  2016-05-04 17:51 [PATCH] netdev: enc28j60 kernel panic fix David Russell
@ 2016-05-05  8:51 ` Francois Romieu
  2016-05-06 18:26   ` David Russell
  0 siblings, 1 reply; 5+ messages in thread
From: Francois Romieu @ 2016-05-05  8:51 UTC (permalink / raw)
  To: David Russell; +Cc: netdev

David Russell <david@aprsworld.com> :
> When connected directly to another system (not via a switch)
> eventually a condition where a NULL pointer dereference occurs in
> enc28j60_hw_tx() and this patch simply checks for that condition and
> returns gracefully without causing a kernel panic.  I believe, but
> have not investigated this is caused by a packet collision and am not
> sure if the kernel tracks collisions or counts them as errors, so that
> should probably be added if this is what's happening.  I'm also not
> familiar with the linux kernel, so may have fixed this in a less than
> ideal way.

Is it possible for EIR.EIR_TXERIF and EIR.EIR_TXIF to be set for the
same packet ?

If so the driver is intrinsically racy:
- EIR.EIR_TXIF completes transmission, clears tx_skb and enables queueing
  again (see netif_wake_queue in enc28j60_tx_clear)

- insert start_xmit here: tx_skb is set and enc28j60_hw_tx is scheduled
  for late execution (user context work)

- EIR.EIR_EIR.EIR_TXERIF issues same enc28j60_tx_clear and clears tx_skb

- enc28j60_hw_tx is run but tx_skb is NULL

> diff --git a/drivers/net/ethernet/microchip/enc28j60.c
> b/drivers/net/ethernet/microchip/enc28j60.c
> index 86ea17e..36ac65f 100644
> --- a/drivers/net/ethernet/microchip/enc28j60.c
> +++ b/drivers/net/ethernet/microchip/enc28j60.c
> @@ -1233,6 +1233,9 @@ static void enc28j60_irq_work_handler(struct
> work_struct *work)
>   */
>  static void enc28j60_hw_tx(struct enc28j60_net *priv)
>  {
> +       if (!priv->tx_skb)
> +               return;
> +
>         if (netif_msg_tx_queued(priv))
>                 printk(KERN_DEBUG DRV_NAME
>                         ": Tx Packet Len:%d\n", priv->tx_skb->len);

enc28j60_hw_tx isn't the culprit. It's the victim.

This change silences the bug but it does not fix it at all.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] netdev: enc28j60 kernel panic fix.
  2016-05-05  8:51 ` Francois Romieu
@ 2016-05-06 18:26   ` David Russell
  2016-05-06 20:35     ` Francois Romieu
  0 siblings, 1 reply; 5+ messages in thread
From: David Russell @ 2016-05-06 18:26 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

I kind of thought my patch was at best incomplete.  When you state
this change silences the bug but does not fix it, what are the
implications of systems running this patch?  We have some production
systems using this patch.  They reboot daily, but have been solid.

In addition, if we sent you a pi and the ethernet controller and a
small but reasonable sum of money for your labor, would you be able to
properly fix it?  Short of that, do you have any recommendations on
quick overviews of the networking stack in the kernel and then
documentation on the various flags and such?

Thanks.

-David Russell
APRS World, LLC
http://www.aprsworld.com/


On Thu, May 5, 2016 at 3:51 AM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> David Russell <david@aprsworld.com> :
>> When connected directly to another system (not via a switch)
>> eventually a condition where a NULL pointer dereference occurs in
>> enc28j60_hw_tx() and this patch simply checks for that condition and
>> returns gracefully without causing a kernel panic.  I believe, but
>> have not investigated this is caused by a packet collision and am not
>> sure if the kernel tracks collisions or counts them as errors, so that
>> should probably be added if this is what's happening.  I'm also not
>> familiar with the linux kernel, so may have fixed this in a less than
>> ideal way.
>
> Is it possible for EIR.EIR_TXERIF and EIR.EIR_TXIF to be set for the
> same packet ?
>
> If so the driver is intrinsically racy:
> - EIR.EIR_TXIF completes transmission, clears tx_skb and enables queueing
>   again (see netif_wake_queue in enc28j60_tx_clear)
>
> - insert start_xmit here: tx_skb is set and enc28j60_hw_tx is scheduled
>   for late execution (user context work)
>
> - EIR.EIR_EIR.EIR_TXERIF issues same enc28j60_tx_clear and clears tx_skb
>
> - enc28j60_hw_tx is run but tx_skb is NULL
>
>> diff --git a/drivers/net/ethernet/microchip/enc28j60.c
>> b/drivers/net/ethernet/microchip/enc28j60.c
>> index 86ea17e..36ac65f 100644
>> --- a/drivers/net/ethernet/microchip/enc28j60.c
>> +++ b/drivers/net/ethernet/microchip/enc28j60.c
>> @@ -1233,6 +1233,9 @@ static void enc28j60_irq_work_handler(struct
>> work_struct *work)
>>   */
>>  static void enc28j60_hw_tx(struct enc28j60_net *priv)
>>  {
>> +       if (!priv->tx_skb)
>> +               return;
>> +
>>         if (netif_msg_tx_queued(priv))
>>                 printk(KERN_DEBUG DRV_NAME
>>                         ": Tx Packet Len:%d\n", priv->tx_skb->len);
>
> enc28j60_hw_tx isn't the culprit. It's the victim.
>
> This change silences the bug but it does not fix it at all.
>
> --
> Ueimor

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] netdev: enc28j60 kernel panic fix.
  2016-05-06 18:26   ` David Russell
@ 2016-05-06 20:35     ` Francois Romieu
  2016-05-09 16:59       ` David Russell
  0 siblings, 1 reply; 5+ messages in thread
From: Francois Romieu @ 2016-05-06 20:35 UTC (permalink / raw)
  To: David Russell; +Cc: netdev

(please don't top post)

David Russell <david@aprsworld.com> :
> I kind of thought my patch was at best incomplete.  When you state
> this change silences the bug but does not fix it, what are the
> implications of systems running this patch?  We have some production
> systems using this patch.  They reboot daily, but have been solid.

If my assumption is right it should drop an extra packet here and there.
No leak.

However transmit errors + transmit packets should still match the number
of times the driver calls enc28j60_send_packet (you would have to cook
your own stat to check the latter though).

> In addition, if we sent you a pi and the ethernet controller and a
> small but reasonable sum of money for your labor, would you be able to
> properly fix it ?

I'd rather see you testing my crap. :o)

Pi as multi-core (the expected race needs several cores or a netconsole
style transmit from an irq/bh context) ?

> Short of that, do you have any recommendations on quick overviews of
> the networking stack in the kernel and then documentation on the
> various flags and such?

A tad bit too high-level a question... Plain ctags + printk for a start ?

Does the patch below make a difference ?

Takes longer to crash counts as a difference.

diff --git a/drivers/net/ethernet/microchip/enc28j60.c b/drivers/net/ethernet/microchip/enc28j60.c
index 7066954..405fe3f 100644
--- a/drivers/net/ethernet/microchip/enc28j60.c
+++ b/drivers/net/ethernet/microchip/enc28j60.c
@@ -1170,7 +1170,8 @@ static void enc28j60_irq_work_handler(struct work_struct *work)
 				enc28j60_dump_tsv(priv, "Tx Done", tsv);
 			}
 			enc28j60_tx_clear(ndev, err);
-			locked_reg_bfclr(priv, EIR, EIR_TXIF);
+			locked_reg_bfclr(priv, EIR, EIR_TXIF | EIR_TXERIF);
+			intflags &= ~EIR_TXERIF;
 		}
 		/* TX Error handler */
 		if ((intflags & EIR_TXERIF) != 0) {
@@ -1190,6 +1191,7 @@ static void enc28j60_irq_work_handler(struct work_struct *work)
 			nolock_reg_bfclr(priv, ECON1, ECON1_TXRST);
 			nolock_txfifo_init(priv, TXSTART_INIT, TXEND_INIT);
 			mutex_unlock(&priv->lock);
+			locked_reg_bfclr(priv, EIR, EIR_TXIF | EIR_TXERIF);
 			/* Transmit Late collision check for retransmit */
 			if (TSV_GETBIT(tsv, TSV_TXLATECOLLISION)) {
 				if (netif_msg_tx_err(priv))
@@ -1203,7 +1205,6 @@ static void enc28j60_irq_work_handler(struct work_struct *work)
 					enc28j60_tx_clear(ndev, true);
 			} else
 				enc28j60_tx_clear(ndev, true);
-			locked_reg_bfclr(priv, EIR, EIR_TXERIF);
 		}
 		/* RX Error handler */
 		if ((intflags & EIR_RXERIF) != 0) {

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] netdev: enc28j60 kernel panic fix.
  2016-05-06 20:35     ` Francois Romieu
@ 2016-05-09 16:59       ` David Russell
  0 siblings, 0 replies; 5+ messages in thread
From: David Russell @ 2016-05-09 16:59 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev

On Fri, May 6, 2016 at 3:35 PM, Francois Romieu <romieu@fr.zoreil.com> wrote:
> (please don't top post)

Sorry about that.

> David Russell <david@aprsworld.com> :
>> I kind of thought my patch was at best incomplete.  When you state
>> this change silences the bug but does not fix it, what are the
>> implications of systems running this patch?  We have some production
>> systems using this patch.  They reboot daily, but have been solid.
>
> If my assumption is right it should drop an extra packet here and there.
> No leak.

Lost packets are "acceptable" in this case as much as I hate "Good Enough".

> However transmit errors + transmit packets should still match the number
> of times the driver calls enc28j60_send_packet (you would have to cook
> your own stat to check the latter though).
>
>> In addition, if we sent you a pi and the ethernet controller and a
>> small but reasonable sum of money for your labor, would you be able to
>> properly fix it ?
>
> I'd rather see you testing my crap. :o)
>
> Pi as multi-core (the expected race needs several cores or a netconsole
> style transmit from an irq/bh context) ?

Yes, multi-core ARM.

>> Short of that, do you have any recommendations on quick overviews of
>> the networking stack in the kernel and then documentation on the
>> various flags and such?
>
> A tad bit too high-level a question... Plain ctags + printk for a start ?

I was afraid of that.  I prefer proper documentation/specifications in
addition to source.

> Does the patch below make a difference ?
>
> Takes longer to crash counts as a difference.

As far as I can tell it solves the problem of the kernel panicing.

> diff --git a/drivers/net/ethernet/microchip/enc28j60.c b/drivers/net/ethernet/microchip/enc28j60.c
> index 7066954..405fe3f 100644
> --- a/drivers/net/ethernet/microchip/enc28j60.c
> +++ b/drivers/net/ethernet/microchip/enc28j60.c
> @@ -1170,7 +1170,8 @@ static void enc28j60_irq_work_handler(struct work_struct *work)
>                                 enc28j60_dump_tsv(priv, "Tx Done", tsv);
>                         }
>                         enc28j60_tx_clear(ndev, err);
> -                       locked_reg_bfclr(priv, EIR, EIR_TXIF);
> +                       locked_reg_bfclr(priv, EIR, EIR_TXIF | EIR_TXERIF);
> +                       intflags &= ~EIR_TXERIF;
>                 }
>                 /* TX Error handler */
>                 if ((intflags & EIR_TXERIF) != 0) {
> @@ -1190,6 +1191,7 @@ static void enc28j60_irq_work_handler(struct work_struct *work)
>                         nolock_reg_bfclr(priv, ECON1, ECON1_TXRST);
>                         nolock_txfifo_init(priv, TXSTART_INIT, TXEND_INIT);
>                         mutex_unlock(&priv->lock);
> +                       locked_reg_bfclr(priv, EIR, EIR_TXIF | EIR_TXERIF);
>                         /* Transmit Late collision check for retransmit */
>                         if (TSV_GETBIT(tsv, TSV_TXLATECOLLISION)) {
>                                 if (netif_msg_tx_err(priv))
> @@ -1203,7 +1205,6 @@ static void enc28j60_irq_work_handler(struct work_struct *work)
>                                         enc28j60_tx_clear(ndev, true);
>                         } else
>                                 enc28j60_tx_clear(ndev, true);
> -                       locked_reg_bfclr(priv, EIR, EIR_TXERIF);
>                 }
>                 /* RX Error handler */
>                 if ((intflags & EIR_RXERIF) != 0) {

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-05-09 16:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-05-04 17:51 [PATCH] netdev: enc28j60 kernel panic fix David Russell
2016-05-05  8:51 ` Francois Romieu
2016-05-06 18:26   ` David Russell
2016-05-06 20:35     ` Francois Romieu
2016-05-09 16:59       ` David Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).