* via_rhine kernel crashes in 2.6.32 @ 2009-12-19 11:12 Andrey Rahmatullin 2009-12-20 20:03 ` Andrey Rahmatullin 0 siblings, 1 reply; 16+ messages in thread From: Andrey Rahmatullin @ 2009-12-19 11:12 UTC (permalink / raw) To: linux-kernel, Roger Luethi [-- Attachment #1: Type: text/plain, Size: 946 bytes --] Hello. Recently I've started to experience kernel crashes under heavy network load. It may be caused by switching to .32-rc0, but as .30 crashed too, it may be caused by switching from Vuze to Deluge (maybe it uses the network more actively). Unfortunately, most crashes can't be detected with netconsole, and they usually don't fit into 1280x1024 (I think the kernel prints 2-3 or more backtraces of random processes like hostapd and scmpc and then completely locks up). Here are two screenshots that look like complete one-screen traces: http://wrar.name/temp/P1010714.JPG - this is from .32-rc0 (v2.6.32-7123-g75b0803) http://wrar.name/temp/P1010711.JPG - this is from 2.6.30 (stock ALT Linux) Of course I have lots of "eth0: Transmit timed out" and some "WARNING: at net/sched/sch_generic.c:255 dev_watchdog" before crash, as described at http://bugzilla.kernel.org/show_bug.cgi?id=11663 -- WBR, wRAR (ALT Linux Team) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-19 11:12 via_rhine kernel crashes in 2.6.32 Andrey Rahmatullin @ 2009-12-20 20:03 ` Andrey Rahmatullin 2009-12-21 12:03 ` Christian Kujau 0 siblings, 1 reply; 16+ messages in thread From: Andrey Rahmatullin @ 2009-12-20 20:03 UTC (permalink / raw) To: linux-kernel, Roger Luethi [-- Attachment #1: Type: text/plain, Size: 159 bytes --] Here are the proper screenshots from 2.6.33-rc1: http://wrar.name/temp/P1010724.JPG http://wrar.name/temp/P1010725.JPG -- WBR, wRAR (ALT Linux Team) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-20 20:03 ` Andrey Rahmatullin @ 2009-12-21 12:03 ` Christian Kujau 2009-12-21 12:36 ` Andrey Rahmatullin 2009-12-21 18:18 ` Andrey Rahmatullin 0 siblings, 2 replies; 16+ messages in thread From: Christian Kujau @ 2009-12-21 12:03 UTC (permalink / raw) To: Andrey Rahmatullin; +Cc: LKML, Roger Luethi, netdev [netdev Cc'ed] On Mon, 21 Dec 2009 at 01:03, Andrey Rahmatullin wrote: > Here are the proper screenshots from 2.6.33-rc1: > http://wrar.name/temp/P1010724.JPG > http://wrar.name/temp/P1010725.JPG Wow, I had to rotate my laptop here. In you first posting [0] you said you "upgraded to .32" but had crashes in .30 too. What was the last working kernel? In the .jpgs one can see something along the lines of schedule_timeout, napi_disable, rhine_tx_timeout - the only patch I could find touching these things [1] altogether is rather old, from 2.6.24. Is 2.6.29 working for you? Christian. [0] http://lkml.org/lkml/2009/12/19/36 [1] http://lists.openwall.net/netdev/2007/08/28/13 -- BOFH excuse #19: floating point processor overflow ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-21 12:03 ` Christian Kujau @ 2009-12-21 12:36 ` Andrey Rahmatullin 2009-12-21 18:18 ` Andrey Rahmatullin 1 sibling, 0 replies; 16+ messages in thread From: Andrey Rahmatullin @ 2009-12-21 12:36 UTC (permalink / raw) To: Christian Kujau; +Cc: LKML, Roger Luethi, netdev [-- Attachment #1: Type: text/plain, Size: 641 bytes --] On Mon, Dec 21, 2009 at 04:03:06AM -0800, Christian Kujau wrote: > Wow, I had to rotate my laptop here. In you first posting [0] you said you > "upgraded to .32" but had crashes in .30 too. What was the last working > kernel? It was 2.6.31, but as I've already said I'm not sure it was because of kernel version alone. > In the .jpgs one can see something along the lines of > schedule_timeout, napi_disable, rhine_tx_timeout - the only patch I could > find touching these things [1] altogether is rather old, from 2.6.24. Is > 2.6.29 working for you? Well, I can try some old versions. -- WBR, wRAR (ALT Linux Team) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-21 12:03 ` Christian Kujau 2009-12-21 12:36 ` Andrey Rahmatullin @ 2009-12-21 18:18 ` Andrey Rahmatullin 2009-12-21 19:32 ` Christian Kujau 1 sibling, 1 reply; 16+ messages in thread From: Andrey Rahmatullin @ 2009-12-21 18:18 UTC (permalink / raw) To: Christian Kujau; +Cc: LKML, Roger Luethi, netdev [-- Attachment #1: Type: text/plain, Size: 1057 bytes --] On Mon, Dec 21, 2009 at 04:03:06AM -0800, Christian Kujau wrote: > In the .jpgs one can see something along the lines of > schedule_timeout, napi_disable, rhine_tx_timeout - the only patch I could > find touching these things [1] altogether is rather old, from 2.6.24. Is > 2.6.29 working for you? I've installed 2.6.27, the earliest kernel supported by udev 149. I started to download two large files to my second machine (maybe 700 or 1000 Kbyte/s combined). Nothing happened. I stopped the downloads and started deluged. The kernel showed "netdev watchdog timeout", but nothing else happened. deluged opened something like 100 TCP connections and started to upload some data at ~20 Kbyte/s. Nothing happened. I resumed one of downloads, waited for some time, nothing happened. I resumed the second download, the kernel crashed into an endless stream of backtraces (did 2.6.27 support pause_on_oops?), containing "whatever from the idle thread", or smth like that, which was also in other crash logs. -- WBR, wRAR (ALT Linux Team) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-21 18:18 ` Andrey Rahmatullin @ 2009-12-21 19:32 ` Christian Kujau 2009-12-22 12:32 ` Jarek Poplawski 0 siblings, 1 reply; 16+ messages in thread From: Christian Kujau @ 2009-12-21 19:32 UTC (permalink / raw) To: Andrey Rahmatullin; +Cc: LKML, Roger Luethi, netdev On Mon, 21 Dec 2009 at 23:18, Andrey Rahmatullin wrote: > I've installed 2.6.27, the earliest kernel supported by udev 149. I [...] > one of downloads, waited for some time, nothing happened. I resumed the > second download, the kernel crashed into an endless stream of backtraces > (did 2.6.27 support pause_on_oops?), containing "whatever from the idle > thread", or smth like that, which was also in other crash logs. So, 2.6.27 crashed as well. Was the backtrace similar to those on 2.6.30? I know it's a long shot, but since you seem to be able to reproduce this pretty reliably, can you try 2.6.23? Or at least something before bea3348eef27e6044b6161fd04c3152215f96411 [0]? Christian. [0] I'm *really* guessing here, if some netdev guru has some better understanding of the backtraces Andrey sent, please step forward. -- BOFH excuse #433: error: one bad user found in front of screen ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-21 19:32 ` Christian Kujau @ 2009-12-22 12:32 ` Jarek Poplawski 2009-12-22 13:21 ` Jarek Poplawski 0 siblings, 1 reply; 16+ messages in thread From: Jarek Poplawski @ 2009-12-22 12:32 UTC (permalink / raw) To: Christian Kujau; +Cc: Andrey Rahmatullin, LKML, Roger Luethi, netdev On 21-12-2009 20:32, Christian Kujau wrote: > On Mon, 21 Dec 2009 at 23:18, Andrey Rahmatullin wrote: >> I've installed 2.6.27, the earliest kernel supported by udev 149. I > [...] >> one of downloads, waited for some time, nothing happened. I resumed the >> second download, the kernel crashed into an endless stream of backtraces >> (did 2.6.27 support pause_on_oops?), containing "whatever from the idle >> thread", or smth like that, which was also in other crash logs. > > So, 2.6.27 crashed as well. Was the backtrace similar to those on 2.6.30? > I know it's a long shot, but since you seem to be able to reproduce this > pretty reliably, can you try 2.6.23? Or at least something before > bea3348eef27e6044b6161fd04c3152215f96411 [0]? > > Christian. > > [0] I'm *really* guessing here, if some netdev guru has some better > understanding of the backtraces Andrey sent, please step forward. It looks like napi_disable() should be illegal in ndo_tx_timeout(). Here is a patch which moves most of the timeout work to a workqueue, similarly to tg3 etc. It should prevent at least one of reported bugs. Alas I can't even check-compile it at the moment, so let me know on any problems. Jarek P. --- drivers/net/via-rhine.c | 41 ++++++++++++++++++++++++++++------------- 1 files changed, 28 insertions(+), 13 deletions(-) diff --git a/drivers/net/via-rhine.c b/drivers/net/via-rhine.c index 593e01f..125406b 100644 --- a/drivers/net/via-rhine.c +++ b/drivers/net/via-rhine.c @@ -102,6 +102,7 @@ static const int multicast_filter_limit = 32; #include <linux/ethtool.h> #include <linux/crc32.h> #include <linux/bitops.h> +#include <linux/workqueue.h> #include <asm/processor.h> /* Processor type for cache alignment. */ #include <asm/io.h> #include <asm/irq.h> @@ -389,6 +390,7 @@ struct rhine_private { struct net_device *dev; struct napi_struct napi; spinlock_t lock; + struct work_struct reset_task; /* Frequently used values: keep some adjacent for cache effect. */ u32 quirks; @@ -407,6 +409,7 @@ struct rhine_private { static int mdio_read(struct net_device *dev, int phy_id, int location); static void mdio_write(struct net_device *dev, int phy_id, int location, int value); static int rhine_open(struct net_device *dev); +static void rhine_reset_task(struct work_struct *work); static void rhine_tx_timeout(struct net_device *dev); static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev); @@ -775,6 +778,8 @@ static int __devinit rhine_init_one(struct pci_dev *pdev, dev->irq = pdev->irq; spin_lock_init(&rp->lock); + INIT_WORK(&rp->reset_task, rhine_reset_task); + rp->mii_if.dev = dev; rp->mii_if.mdio_read = mdio_read; rp->mii_if.mdio_write = mdio_write; @@ -1179,22 +1184,18 @@ static int rhine_open(struct net_device *dev) return 0; } -static void rhine_tx_timeout(struct net_device *dev) +static void rhine_reset_task(struct work_struct *work) { - struct rhine_private *rp = netdev_priv(dev); - void __iomem *ioaddr = rp->base; - - printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " - "%4.4x, resetting...\n", - dev->name, ioread16(ioaddr + IntrStatus), - mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + struct rhine_private *rp = container_of(work, struct rhine_private, + reset_task); + struct net_device *dev = rp->dev; /* protect against concurrent rx interrupts */ disable_irq(rp->pdev->irq); napi_disable(&rp->napi); - spin_lock(&rp->lock); + spin_lock_irq(&rp->lock); /* clear all descriptors */ free_tbufs(dev); @@ -1206,7 +1207,7 @@ static void rhine_tx_timeout(struct net_device *dev) rhine_chip_reset(dev); init_registers(dev); - spin_unlock(&rp->lock); + spin_unlock_irq(&rp->lock); enable_irq(rp->pdev->irq); dev->trans_start = jiffies; @@ -1214,6 +1215,19 @@ static void rhine_tx_timeout(struct net_device *dev) netif_wake_queue(dev); } +static void rhine_tx_timeout(struct net_device *dev) +{ + struct rhine_private *rp = netdev_priv(dev); + void __iomem *ioaddr = rp->base; + + printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " + "%4.4x, resetting...\n", + dev->name, ioread16(ioaddr + IntrStatus), + mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + + schedule_work(&rp->reset_task); +} + static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev) { @@ -1830,10 +1844,11 @@ static int rhine_close(struct net_device *dev) struct rhine_private *rp = netdev_priv(dev); void __iomem *ioaddr = rp->base; - spin_lock_irq(&rp->lock); - - netif_stop_queue(dev); napi_disable(&rp->napi); + cancel_work_sync(&rp->reset_task); + netif_stop_queue(dev); + + spin_lock_irq(&rp->lock); if (debug > 1) printk(KERN_DEBUG "%s: Shutting down ethercard, " ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-22 12:32 ` Jarek Poplawski @ 2009-12-22 13:21 ` Jarek Poplawski 2009-12-22 13:38 ` Jarek Poplawski 0 siblings, 1 reply; 16+ messages in thread From: Jarek Poplawski @ 2009-12-22 13:21 UTC (permalink / raw) To: Christian Kujau; +Cc: Andrey Rahmatullin, LKML, Roger Luethi, netdev On Tue, Dec 22, 2009 at 12:32:11PM +0000, Jarek Poplawski wrote: > It looks like napi_disable() should be illegal in ndo_tx_timeout(). > Here is a patch which moves most of the timeout work to a workqueue, > similarly to tg3 etc. It should prevent at least one of reported > bugs. Alas I can't even check-compile it at the moment, so let me > know on any problems. It seems I needlessly changed locking btw, so here it is again. Jarek P. --- (take 2) drivers/net/via-rhine.c | 37 ++++++++++++++++++++++++++----------- 1 files changed, 26 insertions(+), 11 deletions(-) diff --git a/drivers/net/via-rhine.c b/drivers/net/via-rhine.c index 593e01f..15a4063 100644 --- a/drivers/net/via-rhine.c +++ b/drivers/net/via-rhine.c @@ -102,6 +102,7 @@ static const int multicast_filter_limit = 32; #include <linux/ethtool.h> #include <linux/crc32.h> #include <linux/bitops.h> +#include <linux/workqueue.h> #include <asm/processor.h> /* Processor type for cache alignment. */ #include <asm/io.h> #include <asm/irq.h> @@ -389,6 +390,7 @@ struct rhine_private { struct net_device *dev; struct napi_struct napi; spinlock_t lock; + struct work_struct reset_task; /* Frequently used values: keep some adjacent for cache effect. */ u32 quirks; @@ -407,6 +409,7 @@ struct rhine_private { static int mdio_read(struct net_device *dev, int phy_id, int location); static void mdio_write(struct net_device *dev, int phy_id, int location, int value); static int rhine_open(struct net_device *dev); +static void rhine_reset_task(struct work_struct *work); static void rhine_tx_timeout(struct net_device *dev); static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev); @@ -775,6 +778,8 @@ static int __devinit rhine_init_one(struct pci_dev *pdev, dev->irq = pdev->irq; spin_lock_init(&rp->lock); + INIT_WORK(&rp->reset_task, rhine_reset_task); + rp->mii_if.dev = dev; rp->mii_if.mdio_read = mdio_read; rp->mii_if.mdio_write = mdio_write; @@ -1179,15 +1184,11 @@ static int rhine_open(struct net_device *dev) return 0; } -static void rhine_tx_timeout(struct net_device *dev) +static void rhine_reset_task(struct work_struct *work) { - struct rhine_private *rp = netdev_priv(dev); - void __iomem *ioaddr = rp->base; - - printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " - "%4.4x, resetting...\n", - dev->name, ioread16(ioaddr + IntrStatus), - mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + struct rhine_private *rp = container_of(work, struct rhine_private, + reset_task); + struct net_device *dev = rp->dev; /* protect against concurrent rx interrupts */ disable_irq(rp->pdev->irq); @@ -1214,6 +1215,19 @@ static void rhine_tx_timeout(struct net_device *dev) netif_wake_queue(dev); } +static void rhine_tx_timeout(struct net_device *dev) +{ + struct rhine_private *rp = netdev_priv(dev); + void __iomem *ioaddr = rp->base; + + printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " + "%4.4x, resetting...\n", + dev->name, ioread16(ioaddr + IntrStatus), + mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + + schedule_work(&rp->reset_task); +} + static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev) { @@ -1830,10 +1844,11 @@ static int rhine_close(struct net_device *dev) struct rhine_private *rp = netdev_priv(dev); void __iomem *ioaddr = rp->base; - spin_lock_irq(&rp->lock); - - netif_stop_queue(dev); napi_disable(&rp->napi); + cancel_work_sync(&rp->reset_task); + netif_stop_queue(dev); + + spin_lock_irq(&rp->lock); if (debug > 1) printk(KERN_DEBUG "%s: Shutting down ethercard, " ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-22 13:21 ` Jarek Poplawski @ 2009-12-22 13:38 ` Jarek Poplawski 2009-12-22 15:00 ` Andrey Rahmatullin 0 siblings, 1 reply; 16+ messages in thread From: Jarek Poplawski @ 2009-12-22 13:38 UTC (permalink / raw) To: Christian Kujau; +Cc: Andrey Rahmatullin, LKML, Roger Luethi, netdev On Tue, Dec 22, 2009 at 01:21:07PM +0000, Jarek Poplawski wrote: > On Tue, Dec 22, 2009 at 12:32:11PM +0000, Jarek Poplawski wrote: > > It looks like napi_disable() should be illegal in ndo_tx_timeout(). > > Here is a patch which moves most of the timeout work to a workqueue, > > similarly to tg3 etc. It should prevent at least one of reported > > bugs. Alas I can't even check-compile it at the moment, so let me > > know on any problems. > > It seems I needlessly changed locking btw, so here it is again. Hmm... On the other hand, it definitely needs at least _bh now... Sorry, Jarek P. --- (take 3) drivers/net/via-rhine.c | 41 ++++++++++++++++++++++++++++------------- 1 files changed, 28 insertions(+), 13 deletions(-) diff --git a/drivers/net/via-rhine.c b/drivers/net/via-rhine.c index 593e01f..125406b 100644 --- a/drivers/net/via-rhine.c +++ b/drivers/net/via-rhine.c @@ -102,6 +102,7 @@ static const int multicast_filter_limit = 32; #include <linux/ethtool.h> #include <linux/crc32.h> #include <linux/bitops.h> +#include <linux/workqueue.h> #include <asm/processor.h> /* Processor type for cache alignment. */ #include <asm/io.h> #include <asm/irq.h> @@ -389,6 +390,7 @@ struct rhine_private { struct net_device *dev; struct napi_struct napi; spinlock_t lock; + struct work_struct reset_task; /* Frequently used values: keep some adjacent for cache effect. */ u32 quirks; @@ -407,6 +409,7 @@ struct rhine_private { static int mdio_read(struct net_device *dev, int phy_id, int location); static void mdio_write(struct net_device *dev, int phy_id, int location, int value); static int rhine_open(struct net_device *dev); +static void rhine_reset_task(struct work_struct *work); static void rhine_tx_timeout(struct net_device *dev); static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev); @@ -775,6 +778,8 @@ static int __devinit rhine_init_one(struct pci_dev *pdev, dev->irq = pdev->irq; spin_lock_init(&rp->lock); + INIT_WORK(&rp->reset_task, rhine_reset_task); + rp->mii_if.dev = dev; rp->mii_if.mdio_read = mdio_read; rp->mii_if.mdio_write = mdio_write; @@ -1179,22 +1184,18 @@ static int rhine_open(struct net_device *dev) return 0; } -static void rhine_tx_timeout(struct net_device *dev) +static void rhine_reset_task(struct work_struct *work) { - struct rhine_private *rp = netdev_priv(dev); - void __iomem *ioaddr = rp->base; - - printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " - "%4.4x, resetting...\n", - dev->name, ioread16(ioaddr + IntrStatus), - mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + struct rhine_private *rp = container_of(work, struct rhine_private, + reset_task); + struct net_device *dev = rp->dev; /* protect against concurrent rx interrupts */ disable_irq(rp->pdev->irq); napi_disable(&rp->napi); - spin_lock(&rp->lock); + spin_lock_bh(&rp->lock); /* clear all descriptors */ free_tbufs(dev); @@ -1206,7 +1207,7 @@ static void rhine_tx_timeout(struct net_device *dev) rhine_chip_reset(dev); init_registers(dev); - spin_unlock(&rp->lock); + spin_unlock_bh(&rp->lock); enable_irq(rp->pdev->irq); dev->trans_start = jiffies; @@ -1214,6 +1215,19 @@ static void rhine_tx_timeout(struct net_device *dev) netif_wake_queue(dev); } +static void rhine_tx_timeout(struct net_device *dev) +{ + struct rhine_private *rp = netdev_priv(dev); + void __iomem *ioaddr = rp->base; + + printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " + "%4.4x, resetting...\n", + dev->name, ioread16(ioaddr + IntrStatus), + mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + + schedule_work(&rp->reset_task); +} + static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev) { @@ -1830,10 +1844,11 @@ static int rhine_close(struct net_device *dev) struct rhine_private *rp = netdev_priv(dev); void __iomem *ioaddr = rp->base; - spin_lock_irq(&rp->lock); - - netif_stop_queue(dev); napi_disable(&rp->napi); + cancel_work_sync(&rp->reset_task); + netif_stop_queue(dev); + + spin_lock_irq(&rp->lock); if (debug > 1) printk(KERN_DEBUG "%s: Shutting down ethercard, " ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-22 13:38 ` Jarek Poplawski @ 2009-12-22 15:00 ` Andrey Rahmatullin 2009-12-22 15:26 ` Roger Luethi 0 siblings, 1 reply; 16+ messages in thread From: Andrey Rahmatullin @ 2009-12-22 15:00 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Christian Kujau, LKML, Roger Luethi, netdev [-- Attachment #1: Type: text/plain, Size: 645 bytes --] On Tue, Dec 22, 2009 at 01:38:17PM +0000, Jarek Poplawski wrote: > > > It looks like napi_disable() should be illegal in ndo_tx_timeout(). > > > Here is a patch which moves most of the timeout work to a workqueue, > > > similarly to tg3 etc. It should prevent at least one of reported > > > bugs. Alas I can't even check-compile it at the moment, so let me > > > know on any problems. > > It seems I needlessly changed locking btw, so here it is again. > Hmm... On the other hand, it definitely needs at least _bh now... I've tried this patch. There are lots of "Transmit timed out", but no crashes. -- WBR, wRAR (ALT Linux Team) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-22 15:00 ` Andrey Rahmatullin @ 2009-12-22 15:26 ` Roger Luethi 2009-12-22 17:36 ` [PATCH] net/via-rhine: Fix scheduling while atomic bugs Jarek Poplawski 2009-12-23 9:52 ` via_rhine kernel crashes in 2.6.32 Jarek Poplawski 0 siblings, 2 replies; 16+ messages in thread From: Roger Luethi @ 2009-12-22 15:26 UTC (permalink / raw) To: Andrey Rahmatullin; +Cc: Jarek Poplawski, Christian Kujau, LKML, netdev On Tue, 22 Dec 2009 20:00:45 +0500, Andrey Rahmatullin wrote: > On Tue, Dec 22, 2009 at 01:38:17PM +0000, Jarek Poplawski wrote: > > > > It looks like napi_disable() should be illegal in ndo_tx_timeout(). > > > > Here is a patch which moves most of the timeout work to a workqueue, > > > > similarly to tg3 etc. It should prevent at least one of reported > > > > bugs. Alas I can't even check-compile it at the moment, so let me > > > > know on any problems. > > > It seems I needlessly changed locking btw, so here it is again. > > Hmm... On the other hand, it definitely needs at least _bh now... > I've tried this patch. There are lots of "Transmit timed out", but no > crashes. ACK. Looks like you guys tracked down the crashing and fixed it (thanks!). I suspect we shouldn't have to reset due to timeouts that often, but that's another story. Roger ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH] net/via-rhine: Fix scheduling while atomic bugs 2009-12-22 15:26 ` Roger Luethi @ 2009-12-22 17:36 ` Jarek Poplawski 2009-12-24 5:54 ` David Miller 2009-12-23 9:52 ` via_rhine kernel crashes in 2.6.32 Jarek Poplawski 1 sibling, 1 reply; 16+ messages in thread From: Jarek Poplawski @ 2009-12-22 17:36 UTC (permalink / raw) To: David Miller Cc: Roger Luethi, Andrey Rahmatullin, Christian Kujau, LKML, netdev On Tue, Dec 22, 2009 at 04:26:59PM +0100, Roger Luethi wrote: > On Tue, 22 Dec 2009 20:00:45 +0500, Andrey Rahmatullin wrote: > > On Tue, Dec 22, 2009 at 01:38:17PM +0000, Jarek Poplawski wrote: > > > > > It looks like napi_disable() should be illegal in ndo_tx_timeout(). > > > > > Here is a patch which moves most of the timeout work to a workqueue, > > > > > similarly to tg3 etc. It should prevent at least one of reported > > > > > bugs. Alas I can't even check-compile it at the moment, so let me > > > > > know on any problems. > > > > It seems I needlessly changed locking btw, so here it is again. > > > Hmm... On the other hand, it definitely needs at least _bh now... > > I've tried this patch. There are lots of "Transmit timed out", but no > > crashes. > > ACK. Looks like you guys tracked down the crashing and fixed it (thanks!). > I suspect we shouldn't have to reset due to timeouts that often, but that's > another story. > > Roger Thanks everybody, Jarek P. -------------------> There are BUGs "scheduling while atomic" triggered by the timer rhine_tx_timeout(). They are caused by calling napi_disable() (with msleep()). This patch fixes it by moving most of the timer content to the workqueue function (similarly to other drivers, like tg3), with spin_lock() changed to BH version. Additionally, there is spin_lock_irq() moved in rhine_close() to exclude napi_disable() etc., also tg3's way. Reported-by: Andrey Rahmatullin <wrar@altlinux.org> Tested-by: Andrey Rahmatullin <wrar@altlinux.org> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Roger Luethi <rl@hellgate.ch> --- drivers/net/via-rhine.c | 41 ++++++++++++++++++++++++++++------------- 1 files changed, 28 insertions(+), 13 deletions(-) diff --git a/drivers/net/via-rhine.c b/drivers/net/via-rhine.c index 593e01f..125406b 100644 --- a/drivers/net/via-rhine.c +++ b/drivers/net/via-rhine.c @@ -102,6 +102,7 @@ static const int multicast_filter_limit = 32; #include <linux/ethtool.h> #include <linux/crc32.h> #include <linux/bitops.h> +#include <linux/workqueue.h> #include <asm/processor.h> /* Processor type for cache alignment. */ #include <asm/io.h> #include <asm/irq.h> @@ -389,6 +390,7 @@ struct rhine_private { struct net_device *dev; struct napi_struct napi; spinlock_t lock; + struct work_struct reset_task; /* Frequently used values: keep some adjacent for cache effect. */ u32 quirks; @@ -407,6 +409,7 @@ struct rhine_private { static int mdio_read(struct net_device *dev, int phy_id, int location); static void mdio_write(struct net_device *dev, int phy_id, int location, int value); static int rhine_open(struct net_device *dev); +static void rhine_reset_task(struct work_struct *work); static void rhine_tx_timeout(struct net_device *dev); static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev); @@ -775,6 +778,8 @@ static int __devinit rhine_init_one(struct pci_dev *pdev, dev->irq = pdev->irq; spin_lock_init(&rp->lock); + INIT_WORK(&rp->reset_task, rhine_reset_task); + rp->mii_if.dev = dev; rp->mii_if.mdio_read = mdio_read; rp->mii_if.mdio_write = mdio_write; @@ -1179,22 +1184,18 @@ static int rhine_open(struct net_device *dev) return 0; } -static void rhine_tx_timeout(struct net_device *dev) +static void rhine_reset_task(struct work_struct *work) { - struct rhine_private *rp = netdev_priv(dev); - void __iomem *ioaddr = rp->base; - - printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " - "%4.4x, resetting...\n", - dev->name, ioread16(ioaddr + IntrStatus), - mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + struct rhine_private *rp = container_of(work, struct rhine_private, + reset_task); + struct net_device *dev = rp->dev; /* protect against concurrent rx interrupts */ disable_irq(rp->pdev->irq); napi_disable(&rp->napi); - spin_lock(&rp->lock); + spin_lock_bh(&rp->lock); /* clear all descriptors */ free_tbufs(dev); @@ -1206,7 +1207,7 @@ static void rhine_tx_timeout(struct net_device *dev) rhine_chip_reset(dev); init_registers(dev); - spin_unlock(&rp->lock); + spin_unlock_bh(&rp->lock); enable_irq(rp->pdev->irq); dev->trans_start = jiffies; @@ -1214,6 +1215,19 @@ static void rhine_tx_timeout(struct net_device *dev) netif_wake_queue(dev); } +static void rhine_tx_timeout(struct net_device *dev) +{ + struct rhine_private *rp = netdev_priv(dev); + void __iomem *ioaddr = rp->base; + + printk(KERN_WARNING "%s: Transmit timed out, status %4.4x, PHY status " + "%4.4x, resetting...\n", + dev->name, ioread16(ioaddr + IntrStatus), + mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); + + schedule_work(&rp->reset_task); +} + static netdev_tx_t rhine_start_tx(struct sk_buff *skb, struct net_device *dev) { @@ -1830,10 +1844,11 @@ static int rhine_close(struct net_device *dev) struct rhine_private *rp = netdev_priv(dev); void __iomem *ioaddr = rp->base; - spin_lock_irq(&rp->lock); - - netif_stop_queue(dev); napi_disable(&rp->napi); + cancel_work_sync(&rp->reset_task); + netif_stop_queue(dev); + + spin_lock_irq(&rp->lock); if (debug > 1) printk(KERN_DEBUG "%s: Shutting down ethercard, " ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH] net/via-rhine: Fix scheduling while atomic bugs 2009-12-22 17:36 ` [PATCH] net/via-rhine: Fix scheduling while atomic bugs Jarek Poplawski @ 2009-12-24 5:54 ` David Miller 0 siblings, 0 replies; 16+ messages in thread From: David Miller @ 2009-12-24 5:54 UTC (permalink / raw) To: jarkao2; +Cc: rl, wrar, lists, linux-kernel, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 22 Dec 2009 18:36:42 +0100 > There are BUGs "scheduling while atomic" triggered by the timer > rhine_tx_timeout(). They are caused by calling napi_disable() (with > msleep()). This patch fixes it by moving most of the timer content to > the workqueue function (similarly to other drivers, like tg3), with > spin_lock() changed to BH version. > > Additionally, there is spin_lock_irq() moved in rhine_close() to > exclude napi_disable() etc., also tg3's way. > > Reported-by: Andrey Rahmatullin <wrar@altlinux.org> > Tested-by: Andrey Rahmatullin <wrar@altlinux.org> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Applied, thanks! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-22 15:26 ` Roger Luethi 2009-12-22 17:36 ` [PATCH] net/via-rhine: Fix scheduling while atomic bugs Jarek Poplawski @ 2009-12-23 9:52 ` Jarek Poplawski 2009-12-23 16:21 ` Andrey Rahmatullin 1 sibling, 1 reply; 16+ messages in thread From: Jarek Poplawski @ 2009-12-23 9:52 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrey Rahmatullin, Christian Kujau, LKML, netdev On 22-12-2009 16:26, Roger Luethi wrote: > On Tue, 22 Dec 2009 20:00:45 +0500, Andrey Rahmatullin wrote: >> On Tue, Dec 22, 2009 at 01:38:17PM +0000, Jarek Poplawski wrote: >>>>> It looks like napi_disable() should be illegal in ndo_tx_timeout(). >>>>> Here is a patch which moves most of the timeout work to a workqueue, >>>>> similarly to tg3 etc. It should prevent at least one of reported >>>>> bugs. Alas I can't even check-compile it at the moment, so let me >>>>> know on any problems. >>>> It seems I needlessly changed locking btw, so here it is again. >>> Hmm... On the other hand, it definitely needs at least _bh now... >> I've tried this patch. There are lots of "Transmit timed out", but no >> crashes. > > ACK. Looks like you guys tracked down the crashing and fixed it (thanks!). > I suspect we shouldn't have to reset due to timeouts that often, but that's > another story. BTW, it seems a change in 2.6.31 might trigger these timeouts more often than before. Andrey, could you try if this matters here? Thanks, Jarek P. --- (on top of net-2.6 with the previous "Fix scheduling..." patch) diff -Nurp a/drivers/net/via-rhine.c b/drivers/net/via-rhine.c --- a/drivers/net/via-rhine.c 2009-12-23 09:28:25.000000000 +0000 +++ b/drivers/net/via-rhine.c 2009-12-23 09:33:57.000000000 +0000 @@ -1226,6 +1226,7 @@ static void rhine_tx_timeout(struct net_ mdio_read(dev, rp->mii_if.phy_id, MII_BMSR)); schedule_work(&rp->reset_task); + netdev_get_tx_queue(dev, 0)->trans_start = jiffies; } static netdev_tx_t rhine_start_tx(struct sk_buff *skb, ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-23 9:52 ` via_rhine kernel crashes in 2.6.32 Jarek Poplawski @ 2009-12-23 16:21 ` Andrey Rahmatullin 2009-12-23 16:30 ` Jarek Poplawski 0 siblings, 1 reply; 16+ messages in thread From: Andrey Rahmatullin @ 2009-12-23 16:21 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Roger Luethi, Christian Kujau, LKML, netdev On Wed, Dec 23, 2009 at 09:52:03AM +0000, Jarek Poplawski wrote: > BTW, it seems a change in 2.6.31 might trigger these timeouts more > often than before. Andrey, could you try if this matters here? They appear once per 4 seconds with or without this patch. -- WBR, wRAR (ALT Linux Team) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: via_rhine kernel crashes in 2.6.32 2009-12-23 16:21 ` Andrey Rahmatullin @ 2009-12-23 16:30 ` Jarek Poplawski 0 siblings, 0 replies; 16+ messages in thread From: Jarek Poplawski @ 2009-12-23 16:30 UTC (permalink / raw) To: Andrey Rahmatullin; +Cc: Roger Luethi, Christian Kujau, LKML, netdev Andrey Rahmatullin wrote, On 12/23/2009 05:21 PM: > On Wed, Dec 23, 2009 at 09:52:03AM +0000, Jarek Poplawski wrote: >> BTW, it seems a change in 2.6.31 might trigger these timeouts more >> often than before. Andrey, could you try if this matters here? > They appear once per 4 seconds with or without this patch. > OK, thanks for checking this. Jarek P. ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2009-12-24 5:54 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-12-19 11:12 via_rhine kernel crashes in 2.6.32 Andrey Rahmatullin 2009-12-20 20:03 ` Andrey Rahmatullin 2009-12-21 12:03 ` Christian Kujau 2009-12-21 12:36 ` Andrey Rahmatullin 2009-12-21 18:18 ` Andrey Rahmatullin 2009-12-21 19:32 ` Christian Kujau 2009-12-22 12:32 ` Jarek Poplawski 2009-12-22 13:21 ` Jarek Poplawski 2009-12-22 13:38 ` Jarek Poplawski 2009-12-22 15:00 ` Andrey Rahmatullin 2009-12-22 15:26 ` Roger Luethi 2009-12-22 17:36 ` [PATCH] net/via-rhine: Fix scheduling while atomic bugs Jarek Poplawski 2009-12-24 5:54 ` David Miller 2009-12-23 9:52 ` via_rhine kernel crashes in 2.6.32 Jarek Poplawski 2009-12-23 16:21 ` Andrey Rahmatullin 2009-12-23 16:30 ` Jarek Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox