* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
[not found] <bug-19692-10286@https.bugzilla.kernel.org/>
@ 2010-10-04 20:53 ` Andrew Morton
2010-10-08 9:24 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2010-10-04 20:53 UTC (permalink / raw)
To: netdev; +Cc: bugzilla-daemon, bugme-daemon, eminak71, Anton Vorontsov
(switched to email. Please respond via emailed reply-to-all, not via the
bugzilla web interface).
On Mon, 4 Oct 2010 06:25:14 GMT
bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=19692
>
> Summary: linux-2.6.36-rc5 crash with gianfar ethernet at full
> line rate traffic
> Product: Drivers
> Version: 2.5
> Kernel Version: 2.6.36-rc5
> Platform: All
> OS/Version: Linux
> Tree: Mainline
> Status: NEW
> Severity: blocking
> Priority: P1
> Component: Network
> AssignedTo: drivers_network@kernel-bugs.osdl.org
> ReportedBy: eminak71@gmail.com
> CC: eminak71@gmail.com
> Regression: Yes
>
>
> My problem is kernel crash under full line rate random packet length
> ip network traffic.
> I'am using default unmodified kernel and default SMP kernel
> configuration, MPC8572DS development board and also using a hardware
> packet generator.
> My test is ip forwarding between eth0 and eth1, and Hardware packet
> generator produces full duplex, full line rate traffic with random
> packet length and random payload . After a few millions of packets
> passed, kernel produces this bellow two different crash messages . I
> have retry this scenario many times, crash occurs sometimes on
> skb_put, but mostly occurs on ip_rcv function. I have aplied same
> test to latest stable linux 2.6.35.6 kernel. Same errors produced.
>
>
> Here is crash logs:
>
>
>
> Thanks.
>
> Emin
>
>
> First type of crash:
>
> root@mpc8572ds:~# skb_over_panic: text:c0226280 len:1171 put:1171
> head:eed6d000 data:eed63040 tail:0xeed6d4d3 end:0xeed63660 dev:<NULL>
> ------------[ cut here ]------------
> kernel BUG at net/core/skbuff.c:127!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=2 MPC8572 DS
> last sysfs file: /sys/devices/pci0002:03/0002:03:00.0/subsystem_device
> Modules linked in:
> NIP: c023bdcc LR: c023bdcc CTR: c01f3ff8
> REGS: effe7d70 TRAP: 0700 Not tainted (2.6.36-rc5)
> MSR: 00029000 <EE,ME,CE> CR: 22028024 XER: 20000000
> TASK = ef83e9a0[9] 'ksoftirqd/1' THREAD: ef856000 CPU: 1
> GPR00: c023bdcc effe7e20 ef83e9a0 0000007c 00021000 ffffffff c01f7b98 c03ccf1c
> GPR08: c03c69d4 c03f94b4 00c4e000 00000004 20028048 1001a108 ef211000 efb52d90
> GPR16: efb52e38 efb52870 00000000 ef211800 00000008 00000009 efb52800 00000037
> GPR24: ef24e180 ef2be040 00000000 ef211948 efb52b80 00000493 ef015940 ef386600
> NIP [c023bdcc] skb_put+0x8c/0x94
> LR [c023bdcc] skb_put+0x8c/0x94
> Call Trace:
> [effe7e20] [c023bdcc] skb_put+0x8c/0x94 (unreliable)
> [effe7e30] [c0226280] gfar_clean_rx_ring+0x104/0x4b8
> [effe7e90] [c02269dc] gfar_poll+0x3a8/0x60c
> [effe7f60] [c024928c] net_rx_action+0xf8/0x1a4
> [effe7fa0] [c0042524] __do_softirq+0xe0/0x178
> [effe7ff0] [c000e59c] call_do_softirq+0x14/0x24
> [ef857f50] [c0004840] do_softirq+0x90/0xa0
> [ef857f70] [c00430e4] run_ksoftirqd+0xb4/0x164
> [ef857fb0] [c00586b4] kthread+0x7c/0x80
> [ef857ff0] [c000e9a8] kernel_thread+0x4c/0x68
> Instruction dump:
> 81030098 2f800000 409e000c 3d20c037 3809a19c 3c60c037 7c8802a6 7d695b78
> 3863b010 90010008 4cc63182 4be016c5 <0fe00000> 48000000 9421fff0 7c0802a6
> Kernel panic - not syncing: Fatal exception in interrupt
> ---------------
>
> second type of crash:
>
> Faulting instruction address: 0xc026c1dc
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=2 MPC8572 DS
> last sysfs file: /sys/devices/pci0002:03/0002:03:00.0/subsystem_device
> Modules linked in:
> NIP: c026c1dc LR: c026bfac CTR: 00000000
> REGS: effebd00 TRAP: 0300 Not tainted (2.6.36-rc5)
> MSR: 00029000 <EE,ME,CE> CR: 42028042 XER: 00000000
> DEAR: 0000cad8, ESR: 00000000
> TASK = ef83cde0[3] 'ksoftirqd/0' THREAD: ef84a000 CPU: 0
> GPR00: 00000005 effebdb0 ef83cde0 00000000 000001b9 00000000 c1008060 00000000
> GPR08: 02c3f605 0000ca00 000005b9 0000ca00 b653a6c7 7af823f0 ef217000 efbab590
> GPR16: efbab638 efbab070 00000000 ef217800 00000008 00000018 efbab000 00000028
> GPR24: c03f971c c0410000 c0400000 c03f94b4 effea000 ef316e40 00000000 eecb685e
> NIP [c026c1dc] ip_rcv+0x3f8/0x808
> LR [c026bfac] ip_rcv+0x1c8/0x808
> Call Trace:
> [effebdb0] [c026c204] ip_rcv+0x420/0x808 (unreliable)
> [effebde0] [c02482dc] __netif_receive_skb+0x2f8/0x324
> [effebe10] [c02483a4] netif_receive_skb+0x9c/0xb0
> [effebe30] [c0226308] gfar_clean_rx_ring+0x18c/0x4b8
> [effebe90] [c02269dc] gfar_poll+0x3a8/0x60c
> [effebf60] [c024928c] net_rx_action+0xf8/0x1a4
> [effebfa0] [c0042524] __do_softirq+0xe0/0x178
> [effebff0] [c000e59c] call_do_softirq+0x14/0x24
> [ef84bf50] [c0004840] do_softirq+0x90/0xa0
> [ef84bf70] [c00430e4] run_ksoftirqd+0xb4/0x164
> [ef84bfb0] [c00586b4] kthread+0x7c/0x80
> [ef84bff0] [c000e9a8] kernel_thread+0x4c/0x68
> Instruction dump:
> 8148003c 318a0001 7d690194 91680038 9188003c 4bfffd78 7fa3eb78 48002a29
> 2f830000 40beff50 817d0048 5569003c <a00900d8> 2f800005 419e0034 2f800003
> Kernel panic - not syncing: Fatal exception in interrupt
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-04 20:53 ` [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic Andrew Morton
@ 2010-10-08 9:24 ` Jarek Poplawski
2010-10-09 12:10 ` emin ak
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-08 9:24 UTC (permalink / raw)
To: Andrew Morton
Cc: netdev, bugzilla-daemon, bugme-daemon, eminak71, Anton Vorontsov
Andrew Morton wrote:
> (switched to email. Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Mon, 4 Oct 2010 06:25:14 GMT
> bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=19692
>>
>> Summary: linux-2.6.36-rc5 crash with gianfar ethernet at full
>> line rate traffic
...
Emin, until there is something better I hope you could try this patch.
(not tested nor compiled)
Thanks,
Jarek P.
---
diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 4f7c3f3..db47b55 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -2515,7 +2515,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
skb_recycle_check(skb, priv->rx_buffer_size +
RXBUF_ALIGNMENT)) {
gfar_align_skb(skb);
- __skb_queue_head(&priv->rx_recycle, skb);
+ skb_queue_head(&priv->rx_recycle, skb);
} else
dev_kfree_skb_any(skb);
@@ -2598,7 +2598,7 @@ struct sk_buff * gfar_new_skb(struct net_device *dev)
struct gfar_private *priv = netdev_priv(dev);
struct sk_buff *skb = NULL;
- skb = __skb_dequeue(&priv->rx_recycle);
+ skb = skb_dequeue(&priv->rx_recycle);
if (!skb)
skb = gfar_alloc_skb(dev);
@@ -2754,7 +2754,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
if (unlikely(!newskb))
newskb = skb;
else if (skb)
- __skb_queue_head(&priv->rx_recycle, skb);
+ skb_queue_head(&priv->rx_recycle, skb);
} else {
/* Increment the number of packets */
rx_queue->stats.rx_packets++;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-08 9:24 ` Jarek Poplawski
@ 2010-10-09 12:10 ` emin ak
2010-10-10 10:32 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: emin ak @ 2010-10-09 12:10 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov
> Andrew Morton wrote:
>> (switched to email. Please respond via emailed reply-to-all, not via the
>> bugzilla web interface).
>>
>> On Mon, 4 Oct 2010 06:25:14 GMT
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=19692
>>>
>>> Summary: linux-2.6.36-rc5 crash with gianfar ethernet at full
>>> line rate traffic
> ...
>
> Emin, until there is something better I hope you could try this patch.
> (not tested nor compiled)
>
> Thanks,
> Jarek P.
Hi Jarek,
My test setup is at my office so that I will try your patch on Monday
immediately and turn you the results.
Thanks.
Emin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-09 12:10 ` emin ak
@ 2010-10-10 10:32 ` Jarek Poplawski
2010-10-15 8:58 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-10 10:32 UTC (permalink / raw)
To: emin ak
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov
On Sat, Oct 09, 2010 at 03:10:25PM +0300, emin ak wrote:
> Hi Jarek,
> My test setup is at my office so that I will try your patch on Monday
> immediately and turn you the results.
No need to hurry, especially on Mondays, but thanks for the info.
Jarek P.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-10 10:32 ` Jarek Poplawski
@ 2010-10-15 8:58 ` Jarek Poplawski
2010-10-15 23:14 ` emin ak
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-15 8:58 UTC (permalink / raw)
To: emin ak
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov
On 2010-10-10 12:32, Jarek Poplawski wrote:
> On Sat, Oct 09, 2010 at 03:10:25PM +0300, emin ak wrote:
>> Hi Jarek,
>> My test setup is at my office so that I will try your patch on Monday
>> immediately and turn you the results.
>
> No need to hurry, especially on Mondays, but thanks for the info.
Hi Emin,
Are there any problems with this testing?
Jarek P.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-15 8:58 ` Jarek Poplawski
@ 2010-10-15 23:14 ` emin ak
2010-10-16 19:48 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: emin ak @ 2010-10-15 23:14 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov
> Hi Emin,
> Are there any problems with this testing?
>
> Jarek P.
>
Hi Jarek.
Sorry for delayed answer. As I promised, I have started the tests on
Monday (in spite of dealing with The Monday Syndrome:) I had applied
your patch and after two billions packet (and approximatly four hours)
passed kernel crashed with skb_over_panic error similar with first
type. To ensure the patch is failed or not, I rerun the same test
again. That time, surprisingly, it did'nt crashed again for two days
with same kernel. But this situation had occured before, I think
sometimes because of randomness of the applied ethernet traffic and
mostly because I cant apply all full line rate random traffic to my
target device because of wrong test setup (switch / hardware packet
generator settings etc..), it takes three or more day to crash the
kernel and sometimes it never crashes. My device only and only crashes
when I can apply full line rate random traffic. Before informing you
and the list with (official) test results, I want to be sure with the
truth of them. So that please let me apply more test to the target
device for a fews day more. After that I wish I'll came with good
news!
Thanks for your time and care alot.
Emin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-15 23:14 ` emin ak
@ 2010-10-16 19:48 ` Jarek Poplawski
2010-10-19 6:44 ` emin ak
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-16 19:48 UTC (permalink / raw)
To: emin ak
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov
On Sat, Oct 16, 2010 at 02:14:01AM +0300, emin ak wrote:
> Hi Jarek.
> Sorry for delayed answer. As I promised, I have started the tests on
> Monday (in spite of dealing with The Monday Syndrome:) I had applied
> your patch and after two billions packet (and approximatly four hours)
> passed kernel crashed with skb_over_panic error similar with first
> type. To ensure the patch is failed or not, I rerun the same test
> again. That time, surprisingly, it did'nt crashed again for two days
> with same kernel. But this situation had occured before, I think
> sometimes because of randomness of the applied ethernet traffic and
> mostly because I cant apply all full line rate random traffic to my
> target device because of wrong test setup (switch / hardware packet
> generator settings etc..), it takes three or more day to crash the
> kernel and sometimes it never crashes. My device only and only crashes
> when I can apply full line rate random traffic. Before informing you
> and the list with (official) test results, I want to be sure with the
> truth of them. So that please let me apply more test to the target
> device for a fews day more. After that I wish I'll came with good
> news!
Hi Emin,
Sorry for being impatient. Actually, waiting is no problem for me. I
simply didn't know how much this bug is reproducible, and was a bit
mislead by your forecast of one day result (and terrified btw seeing
words Monday and immediately side by side ;-) So, a few days or weeks,
no problem, it's all up to you.
On the other hand, it looks like there might be something else/more,
so I'd suggest to add this last skb_over_panic to bugzilla too.
Thanks,
Jarek P.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic
2010-10-16 19:48 ` Jarek Poplawski
@ 2010-10-19 6:44 ` emin ak
2010-10-19 10:06 ` [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic) Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: emin ak @ 2010-10-19 6:44 UTC (permalink / raw)
To: Jarek Poplawski
Cc: Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov
Hi Jarek;
After 5 days and more then 20 billion packets passed without crash, it
seems that this patch is working for me, at least for crash type 2.
(For type 1, it only occured once and I can never reproduce this
again, but still trying. I think with this patch is also lowers the
risk for type 1.
For adding a new bug entry for skb_over_panic, before that I think I
must find a reliable way to make this type of crash reproducable,
otherwise I don't know how to test it if it solved or not.
Lastly, thanks a lot for your valuable help to overcome this problem
and also is there anything that I can do for testing / commiting this
patch to mainline?
Thanks
Emin
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-19 6:44 ` emin ak
@ 2010-10-19 10:06 ` Jarek Poplawski
2010-10-22 5:42 ` emin ak
2010-10-22 6:11 ` Eric Dumazet
0 siblings, 2 replies; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-19 10:06 UTC (permalink / raw)
To: David Miller
Cc: emin ak, Andrew Morton, netdev, bugzilla-daemon, bugme-daemon,
Anton Vorontsov, Andy Fleming
On Tue, Oct 19, 2010 at 09:44:33AM +0300, emin ak wrote:
> Hi Jarek;
> After 5 days and more then 20 billion packets passed without crash, it
> seems that this patch is working for me, at least for crash type 2.
> (For type 1, it only occured once and I can never reproduce this
> again, but still trying. I think with this patch is also lowers the
> risk for type 1.
It would be interesting to have a look if it's exactly type 1, because
skb_over_panic can happen for different reasons, e.g. like here:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=63b88b9041ceef8217f34de71a2e96f0c3f0fd3b
> For adding a new bug entry for skb_over_panic, before that I think I
> must find a reliable way to make this type of crash reproducable,
> otherwise I don't know how to test it if it solved or not.
Maybe for now let's try to get and see this type 1 again? Since the
recycle path is suspicious a bit to me, probably limiting memory or
slowing tx (maybe different mtus on eth0 and 1) under heavy multi cpu
load might help.
> Lastly, thanks a lot for your valuable help to overcome this problem
> and also is there anything that I can do for testing / commiting this
> patch to mainline?
Here it is for David to handle the rest.
Thanks a lot for such an intense testing,
Jarek P.
--------------------------->
The rx_recycle queue is global per device but can be accesed by many
napi handlers at the same time, so it needs full skb_queue primitives
(with locking). Otherwise, various crashes caused by broken skbs are
possible.
This patch resolves, at least partly, bugzilla bug 19692. (Because of
some doubts that there could be still something around which is hard
to reproduce my proposal is to leave this bug opened for a month.)
Fixes commit: 0fd56bb5be6455d0d42241e65aed057244665e5e
Reported-by: emin ak <eminak71@gmail.com>
Tested-by: emin ak <eminak71@gmail.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
CC: Andy Fleming <afleming@freescale.com>
---
diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
index 4f7c3f3..db47b55 100644
--- a/drivers/net/gianfar.c
+++ b/drivers/net/gianfar.c
@@ -2515,7 +2515,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
skb_recycle_check(skb, priv->rx_buffer_size +
RXBUF_ALIGNMENT)) {
gfar_align_skb(skb);
- __skb_queue_head(&priv->rx_recycle, skb);
+ skb_queue_head(&priv->rx_recycle, skb);
} else
dev_kfree_skb_any(skb);
@@ -2598,7 +2598,7 @@ struct sk_buff * gfar_new_skb(struct net_device *dev)
struct gfar_private *priv = netdev_priv(dev);
struct sk_buff *skb = NULL;
- skb = __skb_dequeue(&priv->rx_recycle);
+ skb = skb_dequeue(&priv->rx_recycle);
if (!skb)
skb = gfar_alloc_skb(dev);
@@ -2754,7 +2754,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
if (unlikely(!newskb))
newskb = skb;
else if (skb)
- __skb_queue_head(&priv->rx_recycle, skb);
+ skb_queue_head(&priv->rx_recycle, skb);
} else {
/* Increment the number of packets */
rx_queue->stats.rx_packets++;
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-19 10:06 ` [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic) Jarek Poplawski
@ 2010-10-22 5:42 ` emin ak
2010-10-22 6:14 ` Eric Dumazet
2010-10-22 6:11 ` Eric Dumazet
1 sibling, 1 reply; 18+ messages in thread
From: emin ak @ 2010-10-22 5:42 UTC (permalink / raw)
To: Jarek Poplawski
Cc: David Miller, Andrew Morton, netdev, bugzilla-daemon,
bugme-daemon, Anton Vorontsov, Andy Fleming
Hi Jarek;
> Maybe for now let's try to get and see this type 1 again? Since the
> recycle path is suspicious a bit to me, probably limiting memory or
> slowing tx (maybe different mtus on eth0 and 1) under heavy multi cpu
> load might help.
>
I'll do my best with your recommended test conditions. I have reserved
a machine and two ports of hw packet generator, now we'll see if this
error will occur again or not. (But I'am not sure if I want to see it
again:)
Thanks
Emin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-19 10:06 ` [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic) Jarek Poplawski
2010-10-22 5:42 ` emin ak
@ 2010-10-22 6:11 ` Eric Dumazet
2010-10-22 6:52 ` Jarek Poplawski
1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2010-10-22 6:11 UTC (permalink / raw)
To: Jarek Poplawski
Cc: David Miller, emin ak, Andrew Morton, netdev, bugzilla-daemon,
bugme-daemon, Anton Vorontsov, Andy Fleming
Le mardi 19 octobre 2010 à 10:06 +0000, Jarek Poplawski a écrit :
> On Tue, Oct 19, 2010 at 09:44:33AM +0300, emin ak wrote:
> > Hi Jarek;
> > After 5 days and more then 20 billion packets passed without crash, it
> > seems that this patch is working for me, at least for crash type 2.
> > (For type 1, it only occured once and I can never reproduce this
> > again, but still trying. I think with this patch is also lowers the
> > risk for type 1.
>
> It would be interesting to have a look if it's exactly type 1, because
> skb_over_panic can happen for different reasons, e.g. like here:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=63b88b9041ceef8217f34de71a2e96f0c3f0fd3b
>
> > For adding a new bug entry for skb_over_panic, before that I think I
> > must find a reliable way to make this type of crash reproducable,
> > otherwise I don't know how to test it if it solved or not.
>
> Maybe for now let's try to get and see this type 1 again? Since the
> recycle path is suspicious a bit to me, probably limiting memory or
> slowing tx (maybe different mtus on eth0 and 1) under heavy multi cpu
> load might help.
>
> > Lastly, thanks a lot for your valuable help to overcome this problem
> > and also is there anything that I can do for testing / commiting this
> > patch to mainline?
>
> Here it is for David to handle the rest.
>
> Thanks a lot for such an intense testing,
> Jarek P.
> --------------------------->
>
> The rx_recycle queue is global per device but can be accesed by many
> napi handlers at the same time, so it needs full skb_queue primitives
> (with locking). Otherwise, various crashes caused by broken skbs are
> possible.
>
> This patch resolves, at least partly, bugzilla bug 19692. (Because of
> some doubts that there could be still something around which is hard
> to reproduce my proposal is to leave this bug opened for a month.)
>
> Fixes commit: 0fd56bb5be6455d0d42241e65aed057244665e5e
>
> Reported-by: emin ak <eminak71@gmail.com>
> Tested-by: emin ak <eminak71@gmail.com>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> CC: Andy Fleming <afleming@freescale.com>
> ---
> diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
> index 4f7c3f3..db47b55 100644
> --- a/drivers/net/gianfar.c
> +++ b/drivers/net/gianfar.c
> @@ -2515,7 +2515,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
> skb_recycle_check(skb, priv->rx_buffer_size +
> RXBUF_ALIGNMENT)) {
> gfar_align_skb(skb);
> - __skb_queue_head(&priv->rx_recycle, skb);
> + skb_queue_head(&priv->rx_recycle, skb);
> } else
> dev_kfree_skb_any(skb);
>
> @@ -2598,7 +2598,7 @@ struct sk_buff * gfar_new_skb(struct net_device *dev)
> struct gfar_private *priv = netdev_priv(dev);
> struct sk_buff *skb = NULL;
>
> - skb = __skb_dequeue(&priv->rx_recycle);
> + skb = skb_dequeue(&priv->rx_recycle);
> if (!skb)
> skb = gfar_alloc_skb(dev);
>
> @@ -2754,7 +2754,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
> if (unlikely(!newskb))
> newskb = skb;
> else if (skb)
> - __skb_queue_head(&priv->rx_recycle, skb);
> + skb_queue_head(&priv->rx_recycle, skb);
> } else {
> /* Increment the number of packets */
> rx_queue->stats.rx_packets++;
Are you sure its needed at all ?
Gianfar claims to be multiqueue, but only one cpu can run gfar_poll()
and call gfar_clean_tx_ring() / gfar_clean_rx_ring()
If not, there would be more bugs than only rx_recycle thing
vi +2822 drivers/net/gianfar.c
for_each_set_bit(i, &gfargrp->rx_bit_map, priv->num_rx_queues) {
if (test_bit(i, &serviced_queues))
continue;
rx_queue = priv->rx_queue[i];
tx_queue = priv->tx_queue[rx_queue->qindex];
tx_cleaned += gfar_clean_tx_ring(tx_queue);
rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue,
budget_per_queue);
rx_cleaned += rx_cleaned_per_queue;
if(rx_cleaned_per_queue < budget_per_queue) {
left_over_budget = left_over_budget +
(budget_per_queue - rx_cleaned_per_queue);
set_bit(i, &serviced_queues);
num_queues--;
}
}
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-22 5:42 ` emin ak
@ 2010-10-22 6:14 ` Eric Dumazet
2010-10-22 7:03 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2010-10-22 6:14 UTC (permalink / raw)
To: emin ak
Cc: Jarek Poplawski, David Miller, Andrew Morton, netdev,
bugzilla-daemon, bugme-daemon, Anton Vorontsov, Andy Fleming
Le vendredi 22 octobre 2010 à 08:42 +0300, emin ak a écrit :
> Hi Jarek;
>
> > Maybe for now let's try to get and see this type 1 again? Since the
> > recycle path is suspicious a bit to me, probably limiting memory or
> > slowing tx (maybe different mtus on eth0 and 1) under heavy multi cpu
> > load might help.
> >
>
> I'll do my best with your recommended test conditions. I have reserved
> a machine and two ports of hw packet generator, now we'll see if this
> error will occur again or not. (But I'am not sure if I want to see it
> again:)
Really this rx recycle affair is more than suspicious to me too.
Is it really worth the pain ?
Are MTU changes handled correctly ? ( must flush rx_recycle queue...)
Using a single rx_recycle queue in a supposed multiqueue driver makes
absolutely _no_ sense to me. This adds an artificial contention point.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-22 6:11 ` Eric Dumazet
@ 2010-10-22 6:52 ` Jarek Poplawski
2010-10-22 8:52 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-22 6:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, emin ak, Andrew Morton, netdev, bugzilla-daemon,
bugme-daemon, Anton Vorontsov, Andy Fleming
On Fri, Oct 22, 2010 at 08:11:57AM +0200, Eric Dumazet wrote:
> Le mardi 19 octobre 2010 ?? 10:06 +0000, Jarek Poplawski a écrit :
> > On Tue, Oct 19, 2010 at 09:44:33AM +0300, emin ak wrote:
> > > Hi Jarek;
> > > After 5 days and more then 20 billion packets passed without crash, it
> > > seems that this patch is working for me, at least for crash type 2.
> > > (For type 1, it only occured once and I can never reproduce this
> > > again, but still trying. I think with this patch is also lowers the
> > > risk for type 1.
> >
> > It would be interesting to have a look if it's exactly type 1, because
> > skb_over_panic can happen for different reasons, e.g. like here:
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=63b88b9041ceef8217f34de71a2e96f0c3f0fd3b
> >
> > > For adding a new bug entry for skb_over_panic, before that I think I
> > > must find a reliable way to make this type of crash reproducable,
> > > otherwise I don't know how to test it if it solved or not.
> >
> > Maybe for now let's try to get and see this type 1 again? Since the
> > recycle path is suspicious a bit to me, probably limiting memory or
> > slowing tx (maybe different mtus on eth0 and 1) under heavy multi cpu
> > load might help.
> >
> > > Lastly, thanks a lot for your valuable help to overcome this problem
> > > and also is there anything that I can do for testing / commiting this
> > > patch to mainline?
> >
> > Here it is for David to handle the rest.
> >
> > Thanks a lot for such an intense testing,
> > Jarek P.
> > --------------------------->
> >
> > The rx_recycle queue is global per device but can be accesed by many
> > napi handlers at the same time, so it needs full skb_queue primitives
> > (with locking). Otherwise, various crashes caused by broken skbs are
> > possible.
> >
> > This patch resolves, at least partly, bugzilla bug 19692. (Because of
> > some doubts that there could be still something around which is hard
> > to reproduce my proposal is to leave this bug opened for a month.)
> >
> > Fixes commit: 0fd56bb5be6455d0d42241e65aed057244665e5e
> >
> > Reported-by: emin ak <eminak71@gmail.com>
> > Tested-by: emin ak <eminak71@gmail.com>
> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> > CC: Andy Fleming <afleming@freescale.com>
> > ---
> > diff --git a/drivers/net/gianfar.c b/drivers/net/gianfar.c
> > index 4f7c3f3..db47b55 100644
> > --- a/drivers/net/gianfar.c
> > +++ b/drivers/net/gianfar.c
> > @@ -2515,7 +2515,7 @@ static int gfar_clean_tx_ring(struct gfar_priv_tx_q *tx_queue)
> > skb_recycle_check(skb, priv->rx_buffer_size +
> > RXBUF_ALIGNMENT)) {
> > gfar_align_skb(skb);
> > - __skb_queue_head(&priv->rx_recycle, skb);
> > + skb_queue_head(&priv->rx_recycle, skb);
> > } else
> > dev_kfree_skb_any(skb);
> >
> > @@ -2598,7 +2598,7 @@ struct sk_buff * gfar_new_skb(struct net_device *dev)
> > struct gfar_private *priv = netdev_priv(dev);
> > struct sk_buff *skb = NULL;
> >
> > - skb = __skb_dequeue(&priv->rx_recycle);
> > + skb = skb_dequeue(&priv->rx_recycle);
> > if (!skb)
> > skb = gfar_alloc_skb(dev);
> >
> > @@ -2754,7 +2754,7 @@ int gfar_clean_rx_ring(struct gfar_priv_rx_q *rx_queue, int rx_work_limit)
> > if (unlikely(!newskb))
> > newskb = skb;
> > else if (skb)
> > - __skb_queue_head(&priv->rx_recycle, skb);
> > + skb_queue_head(&priv->rx_recycle, skb);
> > } else {
> > /* Increment the number of packets */
> > rx_queue->stats.rx_packets++;
>
> Are you sure its needed at all ?
Yes, after Emin's testing I'm quite sure this fix is needed.
>
> Gianfar claims to be multiqueue, but only one cpu can run gfar_poll()
> and call gfar_clean_tx_ring() / gfar_clean_rx_ring()
>
> If not, there would be more bugs than only rx_recycle thing
I didn't find what prevents running gfar_poll on many cpus and don't
claim there is no more bugs around.
Jarek P.
>
> vi +2822 drivers/net/gianfar.c
>
> for_each_set_bit(i, &gfargrp->rx_bit_map, priv->num_rx_queues) {
> if (test_bit(i, &serviced_queues))
> continue;
> rx_queue = priv->rx_queue[i];
> tx_queue = priv->tx_queue[rx_queue->qindex];
>
> tx_cleaned += gfar_clean_tx_ring(tx_queue);
> rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue,
> budget_per_queue);
> rx_cleaned += rx_cleaned_per_queue;
> if(rx_cleaned_per_queue < budget_per_queue) {
> left_over_budget = left_over_budget +
> (budget_per_queue - rx_cleaned_per_queue);
> set_bit(i, &serviced_queues);
> num_queues--;
> }
> }
>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-22 6:14 ` Eric Dumazet
@ 2010-10-22 7:03 ` Jarek Poplawski
0 siblings, 0 replies; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-22 7:03 UTC (permalink / raw)
To: Eric Dumazet
Cc: emin ak, David Miller, Andrew Morton, netdev, bugzilla-daemon,
bugme-daemon, Anton Vorontsov, Andy Fleming
On Fri, Oct 22, 2010 at 08:14:08AM +0200, Eric Dumazet wrote:
> Le vendredi 22 octobre 2010 ?? 08:42 +0300, emin ak a écrit :
> > Hi Jarek;
> >
> > > Maybe for now let's try to get and see this type 1 again? Since the
> > > recycle path is suspicious a bit to me, probably limiting memory or
> > > slowing tx (maybe different mtus on eth0 and 1) under heavy multi cpu
> > > load might help.
> > >
> >
> > I'll do my best with your recommended test conditions. I have reserved
> > a machine and two ports of hw packet generator, now we'll see if this
> > error will occur again or not. (But I'am not sure if I want to see it
> > again:)
>
> Really this rx recycle affair is more than suspicious to me too.
>
> Is it really worth the pain ?
>
> Are MTU changes handled correctly ? ( must flush rx_recycle queue...)
>
> Using a single rx_recycle queue in a supposed multiqueue driver makes
> absolutely _no_ sense to me. This adds an artificial contention point.
>
Yes, the design/need of this rx_recycle queue should be reconsidered.
But we don't know if the (other?) bug seen by Emin was connected, so
it would be better to try to reproduce it without too many changes.
Jarek P.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic)
2010-10-22 6:52 ` Jarek Poplawski
@ 2010-10-22 8:52 ` Jarek Poplawski
2010-10-26 17:42 ` [PATCH] gianfar: Fix crashes on RX path David Miller
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-22 8:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Miller, emin ak, Andrew Morton, netdev, bugzilla-daemon,
bugme-daemon, Anton Vorontsov, Andy Fleming
On Fri, Oct 22, 2010 at 06:52:31AM +0000, Jarek Poplawski wrote:
> On Fri, Oct 22, 2010 at 08:11:57AM +0200, Eric Dumazet wrote:
...
> > Gianfar claims to be multiqueue, but only one cpu can run gfar_poll()
> > and call gfar_clean_tx_ring() / gfar_clean_rx_ring()
> >
> > If not, there would be more bugs than only rx_recycle thing
>
> I didn't find what prevents running gfar_poll on many cpus and don't
> claim there is no more bugs around.
On the other hand, I don't see your point in the code below either.
These're only per gfargrp queues - not per device, aren't they?
Jarek P.
>
> >
> > vi +2822 drivers/net/gianfar.c
> >
> > for_each_set_bit(i, &gfargrp->rx_bit_map, priv->num_rx_queues) {
> > if (test_bit(i, &serviced_queues))
> > continue;
> > rx_queue = priv->rx_queue[i];
> > tx_queue = priv->tx_queue[rx_queue->qindex];
> >
> > tx_cleaned += gfar_clean_tx_ring(tx_queue);
> > rx_cleaned_per_queue = gfar_clean_rx_ring(rx_queue,
> > budget_per_queue);
> > rx_cleaned += rx_cleaned_per_queue;
> > if(rx_cleaned_per_queue < budget_per_queue) {
> > left_over_budget = left_over_budget +
> > (budget_per_queue - rx_cleaned_per_queue);
> > set_bit(i, &serviced_queues);
> > num_queues--;
> > }
> > }
> >
> >
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path
2010-10-22 8:52 ` Jarek Poplawski
@ 2010-10-26 17:42 ` David Miller
2010-10-26 21:20 ` Jarek Poplawski
0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2010-10-26 17:42 UTC (permalink / raw)
To: jarkao2
Cc: eric.dumazet, eminak71, akpm, netdev, bugzilla-daemon,
bugme-daemon, avorontsov, afleming
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 22 Oct 2010 08:52:48 +0000
> On Fri, Oct 22, 2010 at 06:52:31AM +0000, Jarek Poplawski wrote:
>> On Fri, Oct 22, 2010 at 08:11:57AM +0200, Eric Dumazet wrote:
> ...
>> > Gianfar claims to be multiqueue, but only one cpu can run gfar_poll()
>> > and call gfar_clean_tx_ring() / gfar_clean_rx_ring()
>> >
>> > If not, there would be more bugs than only rx_recycle thing
>>
>> I didn't find what prevents running gfar_poll on many cpus and don't
>> claim there is no more bugs around.
>
> On the other hand, I don't see your point in the code below either.
> These're only per gfargrp queues - not per device, aren't they?
I am still not at the point where I feel confortable applying this bug
fix, in fact I am very far from that.
None of the logic is consistent in what we are saying causes the
problem.
Anything that would make the RX recycling code racy and corrupt
recycling queue of the gianfar driver, would also corrupt all of the
other RX side and other driver state.
The NAPI state is unary for gianfar, and inside of that singular
->poll() instance it iterates over the queues.
The only asynchronous path could be netpoll, but again that would
break lots of other things.
I want to be shown a code path that results in the recycle SKB
queue getting accessed in parallel on two cpus without protection
so we can understand what it is that we are fixing.
On another note, I also agree that removing this RX recycling crud
wouldn't be a bad idea. In addition to the MTU changing concerns Eric
Dumazet brought up, there are many other (less broken) ways to achieve
whatever performance gains recycling gives these devices.
Thanks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path
2010-10-26 17:42 ` [PATCH] gianfar: Fix crashes on RX path David Miller
@ 2010-10-26 21:20 ` Jarek Poplawski
2010-10-26 21:23 ` David Miller
0 siblings, 1 reply; 18+ messages in thread
From: Jarek Poplawski @ 2010-10-26 21:20 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, eminak71, akpm, netdev, bugzilla-daemon,
bugme-daemon, avorontsov, afleming
On Tue, Oct 26, 2010 at 10:42:57AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Fri, 22 Oct 2010 08:52:48 +0000
>
> > On Fri, Oct 22, 2010 at 06:52:31AM +0000, Jarek Poplawski wrote:
> >> On Fri, Oct 22, 2010 at 08:11:57AM +0200, Eric Dumazet wrote:
> > ...
> >> > Gianfar claims to be multiqueue, but only one cpu can run gfar_poll()
> >> > and call gfar_clean_tx_ring() / gfar_clean_rx_ring()
> >> >
> >> > If not, there would be more bugs than only rx_recycle thing
> >>
> >> I didn't find what prevents running gfar_poll on many cpus and don't
> >> claim there is no more bugs around.
> >
> > On the other hand, I don't see your point in the code below either.
> > These're only per gfargrp queues - not per device, aren't they?
>
> I am still not at the point where I feel confortable applying this bug
> fix, in fact I am very far from that.
>
> None of the logic is consistent in what we are saying causes the
> problem.
>
> Anything that would make the RX recycling code racy and corrupt
> recycling queue of the gianfar driver, would also corrupt all of the
> other RX side and other driver state.
>
> The NAPI state is unary for gianfar, and inside of that singular
> ->poll() instance it iterates over the queues.
IMHO, the NAPI state is unary only for gfargrp, and multiple ->poll()
instances share a (unary) rx_recycle queue without proper locking.
Thanks,
Jarek P.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] gianfar: Fix crashes on RX path
2010-10-26 21:20 ` Jarek Poplawski
@ 2010-10-26 21:23 ` David Miller
0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2010-10-26 21:23 UTC (permalink / raw)
To: jarkao2
Cc: eric.dumazet, eminak71, akpm, netdev, bugzilla-daemon,
bugme-daemon, avorontsov, afleming
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 26 Oct 2010 23:20:42 +0200
> IMHO, the NAPI state is unary only for gfargrp, and multiple ->poll()
> instances share a (unary) rx_recycle queue without proper locking.
Ok, I see it now, thank you.
For now I'll apply your patch, but long term I think we should just
eradicate the recycling code from the entire tree.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2010-10-26 21:31 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <bug-19692-10286@https.bugzilla.kernel.org/>
2010-10-04 20:53 ` [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic Andrew Morton
2010-10-08 9:24 ` Jarek Poplawski
2010-10-09 12:10 ` emin ak
2010-10-10 10:32 ` Jarek Poplawski
2010-10-15 8:58 ` Jarek Poplawski
2010-10-15 23:14 ` emin ak
2010-10-16 19:48 ` Jarek Poplawski
2010-10-19 6:44 ` emin ak
2010-10-19 10:06 ` [PATCH] gianfar: Fix crashes on RX path (Was Re: [Bugme-new] [Bug 19692] New: linux-2.6.36-rc5 crash with gianfar ethernet at full line rate traffic) Jarek Poplawski
2010-10-22 5:42 ` emin ak
2010-10-22 6:14 ` Eric Dumazet
2010-10-22 7:03 ` Jarek Poplawski
2010-10-22 6:11 ` Eric Dumazet
2010-10-22 6:52 ` Jarek Poplawski
2010-10-22 8:52 ` Jarek Poplawski
2010-10-26 17:42 ` [PATCH] gianfar: Fix crashes on RX path David Miller
2010-10-26 21:20 ` Jarek Poplawski
2010-10-26 21:23 ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).