* [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
@ 2007-09-19 11:54 Krishna Kumar
2007-09-19 13:23 ` Jan-Bernd Themann
2007-09-19 16:05 ` David Miller
0 siblings, 2 replies; 9+ messages in thread
From: Krishna Kumar @ 2007-09-19 11:54 UTC (permalink / raw)
To: davem; +Cc: netdev, rdreier, general, Krishna Kumar
Hi Dave,
After applying Roland's NAPI patch, system panics when I run multiple
thread iperf (no stack trace at this time, it shows that the panic is in
net_tx_action).
I think the problem is:
In the "done < budget" case, ipoib_poll calls netif_rx_complete()
netif_rx_complete()
__netif_rx_complete()
__napi_complete()
list_del()
__list_del()
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
Due to race with another completion (explained at end of the patch),
net_rx_action finds quota==0 (even though done < budget earlier):
net_rx_action()
if (unlikely(!n->quota)) {
n->quota = n->weight;
list_move_tail()
__list_del(POISON, POISON)
<PANIC>
}
while IPoIB calling netif_rx_reschedule() works fine due to:
netif_rx_reschedule
__netif_rx_schedule
__napi_schedule
list_add_tail (this is OK)
Patch that fixes this:
diff -ruNp a/include/linux/netdevice.h b/include/linux/netdevice.h
--- a/include/linux/netdevice.h 2007-09-19 16:50:35.000000000 +0530
+++ b/include/linux/netdevice.h 2007-09-19 16:51:28.000000000 +0530
@@ -346,7 +346,7 @@ static inline void napi_schedule(struct
static inline void __napi_complete(struct napi_struct *n)
{
BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
- list_del(&n->poll_list);
+ __list_del(&n->poll_list);
smp_mb__before_clear_bit();
clear_bit(NAPI_STATE_SCHED, &n->state);
}
When I apply this patch, things work fine but I get napi_check_quota_bug()
warning. This race seems to happen as follows:
CPU#1: ipoib_poll(budget=100)
{
A. process 100 skbs
B. netif_rx_complete()
<Process unrelated interrupts; executes slower than steps C, D, E on
CPU#2>
F. ib_req_notify_cq() (no missed completions, do nothing)
G. return 100
H. return to net_rx_action, quota=99, subtract 100, quota=-1, BUG.
}
CPU#2: ipoib_ib_completion() : (starts and finishes entire line of execution
*after* step B and *before* H executes).
{
C. New skb comes, call netif_rx_schedule; set quota=100
D. do ipoib_poll(), process one skb, return work=1 to net_rx_action
E. net_rx_action: set quota=99
}
The reason why both cpu's can execute poll simultaneously is because
netpoll_poll_lock() returns NULL (dev->npinfo == NULL). This results in
negative napi refcount and the warning. I verified this is the reason by
saving the original quota before calling poll (in net_tx_action) and
comparing with final after poll (before it gets updated), and it gets
changed very often in multiple thread testing (atleast 4 threads, haven't
seen with 2). In most cases, the quota becomes -1, and I have seen upto
-9 but those are rarer.
Note: during steps F-H and C-E, priv/napi is read/modified by both cpu's
which is another bug relating to the same race.
I guess the above patch is not required if this bug (in IPoIB) is fixed?
Roland, why cannot we get rid of "poll_more" ? We will get called again
after netif_rx_reschedule, and it is cleaner to let the new execution handle
fresh completions. Is there a reason why this goto is required?
Thanks,
- KK
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-19 11:54 [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() Krishna Kumar
@ 2007-09-19 13:23 ` Jan-Bernd Themann
2007-09-20 5:10 ` Krishna Kumar2
2007-09-19 16:05 ` David Miller
1 sibling, 1 reply; 9+ messages in thread
From: Jan-Bernd Themann @ 2007-09-19 13:23 UTC (permalink / raw)
To: Krishna Kumar; +Cc: davem, netdev, rdreier, general, Christoph Raisch
Hi,
On Wednesday 19 September 2007 13:54, Krishna Kumar wrote:
> CPU#1: ipoib_poll(budget=100)
> {
> A. process 100 skbs
> B. netif_rx_complete()
> <Process unrelated interrupts; executes slower than steps C, D, E on
> CPU#2>
> F. ib_req_notify_cq() (no missed completions, do nothing)
> G. return 100
> H. return to net_rx_action, quota=99, subtract 100, quota=-1, BUG.
> }
>
> CPU#2: ipoib_ib_completion() : (starts and finishes entire line of execution
> *after* step B and *before* H executes).
> {
> C. New skb comes, call netif_rx_schedule; set quota=100
> D. do ipoib_poll(), process one skb, return work=1 to net_rx_action
> E. net_rx_action: set quota=99
> }
If I understood it right the problem you describe (quota update in
__napi_schdule) can cause further problems when you choose the
following numbers:
CPU1: A. process 99 pkts
CPU1: B. netif_rx_complete()
CPU2: interrupt occures, netif_rx_schedule is called, net_rx_action triggerd:
CPU2: C. set quota = 100 (__napi_schedule)
CPU2: D. call poll(), process 1 pkt
CPU2: D.2 call netif_rx_complete() (quota not exeeded)
CPU2: E. net_rx_action: set quota=99
CPU1: F. net_rx_action: set qutoa=99 - 99 = 0
CPU1: G. modify list (list_move_tail) altough netif_rx_complete has been called
Step G would fail as the device is not in the list due
to netif_rx_complete. This case can occur for all
devices running on an SMP system where interrupts are not pinned.
Regards,
Jan-Bernd
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-19 13:23 ` Jan-Bernd Themann
@ 2007-09-20 5:10 ` Krishna Kumar2
2007-09-20 5:12 ` David Miller
0 siblings, 1 reply; 9+ messages in thread
From: Krishna Kumar2 @ 2007-09-20 5:10 UTC (permalink / raw)
To: Jan-Bernd Themann; +Cc: davem, general, netdev, Christoph Raisch, rdreier
Hi Jan-Bernd,
Jan-Bernd Themann <ossthema@de.ibm.com> wrote on 09/19/2007 06:53:48 PM:
> If I understood it right the problem you describe (quota update in
> __napi_schdule) can cause further problems when you choose the
> following numbers:
>
> CPU1: A. process 99 pkts
> CPU1: B. netif_rx_complete()
> CPU2: interrupt occures, netif_rx_schedule is called, net_rx_action
triggerd:
> CPU2: C. set quota = 100 (__napi_schedule)
> CPU2: D. call poll(), process 1 pkt
> CPU2: D.2 call netif_rx_complete() (quota not exeeded)
> CPU2: E. net_rx_action: set quota=99
> CPU1: F. net_rx_action: set qutoa=99 - 99 = 0
> CPU1: G. modify list (list_move_tail) altough netif_rx_complete has been
called
>
> Step G would fail as the device is not in the list due
> to netif_rx_complete. This case can occur for all
> devices running on an SMP system where interrupts are not pinned.
I think list_move should be ok whether device is on the list or not.
But it should not come to that code since work (99) != weight (100).
If work == weight, then driver would not have done complete, and the
next/prev would not be set to POISON.
I like the clean changes made by Dave to fix this, and will test it
today (if I can get my crashed system to come up).
Thanks,
- KK
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-20 5:10 ` Krishna Kumar2
@ 2007-09-20 5:12 ` David Miller
2007-09-20 5:54 ` [ofa-general] " Krishna Kumar2
0 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2007-09-20 5:12 UTC (permalink / raw)
To: krkumar2; +Cc: ossthema, general, netdev, raisch, rdreier
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Thu, 20 Sep 2007 10:40:33 +0530
> I like the clean changes made by Dave to fix this, and will test it
> today (if I can get my crashed system to come up).
I would very much appreciate this testing, as I'm rather sure we've
plugged up the most serious holes at this point.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-20 5:12 ` David Miller
@ 2007-09-20 5:54 ` Krishna Kumar2
2007-09-20 18:12 ` David Miller
0 siblings, 1 reply; 9+ messages in thread
From: Krishna Kumar2 @ 2007-09-20 5:54 UTC (permalink / raw)
To: David Miller; +Cc: netdev, rdreier, ossthema, general
Ran 4/16/64 thread iperf on latest bits with this patch and no issues after
30 mins. I used to
consistently get the bug within 1-2 mins with just 4 threads prior to this
patch.
Tested-by: Krishna Kumar <krkumar2@in.ibm.com>
(if any value in that)
thanks,
- KK
David Miller <davem@davemloft.net> wrote on 09/20/2007 10:42:24 AM:
> From: Krishna Kumar2 <krkumar2@in.ibm.com>
> Date: Thu, 20 Sep 2007 10:40:33 +0530
>
> > I like the clean changes made by Dave to fix this, and will test it
> > today (if I can get my crashed system to come up).
>
> I would very much appreciate this testing, as I'm rather sure we've
> plugged up the most serious holes at this point.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-20 5:54 ` [ofa-general] " Krishna Kumar2
@ 2007-09-20 18:12 ` David Miller
0 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2007-09-20 18:12 UTC (permalink / raw)
To: krkumar2; +Cc: netdev, rdreier, ossthema, general
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Thu, 20 Sep 2007 11:24:01 +0530
> Ran 4/16/64 thread iperf on latest bits with this patch and no issues after
> 30 mins. I used to
> consistently get the bug within 1-2 mins with just 4 threads prior to this
> patch.
>
> Tested-by: Krishna Kumar <krkumar2@in.ibm.com>
> (if any value in that)
There is much value in that :-) Thanks a lot Kirshna.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-19 11:54 [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() Krishna Kumar
2007-09-19 13:23 ` Jan-Bernd Themann
@ 2007-09-19 16:05 ` David Miller
2007-09-20 5:18 ` Krishna Kumar2
1 sibling, 1 reply; 9+ messages in thread
From: David Miller @ 2007-09-19 16:05 UTC (permalink / raw)
To: krkumar2; +Cc: netdev, rdreier, general
From: Krishna Kumar <krkumar2@in.ibm.com>
Date: Wed, 19 Sep 2007 17:24:03 +0530
> Note: during steps F-H and C-E, priv/napi is read/modified by both cpu's
> which is another bug relating to the same race.
>
> I guess the above patch is not required if this bug (in IPoIB) is fixed?
The NAPI_STATE_SCHED flag bit should provide all of the necessary
synchornization.
Only the setter of that bit should add the NAPI instance to the
polling list.
The polling loop runs atomically on the cpu where the NAPI instance
got added to the per-cpu polling list. And therefore decisions to
complete NAPI are serialized too.
That serialized completion decision is also when the list deletion
occurs.
I'm starting to suspect the whole problem comes from the resched
facility, and now I really don't blame Stephen for trying to delete
it. Semantically it really makes things very difficult, especially
wrt. to the atomicity of the list handling.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-19 16:05 ` David Miller
@ 2007-09-20 5:18 ` Krishna Kumar2
2007-09-20 18:12 ` [ofa-general] " David Miller
0 siblings, 1 reply; 9+ messages in thread
From: Krishna Kumar2 @ 2007-09-20 5:18 UTC (permalink / raw)
To: David Miller; +Cc: general, netdev, rdreier
Hi Dave,
David Miller <davem@davemloft.net> wrote on 09/19/2007 09:35:57 PM:
> The NAPI_STATE_SCHED flag bit should provide all of the necessary
> synchornization.
>
> Only the setter of that bit should add the NAPI instance to the
> polling list.
>
> The polling loop runs atomically on the cpu where the NAPI instance
> got added to the per-cpu polling list. And therefore decisions to
> complete NAPI are serialized too.
>
> That serialized completion decision is also when the list deletion
> occurs.
About the "list deletion occurs", isn't the race I mentioned still present?
If done < budget, the driver does netif_rx_complete (at which time some
other
cpu can add this NAPI to their list). But the first cpu might do some more
actions on the napi, like ipoib_poll() calls request_notify_cq(priv->cq),
when other cpu might have started using this napi.
(net_rx_action's 'list_move' however will not execute since work != weight)
Thanks,
- KK
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ofa-general] Re: [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule()
2007-09-20 5:18 ` Krishna Kumar2
@ 2007-09-20 18:12 ` David Miller
0 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2007-09-20 18:12 UTC (permalink / raw)
To: krkumar2; +Cc: netdev, rdreier, general
From: Krishna Kumar2 <krkumar2@in.ibm.com>
Date: Thu, 20 Sep 2007 10:48:15 +0530
> About the "list deletion occurs", isn't the race I mentioned still present?
> If done < budget, the driver does netif_rx_complete (at which time some
> other cpu can add this NAPI to their list). But the first cpu might do some more
> actions on the napi, like ipoib_poll() calls request_notify_cq(priv->cq),
> when other cpu might have started using this napi.
>
> (net_rx_action's 'list_move' however will not execute since work != weight)
It is the driver's responsibility to adhere to the fact that once
netif_rx_complete() is called, the driver is explicitly relinquishing
ownership of the NAPI state.
It therefore must not access that NAPI state until it has successfully
acquired the NAPI_STATE_SCHED bit atomically, via a sched or resched.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-09-20 18:12 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-19 11:54 [Bug, PATCH and another Bug] Was: Fix refcounting problem with netif_rx_reschedule() Krishna Kumar
2007-09-19 13:23 ` Jan-Bernd Themann
2007-09-20 5:10 ` Krishna Kumar2
2007-09-20 5:12 ` David Miller
2007-09-20 5:54 ` [ofa-general] " Krishna Kumar2
2007-09-20 18:12 ` David Miller
2007-09-19 16:05 ` David Miller
2007-09-20 5:18 ` Krishna Kumar2
2007-09-20 18:12 ` [ofa-general] " David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).