* NAPI rescheduling and the delay caused by it
@ 2013-12-04 18:55 Zoltan Kiss
2013-12-04 20:41 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Zoltan Kiss @ 2013-12-04 18:55 UTC (permalink / raw)
To: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
Hi,
Recently I've found a strange behaviour in NAPI, and I hope someone can
shed some light about it. The basic issue I see that an instance is
scheduled on one CPU, and then becomes rescheduled on an another one. I
don't understand why, there shouldn't be any load on that CPU, but it
would be fine, however there is a 3-4 msec delay between original
scheduling and actual run, and it causes terrible TCP single stream
performance. Let me explain it with some ftrace logs.
I'm using Xen 4.3, Dom0 kernel is a net-next 3.12 one, the iperf client
guest (TX) has the same kernel, and the iperf server (RX) is a stock
Debian 7. vif2.0-10088 is the TX netback thread (it has to deal only
with the TCP ACKs mostly) and vif1.0-9854 is the RX thread (it copies
the packets to the receiving guest). I've added some trace_printk's to
get more details, and the ftrace filter is:
echo net_rx_action kthread_should_stop xenvif_rx_action
xenvif_start_xmit xenvif_tx_action > /debug/tracing/set_ftrace_filter
Also an important thing, that I set the affinity of vif interrupts and
vif2.0 to CPU#3, while vif1.0 is forced to run on CPU#2. That's not
necessarily ideal, but it can happen, and it does, that's why I started
to poke around with affinities. I've tried to pin vCPU's to make sure no
other domains are running on the same pCPU, but it haven't changed anything.
Here are the logs and my inline comments:
<idle>-0 [003] d.h. 73994.027432: xenvif_interrupt: vif1.0 interrupt
<idle>-0 [003] d.h. 73994.027433: xenvif_tx_interrupt: vif1.0
tx_interrupt napi_schedule for tx_action
<idle>-0 [003] d.h. 73994.027434: xenvif_rx_interrupt: vif1.0
rx_interrupt wake start_xmit
Domain 1 sent an interrupt, we kicked the vif1.0 (TX) thread and woke
NAPI to deal with the sent packet.
<idle>-0 [003] ..s. 73994.027436: net_rx_action <-__do_softirq
<idle>-0 [003] ..s. 73994.027436: xenvif_tx_action <-xenvif_poll
<idle>-0 [003] ..s. 73994.027460: xenvif_start_xmit
<-dev_hard_start_xmit
<idle>-0 [003] .ps. 73994.027465: xenvif_start_xmit: vif2.0
packet_queued
The sent packet is now at vif2.0's queue.
vif2.0-10088 [003] .... 73994.027475: kthread_should_stop <-xenvif_kthread
vif2.0-10088 [003] .... 73994.027475: xenvif_rx_action <-xenvif_kthread
vif2.0-10088 [003] .... 73994.027483: kthread_should_stop <-xenvif_kthread
vif2.0-10088 [003] .... 73994.027483: kthread_should_stop <-xenvif_kthread
vif2.0-10088 [003] .... 73994.027484: kthread_should_stop <-xenvif_kthread
vif2.0 thread now delivered the packet to Domain 2. It was probably a
TCP ACK anyway. The pattern of 3 kthread_should_stop marks that the
thread called schedule()
<idle>-0 [003] d.h. 73994.027543: xenvif_interrupt: vif2.0 interrupt
<idle>-0 [003] d.h. 73994.027544: xenvif_tx_interrupt: vif2.0
tx_interrupt napi_schedule for tx_action
<idle>-0 [003] d.h. 73994.027545: xenvif_rx_interrupt: vif2.0
rx_interrupt wake start_xmit
Domain2's iperf client sent a big packet, and here is the issue: despite
we scheduled NAPI on CPU#3, it will run on CPU#2 (which would be fine),
but there is an almost 4 milisec delay! Both threads are sleeping, and
there is no other load on the system which would justify such a delay.
<idle>-0 [002] ..s. 73994.031248: net_rx_action <-__do_softirq
<idle>-0 [002] ..s. 73994.031249: xenvif_tx_action <-xenvif_poll
<idle>-0 [002] ..s. 73994.031323: xenvif_start_xmit
<-dev_hard_start_xmit
<idle>-0 [002] .ps. 73994.031329: xenvif_start_xmit: vif1.0
vif1.0-9854 [002] .... 73994.031338: kthread_should_stop <-xenvif_kthread
vif1.0-9854 [002] .... 73994.031338: xenvif_rx_action <-xenvif_kthread
vif1.0-9854 [002] .... 73994.031391: kthread_should_stop <-xenvif_kthread
vif1.0-9854 [002] .... 73994.031391: kthread_should_stop <-xenvif_kthread
vif1.0-9854 [002] .... 73994.031392: kthread_should_stop <-xenvif_kthread
NAPI called the poll, and our thread woke up to serve the packet.
So, my questions are:
- why is NAPI rescheduled on an another CPU?
- why does it cause a 3-4 milisec delay?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NAPI rescheduling and the delay caused by it
2013-12-04 18:55 NAPI rescheduling and the delay caused by it Zoltan Kiss
@ 2013-12-04 20:41 ` Eric Dumazet
2013-12-04 21:23 ` Zoltan Kiss
0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-12-04 20:41 UTC (permalink / raw)
To: Zoltan Kiss
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
On Wed, 2013-12-04 at 18:55 +0000, Zoltan Kiss wrote:
> So, my questions are:
> - why is NAPI rescheduled on an another CPU?
> - why does it cause a 3-4 milisec delay?
NAPI can not be scheduled on another cpu.
But at the time of napi_schedule() call, napi_struct can be already be
scheduled by another cpu.
( NAPI_STATE_SCHED bit already set)
So I would say something made the 'other' cpu non responsive fast enough
to softirq events being ready for service.
(Another wakeup happened 3-4 millisec later)
Really, I suspect your usage of netif_wake_queue() is simply wrong.
Check why we have netif_rx() and netif_rx_ni() variants.
And ask yourself if xenvif_notify_tx_completion() is correct, being
called from process context.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NAPI rescheduling and the delay caused by it
2013-12-04 20:41 ` Eric Dumazet
@ 2013-12-04 21:23 ` Zoltan Kiss
2013-12-04 23:32 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Zoltan Kiss @ 2013-12-04 21:23 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
On 04/12/13 20:41, Eric Dumazet wrote:
> On Wed, 2013-12-04 at 18:55 +0000, Zoltan Kiss wrote:
>
>> So, my questions are:
>> - why is NAPI rescheduled on an another CPU?
>> - why does it cause a 3-4 milisec delay?
>
> NAPI can not be scheduled on another cpu.
>
> But at the time of napi_schedule() call, napi_struct can be already be
> scheduled by another cpu.
>
> ( NAPI_STATE_SCHED bit already set)
> So I would say something made the 'other' cpu non responsive fast enough
> to softirq events being ready for service.
>
> (Another wakeup happened 3-4 millisec later)
Oh, thanks! I forgot to mention, I have my grant mapping patches
applied. The callback when the previous packet is sent to the another
vif schedules the NAPI instance on that other CPU. But it's still not
clear why it takes so long to serve that softirq!
> Really, I suspect your usage of netif_wake_queue() is simply wrong.
>
> Check why we have netif_rx() and netif_rx_ni() variants.
>
> And ask yourself if xenvif_notify_tx_completion() is correct, being
> called from process context.
So, at the moment we use netif_wake_queue to notify the stack that it
can call xenvif_start_xmit, the thread is ready to accept new packets
for transmission. It is called when we get an interrupt from the
frontend (it marks it made room in the ring), and from
xenvif_notify_tx_completion at the end of the thread. The latter checks
if queueing were stopped in the meantime, and see if the guest made
space after our recent transmission.
I see netif_rx_ni makes sure the softirq is executed, but I'm not sure I
get how is it related to wake_queue. Can you explain a bit more?
Thanks,
Zoli
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NAPI rescheduling and the delay caused by it
2013-12-04 21:23 ` Zoltan Kiss
@ 2013-12-04 23:32 ` Eric Dumazet
2013-12-09 23:39 ` Zoltan Kiss
0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-12-04 23:32 UTC (permalink / raw)
To: Zoltan Kiss
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
On Wed, 2013-12-04 at 21:23 +0000, Zoltan Kiss wrote:
> I see netif_rx_ni makes sure the softirq is executed, but I'm not sure I
> get how is it related to wake_queue. Can you explain a bit more?
>
Calling netif_wake_queue() from process context makes no guarantee the
TX softirq is processed in the following cycles.
This interface is meant to be used from softirq context.
Try to enclose it in :
void xenvif_notify_tx_completion(struct xenvif *vif)
{
if (netif_queue_stopped(vif->dev) && xenvif_rx_schedulable(vif)) {
local_bh_disable();
netif_wake_queue(vif->dev);
local_bh_enable();
}
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: NAPI rescheduling and the delay caused by it
2013-12-04 23:32 ` Eric Dumazet
@ 2013-12-09 23:39 ` Zoltan Kiss
2013-12-10 1:14 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: Zoltan Kiss @ 2013-12-09 23:39 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
Hi,
I tried that, it didn't help in my case. I think xenvif_notify_tx_completion is only a shortcut to wake the queue earlier. Otherwise we should wait for the interrupt from the guest to arrive (see rx_interrupt), however we know it can be done. Or does this call from the kernel thread makes things worse than a later call from the interrupt context?
I found another suspect however: my grant mapping patches do the unmapping from the NAPI instance where otherwise we receive the packets from the guest. But this means we call napi_schedule from the zerocopy callback, which can be run by anyone who free up that skb, including an another VIF's RX thread (which actually does the transmit TO the guest). I guess that might be bad.
Zoli
-----Original Message-----
From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On Behalf Of Eric Dumazet
Sent: 04 December 2013 23:32
To: Zoltan Kiss
Cc: netdev@vger.kernel.org; xen-devel@lists.xenproject.org; Malcolm Crossley; Jonathan Davies; Paul Durrant; Wei Liu; Ian Campbell
Subject: Re: NAPI rescheduling and the delay caused by it
On Wed, 2013-12-04 at 21:23 +0000, Zoltan Kiss wrote:
> I see netif_rx_ni makes sure the softirq is executed, but I'm not sure
> I get how is it related to wake_queue. Can you explain a bit more?
>
Calling netif_wake_queue() from process context makes no guarantee the TX softirq is processed in the following cycles.
This interface is meant to be used from softirq context.
Try to enclose it in :
void xenvif_notify_tx_completion(struct xenvif *vif) {
if (netif_queue_stopped(vif->dev) && xenvif_rx_schedulable(vif)) {
local_bh_disable();
netif_wake_queue(vif->dev);
local_bh_enable();
}
}
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: NAPI rescheduling and the delay caused by it
2013-12-09 23:39 ` Zoltan Kiss
@ 2013-12-10 1:14 ` Eric Dumazet
2013-12-10 1:17 ` Eric Dumazet
2013-12-11 18:52 ` Zoltan Kiss
0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2013-12-10 1:14 UTC (permalink / raw)
To: Zoltan Kiss
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
On Mon, 2013-12-09 at 23:39 +0000, Zoltan Kiss wrote:
> Hi,
>
> I tried that, it didn't help in my case. I think
> xenvif_notify_tx_completion is only a shortcut to wake the queue
> earlier. Otherwise we should wait for the interrupt from the guest to
> arrive (see rx_interrupt), however we know it can be done. Or does
> this call from the kernel thread makes things worse than a later call
> from the interrupt context?
Well, the call through ksoftirqd only adds some latency, and extra
context switch. This is delayed by the time your current process exits
the kernel (or you need CONFIG_PREEMPT)
> I found another suspect however: my grant mapping patches do the
> unmapping from the NAPI instance where otherwise we receive the
> packets from the guest. But this means we call napi_schedule from the
> zerocopy callback, which can be run by anyone who free up that skb,
> including an another VIF's RX thread (which actually does the transmit
> TO the guest). I guess that might be bad.
Same problem : napi_schedule() is meant to be used from interrupt
context.
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: NAPI rescheduling and the delay caused by it
2013-12-10 1:14 ` Eric Dumazet
@ 2013-12-10 1:17 ` Eric Dumazet
2013-12-11 18:52 ` Zoltan Kiss
1 sibling, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2013-12-10 1:17 UTC (permalink / raw)
To: Zoltan Kiss
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
On Mon, 2013-12-09 at 17:14 -0800, Eric Dumazet wrote:
> > I found another suspect however: my grant mapping patches do the
> > unmapping from the NAPI instance where otherwise we receive the
> > packets from the guest. But this means we call napi_schedule from the
> > zerocopy callback, which can be run by anyone who free up that skb,
> > including an another VIF's RX thread (which actually does the transmit
> > TO the guest). I guess that might be bad.
>
> Same problem : napi_schedule() is meant to be used from interrupt
> context.
>
Check commit ec13ee80145ccb95b00e6e610044bbd94a170051
for example.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NAPI rescheduling and the delay caused by it
2013-12-10 1:14 ` Eric Dumazet
2013-12-10 1:17 ` Eric Dumazet
@ 2013-12-11 18:52 ` Zoltan Kiss
1 sibling, 0 replies; 8+ messages in thread
From: Zoltan Kiss @ 2013-12-11 18:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev@vger.kernel.org, xen-devel@lists.xenproject.org,
Malcolm Crossley, Jonathan Davies, Paul Durrant, Wei Liu,
Ian Campbell
On 10/12/13 01:14, Eric Dumazet wrote:
> On Mon, 2013-12-09 at 23:39 +0000, Zoltan Kiss wrote:
>> I found another suspect however: my grant mapping patches do the
>> unmapping from the NAPI instance where otherwise we receive the
>> packets from the guest. But this means we call napi_schedule from the
>> zerocopy callback, which can be run by anyone who free up that skb,
>> including an another VIF's RX thread (which actually does the transmit
>> TO the guest). I guess that might be bad.
>
> Same problem : napi_schedule() is meant to be used from interrupt
> context.
Indeed, avoiding napi_schedule seems to solve the issue. Thanks for the
advices!
Zoli
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-12-11 18:52 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-04 18:55 NAPI rescheduling and the delay caused by it Zoltan Kiss
2013-12-04 20:41 ` Eric Dumazet
2013-12-04 21:23 ` Zoltan Kiss
2013-12-04 23:32 ` Eric Dumazet
2013-12-09 23:39 ` Zoltan Kiss
2013-12-10 1:14 ` Eric Dumazet
2013-12-10 1:17 ` Eric Dumazet
2013-12-11 18:52 ` Zoltan Kiss
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).