* NAPI and TX
@ 2009-05-08 12:32 Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Krzysztof Halasa @ 2009-05-08 12:32 UTC (permalink / raw)
To: netdev
Hi,
Could NAPI or something similar be used for TX in addition to RX?
Perhaps some driver already does that?
I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those
are 266-667 MHz ARM XScale CPUs and the interrupts handling just
transmitted sk_buffs are a pain. Could I delay those interrupts
somehow?
--
Krzysztof Halasa
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: NAPI and TX
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
@ 2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:44 ` NAPI and TX David Miller
2 siblings, 0 replies; 15+ messages in thread
From: Ben Hutchings @ 2009-05-08 14:24 UTC (permalink / raw)
To: Krzysztof Halasa; +Cc: netdev
On Fri, 2009-05-08 at 14:32 +0200, Krzysztof Halasa wrote:
> Hi,
>
> Could NAPI or something similar be used for TX in addition to RX?
> Perhaps some driver already does that?
Yes, several drivers do this. Mostly they receive a single interrupt
for both RX and TX completions and use a single NAPI context. PCI NICs
using MSI-X can generate distinct IRQs.
> I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those
> are 266-667 MHz ARM XScale CPUs and the interrupts handling just
> transmitted sk_buffs are a pain. Could I delay those interrupts
> somehow?
Since there are separate IRQs for RX and TX completions then you need to
set up a second NAPI context (struct napi_struct) for TX completion
handling. Note that TX completions do not count against the "budget" so
your TX poll function should never need to return early.
Ben.
--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
@ 2009-05-08 15:12 ` Eric Dumazet
2009-05-08 21:48 ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2009-05-08 15:12 UTC (permalink / raw)
To: Krzysztof Halasa; +Cc: netdev, David S. Miller
Krzysztof Halasa a écrit :
> Hi,
>
> Could NAPI or something similar be used for TX in addition to RX?
> Perhaps some driver already does that?
>
> I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those
> are 266-667 MHz ARM XScale CPUs and the interrupts handling just
> transmitted sk_buffs are a pain. Could I delay those interrupts
> somehow?
This is a pain yes, I agree. But some improvements can be done.
(even on drivers already using NAPI, especially if same interrupt
is used both for RX and RX queues)
For example, we can avoid the dst_release() cache miss if this
is done in start_xmit(), and not later in TX completion while freeing skb.
I tried various patches in the past but unfortunatly it seems
only safe way to do this is in the driver xmit itself, not in core
network stack. This would need many patches, one for each driver.
Then, if your workload is UDP packets, a recent patch
avoided an expensive wakeup at TX completion time
(check commit bf368e4e70cd4e0f880923c44e95a4273d725ab4
http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4
Unfortunatly this patch is only for 2.6.30, since it is based
on a new infrastructure from Davide Libenzi)
Then, additional work should be done to reduce cache lines
misses in sock_wfree()
This function currently touches many different cache lines
(one for sk_wmem_alloc, one for testing sk_flags, one
to decrement sk_refcnt). And on UDP, extra cachelines because
we call sock_def_write_space(). Yes this sucks.
We could avoid touching sk_refcnt in several cases. We
need a reference count only for the whole packets attached to a socket,
granted we use atomic_add_return() and such functions to detect 0
transitions. That way, if a tcp stream always has some packets in flight,
TX completion would not have to decrement sk_refcnt.
Following patch (based on net-next-2.6) to illustrate my point :
[PATCH] net: reduce number of reference taken on sk_refcnt
Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
in flight. This hurts some workloads at TX completion time, because
sock_wfree() has three cache lines to touch at least.
(one for sk_wmem_alloc, one for testing sk_flags, one
to decrement sk_refcnt)
We could use only one reference count, taken only when sk_wmem_alloc
is changed from or to ZERO value (ie one reference count for any number
of in-flight packets)
Not all atomic_add() must be changed to atomic_add_return(), if we
know current sk_wmem_alloc is already not null.
This patch reduces by one number of cache lines dirtied in sock_wfree()
and number of atomic operation in some workloads.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
diff --cc include/net/tcp.h
index ac37228,87d210b..0000000
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..d57e88b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1213,14 +1213,17 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from,
*
* Inlined as it's very short and called for pretty much every
* packet ever received.
+ * We take a reference on sock only if this is the first packet
+ * accounted for. (one reference count for any number of in flight packets)
*/
static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
- sock_hold(sk);
skb->sk = sk;
skb->destructor = sock_wfree;
- atomic_add(skb->truesize, &sk->sk_wmem_alloc);
+ if (atomic_add_return(skb->truesize, &sk->sk_wmem_alloc) ==
+ skb->truesize)
+ sock_hold(sk);
}
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dbf3ff..bf10084 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1170,12 +1170,17 @@ void __init sk_init(void)
void sock_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
+ int res;
/* In case it might be waiting for more memory. */
- atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+ res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
sk->sk_write_space(sk);
- sock_put(sk);
+ /*
+ * No more packets in flight, we must release sock refcnt
+ */
+ if (res == 0)
+ sock_put(sk);
}
/*
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: NAPI and TX
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
@ 2009-05-08 21:44 ` David Miller
2009-05-09 12:27 ` Krzysztof Halasa
2 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2009-05-08 21:44 UTC (permalink / raw)
To: khc; +Cc: netdev
From: Krzysztof Halasa <khc@pm.waw.pl>
Date: Fri, 08 May 2009 14:32:25 +0200
> Could NAPI or something similar be used for TX in addition to RX?
> Perhaps some driver already does that?
Many multiqueue drivers do exactly this. NIU is but one example.
Every discrete event from the chip which can generate discrete
interrupts, uses a unique NAPI context.
If a single interrupt is used to indicate RX and TX, using seperate
NAPI contexts is not advisable. It's just going to add overhead.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
@ 2009-05-08 21:48 ` David Miller
2009-05-09 12:13 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2009-05-08 21:48 UTC (permalink / raw)
To: dada1; +Cc: khc, netdev
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 08 May 2009 17:12:39 +0200
> For example, we can avoid the dst_release() cache miss if this
> is done in start_xmit(), and not later in TX completion while freeing skb.
> I tried various patches in the past but unfortunatly it seems
> only safe way to do this is in the driver xmit itself, not in core
> network stack. This would need many patches, one for each driver.
There might be a way around having to hit every driver.
The case we can't muck with is when the route will be used.
Devices which create this kind of situation can be marked with
a flag bit in struct netdevice. If that flag bit isn't set,
you can drop the DST in dev_hard_start_xmit().
> [PATCH] net: reduce number of reference taken on sk_refcnt
>
> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
> in flight. This hurts some workloads at TX completion time, because
> sock_wfree() has three cache lines to touch at least.
> (one for sk_wmem_alloc, one for testing sk_flags, one
> to decrement sk_refcnt)
>
> We could use only one reference count, taken only when sk_wmem_alloc
> is changed from or to ZERO value (ie one reference count for any number
> of in-flight packets)
>
> Not all atomic_add() must be changed to atomic_add_return(), if we
> know current sk_wmem_alloc is already not null.
>
> This patch reduces by one number of cache lines dirtied in sock_wfree()
> and number of atomic operation in some workloads.
>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
I like this idea. Let me know when you have some at least
basic performance numbers and wish to submit this formally.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-08 21:48 ` David Miller
@ 2009-05-09 12:13 ` Eric Dumazet
2009-05-09 20:34 ` David Miller
2009-05-09 20:36 ` David Miller
0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2009-05-09 12:13 UTC (permalink / raw)
To: David Miller; +Cc: khc, netdev
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 08 May 2009 17:12:39 +0200
>
>> For example, we can avoid the dst_release() cache miss if this
>> is done in start_xmit(), and not later in TX completion while freeing skb.
>> I tried various patches in the past but unfortunatly it seems
>> only safe way to do this is in the driver xmit itself, not in core
>> network stack. This would need many patches, one for each driver.
>
> There might be a way around having to hit every driver.
>
> The case we can't muck with is when the route will be used.
> Devices which create this kind of situation can be marked with
> a flag bit in struct netdevice. If that flag bit isn't set,
> you can drop the DST in dev_hard_start_xmit().
Yes, this is a possibility, I'll think about it, thank you.
I'll have to recall which devices would need this flag (loopback for sure)..
>
>> [PATCH] net: reduce number of reference taken on sk_refcnt
>>
>> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
>> in flight. This hurts some workloads at TX completion time, because
>> sock_wfree() has three cache lines to touch at least.
>> (one for sk_wmem_alloc, one for testing sk_flags, one
>> to decrement sk_refcnt)
>>
>> We could use only one reference count, taken only when sk_wmem_alloc
>> is changed from or to ZERO value (ie one reference count for any number
>> of in-flight packets)
>>
>> Not all atomic_add() must be changed to atomic_add_return(), if we
>> know current sk_wmem_alloc is already not null.
>>
>> This patch reduces by one number of cache lines dirtied in sock_wfree()
>> and number of atomic operation in some workloads.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
> I like this idea. Let me know when you have some at least
> basic performance numbers and wish to submit this formally.
Sure, but I am focusing right now on the opposite situation
(many tcp flows but with small in/out trafic, where this patch
has no impact, since I have only (0 <--> !0) transitions.)
BTW, oprofile for this kind of workload gives a surprising result.
(timer stuff being *very* expensive)
CPU doing the NAPI stuff has this profile :
88688 88688 9.7805 9.7805 lock_timer_base
72692 161380 8.0165 17.7970 bnx2_poll_work
66958 228338 7.3842 25.1812 mod_timer
47980 276318 5.2913 30.4724 __wake_up
43312 319630 4.7765 35.2489 task_rq_lock
43193 362823 4.7633 40.0122 __slab_alloc
36388 399211 4.0129 44.0251 __alloc_skb
30285 429496 3.3398 47.3650 skb_release_data
29236 458732 3.2242 50.5891 ip_rcv
29219 487951 3.2223 53.8114 resched_task
29094 517045 3.2085 57.0199 __inet_lookup_established
28695 545740 3.1645 60.1844 tcp_v4_rcv
27479 573219 3.0304 63.2148 sock_wfree
26722 599941 2.9469 66.1617 ip_route_input
21401 621342 2.3601 68.5218 select_task_rq_fair
19390 640732 2.1383 70.6601 __kfree_skb
17763 658495 1.9589 72.6190 sched_clock_cpu
17565 676060 1.9371 74.5561 try_to_wake_up
17366 693426 1.9151 76.4712 __enqueue_entity
16174 709600 1.7837 78.2549 update_curr
14323 723923 1.5795 79.8345 __kmalloc_track_caller
14003 737926 1.5443 81.3787 enqueue_task_fair
12456 750382 1.3737 82.7524 __tcp_prequeue
12212 762594 1.3467 84.0991 __wake_up_common
11437 774031 1.2613 85.3604 kmem_cache_alloc
10927 784958 1.2050 86.5654 place_entity
10535 795493 1.1618 87.7272 netif_receive_skb
9971 805464 1.0996 88.8268 ipt_do_table
8551 814015 0.9430 89.7698 internal_add_timer
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: NAPI and TX
2009-05-08 21:44 ` NAPI and TX David Miller
@ 2009-05-09 12:27 ` Krzysztof Halasa
0 siblings, 0 replies; 15+ messages in thread
From: Krzysztof Halasa @ 2009-05-09 12:27 UTC (permalink / raw)
To: David Miller; +Cc: netdev
David Miller <davem@davemloft.net> writes:
> Many multiqueue drivers do exactly this. NIU is but one example.
>
> Every discrete event from the chip which can generate discrete
> interrupts, uses a unique NAPI context.
>
> If a single interrupt is used to indicate RX and TX, using seperate
> NAPI contexts is not advisable. It's just going to add overhead.
Great. Ben, David, Eric, thanks. I will try to use this.
--
Krzysztof Halasa
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-09 12:13 ` Eric Dumazet
@ 2009-05-09 20:34 ` David Miller
2009-05-09 20:40 ` David Miller
2009-05-09 20:36 ` David Miller
1 sibling, 1 reply; 15+ messages in thread
From: David Miller @ 2009-05-09 20:34 UTC (permalink / raw)
To: dada1; +Cc: khc, netdev
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 09 May 2009 14:13:59 +0200
> BTW, oprofile for this kind of workload gives a surprising result.
> (timer stuff being *very* expensive)
> CPU doing the NAPI stuff has this profile :
>
> 88688 88688 9.7805 9.7805 lock_timer_base
> 72692 161380 8.0165 17.7970 bnx2_poll_work
> 66958 228338 7.3842 25.1812 mod_timer
> 47980 276318 5.2913 30.4724 __wake_up
> 43312 319630 4.7765 35.2489 task_rq_lock
To me this seems to indicate that timers keep getting scheduled on
different cpus.
Probably it would be helped by google's packet-processing-locality
patches, or some variant thereof.
Consider the case where we always send some message on CPU A and
then process the ACK on CPU B. We'll always be cancelling the
timer on a foreign cpu.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-09 12:13 ` Eric Dumazet
2009-05-09 20:34 ` David Miller
@ 2009-05-09 20:36 ` David Miller
1 sibling, 0 replies; 15+ messages in thread
From: David Miller @ 2009-05-09 20:36 UTC (permalink / raw)
To: dada1; +Cc: khc, netdev
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 09 May 2009 14:13:59 +0200
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Fri, 08 May 2009 17:12:39 +0200
>>
>>> For example, we can avoid the dst_release() cache miss if this
>>> is done in start_xmit(), and not later in TX completion while freeing skb.
>>> I tried various patches in the past but unfortunatly it seems
>>> only safe way to do this is in the driver xmit itself, not in core
>>> network stack. This would need many patches, one for each driver.
>>
>> There might be a way around having to hit every driver.
>>
>> The case we can't muck with is when the route will be used.
>> Devices which create this kind of situation can be marked with
>> a flag bit in struct netdevice. If that flag bit isn't set,
>> you can drop the DST in dev_hard_start_xmit().
>
> Yes, this is a possibility, I'll think about it, thank you.
> I'll have to recall which devices would need this flag (loopback for sure)..
Another idea is to invert the flag, and set it in places such
as alloc_etherdev*().
This might speed up getting coverage for the most obvious cases.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-09 20:34 ` David Miller
@ 2009-05-09 20:40 ` David Miller
2009-05-10 7:09 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2009-05-09 20:40 UTC (permalink / raw)
To: dada1; +Cc: khc, netdev
From: David Miller <davem@davemloft.net>
Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
> Consider the case where we always send some message on CPU A and
> then process the ACK on CPU B. We'll always be cancelling the
> timer on a foreign cpu.
I should also mention that TCP has a peculiar optimization of timers
that is likely being thwarted by your workload. It never deletes
timers under normal operation, it simply lets them still expire
and the handler notices that there is "nothing to do" and returns.
But when the connection does shut down, we have to purge all of
these timers.
That could be another part of why you see timers in your profile.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-09 20:40 ` David Miller
@ 2009-05-10 7:09 ` Eric Dumazet
2009-05-10 7:43 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2009-05-10 7:09 UTC (permalink / raw)
To: David Miller; +Cc: khc, netdev
David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>
>> Consider the case where we always send some message on CPU A and
>> then process the ACK on CPU B. We'll always be cancelling the
>> timer on a foreign cpu.
>
> I should also mention that TCP has a peculiar optimization of timers
> that is likely being thwarted by your workload. It never deletes
> timers under normal operation, it simply lets them still expire
> and the handler notices that there is "nothing to do" and returns.
Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>
> But when the connection does shut down, we have to purge all of
> these timers.
>
> That could be another part of why you see timers in your profile.
>
>
Well, in my workload they should never expire, since application exchange
enough data on both direction, and they are no losses (Gigabit LAN context)
On machine acting as a server (the one I am focusing to, of course),
each incoming frame :
- Contains ACK for the previous sent frame
- Contains data provided by the client.
- Starts a timer for delayed ACK
Then server applications reacts and sends a new payload, and TCP stack
- Sends a frame including ACK for previous received frame
- Contains data provided by server application
- Starts a timer for retransmiting this frame if no ACK is received later.
So yes, each incoming and each outgoing frame is going to call mod_timer()
Problem is that incoming process is done by CPU 0 (the one that is dedicated
to NAPI processing because of stress situation, cpu 100% in softirq land),
and outgoing processing done by other cpus in the machine.
offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
So there are cache line ping-pongs, but oprofile seems to point
to a spinlock contention in lock_timer_base(), I dont know why...
shouldnt (in my workload) delack_timer all belongs to cpu 0, and
retransmit_timers to other cpus ?
Or is mod_timer never migrates an already established timer ?
That would explain the lock contention on timer_base, we should
take care of it if possible.
Thanks David
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-10 7:09 ` Eric Dumazet
@ 2009-05-10 7:43 ` Eric Dumazet
2009-05-10 10:45 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2009-05-10 7:43 UTC (permalink / raw)
To: David Miller; +Cc: khc, netdev
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: David Miller <davem@davemloft.net>
>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>
>>> Consider the case where we always send some message on CPU A and
>>> then process the ACK on CPU B. We'll always be cancelling the
>>> timer on a foreign cpu.
>> I should also mention that TCP has a peculiar optimization of timers
>> that is likely being thwarted by your workload. It never deletes
>> timers under normal operation, it simply lets them still expire
>> and the handler notices that there is "nothing to do" and returns.
>
> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>
>> But when the connection does shut down, we have to purge all of
>> these timers.
>>
>> That could be another part of why you see timers in your profile.
>>
>>
>
> Well, in my workload they should never expire, since application exchange
> enough data on both direction, and they are no losses (Gigabit LAN context)
>
> On machine acting as a server (the one I am focusing to, of course),
> each incoming frame :
>
> - Contains ACK for the previous sent frame
> - Contains data provided by the client.
> - Starts a timer for delayed ACK
>
> Then server applications reacts and sends a new payload, and TCP stack
> - Sends a frame including ACK for previous received frame
> - Contains data provided by server application
> - Starts a timer for retransmiting this frame if no ACK is received later.
>
> So yes, each incoming and each outgoing frame is going to call mod_timer()
>
> Problem is that incoming process is done by CPU 0 (the one that is dedicated
> to NAPI processing because of stress situation, cpu 100% in softirq land),
> and outgoing processing done by other cpus in the machine.
>
> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
>
> So there are cache line ping-pongs, but oprofile seems to point
> to a spinlock contention in lock_timer_base(), I dont know why...
> shouldnt (in my workload) delack_timer all belongs to cpu 0, and
> retransmit_timers to other cpus ?
>
> Or is mod_timer never migrates an already established timer ?
>
> That would explain the lock contention on timer_base, we should
> take care of it if possible.
>
ftrace is my friend :)
Problem is the application, when doing it recv() call
is calling tcp_send_delayed_ack() too.
So yes, cpus are fighting on icsk_delack_timer and their
timer_base pretty hard.
2631.936051: finish_task_switch <-schedule
2631.936051: perf_counter_task_sched_in <-finish_task_switch
2631.936051: __perf_counter_sched_in <-perf_counter_task_sched_in
2631.936051: _spin_lock <-__perf_counter_sched_in
2631.936052: lock_sock_nested <-sk_wait_data
2631.936052: _spin_lock_bh <-lock_sock_nested
2631.936052: local_bh_disable <-_spin_lock_bh
2631.936052: local_bh_enable <-lock_sock_nested
2631.936052: finish_wait <-sk_wait_data
2631.936053: tcp_prequeue_process <-tcp_recvmsg
2631.936053: local_bh_disable <-tcp_prequeue_process
2631.936053: tcp_v4_do_rcv <-tcp_prequeue_process
2631.936053: tcp_rcv_established <-tcp_v4_do_rcv
2631.936054: local_bh_enable <-tcp_rcv_established
2631.936054: skb_copy_datagram_iovec <-tcp_rcv_established
2631.936054: memcpy_toiovec <-skb_copy_datagram_iovec
2631.936054: copy_to_user <-memcpy_toiovec
2631.936054: tcp_rcv_space_adjust <-tcp_rcv_established
2631.936055: local_bh_disable <-tcp_rcv_established
2631.936055: tcp_event_data_recv <-tcp_rcv_established
2631.936055: tcp_ack <-tcp_rcv_established
2631.936056: __kfree_skb <-tcp_ack
2631.936056: skb_release_head_state <-__kfree_skb
2631.936056: dst_release <-skb_release_head_state
2631.936056: skb_release_data <-__kfree_skb
2631.936056: put_page <-skb_release_data
2631.936057: kfree <-skb_release_data
2631.936057: kmem_cache_free <-__kfree_skb
2631.936057: tcp_valid_rtt_meas <-tcp_ack
2631.936058: bictcp_acked <-tcp_ack
2631.936058: bictcp_cong_avoid <-tcp_ack
2631.936058: tcp_is_cwnd_limited <-bictcp_cong_avoid
2631.936058: tcp_current_mss <-tcp_rcv_established
2631.936058: tcp_established_options <-tcp_current_mss
2631.936058: __tcp_push_pending_frames <-tcp_rcv_established
2631.936059: __tcp_ack_snd_check <-tcp_rcv_established
2631.936059: tcp_send_delayed_ack <-__tcp_ack_snd_check
2631.936059: sk_reset_timer <-tcp_send_delayed_ack
2631.936059: mod_timer <-sk_reset_timer
2631.936059: lock_timer_base <-mod_timer
2631.936059: _spin_lock_irqsave <-lock_timer_base
2631.936059: _spin_lock <-mod_timer
2631.936060: internal_add_timer <-mod_timer
2631.936064: _spin_unlock_irqrestore <-mod_timer
2631.936064: __kfree_skb <-tcp_rcv_established
2631.936064: skb_release_head_state <-__kfree_skb
2631.936064: dst_release <-skb_release_head_state
2631.936065: skb_release_data <-__kfree_skb
2631.936065: kfree <-skb_release_data
2631.936065: __slab_free <-kfree
2631.936065: add_partial <-__slab_free
2631.936065: _spin_lock <-add_partial
2631.936066: kmem_cache_free <-__kfree_skb
2631.936066: __slab_free <-kmem_cache_free
2631.936066: add_partial <-__slab_free
2631.936067: _spin_lock <-add_partial
2631.936067: local_bh_enable <-tcp_prequeue_process
2631.936067: tcp_cleanup_rbuf <-tcp_recvmsg
2631.936067: __tcp_select_window <-tcp_cleanup_rbuf
2631.936067: release_sock <-tcp_recvmsg
2631.936068: _spin_lock_bh <-release_sock
2631.936068: local_bh_disable <-_spin_lock_bh
2631.936068: _spin_unlock_bh <-release_sock
2631.936068: local_bh_enable_ip <-_spin_unlock_bh
2631.936068: fput <-sys_recvfrom
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-10 7:43 ` Eric Dumazet
@ 2009-05-10 10:45 ` Eric Dumazet
2009-05-19 4:58 ` David Miller
0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2009-05-10 10:45 UTC (permalink / raw)
To: David Miller; +Cc: khc, netdev, satoru.satoh
Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> David Miller a écrit :
>>> From: David Miller <davem@davemloft.net>
>>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>>
>>>> Consider the case where we always send some message on CPU A and
>>>> then process the ACK on CPU B. We'll always be cancelling the
>>>> timer on a foreign cpu.
>>> I should also mention that TCP has a peculiar optimization of timers
>>> that is likely being thwarted by your workload. It never deletes
>>> timers under normal operation, it simply lets them still expire
>>> and the handler notices that there is "nothing to do" and returns.
>> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>>
>>> But when the connection does shut down, we have to purge all of
>>> these timers.
>>>
>>> That could be another part of why you see timers in your profile.
>>>
>>>
>> Well, in my workload they should never expire, since application exchange
>> enough data on both direction, and they are no losses (Gigabit LAN context)
>>
>> On machine acting as a server (the one I am focusing to, of course),
>> each incoming frame :
>>
>> - Contains ACK for the previous sent frame
>> - Contains data provided by the client.
>> - Starts a timer for delayed ACK
>>
>> Then server applications reacts and sends a new payload, and TCP stack
>> - Sends a frame including ACK for previous received frame
>> - Contains data provided by server application
>> - Starts a timer for retransmiting this frame if no ACK is received later.
>>
>> So yes, each incoming and each outgoing frame is going to call mod_timer()
>>
>> Problem is that incoming process is done by CPU 0 (the one that is dedicated
>> to NAPI processing because of stress situation, cpu 100% in softirq land),
>> and outgoing processing done by other cpus in the machine.
>>
>> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
>> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
>>
>> So there are cache line ping-pongs, but oprofile seems to point
>> to a spinlock contention in lock_timer_base(), I dont know why...
>> shouldnt (in my workload) delack_timer all belongs to cpu 0, and
>> retransmit_timers to other cpus ?
>>
>> Or is mod_timer never migrates an already established timer ?
>>
>> That would explain the lock contention on timer_base, we should
>> take care of it if possible.
>>
>
> ftrace is my friend :)
>
> Problem is the application, when doing it recv() call
> is calling tcp_send_delayed_ack() too.
>
> So yes, cpus are fighting on icsk_delack_timer and their
> timer_base pretty hard.
>
>
Changing tcp_prequeue to use same delack timer than tcp_send_delayed_ack()
helps a lot, but I dont know if its correct or not.
It was already changed recently in commit 0c266898b42fe4e4e2f9edfc9d3474c10f93aa6a
Author: Satoru SATOH <satoru.satoh@gmail.com>
Date: Mon May 4 11:11:01 2009 -0700
tcp: Fix tcp_prequeue() to get correct rto_min value
tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of
the actual value might be tuned. The following patches fix this and make
tcp_prequeue get the actual value returns from tcp_rto_min().
but as reader will reset timeout to an even smaller value, I wonder if
we should not select this smaller value sooner, to avoid a mod_timer ping/pong
TCP_RTO_MIN (200 ms) was too big, but (3 * tcp_rto_min(sk)) / 4 might
be too big too...
New profile data of cpu0 (no more timer stuff, and improved bandwidth
by more than 10 %). back to normal device driver load and scheduler (wakeups)
104452 104452 13.3363 13.3363 bnx2_poll_work
56103 160555 7.1631 20.4994 __wake_up
38650 199205 4.9348 25.4342 task_rq_lock
37947 237152 4.8450 30.2792 __slab_alloc
37229 274381 4.7533 35.0326 tcp_v4_rcv
34781 309162 4.4408 39.4734 __alloc_skb
27977 337139 3.5721 43.0454 skb_release_data
26516 363655 3.3855 46.4309 ip_rcv
26399 390054 3.3706 49.8015 resched_task
26059 416113 3.3272 53.1287 __inet_lookup_established
25518 441631 3.2581 56.3868 sock_wfree
23515 465146 3.0024 59.3892 ip_route_input
19811 484957 2.5294 61.9186 select_task_rq_fair
16901 501858 2.1579 64.0765 __kfree_skb
16466 518324 2.1024 66.1788 __enqueue_entity
16008 534332 2.0439 68.2227 sched_clock_cpu
15486 549818 1.9772 70.2000 try_to_wake_up
14641 564459 1.8693 72.0693 update_curr
13725 578184 1.7524 73.8217 enqueue_task_fair
13304 591488 1.6986 75.5203 __kmalloc_track_caller
13190 604678 1.6841 77.2044 kmem_cache_alloc
12016 616694 1.5342 78.7386 __tcp_prequeue
10989 627683 1.4031 80.1416 __wake_up_common
10623 638306 1.3563 81.4980 netif_receive_skb
10004 648310 1.2773 82.7753 place_entity
Patch follows for RFC only (not Signed-of...), and based on net-next-2.6
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 19f4150..906f597 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible_poll(sk->sk_sleep,
POLLIN | POLLRDNORM | POLLRDBAND);
- if (!inet_csk_ack_scheduled(sk))
+ if (!inet_csk_ack_scheduled(sk)) {
+ unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
+
+ delay = min(inet_csk(sk)->icsk_ack.ato, delay);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- (3 * tcp_rto_min(sk)) / 4,
- TCP_RTO_MAX);
+ delay, TCP_RTO_MAX);
+ }
}
return 1;
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-10 10:45 ` Eric Dumazet
@ 2009-05-19 4:58 ` David Miller
2009-05-21 9:07 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2009-05-19 4:58 UTC (permalink / raw)
To: dada1; +Cc: khc, netdev, satoru.satoh
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sun, 10 May 2009 12:45:56 +0200
> Patch follows for RFC only (not Signed-of...), and based on net-next-2.6
Thanks for the analysis.
> @@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
> } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
> wake_up_interruptible_poll(sk->sk_sleep,
> POLLIN | POLLRDNORM | POLLRDBAND);
> - if (!inet_csk_ack_scheduled(sk))
> + if (!inet_csk_ack_scheduled(sk)) {
> + unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
> +
> + delay = min(inet_csk(sk)->icsk_ack.ato, delay);
> inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
> - (3 * tcp_rto_min(sk)) / 4,
> - TCP_RTO_MAX);
> + delay, TCP_RTO_MAX);
> + }
> }
> return 1;
I think this code is trying to aggressively stretch the ACK when
prequeueing. In order to make sure there is enough time to get
the process on the CPU and send a response, and thus piggyback
the ACK.
If that turns out not to really matter, or matter less than your
problem, then we can make your change and I'm all for it.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] net: reduce number of reference taken on sk_refcnt
2009-05-19 4:58 ` David Miller
@ 2009-05-21 9:07 ` Eric Dumazet
0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2009-05-21 9:07 UTC (permalink / raw)
To: David Miller; +Cc: khc, netdev, satoru.satoh
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sun, 10 May 2009 12:45:56 +0200
>
>> Patch follows for RFC only (not Signed-of...), and based on net-next-2.6
>
> Thanks for the analysis.
>
>> @@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
>> } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
>> wake_up_interruptible_poll(sk->sk_sleep,
>> POLLIN | POLLRDNORM | POLLRDBAND);
>> - if (!inet_csk_ack_scheduled(sk))
>> + if (!inet_csk_ack_scheduled(sk)) {
>> + unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
>> +
>> + delay = min(inet_csk(sk)->icsk_ack.ato, delay);
>> inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
>> - (3 * tcp_rto_min(sk)) / 4,
>> - TCP_RTO_MAX);
>> + delay, TCP_RTO_MAX);
>> + }
>> }
>> return 1;
>
> I think this code is trying to aggressively stretch the ACK when
> prequeueing. In order to make sure there is enough time to get
> the process on the CPU and send a response, and thus piggyback
> the ACK.
>
> If that turns out not to really matter, or matter less than your
> problem, then we can make your change and I'm all for it.
This change gave me about 15% increase in bandwidth in a multiflow
tcp benchmark. But this optimization worked because tasks could be
wakeup and send their answer in the same jiffies, and 'rearming'
the xmit timer with the same value...
(135.000 messages received/sent per second in my benchmark, with 60 flows)
mod_timer() has special heuristic to avoid calling __mod_timer()
int mod_timer(struct timer_list *timer, unsigned long expires)
{
/*
* This is a common optimization triggered by the
* networking code - if the timer is re-modified
* to be the same thing then just return:
*/
if (timer->expires == expires && timer_pending(timer))
return 1;
return __mod_timer(timer, expires, false);
}
with HZ=1000, and real applications (using more than 1 msec to process the request),
I suppose this kind of optimization is unlikely to happen,
so we might extend mod_timer() heuristic to avoid changing timer->expires
if the new value is almost the same than previous, and not "exactly the same value"
int mod_timer_unexact(struct timer_list *timer, unsigned long expires, long maxerror)
{
/*
* This is a common optimization triggered by the
* networking code - if the timer is re-modified
* to be about the same thing then just return:
*/
if (timer_pending(timer)) {
long delta = expires - timer->expires;
if (delta <= maxerror && delta >= -maxerror)
return 1;
}
return __mod_timer(timer, expires, false);
}
But to be effective, prequeue needs a blocked task
for each flow, and modern daemons prefer to use poll/epoll and
prequeue is thus not used.
Another possibility would be to use a different timer for prequeue
exclusive use instead of sharing xmit_timer.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2009-05-21 9:07 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48 ` David Miller
2009-05-09 12:13 ` Eric Dumazet
2009-05-09 20:34 ` David Miller
2009-05-09 20:40 ` David Miller
2009-05-10 7:09 ` Eric Dumazet
2009-05-10 7:43 ` Eric Dumazet
2009-05-10 10:45 ` Eric Dumazet
2009-05-19 4:58 ` David Miller
2009-05-21 9:07 ` Eric Dumazet
2009-05-09 20:36 ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27 ` Krzysztof Halasa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).