From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: khc@pm.waw.pl, netdev@vger.kernel.org, satoru.satoh@gmail.com
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
Date: Sun, 10 May 2009 12:45:56 +0200 [thread overview]
Message-ID: <4A06B064.9080801@cosmosbay.com> (raw)
In-Reply-To: <4A0685A0.8020002@cosmosbay.com>
Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> David Miller a écrit :
>>> From: David Miller <davem@davemloft.net>
>>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>>
>>>> Consider the case where we always send some message on CPU A and
>>>> then process the ACK on CPU B. We'll always be cancelling the
>>>> timer on a foreign cpu.
>>> I should also mention that TCP has a peculiar optimization of timers
>>> that is likely being thwarted by your workload. It never deletes
>>> timers under normal operation, it simply lets them still expire
>>> and the handler notices that there is "nothing to do" and returns.
>> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>>
>>> But when the connection does shut down, we have to purge all of
>>> these timers.
>>>
>>> That could be another part of why you see timers in your profile.
>>>
>>>
>> Well, in my workload they should never expire, since application exchange
>> enough data on both direction, and they are no losses (Gigabit LAN context)
>>
>> On machine acting as a server (the one I am focusing to, of course),
>> each incoming frame :
>>
>> - Contains ACK for the previous sent frame
>> - Contains data provided by the client.
>> - Starts a timer for delayed ACK
>>
>> Then server applications reacts and sends a new payload, and TCP stack
>> - Sends a frame including ACK for previous received frame
>> - Contains data provided by server application
>> - Starts a timer for retransmiting this frame if no ACK is received later.
>>
>> So yes, each incoming and each outgoing frame is going to call mod_timer()
>>
>> Problem is that incoming process is done by CPU 0 (the one that is dedicated
>> to NAPI processing because of stress situation, cpu 100% in softirq land),
>> and outgoing processing done by other cpus in the machine.
>>
>> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
>> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
>>
>> So there are cache line ping-pongs, but oprofile seems to point
>> to a spinlock contention in lock_timer_base(), I dont know why...
>> shouldnt (in my workload) delack_timer all belongs to cpu 0, and
>> retransmit_timers to other cpus ?
>>
>> Or is mod_timer never migrates an already established timer ?
>>
>> That would explain the lock contention on timer_base, we should
>> take care of it if possible.
>>
>
> ftrace is my friend :)
>
> Problem is the application, when doing it recv() call
> is calling tcp_send_delayed_ack() too.
>
> So yes, cpus are fighting on icsk_delack_timer and their
> timer_base pretty hard.
>
>
Changing tcp_prequeue to use same delack timer than tcp_send_delayed_ack()
helps a lot, but I dont know if its correct or not.
It was already changed recently in commit 0c266898b42fe4e4e2f9edfc9d3474c10f93aa6a
Author: Satoru SATOH <satoru.satoh@gmail.com>
Date: Mon May 4 11:11:01 2009 -0700
tcp: Fix tcp_prequeue() to get correct rto_min value
tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of
the actual value might be tuned. The following patches fix this and make
tcp_prequeue get the actual value returns from tcp_rto_min().
but as reader will reset timeout to an even smaller value, I wonder if
we should not select this smaller value sooner, to avoid a mod_timer ping/pong
TCP_RTO_MIN (200 ms) was too big, but (3 * tcp_rto_min(sk)) / 4 might
be too big too...
New profile data of cpu0 (no more timer stuff, and improved bandwidth
by more than 10 %). back to normal device driver load and scheduler (wakeups)
104452 104452 13.3363 13.3363 bnx2_poll_work
56103 160555 7.1631 20.4994 __wake_up
38650 199205 4.9348 25.4342 task_rq_lock
37947 237152 4.8450 30.2792 __slab_alloc
37229 274381 4.7533 35.0326 tcp_v4_rcv
34781 309162 4.4408 39.4734 __alloc_skb
27977 337139 3.5721 43.0454 skb_release_data
26516 363655 3.3855 46.4309 ip_rcv
26399 390054 3.3706 49.8015 resched_task
26059 416113 3.3272 53.1287 __inet_lookup_established
25518 441631 3.2581 56.3868 sock_wfree
23515 465146 3.0024 59.3892 ip_route_input
19811 484957 2.5294 61.9186 select_task_rq_fair
16901 501858 2.1579 64.0765 __kfree_skb
16466 518324 2.1024 66.1788 __enqueue_entity
16008 534332 2.0439 68.2227 sched_clock_cpu
15486 549818 1.9772 70.2000 try_to_wake_up
14641 564459 1.8693 72.0693 update_curr
13725 578184 1.7524 73.8217 enqueue_task_fair
13304 591488 1.6986 75.5203 __kmalloc_track_caller
13190 604678 1.6841 77.2044 kmem_cache_alloc
12016 616694 1.5342 78.7386 __tcp_prequeue
10989 627683 1.4031 80.1416 __wake_up_common
10623 638306 1.3563 81.4980 netif_receive_skb
10004 648310 1.2773 82.7753 place_entity
Patch follows for RFC only (not Signed-of...), and based on net-next-2.6
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 19f4150..906f597 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
wake_up_interruptible_poll(sk->sk_sleep,
POLLIN | POLLRDNORM | POLLRDBAND);
- if (!inet_csk_ack_scheduled(sk))
+ if (!inet_csk_ack_scheduled(sk)) {
+ unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
+
+ delay = min(inet_csk(sk)->icsk_ack.ato, delay);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- (3 * tcp_rto_min(sk)) / 4,
- TCP_RTO_MAX);
+ delay, TCP_RTO_MAX);
+ }
}
return 1;
}
next prev parent reply other threads:[~2009-05-10 10:46 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48 ` David Miller
2009-05-09 12:13 ` Eric Dumazet
2009-05-09 20:34 ` David Miller
2009-05-09 20:40 ` David Miller
2009-05-10 7:09 ` Eric Dumazet
2009-05-10 7:43 ` Eric Dumazet
2009-05-10 10:45 ` Eric Dumazet [this message]
2009-05-19 4:58 ` David Miller
2009-05-21 9:07 ` Eric Dumazet
2009-05-09 20:36 ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27 ` Krzysztof Halasa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A06B064.9080801@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=khc@pm.waw.pl \
--cc=netdev@vger.kernel.org \
--cc=satoru.satoh@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).