netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: khc@pm.waw.pl, netdev@vger.kernel.org, satoru.satoh@gmail.com
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
Date: Sun, 10 May 2009 12:45:56 +0200	[thread overview]
Message-ID: <4A06B064.9080801@cosmosbay.com> (raw)
In-Reply-To: <4A0685A0.8020002@cosmosbay.com>

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> David Miller a écrit :
>>> From: David Miller <davem@davemloft.net>
>>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>>
>>>> Consider the case where we always send some message on CPU A and
>>>> then process the ACK on CPU B.  We'll always be cancelling the
>>>> timer on a foreign cpu.
>>> I should also mention that TCP has a peculiar optimization of timers
>>> that is likely being thwarted by your workload.  It never deletes
>>> timers under normal operation, it simply lets them still expire
>>> and the handler notices that there is "nothing to do" and returns.
>> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>>
>>> But when the connection does shut down, we have to purge all of
>>> these timers.
>>>
>>> That could be another part of why you see timers in your profile.
>>>
>>>
>> Well, in my workload they should never expire, since application exchange
>> enough data on both direction, and they are no losses (Gigabit LAN context)
>>
>> On machine acting as a server (the one I am focusing to, of course),
>> each incoming frame :
>>
>> - Contains ACK for the previous sent frame
>> - Contains data provided by the client.
>> - Starts a timer for delayed ACK
>>
>> Then server applications reacts and sends a new payload, and TCP stack
>> - Sends a frame including ACK for previous received frame
>> - Contains data provided by server application
>> - Starts a timer for retransmiting this frame if no ACK is received later.
>>
>> So yes, each incoming and each outgoing frame is going to call mod_timer()
>>
>> Problem is that incoming process is done by CPU 0 (the one that is dedicated
>> to NAPI processing because of stress situation, cpu 100% in softirq land),
>> and outgoing processing done by other cpus in the machine.
>>
>> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
>> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
>>
>> So there are cache line ping-pongs, but oprofile seems to point
>> to a spinlock contention in lock_timer_base(), I dont know why...
>> shouldnt (in my workload) delack_timer all belongs to cpu 0, and 
>> retransmit_timers to other cpus ? 
>>
>> Or is mod_timer never migrates an already established timer ?
>>
>> That would explain the lock contention on timer_base, we should
>> take care of it if possible.
>>
> 
> ftrace is my friend :)
> 
> Problem is the application, when doing it recv() call
> is calling tcp_send_delayed_ack() too.
> 
> So yes, cpus are fighting on icsk_delack_timer and their
> timer_base pretty hard.
> 
> 

Changing tcp_prequeue to use same delack timer than tcp_send_delayed_ack()
helps a lot, but I dont know if its correct or not.

It was already changed recently in commit 0c266898b42fe4e4e2f9edfc9d3474c10f93aa6a

Author: Satoru SATOH <satoru.satoh@gmail.com>
Date:   Mon May 4 11:11:01 2009 -0700

    tcp: Fix tcp_prequeue() to get correct rto_min value

    tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of
    the actual value might be tuned. The following patches fix this and make
    tcp_prequeue get the actual value returns from tcp_rto_min().

but as reader will reset timeout to an even smaller value, I wonder if
we should not select this smaller value sooner, to avoid a mod_timer ping/pong

TCP_RTO_MIN (200 ms) was too big, but (3 * tcp_rto_min(sk)) / 4 might
be too big too...

New profile data of cpu0 (no more timer stuff, and improved bandwidth
by more than 10 %). back to normal device driver load and scheduler (wakeups)

104452   104452        13.3363  13.3363    bnx2_poll_work
56103    160555         7.1631  20.4994    __wake_up
38650    199205         4.9348  25.4342    task_rq_lock
37947    237152         4.8450  30.2792    __slab_alloc
37229    274381         4.7533  35.0326    tcp_v4_rcv
34781    309162         4.4408  39.4734    __alloc_skb
27977    337139         3.5721  43.0454    skb_release_data
26516    363655         3.3855  46.4309    ip_rcv
26399    390054         3.3706  49.8015    resched_task
26059    416113         3.3272  53.1287    __inet_lookup_established
25518    441631         3.2581  56.3868    sock_wfree
23515    465146         3.0024  59.3892    ip_route_input
19811    484957         2.5294  61.9186    select_task_rq_fair
16901    501858         2.1579  64.0765    __kfree_skb
16466    518324         2.1024  66.1788    __enqueue_entity
16008    534332         2.0439  68.2227    sched_clock_cpu
15486    549818         1.9772  70.2000    try_to_wake_up
14641    564459         1.8693  72.0693    update_curr
13725    578184         1.7524  73.8217    enqueue_task_fair
13304    591488         1.6986  75.5203    __kmalloc_track_caller
13190    604678         1.6841  77.2044    kmem_cache_alloc
12016    616694         1.5342  78.7386    __tcp_prequeue
10989    627683         1.4031  80.1416    __wake_up_common
10623    638306         1.3563  81.4980    netif_receive_skb
10004    648310         1.2773  82.7753    place_entity


Patch follows for RFC only (not Signed-of...), and based on net-next-2.6 

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 19f4150..906f597 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 		wake_up_interruptible_poll(sk->sk_sleep,
 					   POLLIN | POLLRDNORM | POLLRDBAND);
-		if (!inet_csk_ack_scheduled(sk))
+		if (!inet_csk_ack_scheduled(sk)) {
+			unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
+
+			delay = min(inet_csk(sk)->icsk_ack.ato, delay);
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
-						  (3 * tcp_rto_min(sk)) / 4,
-						  TCP_RTO_MAX);
+						  delay, TCP_RTO_MAX);
+		}
 	}
 	return 1;
 }


  reply	other threads:[~2009-05-10 10:46 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48   ` David Miller
2009-05-09 12:13     ` Eric Dumazet
2009-05-09 20:34       ` David Miller
2009-05-09 20:40         ` David Miller
2009-05-10  7:09           ` Eric Dumazet
2009-05-10  7:43             ` Eric Dumazet
2009-05-10 10:45               ` Eric Dumazet [this message]
2009-05-19  4:58                 ` David Miller
2009-05-21  9:07                   ` Eric Dumazet
2009-05-09 20:36       ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27   ` Krzysztof Halasa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A06B064.9080801@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=davem@davemloft.net \
    --cc=khc@pm.waw.pl \
    --cc=netdev@vger.kernel.org \
    --cc=satoru.satoh@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).