From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: khc@pm.waw.pl, netdev@vger.kernel.org
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
Date: Sun, 10 May 2009 09:43:28 +0200 [thread overview]
Message-ID: <4A0685A0.8020002@cosmosbay.com> (raw)
In-Reply-To: <4A067D9E.7050706@cosmosbay.com>
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: David Miller <davem@davemloft.net>
>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>
>>> Consider the case where we always send some message on CPU A and
>>> then process the ACK on CPU B. We'll always be cancelling the
>>> timer on a foreign cpu.
>> I should also mention that TCP has a peculiar optimization of timers
>> that is likely being thwarted by your workload. It never deletes
>> timers under normal operation, it simply lets them still expire
>> and the handler notices that there is "nothing to do" and returns.
>
> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>
>> But when the connection does shut down, we have to purge all of
>> these timers.
>>
>> That could be another part of why you see timers in your profile.
>>
>>
>
> Well, in my workload they should never expire, since application exchange
> enough data on both direction, and they are no losses (Gigabit LAN context)
>
> On machine acting as a server (the one I am focusing to, of course),
> each incoming frame :
>
> - Contains ACK for the previous sent frame
> - Contains data provided by the client.
> - Starts a timer for delayed ACK
>
> Then server applications reacts and sends a new payload, and TCP stack
> - Sends a frame including ACK for previous received frame
> - Contains data provided by server application
> - Starts a timer for retransmiting this frame if no ACK is received later.
>
> So yes, each incoming and each outgoing frame is going to call mod_timer()
>
> Problem is that incoming process is done by CPU 0 (the one that is dedicated
> to NAPI processing because of stress situation, cpu 100% in softirq land),
> and outgoing processing done by other cpus in the machine.
>
> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
>
> So there are cache line ping-pongs, but oprofile seems to point
> to a spinlock contention in lock_timer_base(), I dont know why...
> shouldnt (in my workload) delack_timer all belongs to cpu 0, and
> retransmit_timers to other cpus ?
>
> Or is mod_timer never migrates an already established timer ?
>
> That would explain the lock contention on timer_base, we should
> take care of it if possible.
>
ftrace is my friend :)
Problem is the application, when doing it recv() call
is calling tcp_send_delayed_ack() too.
So yes, cpus are fighting on icsk_delack_timer and their
timer_base pretty hard.
2631.936051: finish_task_switch <-schedule
2631.936051: perf_counter_task_sched_in <-finish_task_switch
2631.936051: __perf_counter_sched_in <-perf_counter_task_sched_in
2631.936051: _spin_lock <-__perf_counter_sched_in
2631.936052: lock_sock_nested <-sk_wait_data
2631.936052: _spin_lock_bh <-lock_sock_nested
2631.936052: local_bh_disable <-_spin_lock_bh
2631.936052: local_bh_enable <-lock_sock_nested
2631.936052: finish_wait <-sk_wait_data
2631.936053: tcp_prequeue_process <-tcp_recvmsg
2631.936053: local_bh_disable <-tcp_prequeue_process
2631.936053: tcp_v4_do_rcv <-tcp_prequeue_process
2631.936053: tcp_rcv_established <-tcp_v4_do_rcv
2631.936054: local_bh_enable <-tcp_rcv_established
2631.936054: skb_copy_datagram_iovec <-tcp_rcv_established
2631.936054: memcpy_toiovec <-skb_copy_datagram_iovec
2631.936054: copy_to_user <-memcpy_toiovec
2631.936054: tcp_rcv_space_adjust <-tcp_rcv_established
2631.936055: local_bh_disable <-tcp_rcv_established
2631.936055: tcp_event_data_recv <-tcp_rcv_established
2631.936055: tcp_ack <-tcp_rcv_established
2631.936056: __kfree_skb <-tcp_ack
2631.936056: skb_release_head_state <-__kfree_skb
2631.936056: dst_release <-skb_release_head_state
2631.936056: skb_release_data <-__kfree_skb
2631.936056: put_page <-skb_release_data
2631.936057: kfree <-skb_release_data
2631.936057: kmem_cache_free <-__kfree_skb
2631.936057: tcp_valid_rtt_meas <-tcp_ack
2631.936058: bictcp_acked <-tcp_ack
2631.936058: bictcp_cong_avoid <-tcp_ack
2631.936058: tcp_is_cwnd_limited <-bictcp_cong_avoid
2631.936058: tcp_current_mss <-tcp_rcv_established
2631.936058: tcp_established_options <-tcp_current_mss
2631.936058: __tcp_push_pending_frames <-tcp_rcv_established
2631.936059: __tcp_ack_snd_check <-tcp_rcv_established
2631.936059: tcp_send_delayed_ack <-__tcp_ack_snd_check
2631.936059: sk_reset_timer <-tcp_send_delayed_ack
2631.936059: mod_timer <-sk_reset_timer
2631.936059: lock_timer_base <-mod_timer
2631.936059: _spin_lock_irqsave <-lock_timer_base
2631.936059: _spin_lock <-mod_timer
2631.936060: internal_add_timer <-mod_timer
2631.936064: _spin_unlock_irqrestore <-mod_timer
2631.936064: __kfree_skb <-tcp_rcv_established
2631.936064: skb_release_head_state <-__kfree_skb
2631.936064: dst_release <-skb_release_head_state
2631.936065: skb_release_data <-__kfree_skb
2631.936065: kfree <-skb_release_data
2631.936065: __slab_free <-kfree
2631.936065: add_partial <-__slab_free
2631.936065: _spin_lock <-add_partial
2631.936066: kmem_cache_free <-__kfree_skb
2631.936066: __slab_free <-kmem_cache_free
2631.936066: add_partial <-__slab_free
2631.936067: _spin_lock <-add_partial
2631.936067: local_bh_enable <-tcp_prequeue_process
2631.936067: tcp_cleanup_rbuf <-tcp_recvmsg
2631.936067: __tcp_select_window <-tcp_cleanup_rbuf
2631.936067: release_sock <-tcp_recvmsg
2631.936068: _spin_lock_bh <-release_sock
2631.936068: local_bh_disable <-_spin_lock_bh
2631.936068: _spin_unlock_bh <-release_sock
2631.936068: local_bh_enable_ip <-_spin_unlock_bh
2631.936068: fput <-sys_recvfrom
next prev parent reply other threads:[~2009-05-10 7:43 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48 ` David Miller
2009-05-09 12:13 ` Eric Dumazet
2009-05-09 20:34 ` David Miller
2009-05-09 20:40 ` David Miller
2009-05-10 7:09 ` Eric Dumazet
2009-05-10 7:43 ` Eric Dumazet [this message]
2009-05-10 10:45 ` Eric Dumazet
2009-05-19 4:58 ` David Miller
2009-05-21 9:07 ` Eric Dumazet
2009-05-09 20:36 ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27 ` Krzysztof Halasa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A0685A0.8020002@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=khc@pm.waw.pl \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.