From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: khc@pm.waw.pl, netdev@vger.kernel.org
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
Date: Sat, 09 May 2009 14:13:59 +0200 [thread overview]
Message-ID: <4A057387.4080308@cosmosbay.com> (raw)
In-Reply-To: <20090508.144859.152310605.davem@davemloft.net>
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 08 May 2009 17:12:39 +0200
>
>> For example, we can avoid the dst_release() cache miss if this
>> is done in start_xmit(), and not later in TX completion while freeing skb.
>> I tried various patches in the past but unfortunatly it seems
>> only safe way to do this is in the driver xmit itself, not in core
>> network stack. This would need many patches, one for each driver.
>
> There might be a way around having to hit every driver.
>
> The case we can't muck with is when the route will be used.
> Devices which create this kind of situation can be marked with
> a flag bit in struct netdevice. If that flag bit isn't set,
> you can drop the DST in dev_hard_start_xmit().
Yes, this is a possibility, I'll think about it, thank you.
I'll have to recall which devices would need this flag (loopback for sure)..
>
>> [PATCH] net: reduce number of reference taken on sk_refcnt
>>
>> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
>> in flight. This hurts some workloads at TX completion time, because
>> sock_wfree() has three cache lines to touch at least.
>> (one for sk_wmem_alloc, one for testing sk_flags, one
>> to decrement sk_refcnt)
>>
>> We could use only one reference count, taken only when sk_wmem_alloc
>> is changed from or to ZERO value (ie one reference count for any number
>> of in-flight packets)
>>
>> Not all atomic_add() must be changed to atomic_add_return(), if we
>> know current sk_wmem_alloc is already not null.
>>
>> This patch reduces by one number of cache lines dirtied in sock_wfree()
>> and number of atomic operation in some workloads.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
> I like this idea. Let me know when you have some at least
> basic performance numbers and wish to submit this formally.
Sure, but I am focusing right now on the opposite situation
(many tcp flows but with small in/out trafic, where this patch
has no impact, since I have only (0 <--> !0) transitions.)
BTW, oprofile for this kind of workload gives a surprising result.
(timer stuff being *very* expensive)
CPU doing the NAPI stuff has this profile :
88688 88688 9.7805 9.7805 lock_timer_base
72692 161380 8.0165 17.7970 bnx2_poll_work
66958 228338 7.3842 25.1812 mod_timer
47980 276318 5.2913 30.4724 __wake_up
43312 319630 4.7765 35.2489 task_rq_lock
43193 362823 4.7633 40.0122 __slab_alloc
36388 399211 4.0129 44.0251 __alloc_skb
30285 429496 3.3398 47.3650 skb_release_data
29236 458732 3.2242 50.5891 ip_rcv
29219 487951 3.2223 53.8114 resched_task
29094 517045 3.2085 57.0199 __inet_lookup_established
28695 545740 3.1645 60.1844 tcp_v4_rcv
27479 573219 3.0304 63.2148 sock_wfree
26722 599941 2.9469 66.1617 ip_route_input
21401 621342 2.3601 68.5218 select_task_rq_fair
19390 640732 2.1383 70.6601 __kfree_skb
17763 658495 1.9589 72.6190 sched_clock_cpu
17565 676060 1.9371 74.5561 try_to_wake_up
17366 693426 1.9151 76.4712 __enqueue_entity
16174 709600 1.7837 78.2549 update_curr
14323 723923 1.5795 79.8345 __kmalloc_track_caller
14003 737926 1.5443 81.3787 enqueue_task_fair
12456 750382 1.3737 82.7524 __tcp_prequeue
12212 762594 1.3467 84.0991 __wake_up_common
11437 774031 1.2613 85.3604 kmem_cache_alloc
10927 784958 1.2050 86.5654 place_entity
10535 795493 1.1618 87.7272 netif_receive_skb
9971 805464 1.0996 88.8268 ipt_do_table
8551 814015 0.9430 89.7698 internal_add_timer
next prev parent reply other threads:[~2009-05-09 12:14 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48 ` David Miller
2009-05-09 12:13 ` Eric Dumazet [this message]
2009-05-09 20:34 ` David Miller
2009-05-09 20:40 ` David Miller
2009-05-10 7:09 ` Eric Dumazet
2009-05-10 7:43 ` Eric Dumazet
2009-05-10 10:45 ` Eric Dumazet
2009-05-19 4:58 ` David Miller
2009-05-21 9:07 ` Eric Dumazet
2009-05-09 20:36 ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27 ` Krzysztof Halasa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A057387.4080308@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=khc@pm.waw.pl \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).