From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: khc@pm.waw.pl, netdev@vger.kernel.org
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
Date: Sat, 09 May 2009 14:13:59 +0200 [thread overview]
Message-ID: <4A057387.4080308@cosmosbay.com> (raw)
In-Reply-To: <20090508.144859.152310605.davem@davemloft.net>
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 08 May 2009 17:12:39 +0200
>
>> For example, we can avoid the dst_release() cache miss if this
>> is done in start_xmit(), and not later in TX completion while freeing skb.
>> I tried various patches in the past but unfortunatly it seems
>> only safe way to do this is in the driver xmit itself, not in core
>> network stack. This would need many patches, one for each driver.
>
> There might be a way around having to hit every driver.
>
> The case we can't muck with is when the route will be used.
> Devices which create this kind of situation can be marked with
> a flag bit in struct netdevice. If that flag bit isn't set,
> you can drop the DST in dev_hard_start_xmit().
Yes, this is a possibility, I'll think about it, thank you.
I'll have to recall which devices would need this flag (loopback for sure)..
>
>> [PATCH] net: reduce number of reference taken on sk_refcnt
>>
>> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
>> in flight. This hurts some workloads at TX completion time, because
>> sock_wfree() has three cache lines to touch at least.
>> (one for sk_wmem_alloc, one for testing sk_flags, one
>> to decrement sk_refcnt)
>>
>> We could use only one reference count, taken only when sk_wmem_alloc
>> is changed from or to ZERO value (ie one reference count for any number
>> of in-flight packets)
>>
>> Not all atomic_add() must be changed to atomic_add_return(), if we
>> know current sk_wmem_alloc is already not null.
>>
>> This patch reduces by one number of cache lines dirtied in sock_wfree()
>> and number of atomic operation in some workloads.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
> I like this idea. Let me know when you have some at least
> basic performance numbers and wish to submit this formally.
Sure, but I am focusing right now on the opposite situation
(many tcp flows but with small in/out trafic, where this patch
has no impact, since I have only (0 <--> !0) transitions.)
BTW, oprofile for this kind of workload gives a surprising result.
(timer stuff being *very* expensive)
CPU doing the NAPI stuff has this profile :
88688 88688 9.7805 9.7805 lock_timer_base
72692 161380 8.0165 17.7970 bnx2_poll_work
66958 228338 7.3842 25.1812 mod_timer
47980 276318 5.2913 30.4724 __wake_up
43312 319630 4.7765 35.2489 task_rq_lock
43193 362823 4.7633 40.0122 __slab_alloc
36388 399211 4.0129 44.0251 __alloc_skb
30285 429496 3.3398 47.3650 skb_release_data
29236 458732 3.2242 50.5891 ip_rcv
29219 487951 3.2223 53.8114 resched_task
29094 517045 3.2085 57.0199 __inet_lookup_established
28695 545740 3.1645 60.1844 tcp_v4_rcv
27479 573219 3.0304 63.2148 sock_wfree
26722 599941 2.9469 66.1617 ip_route_input
21401 621342 2.3601 68.5218 select_task_rq_fair
19390 640732 2.1383 70.6601 __kfree_skb
17763 658495 1.9589 72.6190 sched_clock_cpu
17565 676060 1.9371 74.5561 try_to_wake_up
17366 693426 1.9151 76.4712 __enqueue_entity
16174 709600 1.7837 78.2549 update_curr
14323 723923 1.5795 79.8345 __kmalloc_track_caller
14003 737926 1.5443 81.3787 enqueue_task_fair
12456 750382 1.3737 82.7524 __tcp_prequeue
12212 762594 1.3467 84.0991 __wake_up_common
11437 774031 1.2613 85.3604 kmem_cache_alloc
10927 784958 1.2050 86.5654 place_entity
10535 795493 1.1618 87.7272 netif_receive_skb
9971 805464 1.0996 88.8268 ipt_do_table
8551 814015 0.9430 89.7698 internal_add_timer
next prev parent reply other threads:[~2009-05-09 12:14 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48 ` David Miller
2009-05-09 12:13 ` Eric Dumazet [this message]
2009-05-09 20:34 ` David Miller
2009-05-09 20:40 ` David Miller
2009-05-10 7:09 ` Eric Dumazet
2009-05-10 7:43 ` Eric Dumazet
2009-05-10 10:45 ` Eric Dumazet
2009-05-19 4:58 ` David Miller
2009-05-21 9:07 ` Eric Dumazet
2009-05-09 20:36 ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27 ` Krzysztof Halasa
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A057387.4080308@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=khc@pm.waw.pl \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.