All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: David Miller <davem@davemloft.net>
Cc: khc@pm.waw.pl, netdev@vger.kernel.org
Subject: Re: [PATCH] net: reduce number of reference taken on sk_refcnt
Date: Sat, 09 May 2009 14:13:59 +0200	[thread overview]
Message-ID: <4A057387.4080308@cosmosbay.com> (raw)
In-Reply-To: <20090508.144859.152310605.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 08 May 2009 17:12:39 +0200
> 
>> For example, we can avoid the dst_release() cache miss if this
>> is done in start_xmit(), and not later in TX completion while freeing skb.
>> I tried various patches in the past but unfortunatly it seems
>> only safe way to do this is in the driver xmit itself, not in core
>> network stack. This would need many patches, one for each driver.
> 
> There might be a way around having to hit every driver.
> 
> The case we can't muck with is when the route will be used.
> Devices which create this kind of situation can be marked with
> a flag bit in struct netdevice.  If that flag bit isn't set,
> you can drop the DST in dev_hard_start_xmit().

Yes, this is a possibility, I'll think about it, thank you.
I'll have to recall which devices would need this flag (loopback for sure)..


> 
>> [PATCH] net: reduce number of reference taken on sk_refcnt
>>
>> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
>> in flight. This hurts some workloads at TX completion time, because
>> sock_wfree() has three cache lines to touch at least.
>> (one for sk_wmem_alloc, one for testing sk_flags, one
>>  to decrement sk_refcnt)
>>
>> We could use only one reference count, taken only when sk_wmem_alloc
>> is changed from or to ZERO value (ie one reference count for any number
>> of in-flight packets)
>>
>> Not all atomic_add() must be changed to atomic_add_return(), if we
>> know current sk_wmem_alloc is already not null.
>>
>> This patch reduces by one number of cache lines dirtied in sock_wfree()
>> and number of atomic operation in some workloads. 
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> I like this idea.  Let me know when you have some at least
> basic performance numbers and wish to submit this formally.

Sure, but I am focusing right now on the opposite situation
(many tcp flows but with small in/out trafic, where this patch
has no impact, since I have only (0 <--> !0) transitions.)

BTW, oprofile for this kind of workload gives a surprising result.
(timer stuff being *very* expensive)
CPU doing the NAPI stuff has this profile :

88688    88688          9.7805   9.7805    lock_timer_base
72692    161380         8.0165  17.7970    bnx2_poll_work
66958    228338         7.3842  25.1812    mod_timer
47980    276318         5.2913  30.4724    __wake_up
43312    319630         4.7765  35.2489    task_rq_lock
43193    362823         4.7633  40.0122    __slab_alloc
36388    399211         4.0129  44.0251    __alloc_skb
30285    429496         3.3398  47.3650    skb_release_data
29236    458732         3.2242  50.5891    ip_rcv
29219    487951         3.2223  53.8114    resched_task
29094    517045         3.2085  57.0199    __inet_lookup_established
28695    545740         3.1645  60.1844    tcp_v4_rcv
27479    573219         3.0304  63.2148    sock_wfree
26722    599941         2.9469  66.1617    ip_route_input
21401    621342         2.3601  68.5218    select_task_rq_fair
19390    640732         2.1383  70.6601    __kfree_skb
17763    658495         1.9589  72.6190    sched_clock_cpu
17565    676060         1.9371  74.5561    try_to_wake_up
17366    693426         1.9151  76.4712    __enqueue_entity
16174    709600         1.7837  78.2549    update_curr
14323    723923         1.5795  79.8345    __kmalloc_track_caller
14003    737926         1.5443  81.3787    enqueue_task_fair
12456    750382         1.3737  82.7524    __tcp_prequeue
12212    762594         1.3467  84.0991    __wake_up_common
11437    774031         1.2613  85.3604    kmem_cache_alloc
10927    784958         1.2050  86.5654    place_entity
10535    795493         1.1618  87.7272    netif_receive_skb
9971     805464         1.0996  88.8268    ipt_do_table
8551     814015         0.9430  89.7698    internal_add_timer


  reply	other threads:[~2009-05-09 12:14 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-08 12:32 NAPI and TX Krzysztof Halasa
2009-05-08 14:24 ` Ben Hutchings
2009-05-08 15:12 ` [PATCH] net: reduce number of reference taken on sk_refcnt Eric Dumazet
2009-05-08 21:48   ` David Miller
2009-05-09 12:13     ` Eric Dumazet [this message]
2009-05-09 20:34       ` David Miller
2009-05-09 20:40         ` David Miller
2009-05-10  7:09           ` Eric Dumazet
2009-05-10  7:43             ` Eric Dumazet
2009-05-10 10:45               ` Eric Dumazet
2009-05-19  4:58                 ` David Miller
2009-05-21  9:07                   ` Eric Dumazet
2009-05-09 20:36       ` David Miller
2009-05-08 21:44 ` NAPI and TX David Miller
2009-05-09 12:27   ` Krzysztof Halasa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4A057387.4080308@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=davem@davemloft.net \
    --cc=khc@pm.waw.pl \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.