public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] Dynamic percpu data allocator
@ 2002-05-23 13:08 Dipankar Sarma
  2002-05-24  4:37 ` BALBIR SINGH
  0 siblings, 1 reply; 15+ messages in thread
From: Dipankar Sarma @ 2002-05-23 13:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, Paul McKenney, lse-tech

[-- Attachment #1: Type: text/plain, Size: 1185 bytes --]

If static percpu area is around, can dynamic percpu data allocator
be far behind ;-)

As a part of scalable kernel primitives work for higher-end SMP
and NUMA architectures, we have been seeing increasing need
for per-cpu data in various key areas. Rusty's percpu area
work has added a way in 2.5 kernels to maintain static per-cpu
data. Inspired by that work, I have implemented a dynamic per-cpu
data allocator. Currently it is useful to us for -

1. Per-cpu data in dynamically allocated structures.
2. per-cpu statistics and reference counters
3. Per-cpu data in drivers/modules.
4. Scalable locking primitives like local spin only locks
   (or even big reader locks).

Included in this mail is a document that describes the allocator.
I would really appreciate if people comment on it. I am
particularly interested in eek-value of the interfaces,
specially the bit about keeping the type information in
a dummy variable in a union.

The actual patch will follow soon, unless someone convince
me quickly that there is an saner way to do this.

Thanks
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

[-- Attachment #2: percpu_data.txt --]
[-- Type: text/plain, Size: 4302 bytes --]

                        Per-CPU Data Allocator
                        ----------------------

Interfaces
----------

The interfaces for per-cpu data allocator are similar to Rusty's static
per-CPU data interfaces. One clear goal was to make sure that they
leave no overhead in the UP kernels, so for UP kernels they
reduce to ordinary variables. The basic interfaces are these -

1. percpu_data_declare(type,var)
2. percpu_data_alloc(var)
3. percpu_data(var,cpu) 
4. this_percpu_data(var)
5. percpu_data_free(var)

For example, we can declare the following structure -

	struct yy {
		int a;
		percpu_data_declare(int, b);
		int c;
		int d;
	};

We can allocate memory for percpu int like this -

	struct yy y;

	if (percpu_data_alloc(y.b)) {
		/* Failed */
	}

To use it -
	
	cpu = smp_processor_id();
	percpu_data(y.b, cpu)++;

	or

	this_percpu_data(y.b)++;

To free the per-CPU data

	percpu_data_free(y.b);

The data declaration interface is a bit unnatural, but I can't think of
anything better that would let me preserve the type information for the
original variable so that appropriate typecasting can be done for other
interfaces. percpu_data_declare(type,var) expands to -

	union {
		percpu_data_t *percpu;
		typeof(type) realtype;
	} var

percpu_data_t maintains the pointers necessary to lookup the real
percpu data. 

	typedef struct {
		void *blkaddrs[NR_CPUS];
		struct percpu_data_blk *blkp;
	} percpu_data_t;

The type information is used (using typeof()) to
typecast data accesses. 

	#define percpu_data(var,cpu) \
			(*((typeof(var.realtype) *)var.percpu->blkaddrs[cpu]))

Using a pointer to percpu_data_t adds
an overhead of an additional memory reference while accessing
percpu data. This can be avoided by embedding the percpu_data_t
structure, but since percpu_data_t has an NR_CPUS array, it changes
structure sizes very radically. It is a tradeoff, we could go either
way.



Potential Uses
--------------

1. Scalable counters - they already use a scaled down version of the
allocator inside. Per-cpu counters can reduce the overhead of cacheline
bouncing.

2. Big reader lock - this need not be statically allocated anymore.

3. Per-CPU data in modules - the static per-cpu scheme by Rusty's
doesn't work in modules, atleast I haven't seen a way to do this.

4. Scalable locks - per-cpu data is commonly used in scalable locks
like MCS locks.



Allocator
---------

The current approach is that unless there is interest in a dynamic
percpu data allocator, there is no point in spending too much time
in writing a sophisticated. 

Allocation Policy
-----------------

1. If the allocation requests size is a factor of SMP_CACHE_BYTES,
then it will be interleaved to avoid fragmentation as much as possible.
If the request size is a multiple of SMP_CACHE_BYTES, fragmentation
will still be avoided. Anything else will result in fragmentation.
The current allocator doesn't make any attempt to use the fragmented
portion, in a sense it is like padding to cache line boundary.

2. A simple binary search tree is used to maintain the memory
objects (could be blocks of them) of different sizes. For 
interleaving, the objects are maintained in blocks and a freelist 
mechanism similar to the slab allocator is used within the blocks 
to allocates objects from within. Each block is allocated from kmem_cache.

If there is sufficient interest in the per-cpu data allocator,
then I will revisit the allocator and see if fragmentation can
be reduced for non-multiples/non-factors of SMP_CACHE_BYTES.

For non-factor allocations, the residual part of the cache-line
can be maintained and a best-factor-fit alogorithm can be used
to allocate this. This makes an assumption that kernel allocation
requests are likely to contain repetitive patterns of similar sizes.

Alignment Issues
----------------

The current alignment strategy is this -

1. Minimum allocation size is sizeof(int).
2. Each block is aligned to cache line boundary and size of any
   object allocated within the block is either a factor of the block
   size or equal to the block size. However I am not sure if this
   guarantees proper alignment on all architectures. We need to
   investigate some more about this.

I am sure there is a 69-bit transputer architecture somewhere that
breaks this allocator ;-)

^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-05-30 13:56 Mala Anand
  2002-05-30 17:55 ` Dipankar Sarma
  0 siblings, 1 reply; 15+ messages in thread
From: Mala Anand @ 2002-05-30 13:56 UTC (permalink / raw)
  To: dipankar
  Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
	Paul McKenney, Rusty Russell

                                                                                                                                               
                      dipankar@beaverton.ibm.co                                                                                                
                      m                                To:       BALBIR SINGH <balbir.singh@wipro.com>                                         
                      Sent by:                         cc:       linux-kernel@vger.kernel.org, Rusty Russell <rusty@rustcorp.com.au>, Paul     
                      lse-tech-admin@lists.sour         McKenney/Beaverton/IBM@IBMUS, lse-tech@lists.sourceforge.net                           
                      ceforge.net                      Subject:  [Lse-tech] Re: [RFC] Dynamic percpu data allocator                            
                                                                                                                                               
                                                                                                                                               
                      05/24/02 01:13 AM                                                                                                        
                      Please respond to                                                                                                        
                      dipankar                                                                                                                 
                                                                                                                                               
                                                                                                                                               








>On Fri, May 24, 2002 at 10:07:59AM +0530, BALBIR SINGH wrote:
>> Hello, Dipankar,
>>
>> I would prefer to use the existing slab allocator for this.
>> I am not sure if I understand your requirements for the per-cpu
>> allocator correctly, please correct me if I do not.
>>
>> What I would like to see
>>
>> 1. Have per-cpu slabs instead of per-cpu cpucache_t. One should
>>    be able to tell for which caches we want per-cpu slabs. This
>>    way we can make even kmalloc per-cpu. Since most of the kernel
>>    would use and dispose memory before they migrate across cpus.
>>    I think this would be useful, but again no data to back it up.

>Allocating cpu-local memory is a different issue altogether.
>Eventually for NUMA support, we will have to do such allocations
>that supports choosing memory closest to a group of CPUs.

>The per-cpu data allocator allocates one copy for *each* CPU.
>It uses the slab allocator underneath. Eventually, when/if we have
>per-cpu/numa-node slab allocation, the per-cpu data allocator
>can allocate every CPU's copy from memory closest to it.

Does this mean that memory allocation will happen in "each" CPU?
Do slab allocator allocate the memory in each cpu? Your per-cpu
data allocator sounds like the hot list skbs that are in the tcpip stack
in the sense it is one level above the slab allocator and the list is
kept per cpu.  If slab allocator is fixed for per cpu, do you still
need this per-cpu data allocator?

_____________________________________________
Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088


^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-03 19:12 Mala Anand
  2002-06-03 19:48 ` Dipankar Sarma
  0 siblings, 1 reply; 15+ messages in thread
From: Mala Anand @ 2002-06-03 19:12 UTC (permalink / raw)
  To: dipankar
  Cc: BALBIR SINGH, linux-kernel, lse-tech, lse-tech-admin,
	Paul McKenney, Rusty Russell

On Thu, May 30, 2002 at 08:56:36AM -0500, Mala Anand wrote:
>>

>>                       dipankar@beaverton.ibm.co

>>                       m                                To:       BALBIR
SINGH <balbir.singh@wipro.com>
>>
>>The per-cpu data allocator allocates one copy for *each* CPU.
>> >It uses the slab allocator underneath. Eventually, when/if we have
>> >per-cpu/numa-node slab allocation, the per-cpu data allocator
>> >can allocate every CPU's copy from memory closest to it.
>>
>> Does this mean that memory allocation will happen in "each" CPU?
>> Do slab allocator allocate the memory in each cpu? Your per-cpu
>> data allocator sounds like the hot list skbs that are in the tcpip stack
>> in the sense it is one level above the slab allocator and the list is
>> kept per cpu.  If slab allocator is fixed for per cpu, do you still
>> need this per-cpu data allocator?

>Actually I don't know for sure what plans are afoot to fix the slab
allocator
>for per-cpu. One plan I heard about was allocating from per-cpu pools
>rather than per-cpu copies. My requirements are similar to
>the hot list skbs. I want to do this -

I looked at the slab code, per cpu slab is already implemented by Manfred
Spraul.
Look at cpu_data[NR_CPUS] in kmem_cache_s structure.


Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088




^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-04 12:05 Mala Anand
  0 siblings, 0 replies; 15+ messages in thread
From: Mala Anand @ 2002-06-04 12:05 UTC (permalink / raw)
  To: dipankar; +Cc: BALBIR SINGH, linux-kernel, Paul McKenney,
	'Rusty Russell'




>> For sometime now, I have been thinking of implementing/supporting
>> PME's (Peformance Monitoring Events and Counters), so that we can
>> get real values (atleast on x86) as compared to our guesses about
>> cacheline bouncing, etc. Do you know if somebody is already doing
>> this?

>You can use SGI kernprof to measure PMCs. See the SGI oss
>website for details. You can count L2_LINES_IN event to
>get a measure of cache line bouncing.

I have profiled L2_LINES_OUT on netperf tcp_stream workload.
The following is the profiling from a 100mb ethernet tcp_stream
4 adapter test baseline 2.4.17 kernel:

poll_idle [c0105280]: 121743
csum_partial_copy_generic [c0277f60]: 27951
schedule [c0114190]: 24853
do_softirq [c011b9e0]: 9130
mod_timer [c011eb10]: 6997
tcp_v4_rcv [c0258b70]: 6449
speedo_interrupt [c01a83e0]: 6262
__wake_up [c0114720]: 6143
tcp_recvmsg [c0249930]: 5199
USER [c0124a70]: 5081
speedo_start_xmit [c01a7fc0]: 4349
tcp_rcv_established [c0250c90]: 3724
tcp_data_wait [c02496d0]: 3610
speedo_rx [c01a8900]: 3358
handle_IRQ_event [c0108a60]: 3339
__kfree_skb [c0230520]: 2716
net_rx_action [c0233f10]: 2510
mcount [c02784e0]: 2261
ip_route_input [c023ee40]: 2028
ip_rcv [c0241180]: 1912
ip_queue_xmit [c0243950]: 1886
tcp_transmit_skb [c0252400]: 1877
__switch_to [c01059d0]: 1522
tcp_prequeue_process [c0249860]: 1461
skb_copy_and_csum_datagram_iovec [c0232590]: 1393
netif_rx [c0233ba0]: 1360
sock_recvmsg [c022cfc0]: 1358
eth_type_trans [c023a260]: 1350
tcp_event_data_recv [c024c320]: 1344
ip_output [c0243810]: 1333
speedo_tx_buffer_gc [c01a81d0]: 1239
tcp_v4_do_rcv [c0258a50]: 1169
kmalloc [c012dda0]: 1154
tcp_copy_to_iovec [c0250ae0]: 1138
fput [c0136d30]: 1124
sys_recvfrom [c022df40]: 1105
speedo_refill_rx_buf [c01a86d0]: 1045
kfree [c012df90]: 1015
dev_queue_xmit [c0233860]: 952
alloc_skb [c02301e0]: 936
do_gettimeofday [c010c520]: 919
system_call [c01070d8]: 918
fget [c0136e30]: 901
sys_socketcall [c022e610]: 887
skb_release_data [c0230420]: 882
ip_local_deliver [c0241020]: 808
skb_copy_and_csum_datagram [c02322a0]: 805
csum_partial [c0277e78]: 781
cleanup_rbuf [c02495d0]: 736
kfree_skbmem [c02304b0]: 696
sock_wfree [c022f380]: 600
inet_recvmsg [c0264720]: 596
speedo_refill_rx_buffers [c01a88b0]: 568
qdisc_restart [c023a4e0]: 568
__generic_copy_from_user [c02781d0]: 501
do_check_pgt_cache [c0112de0]: 494
sys_recv [c022e020]: 488
remove_wait_queue [c0115930]: 487
check_pgt_cache [c0124b30]: 483
add_wait_queue [c01158b0]: 430
__generic_copy_to_user [c0278180]: 429
tcp_send_delayed_ack [c0254e30]: 405
tcp_v4_checksum_init [c0258930]: 391
cpu_idle [c01052b0]: 386
sockfd_lookup [c022cd70]: 370
pfifo_fast_enqueue [c023a940]: 350
schedule_timeout [c0114060]: 314
pfifo_fast_dequeue [c023a9c0]: 304


To eliminate the cache-line bouncing, I applied IRQ and PROCESS
affinity. The L2_CACHE_LINES_OUT profiling with affinity:

poll_idle [c0105280]: 72241
csum_partial_copy_generic [c0289500]: 13838
schedule [c0114190]: 9036
speedo_interrupt [c01b9980]: 5066
do_softirq [c011b9e0]: 3922
USER [c0124c80]: 2573
tcp_recvmsg [c025aed0]: 2154
__wake_up [c0114720]: 1779
speedo_start_xmit [c01b9560]: 1654
mod_timer [c011eb10]: 1551
tcp_rcv_established [c0262230]: 1336
mcount [c0289a80]: 1298
tcp_transmit_skb [c02639a0]: 984
__switch_to [c01059d0]: 927
do_gettimeofday [c010c520]: 876
sys_socketcall [c023fbb0]: 872
ip_rcv [c0252720]: 872
ip_route_input [c02503e0]: 868
ip_queue_xmit [c0254ef0]: 805
system_call [c01070d8]: 772
tcp_data_wait [c025ac70]: 748
__kfree_skb [c0241ac0]: 640
tcp_v4_rcv [c026a110]: 629
do_check_pgt_cache [c0112de0]: 584
ip_output [c0254db0]: 575
net_rx_action [c02454b0]: 565
kfree [c012e1a0]: 556
fput [c0136f40]: 524
csum_partial [c0289418]: 514
handle_IRQ_event [c0108a60]: 507
sock_recvmsg [c023e560]: 479
cleanup_rbuf [c025ab70]: 430
skb_copy_and_csum_datagram [c0243840]: 428
dev_queue_xmit [c0244e00]: 409
sys_recvfrom [c023f4e0]: 404
kfree_skbmem [c0241a50]: 391
speedo_tx_buffer_gc [c01b9770]: 388
skb_copy_and_csum_datagram_iovec [c0243b30]: 383
kmalloc [c012dfb0]: 379
ip_local_deliver [c02525c0]: 362
netif_rx [c0245140]: 361
tcp_copy_to_iovec [c0262080]: 356
tcp_event_data_recv [c025d8c0]: 334
tcp_prequeue_process [c025ae00]: 319
fget [c0137040]: 300
tcp_v4_do_rcv [c0269ff0]: 285
add_wait_queue [c01158b0]: 267
skb_release_data [c02419c0]: 249
alloc_skb [c0241780]: 249
schedule_timeout [c0114060]: 238
remove_wait_queue [c0115930]: 233
sockfd_lookup [c023e310]: 232
__generic_copy_from_user [c0289770]: 224
sys_recv [c023f5c0]: 216


Regards,
    Mala


   Mala Anand
   E-mail:manand@us.ibm.com
   Linux Technology Center - Performance
   Phone:838-8088; Tie-line:678-8088





^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [Lse-tech] Re: [RFC] Dynamic percpu data allocator
@ 2002-06-04 21:11 Paul McKenney
  0 siblings, 0 replies; 15+ messages in thread
From: Paul McKenney @ 2002-06-04 21:11 UTC (permalink / raw)
  To: Mala Anand; +Cc: BALBIR SINGH, dipankar, linux-kernel, 'Rusty Russell'


For whatever it is worth, here are the functions ranked in
decreasing order of savings due to the IRQ and process
affinity.  Column 2 is profile ticks with affinity, column 3
is without, and column 4 is the difference.

                              Thanx, Paul

                       poll_idle   72241 121743 49502
                        schedule    9036  24853 15817
       csum_partial_copy_generic   13838  27951 14113
                      tcp_v4_rcv     629   6449  5820
                       mod_timer    1551   6997  5446
                      do_softirq    3922   9130  5208
                       __wake_up    1779   6143  4364
                       speedo_rx       0   3358  3358
                     tcp_recvmsg    2154   5199  3045
                   tcp_data_wait     748   3610  2862
                handle_IRQ_event     507   3339  2832
               speedo_start_xmit    1654   4349  2695
                            USER    2573   5081  2508
             tcp_rcv_established    1336   3724  2388
                     __kfree_skb     640   2716  2076
                   net_rx_action     565   2510  1945
                  eth_type_trans       0   1350  1350
                speedo_interrupt    5066   6262  1196
                  ip_route_input     868   2028  1160
            tcp_prequeue_process     319   1461  1142
                   ip_queue_xmit     805   1886  1081
            speedo_refill_rx_buf       0   1045  1045
                          ip_rcv     872   1912  1040
             tcp_event_data_recv     334   1344  1010
skb_copy_and_csum_datagram_iovec     383   1393  1010
                        netif_rx     361   1360   999
                          mcount    1298   2261   963
                tcp_transmit_skb     984   1877   893
                   tcp_v4_do_rcv     285   1169   884
                    sock_recvmsg     479   1358   879
             speedo_tx_buffer_gc     388   1239   851
               tcp_copy_to_iovec     356   1138   782
                         kmalloc     379   1154   775
                       ip_output     575   1333   758
                    sys_recvfrom     404   1105   701
                       alloc_skb     249    936   687
                skb_release_data     249    882   633
                            fget     300    901   601
                            fput     524   1124   600
                      sock_wfree       0    600   600
                    inet_recvmsg       0    596   596
                     __switch_to     927   1522   595
                   qdisc_restart       0    568   568
        speedo_refill_rx_buffers       0    568   568
                  dev_queue_xmit     409    952   543
                 check_pgt_cache       0    483   483
                           kfree     556   1015   459
                ip_local_deliver     362    808   446
          __generic_copy_to_user       0    429   429
            tcp_send_delayed_ack       0    405   405
            tcp_v4_checksum_init       0    391   391
                        cpu_idle       0    386   386
      skb_copy_and_csum_datagram     428    805   377
              pfifo_fast_enqueue       0    350   350
                    cleanup_rbuf     430    736   306
                    kfree_skbmem     391    696   305
              pfifo_fast_dequeue       0    304   304
        __generic_copy_from_user     224    501   277
                        sys_recv     216    488   272
                    csum_partial     514    781   267
               remove_wait_queue     233    487   254
                  add_wait_queue     267    430   163
                     system_call     772    918   146
                   sockfd_lookup     232    370   138
                schedule_timeout     238    314    76
                 do_gettimeofday     876    919    43
                  sys_socketcall     872    887    15
              do_check_pgt_cache     584    494   -90

>>> For sometime now, I have been thinking of implementing/supporting
>>> PME's (Peformance Monitoring Events and Counters), so that we can
>>> get real values (atleast on x86) as compared to our guesses about
>>> cacheline bouncing, etc. Do you know if somebody is already doing
>>> this?
>
>>You can use SGI kernprof to measure PMCs. See the SGI oss
>>website for details. You can count L2_LINES_IN event to
>>get a measure of cache line bouncing.
>
>I have profiled L2_LINES_OUT on netperf tcp_stream workload.
>The following is the profiling from a 100mb ethernet tcp_stream
>4 adapter test baseline 2.4.17 kernel:
>
>poll_idle [c0105280]: 121743
>csum_partial_copy_generic [c0277f60]: 27951
>schedule [c0114190]: 24853
>do_softirq [c011b9e0]: 9130
>mod_timer [c011eb10]: 6997
>tcp_v4_rcv [c0258b70]: 6449
>speedo_interrupt [c01a83e0]: 6262
>__wake_up [c0114720]: 6143
>tcp_recvmsg [c0249930]: 5199
>USER [c0124a70]: 5081
>speedo_start_xmit [c01a7fc0]: 4349
>tcp_rcv_established [c0250c90]: 3724
>tcp_data_wait [c02496d0]: 3610
>speedo_rx [c01a8900]: 3358
>handle_IRQ_event [c0108a60]: 3339
>__kfree_skb [c0230520]: 2716
>net_rx_action [c0233f10]: 2510
>mcount [c02784e0]: 2261
>ip_route_input [c023ee40]: 2028
>ip_rcv [c0241180]: 1912
>ip_queue_xmit [c0243950]: 1886
>tcp_transmit_skb [c0252400]: 1877
>__switch_to [c01059d0]: 1522
>tcp_prequeue_process [c0249860]: 1461
>skb_copy_and_csum_datagram_iovec [c0232590]: 1393
>netif_rx [c0233ba0]: 1360
>sock_recvmsg [c022cfc0]: 1358
>eth_type_trans [c023a260]: 1350
>tcp_event_data_recv [c024c320]: 1344
>ip_output [c0243810]: 1333
>speedo_tx_buffer_gc [c01a81d0]: 1239
>tcp_v4_do_rcv [c0258a50]: 1169
>kmalloc [c012dda0]: 1154
>tcp_copy_to_iovec [c0250ae0]: 1138
>fput [c0136d30]: 1124
>sys_recvfrom [c022df40]: 1105
>speedo_refill_rx_buf [c01a86d0]: 1045
>kfree [c012df90]: 1015
>dev_queue_xmit [c0233860]: 952
>alloc_skb [c02301e0]: 936
>do_gettimeofday [c010c520]: 919
>system_call [c01070d8]: 918
>fget [c0136e30]: 901
>sys_socketcall [c022e610]: 887
>skb_release_data [c0230420]: 882
>ip_local_deliver [c0241020]: 808
>skb_copy_and_csum_datagram [c02322a0]: 805
>csum_partial [c0277e78]: 781
>cleanup_rbuf [c02495d0]: 736
>kfree_skbmem [c02304b0]: 696
>sock_wfree [c022f380]: 600
>inet_recvmsg [c0264720]: 596
>speedo_refill_rx_buffers [c01a88b0]: 568
>qdisc_restart [c023a4e0]: 568
>__generic_copy_from_user [c02781d0]: 501
>do_check_pgt_cache [c0112de0]: 494
>sys_recv [c022e020]: 488
>remove_wait_queue [c0115930]: 487
>check_pgt_cache [c0124b30]: 483
>add_wait_queue [c01158b0]: 430
>__generic_copy_to_user [c0278180]: 429
>tcp_send_delayed_ack [c0254e30]: 405
>tcp_v4_checksum_init [c0258930]: 391
>cpu_idle [c01052b0]: 386
>sockfd_lookup [c022cd70]: 370
>pfifo_fast_enqueue [c023a940]: 350
>schedule_timeout [c0114060]: 314
>pfifo_fast_dequeue [c023a9c0]: 304
>
>
>To eliminate the cache-line bouncing, I applied IRQ and PROCESS
>affinity. The L2_CACHE_LINES_OUT profiling with affinity:
>
>poll_idle [c0105280]: 72241
>csum_partial_copy_generic [c0289500]: 13838
>schedule [c0114190]: 9036
>speedo_interrupt [c01b9980]: 5066
>do_softirq [c011b9e0]: 3922
>USER [c0124c80]: 2573
>tcp_recvmsg [c025aed0]: 2154
>__wake_up [c0114720]: 1779
>speedo_start_xmit [c01b9560]: 1654
>mod_timer [c011eb10]: 1551
>tcp_rcv_established [c0262230]: 1336
>mcount [c0289a80]: 1298
>tcp_transmit_skb [c02639a0]: 984
>__switch_to [c01059d0]: 927
>do_gettimeofday [c010c520]: 876
>sys_socketcall [c023fbb0]: 872
>ip_rcv [c0252720]: 872
>ip_route_input [c02503e0]: 868
>ip_queue_xmit [c0254ef0]: 805
>system_call [c01070d8]: 772
>tcp_data_wait [c025ac70]: 748
>__kfree_skb [c0241ac0]: 640
>tcp_v4_rcv [c026a110]: 629
>do_check_pgt_cache [c0112de0]: 584
>ip_output [c0254db0]: 575
>net_rx_action [c02454b0]: 565
>kfree [c012e1a0]: 556
>fput [c0136f40]: 524
>csum_partial [c0289418]: 514
>handle_IRQ_event [c0108a60]: 507
>sock_recvmsg [c023e560]: 479
>cleanup_rbuf [c025ab70]: 430
>skb_copy_and_csum_datagram [c0243840]: 428
>dev_queue_xmit [c0244e00]: 409
>sys_recvfrom [c023f4e0]: 404
>kfree_skbmem [c0241a50]: 391
>speedo_tx_buffer_gc [c01b9770]: 388
>skb_copy_and_csum_datagram_iovec [c0243b30]: 383
>kmalloc [c012dfb0]: 379
>ip_local_deliver [c02525c0]: 362
>netif_rx [c0245140]: 361
>tcp_copy_to_iovec [c0262080]: 356
>tcp_event_data_recv [c025d8c0]: 334
>tcp_prequeue_process [c025ae00]: 319
>fget [c0137040]: 300
>tcp_v4_do_rcv [c0269ff0]: 285
>add_wait_queue [c01158b0]: 267
>skb_release_data [c02419c0]: 249
>alloc_skb [c0241780]: 249
>schedule_timeout [c0114060]: 238
>remove_wait_queue [c0115930]: 233
>sockfd_lookup [c023e310]: 232
>__generic_copy_from_user [c0289770]: 224
>sys_recv [c023f5c0]: 216
>
>
>Regards,
>    Mala
>
>
>   Mala Anand
>   E-mail:manand@us.ibm.com
>   Linux Technology Center - Performance
>   Phone:838-8088; Tie-line:678-8088


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2002-06-04 21:11 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-05-23 13:08 [RFC] Dynamic percpu data allocator Dipankar Sarma
2002-05-24  4:37 ` BALBIR SINGH
2002-05-24  6:13   ` Dipankar Sarma
2002-05-24  8:38     ` [Lse-tech] " BALBIR SINGH
2002-05-24  9:13       ` Dipankar Sarma
2002-05-24 11:59         ` BALBIR SINGH
2002-05-24 14:38       ` Martin J. Bligh
  -- strict thread matches above, loose matches on Subject: below --
2002-05-30 13:56 Mala Anand
2002-05-30 17:55 ` Dipankar Sarma
2002-05-31  7:57   ` BALBIR SINGH
2002-05-31  8:40     ` Dipankar Sarma
2002-06-03 19:12 Mala Anand
2002-06-03 19:48 ` Dipankar Sarma
2002-06-04 12:05 Mala Anand
2002-06-04 21:11 Paul McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox