Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
@ 2001-12-06 16:10 Niels Christiansen
  2001-12-07  8:54 ` Dipankar Sarma
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Niels Christiansen @ 2001-12-06 16:10 UTC (permalink / raw)
  To: kiran; +Cc: lse-tech, linux-kernel

Hi Kiran,

> Are you concerned with increase in memory used per counter Here? I
suppose
> that must not be that much of an issue for a 16 processor box....

Nope, I'm concerned that if this mechanism is to be used for all counters,
the improvement in cache coherence might be less significant to the point
where the additional overhead isn't worth it.

Arjab van de Ven voiced similar concerns but he also said:

> There's several things where per cpu data is useful; low frequency
> statistics is not one of them in my opinion.

...which may be true for 4-ways and even 8-ways but when you get to
32-ways and greater, you start seeing cache problems.  That was the
case on AIX and per-cpu counters was one of the changes that helped
get the spectacular scalability on Regatta.

Anyway, since we just had a long thread going on NUMA topology, maybe
it would be proper to investigate if there is a better way, such as
using the topology to decide where to put counters?  I think so, seeing
as it is that most Intel based 8-ways and above will have at least some
NUMA in them.

> Well, I wrote a simple kernel module which just increments a shared
global
> counter a million times per processor in parallel, and compared it with
> the statctr which would be incremented a million times per processor in
> parallel..

I suspected that.  Would it be possible to do the test on the real
counters?

Niels Christiansen
IBM LTC, Kernel Performance

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 16:10 [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters Niels Christiansen
@ 2001-12-07  8:54 ` Dipankar Sarma
  2001-12-08 22:24   ` Paul Jackson
  2001-12-07 11:39 ` Ravikiran G Thirumalai
  2001-12-08 13:46 ` Anton Blanchard
  2 siblings, 1 reply; 24+ messages in thread
From: Dipankar Sarma @ 2001-12-07  8:54 UTC (permalink / raw)
  To: Niels Christiansen; +Cc: kiran, lse-tech, linux-kernel

Hi Niels,

On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote:
> 
> Hi Kiran,
> 
> > Are you concerned with increase in memory used per counter Here? I
> suppose
> > that must not be that much of an issue for a 16 processor box....
> 
> Nope, I'm concerned that if this mechanism is to be used for all counters,
> the improvement in cache coherence might be less significant to the point
> where the additional overhead isn't worth it.

In a low-cpu-count SMP box, yes, this will be a concern. Kiran and I
do plan to study this and understand the impact.

> 
> Arjab van de Ven voiced similar concerns but he also said:
> 
> > There's several things where per cpu data is useful; low frequency
> > statistics is not one of them in my opinion.
> 
> ...which may be true for 4-ways and even 8-ways but when you get to
> 32-ways and greater, you start seeing cache problems.  That was the
> case on AIX and per-cpu counters was one of the changes that helped
> get the spectacular scalability on Regatta.

Yes. It also helped us in DYNIX/ptx on Sequent boxes. What we
need to do is to verify if theory based on prior experience is
also applicable to linux.

> 
> Anyway, since we just had a long thread going on NUMA topology, maybe
> it would be proper to investigate if there is a better way, such as
> using the topology to decide where to put counters?  I think so, seeing
> as it is that most Intel based 8-ways and above will have at least some
> NUMA in them.

It should be easy to place the counters in appropriately close
memory if linux gets good NUMA APIs built on top of the topology
services. If we extend kmem_cache_alloc() to allocate memory
in a particular NUMA node, we could simply do this for placing the
counters -

static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
{
        void *addr;
        struct pcpu_ctr_blk *blkp;
        unsigned int save_flags;
        int i;

        if (!(blkp = pcpu_ctr_blkctl_alloc(ctl, flags)))
                return 0;

        /* Get per cpu cache lines for the block */
        for_each_cpu(cpu) {
               blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, 
						flags, CPU_TO_NODE(cpu));
               if(!(blkp->lineaddr[cpu]))
                       goto exit1;
               memset(blkp->lineaddr[cpu], 0, PCPU_CTR_LINE_SIZE);
        }

This would put the block of counters corresponding to a CPU in
memory local to the NUMA node. If there are more sophisticated
APIs available for suitable memory selection, those too can be made
use of here.

Is this the kind of thing you are looking at ?

Thanks
Dipankar
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-07  8:54 ` Dipankar Sarma
@ 2001-12-08 22:24   ` Paul Jackson
  2001-12-09  3:46     ` Jack Steiner
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2001-12-08 22:24 UTC (permalink / raw)
  To: Dipankar Sarma; +Cc: Niels Christiansen, kiran, lse-tech, linux-kernel

On Fri, 7 Dec 2001, Dipankar Sarma wrote:
> .... If we extend kmem_cache_alloc() to allocate memory
> in a particular NUMA node, we could simply do this for placing the
> counters -
> 
> static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
> {
>         ...
> 
>         /* Get per cpu cache lines for the block */
>         for_each_cpu(cpu) {
>                blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, 
> 						flags, CPU_TO_NODE(cpu));
>                ...
>         }
> 
> This would put the block of counters corresponding to a CPU in
> memory local to the NUMA node.

Rather than baking into each call of kmem_cache_alloc_node()
the CPU_TO_NODE() transformation, how about having a
kmem_cache_alloc_cpu() call that allocates closest to a
specified cpu.

I would prefer to avoid spreading the assumption that for each
cpu there is an identifiable node that has a single memory
that is best for all cpus on that node.  Instead, assume that
for each cpu, there is an identifiable best memory ... and let
the internal implementation of kmem_cache_alloc_cpu() find that
best memory for the specified cpu.

Given this change, the kmem_cache_alloc_cpu() implementation
could use the CpuMemSets NUMA infrastructure that my group is
working on to find the best memory.  With CpuMemSets, the
kernel will have, for each cpu, a list of memories, sorted
by distance from that cpu.  Just take the first memory block
off the selected cpus memory list for the above purpose.

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-08 22:24   ` Paul Jackson
@ 2001-12-09  3:46     ` Jack Steiner
  2001-12-09  4:44       ` Paul Jackson
  0 siblings, 1 reply; 24+ messages in thread
From: Jack Steiner @ 2001-12-09  3:46 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Dipankar Sarma, Niels Christiansen, kiran, lse-tech, linux-kernel

> 
> On Fri, 7 Dec 2001, Dipankar Sarma wrote:
> > .... If we extend kmem_cache_alloc() to allocate memory
> > in a particular NUMA node, we could simply do this for placing the
> > counters -
> > 
> > static int pcpu_ctr_mem_grow(struct pcpu_ctr_ctl *ctl, int flags)
> > {
> >         ...
> > 
> >         /* Get per cpu cache lines for the block */
> >         for_each_cpu(cpu) {
> >                blkp->lineaddr[cpu] = kmem_cache_alloc_node(ctl->cachep, 
> > 						flags, CPU_TO_NODE(cpu));
> >                ...
> >         }
> > 
> > This would put the block of counters corresponding to a CPU in
> > memory local to the NUMA node.
> 
> Rather than baking into each call of kmem_cache_alloc_node()
> the CPU_TO_NODE() transformation, how about having a
> kmem_cache_alloc_cpu() call that allocates closest to a
> specified cpu.

I think it depends on whether the slab allocator manages memory by cpu or
node. Since the number of cpus per node is rather small (<=8) for
most NUMA systems, I would expect the slab allocator to manage by node.
Managing by cpu would likely add extra fragmentation and no real performance
benefit.

> 
> I would prefer to avoid spreading the assumption that for each
> cpu there is an identifiable node that has a single memory
> that is best for all cpus on that node.  

But this is already true for the page allocator (alloc_pages_node()).

The page pools are managed by node, not cpu. All memory on a node is
managed by a single pg_data_t structure. This structure contains/points-to
the tables for the memory on the node (page structs, free lists, etc). 

If a cpu needs to allocate local memory, it determines it's node_id.
This node_id is in the cpu_data structure for the cpu so this is an easy
calculation (one memory reference). The nodeid is used find the pgdata_t struct
for the node (index into an array of node-local pointers, so again, no offnode
references).

Assuming the slab allocator manages by node, kmem_cache_alloc_node() & 
kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). 
Each would pick up the nodeid from the cpu_data struct, then allocate 
from the slab cache for that node.

> Instead, assume that
> for each cpu, there is an identifiable best memory ... and let
> the internal implementation of kmem_cache_alloc_cpu() find that
> best memory for the specified cpu.
> 
> Given this change, the kmem_cache_alloc_cpu() implementation
> could use the CpuMemSets NUMA infrastructure that my group is
> working on to find the best memory.  With CpuMemSets, the
> kernel will have, for each cpu, a list of memories, sorted
> by distance from that cpu.  Just take the first memory block
> off the selected cpus memory list for the above purpose.
> 
> 
>                           I won't rest till it's the best ...
> 			  Manager, Linux Scalability
>                           Paul Jackson <pj@sgi.com> 1.650.933.1373
> 
> 
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/lse-tech
> 

-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-09  3:46     ` Jack Steiner
@ 2001-12-09  4:44       ` Paul Jackson
  2001-12-09 17:34         ` Jack Steiner
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2001-12-09  4:44 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Dipankar Sarma, Niels Christiansen, kiran, lse-tech, linux-kernel

I think Jack got his attribution wrong.  Which is good for me,
since I wrote what Jack just gently demolished <grin>.

On Sat, 8 Dec 2001, Jack Steiner wrote:

> I think it depends on whether the slab allocator manages
> memory by cpu or node. Since the number of cpus per node
> is rather small (<=8) for most NUMA systems, I would expect
> the slab allocator to manage by node.  Managing by cpu would
> likely add extra fragmentation and no real performance benefit.

I wasn't intending to suggest that the slab allocator manage by
cpu, rather than by node.  Pretty clearly, that would be silly.
Rather I was doing two things:

  1) Suggesting that if some code asking for memory wanted
     it on a node near to some cpu, that it not convert that
     cpu to a node _before_ the call, but rather, pass in the
     cpu, and let the called routine, kmem_cache_alloc_node()
     or renamed to kmem_cache_alloc_cpu() map the cpu to the
     node, inside the call.

  2) Suggesting (against common usage in the kernel, as Jack
     describes, so probably I'm tilting at wind mills) that
     we distinguish between nodes and and chunks of memory
     that I've started calling memory blocks.

I think (1) is sensible enough - the entire discussion leading
up to the code example involved getting memory near to some
cpu or other - so let the call state just that, and let the
details of translating to whatever units the slab allocator
works with be handled inside the call.  Don't make each
caller remember to perform a CPU_TO_NODE conversion - it's
just a little silly duplication of code (a kernel a few
bytes larger), a slightly less direct interface, and one
more detail to impose on each person coding such a call.

As to (2) I'd have to get Jack in a room with a white board, and
at this point, I'm placing no bets on what we would conclude
(well, actually I'd bet on Jack if forced ... his batting
average is pretty good ;).

                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-09  4:44       ` Paul Jackson
@ 2001-12-09 17:34         ` Jack Steiner
  2001-12-11 23:27           ` Paul Jackson
  0 siblings, 1 reply; 24+ messages in thread
From: Jack Steiner @ 2001-12-09 17:34 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Jack Steiner, Dipankar Sarma, Niels Christiansen, kiran, lse-tech,
	linux-kernel

> 
> I think Jack got his attribution wrong.  Which is good for me,
> since I wrote what Jack just gently demolished <grin>.

And I probably should not have been reading mail while I
debugged a weird system hang. :-)  I missed the earlier
part of the thread - I though you were refering to local
allocation.

I dont think I have a strong opinion yet about kmem_cache_alloc_node()
vs kmem_cache_alloc_cpu(). I would not be surprised to find that 
both interfaces make sense.  

If code want to allocate close to a cpu, then kmem_cache_alloc_cpu()
is the best choice. However, I would also expect that some code
already knows the node. Then kmem_cache_alloc_node() is best.

Conversion of cpu->node is easy. Conversion of node->cpu
is slightly more difficult (currently) and has the ambiguity
that there may be multiple cpus on the node - which one should
you select? And does it matter?

As precident, the page allocation routines are all node-based.
(ie., alloc_pages_node(), etc...)

-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-09 17:34         ` Jack Steiner
@ 2001-12-11 23:27           ` Paul Jackson
  0 siblings, 0 replies; 24+ messages in thread
From: Paul Jackson @ 2001-12-11 23:27 UTC (permalink / raw)
  To: Jack Steiner
  Cc: Dipankar Sarma, Niels Christiansen, kiran, lse-tech, linux-kernel

On Sun, 9 Dec 2001, Jack Steiner wrote:

> If code want to allocate close to a cpu, then kmem_cache_alloc_cpu()
> is the best choice. However, I would also expect that some code
> already knows the node. Then kmem_cache_alloc_node() is best.

yup.


> As precident, the page allocation routines are all node-based.
> (ie., alloc_pages_node(), etc...)

My inclinations would be to prefer more cpu-based allocators.
But until I happen to catch you in a room with a white board,
my inclinations are unlikely to go anywhere ... perhaps someday.


                          I won't rest till it's the best ...
			  Manager, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 16:10 [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters Niels Christiansen
  2001-12-07  8:54 ` Dipankar Sarma
@ 2001-12-07 11:39 ` Ravikiran G Thirumalai
  2001-12-08 13:46 ` Anton Blanchard
  2 siblings, 0 replies; 24+ messages in thread
From: Ravikiran G Thirumalai @ 2001-12-07 11:39 UTC (permalink / raw)
  To: Niels Christiansen, arjanv; +Cc: lse-tech, linux-kernel

Hi Niels, Arjan

On Thu, Dec 06, 2001 at 10:10:47AM -0600, Niels Christiansen wrote:
> 
> > Well, I wrote a simple kernel module which just increments a shared
> global
> > counter a million times per processor in parallel, and compared it with
> > the statctr which would be incremented a million times per processor in
> > parallel..
> 
> I suspected that.  Would it be possible to do the test on the real
> counters?

Yep, I am gonna run a benchmark after changing some stat counters in the
Kernel.  That should let us know if there are performance gains or otherwise..

Kiran
-- 
Ravikiran G Thirumalai <kiran@in.ibm.com>
Linux Technology Center, IBM Software Labs,
Bangalore.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 16:10 [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters Niels Christiansen
  2001-12-07  8:54 ` Dipankar Sarma
  2001-12-07 11:39 ` Ravikiran G Thirumalai
@ 2001-12-08 13:46 ` Anton Blanchard
  2 siblings, 0 replies; 24+ messages in thread
From: Anton Blanchard @ 2001-12-08 13:46 UTC (permalink / raw)
  To: Niels Christiansen; +Cc: kiran, lse-tech, linux-kernel

 
Hi,

> > There's several things where per cpu data is useful; low frequency
> > statistics is not one of them in my opinion.
> 
> ...which may be true for 4-ways and even 8-ways but when you get to
> 32-ways and greater, you start seeing cache problems.  That was the
> case on AIX and per-cpu counters was one of the changes that helped
> get the spectacular scalability on Regatta.

I agree there are large areas of improvement to be done wrt cacheline
ping ponging (see my patch in 2.4.17-pre6 for one example), but we
should do our own benchmarking and not look at what AIX has been doing.

Anton
(ppc64 Linux Hacker)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
@ 2001-12-09 10:57 Manfred Spraul
  2001-12-10 16:32 ` Jack Steiner
  0 siblings, 1 reply; 24+ messages in thread
From: Manfred Spraul @ 2001-12-09 10:57 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-kernel, lse-tech

>
>
>Assuming the slab allocator manages by node, kmem_cache_alloc_node() & 
>kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). 
>Each would pick up the nodeid from the cpu_data struct, then allocate 
>from the slab cache for that node.
>

kmem_cache_alloc is simple - the complex operation is kmem_cache_free.

The current implementation
- assumes that virt_to_page() and reading one cacheline from the page 
structure is fast. Is that true for your setups?
- uses an array to batch several free calls together: If the array 
overflows, then up to 120 objects are freed in one call, to reduce 
cacheline trashing.

If virt_to_page is fast, then a NUMA allocator would be a simple 
extention of the current implementation:

* one slab chain for each node, one spinlock for each node.
* 2 per-cpu arrays for each cpu: one for "correct node" kmem_cache_free 
calls , one for "foreign node" kmem_cache_free calls.
* kmem_cache_alloc allocates from the "correct node" per-cpu array, 
fallback to the per-node slab chain, then fallback to __get_free_pages.
* kmem_cache_free checks to which node the freed object belongs and adds 
it to the appropriate per-cpu array. The array overflow function then 
sorts the objects into the correct slab chains.

If virt_to_page is slow we need a different design. Currently it's 
called in every kmem_cache_free/kfree call.

--
    Manfred

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-09 10:57 Manfred Spraul
@ 2001-12-10 16:32 ` Jack Steiner
  2001-12-10 17:00   ` Manfred Spraul
  0 siblings, 1 reply; 24+ messages in thread
From: Jack Steiner @ 2001-12-10 16:32 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Jack Steiner, linux-kernel, lse-tech

> 
> >
> >
> >Assuming the slab allocator manages by node, kmem_cache_alloc_node() & 
> >kmem_cache_alloc_cpu() would be identical (exzcept for spelling :-). 
> >Each would pick up the nodeid from the cpu_data struct, then allocate 
> >from the slab cache for that node.
> >
> 
> kmem_cache_alloc is simple - the complex operation is kmem_cache_free.
> 
> The current implementation
> - assumes that virt_to_page() and reading one cacheline from the page 
> structure is fast. Is that true for your setups?
> - uses an array to batch several free calls together: If the array 
> overflows, then up to 120 objects are freed in one call, to reduce 
> cacheline trashing.
> 
> If virt_to_page is fast, then a NUMA allocator would be a simple 
> extention of the current implementation:

I can give you 1 data point. This is for the SGI SN1 platform. This is a NUMA
platform & is running with the DISCONTIGMEM patch that is on sourceforge.

The virt_to_page() function currently generates the following code:

	23 instructions
		18 "real" instructions
		 5 noop (I would like to believe the compiler can eventually
		   use these instructions slots for something else)

The code has
	2 load instructions that are always reference node-local memory & have 
	  a high probability of hitting in the caches
	1 load to the node that contains the target page


I think I see a couple opportunities for reducing the amount of code. However, 
I consider the code to be "fast enough" for most purposes.

> 
> * one slab chain for each node, one spinlock for each node.
> * 2 per-cpu arrays for each cpu: one for "correct node" kmem_cache_free 
> calls , one for "foreign node" kmem_cache_free calls.
> * kmem_cache_alloc allocates from the "correct node" per-cpu array, 
> fallback to the per-node slab chain, then fallback to __get_free_pages.
> * kmem_cache_free checks to which node the freed object belongs and adds 
> it to the appropriate per-cpu array. The array overflow function then 
> sorts the objects into the correct slab chains.
> 
> If virt_to_page is slow we need a different design. Currently it's 
> called in every kmem_cache_free/kfree call.

BTW, I think Tony Luck (at Intel) is currently changing the slab allocator 
to be numa-aware. Are coordinating your work with his???



-- 
Thanks

Jack Steiner    (651-683-5302)   (vnet 233-5302)      steiner@sgi.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-10 16:32 ` Jack Steiner
@ 2001-12-10 17:00   ` Manfred Spraul
  0 siblings, 0 replies; 24+ messages in thread
From: Manfred Spraul @ 2001-12-10 17:00 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-kernel, lse-tech

Jack Steiner wrote:

>
>BTW, I think Tony Luck (at Intel) is currently changing the slab allocator 
>to be numa-aware. Are coordinating your work with his???
>

Thanks, I wasn't aware that he's working on it.
I haven't started coding, I'm still collecting what's needed.

* force certain alignments. e.g. ARM needs 1024 byte aligned objects for 
the page tables.
* NUMA support.
* Add a "priority" to kmem_cache_shrink, to avoid that every 
dcache/icache shrink causes an IPI to all cpus.
* If possible: replace the division in kmem_cache_free_one with the 
multiplication by the reciprocal. (I have a patch, but it's too ugly for 
inclusion). Important for uniprocessor versions.
* add reservation support - e.g. there must be a minimum amount of bio 
structures available, otherwise the kernel could oom-deadlock. They must 
be available, not hidden in the per-cpu caches of the other cpus.

--
    Manfred

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
@ 2001-12-08 17:43 Niels Christiansen
  2001-12-09 11:46 ` Anton Blanchard
  0 siblings, 1 reply; 24+ messages in thread
From: Niels Christiansen @ 2001-12-08 17:43 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: lse-tech, linux-kernel

Anton Blanchard wrote:

| > > There's several things where per cpu data is useful; low frequency
| > > statistics is not one of them in my opinion.
| >
| > ...which may be true for 4-ways and even 8-ways but when you get to
| > 32-ways and greater, you start seeing cache problems.  That was the
| > case on AIX and per-cpu counters was one of the changes that helped
| > get the spectacular scalability on Regatta.
|
| I agree there are large areas of improvement to be done wrt cacheline
| ping ponging (see my patch in 2.4.17-pre6 for one example), but we
| should do our own benchmarking and not look at what AIX has been doing.

Oh, please!  You voiced an opinion.  I presented facts.  Nobody suggested
we should not measure on Linux.  As a matter of fact, I suggested that
Kiran does tests on the real counters and he said he would.

Niels

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-08 17:43 Niels Christiansen
@ 2001-12-09 11:46 ` Anton Blanchard
  0 siblings, 0 replies; 24+ messages in thread
From: Anton Blanchard @ 2001-12-09 11:46 UTC (permalink / raw)
  To: Niels Christiansen; +Cc: lse-tech, linux-kernel


> | > ...which may be true for 4-ways and even 8-ways but when you get to
> | > 32-ways and greater, you start seeing cache problems.  That was the
> | > case on AIX and per-cpu counters was one of the changes that helped
> | > get the spectacular scalability on Regatta.
> |
> | I agree there are large areas of improvement to be done wrt cacheline
> | ping ponging (see my patch in 2.4.17-pre6 for one example), but we
> | should do our own benchmarking and not look at what AIX has been doing.
> 
> Oh, please!  You voiced an opinion.  I presented facts.  Nobody suggested
> we should not measure on Linux.  As a matter of fact, I suggested that
> Kiran does tests on the real counters and he said he would.

Exactly, show me where the current problem is and I will benchmark it on
a 16 way linux/ppc64 machine. Your comments are opinions too unless
you have some figures to back them up :)

Anton

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
@ 2001-12-07  9:52 Niels Christiansen
  2001-12-07 10:10 ` Dipankar Sarma
  0 siblings, 1 reply; 24+ messages in thread
From: Niels Christiansen @ 2001-12-07  9:52 UTC (permalink / raw)
  To: dipankar; +Cc: kiran, lse-tech, linux-kernel

Hello Dikanpar,

| > Anyway, since we just had a long thread going on NUMA topology, maybe
| > it would be proper to investigate if there is a better way, such as
| > using the topology to decide where to put counters?  I think so, seeing
| > as it is that most Intel based 8-ways and above will have at least some
| > NUMA in them.
|
| It should be easy to place the counters in appropriately close
| memory if linux gets good NUMA APIs built on top of the topology
| services. If we extend kmem_cache_alloc() to allocate memory
| in a particular NUMA node, we could simply do this for placing the
| counters -
| ...
| This would put the block of counters corresponding to a CPU in
| memory local to the NUMA node. If there are more sophisticated
| APIs available for suitable memory selection, those too can be made
| use of here.
|
| Is this the kind of thing you are looking at ?

I'm no NUMA person so I can't verify your code snippet but if it does
what you say, yes, that is exactly what I meant:  We may have to deal
with both cache coherence and placement of counters in local memory.

Niels

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-07  9:52 Niels Christiansen
@ 2001-12-07 10:10 ` Dipankar Sarma
  0 siblings, 0 replies; 24+ messages in thread
From: Dipankar Sarma @ 2001-12-07 10:10 UTC (permalink / raw)
  To: Niels Christiansen; +Cc: kiran, lse-tech, linux-kernel

On Fri, Dec 07, 2001 at 04:52:40AM -0500, Niels Christiansen wrote:
> 
> Hello Dikanpar,
> | It should be easy to place the counters in appropriately close
> | memory if linux gets good NUMA APIs built on top of the topology
> | services. If we extend kmem_cache_alloc() to allocate memory
> | in a particular NUMA node, we could simply do this for placing the
> | counters -
> | ...
> | This would put the block of counters corresponding to a CPU in
> | memory local to the NUMA node. If there are more sophisticated
> | APIs available for suitable memory selection, those too can be made
> | use of here.
> |
> | Is this the kind of thing you are looking at ?
> 
> I'm no NUMA person so I can't verify your code snippet but if it does
> what you say, yes, that is exactly what I meant:  We may have to deal
> with both cache coherence and placement of counters in local memory.

Yes, we will likely need to place the conters in memory closest to
the corresponding CPUs.

I haven't yet started looking at the current NUMA proposals, but
I hope that there will be support for NUMA-aware allocations. The
flexible allocator scheme in our statctr implementation allows
each counter block corresponding to a CPU to be allocated separately
and we can make the locational judgement at that time as indicated
in my hypothetical changes to the statctr code snippet.

Thanks
Dipankar
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
@ 2001-12-05 15:02 Niels Christiansen
  2001-12-06 12:33 ` Ravikiran G Thirumalai
  0 siblings, 1 reply; 24+ messages in thread
From: Niels Christiansen @ 2001-12-05 15:02 UTC (permalink / raw)
  To: kiran; +Cc: lse-tech, linux-kernel

Hello, Kiran,

> Statistics counters are used in many places in the Linux kernel,
including
> storage, network I/O subsystems etc.  These counters are not atomic since

> accuracy is not so important. Nevertheless, frequent updation of these
> counters result in cacheline bouncing among various cpus in a multi
processor
> environment. This patch introduces a new set of interfaces, which should
> improve performance of such counters in MP environment.  This
implementation
> switches to code that is devoid of overheads for SMP if these interfaces
> are used with a UP kernel.
>
> Comments are welcome :)
>
>Regards,
>Kiran

I'm wondering about the scope of this.  My Ethernet adapter with, maybe, 20
counter fields would have 20 counters allocated for each of my 16
processors.
The only way to get the total would be to use statctr_read() to merge them.
Same for the who knows how many IP counters etc., etc.

How many and which counters were converted for the test you refer to?

I do like the idea of a uniform access mechanism, though.  It is well in
line
with my thoughts about an architected interface for topology and
instrumentation
so I'll definitely get back to you as I try to collect requirements.

Niels

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-05 15:02 Niels Christiansen
@ 2001-12-06 12:33 ` Ravikiran G Thirumalai
  2001-12-06 13:07   ` Arjan van de Ven
  0 siblings, 1 reply; 24+ messages in thread
From: Ravikiran G Thirumalai @ 2001-12-06 12:33 UTC (permalink / raw)
  To: Niels Christiansen; +Cc: lse-tech, linux-kernel

Hi Niels,
 
On Wed, Dec 05, 2001 at 10:02:33AM -0500, Niels Christiansen wrote:
>
> I'm wondering about the scope of this.  My Ethernet adapter with, maybe, 20
> counter fields would have 20 counters allocated for each of my 16
> processors.
> The only way to get the total would be to use statctr_read() to merge them.
> Same for the who knows how many IP counters etc., etc.
 
Are you concerned with increase in memory used per counter Here? I suppose
that must not be that much of an issue for a 16 processor box....
 
>
> How many and which counters were converted for the test you refer to?
>
 
Well, I wrote a simple kernel module which just increments a shared global
counter a million times per processor in parallel, and compared it with
the statctr which would be incremented a million times per processor in
parallel..
 
> I do like the idea of a uniform access mechanism, though.  It is well in
> line
> with my thoughts about an architected interface for topology and
> instrumentation
> so I'll definitely get back to you as I try to collect requirements.
>
> Niels
 
Hope we can come out with a really cool and acceptable interface..

Kiran
-- 
Ravikiran G Thirumalai <kiran@in.ibm.com>
Linux Technology Center, IBM Software Labs,
Bangalore.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 12:33 ` Ravikiran G Thirumalai
@ 2001-12-06 13:07   ` Arjan van de Ven
  2001-12-06 14:09     ` Ravikiran G Thirumalai
  2001-12-07 21:09     ` Alex Bligh - linux-kernel
  0 siblings, 2 replies; 24+ messages in thread
From: Arjan van de Ven @ 2001-12-06 13:07 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: linux-kernel

Ravikiran G Thirumalai wrote:
> 
> Hi Niels,
> 
> On Wed, Dec 05, 2001 at 10:02:33AM -0500, Niels Christiansen wrote:
> >
> > I'm wondering about the scope of this.  My Ethernet adapter with, maybe, 20
> > counter fields would have 20 counters allocated for each of my 16
> > processors.
> > The only way to get the total would be to use statctr_read() to merge them.
> > Same for the who knows how many IP counters etc., etc.
> 
> Are you concerned with increase in memory used per counter Here? I suppose
> that must not be that much of an issue for a 16 processor box....
> 
> >
> > How many and which counters were converted for the test you refer to?
> >
> 
> Well, I wrote a simple kernel module which just increments a shared global
> counter a million times per processor in parallel, and compared it with
> the statctr which would be incremented a million times per processor in
> parallel..

Would you care to point out a statistic in the kernel that is
incremented 
more than 10.000 times/second ? (I'm giving you a a factor of 100 of
playroom 
here) [One that isn't per-cpu yet of course]

 
Greetings,
   Arjan van de Ven

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 13:07   ` Arjan van de Ven
@ 2001-12-06 14:09     ` Ravikiran G Thirumalai
  2001-12-06 14:10       ` Arjan van de Ven
  2001-12-07 21:09     ` Alex Bligh - linux-kernel
  1 sibling, 1 reply; 24+ messages in thread
From: Ravikiran G Thirumalai @ 2001-12-06 14:09 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, lse-tech

On Thu, Dec 06, 2001 at 01:07:37PM +0000, Arjan van de Ven wrote:
> > 
> > >
> > > How many and which counters were converted for the test you refer to?
> > >
> > 
> > Well, I wrote a simple kernel module which just increments a shared global
> > counter a million times per processor in parallel, and compared it with
> > the statctr which would be incremented a million times per processor in
> > parallel..
> 
> Would you care to point out a statistic in the kernel that is
> incremented 
> more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> playroom 
> here) [One that isn't per-cpu yet of course]

Well, as I mentioned in my earlier post, we have performed 
"micro benchmarking", which does not reflect the actual run time
kernel conditions. I guess u gotta take these results with a 
pinch of salt.  

But, you cannot deny that there r gonna be a lot of cacheline 
invalidations, if you use a global counter.  Using per-cpu versions is
definitely going to improve kernel performance.

Kiran

-- 
Ravikiran G Thirumalai <kiran@in.ibm.com>
Linux Technology Center, IBM Software Labs,
Bangalore.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 14:09     ` Ravikiran G Thirumalai
@ 2001-12-06 14:10       ` Arjan van de Ven
  2001-12-06 19:35         ` Dipankar Sarma
  0 siblings, 1 reply; 24+ messages in thread
From: Arjan van de Ven @ 2001-12-06 14:10 UTC (permalink / raw)
  To: Ravikiran G Thirumalai; +Cc: Arjan van de Ven, linux-kernel, lse-tech

On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote:
> Well, as I mentioned in my earlier post, we have performed 
> "micro benchmarking", which does not reflect the actual run time
> kernel conditions. I guess u gotta take these results with a 
> pinch of salt.  
> 
> But, you cannot deny that there r gonna be a lot of cacheline 
> invalidations, if you use a global counter.  Using per-cpu versions is
> definitely going to improve kernel performance.

there's not that many counters in fact. And if you care about a gige
counter, just bind the card to a specific CPU and you have ad-hoc per-cpu
counters...

The extra cost of getting to them (extra indirection) makes each access 
more expensive..... in the end it might be a loss.

There's several things where per cpu data is useful; low frequency
statistics is not one of them in my opinion. 

Greetings,
   Arjan van de Ven

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 14:10       ` Arjan van de Ven
@ 2001-12-06 19:35         ` Dipankar Sarma
  0 siblings, 0 replies; 24+ messages in thread
From: Dipankar Sarma @ 2001-12-06 19:35 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Ravikiran G Thirumalai, linux-kernel, lse-tech

On Thu, Dec 06, 2001 at 09:10:15AM -0500, Arjan van de Ven wrote:
> On Thu, Dec 06, 2001 at 07:39:40PM +0530, Ravikiran G Thirumalai wrote:
> > But, you cannot deny that there r gonna be a lot of cacheline 
> > invalidations, if you use a global counter.  Using per-cpu versions is
> > definitely going to improve kernel performance.
> 
> there's not that many counters in fact. And if you care about a gige
> counter, just bind the card to a specific CPU and you have ad-hoc per-cpu
> counters...
> 
> The extra cost of getting to them (extra indirection) makes each access 
> more expensive..... in the end it might be a loss.

If it is a low frequency statistics then the expensive access wouldn't 
really matter much, right ? On the other hand, this will likely help 
specially with larger number of CPUs.

> 
> There's several things where per cpu data is useful; low frequency
> statistics is not one of them in my opinion. 

It is quite possible that you are right. What we need to do is
a measurement effort to understand the impact.

Thanks
Dipankar
-- 
Dipankar Sarma  <dipankar@in.ibm.com> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-06 13:07   ` Arjan van de Ven
  2001-12-06 14:09     ` Ravikiran G Thirumalai
@ 2001-12-07 21:09     ` Alex Bligh - linux-kernel
  2001-12-07 21:16       ` Arjan van de Ven
  1 sibling, 1 reply; 24+ messages in thread
From: Alex Bligh - linux-kernel @ 2001-12-07 21:09 UTC (permalink / raw)
  To: arjanv, Ravikiran G Thirumalai; +Cc: linux-kernel, Alex Bligh - linux-kernel



--On Thursday, 06 December, 2001 1:07 PM +0000 Arjan van de Ven 
<arjanv@redhat.com> wrote:

> Would you care to point out a statistic in the kernel that is
> incremented
> more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> playroom
> here) [One that isn't per-cpu yet of course]

cat /proc/net/dev

80,000 increments a second here on at least 4 counters

--
Alex Bligh

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters
  2001-12-07 21:09     ` Alex Bligh - linux-kernel
@ 2001-12-07 21:16       ` Arjan van de Ven
  0 siblings, 0 replies; 24+ messages in thread
From: Arjan van de Ven @ 2001-12-07 21:16 UTC (permalink / raw)
  To: Alex Bligh - linux-kernel; +Cc: linux-kernel

yOn Fri, Dec 07, 2001 at 09:09:25PM -0000, Alex Bligh - linux-kernel wrote:
> 
> 
> --On Thursday, 06 December, 2001 1:07 PM +0000 Arjan van de Ven 
> <arjanv@redhat.com> wrote:
> 
> > Would you care to point out a statistic in the kernel that is
> > incremented
> > more than 10.000 times/second ? (I'm giving you a a factor of 100 of
> > playroom
> > here) [One that isn't per-cpu yet of course]
> 
> cat /proc/net/dev
> 
> 80,000 increments a second here on at least 4 counters

except that
1) you can (and should) bind nics to cpus
and 
2) the cacheline for nic stats should (for good drivers) be in the cacheline
the ISR gets into the cpu ANYWAY to get to the device data -> no extra
cacheline pingpong

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2001-12-11 23:27 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2001-12-06 16:10 [Lse-tech] [RFC] [PATCH] Scalable Statistics Counters Niels Christiansen
2001-12-07  8:54 ` Dipankar Sarma
2001-12-08 22:24   ` Paul Jackson
2001-12-09  3:46     ` Jack Steiner
2001-12-09  4:44       ` Paul Jackson
2001-12-09 17:34         ` Jack Steiner
2001-12-11 23:27           ` Paul Jackson
2001-12-07 11:39 ` Ravikiran G Thirumalai
2001-12-08 13:46 ` Anton Blanchard
  -- strict thread matches above, loose matches on Subject: below --
2001-12-09 10:57 Manfred Spraul
2001-12-10 16:32 ` Jack Steiner
2001-12-10 17:00   ` Manfred Spraul
2001-12-08 17:43 Niels Christiansen
2001-12-09 11:46 ` Anton Blanchard
2001-12-07  9:52 Niels Christiansen
2001-12-07 10:10 ` Dipankar Sarma
2001-12-05 15:02 Niels Christiansen
2001-12-06 12:33 ` Ravikiran G Thirumalai
2001-12-06 13:07   ` Arjan van de Ven
2001-12-06 14:09     ` Ravikiran G Thirumalai
2001-12-06 14:10       ` Arjan van de Ven
2001-12-06 19:35         ` Dipankar Sarma
2001-12-07 21:09     ` Alex Bligh - linux-kernel
2001-12-07 21:16       ` Arjan van de Ven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox