perfmon2 vector argument question

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* perfmon2 vector argument question
@ 2006-06-19 20:40 Stephane Eranian
  2006-06-29  3:17 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Stephane Eranian @ 2006-06-19 20:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-ia64, perfmon

Hello,

The current perfmon2 API allows applications to pass vectors of arguments to
certain calls, in particular to the 3 functions to read/write PMU registers. 
This approach was chosen because it is very flexible and allows applications
to modify either multiple or a single register in one call. It is extensible
because there is no implicit knowledge of the actual number of registers supported
by the underlying hardware.

Before entering the actual system call, the argument vector must be copied
into a kernel buffer. This is required by convention for security and also
fault reasons. The famous copy_from_user() and copy_to_user() are invoked.
This must be done before interrupts are masked.

Vectors can have different sizes depending on the measurement, the PMU model.
Yet, the vector must be copied into a kernel-level buffer. Today, we allocate
the kernel-memory on demand based on the size of the vector. We use
kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the
allocated region via a perfmon2 tunable in sysfs. By default, it is set
to a page.

This implementation has worked fairly well, yet it costs some performance
because kmalloc/kfree are expensive (especially kfree). Also it may seem
overkill to malloc a page for small vectors.

I have run some experiments lately and they verified that kmalloc/kfree and
copy to/from user account for a very large portion of the cost of calls with
multiple registers (I tried with 4). For the copies it is hard to avoid
them. One thing we could do is to try and reduce the size of the structs.
Today, both struct pfarg_pmd and struct pfarg_pmc have reserved fields
for future extensions so that we can extend without breaking the ABI.
It may be possible to reduce those a little bit.

There are several ways to amortize or eliminate the kmalloc/kfree. First of
all, it is important to understand that multiple threads may call into a 
particular context at any time. All they need is access to the file descriptor.

An alternative that I have explored is to start from the hypothesis that
most vectors are small. If they are small enough, we could avoid the
kmalloc/kfree by using a buffer allocated on the stack. One could say
if the vector is less than 8 elements, then use the stack buffer. If not, then
go down the expensive path of kmalloc/kfree. I tried this experiment and got
over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The
downside of this approach is that kernel stack space is limited and we should
avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes
whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and
this is true for all architectures including i386 where default kernel stack
is 2 pages (8kB). Of course, the stack buffer could be adjusted per object
type and per-architecture. The downside is that if you need to use kmalloc
the stack space is still consumed.

It is important to note that we cannot use a kernel buffer of one element and simply
loop over the vector. Because the copy_from/copy_to must be done without locks nor
interrupt masked. So one  would have to copy, lock, do the perfmon call, unlock, copy
and loop for the next element.

Another approach that was suggested to me is to allocate on demand but not kfree
systematically when the call terminates. In other words, we amortize the cost
of the allocation by keeping the buffer around for the next caller. To make
this work, we would have to decompose the spin_lock_irq*() into spin_*lock()
and local_irq_*able() to avoid a race condition. For the first caller the
buffer would be allocated to fit the size (up to a certain limit like today).
When the call terminates, the buffer is kept via a pointer in the perfmon
context. The next caller, would check the pointer and size, if the buffer
is big enough, copy_user could proceed directly, otherwise a new buffer would
be allocated. That would also work assuming it is OKAY to copy_user with some locks
held. I can see one issue with this approach as some malicious user could create
lots of contexts and make one call for each to max out the argument vector limit for
each. If you have 1024 descriptors and the limit is 1 page/context, it could allocate
1024 kernel pages (non-pageable) for nothing. Today, we do not have a global tuneable
for the argument vector size limit. Adding one would be costly because multiple threads
could potentially contend for it and therefore we would need yet another lock.

I do not see another approach at this point.

Does someone have something else to propose?

If not, what is your opinion of the two approaches above?

Thanks.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: perfmon2 vector argument question
  2006-06-19 20:40 perfmon2 vector argument question Stephane Eranian
@ 2006-06-29  3:17 ` Andrew Morton
  2006-06-30 10:24   ` Stephane Eranian
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2006-06-29  3:17 UTC (permalink / raw)
  To: eranian; +Cc: linux-kernel, linux-ia64, perfmon

On Mon, 19 Jun 2006 13:40:12 -0700
Stephane Eranian <eranian@hpl.hp.com> wrote:

> Hello,
> 
> The current perfmon2 API allows applications to pass vectors of arguments to
> certain calls, in particular to the 3 functions to read/write PMU registers. 
> This approach was chosen because it is very flexible and allows applications
> to modify either multiple or a single register in one call. It is extensible
> because there is no implicit knowledge of the actual number of registers supported
> by the underlying hardware.
> 
> Before entering the actual system call, the argument vector must be copied
> into a kernel buffer. This is required by convention for security and also
> fault reasons. The famous copy_from_user() and copy_to_user() are invoked.
> This must be done before interrupts are masked.
> 
> Vectors can have different sizes depending on the measurement, the PMU model.
> Yet, the vector must be copied into a kernel-level buffer. Today, we allocate
> the kernel-memory on demand based on the size of the vector. We use
> kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the
> allocated region via a perfmon2 tunable in sysfs. By default, it is set
> to a page.
> 
> This implementation has worked fairly well, yet it costs some performance
> because kmalloc/kfree are expensive (especially kfree). Also it may seem
> overkill to malloc a page for small vectors.
> 
> I have run some experiments lately and they verified that kmalloc/kfree and
> copy to/from user account for a very large portion of the cost of calls with
> multiple registers (I tried with 4). For the copies it is hard to avoid
> them. One thing we could do is to try and reduce the size of the structs.
> Today, both struct pfarg_pmd and struct pfarg_pmc have reserved fields
> for future extensions so that we can extend without breaking the ABI.
> It may be possible to reduce those a little bit.
> 
> There are several ways to amortize or eliminate the kmalloc/kfree. First of
> all, it is important to understand that multiple threads may call into a 
> particular context at any time. All they need is access to the file descriptor.
> 
> An alternative that I have explored is to start from the hypothesis that
> most vectors are small. If they are small enough, we could avoid the
> kmalloc/kfree by using a buffer allocated on the stack. One could say
> if the vector is less than 8 elements, then use the stack buffer. If not, then
> go down the expensive path of kmalloc/kfree. I tried this experiment and got
> over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The
> downside of this approach is that kernel stack space is limited and we should
> avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes
> whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and
> this is true for all architectures including i386 where default kernel stack
> is 2 pages (8kB). Of course, the stack buffer could be adjusted per object
> type and per-architecture. The downside is that if you need to use kmalloc
> the stack space is still consumed.
> 
> It is important to note that we cannot use a kernel buffer of one element and simply
> loop over the vector. Because the copy_from/copy_to must be done without locks nor
> interrupt masked. So one  would have to copy, lock, do the perfmon call, unlock, copy
> and loop for the next element.
> 
> Another approach that was suggested to me is to allocate on demand but not kfree
> systematically when the call terminates. In other words, we amortize the cost
> of the allocation by keeping the buffer around for the next caller. To make
> this work, we would have to decompose the spin_lock_irq*() into spin_*lock()
> and local_irq_*able() to avoid a race condition. For the first caller the
> buffer would be allocated to fit the size (up to a certain limit like today).
> When the call terminates, the buffer is kept via a pointer in the perfmon
> context. The next caller, would check the pointer and size, if the buffer
> is big enough, copy_user could proceed directly, otherwise a new buffer would
> be allocated. That would also work assuming it is OKAY to copy_user with some locks
> held. I can see one issue with this approach as some malicious user could create
> lots of contexts and make one call for each to max out the argument vector limit for
> each. If you have 1024 descriptors and the limit is 1 page/context, it could allocate
> 1024 kernel pages (non-pageable) for nothing. Today, we do not have a global tuneable
> for the argument vector size limit. Adding one would be costly because multiple threads
> could potentially contend for it and therefore we would need yet another lock.
> 
> I do not see another approach at this point.
> 
> Does someone have something else to propose?
> 
> If not, what is your opinion of the two approaches above?
> 

The first approach should be fine - we do that in lots of places, such as
in core_sys_select().

Applications mut be calling this thing at a heck of a rate for kfree()
overhead to matter.  I trust CONFIG_DEBUG_SLAB wasn't turned on...

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: perfmon2 vector argument question
  2006-06-29  3:17 ` Andrew Morton
@ 2006-06-30 10:24   ` Stephane Eranian
  0 siblings, 0 replies; 3+ messages in thread
From: Stephane Eranian @ 2006-06-30 10:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, linux-ia64, perfmon

Andrew,

On Wed, Jun 28, 2006 at 08:17:08PM -0700, Andrew Morton wrote:
> > 
> > Does someone have something else to propose?
> > 
> > If not, what is your opinion of the two approaches above?
> > 
> 
> The first approach should be fine - we do that in lots of places, such as
> in core_sys_select().
> 
Ok, that's good to know. I looked at the stack consumption on x86 and it
is comparable to what you do for core_sys_select().

> Applications mut be calling this thing at a heck of a rate for kfree()
> overhead to matter.  I trust CONFIG_DEBUG_SLAB wasn't turned on...

That was using a micro-benchmark to stress certain paths in perfmon.
CONFIG_DEBUG_SLAB was not turned on.

Thanks.

-- 
-Stephane

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-06-30 10:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-19 20:40 perfmon2 vector argument question Stephane Eranian
2006-06-29  3:17 ` Andrew Morton
2006-06-30 10:24   ` Stephane Eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox