All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Arjan van de Ven <arjan@infradead.org>
Cc: linux-kernel@vger.kernel.org, mingo@elte.hu, sandmann@redhat.com,
	tglx@tglx.de, hpa@zytor.com
Subject: Re: [PATCH] x86: add the debugfs interface for the sysprof tool
Date: Sat, 23 Feb 2008 00:11:30 -0800	[thread overview]
Message-ID: <20080223001130.d8922136.akpm@linux-foundation.org> (raw)
In-Reply-To: <20080219123756.6261c13c@laptopd505.fenrus.org>

On Tue, 19 Feb 2008 12:37:56 -0800 Arjan van de Ven <arjan@infradead.org> wrote:

> From: Soren Sandmann <sandmann@redhat.com>
> Subject: [PATCH] x86: add the debugfs interface for the sysprof tool
> 
> The sysprof tool is a very easy to use GUI tool to find out where
> userspace is spending CPU time. See
> http://www.daimi.au.dk/~sandmann/sysprof/
> for more information and screenshots on this tool.
> 
> Sysprof needs a 200 line kernel module to do it's work, this
> module puts some simple profiling data into debugfs.
> 
> ...
>

Seems a poor idea to me.  Sure, oprofile is "hard to set up", but not if
your distributor already did it for you.

Sidebar: the code uses the utterly crappy register_timer_hook() which 

a) is woefully misnamed and

b) is racy and 

c) will disrupt oprofile if it is being used.  And vice versa.



This code adds a new kernel->userspace interface which is not even
documented in code comments.  It appears to use a pollable debugfs file in
/proc somewhere, carrying an unspecified payload.


What happens when multiple processes are consuming data from the same
debugfs file?


>  arch/x86/Kconfig.debug    |   10 ++
>  arch/x86/kernel/Makefile  |    2 +
>  arch/x86/kernel/sysprof.c |  200 +++++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/sysprof.h |   34 ++++++++
>  4 files changed, 246 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/kernel/sysprof.c
>  create mode 100644 arch/x86/kernel/sysprof.h
> 
> diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
> index 12c98ea..8eb06c0 100644
> --- a/arch/x86/Kconfig.debug
> +++ b/arch/x86/Kconfig.debug
> @@ -206,6 +206,16 @@ config MMIOTRACE_TEST
>  
>  	  Say N, unless you absolutely know what you are doing.
>  
> +config SYSPROF
> +	tristate "Enable sysprof userland performance sampler"
> +	depends on PROFILING

Missing dependency on DEBUG_FS

> +	help
> +	  This option enables the sysprof debugfs file that is used by the
> +	  sysprof tool. sysprof is a tool to show where in userspace CPU
> +	  time is spent.
> +
> +	  When in doubt, say N
> +

And it's x86-specific.

>  #
>  # IO delay types:
>  #
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 4a4260c..1e8fb66 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -80,6 +80,8 @@ obj-$(CONFIG_DEBUG_NX_TEST)	+= test_nx.o
>  obj-$(CONFIG_VMI)		+= vmi_32.o vmiclock_32.o
>  obj-$(CONFIG_PARAVIRT)		+= paravirt.o paravirt_patch_$(BITS).o
>  
> +obj-$(CONFIG_SYSPROF)		+= sysprof.o
> +
>  ifdef CONFIG_INPUT_PCSPKR
>  obj-y				+= pcspeaker.o
>  endif
> diff --git a/arch/x86/kernel/sysprof.c b/arch/x86/kernel/sysprof.c
> new file mode 100644
> index 0000000..6220b9f
> --- /dev/null
> +++ b/arch/x86/kernel/sysprof.c
> @@ -0,0 +1,200 @@
> +/* -*- c-basic-offset: 8 -*- */
> +
> +/* Sysprof -- Sampling, systemwide CPU profiler
> + * Copyright 2004, Red Hat, Inc.
> + * Copyright 2004, 2005, Soeren Sandmann
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/proc_fs.h>
> +#include <linux/poll.h>
> +#include <linux/highmem.h>
> +#include <linux/pagemap.h>
> +#include <linux/profile.h>
> +#include <linux/debugfs.h>
> +#include <asm/uaccess.h>

checkpatch used to warn that linux/uaccess.h is preferred over asm/uaccess.h
but this is another of those checkpatch features which seems to have
mysteriously disappeared, or it broke?

> +#include <asm/atomic.h>
> +
> +#include "sysprof.h"
> +
> +#define SAMPLES_PER_SECOND (200)
> +#define INTERVAL ((HZ <= SAMPLES_PER_SECOND)? 1 : (HZ / SAMPLES_PER_SECOND))
> +#define N_TRACES 256
> +
> +static struct sysprof_stacktrace stack_traces[N_TRACES];
> +static struct sysprof_stacktrace *head = &stack_traces[0];
> +static struct sysprof_stacktrace *tail = &stack_traces[0];

Access to head and tail appear to be racy.  See below.

> +static DECLARE_WAIT_QUEUE_HEAD(wait_for_trace);
> +static DECLARE_WAIT_QUEUE_HEAD(wait_for_exit);
> +
> +struct userspace_reader {
> +	struct task_struct *task;
> +	unsigned long cache_address;
> +	unsigned long *cache;
> +};
> +
> +struct stack_frame;
> +
> +struct stack_frame {
> +	struct stack_frame __user *next;
> +	unsigned long return_address;
> +};
> +
> +static int read_frame(struct stack_frame __user *frame_pointer,
> +					struct stack_frame *frame)
> +{
> +	if (__copy_from_user_inatomic(frame, frame_pointer,
> +					sizeof(struct stack_frame)))
> +		return 1;
> +	return 0;
> +}
> +
> +static DEFINE_PER_CPU(int, n_samples);
> +
> +static int timer_notify(struct pt_regs *regs)
> +{
> +	struct sysprof_stacktrace *trace = head;

We read `head' before taking the "lock".  Another CPU could modify it after
we took a local copy.

> +	int i;
> +	int is_user;
> +	static atomic_t in_timer_notify = ATOMIC_INIT(1);
> +	int n;
> +
> +	n = ++get_cpu_var(n_samples);
> +	put_cpu_var(n_samples);

Needlessly disables preemption.  Use __get_cpu_var().

> +	if (n % INTERVAL != 0)
> +		return 0;

It'd be nice to make INTERVAL a power of 2.

> +	/* 0: locked, 1: unlocked */
> +
> +	if (!atomic_dec_and_test(&in_timer_notify))
> +		goto out;

Why not use spin_trylock()?  Then you get lockdep support too.

And why not use spin_lock()?  At least a comment should be added explaining
and justifying this peculiar home-made-not-really-locking design.

> +	is_user = user_mode(regs);
> +
> +	if (!current || current->pid == 0)
> +		goto out;

And the changelog should explain and justify why we cannot profile root's
code.  I cannot begin to imagine why it was done and I cannot fathom why it
passed uncommented in documentation, code, changelog and "review" comments. 
It greatly reduces the usefulness of an already dubious feature.

If this limitation _was_ documented then I'd be in a position to ask what is
special about root, as opposed to some non-root user who has <unspecified>
capabilities.  And why we penalise a root who has dropped <unspecified>
capabilities.  etcetera.

Is this open-coded test of ->pid correct in a containerised environment?

> +	if (is_user && current->state != TASK_RUNNING)
> +		goto out;

Needs a comment (although this one is fairly obvious)

> +	if (!is_user) {
> +		/* kernel */
> +		trace->pid = current->pid;
> +		trace->truncated = 0;
> +		trace->n_addresses = 1;
> +
> +		/* 0x1 is taken by sysprof to mean "in kernel" */
> +		trace->addresses[0] = 0x1;
> +	} else {
> +		struct stack_frame __user *frame_pointer;
> +		struct stack_frame frame;
> +		memset(trace, 0, sizeof(struct sysprof_stacktrace));
> +
> +		trace->pid = current->pid;

This is ambiguous in a containerised environment.

Ingo, please be alert for anything which exposes raw pids to userspace.

I don't know what a correct and suitable interface might be - perhaps Pavel
or Eric can suggest something.

> +		trace->truncated = 0;
> +
> +		i = 0;
> +
> +		trace->addresses[i++] = regs->ip;
> +
> +		frame_pointer = (struct stack_frame __user *)regs->bp;
> +
> +		while (read_frame(frame_pointer, &frame) == 0 &&
> +		       i < SYSPROF_MAX_ADDRESSES &&
> +		       (unsigned long)frame_pointer >= regs->sp) {
> +			trace->addresses[i++] = frame.return_address;
> +			frame_pointer = frame.next;
> +		}

The (absent) interface documentation should explain what happens when a
fault causes this information to be truncated.

> +		trace->n_addresses = i;
> +
> +		if (i == SYSPROF_MAX_ADDRESSES)
> +			trace->truncated = 1;
> +		else
> +			trace->truncated = 0;
> +	}
> +
> +	if (head++ == &stack_traces[N_TRACES - 1])
> +		head = &stack_traces[0];

`head' can just merrily advance over `tail' and there is no notification to
userspace of the lost data.

> +	wake_up(&wait_for_trace);
> +
> +out:
> +	atomic_inc(&in_timer_notify);
> +	return 0;
> +}
> +
> +static ssize_t procfile_read(struct file *filp, char __user *buffer,
> +			size_t count, loff_t *ppos)
> +{
> +	ssize_t bcount;
> +	if (head == tail)
> +		return -EWOULDBLOCK;

Please put a blank line between end-of-variables and start-of-code.

This seems to be a wrong return value?  Shouldn't it just return zero if
there was nothing there?  As can happen if some other process is reading the
same debugfs file?

> +	BUG_ON(tail->pid == 0);

whee.  There was no locking above to prevent the tasks's pid from
transitioning from non-zero to zero after it was tested.  Which means this
is triggerable.  Perhaps the implicit locking due to cpu-pinnedness and
interruption will prevent that race.  If so, such a subtlety should be
commented, no?

> +	*ppos = 0;
> +	bcount = simple_read_from_buffer(buffer, count, ppos,
> +			tail, sizeof(struct sysprof_stacktrace));
> +
> +	if (tail++ == &stack_traces[N_TRACES - 1])
> +		tail = &stack_traces[0];

There is no locking for `tail', and afaict we support multiple simultaneous
readers.

> +	return bcount;
> +}

This reads a single item even if there were 100 queued, which is quite 
inefficient.

We already have infrastructure for bulk kernel->user transfer in
kernel/relay.c?

> +static unsigned int procfile_poll(struct file *filp, poll_table * poll_table)

checkpatch missed this coding-style error.

> +{
> +	if (head != tail)
> +		return POLLIN | POLLRDNORM;
> +
> +	poll_wait(filp, &wait_for_trace, poll_table);
> +
> +	if (head != tail)
> +		return POLLIN | POLLRDNORM;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations sysprof_fops = {
> +	.owner = THIS_MODULE,
> +	.read = procfile_read,
> +	.poll = procfile_poll,
> +};
> +
> +static struct dentry *debugfs_pe;
> +int init_module(void)
> +{
> +	debugfs_pe = debugfs_create_file("sysprof-trace", 0600, NULL, NULL,
> +				&sysprof_fops);
> +	if (!debugfs_pe)
> +		return -ENODEV;
> +	register_timer_hook(timer_notify);
> +	return 0;
> +}

This function will enable sysprof-trace even if prof_on==0,
prof_on==SLEEP_PROFILING, etc which is pointless.

> +void cleanup_module(void)
> +{
> +	unregister_timer_hook(timer_notify);
> +	debugfs_remove(debugfs_pe);
> +}
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Soeren Sandmann (sandmann@daimi.au.dk)");
> +MODULE_DESCRIPTION("Kernel driver for the sysprof performance analysis tool");
> diff --git a/arch/x86/kernel/sysprof.h b/arch/x86/kernel/sysprof.h
> new file mode 100644
> index 0000000..6e16d6f
> --- /dev/null
> +++ b/arch/x86/kernel/sysprof.h
> @@ -0,0 +1,34 @@
> +/* Sysprof -- Sampling, systemwide CPU profiler
> + * Copyright 2004, Red Hat, Inc.
> + * Copyright 2004, 2005, Soeren Sandmann
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +
> +#ifndef SYSPROF_MODULE_H
> +#define SYSPROF_MODULE_H
> +
> +#define SYSPROF_MAX_ADDRESSES 512
> +
> +struct sysprof_stacktrace {
> +    int	pid;		/* -1 if in kernel */
> +    int truncated;
> +    int n_addresses;	/* note: this can be 1 if the process was compiled
> +			 * with -fomit-frame-pointer or is otherwise weird
> +			 */
> +    unsigned long addresses[SYSPROF_MAX_ADDRESSES];
> +};

This is broken for 32-bit userspace running on a 64-bit kernel.  Unless
said userspace jumps through hoops and works out that it's running under a
64-bit kernel.

There might be alignment issues for addresses[], being at offset 12.


On Wed, 20 Feb 2008 10:10:22 +0100 Ingo Molnar <mingo@elte.hu> wrote:
>
> thanks, looks good to me - applied.

This is pretty distressing, frankly.  The threshold for getting code into
Linux should be much higher than this.

I do not have the time to review everything which goes into all the git
trees.  Better review, please.


  parent reply	other threads:[~2008-02-23  8:28 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-02-19 20:37 [PATCH] x86: add the debugfs interface for the sysprof tool Arjan van de Ven
2008-02-20  9:10 ` Ingo Molnar
2008-02-26  4:13   ` Anton Blanchard
2008-02-26  8:57     ` Ingo Molnar
2008-02-20 18:16 ` Peter Zijlstra
2008-02-20 18:39   ` Arjan van de Ven
2008-02-20 18:53     ` Peter Zijlstra
2008-02-20 18:58       ` Peter Zijlstra
2008-02-26  5:03         ` Anton Blanchard
2008-02-20 19:26       ` Arjan van de Ven
2008-02-20 20:58         ` Peter Zijlstra
2008-02-20 21:07           ` Arjan van de Ven
2008-02-20 21:44             ` Peter Zijlstra
2008-02-20 22:36               ` Arjan van de Ven
2008-02-23  8:11 ` Andrew Morton [this message]
2008-02-23 11:37   ` Ingo Molnar
2008-02-23 13:53     ` John Levon
2008-02-23 15:50       ` Ingo Molnar
2008-02-23 20:15       ` Soeren Sandmann
2008-02-23 20:42         ` Andrew Morton
2008-02-26  5:13         ` Anton Blanchard
2008-02-26  9:02           ` Ingo Molnar
2008-02-26 17:29             ` Andrew Morton
2008-02-27 14:05             ` Anton Blanchard
2008-02-24 13:10       ` Theodore Tso
2008-02-24 13:44         ` Ingo Molnar
2008-02-24 14:33           ` Ingo Molnar
2008-02-24 16:32         ` John Levon
2008-02-24 18:19           ` Theodore Tso
2008-02-23 18:40     ` Andrew Morton
2008-02-23 11:51   ` Pekka Enberg
2008-02-23 12:22     ` Ingo Molnar
2008-02-23 12:29       ` Pekka Enberg
2008-02-23 18:46     ` Andrew Morton
2008-02-24  2:49       ` Pekka Enberg
2008-02-24  3:12         ` Nicholas Miell
2008-02-26  6:27           ` Pekka Enberg
2008-02-26  6:48             ` Pekka Enberg
2008-02-26  8:55               ` Ingo Molnar
2008-02-23 14:54   ` Pekka Enberg
2008-02-23 19:01     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080223001130.d8922136.akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=arjan@infradead.org \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=sandmann@redhat.com \
    --cc=tglx@tglx.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.