* [Patch 0/9] Per-task delay accounting
@ 2006-03-14 0:40 Shailabh Nagar
2006-03-14 0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
` (9 more replies)
0 siblings, 10 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14 0:40 UTC (permalink / raw)
To: linux-kernel; +Cc: Nick Piggin
This is the next iteration of the delay accounting patches
last posted at
http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
The major changes to the patches (and names of commenters whose concerns
have been addressed) are:
- considerable revision of the synchronous block I/O delay collection (Arjan)
Now most delays seen in I/O submission are not being collected.
- removal of dependency on schedstats by using common code (Nick)
- removal of sysctl or dynamic configurability of delay accounting (Arjan)
Now it it can be set on/off at boot time only.
- providing I/O delays through /proc/<tgid>/stats (Andi)
- revisions to the generic netlink interface (Jamal)
Unaddressed comments are suggestions from Jamal on
further improving the generic netlink interface which is ongoing.
Series
nstimestamp-diff.patch
delayacct-setup.patch
delayacct-blkio-init.patch
delayacct-blkio-collect.patch
delayacct-swapin.patch
delayacct-procfs.patch
delayacct-schedstats.patch
genetlink-utils.patch
delayacct-genetlink.patch
A short note on expected usage of this functionality follows which
addresses a comment from Nick in the previous iteration.
Basically, the patches measures a subset of the delays encountered
by a task waiting for a resource such as cpu, disk I/O and page
frames. Which subset is chosen depends on whether a task's delay
is something that can be controlled by playing with its priority
or rss limit etc.
e.g. cpu delays can directly be affected by its static CPU priority.
Similarly I/O delay can be affected by its I/O priority (assuming
one uses the right iosched). In page frame delay, we only count
the page faults due to swapin since that is directly affected by
the rss limits for a task (well, ok, process).
Delays can be used as feedback to decide whether something can/should
be done about a task (or group of tasks) which is not performing
as one expects.
http://www.research.ibm.com/journal/sj/362/amanaut.html
lists a fairly complicated way of using such data in
workload management.
At a simpler level, one can think of utilities that can use this data
to control individual apps using a simple feedback loop.
At this point we do not have such a utility.
Of course, its useful to visually inspect such data. Some of it is
being exported through /proc as suggested by Andi, but the primary
interface is through a netlink socket so that userspace can get
data for exiting tasks too.
The netlink socket also allows userspace apps to efficiently
collect delay data and group it in arbitrary ways
(as envisaged by CKRM, CSA or ELSA) for reporting or
control purposes.
--Shailabh
^ permalink raw reply [flat|nested] 56+ messages in thread* [Patch 1/9] timestamp diff 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar @ 2006-03-14 0:42 ` Shailabh Nagar 2006-03-14 1:01 ` Lee Revell 2006-03-15 10:23 ` Arjan van de Ven 2006-03-14 0:45 ` Patch 2/9] Initialization Shailabh Nagar ` (8 subsequent siblings) 9 siblings, 2 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:42 UTC (permalink / raw) To: linux-kernel; +Cc: Arjan van de Ven nstimestamp_diff.patch Add kernel utility function for measuring the interval (diff) between two timespec values Comments addressed (commenter) - returns difference as timespec instead of nsecs (Arjan) Signed-off-by: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Shailabh Nagar <nagar@us.ibm.com> include/linux/time.h | 14 ++++++++++++++ 1 files changed, 14 insertions(+) Index: linux-2.6.16-rc5/include/linux/time.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/time.h 2006-03-10 17:56:56.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/time.h 2006-03-10 17:57:26.000000000 -0500 @@ -147,6 +147,20 @@ extern struct timespec ns_to_timespec(co */ extern struct timeval ns_to_timeval(const nsec_t nsec); +/** + * timespec_diff - Return difference of timespecs as a timespec + * @start: first timespec + * @end: second timespec + * @ret: result timespec + * + * Returns the difference between @start and @end in @ret + */ +static inline void timespec_diff(struct timespec *start, struct timespec *end, + struct timespec *ret) +{ + ret->tv_sec = end->tv_sec - start->tv_sec; + ret->tv_nsec = end->tv_nsec - start->tv_nsec; +} #endif /* __KERNEL__ */ #define NFDBITS __NFDBITS ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar @ 2006-03-14 1:01 ` Lee Revell 2006-03-14 1:05 ` Shailabh Nagar 2006-03-15 10:23 ` Arjan van de Ven 1 sibling, 1 reply; 56+ messages in thread From: Lee Revell @ 2006-03-14 1:01 UTC (permalink / raw) To: nagar; +Cc: linux-kernel, Arjan van de Ven On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote: > + ret->tv_sec = end->tv_sec - start->tv_sec; > + ret->tv_nsec = end->tv_nsec - start->tv_nsec; What if end->tv_nsec is less than start->tv_nsec? Lee ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 1:01 ` Lee Revell @ 2006-03-14 1:05 ` Shailabh Nagar 2006-03-14 1:12 ` Lee Revell 0 siblings, 1 reply; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 1:05 UTC (permalink / raw) To: Lee Revell; +Cc: linux-kernel, Arjan van de Ven On Mon, 2006-03-13 at 20:01, Lee Revell wrote: > On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote: > > + ret->tv_sec = end->tv_sec - start->tv_sec; > > + ret->tv_nsec = end->tv_nsec - start->tv_nsec; > > What if end->tv_nsec is less than start->tv_nsec? The caller of the func can decide what to do. In our usage, we choose to not use the result of such a diff but others may want to return an error etc. --Shailabh > > Lee > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 1:05 ` Shailabh Nagar @ 2006-03-14 1:12 ` Lee Revell 2006-03-14 3:42 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: Lee Revell @ 2006-03-14 1:12 UTC (permalink / raw) To: nagar; +Cc: linux-kernel, Arjan van de Ven On Mon, 2006-03-13 at 20:05 -0500, Shailabh Nagar wrote: > On Mon, 2006-03-13 at 20:01, Lee Revell wrote: > > On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote: > > > + ret->tv_sec = end->tv_sec - start->tv_sec; > > > + ret->tv_nsec = end->tv_nsec - start->tv_nsec; > > > > What if end->tv_nsec is less than start->tv_nsec? > > The caller of the func can decide what to do. > In our usage, we choose to not use the result of such a diff > but others may want to return an error etc. You don't think it's a problem that 2.0000001s - 1.9999999s would return garbage rather than 0.0000002? Lee ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 1:12 ` Lee Revell @ 2006-03-14 3:42 ` Balbir Singh 2006-03-14 4:26 ` Shailabh Nagar 0 siblings, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-14 3:42 UTC (permalink / raw) To: Lee Revell; +Cc: nagar, linux-kernel, Arjan van de Ven > You don't think it's a problem that 2.0000001s - 1.9999999s would return > garbage rather than 0.0000002? > > Lee > The caller can use set_normalized_timespec() to fix that. Warm Regards, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 3:42 ` Balbir Singh @ 2006-03-14 4:26 ` Shailabh Nagar 2006-03-14 6:50 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 4:26 UTC (permalink / raw) To: balbir; +Cc: Lee Revell, linux-kernel, Arjan van de Ven Balbir Singh wrote: >>You don't think it's a problem that 2.0000001s - 1.9999999s would return >>garbage rather than 0.0000002? >> >> Sorry....you're right. I was thinking of invalid deltas (where end was before start). >>Lee >> >> >> > >The caller can use set_normalized_timespec() to fix that. > > Lee's point is valid....we should be doing the set_normalized_timespec. Will fix. Thanks, Shailabh >Warm Regards, >Balbir > > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 4:26 ` Shailabh Nagar @ 2006-03-14 6:50 ` Balbir Singh 0 siblings, 0 replies; 56+ messages in thread From: Balbir Singh @ 2006-03-14 6:50 UTC (permalink / raw) To: Shailabh Nagar; +Cc: Lee Revell, linux-kernel, Arjan van de Ven > Lee's point is valid....we should be doing the set_normalized_timespec. > Semantically that sounds correct. But the intension of the caller as in our case could be to convert the timespec to nsec_t. Calling set_normalized_timespec() might be an overhead in this case. Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-14 0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar 2006-03-14 1:01 ` Lee Revell @ 2006-03-15 10:23 ` Arjan van de Ven 2006-03-15 10:28 ` Balbir Singh 1 sibling, 1 reply; 56+ messages in thread From: Arjan van de Ven @ 2006-03-15 10:23 UTC (permalink / raw) To: nagar; +Cc: linux-kernel > +static inline void timespec_diff(struct timespec *start, struct timespec *end, > + struct timespec *ret) > +{ > + ret->tv_sec = end->tv_sec - start->tv_sec; > + ret->tv_nsec = end->tv_nsec - start->tv_nsec; > +} > #endif /* __KERNEL__ */ I'd suggest normalizing the timespec; better to do it in such a function than in all callers of it.. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 1/9] timestamp diff 2006-03-15 10:23 ` Arjan van de Ven @ 2006-03-15 10:28 ` Balbir Singh 0 siblings, 0 replies; 56+ messages in thread From: Balbir Singh @ 2006-03-15 10:28 UTC (permalink / raw) To: Arjan van de Ven; +Cc: nagar, linux-kernel > > > +static inline void timespec_diff(struct timespec *start, struct timespec *end, > > + struct timespec *ret) > > +{ > > + ret->tv_sec = end->tv_sec - start->tv_sec; > > + ret->tv_nsec = end->tv_nsec - start->tv_nsec; > > +} > > #endif /* __KERNEL__ */ > > I'd suggest normalizing the timespec; better to do it in such a function > than in all callers of it.. > Yep, that also looks semantically correct. We will do so. Thanks, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Patch 2/9] Initialization 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar 2006-03-14 0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar @ 2006-03-14 0:45 ` Shailabh Nagar 2006-03-14 10:54 ` Jes Sorensen 2006-03-15 10:24 ` Arjan van de Ven 2006-03-14 0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar ` (7 subsequent siblings) 9 siblings, 2 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:45 UTC (permalink / raw) To: linux-kernel; +Cc: Arjan van de Ven delayacct-setup.patch Initialization code related to collection of per-task "delay" statistics which measure how long it had to wait for cpu, sync block io, swapping etc. The collection of statistics and the interface are in other patches. This patch sets up the data structures and allows the statistics collection to be disabled through a kernel boot paramater. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Documentation/kernel-parameters.txt | 2 include/linux/delayacct.h | 48 ++++++++++++++++++ include/linux/sched.h | 17 ++++++ init/Kconfig | 13 +++++ init/main.c | 2 kernel/Makefile | 1 kernel/delayacct.c | 92 ++++++++++++++++++++++++++++++++++++ kernel/exit.c | 3 + kernel/fork.c | 2 9 files changed, 180 insertions(+) Index: linux-2.6.16-rc5/Documentation/kernel-parameters.txt =================================================================== --- linux-2.6.16-rc5.orig/Documentation/kernel-parameters.txt 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/Documentation/kernel-parameters.txt 2006-03-11 07:41:37.000000000 -0500 @@ -410,6 +410,8 @@ running once the system is up. Format: <area>[,<node>] See also Documentation/networking/decnet.txt. + delayacct [KNL] Enable per-task delay accounting + devfs= [DEVFS] See Documentation/filesystems/devfs/boot-options. Index: linux-2.6.16-rc5/kernel/Makefile =================================================================== --- linux-2.6.16-rc5.orig/kernel/Makefile 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/kernel/Makefile 2006-03-11 07:41:37.000000000 -0500 @@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o +obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is Index: linux-2.6.16-rc5/include/linux/delayacct.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc5/include/linux/delayacct.h 2006-03-11 07:41:37.000000000 -0500 @@ -0,0 +1,48 @@ +/* delayacct.h - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKDELAYS_H +#define _LINUX_TASKDELAYS_H + +#include <linux/sched.h> + +#ifdef CONFIG_TASK_DELAY_ACCT +extern int delayacct_on; /* Delay accounting turned on/off */ +extern kmem_cache_t *delayacct_cache; +extern int delayacct_init(void); +extern void __delayacct_tsk_init(struct task_struct *); +extern void __delayacct_tsk_exit(struct task_struct *); + +static inline void delayacct_tsk_init(struct task_struct *tsk) +{ + /* reinitialize in case parent's non-null pointer was dup'ed*/ + tsk->delays = NULL; + if (unlikely(delayacct_on)) + __delayacct_tsk_init(tsk); +} + +static inline void delayacct_tsk_exit(struct task_struct *tsk) +{ + if (tsk->delays) + __delayacct_tsk_exit(tsk); +} + +#else +static inline int delayacct_init(void) +{} +static inline void delayacct_tsk_init(struct task_struct *tsk) +{} +static inline void delayacct_tsk_exit(struct task_struct *tsk) +{} +#endif /* CONFIG_TASK_DELAY_ACCT */ +#endif /* _LINUX_TASKDELAYS_H */ Index: linux-2.6.16-rc5/include/linux/sched.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:37.000000000 -0500 @@ -539,6 +539,20 @@ struct sched_info { extern struct file_operations proc_schedstat_operations; #endif +#ifdef CONFIG_TASK_DELAY_ACCT +struct task_delay_info { + spinlock_t lock; + + /* For each stat XXX, add following, aligned appropriately + * + * struct timespec XXX_start, XXX_end; + * u64 XXX_delay; + * u32 XXX_count; + */ +}; +#endif + + enum idle_type { SCHED_IDLE, @@ -870,6 +884,9 @@ struct task_struct { #endif atomic_t fs_excl; /* holding fs exclusive resources */ struct rcu_head rcu; +#ifdef CONFIG_TASK_DELAY_ACCT + struct task_delay_info *delays; +#endif }; static inline pid_t process_group(struct task_struct *tsk) Index: linux-2.6.16-rc5/init/Kconfig =================================================================== --- linux-2.6.16-rc5.orig/init/Kconfig 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/init/Kconfig 2006-03-11 07:41:37.000000000 -0500 @@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3 for processing it. A preliminary version of these tools is available at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>. +config TASK_DELAY_ACCT + bool "Enable per-task delay accounting (EXPERIMENTAL)" + help + Collect information on time spent by a task waiting for system + resources like cpu, synchronous block I/O completion and swapping + in pages. Such statistics can help in setting a task's priorities + relative to other tasks for cpu, io, rss limits etc. + + Unlike BSD process accounting, this information is available + continuously during the lifetime of a task. + + Say N if unsure. + config SYSCTL bool "Sysctl support" ---help--- Index: linux-2.6.16-rc5/init/main.c =================================================================== --- linux-2.6.16-rc5.orig/init/main.c 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/init/main.c 2006-03-11 07:41:37.000000000 -0500 @@ -47,6 +47,7 @@ #include <linux/rmap.h> #include <linux/mempolicy.h> #include <linux/key.h> +#include <linux/delayacct.h> #include <asm/io.h> #include <asm/bugs.h> @@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void proc_root_init(); #endif cpuset_init(); + delayacct_init(); check_bugs(); Index: linux-2.6.16-rc5/kernel/delayacct.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:37.000000000 -0500 @@ -0,0 +1,92 @@ +/* delayacct.c - per-task delay accounting + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#include <linux/sched.h> +#include <linux/slab.h> +#include <linux/time.h> +#include <linux/sysctl.h> +#include <linux/delayacct.h> + +int delayacct_on = 0; /* Delay accounting turned on/off */ +kmem_cache_t *delayacct_cache; + +static int __init delayacct_setup_enable(char *str) +{ + delayacct_on = 1; + return 1; +} +__setup("delayacct", delayacct_setup_enable); + +int delayacct_init(void) +{ + delayacct_cache = kmem_cache_create("delayacct_cache", + sizeof(struct task_delay_info), + 0, + SLAB_PANIC, + NULL, NULL); + if (!delayacct_cache) + return -ENOMEM; + delayacct_tsk_init(&init_task); + return 0; +} + +void __delayacct_tsk_init(struct task_struct *tsk) +{ + tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL); + if (tsk->delays) { + memset(tsk->delays, 0, sizeof(*tsk->delays)); + spin_lock_init(&tsk->delays->lock); + } +} + +void __delayacct_tsk_exit(struct task_struct *tsk) +{ + if (tsk->delays) { + kmem_cache_free(delayacct_cache, tsk->delays); + tsk->delays = NULL; + } +} + +/* + * Start accounting for a delay statistic using + * its starting timestamp (@start) + */ + +static inline void delayacct_start(struct timespec *start) +{ + do_posix_clock_monotonic_gettime(start); +} + +/* + * Finish delay accounting for a statistic using + * its timestamps (@start, @end), accumalator (@total) and @count + */ + +static inline void delayacct_end(struct timespec *start, struct timespec *end, + u64 *total, u32 *count) +{ + struct timespec ts; + nsec_t ns; + + do_posix_clock_monotonic_gettime(end); + timespec_diff(start, end, &ts); + ns = timespec_to_ns(&ts); + if (ns < 0) + return; + + spin_lock(¤t->delays->lock); + *total += ns; + (*count)++; + spin_unlock(¤t->delays->lock); +} + Index: linux-2.6.16-rc5/kernel/fork.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/fork.c 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/kernel/fork.c 2006-03-11 07:41:37.000000000 -0500 @@ -44,6 +44,7 @@ #include <linux/rmap.h> #include <linux/acct.h> #include <linux/cn_proc.h> +#include <linux/delayacct.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> @@ -970,6 +971,7 @@ static task_t *copy_process(unsigned lon goto bad_fork_cleanup_put_domain; p->did_exec = 0; + delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ copy_flags(clone_flags, p); p->pid = pid; retval = -EFAULT; Index: linux-2.6.16-rc5/kernel/exit.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/exit.c 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/kernel/exit.c 2006-03-11 07:41:37.000000000 -0500 @@ -31,6 +31,7 @@ #include <linux/signal.h> #include <linux/cn_proc.h> #include <linux/mutex.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/unistd.h> @@ -842,6 +843,8 @@ fastcall NORET_TYPE void do_exit(long co preempt_count()); acct_update_integrals(tsk); + delayacct_tsk_exit(tsk); + if (tsk->mm) { update_hiwater_rss(tsk->mm); update_hiwater_vm(tsk->mm); ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Patch 2/9] Initialization 2006-03-14 0:45 ` Patch 2/9] Initialization Shailabh Nagar @ 2006-03-14 10:54 ` Jes Sorensen 2006-03-14 15:20 ` Shailabh Nagar 2006-03-15 10:24 ` Arjan van de Ven 1 sibling, 1 reply; 56+ messages in thread From: Jes Sorensen @ 2006-03-14 10:54 UTC (permalink / raw) To: nagar; +Cc: linux-kernel, Arjan van de Ven >>>>> "Shailabh" == Shailabh Nagar <nagar@watson.ibm.com> writes: Shailabh> delayacct-setup.patch Initialization code related to Shailabh> collection of per-task "delay" statistics which measure how Shailabh> long it had to wait for cpu, sync block io, swapping Shailabh> etc. The collection of statistics and the interface are in Shailabh> other patches. This patch sets up the data structures and Shailabh> allows the statistics collection to be disabled through a Shailabh> kernel boot paramater. Shailabh> +#ifdef CONFIG_TASK_DELAY_ACCT Shailabh> +struct task_delay_info { Shailabh> + spinlock_t lock; Shailabh> + Shailabh> + /* For each stat XXX, add following, aligned appropriately Shailabh> + * Shailabh> + * struct timespec XXX_start, XXX_end; Shailabh> + * u64 XXX_delay; Shailabh> + * u32 XXX_count; Shailabh> + */ Shailabh> +}; Shailabh> +#endif Hmmm I thought you were going to change this to do u64 some_delay u32 foo_count u32 bar_count u64 another_delay To avoid wasting space on 64 bit platforms. Jes ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Patch 2/9] Initialization 2006-03-14 10:54 ` Jes Sorensen @ 2006-03-14 15:20 ` Shailabh Nagar 0 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 15:20 UTC (permalink / raw) To: Jes Sorensen; +Cc: linux-kernel, Arjan van de Ven Jes Sorensen wrote: >>>>>>"Shailabh" == Shailabh Nagar <nagar@watson.ibm.com> writes: >>>>>> >>>>>> > >Shailabh> delayacct-setup.patch Initialization code related to >Shailabh> collection of per-task "delay" statistics which measure how >Shailabh> long it had to wait for cpu, sync block io, swapping >Shailabh> etc. The collection of statistics and the interface are in >Shailabh> other patches. This patch sets up the data structures and >Shailabh> allows the statistics collection to be disabled through a >Shailabh> kernel boot paramater. > >Shailabh> +#ifdef CONFIG_TASK_DELAY_ACCT >Shailabh> +struct task_delay_info { >Shailabh> + spinlock_t lock; >Shailabh> + >Shailabh> + /* For each stat XXX, add following, aligned appropriately >Shailabh> + * >Shailabh> + * struct timespec XXX_start, XXX_end; >Shailabh> + * u64 XXX_delay; >Shailabh> + * u32 XXX_count; >Shailabh> + */ >Shailabh> +}; >Shailabh> +#endif > >Hmmm > >I thought you were going to change this to do > >u64 some_delay >u32 foo_count >u32 bar_count >u64 another_delay > >To avoid wasting space on 64 bit platforms. > > Well, the "aligned appropriately" part of the comment was intended to let future users know that something like the above should be done. e.g in a subsequent patch (5/9) http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1919.html when we introduce the additional stat swapin_delay/count, a 64-bit friendly alignment is being done thus: u64 blkio_delay; /* wait for sync block io completion */ + u64 swapin_delay; /* wait for sync block io completion */ u32 blkio_count; + u32 swapin_count; The need for each stat, potentially, to need a separate set of timespec variables to reduce nesting problems does introduce another wrinkle in cache-friendliness since you'd ideally like to have all the XXX_* variables close together. Right now we're good since swapin doesn't need extra timespecs of its own. But future additions might need to be more careful. --Shailabh >Jes > > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Patch 2/9] Initialization 2006-03-14 0:45 ` Patch 2/9] Initialization Shailabh Nagar 2006-03-14 10:54 ` Jes Sorensen @ 2006-03-15 10:24 ` Arjan van de Ven 2006-03-15 12:37 ` Alan Cox 1 sibling, 1 reply; 56+ messages in thread From: Arjan van de Ven @ 2006-03-15 10:24 UTC (permalink / raw) To: nagar; +Cc: linux-kernel > + * This program is free software; you can redistribute it and/or modify it > + * under the terms of version 2.1 of the GNU Lesser General Public License > + * as published by the Free Software Foundation. > + * LGPL inside the kernel doesn't make a whole lot of sense.... better make it GPL. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Patch 2/9] Initialization 2006-03-15 10:24 ` Arjan van de Ven @ 2006-03-15 12:37 ` Alan Cox 2006-03-15 15:53 ` Shailabh Nagar 0 siblings, 1 reply; 56+ messages in thread From: Alan Cox @ 2006-03-15 12:37 UTC (permalink / raw) To: Arjan van de Ven; +Cc: nagar, linux-kernel On Mer, 2006-03-15 at 11:24 +0100, Arjan van de Ven wrote: > > + * This program is free software; you can redistribute it and/or modify it > > + * under the terms of version 2.1 of the GNU Lesser General Public License > > + * as published by the Free Software Foundation. > > + * > > LGPL inside the kernel doesn't make a whole lot of sense.... better make > it GPL. When you combine an LGPL and GPL work you get a GPL work so yes it would be clearer to mark it GPL as that is what it became as it was merged, but perhaps to add a note stating where it can be obtained under other licenses for other projects. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: Patch 2/9] Initialization 2006-03-15 12:37 ` Alan Cox @ 2006-03-15 15:53 ` Shailabh Nagar 0 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-15 15:53 UTC (permalink / raw) To: Alan Cox; +Cc: Arjan van de Ven, linux-kernel Alan Cox wrote: >On Mer, 2006-03-15 at 11:24 +0100, Arjan van de Ven wrote: > > >>>+ * This program is free software; you can redistribute it and/or modify it >>>+ * under the terms of version 2.1 of the GNU Lesser General Public License >>>+ * as published by the Free Software Foundation. >>>+ * >>> >>> >>LGPL inside the kernel doesn't make a whole lot of sense.... better make >>it GPL. >> >> > >When you combine an LGPL and GPL work you get a GPL work so yes it would >be clearer to mark it GPL as that is what it became as it was merged, >but perhaps to add a note stating where it can be obtained under other >licenses for other projects. > > > Thanks. The LGPL usage is a mistake in the core code....will change to GPL everywhere except the case below. There's no intent to have the kernel code available under other licenses etc. so thats not a problem. However, the confusion about what license to use for a header file that will need to be included in a potentially non-GPL user application persists. The header file include/linux/taskstats.h, created in patch 9/9 http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1925.html defines the user-kernel messaging interface and should probably continue to have LGPL, just to be absolutely safe legally. I've not been following the legal twists too carefully so its probably overkill. --Shailabh ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 3/9] Block I/O accounting initialization 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar 2006-03-14 0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar 2006-03-14 0:45 ` Patch 2/9] Initialization Shailabh Nagar @ 2006-03-14 0:47 ` Shailabh Nagar 2006-03-15 10:27 ` Arjan van de Ven 2006-03-14 0:48 ` [Patch 4/9] Block I/O accounting collection Shailabh Nagar ` (6 subsequent siblings) 9 siblings, 1 reply; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:47 UTC (permalink / raw) To: linux-kernel; +Cc: Arjan van de Ven, Andi Kleen delayacct-blkio-init.patch Setup variables and functions to collect per-task block I/O delays. Separating the collection part to a later patch for easier review and discussion. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> include/linux/delayacct.h | 16 ++++++++++++++++ include/linux/sched.h | 4 ++++ kernel/delayacct.c | 14 ++++++++++++++ 3 files changed, 34 insertions(+) Index: linux-2.6.16-rc5/include/linux/delayacct.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/delayacct.h 2006-03-11 07:41:37.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/delayacct.h 2006-03-11 07:41:38.000000000 -0500 @@ -22,6 +22,8 @@ extern kmem_cache_t *delayacct_cache; extern int delayacct_init(void); extern void __delayacct_tsk_init(struct task_struct *); extern void __delayacct_tsk_exit(struct task_struct *); +extern void __delayacct_blkio_start(void); +extern void __delayacct_blkio_end(void); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -37,6 +39,16 @@ static inline void delayacct_tsk_exit(st __delayacct_tsk_exit(tsk); } +static inline void delayacct_blkio_start(void) +{ + if (unlikely(delayacct_on)) + __delayacct_blkio_start(); +} +static inline void delayacct_blkio_end(void) +{ + if (unlikely(delayacct_on)) + __delayacct_blkio_end(); +} #else static inline int delayacct_init(void) {} @@ -44,5 +56,9 @@ static inline void delayacct_tsk_init(st {} static inline void delayacct_tsk_exit(struct task_struct *tsk) {} +static inline void delayacct_blkio_start(void) +{} +static inline void delayacct_blkio_end(void) +{} #endif /* CONFIG_TASK_DELAY_ACCT */ #endif /* _LINUX_TASKDELAYS_H */ Index: linux-2.6.16-rc5/kernel/delayacct.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/delayacct.c 2006-03-11 07:41:37.000000000 -0500 +++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:38.000000000 -0500 @@ -90,3 +90,17 @@ static inline void delayacct_end(struct spin_unlock(¤t->delays->lock); } +void __delayacct_blkio_start(void) +{ + if (current->delays) + delayacct_start(¤t->delays->blkio_start); +} + +void __delayacct_blkio_end(void) +{ + if (current->delays) + delayacct_end(¤t->delays->blkio_start, + ¤t->delays->blkio_end, + ¤t->delays->blkio_delay, + ¤t->delays->blkio_count); +} Index: linux-2.6.16-rc5/include/linux/sched.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:37.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:38.000000000 -0500 @@ -549,6 +549,10 @@ struct task_delay_info { * u64 XXX_delay; * u32 XXX_count; */ + + struct timespec blkio_start, blkio_end; + u64 blkio_delay; /* wait for sync block io completion */ + u32 blkio_count; }; #endif ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 3/9] Block I/O accounting initialization 2006-03-14 0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar @ 2006-03-15 10:27 ` Arjan van de Ven 2006-03-15 16:27 ` Shailabh Nagar 0 siblings, 1 reply; 56+ messages in thread From: Arjan van de Ven @ 2006-03-15 10:27 UTC (permalink / raw) To: nagar; +Cc: linux-kernel, Andi Kleen > > +static inline void delayacct_blkio_start(void) > +{ > + if (unlikely(delayacct_on)) > + __delayacct_blkio_start(); > +} I still think the unlikely() makes no sense here; at runtime it's either going to be always on or off (in the sense that switching it will be RARE). The cpus branch predictor will get that right; while if you force it unlikely you can even run the risk of getting a 100% miss on some architectures (not x86/x86-64) when you enable the accounting ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 3/9] Block I/O accounting initialization 2006-03-15 10:27 ` Arjan van de Ven @ 2006-03-15 16:27 ` Shailabh Nagar 0 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-15 16:27 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel, Andi Kleen, Greg KH Arjan van de Ven wrote: >> >>+static inline void delayacct_blkio_start(void) >>+{ >>+ if (unlikely(delayacct_on)) >>+ __delayacct_blkio_start(); >>+} >> >> > > >I still think the unlikely() makes no sense here; at runtime it's either >going to be always on or off (in the sense that switching it will be >RARE). The cpus branch predictor will get that right; while if you force >it unlikely you can even run the risk of getting a 100% miss on some >architectures (not x86/x86-64) when you enable the accounting > > > I don't understand why the fact that delayacct_on is not going to vary dynamically once a kernel has been booted should prevent the usage of "unlikely/likely" Perhaps you can help me understand. Here's the logic I was using: There are two kinds of users 1. Those who will not turn delay accounting on at boot time for a number of reasons (don't care about the functionality, don't want to incur any overhead however small). 2. The ones who do turn it on at boot time are willing to pay some overhead for the benefits such functionality brings. What that overhead exactly is will be known when we finish running some tests but we can safely assume it'll be non-zero. A distro will need to keep delay accounting configured to accomodate both kinds of users but because 1 is the common case, the default boot time option will be off and there is a strong incentive to reduce overhead for non-users. Using unlikely reduces the overhead for the common case 1 and increases overhead for case 2 (because of what you pointed out of 100% wrong decisions by the branch predictor). I don't know how much of overhead reduction is achieved by using unlikely (I'm assuming its non-trivial) or how much penalty is imposed by a 100% wrong prediction (I'm assuming its not very high). But under these assumptions, its better to use unlikely, make type 2 users pay extra overhead and save type 1 users from any. Is this reasoning accurate ? If not, we could easily switch the unlikely off. Regards, Shailabh (cc'ing Greg since he'd brought up the overhead for real benchmarks etc.) ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 4/9] Block I/O accounting collection 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (2 preceding siblings ...) 2006-03-14 0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar @ 2006-03-14 0:48 ` Shailabh Nagar 2006-03-14 0:51 ` [Patch 5/9] Swapin delays Shailabh Nagar ` (5 subsequent siblings) 9 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:48 UTC (permalink / raw) To: linux-kernel; +Cc: Arjan van de Ven, Andi Kleen, Suparna Bhattacharya delayacct-blkio-collect.patch Collect per-task block I/O delay statistics. Unlike earlier iterations of the delay accounting patches, now delays are only collected for the actual I/O waits rather than try and cover the delays seen in I/O submission paths. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> kernel/sched.c | 5 +++++ 1 files changed, 5 insertions(+) Index: linux-2.6.16-rc5/kernel/sched.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/sched.c 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/kernel/sched.c 2006-03-11 07:41:39.000000000 -0500 @@ -49,6 +49,7 @@ #include <linux/syscalls.h> #include <linux/times.h> #include <linux/acct.h> +#include <linux/delayacct.h> #include <asm/tlb.h> #include <asm/unistd.h> @@ -4117,9 +4118,11 @@ void __sched io_schedule(void) { struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id()); + delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); schedule(); atomic_dec(&rq->nr_iowait); + delayacct_blkio_end(); } EXPORT_SYMBOL(io_schedule); @@ -4129,9 +4132,11 @@ long __sched io_schedule_timeout(long ti struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id()); long ret; + delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); ret = schedule_timeout(timeout); atomic_dec(&rq->nr_iowait); + delayacct_blkio_end(); return ret; } ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 5/9] Swapin delays 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (3 preceding siblings ...) 2006-03-14 0:48 ` [Patch 4/9] Block I/O accounting collection Shailabh Nagar @ 2006-03-14 0:51 ` Shailabh Nagar 2006-03-14 0:53 ` [Patch 7/9] /proc interface for all I/O delays Shailabh Nagar ` (4 subsequent siblings) 9 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:51 UTC (permalink / raw) To: linux-kernel; +Cc: Arjan van de Ven, Andi Kleen delayacct-swapin.patch Account separately for block I/O delays incurred as a result of swapin page faults since these are a result of insufficient rss limit. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> include/linux/sched.h | 5 ++++- kernel/delayacct.c | 17 ++++++++++++----- mm/memory.c | 4 ++++ 3 files changed, 20 insertions(+), 6 deletions(-) Index: linux-2.6.16-rc5/include/linux/sched.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:38.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:39.000000000 -0500 @@ -550,9 +550,11 @@ struct task_delay_info { * u32 XXX_count; */ - struct timespec blkio_start, blkio_end; + struct timespec blkio_start, blkio_end; /* Shared by blkio, swapin */ u64 blkio_delay; /* wait for sync block io completion */ + u64 swapin_delay; /* wait for sync block io completion */ u32 blkio_count; + u32 swapin_count; }; #endif @@ -949,6 +951,7 @@ static inline void put_task_struct(struc #define PF_BORROWED_MM 0x00400000 /* I am a kthread doing use_mm */ #define PF_RANDOMIZE 0x00800000 /* randomize virtual address space */ #define PF_SWAPWRITE 0x01000000 /* Allowed to write to swap */ +#define PF_SWAPIN 0x02000000 /* I am doing a swap in */ /* * Only the _current_ task can read/write to tsk->flags, but other Index: linux-2.6.16-rc5/kernel/delayacct.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/delayacct.c 2006-03-11 07:41:38.000000000 -0500 +++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:39.000000000 -0500 @@ -98,9 +98,16 @@ void __delayacct_blkio_start(void) void __delayacct_blkio_end(void) { - if (current->delays) - delayacct_end(¤t->delays->blkio_start, - ¤t->delays->blkio_end, - ¤t->delays->blkio_delay, - ¤t->delays->blkio_count); + if (current->delays) { + if (current->flags & PF_SWAPIN) /* Swapping a page in */ + delayacct_end(¤t->delays->blkio_start, + ¤t->delays->blkio_end, + ¤t->delays->swapin_delay, + ¤t->delays->swapin_count); + else /* Other block I/O */ + delayacct_end(¤t->delays->blkio_start, + ¤t->delays->blkio_end, + ¤t->delays->blkio_delay, + ¤t->delays->blkio_count); + } } Index: linux-2.6.16-rc5/mm/memory.c =================================================================== --- linux-2.6.16-rc5.orig/mm/memory.c 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/mm/memory.c 2006-03-11 07:41:39.000000000 -0500 @@ -1882,6 +1882,7 @@ static int do_swap_page(struct mm_struct entry = pte_to_swp_entry(orig_pte); again: + current->flags |= PF_SWAPIN; page = lookup_swap_cache(entry); if (!page) { swapin_readahead(entry, address, vma); @@ -1894,6 +1895,7 @@ again: page_table = pte_offset_map_lock(mm, pmd, address, &ptl); if (likely(pte_same(*page_table, orig_pte))) ret = VM_FAULT_OOM; + current->flags &= ~PF_SWAPOFF; goto unlock; } @@ -1905,6 +1907,8 @@ again: mark_page_accessed(page); lock_page(page); + current->flags &= ~PF_SWAPOFF; + if (!PageSwapCache(page)) { /* Page migration has occured */ unlock_page(page); ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 7/9] /proc interface for all I/O delays 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (4 preceding siblings ...) 2006-03-14 0:51 ` [Patch 5/9] Swapin delays Shailabh Nagar @ 2006-03-14 0:53 ` Shailabh Nagar 2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar ` (3 subsequent siblings) 9 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:53 UTC (permalink / raw) To: linux-kernel; +Cc: Andi Kleen, Arjan van de Ven delayacct-procfs.patch Export I/O delays seen by a task through /proc/<tgid>/stats for use in top etc. Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed together with all other I/O here (this is not the case for the other interface where the swapin I/O is kept distinct) Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> fs/proc/array.c | 6 ++++-- include/linux/delayacct.h | 10 ++++++++++ include/linux/jiffies.h | 6 ++++++ kernel/delayacct.c | 13 +++++++++++++ 4 files changed, 33 insertions(+), 2 deletions(-) Index: linux-2.6.16-rc5/fs/proc/array.c =================================================================== --- linux-2.6.16-rc5.orig/fs/proc/array.c 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/fs/proc/array.c 2006-03-11 07:41:40.000000000 -0500 @@ -75,6 +75,7 @@ #include <linux/times.h> #include <linux/cpuset.h> #include <linux/rcupdate.h> +#include <linux/delayacct.h> #include <asm/uaccess.h> #include <asm/pgtable.h> @@ -414,7 +415,7 @@ static int do_task_stat(struct task_stru res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \ %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \ -%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n", +%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n", task->pid, tcomm, state, @@ -459,7 +460,8 @@ static int do_task_stat(struct task_stru task->exit_signal, task_cpu(task), task->rt_priority, - task->policy); + task->policy, + delayacct_blkio_ticks(task)); if(mm) mmput(mm); return res; Index: linux-2.6.16-rc5/include/linux/delayacct.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/delayacct.h 2006-03-11 07:41:38.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/delayacct.h 2006-03-11 07:41:40.000000000 -0500 @@ -24,6 +24,7 @@ extern void __delayacct_tsk_init(struct extern void __delayacct_tsk_exit(struct task_struct *); extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); +extern unsigned long long __delayacct_blkio_ticks(struct task_struct *); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -49,6 +50,11 @@ static inline void delayacct_blkio_end(v if (unlikely(delayacct_on)) __delayacct_blkio_end(); } +static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk) +{ + if (unlikely(delayacct_on)) + return __delayacct_blkio_ticks(tsk); +} #else static inline int delayacct_init(void) {} @@ -60,5 +66,9 @@ static inline void delayacct_blkio_start {} static inline void delayacct_blkio_end(void) {} +static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk) +{ + return 0; +} #endif /* CONFIG_TASK_DELAY_ACCT */ #endif /* _LINUX_TASKDELAYS_H */ Index: linux-2.6.16-rc5/include/linux/jiffies.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/jiffies.h 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/jiffies.h 2006-03-11 07:41:40.000000000 -0500 @@ -291,6 +291,12 @@ static inline unsigned long usecs_to_jif #endif } +static inline unsigned long nsecs_to_jiffies(const nsec_t n) +{ + return (((u64)n * NSEC_CONVERSION) >> + (NSEC_JIFFIE_SC - SEC_JIFFIE_SC)) >> SEC_JIFFIE_SC; +} + /* * The TICK_NSEC - 1 rounds up the value to the next resolution. Note * that a remainder subtract here would not do the right thing as the Index: linux-2.6.16-rc5/kernel/delayacct.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/delayacct.c 2006-03-11 07:41:39.000000000 -0500 +++ linux-2.6.16-rc5/kernel/delayacct.c 2006-03-11 07:41:40.000000000 -0500 @@ -111,3 +111,16 @@ void __delayacct_blkio_end(void) ¤t->delays->blkio_count); } } + +unsigned long long __delayacct_blkio_ticks(struct task_struct *tsk) +{ + unsigned long long ret; + + if (!tsk->delays) + return 0; + + spin_lock(&tsk->delays->lock); + ret = nsec_to_clock_t(tsk->delays->blkio_delay + tsk->delays->swapin_delay); + spin_unlock(&tsk->delays->lock); + return ret; +} ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 8/9] generic netlink utility functions 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (5 preceding siblings ...) 2006-03-14 0:53 ` [Patch 7/9] /proc interface for all I/O delays Shailabh Nagar @ 2006-03-14 0:55 ` Shailabh Nagar 2006-03-26 16:44 ` Balbir Singh 2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar ` (2 subsequent siblings) 9 siblings, 1 reply; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:55 UTC (permalink / raw) To: linux-kernel, netdev; +Cc: Jamal genetlink-utils.patch Two utilities for simplifying usage of NETLINK_GENERIC interface. Signed-off-by: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> include/net/genetlink.h | 20 ++++++++++++++++++++ 1 files changed, 20 insertions(+) Index: linux-2.6.16-rc5/include/net/genetlink.h =================================================================== --- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 07:41:32.000000000 -0500 +++ linux-2.6.16-rc5/include/net/genetlink.h 2006-03-11 07:41:41.000000000 -0500 @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct return nlmsg_unicast(genl_sock, skb, pid); } +/** + * gennlmsg_data - head of message payload + * @gnlh: genetlink messsage header + */ +static inline void *genlmsg_data(const struct genlmsghdr *gnlh) +{ + return ((unsigned char *) gnlh + GENL_HDRLEN); +} + +/** + * genlmsg_len - length of message payload + * @gnlh: genetlink message header + */ +static inline int genlmsg_len(const struct genlmsghdr *gnlh) +{ + struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh - + NLMSG_HDRLEN); + return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN); +} + #endif /* __NET_GENERIC_NETLINK_H */ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 8/9] generic netlink utility functions 2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar @ 2006-03-26 16:44 ` Balbir Singh 2006-03-26 17:06 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-26 16:44 UTC (permalink / raw) To: Shailabh Nagar; +Cc: linux-kernel, netdev, Jamal On Mon, Mar 13, 2006 at 07:55:05PM -0500, Shailabh Nagar wrote: > genetlink-utils.patch > > Two utilities for simplifying usage of NETLINK_GENERIC > interface. > > Signed-off-by: Balbir Singh <balbir@in.ibm.com> > Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> > > include/net/genetlink.h | 20 ++++++++++++++++++++ > 1 files changed, 20 insertions(+) > > Index: linux-2.6.16-rc5/include/net/genetlink.h > =================================================================== > --- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 07:41:32.000000000 -0500 > +++ linux-2.6.16-rc5/include/net/genetlink.h 2006-03-11 07:41:41.000000000 -0500 > @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct > return nlmsg_unicast(genl_sock, skb, pid); > } > > +/** > + * gennlmsg_data - head of message payload > + * @gnlh: genetlink messsage header > + */ > +static inline void *genlmsg_data(const struct genlmsghdr *gnlh) > +{ > + return ((unsigned char *) gnlh + GENL_HDRLEN); > +} > + > +/** > + * genlmsg_len - length of message payload > + * @gnlh: genetlink message header > + */ > +static inline int genlmsg_len(const struct genlmsghdr *gnlh) > +{ > + struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh - > + NLMSG_HDRLEN); > + return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN); > +} > + > #endif /* __NET_GENERIC_NETLINK_H */ > > Jamal, Does the implementation of these utilities look ok? We use genlmsg_data() in the delay accounting code but not genlmsg_len(), hence it might not be very well tested (just reviewed). Thanks, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 8/9] generic netlink utility functions 2006-03-26 16:44 ` Balbir Singh @ 2006-03-26 17:06 ` jamal 0 siblings, 0 replies; 56+ messages in thread From: jamal @ 2006-03-26 17:06 UTC (permalink / raw) To: balbir; +Cc: Shailabh Nagar, linux-kernel, netdev On Sun, 2006-26-03 at 22:14 +0530, Balbir Singh wrote: > Jamal, > > Does the implementation of these utilities look ok? We use genlmsg_data() > in the delay accounting code but not genlmsg_len(), hence it might not > be very well tested (just reviewed). > They look fine to me - please resubmit and CC Thomas Graf in case he sees it different. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (6 preceding siblings ...) 2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar @ 2006-03-14 0:56 ` Shailabh Nagar 2006-03-14 2:29 ` jamal 2006-03-14 2:33 ` Matt Helsley 2006-03-14 1:01 ` [Patch 6/9] cpu delay collection Shailabh Nagar 2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH 9 siblings, 2 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 0:56 UTC (permalink / raw) To: linux-kernel; +Cc: Jamal, netdev delayacct-genetlink.patch Create a generic netlink interface (NETLINK_GENERIC family), called "taskstats", for getting delay and cpu statistics of tasks and thread groups during their lifetime and when they exit. Comments addressed (all in response to Jamal) - Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE - Use multicast only for events generated due to task exit, not for replies to commands - Align taskstats and taskstats_reply structures to 64 bit aligned. - Use dynamic ID generation for genetlink family More changes expected. Following comments will go into a Documentation file: When a task is alive, userspace can get its stats by sending a command containing its pid. Sending a tgid returns the sum of stats of the tasks belonging to that tgid (where such a sum makes sense). Together, the command interface allows stats for a large number of tasks to be collected more efficiently than would be possible through /proc or any per-pid interface. The netlink interface also sends the stats for each task to userspace when the task is exiting. This permits fine-grain accounting for short-lived tasks, which is important if userspace is doing its own aggregation of statistics based on some grouping of tasks (e.g. CSA jobs, ELSA banks or CKRM classes). If the exiting task belongs to a thread group (with more members than itself) , the latters delay stats are also sent out on the task's exit. This allows userspace to get accurate data at a per-tgid level while the tid's of a tgid are exiting one by one. The interface has been deliberately kept distinct from the delay accounting code since it is potentially usable by other kernel components that need to export per-pid/tgid data. The format of data returned to userspace is versioned and the command interface easily extensible to facilitate reuse. If reuse is not deemed useful enough, the naming, placement of functions and config options will be modified to make this an interface for delay accounting alone. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> include/linux/taskstats.h | 128 ++++++++++++++++++++++++ init/Kconfig | 16 ++- kernel/taskstats.c | 244 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 385 insertions(+), 3 deletions(-) Index: linux-2.6.16-rc5/include/linux/taskstats.h =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc5/include/linux/taskstats.h 2006-03-13 19:01:35.000000000 -0500 @@ -0,0 +1,128 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKSTATS_H +#define _LINUX_TASKSTATS_H + +/* Format for per-task data returned to userland when + * - a task exits + * - listener requests stats for a task + * + * The struct is versioned. Newer versions should only add fields to + * the bottom of the struct to maintain backward compatibility. + * + * To create the next version, bump up the taskstats_version variable + * and delineate the start of newly added fields with a comment indicating + * the version number. + */ + +#define TASKSTATS_VERSION 1 + +struct taskstats { + /* Maintain 64-bit alignment while extending */ + + /* Version 1 */ +#define TASKSTATS_NOPID -1 + __s64 pid; + __s64 tgid; + + /* XXX_count is number of delay values recorded. + * XXX_total is corresponding cumulative delay in nanoseconds + */ + +#define TASKSTATS_NOCPUSTATS 1 + __u64 cpu_count; + __u64 cpu_delay_total; /* wait, while runnable, for cpu */ + __u64 blkio_count; + __u64 blkio_delay_total; /* sync,block io completion wait*/ + __u64 swapin_count; + __u64 swapin_delay_total; /* swapin page fault wait*/ + + __u64 cpu_run_total; /* cpu running time + * no count available/provided */ +}; + + +#define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Commands sent from userspace + * Not versioned. New commands should only be inserted at the enum's end + */ + +enum { + TASKSTATS_CMD_UNSPEC, /* Reserved */ + TASKSTATS_CMD_NONE, /* Not a valid cmd to send + * Marks data sent on task/tgid exit */ + TASKSTATS_CMD_LISTEN, /* Start listening */ + TASKSTATS_CMD_IGNORE, /* Stop listening */ + TASKSTATS_CMD_PID, /* Send stats for a pid */ + TASKSTATS_CMD_TGID, /* Send stats for a tgid */ +}; + +/* Parameters for commands + * New parameters should only be inserted at the struct's end + */ + +struct taskstats_cmd_param { + /* Maintain 64-bit alignment while extending */ + union { + __s64 pid; + __s64 tgid; + } id; +}; + +enum outtype { + TASKSTATS_REPLY_NONE = 1, /* Control cmd response */ + TASKSTATS_REPLY_PID, /* per-pid data cmd response*/ + TASKSTATS_REPLY_TGID, /* per-tgid data cmd response*/ + TASKSTATS_REPLY_EXIT_PID, /* Exiting task's stats */ + TASKSTATS_REPLY_EXIT_TGID, /* Exiting tgid's stats + * (sent on each tid's exit) */ +}; + +/* + * Reply sent from kernel + * Version number affects size/format of struct taskstats only + */ + +struct taskstats_reply { + /* Maintain 64-bit alignment while extending */ + __u16 outtype; /* Must be one of enum outtype */ + __u16 version; + __u32 err; + struct taskstats stats; /* Invalid if err != 0 */ +}; + +/* NETLINK_GENERIC related info */ + +#define TASKSTATS_GENL_NAME "TASKSTATS" +#define TASKSTATS_GENL_VERSION 0x1 + +#define TASKSTATS_HDRLEN (NLMSG_SPACE(GENL_HDRLEN)) +#define TASKSTATS_BODYLEN (sizeof(struct taskstats_reply)) + +#ifdef __KERNEL__ + +#include <linux/sched.h> + +#ifdef CONFIG_TASKSTATS +extern void taskstats_exit_pid(struct task_struct *); +#else +static inline void taskstats_exit_pid(struct task_struct *tsk) +{} +#endif + +#endif /* __KERNEL__ */ +#endif /* _LINUX_TASKSTATS_H */ Index: linux-2.6.16-rc5/kernel/taskstats.c =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.16-rc5/kernel/taskstats.c 2006-03-13 19:01:35.000000000 -0500 @@ -0,0 +1,244 @@ +/* + * taskstats.c - Export per-task statistics to userland + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/kernel.h> +#include <linux/taskstats.h> +#include <linux/delayacct.h> +#include <net/genetlink.h> +#include <asm/atomic.h> + +const int taskstats_version = TASKSTATS_VERSION; +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; +static int family_registered = 0; + + +static struct genl_family family = { + .id = GENL_ID_GENERATE, + .name = TASKSTATS_GENL_NAME, + .version = TASKSTATS_GENL_VERSION, + .hdrsize = 0, + .maxattr = 0, +}; + +/* Taskstat specific functions */ +static int prepare_reply(struct genl_info *info, u8 cmd, + struct sk_buff **skbp, struct taskstats_reply **replyp) +{ + struct sk_buff *skb; + struct taskstats_reply *reply; + + skb = nlmsg_new(TASKSTATS_HDRLEN + TASKSTATS_BODYLEN); + if (!skb) + return -ENOMEM; + + if (!info) { + int seq = get_cpu_var(taskstats_seqnum)++; + put_cpu_var(taskstats_seqnum); + + reply = genlmsg_put(skb, 0, seq, + family.id, 0, NLM_F_REQUEST, + cmd, family.version); + } else + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, + family.id, 0, info->nlhdr->nlmsg_flags, + info->genlhdr->cmd, family.version); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + skb_put(skb, TASKSTATS_BODYLEN); + + memset(reply, 0, sizeof(*reply)); + reply->version = taskstats_version; + reply->err = 0; + + *skbp = skb; + *replyp = reply; + return 0; +} + +static int send_reply(struct sk_buff *skb, int replytype, pid_t pid, int event) +{ + struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + struct taskstats_reply *reply; + int rc; + + reply = (struct taskstats_reply *)genlmsg_data(genlhdr); + reply->outtype = replytype; + + rc = genlmsg_end(skb, reply); + if (rc < 0) { + nlmsg_free(skb); + return rc; + } + + if (event) + return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); + else + return genlmsg_unicast(skb, pid); +} + +static inline void fill_pid(struct taskstats_reply *reply, pid_t pid, + struct task_struct *pidtsk) +{ + int rc; + struct task_struct *tsk = pidtsk; + + if (!pidtsk) { + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + reply->err = EINVAL; + return; + } + get_task_struct(tsk); + read_unlock(&tasklist_lock); + } else + get_task_struct(tsk); + + rc = delayacct_add_tsk(reply, tsk); + if (!rc) { + reply->stats.pid = (s64)tsk->pid; + reply->stats.tgid = (s64)tsk->tgid; + } else + reply->err = (rc < 0) ? -rc : rc ; + + put_task_struct(tsk); +} + +static int taskstats_send_pid(struct sk_buff *skb, struct genl_info *info) +{ + int rc; + struct sk_buff *rep_skb; + struct taskstats_reply *reply; + struct taskstats_cmd_param *param= info->userhdr; + + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply); + if (rc) + return rc; + fill_pid(reply, param->id.pid, NULL); + return send_reply(rep_skb, TASKSTATS_REPLY_PID, info->snd_pid, 0); +} + +static inline void fill_tgid(struct taskstats_reply *reply, pid_t tgid, + struct task_struct *tgidtsk) +{ + int rc; + struct task_struct *tsk, *first; + + first = tgidtsk; + read_lock(&tasklist_lock); + if (!first) { + first = find_task_by_pid(tgid); + if (!first) { + read_unlock(&tasklist_lock); + reply->err = EINVAL; + return; + } + } + tsk = first; + do { + rc = delayacct_add_tsk(reply, tsk); + if (rc) + break; + } while_each_thread(first, tsk); + read_unlock(&tasklist_lock); + + if (!rc) { + reply->stats.pid = (s64)TASKSTATS_NOPID; + reply->stats.tgid = (s64)tgid; + } else + reply->err = (rc < 0) ? -rc : rc ; +} + +static int taskstats_send_tgid(struct sk_buff *skb, struct genl_info *info) +{ + int rc; + struct sk_buff *rep_skb; + struct taskstats_reply *reply; + struct taskstats_cmd_param *param= info->userhdr; + + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply); + if (rc) + return rc; + fill_tgid(reply, param->id.tgid, NULL); + return send_reply(rep_skb, TASKSTATS_REPLY_TGID, info->snd_pid, 0); +} + +/* Send pid data out on exit */ +void taskstats_exit_pid(struct task_struct *tsk) +{ + int rc; + struct sk_buff *rep_skb; + struct taskstats_reply *reply; + + /* + * tasks can start to exit very early. Ensure that the family + * is registered before notifications are sent out + */ + if (!family_registered) + return; + + rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply); + if (rc) + return; + fill_pid(reply, tsk->pid, tsk); + rc = send_reply(rep_skb, TASKSTATS_REPLY_EXIT_PID, 0, 1); + + if (rc || thread_group_empty(tsk)) + return; + + /* Send tgid data too */ + rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply); + if (rc) + return; + fill_tgid(reply, tsk->tgid, tsk); + send_reply(rep_skb, TASKSTATS_REPLY_EXIT_TGID, 0, 1); +} + +static struct genl_ops pid_ops = { + .cmd = TASKSTATS_CMD_PID, + .doit = taskstats_send_pid, +}; + +static struct genl_ops tgid_ops = { + .cmd = TASKSTATS_CMD_TGID, + .doit = taskstats_send_tgid, +}; + +static int __init taskstats_init(void) +{ + if (genl_register_family(&family)) + return -EFAULT; + family_registered = 1; + + if (genl_register_ops(&family, &pid_ops)) + goto err; + if (genl_register_ops(&family, &tgid_ops)) + goto err; + + return 0; +err: + genl_unregister_family(&family); + family_registered = 0; + return -EFAULT; +} + +late_initcall(taskstats_init); + Index: linux-2.6.16-rc5/init/Kconfig =================================================================== --- linux-2.6.16-rc5.orig/init/Kconfig 2006-03-13 18:51:30.000000000 -0500 +++ linux-2.6.16-rc5/init/Kconfig 2006-03-13 19:04:52.000000000 -0500 @@ -158,11 +158,21 @@ config TASK_DELAY_ACCT in pages. Such statistics can help in setting a task's priorities relative to other tasks for cpu, io, rss limits etc. - Unlike BSD process accounting, this information is available - continuously during the lifetime of a task. - Say N if unsure. +config TASKSTATS + bool "Export task/process statistics through netlink (EXPERIMENTAL)" + depends on TASK_DELAY_ACCT + default y + help + Export selected statistics for tasks/processes through the + generic netlink interface. Unlike BSD process accounting, the + statistics are available during the lifetime of tasks/processes as + responses to commands. Like BSD accounting, they are sent to user + space on task exit. + + Say Y if unsure. + config SYSCTL bool "Sysctl support" ---help--- ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar @ 2006-03-14 2:29 ` jamal 2006-03-14 2:33 ` Matt Helsley 1 sibling, 0 replies; 56+ messages in thread From: jamal @ 2006-03-14 2:29 UTC (permalink / raw) To: nagar; +Cc: linux-kernel, netdev On Mon, 2006-13-03 at 19:56 -0500, Shailabh Nagar wrote: > delayacct-genetlink.patch > > Create a generic netlink interface (NETLINK_GENERIC family), > called "taskstats", for getting delay and cpu statistics of > tasks and thread groups during their lifetime and when they exit. > > Comments addressed (all in response to Jamal) > Note, you are still not following the standard scheme of doing things. Example: using command = GET and the message carrying the TGID to note which TGID is of interest. Instead you have command = TGID. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar 2006-03-14 2:29 ` jamal @ 2006-03-14 2:33 ` Matt Helsley 2006-03-14 2:48 ` jamal 2006-03-14 4:29 ` Shailabh Nagar 1 sibling, 2 replies; 56+ messages in thread From: Matt Helsley @ 2006-03-14 2:33 UTC (permalink / raw) To: Shailabh Nagar; +Cc: linux-kernel, Jamal, netdev On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote: <snip> > Comments addressed (all in response to Jamal) > > - Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE The enums for these are still in the patch. See below. <snip> > +/* > + * Commands sent from userspace > + * Not versioned. New commands should only be inserted at the enum's end > + */ > + > +enum { > + TASKSTATS_CMD_UNSPEC, /* Reserved */ > + TASKSTATS_CMD_NONE, /* Not a valid cmd to send > + * Marks data sent on task/tgid exit */ > + TASKSTATS_CMD_LISTEN, /* Start listening */ > + TASKSTATS_CMD_IGNORE, /* Stop listening */ >From the description I thought you had eliminated these. > + TASKSTATS_CMD_PID, /* Send stats for a pid */ > + TASKSTATS_CMD_TGID, /* Send stats for a tgid */ > +}; Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply: > Note, you are still not following the standard scheme of doing things. > Example: using command = GET and the message carrying the TGID to note > which TGID is of interest. Instead you have command = TGID. > > cheers, > jamal meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I misunderstanding? <snip> Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 2:33 ` Matt Helsley @ 2006-03-14 2:48 ` jamal 2006-03-14 4:18 ` Shailabh Nagar 2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh 2006-03-14 4:29 ` Shailabh Nagar 1 sibling, 2 replies; 56+ messages in thread From: jamal @ 2006-03-14 2:48 UTC (permalink / raw) To: Matt Helsley; +Cc: Shailabh Nagar, linux-kernel, netdev On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote: > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote: > Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply: > > Note, you are still not following the standard scheme of doing things. > > Example: using command = GET and the message carrying the TGID to note > > which TGID is of interest. Instead you have command = TGID. > > > meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to > TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I > misunderstanding? > I had a long description in an earlier email feedback; but the summary of it is the GET command is generic like TASKSTATS_CMD_GET; the message itself carries TLVs of what needs to be gotten which are either PID and/or TGID etc. Anyways, theres a long spill of what i am saying in that earlier email. Perhaps the current patch is a transition towards that? cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 2:48 ` jamal @ 2006-03-14 4:18 ` Shailabh Nagar 2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh 1 sibling, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 4:18 UTC (permalink / raw) To: hadi; +Cc: Matt Helsley, linux-kernel, netdev jamal wrote: >On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote: > > >>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote: >> >> > > > >>Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply: >> >> >>>Note, you are still not following the standard scheme of doing things. >>>Example: using command = GET and the message carrying the TGID to note >>>which TGID is of interest. Instead you have command = TGID. >>> >>> >>> > > > >>meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to >>TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I >>misunderstanding? >> >> >> > >I had a long description in an earlier email feedback; but the summary >of it is the GET command is generic like TASKSTATS_CMD_GET; the message >itself carries TLVs of what needs to be gotten which are >either PID and/or TGID etc. Anyways, theres a long spill of what i am >saying in that earlier email. Perhaps the current patch is a transition >towards that? > > Yes, the comments you'd made in the previous mail have not been incorporated and this is still the older version of the patch. We'd been avoiding TLV usage so far :-) >cheers, >jamal > > > ^ permalink raw reply [flat|nested] 56+ messages in thread
* [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 2:48 ` jamal 2006-03-14 4:18 ` Shailabh Nagar @ 2006-03-22 7:49 ` Balbir Singh 2006-03-23 14:04 ` jamal 1 sibling, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-22 7:49 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote: > On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote: > > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote: > > > I had a long description in an earlier email feedback; but the summary > of it is the GET command is generic like TASKSTATS_CMD_GET; the message > itself carries TLVs of what needs to be gotten which are > either PID and/or TGID etc. Anyways, theres a long spill of what i am > saying in that earlier email. Perhaps the current patch is a transition > towards that? > Hi, Jamal, Please find the updated version of delayacct-genetlink.patch. We hope this iteration is closer to your expectation. I have copied the enums you suggested in your previous review comments and used them. Comments addressed (in this patch) - Changed the code to use TLV's for data exchange between kernel and user space Thanks, Balbir Documentation for the patch Create a generic netlink interface (NETLINK_GENERIC family), called "taskstats", for getting delay and cpu statistics of tasks and thread groups during their lifetime and when they exit. More changes expected. Following comments will go into a Documentation file: When a task is alive, userspace can get its stats by sending a command containing its pid. Sending a tgid returns the sum of stats of the tasks belonging to that tgid (where such a sum makes sense). Together, the command interface allows stats for a large number of tasks to be collected more efficiently than would be possible through /proc or any per-pid interface. The netlink interface also sends the stats for each task to userspace when the task is exiting. This permits fine-grain accounting for short-lived tasks, which is important if userspace is doing its own aggregation of statistics based on some grouping of tasks (e.g. CSA jobs, ELSA banks or CKRM classes). If the exiting task belongs to a thread group (with more members than itself) , the latters delay stats are also sent out on the task's exit. This allows userspace to get accurate data at a per-tgid level while the tid's of a tgid are exiting one by one. The interface has been deliberately kept distinct from the delay accounting code since it is potentially usable by other kernel components that need to export per-pid/tgid data. The format of data returned to userspace is versioned and the command interface easily extensible to facilitate reuse. If reuse is not deemed useful enough, the naming, placement of functions and config options will be modified to make this an interface for delay accounting alone. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> --- include/linux/delayacct.h | 11 ++ include/linux/taskstats.h | 112 ++++++++++++++++++++ init/Kconfig | 16 ++ kernel/Makefile | 1 kernel/delayacct.c | 44 ++++++++ kernel/taskstats.c | 251 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 432 insertions(+), 3 deletions(-) diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h --- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530 @@ -15,6 +15,7 @@ #define _LINUX_TASKDELAYS_H #include <linux/sched.h> +#include <linux/taskstats.h> #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ @@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); extern unsigned long long __delayacct_blkio_ticks(struct task_struct *); +extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -72,4 +74,13 @@ static inline unsigned long long delayac return 0; } #endif /* CONFIG_TASK_DELAY_ACCT */ +#ifdef CONFIG_TASKSTATS +static inline int delayacct_add_tsk(struct taskstats *d, + struct task_struct *tsk) +{ + if (!tsk->delays) + return -EINVAL; + return __delayacct_add_tsk(d, tsk); +} +#endif #endif /* _LINUX_TASKDELAYS_H */ diff -puN /dev/null include/linux/taskstats.h --- /dev/null 2004-06-24 23:34:38.000000000 +0530 +++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-22 13:12:01.000000000 +0530 @@ -0,0 +1,112 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKSTATS_H +#define _LINUX_TASKSTATS_H + +/* Format for per-task data returned to userland when + * - a task exits + * - listener requests stats for a task + * + * The struct is versioned. Newer versions should only add fields to + * the bottom of the struct to maintain backward compatibility. + * + * To create the next version, bump up the taskstats_version variable + * and delineate the start of newly added fields with a comment indicating + * the version number. + */ + +#define TASKSTATS_VERSION 1 +#define TASKSTATS_NOPID -1 + +struct taskstats { + /* Maintain 64-bit alignment while extending */ + /* Version 1 */ + + /* XXX_count is number of delay values recorded. + * XXX_total is corresponding cumulative delay in nanoseconds + */ + +#define TASKSTATS_NOCPUSTATS 1 + __u64 cpu_count; + __u64 cpu_delay_total; /* wait, while runnable, for cpu */ + __u64 blkio_count; + __u64 blkio_delay_total; /* sync,block io completion wait*/ + __u64 swapin_count; + __u64 swapin_delay_total; /* swapin page fault wait*/ + + __u64 cpu_run_total; /* cpu running time + * no count available/provided */ +}; + + +#define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Commands sent from userspace + * Not versioned. New commands should only be inserted at the enum's end + */ + +enum { + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ + TASKSTATS_CMD_GET, /* user->kernel request */ + TASKSTATS_CMD_NEW, /* kernel->user event */ + __TASKSTATS_CMD_MAX, +}; + +#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1) + +enum { + TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */ + TASKSTATS_TYPE_TGID, /* Thread group id */ + TASKSTATS_TYPE_PID, /* Process id */ + TASKSTATS_TYPE_STATS, /* taskstats structure */ + __TASKSTATS_TYPE_MAX, +}; + +#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1) + +enum { + TASKSTATS_CMD_ATTR_UNSPEC = 0, + TASKSTATS_CMD_ATTR_PID, + TASKSTATS_CMD_ATTR_TGID, + __TASKSTATS_CMD_ATTR_MAX, +}; + +#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1) + +enum { + TASKSTATS_MSG_UNICAST, /* send data only to requester */ + TASKSTATS_MSG_MULTICAST, /* send data to a group */ +}; + + +/* NETLINK_GENERIC related info */ + +#define TASKSTATS_GENL_NAME "TASKSTATS" +#define TASKSTATS_GENL_VERSION 0x1 + +#ifdef __KERNEL__ + +#include <linux/sched.h> + +#ifdef CONFIG_TASKSTATS +extern void taskstats_exit_pid(struct task_struct *); +#else +static inline void taskstats_exit_pid(struct task_struct *tsk) +{} +#endif + +#endif /* __KERNEL__ */ +#endif /* _LINUX_TASKSTATS_H */ diff -puN init/Kconfig~delayacct-genetlink init/Kconfig --- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530 @@ -158,11 +158,21 @@ config TASK_DELAY_ACCT in pages. Such statistics can help in setting a task's priorities relative to other tasks for cpu, io, rss limits etc. - Unlike BSD process accounting, this information is available - continuously during the lifetime of a task. - Say N if unsure. +config TASKSTATS + bool "Export task/process statistics through netlink (EXPERIMENTAL)" + depends on TASK_DELAY_ACCT + default y + help + Export selected statistics for tasks/processes through the + generic netlink interface. Unlike BSD process accounting, the + statistics are available during the lifetime of tasks/processes as + responses to commands. Like BSD accounting, they are sent to user + space on task exit. + + Say Y if unsure. + config SYSCTL bool "Sysctl support" ---help--- diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c --- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-22 13:12:01.000000000 +0530 @@ -1,6 +1,7 @@ /* delayacct.c - per-task delay accounting * * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * Copyright (C) Balbir Singh, IBM Corp. 2006 * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2.1 of the GNU Lesser General Public License @@ -16,9 +17,12 @@ #include <linux/time.h> #include <linux/sysctl.h> #include <linux/delayacct.h> +#include <linux/taskstats.h> +#include <linux/mutex.h> int delayacct_on = 0; /* Delay accounting turned on/off */ kmem_cache_t *delayacct_cache; +static DEFINE_MUTEX(delayacct_exit_mutex); static int __init delayacct_setup_enable(char *str) { @@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st void __delayacct_tsk_exit(struct task_struct *tsk) { + /* + * Protect against racing thread group exits + */ + mutex_lock(&delayacct_exit_mutex); + taskstats_exit_pid(tsk); if (tsk->delays) { kmem_cache_free(delayacct_cache, tsk->delays); tsk->delays = NULL; } + mutex_unlock(&delayacct_exit_mutex); } /* @@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic spin_unlock(&tsk->delays->lock); return ret; } +#ifdef CONFIG_TASKSTATS +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) +{ + nsec_t tmp; + struct timespec ts; + unsigned long t1,t2; + + /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */ + + tmp = (nsec_t)d->cpu_run_total ; + tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC; + d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp; + + /* No locking available for sched_info. Take snapshot first. */ + t1 = tsk->sched_info.pcnt; + t2 = tsk->sched_info.run_delay; + + d->cpu_count += t1; + + jiffies_to_timespec(t2, &ts); + tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts); + d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp; + + spin_lock(&tsk->delays->lock); + tmp = d->blkio_delay_total + tsk->delays->blkio_delay; + d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp; + tmp = d->swapin_delay_total + tsk->delays->swapin_delay; + d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp; + d->blkio_count += tsk->delays->blkio_count; + d->swapin_count += tsk->delays->swapin_count; + spin_unlock(&tsk->delays->lock); + return 0; +} +#endif /* CONFIG_TASKSTATS */ diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530 @@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o +obj-$(CONFIG_TASKSTATS) += taskstats.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is diff -puN /dev/null kernel/taskstats.c --- /dev/null 2004-06-24 23:34:38.000000000 +0530 +++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-22 13:12:07.000000000 +0530 @@ -0,0 +1,251 @@ +/* + * taskstats.c - Export per-task statistics to userland + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/kernel.h> +#include <linux/taskstats.h> +#include <linux/delayacct.h> +#include <net/genetlink.h> +#include <asm/atomic.h> + +const int taskstats_version = TASKSTATS_VERSION; +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; +static int family_registered = 0; + +static struct genl_family family = { + .id = GENL_ID_GENERATE, + .name = TASKSTATS_GENL_NAME, + .version = TASKSTATS_GENL_VERSION, + .hdrsize = 0, + .maxattr = TASKSTATS_CMD_ATTR_MAX, +}; + +static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = { + [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 }, + [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 }, +}; + + +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, + void **replyp) +{ + struct sk_buff *skb; + void *reply; + + skb = nlmsg_new(NLMSG_GOODSIZE); + if (!skb) + return -ENOMEM; + + if (!info) { + int seq = get_cpu_var(taskstats_seqnum)++; + put_cpu_var(taskstats_seqnum); + + reply = genlmsg_put(skb, 0, seq, + family.id, 0, NLM_F_REQUEST, + cmd, family.version); + } else + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, + family.id, 0, info->nlhdr->nlmsg_flags, + info->genlhdr->cmd, family.version); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + + *skbp = skb; + *replyp = reply; + return 0; +} + +static int send_reply(struct sk_buff *skb, pid_t pid, int event) +{ + struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + void *reply; + int rc; + + reply = genlmsg_data(genlhdr); + + rc = genlmsg_end(skb, reply); + if (rc < 0) { + nlmsg_free(skb); + return rc; + } + + if (event == TASKSTATS_MSG_MULTICAST) + return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); + return genlmsg_unicast(skb, pid); +} + +static inline int fill_pid(pid_t pid, struct task_struct *pidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk = pidtsk; + + if (!pidtsk) { + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + get_task_struct(tsk); + read_unlock(&tasklist_lock); + } else + get_task_struct(tsk); + + rc = delayacct_add_tsk(stats, tsk); + put_task_struct(tsk); + + return rc; + +} + +static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk, *first; + + first = tgidtsk; + read_lock(&tasklist_lock); + if (!first) { + first = find_task_by_pid(tgid); + if (!first) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + } + tsk = first; + do { + rc = delayacct_add_tsk(stats, tsk); + if (rc) + break; + } while_each_thread(first, tsk); + read_unlock(&tasklist_lock); + + return rc; +} + +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) +{ + int rc; + struct sk_buff *rep_skb; + struct taskstats stats; + void *reply; + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply); + if (rc < 0) + return rc; + + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); + rc = fill_pid((pid_t)pid, NULL, &stats); + if (rc < 0) + return rc; + } + + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); + rc = fill_tgid((pid_t)tgid, NULL, &stats); + if (rc < 0) + return rc; + } + + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); + +nla_put_failure: + return genlmsg_cancel(rep_skb, reply); + +} + + +/* Send pid data out on exit */ +void taskstats_exit_pid(struct task_struct *tsk) +{ + int rc; + struct sk_buff *rep_skb; + void *reply; + struct taskstats stats; + + /* + * tasks can start to exit very early. Ensure that the family + * is registered before notifications are sent out + */ + if (!family_registered) + return; + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); + if (rc < 0) + return; + + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); + rc = fill_pid(tsk->pid, tsk, &stats); + if (rc < 0) + return; + + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + + if (rc || thread_group_empty(tsk)) + return; + + /* Send tgid data too */ + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); + if (rc < 0) + return; + + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); + rc = fill_tgid(tsk->tgid, tsk, &stats); + if (rc < 0) + return; + + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + +nla_put_failure: + genlmsg_cancel(rep_skb, reply); +} + +static struct genl_ops taskstats_ops = { + .cmd = TASKSTATS_CMD_GET, + .doit = taskstats_send_stats, + .policy = taskstats_cmd_get_policy, +}; + +static int __init taskstats_init(void) +{ + if (genl_register_family(&family)) + return -EFAULT; + family_registered = 1; + + if (genl_register_ops(&family, &taskstats_ops)) + goto err; + + return 0; +err: + genl_unregister_family(&family); + family_registered = 0; + return -EFAULT; +} + +late_initcall(taskstats_init); _ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh @ 2006-03-23 14:04 ` jamal 2006-03-23 15:41 ` Balbir Singh 2006-03-24 1:32 ` Balbir Singh 0 siblings, 2 replies; 56+ messages in thread From: jamal @ 2006-03-23 14:04 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev Hi Balbir, Looking good. This is a quick scan, so i didnt look at little details. Some comments embedded. On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote: > > diff -puN /dev/null include/linux/taskstats.h > + * The struct is versioned. Newer versions should only add fields to > + * the bottom of the struct to maintain backward compatibility. > + * > + * To create the next version, bump up the taskstats_version variable > + * and delineate the start of newly added fields with a comment indicating > + * the version number. > + */ > + > +enum { > + TASKSTATS_MSG_UNICAST, /* send data only to requester */ > + TASKSTATS_MSG_MULTICAST, /* send data to a group */ > +}; > + Above will never be used outside of the kernel. Should it go under the ifdef kernel below? > +#ifdef __KERNEL__ > + Note: some people will argue that you should probably have two header files. One for all kernel things and includes another which contains all the stuff above ifdef __KERNEL__ > + > +#endif /* __KERNEL__ */ > +#endif /* _LINUX_TASKSTATS_H */ > diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile > --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 > +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530 > + > +const int taskstats_version = TASKSTATS_VERSION; > +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; > +static int family_registered = 0; > + > +static struct genl_family family = { > + .id = GENL_ID_GENERATE, > + .name = TASKSTATS_GENL_NAME, > + .version = TASKSTATS_GENL_VERSION, > + .hdrsize = 0, Do you need to specify hdrsize of 0? > +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, > + void **replyp) > +{ > + struct sk_buff *skb; > + void *reply; > + > + skb = nlmsg_new(NLMSG_GOODSIZE); Ok, getting a size of NLMSG_GOODSIZE is not a good idea. The max size youll ever get it seems to me is 2*32-bit-data TLVs + sizeof struct stats. Why dont you allocate that size? > + if (!skb) > + return -ENOMEM; > + > + if (!info) { > + int seq = get_cpu_var(taskstats_seqnum)++; > + put_cpu_var(taskstats_seqnum); > + > + reply = genlmsg_put(skb, 0, seq, > + family.id, 0, NLM_F_REQUEST, > + cmd, family.version); Double check if you need NLM_F_REQUEST > + } else > + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, > + family.id, 0, info->nlhdr->nlmsg_flags, > + info->genlhdr->cmd, family.version); A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the right thing? > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) > +{ > + int rc; > + struct sk_buff *rep_skb; > + struct taskstats stats; > + void *reply; > + > + memset(&stats, 0, sizeof(stats)); > + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply); Same comment as before: a response to a GET is a NEW; so info->genlhdr->cmd doesnt seem right. > + if (rc < 0) > + return rc; > + > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); > + rc = fill_pid((pid_t)pid, NULL, &stats); > + if (rc < 0) > + return rc; > + } > + > + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { > + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); > + rc = fill_tgid((pid_t)tgid, NULL, &stats); > + if (rc < 0) > + return rc; > + } > + Should there be at least either a pid or tgid? If yes, you need to validate here... > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); > + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); > + > +nla_put_failure: > + return genlmsg_cancel(rep_skb, reply); > + As a general comment double check your logic for errors; if you already have stashed something in the skb, you need to remove it etc. > +} > + > + > +/* Send pid data out on exit */ > +void taskstats_exit_pid(struct task_struct *tsk) > +{ > + int rc; > + struct sk_buff *rep_skb; > + void *reply; > + struct taskstats stats; > + > + /* > + * tasks can start to exit very early. Ensure that the family > + * is registered before notifications are sent out > + */ > + if (!family_registered) > + return; > + > + memset(&stats, 0, sizeof(stats)); > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); > + if (rc < 0) > + return; > + > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); > + rc = fill_pid(tsk->pid, tsk, &stats); > + if (rc < 0) > + return; > + > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); > + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); > + > + if (rc || thread_group_empty(tsk)) > + return; > + > + /* Send tgid data too */ > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); > + if (rc < 0) > + return; > + A single message with PID+TGID sounds reasonable. Why two messages with two stats? all you will need to do is get rid of the prepare_reply() above and NLA_PUT_U32() below (just like you do in a response to a GET. > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); > + rc = fill_tgid(tsk->tgid, tsk, &stats); > + if (rc < 0) > + return; > + > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); > + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); > + > +nla_put_failure: > + genlmsg_cancel(rep_skb, reply); > +} Other than the above comments - I believe you have it right. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-23 14:04 ` jamal @ 2006-03-23 15:41 ` Balbir Singh 2006-03-24 14:04 ` jamal 2006-03-24 1:32 ` Balbir Singh 1 sibling, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-23 15:41 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote: > > Hi Balbir, > > Looking good. > This is a quick scan, so i didnt look at little details. Thanks for your detailed feedback. Please find my response interspersed along with your comments. > Some comments embedded. > > On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote: > > > > > > > > > diff -puN /dev/null include/linux/taskstats.h > > > + * The struct is versioned. Newer versions should only add fields to > > + * the bottom of the struct to maintain backward compatibility. > > + * > > + * To create the next version, bump up the taskstats_version variable > > + * and delineate the start of newly added fields with a comment indicating > > + * the version number. > > + */ > > + > > > > > +enum { > > + TASKSTATS_MSG_UNICAST, /* send data only to requester */ > > + TASKSTATS_MSG_MULTICAST, /* send data to a group */ > > +}; > > + > > Above will never be used outside of the kernel. > Should it go under the ifdef kernel below? > Will do > > > +#ifdef __KERNEL__ > > + > > Note: some people will argue that you should probably have > two header files. One for all kernel things and includes another > which contains all the stuff above ifdef __KERNEL__ > > > + > > +#endif /* __KERNEL__ */ > > +#endif /* _LINUX_TASKSTATS_H */ > > > > > > diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile > > --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 > > +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530 > > > + > > +const int taskstats_version = TASKSTATS_VERSION; > > +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; > > +static int family_registered = 0; > > + > > +static struct genl_family family = { > > + .id = GENL_ID_GENERATE, > > + .name = TASKSTATS_GENL_NAME, > > + .version = TASKSTATS_GENL_VERSION, > > + .hdrsize = 0, > > Do you need to specify hdrsize of 0? I will remove it > > > > +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, > > + void **replyp) > > +{ > > + struct sk_buff *skb; > > + void *reply; > > + > > + skb = nlmsg_new(NLMSG_GOODSIZE); > > Ok, getting a size of NLMSG_GOODSIZE is not a good idea. > The max size youll ever get it seems to me is 2*32-bit-data TLVs + > sizeof struct stats. Why dont you allocate that size? > Will do > > + if (!skb) > > + return -ENOMEM; > > + > > + if (!info) { > > + int seq = get_cpu_var(taskstats_seqnum)++; > > + put_cpu_var(taskstats_seqnum); > > + > > + reply = genlmsg_put(skb, 0, seq, > > + family.id, 0, NLM_F_REQUEST, > > + cmd, family.version); > > Double check if you need NLM_F_REQUEST > It is not required, I will remove it > > + } else > > + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, > > + family.id, 0, info->nlhdr->nlmsg_flags, > > + info->genlhdr->cmd, family.version); > > A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the > right thing? > > I will change it > > > > > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) > > +{ > > + int rc; > > + struct sk_buff *rep_skb; > > + struct taskstats stats; > > + void *reply; > > + > > + memset(&stats, 0, sizeof(stats)); > > + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply); > > Same comment as before: a response to a GET is a NEW; so > info->genlhdr->cmd doesnt seem right. > I will change it > > + if (rc < 0) > > + return rc; > > + > > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { > > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); > > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); > > + rc = fill_pid((pid_t)pid, NULL, &stats); > > + if (rc < 0) > > + return rc; > > + } > > + > > + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { > > + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); > > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); > > + rc = fill_tgid((pid_t)tgid, NULL, &stats); > > + if (rc < 0) > > + return rc; > > + } > > + > > Should there be at least either a pid or tgid? If yes, you need to > validate here... > Yes, you are correct. One of my test cases caught it too.. But I did not want to untidy the code with if-else's which will keep growing if the attributes change in the future. I just followed the controller example. I will change it and validate it. Currently if the attribute is not valid, a stat of all zero's is returned back. > > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); > > + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); > > + > > +nla_put_failure: > > + return genlmsg_cancel(rep_skb, reply); > > + > > As a general comment double check your logic for errors; if you already > have stashed something in the skb, you need to remove it etc. > Wouldn't genlmsg_cancel() take care of cleaning all attributes? > > +} > > + > > + > > +/* Send pid data out on exit */ > > +void taskstats_exit_pid(struct task_struct *tsk) > > +{ > > + int rc; > > + struct sk_buff *rep_skb; > > + void *reply; > > + struct taskstats stats; > > + > > + /* > > + * tasks can start to exit very early. Ensure that the family > > + * is registered before notifications are sent out > > + */ > > + if (!family_registered) > > + return; > > + > > + memset(&stats, 0, sizeof(stats)); > > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); > > + if (rc < 0) > > + return; > > + > > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); > > + rc = fill_pid(tsk->pid, tsk, &stats); > > + if (rc < 0) > > + return; > > + > > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); > > + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); > > + > > + if (rc || thread_group_empty(tsk)) > > + return; > > + > > + /* Send tgid data too */ > > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); > > + if (rc < 0) > > + return; > > + > > A single message with PID+TGID sounds reasonable. Why two messages with > two stats? all you will need to do is get rid of the prepare_reply() > above and NLA_PUT_U32() below (just like you do in a response to a GET. > The reason for two stats is that for TGID, we return accumulated values (of all threads in the group) and for PID we return the value just for that pid. The return value is pid <stats for pid> tgid <accumulated stats for tgid> > > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); > > + rc = fill_tgid(tsk->tgid, tsk, &stats); > > + if (rc < 0) > > + return; > > + > > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); > > + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); > > + > > +nla_put_failure: > > + genlmsg_cancel(rep_skb, reply); > > +} > > > Other than the above comments - I believe you have it right. > Thanks, I will get back to you with the updated patch soon. > cheers, > jamal Thanks, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-23 15:41 ` Balbir Singh @ 2006-03-24 14:04 ` jamal 2006-03-24 14:54 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-03-24 14:04 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote: > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote: > > Should there be at least either a pid or tgid? If yes, you need to > > validate here... > > > > Yes, you are correct. One of my test cases caught it too.. But I did > not want to untidy the code with if-else's which will keep growing if > the attributes change in the future. I just followed the controller > example. I will change it and validate it. Currently if the attribute > is not valid, a stat of all zero's is returned back. > There are many ways to skin this cat. As an example: you could make pid and tgid global to the function and set them to 0. At the end of the if statements, you could check if at least one of them is set - if not you know none was passed and bail out. > > As a general comment double check your logic for errors; if you already > > have stashed something in the skb, you need to remove it etc. > > > > Wouldn't genlmsg_cancel() take care of cleaning all attributes? > it would for attribute setting - but not for others. As a general comment this is one of those areas where cutnpasting aka TheLinuxWay(tm) could result in errors. > > A single message with PID+TGID sounds reasonable. Why two messages with > > two stats? all you will need to do is get rid of the prepare_reply() > > above and NLA_PUT_U32() below (just like you do in a response to a GET. > > > > The reason for two stats is that for TGID, we return accumulated values > (of all threads in the group) and for PID we return the value just > for that pid. The return value is > Ok, I understand the dilemma now - but still not thrilled with having two messages. Perhaps you could have nesting of TLVs? This is widely used in the net code for example i.e: TLV = TASKSTATS_TYPE_TGID/PID TLV = TASKSTATS_TYPE_STATS Look at using nla_nest_start/end/cancel cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-24 14:04 ` jamal @ 2006-03-24 14:54 ` Balbir Singh 2006-03-25 1:19 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-24 14:54 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Fri, Mar 24, 2006 at 09:04:21AM -0500, jamal wrote: > On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote: > > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote: > > > > Should there be at least either a pid or tgid? If yes, you need to > > > validate here... > > > > > > > Yes, you are correct. One of my test cases caught it too.. But I did > > not want to untidy the code with if-else's which will keep growing if > > the attributes change in the future. I just followed the controller > > example. I will change it and validate it. Currently if the attribute > > is not valid, a stat of all zero's is returned back. > > > > There are many ways to skin this cat. > As an example: you could make pid and tgid global to the function and > set them to 0. At the end of the if statements, you could check if at > least one of them is set - if not you know none was passed and bail out. The latest patch does fix it this issue. In the Changelog 6. taskstats_send_stats() now validates the command attributes and ensures that it either gets a PID or a TGID. If it gets both simultaneously the PID stats are sent. Is this change ok with you? > > > > As a general comment double check your logic for errors; if you already > > > have stashed something in the skb, you need to remove it etc. > > > > > > > Wouldn't genlmsg_cancel() take care of cleaning all attributes? > > > > it would for attribute setting - but not for others. As a general > comment this is one of those areas where cutnpasting aka TheLinuxWay(tm) > could result in errors. :-) I understand. What I have done is moved all the NLA_PUT_U32 to after verifying the return values of functions fill_*(). That way we do not stash anything into the skb if there are pending errors. > > > > > A single message with PID+TGID sounds reasonable. Why two messages with > > > two stats? all you will need to do is get rid of the prepare_reply() > > > above and NLA_PUT_U32() below (just like you do in a response to a GET. > > > > > > > The reason for two stats is that for TGID, we return accumulated values > > (of all threads in the group) and for PID we return the value just > > for that pid. The return value is > > > > Ok, I understand the dilemma now - but still not thrilled with having > two messages. > Perhaps you could have nesting of TLVs? This is widely used in the net > code for example > i.e: > > TLV = TASKSTATS_TYPE_TGID/PID > TLV = TASKSTATS_TYPE_STATS > > Look at using nla_nest_start/end/cancel Hmm... Would it be ok to send one message with the following format 1. TLV=TASKSTATS_TYPE_PID 2. TLV=TASKSTATS_TYPE_STATS 3. TLV=TASKSTATS_TYPE_TGID 4. TLV=TASKSTATS_TYPE_STATS It would still be one message, except that 3 and 4 would be optional. What do you think? > > cheers, > jamal Thanks for your comments, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-24 14:54 ` Balbir Singh @ 2006-03-25 1:19 ` jamal 2006-03-25 9:41 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-03-25 1:19 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote: > Hmm... Would it be ok to send one message with the following format > > 1. TLV=TASKSTATS_TYPE_PID > 2. TLV=TASKSTATS_TYPE_STATS > 3. TLV=TASKSTATS_TYPE_TGID > 4. TLV=TASKSTATS_TYPE_STATS > > It would still be one message, except that 3 and 4 would be optional. > What do you think? > No, that wont work since #2 and #4 are basically the same TLV. [Recall that "T" is used to index an array]. Your other alternative is to have #4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS although that would smell a little. Dont be afraid to do the nest, it will be a little painful initially but i am sure once you figure it out you will appreciate it. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-25 1:19 ` jamal @ 2006-03-25 9:41 ` Balbir Singh 2006-03-25 12:52 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-25 9:41 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Fri, Mar 24, 2006 at 08:19:25PM -0500, jamal wrote: > On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote: > > > Hmm... Would it be ok to send one message with the following format > > > > 1. TLV=TASKSTATS_TYPE_PID > > 2. TLV=TASKSTATS_TYPE_STATS > > 3. TLV=TASKSTATS_TYPE_TGID > > 4. TLV=TASKSTATS_TYPE_STATS > > > > It would still be one message, except that 3 and 4 would be optional. > > What do you think? > > > > No, that wont work since #2 and #4 are basically the same TLV. [Recall > that "T" is used to index an array]. Your other alternative is to have > #4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS > although that would smell a little. > Dont be afraid to do the nest, it will be a little painful initially but > i am sure once you figure it out you will appreciate it. > Thanks for the advice, I will dive into nesting. I could not find any in tree users who use nesting, so I have a few questions nla_nest_start() accepts two parameters an skb and an attribute type. Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to contain the nested attributes TASKSTATS_TYPE_AGGR TASKSTATS_TYPE_PID/TGID TASKSTATS_TYPE_STATS but this will lead to TASKSTATS_TYPE_AGGR TASKSTATS_TYPE_PID TASKSTATS_TYPE_STATS TASKSTATS_TYPE_AGGR TASKSTATS_TYPE_TGID TASKSTATS_TYPE_STATS being returned from taskstats_exit_pid(). The other option is to nest TASKSTATS_TYPE_PID/TGID TASKSTATS_TYPE_STATS but the problem with this approach is, nla_len contains the length of all attributes including the nested attribute. So it is hard to find the offset of TASKSTATS_TYPE_STATS in the buffer. Do I understand NLA nesting at all? May be I am missing something obvious. Thanks, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-25 9:41 ` Balbir Singh @ 2006-03-25 12:52 ` jamal 2006-03-25 15:36 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-03-25 12:52 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote: > > Thanks for the advice, I will dive into nesting. I could not find any > in tree users who use nesting, so I have a few questions > Hrm - I have to say i am suprised theres nothing; i could have sworn Thomas had done some conversions already. > nla_nest_start() accepts two parameters an skb and an attribute type. > Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to > contain the nested attributes > > TASKSTATS_TYPE_AGGR > TASKSTATS_TYPE_PID/TGID > TASKSTATS_TYPE_STATS > > > but this will lead to > > TASKSTATS_TYPE_AGGR > TASKSTATS_TYPE_PID > TASKSTATS_TYPE_STATS > TASKSTATS_TYPE_AGGR > TASKSTATS_TYPE_TGID > TASKSTATS_TYPE_STATS > > being returned from taskstats_exit_pid(). > no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice. Again invoke the rule i cited earlier. What you could do instead is a second AGGR; and your nesting would be: TASKSTATS_TYPE_AGGR1 <--- nest start with this type TASKSTATS_TYPE_PID <-- NLA_U32_PUT TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE <-- nest end of TASKSTATS_TYPE_AGGR1 TASKSTATS_TYPE_AGGR2 <--- nest start with this type TASKSTATS_TYPE_TGID <-- NLA_U32_PUT TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE <-- nest end of TASKSTATS_TYPE_AGGR2 > The other option is to nest > > TASKSTATS_TYPE_PID/TGID > TASKSTATS_TYPE_STATS > The advantage being you dont introduce another T. > but the problem with this approach is, nla_len contains the length of > all attributes including the nested attribute. So it is hard to find > the offset of TASKSTATS_TYPE_STATS in the buffer. > So you would distinguish the two as have something like: TASKSTATS_TYPE_PID u32 pid TASKSTATS_TYPE_STATS TASKSTATS_TYPE_TGID u32 tgid TASKSTATS_TYPE_STATS or TASKSTATS_TYPE_PID u32 pid TASKSTATS_TYPE_TGID u32 tgid both should be fine. The difference between the two is the length in the second case will be 4 and in the other case will be larger. But come to think of it, this will introduce unneeded semantics; you have very few items to do, so forget it. Go with scheme #1 but change the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-25 12:52 ` jamal @ 2006-03-25 15:36 ` Balbir Singh 2006-03-25 17:48 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-25 15:36 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote: > On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote: > > > > > Thanks for the advice, I will dive into nesting. I could not find any > > in tree users who use nesting, so I have a few questions > > > > Hrm - I have to say i am suprised theres nothing; i could have sworn > Thomas had done some conversions already. I used cscope to check for global references and callers of nla_nest_.*. I could not find anything. > > > nla_nest_start() accepts two parameters an skb and an attribute type. > > Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to > > contain the nested attributes > > > > TASKSTATS_TYPE_AGGR > > TASKSTATS_TYPE_PID/TGID > > TASKSTATS_TYPE_STATS > > > > > > but this will lead to > > > > TASKSTATS_TYPE_AGGR > > TASKSTATS_TYPE_PID > > TASKSTATS_TYPE_STATS > > TASKSTATS_TYPE_AGGR > > TASKSTATS_TYPE_TGID > > TASKSTATS_TYPE_STATS > > > > being returned from taskstats_exit_pid(). > > > > no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice. > Again invoke the rule i cited earlier. Yes, thats why I wanted to point it out to you. Thanks for explaining the rule. > What you could do instead is a second AGGR; and your nesting would be: > > TASKSTATS_TYPE_AGGR1 <--- nest start with this type > TASKSTATS_TYPE_PID <-- NLA_U32_PUT > TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE > <-- nest end of TASKSTATS_TYPE_AGGR1 > TASKSTATS_TYPE_AGGR2 <--- nest start with this type > TASKSTATS_TYPE_TGID <-- NLA_U32_PUT > TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE > <-- nest end of TASKSTATS_TYPE_AGGR2 > > > The other option is to nest > > > > TASKSTATS_TYPE_PID/TGID > > TASKSTATS_TYPE_STATS > > > > The advantage being you dont introduce another T. > > > but the problem with this approach is, nla_len contains the length of > > all attributes including the nested attribute. So it is hard to find > > the offset of TASKSTATS_TYPE_STATS in the buffer. > > > > So you would distinguish the two as have something like: > > TASKSTATS_TYPE_PID > u32 pid > TASKSTATS_TYPE_STATS > TASKSTATS_TYPE_TGID > u32 tgid > TASKSTATS_TYPE_STATS > or > TASKSTATS_TYPE_PID > u32 pid > TASKSTATS_TYPE_TGID > u32 tgid > > both should be fine. The difference between the two is the length in the > second case will be 4 and in the other case will be larger. > > But come to think of it, this will introduce unneeded semantics; you > have very few items to do, so forget it. Go with scheme #1 but change > the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID. > I prefer #1 as well. The overloaded use of the same type with different lengths can be confusing. > cheers, > jamal > Here is another attempt (one more iteration) at trying to get it right. Thank you for your patience and help in getting it right. Changelog --------- As discussed in our email. Thanks, Balbir Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> --- include/linux/delayacct.h | 11 + include/linux/taskstats.h | 113 +++++++++++++++++ init/Kconfig | 16 ++ kernel/Makefile | 1 kernel/delayacct.c | 44 ++++++ kernel/taskstats.c | 291 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 473 insertions(+), 3 deletions(-) diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h --- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530 @@ -15,6 +15,7 @@ #define _LINUX_TASKDELAYS_H #include <linux/sched.h> +#include <linux/taskstats.h> #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ @@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); extern unsigned long long __delayacct_blkio_ticks(struct task_struct *); +extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -72,4 +74,13 @@ static inline unsigned long long delayac return 0; } #endif /* CONFIG_TASK_DELAY_ACCT */ +#ifdef CONFIG_TASKSTATS +static inline int delayacct_add_tsk(struct taskstats *d, + struct task_struct *tsk) +{ + if (!tsk->delays) + return -EINVAL; + return __delayacct_add_tsk(d, tsk); +} +#endif #endif /* _LINUX_TASKDELAYS_H */ diff -puN /dev/null include/linux/taskstats.h --- /dev/null 2004-06-24 23:34:38.000000000 +0530 +++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-25 20:56:55.000000000 +0530 @@ -0,0 +1,113 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKSTATS_H +#define _LINUX_TASKSTATS_H + +/* Format for per-task data returned to userland when + * - a task exits + * - listener requests stats for a task + * + * The struct is versioned. Newer versions should only add fields to + * the bottom of the struct to maintain backward compatibility. + * + * To create the next version, bump up the taskstats_version variable + * and delineate the start of newly added fields with a comment indicating + * the version number. + */ + +#define TASKSTATS_VERSION 1 +#define TASKSTATS_NOPID -1 + +struct taskstats { + /* Maintain 64-bit alignment while extending */ + /* Version 1 */ + + /* XXX_count is number of delay values recorded. + * XXX_total is corresponding cumulative delay in nanoseconds + */ + +#define TASKSTATS_NOCPUSTATS 1 + __u64 cpu_count; + __u64 cpu_delay_total; /* wait, while runnable, for cpu */ + __u64 blkio_count; + __u64 blkio_delay_total; /* sync,block io completion wait*/ + __u64 swapin_count; + __u64 swapin_delay_total; /* swapin page fault wait*/ + + __u64 cpu_run_total; /* cpu running time + * no count available/provided */ +}; + + +#define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Commands sent from userspace + * Not versioned. New commands should only be inserted at the enum's end + */ + +enum { + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ + TASKSTATS_CMD_GET, /* user->kernel request */ + TASKSTATS_CMD_NEW, /* kernel->user event */ + __TASKSTATS_CMD_MAX, +}; + +#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1) + +enum { + TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */ + TASKSTATS_TYPE_PID, /* Process id */ + TASKSTATS_TYPE_TGID, /* Thread group id */ + TASKSTATS_TYPE_STATS, /* taskstats structure */ + TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */ + TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */ + __TASKSTATS_TYPE_MAX, +}; + +#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1) + +enum { + TASKSTATS_CMD_ATTR_UNSPEC = 0, + TASKSTATS_CMD_ATTR_PID, + TASKSTATS_CMD_ATTR_TGID, + __TASKSTATS_CMD_ATTR_MAX, +}; + +#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1) + +/* NETLINK_GENERIC related info */ + +#define TASKSTATS_GENL_NAME "TASKSTATS" +#define TASKSTATS_GENL_VERSION 0x1 + +#ifdef __KERNEL__ + +#include <linux/sched.h> + +enum { + TASKSTATS_MSG_UNICAST, /* send data only to requester */ + TASKSTATS_MSG_MULTICAST, /* send data to a group */ +}; + +#ifdef CONFIG_TASKSTATS +extern void taskstats_exit_pid(struct task_struct *); +#else +static inline void taskstats_exit_pid(struct task_struct *tsk) +{} +#endif + +#endif /* __KERNEL__ */ +#endif /* _LINUX_TASKSTATS_H */ diff -puN init/Kconfig~delayacct-genetlink init/Kconfig --- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530 @@ -158,11 +158,21 @@ config TASK_DELAY_ACCT in pages. Such statistics can help in setting a task's priorities relative to other tasks for cpu, io, rss limits etc. - Unlike BSD process accounting, this information is available - continuously during the lifetime of a task. - Say N if unsure. +config TASKSTATS + bool "Export task/process statistics through netlink (EXPERIMENTAL)" + depends on TASK_DELAY_ACCT + default y + help + Export selected statistics for tasks/processes through the + generic netlink interface. Unlike BSD process accounting, the + statistics are available during the lifetime of tasks/processes as + responses to commands. Like BSD accounting, they are sent to user + space on task exit. + + Say Y if unsure. + config SYSCTL bool "Sysctl support" ---help--- diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c --- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-25 20:56:55.000000000 +0530 @@ -1,6 +1,7 @@ /* delayacct.c - per-task delay accounting * * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * Copyright (C) Balbir Singh, IBM Corp. 2006 * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2.1 of the GNU Lesser General Public License @@ -16,9 +17,12 @@ #include <linux/time.h> #include <linux/sysctl.h> #include <linux/delayacct.h> +#include <linux/taskstats.h> +#include <linux/mutex.h> int delayacct_on = 0; /* Delay accounting turned on/off */ kmem_cache_t *delayacct_cache; +static DEFINE_MUTEX(delayacct_exit_mutex); static int __init delayacct_setup_enable(char *str) { @@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st void __delayacct_tsk_exit(struct task_struct *tsk) { + /* + * Protect against racing thread group exits + */ + mutex_lock(&delayacct_exit_mutex); + taskstats_exit_pid(tsk); if (tsk->delays) { kmem_cache_free(delayacct_cache, tsk->delays); tsk->delays = NULL; } + mutex_unlock(&delayacct_exit_mutex); } /* @@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic spin_unlock(&tsk->delays->lock); return ret; } +#ifdef CONFIG_TASKSTATS +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) +{ + nsec_t tmp; + struct timespec ts; + unsigned long t1,t2; + + /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */ + + tmp = (nsec_t)d->cpu_run_total ; + tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC; + d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp; + + /* No locking available for sched_info. Take snapshot first. */ + t1 = tsk->sched_info.pcnt; + t2 = tsk->sched_info.run_delay; + + d->cpu_count += t1; + + jiffies_to_timespec(t2, &ts); + tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts); + d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp; + + spin_lock(&tsk->delays->lock); + tmp = d->blkio_delay_total + tsk->delays->blkio_delay; + d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp; + tmp = d->swapin_delay_total + tsk->delays->swapin_delay; + d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp; + d->blkio_count += tsk->delays->blkio_count; + d->swapin_count += tsk->delays->swapin_count; + spin_unlock(&tsk->delays->lock); + return 0; +} +#endif /* CONFIG_TASKSTATS */ diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530 @@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o +obj-$(CONFIG_TASKSTATS) += taskstats.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is diff -puN /dev/null kernel/taskstats.c --- /dev/null 2004-06-24 23:34:38.000000000 +0530 +++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-25 20:57:19.000000000 +0530 @@ -0,0 +1,291 @@ +/* + * taskstats.c - Export per-task statistics to userland + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/kernel.h> +#include <linux/taskstats.h> +#include <linux/delayacct.h> +#include <net/genetlink.h> +#include <asm/atomic.h> + +const int taskstats_version = TASKSTATS_VERSION; +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; +static int family_registered = 0; + +static struct genl_family family = { + .id = GENL_ID_GENERATE, + .name = TASKSTATS_GENL_NAME, + .version = TASKSTATS_GENL_VERSION, + .maxattr = TASKSTATS_CMD_ATTR_MAX, +}; + +static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = { + [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 }, + [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 }, +}; + + +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, + void **replyp, size_t size) +{ + struct sk_buff *skb; + void *reply; + + /* + * If new attributes are added, please revisit this allocation + */ + skb = nlmsg_new(size); + if (!skb) + return -ENOMEM; + + if (!info) { + int seq = get_cpu_var(taskstats_seqnum)++; + put_cpu_var(taskstats_seqnum); + + reply = genlmsg_put(skb, 0, seq, + family.id, 0, 0, + cmd, family.version); + } else + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, + family.id, 0, 0, + cmd, family.version); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + + *skbp = skb; + *replyp = reply; + return 0; +} + +static int send_reply(struct sk_buff *skb, pid_t pid, int event) +{ + struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + void *reply; + int rc; + + reply = genlmsg_data(genlhdr); + + rc = genlmsg_end(skb, reply); + if (rc < 0) { + nlmsg_free(skb); + return rc; + } + + if (event == TASKSTATS_MSG_MULTICAST) + return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); + return genlmsg_unicast(skb, pid); +} + +static inline int fill_pid(pid_t pid, struct task_struct *pidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk = pidtsk; + + if (!pidtsk) { + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + get_task_struct(tsk); + read_unlock(&tasklist_lock); + } else + get_task_struct(tsk); + + rc = delayacct_add_tsk(stats, tsk); + put_task_struct(tsk); + + return rc; + +} + +static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk, *first; + + first = tgidtsk; + read_lock(&tasklist_lock); + if (!first) { + first = find_task_by_pid(tgid); + if (!first) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + } + tsk = first; + do { + rc = delayacct_add_tsk(stats, tsk); + if (rc) + break; + } while_each_thread(first, tsk); + read_unlock(&tasklist_lock); + + return rc; +} + +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) +{ + int rc = 0; + struct sk_buff *rep_skb; + struct taskstats stats; + void *reply; + size_t size; + struct nlattr *na; + + /* + * Size includes space for nested attribute as well + * The returned data is of the format + * TASKSTATS_TYPE_AGGR_PID/TGID + * --> TASKSTATS_TYPE_PID/TGID + * --> TASKSTATS_TYPE_STATS + */ + size = nla_total_size(sizeof(u32)) + + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0); + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size); + if (rc < 0) + return rc; + + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); + rc = fill_pid((pid_t)pid, NULL, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); + rc = fill_tgid((pid_t)tgid, NULL, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); + } else { + rc = -EINVAL; + goto err; + } + + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + nla_nest_end(rep_skb, na); + + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); + +nla_put_failure: + return genlmsg_cancel(rep_skb, reply); +err: + nlmsg_free(rep_skb); + return rc; +} + + +/* Send pid data out on exit */ +void taskstats_exit_pid(struct task_struct *tsk) +{ + int rc = 0; + struct sk_buff *rep_skb; + void *reply; + struct taskstats stats; + size_t size; + int is_thread_group = !thread_group_empty(tsk); + struct nlattr *na; + + /* + * tasks can start to exit very early. Ensure that the family + * is registered before notifications are sent out + */ + if (!family_registered) + return; + + /* + * Size includes space for nested attributes + */ + size = nla_total_size(sizeof(u32)) + + nla_total_size(sizeof(struct taskstats)) + nla_total_size(0); + + if (is_thread_group) + size = 2 * size; // PID + STATS + TGID + STATS + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size); + if (rc < 0) + return; + + rc = fill_pid(tsk->pid, tsk, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + nla_nest_end(rep_skb, na); + + if (!is_thread_group) { + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + return; + } + + memset(&stats, 0, sizeof(stats)); + rc = fill_tgid(tsk->tgid, tsk, &stats); + if (rc < 0) + goto err; + + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + nla_nest_end(rep_skb, na); + + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + +nla_put_failure: + genlmsg_cancel(rep_skb, reply); + return; +err: + nlmsg_free(rep_skb); +} + +static struct genl_ops taskstats_ops = { + .cmd = TASKSTATS_CMD_GET, + .doit = taskstats_send_stats, + .policy = taskstats_cmd_get_policy, +}; + +static int __init taskstats_init(void) +{ + if (genl_register_family(&family)) + return -EFAULT; + family_registered = 1; + + if (genl_register_ops(&family, &taskstats_ops)) + goto err; + + return 0; +err: + genl_unregister_family(&family); + family_registered = 0; + return -EFAULT; +} + +late_initcall(taskstats_init); _ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-25 15:36 ` Balbir Singh @ 2006-03-25 17:48 ` jamal 2006-03-25 18:22 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-03-25 17:48 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote: > On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote: I didnt pay attention to failure paths etc; i suppose your testing should catch those. Getting there, a couple more comments: > +enum { > + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ > + TASKSTATS_CMD_GET, /* user->kernel request */ > + TASKSTATS_CMD_NEW, /* kernel->user event */ Should the comment read "kernel->user event/get-response" > + > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) > +{ > + > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); > + rc = fill_pid((pid_t)pid, NULL, &stats); > + if (rc < 0) > + goto err; > + > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); > + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { in regards to the elseif above: Could you not have both PID and TGID passed? From my earlier understanding it seemed legit, no? if answer is yes, then you will have to do your sizes + reply TLVs at the end. Also in regards to the nesting, isnt there a need for nla_nest_cancel in case of failures to add TLVs? cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-25 17:48 ` jamal @ 2006-03-25 18:22 ` Balbir Singh 2006-03-26 14:05 ` jamal 0 siblings, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-25 18:22 UTC (permalink / raw) To: hadi; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On 3/25/06, jamal <hadi@cyberus.ca> wrote: > On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote: > > On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote: > > > I didnt pay attention to failure paths etc; i suppose your testing > should catch those. Getting there, a couple more comments: > Yes, I have tried several negative test cases. > > > +enum { > > + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ > > + TASKSTATS_CMD_GET, /* user->kernel request */ > > + TASKSTATS_CMD_NEW, /* kernel->user event */ > > Should the comment read "kernel->user event/get-response" > Yes, good catch. I will update the comment. > > > + > > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) > > +{ > > > > + > > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { > > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); > > + rc = fill_pid((pid_t)pid, NULL, &stats); > > + if (rc < 0) > > + goto err; > > + > > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID); > > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); > > + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { > > in regards to the elseif above: > Could you not have both PID and TGID passed? From my earlier > understanding it seemed legit, no? if answer is yes, then you will have > to do your sizes + reply TLVs at the end. No, we cannot have both passed. If we pass both a PID and a TGID and then the code returns just the stats for the PID. > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in > case of failures to add TLVs? > I thought about it, but when I looked at the code of genlmsg_cancel() and nla_nest_cancel(). It seemed that genlmsg_cancel() should suffice. <snippet> static inline int genlmsg_cancel(struct sk_buff *skb, void *hdr) { return nlmsg_cancel(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN); } static inline int nlmsg_cancel(struct sk_buff *skb, struct nlmsghdr *nlh) { skb_trim(skb, (unsigned char *) nlh - skb->data); return -1; } static inline int nla_nest_cancel(struct sk_buff *skb, struct nlattr *start) { if (start) skb_trim(skb, (unsigned char *) start - skb->data); return -1; } </snippet> genlmsg_cancel() seemed more generic, since it handles skb_trim from the nlmsghdr down to skb->data, where as nla_test_cancel() does it only from the start of the nested attributes to skb->data. Is my understanding correct? > cheers, > jamal > Thanks, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-25 18:22 ` Balbir Singh @ 2006-03-26 14:05 ` jamal 2006-03-26 16:40 ` Balbir Singh 0 siblings, 1 reply; 56+ messages in thread From: jamal @ 2006-03-26 14:05 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote: > No, we cannot have both passed. If we pass both a PID and a TGID and > then the code returns just the stats for the PID. > ok, that clears it then; i think you are ready to go. > > > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in > > case of failures to add TLVs? > > > > I thought about it, but when I looked at the code of genlmsg_cancel() > and nla_nest_cancel(). It seemed that genlmsg_cancel() should > suffice. > If your policy is to never send a message if anything fails, then you are fine. What would be really useful now that you understand this, is if you can help extending/cleaning the document i sent you. Or send me a table of contents of how it would have flowed better for you. cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-26 14:05 ` jamal @ 2006-03-26 16:40 ` Balbir Singh 0 siblings, 0 replies; 56+ messages in thread From: Balbir Singh @ 2006-03-26 16:40 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Sun, Mar 26, 2006 at 09:05:18AM -0500, jamal wrote: > On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote: > > > > No, we cannot have both passed. If we pass both a PID and a TGID and > > then the code returns just the stats for the PID. > > > > ok, that clears it then; i think you are ready to go. Cool! Thanks for all your help and review. > > > > > > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in > > > case of failures to add TLVs? > > > > > > > I thought about it, but when I looked at the code of genlmsg_cancel() > > and nla_nest_cancel(). It seemed that genlmsg_cancel() should > > suffice. > > > > If your policy is to never send a message if anything fails, then you > are fine. > > What would be really useful now that you understand this, is if you can > help extending/cleaning the document i sent you. Or send me a table of > contents of how it would have flowed better for you. I will start looking at the document and see what changes can be made. I will try and update the document from a genetlink programmers perspective i.e. things to know, avoid, etc. > > cheers, > jamal > > Thanks, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-23 14:04 ` jamal 2006-03-23 15:41 ` Balbir Singh @ 2006-03-24 1:32 ` Balbir Singh 2006-03-24 14:11 ` jamal 1 sibling, 1 reply; 56+ messages in thread From: Balbir Singh @ 2006-03-24 1:32 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote: > > Hi Balbir, > > Looking good. > This is a quick scan, so i didnt look at little details. > Some comments embedded. Hi, Jamal, I tried addressing your comments in this new version. Changelog --------- 1. Moved TASKSTATS_MSG_* to under #ifdef __KERNEL__ 2. Got rid of .hdrsize = 0 in genl_family family 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats) 4. Got rid of NLM_F_REQUEST, all flags passed down to user space are now 0 5. The response to TASKSTATS_CMD_GET is TASKSTATS_CMD_NEW 6. taskstats_send_stats() now validates the command attributes and ensures that it either gets a PID or a TGID. If it gets both simultaneously the PID stats are sent. 7. Do not put the PID/TGID into the skb if there are errors in fill_pid() or fill_tgid(). Thanks, Balbir Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> --- include/linux/delayacct.h | 11 + include/linux/taskstats.h | 111 ++++++++++++++++++++ init/Kconfig | 16 ++ kernel/Makefile | 1 kernel/delayacct.c | 44 +++++++ kernel/taskstats.c | 255 ++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 435 insertions(+), 3 deletions(-) diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h --- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530 @@ -15,6 +15,7 @@ #define _LINUX_TASKDELAYS_H #include <linux/sched.h> +#include <linux/taskstats.h> #ifdef CONFIG_TASK_DELAY_ACCT extern int delayacct_on; /* Delay accounting turned on/off */ @@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct extern void __delayacct_blkio_start(void); extern void __delayacct_blkio_end(void); extern unsigned long long __delayacct_blkio_ticks(struct task_struct *); +extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *); static inline void delayacct_tsk_init(struct task_struct *tsk) { @@ -72,4 +74,13 @@ static inline unsigned long long delayac return 0; } #endif /* CONFIG_TASK_DELAY_ACCT */ +#ifdef CONFIG_TASKSTATS +static inline int delayacct_add_tsk(struct taskstats *d, + struct task_struct *tsk) +{ + if (!tsk->delays) + return -EINVAL; + return __delayacct_add_tsk(d, tsk); +} +#endif #endif /* _LINUX_TASKDELAYS_H */ diff -puN /dev/null include/linux/taskstats.h --- /dev/null 2004-06-24 23:34:38.000000000 +0530 +++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-24 06:49:24.000000000 +0530 @@ -0,0 +1,111 @@ +/* taskstats.h - exporting per-task statistics + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of version 2.1 of the GNU Lesser General Public License + * as published by the Free Software Foundation. + * + * This program is distributed in the hope that it would be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + */ + +#ifndef _LINUX_TASKSTATS_H +#define _LINUX_TASKSTATS_H + +/* Format for per-task data returned to userland when + * - a task exits + * - listener requests stats for a task + * + * The struct is versioned. Newer versions should only add fields to + * the bottom of the struct to maintain backward compatibility. + * + * To create the next version, bump up the taskstats_version variable + * and delineate the start of newly added fields with a comment indicating + * the version number. + */ + +#define TASKSTATS_VERSION 1 +#define TASKSTATS_NOPID -1 + +struct taskstats { + /* Maintain 64-bit alignment while extending */ + /* Version 1 */ + + /* XXX_count is number of delay values recorded. + * XXX_total is corresponding cumulative delay in nanoseconds + */ + +#define TASKSTATS_NOCPUSTATS 1 + __u64 cpu_count; + __u64 cpu_delay_total; /* wait, while runnable, for cpu */ + __u64 blkio_count; + __u64 blkio_delay_total; /* sync,block io completion wait*/ + __u64 swapin_count; + __u64 swapin_delay_total; /* swapin page fault wait*/ + + __u64 cpu_run_total; /* cpu running time + * no count available/provided */ +}; + + +#define TASKSTATS_LISTEN_GROUP 0x1 + +/* + * Commands sent from userspace + * Not versioned. New commands should only be inserted at the enum's end + */ + +enum { + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */ + TASKSTATS_CMD_GET, /* user->kernel request */ + TASKSTATS_CMD_NEW, /* kernel->user event */ + __TASKSTATS_CMD_MAX, +}; + +#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1) + +enum { + TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */ + TASKSTATS_TYPE_TGID, /* Thread group id */ + TASKSTATS_TYPE_PID, /* Process id */ + TASKSTATS_TYPE_STATS, /* taskstats structure */ + __TASKSTATS_TYPE_MAX, +}; + +#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1) + +enum { + TASKSTATS_CMD_ATTR_UNSPEC = 0, + TASKSTATS_CMD_ATTR_PID, + TASKSTATS_CMD_ATTR_TGID, + __TASKSTATS_CMD_ATTR_MAX, +}; + +#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1) + +/* NETLINK_GENERIC related info */ + +#define TASKSTATS_GENL_NAME "TASKSTATS" +#define TASKSTATS_GENL_VERSION 0x1 + +#ifdef __KERNEL__ + +#include <linux/sched.h> + +enum { + TASKSTATS_MSG_UNICAST, /* send data only to requester */ + TASKSTATS_MSG_MULTICAST, /* send data to a group */ +}; + +#ifdef CONFIG_TASKSTATS +extern void taskstats_exit_pid(struct task_struct *); +#else +static inline void taskstats_exit_pid(struct task_struct *tsk) +{} +#endif + +#endif /* __KERNEL__ */ +#endif /* _LINUX_TASKSTATS_H */ diff -puN init/Kconfig~delayacct-genetlink init/Kconfig --- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530 @@ -158,11 +158,21 @@ config TASK_DELAY_ACCT in pages. Such statistics can help in setting a task's priorities relative to other tasks for cpu, io, rss limits etc. - Unlike BSD process accounting, this information is available - continuously during the lifetime of a task. - Say N if unsure. +config TASKSTATS + bool "Export task/process statistics through netlink (EXPERIMENTAL)" + depends on TASK_DELAY_ACCT + default y + help + Export selected statistics for tasks/processes through the + generic netlink interface. Unlike BSD process accounting, the + statistics are available during the lifetime of tasks/processes as + responses to commands. Like BSD accounting, they are sent to user + space on task exit. + + Say Y if unsure. + config SYSCTL bool "Sysctl support" ---help--- diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c --- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-24 06:49:24.000000000 +0530 @@ -1,6 +1,7 @@ /* delayacct.c - per-task delay accounting * * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * Copyright (C) Balbir Singh, IBM Corp. 2006 * * This program is free software; you can redistribute it and/or modify it * under the terms of version 2.1 of the GNU Lesser General Public License @@ -16,9 +17,12 @@ #include <linux/time.h> #include <linux/sysctl.h> #include <linux/delayacct.h> +#include <linux/taskstats.h> +#include <linux/mutex.h> int delayacct_on = 0; /* Delay accounting turned on/off */ kmem_cache_t *delayacct_cache; +static DEFINE_MUTEX(delayacct_exit_mutex); static int __init delayacct_setup_enable(char *str) { @@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st void __delayacct_tsk_exit(struct task_struct *tsk) { + /* + * Protect against racing thread group exits + */ + mutex_lock(&delayacct_exit_mutex); + taskstats_exit_pid(tsk); if (tsk->delays) { kmem_cache_free(delayacct_cache, tsk->delays); tsk->delays = NULL; } + mutex_unlock(&delayacct_exit_mutex); } /* @@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic spin_unlock(&tsk->delays->lock); return ret; } +#ifdef CONFIG_TASKSTATS +int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk) +{ + nsec_t tmp; + struct timespec ts; + unsigned long t1,t2; + + /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */ + + tmp = (nsec_t)d->cpu_run_total ; + tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC; + d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp; + + /* No locking available for sched_info. Take snapshot first. */ + t1 = tsk->sched_info.pcnt; + t2 = tsk->sched_info.run_delay; + + d->cpu_count += t1; + + jiffies_to_timespec(t2, &ts); + tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts); + d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp; + + spin_lock(&tsk->delays->lock); + tmp = d->blkio_delay_total + tsk->delays->blkio_delay; + d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp; + tmp = d->swapin_delay_total + tsk->delays->swapin_delay; + d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp; + d->blkio_count += tsk->delays->blkio_count; + d->swapin_count += tsk->delays->swapin_count; + spin_unlock(&tsk->delays->lock); + return 0; +} +#endif /* CONFIG_TASKSTATS */ diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530 +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530 @@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/ obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o +obj-$(CONFIG_TASKSTATS) += taskstats.o ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y) # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is diff -puN /dev/null kernel/taskstats.c --- /dev/null 2004-06-24 23:34:38.000000000 +0530 +++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-24 06:50:03.000000000 +0530 @@ -0,0 +1,255 @@ +/* + * taskstats.c - Export per-task statistics to userland + * + * Copyright (C) Shailabh Nagar, IBM Corp. 2006 + * (C) Balbir Singh, IBM Corp. 2006 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + */ + +#include <linux/kernel.h> +#include <linux/taskstats.h> +#include <linux/delayacct.h> +#include <net/genetlink.h> +#include <asm/atomic.h> + +const int taskstats_version = TASKSTATS_VERSION; +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 }; +static int family_registered = 0; + +static struct genl_family family = { + .id = GENL_ID_GENERATE, + .name = TASKSTATS_GENL_NAME, + .version = TASKSTATS_GENL_VERSION, + .maxattr = TASKSTATS_CMD_ATTR_MAX, +}; + +static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = { + [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 }, + [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 }, +}; + + +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp, + void **replyp) +{ + struct sk_buff *skb; + void *reply; + + /* + * If new attributes are added, please revisit this allocation + */ + skb = nlmsg_new((2 * sizeof(u32)) + sizeof(struct taskstats)); + if (!skb) + return -ENOMEM; + + if (!info) { + int seq = get_cpu_var(taskstats_seqnum)++; + put_cpu_var(taskstats_seqnum); + + reply = genlmsg_put(skb, 0, seq, + family.id, 0, 0, + cmd, family.version); + } else + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq, + family.id, 0, 0, + cmd, family.version); + if (reply == NULL) { + nlmsg_free(skb); + return -EINVAL; + } + + *skbp = skb; + *replyp = reply; + return 0; +} + +static int send_reply(struct sk_buff *skb, pid_t pid, int event) +{ + struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data); + void *reply; + int rc; + + reply = genlmsg_data(genlhdr); + + rc = genlmsg_end(skb, reply); + if (rc < 0) { + nlmsg_free(skb); + return rc; + } + + if (event == TASKSTATS_MSG_MULTICAST) + return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP); + return genlmsg_unicast(skb, pid); +} + +static inline int fill_pid(pid_t pid, struct task_struct *pidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk = pidtsk; + + if (!pidtsk) { + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (!tsk) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + get_task_struct(tsk); + read_unlock(&tasklist_lock); + } else + get_task_struct(tsk); + + rc = delayacct_add_tsk(stats, tsk); + put_task_struct(tsk); + + return rc; + +} + +static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk, + struct taskstats *stats) +{ + int rc; + struct task_struct *tsk, *first; + + first = tgidtsk; + read_lock(&tasklist_lock); + if (!first) { + first = find_task_by_pid(tgid); + if (!first) { + read_unlock(&tasklist_lock); + return -ESRCH; + } + } + tsk = first; + do { + rc = delayacct_add_tsk(stats, tsk); + if (rc) + break; + } while_each_thread(first, tsk); + read_unlock(&tasklist_lock); + + return rc; +} + +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) +{ + int rc; + struct sk_buff *rep_skb; + struct taskstats stats; + void *reply; + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply); + if (rc < 0) + return rc; + + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) { + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]); + rc = fill_pid((pid_t)pid, NULL, &stats); + if (rc < 0) + return rc; + + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid); + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) { + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]); + rc = fill_tgid((pid_t)tgid, NULL, &stats); + if (rc < 0) + return rc; + + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid); + } else { + return -EINVAL; + } + + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST); + +nla_put_failure: + return genlmsg_cancel(rep_skb, reply); + +} + + +/* Send pid data out on exit */ +void taskstats_exit_pid(struct task_struct *tsk) +{ + int rc; + struct sk_buff *rep_skb; + void *reply; + struct taskstats stats; + + /* + * tasks can start to exit very early. Ensure that the family + * is registered before notifications are sent out + */ + if (!family_registered) + return; + + memset(&stats, 0, sizeof(stats)); + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); + if (rc < 0) + return; + + rc = fill_pid(tsk->pid, tsk, &stats); + if (rc < 0) + return; + + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + + if (rc || thread_group_empty(tsk)) + return; + + /* Send tgid data too */ + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply); + if (rc < 0) + return; + + rc = fill_tgid(tsk->tgid, tsk, &stats); + if (rc < 0) + return; + + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats); + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); + +nla_put_failure: + genlmsg_cancel(rep_skb, reply); +} + +static struct genl_ops taskstats_ops = { + .cmd = TASKSTATS_CMD_GET, + .doit = taskstats_send_stats, + .policy = taskstats_cmd_get_policy, +}; + +static int __init taskstats_init(void) +{ + if (genl_register_family(&family)) + return -EFAULT; + family_registered = 1; + + if (genl_register_ops(&family, &taskstats_ops)) + goto err; + + return 0; +err: + genl_unregister_family(&family); + family_registered = 0; + return -EFAULT; +} + +late_initcall(taskstats_init); _ ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-24 1:32 ` Balbir Singh @ 2006-03-24 14:11 ` jamal 2006-03-24 14:19 ` jamal 2006-03-24 14:59 ` Balbir Singh 0 siblings, 2 replies; 56+ messages in thread From: jamal @ 2006-03-24 14:11 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote: > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote: > 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats) Not the right size; the u32 covers the V part of TLV. The T = 16 bits and L = 16 bits. And if you nest TLVs, then it gets more interesting. Look at using proper macros instead of hard coding like you did. grep for something like RTA_SPACE and perhaps send a patch to make it generic for netlink.h cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-24 14:11 ` jamal @ 2006-03-24 14:19 ` jamal 2006-03-24 14:59 ` Balbir Singh 1 sibling, 0 replies; 56+ messages in thread From: jamal @ 2006-03-24 14:19 UTC (permalink / raw) To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Fri, 2006-24-03 at 09:11 -0500, jamal wrote: > Look at using proper macros instead of hard coding like you did. > grep for something like RTA_SPACE and perhaps send a patch to make it > generic for netlink.h > actually Thomas already has this in netlink.h Look at using things like: nla_attr_size() make sure padding is taken care of etc (read: use the right macros). cheers, jamal ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting 2006-03-24 14:11 ` jamal 2006-03-24 14:19 ` jamal @ 2006-03-24 14:59 ` Balbir Singh 1 sibling, 0 replies; 56+ messages in thread From: Balbir Singh @ 2006-03-24 14:59 UTC (permalink / raw) To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev On Fri, Mar 24, 2006 at 09:11:58AM -0500, jamal wrote: > On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote: > > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote: > > > 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats) > > Not the right size; the u32 covers the V part of TLV. The T = 16 bits > and L = 16 bits. And if you nest TLVs, then it gets more interesting. > > Look at using proper macros instead of hard coding like you did. > grep for something like RTA_SPACE and perhaps send a patch to make it > generic for netlink.h > > cheers, > jamal > My bad, but I was wondering why my test case did not segfault until I saw the implementation of nlmsg_new :-) I will fix this and use nla_total_size() to calculate the correct sizes including padding and TLV. Thanks again, Balbir ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting 2006-03-14 2:33 ` Matt Helsley 2006-03-14 2:48 ` jamal @ 2006-03-14 4:29 ` Shailabh Nagar 1 sibling, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 4:29 UTC (permalink / raw) To: Matt Helsley; +Cc: linux-kernel, Jamal, netdev Matt Helsley wrote: >On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote: ><snip> > > > >>Comments addressed (all in response to Jamal) >> >>- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE >> >> > >The enums for these are still in the patch. See below. > ><snip> > > > >>+/* >>+ * Commands sent from userspace >>+ * Not versioned. New commands should only be inserted at the enum's end >>+ */ >>+ >>+enum { >>+ TASKSTATS_CMD_UNSPEC, /* Reserved */ >>+ TASKSTATS_CMD_NONE, /* Not a valid cmd to send >>+ * Marks data sent on task/tgid exit */ >>+ TASKSTATS_CMD_LISTEN, /* Start listening */ >>+ TASKSTATS_CMD_IGNORE, /* Stop listening */ >> >> > >>From the description I thought you had eliminated these. > > Yup, typo. the commands aren't registered but the enums hang on. Will fix. --Shailabh ^ permalink raw reply [flat|nested] 56+ messages in thread
* [Patch 6/9] cpu delay collection 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (7 preceding siblings ...) 2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar @ 2006-03-14 1:01 ` Shailabh Nagar 2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH 9 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 1:01 UTC (permalink / raw) To: linux-kernel; +Cc: Nick Piggin delayacct-schedstats.patch Collect cpu delays through the task-related schedstats functions, extracted as common code to remove dependence of delay accounting on rest of schedstats. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> include/linux/sched.h | 8 +++++--- kernel/sched.c | 49 +++++++++++++++++++++++++++++++++++++------------ 2 files changed, 42 insertions(+), 15 deletions(-) Index: linux-2.6.16-rc5/include/linux/sched.h =================================================================== --- linux-2.6.16-rc5.orig/include/linux/sched.h 2006-03-11 07:41:39.000000000 -0500 +++ linux-2.6.16-rc5/include/linux/sched.h 2006-03-11 07:41:40.000000000 -0500 @@ -524,7 +524,7 @@ typedef struct prio_array prio_array_t; struct backing_dev_info; struct reclaim_state; -#ifdef CONFIG_SCHEDSTATS +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info { /* cumulative counters */ unsigned long cpu_time, /* time spent on the cpu */ @@ -536,8 +536,10 @@ struct sched_info { last_queued; /* when we were last queued to run */ }; +#ifdef CONFIG_SCHEDSTATS extern struct file_operations proc_schedstat_operations; -#endif +#endif /* CONFIG_SCHEDSTATS */ +#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */ #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info { @@ -735,7 +737,7 @@ struct task_struct { cpumask_t cpus_allowed; unsigned int time_slice, first_time_slice; -#ifdef CONFIG_SCHEDSTATS +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) struct sched_info sched_info; #endif Index: linux-2.6.16-rc5/kernel/sched.c =================================================================== --- linux-2.6.16-rc5.orig/kernel/sched.c 2006-03-11 07:41:39.000000000 -0500 +++ linux-2.6.16-rc5/kernel/sched.c 2006-03-11 07:41:40.000000000 -0500 @@ -473,9 +473,26 @@ struct file_operations proc_schedstat_op .release = single_release, }; +static inline void rq_sched_info_arrive(struct runqueue *rq, + unsigned long diff) +{ + if (rq) { + rq->rq_sched_info.run_delay += diff; + rq->rq_sched_info.pcnt++; + } +} + +static inline void rq_sched_info_depart(struct runqueue *rq, + unsigned long diff) +{ + if (rq) + rq->rq_sched_info.cpu_time += diff; +} # define schedstat_inc(rq, field) do { (rq)->field++; } while (0) # define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0) #else /* !CONFIG_SCHEDSTATS */ +static inline void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) {} +static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) {} # define schedstat_inc(rq, field) do { } while (0) # define schedstat_add(rq, field, amt) do { } while (0) #endif @@ -495,7 +512,19 @@ static inline runqueue_t *this_rq_lock(v return rq; } +#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) + +static inline int sched_info_on(void) +{ #ifdef CONFIG_SCHEDSTATS + return 1; +#endif +#ifdef CONFIG_TASK_DELAY_ACCT + extern int delayacct_on; + return delayacct_on; +#endif +} + /* * Called when a process is dequeued from the active array and given * the cpu. We should note that with the exception of interactive @@ -524,7 +553,6 @@ static inline void sched_info_dequeued(t static void sched_info_arrive(task_t *t) { unsigned long now = jiffies, diff = 0; - struct runqueue *rq = task_rq(t); if (t->sched_info.last_queued) diff = now - t->sched_info.last_queued; @@ -533,11 +561,7 @@ static void sched_info_arrive(task_t *t) t->sched_info.last_arrival = now; t->sched_info.pcnt++; - if (!rq) - return; - - rq->rq_sched_info.run_delay += diff; - rq->rq_sched_info.pcnt++; + rq_sched_info_arrive(task_rq(t), diff); } /* @@ -557,8 +581,9 @@ static void sched_info_arrive(task_t *t) */ static inline void sched_info_queued(task_t *t) { - if (!t->sched_info.last_queued) - t->sched_info.last_queued = jiffies; + if (sched_info_on()) + if (!t->sched_info.last_queued) + t->sched_info.last_queued = jiffies; } /* @@ -567,13 +592,10 @@ static inline void sched_info_queued(tas */ static inline void sched_info_depart(task_t *t) { - struct runqueue *rq = task_rq(t); unsigned long diff = jiffies - t->sched_info.last_arrival; t->sched_info.cpu_time += diff; - - if (rq) - rq->rq_sched_info.cpu_time += diff; + rq_sched_info_depart(task_rq(t), diff); } /* @@ -585,6 +607,9 @@ static inline void sched_info_switch(tas { struct runqueue *rq = task_rq(prev); + if (!sched_info_on()) + return; + /* * prev now departs the cpu. It's not interesting to record * stats about how efficient we were at scheduling the idle ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Per-task delay accounting 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar ` (8 preceding siblings ...) 2006-03-14 1:01 ` [Patch 6/9] cpu delay collection Shailabh Nagar @ 2006-03-14 19:28 ` Greg KH 2006-03-14 20:49 ` Shailabh Nagar 2006-03-23 15:16 ` [Patch 0/9] Performance Shailabh Nagar 9 siblings, 2 replies; 56+ messages in thread From: Greg KH @ 2006-03-14 19:28 UTC (permalink / raw) To: Shailabh Nagar; +Cc: linux-kernel, Nick Piggin On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: > This is the next iteration of the delay accounting patches > last posted at > http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html Do you have any benchmark numbers with this patch applied and with it not applied? Last I heard it was a measurable decrease for some "important" benchmark results... thanks, greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Per-task delay accounting 2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH @ 2006-03-14 20:49 ` Shailabh Nagar 2006-03-14 21:24 ` Greg KH 2006-03-23 15:16 ` [Patch 0/9] Performance Shailabh Nagar 1 sibling, 1 reply; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 20:49 UTC (permalink / raw) To: Greg KH; +Cc: linux-kernel, Nick Piggin Greg KH wrote: >On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: > > >>This is the next iteration of the delay accounting patches >>last posted at >> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html >> >> > >Do you have any benchmark numbers with this patch applied and with it >not applied? > None yet. Wanted to iron out the collection/utility aspects a bit before going into the performance impact. But this seems as good a time as any to collect some stats using the usual suspects lmbench, kernbench, hackbench etc. > Last I heard it was a measurable decrease for some >"important" benchmark results... > > Might have been from an older iteration where schedstats was fully enabled. But no point speculating....will run with this set of patches and see what shakes out. One point about the overhead is that it depends on the frequency with which data is collected. So a proper test would probably be a comparison of a non-patched kernel with a) patches applied but delay accounting not turned on at boot i.e. cost of the checks b) delay accounting turned on but not being read c) delay accounting turned on and data read for all tasks at some "reasonable" rate Will that be good ? Other suggestions welcome. >thanks, > >greg k-h > > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Per-task delay accounting 2006-03-14 20:49 ` Shailabh Nagar @ 2006-03-14 21:24 ` Greg KH 2006-03-14 21:59 ` Shailabh Nagar 0 siblings, 1 reply; 56+ messages in thread From: Greg KH @ 2006-03-14 21:24 UTC (permalink / raw) To: Shailabh Nagar; +Cc: linux-kernel, Nick Piggin On Tue, Mar 14, 2006 at 03:49:16PM -0500, Shailabh Nagar wrote: > Greg KH wrote: > > >On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: > > > > > >>This is the next iteration of the delay accounting patches > >>last posted at > >> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html > >> > >> > > > >Do you have any benchmark numbers with this patch applied and with it > >not applied? > > > None yet. Wanted to iron out the collection/utility aspects a bit before > going into > the performance impact. > > But this seems as good a time as any to collect some stats > using the usual suspects lmbench, kernbench, hackbench etc. > > >Last I heard it was a measurable decrease for some > >"important" benchmark results... > > > > > Might have been from an older iteration where schedstats was fully enabled. > But no point speculating....will run with this set of patches and see > what shakes out. > > One point about the overhead is that it depends on the frequency with > which data is > collected. So a proper test would probably be a comparison of a > non-patched kernel > with > a) patches applied but delay accounting not turned on at boot i.e. cost > of the checks > b) delay accounting turned on but not being read This is probably the most important one, as that is what distros will be looking at. They will have to enable the option, but will not "turn it on". > c) delay accounting turned on and data read for all tasks at some > "reasonable" rate > > Will that be good ? Other suggestions welcome. How about real benchmarks? The ones that the big companies look at? I know you have access to them :) thanks, greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Per-task delay accounting 2006-03-14 21:24 ` Greg KH @ 2006-03-14 21:59 ` Shailabh Nagar 0 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-14 21:59 UTC (permalink / raw) To: Greg KH; +Cc: linux-kernel, Nick Piggin Greg KH wrote: >On Tue, Mar 14, 2006 at 03:49:16PM -0500, Shailabh Nagar wrote: > > >>Greg KH wrote: >> >> >> >>>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: >>> >>> >>> >>> >>>>This is the next iteration of the delay accounting patches >>>>last posted at >>>> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html >>>> >>>> >>>> >>>> >>>Do you have any benchmark numbers with this patch applied and with it >>>not applied? >>> >>> >>> >>None yet. Wanted to iron out the collection/utility aspects a bit before >>going into >>the performance impact. >> >>But this seems as good a time as any to collect some stats >>using the usual suspects lmbench, kernbench, hackbench etc. >> >> >> >>>Last I heard it was a measurable decrease for some >>>"important" benchmark results... >>> >>> >>> >>> >>Might have been from an older iteration where schedstats was fully enabled. >>But no point speculating....will run with this set of patches and see >>what shakes out. >> >>One point about the overhead is that it depends on the frequency with >>which data is >>collected. So a proper test would probably be a comparison of a >>non-patched kernel >>with >>a) patches applied but delay accounting not turned on at boot i.e. cost >>of the checks >>b) delay accounting turned on but not being read >> >> > >This is probably the most important one, as that is what distros will be >looking at. They will have to enable the option, but will not "turn it >on". > > I guess you meant a), not b) but yes, will run them in all these modes. > > >>c) delay accounting turned on and data read for all tasks at some >>"reasonable" rate >> >>Will that be good ? Other suggestions welcome. >> >> > >How about real benchmarks? The ones that the big companies look at? I >know you have access to them :) > > Hmm...though you also know, from working for some "big company", that it might take a while to get such data since one has to stand in queue :-) I'll try, and also explore the OSDL STP's DBT tests. Thanks, Shailabh >thanks, > >greg k-h > > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Performance 2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH 2006-03-14 20:49 ` Shailabh Nagar @ 2006-03-23 15:16 ` Shailabh Nagar 2006-03-25 2:38 ` Greg KH 1 sibling, 1 reply; 56+ messages in thread From: Shailabh Nagar @ 2006-03-23 15:16 UTC (permalink / raw) To: Greg KH; +Cc: linux-kernel, Nick Piggin Greg KH wrote: > On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: > >>This is the next iteration of the delay accounting patches >>last posted at >> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html > > > Do you have any benchmark numbers with this patch applied and with it > not applied? Last I heard it was a measurable decrease for some > "important" benchmark results... > > thanks, > > greg k-h Here are some numbers for the latest set of posted patches using microbenchmarks hackbench, kernbench and lmbench. I was trying to get the real/big benchmark numbers too but it looks like getting a run whose numbers can be trusted will take a bit longer than expected. Preliminary runs of transaction processing benchmarks indicate that overhead actually decreases with the patch (as also seen in some of the lmbench numbers below). --Shailabh Results highlights - Configuring delay accounting adds < 0.5% overhead in most cases and even reduces overhead in some cases - Enabling delay accounting has similar results with a maximum overhead of 1.2% for hackbench , most other overheads < 1% and reduction in overhead in some cases Base Vanilla 2.6.16-rc6 kernel without any patches applied +patch Delay accounting configured but not enabled at boot +patch+enable Delay accounting enabled at boot but no stats read Hackbench --------- 200 groups, using sockets Elapsed time, in seconds, lower better %Ovhd Time Base 0 12.468 +patch 0.4% 12.523 +patch+enable 1.2% 12.622 Kernbench --------- Average of 5 iterations Elapsed time, in seconds, lower better %Ovhd Elapsed Base 0 195.776 +patch 0.2% 196.246 +patch+enable 0.3% 196.282 Lmbench ------- Processor, Processes - times in microseconds - smaller is better ---------------------------------------------------------------- Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos TCP inst hndl proc proc proc --------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ---- base Linux 2.6.16- 2783 0.17 0.33 5.17 6.49 13.4 0.64 2.61 146. 610. 9376 +patch Linux 2.6.16- 2781 0.17 0.32 4.75 5.85 13.0 0.64 2.62 145. 628. 9393 +patch+en Linux 2.6.16- 2784 0.17 0.32 4.71 6.14 13.4 0.64 2.60 150. 616. 9402 Context switching - times in microseconds - smaller is better ------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ----- ------ ------ ------ ------ ------- ------- base Linux 2.6.16- 4.340 4.9600 7.3300 6.5700 30.3 10.4 36.0 +patch Linux 2.6.16- 4.390 4.9800 7.3100 6.5900 29.7 9.62000 35.8 +patch+en Linux 2.6.16- 4.560 5.0800 7.2400 5.6900 22.7 10.3 33.8 *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------------- ----- ----- ---- ----- ----- ----- ----- ---- base Linux 2.6.16- 4.340 15.9 12.2 18.3 24.9 21.5 29.1 45.3 +patch Linux 2.6.16- 4.390 15.7 11.8 18.6 22.2 22.0 29.1 44.8 +patch+en Linux 2.6.16- 4.560 15.6 12.1 18.9 25.3 21.9 27.1 45.1 File & VM system latencies in microseconds - smaller is better -------------------------------------------------------------- Host OS 0K File 10K File Mmap Prot Page Create Delete Create Delete Latency Fault Fault --------- ------------- ------ ------ ------ ------ ------- ----- ----- base Linux 2.6.16- 39.8 58.0 112.0 82.6 8417.0 0.838 2.00000 +patch Linux 2.6.16- 39.6 58.2 111.0 82.3 8392.0 0.864 2.00000 +patch+en Linux 2.6.16- 39.6 59.1 112.8 83.2 8308.0 0.821 2.00000 *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- base Linux 2.6.16- 676. 616. 620. 1658.0 2030.6 759.6 825.9 2032 1177. +patch Linux 2.6.16- 627. 165. 616. 1649.9 2030.9 766.1 834.1 2030 1187. +patch+en Linux 2.6.16- 633. 148. 603. 1569.7 2030.9 757.2 835.3 2030 1174. Memory latencies in nanoseconds - smaller is better (WARNING - may not be correct, check graphs) --------------------------------------------------- Host OS Mhz L1 $ L2 $ Main mem Guesses --------- ------------- ---- ----- ------ -------- ------- base Linux 2.6.16- 2783 0.719 6.5960 110.5 +patch Linux 2.6.16- 2781 0.720 6.5980 111.0 +patch+en Linux 2.6.16- 2784 0.720 6.5970 110.7 ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Performance 2006-03-23 15:16 ` [Patch 0/9] Performance Shailabh Nagar @ 2006-03-25 2:38 ` Greg KH 2006-03-27 18:28 ` Shailabh Nagar 0 siblings, 1 reply; 56+ messages in thread From: Greg KH @ 2006-03-25 2:38 UTC (permalink / raw) To: Shailabh Nagar; +Cc: linux-kernel, Nick Piggin On Thu, Mar 23, 2006 at 10:16:41AM -0500, Shailabh Nagar wrote: > Greg KH wrote: > > On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: > > > >>This is the next iteration of the delay accounting patches > >>last posted at > >> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html > > > > > > Do you have any benchmark numbers with this patch applied and with it > > not applied? Last I heard it was a measurable decrease for some > > "important" benchmark results... > > > > thanks, > > > > greg k-h > > Here are some numbers for the latest set of posted patches > using microbenchmarks hackbench, kernbench and lmbench. > > I was trying to get the real/big benchmark numbers too but > it looks like getting a run whose numbers can be trusted > will take a bit longer than expected. Preliminary runs of > transaction processing benchmarks indicate that overhead > actually decreases with the patch (as also seen in some of > the lmbench numbers below). That's good to hear. But your .5% is noticable on the +patch results, which I don't think people who take performance issues seriously will like (that's real money for the big vendors.) And distros will be forced to enable that option in their kernels, so those vendors will have to get that percentage back some other way... thanks, greg k-h ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [Patch 0/9] Performance 2006-03-25 2:38 ` Greg KH @ 2006-03-27 18:28 ` Shailabh Nagar 0 siblings, 0 replies; 56+ messages in thread From: Shailabh Nagar @ 2006-03-27 18:28 UTC (permalink / raw) To: Greg KH; +Cc: linux-kernel, Nick Piggin Greg KH wrote: >On Thu, Mar 23, 2006 at 10:16:41AM -0500, Shailabh Nagar wrote: > > >>Greg KH wrote: >> >> >>>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote: >>> >>> >>> >>>>This is the next iteration of the delay accounting patches >>>>last posted at >>>> http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html >>>> >>>> >>>Do you have any benchmark numbers with this patch applied and with it >>>not applied? Last I heard it was a measurable decrease for some >>>"important" benchmark results... >>> >>>thanks, >>> >>>greg k-h >>> >>> >>Here are some numbers for the latest set of posted patches >>using microbenchmarks hackbench, kernbench and lmbench. >> >>I was trying to get the real/big benchmark numbers too but >>it looks like getting a run whose numbers can be trusted >>will take a bit longer than expected. Preliminary runs of >>transaction processing benchmarks indicate that overhead >>actually decreases with the patch (as also seen in some of >>the lmbench numbers below). >> >> > >That's good to hear. > >But your .5% is noticable on the +patch results, which I don't think >people who take performance issues seriously will like (that's real >money for the big vendors.) And distros will be forced to enable that >option in their kernels, so those vendors will have to get that >percentage back some other way... > > Sorry, missed your response. Yes, even the slight deterioration might be an issue for distros. We discovered one memcpy, lack of use of "__read_mostly" and another "unlikely" that might help with the 0.5% but other than that don't see any major way of reducing overhead further for the +patch case. I'll be posting another iteration of the patches with these changes and corresponding results (as well as the changes for the netlink interface which has been stabilized after incorporating Jamal's comments). Lets see what that does. Thanks, Shailabh >thanks, > >greg k-h > > ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2006-03-27 18:28 UTC | newest] Thread overview: 56+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-14 0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar 2006-03-14 0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar 2006-03-14 1:01 ` Lee Revell 2006-03-14 1:05 ` Shailabh Nagar 2006-03-14 1:12 ` Lee Revell 2006-03-14 3:42 ` Balbir Singh 2006-03-14 4:26 ` Shailabh Nagar 2006-03-14 6:50 ` Balbir Singh 2006-03-15 10:23 ` Arjan van de Ven 2006-03-15 10:28 ` Balbir Singh 2006-03-14 0:45 ` Patch 2/9] Initialization Shailabh Nagar 2006-03-14 10:54 ` Jes Sorensen 2006-03-14 15:20 ` Shailabh Nagar 2006-03-15 10:24 ` Arjan van de Ven 2006-03-15 12:37 ` Alan Cox 2006-03-15 15:53 ` Shailabh Nagar 2006-03-14 0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar 2006-03-15 10:27 ` Arjan van de Ven 2006-03-15 16:27 ` Shailabh Nagar 2006-03-14 0:48 ` [Patch 4/9] Block I/O accounting collection Shailabh Nagar 2006-03-14 0:51 ` [Patch 5/9] Swapin delays Shailabh Nagar 2006-03-14 0:53 ` [Patch 7/9] /proc interface for all I/O delays Shailabh Nagar 2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar 2006-03-26 16:44 ` Balbir Singh 2006-03-26 17:06 ` jamal 2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar 2006-03-14 2:29 ` jamal 2006-03-14 2:33 ` Matt Helsley 2006-03-14 2:48 ` jamal 2006-03-14 4:18 ` Shailabh Nagar 2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh 2006-03-23 14:04 ` jamal 2006-03-23 15:41 ` Balbir Singh 2006-03-24 14:04 ` jamal 2006-03-24 14:54 ` Balbir Singh 2006-03-25 1:19 ` jamal 2006-03-25 9:41 ` Balbir Singh 2006-03-25 12:52 ` jamal 2006-03-25 15:36 ` Balbir Singh 2006-03-25 17:48 ` jamal 2006-03-25 18:22 ` Balbir Singh 2006-03-26 14:05 ` jamal 2006-03-26 16:40 ` Balbir Singh 2006-03-24 1:32 ` Balbir Singh 2006-03-24 14:11 ` jamal 2006-03-24 14:19 ` jamal 2006-03-24 14:59 ` Balbir Singh 2006-03-14 4:29 ` Shailabh Nagar 2006-03-14 1:01 ` [Patch 6/9] cpu delay collection Shailabh Nagar 2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH 2006-03-14 20:49 ` Shailabh Nagar 2006-03-14 21:24 ` Greg KH 2006-03-14 21:59 ` Shailabh Nagar 2006-03-23 15:16 ` [Patch 0/9] Performance Shailabh Nagar 2006-03-25 2:38 ` Greg KH 2006-03-27 18:28 ` Shailabh Nagar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox