public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [Patch 0/9] Per-task delay accounting
@ 2006-03-14  0:40 Shailabh Nagar
  2006-03-14  0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
                   ` (9 more replies)
  0 siblings, 10 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Nick Piggin

This is the next iteration of the delay accounting patches
last posted at
	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html

The major changes to the patches (and names of commenters whose concerns
have been addressed) are:

- considerable revision of the synchronous block I/O delay collection (Arjan)
Now most delays seen in I/O submission are not being collected.
- removal of dependency on schedstats by using common code (Nick) 
- removal of sysctl or dynamic configurability of delay accounting (Arjan)
Now it it can be set on/off at boot time only.
- providing I/O delays through /proc/<tgid>/stats (Andi)
- revisions to the generic netlink interface (Jamal)

Unaddressed comments are suggestions from Jamal on 
further improving the generic netlink interface which is ongoing.

Series 

nstimestamp-diff.patch
delayacct-setup.patch
delayacct-blkio-init.patch
delayacct-blkio-collect.patch
delayacct-swapin.patch
delayacct-procfs.patch
delayacct-schedstats.patch
genetlink-utils.patch
delayacct-genetlink.patch


A short note on expected usage of this functionality follows which
addresses a comment from Nick in the previous iteration.
Basically, the patches measures a subset of the delays encountered 
by a task waiting for a resource such as cpu, disk I/O and page 
frames. Which subset is chosen depends on whether a task's delay 
is something that can be controlled by playing with its priority 
or rss limit etc.
e.g. cpu delays can directly be affected by its static CPU priority.
Similarly I/O delay can be affected by its I/O priority (assuming 
one uses the right iosched). In page frame delay, we only count 
the page faults due to swapin since that is directly affected by 
the rss limits for a task (well, ok, process). 

Delays can be used as feedback to decide whether something can/should
be done about a task (or group of tasks) which is not performing 
as one expects. 

http://www.research.ibm.com/journal/sj/362/amanaut.html
lists a fairly complicated way of using such data in 
workload management.

At a simpler level, one can think of utilities that can use this data
to control individual apps using a simple feedback loop.
At this point we do not have such a utility. 

Of course, its useful to visually inspect such data. Some of it is 
being exported through /proc as suggested by Andi, but the primary 
interface is through a netlink socket so that userspace can get 
data for exiting tasks too. 

The netlink socket also allows userspace apps to efficiently 
collect  delay data and group it in arbitrary ways 
(as envisaged by CKRM, CSA or ELSA) for reporting or 
control purposes. 


--Shailabh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 1/9] timestamp diff
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
@ 2006-03-14  0:42 ` Shailabh Nagar
  2006-03-14  1:01   ` Lee Revell
  2006-03-15 10:23   ` Arjan van de Ven
  2006-03-14  0:45 ` Patch 2/9] Initialization Shailabh Nagar
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:42 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven

nstimestamp_diff.patch

Add kernel utility function for measuring the
interval (diff) between two timespec values

Comments addressed (commenter)

- returns difference as timespec instead of nsecs (Arjan)

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>


 include/linux/time.h |   14 ++++++++++++++
 1 files changed, 14 insertions(+)

Index: linux-2.6.16-rc5/include/linux/time.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/time.h	2006-03-10 17:56:56.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/time.h	2006-03-10 17:57:26.000000000 -0500
@@ -147,6 +147,20 @@ extern struct timespec ns_to_timespec(co
  */
 extern struct timeval ns_to_timeval(const nsec_t nsec);
 
+/**
+ * timespec_diff - Return difference of timespecs as a timespec
+ * @start:	first timespec
+ * @end:	second timespec
+ * @ret:	result timespec
+ *
+ * Returns the difference between @start and @end in @ret
+ */
+static inline void timespec_diff(struct timespec *start, struct timespec *end,
+				 struct timespec *ret)
+{
+	ret->tv_sec = end->tv_sec - start->tv_sec;
+	ret->tv_nsec = end->tv_nsec - start->tv_nsec;
+}
 #endif /* __KERNEL__ */
 
 #define NFDBITS			__NFDBITS



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Patch 2/9] Initialization
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
  2006-03-14  0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
@ 2006-03-14  0:45 ` Shailabh Nagar
  2006-03-14 10:54   ` Jes Sorensen
  2006-03-15 10:24   ` Arjan van de Ven
  2006-03-14  0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven

delayacct-setup.patch

Initialization code related to collection of per-task "delay" 
statistics which measure how long it had to wait for cpu, 
sync block io, swapping etc. The collection of statistics and 
the interface are in other patches. This patch sets up the data
structures and allows the statistics collection to be disabled 
through a  kernel boot paramater.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>

 Documentation/kernel-parameters.txt |    2 
 include/linux/delayacct.h           |   48 ++++++++++++++++++
 include/linux/sched.h               |   17 ++++++
 init/Kconfig                        |   13 +++++
 init/main.c                         |    2 
 kernel/Makefile                     |    1 
 kernel/delayacct.c                  |   92 ++++++++++++++++++++++++++++++++++++
 kernel/exit.c                       |    3 +
 kernel/fork.c                       |    2 
 9 files changed, 180 insertions(+)

Index: linux-2.6.16-rc5/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.16-rc5.orig/Documentation/kernel-parameters.txt	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/Documentation/kernel-parameters.txt	2006-03-11 07:41:37.000000000 -0500
@@ -410,6 +410,8 @@ running once the system is up.
 			Format: <area>[,<node>]
 			See also Documentation/networking/decnet.txt.
 
+	delayacct	[KNL] Enable per-task delay accounting
+
 	devfs=		[DEVFS]
 			See Documentation/filesystems/devfs/boot-options.
 
Index: linux-2.6.16-rc5/kernel/Makefile
===================================================================
--- linux-2.6.16-rc5.orig/kernel/Makefile	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/Makefile	2006-03-11 07:41:37.000000000 -0500
@@ -34,6 +34,7 @@ obj-$(CONFIG_DETECT_SOFTLOCKUP) += softl
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
+obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: linux-2.6.16-rc5/include/linux/delayacct.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/include/linux/delayacct.h	2006-03-11 07:41:37.000000000 -0500
@@ -0,0 +1,48 @@
+/* delayacct.h - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+extern int delayacct_on;	/* Delay accounting turned on/off */
+extern kmem_cache_t *delayacct_cache;
+extern int delayacct_init(void);
+extern void __delayacct_tsk_init(struct task_struct *);
+extern void __delayacct_tsk_exit(struct task_struct *);
+
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{
+	/* reinitialize in case parent's non-null pointer was dup'ed*/
+	tsk->delays = NULL;
+	if (unlikely(delayacct_on))
+		__delayacct_tsk_init(tsk);
+}
+
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{
+	if (tsk->delays)
+		__delayacct_tsk_exit(tsk);
+}
+
+#else
+static inline int delayacct_init(void)
+{}
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{}
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{}
+#endif /* CONFIG_TASK_DELAY_ACCT */
+#endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h	2006-03-11 07:41:37.000000000 -0500
@@ -539,6 +539,20 @@ struct sched_info {
 extern struct file_operations proc_schedstat_operations;
 #endif
 
+#ifdef CONFIG_TASK_DELAY_ACCT
+struct task_delay_info {
+	spinlock_t	lock;
+
+	/* For each stat XXX, add following, aligned appropriately
+	 *
+	 * struct timespec XXX_start, XXX_end;
+	 * u64 XXX_delay;
+	 * u32 XXX_count;
+	 */
+};
+#endif
+
+
 enum idle_type
 {
 	SCHED_IDLE,
@@ -870,6 +884,9 @@ struct task_struct {
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
+#ifdef	CONFIG_TASK_DELAY_ACCT
+	struct task_delay_info *delays;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.16-rc5/init/Kconfig
===================================================================
--- linux-2.6.16-rc5.orig/init/Kconfig	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/init/Kconfig	2006-03-11 07:41:37.000000000 -0500
@@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3
 	  for processing it. A preliminary version of these tools is available
 	  at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.
 
+config TASK_DELAY_ACCT
+	bool "Enable per-task delay accounting (EXPERIMENTAL)"
+	help
+	  Collect information on time spent by a task waiting for system
+	  resources like cpu, synchronous block I/O completion and swapping
+	  in pages. Such statistics can help in setting a task's priorities
+	  relative to other tasks for cpu, io, rss limits etc.
+
+	  Unlike BSD process accounting, this information is available
+	  continuously during the lifetime of a task.
+
+	  Say N if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
Index: linux-2.6.16-rc5/init/main.c
===================================================================
--- linux-2.6.16-rc5.orig/init/main.c	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/init/main.c	2006-03-11 07:41:37.000000000 -0500
@@ -47,6 +47,7 @@
 #include <linux/rmap.h>
 #include <linux/mempolicy.h>
 #include <linux/key.h>
+#include <linux/delayacct.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -537,6 +538,7 @@ asmlinkage void __init start_kernel(void
 	proc_root_init();
 #endif
 	cpuset_init();
+	delayacct_init();
 
 	check_bugs();
 
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/kernel/delayacct.c	2006-03-11 07:41:37.000000000 -0500
@@ -0,0 +1,92 @@
+/* delayacct.c - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+#include <linux/sysctl.h>
+#include <linux/delayacct.h>
+
+int delayacct_on = 0;		/* Delay accounting turned on/off */
+kmem_cache_t *delayacct_cache;
+
+static int __init delayacct_setup_enable(char *str)
+{
+	delayacct_on = 1;
+	return 1;
+}
+__setup("delayacct", delayacct_setup_enable);
+
+int delayacct_init(void)
+{
+	delayacct_cache = kmem_cache_create("delayacct_cache",
+					    sizeof(struct task_delay_info),
+					    0,
+					    SLAB_PANIC,
+					    NULL, NULL);
+	if (!delayacct_cache)
+		return -ENOMEM;
+	delayacct_tsk_init(&init_task);
+	return 0;
+}
+
+void __delayacct_tsk_init(struct task_struct *tsk)
+{
+	tsk->delays = kmem_cache_alloc(delayacct_cache, SLAB_KERNEL);
+	if (tsk->delays) {
+		memset(tsk->delays, 0, sizeof(*tsk->delays));
+		spin_lock_init(&tsk->delays->lock);
+	}
+}
+
+void __delayacct_tsk_exit(struct task_struct *tsk)
+{
+	if (tsk->delays) {
+		kmem_cache_free(delayacct_cache, tsk->delays);
+		tsk->delays = NULL;
+	}
+}
+
+/*
+ * Start accounting for a delay statistic using
+ * its starting timestamp (@start)
+ */
+
+static inline void delayacct_start(struct timespec *start)
+{
+	do_posix_clock_monotonic_gettime(start);
+}
+
+/*
+ * Finish delay accounting for a statistic using
+ * its timestamps (@start, @end), accumalator (@total) and @count
+ */
+
+static inline void delayacct_end(struct timespec *start, struct timespec *end,
+				u64 *total, u32 *count)
+{
+	struct timespec ts;
+	nsec_t ns;
+
+	do_posix_clock_monotonic_gettime(end);
+	timespec_diff(start, end, &ts);
+	ns = timespec_to_ns(&ts);
+	if (ns < 0)
+		return;
+
+	spin_lock(&current->delays->lock);
+	*total += ns;
+	(*count)++;
+	spin_unlock(&current->delays->lock);
+}
+
Index: linux-2.6.16-rc5/kernel/fork.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/fork.c	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/fork.c	2006-03-11 07:41:37.000000000 -0500
@@ -44,6 +44,7 @@
 #include <linux/rmap.h>
 #include <linux/acct.h>
 #include <linux/cn_proc.h>
+#include <linux/delayacct.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -970,6 +971,7 @@ static task_t *copy_process(unsigned lon
 		goto bad_fork_cleanup_put_domain;
 
 	p->did_exec = 0;
+	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
 	copy_flags(clone_flags, p);
 	p->pid = pid;
 	retval = -EFAULT;
Index: linux-2.6.16-rc5/kernel/exit.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/exit.c	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/exit.c	2006-03-11 07:41:37.000000000 -0500
@@ -31,6 +31,7 @@
 #include <linux/signal.h>
 #include <linux/cn_proc.h>
 #include <linux/mutex.h>
+#include <linux/delayacct.h>
 
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -842,6 +843,8 @@ fastcall NORET_TYPE void do_exit(long co
 				preempt_count());
 
 	acct_update_integrals(tsk);
+	delayacct_tsk_exit(tsk);
+
 	if (tsk->mm) {
 		update_hiwater_rss(tsk->mm);
 		update_hiwater_vm(tsk->mm);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 3/9] Block I/O accounting initialization
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
  2006-03-14  0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
  2006-03-14  0:45 ` Patch 2/9] Initialization Shailabh Nagar
@ 2006-03-14  0:47 ` Shailabh Nagar
  2006-03-15 10:27   ` Arjan van de Ven
  2006-03-14  0:48 ` [Patch 4/9] Block I/O accounting collection Shailabh Nagar
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andi Kleen

delayacct-blkio-init.patch

Setup variables and functions to collect per-task
block I/O delays.

Separating the collection part to a later patch
for easier review and discussion.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>


 include/linux/delayacct.h |   16 ++++++++++++++++
 include/linux/sched.h     |    4 ++++
 kernel/delayacct.c        |   14 ++++++++++++++
 3 files changed, 34 insertions(+)

Index: linux-2.6.16-rc5/include/linux/delayacct.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/delayacct.h	2006-03-11 07:41:37.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/delayacct.h	2006-03-11 07:41:38.000000000 -0500
@@ -22,6 +22,8 @@ extern kmem_cache_t *delayacct_cache;
 extern int delayacct_init(void);
 extern void __delayacct_tsk_init(struct task_struct *);
 extern void __delayacct_tsk_exit(struct task_struct *);
+extern void __delayacct_blkio_start(void);
+extern void __delayacct_blkio_end(void);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -37,6 +39,16 @@ static inline void delayacct_tsk_exit(st
 		__delayacct_tsk_exit(tsk);
 }
 
+static inline void delayacct_blkio_start(void)
+{
+	if (unlikely(delayacct_on))
+		__delayacct_blkio_start();
+}
+static inline void delayacct_blkio_end(void)
+{
+	if (unlikely(delayacct_on))
+		__delayacct_blkio_end();
+}
 #else
 static inline int delayacct_init(void)
 {}
@@ -44,5 +56,9 @@ static inline void delayacct_tsk_init(st
 {}
 static inline void delayacct_tsk_exit(struct task_struct *tsk)
 {}
+static inline void delayacct_blkio_start(void)
+{}
+static inline void delayacct_blkio_end(void)
+{}
 #endif /* CONFIG_TASK_DELAY_ACCT */
 #endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/delayacct.c	2006-03-11 07:41:37.000000000 -0500
+++ linux-2.6.16-rc5/kernel/delayacct.c	2006-03-11 07:41:38.000000000 -0500
@@ -90,3 +90,17 @@ static inline void delayacct_end(struct 
 	spin_unlock(&current->delays->lock);
 }
 
+void __delayacct_blkio_start(void)
+{
+	if (current->delays)
+		delayacct_start(&current->delays->blkio_start);
+}
+
+void __delayacct_blkio_end(void)
+{
+	if (current->delays)
+		delayacct_end(&current->delays->blkio_start,
+				&current->delays->blkio_end,
+				&current->delays->blkio_delay,
+				&current->delays->blkio_count);
+}
Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h	2006-03-11 07:41:37.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h	2006-03-11 07:41:38.000000000 -0500
@@ -549,6 +549,10 @@ struct task_delay_info {
 	 * u64 XXX_delay;
 	 * u32 XXX_count;
 	 */
+
+	struct timespec blkio_start, blkio_end;
+	u64 blkio_delay;	/* wait for sync block io completion */
+	u32 blkio_count;
 };
 #endif
 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 4/9] Block I/O accounting collection
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (2 preceding siblings ...)
  2006-03-14  0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar
@ 2006-03-14  0:48 ` Shailabh Nagar
  2006-03-14  0:51 ` [Patch 5/9] Swapin delays Shailabh Nagar
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andi Kleen, Suparna Bhattacharya

delayacct-blkio-collect.patch

Collect per-task block I/O delay statistics.

Unlike earlier iterations of the delay accounting
patches, now delays are only collected for the actual
I/O waits rather than try and cover the delays seen in 
I/O submission paths.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>

 kernel/sched.c |    5 +++++
 1 files changed, 5 insertions(+)

Index: linux-2.6.16-rc5/kernel/sched.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/sched.c	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/kernel/sched.c	2006-03-11 07:41:39.000000000 -0500
@@ -49,6 +49,7 @@
 #include <linux/syscalls.h>
 #include <linux/times.h>
 #include <linux/acct.h>
+#include <linux/delayacct.h>
 #include <asm/tlb.h>
 
 #include <asm/unistd.h>
@@ -4117,9 +4118,11 @@ void __sched io_schedule(void)
 {
 	struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
 
+	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
 	schedule();
 	atomic_dec(&rq->nr_iowait);
+	delayacct_blkio_end();
 }
 
 EXPORT_SYMBOL(io_schedule);
@@ -4129,9 +4132,11 @@ long __sched io_schedule_timeout(long ti
 	struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
 	long ret;
 
+	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
 	atomic_dec(&rq->nr_iowait);
+	delayacct_blkio_end();
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 5/9] Swapin delays
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (3 preceding siblings ...)
  2006-03-14  0:48 ` [Patch 4/9] Block I/O accounting collection Shailabh Nagar
@ 2006-03-14  0:51 ` Shailabh Nagar
  2006-03-14  0:53 ` [Patch 7/9] /proc interface for all I/O delays Shailabh Nagar
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Arjan van de Ven, Andi Kleen

delayacct-swapin.patch

Account separately for block I/O delays 
incurred as a result of swapin page faults since
these are a result of insufficient rss limit.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>

 include/linux/sched.h |    5 ++++-
 kernel/delayacct.c    |   17 ++++++++++++-----
 mm/memory.c           |    4 ++++
 3 files changed, 20 insertions(+), 6 deletions(-)

Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h	2006-03-11 07:41:38.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h	2006-03-11 07:41:39.000000000 -0500
@@ -550,9 +550,11 @@ struct task_delay_info {
 	 * u32 XXX_count;
 	 */
 
-	struct timespec blkio_start, blkio_end;
+	struct timespec blkio_start, blkio_end;	/* Shared by blkio, swapin */
 	u64 blkio_delay;	/* wait for sync block io completion */
+	u64 swapin_delay;	/* wait for sync block io completion */
 	u32 blkio_count;
+	u32 swapin_count;
 };
 #endif
 
@@ -949,6 +951,7 @@ static inline void put_task_struct(struc
 #define PF_BORROWED_MM	0x00400000	/* I am a kthread doing use_mm */
 #define PF_RANDOMIZE	0x00800000	/* randomize virtual address space */
 #define PF_SWAPWRITE	0x01000000	/* Allowed to write to swap */
+#define PF_SWAPIN	0x02000000	/* I am doing a swap in */
 
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/delayacct.c	2006-03-11 07:41:38.000000000 -0500
+++ linux-2.6.16-rc5/kernel/delayacct.c	2006-03-11 07:41:39.000000000 -0500
@@ -98,9 +98,16 @@ void __delayacct_blkio_start(void)
 
 void __delayacct_blkio_end(void)
 {
-	if (current->delays)
-		delayacct_end(&current->delays->blkio_start,
-				&current->delays->blkio_end,
-				&current->delays->blkio_delay,
-				&current->delays->blkio_count);
+	if (current->delays) {
+		if (current->flags & PF_SWAPIN)	/* Swapping a page in */
+			delayacct_end(&current->delays->blkio_start,
+				      &current->delays->blkio_end,
+				      &current->delays->swapin_delay,
+				      &current->delays->swapin_count);
+		else	/* Other block I/O */
+			delayacct_end(&current->delays->blkio_start,
+				      &current->delays->blkio_end,
+				      &current->delays->blkio_delay,
+				      &current->delays->blkio_count);
+	}
 }
Index: linux-2.6.16-rc5/mm/memory.c
===================================================================
--- linux-2.6.16-rc5.orig/mm/memory.c	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/mm/memory.c	2006-03-11 07:41:39.000000000 -0500
@@ -1882,6 +1882,7 @@ static int do_swap_page(struct mm_struct
 
 	entry = pte_to_swp_entry(orig_pte);
 again:
+	current->flags |= PF_SWAPIN;
 	page = lookup_swap_cache(entry);
 	if (!page) {
  		swapin_readahead(entry, address, vma);
@@ -1894,6 +1895,7 @@ again:
 			page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
 			if (likely(pte_same(*page_table, orig_pte)))
 				ret = VM_FAULT_OOM;
+			current->flags &= ~PF_SWAPOFF;
 			goto unlock;
 		}
 
@@ -1905,6 +1907,8 @@ again:
 
 	mark_page_accessed(page);
 	lock_page(page);
+	current->flags &= ~PF_SWAPOFF;
+
 	if (!PageSwapCache(page)) {
 		/* Page migration has occured */
 		unlock_page(page);



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 7/9] /proc interface for all I/O delays
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (4 preceding siblings ...)
  2006-03-14  0:51 ` [Patch 5/9] Swapin delays Shailabh Nagar
@ 2006-03-14  0:53 ` Shailabh Nagar
  2006-03-14  0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:53 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, Arjan van de Ven

delayacct-procfs.patch

Export I/O delays seen by a task through /proc/<tgid>/stats
for use in top etc.

Note that delays for I/O done for swapping in pages (swapin I/O) is 
clubbed together with all other I/O here (this is not the 
case for the other interface where the swapin I/O is kept distinct)

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>


 fs/proc/array.c           |    6 ++++--
 include/linux/delayacct.h |   10 ++++++++++
 include/linux/jiffies.h   |    6 ++++++
 kernel/delayacct.c        |   13 +++++++++++++
 4 files changed, 33 insertions(+), 2 deletions(-)

Index: linux-2.6.16-rc5/fs/proc/array.c
===================================================================
--- linux-2.6.16-rc5.orig/fs/proc/array.c	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/fs/proc/array.c	2006-03-11 07:41:40.000000000 -0500
@@ -75,6 +75,7 @@
 #include <linux/times.h>
 #include <linux/cpuset.h>
 #include <linux/rcupdate.h>
+#include <linux/delayacct.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -414,7 +415,7 @@ static int do_task_stat(struct task_stru
 
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
 		task->pid,
 		tcomm,
 		state,
@@ -459,7 +460,8 @@ static int do_task_stat(struct task_stru
 		task->exit_signal,
 		task_cpu(task),
 		task->rt_priority,
-		task->policy);
+		      task->policy,
+		delayacct_blkio_ticks(task));
 	if(mm)
 		mmput(mm);
 	return res;
Index: linux-2.6.16-rc5/include/linux/delayacct.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/delayacct.h	2006-03-11 07:41:38.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/delayacct.h	2006-03-11 07:41:40.000000000 -0500
@@ -24,6 +24,7 @@ extern void __delayacct_tsk_init(struct 
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
+extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -49,6 +50,11 @@ static inline void delayacct_blkio_end(v
 	if (unlikely(delayacct_on))
 		__delayacct_blkio_end();
 }
+static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk)
+{
+	if (unlikely(delayacct_on))
+		return __delayacct_blkio_ticks(tsk);
+}
 #else
 static inline int delayacct_init(void)
 {}
@@ -60,5 +66,9 @@ static inline void delayacct_blkio_start
 {}
 static inline void delayacct_blkio_end(void)
 {}
+static inline unsigned long long delayacct_blkio_ticks(struct task_struct *tsk)
+{
+	return 0;
+}
 #endif /* CONFIG_TASK_DELAY_ACCT */
 #endif /* _LINUX_TASKDELAYS_H */
Index: linux-2.6.16-rc5/include/linux/jiffies.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/jiffies.h	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/jiffies.h	2006-03-11 07:41:40.000000000 -0500
@@ -291,6 +291,12 @@ static inline unsigned long usecs_to_jif
 #endif
 }
 
+static inline unsigned long nsecs_to_jiffies(const nsec_t n)
+{
+	return (((u64)n * NSEC_CONVERSION) >>
+		(NSEC_JIFFIE_SC - SEC_JIFFIE_SC)) >> SEC_JIFFIE_SC;
+}
+
 /*
  * The TICK_NSEC - 1 rounds up the value to the next resolution.  Note
  * that a remainder subtract here would not do the right thing as the
Index: linux-2.6.16-rc5/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/delayacct.c	2006-03-11 07:41:39.000000000 -0500
+++ linux-2.6.16-rc5/kernel/delayacct.c	2006-03-11 07:41:40.000000000 -0500
@@ -111,3 +111,16 @@ void __delayacct_blkio_end(void)
 				      &current->delays->blkio_count);
 	}
 }
+
+unsigned long long __delayacct_blkio_ticks(struct task_struct *tsk)
+{
+	unsigned long long ret;
+
+	if (!tsk->delays)
+		return 0;
+
+	spin_lock(&tsk->delays->lock);
+	ret = nsec_to_clock_t(tsk->delays->blkio_delay + tsk->delays->swapin_delay);
+	spin_unlock(&tsk->delays->lock);
+	return ret;
+}



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 8/9] generic netlink utility functions
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (5 preceding siblings ...)
  2006-03-14  0:53 ` [Patch 7/9] /proc interface for all I/O delays Shailabh Nagar
@ 2006-03-14  0:55 ` Shailabh Nagar
  2006-03-26 16:44   ` Balbir Singh
  2006-03-14  0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:55 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Jamal

genetlink-utils.patch

Two utilities for simplifying usage of NETLINK_GENERIC
interface.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>

 include/net/genetlink.h |   20 ++++++++++++++++++++
 1 files changed, 20 insertions(+)

Index: linux-2.6.16-rc5/include/net/genetlink.h
===================================================================
--- linux-2.6.16-rc5.orig/include/net/genetlink.h	2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/net/genetlink.h	2006-03-11 07:41:41.000000000 -0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
 	return nlmsg_unicast(genl_sock, skb, pid);
 }
 
+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+       return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+       struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+                                                   NLMSG_HDRLEN);
+       return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif	/* __NET_GENERIC_NETLINK_H */



^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (6 preceding siblings ...)
  2006-03-14  0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
@ 2006-03-14  0:56 ` Shailabh Nagar
  2006-03-14  2:29   ` jamal
  2006-03-14  2:33   ` Matt Helsley
  2006-03-14  1:01 ` [Patch 6/9] cpu delay collection Shailabh Nagar
  2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH
  9 siblings, 2 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  0:56 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jamal, netdev

delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family), 
called "taskstats", for getting delay and cpu statistics of 
tasks and thread groups during their lifetime and when they exit. 

Comments addressed (all in response to Jamal)

- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
- Use multicast only for events generated due to task exit, not for 
replies to commands
- Align taskstats and taskstats_reply structures to 64 bit aligned.
- Use dynamic ID generation for genetlink family

More changes expected. Following comments will go into a 
Documentation file:

When a task is alive, userspace can get its stats by sending a 
command containing its pid. Sending a tgid returns the sum of stats 
of the tasks belonging to that tgid (where such a sum makes sense). 
Together, the command interface allows stats for a large number of 
tasks to be collected more efficiently than would be possible 
through /proc or any per-pid interface. 

The netlink interface also sends the stats for each task to userspace 
when the task is exiting. This permits fine-grain accounting for 
short-lived tasks, which is important if userspace is doing its own 
aggregation of statistics based on some grouping of tasks 
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay 
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to 
userspace is versioned and the command interface easily extensible to 
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay 
accounting alone.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

 include/linux/taskstats.h |  128 ++++++++++++++++++++++++
 init/Kconfig              |   16 ++-
 kernel/taskstats.c        |  244 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 385 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc5/include/linux/taskstats.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/include/linux/taskstats.h	2006-03-13 19:01:35.000000000 -0500
@@ -0,0 +1,128 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *     - a task exits
+ *     - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION      1
+
+struct taskstats {
+       /* Maintain 64-bit alignment while extending */
+
+       /* Version 1 */
+#define TASKSTATS_NOPID        -1
+       __s64   pid;
+       __s64   tgid;
+
+       /* XXX_count is number of delay values recorded.
+        * XXX_total is corresponding cumulative delay in nanoseconds
+        */
+
+#define TASKSTATS_NOCPUSTATS   1
+       __u64   cpu_count;
+       __u64   cpu_delay_total;        /* wait, while runnable, for cpu */
+       __u64   blkio_count;
+       __u64   blkio_delay_total;      /* sync,block io completion wait*/
+       __u64   swapin_count;
+       __u64   swapin_delay_total;     /* swapin page fault wait*/
+
+       __u64   cpu_run_total;          /* cpu running time
+                                        * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+       TASKSTATS_CMD_UNSPEC,           /* Reserved */
+       TASKSTATS_CMD_NONE,             /* Not a valid cmd to send
+                                        * Marks data sent on task/tgid exit */
+       TASKSTATS_CMD_LISTEN,           /* Start listening */
+       TASKSTATS_CMD_IGNORE,           /* Stop listening */
+       TASKSTATS_CMD_PID,              /* Send stats for a pid */
+       TASKSTATS_CMD_TGID,             /* Send stats for a tgid */
+};
+
+/* Parameters for commands
+ * New parameters should only be inserted at the struct's end
+ */
+
+struct taskstats_cmd_param {
+       /* Maintain 64-bit alignment while extending */
+       union {
+               __s64   pid;
+               __s64   tgid;
+       } id;
+};
+
+enum outtype {
+       TASKSTATS_REPLY_NONE = 1,       /* Control cmd response */
+       TASKSTATS_REPLY_PID,            /* per-pid data cmd response*/
+       TASKSTATS_REPLY_TGID,           /* per-tgid data cmd response*/
+       TASKSTATS_REPLY_EXIT_PID,       /* Exiting task's stats */
+       TASKSTATS_REPLY_EXIT_TGID,      /* Exiting tgid's stats
+                                        * (sent on each tid's exit) */
+};
+
+/*
+ * Reply sent from kernel
+ * Version number affects size/format of struct taskstats only
+ */
+
+struct taskstats_reply {
+       /* Maintain 64-bit alignment while extending */
+       __u16 outtype;                  /* Must be one of enum outtype */
+       __u16 version;
+       __u32 err;
+       struct taskstats stats;         /* Invalid if err != 0 */
+};
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME    "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#define TASKSTATS_HDRLEN       (NLMSG_SPACE(GENL_HDRLEN))
+#define TASKSTATS_BODYLEN      (sizeof(struct taskstats_reply))
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.16-rc5/kernel/taskstats.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/kernel/taskstats.c	2006-03-13 19:01:35.000000000 -0500
@@ -0,0 +1,244 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+
+static struct genl_family family = {
+       .id             = GENL_ID_GENERATE,
+       .name           = TASKSTATS_GENL_NAME,
+       .version        = TASKSTATS_GENL_VERSION,
+       .hdrsize        = 0,
+       .maxattr        = 0,
+};
+
+/* Taskstat specific functions */
+static int prepare_reply(struct genl_info *info, u8 cmd,
+                        struct sk_buff **skbp, struct taskstats_reply **replyp)
+{
+       struct sk_buff *skb;
+       struct taskstats_reply *reply;
+
+       skb = nlmsg_new(TASKSTATS_HDRLEN + TASKSTATS_BODYLEN);
+       if (!skb)
+               return -ENOMEM;
+
+       if (!info) {
+               int seq = get_cpu_var(taskstats_seqnum)++;
+               put_cpu_var(taskstats_seqnum);
+
+               reply = genlmsg_put(skb, 0, seq,
+                                   family.id, 0, NLM_F_REQUEST,
+                                   cmd, family.version);
+       } else
+               reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+                                   family.id, 0, info->nlhdr->nlmsg_flags,
+                                   info->genlhdr->cmd, family.version);
+       if (reply == NULL) {
+               nlmsg_free(skb);
+               return -EINVAL;
+       }
+       skb_put(skb, TASKSTATS_BODYLEN);
+
+       memset(reply, 0, sizeof(*reply));
+       reply->version = taskstats_version;
+       reply->err = 0;
+
+       *skbp = skb;
+       *replyp = reply;
+       return 0;
+}
+
+static int send_reply(struct sk_buff *skb, int replytype, pid_t pid, int event)
+{
+       struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+       struct taskstats_reply *reply;
+       int rc;
+
+       reply = (struct taskstats_reply *)genlmsg_data(genlhdr);
+       reply->outtype = replytype;
+
+       rc = genlmsg_end(skb, reply);
+       if (rc < 0) {
+               nlmsg_free(skb);
+               return rc;
+       }
+
+       if (event)
+               return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+       else
+               return genlmsg_unicast(skb, pid);
+}
+
+static inline void fill_pid(struct taskstats_reply *reply, pid_t pid,
+                           struct task_struct *pidtsk)
+{
+       int rc;
+       struct task_struct *tsk = pidtsk;
+
+       if (!pidtsk) {
+               read_lock(&tasklist_lock);
+               tsk = find_task_by_pid(pid);
+               if (!tsk) {
+                       read_unlock(&tasklist_lock);
+                       reply->err = EINVAL;
+                       return;
+               }
+               get_task_struct(tsk);
+               read_unlock(&tasklist_lock);
+       } else
+               get_task_struct(tsk);
+
+       rc = delayacct_add_tsk(reply, tsk);
+       if (!rc) {
+               reply->stats.pid = (s64)tsk->pid;
+               reply->stats.tgid = (s64)tsk->tgid;
+       } else
+               reply->err = (rc < 0) ? -rc : rc ;
+
+       put_task_struct(tsk);
+}
+
+static int taskstats_send_pid(struct sk_buff *skb, struct genl_info *info)
+{
+       int rc;
+       struct sk_buff *rep_skb;
+       struct taskstats_reply *reply;
+       struct taskstats_cmd_param *param= info->userhdr;
+
+       rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+       if (rc)
+               return rc;
+       fill_pid(reply, param->id.pid, NULL);
+       return send_reply(rep_skb, TASKSTATS_REPLY_PID, info->snd_pid, 0);
+}
+
+static inline void fill_tgid(struct taskstats_reply *reply, pid_t tgid,
+                            struct task_struct *tgidtsk)
+{
+       int rc;
+       struct task_struct *tsk, *first;
+
+       first = tgidtsk;
+       read_lock(&tasklist_lock);
+       if (!first) {
+               first = find_task_by_pid(tgid);
+               if (!first) {
+                       read_unlock(&tasklist_lock);
+                       reply->err = EINVAL;
+                       return;
+               }
+       }
+       tsk = first;
+       do {
+               rc = delayacct_add_tsk(reply, tsk);
+               if (rc)
+                       break;
+       } while_each_thread(first, tsk);
+       read_unlock(&tasklist_lock);
+
+       if (!rc) {
+               reply->stats.pid = (s64)TASKSTATS_NOPID;
+               reply->stats.tgid = (s64)tgid;
+       } else
+               reply->err = (rc < 0) ? -rc : rc ;
+}
+
+static int taskstats_send_tgid(struct sk_buff *skb, struct genl_info *info)
+{
+       int rc;
+       struct sk_buff *rep_skb;
+       struct taskstats_reply *reply;
+       struct taskstats_cmd_param *param= info->userhdr;
+
+       rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+       if (rc)
+               return rc;
+       fill_tgid(reply, param->id.tgid, NULL);
+       return send_reply(rep_skb, TASKSTATS_REPLY_TGID, info->snd_pid, 0);
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+       int rc;
+       struct sk_buff *rep_skb;
+       struct taskstats_reply *reply;
+
+       /*
+        * tasks can start to exit very early. Ensure that the family
+        * is registered before notifications are sent out
+        */
+       if (!family_registered)
+               return;
+
+       rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+       if (rc)
+               return;
+       fill_pid(reply, tsk->pid, tsk);
+       rc = send_reply(rep_skb, TASKSTATS_REPLY_EXIT_PID, 0, 1);
+
+       if (rc || thread_group_empty(tsk))
+               return;
+
+       /* Send tgid data too */
+       rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+       if (rc)
+               return;
+       fill_tgid(reply, tsk->tgid, tsk);
+       send_reply(rep_skb, TASKSTATS_REPLY_EXIT_TGID, 0, 1);
+}
+
+static struct genl_ops pid_ops = {
+       .cmd            = TASKSTATS_CMD_PID,
+       .doit           = taskstats_send_pid,
+};
+
+static struct genl_ops tgid_ops = {
+       .cmd            = TASKSTATS_CMD_TGID,
+       .doit           = taskstats_send_tgid,
+};
+
+static int __init taskstats_init(void)
+{
+       if (genl_register_family(&family))
+               return -EFAULT;
+       family_registered = 1;
+
+       if (genl_register_ops(&family, &pid_ops))
+               goto err;
+       if (genl_register_ops(&family, &tgid_ops))
+               goto err;
+
+       return 0;
+err:
+       genl_unregister_family(&family);
+       family_registered = 0;
+       return -EFAULT;
+}
+
+late_initcall(taskstats_init);
+
Index: linux-2.6.16-rc5/init/Kconfig
===================================================================
--- linux-2.6.16-rc5.orig/init/Kconfig	2006-03-13 18:51:30.000000000 -0500
+++ linux-2.6.16-rc5/init/Kconfig	2006-03-13 19:04:52.000000000 -0500
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
@ 2006-03-14  1:01   ` Lee Revell
  2006-03-14  1:05     ` Shailabh Nagar
  2006-03-15 10:23   ` Arjan van de Ven
  1 sibling, 1 reply; 56+ messages in thread
From: Lee Revell @ 2006-03-14  1:01 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel, Arjan van de Ven

On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote:
> +       ret->tv_sec = end->tv_sec - start->tv_sec;
> +       ret->tv_nsec = end->tv_nsec - start->tv_nsec; 

What if end->tv_nsec is less than start->tv_nsec?

Lee


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [Patch 6/9] cpu delay collection
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (7 preceding siblings ...)
  2006-03-14  0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
@ 2006-03-14  1:01 ` Shailabh Nagar
  2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH
  9 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  1:01 UTC (permalink / raw)
  To: linux-kernel; +Cc: Nick Piggin

delayacct-schedstats.patch

Collect cpu delays through the 
task-related schedstats functions, extracted
as common code to remove dependence of delay 
accounting on rest of schedstats. 

Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>

 include/linux/sched.h |    8 +++++---
 kernel/sched.c        |   49 +++++++++++++++++++++++++++++++++++++------------
 2 files changed, 42 insertions(+), 15 deletions(-)

Index: linux-2.6.16-rc5/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc5.orig/include/linux/sched.h	2006-03-11 07:41:39.000000000 -0500
+++ linux-2.6.16-rc5/include/linux/sched.h	2006-03-11 07:41:40.000000000 -0500
@@ -524,7 +524,7 @@ typedef struct prio_array prio_array_t;
 struct backing_dev_info;
 struct reclaim_state;
 
-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 struct sched_info {
 	/* cumulative counters */
 	unsigned long	cpu_time,	/* time spent on the cpu */
@@ -536,8 +536,10 @@ struct sched_info {
 			last_queued;	/* when we were last queued to run */
 };
 
+#ifdef CONFIG_SCHEDSTATS
 extern struct file_operations proc_schedstat_operations;
-#endif
+#endif /* CONFIG_SCHEDSTATS */
+#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 struct task_delay_info {
@@ -735,7 +737,7 @@ struct task_struct {
 	cpumask_t cpus_allowed;
 	unsigned int time_slice, first_time_slice;
 
-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
 #endif
 
Index: linux-2.6.16-rc5/kernel/sched.c
===================================================================
--- linux-2.6.16-rc5.orig/kernel/sched.c	2006-03-11 07:41:39.000000000 -0500
+++ linux-2.6.16-rc5/kernel/sched.c	2006-03-11 07:41:40.000000000 -0500
@@ -473,9 +473,26 @@ struct file_operations proc_schedstat_op
 	.release = single_release,
 };
 
+static inline void rq_sched_info_arrive(struct runqueue *rq,
+						unsigned long diff)
+{
+	if (rq) {
+		rq->rq_sched_info.run_delay += diff;
+		rq->rq_sched_info.pcnt++;
+	}
+}
+
+static inline void rq_sched_info_depart(struct runqueue *rq,
+						unsigned long diff)
+{
+	if (rq)
+		rq->rq_sched_info.cpu_time += diff;
+}
 # define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
 # define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
 #else /* !CONFIG_SCHEDSTATS */
+static inline void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) {}
+static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) {}
 # define schedstat_inc(rq, field)	do { } while (0)
 # define schedstat_add(rq, field, amt)	do { } while (0)
 #endif
@@ -495,7 +512,19 @@ static inline runqueue_t *this_rq_lock(v
 	return rq;
 }
 
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+
+static inline int sched_info_on(void)
+{
 #ifdef CONFIG_SCHEDSTATS
+	return 1;
+#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+	extern int delayacct_on;
+	return delayacct_on;
+#endif
+}
+
 /*
  * Called when a process is dequeued from the active array and given
  * the cpu.  We should note that with the exception of interactive
@@ -524,7 +553,6 @@ static inline void sched_info_dequeued(t
 static void sched_info_arrive(task_t *t)
 {
 	unsigned long now = jiffies, diff = 0;
-	struct runqueue *rq = task_rq(t);
 
 	if (t->sched_info.last_queued)
 		diff = now - t->sched_info.last_queued;
@@ -533,11 +561,7 @@ static void sched_info_arrive(task_t *t)
 	t->sched_info.last_arrival = now;
 	t->sched_info.pcnt++;
 
-	if (!rq)
-		return;
-
-	rq->rq_sched_info.run_delay += diff;
-	rq->rq_sched_info.pcnt++;
+	rq_sched_info_arrive(task_rq(t), diff);
 }
 
 /*
@@ -557,8 +581,9 @@ static void sched_info_arrive(task_t *t)
  */
 static inline void sched_info_queued(task_t *t)
 {
-	if (!t->sched_info.last_queued)
-		t->sched_info.last_queued = jiffies;
+	if (sched_info_on())
+		if (!t->sched_info.last_queued)
+			t->sched_info.last_queued = jiffies;
 }
 
 /*
@@ -567,13 +592,10 @@ static inline void sched_info_queued(tas
  */
 static inline void sched_info_depart(task_t *t)
 {
-	struct runqueue *rq = task_rq(t);
 	unsigned long diff = jiffies - t->sched_info.last_arrival;
 
 	t->sched_info.cpu_time += diff;
-
-	if (rq)
-		rq->rq_sched_info.cpu_time += diff;
+	rq_sched_info_depart(task_rq(t), diff);
 }
 
 /*
@@ -585,6 +607,9 @@ static inline void sched_info_switch(tas
 {
 	struct runqueue *rq = task_rq(prev);
 
+ 	if (!sched_info_on())
+		return;
+
 	/*
 	 * prev now departs the cpu.  It's not interesting to record
 	 * stats about how efficient we were at scheduling the idle



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  1:01   ` Lee Revell
@ 2006-03-14  1:05     ` Shailabh Nagar
  2006-03-14  1:12       ` Lee Revell
  0 siblings, 1 reply; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  1:05 UTC (permalink / raw)
  To: Lee Revell; +Cc: linux-kernel, Arjan van de Ven

On Mon, 2006-03-13 at 20:01, Lee Revell wrote:
> On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote:
> > +       ret->tv_sec = end->tv_sec - start->tv_sec;
> > +       ret->tv_nsec = end->tv_nsec - start->tv_nsec; 
> 
> What if end->tv_nsec is less than start->tv_nsec?

The caller of the func can decide what to do. 
In our usage, we choose to not use the result of such a diff
but others may want to return an error etc.

--Shailabh
  
> 
> Lee
> 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  1:05     ` Shailabh Nagar
@ 2006-03-14  1:12       ` Lee Revell
  2006-03-14  3:42         ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: Lee Revell @ 2006-03-14  1:12 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel, Arjan van de Ven

On Mon, 2006-03-13 at 20:05 -0500, Shailabh Nagar wrote:
> On Mon, 2006-03-13 at 20:01, Lee Revell wrote:
> > On Mon, 2006-03-13 at 19:42 -0500, Shailabh Nagar wrote:
> > > +       ret->tv_sec = end->tv_sec - start->tv_sec;
> > > +       ret->tv_nsec = end->tv_nsec - start->tv_nsec; 
> > 
> > What if end->tv_nsec is less than start->tv_nsec?
> 
> The caller of the func can decide what to do. 
> In our usage, we choose to not use the result of such a diff
> but others may want to return an error etc.

You don't think it's a problem that 2.0000001s - 1.9999999s would return
garbage rather than 0.0000002?

Lee


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
@ 2006-03-14  2:29   ` jamal
  2006-03-14  2:33   ` Matt Helsley
  1 sibling, 0 replies; 56+ messages in thread
From: jamal @ 2006-03-14  2:29 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel, netdev

On Mon, 2006-13-03 at 19:56 -0500, Shailabh Nagar wrote:
> delayacct-genetlink.patch
> 
> Create a generic netlink interface (NETLINK_GENERIC family), 
> called "taskstats", for getting delay and cpu statistics of 
> tasks and thread groups during their lifetime and when they exit. 
> 
> Comments addressed (all in response to Jamal)
> 

Note, you are still not following the standard scheme of doing things.
Example: using command = GET and the message carrying the TGID to note
which TGID is of interest. Instead you have command = TGID.

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
  2006-03-14  2:29   ` jamal
@ 2006-03-14  2:33   ` Matt Helsley
  2006-03-14  2:48     ` jamal
  2006-03-14  4:29     ` Shailabh Nagar
  1 sibling, 2 replies; 56+ messages in thread
From: Matt Helsley @ 2006-03-14  2:33 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: linux-kernel, Jamal, netdev

On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
<snip>

> Comments addressed (all in response to Jamal)
> 
> - Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE

The enums for these are still in the patch. See below.

<snip>

> +/*
> + * Commands sent from userspace
> + * Not versioned. New commands should only be inserted at the enum's end
> + */
> +
> +enum {
> +       TASKSTATS_CMD_UNSPEC,           /* Reserved */
> +       TASKSTATS_CMD_NONE,             /* Not a valid cmd to send
> +                                        * Marks data sent on task/tgid exit */
> +       TASKSTATS_CMD_LISTEN,           /* Start listening */
> +       TASKSTATS_CMD_IGNORE,           /* Stop listening */

>From the description I thought you had eliminated these.

> +       TASKSTATS_CMD_PID,              /* Send stats for a pid */
> +       TASKSTATS_CMD_TGID,             /* Send stats for a tgid */
> +};

Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
> Note, you are still not following the standard scheme of doing things.
> Example: using command = GET and the message carrying the TGID to note
> which TGID is of interest. Instead you have command = TGID.
> 
> cheers,
> jamal

meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
misunderstanding?

<snip>

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  2:33   ` Matt Helsley
@ 2006-03-14  2:48     ` jamal
  2006-03-14  4:18       ` Shailabh Nagar
  2006-03-22  7:49       ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
  2006-03-14  4:29     ` Shailabh Nagar
  1 sibling, 2 replies; 56+ messages in thread
From: jamal @ 2006-03-14  2:48 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Shailabh Nagar, linux-kernel, netdev

On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:

> Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
> > Note, you are still not following the standard scheme of doing things.
> > Example: using command = GET and the message carrying the TGID to note
> > which TGID is of interest. Instead you have command = TGID.
> > 

> meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
> TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
> misunderstanding?
> 

I had a long description in an earlier email feedback; but the summary
of it is the GET command is generic like TASKSTATS_CMD_GET; the message
itself carries TLVs of what needs to be gotten which are 
either PID and/or TGID etc. Anyways, theres a long spill of what i am
saying in that earlier email. Perhaps the current patch is a transition
towards that?

cheers,
jamal 


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  1:12       ` Lee Revell
@ 2006-03-14  3:42         ` Balbir Singh
  2006-03-14  4:26           ` Shailabh Nagar
  0 siblings, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-14  3:42 UTC (permalink / raw)
  To: Lee Revell; +Cc: nagar, linux-kernel, Arjan van de Ven

> You don't think it's a problem that 2.0000001s - 1.9999999s would return
> garbage rather than 0.0000002?
>
> Lee
>

The caller can use set_normalized_timespec() to fix that.

Warm Regards,
Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  2:48     ` jamal
@ 2006-03-14  4:18       ` Shailabh Nagar
  2006-03-22  7:49       ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
  1 sibling, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  4:18 UTC (permalink / raw)
  To: hadi; +Cc: Matt Helsley, linux-kernel, netdev

jamal wrote:

>On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
>  
>
>>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
>>    
>>
>
>  
>
>>Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
>>    
>>
>>>Note, you are still not following the standard scheme of doing things.
>>>Example: using command = GET and the message carrying the TGID to note
>>>which TGID is of interest. Instead you have command = TGID.
>>>
>>>      
>>>
>
>  
>
>>meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
>>TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
>>misunderstanding?
>>
>>    
>>
>
>I had a long description in an earlier email feedback; but the summary
>of it is the GET command is generic like TASKSTATS_CMD_GET; the message
>itself carries TLVs of what needs to be gotten which are 
>either PID and/or TGID etc. Anyways, theres a long spill of what i am
>saying in that earlier email. Perhaps the current patch is a transition
>towards that?
>  
>

Yes, the comments you'd made in the previous mail have not been
incorporated and this is still the older version of the patch.
We'd been avoiding TLV usage so far :-)

>cheers,
>jamal 
>
>  
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  3:42         ` Balbir Singh
@ 2006-03-14  4:26           ` Shailabh Nagar
  2006-03-14  6:50             ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  4:26 UTC (permalink / raw)
  To: balbir; +Cc: Lee Revell, linux-kernel, Arjan van de Ven

Balbir Singh wrote:

>>You don't think it's a problem that 2.0000001s - 1.9999999s would return
>>garbage rather than 0.0000002?
>>    
>>
Sorry....you're right. I was thinking of invalid deltas (where end was 
before start).

>>Lee
>>
>>    
>>
>
>The caller can use set_normalized_timespec() to fix that.
>  
>
Lee's point is valid....we should be doing the set_normalized_timespec.

Will fix.
Thanks,
Shailabh

>Warm Regards,
>Balbir
>  
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  2:33   ` Matt Helsley
  2006-03-14  2:48     ` jamal
@ 2006-03-14  4:29     ` Shailabh Nagar
  1 sibling, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14  4:29 UTC (permalink / raw)
  To: Matt Helsley; +Cc: linux-kernel, Jamal, netdev

Matt Helsley wrote:

>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
><snip>
>
>  
>
>>Comments addressed (all in response to Jamal)
>>
>>- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
>>    
>>
>
>The enums for these are still in the patch. See below.
>
><snip>
>
>  
>
>>+/*
>>+ * Commands sent from userspace
>>+ * Not versioned. New commands should only be inserted at the enum's end
>>+ */
>>+
>>+enum {
>>+       TASKSTATS_CMD_UNSPEC,           /* Reserved */
>>+       TASKSTATS_CMD_NONE,             /* Not a valid cmd to send
>>+                                        * Marks data sent on task/tgid exit */
>>+       TASKSTATS_CMD_LISTEN,           /* Start listening */
>>+       TASKSTATS_CMD_IGNORE,           /* Stop listening */
>>    
>>
>
>>From the description I thought you had eliminated these.
>  
>

Yup, typo. the commands aren't registered but the enums hang on.
Will fix.

--Shailabh


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  4:26           ` Shailabh Nagar
@ 2006-03-14  6:50             ` Balbir Singh
  0 siblings, 0 replies; 56+ messages in thread
From: Balbir Singh @ 2006-03-14  6:50 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: Lee Revell, linux-kernel, Arjan van de Ven

> Lee's point is valid....we should be doing the set_normalized_timespec.
>

Semantically that sounds correct. But the intension of the caller as in
our case could be to convert the timespec to nsec_t. Calling
set_normalized_timespec() might be an overhead in this case. 

Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Patch 2/9] Initialization
  2006-03-14  0:45 ` Patch 2/9] Initialization Shailabh Nagar
@ 2006-03-14 10:54   ` Jes Sorensen
  2006-03-14 15:20     ` Shailabh Nagar
  2006-03-15 10:24   ` Arjan van de Ven
  1 sibling, 1 reply; 56+ messages in thread
From: Jes Sorensen @ 2006-03-14 10:54 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel, Arjan van de Ven

>>>>> "Shailabh" == Shailabh Nagar <nagar@watson.ibm.com> writes:

Shailabh> delayacct-setup.patch Initialization code related to
Shailabh> collection of per-task "delay" statistics which measure how
Shailabh> long it had to wait for cpu, sync block io, swapping
Shailabh> etc. The collection of statistics and the interface are in
Shailabh> other patches. This patch sets up the data structures and
Shailabh> allows the statistics collection to be disabled through a
Shailabh> kernel boot paramater.

Shailabh> +#ifdef CONFIG_TASK_DELAY_ACCT
Shailabh> +struct task_delay_info {
Shailabh> +	spinlock_t	lock;
Shailabh> +
Shailabh> +	/* For each stat XXX, add following, aligned appropriately
Shailabh> +	 *
Shailabh> +	 * struct timespec XXX_start, XXX_end;
Shailabh> +	 * u64 XXX_delay;
Shailabh> +	 * u32 XXX_count;
Shailabh> +	 */
Shailabh> +};
Shailabh> +#endif

Hmmm

I thought you were going to change this to do

u64 some_delay
u32 foo_count
u32 bar_count
u64 another_delay

To avoid wasting space on 64 bit platforms.

Jes

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Patch 2/9] Initialization
  2006-03-14 10:54   ` Jes Sorensen
@ 2006-03-14 15:20     ` Shailabh Nagar
  0 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14 15:20 UTC (permalink / raw)
  To: Jes Sorensen; +Cc: linux-kernel, Arjan van de Ven

Jes Sorensen wrote:

>>>>>>"Shailabh" == Shailabh Nagar <nagar@watson.ibm.com> writes:
>>>>>>            
>>>>>>
>
>Shailabh> delayacct-setup.patch Initialization code related to
>Shailabh> collection of per-task "delay" statistics which measure how
>Shailabh> long it had to wait for cpu, sync block io, swapping
>Shailabh> etc. The collection of statistics and the interface are in
>Shailabh> other patches. This patch sets up the data structures and
>Shailabh> allows the statistics collection to be disabled through a
>Shailabh> kernel boot paramater.
>
>Shailabh> +#ifdef CONFIG_TASK_DELAY_ACCT
>Shailabh> +struct task_delay_info {
>Shailabh> +	spinlock_t	lock;
>Shailabh> +
>Shailabh> +	/* For each stat XXX, add following, aligned appropriately
>Shailabh> +	 *
>Shailabh> +	 * struct timespec XXX_start, XXX_end;
>Shailabh> +	 * u64 XXX_delay;
>Shailabh> +	 * u32 XXX_count;
>Shailabh> +	 */
>Shailabh> +};
>Shailabh> +#endif
>
>Hmmm
>
>I thought you were going to change this to do
>
>u64 some_delay
>u32 foo_count
>u32 bar_count
>u64 another_delay
>
>To avoid wasting space on 64 bit platforms.
>  
>

Well, the "aligned appropriately" part of the comment was intended to let
future users know that something like the above should be done.

e.g in a subsequent patch (5/9)
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1919.html
when we introduce the additional stat swapin_delay/count, a 64-bit friendly
alignment is being done thus:

u64 blkio_delay; /* wait for sync block io completion */
+ u64 swapin_delay; /* wait for sync block io completion */
u32 blkio_count;
+ u32 swapin_count;

The need for each stat, potentially, to need a separate set of timespec 
variables
to reduce nesting problems does introduce another wrinkle in 
cache-friendliness
since you'd ideally like to have all the XXX_* variables close together. 
Right now
we're good since swapin doesn't need extra timespecs of its own. But future
additions might need to be more careful.

--Shailabh


>Jes
>  
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Per-task delay accounting
  2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
                   ` (8 preceding siblings ...)
  2006-03-14  1:01 ` [Patch 6/9] cpu delay collection Shailabh Nagar
@ 2006-03-14 19:28 ` Greg KH
  2006-03-14 20:49   ` Shailabh Nagar
  2006-03-23 15:16   ` [Patch 0/9] Performance Shailabh Nagar
  9 siblings, 2 replies; 56+ messages in thread
From: Greg KH @ 2006-03-14 19:28 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: linux-kernel, Nick Piggin

On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
> This is the next iteration of the delay accounting patches
> last posted at
> 	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html

Do you have any benchmark numbers with this patch applied and with it
not applied?  Last I heard it was a measurable decrease for some
"important" benchmark results...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Per-task delay accounting
  2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH
@ 2006-03-14 20:49   ` Shailabh Nagar
  2006-03-14 21:24     ` Greg KH
  2006-03-23 15:16   ` [Patch 0/9] Performance Shailabh Nagar
  1 sibling, 1 reply; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14 20:49 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, Nick Piggin

Greg KH wrote:

>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>  
>
>>This is the next iteration of the delay accounting patches
>>last posted at
>>	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>>    
>>
>
>Do you have any benchmark numbers with this patch applied and with it
>not applied? 
>
None yet. Wanted to iron out the collection/utility aspects a bit before 
going into
the performance impact.

But this seems as good a time as any to collect some stats
using the usual suspects lmbench, kernbench, hackbench etc.

> Last I heard it was a measurable decrease for some
>"important" benchmark results...
>  
>
Might have been from an older iteration where schedstats was fully enabled.
But no point speculating....will run with this set of patches and see 
what shakes out.

One point about the overhead is that it depends on the frequency with 
which data is
collected. So a proper test would probably be a comparison of a 
non-patched kernel
with
a) patches applied but delay accounting not turned on at boot i.e. cost 
of the checks
b) delay accounting turned on but not being read
c) delay accounting turned on and data read for all tasks at some 
"reasonable" rate

Will that be good  ? Other suggestions welcome.

>thanks,
>
>greg k-h
>  
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Per-task delay accounting
  2006-03-14 20:49   ` Shailabh Nagar
@ 2006-03-14 21:24     ` Greg KH
  2006-03-14 21:59       ` Shailabh Nagar
  0 siblings, 1 reply; 56+ messages in thread
From: Greg KH @ 2006-03-14 21:24 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: linux-kernel, Nick Piggin

On Tue, Mar 14, 2006 at 03:49:16PM -0500, Shailabh Nagar wrote:
> Greg KH wrote:
> 
> >On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
> > 
> >
> >>This is the next iteration of the delay accounting patches
> >>last posted at
> >>	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
> >>   
> >>
> >
> >Do you have any benchmark numbers with this patch applied and with it
> >not applied? 
> >
> None yet. Wanted to iron out the collection/utility aspects a bit before 
> going into
> the performance impact.
> 
> But this seems as good a time as any to collect some stats
> using the usual suspects lmbench, kernbench, hackbench etc.
> 
> >Last I heard it was a measurable decrease for some
> >"important" benchmark results...
> > 
> >
> Might have been from an older iteration where schedstats was fully enabled.
> But no point speculating....will run with this set of patches and see 
> what shakes out.
> 
> One point about the overhead is that it depends on the frequency with 
> which data is
> collected. So a proper test would probably be a comparison of a 
> non-patched kernel
> with
> a) patches applied but delay accounting not turned on at boot i.e. cost 
> of the checks
> b) delay accounting turned on but not being read

This is probably the most important one, as that is what distros will be
looking at.  They will have to enable the option, but will not "turn it
on".

> c) delay accounting turned on and data read for all tasks at some 
> "reasonable" rate
> 
> Will that be good  ? Other suggestions welcome.

How about real benchmarks?  The ones that the big companies look at?  I
know you have access to them :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Per-task delay accounting
  2006-03-14 21:24     ` Greg KH
@ 2006-03-14 21:59       ` Shailabh Nagar
  0 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-14 21:59 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, Nick Piggin

Greg KH wrote:

>On Tue, Mar 14, 2006 at 03:49:16PM -0500, Shailabh Nagar wrote:
>  
>
>>Greg KH wrote:
>>
>>    
>>
>>>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>>>
>>>
>>>      
>>>
>>>>This is the next iteration of the delay accounting patches
>>>>last posted at
>>>>	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>>>>  
>>>>
>>>>        
>>>>
>>>Do you have any benchmark numbers with this patch applied and with it
>>>not applied? 
>>>
>>>      
>>>
>>None yet. Wanted to iron out the collection/utility aspects a bit before 
>>going into
>>the performance impact.
>>
>>But this seems as good a time as any to collect some stats
>>using the usual suspects lmbench, kernbench, hackbench etc.
>>
>>    
>>
>>>Last I heard it was a measurable decrease for some
>>>"important" benchmark results...
>>>
>>>
>>>      
>>>
>>Might have been from an older iteration where schedstats was fully enabled.
>>But no point speculating....will run with this set of patches and see 
>>what shakes out.
>>
>>One point about the overhead is that it depends on the frequency with 
>>which data is
>>collected. So a proper test would probably be a comparison of a 
>>non-patched kernel
>>with
>>a) patches applied but delay accounting not turned on at boot i.e. cost 
>>of the checks
>>b) delay accounting turned on but not being read
>>    
>>
>
>This is probably the most important one, as that is what distros will be
>looking at.  They will have to enable the option, but will not "turn it
>on".
>  
>
I guess you meant a), not b) but yes, will run them in all these modes.

>  
>
>>c) delay accounting turned on and data read for all tasks at some 
>>"reasonable" rate
>>
>>Will that be good  ? Other suggestions welcome.
>>    
>>
>
>How about real benchmarks?  The ones that the big companies look at?  I
>know you have access to them :)
>  
>
Hmm...though you also know, from working for some "big company",  that 
it might
take a while to get such data since one has to stand in queue :-)
I'll try, and also explore the OSDL STP's DBT tests.


Thanks,
Shailabh

>thanks,
>
>greg k-h
>  
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-14  0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
  2006-03-14  1:01   ` Lee Revell
@ 2006-03-15 10:23   ` Arjan van de Ven
  2006-03-15 10:28     ` Balbir Singh
  1 sibling, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2006-03-15 10:23 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel


> +static inline void timespec_diff(struct timespec *start, struct timespec *end,
> +				 struct timespec *ret)
> +{
> +	ret->tv_sec = end->tv_sec - start->tv_sec;
> +	ret->tv_nsec = end->tv_nsec - start->tv_nsec;
> +}
>  #endif /* __KERNEL__ */

I'd suggest normalizing the timespec; better to do it in such a function
than in all callers of it..



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Patch 2/9] Initialization
  2006-03-14  0:45 ` Patch 2/9] Initialization Shailabh Nagar
  2006-03-14 10:54   ` Jes Sorensen
@ 2006-03-15 10:24   ` Arjan van de Ven
  2006-03-15 12:37     ` Alan Cox
  1 sibling, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2006-03-15 10:24 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel


> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of version 2.1 of the GNU Lesser General Public License
> + * as published by the Free Software Foundation.
> + *

LGPL inside the kernel doesn't make a whole lot of sense.... better make
it GPL.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 3/9] Block I/O accounting initialization
  2006-03-14  0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar
@ 2006-03-15 10:27   ` Arjan van de Ven
  2006-03-15 16:27     ` Shailabh Nagar
  0 siblings, 1 reply; 56+ messages in thread
From: Arjan van de Ven @ 2006-03-15 10:27 UTC (permalink / raw)
  To: nagar; +Cc: linux-kernel, Andi Kleen


>  
> +static inline void delayacct_blkio_start(void)
> +{
> +	if (unlikely(delayacct_on))
> +		__delayacct_blkio_start();
> +}


I still think the unlikely() makes no sense here; at runtime it's either
going to be always on or off (in the sense that switching it will be
RARE). The cpus branch predictor will get that right; while if you force
it unlikely you can even run the risk of getting a 100% miss on some
architectures (not x86/x86-64) when you enable the accounting



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 1/9] timestamp diff
  2006-03-15 10:23   ` Arjan van de Ven
@ 2006-03-15 10:28     ` Balbir Singh
  0 siblings, 0 replies; 56+ messages in thread
From: Balbir Singh @ 2006-03-15 10:28 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: nagar, linux-kernel

>
> > +static inline void timespec_diff(struct timespec *start, struct timespec *end,
> > +                              struct timespec *ret)
> > +{
> > +     ret->tv_sec = end->tv_sec - start->tv_sec;
> > +     ret->tv_nsec = end->tv_nsec - start->tv_nsec;
> > +}
> >  #endif /* __KERNEL__ */
>
> I'd suggest normalizing the timespec; better to do it in such a function
> than in all callers of it..
>

Yep, that also looks semantically correct. We will do so.

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Patch 2/9] Initialization
  2006-03-15 10:24   ` Arjan van de Ven
@ 2006-03-15 12:37     ` Alan Cox
  2006-03-15 15:53       ` Shailabh Nagar
  0 siblings, 1 reply; 56+ messages in thread
From: Alan Cox @ 2006-03-15 12:37 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: nagar, linux-kernel

On Mer, 2006-03-15 at 11:24 +0100, Arjan van de Ven wrote:
> > + * This program is free software; you can redistribute it and/or modify it
> > + * under the terms of version 2.1 of the GNU Lesser General Public License
> > + * as published by the Free Software Foundation.
> > + *
> 
> LGPL inside the kernel doesn't make a whole lot of sense.... better make
> it GPL.

When you combine an LGPL and GPL work you get a GPL work so yes it would
be clearer to mark it GPL as that is what it became as it was merged,
but perhaps to add a note stating where it can be obtained under other
licenses for other projects.



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: Patch 2/9] Initialization
  2006-03-15 12:37     ` Alan Cox
@ 2006-03-15 15:53       ` Shailabh Nagar
  0 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-15 15:53 UTC (permalink / raw)
  To: Alan Cox; +Cc: Arjan van de Ven, linux-kernel

Alan Cox wrote:

>On Mer, 2006-03-15 at 11:24 +0100, Arjan van de Ven wrote:
>  
>
>>>+ * This program is free software; you can redistribute it and/or modify it
>>>+ * under the terms of version 2.1 of the GNU Lesser General Public License
>>>+ * as published by the Free Software Foundation.
>>>+ *
>>>      
>>>
>>LGPL inside the kernel doesn't make a whole lot of sense.... better make
>>it GPL.
>>    
>>
>
>When you combine an LGPL and GPL work you get a GPL work so yes it would
>be clearer to mark it GPL as that is what it became as it was merged,
>but perhaps to add a note stating where it can be obtained under other
>licenses for other projects.
>
>  
>
Thanks. The LGPL usage is a mistake in the core code....will change to GPL
everywhere except the  case below. There's no intent to have the kernel 
code available
under other licenses etc. so thats not a problem.

However, the confusion about what license to use for a header file that 
will need to be
included in a potentially non-GPL user application persists. The header file
    include/linux/taskstats.h, created in patch 9/9
    http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.1/1925.html

defines the user-kernel messaging interface and should probably continue 
to have LGPL,
just to be absolutely safe legally. I've not been following the legal 
twists too carefully so
its probably overkill.

--Shailabh







^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 3/9] Block I/O accounting initialization
  2006-03-15 10:27   ` Arjan van de Ven
@ 2006-03-15 16:27     ` Shailabh Nagar
  0 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-15 16:27 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, Andi Kleen, Greg KH

Arjan van de Ven wrote:

>> 
>>+static inline void delayacct_blkio_start(void)
>>+{
>>+	if (unlikely(delayacct_on))
>>+		__delayacct_blkio_start();
>>+}
>>    
>>
>
>
>I still think the unlikely() makes no sense here; at runtime it's either
>going to be always on or off (in the sense that switching it will be
>RARE). The cpus branch predictor will get that right; while if you force
>it unlikely you can even run the risk of getting a 100% miss on some
>architectures (not x86/x86-64) when you enable the accounting
>
>  
>
I don't understand why the fact that delayacct_on is not going to vary
dynamically once a kernel has been booted should prevent the usage of 
"unlikely/likely"
Perhaps you can help me understand. Here's the logic I was using:

There are two kinds of users

1. Those who will not turn delay accounting on at boot time for a number of
reasons (don't care about the functionality, don't want to incur any 
overhead
however small).

2. The ones who do turn it on at boot time are willing to pay some overhead
for the benefits such functionality brings. What that overhead exactly 
is will be known
when we finish running some tests but we can safely assume it'll be 
non-zero.

A distro will need to keep delay accounting configured to accomodate 
both kinds
of users but because 1 is the common case,  the default boot time option 
will be off and
there is a strong incentive to reduce overhead for non-users.

Using unlikely reduces the overhead for the common case 1 and increases 
overhead for
case 2 (because of what you pointed out of 100% wrong decisions by the 
branch predictor).

I don't know how much of overhead reduction is achieved by using 
unlikely (I'm assuming its
non-trivial) or how much penalty is imposed by a 100% wrong prediction 
(I'm assuming its not
very high).

But under these assumptions, its better to use unlikely, make type 2 
users pay extra overhead
and save type 1 users from any.

Is this reasoning accurate ? If not, we could easily switch the unlikely 
off.

Regards,
Shailabh

(cc'ing Greg since he'd brought up the overhead for real benchmarks etc.)

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-14  2:48     ` jamal
  2006-03-14  4:18       ` Shailabh Nagar
@ 2006-03-22  7:49       ` Balbir Singh
  2006-03-23 14:04         ` jamal
  1 sibling, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-22  7:49 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote:
> On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
> 
> 
> I had a long description in an earlier email feedback; but the summary
> of it is the GET command is generic like TASKSTATS_CMD_GET; the message
> itself carries TLVs of what needs to be gotten which are 
> either PID and/or TGID etc. Anyways, theres a long spill of what i am
> saying in that earlier email. Perhaps the current patch is a transition
> towards that?
> 

Hi, Jamal,

Please find the updated version of delayacct-genetlink.patch. We hope
this iteration is closer to your expectation. I have copied the enums
you suggested in your previous review comments and used them.

Comments addressed (in this patch)

- Changed the code to use TLV's for data exchange between kernel and
  user space

Thanks,
Balbir


Documentation for the patch

Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.

More changes expected. Following comments will go into a
Documentation file:

When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.

The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

---

 include/linux/delayacct.h |   11 ++
 include/linux/taskstats.h |  112 ++++++++++++++++++++
 init/Kconfig              |   16 ++
 kernel/Makefile           |    1 
 kernel/delayacct.c        |   44 ++++++++
 kernel/taskstats.c        |  251 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 432 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h	2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include <linux/sched.h>
+#include <linux/taskstats.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
 	return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+				    struct task_struct *tsk)
+{
+	if (!tsk->delays)
+		return -EINVAL;
+	return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h	2006-03-22 13:12:01.000000000 +0530
@@ -0,0 +1,112 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+#define TASKSTATS_NOPID	-1
+
+struct taskstats {
+	/* Maintain 64-bit alignment while extending */
+	/* Version 1 */
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_CMD_GET,		/* user->kernel request */
+	TASKSTATS_CMD_NEW,		/* kernel->user event */
+	__TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+	TASKSTATS_TYPE_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_TYPE_TGID,		/* Thread group id */
+	TASKSTATS_TYPE_PID,		/* Process id */
+	TASKSTATS_TYPE_STATS,		/* taskstats structure */
+	__TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+	TASKSTATS_CMD_ATTR_UNSPEC = 0,
+	TASKSTATS_CMD_ATTR_PID,
+	TASKSTATS_CMD_ATTR_TGID,
+	__TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+enum {
+	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
+	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
+};
+
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig	2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c	2006-03-22 13:12:01.000000000 +0530
@@ -1,6 +1,7 @@
 /* delayacct.c - per-task delay accounting
  *
  * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
 
 int delayacct_on = 0;		/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
 
 static int __init delayacct_setup_enable(char *str)
 {
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+	/*
+	 * Protect against racing thread group exits
+	 */
+	mutex_lock(&delayacct_exit_mutex);
+	taskstats_exit_pid(tsk);
 	if (tsk->delays) {
 		kmem_cache_free(delayacct_cache, tsk->delays);
 		tsk->delays = NULL;
 	}
+	mutex_unlock(&delayacct_exit_mutex);
 }
 
 /*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
 	spin_unlock(&tsk->delays->lock);
 	return ret;
 }
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+	return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile	2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c	2006-03-22 13:12:07.000000000 +0530
@@ -0,0 +1,251 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+	.id             = GENL_ID_GENERATE,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.hdrsize        = 0,
+	.maxattr        = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
+	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+			 void **replyp)
+{
+	struct sk_buff *skb;
+	void *reply;
+
+	skb = nlmsg_new(NLMSG_GOODSIZE);
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, NLM_F_REQUEST,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, info->nlhdr->nlmsg_flags,
+				    info->genlhdr->cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	void *reply;
+	int rc;
+
+	reply = genlmsg_data(genlhdr);
+
+	rc = genlmsg_end(skb, reply);
+	if (rc < 0) {
+		nlmsg_free(skb);
+		return rc;
+	}
+
+	if (event == TASKSTATS_MSG_MULTICAST)
+		return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+	return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+			   struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(stats, tsk);
+	put_task_struct(tsk);
+
+	return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+			    struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(stats, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats stats;
+	void *reply;
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+	if (rc < 0)
+		return rc;
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+		rc = fill_pid((pid_t)pid, NULL, &stats);
+		if (rc < 0)
+			return rc;
+	}
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+		rc = fill_tgid((pid_t)tgid, NULL, &stats);
+		if (rc < 0)
+			return rc;
+	}
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+	return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	void *reply;
+	struct taskstats stats;
+
+	/*
+	 * tasks can start to exit very early. Ensure that the family
+	 * is registered before notifications are sent out
+	 */
+	if (!family_registered)
+		return;
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+	rc = fill_pid(tsk->pid, tsk, &stats);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+	if (rc || thread_group_empty(tsk))
+		return;
+
+	/* Send tgid data too */
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+	rc = fill_tgid(tsk->tgid, tsk, &stats);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+	genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+	.cmd            = TASKSTATS_CMD_GET,
+	.doit           = taskstats_send_stats,
+	.policy		= taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+	if (genl_register_family(&family))
+		return -EFAULT;
+        family_registered = 1;
+
+	if (genl_register_ops(&family, &taskstats_ops))
+		goto err;
+
+	return 0;
+err:
+	genl_unregister_family(&family);
+	family_registered = 0;
+	return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-22  7:49       ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
@ 2006-03-23 14:04         ` jamal
  2006-03-23 15:41           ` Balbir Singh
  2006-03-24  1:32           ` Balbir Singh
  0 siblings, 2 replies; 56+ messages in thread
From: jamal @ 2006-03-23 14:04 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev


Hi Balbir,

Looking good.
This is a quick scan, so i didnt look at little details.
Some comments embedded.

On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:




>  

> diff -puN /dev/null include/linux/taskstats.h

> + * The struct is versioned. Newer versions should only add fields to
> + * the bottom of the struct to maintain backward compatibility.
> + *
> + * To create the next version, bump up the taskstats_version variable
> + * and delineate the start of newly added fields with a comment indicating
> + * the version number.
> + */
> +



> +enum {
> +	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
> +	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
> +};
> +

Above will never be used outside of the kernel.
Should it go under the ifdef kernel below?


> +#ifdef __KERNEL__
> +

Note: some people will argue that you should probably have
two header files. One for all kernel things and includes another
which contains all the stuff above ifdef __KERNEL__

> +
> +#endif /* __KERNEL__ */
> +#endif /* _LINUX_TASKSTATS_H */




> diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> --- linux-2.6.16/kernel/Makefile~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
> +++ linux-2.6.16-balbir/kernel/Makefile	2006-03-22 11:56:03.000000000 +0530

> +
> +const int taskstats_version = TASKSTATS_VERSION;
> +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> +static int family_registered = 0;
> +
> +static struct genl_family family = {
> +	.id             = GENL_ID_GENERATE,
> +	.name           = TASKSTATS_GENL_NAME,
> +	.version        = TASKSTATS_GENL_VERSION,
> +	.hdrsize        = 0,

Do you need to specify hdrsize of 0?


> +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
> +			 void **replyp)
> +{
> +	struct sk_buff *skb;
> +	void *reply;
> +
> +	skb = nlmsg_new(NLMSG_GOODSIZE);

Ok,  getting a size of NLMSG_GOODSIZE is not a good idea. 
The max size youll ever get it seems to me is 2*32-bit-data TLVs +
sizeof struct stats. Why dont you allocate that size?

> +	if (!skb)
> +		return -ENOMEM;
> +
> +	if (!info) {
> +		int seq = get_cpu_var(taskstats_seqnum)++;
> +		put_cpu_var(taskstats_seqnum);
> +
> +		reply = genlmsg_put(skb, 0, seq,
> +				    family.id, 0, NLM_F_REQUEST,
> +				    cmd, family.version);

Double check if you need NLM_F_REQUEST

> +	} else
> +		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> +				    family.id, 0, info->nlhdr->nlmsg_flags,
> +				    info->genlhdr->cmd, family.version);

A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
right thing?





> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{
> +	int rc;
> +	struct sk_buff *rep_skb;
> +	struct taskstats stats;
> +	void *reply;
> +
> +	memset(&stats, 0, sizeof(stats));
> +	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);

Same comment as before: a response to a GET is a NEW; so
info->genlhdr->cmd doesnt seem right.

> +	if (rc < 0)
> +		return rc;
> +
> +	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> +		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> +		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> +		rc = fill_pid((pid_t)pid, NULL, &stats);
> +		if (rc < 0)
> +			return rc;
> +	}
> +
> +	if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> +		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> +		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> +		rc = fill_tgid((pid_t)tgid, NULL, &stats);
> +		if (rc < 0)
> +			return rc;
> +	}
> +

Should there be at least either a pid or tgid? If yes, you need to
validate here...

> +	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> +	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> +
> +nla_put_failure:
> +	return genlmsg_cancel(rep_skb, reply);
> +

As a general comment double check your logic for errors; if you already
have stashed something in the skb, you need to remove it etc.

> +}
> +
> +
> +/* Send pid data out on exit */
> +void taskstats_exit_pid(struct task_struct *tsk)
> +{
> +	int rc;
> +	struct sk_buff *rep_skb;
> +	void *reply;
> +	struct taskstats stats;
> +
> +	/*
> +	 * tasks can start to exit very early. Ensure that the family
> +	 * is registered before notifications are sent out
> +	 */
> +	if (!family_registered)
> +		return;
> +
> +	memset(&stats, 0, sizeof(stats));
> +	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> +	if (rc < 0)
> +		return;
> +
> +	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
> +	rc = fill_pid(tsk->pid, tsk, &stats);
> +	if (rc < 0)
> +		return;
> +
> +	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> +	rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> +
> +	if (rc || thread_group_empty(tsk))
> +		return;
> +
> +	/* Send tgid data too */
> +	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> +	if (rc < 0)
> +		return;
> +

A single message with PID+TGID sounds reasonable. Why two messages with
two stats? all you will need to do is get rid of the prepare_reply()
above and NLA_PUT_U32() below (just like you do in a response to a GET.

> +	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
> +	rc = fill_tgid(tsk->tgid, tsk, &stats);
> +	if (rc < 0)
> +		return;
> +
> +	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> +	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> +
> +nla_put_failure:
> +	genlmsg_cancel(rep_skb, reply);
> +}


Other than the above comments - I believe you have it right.

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Performance
  2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH
  2006-03-14 20:49   ` Shailabh Nagar
@ 2006-03-23 15:16   ` Shailabh Nagar
  2006-03-25  2:38     ` Greg KH
  1 sibling, 1 reply; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-23 15:16 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, Nick Piggin

Greg KH wrote:
> On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>
>>This is the next iteration of the delay accounting patches
>>last posted at
>>	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>
>
> Do you have any benchmark numbers with this patch applied and with it
> not applied?  Last I heard it was a measurable decrease for some
> "important" benchmark results...
>
> thanks,
>
> greg k-h

Here are some numbers for the latest set of posted patches
using microbenchmarks hackbench, kernbench and lmbench.

I was trying to get the real/big benchmark numbers too but
it looks like getting a run whose numbers can be trusted
will take a bit longer than expected. Preliminary runs of
transaction processing benchmarks indicate that overhead
actually decreases with the patch (as also seen in some of
the lmbench numbers below).

--Shailabh



Results highlights

- Configuring delay accounting adds < 0.5%
  overhead in most cases and even reduces overhead
  in some cases

- Enabling delay accounting has similar results
  with a maximum overhead of 1.2% for hackbench
  , most other overheads < 1% and reduction in
  overhead in some cases



Base
	Vanilla 2.6.16-rc6 kernel
	without any patches applied
+patch	
	Delay accounting configured
	but not enabled at boot
+patch+enable
	Delay accounting enabled at boot
	but no stats read

Hackbench
---------
200 groups, using sockets
Elapsed time, in seconds, lower better

		%Ovhd	Time 	
Base		0	12.468	
+patch		0.4%	12.523	
+patch+enable	1.2%	12.622	

Kernbench
---------
Average of 5 iterations
Elapsed time, in seconds, lower better

		%Ovhd	Elapsed	
Base		0	195.776
+patch		0.2%	196.246
+patch+enable	0.3%	196.282

Lmbench
-------

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host                 OS  Mhz null null      open selct sig  sig  fork exec sh
                             call  I/O stat clos TCP   inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
base      Linux 2.6.16- 2783 0.17 0.33 5.17 6.49  13.4 0.64 2.61 146. 610. 9376
+patch    Linux 2.6.16- 2781 0.17 0.32 4.75 5.85  13.0 0.64 2.62 145. 628. 9393
+patch+en Linux 2.6.16- 2784 0.17 0.32 4.71 6.14  13.4 0.64 2.60 150. 616. 9402

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host                 OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                        ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
base      Linux 2.6.16- 4.340 4.9600 7.3300 6.5700   30.3    10.4    36.0
+patch    Linux 2.6.16- 4.390 4.9800 7.3100 6.5900   29.7 9.62000    35.8
+patch+en Linux 2.6.16- 4.560 5.0800 7.2400 5.6900   22.7    10.3    33.8

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
base      Linux 2.6.16- 4.340  15.9 12.2  18.3  24.9  21.5  29.1 45.3
+patch    Linux 2.6.16- 4.390  15.7 11.8  18.6  22.2  22.0  29.1 44.8
+patch+en Linux 2.6.16- 4.560  15.6 12.1  18.9  25.3  21.9  27.1 45.1

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host                 OS   0K File      10K File      Mmap    Prot    Page
                        Create Delete Create Delete  Latency Fault   Fault
--------- ------------- ------ ------ ------ ------  ------- -----   -----
base      Linux 2.6.16-   39.8   58.0  112.0   82.6   8417.0 0.838 2.00000
+patch    Linux 2.6.16-   39.6   58.2  111.0   82.3   8392.0 0.864 2.00000
+patch+en Linux 2.6.16-   39.6   59.1  112.8   83.2   8308.0 0.821 2.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
base      Linux 2.6.16- 676. 616. 620. 1658.0 2030.6  759.6  825.9 2032 1177.
+patch    Linux 2.6.16- 627. 165. 616. 1649.9 2030.9  766.1  834.1 2030 1187.
+patch+en Linux 2.6.16- 633. 148. 603. 1569.7 2030.9  757.2  835.3 2030 1174.

Memory latencies in nanoseconds - smaller is better
    (WARNING - may not be correct, check graphs)
---------------------------------------------------
Host                 OS   Mhz  L1 $   L2 $    Main mem    Guesses
--------- -------------  ---- ----- ------    --------    -------
base      Linux 2.6.16-  2783 0.719 6.5960  110.5
+patch    Linux 2.6.16-  2781 0.720 6.5980  111.0
+patch+en Linux 2.6.16-  2784 0.720 6.5970  110.7



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-23 14:04         ` jamal
@ 2006-03-23 15:41           ` Balbir Singh
  2006-03-24 14:04             ` jamal
  2006-03-24  1:32           ` Balbir Singh
  1 sibling, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-23 15:41 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> Hi Balbir,
> 
> Looking good.
> This is a quick scan, so i didnt look at little details.

Thanks for your detailed feedback. Please find my response interspersed
along with your comments.

> Some comments embedded.
> 
> On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:
> 
> 
> 
> 
> >  
> 
> > diff -puN /dev/null include/linux/taskstats.h
> 
> > + * The struct is versioned. Newer versions should only add fields to
> > + * the bottom of the struct to maintain backward compatibility.
> > + *
> > + * To create the next version, bump up the taskstats_version variable
> > + * and delineate the start of newly added fields with a comment indicating
> > + * the version number.
> > + */
> > +
> 
> 
> 
> > +enum {
> > +	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
> > +	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
> > +};
> > +
> 
> Above will never be used outside of the kernel.
> Should it go under the ifdef kernel below?
>

Will do
 
> 
> > +#ifdef __KERNEL__
> > +
> 
> Note: some people will argue that you should probably have
> two header files. One for all kernel things and includes another
> which contains all the stuff above ifdef __KERNEL__
> 
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* _LINUX_TASKSTATS_H */
> 
> 
> 
> 
> > diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> > --- linux-2.6.16/kernel/Makefile~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
> > +++ linux-2.6.16-balbir/kernel/Makefile	2006-03-22 11:56:03.000000000 +0530
> 
> > +
> > +const int taskstats_version = TASKSTATS_VERSION;
> > +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> > +static int family_registered = 0;
> > +
> > +static struct genl_family family = {
> > +	.id             = GENL_ID_GENERATE,
> > +	.name           = TASKSTATS_GENL_NAME,
> > +	.version        = TASKSTATS_GENL_VERSION,
> > +	.hdrsize        = 0,
> 
> Do you need to specify hdrsize of 0?

I will remove it

> 
> 
> > +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
> > +			 void **replyp)
> > +{
> > +	struct sk_buff *skb;
> > +	void *reply;
> > +
> > +	skb = nlmsg_new(NLMSG_GOODSIZE);
> 
> Ok,  getting a size of NLMSG_GOODSIZE is not a good idea. 
> The max size youll ever get it seems to me is 2*32-bit-data TLVs +
> sizeof struct stats. Why dont you allocate that size?
> 

Will do

> > +	if (!skb)
> > +		return -ENOMEM;
> > +
> > +	if (!info) {
> > +		int seq = get_cpu_var(taskstats_seqnum)++;
> > +		put_cpu_var(taskstats_seqnum);
> > +
> > +		reply = genlmsg_put(skb, 0, seq,
> > +				    family.id, 0, NLM_F_REQUEST,
> > +				    cmd, family.version);
> 
> Double check if you need NLM_F_REQUEST
> 

It is not required, I will remove it

> > +	} else
> > +		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> > +				    family.id, 0, info->nlhdr->nlmsg_flags,
> > +				    info->genlhdr->cmd, family.version);
> 
> A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
> right thing?
> 
> 

I will change it

> 
> 
> 
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
> > +	int rc;
> > +	struct sk_buff *rep_skb;
> > +	struct taskstats stats;
> > +	void *reply;
> > +
> > +	memset(&stats, 0, sizeof(stats));
> > +	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
> 
> Same comment as before: a response to a GET is a NEW; so
> info->genlhdr->cmd doesnt seem right.
> 

I will change it

> > +	if (rc < 0)
> > +		return rc;
> > +
> > +	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > +		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > +		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > +		rc = fill_pid((pid_t)pid, NULL, &stats);
> > +		if (rc < 0)
> > +			return rc;
> > +	}
> > +
> > +	if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> > +		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> > +		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> > +		rc = fill_tgid((pid_t)tgid, NULL, &stats);
> > +		if (rc < 0)
> > +			return rc;
> > +	}
> > +
> 
> Should there be at least either a pid or tgid? If yes, you need to
> validate here...
>

Yes, you are correct. One of my test cases caught it too.. But I did
not want to untidy the code with if-else's which will keep growing if
the attributes change in the future. I just followed the controller
example. I will change it and validate it. Currently if the attribute
is not valid, a stat of all zero's is returned back.
 
> > +	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > +	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> > +
> > +nla_put_failure:
> > +	return genlmsg_cancel(rep_skb, reply);
> > +
> 
> As a general comment double check your logic for errors; if you already
> have stashed something in the skb, you need to remove it etc.
> 

Wouldn't genlmsg_cancel() take care of cleaning all attributes?


> > +}
> > +
> > +
> > +/* Send pid data out on exit */
> > +void taskstats_exit_pid(struct task_struct *tsk)
> > +{
> > +	int rc;
> > +	struct sk_buff *rep_skb;
> > +	void *reply;
> > +	struct taskstats stats;
> > +
> > +	/*
> > +	 * tasks can start to exit very early. Ensure that the family
> > +	 * is registered before notifications are sent out
> > +	 */
> > +	if (!family_registered)
> > +		return;
> > +
> > +	memset(&stats, 0, sizeof(stats));
> > +	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> > +	if (rc < 0)
> > +		return;
> > +
> > +	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
> > +	rc = fill_pid(tsk->pid, tsk, &stats);
> > +	if (rc < 0)
> > +		return;
> > +
> > +	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > +	rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> > +
> > +	if (rc || thread_group_empty(tsk))
> > +		return;
> > +
> > +	/* Send tgid data too */
> > +	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> > +	if (rc < 0)
> > +		return;
> > +
> 
> A single message with PID+TGID sounds reasonable. Why two messages with
> two stats? all you will need to do is get rid of the prepare_reply()
> above and NLA_PUT_U32() below (just like you do in a response to a GET.
> 

The reason for two stats is that for TGID, we return accumulated values
(of all threads in the group) and for PID we return the value just
for that pid. The return value is

pid
<stats for pid>
tgid
<accumulated stats for tgid>

> > +	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
> > +	rc = fill_tgid(tsk->tgid, tsk, &stats);
> > +	if (rc < 0)
> > +		return;
> > +
> > +	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > +	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> > +
> > +nla_put_failure:
> > +	genlmsg_cancel(rep_skb, reply);
> > +}
> 
> 
> Other than the above comments - I believe you have it right.
> 

Thanks, I will get back to you with the updated patch soon.

> cheers,
> jamal

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-23 14:04         ` jamal
  2006-03-23 15:41           ` Balbir Singh
@ 2006-03-24  1:32           ` Balbir Singh
  2006-03-24 14:11             ` jamal
  1 sibling, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-24  1:32 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> Hi Balbir,
> 
> Looking good.
> This is a quick scan, so i didnt look at little details.
> Some comments embedded.

Hi, Jamal,

I tried addressing your comments in this new version.

Changelog
---------
1. Moved TASKSTATS_MSG_* to under #ifdef __KERNEL__
2. Got rid of .hdrsize = 0 in genl_family family
3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
4. Got rid of NLM_F_REQUEST, all flags passed down to user space are now 0
5. The response to TASKSTATS_CMD_GET is TASKSTATS_CMD_NEW
6. taskstats_send_stats() now validates the command attributes and ensures
   that it either gets a PID or a TGID. If it gets both simultaneously
   the PID stats are sent.
7. Do not put the PID/TGID into the skb if there are errors in fill_pid() or
   fill_tgid().

Thanks,
Balbir


Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---

 include/linux/delayacct.h |   11 +
 include/linux/taskstats.h |  111 ++++++++++++++++++++
 init/Kconfig              |   16 ++
 kernel/Makefile           |    1 
 kernel/delayacct.c        |   44 +++++++
 kernel/taskstats.c        |  255 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 435 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h	2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include <linux/sched.h>
+#include <linux/taskstats.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
 	return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+				    struct task_struct *tsk)
+{
+	if (!tsk->delays)
+		return -EINVAL;
+	return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h	2006-03-24 06:49:24.000000000 +0530
@@ -0,0 +1,111 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+#define TASKSTATS_NOPID	-1
+
+struct taskstats {
+	/* Maintain 64-bit alignment while extending */
+	/* Version 1 */
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_CMD_GET,		/* user->kernel request */
+	TASKSTATS_CMD_NEW,		/* kernel->user event */
+	__TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+	TASKSTATS_TYPE_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_TYPE_TGID,		/* Thread group id */
+	TASKSTATS_TYPE_PID,		/* Process id */
+	TASKSTATS_TYPE_STATS,		/* taskstats structure */
+	__TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+	TASKSTATS_CMD_ATTR_UNSPEC = 0,
+	TASKSTATS_CMD_ATTR_PID,
+	TASKSTATS_CMD_ATTR_TGID,
+	__TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
+	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig	2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c	2006-03-24 06:49:24.000000000 +0530
@@ -1,6 +1,7 @@
 /* delayacct.c - per-task delay accounting
  *
  * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
 
 int delayacct_on = 0;		/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
 
 static int __init delayacct_setup_enable(char *str)
 {
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+	/*
+	 * Protect against racing thread group exits
+	 */
+	mutex_lock(&delayacct_exit_mutex);
+	taskstats_exit_pid(tsk);
 	if (tsk->delays) {
 		kmem_cache_free(delayacct_cache, tsk->delays);
 		tsk->delays = NULL;
 	}
+	mutex_unlock(&delayacct_exit_mutex);
 }
 
 /*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
 	spin_unlock(&tsk->delays->lock);
 	return ret;
 }
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+	return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile	2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c	2006-03-24 06:50:03.000000000 +0530
@@ -0,0 +1,255 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+	.id             = GENL_ID_GENERATE,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.maxattr        = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
+	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+			 void **replyp)
+{
+	struct sk_buff *skb;
+	void *reply;
+
+	/*
+	 * If new attributes are added, please revisit this allocation
+	 */
+	skb = nlmsg_new((2 * sizeof(u32)) + sizeof(struct taskstats));
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, 0,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, 0,
+				    cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	void *reply;
+	int rc;
+
+	reply = genlmsg_data(genlhdr);
+
+	rc = genlmsg_end(skb, reply);
+	if (rc < 0) {
+		nlmsg_free(skb);
+		return rc;
+	}
+
+	if (event == TASKSTATS_MSG_MULTICAST)
+		return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+	return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+			   struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(stats, tsk);
+	put_task_struct(tsk);
+
+	return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+			    struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(stats, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats stats;
+	void *reply;
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return rc;
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+		rc = fill_pid((pid_t)pid, NULL, &stats);
+		if (rc < 0)
+			return rc;
+
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+		rc = fill_tgid((pid_t)tgid, NULL, &stats);
+		if (rc < 0)
+			return rc;
+
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+	} else {
+		return -EINVAL;
+	}
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+	return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	void *reply;
+	struct taskstats stats;
+
+	/*
+	 * tasks can start to exit very early. Ensure that the family
+	 * is registered before notifications are sent out
+	 */
+	if (!family_registered)
+		return;
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return;
+
+	rc = fill_pid(tsk->pid, tsk, &stats);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+	if (rc || thread_group_empty(tsk))
+		return;
+
+	/* Send tgid data too */
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+	if (rc < 0)
+		return;
+
+	rc = fill_tgid(tsk->tgid, tsk, &stats);
+	if (rc < 0)
+		return;
+
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+	genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+	.cmd            = TASKSTATS_CMD_GET,
+	.doit           = taskstats_send_stats,
+	.policy		= taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+	if (genl_register_family(&family))
+		return -EFAULT;
+        family_registered = 1;
+
+	if (genl_register_ops(&family, &taskstats_ops))
+		goto err;
+
+	return 0;
+err:
+	genl_unregister_family(&family);
+	family_registered = 0;
+	return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-23 15:41           ` Balbir Singh
@ 2006-03-24 14:04             ` jamal
  2006-03-24 14:54               ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-03-24 14:04 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:

> > Should there be at least either a pid or tgid? If yes, you need to
> > validate here...
> >
> 
> Yes, you are correct. One of my test cases caught it too.. But I did
> not want to untidy the code with if-else's which will keep growing if
> the attributes change in the future. I just followed the controller
> example. I will change it and validate it. Currently if the attribute
> is not valid, a stat of all zero's is returned back.
>  

There are many ways to skin this cat.
As an example: you could make pid and tgid global to the function and
set them to 0. At the end of the if statements, you could check if at
least one of them is set - if not you know none was passed and bail out.

> > As a general comment double check your logic for errors; if you already
> > have stashed something in the skb, you need to remove it etc.
> > 
> 
> Wouldn't genlmsg_cancel() take care of cleaning all attributes?
> 

it would for attribute setting - but not for others. As a general
comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
could result in errors.


> > A single message with PID+TGID sounds reasonable. Why two messages with
> > two stats? all you will need to do is get rid of the prepare_reply()
> > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> > 
> 
> The reason for two stats is that for TGID, we return accumulated values
> (of all threads in the group) and for PID we return the value just
> for that pid. The return value is
> 

Ok, I understand the dilemma now - but still not thrilled with having
two messages.
Perhaps you could have nesting of TLVs? This is widely used in the net
code for example
i.e:

TLV = TASKSTATS_TYPE_TGID/PID
    TLV = TASKSTATS_TYPE_STATS

Look at using nla_nest_start/end/cancel

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-24  1:32           ` Balbir Singh
@ 2006-03-24 14:11             ` jamal
  2006-03-24 14:19               ` jamal
  2006-03-24 14:59               ` Balbir Singh
  0 siblings, 2 replies; 56+ messages in thread
From: jamal @ 2006-03-24 14:11 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:

> 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)

Not the right size; the u32 covers the V part of TLV. The T = 16 bits
and L = 16 bits. And if you nest TLVs, then it gets more interesting.

Look at using proper macros instead of hard coding like you did.
grep for something like RTA_SPACE and perhaps send a patch to make it
generic for netlink.h

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-24 14:11             ` jamal
@ 2006-03-24 14:19               ` jamal
  2006-03-24 14:59               ` Balbir Singh
  1 sibling, 0 replies; 56+ messages in thread
From: jamal @ 2006-03-24 14:19 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Fri, 2006-24-03 at 09:11 -0500, jamal wrote:

> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
> 

actually Thomas already has this in netlink.h
Look at using things like:
nla_attr_size()

make sure padding is taken care of etc (read: use the right macros).

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-24 14:04             ` jamal
@ 2006-03-24 14:54               ` Balbir Singh
  2006-03-25  1:19                 ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-24 14:54 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Fri, Mar 24, 2006 at 09:04:21AM -0500, jamal wrote:
> On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> > > Should there be at least either a pid or tgid? If yes, you need to
> > > validate here...
> > >
> > 
> > Yes, you are correct. One of my test cases caught it too.. But I did
> > not want to untidy the code with if-else's which will keep growing if
> > the attributes change in the future. I just followed the controller
> > example. I will change it and validate it. Currently if the attribute
> > is not valid, a stat of all zero's is returned back.
> >  
> 
> There are many ways to skin this cat.
> As an example: you could make pid and tgid global to the function and
> set them to 0. At the end of the if statements, you could check if at
> least one of them is set - if not you know none was passed and bail out.

The latest patch does fix it this issue. In the Changelog
6. taskstats_send_stats() now validates the command attributes and ensures
  that it either gets a PID or a TGID. If it gets both simultaneously
  the PID stats are sent.

Is this change ok with you?

> 
> > > As a general comment double check your logic for errors; if you already
> > > have stashed something in the skb, you need to remove it etc.
> > > 
> > 
> > Wouldn't genlmsg_cancel() take care of cleaning all attributes?
> > 
> 
> it would for attribute setting - but not for others. As a general
> comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
> could result in errors.

:-) I understand.
What I have done is moved all the NLA_PUT_U32 to after verifying the
return values of functions fill_*(). That way we do not stash anything into the
skb if there are pending errors.

> 
> 
> > > A single message with PID+TGID sounds reasonable. Why two messages with
> > > two stats? all you will need to do is get rid of the prepare_reply()
> > > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> > > 
> > 
> > The reason for two stats is that for TGID, we return accumulated values
> > (of all threads in the group) and for PID we return the value just
> > for that pid. The return value is
> > 
> 
> Ok, I understand the dilemma now - but still not thrilled with having
> two messages.
> Perhaps you could have nesting of TLVs? This is widely used in the net
> code for example
> i.e:
> 
> TLV = TASKSTATS_TYPE_TGID/PID
>     TLV = TASKSTATS_TYPE_STATS
> 
> Look at using nla_nest_start/end/cancel

Hmm... Would it be ok to send one message with the following format

1. TLV=TASKSTATS_TYPE_PID
2. TLV=TASKSTATS_TYPE_STATS
3. TLV=TASKSTATS_TYPE_TGID
4. TLV=TASKSTATS_TYPE_STATS

It would still be one message, except that 3 and 4 would be optional.
What do you think?

> 
> cheers,
> jamal

Thanks for your comments,
Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-24 14:11             ` jamal
  2006-03-24 14:19               ` jamal
@ 2006-03-24 14:59               ` Balbir Singh
  1 sibling, 0 replies; 56+ messages in thread
From: Balbir Singh @ 2006-03-24 14:59 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Fri, Mar 24, 2006 at 09:11:58AM -0500, jamal wrote:
> On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 
> > 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
> 
> Not the right size; the u32 covers the V part of TLV. The T = 16 bits
> and L = 16 bits. And if you nest TLVs, then it gets more interesting.
> 
> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
> 
> cheers,
> jamal
>

My bad,  but I was wondering why my test case did not segfault until
I saw the implementation of nlmsg_new :-)

I will fix this and use nla_total_size() to calculate the correct sizes
including padding and TLV.

Thanks again,
Balbir

 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-24 14:54               ` Balbir Singh
@ 2006-03-25  1:19                 ` jamal
  2006-03-25  9:41                   ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-03-25  1:19 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:

> Hmm... Would it be ok to send one message with the following format
> 
> 1. TLV=TASKSTATS_TYPE_PID
> 2. TLV=TASKSTATS_TYPE_STATS
> 3. TLV=TASKSTATS_TYPE_TGID
> 4. TLV=TASKSTATS_TYPE_STATS
> 
> It would still be one message, except that 3 and 4 would be optional.
> What do you think?
> 

No, that wont work since #2 and #4 are basically the same TLV. [Recall
that "T" is used to index an array]. Your other alternative is to have
#4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
although that would smell a little.
Dont be afraid to do the nest, it will be a little painful initially but
i am sure once you figure it out you will appreciate it.

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Performance
  2006-03-23 15:16   ` [Patch 0/9] Performance Shailabh Nagar
@ 2006-03-25  2:38     ` Greg KH
  2006-03-27 18:28       ` Shailabh Nagar
  0 siblings, 1 reply; 56+ messages in thread
From: Greg KH @ 2006-03-25  2:38 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: linux-kernel, Nick Piggin

On Thu, Mar 23, 2006 at 10:16:41AM -0500, Shailabh Nagar wrote:
> Greg KH wrote:
> > On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
> >
> >>This is the next iteration of the delay accounting patches
> >>last posted at
> >>	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
> >
> >
> > Do you have any benchmark numbers with this patch applied and with it
> > not applied?  Last I heard it was a measurable decrease for some
> > "important" benchmark results...
> >
> > thanks,
> >
> > greg k-h
> 
> Here are some numbers for the latest set of posted patches
> using microbenchmarks hackbench, kernbench and lmbench.
> 
> I was trying to get the real/big benchmark numbers too but
> it looks like getting a run whose numbers can be trusted
> will take a bit longer than expected. Preliminary runs of
> transaction processing benchmarks indicate that overhead
> actually decreases with the patch (as also seen in some of
> the lmbench numbers below).

That's good to hear.

But your .5% is noticable on the +patch results, which I don't think
people who take performance issues seriously will like (that's real
money for the big vendors.)  And distros will be forced to enable that
option in their kernels, so those vendors will have to get that
percentage back some other way...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-25  1:19                 ` jamal
@ 2006-03-25  9:41                   ` Balbir Singh
  2006-03-25 12:52                     ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-25  9:41 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Fri, Mar 24, 2006 at 08:19:25PM -0500, jamal wrote:
> On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:
> 
> > Hmm... Would it be ok to send one message with the following format
> > 
> > 1. TLV=TASKSTATS_TYPE_PID
> > 2. TLV=TASKSTATS_TYPE_STATS
> > 3. TLV=TASKSTATS_TYPE_TGID
> > 4. TLV=TASKSTATS_TYPE_STATS
> > 
> > It would still be one message, except that 3 and 4 would be optional.
> > What do you think?
> > 
> 
> No, that wont work since #2 and #4 are basically the same TLV. [Recall
> that "T" is used to index an array]. Your other alternative is to have
> #4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
> although that would smell a little.
> Dont be afraid to do the nest, it will be a little painful initially but
> i am sure once you figure it out you will appreciate it.
>

Thanks for the advice, I will dive into nesting. I could not find any 
in tree users who use nesting, so I have a few questions

nla_nest_start() accepts two parameters an skb and an attribute type.
Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
contain the nested attributes 

TASKSTATS_TYPE_AGGR
   TASKSTATS_TYPE_PID/TGID
   TASKSTATS_TYPE_STATS

but this will lead to


TASKSTATS_TYPE_AGGR
   TASKSTATS_TYPE_PID
   TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_AGGR
   TASKSTATS_TYPE_TGID
   TASKSTATS_TYPE_STATS

being returned from taskstats_exit_pid().

The other option is to nest

TASKSTATS_TYPE_PID/TGID
   TASKSTATS_TYPE_STATS

but the problem with this approach is, nla_len contains the length of
all attributes including the nested attribute. So it is hard to find
the offset of TASKSTATS_TYPE_STATS in the buffer.

Do I understand NLA nesting at all? May be I am missing something obvious.

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-25  9:41                   ` Balbir Singh
@ 2006-03-25 12:52                     ` jamal
  2006-03-25 15:36                       ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-03-25 12:52 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:

> 
> Thanks for the advice, I will dive into nesting. I could not find any 
> in tree users who use nesting, so I have a few questions
> 

Hrm - I have to say i am suprised theres nothing; i could have sworn
Thomas had done some conversions already.

> nla_nest_start() accepts two parameters an skb and an attribute type.
> Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> contain the nested attributes 
> 
> TASKSTATS_TYPE_AGGR
>    TASKSTATS_TYPE_PID/TGID
>    TASKSTATS_TYPE_STATS
> 
>
> but this will lead to
> 
> TASKSTATS_TYPE_AGGR
>    TASKSTATS_TYPE_PID
>    TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_AGGR
>    TASKSTATS_TYPE_TGID
>    TASKSTATS_TYPE_STATS
> 
> being returned from taskstats_exit_pid().
> 

no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
Again invoke the rule i cited earlier.
What you could do instead is a second AGGR; and your nesting would be:

TASKSTATS_TYPE_AGGR1 <--- nest start with this type
   TASKSTATS_TYPE_PID <-- NLA_U32_PUT
   TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
                     <-- nest end of TASKSTATS_TYPE_AGGR1
TASKSTATS_TYPE_AGGR2 <--- nest start with this type
   TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
   TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
                       <-- nest end of TASKSTATS_TYPE_AGGR2

> The other option is to nest
> 
> TASKSTATS_TYPE_PID/TGID
>    TASKSTATS_TYPE_STATS
> 

The advantage being you dont introduce another T.

> but the problem with this approach is, nla_len contains the length of
> all attributes including the nested attribute. So it is hard to find
> the offset of TASKSTATS_TYPE_STATS in the buffer.
> 

So you would distinguish the two as have something like:

TASKSTATS_TYPE_PID
   u32 pid
   TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_TGID
   u32 tgid
   TASKSTATS_TYPE_STATS
or
TASKSTATS_TYPE_PID
   u32 pid
TASKSTATS_TYPE_TGID
   u32 tgid

both should be fine. The difference between the two is the length in the
second case will be 4 and in the other case will be larger. 

But come to think of it, this will introduce unneeded semantics; you
have very few items to do, so forget it. Go with scheme #1 but change
the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.

cheers,
jamal



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-25 12:52                     ` jamal
@ 2006-03-25 15:36                       ` Balbir Singh
  2006-03-25 17:48                         ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-25 15:36 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:
> 
> > 
> > Thanks for the advice, I will dive into nesting. I could not find any 
> > in tree users who use nesting, so I have a few questions
> > 
> 
> Hrm - I have to say i am suprised theres nothing; i could have sworn
> Thomas had done some conversions already.

I used cscope to check for global references and callers of nla_nest_.*.
I could not find anything.

> 
> > nla_nest_start() accepts two parameters an skb and an attribute type.
> > Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> > contain the nested attributes 
> > 
> > TASKSTATS_TYPE_AGGR
> >    TASKSTATS_TYPE_PID/TGID
> >    TASKSTATS_TYPE_STATS
> > 
> >
> > but this will lead to
> > 
> > TASKSTATS_TYPE_AGGR
> >    TASKSTATS_TYPE_PID
> >    TASKSTATS_TYPE_STATS
> > TASKSTATS_TYPE_AGGR
> >    TASKSTATS_TYPE_TGID
> >    TASKSTATS_TYPE_STATS
> > 
> > being returned from taskstats_exit_pid().
> > 
> 
> no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
> Again invoke the rule i cited earlier.

Yes, thats why I wanted to point it out to you. Thanks for explaining the
rule.

> What you could do instead is a second AGGR; and your nesting would be:
> 
> TASKSTATS_TYPE_AGGR1 <--- nest start with this type
>    TASKSTATS_TYPE_PID <-- NLA_U32_PUT
>    TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
>                      <-- nest end of TASKSTATS_TYPE_AGGR1
> TASKSTATS_TYPE_AGGR2 <--- nest start with this type
>    TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
>    TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
>                        <-- nest end of TASKSTATS_TYPE_AGGR2
> 
> > The other option is to nest
> > 
> > TASKSTATS_TYPE_PID/TGID
> >    TASKSTATS_TYPE_STATS
> > 
> 
> The advantage being you dont introduce another T.
> 
> > but the problem with this approach is, nla_len contains the length of
> > all attributes including the nested attribute. So it is hard to find
> > the offset of TASKSTATS_TYPE_STATS in the buffer.
> > 
> 
> So you would distinguish the two as have something like:
> 
> TASKSTATS_TYPE_PID
>    u32 pid
>    TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_TGID
>    u32 tgid
>    TASKSTATS_TYPE_STATS
> or
> TASKSTATS_TYPE_PID
>    u32 pid
> TASKSTATS_TYPE_TGID
>    u32 tgid
> 
> both should be fine. The difference between the two is the length in the
> second case will be 4 and in the other case will be larger. 
> 
> But come to think of it, this will introduce unneeded semantics; you
> have very few items to do, so forget it. Go with scheme #1 but change
> the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.
>

I prefer #1 as well. The overloaded use of the same type with different lengths
can be confusing.
 
> cheers,
> jamal
> 

Here is another attempt (one more iteration) at trying to get it right.
Thank you for your patience and help in getting it right.

Changelog
---------

As discussed in our email.

Thanks,
Balbir

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

---

 include/linux/delayacct.h |   11 +
 include/linux/taskstats.h |  113 +++++++++++++++++
 init/Kconfig              |   16 ++
 kernel/Makefile           |    1 
 kernel/delayacct.c        |   44 ++++++
 kernel/taskstats.c        |  291 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 473 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h	2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include <linux/sched.h>
+#include <linux/taskstats.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct 
 extern void __delayacct_blkio_start(void);
 extern void __delayacct_blkio_end(void);
 extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
 	return 0;
 }
 #endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+				    struct task_struct *tsk)
+{
+	if (!tsk->delays)
+		return -EINVAL;
+	return __delayacct_add_tsk(d, tsk);
+}
+#endif
 #endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h	2006-03-25 20:56:55.000000000 +0530
@@ -0,0 +1,113 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+#define TASKSTATS_NOPID	-1
+
+struct taskstats {
+	/* Maintain 64-bit alignment while extending */
+	/* Version 1 */
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_CMD_GET,		/* user->kernel request */
+	TASKSTATS_CMD_NEW,		/* kernel->user event */
+	__TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+	TASKSTATS_TYPE_UNSPEC = 0,	/* Reserved */
+	TASKSTATS_TYPE_PID,		/* Process id */
+	TASKSTATS_TYPE_TGID,		/* Thread group id */
+	TASKSTATS_TYPE_STATS,		/* taskstats structure */
+	TASKSTATS_TYPE_AGGR_PID,	/* contains pid + stats */
+	TASKSTATS_TYPE_AGGR_TGID,	/* contains tgid + stats */
+	__TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+	TASKSTATS_CMD_ATTR_UNSPEC = 0,
+	TASKSTATS_CMD_ATTR_PID,
+	TASKSTATS_CMD_ATTR_TGID,
+	__TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+	TASKSTATS_MSG_UNICAST,		/* send data only to requester */
+	TASKSTATS_MSG_MULTICAST,	/* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig	2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c	2006-03-25 20:56:55.000000000 +0530
@@ -1,6 +1,7 @@
 /* delayacct.c - per-task delay accounting
  *
  * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
 
 int delayacct_on = 0;		/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
 
 static int __init delayacct_setup_enable(char *str)
 {
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+	/*
+	 * Protect against racing thread group exits
+	 */
+	mutex_lock(&delayacct_exit_mutex);
+	taskstats_exit_pid(tsk);
 	if (tsk->delays) {
 		kmem_cache_free(delayacct_cache, tsk->delays);
 		tsk->delays = NULL;
 	}
+	mutex_unlock(&delayacct_exit_mutex);
 }
 
 /*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
 	spin_unlock(&tsk->delays->lock);
 	return ret;
 }
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+	return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink	2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile	2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c	2006-03-25 20:57:19.000000000 +0530
@@ -0,0 +1,291 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+	.id             = GENL_ID_GENERATE,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.maxattr        = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+	[TASKSTATS_CMD_ATTR_PID]  = { .type = NLA_U32 },
+	[TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+			 void **replyp, size_t size)
+{
+	struct sk_buff *skb;
+	void *reply;
+
+	/*
+	 * If new attributes are added, please revisit this allocation
+	 */
+	skb = nlmsg_new(size);
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, 0,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, 0,
+				    cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	void *reply;
+	int rc;
+
+	reply = genlmsg_data(genlhdr);
+
+	rc = genlmsg_end(skb, reply);
+	if (rc < 0) {
+		nlmsg_free(skb);
+		return rc;
+	}
+
+	if (event == TASKSTATS_MSG_MULTICAST)
+		return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+	return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+			   struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(stats, tsk);
+	put_task_struct(tsk);
+
+	return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+			    struct taskstats *stats)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			return -ESRCH;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(stats, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc = 0;
+	struct sk_buff *rep_skb;
+	struct taskstats stats;
+	void *reply;
+	size_t size;
+	struct nlattr *na;
+
+	/*
+	 * Size includes space for nested attribute as well
+	 * The returned data is of the format
+	 * TASKSTATS_TYPE_AGGR_PID/TGID
+	 * --> TASKSTATS_TYPE_PID/TGID
+	 * --> TASKSTATS_TYPE_STATS
+	 */
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+	if (rc < 0)
+		return rc;
+
+	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+		rc = fill_pid((pid_t)pid, NULL, &stats);
+		if (rc < 0)
+			goto err;
+
+		na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+		u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+		rc = fill_tgid((pid_t)tgid, NULL, &stats);
+		if (rc < 0)
+			goto err;
+
+		na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+	} else {
+		rc = -EINVAL;
+		goto err;
+	}
+
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	nla_nest_end(rep_skb, na);
+
+	return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+	return  genlmsg_cancel(rep_skb, reply);
+err:
+	nlmsg_free(rep_skb);
+	return rc;
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc = 0;
+	struct sk_buff *rep_skb;
+	void *reply;
+	struct taskstats stats;
+	size_t size;
+	int is_thread_group = !thread_group_empty(tsk);
+	struct nlattr *na;
+
+	/*
+	 * tasks can start to exit very early. Ensure that the family
+	 * is registered before notifications are sent out
+	 */
+	if (!family_registered)
+		return;
+
+	/*
+	 * Size includes space for nested attributes
+	 */
+	size = nla_total_size(sizeof(u32)) +
+		nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+	if (is_thread_group)
+		size = 2 * size;	// PID + STATS + TGID + STATS
+
+	memset(&stats, 0, sizeof(stats));
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+	if (rc < 0)
+		return;
+
+	rc = fill_pid(tsk->pid, tsk, &stats);
+	if (rc < 0)
+		goto err;
+
+	na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	nla_nest_end(rep_skb, na);
+
+	if (!is_thread_group) {
+		send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+		return;
+	}
+
+	memset(&stats, 0, sizeof(stats));
+	rc = fill_tgid(tsk->tgid, tsk, &stats);
+	if (rc < 0)
+		goto err;
+
+	na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+	NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+	NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+	nla_nest_end(rep_skb, na);
+
+	send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+	genlmsg_cancel(rep_skb, reply);
+	return;
+err:
+	nlmsg_free(rep_skb);
+}
+
+static struct genl_ops taskstats_ops = {
+	.cmd            = TASKSTATS_CMD_GET,
+	.doit           = taskstats_send_stats,
+	.policy		= taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+	if (genl_register_family(&family))
+		return -EFAULT;
+        family_registered = 1;
+
+	if (genl_register_ops(&family, &taskstats_ops))
+		goto err;
+
+	return 0;
+err:
+	genl_unregister_family(&family);
+	family_registered = 0;
+	return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-25 15:36                       ` Balbir Singh
@ 2006-03-25 17:48                         ` jamal
  2006-03-25 18:22                           ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-03-25 17:48 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:


I didnt pay attention to failure paths etc; i suppose your testing
should catch those. Getting there, a couple more comments:


> +enum {
> +	TASKSTATS_CMD_UNSPEC = 0,	/* Reserved */
> +	TASKSTATS_CMD_GET,		/* user->kernel request */
> +	TASKSTATS_CMD_NEW,		/* kernel->user event */

Should the comment read "kernel->user event/get-response"


> +
> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{


> +
> +	if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> +		u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> +		rc = fill_pid((pid_t)pid, NULL, &stats);
> +		if (rc < 0)
> +			goto err;
> +
> +		na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> +		NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> +	} else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {

in regards to the elseif above:
Could you not have both PID and TGID passed? From my earlier
understanding it seemed legit, no? if answer is yes, then you will have
to do your sizes + reply TLVs at the end.

Also in regards to the nesting, isnt there a need for nla_nest_cancel in
case of failures to add TLVs?

cheers,
jamal



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-25 17:48                         ` jamal
@ 2006-03-25 18:22                           ` Balbir Singh
  2006-03-26 14:05                             ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-25 18:22 UTC (permalink / raw)
  To: hadi; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On 3/25/06, jamal <hadi@cyberus.ca> wrote:
> On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> > On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
>
>
> I didnt pay attention to failure paths etc; i suppose your testing
> should catch those. Getting there, a couple more comments:
>

Yes, I have tried several negative test cases.

>
> > +enum {
> > +     TASKSTATS_CMD_UNSPEC = 0,       /* Reserved */
> > +     TASKSTATS_CMD_GET,              /* user->kernel request */
> > +     TASKSTATS_CMD_NEW,              /* kernel->user event */
>
> Should the comment read "kernel->user event/get-response"
>

Yes, good catch. I will update the comment.

>
> > +
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
>
>
> > +
> > +     if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > +             u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > +             rc = fill_pid((pid_t)pid, NULL, &stats);
> > +             if (rc < 0)
> > +                     goto err;
> > +
> > +             na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> > +             NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > +     } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
>
> in regards to the elseif above:
> Could you not have both PID and TGID passed? From my earlier
> understanding it seemed legit, no? if answer is yes, then you will have
> to do your sizes + reply TLVs at the end.

No, we cannot have both passed. If we pass both a PID and a TGID and
then the code returns just the stats for the PID.

>
> Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> case of failures to add TLVs?
>

I thought about it, but when I looked at the code of genlmsg_cancel()
and nla_nest_cancel().  It seemed that genlmsg_cancel() should
suffice.

<snippet>
static inline int genlmsg_cancel(struct sk_buff *skb, void *hdr)
{
        return nlmsg_cancel(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN);
}

static inline int nlmsg_cancel(struct sk_buff *skb, struct nlmsghdr *nlh)
{
        skb_trim(skb, (unsigned char *) nlh - skb->data);

        return -1;
}

static inline int nla_nest_cancel(struct sk_buff *skb, struct nlattr *start)
{
        if (start)
                skb_trim(skb, (unsigned char *) start - skb->data);

        return -1;
}

</snippet>

genlmsg_cancel() seemed more generic, since it handles skb_trim from
the nlmsghdr down to skb->data, where as nla_test_cancel() does it
only from the start of the nested attributes to skb->data.

Is my understanding correct?


> cheers,
> jamal
>

Thanks,
Balbir

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-25 18:22                           ` Balbir Singh
@ 2006-03-26 14:05                             ` jamal
  2006-03-26 16:40                               ` Balbir Singh
  0 siblings, 1 reply; 56+ messages in thread
From: jamal @ 2006-03-26 14:05 UTC (permalink / raw)
  To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:


> No, we cannot have both passed. If we pass both a PID and a TGID and
> then the code returns just the stats for the PID.
> 

ok, that clears it then; i think you are ready to go.

> >
> > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > case of failures to add TLVs?
> >
> 
> I thought about it, but when I looked at the code of genlmsg_cancel()
> and nla_nest_cancel().  It seemed that genlmsg_cancel() should
> suffice.
> 

If your policy is to never send a message if anything fails, then you
are fine.

What would be really useful now that you understand this, is if you can
help extending/cleaning the document i sent you. Or send me a table of
contents of how it would have flowed better for you.

cheers,
jamal




^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
  2006-03-26 14:05                             ` jamal
@ 2006-03-26 16:40                               ` Balbir Singh
  0 siblings, 0 replies; 56+ messages in thread
From: Balbir Singh @ 2006-03-26 16:40 UTC (permalink / raw)
  To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev

On Sun, Mar 26, 2006 at 09:05:18AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:
> 
> 
> > No, we cannot have both passed. If we pass both a PID and a TGID and
> > then the code returns just the stats for the PID.
> > 
> 
> ok, that clears it then; i think you are ready to go.

Cool! Thanks for all your help and review.

> 
> > >
> > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > > case of failures to add TLVs?
> > >
> > 
> > I thought about it, but when I looked at the code of genlmsg_cancel()
> > and nla_nest_cancel().  It seemed that genlmsg_cancel() should
> > suffice.
> > 
> 
> If your policy is to never send a message if anything fails, then you
> are fine.
> 
> What would be really useful now that you understand this, is if you can
> help extending/cleaning the document i sent you. Or send me a table of
> contents of how it would have flowed better for you.

I will start looking at the document and see what changes can be made.
I will try and update the document from a genetlink programmers perspective
i.e. things to know, avoid, etc.

> 
> cheers,
> jamal
> 
>

Thanks,
Balbir 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 8/9] generic netlink utility functions
  2006-03-14  0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
@ 2006-03-26 16:44   ` Balbir Singh
  2006-03-26 17:06     ` jamal
  0 siblings, 1 reply; 56+ messages in thread
From: Balbir Singh @ 2006-03-26 16:44 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: linux-kernel, netdev, Jamal

On Mon, Mar 13, 2006 at 07:55:05PM -0500, Shailabh Nagar wrote:
> genetlink-utils.patch
> 
> Two utilities for simplifying usage of NETLINK_GENERIC
> interface.
> 
> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
> 
>  include/net/genetlink.h |   20 ++++++++++++++++++++
>  1 files changed, 20 insertions(+)
> 
> Index: linux-2.6.16-rc5/include/net/genetlink.h
> ===================================================================
> --- linux-2.6.16-rc5.orig/include/net/genetlink.h	2006-03-11 07:41:32.000000000 -0500
> +++ linux-2.6.16-rc5/include/net/genetlink.h	2006-03-11 07:41:41.000000000 -0500
> @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
>  	return nlmsg_unicast(genl_sock, skb, pid);
>  }
>  
> +/**
> + * gennlmsg_data - head of message payload
> + * @gnlh: genetlink messsage header
> + */
> +static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
> +{
> +       return ((unsigned char *) gnlh + GENL_HDRLEN);
> +}
> +
> +/**
> + * genlmsg_len - length of message payload
> + * @gnlh: genetlink message header
> + */
> +static inline int genlmsg_len(const struct genlmsghdr *gnlh)
> +{
> +       struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
> +                                                   NLMSG_HDRLEN);
> +       return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
> +}
> +
>  #endif	/* __NET_GENERIC_NETLINK_H */
> 
>

Jamal,

Does the implementation of these utilities look ok? We use genlmsg_data()
in the delay accounting code but not genlmsg_len(), hence it might not
be very well tested (just reviewed).

Thanks,
Balbir 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 8/9] generic netlink utility functions
  2006-03-26 16:44   ` Balbir Singh
@ 2006-03-26 17:06     ` jamal
  0 siblings, 0 replies; 56+ messages in thread
From: jamal @ 2006-03-26 17:06 UTC (permalink / raw)
  To: balbir; +Cc: Shailabh Nagar, linux-kernel, netdev

On Sun, 2006-26-03 at 22:14 +0530, Balbir Singh wrote:

> Jamal,
> 
> Does the implementation of these utilities look ok? We use genlmsg_data()
> in the delay accounting code but not genlmsg_len(), hence it might not
> be very well tested (just reviewed).
> 

They look fine to me - please resubmit and CC Thomas Graf in case he
sees it different.

cheers,
jamal


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [Patch 0/9] Performance
  2006-03-25  2:38     ` Greg KH
@ 2006-03-27 18:28       ` Shailabh Nagar
  0 siblings, 0 replies; 56+ messages in thread
From: Shailabh Nagar @ 2006-03-27 18:28 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-kernel, Nick Piggin

Greg KH wrote:

>On Thu, Mar 23, 2006 at 10:16:41AM -0500, Shailabh Nagar wrote:
>  
>
>>Greg KH wrote:
>>    
>>
>>>On Mon, Mar 13, 2006 at 07:40:34PM -0500, Shailabh Nagar wrote:
>>>
>>>      
>>>
>>>>This is the next iteration of the delay accounting patches
>>>>last posted at
>>>>	http://www.ussg.iu.edu/hypermail/linux/kernel/0602.3/0893.html
>>>>        
>>>>
>>>Do you have any benchmark numbers with this patch applied and with it
>>>not applied?  Last I heard it was a measurable decrease for some
>>>"important" benchmark results...
>>>
>>>thanks,
>>>
>>>greg k-h
>>>      
>>>
>>Here are some numbers for the latest set of posted patches
>>using microbenchmarks hackbench, kernbench and lmbench.
>>
>>I was trying to get the real/big benchmark numbers too but
>>it looks like getting a run whose numbers can be trusted
>>will take a bit longer than expected. Preliminary runs of
>>transaction processing benchmarks indicate that overhead
>>actually decreases with the patch (as also seen in some of
>>the lmbench numbers below).
>>    
>>
>
>That's good to hear.
>
>But your .5% is noticable on the +patch results, which I don't think
>people who take performance issues seriously will like (that's real
>money for the big vendors.)  And distros will be forced to enable that
>option in their kernels, so those vendors will have to get that
>percentage back some other way...
>  
>

Sorry, missed your response.

Yes, even the slight deterioration might be an issue for distros. We 
discovered one memcpy,
lack of use of "__read_mostly" and another "unlikely" that might help 
with the 0.5% but other than
that don't see any major way of reducing overhead further for the +patch 
case.

I'll be posting another iteration of the patches with these changes and 
corresponding results
(as well as the changes for the netlink interface which has been 
stabilized after incorporating Jamal's
comments). Lets see what that does.

Thanks,
Shailabh

>thanks,
>
>greg k-h
>  
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2006-03-27 18:28 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-14  0:40 [Patch 0/9] Per-task delay accounting Shailabh Nagar
2006-03-14  0:42 ` [Patch 1/9] timestamp diff Shailabh Nagar
2006-03-14  1:01   ` Lee Revell
2006-03-14  1:05     ` Shailabh Nagar
2006-03-14  1:12       ` Lee Revell
2006-03-14  3:42         ` Balbir Singh
2006-03-14  4:26           ` Shailabh Nagar
2006-03-14  6:50             ` Balbir Singh
2006-03-15 10:23   ` Arjan van de Ven
2006-03-15 10:28     ` Balbir Singh
2006-03-14  0:45 ` Patch 2/9] Initialization Shailabh Nagar
2006-03-14 10:54   ` Jes Sorensen
2006-03-14 15:20     ` Shailabh Nagar
2006-03-15 10:24   ` Arjan van de Ven
2006-03-15 12:37     ` Alan Cox
2006-03-15 15:53       ` Shailabh Nagar
2006-03-14  0:47 ` [Patch 3/9] Block I/O accounting initialization Shailabh Nagar
2006-03-15 10:27   ` Arjan van de Ven
2006-03-15 16:27     ` Shailabh Nagar
2006-03-14  0:48 ` [Patch 4/9] Block I/O accounting collection Shailabh Nagar
2006-03-14  0:51 ` [Patch 5/9] Swapin delays Shailabh Nagar
2006-03-14  0:53 ` [Patch 7/9] /proc interface for all I/O delays Shailabh Nagar
2006-03-14  0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
2006-03-26 16:44   ` Balbir Singh
2006-03-26 17:06     ` jamal
2006-03-14  0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
2006-03-14  2:29   ` jamal
2006-03-14  2:33   ` Matt Helsley
2006-03-14  2:48     ` jamal
2006-03-14  4:18       ` Shailabh Nagar
2006-03-22  7:49       ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
2006-03-23 14:04         ` jamal
2006-03-23 15:41           ` Balbir Singh
2006-03-24 14:04             ` jamal
2006-03-24 14:54               ` Balbir Singh
2006-03-25  1:19                 ` jamal
2006-03-25  9:41                   ` Balbir Singh
2006-03-25 12:52                     ` jamal
2006-03-25 15:36                       ` Balbir Singh
2006-03-25 17:48                         ` jamal
2006-03-25 18:22                           ` Balbir Singh
2006-03-26 14:05                             ` jamal
2006-03-26 16:40                               ` Balbir Singh
2006-03-24  1:32           ` Balbir Singh
2006-03-24 14:11             ` jamal
2006-03-24 14:19               ` jamal
2006-03-24 14:59               ` Balbir Singh
2006-03-14  4:29     ` Shailabh Nagar
2006-03-14  1:01 ` [Patch 6/9] cpu delay collection Shailabh Nagar
2006-03-14 19:28 ` [Patch 0/9] Per-task delay accounting Greg KH
2006-03-14 20:49   ` Shailabh Nagar
2006-03-14 21:24     ` Greg KH
2006-03-14 21:59       ` Shailabh Nagar
2006-03-23 15:16   ` [Patch 0/9] Performance Shailabh Nagar
2006-03-25  2:38     ` Greg KH
2006-03-27 18:28       ` Shailabh Nagar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox