* [Patch 0/8] per-task delay accounting
@ 2006-04-22 2:16 Shailabh Nagar
2006-04-22 2:23 ` [Patch 1/8] Setup Shailabh Nagar
` (8 more replies)
0 siblings, 9 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:16 UTC (permalink / raw)
To: linux-kernel
Cc: LSE, Jes Sorensen, Peter Chubb, Erich Focht, Levent Serinol,
Jay Lan
Here are the delay accounting patches again. I'm not using the
earlier email thread due to code being refactored a bit.
The previous posting
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/1776.html
of these patches elicited several review comments from Andrew Morton
all of which have been addressed.
The other main thread of the comments was whether other accounting
stakeholders would be ok with this interface. Towards this end,
I'd posted an overview of what the other packages do (which didn't seem
to make the archives) and some of the stakeholders responded.
I'll repost the analysis as a reply to this post. Meanwhile, here's
the list of the stakeholders identified by Andrew and a summary of status
of their comments.
1. CSA accounting/PAGG/JOB: Jay Lan <jlan@engr.sgi.com>
Raised several points
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0604.1/0397.html
all of which have been addressed in this set of patches.
2. per-process IO statistics: Levent Serinol <lserinol@gmail.com>
No reponse.
I'd ascertained that its needs are a subset of CSA.
3. per-cpu time statistics: Erich Focht <efocht@ess.nec.de>
No response.
I'd ascertained that its needs can be met by taskstats
interface whenever these statistics are submitted for inclusion.
4. Microstate accounting: Peter Chubb <peterc@gelato.unsw.edu.au>
Mentioned overlap of patches with delay accounting
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0603.3/2286.html
and also that a /proc interface was preferable due to convenience.
My position is that the netlink interface is a superset of /proc due to
former's ability to supply exit-time data.
5. ELSA: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Confirmed that ELSA is not a direct user of a new kernel statistics
interface since it is a consumer of CSA or BSD accounting's statistics.
6. pnotify: Jes Sorensen <jes@sgi.com>
(taken over pnotify from Erik Jacobson)
Informed over private email that pnotify replacement is
being worked on.
I'd ascertained that pnotify (or its replacemenent) will not be
concerned with exporting data to userspace or collecting any stats.
Thats left to the kernel module that uses pnotify to get
notifications. CSA is one expected user of pnotify.
Hence CSA's concerns are the only ones relevant to pnotify as well.
7. Scalable statistics counters with /proc reporting:
Ravikiran G Thirumalai, Dipankar Sarma <dipankar@in.ibm.com>
Confirmed these counters aren't relevant to this discussion.
--Shailabh
Series
delayacct-setup.patch
delayacct-blkio-swapin.patch
delayacct-schedstats.patch
genetlink-utils.patch
taskstats-setup.patch
delayacct-taskstats.patch
delayacct-doc.patch
delayacct-procfs.patch
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 1/8] Setup
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
@ 2006-04-22 2:23 ` Shailabh Nagar
2006-04-24 2:02 ` Randy.Dunlap
2006-04-22 2:29 ` [Patch 2/8] Sync block I/O and swapin delay collection Shailabh Nagar
` (7 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:23 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
Changelog
Fixes comments by akpm
- unnecessary initialization of delayacct_on
- use kmem_cache_zalloc
- redundant check in __delayacct_tsk_exit
delayacct-setup.patch
Initialization code related to collection of per-task "delay"
statistics which measure how long it had to wait for cpu,
sync block io, swapping etc. The collection of statistics and
the interface are in other patches. This patch sets up the data
structures and allows the statistics collection to be disabled
through a kernel boot paramater.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Documentation/kernel-parameters.txt | 2
include/linux/delayacct.h | 69 ++++++++++++++++++++++++++++
include/linux/sched.h | 21 ++++++++
include/linux/time.h | 10 ++++
init/Kconfig | 13 +++++
init/main.c | 2
kernel/Makefile | 1
kernel/delayacct.c | 87 ++++++++++++++++++++++++++++++++++++
kernel/exit.c | 3 +
kernel/fork.c | 2
10 files changed, 210 insertions(+)
Index: linux-2.6.17-rc1/Documentation/kernel-parameters.txt
===================================================================
--- linux-2.6.17-rc1.orig/Documentation/kernel-parameters.txt 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/Documentation/kernel-parameters.txt 2006-04-14 14:59:21.000000000 -0400
@@ -430,6 +430,8 @@ running once the system is up.
Format: <area>[,<node>]
See also Documentation/networking/decnet.txt.
+ delayacct [KNL] Enable per-task delay accounting
+
devfs= [DEVFS]
See Documentation/filesystems/devfs/boot-options.
Index: linux-2.6.17-rc1/kernel/Makefile
===================================================================
--- linux-2.6.17-rc1.orig/kernel/Makefile 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/kernel/Makefile 2006-04-21 19:39:28.000000000 -0400
@@ -38,6 +38,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_RELAY) += relay.o
+obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: linux-2.6.17-rc1/include/linux/delayacct.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 19:39:29.000000000 -0400
@@ -0,0 +1,69 @@
+/* delayacct.h - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ */
+
+#ifndef _LINUX_TASKDELAYS_H
+#define _LINUX_TASKDELAYS_H
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASK_DELAY_ACCT
+
+extern int delayacct_on; /* Delay accounting turned on/off */
+extern kmem_cache_t *delayacct_cache;
+extern void delayacct_init(void);
+extern void __delayacct_tsk_init(struct task_struct *);
+extern void __delayacct_tsk_exit(struct task_struct *);
+
+static inline void delayacct_set_flag(int flag)
+{
+ if (current->delays)
+ current->delays->flags |= flag;
+}
+
+static inline void delayacct_clear_flag(int flag)
+{
+ if (current->delays)
+ current->delays->flags &= ~flag;
+}
+
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{
+ /* reinitialize in case parent's non-null pointer was dup'ed*/
+ tsk->delays = NULL;
+ if (unlikely(delayacct_on))
+ __delayacct_tsk_init(tsk);
+}
+
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{
+ if (tsk->delays)
+ __delayacct_tsk_exit(tsk);
+}
+
+#else
+static inline void delayacct_set_flag(int flag)
+{}
+static inline void delayacct_clear_flag(int flag)
+{}
+static inline void delayacct_init(void)
+{}
+static inline void delayacct_tsk_init(struct task_struct *tsk)
+{}
+static inline void delayacct_tsk_exit(struct task_struct *tsk)
+{}
+#endif /* CONFIG_TASK_DELAY_ACCT */
+
+#endif
Index: linux-2.6.17-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/sched.h 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/sched.h 2006-04-21 19:39:29.000000000 -0400
@@ -536,6 +536,24 @@ struct sched_info {
extern struct file_operations proc_schedstat_operations;
#endif
+#ifdef CONFIG_TASK_DELAY_ACCT
+struct task_delay_info {
+ spinlock_t lock;
+ unsigned int flags; /* Private per-task flags */
+
+ /* For each stat XXX, add following, aligned appropriately
+ *
+ * struct timespec XXX_start, XXX_end;
+ * u64 XXX_delay;
+ * u32 XXX_count;
+ *
+ * Atomicity of updates to XXX_delay, XXX_count protected by
+ * single lock above (split into XXX_lock if contention is an issue).
+ */
+};
+#endif
+
+
enum idle_type
{
SCHED_IDLE,
@@ -882,6 +900,9 @@ struct task_struct {
atomic_t fs_excl; /* holding fs exclusive resources */
struct rcu_head rcu;
+#ifdef CONFIG_TASK_DELAY_ACCT
+ struct task_delay_info *delays;
+#endif
};
static inline pid_t process_group(struct task_struct *tsk)
Index: linux-2.6.17-rc1/init/Kconfig
===================================================================
--- linux-2.6.17-rc1.orig/init/Kconfig 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/init/Kconfig 2006-04-21 19:39:28.000000000 -0400
@@ -150,6 +150,19 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.
+config TASK_DELAY_ACCT
+ bool "Enable per-task delay accounting (EXPERIMENTAL)"
+ help
+ Collect information on time spent by a task waiting for system
+ resources like cpu, synchronous block I/O completion and swapping
+ in pages. Such statistics can help in setting a task's priorities
+ relative to other tasks for cpu, io, rss limits etc.
+
+ Unlike BSD process accounting, this information is available
+ continuously during the lifetime of a task.
+
+ Say N if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
Index: linux-2.6.17-rc1/init/main.c
===================================================================
--- linux-2.6.17-rc1.orig/init/main.c 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/init/main.c 2006-04-21 19:39:28.000000000 -0400
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/delayacct.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -541,6 +542,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
+ delayacct_init();
check_bugs();
Index: linux-2.6.17-rc1/kernel/delayacct.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 19:39:29.000000000 -0400
@@ -0,0 +1,87 @@
+/* delayacct.c - per-task delay accounting
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/time.h>
+#include <linux/sysctl.h>
+#include <linux/delayacct.h>
+
+int delayacct_on __read_mostly; /* Delay accounting turned on/off */
+kmem_cache_t *delayacct_cache;
+
+static int __init delayacct_setup_enable(char *str)
+{
+ delayacct_on = 1;
+ return 1;
+}
+__setup("delayacct", delayacct_setup_enable);
+
+void delayacct_init(void)
+{
+ delayacct_cache = kmem_cache_create("delayacct_cache",
+ sizeof(struct task_delay_info),
+ 0,
+ SLAB_PANIC,
+ NULL, NULL);
+ delayacct_tsk_init(&init_task);
+}
+
+void __delayacct_tsk_init(struct task_struct *tsk)
+{
+ tsk->delays = kmem_cache_zalloc(delayacct_cache, SLAB_KERNEL);
+ if (tsk->delays)
+ spin_lock_init(&tsk->delays->lock);
+}
+
+void __delayacct_tsk_exit(struct task_struct *tsk)
+{
+ kmem_cache_free(delayacct_cache, tsk->delays);
+ tsk->delays = NULL;
+}
+
+/*
+ * Start accounting for a delay statistic using
+ * its starting timestamp (@start)
+ */
+
+static inline void delayacct_start(struct timespec *start)
+{
+ do_posix_clock_monotonic_gettime(start);
+}
+
+/*
+ * Finish delay accounting for a statistic using
+ * its timestamps (@start, @end), accumalator (@total) and @count
+ */
+
+static inline void delayacct_end(struct timespec *start, struct timespec *end,
+ u64 *total, u32 *count)
+{
+ struct timespec ts;
+ s64 ns;
+
+ do_posix_clock_monotonic_gettime(end);
+ timespec_sub(&ts, start, end);
+ ns = timespec_to_ns(&ts);
+ if (ns < 0)
+ return;
+
+ spin_lock(¤t->delays->lock);
+ *total += ns;
+ (*count)++;
+ spin_unlock(¤t->delays->lock);
+}
+
Index: linux-2.6.17-rc1/kernel/fork.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/fork.c 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/kernel/fork.c 2006-04-14 14:59:21.000000000 -0400
@@ -44,6 +44,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/delayacct.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -989,6 +990,7 @@ static task_t *copy_process(unsigned lon
goto bad_fork_cleanup_put_domain;
p->did_exec = 0;
+ delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
copy_flags(clone_flags, p);
p->pid = pid;
retval = -EFAULT;
Index: linux-2.6.17-rc1/kernel/exit.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/exit.c 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/kernel/exit.c 2006-04-21 19:39:28.000000000 -0400
@@ -34,6 +34,7 @@
#include <linux/mutex.h>
#include <linux/futex.h>
#include <linux/compat.h>
+#include <linux/delayacct.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -893,6 +894,7 @@ fastcall NORET_TYPE void do_exit(long co
preempt_count());
acct_update_integrals(tsk);
+
if (tsk->mm) {
update_hiwater_rss(tsk->mm);
update_hiwater_vm(tsk->mm);
@@ -909,6 +911,7 @@ fastcall NORET_TYPE void do_exit(long co
if (unlikely(tsk->compat_robust_list))
compat_exit_robust_list(tsk);
#endif
+ delayacct_tsk_exit(tsk);
exit_mm(tsk);
exit_sem(tsk);
Index: linux-2.6.17-rc1/include/linux/time.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/time.h 2006-04-13 10:55:54.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/time.h 2006-04-14 14:59:21.000000000 -0400
@@ -68,6 +68,16 @@ extern unsigned long mktime(const unsign
extern void set_normalized_timespec(struct timespec *ts, time_t sec, long nsec);
/*
+ * sub = end - start, in normalized form
+ */
+static inline void timespec_sub(struct timespec *start, struct timespec *end,
+ struct timespec *sub)
+{
+ set_normalized_timespec(sub, end->tv_sec - start->tv_sec,
+ end->tv_nsec - start->tv_nsec);
+}
+
+/*
* Returns true if the timespec is norm, false if denorm:
*/
#define timespec_valid(ts) \
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 2/8] Sync block I/O and swapin delay collection
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
2006-04-22 2:23 ` [Patch 1/8] Setup Shailabh Nagar
@ 2006-04-22 2:29 ` Shailabh Nagar
2006-04-22 2:33 ` [Patch 3/8] cpu delay collection via schedstats Shailabh Nagar
` (6 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:29 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
Changelog
Fixes comments by akpm
- avoid creating new per-process flag PF_SWAPIN
delayacct-blkio-swapin.patch
Collect per-task block I/O delay statistics.
Unlike earlier iterations of the delay accounting
patches, now delays are only collected for the actual
I/O waits rather than try and cover the delays seen in
I/O submission paths.
Account separately for block I/O delays
incurred as a result of swapin page faults whose
frequency can be affected by the task/process' rss limit.
Hence swapin delays can act as feedback for rss limit changes
independent of I/O priority changes.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
include/linux/delayacct.h | 25 +++++++++++++++++++++++++
include/linux/sched.h | 6 ++++++
kernel/delayacct.c | 19 +++++++++++++++++++
kernel/sched.c | 5 +++++
mm/memory.c | 4 ++++
5 files changed, 59 insertions(+)
Index: linux-2.6.17-rc1/include/linux/delayacct.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/delayacct.h 2006-04-21 22:27:18.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 22:27:19.000000000 -0400
@@ -19,6 +19,13 @@
#include <linux/sched.h>
+/*
+ * Per-task flags relevant to delay accounting
+ * maintained privately to avoid exhausting similar flags in sched.h:PF_*
+ * Used to set current->delays->flags
+ */
+#define DELAYACCT_PF_SWAPIN 0x00000001 /* I am doing a swapin */
+
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -26,6 +33,8 @@ extern kmem_cache_t *delayacct_cache;
extern void delayacct_init(void);
extern void __delayacct_tsk_init(struct task_struct *);
extern void __delayacct_tsk_exit(struct task_struct *);
+extern void __delayacct_blkio_start(void);
+extern void __delayacct_blkio_end(void);
static inline void delayacct_set_flag(int flag)
{
@@ -53,6 +62,18 @@ static inline void delayacct_tsk_exit(st
__delayacct_tsk_exit(tsk);
}
+static inline void delayacct_blkio_start(void)
+{
+ if (current->delays)
+ __delayacct_blkio_start();
+}
+
+static inline void delayacct_blkio_end(void)
+{
+ if (current->delays)
+ __delayacct_blkio_end();
+}
+
#else
static inline void delayacct_set_flag(int flag)
{}
@@ -64,6 +85,10 @@ static inline void delayacct_tsk_init(st
{}
static inline void delayacct_tsk_exit(struct task_struct *tsk)
{}
+static inline void delayacct_blkio_start(void)
+{}
+static inline void delayacct_blkio_end(void)
+{}
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif
Index: linux-2.6.17-rc1/kernel/delayacct.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/delayacct.c 2006-04-21 22:27:18.000000000 -0400
+++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 22:27:19.000000000 -0400
@@ -85,3 +85,22 @@ static inline void delayacct_end(struct
spin_unlock(¤t->delays->lock);
}
+void __delayacct_blkio_start(void)
+{
+ delayacct_start(¤t->delays->blkio_start);
+}
+
+void __delayacct_blkio_end(void)
+{
+ if (current->delays->flags & DELAYACCT_PF_SWAPIN)
+ /* Swapin block I/O */
+ delayacct_end(¤t->delays->blkio_start,
+ ¤t->delays->blkio_end,
+ ¤t->delays->swapin_delay,
+ ¤t->delays->swapin_count);
+ else /* Other block I/O */
+ delayacct_end(¤t->delays->blkio_start,
+ ¤t->delays->blkio_end,
+ ¤t->delays->blkio_delay,
+ ¤t->delays->blkio_count);
+}
Index: linux-2.6.17-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/sched.h 2006-04-21 22:27:18.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/sched.h 2006-04-21 22:27:19.000000000 -0400
@@ -550,6 +550,12 @@ struct task_delay_info {
* Atomicity of updates to XXX_delay, XXX_count protected by
* single lock above (split into XXX_lock if contention is an issue).
*/
+
+ struct timespec blkio_start, blkio_end; /* Shared by blkio, swapin */
+ u64 blkio_delay; /* wait for sync block io completion */
+ u64 swapin_delay; /* wait for swapin block io completion */
+ u32 blkio_count;
+ u32 swapin_count;
};
#endif
Index: linux-2.6.17-rc1/kernel/sched.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/sched.c 2006-04-21 22:27:18.000000000 -0400
+++ linux-2.6.17-rc1/kernel/sched.c 2006-04-21 22:27:19.000000000 -0400
@@ -50,6 +50,7 @@
#include <linux/times.h>
#include <linux/acct.h>
#include <linux/kprobes.h>
+#include <linux/delayacct.h>
#include <asm/tlb.h>
#include <asm/unistd.h>
@@ -4144,9 +4145,11 @@ void __sched io_schedule(void)
{
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
+ delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
schedule();
atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
}
EXPORT_SYMBOL(io_schedule);
@@ -4156,9 +4159,11 @@ long __sched io_schedule_timeout(long ti
struct runqueue *rq = &per_cpu(runqueues, raw_smp_processor_id());
long ret;
+ delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
ret = schedule_timeout(timeout);
atomic_dec(&rq->nr_iowait);
+ delayacct_blkio_end();
return ret;
}
Index: linux-2.6.17-rc1/mm/memory.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/memory.c 2006-04-21 22:27:18.000000000 -0400
+++ linux-2.6.17-rc1/mm/memory.c 2006-04-21 22:27:19.000000000 -0400
@@ -48,6 +48,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/delayacct.h>
#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -1880,6 +1881,7 @@ static int do_swap_page(struct mm_struct
entry = pte_to_swp_entry(orig_pte);
again:
+ delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry);
if (!page) {
swapin_readahead(entry, address, vma);
@@ -1892,6 +1894,7 @@ again:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte)))
ret = VM_FAULT_OOM;
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
goto unlock;
}
@@ -1903,6 +1906,7 @@ again:
mark_page_accessed(page);
lock_page(page);
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
if (!PageSwapCache(page)) {
/* Page migration has occured */
unlock_page(page);
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 3/8] cpu delay collection via schedstats
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
2006-04-22 2:23 ` [Patch 1/8] Setup Shailabh Nagar
2006-04-22 2:29 ` [Patch 2/8] Sync block I/O and swapin delay collection Shailabh Nagar
@ 2006-04-22 2:33 ` Shailabh Nagar
2006-04-22 2:35 ` [Patch 4/8] Utilities for genetlink usage Shailabh Nagar
` (5 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:33 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
Changelog
Fixes comments by akpm
- comments about locking used in rq_sched_info_arrive/depart
No fix needed/possible
- redundant extern declaration of delayacct_on in sched.h
suggested location (delayacct.h) cannot be used as it includes sched.h
extern declaration moved to where its needed
- move unlikely declaration inside sched_info_on
Function only returns constants. Cannot be done.
- removal of #if defined in sched_fork (Dave Hansen)
Refactoring suggested does not work if only SCHEDSTATS is configured
delayacct-shedstats.patch
Make the task-related schedstats functions
callable by delay accounting even if schedstats
collection isn't turned on. This removes the
dependency of delay accounting on schedstats.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
include/linux/sched.h | 21 +++++++++++++++---
kernel/sched.c | 56 ++++++++++++++++++++++++++++++++++----------------
2 files changed, 56 insertions(+), 21 deletions(-)
Index: linux-2.6.17-rc1/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/sched.h 2006-04-21 20:29:13.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/sched.h 2006-04-21 20:29:15.000000000 -0400
@@ -521,7 +521,7 @@ typedef struct prio_array prio_array_t;
struct backing_dev_info;
struct reclaim_state;
-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info {
/* cumulative counters */
unsigned long cpu_time, /* time spent on the cpu */
@@ -532,9 +532,11 @@ struct sched_info {
unsigned long last_arrival, /* when we last ran on a cpu */
last_queued; /* when we were last queued to run */
};
+#endif /* defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) */
+#ifdef CONFIG_SCHEDSTATS
extern struct file_operations proc_schedstat_operations;
-#endif
+#endif /* CONFIG_SCHEDSTATS */
#ifdef CONFIG_TASK_DELAY_ACCT
struct task_delay_info {
@@ -557,8 +559,19 @@ struct task_delay_info {
u32 blkio_count;
u32 swapin_count;
};
-#endif
+#endif /* CONFIG_TASK_DELAY_ACCT */
+static inline int sched_info_on(void)
+{
+#ifdef CONFIG_SCHEDSTATS
+ return 1;
+#elif defined(CONFIG_TASK_DELAY_ACCT)
+ extern int delayacct_on;
+ return delayacct_on;
+#else
+ return 0;
+#endif
+}
enum idle_type
{
@@ -744,7 +757,7 @@ struct task_struct {
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;
-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
struct sched_info sched_info;
#endif
Index: linux-2.6.17-rc1/kernel/sched.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/sched.c 2006-04-21 20:29:13.000000000 -0400
+++ linux-2.6.17-rc1/kernel/sched.c 2006-04-21 20:29:15.000000000 -0400
@@ -469,9 +469,32 @@ struct file_operations proc_schedstat_op
.release = single_release,
};
+/*
+ * Expects runqueue lock to be held for atomicity of update
+ */
+static inline void rq_sched_info_arrive(struct runqueue *rq,
+ unsigned long diff)
+{
+ if (rq) {
+ rq->rq_sched_info.run_delay += diff;
+ rq->rq_sched_info.pcnt++;
+ }
+}
+
+/*
+ * Expects runqueue lock to be held for atomicity of update
+ */
+static inline void rq_sched_info_depart(struct runqueue *rq,
+ unsigned long diff)
+{
+ if (rq)
+ rq->rq_sched_info.cpu_time += diff;
+}
# define schedstat_inc(rq, field) do { (rq)->field++; } while (0)
# define schedstat_add(rq, field, amt) do { (rq)->field += (amt); } while (0)
#else /* !CONFIG_SCHEDSTATS */
+static inline void rq_sched_info_arrive(struct runqueue *rq, unsigned long diff) {}
+static inline void rq_sched_info_depart(struct runqueue *rq, unsigned long diff) {}
# define schedstat_inc(rq, field) do { } while (0)
# define schedstat_add(rq, field, amt) do { } while (0)
#endif
@@ -491,7 +514,7 @@ static inline runqueue_t *this_rq_lock(v
return rq;
}
-#ifdef CONFIG_SCHEDSTATS
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
/*
* Called when a process is dequeued from the active array and given
* the cpu. We should note that with the exception of interactive
@@ -520,7 +543,6 @@ static inline void sched_info_dequeued(t
static void sched_info_arrive(task_t *t)
{
unsigned long now = jiffies, diff = 0;
- struct runqueue *rq = task_rq(t);
if (t->sched_info.last_queued)
diff = now - t->sched_info.last_queued;
@@ -529,11 +551,7 @@ static void sched_info_arrive(task_t *t)
t->sched_info.last_arrival = now;
t->sched_info.pcnt++;
- if (!rq)
- return;
-
- rq->rq_sched_info.run_delay += diff;
- rq->rq_sched_info.pcnt++;
+ rq_sched_info_arrive(task_rq(t), diff);
}
/*
@@ -553,8 +571,9 @@ static void sched_info_arrive(task_t *t)
*/
static inline void sched_info_queued(task_t *t)
{
- if (!t->sched_info.last_queued)
- t->sched_info.last_queued = jiffies;
+ if (unlikely(sched_info_on()))
+ if (!t->sched_info.last_queued)
+ t->sched_info.last_queued = jiffies;
}
/*
@@ -563,13 +582,10 @@ static inline void sched_info_queued(tas
*/
static inline void sched_info_depart(task_t *t)
{
- struct runqueue *rq = task_rq(t);
unsigned long diff = jiffies - t->sched_info.last_arrival;
t->sched_info.cpu_time += diff;
-
- if (rq)
- rq->rq_sched_info.cpu_time += diff;
+ rq_sched_info_depart(task_rq(t), diff);
}
/*
@@ -577,7 +593,7 @@ static inline void sched_info_depart(tas
* their time slice. (This may also be called when switching to or from
* the idle task.) We are only called when prev != next.
*/
-static inline void sched_info_switch(task_t *prev, task_t *next)
+static inline void __sched_info_switch(task_t *prev, task_t *next)
{
struct runqueue *rq = task_rq(prev);
@@ -592,10 +608,15 @@ static inline void sched_info_switch(tas
if (next != rq->idle)
sched_info_arrive(next);
}
+static inline void sched_info_switch(task_t *prev, task_t *next)
+{
+ if (unlikely(sched_info_on()))
+ __sched_info_switch(prev, next);
+}
#else
#define sched_info_queued(t) do { } while (0)
#define sched_info_switch(t, next) do { } while (0)
-#endif /* CONFIG_SCHEDSTATS */
+#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
/*
* Adding/removing a task to/from a priority array:
@@ -1351,8 +1372,9 @@ void fastcall sched_fork(task_t *p, int
p->state = TASK_RUNNING;
INIT_LIST_HEAD(&p->run_list);
p->array = NULL;
-#ifdef CONFIG_SCHEDSTATS
- memset(&p->sched_info, 0, sizeof(p->sched_info));
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+ if (unlikely(sched_info_on()))
+ memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
p->oncpu = 0;
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 4/8] Utilities for genetlink usage
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
` (2 preceding siblings ...)
2006-04-22 2:33 ` [Patch 3/8] cpu delay collection via schedstats Shailabh Nagar
@ 2006-04-22 2:35 ` Shailabh Nagar
2006-04-22 2:37 ` [Patch 5/8] taskstats interface Shailabh Nagar
` (4 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:35 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan, Jamal, Thomas Graf, netdev
genetlink-utils.patch
Two utilities for simplifying usage of NETLINK_GENERIC
interface.
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
include/net/genetlink.h | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+)
Index: linux-2.6.17-rc1/include/net/genetlink.h
===================================================================
--- linux-2.6.17-rc1.orig/include/net/genetlink.h 2006-04-21 19:39:29.000000000 -0400
+++ linux-2.6.17-rc1/include/net/genetlink.h 2006-04-21 20:29:19.000000000 -0400
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
}
+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+ return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+ struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+ NLMSG_HDRLEN);
+ return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
#endif /* __NET_GENERIC_NETLINK_H */
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 5/8] taskstats interface
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
` (3 preceding siblings ...)
2006-04-22 2:35 ` [Patch 4/8] Utilities for genetlink usage Shailabh Nagar
@ 2006-04-22 2:37 ` Shailabh Nagar
2006-04-27 1:12 ` Jay Lan
2006-04-22 2:39 ` [Patch 6/8] delay accounting usage of " Shailabh Nagar
` (3 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:37 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
Changelog
Fixes comments by jlan@engr.sgi.com
- separate out taskstats interface from delay accounting completely including
separate documentation
- permit different accounting subsystems to fill in parts of common
structure separately before common taskstats code sends it out on genetlink
- send common structure to userspace after update_hiwater_rss and before
exit_mm in do_exit
Fixes comments by akpm
- comment to indicate locking used for taskstats struct
- whitespace issues
- unnecessary use of constant taskstats_version
- uninline fill_pid(), fill_tgid()
- unnecessary cast to pid_t in taskstats_send_stats()
- too early evaluation of thread_group_empty() in taskstats_exit_pid
- returning -EFAULT on genl_register_family failure in taskstats_init
- comment for late_initcall of taskstats_init
No fix needed
- moving kmem_cache_free of tsk->delays outside the exit mutex
(mutex shifted and tsk->delays freeing being done elsewhere now)
- __delayacct_add_tsk returning -EINVAL if delay accounting isn't enabled
user should know that no values can be returned
returning zero would be misleading
- combining fill_pid(), fill_tgid() into a common function
combined code convoluted and less readable
taskstats-setup.patch
Create a "taskstats" interface based on generic netlink
(NETLINK_GENERIC family), for getting statistics of
tasks and thread groups during their lifetime and when they exit.
The interface is intended for use by multiple accounting packages
though it is being created in the context of delay accounting.
This patch creates the interface without populating the
fields of the data that is sent to the user in response to a command
or upon the exit of a task. Each accounting package interested in using
taskstats has to provide an additional patch to add its stats to the
common structure.
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Documentation/accounting/taskstats.txt | 146 +++++++++++++++
include/linux/taskstats.h | 85 ++++++++
include/linux/taskstats_kern.h | 55 +++++
init/Kconfig | 15 +
init/main.c | 2
kernel/Makefile | 1
kernel/exit.c | 7
kernel/taskstats.c | 321 +++++++++++++++++++++++++++++++++
8 files changed, 629 insertions(+), 3 deletions(-)
Index: linux-2.6.17-rc1/include/linux/taskstats.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/include/linux/taskstats.h 2006-04-21 20:31:11.000000000 -0400
@@ -0,0 +1,85 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ *
+ * To add new fields
+ * a) bump up TASKSTATS_VERSION
+ * b) add comment indicating new version number at end of struct
+ * c) add new fields after version comment; maintain 64-bit alignment
+ */
+
+#define TASKSTATS_VERSION 1
+
+struct taskstats {
+
+ /* Version 1 */
+
+ int filler_avoids_empty_struct_warnings;
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ * prior to __TASKSTATS_CMD_MAX
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request/get-response */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */
+ TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.17-rc1/init/Kconfig
===================================================================
--- linux-2.6.17-rc1.orig/init/Kconfig 2006-04-21 19:39:28.000000000 -0400
+++ linux-2.6.17-rc1/init/Kconfig 2006-04-21 20:29:22.000000000 -0400
@@ -150,6 +150,18 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.
+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ default n
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say N if unsure.
+
config TASK_DELAY_ACCT
bool "Enable per-task delay accounting (EXPERIMENTAL)"
help
@@ -158,9 +170,6 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.
- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.
config SYSCTL
Index: linux-2.6.17-rc1/kernel/Makefile
===================================================================
--- linux-2.6.17-rc1.orig/kernel/Makefile 2006-04-21 19:39:28.000000000 -0400
+++ linux-2.6.17-rc1/kernel/Makefile 2006-04-21 20:29:22.000000000 -0400
@@ -39,6 +39,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: linux-2.6.17-rc1/kernel/taskstats.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/kernel/taskstats.c 2006-04-21 20:29:22.000000000 -0400
@@ -0,0 +1,321 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats_kern.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+kmem_cache_t *taskstats_cache;
+static DEFINE_MUTEX(taskstats_exit_mutex);
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp, size_t size)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new(size);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ /*
+ * Each accounting subsystem adds calls to its functions to
+ * fill in relevant parts of struct taskstsats as follows
+ *
+ * rc = per-task-foo(stats, tsk);
+ * if (rc)
+ * goto err;
+ */
+
+err:
+ put_task_struct(tsk);
+ return rc;
+
+}
+
+static int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ /*
+ * Each accounting subsystem adds calls its functions to
+ * fill in relevant parts of struct taskstsats as follows
+ *
+ * rc = per-task-foo(stats, tsk);
+ * if (rc)
+ * break;
+ */
+
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ /*
+ * Accounting subsytems can also add calls here if they don't
+ * wish to aggregate statistics for per-tgid stats
+ */
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+ size_t size;
+ struct nlattr *na;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid(pid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
+ stats);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid(tgid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
+ stats);
+ } else {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ nla_nest_end(rep_skb, na);
+
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+err:
+ nlmsg_free(rep_skb);
+ return rc;
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_send(struct task_struct *tsk, struct taskstats *tidstats,
+ struct taskstats *tgidstats)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ void *reply;
+ size_t size;
+ int is_thread_group;
+ struct nlattr *na;
+
+ if (!family_registered)
+ return;
+
+ mutex_lock(&taskstats_exit_mutex);
+
+ is_thread_group = !thread_group_empty(tsk);
+ rc = 0;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ if (is_thread_group)
+ size = 2 * size; // PID + STATS + TGID + STATS
+
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ goto ret;
+
+ if (!tidstats)
+ goto err_skb;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, *tidstats);
+ nla_nest_end(rep_skb, na);
+
+ if (!is_thread_group || !tgidstats) {
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ goto ret;
+ }
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, *tgidstats);
+ nla_nest_end(rep_skb, na);
+
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ goto ret;
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+ goto ret;
+err_skb:
+ nlmsg_free(rep_skb);
+ret:
+ mutex_unlock(&taskstats_exit_mutex);
+ return;
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+/* Needed early in initialization */
+void __init taskstats_init_early(void)
+{
+ taskstats_cache = kmem_cache_create("taskstats_cache",
+ sizeof(struct taskstats),
+ 0, SLAB_PANIC, NULL, NULL);
+}
+
+static int __init taskstats_init(void)
+{
+ int rc;
+
+ rc = genl_register_family(&family);
+ if (rc)
+ return rc;
+ family_registered = 1;
+
+ if ((rc = genl_register_ops(&family, &taskstats_ops)) < 0)
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return rc;
+}
+
+/*
+ * late initcall ensures initialization of statistics collection
+ * mechanisms precedes initialization of the taskstats interface
+ */
+late_initcall(taskstats_init);
Index: linux-2.6.17-rc1/kernel/exit.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/exit.c 2006-04-21 19:39:28.000000000 -0400
+++ linux-2.6.17-rc1/kernel/exit.c 2006-04-21 20:29:22.000000000 -0400
@@ -35,6 +35,7 @@
#include <linux/futex.h>
#include <linux/compat.h>
#include <linux/delayacct.h>
+#include <linux/taskstats_kern.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -847,6 +848,7 @@ static void exit_notify(struct task_stru
fastcall NORET_TYPE void do_exit(long code)
{
struct task_struct *tsk = current;
+ struct taskstats *tidstats, *tgidstats;
int group_dead;
profile_task_exit(tsk);
@@ -893,6 +895,8 @@ fastcall NORET_TYPE void do_exit(long co
current->comm, current->pid,
preempt_count());
+ taskstats_exit_alloc(&tidstats, &tgidstats);
+
acct_update_integrals(tsk);
if (tsk->mm) {
@@ -911,7 +915,10 @@ fastcall NORET_TYPE void do_exit(long co
if (unlikely(tsk->compat_robust_list))
compat_exit_robust_list(tsk);
#endif
+ taskstats_exit_send(tsk, tidstats, tgidstats);
+ taskstats_exit_free(tidstats, tgidstats);
delayacct_tsk_exit(tsk);
+
exit_mm(tsk);
exit_sem(tsk);
Index: linux-2.6.17-rc1/include/linux/taskstats_kern.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/include/linux/taskstats_kern.h 2006-04-21 20:29:22.000000000 -0400
@@ -0,0 +1,55 @@
+/* taskstats_kern.h - kernel header for per-task statistics interface
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ */
+
+#ifndef _LINUX_TASKSTATS_KERN_H
+#define _LINUX_TASKSTATS_KERN_H
+
+#include <linux/taskstats.h>
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern kmem_cache_t *taskstats_cache;
+
+static inline void taskstats_exit_alloc(struct taskstats **ptidstats,
+ struct taskstats **ptgidstats)
+{
+ *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
+ *ptgidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
+}
+
+static inline void taskstats_exit_free(struct taskstats *tidstats,
+ struct taskstats *tgidstats)
+{
+ if (tidstats)
+ kmem_cache_free(taskstats_cache, tidstats);
+ if (tgidstats)
+ kmem_cache_free(taskstats_cache, tgidstats);
+}
+
+extern void taskstats_exit_send(struct task_struct *, struct taskstats *,
+ struct taskstats *);
+extern void taskstats_init_early(void);
+
+#else
+static inline void taskstats_exit_alloc(struct taskstats **ptidstats,
+ struct taskstats **ptgidstats)
+{}
+static inline void taskstats_exit_free(struct taskstats *ptidstats,
+ struct taskstats *ptgidstats)
+{}
+static inline void taskstats_exit_send(struct task_struct *tsk)
+{}
+static inline void taskstats_init_early(void)
+{}
+#endif /* CONFIG_TASKSTATS */
+
+#endif
+
Index: linux-2.6.17-rc1/Documentation/accounting/taskstats.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/Documentation/accounting/taskstats.txt 2006-04-21 20:29:22.000000000 -0400
@@ -0,0 +1,146 @@
+Per-task statistics interface
+-----------------------------
+
+
+Taskstats is a netlink-based interface for sending per-task and
+per-process statistics from the kernel to userspace.
+
+Taskstats was designed for the following benefits:
+
+- efficiently provide statistics during lifetime of a task and on its exit
+- unified interface for multiple accounting subsystems
+- extensibility for use by future accounting patches
+
+Terminology
+-----------
+
+"pid", "tid" and "task" are used interchangeably and refer to the standard
+Linux task defined by struct task_struct. per-pid stats are the same as
+per-task stats.
+
+"tgid", "process" and "thread group" are used interchangeably and refer to the
+tasks that share an mm_struct i.e. the traditional Unix process. Despite the
+use of tgid, there is no special treatment for the task that is thread group
+leader - a process is deemed alive as long as it has any task belonging to it.
+
+Usage
+-----
+
+To get statistics during task's lifetime, userspace opens a unicast netlink
+socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
+The response contains statistics for a task (if pid is specified) or the sum of
+statistics for all tasks of the process (if tgid is specified).
+
+To obtain statistics for tasks which are exiting, userspace opens a multicast
+netlink socket. Each time a task exits, two records are sent by the kernel to
+each listener on the multicast socket. The first the per-pid task's statistics
+and the second is the sum for all tasks of the process to which the task
+belongs (the task does not need to be the thread group leader). The need for
+per-tgid stats to be sent for each exiting task is explained in the Advanced
+Usage section below.
+
+
+Interface
+---------
+
+The user-kernel interface is encapsulated in include/linux/taskstats.h
+
+To avoid this documentation becoming obsolete as the interface evolves, only
+an outline of the current version is given. taskstats.h always overrides the
+description here.
+
+struct taskstats is the common accounting structure for both per-pid and
+per-tgid data. It is versioned and can be extended by each accounting subsystem
+that is added to the kernel. The fields and their semantics are defined in the
+taskstats_struct.h file.
+
+The data exchanged between user and kernel space is a netlink message belonging
+to the NETLINK_GENERIC family and using the netlink attributes interface.
+The messages are in the format
+
+ +----------+- - -+-------------+-------------------+
+ | nlmsghdr | Pad | genlmsghdr | taskstats payload |
+ +----------+- - -+-------------+-------------------+
+
+
+The taskstats payload is one of the following three kinds:
+
+1. Commands: Sent from user to kernel. The payload is one attribute, of type
+TASKSTATS_CMD_ATTR_PID/TGID, containing a u32 pid or tgid in the attribute
+payload. The pid/tgid denotes the task/process for which userspace wants
+statistics.
+
+2. Response for a command: sent from the kernel in response to a userspace
+command. The payload is a series of three attributes of type:
+
+a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
+a pid/tgid will be followed by some stats.
+
+b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
+is being returned.
+
+c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The
+same structure is used for both per-pid and per-tgid stats.
+
+3. New message sent by kernel whenever a task exits. The payload consists of a
+ series of attributes of the following type:
+
+a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
+b) TASKSTATS_TYPE_PID: contains exiting task's pid
+c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
+d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
+e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
+f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
+
+
+per-tgid stats
+--------------
+
+Taskstats provides per-process stats, in addition to per-task stats, since
+resource management is often done at a process granularity and aggregating task
+stats in userspace alone is inefficient and potentially inaccurate (due to lack
+of atomicity).
+
+However, maintaining per-process, in addition to per-task stats, within the
+kernel has space and time overheads. Hence the taskstats implementation
+dynamically sums up the per-task stats for each task belonging to a process
+whenever per-process stats are needed.
+
+Not maintaining per-tgid stats creates a problem when userspace is interested
+in getting these stats when the process dies i.e. the last thread of
+a process exits. It isn't possible to simply return some aggregated per-process
+statistic from the kernel.
+
+The approach taken by taskstats is to return the per-tgid stats *each* time
+a task exits, in addition to the per-pid stats for that task. Userspace can
+maintain task<->process mappings and use them to maintain the per-process stats
+in userspace, updating the aggregate appropriately as the tasks of a process
+exit.
+
+Extending taskstats
+-------------------
+
+There are two ways to extend the taskstats interface to export more
+per-task/process stats as patches to collect them get added to the kernel
+in future:
+
+1. Adding more fields to the end of the existing struct taskstats. Backward
+ compatibility is ensured by the version number within the
+ structure. Userspace will use only the fields of the struct that correspond
+ to the version its using.
+
+2. Defining separate statistic structs and using the netlink attributes
+ interface to return them. Since userspace processes each netlink attribute
+ independently, it can always ignore attributes whose type it does not
+ understand (because it is using an older version of the interface).
+
+
+Choosing between 1. and 2. is a matter of trading off flexibility and
+overhead. If only a few fields need to be added, then 1. is the preferable
+path since the kernel and userspace don't need to incur the overhead of
+processing new netlink attributes. But if the new fields expand the existing
+struct too much, requiring disparate userspace accounting utilities to
+unnecessarily receive large structures whose fields are of no interest, then
+extending the attributes structure would be worthwhile.
+
+----
\ No newline at end of file
Index: linux-2.6.17-rc1/init/main.c
===================================================================
--- linux-2.6.17-rc1.orig/init/main.c 2006-04-21 19:39:28.000000000 -0400
+++ linux-2.6.17-rc1/init/main.c 2006-04-21 20:29:22.000000000 -0400
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/taskstats_kern.h>
#include <linux/delayacct.h>
#include <asm/io.h>
@@ -542,6 +543,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
+ taskstats_init_early();
delayacct_init();
check_bugs();
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 6/8] delay accounting usage of taskstats interface
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
` (4 preceding siblings ...)
2006-04-22 2:37 ` [Patch 5/8] taskstats interface Shailabh Nagar
@ 2006-04-22 2:39 ` Shailabh Nagar
2006-04-22 2:40 ` [Patch 7/8] documentation Shailabh Nagar
` (2 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:39 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
Changelog
Fixes comments by akpm (on earlier patch now incorporated here)
- detailed comments on atomicity rules of accounting fields
- replace use of nsec_t
delayacct-taskstats.patch
Usage of taskstats interface by delay accounting.
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
include/linux/delayacct.h | 11 ++++++++++
include/linux/taskstats.h | 48 +++++++++++++++++++++++++++++++++++++++++++++-
init/Kconfig | 1
kernel/delayacct.c | 42 ++++++++++++++++++++++++++++++++++++++++
kernel/taskstats.c | 9 +++++++-
5 files changed, 109 insertions(+), 2 deletions(-)
Index: linux-2.6.17-rc1/include/linux/delayacct.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/delayacct.h 2006-04-21 20:29:13.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 20:42:41.000000000 -0400
@@ -18,6 +18,7 @@
#define _LINUX_TASKDELAYS_H
#include <linux/sched.h>
+#include <linux/taskstats_kern.h>
/*
* Per-task flags relevant to delay accounting
@@ -35,6 +36,7 @@ extern void __delayacct_tsk_init(struct
extern void __delayacct_tsk_exit(struct task_struct *);
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
static inline void delayacct_set_flag(int flag)
{
@@ -74,6 +76,13 @@ static inline void delayacct_blkio_end(v
__delayacct_blkio_end();
}
+static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+
#else
static inline void delayacct_set_flag(int flag)
{}
@@ -89,6 +98,8 @@ static inline void delayacct_blkio_start
{}
static inline void delayacct_blkio_end(void)
{}
+static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{ return 0; }
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif
Index: linux-2.6.17-rc1/kernel/delayacct.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/delayacct.c 2006-04-21 20:29:13.000000000 -0400
+++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 20:40:03.000000000 -0400
@@ -104,3 +104,45 @@ void __delayacct_blkio_end(void)
¤t->delays->blkio_delay,
¤t->delays->blkio_count);
}
+
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ s64 tmp;
+ struct timespec ts;
+ unsigned long t1,t2,t3;
+
+
+ tmp = (s64)d->cpu_run_real_total;
+ tmp += (u64)(tsk->utime + tsk->stime) * TICK_NSEC;
+ d->cpu_run_real_total = (tmp < (s64)d->cpu_run_real_total) ? 0 : tmp;
+
+ /* No locking available for sched_info (and too expensive to add one)
+ * Mitigate by taking snapshot of values
+ */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+ t3 = tsk->sched_info.cpu_time;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (s64)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (s64)d->cpu_delay_total) ? 0 : tmp;
+
+ tmp = (s64)d->cpu_run_virtual_total + (s64)jiffies_to_usecs(t3) * 1000;
+ d->cpu_run_virtual_total =
+ (tmp < (s64)d->cpu_run_virtual_total) ? 0 : tmp;
+
+ /* zero XXX_total, non-zero XXX_count implies XXX stat overflowed */
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total) ? 0 : tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total) ? 0 : tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+
+ return 0;
+}
Index: linux-2.6.17-rc1/kernel/taskstats.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/taskstats.c 2006-04-21 20:29:22.000000000 -0400
+++ linux-2.6.17-rc1/kernel/taskstats.c 2006-04-21 20:40:03.000000000 -0400
@@ -18,6 +18,7 @@
#include <linux/kernel.h>
#include <linux/taskstats_kern.h>
+#include <linux/delayacct.h>
#include <net/genetlink.h>
#include <asm/atomic.h>
@@ -119,7 +120,9 @@ static int fill_pid(pid_t pid, struct ta
* goto err;
*/
-err:
+ rc = delayacct_add_tsk(stats, tsk);
+
+ /* Define err: label here if needed */
put_task_struct(tsk);
return rc;
@@ -151,6 +154,10 @@ static int fill_tgid(pid_t tgid, struct
* break;
*/
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+
} while_each_thread(first, tsk);
read_unlock(&tasklist_lock);
Index: linux-2.6.17-rc1/init/Kconfig
===================================================================
--- linux-2.6.17-rc1.orig/init/Kconfig 2006-04-21 20:29:22.000000000 -0400
+++ linux-2.6.17-rc1/init/Kconfig 2006-04-21 20:40:03.000000000 -0400
@@ -164,6 +164,7 @@ config TASKSTATS
config TASK_DELAY_ACCT
bool "Enable per-task delay accounting (EXPERIMENTAL)"
+ depends on TASKSTATS
help
Collect information on time spent by a task waiting for system
resources like cpu, synchronous block I/O completion and swapping
Index: linux-2.6.17-rc1/include/linux/taskstats.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/taskstats.h 2006-04-21 20:31:11.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/taskstats.h 2006-04-21 20:45:17.000000000 -0400
@@ -35,7 +35,53 @@ struct taskstats {
/* Version 1 */
- int filler_avoids_empty_struct_warnings;
+ /* Delay accounting fields start
+ *
+ * All values, until comment "Delay accounting fields end" are
+ * available only if delay accounting is enabled, even though the last
+ * few fields are not delays
+ *
+ * xxx_count is the number of delay values recorded
+ * xxx_delay_total is the corresponding cumulative delay in nanoseconds
+ *
+ * xxx_delay_total wraps around to zero on overflow
+ * xxx_count incremented regardless of overflow
+ */
+
+ /* Delay waiting for cpu, while runnable
+ * count, delay_total NOT updated atomically
+ */
+ __u64 cpu_count;
+ __u64 cpu_delay_total;
+
+ /* Following four fields atomically updated using task->delays->lock */
+
+ /* Delay waiting for synchronous block I/O to complete
+ * does not account for delays in I/O submission
+ */
+ __u64 blkio_count;
+ __u64 blkio_delay_total;
+
+ /* Delay waiting for page fault I/O (swap in only) */
+ __u64 swapin_count;
+ __u64 swapin_delay_total;
+
+ /* cpu "wall-clock" running time
+ * On some architectures, value will adjust for cpu time stolen
+ * from the kernel in involuntary waits due to virtualization.
+ * Value is cumulative, in nanoseconds, without a corresponding count
+ * and wraps around to zero silently on overflow
+ */
+ __u64 cpu_run_real_total;
+
+ /* cpu "virtual" running time
+ * Uses time intervals seen by the kernel i.e. no adjustment
+ * for kernel's involuntary waits due to virtualization.
+ * Value is cumulative, in nanoseconds, without a corresponding count
+ * and wraps around to zero silently on overflow
+ */
+ __u64 cpu_run_virtual_total;
+ /* Delay accounting fields end */
};
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 7/8] documentation
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
` (5 preceding siblings ...)
2006-04-22 2:39 ` [Patch 6/8] delay accounting usage of " Shailabh Nagar
@ 2006-04-22 2:40 ` Shailabh Nagar
2006-04-22 2:42 ` [Patch 8/8] /proc export of aggregated block I/O delays Shailabh Nagar
2006-04-25 15:07 ` [Patch 0/8] per-task delay accounting Shailabh Nagar
8 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:40 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
delayacct-doc.patch
Some documentation for delay accounting.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Documentation/accounting/delay-accounting.txt | 115 +++++++
Documentation/accounting/getdelays.c | 376 ++++++++++++++++++++++++++
Documentation/accounting/taskstats.txt | 2
3 files changed, 493 insertions(+)
Index: linux-2.6.17-rc1/Documentation/accounting/delay-accounting.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/Documentation/accounting/delay-accounting.txt 2006-04-21 20:50:22.000000000 -0400
@@ -0,0 +1,115 @@
+Delay accounting
+----------------
+
+Tasks encounter delays in execution when they wait
+for some kernel resource to become available e.g. a
+runnable task may wait for a free CPU to run on.
+
+The per-task delay accounting functionality measures
+the delays experienced by a task while
+
+a) waiting for a CPU (while being runnable)
+b) completion of synchronous block I/O initiated by the task
+c) swapping in pages
+
+and makes these statistics available to userspace through
+the taskstats interface.
+
+Such delays provide feedback for setting a task's cpu priority,
+io priority and rss limit values appropriately. Long delays for
+important tasks could be a trigger for raising its corresponding priority.
+
+The functionality, through its use of the taskstats interface, also provides
+delay statistics aggregated for all tasks (or threads) belonging to a
+thread group (corresponding to a traditional Unix process). This is a commonly
+needed aggregation that is more efficiently done by the kernel.
+
+Userspace utilities, particularly resource management applications, can also
+aggregate delay statistics into arbitrary groups. To enable this, delay
+statistics of a task are available both during its lifetime as well as on its
+exit, ensuring continuous and complete monitoring can be done.
+
+
+Interface
+---------
+
+Delay accounting uses the taskstats interface which is described
+in detail in a separate document in this directory. Taskstats returns a
+generic data structure to userspace corresponding to per-pid and per-tgid
+statistics. The delay accounting functionality populates specific fields of
+this structure. See
+ include/linux/taskstats.h
+for a description of the fields pertaining to delay accounting.
+It will generally be in the form of counters returning the cumulative
+delay seen for cpu, sync block I/O, swapin etc.
+
+Taking the difference of two successive readings of a given
+counter (say cpu_delay_total) for a task will give the delay
+experienced by the task waiting for the corresponding resource
+in that interval.
+
+When a task exits, records containing the per-task and per-process statistics
+are sent to userspace without requiring a command. More details are given in
+the taskstats interface description.
+
+The getdelays.c userspace utility in this directory allows simple commands to
+be run and the corresponding delay statistics to be displayed. It also serves
+as an example of using the taskstats interface.
+
+Usage
+-----
+
+Compile the kernel with
+ CONFIG_TASK_DELAY_ACCT=y
+ CONFIG_TASKSTATS=y
+
+Enable the accounting at boot time by adding
+the following to the kernel boot options
+ delayacct
+
+and after the system has booted up, use a utility
+similar to getdelays.c to access the delays
+seen by a given task or a task group (tgid).
+The utility also allows a given command to be
+executed and the corresponding delays to be
+seen.
+
+General format of the getdelays command
+
+getdelays [-t tgid] [-p pid] [-c cmd...]
+
+
+Get delays, since system boot, for pid 10
+# ./getdelays -p 10
+(output similar to next case)
+
+Get sum of delays, since system boot, for all pids with tgid 5
+# ./getdelays -t 5
+
+
+CPU count real total virtual total delay total
+ 7876 92005750 100000000 24001500
+IO count delay total
+ 0 0
+MEM count delay total
+ 0 0
+
+Get delays seen in executing a given simple command
+# ./getdelays -c ls /
+
+bin data1 data3 data5 dev home media opt root srv sys usr
+boot data2 data4 data6 etc lib mnt proc sbin subdomain tmp var
+
+
+CPU count real total virtual total delay total
+ 6 4000250 4000000 0
+IO count delay total
+ 0 0
+MEM count delay total
+ 0 0
+
+
+
+
+
+
Index: linux-2.6.17-rc1/Documentation/accounting/getdelays.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.17-rc1/Documentation/accounting/getdelays.c 2006-04-21 20:53:54.000000000 -0400
@@ -0,0 +1,376 @@
+/* getdelays.c
+ *
+ * Utility to get per-pid and per-tgid delay accounting statistics
+ * Also illustrates usage of the taskstats interface
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2005
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
+ *
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <unistd.h>
+#include <poll.h>
+#include <string.h>
+#include <fcntl.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <signal.h>
+
+#include <linux/genetlink.h>
+#include <linux/taskstats.h>
+
+/*
+ * Generic macros for dealing with netlink sockets. Might be duplicated
+ * elsewhere. It is recommended that commercial grade applications use
+ * libnl or libnetlink and use the interfaces provided by the library
+ */
+#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN)
+#define NLA_DATA(na) ((void *)((char*)(na) + NLA_HDRLEN))
+#define NLA_PAYLOAD(len) (len - NLA_HDRLEN)
+
+#define err(code, fmt, arg...) do { printf(fmt, ##arg); exit(code); } while (0)
+int done = 0;
+
+/*
+ * Create a raw netlink socket and bind
+ */
+static int create_nl_socket(int protocol, int groups)
+{
+ socklen_t addr_len;
+ int fd;
+ struct sockaddr_nl local;
+
+ fd = socket(AF_NETLINK, SOCK_RAW, protocol);
+ if (fd < 0)
+ return -1;
+
+ memset(&local, 0, sizeof(local));
+ local.nl_family = AF_NETLINK;
+ local.nl_groups = groups;
+
+ if (bind(fd, (struct sockaddr *) &local, sizeof(local)) < 0)
+ goto error;
+
+ return fd;
+ error:
+ close(fd);
+ return -1;
+}
+
+int sendto_fd(int s, const char *buf, int bufLen)
+{
+ struct sockaddr_nl nladdr;
+ int r;
+
+ memset(&nladdr, 0, sizeof(nladdr));
+ nladdr.nl_family = AF_NETLINK;
+
+ while ((r = sendto(s, buf, bufLen, 0, (struct sockaddr *) &nladdr,
+ sizeof(nladdr))) < bufLen) {
+ if (r > 0) {
+ buf += r;
+ bufLen -= r;
+ } else if (errno != EAGAIN)
+ return -1;
+ }
+ return 0;
+}
+
+/*
+ * Probe the controller in genetlink to find the family id
+ * for the TASKSTATS family
+ */
+int get_family_id(int sd)
+{
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } family_req;
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } ans;
+
+ int id;
+ struct nlattr *na;
+ int rep_len;
+
+ /* Get family name */
+ family_req.n.nlmsg_type = GENL_ID_CTRL;
+ family_req.n.nlmsg_flags = NLM_F_REQUEST;
+ family_req.n.nlmsg_seq = 0;
+ family_req.n.nlmsg_pid = getpid();
+ family_req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+ family_req.g.cmd = CTRL_CMD_GETFAMILY;
+ family_req.g.version = 0x1;
+ na = (struct nlattr *) GENLMSG_DATA(&family_req);
+ na->nla_type = CTRL_ATTR_FAMILY_NAME;
+ na->nla_len = strlen(TASKSTATS_GENL_NAME) + 1 + NLA_HDRLEN;
+ strcpy(NLA_DATA(na), TASKSTATS_GENL_NAME);
+ family_req.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+ if (sendto_fd(sd, (char *) &family_req, family_req.n.nlmsg_len) < 0)
+ err(1, "error sending message via Netlink\n");
+
+ rep_len = recv(sd, &ans, sizeof(ans), 0);
+
+ if (rep_len < 0)
+ err(1, "error receiving reply message via Netlink\n");
+
+
+ /* Validate response message */
+ if (!NLMSG_OK((&ans.n), rep_len))
+ err(1, "invalid reply message received via Netlink\n");
+
+ if (ans.n.nlmsg_type == NLMSG_ERROR) { /* error */
+ printf("error received NACK - leaving\n");
+ exit(1);
+ }
+
+
+ na = (struct nlattr *) GENLMSG_DATA(&ans);
+ na = (struct nlattr *) ((char *) na + NLA_ALIGN(na->nla_len));
+ if (na->nla_type == CTRL_ATTR_FAMILY_ID) {
+ id = *(__u16 *) NLA_DATA(na);
+ }
+ return id;
+}
+
+void print_taskstats(struct taskstats *t)
+{
+ printf("\n\nCPU %15s%15s%15s%15s\n"
+ " %15llu%15llu%15llu%15llu\n"
+ "IO %15s%15s\n"
+ " %15llu%15llu\n"
+ "MEM %15s%15s\n"
+ " %15llu%15llu\n\n",
+ "count", "real total", "virtual total", "delay total",
+ t->cpu_count, t->cpu_run_real_total, t->cpu_run_virtual_total,
+ t->cpu_delay_total,
+ "count", "delay total",
+ t->blkio_count, t->blkio_delay_total,
+ "count", "delay total", t->swapin_count, t->swapin_delay_total);
+}
+
+void sigchld(int sig)
+{
+ done = 1;
+}
+
+int main(int argc, char *argv[])
+{
+ int rc;
+ int sk_nl;
+ struct nlmsghdr *nlh;
+ struct genlmsghdr *genlhdr;
+ char *buf;
+ struct taskstats_cmd_param *param;
+ __u16 id;
+ struct nlattr *na;
+
+ /* For receiving */
+ struct sockaddr_nl kern_nla, from_nla;
+ socklen_t from_nla_len;
+ int recv_len;
+ struct taskstats_reply *reply;
+
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } req;
+
+ struct {
+ struct nlmsghdr n;
+ struct genlmsghdr g;
+ char buf[256];
+ } ans;
+
+ int nl_sd = -1;
+ int rep_len;
+ int len = 0;
+ int aggr_len, len2;
+ struct sockaddr_nl nladdr;
+ pid_t tid = 0;
+ pid_t rtid = 0;
+ int cmd_type = TASKSTATS_TYPE_TGID;
+ int c, status;
+ int forking = 0;
+ struct sigaction act = {
+ .sa_handler = SIG_IGN,
+ .sa_mask = SA_NOMASK,
+ };
+ struct sigaction tact ;
+
+ if (argc < 3) {
+ printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]);
+ exit(-1);
+ }
+
+ tact.sa_handler = sigchld;
+ sigemptyset(&tact.sa_mask);
+ if (sigaction(SIGCHLD, &tact, NULL) < 0)
+ err(1, "sigaction failed for SIGCHLD\n");
+
+ while (1) {
+
+ c = getopt(argc, argv, "t:p:c:");
+ if (c < 0)
+ break;
+
+ switch (c) {
+ case 't':
+ tid = atoi(optarg);
+ if (!tid)
+ err(1, "Invalid tgid\n");
+ cmd_type = TASKSTATS_CMD_ATTR_TGID;
+ break;
+ case 'p':
+ tid = atoi(optarg);
+ if (!tid)
+ err(1, "Invalid pid\n");
+ cmd_type = TASKSTATS_CMD_ATTR_TGID;
+ break;
+ case 'c':
+ opterr = 0;
+ tid = fork();
+ if (tid < 0)
+ err(1, "fork failed\n");
+
+ if (tid == 0) { /* child process */
+ if (execvp(argv[optind - 1], &argv[optind - 1]) < 0) {
+ exit(-1);
+ }
+ }
+ forking = 1;
+ break;
+ default:
+ printf("usage %s [-t tgid][-p pid][-c cmd]\n", argv[0]);
+ exit(-1);
+ break;
+ }
+ if (c == 'c')
+ break;
+ }
+
+ /* Construct Netlink request message */
+
+ /* Send Netlink request message & get reply */
+
+ if ((nl_sd =
+ create_nl_socket(NETLINK_GENERIC, TASKSTATS_LISTEN_GROUP)) < 0)
+ err(1, "error creating Netlink socket\n");
+
+
+ id = get_family_id(nl_sd);
+
+ /* Send command needed */
+ req.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+ req.n.nlmsg_type = id;
+ req.n.nlmsg_flags = NLM_F_REQUEST;
+ req.n.nlmsg_seq = 0;
+ req.n.nlmsg_pid = tid;
+ req.g.cmd = TASKSTATS_CMD_GET;
+ na = (struct nlattr *) GENLMSG_DATA(&req);
+ na->nla_type = cmd_type;
+ na->nla_len = sizeof(unsigned int) + NLA_HDRLEN;
+ *(__u32 *) NLA_DATA(na) = tid;
+ req.n.nlmsg_len += NLMSG_ALIGN(na->nla_len);
+
+
+ if (!forking && sendto_fd(nl_sd, (char *) &req, req.n.nlmsg_len) < 0)
+ err(1, "error sending message via Netlink\n");
+
+ act.sa_handler = SIG_IGN;
+ sigemptyset(&act.sa_mask);
+ if (sigaction(SIGINT, &act, NULL) < 0)
+ err(1, "sigaction failed for SIGINT\n");
+
+ do {
+ int i;
+ struct pollfd pfd;
+ int pollres;
+
+ pfd.events = 0xffff & ~POLLOUT;
+ pfd.fd = nl_sd;
+ pollres = poll(&pfd, 1, 5000);
+ if (pollres < 0 || done) {
+ break;
+ }
+
+ rep_len = recv(nl_sd, &ans, sizeof(ans), 0);
+ nladdr.nl_family = AF_NETLINK;
+ nladdr.nl_groups = TASKSTATS_LISTEN_GROUP;
+
+ if (ans.n.nlmsg_type == NLMSG_ERROR) { /* error */
+ printf("error received NACK - leaving\n");
+ exit(1);
+ }
+
+ if (rep_len < 0) {
+ err(1, "error receiving reply message via Netlink\n");
+ break;
+ }
+
+ /* Validate response message */
+ if (!NLMSG_OK((&ans.n), rep_len))
+ err(1, "invalid reply message received via Netlink\n");
+
+ rep_len = GENLMSG_PAYLOAD(&ans.n);
+
+ na = (struct nlattr *) GENLMSG_DATA(&ans);
+ len = 0;
+ i = 0;
+ while (len < rep_len) {
+ len += NLA_ALIGN(na->nla_len);
+ switch (na->nla_type) {
+ case TASKSTATS_TYPE_AGGR_PID:
+ /* Fall through */
+ case TASKSTATS_TYPE_AGGR_TGID:
+ aggr_len = NLA_PAYLOAD(na->nla_len);
+ len2 = 0;
+ /* For nested attributes, na follows */
+ na = (struct nlattr *) NLA_DATA(na);
+ done = 0;
+ while (len2 < aggr_len) {
+ switch (na->nla_type) {
+ case TASKSTATS_TYPE_PID:
+ rtid = *(int *) NLA_DATA(na);
+ break;
+ case TASKSTATS_TYPE_TGID:
+ rtid = *(int *) NLA_DATA(na);
+ break;
+ case TASKSTATS_TYPE_STATS:
+ if (rtid == tid) {
+ print_taskstats((struct taskstats *)
+ NLA_DATA(na));
+ done = 1;
+ }
+ break;
+ }
+ len2 += NLA_ALIGN(na->nla_len);
+ na = (struct nlattr *) ((char *) na + len2);
+ if (done)
+ break;
+ }
+ }
+ na = (struct nlattr *) (GENLMSG_DATA(&ans) + len);
+ if (done)
+ break;
+ }
+ if (done)
+ break;
+ }
+ while (1);
+
+ close(nl_sd);
+ return 0;
+}
Index: linux-2.6.17-rc1/Documentation/accounting/taskstats.txt
===================================================================
--- linux-2.6.17-rc1.orig/Documentation/accounting/taskstats.txt 2006-04-21 20:29:22.000000000 -0400
+++ linux-2.6.17-rc1/Documentation/accounting/taskstats.txt 2006-04-21 20:50:22.000000000 -0400
@@ -39,6 +39,8 @@ belongs (the task does not need to be th
per-tgid stats to be sent for each exiting task is explained in the Advanced
Usage section below.
+getdelays.c is a simple utility demonstrating usage of the taskstats interface
+for reporting delay accounting statistics.
Interface
---------
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 8/8] /proc export of aggregated block I/O delays
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
` (6 preceding siblings ...)
2006-04-22 2:40 ` [Patch 7/8] documentation Shailabh Nagar
@ 2006-04-22 2:42 ` Shailabh Nagar
2006-04-22 7:46 ` [Lse-tech] " Andi Kleen
2006-04-25 15:07 ` [Patch 0/8] per-task delay accounting Shailabh Nagar
8 siblings, 1 reply; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-22 2:42 UTC (permalink / raw)
To: linux-kernel; +Cc: LSE, Jay Lan
Changelog
Fixed comments by akpm
- use __u64 for delayacct_blkio_ticks() return type
- redundant check for tsk->delays in __delayacct_blkio_ticks()
delayacct-procfs.patch
Export I/O delays seen by a task through /proc/<tgid>/stats
for use in top etc.
Note that delays for I/O done for swapping in pages (swapin I/O) is
clubbed together with all other I/O here (this is not the
case in the netlink interface where the swapin I/O is kept distinct)
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
fs/proc/array.c | 6 ++++--
include/linux/delayacct.h | 10 ++++++++++
kernel/delayacct.c | 12 ++++++++++++
3 files changed, 26 insertions(+), 2 deletions(-)
Index: linux-2.6.17-rc1/fs/proc/array.c
===================================================================
--- linux-2.6.17-rc1.orig/fs/proc/array.c 2006-04-21 19:39:28.000000000 -0400
+++ linux-2.6.17-rc1/fs/proc/array.c 2006-04-21 20:55:09.000000000 -0400
@@ -75,6 +75,7 @@
#include <linux/times.h>
#include <linux/cpuset.h>
#include <linux/rcupdate.h>
+#include <linux/delayacct.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -412,7 +413,7 @@ static int do_task_stat(struct task_stru
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %llu\n",
task->pid,
tcomm,
state,
@@ -456,7 +457,8 @@ static int do_task_stat(struct task_stru
task->exit_signal,
task_cpu(task),
task->rt_priority,
- task->policy);
+ task->policy,
+ delayacct_blkio_ticks(task));
if(mm)
mmput(mm);
return res;
Index: linux-2.6.17-rc1/include/linux/delayacct.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/delayacct.h 2006-04-21 20:42:41.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 20:55:58.000000000 -0400
@@ -37,6 +37,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
+extern __u64 __delayacct_blkio_ticks(struct task_struct *);
static inline void delayacct_set_flag(int flag)
{
@@ -83,6 +84,13 @@ static inline int delayacct_add_tsk(stru
return __delayacct_add_tsk(d, tsk);
}
+static inline __u64 delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ if (tsk->delays)
+ return __delayacct_blkio_ticks(tsk);
+ return 0;
+}
+
#else
static inline void delayacct_set_flag(int flag)
{}
@@ -100,6 +108,8 @@ static inline void delayacct_blkio_end(v
{}
static inline int delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
{ return 0; }
+static inline __u64 delayacct_blkio_ticks(struct task_struct *tsk)
+{ return 0; }
#endif /* CONFIG_TASK_DELAY_ACCT */
#endif
Index: linux-2.6.17-rc1/kernel/delayacct.c
===================================================================
--- linux-2.6.17-rc1.orig/kernel/delayacct.c 2006-04-21 20:40:03.000000000 -0400
+++ linux-2.6.17-rc1/kernel/delayacct.c 2006-04-21 20:55:09.000000000 -0400
@@ -146,3 +146,15 @@ int __delayacct_add_tsk(struct taskstats
return 0;
}
+
+__u64 __delayacct_blkio_ticks(struct task_struct *tsk)
+{
+ __u64 ret;
+
+ spin_lock(&tsk->delays->lock);
+ ret = nsec_to_clock_t(tsk->delays->blkio_delay +
+ tsk->delays->swapin_delay);
+ spin_unlock(&tsk->delays->lock);
+ return ret;
+}
+
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] [Patch 8/8] /proc export of aggregated block I/O delays
2006-04-22 2:42 ` [Patch 8/8] /proc export of aggregated block I/O delays Shailabh Nagar
@ 2006-04-22 7:46 ` Andi Kleen
0 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2006-04-22 7:46 UTC (permalink / raw)
To: lse-tech; +Cc: Shailabh Nagar, linux-kernel, Jay Lan
On Saturday 22 April 2006 04:42, Shailabh Nagar wrote:
> Changelog
> Fixed comments by akpm
> - use __u64 for delayacct_blkio_ticks() return type
> - redundant check for tsk->delays in __delayacct_blkio_ticks()
I think these basic statistics in /proc are quite useful. Hopefully
top etc. would learn quickly about them too so that normal
people can actually make use of it.
-Andi
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Patch 1/8] Setup
2006-04-22 2:23 ` [Patch 1/8] Setup Shailabh Nagar
@ 2006-04-24 2:02 ` Randy.Dunlap
2006-04-24 17:26 ` Shailabh Nagar
0 siblings, 1 reply; 23+ messages in thread
From: Randy.Dunlap @ 2006-04-24 2:02 UTC (permalink / raw)
To: Shailabh Nagar; +Cc: linux-kernel, lse-tech, jlan
On Fri, 21 Apr 2006 22:23:25 -0400 Shailabh Nagar wrote:
> Index: linux-2.6.17-rc1/include/linux/delayacct.h
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 19:39:29.000000000 -0400
> @@ -0,0 +1,69 @@
> +/* delayacct.h - per-task delay accounting
> + */
> +
> +#ifndef _LINUX_TASKDELAYS_H
> +#define _LINUX_TASKDELAYS_H
Probably _LINUX_DELAYACCT_H.
Or if I add linux/taskdelays.h, what #include guard should I use?
---
~Randy
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Patch 1/8] Setup
2006-04-24 2:02 ` Randy.Dunlap
@ 2006-04-24 17:26 ` Shailabh Nagar
0 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-24 17:26 UTC (permalink / raw)
To: Randy.Dunlap; +Cc: linux-kernel, lse-tech, jlan
Randy.Dunlap wrote:
>On Fri, 21 Apr 2006 22:23:25 -0400 Shailabh Nagar wrote:
>
>
>
>>Index: linux-2.6.17-rc1/include/linux/delayacct.h
>>===================================================================
>>--- /dev/null 1970-01-01 00:00:00.000000000 +0000
>>+++ linux-2.6.17-rc1/include/linux/delayacct.h 2006-04-21 19:39:29.000000000 -0400
>>@@ -0,0 +1,69 @@
>>+/* delayacct.h - per-task delay accounting
>>+ */
>>+
>>+#ifndef _LINUX_TASKDELAYS_H
>>+#define _LINUX_TASKDELAYS_H
>>
>>
>
>Probably _LINUX_DELAYACCT_H.
>
>
Yup. Hangover from old name...will fix.
>Or if I add linux/taskdelays.h, what #include guard should I use?
>
>---
>~Randy
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Patch 0/8] per-task delay accounting
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
` (7 preceding siblings ...)
2006-04-22 2:42 ` [Patch 8/8] /proc export of aggregated block I/O delays Shailabh Nagar
@ 2006-04-25 15:07 ` Shailabh Nagar
8 siblings, 0 replies; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-25 15:07 UTC (permalink / raw)
To: Shailabh Nagar
Cc: linux-kernel, LSE, Jes Sorensen, Peter Chubb, Erich Focht,
Levent Serinol, Jay Lan
Here's a repost of my overview of the other stakeholders.
For some reason, lkml keeps rejecting this and its earlier post
wasn't archived either. Retrying.
Following Andrew's suggestion, here's my quick overview
of the various other accounting packages that have been
proposed on lse-tech with a focus on whether they can
utilize the netlink-based taskstats interface being proposed
by the delay accounting patches.
Please note that unification of statistics *collection* is not
being discussed since that kind of merger can be done as these
patches get accepted, if at all, into the kernel. To try and unify right
away would hold every patch (esp. delay accounting !)
hostage to the problems in every other patch unnecessarily. As
long as the interface can be unified, the merger of the collection bits
can always happen without affecting user space.
Stakeholders of each of these patches, on cc, are requested to
please correct any misunderstandings of what their patches do
so we can make forward progress.
--Shailabh
Summary
The following can use the taskstats netlink-based
interface by extending the returned data structure
- Comprehensive System Accounting
- per-process I/O stats
- Microstate accounting
- per cpu time stats
The following patches' interface needs are independent
of taskstats or subsumed by one of above:
- Enhanced Linux System Accounting
- pnotify
- scalable statistics counters
Details
(please correct if these are misunderstood)
1. Comprehensive System Accounting (Jay Lan)
--------------------------------------------
- Collect various per-task statistics and write an accounting record
containing
these stats at task exit. Interface similar to BSD process accounting
but the accounting record structure is quite different.
- CSA could utilize some stats collected/exported by delay accounting
blkio wait time
cpu run time for task
- CSA only needs data to be available at task exit, not during the
task's lifetime. Moreover, at task exit, it needs the accounting record
to be written to a file.
- CSA could utilize delay accounting's taskstats netlink interface to
gather task data at exit through
a userspace utility that then writes it out to its expected file.
To do so, CSA would need the taskstats struct to be extended with
whatever additional stats it needs. The additional stats could be
selectively exported only on task exit to avoid imposing a space burden
on users of delay accounting who query a process's statistics during its
lifetime.
Collection of the additional stats needed by CSA may be tied to pnotify
and job patches which are still being reviewed/considered for
acceptance. As such, unification in the collection of stats can be
deferred until status of pnotify/job/CSA patches becomes more clear.
2. per-process I/O statistics (Levent Serinol)
----------------------------------------------
- Exports task->{rchar,wchar} through /proc/tgid/iostat
(earlier version proposed export through /proc/tgid/stats)
- No new stats collection. Just export of existing task fields
- Problem with accepting the patch stems from the accuracy of the
statistics
in these fields. The fields are updated only in three cases today
(sys_read/write, sys_readv/writev, do_sendfile)
so they aren't accurate. async I/O, memory-mapped I/O is not counted
at the very least).
CSA patches also export these fields through their accounting record
but don't appear to be doing anything to improve accuracy of collection
(or maybe it doesn't matter to them).
BSD accounting, which ought to be using the sum of these fields for its
ac_io field, doesn't (it hardcodes the output to zero).
When the fate of task->rchar/wchar is decided, based on CSA's needs,
those fields can be easily added to taskstats.
3. per-cpu time statistics (Erich Focht)
----------------------------------------
- Collects time spent by a task on each cpu of a system
and exports it through new interface /proc/tgid/cpu
- Statistic is needed for performance analysis/debugging (like
schedstats) and not for production systems.
- Unsure why push for acceptance was abandoned. Possibly due to one or
more of:
space overhead of allocating NR_CPUS variables in task_struct
time overhead of collecting the data ?
- Can use taskstats interface to export the data by adding needed fields
to struct taskstats and bumping up the version.
4. Microstate accounting (Peter Chubb)
--------------------------------------
- Measure time spent by a thread in various interesting states, while
accounting for interrupts, and export through /proc/tid/msa and
through a syscall interface
- Interesting states have some overlap with delay accounting
- Exporting of per-task stats can be done through taskstats netlink
interface
5. Enhanced Linux System Accounting (Guillaume Thouvenine)
----------------------------------------------------------
- Group tasks at a user level into "jobs" and aggregate, at user level,
per-task statistics collected by CSA and/or BSD process accounting.
- ELSA does not introduce any new requirement for either collection or
export of statistics from the kernel. It can use either BSD and/or CSA's
method of using an accounting file.
- ELSA needs notification of forks and exits which it can already get
through the process events connector in the kernel.
Hence ELSA's needs are either met by the kernel today or are a strict
subset of CSA (since BSD accounting is already there).
6. pnotify (Erik Jacobson)
--------------------------
- Infrastructure for kernel modules to be notified when an event (like
fork/exit/exec)
happens to a task. Also provides some per-task data for the modules'
convenience
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Patch 5/8] taskstats interface
2006-04-22 2:37 ` [Patch 5/8] taskstats interface Shailabh Nagar
@ 2006-04-27 1:12 ` Jay Lan
2006-04-27 4:00 ` Shailabh Nagar
0 siblings, 1 reply; 23+ messages in thread
From: Jay Lan @ 2006-04-27 1:12 UTC (permalink / raw)
To: Shailabh Nagar; +Cc: linux-kernel, LSE
Hi Shailabh,
Thanks for your effort in taskstats interface! Really appreciated!
I think this interface can offer a good foundation for other packages
to build on.
Here are a few more comments:
1) You mentioned the "version number within the (taskstats)
structure" in taskstats.txt and a few other places, but i do not see
that field defined in struct taskstats in taskstats.h?
2) In taskstats.txt "Extending taskstats" section, you mentioned two
ways to extend the interface. The second method looks like a method
to encoureage other package developers to create their own interface
(ie, not taskstats) based on generic netlink to avoid reading large
number
of fields not interested to other particular applications? I will be
fine
with this as long as it is understood and agreed.
Alternatively, you may have considered the pros and cons of #ifdef
fields specific to only one accounting package in the struct taskstats.
If you do, care to share your thoughts? Specific payload information
can be carried in the version field. I am sure the version number of
struct
taskstats does not need 64 bits. With the version number and payload
info, application can surely interpret the taskstats data correctly.
3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced
Usage section below...", but that section does not exist.
4) In do_exit() routine, you do:
+ taskstats_exit_alloc(&tidstats, &tgidstats);
The tidstats and tgidstats are checked in taskstats_exit_send() in
taskstats.c for allocation failure, but a lot has been processed before
the check. The allocation failure happens when system is stressed in
memory. I think we want to do the check earlier?
Regards,
- jay
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Patch 5/8] taskstats interface
2006-04-27 1:12 ` Jay Lan
@ 2006-04-27 4:00 ` Shailabh Nagar
2006-04-27 6:42 ` [Lse-tech] " Balbir Singh
0 siblings, 1 reply; 23+ messages in thread
From: Shailabh Nagar @ 2006-04-27 4:00 UTC (permalink / raw)
To: Jay Lan; +Cc: linux-kernel, LSE
Jay Lan wrote:
>Hi Shailabh,
>
>Thanks for your effort in taskstats interface! Really appreciated!
>I think this interface can offer a good foundation for other packages
>to build on.
>
>Here are a few more comments:
>
>1) You mentioned the "version number within the (taskstats)
> structure" in taskstats.txt and a few other places, but i do not see
> that field defined in struct taskstats in taskstats.h?
>
>
Missed out on that. Need to add it back in.
>2) In taskstats.txt "Extending taskstats" section, you mentioned two
> ways to extend the interface. The second method looks like a method
> to encoureage other package developers to create their own interface
> (ie, not taskstats) based on generic netlink to avoid reading large
>number
> of fields not interested to other particular applications? I will be
>fine
> with this as long as it is understood and agreed.
>
>
Yes, the second method is for other packages, which have very little in
common with the struct
taskstats to extend the stats returned (using netlink attribs to extend
rather than extending the structure).
> Alternatively, you may have considered the pros and cons of #ifdef
> fields specific to only one accounting package in the struct taskstats.
> If you do, care to share your thoughts?
>
I'd rather avoid doing an #ifdef'ed definition of the fields based on
configuration of one or the other
accounting package...it'll add complexity for userspace parsing of the
structure.
Its quite acceptable to have the fields have zero as content if the
corresponding package isn't configured.
>Specific payload information
> can be carried in the version field. I am sure the version number of
>struct
> taskstats does not need 64 bits. With the version number and payload
> info, application can surely interpret the taskstats data correctly.
>
>
By "payload info" you mean some sort of bitmask (or encoding) which
specifies which fields are present
or absent ? I suppose that could be done but it adds unnecessary
complexity ? e.g once delay accounting is there,
all six to eight fields corresponding to it will be present...I don't
see much value in further being able to configure
cpu delays, mem delays etc. separately. Is that different for CSA ?
>3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced
> Usage section below...", but that section does not exist.
>
>
Thanks for pointing it out. Should replace it with "per-tgid stats section".
>4) In do_exit() routine, you do:
>+ taskstats_exit_alloc(&tidstats, &tgidstats);
>
> The tidstats and tgidstats are checked in taskstats_exit_send() in
> taskstats.c for allocation failure, but a lot has been processed before
> the check. The allocation failure happens when system is stressed in
> memory. I think we want to do the check earlier?
>
>
Since accounting is non-critical, I didn't see the need for doing the
check earlier if we're not going to do
anything about it. The first use of the allocated structure is in the
taskstats_exit_send() where filling of the
stats is not done if allocation failed. What would you suggest we do, on
allocation failure, if the check is
performed immediately after the alloc ?
--Shailabh
>
>Regards,
> - jay
>
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-27 4:00 ` Shailabh Nagar
@ 2006-04-27 6:42 ` Balbir Singh
2006-04-27 17:52 ` Jay Lan
0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-04-27 6:42 UTC (permalink / raw)
To: Shailabh Nagar; +Cc: Jay Lan, linux-kernel, LSE
On Thu, Apr 27, 2006 at 12:00:43AM -0400, Shailabh Nagar wrote:
> Jay Lan wrote:
>
> >Hi Shailabh,
> >
> >Thanks for your effort in taskstats interface! Really appreciated!
> >I think this interface can offer a good foundation for other packages
> >to build on.
> >
> >Here are a few more comments:
> >
> >1) You mentioned the "version number within the (taskstats)
> > structure" in taskstats.txt and a few other places, but i do not see
> > that field defined in struct taskstats in taskstats.h?
> >
> >
> Missed out on that. Need to add it back in.
There is a version field in genl_family as well. That can be used
for versioning as well. When we user space tool queries for the family
id, it can obtain and interpret the version information.
>
> >2) In taskstats.txt "Extending taskstats" section, you mentioned two
> > ways to extend the interface. The second method looks like a method
> > to encoureage other package developers to create their own interface
> > (ie, not taskstats) based on generic netlink to avoid reading large
> >number
> > of fields not interested to other particular applications? I will be
> >fine
> > with this as long as it is understood and agreed.
> >
> >
> Yes, the second method is for other packages, which have very little in
> common with the struct
> taskstats to extend the stats returned (using netlink attribs to extend
> rather than extending the structure).
The second method will require the following
1. An API to return the length of data it wants to fill in
2. Another API to fill in the statistics along with the type -
Like Shailabh mentioned, this will require creating a new TASKSTATS_TYPE_XXXX
>
> > Alternatively, you may have considered the pros and cons of #ifdef
> > fields specific to only one accounting package in the struct taskstats.
> > If you do, care to share your thoughts?
> >
> I'd rather avoid doing an #ifdef'ed definition of the fields based on
> configuration of one or the other
> accounting package...it'll add complexity for userspace parsing of the
> structure.
>
> Its quite acceptable to have the fields have zero as content if the
> corresponding package isn't configured.
>
I agree with Shailabh, building in knowledge of other subsystems into
taskstats.h might not be the best choice.
>
> >Specific payload information
> > can be carried in the version field. I am sure the version number of
> >struct
> > taskstats does not need 64 bits. With the version number and payload
> > info, application can surely interpret the taskstats data correctly.
> >
> >
> By "payload info" you mean some sort of bitmask (or encoding) which
> specifies which fields are present
> or absent ? I suppose that could be done but it adds unnecessary
> complexity ? e.g once delay accounting is there,
> all six to eight fields corresponding to it will be present...I don't
> see much value in further being able to configure
> cpu delays, mem delays etc. separately. Is that different for CSA ?
Netlink attributes can be used to determine which attribute types are
present in the payload. libnl does a great job of providing a good set of
APIs to determine all attribute types present. This is one of the biggest
advantages I see of genetlink (attributes are optional and can co-exist
simultaneously)
>
>
> >3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced
> > Usage section below...", but that section does not exist.
> >
> >
> Thanks for pointing it out. Should replace it with "per-tgid stats section".
>
> >4) In do_exit() routine, you do:
> >+ taskstats_exit_alloc(&tidstats, &tgidstats);
> >
> > The tidstats and tgidstats are checked in taskstats_exit_send() in
> > taskstats.c for allocation failure, but a lot has been processed before
> > the check. The allocation failure happens when system is stressed in
> > memory. I think we want to do the check earlier?
> >
> >
> Since accounting is non-critical, I didn't see the need for doing the
> check earlier if we're not going to do
> anything about it. The first use of the allocated structure is in the
> taskstats_exit_send() where filling of the
> stats is not done if allocation failed. What would you suggest we do, on
> allocation failure, if the check is
> performed immediately after the alloc ?
>
> --Shailabh
>
> >
> >Regards,
> >- jay
> >
> >
> >
>
>
>
<snip>
<--- Balbir
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-27 6:42 ` [Lse-tech] " Balbir Singh
@ 2006-04-27 17:52 ` Jay Lan
2006-04-27 18:27 ` Balbir Singh
0 siblings, 1 reply; 23+ messages in thread
From: Jay Lan @ 2006-04-27 17:52 UTC (permalink / raw)
To: balbir; +Cc: Shailabh Nagar, linux-kernel, LSE
Balbir Singh wrote:
> On Thu, Apr 27, 2006 at 12:00:43AM -0400, Shailabh Nagar wrote:
>
>>Jay Lan wrote:
>>
>>
>>>Hi Shailabh,
>>>
>>>Thanks for your effort in taskstats interface! Really appreciated!
>>>I think this interface can offer a good foundation for other packages
>>>to build on.
>>>
>>>Here are a few more comments:
>>>
>>>1) You mentioned the "version number within the (taskstats)
>>> structure" in taskstats.txt and a few other places, but i do not see
>>> that field defined in struct taskstats in taskstats.h?
>>>
>>>
>>
>>Missed out on that. Need to add it back in.
>
>
> There is a version field in genl_family as well. That can be used
> for versioning as well. When we user space tool queries for the family
> id, it can obtain and interpret the version information.
Hi Shailabh and Balbir,
Are TASKSTATS_GENL_VERSION and TASKSTATS_VERSION the same thing?
If they are meant to serve different purposes, we still need it.
>
>
>>>2) In taskstats.txt "Extending taskstats" section, you mentioned two
>>> ways to extend the interface. The second method looks like a method
>>> to encoureage other package developers to create their own interface
>>> (ie, not taskstats) based on generic netlink to avoid reading large
>>>number
>>> of fields not interested to other particular applications? I will be
>>>fine
>>> with this as long as it is understood and agreed.
>>>
>>>
>>
>>Yes, the second method is for other packages, which have very little in
>>common with the struct
>>taskstats to extend the stats returned (using netlink attribs to extend
>>rather than extending the structure).
>
>
> The second method will require the following
>
> 1. An API to return the length of data it wants to fill in
> 2. Another API to fill in the statistics along with the type -
> Like Shailabh mentioned, this will require creating a new TASKSTATS_TYPE_XXXX
>
>
>>> Alternatively, you may have considered the pros and cons of #ifdef
>>> fields specific to only one accounting package in the struct taskstats.
>>> If you do, care to share your thoughts?
>>>
>>
>>I'd rather avoid doing an #ifdef'ed definition of the fields based on
>>configuration of one or the other
>>accounting package...it'll add complexity for userspace parsing of the
>>structure.
>>
>>Its quite acceptable to have the fields have zero as content if the
>>corresponding package isn't configured.
>>
>
>
> I agree with Shailabh, building in knowledge of other subsystems into
> taskstats.h might not be the best choice.
>
>
>>>Specific payload information
>>> can be carried in the version field. I am sure the version number of
>>>struct
>>> taskstats does not need 64 bits. With the version number and payload
>>> info, application can surely interpret the taskstats data correctly.
>>>
>>>
>>
>>By "payload info" you mean some sort of bitmask (or encoding) which
>>specifies which fields are present
>>or absent ? I suppose that could be done but it adds unnecessary
>>complexity ? e.g once delay accounting is there,
>>all six to eight fields corresponding to it will be present...I don't
>>see much value in further being able to configure
>>cpu delays, mem delays etc. separately. Is that different for CSA ?
I was thinking of a bitmask thing. But instead of keying specific
fields, one bit may be used to key delay accounting, and another bit
for CSA, el at. This way you do not need to have CSA-specifc fields
in the payload and applications know how to correctly interpret the
payload. Taskstats and application do not need to have knowledge of
accounting packages, only need to set the bitmasks correctly.
When we start sending sys stats of each tasks to userland, that is
s lot of data. Note that BSD accounting even uses encode_comp_t()
routine to compress data into a 13-bit fraction with 3-bit exponent
field to shrink its size. Even though you do not need to care
about those zero's in taskstats, they still need to be delievered
through netlink socket.
I must admit that this may create a point of failure due to the
payload info not set correctly according to the CONFIG flags.
The idea was to eliminate the need of #2 methods, but maybe
#2 method is better...
I am a little confused after reading Balbir's reply. It seems to
me that Shailabh suggested to create a different struct to contain
stats data. Is that also what Balbir talked about? If a different
package builds a different taskstat-like interface as suggested
in #2, would the data travel on the same socket as delay
accounting?
>
>
> Netlink attributes can be used to determine which attribute types are
> present in the payload. libnl does a great job of providing a good set of
> APIs to determine all attribute types present. This is one of the biggest
> advantages I see of genetlink (attributes are optional and can co-exist
> simultaneously)
>
>
>>
>>>3) In taskstats.txt "Usage" section, you mentioned "... in the Advanced
>>> Usage section below...", but that section does not exist.
>>>
>>>
>>
>>Thanks for pointing it out. Should replace it with "per-tgid stats section".
>>
>>
>>>4) In do_exit() routine, you do:
>>>+ taskstats_exit_alloc(&tidstats, &tgidstats);
>>>
>>> The tidstats and tgidstats are checked in taskstats_exit_send() in
>>> taskstats.c for allocation failure, but a lot has been processed before
>>> the check. The allocation failure happens when system is stressed in
>>> memory. I think we want to do the check earlier?
>>>
>>>
>>
>>Since accounting is non-critical, I didn't see the need for doing the
>>check earlier if we're not going to do
>>anything about it. The first use of the allocated structure is in the
>>taskstats_exit_send() where filling of the
>>stats is not done if allocation failed. What would you suggest we do, on
>>allocation failure, if the check is
>>performed immediately after the alloc ?
I would suggest to do the check at the beginning of
taskstats_exit_send() before mutex_lock(&taskstats_exit_mutex).
Regards,
- jay
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-27 17:52 ` Jay Lan
@ 2006-04-27 18:27 ` Balbir Singh
2006-04-27 19:34 ` Jay Lan
0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-04-27 18:27 UTC (permalink / raw)
To: Jay Lan; +Cc: Shailabh Nagar, linux-kernel, LSE
Hi Jay,
> Hi Shailabh and Balbir,
>
> Are TASKSTATS_GENL_VERSION and TASKSTATS_VERSION the same thing?
> If they are meant to serve different purposes, we still need it.
>
Yes, thats true. But for now from what I can see, one version should
be sufficient.
<snip>
> I was thinking of a bitmask thing. But instead of keying specific
> fields, one bit may be used to key delay accounting, and another bit
> for CSA, el at. This way you do not need to have CSA-specifc fields
> in the payload and applications know how to correctly interpret the
> payload. Taskstats and application do not need to have knowledge of
> accounting packages, only need to set the bitmasks correctly.
>
Yes, but scanning the entire payload for various types is also feasible. It is
a bit slow, but feasible and generally the recommended approach for
dealing with genetlink types. What you are saying is still possible, the
application can ignore types it does not understand.
> When we start sending sys stats of each tasks to userland, that is
> s lot of data. Note that BSD accounting even uses encode_comp_t()
> routine to compress data into a 13-bit fraction with 3-bit exponent
> field to shrink its size. Even though you do not need to care
> about those zero's in taskstats, they still need to be delievered
> through netlink socket.
Yes, thats true. We can leave the decision of compressing, etc to the
specific subsystem. It can encode it and the user level application
can decode the data.
>
> I must admit that this may create a point of failure due to the
> payload info not set correctly according to the CONFIG flags.
>
> The idea was to eliminate the need of #2 methods, but maybe
> #2 method is better...
>
> I am a little confused after reading Balbir's reply. It seems to
> me that Shailabh suggested to create a different struct to contain
> stats data. Is that also what Balbir talked about? If a different
> package builds a different taskstat-like interface as suggested
> in #2, would the data travel on the same socket as delay
> accounting?
Sorry for the confusion. Yes, even I would recommend creating a different
struct for the stats data. The data will pass over the same socket as delay
accounting (separate sockets can be used, but that would become inefficient).
>
> I would suggest to do the check at the beginning of
> taskstats_exit_send() before mutex_lock(&taskstats_exit_mutex).
Good suggestion, we can move the check to that point.
>
> Regards,
> - jay
--
Warm Regards,
<--- Balbir
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-27 18:27 ` Balbir Singh
@ 2006-04-27 19:34 ` Jay Lan
2006-04-28 2:59 ` Balbir Singh
0 siblings, 1 reply; 23+ messages in thread
From: Jay Lan @ 2006-04-27 19:34 UTC (permalink / raw)
To: balbir; +Cc: Shailabh Nagar, linux-kernel, LSE
Hi Balbir,
Balbir Singh wrote:
>>Are TASKSTATS_GENL_VERSION and TASKSTATS_VERSION the same thing?
>>If they are meant to serve different purposes, we still need it.
>>
>
>
> Yes, thats true. But for now from what I can see, one version should
> be sufficient.
If we envision a need of it in the future, we'd better put it in
today. It would be nice to have the revision number at beginning of
the struct. Shailabh's instruction says to add new field after existing
fields.
>
> <snip>
>
>
>>I was thinking of a bitmask thing. But instead of keying specific
>>fields, one bit may be used to key delay accounting, and another bit
>>for CSA, el at. This way you do not need to have CSA-specifc fields
>>in the payload and applications know how to correctly interpret the
>>payload. Taskstats and application do not need to have knowledge of
>>accounting packages, only need to set the bitmasks correctly.
>>
>
>
> Yes, but scanning the entire payload for various types is also feasible. It is
> a bit slow, but feasible and generally the recommended approach for
> dealing with genetlink types. What you are saying is still possible, the
> application can ignore types it does not understand.
>
>
>>When we start sending sys stats of each tasks to userland, that is
>>s lot of data. Note that BSD accounting even uses encode_comp_t()
>>routine to compress data into a 13-bit fraction with 3-bit exponent
>>field to shrink its size. Even though you do not need to care
>>about those zero's in taskstats, they still need to be delievered
>>through netlink socket.
>
>
> Yes, thats true. We can leave the decision of compressing, etc to the
> specific subsystem. It can encode it and the user level application
> can decode the data.
I am sorry that i did not make myself clear. My suggestion of using
the bitmask payload info is to be combined with #ifdef CONFIG_* to
eliminate unnecessary fields from the traffic. I am concerned about
losing data due to application not reading data fast enough.
Well, we can revisit this suggestion when we start losing data
though. ;-)
Regards,
- jay
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-27 19:34 ` Jay Lan
@ 2006-04-28 2:59 ` Balbir Singh
2006-04-28 18:20 ` Jay Lan
0 siblings, 1 reply; 23+ messages in thread
From: Balbir Singh @ 2006-04-28 2:59 UTC (permalink / raw)
To: Jay Lan; +Cc: Shailabh Nagar, linux-kernel, LSE
> If we envision a need of it in the future, we'd better put it in
> today. It would be nice to have the revision number at beginning of
> the struct. Shailabh's instruction says to add new field after existing
> fields.
>
Yes, true. It does not hurt to have a version number for taskstats.
I will add it in.
<snip>
>
> I am sorry that i did not make myself clear. My suggestion of using
> the bitmask payload info is to be combined with #ifdef CONFIG_* to
> eliminate unnecessary fields from the traffic. I am concerned about
> losing data due to application not reading data fast enough.
>
> Well, we can revisit this suggestion when we start losing data
> though. ;-)
Like Shailabh said #ifdef CONFIG_* adds complexity for userspace parsing
of the structure, but if it helps avoid sending unnecessary data we
can consider using that approach.
Would something like the structure below be useful?
struct csastats {
#if defined(CONFIG_CSA) || defined(CONFIG_CSA_MODULE)
char acctent[sizeof(struct acctcsa) +
sizeof(struct acctmem) +
sizeof(struct acctio)];
int filled;
#endif
};
The filled member can be a bool or an int to indicate that the structure
contains meaningful data and the CONFIG_* is used to control the
inclusion of meaningful fields. Instead of using a bitmap we use
the filled member.
Is this what you had in mind?
--
<--- Balbir
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-28 2:59 ` Balbir Singh
@ 2006-04-28 18:20 ` Jay Lan
2006-04-28 18:35 ` Balbir Singh
0 siblings, 1 reply; 23+ messages in thread
From: Jay Lan @ 2006-04-28 18:20 UTC (permalink / raw)
To: balbir; +Cc: Shailabh Nagar, linux-kernel, LSE
Balbir Singh wrote:
>>If we envision a need of it in the future, we'd better put it in
>>today. It would be nice to have the revision number at beginning of
>>the struct. Shailabh's instruction says to add new field after existing
>>fields.
>>
>>
>
>Yes, true. It does not hurt to have a version number for taskstats.
>I will add it in.
>
><snip>
>
>
>>I am sorry that i did not make myself clear. My suggestion of using
>>the bitmask payload info is to be combined with #ifdef CONFIG_* to
>>eliminate unnecessary fields from the traffic. I am concerned about
>>losing data due to application not reading data fast enough.
>>
>>Well, we can revisit this suggestion when we start losing data
>>though. ;-)
>>
>
>Like Shailabh said #ifdef CONFIG_* adds complexity for userspace parsing
>of the structure, but if it helps avoid sending unnecessary data we
>can consider using that approach.
>
>Would something like the structure below be useful?
>
>struct csastats {
>#if defined(CONFIG_CSA) || defined(CONFIG_CSA_MODULE)
> char acctent[sizeof(struct acctcsa) +
> sizeof(struct acctmem) +
> sizeof(struct acctio)];
> int filled;
>#endif
>};
>
>The filled member can be a bool or an int to indicate that the structure
>contains meaningful data and the CONFIG_* is used to control the
>inclusion of meaningful fields. Instead of using a bitmap we use
>the filled member.
>
>Is this what you had in mind?
>
No exactly. The payload information must be always available for
application.
On a second thought, the idea of one big taskstats struct with many
#ifconfig is not really a good idea. My goal is to cut down unnecessary
data being transfered throught the socket.
Here is my Take 2. We can have a taskstats header containing taskstats
version and other general fields useful to more than one taskstats
application including a payload information. Then, we define
accounting subsystem specific structs for delayacct, csa, etc.
The kernel/{delayacct.c,csa.c,etc.c} set the payload information and
fill the buffer with desired subsystem structs. The header thus contain
enough information to tell applications how to map the data following
the header.
Would IBM propose more accounting subsystems besides delayacct?
If we only see delayacct and csa on the horizon, this scheme is really
not necessary since delayacct does not have as much data (as csa :))
and csa can use part of the delayacct data. You gain more than
csa can benefit from this. ;-) I guess i just speak from design point
of view. :)
But, if one day somebody who does not need a paycheck decides
to convert BSD accounting to use taskstats interface, this can
be helpful.
Thanks,
- jay
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [Lse-tech] Re: [Patch 5/8] taskstats interface
2006-04-28 18:20 ` Jay Lan
@ 2006-04-28 18:35 ` Balbir Singh
0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-04-28 18:35 UTC (permalink / raw)
To: Jay Lan; +Cc: Shailabh Nagar, linux-kernel, LSE
> >Is this what you had in mind?
> >
> No exactly. The payload information must be always available for
> application.
>
> On a second thought, the idea of one big taskstats struct with many
> #ifconfig is not really a good idea. My goal is to cut down unnecessary
> data being transfered throught the socket.
Yes, so we agree that #ifdef CONFIG_* is not good.
>
> Here is my Take 2. We can have a taskstats header containing taskstats
> version and other general fields useful to more than one taskstats
> application including a payload information. Then, we define
> accounting subsystem specific structs for delayacct, csa, etc.
> The kernel/{delayacct.c,csa.c,etc.c} set the payload information and
> fill the buffer with desired subsystem structs. The header thus contain
> enough information to tell applications how to map the data following
> the header.
I agree with this suggestion.
Each netlink attribute contains the following fields (also referred to as TLV)
+----+--------+------+
|Type| length | value|
+----+--------+------+
The type is meant to serve the purpose of the header you describe. The
type value can be used by the application to map the data.
getdelays.c is a sample application posted in the previous patches,
it interprets data based on type.
>
> Would IBM propose more accounting subsystems besides delayacct?
> If we only see delayacct and csa on the horizon, this scheme is really
> not necessary since delayacct does not have as much data (as csa :))
> and csa can use part of the delayacct data. You gain more than
> csa can benefit from this. ;-) I guess i just speak from design point
> of view. :)
>
> But, if one day somebody who does not need a paycheck decides
> to convert BSD accounting to use taskstats interface, this can
> be helpful.
>
Yes, I think in the long term it would be more useful to use the scheme
of adding subsystem structs. taskstats.txt explains the process of
extending taskstats. Point #2 is the same as what we have just discussed.
Could you please see if the text needs any changes based on our discussions
so far (taskstats.txt was posted in the delayacct-doc.patch).
> Thanks,
> - jay
>
>
--
<--- Balbir
^ permalink raw reply [flat|nested] 23+ messages in thread
* [Patch 5/8] taskstats interface
@ 2006-05-02 6:18 Balbir Singh
0 siblings, 0 replies; 23+ messages in thread
From: Balbir Singh @ 2006-05-02 6:18 UTC (permalink / raw)
To: linux-kernel; +Cc: lse-tech, jlan
Changelog
Fixes comments by jlan@engr.sgi.com
- separate out taskstats interface from delay accounting completely including
separate documentation
- permit different accounting subsystems to fill in parts of common
structure separately before common taskstats code sends it out on genetlink
- send common structure to userspace after update_hiwater_rss and before
exit_mm in do_exit
- fix references to sections
Fixes comments by akpm
- comment to indicate locking used for taskstats struct
- whitespace issues
- unnecessary use of constant taskstats_version
- uninline fill_pid(), fill_tgid()
- unnecessary cast to pid_t in taskstats_send_stats()
- too early evaluation of thread_group_empty() in taskstats_exit_pid
- returning -EFAULT on genl_register_family failure in taskstats_init
- comment for late_initcall of taskstats_init
No fix needed
- moving kmem_cache_free of tsk->delays outside the exit mutex
(mutex shifted and tsk->delays freeing being done elsewhere now)
- __delayacct_add_tsk returning -EINVAL if delay accounting isn't enabled
user should know that no values can be returned
returning zero would be misleading
- combining fill_pid(), fill_tgid() into a common function
combined code convoluted and less readable
taskstats-setup.patch
Create a "taskstats" interface based on generic netlink
(NETLINK_GENERIC family), for getting statistics of
tasks and thread groups during their lifetime and when they exit.
The interface is intended for use by multiple accounting packages
though it is being created in the context of delay accounting.
This patch creates the interface without populating the
fields of the data that is sent to the user in response to a command
or upon the exit of a task. Each accounting package interested in using
taskstats has to provide an additional patch to add its stats to the
common structure.
Signed-off-by: Shailabh Nagar <nagar@us.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---
Documentation/accounting/taskstats.txt | 146 ++++++++++++++
include/linux/taskstats.h | 85 ++++++++
include/linux/taskstats_kern.h | 57 +++++
init/Kconfig | 12 +
init/main.c | 2
kernel/Makefile | 1
kernel/exit.c | 7
kernel/taskstats.c | 329 +++++++++++++++++++++++++++++++++
8 files changed, 639 insertions(+)
diff -puN /dev/null Documentation/accounting/taskstats.txt
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.17-rc3-balbir/Documentation/accounting/taskstats.txt 2006-05-02 10:34:36.000000000 +0530
@@ -0,0 +1,146 @@
+Per-task statistics interface
+-----------------------------
+
+
+Taskstats is a netlink-based interface for sending per-task and
+per-process statistics from the kernel to userspace.
+
+Taskstats was designed for the following benefits:
+
+- efficiently provide statistics during lifetime of a task and on its exit
+- unified interface for multiple accounting subsystems
+- extensibility for use by future accounting patches
+
+Terminology
+-----------
+
+"pid", "tid" and "task" are used interchangeably and refer to the standard
+Linux task defined by struct task_struct. per-pid stats are the same as
+per-task stats.
+
+"tgid", "process" and "thread group" are used interchangeably and refer to the
+tasks that share an mm_struct i.e. the traditional Unix process. Despite the
+use of tgid, there is no special treatment for the task that is thread group
+leader - a process is deemed alive as long as it has any task belonging to it.
+
+Usage
+-----
+
+To get statistics during task's lifetime, userspace opens a unicast netlink
+socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
+The response contains statistics for a task (if pid is specified) or the sum of
+statistics for all tasks of the process (if tgid is specified).
+
+To obtain statistics for tasks which are exiting, userspace opens a multicast
+netlink socket. Each time a task exits, two records are sent by the kernel to
+each listener on the multicast socket. The first the per-pid task's statistics
+and the second is the sum for all tasks of the process to which the task
+belongs (the task does not need to be the thread group leader). The need for
+per-tgid stats to be sent for each exiting task is explained in the per-tgid
+stats section below.
+
+
+Interface
+---------
+
+The user-kernel interface is encapsulated in include/linux/taskstats.h
+
+To avoid this documentation becoming obsolete as the interface evolves, only
+an outline of the current version is given. taskstats.h always overrides the
+description here.
+
+struct taskstats is the common accounting structure for both per-pid and
+per-tgid data. It is versioned and can be extended by each accounting subsystem
+that is added to the kernel. The fields and their semantics are defined in the
+taskstats.h file.
+
+The data exchanged between user and kernel space is a netlink message belonging
+to the NETLINK_GENERIC family and using the netlink attributes interface.
+The messages are in the format
+
+ +----------+- - -+-------------+-------------------+
+ | nlmsghdr | Pad | genlmsghdr | taskstats payload |
+ +----------+- - -+-------------+-------------------+
+
+
+The taskstats payload is one of the following three kinds:
+
+1. Commands: Sent from user to kernel. The payload is one attribute, of type
+TASKSTATS_CMD_ATTR_PID/TGID, containing a u32 pid or tgid in the attribute
+payload. The pid/tgid denotes the task/process for which userspace wants
+statistics.
+
+2. Response for a command: sent from the kernel in response to a userspace
+command. The payload is a series of three attributes of type:
+
+a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
+a pid/tgid will be followed by some stats.
+
+b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
+is being returned.
+
+c) TASKSTATS_TYPE_STATS: attribute with a struct taskstsats as payload. The
+same structure is used for both per-pid and per-tgid stats.
+
+3. New message sent by kernel whenever a task exits. The payload consists of a
+ series of attributes of the following type:
+
+a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
+b) TASKSTATS_TYPE_PID: contains exiting task's pid
+c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
+d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
+e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
+f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
+
+
+per-tgid stats
+--------------
+
+Taskstats provides per-process stats, in addition to per-task stats, since
+resource management is often done at a process granularity and aggregating task
+stats in userspace alone is inefficient and potentially inaccurate (due to lack
+of atomicity).
+
+However, maintaining per-process, in addition to per-task stats, within the
+kernel has space and time overheads. Hence the taskstats implementation
+dynamically sums up the per-task stats for each task belonging to a process
+whenever per-process stats are needed.
+
+Not maintaining per-tgid stats creates a problem when userspace is interested
+in getting these stats when the process dies i.e. the last thread of
+a process exits. It isn't possible to simply return some aggregated per-process
+statistic from the kernel.
+
+The approach taken by taskstats is to return the per-tgid stats *each* time
+a task exits, in addition to the per-pid stats for that task. Userspace can
+maintain task<->process mappings and use them to maintain the per-process stats
+in userspace, updating the aggregate appropriately as the tasks of a process
+exit.
+
+Extending taskstats
+-------------------
+
+There are two ways to extend the taskstats interface to export more
+per-task/process stats as patches to collect them get added to the kernel
+in future:
+
+1. Adding more fields to the end of the existing struct taskstats. Backward
+ compatibility is ensured by the version number within the
+ structure. Userspace will use only the fields of the struct that correspond
+ to the version its using.
+
+2. Defining separate statistic structs and using the netlink attributes
+ interface to return them. Since userspace processes each netlink attribute
+ independently, it can always ignore attributes whose type it does not
+ understand (because it is using an older version of the interface).
+
+
+Choosing between 1. and 2. is a matter of trading off flexibility and
+overhead. If only a few fields need to be added, then 1. is the preferable
+path since the kernel and userspace don't need to incur the overhead of
+processing new netlink attributes. But if the new fields expand the existing
+struct too much, requiring disparate userspace accounting utilities to
+unnecessarily receive large structures whose fields are of no interest, then
+extending the attributes structure would be worthwhile.
+
+----
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.17-rc3-balbir/include/linux/taskstats.h 2006-05-02 10:35:22.000000000 +0530
@@ -0,0 +1,85 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ *
+ * To add new fields
+ * a) bump up TASKSTATS_VERSION
+ * b) add comment indicating new version number at end of struct
+ * c) add new fields after version comment; maintain 64-bit alignment
+ */
+
+#define TASKSTATS_VERSION 1
+
+struct taskstats {
+
+ /* Version 1 */
+
+ int filler_avoids_empty_struct_warnings;
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ * prior to __TASKSTATS_CMD_MAX
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request/get-response */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */
+ TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#endif /* _LINUX_TASKSTATS_H */
diff -puN /dev/null include/linux/taskstats_kern.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.17-rc3-balbir/include/linux/taskstats_kern.h 2006-05-02 09:47:24.000000000 +0530
@@ -0,0 +1,57 @@
+/* taskstats_kern.h - kernel header for per-task statistics interface
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ */
+
+#ifndef _LINUX_TASKSTATS_KERN_H
+#define _LINUX_TASKSTATS_KERN_H
+
+#include <linux/taskstats.h>
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern kmem_cache_t *taskstats_cache;
+
+static inline void taskstats_exit_alloc(struct taskstats **ptidstats,
+ struct taskstats **ptgidstats)
+{
+ *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
+ *ptgidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL);
+}
+
+static inline void taskstats_exit_free(struct taskstats *tidstats,
+ struct taskstats *tgidstats)
+{
+ if (tidstats)
+ kmem_cache_free(taskstats_cache, tidstats);
+ if (tgidstats)
+ kmem_cache_free(taskstats_cache, tgidstats);
+}
+
+extern void taskstats_exit_send(struct task_struct *, struct taskstats *,
+ struct taskstats *);
+extern void taskstats_init_early(void);
+
+#else
+static inline void taskstats_exit_alloc(struct taskstats **ptidstats,
+ struct taskstats **ptgidstats)
+{}
+static inline void taskstats_exit_free(struct taskstats *ptidstats,
+ struct taskstats *ptgidstats)
+{}
+static inline void taskstats_exit_send(struct task_struct *tsk,
+ struct taskstats *tidstats,
+ struct taskstats *tgidstats)
+{}
+static inline void taskstats_init_early(void)
+{}
+#endif /* CONFIG_TASKSTATS */
+
+#endif
+
diff -puN init/Kconfig~taskstats-setup init/Kconfig
--- linux-2.6.17-rc3/init/Kconfig~taskstats-setup 2006-05-02 09:47:24.000000000 +0530
+++ linux-2.6.17-rc3-balbir/init/Kconfig 2006-05-02 10:35:22.000000000 +0530
@@ -150,6 +150,18 @@ config BSD_PROCESS_ACCT_V3
for processing it. A preliminary version of these tools is available
at <http://www.physik3.uni-rostock.de/tim/kernel/utils/acct/>.
+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ default n
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say N if unsure.
+
config TASK_DELAY_ACCT
bool "Enable per-task delay accounting (EXPERIMENTAL)"
help
diff -puN init/main.c~taskstats-setup init/main.c
--- linux-2.6.17-rc3/init/main.c~taskstats-setup 2006-05-02 09:47:24.000000000 +0530
+++ linux-2.6.17-rc3-balbir/init/main.c 2006-05-02 09:47:24.000000000 +0530
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/taskstats_kern.h>
#include <linux/delayacct.h>
#include <asm/io.h>
@@ -542,6 +543,7 @@ asmlinkage void __init start_kernel(void
proc_root_init();
#endif
cpuset_init();
+ taskstats_init_early();
delayacct_init();
check_bugs();
diff -puN kernel/exit.c~taskstats-setup kernel/exit.c
--- linux-2.6.17-rc3/kernel/exit.c~taskstats-setup 2006-05-02 09:47:24.000000000 +0530
+++ linux-2.6.17-rc3-balbir/kernel/exit.c 2006-05-02 09:47:24.000000000 +0530
@@ -36,6 +36,7 @@
#include <linux/compat.h>
#include <linux/pipe_fs_i.h>
#include <linux/delayacct.h>
+#include <linux/taskstats_kern.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -848,6 +849,7 @@ static void exit_notify(struct task_stru
fastcall NORET_TYPE void do_exit(long code)
{
struct task_struct *tsk = current;
+ struct taskstats *tidstats, *tgidstats;
int group_dead;
profile_task_exit(tsk);
@@ -894,6 +896,8 @@ fastcall NORET_TYPE void do_exit(long co
current->comm, current->pid,
preempt_count());
+ taskstats_exit_alloc(&tidstats, &tgidstats);
+
acct_update_integrals(tsk);
if (tsk->mm) {
update_hiwater_rss(tsk->mm);
@@ -911,7 +915,10 @@ fastcall NORET_TYPE void do_exit(long co
if (unlikely(tsk->compat_robust_list))
compat_exit_robust_list(tsk);
#endif
+ taskstats_exit_send(tsk, tidstats, tgidstats);
+ taskstats_exit_free(tidstats, tgidstats);
delayacct_tsk_exit(tsk);
+
exit_mm(tsk);
exit_sem(tsk);
diff -puN kernel/Makefile~taskstats-setup kernel/Makefile
--- linux-2.6.17-rc3/kernel/Makefile~taskstats-setup 2006-05-02 09:47:24.000000000 +0530
+++ linux-2.6.17-rc3-balbir/kernel/Makefile 2006-05-02 09:47:24.000000000 +0530
@@ -39,6 +39,7 @@ obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.17-rc3-balbir/kernel/taskstats.c 2006-05-02 10:36:27.000000000 +0530
@@ -0,0 +1,329 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats_kern.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+kmem_cache_t *taskstats_cache;
+static DEFINE_MUTEX(taskstats_exit_mutex);
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1]
+__read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp, size_t size)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new(size);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ /*
+ * Each accounting subsystem adds calls to its functions to
+ * fill in relevant parts of struct taskstsats as follows
+ *
+ * rc = per-task-foo(stats, tsk);
+ * if (rc)
+ * goto err;
+ */
+
+err:
+ put_task_struct(tsk);
+ return rc;
+
+}
+
+static int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ /*
+ * Each accounting subsystem adds calls its functions to
+ * fill in relevant parts of struct taskstsats as follows
+ *
+ * rc = per-task-foo(stats, tsk);
+ * if (rc)
+ * break;
+ */
+
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ /*
+ * Accounting subsytems can also add calls here if they don't
+ * wish to aggregate statistics for per-tgid stats
+ */
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+ size_t size;
+ struct nlattr *na;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid(pid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
+ stats);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid(tgid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
+ stats);
+ } else {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ nla_nest_end(rep_skb, na);
+
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+err:
+ nlmsg_free(rep_skb);
+ return rc;
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_send(struct task_struct *tsk, struct taskstats *tidstats,
+ struct taskstats *tgidstats)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ void *reply;
+ size_t size;
+ int is_thread_group;
+ struct nlattr *na;
+
+ if (!family_registered || !tidstats)
+ return;
+
+ mutex_lock(&taskstats_exit_mutex);
+
+ is_thread_group = !thread_group_empty(tsk);
+ rc = 0;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ if (is_thread_group)
+ size = 2 * size; /* PID + STATS + TGID + STATS */
+
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ goto ret;
+
+ rc = fill_pid(tsk->pid, tsk, tidstats);
+ if (rc < 0)
+ goto err_skb;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
+ *tidstats);
+ nla_nest_end(rep_skb, na);
+
+ if (!is_thread_group || !tgidstats) {
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ goto ret;
+ }
+
+ rc = fill_tgid(tsk->pid, tsk, tgidstats);
+ if (rc < 0)
+ goto err_skb;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS,
+ *tgidstats);
+ nla_nest_end(rep_skb, na);
+
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ goto ret;
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+ goto ret;
+err_skb:
+ nlmsg_free(rep_skb);
+ret:
+ mutex_unlock(&taskstats_exit_mutex);
+ return;
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+/* Needed early in initialization */
+void __init taskstats_init_early(void)
+{
+ taskstats_cache = kmem_cache_create("taskstats_cache",
+ sizeof(struct taskstats),
+ 0, SLAB_PANIC, NULL, NULL);
+}
+
+static int __init taskstats_init(void)
+{
+ int rc;
+
+ rc = genl_register_family(&family);
+ if (rc)
+ return rc;
+ family_registered = 1;
+
+ if ((rc = genl_register_ops(&family, &taskstats_ops)) < 0)
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return rc;
+}
+
+/*
+ * late initcall ensures initialization of statistics collection
+ * mechanisms precedes initialization of the taskstats interface
+ */
+late_initcall(taskstats_init);
_
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2006-05-02 6:21 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-22 2:16 [Patch 0/8] per-task delay accounting Shailabh Nagar
2006-04-22 2:23 ` [Patch 1/8] Setup Shailabh Nagar
2006-04-24 2:02 ` Randy.Dunlap
2006-04-24 17:26 ` Shailabh Nagar
2006-04-22 2:29 ` [Patch 2/8] Sync block I/O and swapin delay collection Shailabh Nagar
2006-04-22 2:33 ` [Patch 3/8] cpu delay collection via schedstats Shailabh Nagar
2006-04-22 2:35 ` [Patch 4/8] Utilities for genetlink usage Shailabh Nagar
2006-04-22 2:37 ` [Patch 5/8] taskstats interface Shailabh Nagar
2006-04-27 1:12 ` Jay Lan
2006-04-27 4:00 ` Shailabh Nagar
2006-04-27 6:42 ` [Lse-tech] " Balbir Singh
2006-04-27 17:52 ` Jay Lan
2006-04-27 18:27 ` Balbir Singh
2006-04-27 19:34 ` Jay Lan
2006-04-28 2:59 ` Balbir Singh
2006-04-28 18:20 ` Jay Lan
2006-04-28 18:35 ` Balbir Singh
2006-04-22 2:39 ` [Patch 6/8] delay accounting usage of " Shailabh Nagar
2006-04-22 2:40 ` [Patch 7/8] documentation Shailabh Nagar
2006-04-22 2:42 ` [Patch 8/8] /proc export of aggregated block I/O delays Shailabh Nagar
2006-04-22 7:46 ` [Lse-tech] " Andi Kleen
2006-04-25 15:07 ` [Patch 0/8] per-task delay accounting Shailabh Nagar
-- strict thread matches above, loose matches on Subject: below --
2006-05-02 6:18 [Patch 5/8] taskstats interface Balbir Singh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).