netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch 7/7] Generic netlink interface (delay accounting)
       [not found] <1141026996.5785.38.camel@elinux04.optonline.net>
@ 2006-02-27  8:31 ` Shailabh Nagar
       [not found]   ` <1141045194.5363.12.camel@localhost.localdomain>
  0 siblings, 1 reply; 11+ messages in thread
From: Shailabh Nagar @ 2006-02-27  8:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: lse-tech, netdev

delayacct-genetlink.patch

Create a generic netlink interface (NETLINK_GENERIC family), 
called "taskstats", for getting delay and cpu statistics of 
tasks and thread groups during their lifetime and when they exit. 

The cpu stats are available only if CONFIG_SCHEDSTATS is enabled.

When a task is alive, userspace can get its stats by sending a 
command containing its pid. Sending a tgid returns the sum of stats 
of the tasks belonging to that tgid (where such a sum makes sense). 
Together, the command interface allows stats for a large number of 
tasks to be collected more efficiently than would be possible 
through /proc or any per-pid interface. 

The netlink interface also sends the stats for each task to userspace 
when the task is exiting. This permits fine-grain accounting for 
short-lived tasks, which is important if userspace is doing its own 
aggregation of statistics based on some grouping of tasks 
(e.g. CSA jobs, ELSA banks or CKRM classes).

If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.

The interface has been deliberately kept distinct from the delay 
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to 
userspace is versioned and the command interface easily extensible to 
facilitate reuse.

If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay 
accounting alone.

Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

 include/linux/delayacct.h |    2 
 include/linux/taskstats.h |  125 +++++++++++++++++++++
 init/Kconfig              |   16 ++
 kernel/Makefile           |    1 
 kernel/delayacct.c        |   49 ++++++++
 kernel/taskstats.c        |  266 ++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 456 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc4/include/linux/delayacct.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/delayacct.h	2006-02-27 01:53:01.000000000 -0500
+++ linux-2.6.16-rc4/include/linux/delayacct.h	2006-02-27 01:53:03.000000000 -0500
@@ -16,6 +16,7 @@
 
 #include <linux/sched.h>
 #include <linux/sysctl.h>
+#include <linux/taskstats.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -27,6 +28,7 @@ extern void __delayacct_tsk_init(struct 
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio(void);
 extern void __delayacct_swapin(void);
+extern int delayacct_add_tsk(struct taskstats_reply *, struct task_struct *);
 
 static inline void delayacct_tsk_early_init(struct task_struct *tsk)
 {
Index: linux-2.6.16-rc4/include/linux/taskstats.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4/include/linux/taskstats.h	2006-02-27 01:53:03.000000000 -0500
@@ -0,0 +1,125 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+
+struct taskstats {
+
+	/* Version 1 */
+#define TASKSTATS_NOPID	-1
+	pid_t	pid;
+	pid_t	tgid;
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC,		/* Reserved */
+	TASKSTATS_CMD_NONE,		/* Not a valid cmd to send
+					 * Marks data sent on task/tgid exit */
+	TASKSTATS_CMD_LISTEN,		/* Start listening */
+	TASKSTATS_CMD_IGNORE,		/* Stop listening */
+	TASKSTATS_CMD_PID,		/* Send stats for a pid */
+	TASKSTATS_CMD_TGID,		/* Send stats for a tgid */
+};
+
+/* Parameters for commands
+ * New parameters should only be inserted at the struct's end
+ */
+
+struct taskstats_cmd_param {
+	union {
+		pid_t	pid;
+		pid_t	tgid;
+	} id;
+};
+
+/*
+ * Reply sent from kernel
+ * Version number affects size/format of struct taskstats only
+ */
+
+struct taskstats_reply {
+	enum outtype {
+		TASKSTATS_REPLY_NONE = 1,	/* Control cmd response */
+		TASKSTATS_REPLY_PID,		/* per-pid data cmd response*/
+		TASKSTATS_REPLY_TGID,		/* per-tgid data cmd response*/
+		TASKSTATS_REPLY_EXIT_PID,	/* Exiting task's stats */
+		TASKSTATS_REPLY_EXIT_TGID,	/* Exiting tgid's stats
+						 * (sent on each tid's exit) */
+	} outtype;
+	__u32 version;
+	__u32 err;
+	struct taskstats stats;			/* Invalid if err != 0 */
+};
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+/* Following must be > NLMSG_MIN_TYPE */
+#define TASKSTATS_GENL_ID	0x123
+
+#define TASKSTATS_HDRLEN	(NLMSG_SPACE(GENL_HDRLEN))
+#define TASKSTATS_BODYLEN	(sizeof(struct taskstats_reply))
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.16-rc4/kernel/Makefile
===================================================================
--- linux-2.6.16-rc4.orig/kernel/Makefile	2006-02-27 01:52:54.000000000 -0500
+++ linux-2.6.16-rc4/kernel/Makefile	2006-02-27 01:53:03.000000000 -0500
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
Index: linux-2.6.16-rc4/kernel/taskstats.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc4/kernel/taskstats.c	2006-02-27 01:53:03.000000000 -0500
@@ -0,0 +1,266 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static atomic_t taskstats_group_listeners = ATOMIC_INIT(0);
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+
+static struct genl_family family = {
+//	.id             = GENL_ID_GENERATE,
+	.id             = TASKSTATS_GENL_ID,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.hdrsize        = 0,
+	.maxattr        = 0,
+};
+
+#define genlmsg_data(genlhdr)	((char *)genlhdr + GENL_HDRLEN)
+
+/* Taskstat specific functions */
+
+static int taskstats_listen(struct sk_buff *skb, struct genl_info *info)
+{
+	atomic_inc(&taskstats_group_listeners);
+	return 0;
+}
+
+static int taskstats_ignore(struct sk_buff *skb, struct genl_info *info)
+{
+	atomic_dec(&taskstats_group_listeners);
+	return 0;
+}
+
+static int prepare_reply(struct genl_info *info, u8 cmd,
+			 struct sk_buff **skbp, struct taskstats_reply **replyp)
+{
+	struct sk_buff *skb;
+	struct taskstats_reply *reply;
+
+	skb = nlmsg_new(TASKSTATS_HDRLEN + TASKSTATS_BODYLEN);
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, NLM_F_REQUEST,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, info->nlhdr->nlmsg_flags,
+				    info->genlhdr->cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+	skb_put(skb, TASKSTATS_BODYLEN);
+
+	memset(reply, 0, sizeof(*reply));
+	reply->version = taskstats_version;
+	reply->err = 0;
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, int replytype)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	struct taskstats_reply *reply;
+
+	reply = (struct taskstats_reply *)genlmsg_data(genlhdr);
+	reply->outtype = replytype;
+
+	genlmsg_end(skb, genlhdr);
+	return genlmsg_multicast(skb, 0, TASKSTATS_LISTEN_GROUP);
+}
+
+static inline void fill_pid(struct taskstats_reply *reply, pid_t pid, struct task_struct *pidtsk)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			reply->err = EINVAL;
+			return;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(reply, tsk);
+	if (!rc) {
+		reply->stats.pid = tsk->pid;
+		reply->stats.tgid = tsk->tgid;
+	} else
+		reply->err = (rc < 0) ? -rc : rc ;
+
+	put_task_struct(tsk);
+}
+
+static int taskstats_send_pid(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats_reply *reply;
+	struct taskstats_cmd_param *param= info->userhdr;
+
+	if (atomic_read(&taskstats_group_listeners) < 1)
+		return -EINVAL;
+
+	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+	if (rc)
+		return rc;
+	fill_pid(reply, param->id.pid, NULL);
+	return send_reply(rep_skb, TASKSTATS_REPLY_PID);
+}
+
+static inline void fill_tgid(struct taskstats_reply *reply, pid_t tgid, struct task_struct *tgidtsk)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			reply->err = EINVAL;
+			return;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(reply, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	if (!rc) {
+		reply->stats.pid = TASKSTATS_NOPID;
+		reply->stats.tgid = tgid;
+	} else
+		reply->err = (rc < 0) ? -rc : rc ;
+}
+
+static int taskstats_send_tgid(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats_reply *reply;
+	struct taskstats_cmd_param *param= info->userhdr;
+
+	if (atomic_read(&taskstats_group_listeners) < 1)
+		return -EINVAL;
+
+	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+	if (rc)
+		return rc;
+	fill_tgid(reply, param->id.tgid, NULL);
+	return send_reply(rep_skb, TASKSTATS_REPLY_TGID);
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats_reply *reply;
+
+	if (atomic_read(&taskstats_group_listeners) < 1)
+		return;
+
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+	if (rc)
+		return;
+	fill_pid(reply, tsk->pid, tsk);
+	rc = send_reply(rep_skb, TASKSTATS_REPLY_EXIT_PID);
+
+	if (rc || thread_group_empty(tsk))
+		return;
+
+	/* Send tgid data too */
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+	if (rc)
+		return;
+	fill_tgid(reply, tsk->tgid, tsk);
+	send_reply(rep_skb, TASKSTATS_REPLY_EXIT_TGID);
+}
+
+static struct genl_ops listen_ops = {
+	.cmd            = TASKSTATS_CMD_LISTEN,
+	.doit           = taskstats_listen,
+};
+
+static struct genl_ops ignore_ops = {
+	.cmd            = TASKSTATS_CMD_IGNORE,
+	.doit           = taskstats_ignore,
+};
+
+static struct genl_ops pid_ops = {
+	.cmd            = TASKSTATS_CMD_PID,
+	.doit           = taskstats_send_pid,
+};
+
+static struct genl_ops tgid_ops = {
+	.cmd            = TASKSTATS_CMD_TGID,
+	.doit           = taskstats_send_tgid,
+};
+
+static int family_registered = 0;
+
+static int __init taskstats_init(void)
+{
+        if (genl_register_family(&family))
+                return -EFAULT;
+        family_registered = 1;
+
+        if (genl_register_ops(&family, &listen_ops))
+               goto err;
+        if (genl_register_ops(&family, &ignore_ops))
+                goto err;
+        if (genl_register_ops(&family, &pid_ops))
+		goto err;
+        if (genl_register_ops(&family, &tgid_ops))
+                goto err;
+
+        return 0;
+err:
+        genl_unregister_family(&family);
+        family_registered = 0;
+        return -EFAULT;
+}
+
+late_initcall(taskstats_init);
+
Index: linux-2.6.16-rc4/init/Kconfig
===================================================================
--- linux-2.6.16-rc4.orig/init/Kconfig	2006-02-27 01:52:54.000000000 -0500
+++ linux-2.6.16-rc4/init/Kconfig	2006-02-27 01:53:03.000000000 -0500
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
Index: linux-2.6.16-rc4/kernel/delayacct.c
===================================================================
--- linux-2.6.16-rc4.orig/kernel/delayacct.c	2006-02-27 01:53:01.000000000 -0500
+++ linux-2.6.16-rc4/kernel/delayacct.c	2006-02-27 01:53:03.000000000 -0500
@@ -17,6 +17,7 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
 
 int delayacct_on = 0;		/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
@@ -61,6 +62,7 @@ void __delayacct_tsk_init(struct task_st
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+ 	taskstats_exit_pid(tsk);
 	task_lock(tsk);
 	if (tsk->delays) {
 		kmem_cache_free(delayacct_cache, tsk->delays);
@@ -209,3 +211,50 @@ int delayacct_sysctl_handler(ctl_table *
 		delayacct_on = prev;
 	return ret;
 }
+
+#ifdef CONFIG_TASKSTATS
+
+int delayacct_add_tsk(struct taskstats_reply *reply, struct task_struct *tsk)
+{
+	struct taskstats *d = &reply->stats;
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	if (!tsk->delays || !delayacct_on)
+		return -EINVAL;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+#ifdef CONFIG_SCHEDSTATS
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0:tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0:tmp;
+#else
+	/* Non-zero XXX_total,zero XXX_count implies XXX stat unavailable */
+	d->cpu_count = 0;
+	d->cpu_run_total = d->cpu_delay_total = TASKSTATS_NOCPUSTATS;
+#endif
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0:tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0:tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+
+	return 0;
+}
+
+#endif /* CONFIG_TASKSTATS */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Patch 7/7] Generic netlink interface (delay accounting)
       [not found]       ` <1141652556.5185.64.camel@localhost.localdomain>
@ 2006-03-06 17:00         ` Shailabh Nagar
  2006-03-07 14:38           ` [Lse-tech] " jamal
  0 siblings, 1 reply; 11+ messages in thread
From: Shailabh Nagar @ 2006-03-06 17:00 UTC (permalink / raw)
  To: hadi; +Cc: netdev, linux-kernel, lse-tech

Jamal,

Pls keep lkml and lse-tech on cc since some of this affects the usage
of delay accounting.


jamal wrote:

>Hi Shailabh,
>Apologies for taking a week to respond ..
>
>On Mon, 2006-27-02 at 15:26 -0500, Shailabh Nagar wrote: 
>  
>
>>jamal wrote:
>>    
>>
>
>  
>
>>Yes, the current intent is to allow multiple listeners to receive the 
>>responses sent by the kernel.
>>    
>>
>
>Responses or events? There is a difference:
>Response implies the program in user space requested (ex a GET) for that
>information and is receiving such info.
>Event implies the program in user space asked to be informed of changes
>in the kernel. Example an exit would be considered an event. 
>Events are received by virtue of registering to a multicast group.
>[..] 
>  
>
My design was to have the listener get both responses (what I call 
replies in the code)
as well as events (data sent on exit of pid)

>>Since this interface (taskstats) is currently designed for that 
>>possibility, having multiple listeners, one for
>>each "component" such as delay accounting, is the model we're using.
>>We expect each component to have a pair of userspace programs, one for 
>>sending commands and the other
>>to "listen" to all replies + data generated on task exits. 
>>    
>>
>
>You need to have a sender of GETs essentially and a listener of events.
>Those are two connections. The replies of a get from user1 will not be
>sent to user2 as well - unless ... thats what you are trying to achieve;
>the question is why?
>  
>
Yes, I was trying to have an asymmetric model where the userspace sender 
of GETs
doesn't receive the reply as a unicast. Rather the reply is sent by 
multicast (alongwith all the
event data).

Reason for this unintuitive design was to make it easier to process the 
returned data.

The expected usage of delay accounting is to periodically "sample" the 
delays for all
tasks (or tgids) in the system. Also, get the delays from exiting pids 
(lets forget how tgid exit
is handled for now...irrelevant to this discussion).

Using the above two pieces of data, userspace can aggregate the "delays" 
seen by any
grouping of tasks that it chooses to implement.

In this usage scenario, its more efficient to have one receiver get both 
response and event
data and process in a loop.

However, we could switch to the model you suggest and use a 
multithreaded send/receive
userspace utility.

>>The listener 
>>is expected to register/deregister interest through
>>TASKSTATS_CMD_LISTEN and IGNORE.
>>
>>    
>>
>
>It is not necessary if you follow the model i described.
>
>  
>
>>>How does this correlate to TASKSTATS_CMD_LISTEN/IGNORE?
>>> 
>>>
>>>      
>>>
>>See above. Its mainly an optimization so that if no listener is present, 
>>there's no need to generate the data.
>>
>>    
>>
>
>Also not necessary - There is a recent netlink addition to make sure
>that events dont get sent if no listeners exist.
>genetlink needs to be extended. For now assume such a thing exists.
>  
>
Ok. Will this addition work for both unicast and multicast modes ?



>  
>
>>>>+
>>>>        
>>>>
>
>  
>
>>Good point. Should check for users sending it as a cmd and treat it as a 
>>noop. 
>>    
>>
>
>More like return an -EINVAL
>  
>
Will this be necessary ? Isn't genl_rcv_msg() going to return a -EOPNOTSUPP
automatically for us since we've not registered the command ?
 

>  
>
>>I'm just using
>>this as a placeholder for data thats returned without being requested.
>>
>>    
>>
>
>So it is unconditional?
>  
>
Yes.

>  
>
>>Come to think of it, there's no real reason to have a genlmsghdr for 
>>returned data, is there ?
>>    
>>
>
>All messages should be consistent whether they are sent from user
>or kernel.
>  
>
Ok. will retain genetlink header.

>>Other than to copy the genlmsghdr that was sent so user can identify 
>>which command was sent
>>(and I'm doing that through the reply type, perhaps redundantly).
>>
>>    
>>
>
>yes, that is a useful trick. Just make sure they are reflected
>correctly.
>
>  
>
>>Actually, the next iteration of the code will move to dynamically 
>>generated ID. But yes, will need to check for that.
>>
>>    
>>
>
>Also if you can provide feedback whether the doc i sent was any use
>and what wasnt clear etc.
>  
>
Will do.

>>Thanks for the review.
>>Couple of questions about general netlink:
>>is it intended to remain a size that will always be aligned to the 
>>NLMSG_ALIGNTO so that (NLMSG_DATA(nlhdr) + GENL_HDRLEN) can always 
>>be used as a pointer to the genlmsghdr ?
>>
>>    
>>
>
>I am not sure i followed.
>The whole message (nlhdr, genlhdr, optionalhdr, TLVs) has to be in 
>the end 32 bit aligned.
>  
>
Ok , so separate padding isn't needed to make the genlhdr, optionalhdr 
and TLV parts aligned
too.

>>Adding some macros like genlmsg_data(nlh) would be handy (currently I 
>>just define and use it locally).
>>
>>    
>>
>
>Send a patch.
>  
>
will do.


Thanks,
Shailabh

>cheers,
>jamal
>
>  
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-06 17:00         ` Shailabh Nagar
@ 2006-03-07 14:38           ` jamal
  2006-03-08 21:56             ` Shailabh Nagar
  0 siblings, 1 reply; 11+ messages in thread
From: jamal @ 2006-03-07 14:38 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: netdev, linux-kernel, lse-tech

On Mon, 2006-06-03 at 12:00 -0500, Shailabh Nagar wrote:

> My design was to have the listener get both responses (what I call 
> replies in the code) as well as events (data sent on exit of pid)
> 

I think i may not be doing justice explaining this, so let me be more
elaborate so we can be in sync.
Here is the classical way of doing things:

- Assume several apps in user space and a target in the kernel (this
could be reversed or combined in many ways, but the sake of
simplicity/clarity make the above assumption).
- suppose we have five user space apps A, B, C, D, E; these processes
would typically do one of the following class of activities:

a) configure (ADD/NEW/DEL etc). This is issued towards the kernel to
set/create/delete/flush some scalar attribute or vector. These sorts of
commands are synchronous. i.e you issue them, you expect a response
(which may indicate success/failure etc). The response is unicast; the
effect of what they affected may cause an event which may be multicast.

b) query(GET). This is issued towards the kernel to query state of
configured items. These class of commands are also synchronous. There
are special cases of the query which dump everything in the target -
literally called "dumps". The response is unicast.

c) events. These are _asynchronous_ messages issued by the kernel to
indicate some happening in the kernel. The event may be caused by #a
above or any other activity in the kernel. Events are multicast.
To receive them you have to register for the multicast group. You do so
via sockets. You can register to many multicast group.

For clarity again assume we have a multicast group where announcements
of pids exiting is seen and C and D are registered to such a multicast
group.
Suppose process A exits. That would fall under #c above. C and D will be
notified.
Suppose B configures something in the kernel that forces the kernel to
have process E exit and that such an operation is successful. B will get
acknowledgement it succeeded (unicast). C and D will get notified
(multicast). 
Suppose C issued a GET to find details about a specific pid, then only C
will get that info back (message is unicast).
[A response message to a GET is typically designed to be the same as an
ADD message i.e one should be able to take exactly the same message,
change one or two things and shove back into the kernel to configure].
Suppose D issued a GET with dump flag, then D will get the details of
all pids (message is unicast).

Is this clear? Is there more than the above you need?

There are no hard rules on what you need to be multicasting and as an
example you could send periodic(aka time based) samples from the kernel
on a multicast channel and that would be received by all. It did seem
odd that you want to have a semi-promiscous mode where a response to a
GET is multicast. If that is still what you want to achieve, then you
should.

> However, we could switch to the model you suggest and use a 
> multithreaded send/receive userspace utility.
> 

This is more of the classical way of doing things. 


> >There is a recent netlink addition to make sure
> >that events dont get sent if no listeners exist.
> >genetlink needs to be extended. For now assume such a thing exists.
> >  
> >
> Ok. Will this addition work for both unicast and multicast modes ?
> 

If you never open a connection to the kernel, nothing will be generated
towards user space. 
There are other techniques to rate limit event generation as well (one
such technique is a nagle-like algorithm used by xfrm).

> >
> Will this be necessary ? Isn't genl_rcv_msg() going to return a -EOPNOTSUPP
> automatically for us since we've not registered the command ?
>  

Yes, please in your doc feedback remind me of this,

> >
> >Also if you can provide feedback whether the doc i sent was any use
> >and what wasnt clear etc.
> >  
> >
> Will do.
> 

also take a look at the excellent documentation Thomas Graf has put in
the kernel for all the utilities for manipulating netlink messages and
tell me if that should also be put in this doc (It is listed as a TODO).


cheers,
jamal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-07 14:38           ` [Lse-tech] " jamal
@ 2006-03-08 21:56             ` Shailabh Nagar
  2006-03-09 14:37               ` [UPDATED PATCH] " Balbir Singh
  0 siblings, 1 reply; 11+ messages in thread
From: Shailabh Nagar @ 2006-03-08 21:56 UTC (permalink / raw)
  To: hadi; +Cc: netdev, linux-kernel, lse-tech

jamal wrote:

>On Mon, 2006-06-03 at 12:00 -0500, Shailabh Nagar wrote:
>
>  
>
>>My design was to have the listener get both responses (what I call 
>>replies in the code) as well as events (data sent on exit of pid)
>>
>>    
>>
>
>I think i may not be doing justice explaining this, so let me be more
>elaborate so we can be in sync.
>Here is the classical way of doing things:
>
>- Assume several apps in user space and a target in the kernel (this
>could be reversed or combined in many ways, but the sake of
>simplicity/clarity make the above assumption).
>- suppose we have five user space apps A, B, C, D, E; these processes
>would typically do one of the following class of activities:
>
>a) configure (ADD/NEW/DEL etc). This is issued towards the kernel to
>set/create/delete/flush some scalar attribute or vector. These sorts of
>commands are synchronous. i.e you issue them, you expect a response
>(which may indicate success/failure etc). The response is unicast; the
>effect of what they affected may cause an event which may be multicast.
>
>b) query(GET). This is issued towards the kernel to query state of
>configured items. These class of commands are also synchronous. There
>are special cases of the query which dump everything in the target -
>literally called "dumps". The response is unicast.
>
>c) events. These are _asynchronous_ messages issued by the kernel to
>indicate some happening in the kernel. The event may be caused by #a
>above or any other activity in the kernel. Events are multicast.
>To receive them you have to register for the multicast group. You do so
>via sockets. You can register to many multicast group.
>
>For clarity again assume we have a multicast group where announcements
>of pids exiting is seen and C and D are registered to such a multicast
>group.
>Suppose process A exits. That would fall under #c above. C and D will be
>notified.
>Suppose B configures something in the kernel that forces the kernel to
>have process E exit and that such an operation is successful. B will get
>acknowledgement it succeeded (unicast). C and D will get notified
>(multicast). 
>Suppose C issued a GET to find details about a specific pid, then only C
>will get that info back (message is unicast).
>[A response message to a GET is typically designed to be the same as an
>ADD message i.e one should be able to take exactly the same message,
>change one or two things and shove back into the kernel to configure].
>Suppose D issued a GET with dump flag, then D will get the details of
>all pids (message is unicast).
>
>Is this clear? Is there more than the above you need?
>  
>
Thanks for the clarification of the usage model. While our needs are 
certainly much less complex,
it is useful to know the range of options.

>There are no hard rules on what you need to be multicasting and as an
>example you could send periodic(aka time based) samples from the kernel
>on a multicast channel and that would be received by all. It did seem
>odd that you want to have a semi-promiscous mode where a response to a
>GET is multicast. If that is still what you want to achieve, then you
>should.
>  
>
Ok, we'll probably switch to the classical mode you suggest since the 
"efficient processing"
rationale for choosing to operate in the semi-promiscous mode earlier 
can be overcome by
writing a multi-threaded userspace utility.

>>However, we could switch to the model you suggest and use a 
>>multithreaded send/receive userspace utility.
>>
>>    
>>
>
>This is more of the classical way of doing things. 
>
>
>  
>
>>>There is a recent netlink addition to make sure
>>>that events dont get sent if no listeners exist.
>>>genetlink needs to be extended. For now assume such a thing exists.
>>> 
>>>
>>>      
>>>
>>Ok. Will this addition work for both unicast and multicast modes ?
>>
>>    
>>
>
>If you never open a connection to the kernel, nothing will be generated
>towards user space. 
>There are other techniques to rate limit event generation as well (one
>such technique is a nagle-like algorithm used by xfrm).
>
>  
>
>>Will this be necessary ? Isn't genl_rcv_msg() going to return a -EOPNOTSUPP
>>automatically for us since we've not registered the command ?
>> 
>>    
>>
>
>Yes, please in your doc feedback remind me of this,
>
>  
>
>>>Also if you can provide feedback whether the doc i sent was any use
>>>and what wasnt clear etc.
>>> 
>>>
>>>      
>>>
>>Will do.
>>
>>    
>>
>
>also take a look at the excellent documentation Thomas Graf has put in
>the kernel for all the utilities for manipulating netlink messages and
>tell me if that should also be put in this doc (It is listed as a TODO).
>
>
>  
>
Ok.

Thanks,
Shailabh

>cheers,
>jamal
>
>  
>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [UPDATED PATCH] Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-08 21:56             ` Shailabh Nagar
@ 2006-03-09 14:37               ` Balbir Singh
  2006-03-09 16:06                 ` Shailabh Nagar
  2006-03-10 14:53                 ` jamal
  0 siblings, 2 replies; 11+ messages in thread
From: Balbir Singh @ 2006-03-09 14:37 UTC (permalink / raw)
  To: Shailabh Nagar; +Cc: hadi, netdev, linux-kernel, lse-tech

> Thanks for the clarification of the usage model. While our needs are 
> certainly much less complex,
> it is useful to know the range of options.
> 
> >There are no hard rules on what you need to be multicasting and as an
> >example you could send periodic(aka time based) samples from the kernel
> >on a multicast channel and that would be received by all. It did seem
> >odd that you want to have a semi-promiscous mode where a response to a
> >GET is multicast. If that is still what you want to achieve, then you
> >should.
> > 
> >>>Also if you can provide feedback whether the doc i sent was any use
> >>>and what wasnt clear etc.
> >also take a look at the excellent documentation Thomas Graf has put in
> >the kernel for all the utilities for manipulating netlink messages and
> >tell me if that should also be put in this doc (It is listed as a TODO).

Hello, Jamal,

Please find the latest version of the patch for review. The genetlink
code has been updated as per your review comments. The changelog is provided
below

1. Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
2. Provide generic functions called genlmsg_data() and genlmsg_len()
   in linux/net/genetlink.h
3. Do not multicast all replies, multicast only events generated due
   to task exit.
4. The taskstats and taskstats_reply structures are now 64 bit aligned.
5. Family id is dynamically generated.

Please let us know if we missed something out.

Thanks,
Balbir


Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>

---

 include/linux/delayacct.h |    2 
 include/linux/taskstats.h |  128 ++++++++++++++++++++++++
 include/net/genetlink.h   |   20 +++
 init/Kconfig              |   16 ++-
 kernel/Makefile           |    1 
 kernel/delayacct.c        |   56 ++++++++++
 kernel/taskstats.c        |  244 ++++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 464 insertions(+), 3 deletions(-)

diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16-rc5/include/linux/delayacct.h~delayacct-genetlink	2006-03-09 17:15:31.000000000 +0530
+++ linux-2.6.16-rc5-balbir/include/linux/delayacct.h	2006-03-09 17:15:31.000000000 +0530
@@ -15,6 +15,7 @@
 #define _LINUX_TASKDELAYS_H
 
 #include <linux/sched.h>
+#include <linux/taskstats.h>
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 extern int delayacct_on;	/* Delay accounting turned on/off */
@@ -24,6 +25,7 @@ extern void __delayacct_tsk_init(struct 
 extern void __delayacct_tsk_exit(struct task_struct *);
 extern void __delayacct_blkio(void);
 extern void __delayacct_swapin(void);
+extern int delayacct_add_tsk(struct taskstats_reply *, struct task_struct *);
 
 static inline void delayacct_tsk_init(struct task_struct *tsk)
 {
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-rc5-balbir/include/linux/taskstats.h	2006-03-09 19:28:54.000000000 +0530
@@ -0,0 +1,128 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ *	- a task exits
+ *	- listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION	1
+
+struct taskstats {
+	/* Maintain 64-bit alignment while extending */
+
+	/* Version 1 */
+#define TASKSTATS_NOPID	-1
+	__s64	pid;
+	__s64	tgid;
+
+	/* XXX_count is number of delay values recorded.
+	 * XXX_total is corresponding cumulative delay in nanoseconds
+	 */
+
+#define TASKSTATS_NOCPUSTATS	1
+	__u64	cpu_count;
+	__u64	cpu_delay_total;	/* wait, while runnable, for cpu */
+	__u64	blkio_count;
+	__u64	blkio_delay_total;	/* sync,block io completion wait*/
+	__u64	swapin_count;
+	__u64	swapin_delay_total;	/* swapin page fault wait*/
+
+	__u64	cpu_run_total;		/* cpu running time
+					 * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP	0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+	TASKSTATS_CMD_UNSPEC,		/* Reserved */
+	TASKSTATS_CMD_NONE,		/* Not a valid cmd to send
+					 * Marks data sent on task/tgid exit */
+	TASKSTATS_CMD_LISTEN,		/* Start listening */
+	TASKSTATS_CMD_IGNORE,		/* Stop listening */
+	TASKSTATS_CMD_PID,		/* Send stats for a pid */
+	TASKSTATS_CMD_TGID,		/* Send stats for a tgid */
+};
+
+/* Parameters for commands
+ * New parameters should only be inserted at the struct's end
+ */
+
+struct taskstats_cmd_param {
+	/* Maintain 64-bit alignment while extending */
+	union {
+		__s64	pid;
+		__s64	tgid;
+	} id;
+};
+
+enum outtype {
+	TASKSTATS_REPLY_NONE = 1,	/* Control cmd response */
+	TASKSTATS_REPLY_PID,		/* per-pid data cmd response*/
+	TASKSTATS_REPLY_TGID,		/* per-tgid data cmd response*/
+	TASKSTATS_REPLY_EXIT_PID,	/* Exiting task's stats */
+	TASKSTATS_REPLY_EXIT_TGID,	/* Exiting tgid's stats
+					 * (sent on each tid's exit) */
+};
+
+/*
+ * Reply sent from kernel
+ * Version number affects size/format of struct taskstats only
+ */
+
+struct taskstats_reply {
+	/* Maintain 64-bit alignment while extending */
+	__u16 outtype;			/* Must be one of enum outtype */
+	__u16 version;
+	__u32 err;
+	struct taskstats stats;		/* Invalid if err != 0 */
+};
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME	"TASKSTATS"
+#define TASKSTATS_GENL_VERSION	0x1
+
+#define TASKSTATS_HDRLEN	(NLMSG_SPACE(GENL_HDRLEN))
+#define TASKSTATS_BODYLEN	(sizeof(struct taskstats_reply))
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16-rc5/init/Kconfig~delayacct-genetlink	2006-03-09 17:15:31.000000000 +0530
+++ linux-2.6.16-rc5-balbir/init/Kconfig	2006-03-09 17:15:31.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
 	  in pages. Such statistics can help in setting a task's priorities
 	  relative to other tasks for cpu, io, rss limits etc.
 
-	  Unlike BSD process accounting, this information is available
-	  continuously during the lifetime of a task.
-
 	  Say N if unsure.
 
+config TASKSTATS
+	bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+	depends on TASK_DELAY_ACCT
+	default y
+	help
+	  Export selected statistics for tasks/processes through the
+	  generic netlink interface. Unlike BSD process accounting, the
+	  statistics are available during the lifetime of tasks/processes as
+	  responses to commands. Like BSD accounting, they are sent to user
+	  space on task exit.
+
+	  Say Y if unsure.
+
 config SYSCTL
 	bool "Sysctl support"
 	---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16-rc5/kernel/delayacct.c~delayacct-genetlink	2006-03-09 17:15:31.000000000 +0530
+++ linux-2.6.16-rc5-balbir/kernel/delayacct.c	2006-03-09 17:15:31.000000000 +0530
@@ -16,9 +16,12 @@
 #include <linux/time.h>
 #include <linux/sysctl.h>
 #include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
 
 int delayacct_on = 0;		/* Delay accounting turned on/off */
 kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
 
 static int __init delayacct_setup_enable(char *str)
 {
@@ -51,8 +54,14 @@ void __delayacct_tsk_init(struct task_st
 
 void __delayacct_tsk_exit(struct task_struct *tsk)
 {
+	/*
+	 * Protect against racing thread group exits
+	 */
+	mutex_lock(&delayacct_exit_mutex);
+	taskstats_exit_pid(tsk);
 	kmem_cache_free(delayacct_cache, tsk->delays);
 	tsk->delays = NULL;
+	mutex_unlock(&delayacct_exit_mutex);
 }
 
 static inline nsec_t delayacct_measure(void)
@@ -97,3 +106,50 @@ void __delayacct_swapin(void)
 	current->delays->swapin_count++;
 	spin_unlock(&current->delays->lock);
 }
+
+#ifdef CONFIG_TASKSTATS
+
+int delayacct_add_tsk(struct taskstats_reply *reply, struct task_struct *tsk)
+{
+	struct taskstats *d = &reply->stats;
+	nsec_t tmp;
+	struct timespec ts;
+	unsigned long t1,t2;
+
+	if (!tsk->delays || !delayacct_on)
+		return -EINVAL;
+
+	/* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+#ifdef CONFIG_SCHEDSTATS
+
+	tmp = (nsec_t)d->cpu_run_total ;
+	tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+	d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0:tmp;
+
+	/* No locking available for sched_info. Take snapshot first. */
+	t1 = tsk->sched_info.pcnt;
+	t2 = tsk->sched_info.run_delay;
+
+	d->cpu_count += t1;
+
+	jiffies_to_timespec(t2, &ts);
+	tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+	d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0:tmp;
+#else
+	/* Non-zero XXX_total,zero XXX_count implies XXX stat unavailable */
+	d->cpu_count = 0;
+	d->cpu_run_total = d->cpu_delay_total = TASKSTATS_NOCPUSTATS;
+#endif
+	spin_lock(&tsk->delays->lock);
+	tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+	d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0:tmp;
+	tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+	d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0:tmp;
+	d->blkio_count += tsk->delays->blkio_count;
+	d->swapin_count += tsk->delays->swapin_count;
+	spin_unlock(&tsk->delays->lock);
+
+	return 0;
+}
+
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16-rc5/kernel/Makefile~delayacct-genetlink	2006-03-09 17:15:31.000000000 +0530
+++ linux-2.6.16-rc5-balbir/kernel/Makefile	2006-03-09 17:15:31.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
 
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null	2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-rc5-balbir/kernel/taskstats.c	2006-03-09 18:52:47.000000000 +0530
@@ -0,0 +1,244 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ *           (C) Balbir Singh,   IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+
+static struct genl_family family = {
+	.id             = GENL_ID_GENERATE,
+	.name           = TASKSTATS_GENL_NAME,
+	.version        = TASKSTATS_GENL_VERSION,
+	.hdrsize        = 0,
+	.maxattr        = 0,
+};
+
+/* Taskstat specific functions */
+static int prepare_reply(struct genl_info *info, u8 cmd,
+			 struct sk_buff **skbp, struct taskstats_reply **replyp)
+{
+	struct sk_buff *skb;
+	struct taskstats_reply *reply;
+
+	skb = nlmsg_new(TASKSTATS_HDRLEN + TASKSTATS_BODYLEN);
+	if (!skb)
+		return -ENOMEM;
+
+	if (!info) {
+		int seq = get_cpu_var(taskstats_seqnum)++;
+		put_cpu_var(taskstats_seqnum);
+
+		reply = genlmsg_put(skb, 0, seq,
+				    family.id, 0, NLM_F_REQUEST,
+				    cmd, family.version);
+	} else
+		reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+				    family.id, 0, info->nlhdr->nlmsg_flags,
+				    info->genlhdr->cmd, family.version);
+	if (reply == NULL) {
+		nlmsg_free(skb);
+		return -EINVAL;
+	}
+	skb_put(skb, TASKSTATS_BODYLEN);
+
+	memset(reply, 0, sizeof(*reply));
+	reply->version = taskstats_version;
+	reply->err = 0;
+
+	*skbp = skb;
+	*replyp = reply;
+	return 0;
+}
+
+static int send_reply(struct sk_buff *skb, int replytype, pid_t pid, int event)
+{
+	struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+	struct taskstats_reply *reply;
+	int rc;
+
+	reply = (struct taskstats_reply *)genlmsg_data(genlhdr);
+	reply->outtype = replytype;
+
+	rc = genlmsg_end(skb, reply);
+	if (rc < 0) {
+		nlmsg_free(skb);
+		return rc;
+	}
+
+	if (event)
+		return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+	else
+		return genlmsg_unicast(skb, pid);
+}
+
+static inline void fill_pid(struct taskstats_reply *reply, pid_t pid,
+			    struct task_struct *pidtsk)
+{
+	int rc;
+	struct task_struct *tsk = pidtsk;
+
+	if (!pidtsk) {
+		read_lock(&tasklist_lock);
+		tsk = find_task_by_pid(pid);
+		if (!tsk) {
+			read_unlock(&tasklist_lock);
+			reply->err = EINVAL;
+			return;
+		}
+		get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+	} else
+		get_task_struct(tsk);
+
+	rc = delayacct_add_tsk(reply, tsk);
+	if (!rc) {
+		reply->stats.pid = (s64)tsk->pid;
+		reply->stats.tgid = (s64)tsk->tgid;
+	} else
+		reply->err = (rc < 0) ? -rc : rc ;
+
+	put_task_struct(tsk);
+}
+
+static int taskstats_send_pid(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats_reply *reply;
+	struct taskstats_cmd_param *param= info->userhdr;
+
+	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+	if (rc)
+		return rc;
+	fill_pid(reply, param->id.pid, NULL);
+	return send_reply(rep_skb, TASKSTATS_REPLY_PID, info->snd_pid, 0);
+}
+
+static inline void fill_tgid(struct taskstats_reply *reply, pid_t tgid,
+			     struct task_struct *tgidtsk)
+{
+	int rc;
+	struct task_struct *tsk, *first;
+
+	first = tgidtsk;
+	read_lock(&tasklist_lock);
+	if (!first) {
+		first = find_task_by_pid(tgid);
+		if (!first) {
+			read_unlock(&tasklist_lock);
+			reply->err = EINVAL;
+			return;
+		}
+	}
+	tsk = first;
+	do {
+		rc = delayacct_add_tsk(reply, tsk);
+		if (rc)
+			break;
+	} while_each_thread(first, tsk);
+	read_unlock(&tasklist_lock);
+
+	if (!rc) {
+		reply->stats.pid = (s64)TASKSTATS_NOPID;
+		reply->stats.tgid = (s64)tgid;
+	} else
+		reply->err = (rc < 0) ? -rc : rc ;
+}
+
+static int taskstats_send_tgid(struct sk_buff *skb, struct genl_info *info)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats_reply *reply;
+	struct taskstats_cmd_param *param= info->userhdr;
+
+	rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+	if (rc)
+		return rc;
+	fill_tgid(reply, param->id.tgid, NULL);
+	return send_reply(rep_skb, TASKSTATS_REPLY_TGID, info->snd_pid, 0);
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+	int rc;
+	struct sk_buff *rep_skb;
+	struct taskstats_reply *reply;
+
+	/*
+	 * tasks can start to exit very early. Ensure that the family
+	 * is registered before notifications are sent out
+	 */
+	if (!family_registered)
+		return;
+
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+	if (rc)
+		return;
+	fill_pid(reply, tsk->pid, tsk);
+	rc = send_reply(rep_skb, TASKSTATS_REPLY_EXIT_PID, 0, 1);
+
+	if (rc || thread_group_empty(tsk))
+		return;
+
+	/* Send tgid data too */
+	rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+	if (rc)
+		return;
+	fill_tgid(reply, tsk->tgid, tsk);
+	send_reply(rep_skb, TASKSTATS_REPLY_EXIT_TGID, 0, 1);
+}
+
+static struct genl_ops pid_ops = {
+	.cmd            = TASKSTATS_CMD_PID,
+	.doit           = taskstats_send_pid,
+};
+
+static struct genl_ops tgid_ops = {
+	.cmd            = TASKSTATS_CMD_TGID,
+	.doit           = taskstats_send_tgid,
+};
+
+static int __init taskstats_init(void)
+{
+	if (genl_register_family(&family))
+		return -EFAULT;
+	family_registered = 1;
+
+	if (genl_register_ops(&family, &pid_ops))
+		goto err;
+	if (genl_register_ops(&family, &tgid_ops))
+		goto err;
+
+	return 0;
+err:
+	genl_unregister_family(&family);
+	family_registered = 0;
+	return -EFAULT;
+}
+
+late_initcall(taskstats_init);
+
diff -puN include/net/genetlink.h~delayacct-genetlink include/net/genetlink.h
--- linux-2.6.16-rc5/include/net/genetlink.h~delayacct-genetlink	2006-03-09 17:15:31.000000000 +0530
+++ linux-2.6.16-rc5-balbir/include/net/genetlink.h	2006-03-09 17:48:39.000000000 +0530
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
 	return nlmsg_unicast(genl_sock, skb, pid);
 }
 
+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+	return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+	struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+						    NLMSG_HDRLEN);
+	return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
 #endif	/* __NET_GENERIC_NETLINK_H */
_


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [UPDATED PATCH] Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-09 14:37               ` [UPDATED PATCH] " Balbir Singh
@ 2006-03-09 16:06                 ` Shailabh Nagar
  2006-03-10 14:53                 ` jamal
  1 sibling, 0 replies; 11+ messages in thread
From: Shailabh Nagar @ 2006-03-09 16:06 UTC (permalink / raw)
  To: balbir; +Cc: hadi, netdev, linux-kernel, lse-tech

Balbir Singh wrote:

<snip>

>Hello, Jamal,
>
>Please find the latest version of the patch for review. The genetlink
>code has been updated as per your review comments. The changelog is provided
>below
>
>1. Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
>2. Provide generic functions called genlmsg_data() and genlmsg_len()
>   in linux/net/genetlink.h
>  
>
Balbir,
it might be a good idea to split 2. out separately, since it has generic 
value beyond the
delay accounting patches (just like we did for the timespec_diff_ns change)


Thanks,
Shailabh

>3. Do not multicast all replies, multicast only events generated due
>   to task exit.
>4. The taskstats and taskstats_reply structures are now 64 bit aligned.
>5. Family id is dynamically generated.
>
>Please let us know if we missed something out.
>
>Thanks,
>Balbir
>
>
>Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
>Signed-off-by: Balbir Singh <balbir@in.ibm.com>
>
>---
>
> include/linux/delayacct.h |    2 
> include/linux/taskstats.h |  128 ++++++++++++++++++++++++
> include/net/genetlink.h   |   20 +++
> init/Kconfig              |   16 ++-
> kernel/Makefile           |    1 
> kernel/delayacct.c        |   56 ++++++++++
> kernel/taskstats.c        |  244 ++++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 464 insertions(+), 3 deletions(-)
>
>  
>
<snip>



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [UPDATED PATCH] Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-09 14:37               ` [UPDATED PATCH] " Balbir Singh
  2006-03-09 16:06                 ` Shailabh Nagar
@ 2006-03-10 14:53                 ` jamal
  2006-03-10 15:35                   ` [UPDATED PATCH] Re: [Lse-tech] " jamal
  2006-03-10 16:39                   ` [UPDATED PATCH] " Balbir Singh
  1 sibling, 2 replies; 11+ messages in thread
From: jamal @ 2006-03-10 14:53 UTC (permalink / raw)
  To: balbir; +Cc: Shailabh Nagar, netdev, linux-kernel, lse-tech

On Thu, 2006-09-03 at 20:07 +0530, Balbir Singh wrote:

> Please find the latest version of the patch for review. The genetlink
> code has been updated as per your review comments. The changelog is provided
> below
> 
> 1. Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
> 2. Provide generic functions called genlmsg_data() and genlmsg_len()
>    in linux/net/genetlink.h
> 3. Do not multicast all replies, multicast only events generated due
>    to task exit.
> 4. The taskstats and taskstats_reply structures are now 64 bit aligned.
> 5. Family id is dynamically generated.
> 
> Please let us know if we missed something out.

Design still shaky IMO - now that i think i may understand what your end
goal is. 
Using the principles i described in earlier email, the problem you are
trying to solve is:

a) shipping of the taskstats from kernel to user-space asynchronously to
all listeners on multicast channel/group TASKSTATS_LISTEN_GRP
at the point when some process exits.
b) responding to queries issued by the user to the kernel for taskstats
of a particular defined tgid and/or pid combination. 

Did i summarize your goals correctly?

So lets stat with #b:
i) the message is multicast; there has to be a user space app registered
to the multicast group otherwise nothing goes to user space.
ii) user space issues a GET and it seems to me the appropriate naming
for the response is a NEW.

Lets go to #a:
The issued multicast messages are also NEW and no different from the
ones sent in response to a GET.

Having said that then, you have the following commands:

enum {
       TASKSTATS_CMD_UNSPEC,           /* Reserved */
       TASKSTATS_CMD_GET,              /* user -> kernel query*/
       TASKSTATS_CMD_NEW,             /* kernel -> user update */
};

You also need the following TLVs

enum {
       TASKSTATS_TYPE_UNSPEC,           /* Reserved */
       TASKSTATS_TYPE_TGID,              /* The TGID */
       TASKSTATS_TYPE_PID,             /* The PID */
       TASKSTATS_TYPE_STATS, 		/* carries the taskstats */
       TASKSTATS_TYPE_VERS,		/* carries the version */
};

Refer to the doc i passed you and provide feedback if how to use the
above is not obvious.
 
The use of TLVs above implies that any of these can be optionally
appearing.
So when you are going from user->kernel with a GET a in #a above, then
you can specify the PID and/or TGID and you dont need to specify the
STATS and this would be perfectly legal.

On kernel->user (in the case of response to #a or async notifiation as
in #b) you really dont need to specify the TG/PID since they appear in
the STATS etc.
I take it you dont want to configure the values of taskstats from user
space, otherwise user->kernel will send a NEW as well.
I also take it dumping doesnt apply to you, so you dont need a dump
callback in your kernel code.
>From what i described so far, you dont really need a header for yourself
either (maybe you need one to just store the VERSION?)

I didnt understand the point to the err field you had in the reply.
Netlink already does issue errors which can be read via perror. If this
is different from what you need, it may be an excuse to have your own
header.

I hope this helps.

cheers,
jamal





-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [UPDATED PATCH] Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-10 14:53                 ` jamal
@ 2006-03-10 15:35                   ` jamal
  2006-03-10 16:39                   ` [UPDATED PATCH] " Balbir Singh
  1 sibling, 0 replies; 11+ messages in thread
From: jamal @ 2006-03-10 15:35 UTC (permalink / raw)
  To: balbir; +Cc: Shailabh Nagar, netdev, linux-kernel, lse-tech

On Fri, 2006-10-03 at 09:53 -0500, jamal wrote:

> 
> a) shipping of the taskstats from kernel to user-space asynchronously to
> all listeners on multicast channel/group TASKSTATS_LISTEN_GRP
> at the point when some process exits.
> b) responding to queries issued by the user to the kernel for taskstats
> of a particular defined tgid and/or pid combination. 
> 
> Did i summarize your goals correctly?
> 
> So lets stat with #b:
> i) the message is multicast; there has to be a user space app registered
> to the multicast group otherwise nothing goes to user space.

I mispoke:
The above applies to #a. 
For #b, the message from/to kernel to user is unicast.


cheers,
jamal

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [UPDATED PATCH] Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-10 14:53                 ` jamal
  2006-03-10 15:35                   ` [UPDATED PATCH] Re: [Lse-tech] " jamal
@ 2006-03-10 16:39                   ` Balbir Singh
  2006-03-11 13:30                     ` jamal
  1 sibling, 1 reply; 11+ messages in thread
From: Balbir Singh @ 2006-03-10 16:39 UTC (permalink / raw)
  To: jamal; +Cc: Shailabh Nagar, netdev, linux-kernel, lse-tech

On Fri, Mar 10, 2006 at 09:53:53AM -0500, jamal wrote:
> On Thu, 2006-09-03 at 20:07 +0530, Balbir Singh wrote:
> > Please let us know if we missed something out.
> 
> Design still shaky IMO - now that i think i may understand what your end
> goal is. 
> Using the principles i described in earlier email, the problem you are
> trying to solve is:
> 
> a) shipping of the taskstats from kernel to user-space asynchronously to
> all listeners on multicast channel/group TASKSTATS_LISTEN_GRP
> at the point when some process exits.
> b) responding to queries issued by the user to the kernel for taskstats
> of a particular defined tgid and/or pid combination. 
> 
> Did i summarize your goals correctly?

Yes, you did.

> 
> So lets stat with #b:
> i) the message is multicast; there has to be a user space app registered
> to the multicast group otherwise nothing goes to user space.
> ii) user space issues a GET and it seems to me the appropriate naming
> for the response is a NEW.
> 
> Lets go to #a:
> The issued multicast messages are also NEW and no different from the
> ones sent in response to a GET.
> 
> Having said that then, you have the following commands:
> 
> enum {
>        TASKSTATS_CMD_UNSPEC,           /* Reserved */
>        TASKSTATS_CMD_GET,              /* user -> kernel query*/
>        TASKSTATS_CMD_NEW,             /* kernel -> user update */
> };
> 
> You also need the following TLVs
> 
> enum {
>        TASKSTATS_TYPE_UNSPEC,           /* Reserved */
>        TASKSTATS_TYPE_TGID,              /* The TGID */
>        TASKSTATS_TYPE_PID,             /* The PID */
>        TASKSTATS_TYPE_STATS, 		/* carries the taskstats */
>        TASKSTATS_TYPE_VERS,		/* carries the version */
> };
> 
> Refer to the doc i passed you and provide feedback if how to use the
> above is not obvious.
>  

I will look at the document, just got hold of it.

> The use of TLVs above implies that any of these can be optionally
> appearing.
> So when you are going from user->kernel with a GET a in #a above, then
> you can specify the PID and/or TGID and you dont need to specify the
> STATS and this would be perfectly legal.
> 
> On kernel->user (in the case of response to #a or async notifiation as
> in #b) you really dont need to specify the TG/PID since they appear in
> the STATS etc.

I see your point now. I am looking at other users of netlink like
rtnetlink and I see the classical usage.

We can implement TLV's in our code, but for the most part the data we exchange
between the user <-> kernel has all the TLV's listed in the enum above.
The major differnece is the type (pid/tgid). Hence we created a structure
(taskstats) instead of using TLV's.

> I take it you dont want to configure the values of taskstats from user
> space, otherwise user->kernel will send a NEW as well.

Your understanding is correct.

> I also take it dumping doesnt apply to you, so you dont need a dump
> callback in your kernel code.

Yes, this is correct as well.

> >From what i described so far, you dont really need a header for yourself
> either (maybe you need one to just store the VERSION?)
> 

True, we do not need a header.

> I didnt understand the point to the err field you had in the reply.
> Netlink already does issue errors which can be read via perror. If this
> is different from what you need, it may be an excuse to have your own
> header.
> 

Hmm.. Will look into this.

> I hope this helps.
> 

Yes, it does immensely. Thanks for the detailed feedback.

> cheers,
> jamal
> 
> 

Warm Regards,
Balbir


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [UPDATED PATCH] Re: Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-10 16:39                   ` [UPDATED PATCH] " Balbir Singh
@ 2006-03-11 13:30                     ` jamal
  2006-03-13 16:21                       ` [UPDATED PATCH] Re: [Lse-tech] " Balbir Singh
  0 siblings, 1 reply; 11+ messages in thread
From: jamal @ 2006-03-11 13:30 UTC (permalink / raw)
  To: balbir; +Cc: Shailabh Nagar, netdev, linux-kernel, lse-tech

On Fri, 2006-10-03 at 22:09 +0530, Balbir Singh wrote:
> On Fri, Mar 10, 2006 at 09:53:53AM -0500, jamal wrote: 

> > On kernel->user (in the case of response to #a or async notifiation as
> > in #b) you really dont need to specify the TG/PID since they appear in
> > the STATS etc.
> 
> I see your point now. I am looking at other users of netlink like
> rtnetlink and I see the classical usage.
> 
> We can implement TLV's in our code, but for the most part the data we exchange
> between the user <-> kernel has all the TLV's listed in the enum above.
>
> The major differnece is the type (pid/tgid). Hence we created a structure
> (taskstats) instead of using TLV's.

Something to remember:

1) TLVs are essentially giving you the flexibility to send optionally
appearing elements. It is up to the receiver (in the kernel or user
space) to check for the presence of mandatory elements or execute things
depending on the presence of certain TLVs. Example in your case:
if the tgid TLV appears then the user is requesting for that TLV
if the pid appears then they are requesting for that
if both appear then it is the && of the two.
You should always ignore TLVs you dont understand - to allow for forward
compatibility.

2)  The "T" part is essentially also encoding (semantically) what size
the value is; the "L" part is useful for validation. So the receiver
will always know what the size of the TLV is by definition and uses the
L to make sure it is the right size. Reject what is of the wrong size.

cheers,
jamal



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [UPDATED PATCH] Re: [Lse-tech] Re: [Patch 7/7] Generic netlink interface (delay accounting)
  2006-03-11 13:30                     ` jamal
@ 2006-03-13 16:21                       ` Balbir Singh
  0 siblings, 0 replies; 11+ messages in thread
From: Balbir Singh @ 2006-03-13 16:21 UTC (permalink / raw)
  To: jamal; +Cc: Shailabh Nagar, netdev, linux-kernel, lse-tech

On Sat, Mar 11, 2006 at 08:30:49AM -0500, jamal wrote:
> On Fri, 2006-10-03 at 22:09 +0530, Balbir Singh wrote:
> > On Fri, Mar 10, 2006 at 09:53:53AM -0500, jamal wrote: 
> 
> > > On kernel->user (in the case of response to #a or async notifiation as
> > > in #b) you really dont need to specify the TG/PID since they appear in
> > > the STATS etc.
> > 
> > I see your point now. I am looking at other users of netlink like
> > rtnetlink and I see the classical usage.
> > 
> > We can implement TLV's in our code, but for the most part the data we exchange
> > between the user <-> kernel has all the TLV's listed in the enum above.
> >
> > The major differnece is the type (pid/tgid). Hence we created a structure
> > (taskstats) instead of using TLV's.
> 
> Something to remember:
> 
> 1) TLVs are essentially giving you the flexibility to send optionally
> appearing elements. It is up to the receiver (in the kernel or user
> space) to check for the presence of mandatory elements or execute things
> depending on the presence of certain TLVs. Example in your case:
> if the tgid TLV appears then the user is requesting for that TLV
> if the pid appears then they are requesting for that
> if both appear then it is the && of the two.
> You should always ignore TLVs you dont understand - to allow for forward
> compatibility.
> 
> 2)  The "T" part is essentially also encoding (semantically) what size
> the value is; the "L" part is useful for validation. So the receiver
> will always know what the size of the TLV is by definition and uses the
> L to make sure it is the right size. Reject what is of the wrong size.
> 
> cheers,
> jamal

Thanks for the clarification, I will try and adapt our genetlink to use
TLV's, I can see the benefits - we will work on this as an evolutionary change
to our code.

Warm Regards,
Balbir

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2006-03-13 16:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1141026996.5785.38.camel@elinux04.optonline.net>
2006-02-27  8:31 ` [Patch 7/7] Generic netlink interface (delay accounting) Shailabh Nagar
     [not found]   ` <1141045194.5363.12.camel@localhost.localdomain>
     [not found]     ` <4403608E.1050304@watson.ibm.com>
     [not found]       ` <1141652556.5185.64.camel@localhost.localdomain>
2006-03-06 17:00         ` Shailabh Nagar
2006-03-07 14:38           ` [Lse-tech] " jamal
2006-03-08 21:56             ` Shailabh Nagar
2006-03-09 14:37               ` [UPDATED PATCH] " Balbir Singh
2006-03-09 16:06                 ` Shailabh Nagar
2006-03-10 14:53                 ` jamal
2006-03-10 15:35                   ` [UPDATED PATCH] Re: [Lse-tech] " jamal
2006-03-10 16:39                   ` [UPDATED PATCH] " Balbir Singh
2006-03-11 13:30                     ` jamal
2006-03-13 16:21                       ` [UPDATED PATCH] Re: [Lse-tech] " Balbir Singh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).