* [Patch 8/9] generic netlink utility functions
[not found] <1142296834.5858.3.camel@elinux04.optonline.net>
@ 2006-03-14 0:55 ` Shailabh Nagar
2006-03-26 16:44 ` Balbir Singh
2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
1 sibling, 1 reply; 26+ messages in thread
From: Shailabh Nagar @ 2006-03-14 0:55 UTC (permalink / raw)
To: linux-kernel, netdev; +Cc: Jamal
genetlink-utils.patch
Two utilities for simplifying usage of NETLINK_GENERIC
interface.
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
include/net/genetlink.h | 20 ++++++++++++++++++++
1 files changed, 20 insertions(+)
Index: linux-2.6.16-rc5/include/net/genetlink.h
===================================================================
--- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 07:41:32.000000000 -0500
+++ linux-2.6.16-rc5/include/net/genetlink.h 2006-03-11 07:41:41.000000000 -0500
@@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
return nlmsg_unicast(genl_sock, skb, pid);
}
+/**
+ * gennlmsg_data - head of message payload
+ * @gnlh: genetlink messsage header
+ */
+static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
+{
+ return ((unsigned char *) gnlh + GENL_HDRLEN);
+}
+
+/**
+ * genlmsg_len - length of message payload
+ * @gnlh: genetlink message header
+ */
+static inline int genlmsg_len(const struct genlmsghdr *gnlh)
+{
+ struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
+ NLMSG_HDRLEN);
+ return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
+}
+
#endif /* __NET_GENERIC_NETLINK_H */
^ permalink raw reply [flat|nested] 26+ messages in thread
* [Patch 9/9] Generic netlink interface for delay accounting
[not found] <1142296834.5858.3.camel@elinux04.optonline.net>
2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
@ 2006-03-14 0:56 ` Shailabh Nagar
2006-03-14 2:29 ` jamal
2006-03-14 2:33 ` Matt Helsley
1 sibling, 2 replies; 26+ messages in thread
From: Shailabh Nagar @ 2006-03-14 0:56 UTC (permalink / raw)
To: linux-kernel; +Cc: Jamal, netdev
delayacct-genetlink.patch
Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.
Comments addressed (all in response to Jamal)
- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
- Use multicast only for events generated due to task exit, not for
replies to commands
- Align taskstats and taskstats_reply structures to 64 bit aligned.
- Use dynamic ID generation for genetlink family
More changes expected. Following comments will go into a
Documentation file:
When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.
The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).
If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.
The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.
If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
include/linux/taskstats.h | 128 ++++++++++++++++++++++++
init/Kconfig | 16 ++-
kernel/taskstats.c | 244 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 385 insertions(+), 3 deletions(-)
Index: linux-2.6.16-rc5/include/linux/taskstats.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/include/linux/taskstats.h 2006-03-13 19:01:35.000000000 -0500
@@ -0,0 +1,128 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+
+ /* Version 1 */
+#define TASKSTATS_NOPID -1
+ __s64 pid;
+ __s64 tgid;
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC, /* Reserved */
+ TASKSTATS_CMD_NONE, /* Not a valid cmd to send
+ * Marks data sent on task/tgid exit */
+ TASKSTATS_CMD_LISTEN, /* Start listening */
+ TASKSTATS_CMD_IGNORE, /* Stop listening */
+ TASKSTATS_CMD_PID, /* Send stats for a pid */
+ TASKSTATS_CMD_TGID, /* Send stats for a tgid */
+};
+
+/* Parameters for commands
+ * New parameters should only be inserted at the struct's end
+ */
+
+struct taskstats_cmd_param {
+ /* Maintain 64-bit alignment while extending */
+ union {
+ __s64 pid;
+ __s64 tgid;
+ } id;
+};
+
+enum outtype {
+ TASKSTATS_REPLY_NONE = 1, /* Control cmd response */
+ TASKSTATS_REPLY_PID, /* per-pid data cmd response*/
+ TASKSTATS_REPLY_TGID, /* per-tgid data cmd response*/
+ TASKSTATS_REPLY_EXIT_PID, /* Exiting task's stats */
+ TASKSTATS_REPLY_EXIT_TGID, /* Exiting tgid's stats
+ * (sent on each tid's exit) */
+};
+
+/*
+ * Reply sent from kernel
+ * Version number affects size/format of struct taskstats only
+ */
+
+struct taskstats_reply {
+ /* Maintain 64-bit alignment while extending */
+ __u16 outtype; /* Must be one of enum outtype */
+ __u16 version;
+ __u32 err;
+ struct taskstats stats; /* Invalid if err != 0 */
+};
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#define TASKSTATS_HDRLEN (NLMSG_SPACE(GENL_HDRLEN))
+#define TASKSTATS_BODYLEN (sizeof(struct taskstats_reply))
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
Index: linux-2.6.16-rc5/kernel/taskstats.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-rc5/kernel/taskstats.c 2006-03-13 19:01:35.000000000 -0500
@@ -0,0 +1,244 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = 0,
+};
+
+/* Taskstat specific functions */
+static int prepare_reply(struct genl_info *info, u8 cmd,
+ struct sk_buff **skbp, struct taskstats_reply **replyp)
+{
+ struct sk_buff *skb;
+ struct taskstats_reply *reply;
+
+ skb = nlmsg_new(TASKSTATS_HDRLEN + TASKSTATS_BODYLEN);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, NLM_F_REQUEST,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, info->nlhdr->nlmsg_flags,
+ info->genlhdr->cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+ skb_put(skb, TASKSTATS_BODYLEN);
+
+ memset(reply, 0, sizeof(*reply));
+ reply->version = taskstats_version;
+ reply->err = 0;
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, int replytype, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ struct taskstats_reply *reply;
+ int rc;
+
+ reply = (struct taskstats_reply *)genlmsg_data(genlhdr);
+ reply->outtype = replytype;
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ else
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline void fill_pid(struct taskstats_reply *reply, pid_t pid,
+ struct task_struct *pidtsk)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ reply->err = EINVAL;
+ return;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(reply, tsk);
+ if (!rc) {
+ reply->stats.pid = (s64)tsk->pid;
+ reply->stats.tgid = (s64)tsk->tgid;
+ } else
+ reply->err = (rc < 0) ? -rc : rc ;
+
+ put_task_struct(tsk);
+}
+
+static int taskstats_send_pid(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats_reply *reply;
+ struct taskstats_cmd_param *param= info->userhdr;
+
+ rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+ if (rc)
+ return rc;
+ fill_pid(reply, param->id.pid, NULL);
+ return send_reply(rep_skb, TASKSTATS_REPLY_PID, info->snd_pid, 0);
+}
+
+static inline void fill_tgid(struct taskstats_reply *reply, pid_t tgid,
+ struct task_struct *tgidtsk)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ reply->err = EINVAL;
+ return;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(reply, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ if (!rc) {
+ reply->stats.pid = (s64)TASKSTATS_NOPID;
+ reply->stats.tgid = (s64)tgid;
+ } else
+ reply->err = (rc < 0) ? -rc : rc ;
+}
+
+static int taskstats_send_tgid(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats_reply *reply;
+ struct taskstats_cmd_param *param= info->userhdr;
+
+ rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+ if (rc)
+ return rc;
+ fill_tgid(reply, param->id.tgid, NULL);
+ return send_reply(rep_skb, TASKSTATS_REPLY_TGID, info->snd_pid, 0);
+}
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats_reply *reply;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+ if (rc)
+ return;
+ fill_pid(reply, tsk->pid, tsk);
+ rc = send_reply(rep_skb, TASKSTATS_REPLY_EXIT_PID, 0, 1);
+
+ if (rc || thread_group_empty(tsk))
+ return;
+
+ /* Send tgid data too */
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NONE, &rep_skb, &reply);
+ if (rc)
+ return;
+ fill_tgid(reply, tsk->tgid, tsk);
+ send_reply(rep_skb, TASKSTATS_REPLY_EXIT_TGID, 0, 1);
+}
+
+static struct genl_ops pid_ops = {
+ .cmd = TASKSTATS_CMD_PID,
+ .doit = taskstats_send_pid,
+};
+
+static struct genl_ops tgid_ops = {
+ .cmd = TASKSTATS_CMD_TGID,
+ .doit = taskstats_send_tgid,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &pid_ops))
+ goto err;
+ if (genl_register_ops(&family, &tgid_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
+
Index: linux-2.6.16-rc5/init/Kconfig
===================================================================
--- linux-2.6.16-rc5.orig/init/Kconfig 2006-03-13 18:51:30.000000000 -0500
+++ linux-2.6.16-rc5/init/Kconfig 2006-03-13 19:04:52.000000000 -0500
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.
- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.
+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting
2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
@ 2006-03-14 2:29 ` jamal
2006-03-14 2:33 ` Matt Helsley
1 sibling, 0 replies; 26+ messages in thread
From: jamal @ 2006-03-14 2:29 UTC (permalink / raw)
To: nagar; +Cc: linux-kernel, netdev
On Mon, 2006-13-03 at 19:56 -0500, Shailabh Nagar wrote:
> delayacct-genetlink.patch
>
> Create a generic netlink interface (NETLINK_GENERIC family),
> called "taskstats", for getting delay and cpu statistics of
> tasks and thread groups during their lifetime and when they exit.
>
> Comments addressed (all in response to Jamal)
>
Note, you are still not following the standard scheme of doing things.
Example: using command = GET and the message carrying the TGID to note
which TGID is of interest. Instead you have command = TGID.
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting
2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
2006-03-14 2:29 ` jamal
@ 2006-03-14 2:33 ` Matt Helsley
2006-03-14 2:48 ` jamal
2006-03-14 4:29 ` Shailabh Nagar
1 sibling, 2 replies; 26+ messages in thread
From: Matt Helsley @ 2006-03-14 2:33 UTC (permalink / raw)
To: Shailabh Nagar; +Cc: linux-kernel, Jamal, netdev
On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
<snip>
> Comments addressed (all in response to Jamal)
>
> - Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
The enums for these are still in the patch. See below.
<snip>
> +/*
> + * Commands sent from userspace
> + * Not versioned. New commands should only be inserted at the enum's end
> + */
> +
> +enum {
> + TASKSTATS_CMD_UNSPEC, /* Reserved */
> + TASKSTATS_CMD_NONE, /* Not a valid cmd to send
> + * Marks data sent on task/tgid exit */
> + TASKSTATS_CMD_LISTEN, /* Start listening */
> + TASKSTATS_CMD_IGNORE, /* Stop listening */
>From the description I thought you had eliminated these.
> + TASKSTATS_CMD_PID, /* Send stats for a pid */
> + TASKSTATS_CMD_TGID, /* Send stats for a tgid */
> +};
Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
> Note, you are still not following the standard scheme of doing things.
> Example: using command = GET and the message carrying the TGID to note
> which TGID is of interest. Instead you have command = TGID.
>
> cheers,
> jamal
meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
misunderstanding?
<snip>
Cheers,
-Matt Helsley
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting
2006-03-14 2:33 ` Matt Helsley
@ 2006-03-14 2:48 ` jamal
2006-03-14 4:18 ` Shailabh Nagar
2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
2006-03-14 4:29 ` Shailabh Nagar
1 sibling, 2 replies; 26+ messages in thread
From: jamal @ 2006-03-14 2:48 UTC (permalink / raw)
To: Matt Helsley; +Cc: Shailabh Nagar, linux-kernel, netdev
On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
> Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
> > Note, you are still not following the standard scheme of doing things.
> > Example: using command = GET and the message carrying the TGID to note
> > which TGID is of interest. Instead you have command = TGID.
> >
> meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
> TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
> misunderstanding?
>
I had a long description in an earlier email feedback; but the summary
of it is the GET command is generic like TASKSTATS_CMD_GET; the message
itself carries TLVs of what needs to be gotten which are
either PID and/or TGID etc. Anyways, theres a long spill of what i am
saying in that earlier email. Perhaps the current patch is a transition
towards that?
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting
2006-03-14 2:48 ` jamal
@ 2006-03-14 4:18 ` Shailabh Nagar
2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
1 sibling, 0 replies; 26+ messages in thread
From: Shailabh Nagar @ 2006-03-14 4:18 UTC (permalink / raw)
To: hadi; +Cc: Matt Helsley, linux-kernel, netdev
jamal wrote:
>On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
>
>
>>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
>>
>>
>
>
>
>>Jamal, was your Mon, 13 Mar 2006 21:29:09 -0500 reply:
>>
>>
>>>Note, you are still not following the standard scheme of doing things.
>>>Example: using command = GET and the message carrying the TGID to note
>>>which TGID is of interest. Instead you have command = TGID.
>>>
>>>
>>>
>
>
>
>>meant to suggest that TASKSTATS_CMD_(P|TG)ID should be renamed to
>>TASKSTATS_CMD_GET_(P|TG)ID ? Is that sufficient? Or am I
>>misunderstanding?
>>
>>
>>
>
>I had a long description in an earlier email feedback; but the summary
>of it is the GET command is generic like TASKSTATS_CMD_GET; the message
>itself carries TLVs of what needs to be gotten which are
>either PID and/or TGID etc. Anyways, theres a long spill of what i am
>saying in that earlier email. Perhaps the current patch is a transition
>towards that?
>
>
Yes, the comments you'd made in the previous mail have not been
incorporated and this is still the older version of the patch.
We'd been avoiding TLV usage so far :-)
>cheers,
>jamal
>
>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 9/9] Generic netlink interface for delay accounting
2006-03-14 2:33 ` Matt Helsley
2006-03-14 2:48 ` jamal
@ 2006-03-14 4:29 ` Shailabh Nagar
1 sibling, 0 replies; 26+ messages in thread
From: Shailabh Nagar @ 2006-03-14 4:29 UTC (permalink / raw)
To: Matt Helsley; +Cc: linux-kernel, Jamal, netdev
Matt Helsley wrote:
>On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
><snip>
>
>
>
>>Comments addressed (all in response to Jamal)
>>
>>- Eliminated TASKSTATS_CMD_LISTEN and TASKSTATS_CMD_IGNORE
>>
>>
>
>The enums for these are still in the patch. See below.
>
><snip>
>
>
>
>>+/*
>>+ * Commands sent from userspace
>>+ * Not versioned. New commands should only be inserted at the enum's end
>>+ */
>>+
>>+enum {
>>+ TASKSTATS_CMD_UNSPEC, /* Reserved */
>>+ TASKSTATS_CMD_NONE, /* Not a valid cmd to send
>>+ * Marks data sent on task/tgid exit */
>>+ TASKSTATS_CMD_LISTEN, /* Start listening */
>>+ TASKSTATS_CMD_IGNORE, /* Stop listening */
>>
>>
>
>>From the description I thought you had eliminated these.
>
>
Yup, typo. the commands aren't registered but the enums hang on.
Will fix.
--Shailabh
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-14 2:48 ` jamal
2006-03-14 4:18 ` Shailabh Nagar
@ 2006-03-22 7:49 ` Balbir Singh
2006-03-23 14:04 ` jamal
1 sibling, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-22 7:49 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Mon, Mar 13, 2006 at 09:48:26PM -0500, jamal wrote:
> On Mon, 2006-13-03 at 18:33 -0800, Matt Helsley wrote:
> > On Mon, 2006-03-13 at 19:56 -0500, Shailabh Nagar wrote:
>
>
> I had a long description in an earlier email feedback; but the summary
> of it is the GET command is generic like TASKSTATS_CMD_GET; the message
> itself carries TLVs of what needs to be gotten which are
> either PID and/or TGID etc. Anyways, theres a long spill of what i am
> saying in that earlier email. Perhaps the current patch is a transition
> towards that?
>
Hi, Jamal,
Please find the updated version of delayacct-genetlink.patch. We hope
this iteration is closer to your expectation. I have copied the enums
you suggested in your previous review comments and used them.
Comments addressed (in this patch)
- Changed the code to use TLV's for data exchange between kernel and
user space
Thanks,
Balbir
Documentation for the patch
Create a generic netlink interface (NETLINK_GENERIC family),
called "taskstats", for getting delay and cpu statistics of
tasks and thread groups during their lifetime and when they exit.
More changes expected. Following comments will go into a
Documentation file:
When a task is alive, userspace can get its stats by sending a
command containing its pid. Sending a tgid returns the sum of stats
of the tasks belonging to that tgid (where such a sum makes sense).
Together, the command interface allows stats for a large number of
tasks to be collected more efficiently than would be possible
through /proc or any per-pid interface.
The netlink interface also sends the stats for each task to userspace
when the task is exiting. This permits fine-grain accounting for
short-lived tasks, which is important if userspace is doing its own
aggregation of statistics based on some grouping of tasks
(e.g. CSA jobs, ELSA banks or CKRM classes).
If the exiting task belongs to a thread group (with more members than itself)
, the latters delay stats are also sent out on the task's exit. This allows
userspace to get accurate data at a per-tgid level while the tid's of a tgid
are exiting one by one.
The interface has been deliberately kept distinct from the delay
accounting code since it is potentially usable by other kernel components
that need to export per-pid/tgid data. The format of data returned to
userspace is versioned and the command interface easily extensible to
facilitate reuse.
If reuse is not deemed useful enough, the naming, placement of functions
and config options will be modified to make this an interface for delay
accounting alone.
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---
include/linux/delayacct.h | 11 ++
include/linux/taskstats.h | 112 ++++++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 ++++++++
kernel/taskstats.c | 251 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 432 insertions(+), 3 deletions(-)
diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
#define _LINUX_TASKDELAYS_H
#include <linux/sched.h>
+#include <linux/taskstats.h>
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-22 13:12:01.000000000 +0530
@@ -0,0 +1,112 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.
- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.
+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-22 13:12:01.000000000 +0530
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
int delayacct_on = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
static int __init delayacct_setup_enable(char *str)
{
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}
/*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
spin_unlock(&tsk->delays->lock);
return ret;
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-22 13:12:07.000000000 +0530
@@ -0,0 +1,251 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .hdrsize = 0,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ skb = nlmsg_new(NLMSG_GOODSIZE);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, NLM_F_REQUEST,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, info->nlhdr->nlmsg_flags,
+ info->genlhdr->cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+ }
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+ if (rc || thread_group_empty(tsk))
+ return;
+
+ /* Send tgid data too */
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
@ 2006-03-23 14:04 ` jamal
2006-03-23 15:41 ` Balbir Singh
2006-03-24 1:32 ` Balbir Singh
0 siblings, 2 replies; 26+ messages in thread
From: jamal @ 2006-03-23 14:04 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
Hi Balbir,
Looking good.
This is a quick scan, so i didnt look at little details.
Some comments embedded.
On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:
>
> diff -puN /dev/null include/linux/taskstats.h
> + * The struct is versioned. Newer versions should only add fields to
> + * the bottom of the struct to maintain backward compatibility.
> + *
> + * To create the next version, bump up the taskstats_version variable
> + * and delineate the start of newly added fields with a comment indicating
> + * the version number.
> + */
> +
> +enum {
> + TASKSTATS_MSG_UNICAST, /* send data only to requester */
> + TASKSTATS_MSG_MULTICAST, /* send data to a group */
> +};
> +
Above will never be used outside of the kernel.
Should it go under the ifdef kernel below?
> +#ifdef __KERNEL__
> +
Note: some people will argue that you should probably have
two header files. One for all kernel things and includes another
which contains all the stuff above ifdef __KERNEL__
> +
> +#endif /* __KERNEL__ */
> +#endif /* _LINUX_TASKSTATS_H */
> diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
> +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
> +
> +const int taskstats_version = TASKSTATS_VERSION;
> +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> +static int family_registered = 0;
> +
> +static struct genl_family family = {
> + .id = GENL_ID_GENERATE,
> + .name = TASKSTATS_GENL_NAME,
> + .version = TASKSTATS_GENL_VERSION,
> + .hdrsize = 0,
Do you need to specify hdrsize of 0?
> +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
> + void **replyp)
> +{
> + struct sk_buff *skb;
> + void *reply;
> +
> + skb = nlmsg_new(NLMSG_GOODSIZE);
Ok, getting a size of NLMSG_GOODSIZE is not a good idea.
The max size youll ever get it seems to me is 2*32-bit-data TLVs +
sizeof struct stats. Why dont you allocate that size?
> + if (!skb)
> + return -ENOMEM;
> +
> + if (!info) {
> + int seq = get_cpu_var(taskstats_seqnum)++;
> + put_cpu_var(taskstats_seqnum);
> +
> + reply = genlmsg_put(skb, 0, seq,
> + family.id, 0, NLM_F_REQUEST,
> + cmd, family.version);
Double check if you need NLM_F_REQUEST
> + } else
> + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> + family.id, 0, info->nlhdr->nlmsg_flags,
> + info->genlhdr->cmd, family.version);
A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
right thing?
> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{
> + int rc;
> + struct sk_buff *rep_skb;
> + struct taskstats stats;
> + void *reply;
> +
> + memset(&stats, 0, sizeof(stats));
> + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
Same comment as before: a response to a GET is a NEW; so
info->genlhdr->cmd doesnt seem right.
> + if (rc < 0)
> + return rc;
> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> + rc = fill_pid((pid_t)pid, NULL, &stats);
> + if (rc < 0)
> + return rc;
> + }
> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> + rc = fill_tgid((pid_t)tgid, NULL, &stats);
> + if (rc < 0)
> + return rc;
> + }
> +
Should there be at least either a pid or tgid? If yes, you need to
validate here...
> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> +
> +nla_put_failure:
> + return genlmsg_cancel(rep_skb, reply);
> +
As a general comment double check your logic for errors; if you already
have stashed something in the skb, you need to remove it etc.
> +}
> +
> +
> +/* Send pid data out on exit */
> +void taskstats_exit_pid(struct task_struct *tsk)
> +{
> + int rc;
> + struct sk_buff *rep_skb;
> + void *reply;
> + struct taskstats stats;
> +
> + /*
> + * tasks can start to exit very early. Ensure that the family
> + * is registered before notifications are sent out
> + */
> + if (!family_registered)
> + return;
> +
> + memset(&stats, 0, sizeof(stats));
> + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> + if (rc < 0)
> + return;
> +
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
> + rc = fill_pid(tsk->pid, tsk, &stats);
> + if (rc < 0)
> + return;
> +
> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> +
> + if (rc || thread_group_empty(tsk))
> + return;
> +
> + /* Send tgid data too */
> + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> + if (rc < 0)
> + return;
> +
A single message with PID+TGID sounds reasonable. Why two messages with
two stats? all you will need to do is get rid of the prepare_reply()
above and NLA_PUT_U32() below (just like you do in a response to a GET.
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
> + rc = fill_tgid(tsk->tgid, tsk, &stats);
> + if (rc < 0)
> + return;
> +
> + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> +
> +nla_put_failure:
> + genlmsg_cancel(rep_skb, reply);
> +}
Other than the above comments - I believe you have it right.
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-23 14:04 ` jamal
@ 2006-03-23 15:41 ` Balbir Singh
2006-03-24 14:04 ` jamal
2006-03-24 1:32 ` Balbir Singh
1 sibling, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-23 15:41 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> Hi Balbir,
>
> Looking good.
> This is a quick scan, so i didnt look at little details.
Thanks for your detailed feedback. Please find my response interspersed
along with your comments.
> Some comments embedded.
>
> On Wed, 2006-22-03 at 13:19 +0530, Balbir Singh wrote:
>
>
>
>
> >
>
> > diff -puN /dev/null include/linux/taskstats.h
>
> > + * The struct is versioned. Newer versions should only add fields to
> > + * the bottom of the struct to maintain backward compatibility.
> > + *
> > + * To create the next version, bump up the taskstats_version variable
> > + * and delineate the start of newly added fields with a comment indicating
> > + * the version number.
> > + */
> > +
>
>
>
> > +enum {
> > + TASKSTATS_MSG_UNICAST, /* send data only to requester */
> > + TASKSTATS_MSG_MULTICAST, /* send data to a group */
> > +};
> > +
>
> Above will never be used outside of the kernel.
> Should it go under the ifdef kernel below?
>
Will do
>
> > +#ifdef __KERNEL__
> > +
>
> Note: some people will argue that you should probably have
> two header files. One for all kernel things and includes another
> which contains all the stuff above ifdef __KERNEL__
>
> > +
> > +#endif /* __KERNEL__ */
> > +#endif /* _LINUX_TASKSTATS_H */
>
>
>
>
> > diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
> > --- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
> > +++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
>
> > +
> > +const int taskstats_version = TASKSTATS_VERSION;
> > +static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
> > +static int family_registered = 0;
> > +
> > +static struct genl_family family = {
> > + .id = GENL_ID_GENERATE,
> > + .name = TASKSTATS_GENL_NAME,
> > + .version = TASKSTATS_GENL_VERSION,
> > + .hdrsize = 0,
>
> Do you need to specify hdrsize of 0?
I will remove it
>
>
> > +static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
> > + void **replyp)
> > +{
> > + struct sk_buff *skb;
> > + void *reply;
> > +
> > + skb = nlmsg_new(NLMSG_GOODSIZE);
>
> Ok, getting a size of NLMSG_GOODSIZE is not a good idea.
> The max size youll ever get it seems to me is 2*32-bit-data TLVs +
> sizeof struct stats. Why dont you allocate that size?
>
Will do
> > + if (!skb)
> > + return -ENOMEM;
> > +
> > + if (!info) {
> > + int seq = get_cpu_var(taskstats_seqnum)++;
> > + put_cpu_var(taskstats_seqnum);
> > +
> > + reply = genlmsg_put(skb, 0, seq,
> > + family.id, 0, NLM_F_REQUEST,
> > + cmd, family.version);
>
> Double check if you need NLM_F_REQUEST
>
It is not required, I will remove it
> > + } else
> > + reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
> > + family.id, 0, info->nlhdr->nlmsg_flags,
> > + info->genlhdr->cmd, family.version);
>
> A Response to a GET is a NEW. So i dont think info->genlhdr->cmd is the
> right thing?
>
>
I will change it
>
>
>
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
> > + int rc;
> > + struct sk_buff *rep_skb;
> > + struct taskstats stats;
> > + void *reply;
> > +
> > + memset(&stats, 0, sizeof(stats));
> > + rc = prepare_reply(info, info->genlhdr->cmd, &rep_skb, &reply);
>
> Same comment as before: a response to a GET is a NEW; so
> info->genlhdr->cmd doesnt seem right.
>
I will change it
> > + if (rc < 0)
> > + return rc;
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > + rc = fill_pid((pid_t)pid, NULL, &stats);
> > + if (rc < 0)
> > + return rc;
> > + }
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
> > + u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
> > + rc = fill_tgid((pid_t)tgid, NULL, &stats);
> > + if (rc < 0)
> > + return rc;
> > + }
> > +
>
> Should there be at least either a pid or tgid? If yes, you need to
> validate here...
>
Yes, you are correct. One of my test cases caught it too.. But I did
not want to untidy the code with if-else's which will keep growing if
the attributes change in the future. I just followed the controller
example. I will change it and validate it. Currently if the attribute
is not valid, a stat of all zero's is returned back.
> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
> > +
> > +nla_put_failure:
> > + return genlmsg_cancel(rep_skb, reply);
> > +
>
> As a general comment double check your logic for errors; if you already
> have stashed something in the skb, you need to remove it etc.
>
Wouldn't genlmsg_cancel() take care of cleaning all attributes?
> > +}
> > +
> > +
> > +/* Send pid data out on exit */
> > +void taskstats_exit_pid(struct task_struct *tsk)
> > +{
> > + int rc;
> > + struct sk_buff *rep_skb;
> > + void *reply;
> > + struct taskstats stats;
> > +
> > + /*
> > + * tasks can start to exit very early. Ensure that the family
> > + * is registered before notifications are sent out
> > + */
> > + if (!family_registered)
> > + return;
> > +
> > + memset(&stats, 0, sizeof(stats));
> > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> > + if (rc < 0)
> > + return;
> > +
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
> > + rc = fill_pid(tsk->pid, tsk, &stats);
> > + if (rc < 0)
> > + return;
> > +
> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> > +
> > + if (rc || thread_group_empty(tsk))
> > + return;
> > +
> > + /* Send tgid data too */
> > + rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
> > + if (rc < 0)
> > + return;
> > +
>
> A single message with PID+TGID sounds reasonable. Why two messages with
> two stats? all you will need to do is get rid of the prepare_reply()
> above and NLA_PUT_U32() below (just like you do in a response to a GET.
>
The reason for two stats is that for TGID, we return accumulated values
(of all threads in the group) and for PID we return the value just
for that pid. The return value is
pid
<stats for pid>
tgid
<accumulated stats for tgid>
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
> > + rc = fill_tgid(tsk->tgid, tsk, &stats);
> > + if (rc < 0)
> > + return;
> > +
> > + NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
> > + send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
> > +
> > +nla_put_failure:
> > + genlmsg_cancel(rep_skb, reply);
> > +}
>
>
> Other than the above comments - I believe you have it right.
>
Thanks, I will get back to you with the updated patch soon.
> cheers,
> jamal
Thanks,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-23 14:04 ` jamal
2006-03-23 15:41 ` Balbir Singh
@ 2006-03-24 1:32 ` Balbir Singh
2006-03-24 14:11 ` jamal
1 sibling, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-24 1:32 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> Hi Balbir,
>
> Looking good.
> This is a quick scan, so i didnt look at little details.
> Some comments embedded.
Hi, Jamal,
I tried addressing your comments in this new version.
Changelog
---------
1. Moved TASKSTATS_MSG_* to under #ifdef __KERNEL__
2. Got rid of .hdrsize = 0 in genl_family family
3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
4. Got rid of NLM_F_REQUEST, all flags passed down to user space are now 0
5. The response to TASKSTATS_CMD_GET is TASKSTATS_CMD_NEW
6. taskstats_send_stats() now validates the command attributes and ensures
that it either gets a PID or a TGID. If it gets both simultaneously
the PID stats are sent.
7. Do not put the PID/TGID into the skb if there are errors in fill_pid() or
fill_tgid().
Thanks,
Balbir
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---
include/linux/delayacct.h | 11 +
include/linux/taskstats.h | 111 ++++++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 +++++++
kernel/taskstats.c | 255 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 435 insertions(+), 3 deletions(-)
diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
#define _LINUX_TASKDELAYS_H
#include <linux/sched.h>
+#include <linux/taskstats.h>
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-24 06:49:24.000000000 +0530
@@ -0,0 +1,111 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.
- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.
+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-24 06:49:24.000000000 +0530
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
int delayacct_on = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
static int __init delayacct_setup_enable(char *str)
{
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}
/*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
spin_unlock(&tsk->delays->lock);
return ret;
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-24 06:50:03.000000000 +0530
@@ -0,0 +1,255 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new((2 * sizeof(u32)) + sizeof(struct taskstats));
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ return rc;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ } else {
+ return -EINVAL;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ rc = send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+ if (rc || thread_group_empty(tsk))
+ return;
+
+ /* Send tgid data too */
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply);
+ if (rc < 0)
+ return;
+
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ return;
+
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-23 15:41 ` Balbir Singh
@ 2006-03-24 14:04 ` jamal
2006-03-24 14:54 ` Balbir Singh
0 siblings, 1 reply; 26+ messages in thread
From: jamal @ 2006-03-24 14:04 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> > Should there be at least either a pid or tgid? If yes, you need to
> > validate here...
> >
>
> Yes, you are correct. One of my test cases caught it too.. But I did
> not want to untidy the code with if-else's which will keep growing if
> the attributes change in the future. I just followed the controller
> example. I will change it and validate it. Currently if the attribute
> is not valid, a stat of all zero's is returned back.
>
There are many ways to skin this cat.
As an example: you could make pid and tgid global to the function and
set them to 0. At the end of the if statements, you could check if at
least one of them is set - if not you know none was passed and bail out.
> > As a general comment double check your logic for errors; if you already
> > have stashed something in the skb, you need to remove it etc.
> >
>
> Wouldn't genlmsg_cancel() take care of cleaning all attributes?
>
it would for attribute setting - but not for others. As a general
comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
could result in errors.
> > A single message with PID+TGID sounds reasonable. Why two messages with
> > two stats? all you will need to do is get rid of the prepare_reply()
> > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> >
>
> The reason for two stats is that for TGID, we return accumulated values
> (of all threads in the group) and for PID we return the value just
> for that pid. The return value is
>
Ok, I understand the dilemma now - but still not thrilled with having
two messages.
Perhaps you could have nesting of TLVs? This is widely used in the net
code for example
i.e:
TLV = TASKSTATS_TYPE_TGID/PID
TLV = TASKSTATS_TYPE_STATS
Look at using nla_nest_start/end/cancel
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-24 1:32 ` Balbir Singh
@ 2006-03-24 14:11 ` jamal
2006-03-24 14:19 ` jamal
2006-03-24 14:59 ` Balbir Singh
0 siblings, 2 replies; 26+ messages in thread
From: jamal @ 2006-03-24 14:11 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
> 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
Not the right size; the u32 covers the V part of TLV. The T = 16 bits
and L = 16 bits. And if you nest TLVs, then it gets more interesting.
Look at using proper macros instead of hard coding like you did.
grep for something like RTA_SPACE and perhaps send a patch to make it
generic for netlink.h
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-24 14:11 ` jamal
@ 2006-03-24 14:19 ` jamal
2006-03-24 14:59 ` Balbir Singh
1 sibling, 0 replies; 26+ messages in thread
From: jamal @ 2006-03-24 14:19 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Fri, 2006-24-03 at 09:11 -0500, jamal wrote:
> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
>
actually Thomas already has this in netlink.h
Look at using things like:
nla_attr_size()
make sure padding is taken care of etc (read: use the right macros).
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-24 14:04 ` jamal
@ 2006-03-24 14:54 ` Balbir Singh
2006-03-25 1:19 ` jamal
0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-24 14:54 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Fri, Mar 24, 2006 at 09:04:21AM -0500, jamal wrote:
> On Thu, 2006-23-03 at 21:11 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> > > Should there be at least either a pid or tgid? If yes, you need to
> > > validate here...
> > >
> >
> > Yes, you are correct. One of my test cases caught it too.. But I did
> > not want to untidy the code with if-else's which will keep growing if
> > the attributes change in the future. I just followed the controller
> > example. I will change it and validate it. Currently if the attribute
> > is not valid, a stat of all zero's is returned back.
> >
>
> There are many ways to skin this cat.
> As an example: you could make pid and tgid global to the function and
> set them to 0. At the end of the if statements, you could check if at
> least one of them is set - if not you know none was passed and bail out.
The latest patch does fix it this issue. In the Changelog
6. taskstats_send_stats() now validates the command attributes and ensures
that it either gets a PID or a TGID. If it gets both simultaneously
the PID stats are sent.
Is this change ok with you?
>
> > > As a general comment double check your logic for errors; if you already
> > > have stashed something in the skb, you need to remove it etc.
> > >
> >
> > Wouldn't genlmsg_cancel() take care of cleaning all attributes?
> >
>
> it would for attribute setting - but not for others. As a general
> comment this is one of those areas where cutnpasting aka TheLinuxWay(tm)
> could result in errors.
:-) I understand.
What I have done is moved all the NLA_PUT_U32 to after verifying the
return values of functions fill_*(). That way we do not stash anything into the
skb if there are pending errors.
>
>
> > > A single message with PID+TGID sounds reasonable. Why two messages with
> > > two stats? all you will need to do is get rid of the prepare_reply()
> > > above and NLA_PUT_U32() below (just like you do in a response to a GET.
> > >
> >
> > The reason for two stats is that for TGID, we return accumulated values
> > (of all threads in the group) and for PID we return the value just
> > for that pid. The return value is
> >
>
> Ok, I understand the dilemma now - but still not thrilled with having
> two messages.
> Perhaps you could have nesting of TLVs? This is widely used in the net
> code for example
> i.e:
>
> TLV = TASKSTATS_TYPE_TGID/PID
> TLV = TASKSTATS_TYPE_STATS
>
> Look at using nla_nest_start/end/cancel
Hmm... Would it be ok to send one message with the following format
1. TLV=TASKSTATS_TYPE_PID
2. TLV=TASKSTATS_TYPE_STATS
3. TLV=TASKSTATS_TYPE_TGID
4. TLV=TASKSTATS_TYPE_STATS
It would still be one message, except that 3 and 4 would be optional.
What do you think?
>
> cheers,
> jamal
Thanks for your comments,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-24 14:11 ` jamal
2006-03-24 14:19 ` jamal
@ 2006-03-24 14:59 ` Balbir Singh
1 sibling, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2006-03-24 14:59 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Fri, Mar 24, 2006 at 09:11:58AM -0500, jamal wrote:
> On Fri, 2006-24-03 at 07:02 +0530, Balbir Singh wrote:
> > On Thu, Mar 23, 2006 at 09:04:46AM -0500, jamal wrote:
>
> > 3. nlmsg_new() now allocates for 2*u32 + sizeof(taskstats)
>
> Not the right size; the u32 covers the V part of TLV. The T = 16 bits
> and L = 16 bits. And if you nest TLVs, then it gets more interesting.
>
> Look at using proper macros instead of hard coding like you did.
> grep for something like RTA_SPACE and perhaps send a patch to make it
> generic for netlink.h
>
> cheers,
> jamal
>
My bad, but I was wondering why my test case did not segfault until
I saw the implementation of nlmsg_new :-)
I will fix this and use nla_total_size() to calculate the correct sizes
including padding and TLV.
Thanks again,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-24 14:54 ` Balbir Singh
@ 2006-03-25 1:19 ` jamal
2006-03-25 9:41 ` Balbir Singh
0 siblings, 1 reply; 26+ messages in thread
From: jamal @ 2006-03-25 1:19 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:
> Hmm... Would it be ok to send one message with the following format
>
> 1. TLV=TASKSTATS_TYPE_PID
> 2. TLV=TASKSTATS_TYPE_STATS
> 3. TLV=TASKSTATS_TYPE_TGID
> 4. TLV=TASKSTATS_TYPE_STATS
>
> It would still be one message, except that 3 and 4 would be optional.
> What do you think?
>
No, that wont work since #2 and #4 are basically the same TLV. [Recall
that "T" is used to index an array]. Your other alternative is to have
#4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
although that would smell a little.
Dont be afraid to do the nest, it will be a little painful initially but
i am sure once you figure it out you will appreciate it.
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-25 1:19 ` jamal
@ 2006-03-25 9:41 ` Balbir Singh
2006-03-25 12:52 ` jamal
0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-25 9:41 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Fri, Mar 24, 2006 at 08:19:25PM -0500, jamal wrote:
> On Fri, 2006-24-03 at 20:24 +0530, Balbir Singh wrote:
>
> > Hmm... Would it be ok to send one message with the following format
> >
> > 1. TLV=TASKSTATS_TYPE_PID
> > 2. TLV=TASKSTATS_TYPE_STATS
> > 3. TLV=TASKSTATS_TYPE_TGID
> > 4. TLV=TASKSTATS_TYPE_STATS
> >
> > It would still be one message, except that 3 and 4 would be optional.
> > What do you think?
> >
>
> No, that wont work since #2 and #4 are basically the same TLV. [Recall
> that "T" is used to index an array]. Your other alternative is to have
> #4 perhaps called TASKSTATS_TGID_STATS and #2 TASKSTATS_PID_STATS
> although that would smell a little.
> Dont be afraid to do the nest, it will be a little painful initially but
> i am sure once you figure it out you will appreciate it.
>
Thanks for the advice, I will dive into nesting. I could not find any
in tree users who use nesting, so I have a few questions
nla_nest_start() accepts two parameters an skb and an attribute type.
Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
contain the nested attributes
TASKSTATS_TYPE_AGGR
TASKSTATS_TYPE_PID/TGID
TASKSTATS_TYPE_STATS
but this will lead to
TASKSTATS_TYPE_AGGR
TASKSTATS_TYPE_PID
TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_AGGR
TASKSTATS_TYPE_TGID
TASKSTATS_TYPE_STATS
being returned from taskstats_exit_pid().
The other option is to nest
TASKSTATS_TYPE_PID/TGID
TASKSTATS_TYPE_STATS
but the problem with this approach is, nla_len contains the length of
all attributes including the nested attribute. So it is hard to find
the offset of TASKSTATS_TYPE_STATS in the buffer.
Do I understand NLA nesting at all? May be I am missing something obvious.
Thanks,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-25 9:41 ` Balbir Singh
@ 2006-03-25 12:52 ` jamal
2006-03-25 15:36 ` Balbir Singh
0 siblings, 1 reply; 26+ messages in thread
From: jamal @ 2006-03-25 12:52 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:
>
> Thanks for the advice, I will dive into nesting. I could not find any
> in tree users who use nesting, so I have a few questions
>
Hrm - I have to say i am suprised theres nothing; i could have sworn
Thomas had done some conversions already.
> nla_nest_start() accepts two parameters an skb and an attribute type.
> Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> contain the nested attributes
>
> TASKSTATS_TYPE_AGGR
> TASKSTATS_TYPE_PID/TGID
> TASKSTATS_TYPE_STATS
>
>
> but this will lead to
>
> TASKSTATS_TYPE_AGGR
> TASKSTATS_TYPE_PID
> TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_AGGR
> TASKSTATS_TYPE_TGID
> TASKSTATS_TYPE_STATS
>
> being returned from taskstats_exit_pid().
>
no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
Again invoke the rule i cited earlier.
What you could do instead is a second AGGR; and your nesting would be:
TASKSTATS_TYPE_AGGR1 <--- nest start with this type
TASKSTATS_TYPE_PID <-- NLA_U32_PUT
TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
<-- nest end of TASKSTATS_TYPE_AGGR1
TASKSTATS_TYPE_AGGR2 <--- nest start with this type
TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
<-- nest end of TASKSTATS_TYPE_AGGR2
> The other option is to nest
>
> TASKSTATS_TYPE_PID/TGID
> TASKSTATS_TYPE_STATS
>
The advantage being you dont introduce another T.
> but the problem with this approach is, nla_len contains the length of
> all attributes including the nested attribute. So it is hard to find
> the offset of TASKSTATS_TYPE_STATS in the buffer.
>
So you would distinguish the two as have something like:
TASKSTATS_TYPE_PID
u32 pid
TASKSTATS_TYPE_STATS
TASKSTATS_TYPE_TGID
u32 tgid
TASKSTATS_TYPE_STATS
or
TASKSTATS_TYPE_PID
u32 pid
TASKSTATS_TYPE_TGID
u32 tgid
both should be fine. The difference between the two is the length in the
second case will be 4 and in the other case will be larger.
But come to think of it, this will introduce unneeded semantics; you
have very few items to do, so forget it. Go with scheme #1 but change
the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-25 12:52 ` jamal
@ 2006-03-25 15:36 ` Balbir Singh
2006-03-25 17:48 ` jamal
0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-25 15:36 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 15:11 +0530, Balbir Singh wrote:
>
> >
> > Thanks for the advice, I will dive into nesting. I could not find any
> > in tree users who use nesting, so I have a few questions
> >
>
> Hrm - I have to say i am suprised theres nothing; i could have sworn
> Thomas had done some conversions already.
I used cscope to check for global references and callers of nla_nest_.*.
I could not find anything.
>
> > nla_nest_start() accepts two parameters an skb and an attribute type.
> > Do I have to create a new attribute type like TASKSTATS_TYPE_AGGR to
> > contain the nested attributes
> >
> > TASKSTATS_TYPE_AGGR
> > TASKSTATS_TYPE_PID/TGID
> > TASKSTATS_TYPE_STATS
> >
> >
> > but this will lead to
> >
> > TASKSTATS_TYPE_AGGR
> > TASKSTATS_TYPE_PID
> > TASKSTATS_TYPE_STATS
> > TASKSTATS_TYPE_AGGR
> > TASKSTATS_TYPE_TGID
> > TASKSTATS_TYPE_STATS
> >
> > being returned from taskstats_exit_pid().
> >
>
> no this is wrong by virtue of having TASKSTATS_TYPE_AGGR twice.
> Again invoke the rule i cited earlier.
Yes, thats why I wanted to point it out to you. Thanks for explaining the
rule.
> What you could do instead is a second AGGR; and your nesting would be:
>
> TASKSTATS_TYPE_AGGR1 <--- nest start with this type
> TASKSTATS_TYPE_PID <-- NLA_U32_PUT
> TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
> <-- nest end of TASKSTATS_TYPE_AGGR1
> TASKSTATS_TYPE_AGGR2 <--- nest start with this type
> TASKSTATS_TYPE_TGID <-- NLA_U32_PUT
> TASKSTATS_TYPE_STATS <-- NAL_PUT_TYPE
> <-- nest end of TASKSTATS_TYPE_AGGR2
>
> > The other option is to nest
> >
> > TASKSTATS_TYPE_PID/TGID
> > TASKSTATS_TYPE_STATS
> >
>
> The advantage being you dont introduce another T.
>
> > but the problem with this approach is, nla_len contains the length of
> > all attributes including the nested attribute. So it is hard to find
> > the offset of TASKSTATS_TYPE_STATS in the buffer.
> >
>
> So you would distinguish the two as have something like:
>
> TASKSTATS_TYPE_PID
> u32 pid
> TASKSTATS_TYPE_STATS
> TASKSTATS_TYPE_TGID
> u32 tgid
> TASKSTATS_TYPE_STATS
> or
> TASKSTATS_TYPE_PID
> u32 pid
> TASKSTATS_TYPE_TGID
> u32 tgid
>
> both should be fine. The difference between the two is the length in the
> second case will be 4 and in the other case will be larger.
>
> But come to think of it, this will introduce unneeded semantics; you
> have very few items to do, so forget it. Go with scheme #1 but change
> the names to TASKSTATS_TYPE_AGGR_PID and TASKSTATS_TYPE_AGGR_TGID.
>
I prefer #1 as well. The overloaded use of the same type with different lengths
can be confusing.
> cheers,
> jamal
>
Here is another attempt (one more iteration) at trying to get it right.
Thank you for your patience and help in getting it right.
Changelog
---------
As discussed in our email.
Thanks,
Balbir
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
---
include/linux/delayacct.h | 11 +
include/linux/taskstats.h | 113 +++++++++++++++++
init/Kconfig | 16 ++
kernel/Makefile | 1
kernel/delayacct.c | 44 ++++++
kernel/taskstats.c | 291 ++++++++++++++++++++++++++++++++++++++++++++++
6 files changed, 473 insertions(+), 3 deletions(-)
diff -puN include/linux/delayacct.h~delayacct-genetlink include/linux/delayacct.h
--- linux-2.6.16/include/linux/delayacct.h~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/delayacct.h 2006-03-22 11:56:03.000000000 +0530
@@ -15,6 +15,7 @@
#define _LINUX_TASKDELAYS_H
#include <linux/sched.h>
+#include <linux/taskstats.h>
#ifdef CONFIG_TASK_DELAY_ACCT
extern int delayacct_on; /* Delay accounting turned on/off */
@@ -25,6 +26,7 @@ extern void __delayacct_tsk_exit(struct
extern void __delayacct_blkio_start(void);
extern void __delayacct_blkio_end(void);
extern unsigned long long __delayacct_blkio_ticks(struct task_struct *);
+extern int __delayacct_add_tsk(struct taskstats *, struct task_struct *);
static inline void delayacct_tsk_init(struct task_struct *tsk)
{
@@ -72,4 +74,13 @@ static inline unsigned long long delayac
return 0;
}
#endif /* CONFIG_TASK_DELAY_ACCT */
+#ifdef CONFIG_TASKSTATS
+static inline int delayacct_add_tsk(struct taskstats *d,
+ struct task_struct *tsk)
+{
+ if (!tsk->delays)
+ return -EINVAL;
+ return __delayacct_add_tsk(d, tsk);
+}
+#endif
#endif /* _LINUX_TASKDELAYS_H */
diff -puN /dev/null include/linux/taskstats.h
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/include/linux/taskstats.h 2006-03-25 20:56:55.000000000 +0530
@@ -0,0 +1,113 @@
+/* taskstats.h - exporting per-task statistics
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#ifndef _LINUX_TASKSTATS_H
+#define _LINUX_TASKSTATS_H
+
+/* Format for per-task data returned to userland when
+ * - a task exits
+ * - listener requests stats for a task
+ *
+ * The struct is versioned. Newer versions should only add fields to
+ * the bottom of the struct to maintain backward compatibility.
+ *
+ * To create the next version, bump up the taskstats_version variable
+ * and delineate the start of newly added fields with a comment indicating
+ * the version number.
+ */
+
+#define TASKSTATS_VERSION 1
+#define TASKSTATS_NOPID -1
+
+struct taskstats {
+ /* Maintain 64-bit alignment while extending */
+ /* Version 1 */
+
+ /* XXX_count is number of delay values recorded.
+ * XXX_total is corresponding cumulative delay in nanoseconds
+ */
+
+#define TASKSTATS_NOCPUSTATS 1
+ __u64 cpu_count;
+ __u64 cpu_delay_total; /* wait, while runnable, for cpu */
+ __u64 blkio_count;
+ __u64 blkio_delay_total; /* sync,block io completion wait*/
+ __u64 swapin_count;
+ __u64 swapin_delay_total; /* swapin page fault wait*/
+
+ __u64 cpu_run_total; /* cpu running time
+ * no count available/provided */
+};
+
+
+#define TASKSTATS_LISTEN_GROUP 0x1
+
+/*
+ * Commands sent from userspace
+ * Not versioned. New commands should only be inserted at the enum's end
+ */
+
+enum {
+ TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
+ TASKSTATS_CMD_GET, /* user->kernel request */
+ TASKSTATS_CMD_NEW, /* kernel->user event */
+ __TASKSTATS_CMD_MAX,
+};
+
+#define TASKSTATS_CMD_MAX (__TASKSTATS_CMD_MAX - 1)
+
+enum {
+ TASKSTATS_TYPE_UNSPEC = 0, /* Reserved */
+ TASKSTATS_TYPE_PID, /* Process id */
+ TASKSTATS_TYPE_TGID, /* Thread group id */
+ TASKSTATS_TYPE_STATS, /* taskstats structure */
+ TASKSTATS_TYPE_AGGR_PID, /* contains pid + stats */
+ TASKSTATS_TYPE_AGGR_TGID, /* contains tgid + stats */
+ __TASKSTATS_TYPE_MAX,
+};
+
+#define TASKSTATS_TYPE_MAX (__TASKSTATS_TYPE_MAX - 1)
+
+enum {
+ TASKSTATS_CMD_ATTR_UNSPEC = 0,
+ TASKSTATS_CMD_ATTR_PID,
+ TASKSTATS_CMD_ATTR_TGID,
+ __TASKSTATS_CMD_ATTR_MAX,
+};
+
+#define TASKSTATS_CMD_ATTR_MAX (__TASKSTATS_CMD_ATTR_MAX - 1)
+
+/* NETLINK_GENERIC related info */
+
+#define TASKSTATS_GENL_NAME "TASKSTATS"
+#define TASKSTATS_GENL_VERSION 0x1
+
+#ifdef __KERNEL__
+
+#include <linux/sched.h>
+
+enum {
+ TASKSTATS_MSG_UNICAST, /* send data only to requester */
+ TASKSTATS_MSG_MULTICAST, /* send data to a group */
+};
+
+#ifdef CONFIG_TASKSTATS
+extern void taskstats_exit_pid(struct task_struct *);
+#else
+static inline void taskstats_exit_pid(struct task_struct *tsk)
+{}
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_TASKSTATS_H */
diff -puN init/Kconfig~delayacct-genetlink init/Kconfig
--- linux-2.6.16/init/Kconfig~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/init/Kconfig 2006-03-22 11:56:03.000000000 +0530
@@ -158,11 +158,21 @@ config TASK_DELAY_ACCT
in pages. Such statistics can help in setting a task's priorities
relative to other tasks for cpu, io, rss limits etc.
- Unlike BSD process accounting, this information is available
- continuously during the lifetime of a task.
-
Say N if unsure.
+config TASKSTATS
+ bool "Export task/process statistics through netlink (EXPERIMENTAL)"
+ depends on TASK_DELAY_ACCT
+ default y
+ help
+ Export selected statistics for tasks/processes through the
+ generic netlink interface. Unlike BSD process accounting, the
+ statistics are available during the lifetime of tasks/processes as
+ responses to commands. Like BSD accounting, they are sent to user
+ space on task exit.
+
+ Say Y if unsure.
+
config SYSCTL
bool "Sysctl support"
---help---
diff -puN kernel/delayacct.c~delayacct-genetlink kernel/delayacct.c
--- linux-2.6.16/kernel/delayacct.c~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/delayacct.c 2006-03-25 20:56:55.000000000 +0530
@@ -1,6 +1,7 @@
/* delayacct.c - per-task delay accounting
*
* Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * Copyright (C) Balbir Singh, IBM Corp. 2006
*
* This program is free software; you can redistribute it and/or modify it
* under the terms of version 2.1 of the GNU Lesser General Public License
@@ -16,9 +17,12 @@
#include <linux/time.h>
#include <linux/sysctl.h>
#include <linux/delayacct.h>
+#include <linux/taskstats.h>
+#include <linux/mutex.h>
int delayacct_on = 0; /* Delay accounting turned on/off */
kmem_cache_t *delayacct_cache;
+static DEFINE_MUTEX(delayacct_exit_mutex);
static int __init delayacct_setup_enable(char *str)
{
@@ -51,10 +55,16 @@ void __delayacct_tsk_init(struct task_st
void __delayacct_tsk_exit(struct task_struct *tsk)
{
+ /*
+ * Protect against racing thread group exits
+ */
+ mutex_lock(&delayacct_exit_mutex);
+ taskstats_exit_pid(tsk);
if (tsk->delays) {
kmem_cache_free(delayacct_cache, tsk->delays);
tsk->delays = NULL;
}
+ mutex_unlock(&delayacct_exit_mutex);
}
/*
@@ -124,3 +134,37 @@ unsigned long long __delayacct_blkio_tic
spin_unlock(&tsk->delays->lock);
return ret;
}
+#ifdef CONFIG_TASKSTATS
+int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
+{
+ nsec_t tmp;
+ struct timespec ts;
+ unsigned long t1,t2;
+
+ /* zero XXX_total,non-zero XXX_count implies XXX stat overflowed */
+
+ tmp = (nsec_t)d->cpu_run_total ;
+ tmp += (u64)(tsk->utime+tsk->stime)*TICK_NSEC;
+ d->cpu_run_total = (tmp < (nsec_t)d->cpu_run_total)? 0: tmp;
+
+ /* No locking available for sched_info. Take snapshot first. */
+ t1 = tsk->sched_info.pcnt;
+ t2 = tsk->sched_info.run_delay;
+
+ d->cpu_count += t1;
+
+ jiffies_to_timespec(t2, &ts);
+ tmp = (nsec_t)d->cpu_delay_total + timespec_to_ns(&ts);
+ d->cpu_delay_total = (tmp < (nsec_t)d->cpu_delay_total)? 0: tmp;
+
+ spin_lock(&tsk->delays->lock);
+ tmp = d->blkio_delay_total + tsk->delays->blkio_delay;
+ d->blkio_delay_total = (tmp < d->blkio_delay_total)? 0: tmp;
+ tmp = d->swapin_delay_total + tsk->delays->swapin_delay;
+ d->swapin_delay_total = (tmp < d->swapin_delay_total)? 0: tmp;
+ d->blkio_count += tsk->delays->blkio_count;
+ d->swapin_count += tsk->delays->swapin_count;
+ spin_unlock(&tsk->delays->lock);
+ return 0;
+}
+#endif /* CONFIG_TASKSTATS */
diff -puN kernel/Makefile~delayacct-genetlink kernel/Makefile
--- linux-2.6.16/kernel/Makefile~delayacct-genetlink 2006-03-22 11:56:03.000000000 +0530
+++ linux-2.6.16-balbir/kernel/Makefile 2006-03-22 11:56:03.000000000 +0530
@@ -35,6 +35,7 @@ obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
+obj-$(CONFIG_TASKSTATS) += taskstats.o
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff -puN /dev/null kernel/taskstats.c
--- /dev/null 2004-06-24 23:34:38.000000000 +0530
+++ linux-2.6.16-balbir/kernel/taskstats.c 2006-03-25 20:57:19.000000000 +0530
@@ -0,0 +1,291 @@
+/*
+ * taskstats.c - Export per-task statistics to userland
+ *
+ * Copyright (C) Shailabh Nagar, IBM Corp. 2006
+ * (C) Balbir Singh, IBM Corp. 2006
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/taskstats.h>
+#include <linux/delayacct.h>
+#include <net/genetlink.h>
+#include <asm/atomic.h>
+
+const int taskstats_version = TASKSTATS_VERSION;
+static DEFINE_PER_CPU(__u32, taskstats_seqnum) = { 0 };
+static int family_registered = 0;
+
+static struct genl_family family = {
+ .id = GENL_ID_GENERATE,
+ .name = TASKSTATS_GENL_NAME,
+ .version = TASKSTATS_GENL_VERSION,
+ .maxattr = TASKSTATS_CMD_ATTR_MAX,
+};
+
+static struct nla_policy taskstats_cmd_get_policy[TASKSTATS_CMD_ATTR_MAX+1] __read_mostly = {
+ [TASKSTATS_CMD_ATTR_PID] = { .type = NLA_U32 },
+ [TASKSTATS_CMD_ATTR_TGID] = { .type = NLA_U32 },
+};
+
+
+static int prepare_reply(struct genl_info *info, u8 cmd, struct sk_buff **skbp,
+ void **replyp, size_t size)
+{
+ struct sk_buff *skb;
+ void *reply;
+
+ /*
+ * If new attributes are added, please revisit this allocation
+ */
+ skb = nlmsg_new(size);
+ if (!skb)
+ return -ENOMEM;
+
+ if (!info) {
+ int seq = get_cpu_var(taskstats_seqnum)++;
+ put_cpu_var(taskstats_seqnum);
+
+ reply = genlmsg_put(skb, 0, seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ } else
+ reply = genlmsg_put(skb, info->snd_pid, info->snd_seq,
+ family.id, 0, 0,
+ cmd, family.version);
+ if (reply == NULL) {
+ nlmsg_free(skb);
+ return -EINVAL;
+ }
+
+ *skbp = skb;
+ *replyp = reply;
+ return 0;
+}
+
+static int send_reply(struct sk_buff *skb, pid_t pid, int event)
+{
+ struct genlmsghdr *genlhdr = nlmsg_data((struct nlmsghdr *)skb->data);
+ void *reply;
+ int rc;
+
+ reply = genlmsg_data(genlhdr);
+
+ rc = genlmsg_end(skb, reply);
+ if (rc < 0) {
+ nlmsg_free(skb);
+ return rc;
+ }
+
+ if (event == TASKSTATS_MSG_MULTICAST)
+ return genlmsg_multicast(skb, pid, TASKSTATS_LISTEN_GROUP);
+ return genlmsg_unicast(skb, pid);
+}
+
+static inline int fill_pid(pid_t pid, struct task_struct *pidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk = pidtsk;
+
+ if (!pidtsk) {
+ read_lock(&tasklist_lock);
+ tsk = find_task_by_pid(pid);
+ if (!tsk) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ get_task_struct(tsk);
+ read_unlock(&tasklist_lock);
+ } else
+ get_task_struct(tsk);
+
+ rc = delayacct_add_tsk(stats, tsk);
+ put_task_struct(tsk);
+
+ return rc;
+
+}
+
+static inline int fill_tgid(pid_t tgid, struct task_struct *tgidtsk,
+ struct taskstats *stats)
+{
+ int rc;
+ struct task_struct *tsk, *first;
+
+ first = tgidtsk;
+ read_lock(&tasklist_lock);
+ if (!first) {
+ first = find_task_by_pid(tgid);
+ if (!first) {
+ read_unlock(&tasklist_lock);
+ return -ESRCH;
+ }
+ }
+ tsk = first;
+ do {
+ rc = delayacct_add_tsk(stats, tsk);
+ if (rc)
+ break;
+ } while_each_thread(first, tsk);
+ read_unlock(&tasklist_lock);
+
+ return rc;
+}
+
+static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ struct taskstats stats;
+ void *reply;
+ size_t size;
+ struct nlattr *na;
+
+ /*
+ * Size includes space for nested attribute as well
+ * The returned data is of the format
+ * TASKSTATS_TYPE_AGGR_PID/TGID
+ * --> TASKSTATS_TYPE_PID/TGID
+ * --> TASKSTATS_TYPE_STATS
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(info, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return rc;
+
+ if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
+ u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
+ rc = fill_pid((pid_t)pid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
+ } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
+ u32 tgid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_TGID]);
+ rc = fill_tgid((pid_t)tgid, NULL, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, tgid);
+ } else {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ return send_reply(rep_skb, info->snd_pid, TASKSTATS_MSG_UNICAST);
+
+nla_put_failure:
+ return genlmsg_cancel(rep_skb, reply);
+err:
+ nlmsg_free(rep_skb);
+ return rc;
+}
+
+
+/* Send pid data out on exit */
+void taskstats_exit_pid(struct task_struct *tsk)
+{
+ int rc = 0;
+ struct sk_buff *rep_skb;
+ void *reply;
+ struct taskstats stats;
+ size_t size;
+ int is_thread_group = !thread_group_empty(tsk);
+ struct nlattr *na;
+
+ /*
+ * tasks can start to exit very early. Ensure that the family
+ * is registered before notifications are sent out
+ */
+ if (!family_registered)
+ return;
+
+ /*
+ * Size includes space for nested attributes
+ */
+ size = nla_total_size(sizeof(u32)) +
+ nla_total_size(sizeof(struct taskstats)) + nla_total_size(0);
+
+ if (is_thread_group)
+ size = 2 * size; // PID + STATS + TGID + STATS
+
+ memset(&stats, 0, sizeof(stats));
+ rc = prepare_reply(NULL, TASKSTATS_CMD_NEW, &rep_skb, &reply, size);
+ if (rc < 0)
+ return;
+
+ rc = fill_pid(tsk->pid, tsk, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, (u32)tsk->pid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ if (!is_thread_group) {
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+ return;
+ }
+
+ memset(&stats, 0, sizeof(stats));
+ rc = fill_tgid(tsk->tgid, tsk, &stats);
+ if (rc < 0)
+ goto err;
+
+ na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID);
+ NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid);
+ NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, stats);
+ nla_nest_end(rep_skb, na);
+
+ send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST);
+
+nla_put_failure:
+ genlmsg_cancel(rep_skb, reply);
+ return;
+err:
+ nlmsg_free(rep_skb);
+}
+
+static struct genl_ops taskstats_ops = {
+ .cmd = TASKSTATS_CMD_GET,
+ .doit = taskstats_send_stats,
+ .policy = taskstats_cmd_get_policy,
+};
+
+static int __init taskstats_init(void)
+{
+ if (genl_register_family(&family))
+ return -EFAULT;
+ family_registered = 1;
+
+ if (genl_register_ops(&family, &taskstats_ops))
+ goto err;
+
+ return 0;
+err:
+ genl_unregister_family(&family);
+ family_registered = 0;
+ return -EFAULT;
+}
+
+late_initcall(taskstats_init);
_
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-25 15:36 ` Balbir Singh
@ 2006-03-25 17:48 ` jamal
2006-03-25 18:22 ` Balbir Singh
0 siblings, 1 reply; 26+ messages in thread
From: jamal @ 2006-03-25 17:48 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
I didnt pay attention to failure paths etc; i suppose your testing
should catch those. Getting there, a couple more comments:
> +enum {
> + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
> + TASKSTATS_CMD_GET, /* user->kernel request */
> + TASKSTATS_CMD_NEW, /* kernel->user event */
Should the comment read "kernel->user event/get-response"
> +
> +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> +{
> +
> + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> + rc = fill_pid((pid_t)pid, NULL, &stats);
> + if (rc < 0)
> + goto err;
> +
> + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
in regards to the elseif above:
Could you not have both PID and TGID passed? From my earlier
understanding it seemed legit, no? if answer is yes, then you will have
to do your sizes + reply TLVs at the end.
Also in regards to the nesting, isnt there a need for nla_nest_cancel in
case of failures to add TLVs?
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-25 17:48 ` jamal
@ 2006-03-25 18:22 ` Balbir Singh
2006-03-26 14:05 ` jamal
0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-25 18:22 UTC (permalink / raw)
To: hadi; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On 3/25/06, jamal <hadi@cyberus.ca> wrote:
> On Sat, 2006-25-03 at 21:06 +0530, Balbir Singh wrote:
> > On Sat, Mar 25, 2006 at 07:52:13AM -0500, jamal wrote:
>
>
> I didnt pay attention to failure paths etc; i suppose your testing
> should catch those. Getting there, a couple more comments:
>
Yes, I have tried several negative test cases.
>
> > +enum {
> > + TASKSTATS_CMD_UNSPEC = 0, /* Reserved */
> > + TASKSTATS_CMD_GET, /* user->kernel request */
> > + TASKSTATS_CMD_NEW, /* kernel->user event */
>
> Should the comment read "kernel->user event/get-response"
>
Yes, good catch. I will update the comment.
>
> > +
> > +static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info)
> > +{
>
>
> > +
> > + if (info->attrs[TASKSTATS_CMD_ATTR_PID]) {
> > + u32 pid = nla_get_u32(info->attrs[TASKSTATS_CMD_ATTR_PID]);
> > + rc = fill_pid((pid_t)pid, NULL, &stats);
> > + if (rc < 0)
> > + goto err;
> > +
> > + na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_PID);
> > + NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_PID, pid);
> > + } else if (info->attrs[TASKSTATS_CMD_ATTR_TGID]) {
>
> in regards to the elseif above:
> Could you not have both PID and TGID passed? From my earlier
> understanding it seemed legit, no? if answer is yes, then you will have
> to do your sizes + reply TLVs at the end.
No, we cannot have both passed. If we pass both a PID and a TGID and
then the code returns just the stats for the PID.
>
> Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> case of failures to add TLVs?
>
I thought about it, but when I looked at the code of genlmsg_cancel()
and nla_nest_cancel(). It seemed that genlmsg_cancel() should
suffice.
<snippet>
static inline int genlmsg_cancel(struct sk_buff *skb, void *hdr)
{
return nlmsg_cancel(skb, hdr - GENL_HDRLEN - NLMSG_HDRLEN);
}
static inline int nlmsg_cancel(struct sk_buff *skb, struct nlmsghdr *nlh)
{
skb_trim(skb, (unsigned char *) nlh - skb->data);
return -1;
}
static inline int nla_nest_cancel(struct sk_buff *skb, struct nlattr *start)
{
if (start)
skb_trim(skb, (unsigned char *) start - skb->data);
return -1;
}
</snippet>
genlmsg_cancel() seemed more generic, since it handles skb_trim from
the nlmsghdr down to skb->data, where as nla_test_cancel() does it
only from the start of the nested attributes to skb->data.
Is my understanding correct?
> cheers,
> jamal
>
Thanks,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-25 18:22 ` Balbir Singh
@ 2006-03-26 14:05 ` jamal
2006-03-26 16:40 ` Balbir Singh
0 siblings, 1 reply; 26+ messages in thread
From: jamal @ 2006-03-26 14:05 UTC (permalink / raw)
To: balbir; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:
> No, we cannot have both passed. If we pass both a PID and a TGID and
> then the code returns just the stats for the PID.
>
ok, that clears it then; i think you are ready to go.
> >
> > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > case of failures to add TLVs?
> >
>
> I thought about it, but when I looked at the code of genlmsg_cancel()
> and nla_nest_cancel(). It seemed that genlmsg_cancel() should
> suffice.
>
If your policy is to never send a message if anything fails, then you
are fine.
What would be really useful now that you understand this, is if you can
help extending/cleaning the document i sent you. Or send me a table of
contents of how it would have flowed better for you.
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC][UPDATED PATCH 2.6.16] [Patch 9/9] Generic netlink interface for delay accounting
2006-03-26 14:05 ` jamal
@ 2006-03-26 16:40 ` Balbir Singh
0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2006-03-26 16:40 UTC (permalink / raw)
To: jamal; +Cc: Matt Helsley, Shailabh Nagar, linux-kernel, netdev
On Sun, Mar 26, 2006 at 09:05:18AM -0500, jamal wrote:
> On Sat, 2006-25-03 at 23:52 +0530, Balbir Singh wrote:
>
>
> > No, we cannot have both passed. If we pass both a PID and a TGID and
> > then the code returns just the stats for the PID.
> >
>
> ok, that clears it then; i think you are ready to go.
Cool! Thanks for all your help and review.
>
> > >
> > > Also in regards to the nesting, isnt there a need for nla_nest_cancel in
> > > case of failures to add TLVs?
> > >
> >
> > I thought about it, but when I looked at the code of genlmsg_cancel()
> > and nla_nest_cancel(). It seemed that genlmsg_cancel() should
> > suffice.
> >
>
> If your policy is to never send a message if anything fails, then you
> are fine.
>
> What would be really useful now that you understand this, is if you can
> help extending/cleaning the document i sent you. Or send me a table of
> contents of how it would have flowed better for you.
I will start looking at the document and see what changes can be made.
I will try and update the document from a genetlink programmers perspective
i.e. things to know, avoid, etc.
>
> cheers,
> jamal
>
>
Thanks,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 8/9] generic netlink utility functions
2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
@ 2006-03-26 16:44 ` Balbir Singh
2006-03-26 17:06 ` jamal
0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2006-03-26 16:44 UTC (permalink / raw)
To: Shailabh Nagar; +Cc: linux-kernel, netdev, Jamal
On Mon, Mar 13, 2006 at 07:55:05PM -0500, Shailabh Nagar wrote:
> genetlink-utils.patch
>
> Two utilities for simplifying usage of NETLINK_GENERIC
> interface.
>
> Signed-off-by: Balbir Singh <balbir@in.ibm.com>
> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
>
> include/net/genetlink.h | 20 ++++++++++++++++++++
> 1 files changed, 20 insertions(+)
>
> Index: linux-2.6.16-rc5/include/net/genetlink.h
> ===================================================================
> --- linux-2.6.16-rc5.orig/include/net/genetlink.h 2006-03-11 07:41:32.000000000 -0500
> +++ linux-2.6.16-rc5/include/net/genetlink.h 2006-03-11 07:41:41.000000000 -0500
> @@ -150,4 +150,24 @@ static inline int genlmsg_unicast(struct
> return nlmsg_unicast(genl_sock, skb, pid);
> }
>
> +/**
> + * gennlmsg_data - head of message payload
> + * @gnlh: genetlink messsage header
> + */
> +static inline void *genlmsg_data(const struct genlmsghdr *gnlh)
> +{
> + return ((unsigned char *) gnlh + GENL_HDRLEN);
> +}
> +
> +/**
> + * genlmsg_len - length of message payload
> + * @gnlh: genetlink message header
> + */
> +static inline int genlmsg_len(const struct genlmsghdr *gnlh)
> +{
> + struct nlmsghdr *nlh = (struct nlmsghdr *)((unsigned char *)gnlh -
> + NLMSG_HDRLEN);
> + return (nlh->nlmsg_len - GENL_HDRLEN - NLMSG_HDRLEN);
> +}
> +
> #endif /* __NET_GENERIC_NETLINK_H */
>
>
Jamal,
Does the implementation of these utilities look ok? We use genlmsg_data()
in the delay accounting code but not genlmsg_len(), hence it might not
be very well tested (just reviewed).
Thanks,
Balbir
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [Patch 8/9] generic netlink utility functions
2006-03-26 16:44 ` Balbir Singh
@ 2006-03-26 17:06 ` jamal
0 siblings, 0 replies; 26+ messages in thread
From: jamal @ 2006-03-26 17:06 UTC (permalink / raw)
To: balbir; +Cc: Shailabh Nagar, linux-kernel, netdev
On Sun, 2006-26-03 at 22:14 +0530, Balbir Singh wrote:
> Jamal,
>
> Does the implementation of these utilities look ok? We use genlmsg_data()
> in the delay accounting code but not genlmsg_len(), hence it might not
> be very well tested (just reviewed).
>
They look fine to me - please resubmit and CC Thomas Graf in case he
sees it different.
cheers,
jamal
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2006-03-26 17:06 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1142296834.5858.3.camel@elinux04.optonline.net>
2006-03-14 0:55 ` [Patch 8/9] generic netlink utility functions Shailabh Nagar
2006-03-26 16:44 ` Balbir Singh
2006-03-26 17:06 ` jamal
2006-03-14 0:56 ` [Patch 9/9] Generic netlink interface for delay accounting Shailabh Nagar
2006-03-14 2:29 ` jamal
2006-03-14 2:33 ` Matt Helsley
2006-03-14 2:48 ` jamal
2006-03-14 4:18 ` Shailabh Nagar
2006-03-22 7:49 ` [RFC][UPDATED PATCH 2.6.16] " Balbir Singh
2006-03-23 14:04 ` jamal
2006-03-23 15:41 ` Balbir Singh
2006-03-24 14:04 ` jamal
2006-03-24 14:54 ` Balbir Singh
2006-03-25 1:19 ` jamal
2006-03-25 9:41 ` Balbir Singh
2006-03-25 12:52 ` jamal
2006-03-25 15:36 ` Balbir Singh
2006-03-25 17:48 ` jamal
2006-03-25 18:22 ` Balbir Singh
2006-03-26 14:05 ` jamal
2006-03-26 16:40 ` Balbir Singh
2006-03-24 1:32 ` Balbir Singh
2006-03-24 14:11 ` jamal
2006-03-24 14:19 ` jamal
2006-03-24 14:59 ` Balbir Singh
2006-03-14 4:29 ` Shailabh Nagar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).