[0/2][ANNOUNCE] nproc: netlink access to /proc information

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [0/2][ANNOUNCE] nproc: netlink access to /proc information
@ 2004-08-27 12:24 Roger Luethi
  2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi
                   ` (3 more replies)
  0 siblings, 4 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 12:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh,
	Paul Jackson

[ Cc: contributors to recent, related thread ]

nproc is an attempt to address the current problems with /proc. In
short, it exposes the same information via netlink (implemented for a
small subset).

This patch is experimental. I'm posting it to get the discussion going.

Problems with /proc
===================
The information in /proc comes in a number of different formats, for
example:

- /proc/PID/stat works for parsers. However, because it is not
  self-documenting, it can never shrink, It contains a growing number
  of dead fields -- legacy tools expect them to be there. To make things
  worse, there is no N/A value, which makes a field value 0 ambiguous.

- /proc/pid/status is self-documenting. No N/A value is necessary --
  fields can easily be added, removed, and reordered. Too easily, maybe.
  Tool maintainers complain about parsing overhead and unstable file
  formats.

- /proc/slabinfo is something of a hybrid and tries to avoid the
  weaknesses of other formats.

So a key problem is that it's hard to make an interface that is both
easy for humans and parsers to read. The amount of human-readable
information in /proc has been growing and there's no way all these
files will be rewritten again to favor parsers.

Another problem with /proc is speed. If we put all information in a few
large files, the kernel needs to calculate many fields even if a tool
is only interested in one of them. OTOH, if the informations is split
into many small files, VFS and related overhead increases if a tool
needs to read many files just for the information on one single process.

In summary, /proc suffers from diverging goals of its two groups of
users (human readers and parsers), and it doesn't scale well for tools
monitoring many fields or many processes.

Overview
========
This patch implements an alternative method of querying the kernel
with well-defined messages through netlink.

Each piece of information ("field") like MemFree or VmRSS is given a
32 bit ID:

bits
 0-15 a unique ID
16-23 reserved
24-27 data type (u32, unsigned long, u64, string)
28-31 the scope (process, global)

Four operations exist to query the kernel:

NPROC_GET_LIST
--------------
This request has no payload. The kernel answers with a sequence of u32
values. The first one announces the number of fields known to the kernel,
the rest of the message lists all of them by IDs.

NPROC_GET_LIST allows a tools to check which fields are still available
and -- if the tool author is so inclined -- to discover new fields
dynamically.

NPROC_GET_LABEL
---------------
A label request contains a u32 value indicating the type of label
and one key for which a label is wanted. The kernel returns a string
containing the label. Label types are field (useful for dynamically
discovered fields) and ksym.

NPROC_GET_GLOBAL
----------------
A request for one or more fields with a global scope (e.g. MemFree,
nr_dirty) contains a u32 value announcing the number of requested
fields and a matching sequence of fields IDs.

The kernel replies with one netlink message containing the requested
fields. A string field is lead by a u32 value indicating the remaining
length of the field. I didn't want to offer any strings outside of
the label operation initially, but having to make an extra call for,
say, every process name seemed a bit excessive.

NPROC_SCOPE_PROCESS
-------------------
For fields with a process scope (e.g. VmSize, wchan), a request starts
as above. It adds an additional part, though: The selector. The only
selector implemented so far takes a list of u32 PID values.

At the moment, the kernel sends a separate netlink message for every
process.

Results
=======
- The new interface is self-documenting.

- There is no need to ever parse strings on either side of the
  user/kernel space barrier.

- Fields that have become meaningless or are unmaintained are simply
  removed. Tools can easily detect if fields (and which ones) are
  missing. (Of course that does not imply that any field is fair game
  to remove from the kernel.)

- Any number and combination of fields can be gathered with one single
  message exchange (as long as they are in the same scope).

- The kernel only calculates fields as requested (where it makes sense,
  see __task_mem for an example).

- The conflict between human-readable and machine-parsable files is
  solved by providing an interface each.

- While parsing answers is vastly easier for tools, there hardly any
  additional complexity in the kernel (except for the process selector
  which is optional as it goes beyond the functionality offered by
  /proc).

- If we're lucky, we may even be able to save memory on small systems
  that want to do away with /proc but need access to some of the
  information it provides.

I haven't implemented any form of access control. One possibility is
to use some of the reserved bits in the ID field to indicate access
restrictions to both kernel and user space (e.g. everyone, process owner,
root) and add some LSM hook for those needing fine-grained control.

It would also be easy to add semantics that won't work in /proc (for
instance a simple mechanism for repetitive requests -- just  add an
optional frequency or interval flag). Whether that is desirable or not
is a separate discussion, though.

There are obvious speed optimizations I haven't tried. I meant to
conduct some performance tests, but I'm not sure what a meaningful
benchmark on the /proc file side is. Suggestions?

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [1/2][PATCH] nproc: netlink access to /proc information
  2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi
@ 2004-08-27 12:24 ` Roger Luethi
  2004-08-27 13:39   ` Roger Luethi
  2004-08-27 12:24 ` [2/2][sample code] nproc: user space app Roger Luethi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 12:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh,
	Paul Jackson

The current code duplicates some data gathering logic from elsewhere
in the kernel. The code can be trivially shared if the exisiting users
in proc split data gathering and string creation.

The patch should apply against any current 2.6 kernel.

 include/linux/netlink.h |    1 
 include/linux/nproc.h   |   93 ++++++
 init/Kconfig            |    7 
 kernel/Makefile         |    1 
 kernel/nproc.c          |  690 ++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 792 insertions(+)

diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/include/linux/netlink.h linux-2.6.8-nproc/include/linux/netlink.h
--- linux-2.6.8/include/linux/netlink.h	2004-08-27 10:08:20.000000000 +0200
+++ linux-2.6.8-nproc/include/linux/netlink.h	2004-08-27 10:20:07.000000000 +0200
@@ -15,6 +15,7 @@
 #define NETLINK_ARPD		8
 #define NETLINK_AUDIT		9	/* auditing */
 #define NETLINK_ROUTE6		11	/* af_inet6 route comm channel */
+#define NETLINK_NPROC		12	/* /proc information */
 #define NETLINK_IP6_FW		13
 #define NETLINK_DNRTMSG		14	/* DECnet routing messages */
 #define NETLINK_TAPBASE		16	/* 16 to 31 are ethertap */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/include/linux/nproc.h linux-2.6.8-nproc/include/linux/nproc.h
--- linux-2.6.8/include/linux/nproc.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.8-nproc/include/linux/nproc.h	2004-08-27 10:20:07.000000000 +0200
@@ -0,0 +1,93 @@
+#ifndef _LINUX_NPROC_H
+#define _LINUX_NPROC_H
+
+#include <linux/config.h>
+
+#ifdef CONFIG_NPROC
+
+#define NPROC_BASE		0x10
+#define NPROC_GET_LIST		(NPROC_BASE+0)
+#define NPROC_GET_LABEL		(NPROC_BASE+1)
+#define NPROC_GET_GLOBAL	(NPROC_BASE+2)
+#define NPROC_GET_PS		(NPROC_BASE+3)
+
+#define NPROC_SCOPE_MASK	0xF0000000
+#define NPROC_SCOPE_GLOBAL	0x10000000	/* Global w/o arguments */
+#define NPROC_SCOPE_PROCESS	0x20000000
+#define NPROC_SCOPE_LABEL	0x30000000
+
+#define NPROC_TYPE_MASK		0x0F000000
+#define NPROC_TYPE_STRING	0x01000000
+#define NPROC_TYPE_U32		0x02000000
+#define NPROC_TYPE_UL		0x03000000
+#define NPROC_TYPE_U64		0x04000000
+
+#define NPROC_SELECT_ALL	0x00000001
+#define NPROC_SELECT_PID	0x00000002
+#define NPROC_SELECT_UID	0x00000003
+
+#define NPROC_LABEL_FIELD	0x00000001
+#define NPROC_LABEL_KSYM	0x00000002
+
+struct nproc_field {
+	__u32 id;
+	const char *label;
+};
+
+#define NPROC_PID		(0x00000001 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NAME		(0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS)
+/* Amount of free memory (pages) */
+#define NPROC_MEMFREE		(0x00000004 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* Size of a page (bytes) */
+#define NPROC_PAGESIZE		(0x00000005 | NPROC_TYPE_U32    | NPROC_SCOPE_GLOBAL)
+/* There's no guarantee about anything with jiffies. Still useful for some. */
+#define NPROC_JIFFIES		(0x00000006 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
+/* Process: VM size (KiB) */
+#define NPROC_VMSIZE		(0x00000010 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: locked memory (KiB) */
+#define NPROC_VMLOCK		(0x00000011 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+/* Process: Memory resident size (KiB) */
+#define NPROC_VMRSS		(0x00000012 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMDATA		(0x00000013 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMSTACK		(0x00000014 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMEXE		(0x00000015 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_VMLIB		(0x00000016 | NPROC_TYPE_U32    | NPROC_SCOPE_PROCESS)
+#define NPROC_NR_DIRTY		(0x00000051 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_WRITEBACK	(0x00000052 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_UNSTABLE	(0x00000053 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_PG_TABLE_PGS	(0x00000054 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_MAPPED		(0x00000055 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_NR_SLAB		(0x00000056 | NPROC_TYPE_UL     | NPROC_SCOPE_GLOBAL)
+#define NPROC_WCHAN		(0x00000100 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)
+#define NPROC_WCHAN_NAME	(0x00000101 | NPROC_TYPE_STRING)
+
+#ifdef __KERNEL__
+static struct nproc_field labels[] = {
+	{ NPROC_PID,			"PID" },
+	{ NPROC_NAME,			"Name" },
+	{ NPROC_MEMFREE,		"MemFree" },
+	{ NPROC_PAGESIZE,		"PageSize" },
+	{ NPROC_JIFFIES,		"Jiffies" },
+	{ NPROC_VMSIZE,			"VmSize" },
+	{ NPROC_VMLOCK,			"VmLock" },
+	{ NPROC_VMRSS,			"VmRSS" },
+	{ NPROC_VMDATA,			"VmData" },
+	{ NPROC_VMSTACK,		"VmStack" },
+	{ NPROC_VMEXE,			"VmExe" },
+	{ NPROC_VMLIB,			"VmLib" },
+	{ NPROC_NR_DIRTY,		"nr_dirty" },
+	{ NPROC_NR_WRITEBACK,		"nr_writeback" },
+	{ NPROC_NR_UNSTABLE,		"nr_unstable" },
+	{ NPROC_NR_PG_TABLE_PGS,	"nr_page_table_pages" },
+	{ NPROC_NR_MAPPED,		"nr_mapped" },
+	{ NPROC_NR_SLAB,		"nr_slab" },
+	{ NPROC_WCHAN,			"wchan" },
+#ifdef CONFIG_KALLSYMS
+	{ NPROC_WCHAN_NAME,		"wchan_symbol" },
+#endif
+};
+#endif /* __KERNEL__ */
+
+#endif /* CONFIG_NPROC */
+
+#endif /* _LINUX_NPROC_H */
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/kernel/Makefile linux-2.6.8-nproc/kernel/Makefile
--- linux-2.6.8/kernel/Makefile	2004-08-27 10:08:20.000000000 +0200
+++ linux-2.6.8-nproc/kernel/Makefile	2004-08-27 10:20:07.000000000 +0200
@@ -15,6 +15,7 @@ obj-$(CONFIG_SMP) += cpu.o
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += module.o
 obj-$(CONFIG_KALLSYMS) += kallsyms.o
+obj-$(CONFIG_NPROC) += nproc.o
 obj-$(CONFIG_PM) += power/
 obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
 obj-$(CONFIG_COMPAT) += compat.o
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/kernel/nproc.c linux-2.6.8-nproc/kernel/nproc.c
--- linux-2.6.8/kernel/nproc.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.8-nproc/kernel/nproc.c	2004-08-27 10:20:07.000000000 +0200
@@ -0,0 +1,690 @@
+/*
+ * nproc.c
+ *
+ * netlink interface to /proc information.
+ *
+ */
+
+#include <linux/skbuff.h>
+#include <net/sock.h>
+#include <linux/swap.h>		/* nr_free_pages() */
+#include <linux/kallsyms.h>	/* kallsyms_lookup() */
+#include <linux/nproc.h>
+
+//#define DEBUG
+
+/* There must be like 5 million dprintk definitions, so let's add some more */
+#ifdef DEBUG
+#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args)
+#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args)
+#else
+#define pdebug(x,args...)
+#define pwarn(x,args...)
+#endif
+
+#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args)
+
+static struct sock *nproc_sock = NULL;
+
+struct task_mem {
+	u32	vmdata;
+	u32	vmstack;
+	u32	vmexe;
+	u32	vmlib;
+};
+
+struct task_mem_cheap {
+	u32	vmsize;
+	u32	vmlock;
+	u32	vmrss;
+};
+
+/*
+ * __task_mem/__task_mem_cheap basically duplicate the MMU version of
+ * task_mem, but they are split by cost and work on structs.
+ */
+
+void __task_mem(struct task_struct *tsk, struct task_mem *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		unsigned long data = 0, stack = 0, exec = 0, lib = 0;
+		struct vm_area_struct *vma;
+
+		down_read(&mm->mmap_sem);
+		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
+			if (!vma->vm_file) {
+				data += len;
+				if (vma->vm_flags & VM_GROWSDOWN)
+					stack += len;
+				continue;
+			}
+			if (vma->vm_flags & VM_WRITE)
+				continue;
+			if (vma->vm_flags & VM_EXEC) {
+				exec += len;
+				if (vma->vm_flags & VM_EXECUTABLE)
+					continue;
+				lib += len;
+			}
+		}
+		res->vmdata = data - stack;
+		res->vmstack = stack;
+		res->vmexe = exec - lib;
+		res->vmlib = lib;
+		up_read(&mm->mmap_sem);
+
+		mmput(mm);
+	}
+}
+
+void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res)
+{
+	struct mm_struct *mm = get_task_mm(tsk);
+	if (mm) {
+		res->vmsize = mm->total_vm << (PAGE_SHIFT-10);
+		res->vmlock = mm->locked_vm << (PAGE_SHIFT-10);
+		res->vmrss = mm->rss << (PAGE_SHIFT-10);
+		mmput(mm);
+	}
+}
+
+/*
+ * page_alloc.c already has an extra function broken out to fill a
+ * struct with information. Cool. Not sure whether pgpgin/pgpgout
+ * should be left as is or nailed down as kbytes.
+ */
+struct page_state *__vmstat(void)
+{
+	struct page_state *ps;
+	ps = kmalloc(sizeof(*ps), GFP_KERNEL);
+	if (!ps)
+		return ERR_PTR(-ENOMEM);
+	get_full_page_state(ps);
+	ps->pgpgin /= 2;	/* sectors -> kbytes */
+	ps->pgpgout /= 2;
+	return ps;
+}
+
+/*
+ * Allocate and prefill an skb. The nlmsghdr provided to the function
+ * is a pointer to the respective struct in the request message.
+ */
+struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len)
+{
+	__u32 seq = nlh->nlmsg_seq;
+	__u16 type = nlh->nlmsg_type;
+	__u32 pid = nlh->nlmsg_pid;
+	struct sk_buff *skb2 = 0;
+
+	skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL);
+	if (!skb2) {
+		skb2 = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len));
+	goto out;
+
+nlmsg_failure:				/* Used by NLMSG_PUT */
+	kfree_skb(skb2);
+	skb2 = NULL;
+out:
+	return skb2;
+}
+
+#define mstore(value, id, buf)						\
+({									\
+	u32 _type = id & NPROC_TYPE_MASK;				\
+	switch (_type) {						\
+		case NPROC_TYPE_U32: {					\
+			__u32 *p = (u32 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_UL: {					\
+			unsigned long *p = (unsigned long *)buf;	\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		case NPROC_TYPE_U64: {					\
+			__u64 *p = (u64 *)buf;				\
+			*p = value;					\
+			buf = (char *)++p;				\
+			break;						\
+		}							\
+		default:						\
+			perror("Huh? Bad type!\n");			\
+	}								\
+})
+
+/*
+ * Build and send a netlink msg for one PID.
+ */
+int nproc_pid_fields(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk)
+{
+	int i;
+	int err;
+	struct task_mem tsk_mem;
+	struct task_mem_cheap tsk_mem_cheap;
+	u32 fcnt = fdata[0];
+	u32 *fields = &fdata[1];
+	struct sk_buff *skb2;
+	char *buf;
+	struct nlmsghdr *nlh2;
+
+	tsk_mem.vmdata = (~0);
+	tsk_mem_cheap.vmsize = (~0);
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+	nlh2 = (struct nlmsghdr *)skb2->data;
+	buf = NLMSG_DATA(nlh2);
+
+	for (i = 0; i < fcnt; i++) {
+		switch (fields[i]) {
+			case NPROC_PID:
+				mstore(tsk->pid, NPROC_PID, buf);
+				break;
+			case NPROC_VMSIZE:
+			case NPROC_VMLOCK:
+			case NPROC_VMRSS:
+				if (tsk_mem_cheap.vmsize == (~0))
+					__task_mem_cheap(tsk, &tsk_mem_cheap);
+				switch (fields[i]) {
+					case NPROC_VMSIZE:
+						mstore(tsk_mem_cheap.vmsize, NPROC_VMSIZE, buf);
+						break;
+					case NPROC_VMLOCK:
+						mstore(tsk_mem_cheap.vmlock, NPROC_VMLOCK, buf);
+						break;
+					case NPROC_VMRSS:
+						mstore(tsk_mem_cheap.vmrss, NPROC_VMRSS, buf);
+						break;
+				}
+				break;
+			case NPROC_VMDATA:
+			case NPROC_VMSTACK:
+			case NPROC_VMEXE:
+			case NPROC_VMLIB:
+				if (tsk_mem.vmdata == (~0))
+					__task_mem(tsk, &tsk_mem);
+				switch (fields[i]) {
+					case NPROC_VMDATA:
+						mstore(tsk_mem.vmdata, NPROC_VMDATA, buf);
+						break;
+					case NPROC_VMSTACK:
+						mstore(tsk_mem.vmstack, NPROC_VMSTACK, buf);
+						break;
+					case NPROC_VMEXE:
+						mstore(tsk_mem.vmexe, NPROC_VMEXE, buf);
+						break;
+					case NPROC_VMLIB:
+						mstore(tsk_mem.vmlib, NPROC_VMLIB, buf);
+						break;
+				}
+				break;
+			case NPROC_JIFFIES:
+				mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+				break;
+			case NPROC_WCHAN:
+				mstore(get_wchan(tsk), NPROC_WCHAN, buf);
+				pdebug("pid %d wchan: %lu.\n", tsk->pid,
+						get_wchan(tsk));
+				break;
+			case NPROC_NAME:
+				mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf);
+				strncpy(buf, tsk->comm, sizeof(tsk->comm));
+				buf += sizeof(tsk->comm);
+				break;
+			default:
+				pwarn("Unknown field %#x.\n", fields[i]);
+		}
+	}
+	err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, MSG_DONTWAIT);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+/*
+ * Iterate over a list of PIDs.
+ */
+int nproc_select_pid(struct nlmsghdr *nlh, u32 left, u32 *fdata, u32 len, u32 *sdata)
+{
+	int i;
+	int err = 0;
+	u32 tcnt;
+	u32 *pids;
+
+	if (left < sizeof(tcnt))
+		goto err_inval;
+	left -= sizeof(tcnt);
+
+	tcnt = sdata[0];
+
+	if (left < (tcnt * sizeof(u32)))
+		goto err_inval;
+	left -= tcnt * sizeof(u32);
+
+	pids = &sdata[1];
+
+	for (i = 0; i < tcnt; i++) {
+		task_t *tsk;
+		tsk = find_task_by_pid(pids[i]);
+		pdebug("task found for pid %d: %s.\n", pids[i], tsk->comm);
+		if (!tsk) {
+			err = -ESRCH;
+			goto out;
+		}
+		err = nproc_pid_fields(nlh, fdata, len, tsk);
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static u32 __reply_size_special(u32 id)
+{
+	u32 len = 0;
+
+	switch (id) {
+		case NPROC_NAME:
+			len = sizeof(u32) +
+				sizeof(((struct task_struct*)0)->comm);
+			break;
+		default:
+			pwarn("Unknown field size in %#x.\n", id);
+	}
+	return len;
+}
+
+/*
+ * Calculates the size of a reply message payload. Alternatively, we could have
+ * the user space caller supply a number along with the request and bail
+ * out or realloc later if we find the allocation was too small. More
+ * responsibility in user space, but faster.
+ */
+static u32 *__reply_size (u32 *data, u32 *left, u32 *len)
+{
+	u32 *fields;
+	u32 fcnt;
+	int i;
+	*len = 0;
+
+	if (*left < sizeof(fcnt))
+		goto err_inval;
+	*left -= sizeof(fcnt);
+
+	fcnt = data[0];
+
+	if (*left < (fcnt * sizeof(u32)))
+		goto err_inval;
+	*left -= fcnt * sizeof(u32);
+
+	fields = &data[1];
+
+	pdebug("for %d fields:\n", fcnt);
+	for (i = 0; i < fcnt; i++) {
+		u32 id = fields[i];
+		u32 type = id & NPROC_TYPE_MASK;
+		pdebug("        %#8.8x.\n", fields[i]);
+		switch (type) {
+			case NPROC_TYPE_U32:
+				*len += sizeof(u32);
+				break;
+			case NPROC_TYPE_UL:
+				*len += sizeof(unsigned long);
+				break;
+			case NPROC_TYPE_U64:
+				*len += sizeof(u64);
+				break;
+			default: {		/* Special cases */
+				u32 slen;
+				slen = __reply_size_special(id);
+				if (slen)
+					*len += slen;
+				else
+					goto err_inval;
+			}
+		}
+	}
+
+	return &fields[fcnt];
+
+err_inval:
+	return ERR_PTR(-EINVAL);
+}
+
+/*
+ * Call the chosen process selector. Not much to choose from right now.
+ */
+static int nproc_get_ps(struct sk_buff *skb, struct nlmsghdr *nlh)
+{
+	int err;
+	u32 len;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *sdata;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+
+	sdata = __reply_size(data, &left, &len);
+	if (IS_ERR(sdata)) {
+		err = PTR_ERR(sdata);
+		goto out;
+	}
+
+	switch (*sdata) {
+#if 0
+		case NPROC_SELECT_ALL:
+			err = nproc_select_all(nlh, data, len, sdata + 1);
+			break;
+#endif
+		case NPROC_SELECT_PID:
+			err = nproc_select_pid(nlh, left, data, len,
+					sdata + 1);
+			break;
+#if 0
+		case NPROC_SELECT_UID:
+			err = nproc_select_uid(sdata + 1);
+			break;
+#endif
+		default:
+			pwarn("Unknown selection method %#x.\n", *sdata);
+			goto err_inval;
+	}
+
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static int nproc_get_global(struct nlmsghdr *nlh)
+{
+	int err, i, len;
+	void *errp;
+	struct sk_buff *skb2;
+	char *buf;
+	u32 fcnt;
+	struct page_state *ps = NULL;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 *fields;
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	errp = __reply_size(data, &left, &len);
+	if (IS_ERR(errp)) {
+		err = PTR_ERR(errp);
+		goto out;
+	}
+
+	fcnt = data[0];
+	fields = &data[1];
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	for (i = 0; i < fcnt; i++) {
+		u32 id = fields[i];
+		switch (id) {
+			case NPROC_NR_DIRTY:
+			case NPROC_NR_WRITEBACK:
+			case NPROC_NR_UNSTABLE:
+			case NPROC_NR_PG_TABLE_PGS:
+			case NPROC_NR_MAPPED:
+			case NPROC_NR_SLAB:
+				if (!ps)
+					ps = __vmstat();
+				switch (id) {
+					case NPROC_NR_DIRTY:
+						mstore(ps->nr_dirty, NPROC_NR_DIRTY, buf);
+						break;
+					case NPROC_NR_WRITEBACK:
+						mstore(ps->nr_writeback, NPROC_NR_WRITEBACK, buf);
+						break;
+					case NPROC_NR_UNSTABLE:
+						mstore(ps->nr_unstable, NPROC_NR_UNSTABLE, buf);
+						break;
+					case NPROC_NR_PG_TABLE_PGS:
+						mstore(ps->nr_page_table_pages, NPROC_NR_PG_TABLE_PGS, buf);
+						break;
+					case NPROC_NR_MAPPED:
+						mstore(ps->nr_mapped, NPROC_NR_MAPPED, buf);
+						break;
+					case NPROC_NR_SLAB:
+						mstore(ps->nr_slab, NPROC_NR_SLAB, buf);
+						break;
+				}
+				break;
+			case NPROC_MEMFREE:
+				mstore(nr_free_pages(), NPROC_MEMFREE, buf);
+				break;
+			case NPROC_PAGESIZE:
+				mstore(PAGE_SIZE, NPROC_PAGESIZE, buf);
+				break;
+			case NPROC_JIFFIES:
+				mstore(get_jiffies_64(), NPROC_JIFFIES, buf);
+				break;
+			default:
+				pwarn("Unknown field requested %#x.\n",
+						fields[i]);
+				goto err_inval;
+		}
+	}
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, MSG_DONTWAIT);
+	if (err > 0)
+		err = 0;
+out:
+	kfree(ps);
+	return err;
+
+err_inval:
+	kfree(ps);
+	return -EINVAL;
+}
+
+static int nproc_get_label(struct nlmsghdr *nlh)
+{
+	int err;
+	struct sk_buff *skb2;
+	const char *label;
+	char *buf;
+	int len;
+	u32 ltype;
+	u32 *data = NLMSG_DATA(nlh);
+	u32 left = nlh->nlmsg_len - sizeof(*nlh);
+
+	if (left < sizeof(ltype))
+		goto err_inval;
+
+	ltype = data[0];
+	left -= sizeof(ltype);
+
+	if (ltype == NPROC_LABEL_FIELD) {
+		int i;
+		u32 id;
+		
+		if (left < sizeof(id))
+			goto err_inval;
+
+		id = data[1];
+
+		for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++)
+			;	/* Do nothing */
+
+		if (labels[i].id != id) {
+			pwarn("No matching label found for %#x.\n", id);
+			goto err_inval;
+		}
+
+		label = labels[i].label;
+
+	}
+	else if (ltype == NPROC_LABEL_KSYM) {
+		char *modname;
+		unsigned long wchan, size, offset;
+		char namebuf[128];
+		if (left < sizeof(unsigned long))
+			goto err_inval;
+
+		wchan = (unsigned long)data[1];
+		label = kallsyms_lookup(wchan, &size, &offset, &modname,
+				namebuf);
+		if (!label) {
+			pwarn("No ksym found for %#lx.\n", wchan);
+			goto err_inval;
+		}
+	}
+	else {
+		pwarn("Unknown label type %#x.\n", ltype);
+		goto err_inval;
+	}
+
+	len = strlen(label) + 1;
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+
+	strncpy(buf, label, len);
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, MSG_DONTWAIT);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static int nproc_get_list(struct nlmsghdr *nlh)
+{
+	int err, i, cnt, len;
+	struct sk_buff *skb2;
+	u32 *buf;
+
+	cnt = ARRAY_SIZE(labels);
+	len = (cnt + 1) * sizeof(u32);
+
+	skb2 = nproc_alloc_nlmsg(nlh, len);
+	if (IS_ERR(skb2)) {
+		err = PTR_ERR(skb2);
+		goto out;
+	}
+
+	buf = NLMSG_DATA((struct nlmsghdr *)skb2->data);
+	buf[0] = cnt;
+	for (i = 0; i < cnt; i++)
+		buf[i + 1] = labels[i].id;
+
+	err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, MSG_DONTWAIT);
+	if (err > 0)
+		err = 0;
+out:
+	return err;
+}
+
+static __inline__ int nproc_process_msg(struct sk_buff *skb,
+		struct nlmsghdr *nlh)
+{
+	int err;
+
+	if (!(nlh->nlmsg_flags & NLM_F_REQUEST))
+		return 0;
+
+	nlh->nlmsg_pid = NETLINK_CB(skb).pid;
+
+	switch (nlh->nlmsg_type) {
+		case NPROC_GET_LIST:
+			err = nproc_get_list(nlh);
+			break;
+		case NPROC_GET_LABEL:
+			err = nproc_get_label(nlh);
+			break;
+		case NPROC_GET_GLOBAL:
+			err = nproc_get_global(nlh);
+			break;
+		case NPROC_GET_PS:
+			err = nproc_get_ps(skb, nlh);
+			break;
+		default:
+			pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type);
+			err = -EINVAL;
+	}
+	return err;
+
+}
+
+static int nproc_receive_skb(struct sk_buff *skb)
+{
+	int err = 0;
+	struct nlmsghdr *nlh;
+
+	if (skb->len < NLMSG_LENGTH(0))
+		goto err_inval;
+
+	nlh = (struct nlmsghdr *)skb->data;
+	if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){
+		pwarn("Invalid packet.\n");
+		goto err_inval;
+	}
+
+	err = nproc_process_msg(skb, nlh);
+	if (err || nlh->nlmsg_flags & NLM_F_ACK) {
+		pdebug("err %d, type %#x, flags %#x, seq %#x.\n", err,
+				nlh->nlmsg_type, nlh->nlmsg_flags,
+				nlh->nlmsg_seq);
+		netlink_ack(skb, nlh, err);
+	}
+
+	return err;
+
+err_inval:
+	return -EINVAL;
+}
+
+static void nproc_receive(struct sock *sk, int len)
+{
+	struct sk_buff *skb;
+
+	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+		nproc_receive_skb(skb);
+		kfree_skb(skb);
+	}
+}
+
+static int nproc_init(void)
+{
+	nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive);
+
+	if (!nproc_sock) {
+		perror("No netlink socket for nproc.\n");
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+module_init(nproc_init);
diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/init/Kconfig linux-2.6.8-nproc/init/Kconfig
--- linux-2.6.8/init/Kconfig	2004-08-27 13:33:21.680899010 +0200
+++ linux-2.6.8-nproc/init/Kconfig	2004-08-27 13:28:33.104788111 +0200
@@ -141,6 +141,13 @@ config SYSCTL
 	  building a kernel for install/rescue disks or your system is very
 	  limited in memory.
 
+config NPROC
+	bool "Netlink interface to /proc information"
+	depends on PROC_FS && EXPERIMENTAL
+	default y
+	help
+	  Nproc is a netlink interface to /proc information.
+
 config AUDIT
 	bool "Auditing support"
 	default y if SECURITY_SELINUX

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [2/2][sample code] nproc: user space app
  2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi
  2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi
@ 2004-08-27 12:24 ` Roger Luethi
  2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris
  2004-08-27 16:23 ` William Lee Irwin III
  3 siblings, 0 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 12:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh,
	Paul Jackson

On a system running a kernel with nproc, the sample program (below)
spits out the information it can gather from the kernel: Available
fields, data types, and associated values.

Obviously, real tools would have their own labels (and help texts) for
well-known fields. Scope information and the NPROC_GET_LABEL operation
allow them to provide additional information in a meaningful context.

The sample program does have some extra knowledge beyond the bare
interface: It knows that a time stamp (global scope) can be requested
within process scope as well, and it knows how to request a symbol name
from a wchan value.

Sample output:
----------------------------------------------------------------------------
================ Available fields =======================
    ----id---- --------label------- -scope- ----type-----
# 0 0x22000001 PID                  process __u32
# 1 0x21000002 Name                 process string
# 2 0x12000004 MemFree              global  __u32
# 3 0x12000005 PageSize             global  __u32
# 4 0x14000006 Jiffies              global  __u64
# 5 0x22000010 VmSize               process __u32
# 6 0x22000011 VmLock               process __u32
# 7 0x22000012 VmRSS                process __u32
# 8 0x22000013 VmData               process __u32
# 9 0x22000014 VmStack              process __u32
#10 0x22000015 VmExe                process __u32
#11 0x22000016 VmLib                process __u32
#12 0x13000051 nr_dirty             global  unsigned long
#13 0x13000052 nr_writeback         global  unsigned long
#14 0x13000053 nr_unstable          global  unsigned long
#15 0x13000054 nr_page_table_pages  global  unsigned long
#16 0x13000055 nr_mapped            global  unsigned long
#17 0x13000056 nr_slab              global  unsigned long
#18 0x23000100 wchan                process unsigned long
#19 0x01000101 wchan_symbol         (    0) string

================ Global fields ==========================
    ----id---- --------label------- --value---
# 0 0x12000004 MemFree                   97926
# 1 0x12000005 PageSize                   4096
# 2 0x14000006 Jiffies              4298132669
# 3 0x13000051 nr_dirty                     10
# 4 0x13000052 nr_writeback                  0
# 5 0x13000053 nr_unstable                   0
# 6 0x13000054 nr_page_table_pages         405
# 7 0x13000055 nr_mapped                 36021
# 8 0x13000056 nr_slab                    5956

================ Process fields =========================
---------------- process PID 14318 ----------------------
    ----id---- --------label------- --value---
# 0 0x14000006 Jiffies              4298132669
# 1 0x22000001 PID                       14318
# 2 0x21000002 Name                 tst
# 3 0x22000010 VmSize                     1456
# 4 0x22000011 VmLock                        0
# 5 0x22000012 VmRSS                       360
# 6 0x22000013 VmData                      272
# 7 0x22000014 VmStack                      12
# 8 0x22000015 VmExe                         8
# 9 0x22000016 VmLib                      1140
#10 0x23000100 wchan                         0
---------------- process PID     1 ----------------------
    ----id---- --------label------- --value---
# 0 0x14000006 Jiffies              4298132669
# 1 0x22000001 PID                           1
# 2 0x21000002 Name                 init
# 3 0x22000010 VmSize                     1340
# 4 0x22000011 VmLock                        0
# 5 0x22000012 VmRSS                       468
# 6 0x22000013 VmData                      144
# 7 0x22000014 VmStack                       4
# 8 0x22000015 VmExe                        28
# 9 0x22000016 VmLib                      1140
#10 0x23000100 wchan                0xc01924f9 (ksym: do_select)

1000 iterations for both processes:
	CPU time : 0.000000s
	Wall time: 0.008305s
============================================================================
Sample code below:
----------------------------------------------------------------------------
#include <asm/types.h>
#include <sys/socket.h>
#include <linux/netlink.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>

/* Sample code to demonstrate nproc usage */

//#include "<linux/nproc.h>"

#define NPROC_BASE		0x10
#define NPROC_GET_LIST		(NPROC_BASE+0)
#define NPROC_GET_LABEL		(NPROC_BASE+1)
#define NPROC_GET_GLOBAL	(NPROC_BASE+2)
#define NPROC_GET_PS		(NPROC_BASE+3)

#define NETLINK_NPROC		12

#define NPROC_SCOPE_MASK	0xF0000000
#define NPROC_SCOPE_GLOBAL	0x10000000	/* Global w/o arguments */
#define NPROC_SCOPE_PROCESS	0x20000000
#define NPROC_SCOPE_LABEL	0x30000000

#define NPROC_TYPE_MASK		0x0F000000
#define NPROC_TYPE_STRING	0x01000000
#define NPROC_TYPE_U32		0x02000000
#define NPROC_TYPE_UL		0x03000000
#define NPROC_TYPE_U64		0x04000000

#define NPROC_SELECT_ALL	0x00000001
#define NPROC_SELECT_PID	0x00000002
#define NPROC_SELECT_UID	0x00000003

#define NPROC_LABEL_FIELD	0x00000001
#define NPROC_LABEL_KSYM	0x00000002

struct nproc_field {
	__u32	id;
	const char *label;
};

#define NPROC_JIFFIES		(0x00000006 | NPROC_TYPE_U64    | NPROC_SCOPE_GLOBAL)
#define NPROC_WCHAN		(0x00000100 | NPROC_TYPE_UL     | NPROC_SCOPE_PROCESS)


//#define DEBUG

#ifdef DEBUG
#define pdebug(x,args...) printf("%s:%d " x, __func__ , __LINE__, ##args)
#else
#define pdebug(x,args...)
#endif
#define perror(x,args...) fprintf(stderr, "%s:%d " x, __func__ , __LINE__, ##args)

static __u32 seq_nr;
static pid_t pid;
static int nsk;		/* netlink socket */

struct proc_message {
	struct nlmsghdr nlh;
	__u32 data[256];
};

int open_netlink()
{
	if ((nsk = socket(PF_NETLINK, SOCK_RAW, NETLINK_NPROC)) == -1) {
		perror("Failed to open netlink proc socket.\n");
		exit(1);
	}
	return nsk;
}

void send_request(struct proc_message *req)
{
	int sent;

	req->nlh.nlmsg_flags = NLM_F_REQUEST;
	req->nlh.nlmsg_seq = seq_nr++;
	req->nlh.nlmsg_pid = pid;

	if ((sent = send(nsk, req, req->nlh.nlmsg_len, 0)) == -1) {
		perror("Failed to send netlink proc msg.\n");
		exit(1);
	}
		pdebug("sent %d bytes seq %#x type %#x \n", sent,
				req->nlh.nlmsg_seq, req->nlh.nlmsg_type);
}

void *get_reply(__u32 type, struct proc_message *ans)
{
	int len;
	if ((len = recv(nsk, ans, sizeof(struct proc_message), 0)) == -1) {
		perror("Failed to read netlink proc msg.\n");
		exit(1);
	};

	if (!NLMSG_OK((&(*ans).nlh), len)) {
		perror("Bad netlink msg.\n");
		exit(1);
	}

	if (ans->nlh.nlmsg_type != type) {
		perror("read %d bytes seq %#x type %#x len %d\n", len,
				ans->nlh.nlmsg_seq, ans->nlh.nlmsg_type,
				ans->nlh.nlmsg_len);
		exit(1);
	}
	else
		pdebug("read %d bytes seq %#x type %#x len %d\n", len,
				ans->nlh.nlmsg_seq, ans->nlh.nlmsg_type,
				ans->nlh.nlmsg_len);

	return NLMSG_DATA(&ans->nlh);
}

void *get_global(__u32 num, struct proc_message *nlmsg)
{
	int len = num * sizeof(__u32);

	nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len);
	nlmsg->nlh.nlmsg_type = NPROC_GET_GLOBAL;

	send_request(nlmsg);

	return get_reply(NPROC_GET_GLOBAL, nlmsg);
}

void get_ps(__u32 num, struct proc_message *nlmsg)
{
	int len = num * sizeof(__u32);

	nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len);
	nlmsg->nlh.nlmsg_type = NPROC_GET_PS;

	send_request(nlmsg);
}

char *get_label(struct proc_message *nlmsg)
{
	nlmsg->nlh.nlmsg_type = NPROC_GET_LABEL;

	send_request(nlmsg);

	return get_reply(NPROC_GET_LABEL, nlmsg);
}

char *get_field_label(__u32 id, struct proc_message *nlmsg)
{
	__u32 *buf = NLMSG_DATA(&nlmsg->nlh);
	int len = 2 * sizeof(__u32);

	nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len);

	buf[0] = NPROC_LABEL_FIELD;
	buf[1] = id;

	return get_label(nlmsg);
}

char *get_ksym(unsigned long wchan, struct proc_message *nlmsg)
{
	__u32 *buf = NLMSG_DATA(&nlmsg->nlh);
	unsigned long *addr;
	int len = sizeof(__u32) + sizeof(unsigned long);

	*buf++ = NPROC_LABEL_KSYM;
	addr = (unsigned long *)buf;
	*addr = wchan;

	nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len);

	return get_label(nlmsg);
}

__u32 *get_list(struct proc_message *nlmsg)
{
	nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(0);
	nlmsg->nlh.nlmsg_type = NPROC_GET_LIST;

	send_request(nlmsg);

	return get_reply(NPROC_GET_LIST, nlmsg);
}

void print_ps(char *res, int psc, struct nproc_field *ps_label)
{
	int i;
	struct proc_message nlmsg;

	printf("    ----id---- --------label------- --value---\n");
	for (i = 0; i < psc; i++) {
		const char *label = ps_label[i].label;
		__u32 id    = ps_label[i].id;
		__u32 type  = id & NPROC_TYPE_MASK;

		printf("#%2d %#x %-20s ", i, id, label);
		switch (type) {
			case NPROC_TYPE_U32: {
				__u32 *p = (__u32 *)res;
				printf("%10u\n", *p);
				res = (char *)++p;
				break;
			}
			case NPROC_TYPE_UL: {
				unsigned long *p = (unsigned long *)res;
				if ((id == NPROC_WCHAN) && *p) {
					printf("%#8lx ", *p);
					printf("(ksym: %s)\n", get_ksym(*p,
								&nlmsg));
				}
				else
					printf("%10lu\n", *p);
				res = (char *)++p;
				break;
			}
			case NPROC_TYPE_U64: {
				__u64 *p = (__u64 *)res;
				printf("%10llu\n", *p);
				res = (char *)++p;
				break;
			}
			case NPROC_TYPE_STRING: {
				__u32 *len = (__u32 *)res;
				char *p = res + sizeof(__u32);
				printf("%s\n", p);
				res += *len + sizeof(__u32);
				break;
			}
			default:
				printf("(?)\t");
		}
	}
}


#define MAX_FIELDS	64

int main() {
	struct proc_message flist;
	struct proc_message nlmsg;
	struct proc_message gl_msg;
	struct proc_message ps_msg;

	__u32 *fields;
	__u32 *gl = NLMSG_DATA(&gl_msg.nlh);
	__u32 *ps = NLMSG_DATA(&ps_msg.nlh);

	struct nproc_field gl_label[MAX_FIELDS];
	struct nproc_field ps_label[MAX_FIELDS];

	char *res;

	int i;
	int ac, glc = 0, psc = 0;		/* Count fields */

	int cpu_0;
	struct timeval tv0, tv1;
	float wall;
	struct timezone tz;

	pid = getpid();
	nsk = open_netlink();

	fields = get_list(&flist);
	ac = *fields++;


	*gl++ = 0;		/* Reserve space for field count */
	*ps++ = 0;

	*ps++ = NPROC_JIFFIES;	/* Special: both global and ps context */
	ps_label[psc].id = NPROC_JIFFIES;
	ps_label[psc++].label = strdup(get_field_label(NPROC_JIFFIES, &nlmsg));

	printf("================ Available fields =======================\n");
	printf("    ----id---- --------label------- -scope- ----type-----\n");
	for (i = 0; i < ac; i++) {
		char *label;
		__u32 scope, type;

		scope = fields[i] & NPROC_SCOPE_MASK;
		type  = fields[i] & NPROC_TYPE_MASK;
		label = strdup(get_field_label(fields[i], &nlmsg));

		printf("#%2d %#8.8x %-20s ", i, fields[i], label);
		switch (scope) {
			case NPROC_SCOPE_GLOBAL:
				printf("global  ");
				*gl++ = fields[i];
				gl_label[glc].id = fields[i];
				gl_label[glc++].label = label;
				break;
			case NPROC_SCOPE_PROCESS:
				printf("process ");
				*ps++ = fields[i];
				ps_label[psc].id = fields[i];
				ps_label[psc++].label = label;
				break;
			default:
				printf("(%#5x) ", scope);
		}
		switch (type) {
			case NPROC_TYPE_U32:
				printf("__u32");
				break;
			case NPROC_TYPE_UL:
				printf("unsigned long");
				break;
			case NPROC_TYPE_U64:
				printf("__u64");
				break;
			case NPROC_TYPE_STRING:
				printf("string");
				break;
			default:
				printf("type: (%#8.8x)\t", type);
		}
		if ((glc == MAX_FIELDS) || (psc == MAX_FIELDS)) {
			perror("Array too small.\n");
			exit(1);
		}
		printf("\n");
	}

	gl = NLMSG_DATA(&gl_msg.nlh);
	*gl = glc;

	res = get_global(glc + 1, &gl_msg);

	printf("\n================ Global fields ==========================\n");
	printf("    ----id---- --------label------- --value---\n");
	for (i = 0; i < glc; i++) {
		const char *label = gl_label[i].label;
		__u32 id    = gl_label[i].id;
		__u32 type  = id & NPROC_TYPE_MASK;

		printf("#%2d %#8.8x %-20s ", i, id, label);
		switch (type) {
			case NPROC_TYPE_U32: {
				__u32 *p = (__u32 *)res;
				printf("%10u", *p);
				res = (char *)++p;
				break;
			}
			case NPROC_TYPE_UL: {
				unsigned long *p = (unsigned long *)res;
				printf("%10lu", *p);
				res = (char *)++p;
				break;
			}
			case NPROC_TYPE_U64: {
				__u64 *p = (__u64 *)res;
				printf("%10llu", *p);
				res = (char *)++p;
				break;
			}
			case NPROC_TYPE_STRING: {
				__u32 *len = (__u32 *)res;
				char *p = res + sizeof(__u32);
				printf("%s", p);
				res += *len + sizeof(__u32);
				break;
			}
			default:
				printf("(?)");
		}
		printf("\n");
	}

	printf("\n================ Process fields =========================\n");

	*ps++ = NPROC_SELECT_PID;
	*ps++ = 2;		// Number of PIDs to follow
	*ps++ = pid;
	*ps++ = 1;

	ps = NLMSG_DATA(&ps_msg.nlh);
	*ps = psc;

	get_ps(psc + 1 + 4, &ps_msg);

	res = get_reply(NPROC_GET_PS, &nlmsg);

	printf("---------------- process PID %5d ----------------------\n", pid);
	print_ps(res, psc, ps_label);

	res = get_reply(NPROC_GET_PS, &nlmsg);
	printf("---------------- process PID %5d ----------------------\n", 1);
	print_ps(res, psc, ps_label);

	gettimeofday(&tv0, &tz);
	cpu_0 = clock();

#define RUNS 1000
	for (i = 0; i < RUNS; i++) {
		get_ps(psc + 1 + 4, &ps_msg);
		get_reply(NPROC_GET_PS, &nlmsg);
		get_reply(NPROC_GET_PS, &nlmsg);
	}

	printf("\n%d iterations for both processes:\n", RUNS);
	printf("\tCPU time : %fs\n", (float)(clock() - cpu_0)/CLOCKS_PER_SEC);
	gettimeofday(&tv1,&tz);
	wall = (float) tv1.tv_sec - tv0.tv_sec +
		(tv1.tv_usec - tv0.tv_usec) / 1.0e6;
	printf("\tWall time: %fs\n", wall);

	return 0;
}
----------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [1/2][PATCH] nproc: netlink access to /proc information
  2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi
@ 2004-08-27 13:39   ` Roger Luethi
  0 siblings, 0 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 13:39 UTC (permalink / raw)
  To: linux-kernel, Albert Cahalan, William Lee Irwin III,
	Martin J. Bligh, Paul Jackson

I failed to mention that the patch is missing some rather basic locking
(say, in nproc_select_pid). Yeah, it is _that_ experimental :-/. I ignored
the locking issue when mulling over the semantics of the new interface and
forgot it later.

The patch below should be an improvement.

Roger

--- kernel/nproc.c.01	2004-08-27 15:38:36.686602557 +0200
+++ kernel/nproc.c	2004-08-27 15:38:36.686602557 +0200
@@ -278,18 +278,23 @@ int nproc_select_pid(struct nlmsghdr *nl
 
 	for (i = 0; i < tcnt; i++) {
 		task_t *tsk;
+		read_lock(&tasklist_lock);
 		tsk = find_task_by_pid(pids[i]);
+		if (tsk)
+			get_task_struct(tsk);
+		read_unlock(&tasklist_lock);
+		if (!tsk)
+			goto err_srch;
 		pdebug("task found for pid %d: %s.\n", pids[i], tsk->comm);
-		if (!tsk) {
-			err = -ESRCH;
-			goto out;
-		}
 		err = nproc_pid_fields(nlh, fdata, len, tsk);
+		put_task_struct(tsk);
 	}
 
-out:
 	return err;
 
+err_srch:
+	return -ESRCH;
+
 err_inval:
 	return -EINVAL;
 }

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi
  2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi
  2004-08-27 12:24 ` [2/2][sample code] nproc: user space app Roger Luethi
@ 2004-08-27 14:50 ` James Morris
  2004-08-27 15:26   ` Roger Luethi
  2004-08-27 16:23 ` William Lee Irwin III
  3 siblings, 1 reply; 39+ messages in thread
From: James Morris @ 2004-08-27 14:50 UTC (permalink / raw)
  To: Roger Luethi
  Cc: linux-kernel, Albert Cahalan, William Lee Irwin III,
	Martin J. Bligh, Paul Jackson, Chris Wright, Stephen Smalley

On Fri, 27 Aug 2004, Roger Luethi wrote:

> At the moment, the kernel sends a separate netlink message for every
> process.

You should look at the way rtnetlink dumps large amounts of data to  
userspace.

> I haven't implemented any form of access control. One possibility is
> to use some of the reserved bits in the ID field to indicate access
> restrictions to both kernel and user space (e.g. everyone, process owner,
> root) 

So, user tools would all need to be privileged?  That sounds problematic.

> and add some LSM hook for those needing fine-grained control.

Control over the user request, or what the kernel returns?  If the latter, 
LSM is not really a filtering API.


- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris
@ 2004-08-27 15:26   ` Roger Luethi
  0 siblings, 0 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 15:26 UTC (permalink / raw)
  To: James Morris
  Cc: linux-kernel, Albert Cahalan, William Lee Irwin III,
	Martin J. Bligh, Paul Jackson, Chris Wright, Stephen Smalley

On Fri, 27 Aug 2004 10:50:23 -0400, James Morris wrote:
> On Fri, 27 Aug 2004, Roger Luethi wrote:
> 
> > At the moment, the kernel sends a separate netlink message for every
> > process.
> 
> You should look at the way rtnetlink dumps large amounts of data to  
> userspace.

At this point, I am just using a working prototype to gauge the interest
in an improved interface. Other than that, I agree. This would be one
of the "speed optimizations I haven't tried".

> > I haven't implemented any form of access control. One possibility is
> > to use some of the reserved bits in the ID field to indicate access
> > restrictions to both kernel and user space (e.g. everyone, process owner,
> > root) 
> 
> So, user tools would all need to be privileged?  That sounds problematic.

It just means that not all the pieces that would be required to make
this a merge candidate have been implemented. I focused on the basic
infrastructure that is needed for the basic protocol.

Adding some access control that is about as smart as file permissions
in /proc is fairly easy (we have the caller pid and netlink_skb_parms
as a starting point). We only have read permissions to care about. It's
trivial to flag each field as "world readable", "owner only" (for fields
with process scope), and "root only". That covers pretty much what
/proc permissions achieve. While I am confident that this will work,
others may have better ideas for access control.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi
                   ` (2 preceding siblings ...)
  2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris
@ 2004-08-27 16:23 ` William Lee Irwin III
  2004-08-27 16:37   ` Albert Cahalan
                     ` (2 more replies)
  3 siblings, 3 replies; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-27 16:23 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson

On Fri, Aug 27, 2004 at 02:24:12PM +0200, Roger Luethi wrote:
> Problems with /proc
> ===================
> The information in /proc comes in a number of different formats, for
> example:
> - /proc/PID/stat works for parsers. However, because it is not
>   self-documenting, it can never shrink, It contains a growing number
>   of dead fields -- legacy tools expect them to be there. To make things
>   worse, there is no N/A value, which makes a field value 0 ambiguous.
> - /proc/pid/status is self-documenting. No N/A value is necessary --
>   fields can easily be added, removed, and reordered. Too easily, maybe.
>   Tool maintainers complain about parsing overhead and unstable file
>   formats.
> - /proc/slabinfo is something of a hybrid and tries to avoid the
>   weaknesses of other formats.
> So a key problem is that it's hard to make an interface that is both
> easy for humans and parsers to read. The amount of human-readable
> information in /proc has been growing and there's no way all these
> files will be rewritten again to favor parsers.

These are many of the same issues raised in rusty's "current /proc/ of
shit" thread from a while back.


On Fri, Aug 27, 2004 at 02:24:12PM +0200, Roger Luethi wrote:
> Another problem with /proc is speed. If we put all information in a few
> large files, the kernel needs to calculate many fields even if a tool
> is only interested in one of them. OTOH, if the informations is split
> into many small files, VFS and related overhead increases if a tool
> needs to read many files just for the information on one single process.
> In summary, /proc suffers from diverging goals of its two groups of
> users (human readers and parsers), and it doesn't scale well for tools
> monitoring many fields or many processes.

There are more maintainability benefits from the interface improvement
than speed benefits. How many processes did you microbenchmark with?
I see no evidence that this will be a speedup with large numbers of
processes, as the problematic algorithms are preserved wholesale.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 16:23 ` William Lee Irwin III
@ 2004-08-27 16:37   ` Albert Cahalan
  2004-08-27 16:41     ` William Lee Irwin III
  2004-08-27 17:01   ` Roger Luethi
  2004-08-28 19:45   ` [BENCHMARK] " Roger Luethi
  2 siblings, 1 reply; 39+ messages in thread
From: Albert Cahalan @ 2004-08-27 16:37 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson

On Fri, 2004-08-27 at 12:23, William Lee Irwin III wrote:

> I see no evidence that this will be a speedup with large numbers of
> processes, as the problematic algorithms are preserved wholesale.

Well, as far as THAT goes, I thought your tree-based
lookup was nice. I assume you still have the code.

What we got instead was a sort of cached directory
offset computation, which looks great... until you
hit the bad case. I suggest that the people trying to
reduce latency should try "top -d 0 -b >> /dev/null"
while running something like the SDET benchmark.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 16:37   ` Albert Cahalan
@ 2004-08-27 16:41     ` William Lee Irwin III
  0 siblings, 0 replies; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-27 16:41 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson

On Fri, 2004-08-27 at 12:23, William Lee Irwin III wrote:
>> I see no evidence that this will be a speedup with large numbers of
>> processes, as the problematic algorithms are preserved wholesale.

On Fri, Aug 27, 2004 at 12:37:40PM -0400, Albert Cahalan wrote:
> Well, as far as THAT goes, I thought your tree-based
> lookup was nice. I assume you still have the code.
> What we got instead was a sort of cached directory
> offset computation, which looks great... until you
> hit the bad case. I suggest that the people trying to
> reduce latency should try "top -d 0 -b >> /dev/null"
> while running something like the SDET benchmark.

I can resurrect that easily enough.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 16:23 ` William Lee Irwin III
  2004-08-27 16:37   ` Albert Cahalan
@ 2004-08-27 17:01   ` Roger Luethi
  2004-08-27 17:08     ` William Lee Irwin III
  2004-08-28 19:45   ` [BENCHMARK] " Roger Luethi
  2 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 17:01 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

On Fri, 27 Aug 2004 09:23:08 -0700, William Lee Irwin III wrote:
> These are many of the same issues raised in rusty's "current /proc/ of
> shit" thread from a while back.

The problems are not new. The driver stuff has been outsourced to /sysfs
in the meantime, though, and the information that is being added to
/proc these days is usually human-readable and a pain to parse.

> On Fri, Aug 27, 2004 at 02:24:12PM +0200, Roger Luethi wrote:
> > Another problem with /proc is speed. If we put all information in a few
> > large files, the kernel needs to calculate many fields even if a tool
> > is only interested in one of them. OTOH, if the informations is split
> > into many small files, VFS and related overhead increases if a tool
> > needs to read many files just for the information on one single process.
> > In summary, /proc suffers from diverging goals of its two groups of
> > users (human readers and parsers), and it doesn't scale well for tools
> > monitoring many fields or many processes.
> 
> There are more maintainability benefits from the interface improvement
> than speed benefits.

Agreed. That has been my initial motivation. Speed is a bonus.

> How many processes did you microbenchmark with?

Nothing worth mentioning. I have nothing in /proc space to compare
to. I was hoping someone would suggest a /proc based benchmark.

> I see no evidence that this will be a speedup with large numbers of
> processes, as the problematic algorithms are preserved wholesale.

It doesn't fundamentally change the complexity, but I expect the
reduction in overhead to be noticeable, mostly due to:
- no more string parsing.
- fewer system calls.
- fewer cycles wasted on calculating unnecessary data fields.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information
  2004-08-27 17:01   ` Roger Luethi
@ 2004-08-27 17:08     ` William Lee Irwin III
  0 siblings, 0 replies; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-27 17:08 UTC (permalink / raw)
  To: linux-kernel, Albert Cahalan, Paul Jackson

On Fri, 27 Aug 2004 09:23:08 -0700, William Lee Irwin III wrote:
>> I see no evidence that this will be a speedup with large numbers of
>> processes, as the problematic algorithms are preserved wholesale.

On Fri, Aug 27, 2004 at 07:01:43PM +0200, Roger Luethi wrote:
> It doesn't fundamentally change the complexity, but I expect the
> reduction in overhead to be noticeable, mostly due to:
> - no more string parsing.
> - fewer system calls.
> - fewer cycles wasted on calculating unnecessary data fields.

After some closer review it appears recent algorithmic improvements
are largely orthogonal to your interface change; the new interface
may just call the improved algorithms.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [BENCHMARK] nproc: netlink access to /proc information
  2004-08-27 16:23 ` William Lee Irwin III
  2004-08-27 16:37   ` Albert Cahalan
  2004-08-27 17:01   ` Roger Luethi
@ 2004-08-28 19:45   ` Roger Luethi
  2004-08-28 19:56     ` William Lee Irwin III
  2 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-28 19:45 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

On Fri, 27 Aug 2004 09:23:08 -0700, William Lee Irwin III wrote:
> than speed benefits. How many processes did you microbenchmark with?

Executive summary: I wrote a benchmark to compare /proc and nproc
performance. The results are as expected: Parsing even the most simple
strings is expensive. /proc performance does not scale if we have to
open and close many files, which is the common case.

In a situation with many processes p and fields/files f the delivery
overhead is roughly O(p) for nproc and O(p*f) for /proc.

The difference becomes even more pronounced if a /proc file request
triggers an expensive in-kernel computation for fields that are not
of interest but part of the file, or if human-readable files need to
be parsed.

Benchmark: I chose the most favorable scenario for /proc I could think
of: Reading a single, easy to parse file per process and find every data
item useful.  I picked /proc/pid/statm. For nproc, I chose seven fields
that are calculated with the same resource usage as the fields in statm:
NPROC_VMSIZE, NPROC_VMLOCK, NPROC_VMRSS, NPROC_VMDATA, NPROC_VMSTACK,
NPROC_VMEXE, and NPROC_VMLIB.

Numbers:
* The first run is basically lseek+read:
/proc/pid/statm for 1000 processes, 1000 times, lseek
CPU time : 7.080000s
Wall time: 7.636732s

* The second run adds a simple sscanf call to dump seven values
  into seven variables:
/prod/pid/statm for 1000 processes, 1000 times, lseek (scanf)
CPU time : 10.230000s
Wall time: 10.958432s

* If we watch p processes with f files each, we typically hit the
  file descriptor limit before p * f == 1024. From then on, lseek is
  useless, we have to resort to opening and closing files:
/prod/pid/statm for 1000 processes, 1000 times, open
CPU time : 14.920000s
Wall time: 16.087339s

* Again, parsing the string comes at a cost:
/prod/pid/statm for 1000 processes, 1000 times, open (scanf)
CPU time : 18.110000s
Wall time: 19.457451s

* What happens if we need to read 2 simple /proc files (14 fields)
  per process?
/prod/pid/statm (2x) for 1000 processes, 1000 times, open (scanf)
CPU time : 30.250000s
Wall time: 32.650314s

* 10000 processes at 3 files each (27 fields)
/prod/pid/statm (3x) for 10000 processes, 1000 times, open (scanf)
CPU time : 450.630000s
Wall time: 500.265503s

* nproc delivering said 7 fields:
nproc for 1000 processes, 1000 times, one process per request
CPU time : 7.910000s
Wall time: 8.473371s

* 200 processes per request, but still 1000 reply messages. If we stuffed
  a bunch of them into every message, performance would improve further.
nproc for 1000 processes, 1000 times, 200 processes per request
CPU time : 6.350000s
Wall time: 6.817391s

* There's no large penalty if we need additional fields:
14 nproc fields for 1000 processes, 1000 times, one process per request
CPU time : 8.680000s
Wall time: 9.328828s

27 nproc fields for 10000 processes, 1000 times, one process per request
CPU time : 88.270000s
Wall time: 98.664330s

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-28 19:45   ` [BENCHMARK] " Roger Luethi
@ 2004-08-28 19:56     ` William Lee Irwin III
  2004-08-28 20:14       ` Roger Luethi
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-28 19:56 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson

On Sat, Aug 28, 2004 at 09:45:46PM +0200, Roger Luethi wrote:
> Executive summary: I wrote a benchmark to compare /proc and nproc
> performance. The results are as expected: Parsing even the most simple
> strings is expensive. /proc performance does not scale if we have to
> open and close many files, which is the common case.
> In a situation with many processes p and fields/files f the delivery
> overhead is roughly O(p) for nproc and O(p*f) for /proc.
> The difference becomes even more pronounced if a /proc file request
> triggers an expensive in-kernel computation for fields that are not
> of interest but part of the file, or if human-readable files need to
> be parsed.
> Benchmark: I chose the most favorable scenario for /proc I could think
> of: Reading a single, easy to parse file per process and find every data
> item useful.  I picked /proc/pid/statm. For nproc, I chose seven fields
> that are calculated with the same resource usage as the fields in statm:
> NPROC_VMSIZE, NPROC_VMLOCK, NPROC_VMRSS, NPROC_VMDATA, NPROC_VMSTACK,
> NPROC_VMEXE, and NPROC_VMLIB.

These numbers are somewhat at variance with my experience in the area,
as I see that the internal algorithms actually dominate the runtime
of the /proc/ algorithms. Could you describe the processes used for the
benchmarks, e.g. typical /proc/$PID/status and /proc/$PID/maps for them?


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-28 19:56     ` William Lee Irwin III
@ 2004-08-28 20:14       ` Roger Luethi
  2004-08-29 16:05         ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-28 20:14 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

On Sat, 28 Aug 2004 12:56:47 -0700, William Lee Irwin III wrote:
> These numbers are somewhat at variance with my experience in the area,
> as I see that the internal algorithms actually dominate the runtime
> of the /proc/ algorithms. Could you describe the processes used for the
> benchmarks, e.g. typical /proc/$PID/status and /proc/$PID/maps for them?

The status/maps numbers below are not only typical, but identical for
all tasks. I'm forking off a defined number of children and then query
their status from the parent.

Because I was interested in delivery overhead, I built on purpose a
benchmark without computationally expensive fields. Expensive field
computation hurts /proc more than nproc because the latter allows you
to have only the currently needed fields computed.

Roger

Name:   nprocbench
State:  T (stopped)
SleepAVG:       0%
Tgid:   6400
Pid:    6400
PPid:   2120
TracerPid:      0
Uid:    1000    1000    1000    1000
Gid:    100     100     100     100
FDSize: 32
Groups: 4 10 11 18 19 20 27 100 250 
VmSize:     1336 kB
VmLck:         0 kB
VmRSS:       304 kB
VmData:      144 kB
VmStk:        16 kB
VmExe:        12 kB
VmLib:      1140 kB
Threads:        1
SigPnd: 0000000000000000
ShdPnd: 0000000000080000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000

08048000-0804b000 r-xp 00000000 03:45 160990     /home/rl/nproc/nprocbench
0804b000-0804c000 rw-p 00002000 03:45 160990     /home/rl/nproc/nprocbench
0804c000-0806d000 rw-p 0804c000 00:00 0 
40000000-40013000 r-xp 00000000 03:42 11356336   /lib/ld-2.3.3.so
40013000-40014000 rw-p 00012000 03:42 11356336   /lib/ld-2.3.3.so
40014000-40015000 rw-p 40014000 00:00 0 
40032000-4013c000 r-xp 00000000 03:42 11356337   /lib/libc-2.3.3.so
4013c000-40140000 rw-p 00109000 03:42 11356337   /lib/libc-2.3.3.so
40140000-40142000 rw-p 40140000 00:00 0 
bfffc000-c0000000 rw-p bfffc000 00:00 0 
ffffe000-fffff000 ---p 00000000 00:00 0 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-28 20:14       ` Roger Luethi
@ 2004-08-29 16:05         ` William Lee Irwin III
  2004-08-29 17:02           ` Roger Luethi
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 16:05 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson

On Sat, 28 Aug 2004 12:56:47 -0700, William Lee Irwin III wrote:
>> These numbers are somewhat at variance with my experience in the area,
>> as I see that the internal algorithms actually dominate the runtime
>> of the /proc/ algorithms. Could you describe the processes used for the
>> benchmarks, e.g. typical /proc/$PID/status and /proc/$PID/maps for them?

On Sat, Aug 28, 2004 at 10:14:35PM +0200, Roger Luethi wrote:
> The status/maps numbers below are not only typical, but identical for
> all tasks. I'm forking off a defined number of children and then query
> their status from the parent.
> Because I was interested in delivery overhead, I built on purpose a
> benchmark without computationally expensive fields. Expensive field
> computation hurts /proc more than nproc because the latter allows you
> to have only the currently needed fields computed.

Okay, these explain some of the difference. I usually see issues with
around 10000 processes with fully populated virtual address spaces and
several hundred vmas each, varying between 200 to 1000, mostly
concentrated at somewhere just above 300.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 16:05         ` William Lee Irwin III
@ 2004-08-29 17:02           ` Roger Luethi
  2004-08-29 17:20             ` William Lee Irwin III
  2004-08-31 15:34             ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi
  0 siblings, 2 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-29 17:02 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

On Sun, 29 Aug 2004 09:05:42 -0700, William Lee Irwin III wrote:
> Okay, these explain some of the difference. I usually see issues with
> around 10000 processes with fully populated virtual address spaces and
> several hundred vmas each, varying between 200 to 1000, mostly
> concentrated at somewhere just above 300.

I agree, that should make quite a difference. As you said, we are
working on orthogonal areas: My current focus is on data delivery (sane
semantics and minimal overhead), while you seem to be more interested
in better data gathering.

I profiled "top -d 0 -b > /dev/null" for about 100 and 10^5 processes.

When monitoring 100 (real-world) processes, /proc specific overhead
(_IO_vfscanf_internal, number, __d_lookup, vsnprintf, etc.) amounts to
about one third of total resource usage.

==> 100 processes: top -d 0 -b > /dev/null <==
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               symbol name
20439    12.2035  libc-2.3.3.so            _IO_vfscanf_internal
15852     9.4647  vmlinux                  number
11635     6.9469  vmlinux                  task_statm
9286      5.5444  libc-2.3.3.so            _IO_vfprintf_internal
9128      5.4500  vmlinux                  proc_pid_stat
5395      3.2212  vmlinux                  __d_lookup
4738      2.8289  vmlinux                  vsnprintf
4123      2.4617  libc-2.3.3.so            _IO_default_xsputn_internal
4110      2.4540  libc-2.3.3.so            __i686.get_pc_thunk.bx
3712      2.2163  libc-2.3.3.so            _IO_putc_internal
3504      2.0921  vmlinux                  link_path_walk
3417      2.0402  libc-2.3.3.so            ____strtoul_l_internal
3363      2.0079  libc-2.3.3.so            ____strtol_l_internal
2250      1.3434  libncurses.so.5.4        _nc_outch
2116      1.2634  libc-2.3.3.so            _IO_sputbackc_internal
2006      1.1977  top                      task_show
1851      1.1052  vmlinux                  pid_revalidate

With 10^5 additional dummy processes, resource usage is dominated by
attempts to get a current list of pids. My own benchmark walked a list
of known pids, so that was not an issue. I bet though that nproc can
provide more efficient means to get such a list than getdents (we could
even allow a user to ask for a message on process creation/kill).

So basically that's just another place where nproc-based tools would
trounce /proc-based ones (that piece is vaporware today, though).

==> 10000 processes: top -d 0 -b > /dev/null <==
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               symbol name
35855    36.0707  vmlinux                  get_tgid_list
9366      9.4223  vmlinux                  pid_alive
7077      7.1196  libc-2.3.3.so            _IO_vfscanf_internal
5386      5.4184  vmlinux                  number
3664      3.6860  vmlinux                  proc_pid_stat
3077      3.0955  libc-2.3.3.so            _IO_vfprintf_internal
2136      2.1489  vmlinux                  __d_lookup
1720      1.7303  vmlinux                  vsnprintf
1451      1.4597  libc-2.3.3.so            __i686.get_pc_thunk.bx
1409      1.4175  libc-2.3.3.so            _IO_default_xsputn_internal
1258      1.2656  libc-2.3.3.so            _IO_putc_internal
1225      1.2324  vmlinux                  link_path_walk
1210      1.2173  libc-2.3.3.so            ____strtoul_l_internal
1199      1.2062  vmlinux                  task_statm
1157      1.1640  libc-2.3.3.so            ____strtol_l_internal
794       0.7988  libc-2.3.3.so            _IO_sputbackc_internal
776       0.7807  libncurses.so.5.4        _nc_outch

The remaining profiles are for two benchmarks from my previous message.

Field computation is more prominent than with top because the benchmark
uses a known list of pids and parsing is kept at a trivial level.

==> /prod/pid/statm (2x) for 10000 processes <==
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               symbol name
7430      9.9485  libc-2.3.3.so            _IO_vfscanf_internal
6195      8.2948  vmlinux                  __d_lookup
5477      7.3335  vmlinux                  task_statm
5082      6.8046  vmlinux                  number
3227      4.3208  vmlinux                  link_path_walk
3050      4.0838  libc-2.3.3.so            ____strtol_l_internal
2116      2.8332  libc-2.3.3.so            _IO_vfprintf_internal
2064      2.7636  vmlinux                  vsnprintf
1664      2.2280  vmlinux                  atomic_dec_and_lock
1551      2.0767  vmlinux                  task_dumpable
1497      2.0044  vmlinux                  pid_revalidate
1419      1.9000  vmlinux                  system_call
1401      1.8759  vmlinux                  pid_alive
1244      1.6657  libc-2.3.3.so            _IO_sputbackc_internal
1175      1.5733  vmlinux                  dnotify_parent
1060      1.4193  libc-2.3.3.so            _IO_default_xsputn_internal
922       1.2345  vmlinux                  file_move

nproc removes most of the delivery overhead so field computation is
now dominant. Strictly speaking, it should be even higher because the
benchmarks requests the same fields three times, but they only get
computed once in such a case.

==> 27 nproc fields for 10000 processes, one process per request <==
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               symbol name
7647     25.0894  vmlinux                  __task_mem
2125      6.9720  vmlinux                  find_pid
1884      6.1813  vmlinux                  nproc_pid_fields
1488      4.8820  vmlinux                  __task_mem_cheap
1161      3.8092  vmlinux                  mmgrab
978       3.2088  vmlinux                  netlink_recvmsg
944       3.0972  vmlinux                  alloc_skb
935       3.0677  vmlinux                  __might_sleep
751       2.4640  vmlinux                  nproc_select_pid
738       2.4213  vmlinux                  system_call
691       2.2671  vmlinux                  skb_dequeue
636       2.0867  vmlinux                  netlink_sendmsg
630       2.0670  vmlinux                  __copy_from_user_ll
624       2.0473  vmlinux                  sockfd_lookup
621       2.0375  vmlinux                  kfree
602       1.9751  vmlinux                  __reply_size
523       1.7159  vmlinux                  fget

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 17:02           ` Roger Luethi
@ 2004-08-29 17:20             ` William Lee Irwin III
  2004-08-29 17:52               ` Roger Luethi
  2004-08-29 19:07               ` Paul Jackson
  2004-08-31 15:34             ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi
  1 sibling, 2 replies; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 17:20 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson

On Sun, 29 Aug 2004 09:05:42 -0700, William Lee Irwin III wrote:
>> Okay, these explain some of the difference. I usually see issues with
>> around 10000 processes with fully populated virtual address spaces and
>> several hundred vmas each, varying between 200 to 1000, mostly
>> concentrated at somewhere just above 300.

On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote:
> I agree, that should make quite a difference. As you said, we are
> working on orthogonal areas: My current focus is on data delivery (sane
> semantics and minimal overhead), while you seem to be more interested
> in better data gathering.

Yes, there doesn't seem to be any conflict between the code we're
working on. These benchmark results are very useful for quantifying the
relative importance of the overheads under more typical conditions.


On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote:
> I profiled "top -d 0 -b > /dev/null" for about 100 and 10^5 processes.
> When monitoring 100 (real-world) processes, /proc specific overhead
> (_IO_vfscanf_internal, number, __d_lookup, vsnprintf, etc.) amounts to
> about one third of total resource usage.
> ==> 100 processes: top -d 0 -b > /dev/null <==
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               symbol name
> 20439    12.2035  libc-2.3.3.so            _IO_vfscanf_internal
> 15852     9.4647  vmlinux                  number
> 11635     6.9469  vmlinux                  task_statm
> 9286      5.5444  libc-2.3.3.so            _IO_vfprintf_internal
> 9128      5.4500  vmlinux                  proc_pid_stat

Lexical analysis is cpu-intensive, probably due to the cache misses
taken while traversing the strings. This is likely inherent in string
processing interfaces.


On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote:
> With 10^5 additional dummy processes, resource usage is dominated by
> attempts to get a current list of pids. My own benchmark walked a list
> of known pids, so that was not an issue. I bet though that nproc can
> provide more efficient means to get such a list than getdents (we could
> even allow a user to ask for a message on process creation/kill).
> So basically that's just another place where nproc-based tools would
> trounce /proc-based ones (that piece is vaporware today, though).
> ==> 10000 processes: top -d 0 -b > /dev/null <==
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               symbol name
> 35855    36.0707  vmlinux                  get_tgid_list
> 9366      9.4223  vmlinux                  pid_alive
> 7077      7.1196  libc-2.3.3.so            _IO_vfscanf_internal
> 5386      5.4184  vmlinux                  number
> 3664      3.6860  vmlinux                  proc_pid_stat

get_tgid_list() is a sad story I don't have time to go into in depth.
The short version is that larger systems are extremely sensitive to
hold time for writes on the tasklist_lock, and this being on scales
not needing SGI participation to tell us (though scales beyond personal
financial resources still).


On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote:
> The remaining profiles are for two benchmarks from my previous message.
> Field computation is more prominent than with top because the benchmark
> uses a known list of pids and parsing is kept at a trivial level.
> ==> /prod/pid/statm (2x) for 10000 processes <==
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               symbol name
> 7430      9.9485  libc-2.3.3.so            _IO_vfscanf_internal
> 6195      8.2948  vmlinux                  __d_lookup
> 5477      7.3335  vmlinux                  task_statm
> 5082      6.8046  vmlinux                  number
> 3227      4.3208  vmlinux                  link_path_walk

scanf() is still very pronounced here; I wonder how well-optimized
glibc's implementation is, or if otherwise it may be useful to
circumvent it with a more specialized parser if its generality
requirements preclude faster execution.


On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote:
> nproc removes most of the delivery overhead so field computation is
> now dominant. Strictly speaking, it should be even higher because the
> benchmarks requests the same fields three times, but they only get
> computed once in such a case.
> ==> 27 nproc fields for 10000 processes, one process per request <==
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               symbol name
> 7647     25.0894  vmlinux                  __task_mem
> 2125      6.9720  vmlinux                  find_pid
> 1884      6.1813  vmlinux                  nproc_pid_fields
> 1488      4.8820  vmlinux                  __task_mem_cheap
> 1161      3.8092  vmlinux                  mmgrab

It looks like I'm going after the right culprit(s) for the lower-level
algorithms from this.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 17:20             ` William Lee Irwin III
@ 2004-08-29 17:52               ` Roger Luethi
  2004-08-29 18:16                 ` William Lee Irwin III
  2004-08-29 19:07               ` Paul Jackson
  1 sibling, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-29 17:52 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote:
> > ==> 10000 processes: top -d 0 -b > /dev/null <==
> > CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> > Profiling through timer interrupt
> > samples  %        image name               symbol name
> > 35855    36.0707  vmlinux                  get_tgid_list
> > 9366      9.4223  vmlinux                  pid_alive
> > 7077      7.1196  libc-2.3.3.so            _IO_vfscanf_internal
> > 5386      5.4184  vmlinux                  number
> > 3664      3.6860  vmlinux                  proc_pid_stat
> 
> get_tgid_list() is a sad story I don't have time to go into in depth.
> The short version is that larger systems are extremely sensitive to
> hold time for writes on the tasklist_lock, and this being on scales
> not needing SGI participation to tell us (though scales beyond personal
> financial resources still).

I am confident that this problem (as far as process monitoring is
concerned) could be addressed with differential notification.

> > ==> /prod/pid/statm (2x) for 10000 processes <==
> > CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> > Profiling through timer interrupt
> > samples  %        image name               symbol name
> > 7430      9.9485  libc-2.3.3.so            _IO_vfscanf_internal
> > 6195      8.2948  vmlinux                  __d_lookup
> > 5477      7.3335  vmlinux                  task_statm
> > 5082      6.8046  vmlinux                  number
> > 3227      4.3208  vmlinux                  link_path_walk
> 
> scanf() is still very pronounced here; I wonder how well-optimized
> glibc's implementation is, or if otherwise it may be useful to
> circumvent it with a more specialized parser if its generality
> requirements preclude faster execution.

I'd much rather remove unnecessary overhead than optimize code for
overhead processing. Note that number() takes out 7% and that's the
_kernel_ printing numbers for user space to parse back. And __d_lookup
is another /proc souvenir you get to keep as long as you use /proc.

> > ==> 27 nproc fields for 10000 processes, one process per request <==
> > CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> > Profiling through timer interrupt
> > samples  %        image name               symbol name
> > 7647     25.0894  vmlinux                  __task_mem
> > 2125      6.9720  vmlinux                  find_pid
> > 1884      6.1813  vmlinux                  nproc_pid_fields
> > 1488      4.8820  vmlinux                  __task_mem_cheap
> > 1161      3.8092  vmlinux                  mmgrab
> 
> It looks like I'm going after the right culprit(s) for the lower-level
> algorithms from this.

Well __task_mem is promiment here because I don't call other computation
functions. vmstat ain't cheap, and wchan is horribly expensive if the
kernel does the ksym translation. Etc. pp.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 17:52               ` Roger Luethi
@ 2004-08-29 18:16                 ` William Lee Irwin III
  2004-08-29 19:00                   ` Roger Luethi
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 18:16 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson

On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote:
>> get_tgid_list() is a sad story I don't have time to go into in depth.
>> The short version is that larger systems are extremely sensitive to
>> hold time for writes on the tasklist_lock, and this being on scales
>> not needing SGI participation to tell us (though scales beyond personal
>> financial resources still).

On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:
> I am confident that this problem (as far as process monitoring is
> concerned) could be addressed with differential notification.

I'm a bit squeamish about that given that mmlist_lock and tasklist_lock
are both problematic and yet another global structure to fiddle with in
the process creation and destruction path threatens similar trouble.

Also, what guarantee is there that the notification events come
sufficiently slowly for a single task to process, particularly when that
task may not have a whole cpu's resources to marshal to the task?
Queueing them sounds less than ideal due to resource consumption, and
if notifications are dropped most of the efficiency gains are lost. So
I question that a bit.

I have a vague notion that userspace should intelligently schedule
inquiries so requests are made at a rate the app can process and so
that the app doesn't consume excessive amounts of cpu. In such an
arrangement screen refresh events don't trigger a full scan of the
tasklist, but rather only an incremental partial rescan of it, whose
work is limited for the above cpu bandwidth concerns.

On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote:
>> scanf() is still very pronounced here; I wonder how well-optimized
>> glibc's implementation is, or if otherwise it may be useful to
>> circumvent it with a more specialized parser if its generality
>> requirements preclude faster execution.

On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:
> I'd much rather remove unnecessary overhead than optimize code for
> overhead processing. Note that number() takes out 7% and that's the
> _kernel_ printing numbers for user space to parse back. And __d_lookup
> is another /proc souvenir you get to keep as long as you use /proc.

I'm expecting very very long lifetimes for legacy kernel versions and
userspace predating the merge of nproc, so it's not entirely irrelevant,
though backports aren't exactly something I relish.

On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote:
>> It looks like I'm going after the right culprit(s) for the lower-level
>> algorithms from this.

On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:
> Well __task_mem is promiment here because I don't call other computation
> functions. vmstat ain't cheap, and wchan is horribly expensive if the
> kernel does the ksym translation. Etc. pp.

task_mem() is generally prominent when the processes have large numbers
of vmas, and also due to acquisition of ->mmap_sem.

-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 18:16                 ` William Lee Irwin III
@ 2004-08-29 19:00                   ` Roger Luethi
  2004-08-29 20:17                     ` Albert Cahalan
  0 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-29 19:00 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote:
> On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:
> > I am confident that this problem (as far as process monitoring is
> > concerned) could be addressed with differential notification.
> 
> I'm a bit squeamish about that given that mmlist_lock and tasklist_lock
> are both problematic and yet another global structure to fiddle with in
> the process creation and destruction path threatens similar trouble.

The numbers looks so bad that for many cases it's going to be a
significant win if we simply call nproc_send_note in said paths. But
I'll admit that I've been entertaining thoughts about a global queue
or something to send notifications in batches.

> Also, what guarantee is there that the notification events come
> sufficiently slowly for a single task to process, particularly when that
> task may not have a whole cpu's resources to marshal to the task?

A more likely guarantee is that a process that can't keep up with
differential updates won't be able to process the whole list, either.
Well, unless the system is loaded with tons of short-lived processes
that wouldn't even make the full process list by the time it's pulled.
But in such a case, a complete list of task won't do you much good,
either, because by the time you are ready to query the kernel for
details the tasks are gone.

> Queueing them sounds less than ideal due to resource consumption, and
> if notifications are dropped most of the efficiency gains are lost. So
> I question that a bit.

Point. Task discovery is not an exact science anyway, though.

I'd still expect differential notification to be useful in most
non-pathological cases, but I concede it's nowhere as clear-cut as
nproc per se is.

> I have a vague notion that userspace should intelligently schedule
> inquiries so requests are made at a rate the app can process and so
> that the app doesn't consume excessive amounts of cpu. In such an
> arrangement screen refresh events don't trigger a full scan of the
> tasklist, but rather only an incremental partial rescan of it, whose
> work is limited for the above cpu bandwidth concerns.

While I'm not sure I understand how that partial rescan (or its limits)
would be defined, I agree with the general idea. There is indeed plenty
of room for improvement in a smart user space. For instance, most apps
show only the top n processes. So if an app shows the top 20 memory
users, it could use nproc to get a complete list of pid+vmrss, and then
request all the expensive fields only for the top 20 in that list.

> On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:
> > I'd much rather remove unnecessary overhead than optimize code for
> > overhead processing. Note that number() takes out 7% and that's the
> > _kernel_ printing numbers for user space to parse back. And __d_lookup
> > is another /proc souvenir you get to keep as long as you use /proc.
> 
> I'm expecting very very long lifetimes for legacy kernel versions and
> userspace predating the merge of nproc, so it's not entirely irrelevant,
> though backports aren't exactly something I relish.

Uhm... Optimized string parsing would require updated user space
anyway. OTOH, I can buy the legacy kernel argument, so if you want to
rewrite the user space tools, go wild :-). You may find that there are
issues more serious than string parsing:

$ ps --version
procps version 3.2.3
$ ps -o pid
  PID
 2089
 2139
$ strace ps -o pid 2>&1|grep 'open("/proc/'|wc -l
325

<whine>

> On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:
> > Well __task_mem is promiment here because I don't call other computation
> > functions. vmstat ain't cheap, and wchan is horribly expensive if the
> > kernel does the ksym translation. Etc. pp.
> 
> task_mem() is generally prominent when the processes have large numbers
> of vmas, and also due to acquisition of ->mmap_sem.

Makes sense. I just wanted to make sure I wasn't misleading you.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 17:20             ` William Lee Irwin III
  2004-08-29 17:52               ` Roger Luethi
@ 2004-08-29 19:07               ` Paul Jackson
  2004-08-29 19:17                 ` William Lee Irwin III
  1 sibling, 1 reply; 39+ messages in thread
From: Paul Jackson @ 2004-08-29 19:07 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: rl, linux-kernel, albert

> get_tgid_list() is a sad story I don't have time to go into in depth.
> The short version is that larger systems are extremely sensitive to

Thanks, Roger and William, for your good work here.  I'm sure that SGI's
big bertha's will benefit.

In glancing at the get_tgid_list() I see it is careful to only pick off
20 (PROC_MAXPIDS) slots at a time.  But elsewhere in the kernel, I see
several uses of "do_each_thread()" which rip through the entire task
list in a single shot.

Is there a simple explanation for why it is ok in one place to take on
the entire task list in a single sweep, but in another it is important
to drop the lock every 20 slots?

>From the code and nice comments, I see that:
  (1) the work that had to be done by proc_pid_readdir(), the caller of
      get_tgid_list(), required dropping the task list lock, and
  (2) so the harvested tgid's had to be stashed in a temp buffer.

So perhaps the reason for not doing this in a single pass is:
  (3) it was not doable or not desirable (which one?) to size that temp
      buffer large enough to hold all the harvested tgid's in one pass.

But my understanding is losing the scent of the trail at this point.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 19:07               ` Paul Jackson
@ 2004-08-29 19:17                 ` William Lee Irwin III
  2004-08-29 19:49                   ` Roger Luethi
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 19:17 UTC (permalink / raw)
  To: Paul Jackson; +Cc: rl, linux-kernel, albert

At some point in the past, I wrote:
>> get_tgid_list() is a sad story I don't have time to go into in depth.
>> The short version is that larger systems are extremely sensitive to

On Sun, Aug 29, 2004 at 12:07:33PM -0700, Paul Jackson wrote:
> Thanks, Roger and William, for your good work here.  I'm sure that SGI's
> big bertha's will benefit.
> In glancing at the get_tgid_list() I see it is careful to only pick off
> 20 (PROC_MAXPIDS) slots at a time.  But elsewhere in the kernel, I see
> several uses of "do_each_thread()" which rip through the entire task
> list in a single shot.
> Is there a simple explanation for why it is ok in one place to take on
> the entire task list in a single sweep, but in another it is important
> to drop the lock every 20 slots?

PROC_MAXPIDS is the size of the buffer used to temporarily store the
pid's while doing user copies, so that potentially blocking operations
may be done to transmit the pid's to userspace.

Introducing another whole-tasklist scan, even if feasible, is probably
not a good idea.


On Sun, Aug 29, 2004 at 12:07:33PM -0700, Paul Jackson wrote:
> From the code and nice comments, I see that:
>   (1) the work that had to be done by proc_pid_readdir(), the caller of
>       get_tgid_list(), required dropping the task list lock, and
>   (2) so the harvested tgid's had to be stashed in a temp buffer.
> So perhaps the reason for not doing this in a single pass is:
>   (3) it was not doable or not desirable (which one?) to size that temp
>       buffer large enough to hold all the harvested tgid's in one pass.
> But my understanding is losing the scent of the trail at this point.

Using a larger, dynamically-allocated buffer may be better. e.g.
allocating a page to buffer pid's with.

A solution to the problem of the quadratic algorithm I wrote long ago
restructured the tasklist as an rbtree so that the position in the
tasklist could be recovered in O(lg(n)) time. Unfortunately, this
increases the write hold time of tasklist_lock.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 19:17                 ` William Lee Irwin III
@ 2004-08-29 19:49                   ` Roger Luethi
  2004-08-29 20:25                     ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-29 19:49 UTC (permalink / raw)
  To: William Lee Irwin III, Paul Jackson, linux-kernel, albert

On Sun, 29 Aug 2004 12:17:07 -0700, William Lee Irwin III wrote:
> > In glancing at the get_tgid_list() I see it is careful to only pick off
> > 20 (PROC_MAXPIDS) slots at a time.  But elsewhere in the kernel, I see
> > several uses of "do_each_thread()" which rip through the entire task
> > list in a single shot.
> > Is there a simple explanation for why it is ok in one place to take on
> > the entire task list in a single sweep, but in another it is important
> > to drop the lock every 20 slots?
> 
[...]
> Introducing another whole-tasklist scan, even if feasible, is probably
> not a good idea.

I'm not sure whether I should participate in that discussion. I'll risk
discrediting nproc with wild speculations on a subject I haven't really
looked into yet. Ah well...

As far as nproc (and process monitoring) is concerned, we aren't really
interested in walking a complete process list. All we care about is
which pids exist right now. How about a bit field, maintained by the
kernel, to indicate for each pid whether it exists or not? This would
amount to 4 KiB by default and 512 KiB for PID_MAX_LIMIT (4 million
processes). Maintenance cost would be one atomic bit operation per
process creation/deletion. No contested locks.

The list for the nproc user could be prepared based on the bit field
(or simply memcpy'd), no tasklist_lock or walking linked lists required.

What am I missing?

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 19:00                   ` Roger Luethi
@ 2004-08-29 20:17                     ` Albert Cahalan
  2004-08-29 20:46                       ` William Lee Irwin III
                                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Albert Cahalan @ 2004-08-29 20:17 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson

> Roger Luethi writes:
> On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote:
>> On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote:

>>> I am confident that this problem (as far as process
>>> monitoring is concerned) could be addressed with
>>> differential notification.
...
>> Also, what guarantee is there that the notification
>> events come sufficiently slowly for a single task to
>> process, particularly when that task may not have a whole
>> cpu's resources to marshal to the task?
>
> A more likely guarantee is that a process that can't
> keep up with differential updates won't be able to
> process the whole list, either.

When the reader falls behind, keep supplying differential
updates as long as practical. When this starts to eat up
lots of memory, switch to supplying the full list until
the reader catches up again.

>> I have a vague notion that userspace should intelligently schedule
>> inquiries so requests are made at a rate the app can process and so
>> that the app doesn't consume excessive amounts of cpu. In such an
>> arrangement screen refresh events don't trigger a full scan of the
>> tasklist, but rather only an incremental partial rescan of it, whose
>> work is limited for the above cpu bandwidth concerns.

If you won't scan, why update the display? This boils down
to simply setting a lower refresh rate or using "nice".

> While I'm not sure I understand how that partial rescan (or its limits)
> would be defined, I agree with the general idea. There is indeed plenty
> of room for improvement in a smart user space. For instance, most apps
> show only the top n processes. So if an app shows the top 20 memory
> users, it could use nproc to get a complete list of pid+vmrss, and then
> request all the expensive fields only for the top 20 in that list.

This is crummy. It's done for wchan, since that is so horribly
expensive, but I'm not liking the larger race condition window.
Remember that PIDs get reused. There isn't a generation counter
or UUID that can be checked.

> Uhm... Optimized string parsing would require updated user space
> anyway. OTOH, I can buy the legacy kernel argument, so if you want to
> rewrite the user space tools, go wild :-). You may find that there are
> issues more serious than string parsing:
>
> $ ps --version
> procps version 3.2.3
> $ ps -o pid
>   PID
>  2089
>  2139
> $ strace ps -o pid 2>&1|grep 'open("/proc/'|wc -l
> 325
>
> <whine>

While "pid" makes a nice extreme example, note that ps must
handle arbitrary cases like "pmem,comm,wchan,ppid,session".

Now, I direct your attention to "Introduction to Algorithms",
by Cormen, Leiserson, and Rivest. Find the section entitled
"The set-covering problem". It's page 974, section 37.3, in
my version of the book. An example of this would be the
determination of the minimum set of /proc files needed to
supply some required set of process attributes.

Look familiar? It's NP-hard. To me, that just sounds bad. :-)

While there are decent (?) approximations that run in
polynomial time, they are generally overkill. It is very
common to need both the stat and status files. Selection,
sorting, and display all may require data.

But hey, we can go ahead and compute NP-hard problems in
userspace if that makes the kernel less complicated. :-)
Just remember that if I say "this is hard", I mean it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 19:49                   ` Roger Luethi
@ 2004-08-29 20:25                     ` William Lee Irwin III
  2004-08-31 10:16                       ` Roger Luethi
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 20:25 UTC (permalink / raw)
  To: Roger Luethi; +Cc: Paul Jackson, linux-kernel, albert

On Sun, 29 Aug 2004 12:17:07 -0700, William Lee Irwin III wrote:
>> Introducing another whole-tasklist scan, even if feasible, is probably
>> not a good idea.

On Sun, Aug 29, 2004 at 09:49:26PM +0200, Roger Luethi wrote:
> I'm not sure whether I should participate in that discussion. I'll risk
> discrediting nproc with wild speculations on a subject I haven't really
> looked into yet. Ah well...

There isn't much to speculate about here; reducing the arrival rate to
tasklist_lock is okay, but it can't be held forever or use unbounded
allocations or anything like that.


On Sun, Aug 29, 2004 at 09:49:26PM +0200, Roger Luethi wrote:
> As far as nproc (and process monitoring) is concerned, we aren't really
> interested in walking a complete process list. All we care about is
> which pids exist right now. How about a bit field, maintained by the
> kernel, to indicate for each pid whether it exists or not? This would
> amount to 4 KiB by default and 512 KiB for PID_MAX_LIMIT (4 million
> processes). Maintenance cost would be one atomic bit operation per
> process creation/deletion. No contested locks.
> The list for the nproc user could be prepared based on the bit field
> (or simply memcpy'd), no tasklist_lock or walking linked lists required.
> What am I missing?

The pid bitmap could be exported to userspace rather easily.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 20:17                     ` Albert Cahalan
@ 2004-08-29 20:46                       ` William Lee Irwin III
  2004-08-29 21:45                         ` Albert Cahalan
  2004-08-29 21:41                       ` Roger Luethi
  2004-08-30 10:31                       ` Paulo Marques
  2 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 20:46 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson

On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote:
>>> Also, what guarantee is there that the notification
>>> events come sufficiently slowly for a single task to
>>> process, particularly when that task may not have a whole
>>> cpu's resources to marshal to the task?

Roger Luethi writes:
>> A more likely guarantee is that a process that can't
>> keep up with differential updates won't be able to
>> process the whole list, either.

On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> When the reader falls behind, keep supplying differential
> updates as long as practical. When this starts to eat up
> lots of memory, switch to supplying the full list until
> the reader catches up again.

You shouldn't have to try to scan the set of all tasks in any bounded
period of time or rely on differential updates. Scanning some part of
the list of a bounded size, updating the state based on what was
scanned, and reporting the rest as if it hadn't changed is the strategy
I'm describing.


On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote:
>>> I have a vague notion that userspace should intelligently schedule
>>> inquiries so requests are made at a rate the app can process and so
>>> that the app doesn't consume excessive amounts of cpu. In such an
>>> arrangement screen refresh events don't trigger a full scan of the
>>> tasklist, but rather only an incremental partial rescan of it, whose
>>> work is limited for the above cpu bandwidth concerns.

On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> If you won't scan, why update the display? This boils down
> to simply setting a lower refresh rate or using "nice".

Some updates can be captured, merely not all. Updating the state given
what was captured during the partial scan and then displaying the state
derived from what could be captured in the refresh interval is more
useful than being nonfunctional at the lower refresh intervals or
needlessly beating the kernel in some futile attempt to exhaustively
search an impossibly huge dataset in some time bound that can't be
satisfied.


Roger Luethi writes:
>> While I'm not sure I understand how that partial rescan (or its limits)
>> would be defined, I agree with the general idea. There is indeed plenty
>> of room for improvement in a smart user space. For instance, most apps
>> show only the top n processes. So if an app shows the top 20 memory
>> users, it could use nproc to get a complete list of pid+vmrss, and then
>> request all the expensive fields only for the top 20 in that list.

On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> This is crummy. It's done for wchan, since that is so horribly
> expensive, but I'm not liking the larger race condition window.
> Remember that PIDs get reused. There isn't a generation counter
> or UUID that can be checked.

One shouldn't really need to care; periodically rechecking the fields
of an active pid should suffice. You don't really care whether it's the
same task or not, just that the fields are up-to-date and whether any
task with that pid exists.


Roger Luethi writes:
>> Uhm... Optimized string parsing would require updated user space
>> anyway. OTOH, I can buy the legacy kernel argument, so if you want to
>> rewrite the user space tools, go wild :-). You may find that there are
>> issues more serious than string parsing:
[...]

On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> While "pid" makes a nice extreme example, note that ps must
> handle arbitrary cases like "pmem,comm,wchan,ppid,session".
> Now, I direct your attention to "Introduction to Algorithms",
> by Cormen, Leiserson, and Rivest. Find the section entitled
> "The set-covering problem". It's page 974, section 37.3, in
> my version of the book. An example of this would be the
> determination of the minimum set of /proc files needed to
> supply some required set of process attributes.
> Look familiar? It's NP-hard. To me, that just sounds bad. :-)
> While there are decent (?) approximations that run in
> polynomial time, they are generally overkill. It is very
> common to need both the stat and status files. Selection,
> sorting, and display all may require data.
> But hey, we can go ahead and compute NP-hard problems in
> userspace if that makes the kernel less complicated. :-)
> Just remember that if I say "this is hard", I mean it.

Actually, the problem size is so small it shouldn't be problematic.
There are only 13 /proc/ files associated with a process, so exhaustive
search over 2**13 - 1 == 8191 nonempty subsets, e.g. queueing by size
and checking for the satisfiability of the reporting, will suffice.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 20:17                     ` Albert Cahalan
  2004-08-29 20:46                       ` William Lee Irwin III
@ 2004-08-29 21:41                       ` Roger Luethi
  2004-08-29 23:31                         ` Albert Cahalan
  2004-08-30 10:31                       ` Paulo Marques
  2 siblings, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-29 21:41 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson

On Sun, 29 Aug 2004 16:17:26 -0400, Albert Cahalan wrote:
> When the reader falls behind, keep supplying differential
> updates as long as practical. When this starts to eat up
> lots of memory, switch to supplying the full list until
> the reader catches up again.

I think it should be up to the reader to request stuff, so I'd probably
just have the kernel notify the client that there won't be any more
differential updates. Then the client can decide what to do now.

But I'd have to play around with this to see what works.

> > While I'm not sure I understand how that partial rescan (or its limits)
> > would be defined, I agree with the general idea. There is indeed plenty
> > of room for improvement in a smart user space. For instance, most apps
> > show only the top n processes. So if an app shows the top 20 memory
> > users, it could use nproc to get a complete list of pid+vmrss, and then
> > request all the expensive fields only for the top 20 in that list.
> 
> This is crummy. It's done for wchan, since that is so horribly
> expensive, but I'm not liking the larger race condition window.

The races left with nproc are much smaller. There is of course the question
of whether the pid still exists by the time you query the kernel about it.
But you get all the information in one go (although the process may still
disappear while the kernel prepares the requested info).

> > $ ps --version
> > procps version 3.2.3
> > $ ps -o pid
> >   PID
> >  2089
> >  2139
> > $ strace ps -o pid 2>&1|grep 'open("/proc/'|wc -l
> > 325
> >
> > <whine>
> 
> While "pid" makes a nice extreme example, note that ps must
> handle arbitrary cases like "pmem,comm,wchan,ppid,session".
> 
> Now, I direct your attention to "Introduction to Algorithms",
> by Cormen, Leiserson, and Rivest. Find the section entitled
[...]
> Just remember that if I say "this is hard", I mean it.

Entertaining, but you missed the point: I am not terribly impressed with
the fact that ps opens two files (stat, statm) for _every_ _single_
_process_ if all I want to know is, say, the name of PID 42 (example
taken from ps(1): ps -p 42 -o comm=).

And FWIW, you don't need the "minimum set of /proc files needed to
supply some required set of process attributes". Any set that supplies
the required fields will do, and you can get an excellent approximation
in O(n).

I suspect Cormen, Leiserson, and Rivest would take exception with your
assertion that ps tools can't be improved. Or even that doing so is hard.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 20:46                       ` William Lee Irwin III
@ 2004-08-29 21:45                         ` Albert Cahalan
  2004-08-29 22:11                           ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Albert Cahalan @ 2004-08-29 21:45 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson

On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote:
> On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> > When the reader falls behind, keep supplying differential
> > updates as long as practical. When this starts to eat up
> > lots of memory, switch to supplying the full list until
> > the reader catches up again.
> 
> You shouldn't have to try to scan the set of all tasks in any bounded
> period of time or rely on differential updates. Scanning some part of
> the list of a bounded size, updating the state based on what was
> scanned, and reporting the rest as if it hadn't changed is the strategy
> I'm describing.

That's defective. Users will not like it.

> > If you won't scan, why update the display? This boils down
> > to simply setting a lower refresh rate or using "nice".
> 
> Some updates can be captured, merely not all. Updating the
> state given what was captured during the partial scan and
> then displaying the state derived from what could be
> captured in the refresh interval is more useful than being
> nonfunctional at the lower refresh intervals or needlessly
> beating the kernel in some futile attempt to exhaustively
> search an impossibly huge dataset in some time bound that
> can't be satisfied.

nice -n 19 top

> Roger Luethi writes:
> >> While I'm not sure I understand how that partial rescan (or its limits)
> >> would be defined, I agree with the general idea. There is indeed plenty
> >> of room for improvement in a smart user space. For instance, most apps
> >> show only the top n processes. So if an app shows the top 20 memory
> >> users, it could use nproc to get a complete list of pid+vmrss, and then
> >> request all the expensive fields only for the top 20 in that list.
> 
> On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> > This is crummy. It's done for wchan, since that is so horribly
> > expensive, but I'm not liking the larger race condition window.
> > Remember that PIDs get reused. There isn't a generation counter
> > or UUID that can be checked.
> 
> One shouldn't really need to care; periodically rechecking the fields
> of an active pid should suffice. You don't really care whether it's the
> same task or not, just that the fields are up-to-date and whether any
> task with that pid exists.

People use the procps tools to kill processes.
Bad data leads to bad decisions.

> On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote:
> > While "pid" makes a nice extreme example, note that ps must
> > handle arbitrary cases like "pmem,comm,wchan,ppid,session".
> > Now, I direct your attention to "Introduction to Algorithms",
> > by Cormen, Leiserson, and Rivest. Find the section entitled
> > "The set-covering problem". It's page 974, section 37.3, in
> > my version of the book. An example of this would be the
> > determination of the minimum set of /proc files needed to
> > supply some required set of process attributes.
> > Look familiar? It's NP-hard. To me, that just sounds bad. :-)
> > While there are decent (?) approximations that run in
> > polynomial time, they are generally overkill. It is very
> > common to need both the stat and status files. Selection,
> > sorting, and display all may require data.
> > But hey, we can go ahead and compute NP-hard problems in
> > userspace if that makes the kernel less complicated. :-)
> > Just remember that if I say "this is hard", I mean it.
> 
> Actually, the problem size is so small it shouldn't be problematic.
> There are only 13 /proc/ files associated with a process, so exhaustive
> search over 2**13 - 1 == 8191 nonempty subsets, e.g. queueing by size
> and checking for the satisfiability of the reporting, will suffice.

Nice! Checking for satisfiability is only NP-complete...

I do get your point, but I expect to see more /proc files
as time passes. Also, there is the issue of maintainability.

Example 1: It has crossed my mind to add separate files
for the least security-critical data, so that an SE Linux
system with moderate security could provide some minimal
amount of basic info to normal users.

Example 2: There could be files containing only data
that is easy to generate or that needs the same locking.

Even with the "ps -o pid" example given, opening /proc/*/stat
is required to get the tty. Opening /proc/*/status is nearly
required; one can do stat() on the directory to get that
via st_uid though.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 21:45                         ` Albert Cahalan
@ 2004-08-29 22:11                           ` William Lee Irwin III
  0 siblings, 0 replies; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-29 22:11 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson

On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote:
>> You shouldn't have to try to scan the set of all tasks in any bounded
>> period of time or rely on differential updates. Scanning some part of
>> the list of a bounded size, updating the state based on what was
>> scanned, and reporting the rest as if it hadn't changed is the strategy
>> I'm describing.

On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote:
> That's defective. Users will not like it.

Scarcely. The task can't be done in realtime. The data will be stale
by the time it's reported anyway. Limiting the amount of sampling done
is vastly superior to beating the kernel's reporting interfaces to
death in a totally futile attempt to achieve infeasible consistencies,
burning ridiculous amounts of cpu in the process, and reporting
gibberish in the end anyway.

On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote:
>> Some updates can be captured, merely not all. Updating the
>> state given what was captured during the partial scan and
>> then displaying the state derived from what could be
>> captured in the refresh interval is more useful than being
>> nonfunctional at the lower refresh intervals or needlessly
>> beating the kernel in some futile attempt to exhaustively
>> search an impossibly huge dataset in some time bound that
>> can't be satisfied.

On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote:
> nice -n 19 top

No, hard cpu limits are required, and even then it just spews gibberish
and very slowly. The current algorithms are nonfunctional with any
substantial number of processes.

On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote:
>> One shouldn't really need to care; periodically rechecking the fields
>> of an active pid should suffice. You don't really care whether it's the
>> same task or not, just that the fields are up-to-date and whether any
>> task with that pid exists.

On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote:
> People use the procps tools to kill processes.
> Bad data leads to bad decisions.

Refusal to rate limit sampling doesn't make the data more coherent in
the presence of large numbers of tasks.

On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote:
>> Actually, the problem size is so small it shouldn't be problematic.
>> There are only 13 /proc/ files associated with a process, so exhaustive
>> search over 2**13 - 1 == 8191 nonempty subsets, e.g. queueing by size
>> and checking for the satisfiability of the reporting, will suffice.

On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote:
> Nice! Checking for satisfiability is only NP-complete...
> I do get your point, but I expect to see more /proc files
> as time passes. Also, there is the issue of maintainability.

No, that's not general satisfiability. Each field to be reported needs
at least one out of some set of subsets of /proc/ files associated with
a process to be includes in those parsed. Checking for inclusion of one
of a field's required subsets for each field suffices. The number of
subsets of /proc/ files from which a field is calculable is bounded by
some small constant. It must be constant, as there are a finite number
of fields, and the constant is small, as this is some specific set and
the precise upper bound can be found, and if/when it is found, it is
very likely to be well under a tenth of the total number of subsets of
/proc/ files associated with a process.

On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote:
> Example 1: It has crossed my mind to add separate files
> for the least security-critical data, so that an SE Linux
> system with moderate security could provide some minimal
> amount of basic info to normal users.
> Example 2: There could be files containing only data
> that is easy to generate or that needs the same locking.
> Even with the "ps -o pid" example given, opening /proc/*/stat
> is required to get the tty. Opening /proc/*/status is nearly
> required; one can do stat() on the directory to get that
> via st_uid though.

I don't have a whole lot to say on this subject. These sound reasonable.

-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 21:41                       ` Roger Luethi
@ 2004-08-29 23:31                         ` Albert Cahalan
  2004-08-30  7:16                           ` Roger Luethi
  0 siblings, 1 reply; 39+ messages in thread
From: Albert Cahalan @ 2004-08-29 23:31 UTC (permalink / raw)
  To: Roger Luethi
  Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson

On Sun, 2004-08-29 at 17:41, Roger Luethi wrote:

> And FWIW, you don't need the "minimum set of /proc
> files needed to supply some required set of process
> attributes". Any set that supplies the required fields
> will do, and you can get an excellent approximation
> in O(n).

You got that, and you didn't like it.

I'm sure it wouldn't be hard to hack up some
special-case optimization for the cases you've
listed. As soon as I do so, you'll find another
special case. Ultimately, you ARE asking to have
procps solve the NP-hard set-covering problem.

There are several good reasons to not go down
that path. The potential for increasing numbers
of /proc files in the future is one. Another is
the very limited benefit; typical ps usage does
require much of that data. Maintainability is yet
another reason; ps does more than just spit out the
data. It is very useful to have a decent selection
of data items that will always be available for
process selection, sorting, and any other use.
The potential for adding bugs is great.

That said, I do at times tweak the code used to
select data sources. Perhaps I should add a new
/proc/*/basics file for the most popular items.
This would make fancy set-covering choices more
profitable.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 23:31                         ` Albert Cahalan
@ 2004-08-30  7:16                           ` Roger Luethi
  0 siblings, 0 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-30  7:16 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson

On Sun, 29 Aug 2004 19:31:17 -0400, Albert Cahalan wrote:
> select data sources. Perhaps I should add a new
> /proc/*/basics file for the most popular items.

It shouldn't surprise that I am not keen on making any semantic changes
to /proc in order to help tools. nproc is a vastly superior interface.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 20:17                     ` Albert Cahalan
  2004-08-29 20:46                       ` William Lee Irwin III
  2004-08-29 21:41                       ` Roger Luethi
@ 2004-08-30 10:31                       ` Paulo Marques
  2004-08-30 10:53                         ` William Lee Irwin III
  2 siblings, 1 reply; 39+ messages in thread
From: Paulo Marques @ 2004-08-30 10:31 UTC (permalink / raw)
  To: Albert Cahalan
  Cc: Roger Luethi, William Lee Irwin III, linux-kernel mailing list,
	Paul Jackson

Albert Cahalan wrote:
>...
> 
> This is crummy. It's done for wchan, since that is so horribly
> expensive, but I'm not liking the larger race condition window.
> Remember that PIDs get reused. There isn't a generation counter
> or UUID that can be checked.

I just wanted to call your attention to the kallsyms speedup patch that 
is now on the -mm tree.

It should improve wchan speed. My benchmarks for kallsyms_lookup (the 
function that was responsible for the wchan time) went from 1340us to 0.5us.

So maybe this is enough not to make wchan a special case anymore...

-- 
Paulo Marques - www.grupopie.com

To err is human, but to really foul things up requires a computer.
Farmers' Almanac, 1978

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-30 10:31                       ` Paulo Marques
@ 2004-08-30 10:53                         ` William Lee Irwin III
  2004-08-30 12:23                           ` Paulo Marques
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-30 10:53 UTC (permalink / raw)
  To: Paulo Marques
  Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list,
	Paul Jackson

Albert Cahalan wrote:
>> This is crummy. It's done for wchan, since that is so horribly
>> expensive, but I'm not liking the larger race condition window.
>> Remember that PIDs get reused. There isn't a generation counter
>> or UUID that can be checked.

On Mon, Aug 30, 2004 at 11:31:43AM +0100, Paulo Marques wrote:
> I just wanted to call your attention to the kallsyms speedup patch that 
> is now on the -mm tree.
> It should improve wchan speed. My benchmarks for kallsyms_lookup (the 
> function that was responsible for the wchan time) went from 1340us to 0.5us.
> So maybe this is enough not to make wchan a special case anymore...

This seems to go wrong on big-endian machines; any chance you could look
over your stuff and try to figure out what endianness issues it may have?


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-30 10:53                         ` William Lee Irwin III
@ 2004-08-30 12:23                           ` Paulo Marques
  2004-08-30 12:28                             ` William Lee Irwin III
  0 siblings, 1 reply; 39+ messages in thread
From: Paulo Marques @ 2004-08-30 12:23 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list,
	Paul Jackson

William Lee Irwin III wrote:
> Albert Cahalan wrote:
> 
>>>This is crummy. It's done for wchan, since that is so horribly
>>>expensive, but I'm not liking the larger race condition window.
>>>Remember that PIDs get reused. There isn't a generation counter
>>>or UUID that can be checked.
> 
> 
> On Mon, Aug 30, 2004 at 11:31:43AM +0100, Paulo Marques wrote:
> 
>>I just wanted to call your attention to the kallsyms speedup patch that 
>>is now on the -mm tree.
>>It should improve wchan speed. My benchmarks for kallsyms_lookup (the 
>>function that was responsible for the wchan time) went from 1340us to 0.5us.
>>So maybe this is enough not to make wchan a special case anymore...
> 
> 
> This seems to go wrong on big-endian machines; any chance you could look
> over your stuff and try to figure out what endianness issues it may have?

I went over the code but at a first glance couldn't find a notorius 
trouble spot. I don't have big-endian hardware myself so this is hard to 
test.

Just a few questions to help me out in finding the problem:

- is this really an endianess problem or is it a 64-bit integer problem?

- are you cross compiling the kernel?

Thanks in advance,

-- 
Paulo Marques - www.grupopie.com

To err is human, but to really foul things up requires a computer.
Farmers' Almanac, 1978

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-30 12:23                           ` Paulo Marques
@ 2004-08-30 12:28                             ` William Lee Irwin III
  2004-08-30 13:43                               ` Paulo Marques
  0 siblings, 1 reply; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-30 12:28 UTC (permalink / raw)
  To: Paulo Marques
  Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list,
	Paul Jackson

William Lee Irwin III wrote:
>> This seems to go wrong on big-endian machines; any chance you could look
>> over your stuff and try to figure out what endianness issues it may have?

On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote:
> I went over the code but at a first glance couldn't find a notorius 
> trouble spot. I don't have big-endian hardware myself so this is hard to 
> test.
> Just a few questions to help me out in finding the problem:
> - is this really an endianess problem or is it a 64-bit integer problem?

Works fine on x86-64 and alpha. Prints gibberish on sparc64.


On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote:
> - are you cross compiling the kernel?
> Thanks in advance,

No. All native.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-30 12:28                             ` William Lee Irwin III
@ 2004-08-30 13:43                               ` Paulo Marques
  0 siblings, 0 replies; 39+ messages in thread
From: Paulo Marques @ 2004-08-30 13:43 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list,
	Paul Jackson

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> 
>>>This seems to go wrong on big-endian machines; any chance you could look
>>>over your stuff and try to figure out what endianness issues it may have?
> 
> 
> On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote:
> 
>>I went over the code but at a first glance couldn't find a notorius 
>>trouble spot. I don't have big-endian hardware myself so this is hard to 
>>test.
>>Just a few questions to help me out in finding the problem:
>>- is this really an endianess problem or is it a 64-bit integer problem?
> 
> 
> Works fine on x86-64 and alpha. Prints gibberish on sparc64.
> 
> 
> On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote:
> 
>>- are you cross compiling the kernel?
>>Thanks in advance,
> 
> 
> No. All native.

Can you send me an ".tmp_kallsyms2.S" obtained after a kernel build on a 
sparc64, so that I can isolate the problem between scripts/kallsyms.c 
and kernel/kallsyms.c?  (maybe gzip'ed and in private, because this can 
be a big file...)

Thanks for all the help in debugging this.

-- 
Paulo Marques - www.grupopie.com

To err is human, but to really foul things up requires a computer.
Farmers' Almanac, 1978

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: netlink access to /proc information
  2004-08-29 20:25                     ` William Lee Irwin III
@ 2004-08-31 10:16                       ` Roger Luethi
  0 siblings, 0 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-31 10:16 UTC (permalink / raw)
  To: William Lee Irwin III, Paul Jackson, linux-kernel, albert

On Sun, 29 Aug 2004 13:25:43 -0700, William Lee Irwin III wrote:
> > The list for the nproc user could be prepared based on the bit field
> > (or simply memcpy'd), no tasklist_lock or walking linked lists required.
> > What am I missing?
> 
> The pid bitmap could be exported to userspace rather easily.

I implemented an "all processes" selector based on that. Remaining pieces
are access control and a method for dumping large amounts of data (10 -
1000 KB) to user space.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [BENCHMARK] nproc: Look Ma, No get_tgid_list!
  2004-08-29 17:02           ` Roger Luethi
  2004-08-29 17:20             ` William Lee Irwin III
@ 2004-08-31 15:34             ` Roger Luethi
  2004-08-31 19:38               ` William Lee Irwin III
  1 sibling, 1 reply; 39+ messages in thread
From: Roger Luethi @ 2004-08-31 15:34 UTC (permalink / raw)
  To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson

This posting demonstrates a new method of monitoring all processes in
a large system.

You may remember what a /proc based tool does when monitoring some
10^5 processes -- it spends its time in the kernel hanging on to a
read task_list_lock:

==> 10000 processes: top -d 0 -b > /dev/null <==
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        image name               symbol name
35855    36.0707  vmlinux                  get_tgid_list
9366      9.4223  vmlinux                  pid_alive
7077      7.1196  libc-2.3.3.so            _IO_vfscanf_internal
5386      5.4184  vmlinux                  number
3664      3.6860  vmlinux                  proc_pid_stat
3077      3.0955  libc-2.3.3.so            _IO_vfprintf_internal
2136      2.1489  vmlinux                  __d_lookup
1720      1.7303  vmlinux                  vsnprintf
1451      1.4597  libc-2.3.3.so            __i686.get_pc_thunk.bx
1409      1.4175  libc-2.3.3.so            _IO_default_xsputn_internal
1258      1.2656  libc-2.3.3.so            _IO_putc_internal
1225      1.2324  vmlinux                  link_path_walk
1210      1.2173  libc-2.3.3.so            ____strtoul_l_internal
1199      1.2062  vmlinux                  task_statm
1157      1.1640  libc-2.3.3.so            ____strtol_l_internal
794       0.7988  libc-2.3.3.so            _IO_sputbackc_internal
776       0.7807  libncurses.so.5.4        _nc_outch

Here's a profile for an nproc based tool monitoring the same set
of processes:

==> 10000 processes: nprocbench <==
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples  %        app name                 symbol name
8641     24.8626  vmlinux                  __task_mem
2778      7.9931  vmlinux                  find_pid
2536      7.2968  vmlinux                  finish_task_switch
1872      5.3863  vmlinux                  netlink_recvmsg
1637      4.7101  vmlinux                  nproc_pid_fields
1373      3.9505  vmlinux                  __wake_up
1218      3.5045  vmlinux                  __copy_to_user_ll
1134      3.2628  vmlinux                  __task_mem_cheap
944       2.7162  vmlinux                  mmgrab
876       2.5205  vmlinux                  nproc_ps_do_pid
568       1.6343  vmlinux                  skb_dequeue
526       1.5135  libc-2.3.3.so            __recv
514       1.4789  vmlinux                  alloc_skb
510       1.4674  vmlinux                  __might_sleep
485       1.3955  vmlinux                  skb_release_data
463       1.3322  vmlinux                  netlink_attachskb
363       1.0445  vmlinux                  sys_recvfrom

Resource usage is now dominated by field computation, rather than by
delivery overhead. By now it should be clear that nproc is not only a
cleaner interface with lower overhead for tools, it also scales a lot
better than /proc.

Roger

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [BENCHMARK] nproc: Look Ma, No get_tgid_list!
  2004-08-31 15:34             ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi
@ 2004-08-31 19:38               ` William Lee Irwin III
  0 siblings, 0 replies; 39+ messages in thread
From: William Lee Irwin III @ 2004-08-31 19:38 UTC (permalink / raw)
  To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson

On Tue, Aug 31, 2004 at 05:34:32PM +0200, Roger Luethi wrote:
> This posting demonstrates a new method of monitoring all processes in
> a large system.
> You may remember what a /proc based tool does when monitoring some
> 10^5 processes -- it spends its time in the kernel hanging on to a
> read task_list_lock:
> ==> 10000 processes: top -d 0 -b > /dev/null <==
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        image name               symbol name
> 35855    36.0707  vmlinux                  get_tgid_list
> 9366      9.4223  vmlinux                  pid_alive
> 7077      7.1196  libc-2.3.3.so            _IO_vfscanf_internal
> 5386      5.4184  vmlinux                  number
> 3664      3.6860  vmlinux                  proc_pid_stat
[...]

The most crucial issue for larger systems is removing the rather easily
triggerable rwlock starvation. Perhaps dipankar's /proc/ -only tasklist
RCU patch can resolve that.


On Tue, Aug 31, 2004 at 05:34:32PM +0200, Roger Luethi wrote:
> Here's a profile for an nproc based tool monitoring the same set
> of processes:
> ==> 10000 processes: nprocbench <==
> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %        app name                 symbol name
> 8641     24.8626  vmlinux                  __task_mem
> 2778      7.9931  vmlinux                  find_pid
> 2536      7.2968  vmlinux                  finish_task_switch
> 1872      5.3863  vmlinux                  netlink_recvmsg
> 1637      4.7101  vmlinux                  nproc_pid_fields
[...]
> Resource usage is now dominated by field computation, rather than by
> delivery overhead. By now it should be clear that nproc is not only a
> cleaner interface with lower overhead for tools, it also scales a lot
> better than /proc.

With this in hand we can probably ignore the /proc/ -related efficiency
issues in favor of any method preventing the rwlock starvation, e.g.
dipankar's /proc/ -only tasklist RCU patch.


-- wli

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2004-09-01  0:28 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi
2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi
2004-08-27 13:39   ` Roger Luethi
2004-08-27 12:24 ` [2/2][sample code] nproc: user space app Roger Luethi
2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris
2004-08-27 15:26   ` Roger Luethi
2004-08-27 16:23 ` William Lee Irwin III
2004-08-27 16:37   ` Albert Cahalan
2004-08-27 16:41     ` William Lee Irwin III
2004-08-27 17:01   ` Roger Luethi
2004-08-27 17:08     ` William Lee Irwin III
2004-08-28 19:45   ` [BENCHMARK] " Roger Luethi
2004-08-28 19:56     ` William Lee Irwin III
2004-08-28 20:14       ` Roger Luethi
2004-08-29 16:05         ` William Lee Irwin III
2004-08-29 17:02           ` Roger Luethi
2004-08-29 17:20             ` William Lee Irwin III
2004-08-29 17:52               ` Roger Luethi
2004-08-29 18:16                 ` William Lee Irwin III
2004-08-29 19:00                   ` Roger Luethi
2004-08-29 20:17                     ` Albert Cahalan
2004-08-29 20:46                       ` William Lee Irwin III
2004-08-29 21:45                         ` Albert Cahalan
2004-08-29 22:11                           ` William Lee Irwin III
2004-08-29 21:41                       ` Roger Luethi
2004-08-29 23:31                         ` Albert Cahalan
2004-08-30  7:16                           ` Roger Luethi
2004-08-30 10:31                       ` Paulo Marques
2004-08-30 10:53                         ` William Lee Irwin III
2004-08-30 12:23                           ` Paulo Marques
2004-08-30 12:28                             ` William Lee Irwin III
2004-08-30 13:43                               ` Paulo Marques
2004-08-29 19:07               ` Paul Jackson
2004-08-29 19:17                 ` William Lee Irwin III
2004-08-29 19:49                   ` Roger Luethi
2004-08-29 20:25                     ` William Lee Irwin III
2004-08-31 10:16                       ` Roger Luethi
2004-08-31 15:34             ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi
2004-08-31 19:38               ` William Lee Irwin III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox