* [0/2][ANNOUNCE] nproc: netlink access to /proc information
@ 2004-08-27 12:24 Roger Luethi
2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi
` (3 more replies)
0 siblings, 4 replies; 39+ messages in thread
From: Roger Luethi @ 2004-08-27 12:24 UTC (permalink / raw)
To: linux-kernel
Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh,
Paul Jackson
[ Cc: contributors to recent, related thread ]
nproc is an attempt to address the current problems with /proc. In
short, it exposes the same information via netlink (implemented for a
small subset).
This patch is experimental. I'm posting it to get the discussion going.
Problems with /proc
===================
The information in /proc comes in a number of different formats, for
example:
- /proc/PID/stat works for parsers. However, because it is not
self-documenting, it can never shrink, It contains a growing number
of dead fields -- legacy tools expect them to be there. To make things
worse, there is no N/A value, which makes a field value 0 ambiguous.
- /proc/pid/status is self-documenting. No N/A value is necessary --
fields can easily be added, removed, and reordered. Too easily, maybe.
Tool maintainers complain about parsing overhead and unstable file
formats.
- /proc/slabinfo is something of a hybrid and tries to avoid the
weaknesses of other formats.
So a key problem is that it's hard to make an interface that is both
easy for humans and parsers to read. The amount of human-readable
information in /proc has been growing and there's no way all these
files will be rewritten again to favor parsers.
Another problem with /proc is speed. If we put all information in a few
large files, the kernel needs to calculate many fields even if a tool
is only interested in one of them. OTOH, if the informations is split
into many small files, VFS and related overhead increases if a tool
needs to read many files just for the information on one single process.
In summary, /proc suffers from diverging goals of its two groups of
users (human readers and parsers), and it doesn't scale well for tools
monitoring many fields or many processes.
Overview
========
This patch implements an alternative method of querying the kernel
with well-defined messages through netlink.
Each piece of information ("field") like MemFree or VmRSS is given a
32 bit ID:
bits
0-15 a unique ID
16-23 reserved
24-27 data type (u32, unsigned long, u64, string)
28-31 the scope (process, global)
Four operations exist to query the kernel:
NPROC_GET_LIST
--------------
This request has no payload. The kernel answers with a sequence of u32
values. The first one announces the number of fields known to the kernel,
the rest of the message lists all of them by IDs.
NPROC_GET_LIST allows a tools to check which fields are still available
and -- if the tool author is so inclined -- to discover new fields
dynamically.
NPROC_GET_LABEL
---------------
A label request contains a u32 value indicating the type of label
and one key for which a label is wanted. The kernel returns a string
containing the label. Label types are field (useful for dynamically
discovered fields) and ksym.
NPROC_GET_GLOBAL
----------------
A request for one or more fields with a global scope (e.g. MemFree,
nr_dirty) contains a u32 value announcing the number of requested
fields and a matching sequence of fields IDs.
The kernel replies with one netlink message containing the requested
fields. A string field is lead by a u32 value indicating the remaining
length of the field. I didn't want to offer any strings outside of
the label operation initially, but having to make an extra call for,
say, every process name seemed a bit excessive.
NPROC_SCOPE_PROCESS
-------------------
For fields with a process scope (e.g. VmSize, wchan), a request starts
as above. It adds an additional part, though: The selector. The only
selector implemented so far takes a list of u32 PID values.
At the moment, the kernel sends a separate netlink message for every
process.
Results
=======
- The new interface is self-documenting.
- There is no need to ever parse strings on either side of the
user/kernel space barrier.
- Fields that have become meaningless or are unmaintained are simply
removed. Tools can easily detect if fields (and which ones) are
missing. (Of course that does not imply that any field is fair game
to remove from the kernel.)
- Any number and combination of fields can be gathered with one single
message exchange (as long as they are in the same scope).
- The kernel only calculates fields as requested (where it makes sense,
see __task_mem for an example).
- The conflict between human-readable and machine-parsable files is
solved by providing an interface each.
- While parsing answers is vastly easier for tools, there hardly any
additional complexity in the kernel (except for the process selector
which is optional as it goes beyond the functionality offered by
/proc).
- If we're lucky, we may even be able to save memory on small systems
that want to do away with /proc but need access to some of the
information it provides.
I haven't implemented any form of access control. One possibility is
to use some of the reserved bits in the ID field to indicate access
restrictions to both kernel and user space (e.g. everyone, process owner,
root) and add some LSM hook for those needing fine-grained control.
It would also be easy to add semantics that won't work in /proc (for
instance a simple mechanism for repetitive requests -- just add an
optional frequency or interval flag). Whether that is desirable or not
is a separate discussion, though.
There are obvious speed optimizations I haven't tried. I meant to
conduct some performance tests, but I'm not sure what a meaningful
benchmark on the /proc file side is. Suggestions?
Roger
^ permalink raw reply [flat|nested] 39+ messages in thread* [1/2][PATCH] nproc: netlink access to /proc information 2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi @ 2004-08-27 12:24 ` Roger Luethi 2004-08-27 13:39 ` Roger Luethi 2004-08-27 12:24 ` [2/2][sample code] nproc: user space app Roger Luethi ` (2 subsequent siblings) 3 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-27 12:24 UTC (permalink / raw) To: linux-kernel Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson The current code duplicates some data gathering logic from elsewhere in the kernel. The code can be trivially shared if the exisiting users in proc split data gathering and string creation. The patch should apply against any current 2.6 kernel. include/linux/netlink.h | 1 include/linux/nproc.h | 93 ++++++ init/Kconfig | 7 kernel/Makefile | 1 kernel/nproc.c | 690 ++++++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 792 insertions(+) diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/include/linux/netlink.h linux-2.6.8-nproc/include/linux/netlink.h --- linux-2.6.8/include/linux/netlink.h 2004-08-27 10:08:20.000000000 +0200 +++ linux-2.6.8-nproc/include/linux/netlink.h 2004-08-27 10:20:07.000000000 +0200 @@ -15,6 +15,7 @@ #define NETLINK_ARPD 8 #define NETLINK_AUDIT 9 /* auditing */ #define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */ +#define NETLINK_NPROC 12 /* /proc information */ #define NETLINK_IP6_FW 13 #define NETLINK_DNRTMSG 14 /* DECnet routing messages */ #define NETLINK_TAPBASE 16 /* 16 to 31 are ethertap */ diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/include/linux/nproc.h linux-2.6.8-nproc/include/linux/nproc.h --- linux-2.6.8/include/linux/nproc.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.8-nproc/include/linux/nproc.h 2004-08-27 10:20:07.000000000 +0200 @@ -0,0 +1,93 @@ +#ifndef _LINUX_NPROC_H +#define _LINUX_NPROC_H + +#include <linux/config.h> + +#ifdef CONFIG_NPROC + +#define NPROC_BASE 0x10 +#define NPROC_GET_LIST (NPROC_BASE+0) +#define NPROC_GET_LABEL (NPROC_BASE+1) +#define NPROC_GET_GLOBAL (NPROC_BASE+2) +#define NPROC_GET_PS (NPROC_BASE+3) + +#define NPROC_SCOPE_MASK 0xF0000000 +#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */ +#define NPROC_SCOPE_PROCESS 0x20000000 +#define NPROC_SCOPE_LABEL 0x30000000 + +#define NPROC_TYPE_MASK 0x0F000000 +#define NPROC_TYPE_STRING 0x01000000 +#define NPROC_TYPE_U32 0x02000000 +#define NPROC_TYPE_UL 0x03000000 +#define NPROC_TYPE_U64 0x04000000 + +#define NPROC_SELECT_ALL 0x00000001 +#define NPROC_SELECT_PID 0x00000002 +#define NPROC_SELECT_UID 0x00000003 + +#define NPROC_LABEL_FIELD 0x00000001 +#define NPROC_LABEL_KSYM 0x00000002 + +struct nproc_field { + __u32 id; + const char *label; +}; + +#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS) +/* Amount of free memory (pages) */ +#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* Size of a page (bytes) */ +#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* There's no guarantee about anything with jiffies. Still useful for some. */ +#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) +/* Process: VM size (KiB) */ +#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: locked memory (KiB) */ +#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: Memory resident size (KiB) */ +#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_WCHAN (0x00000100 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) +#define NPROC_WCHAN_NAME (0x00000101 | NPROC_TYPE_STRING) + +#ifdef __KERNEL__ +static struct nproc_field labels[] = { + { NPROC_PID, "PID" }, + { NPROC_NAME, "Name" }, + { NPROC_MEMFREE, "MemFree" }, + { NPROC_PAGESIZE, "PageSize" }, + { NPROC_JIFFIES, "Jiffies" }, + { NPROC_VMSIZE, "VmSize" }, + { NPROC_VMLOCK, "VmLock" }, + { NPROC_VMRSS, "VmRSS" }, + { NPROC_VMDATA, "VmData" }, + { NPROC_VMSTACK, "VmStack" }, + { NPROC_VMEXE, "VmExe" }, + { NPROC_VMLIB, "VmLib" }, + { NPROC_NR_DIRTY, "nr_dirty" }, + { NPROC_NR_WRITEBACK, "nr_writeback" }, + { NPROC_NR_UNSTABLE, "nr_unstable" }, + { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages" }, + { NPROC_NR_MAPPED, "nr_mapped" }, + { NPROC_NR_SLAB, "nr_slab" }, + { NPROC_WCHAN, "wchan" }, +#ifdef CONFIG_KALLSYMS + { NPROC_WCHAN_NAME, "wchan_symbol" }, +#endif +}; +#endif /* __KERNEL__ */ + +#endif /* CONFIG_NPROC */ + +#endif /* _LINUX_NPROC_H */ diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/kernel/Makefile linux-2.6.8-nproc/kernel/Makefile --- linux-2.6.8/kernel/Makefile 2004-08-27 10:08:20.000000000 +0200 +++ linux-2.6.8-nproc/kernel/Makefile 2004-08-27 10:20:07.000000000 +0200 @@ -15,6 +15,7 @@ obj-$(CONFIG_SMP) += cpu.o obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += module.o obj-$(CONFIG_KALLSYMS) += kallsyms.o +obj-$(CONFIG_NPROC) += nproc.o obj-$(CONFIG_PM) += power/ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_COMPAT) += compat.o diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/kernel/nproc.c linux-2.6.8-nproc/kernel/nproc.c --- linux-2.6.8/kernel/nproc.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.8-nproc/kernel/nproc.c 2004-08-27 10:20:07.000000000 +0200 @@ -0,0 +1,690 @@ +/* + * nproc.c + * + * netlink interface to /proc information. + * + */ + +#include <linux/skbuff.h> +#include <net/sock.h> +#include <linux/swap.h> /* nr_free_pages() */ +#include <linux/kallsyms.h> /* kallsyms_lookup() */ +#include <linux/nproc.h> + +//#define DEBUG + +/* There must be like 5 million dprintk definitions, so let's add some more */ +#ifdef DEBUG +#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args) +#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args) +#else +#define pdebug(x,args...) +#define pwarn(x,args...) +#endif + +#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args) + +static struct sock *nproc_sock = NULL; + +struct task_mem { + u32 vmdata; + u32 vmstack; + u32 vmexe; + u32 vmlib; +}; + +struct task_mem_cheap { + u32 vmsize; + u32 vmlock; + u32 vmrss; +}; + +/* + * __task_mem/__task_mem_cheap basically duplicate the MMU version of + * task_mem, but they are split by cost and work on structs. + */ + +void __task_mem(struct task_struct *tsk, struct task_mem *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + unsigned long data = 0, stack = 0, exec = 0, lib = 0; + struct vm_area_struct *vma; + + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long len = (vma->vm_end - vma->vm_start) >> 10; + if (!vma->vm_file) { + data += len; + if (vma->vm_flags & VM_GROWSDOWN) + stack += len; + continue; + } + if (vma->vm_flags & VM_WRITE) + continue; + if (vma->vm_flags & VM_EXEC) { + exec += len; + if (vma->vm_flags & VM_EXECUTABLE) + continue; + lib += len; + } + } + res->vmdata = data - stack; + res->vmstack = stack; + res->vmexe = exec - lib; + res->vmlib = lib; + up_read(&mm->mmap_sem); + + mmput(mm); + } +} + +void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + res->vmsize = mm->total_vm << (PAGE_SHIFT-10); + res->vmlock = mm->locked_vm << (PAGE_SHIFT-10); + res->vmrss = mm->rss << (PAGE_SHIFT-10); + mmput(mm); + } +} + +/* + * page_alloc.c already has an extra function broken out to fill a + * struct with information. Cool. Not sure whether pgpgin/pgpgout + * should be left as is or nailed down as kbytes. + */ +struct page_state *__vmstat(void) +{ + struct page_state *ps; + ps = kmalloc(sizeof(*ps), GFP_KERNEL); + if (!ps) + return ERR_PTR(-ENOMEM); + get_full_page_state(ps); + ps->pgpgin /= 2; /* sectors -> kbytes */ + ps->pgpgout /= 2; + return ps; +} + +/* + * Allocate and prefill an skb. The nlmsghdr provided to the function + * is a pointer to the respective struct in the request message. + */ +struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len) +{ + __u32 seq = nlh->nlmsg_seq; + __u16 type = nlh->nlmsg_type; + __u32 pid = nlh->nlmsg_pid; + struct sk_buff *skb2 = 0; + + skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL); + if (!skb2) { + skb2 = ERR_PTR(-ENOMEM); + goto out; + } + + NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len)); + goto out; + +nlmsg_failure: /* Used by NLMSG_PUT */ + kfree_skb(skb2); + skb2 = NULL; +out: + return skb2; +} + +#define mstore(value, id, buf) \ +({ \ + u32 _type = id & NPROC_TYPE_MASK; \ + switch (_type) { \ + case NPROC_TYPE_U32: { \ + __u32 *p = (u32 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_UL: { \ + unsigned long *p = (unsigned long *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_U64: { \ + __u64 *p = (u64 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + default: \ + perror("Huh? Bad type!\n"); \ + } \ +}) + +/* + * Build and send a netlink msg for one PID. + */ +int nproc_pid_fields(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk) +{ + int i; + int err; + struct task_mem tsk_mem; + struct task_mem_cheap tsk_mem_cheap; + u32 fcnt = fdata[0]; + u32 *fields = &fdata[1]; + struct sk_buff *skb2; + char *buf; + struct nlmsghdr *nlh2; + + tsk_mem.vmdata = (~0); + tsk_mem_cheap.vmsize = (~0); + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + nlh2 = (struct nlmsghdr *)skb2->data; + buf = NLMSG_DATA(nlh2); + + for (i = 0; i < fcnt; i++) { + switch (fields[i]) { + case NPROC_PID: + mstore(tsk->pid, NPROC_PID, buf); + break; + case NPROC_VMSIZE: + case NPROC_VMLOCK: + case NPROC_VMRSS: + if (tsk_mem_cheap.vmsize == (~0)) + __task_mem_cheap(tsk, &tsk_mem_cheap); + switch (fields[i]) { + case NPROC_VMSIZE: + mstore(tsk_mem_cheap.vmsize, NPROC_VMSIZE, buf); + break; + case NPROC_VMLOCK: + mstore(tsk_mem_cheap.vmlock, NPROC_VMLOCK, buf); + break; + case NPROC_VMRSS: + mstore(tsk_mem_cheap.vmrss, NPROC_VMRSS, buf); + break; + } + break; + case NPROC_VMDATA: + case NPROC_VMSTACK: + case NPROC_VMEXE: + case NPROC_VMLIB: + if (tsk_mem.vmdata == (~0)) + __task_mem(tsk, &tsk_mem); + switch (fields[i]) { + case NPROC_VMDATA: + mstore(tsk_mem.vmdata, NPROC_VMDATA, buf); + break; + case NPROC_VMSTACK: + mstore(tsk_mem.vmstack, NPROC_VMSTACK, buf); + break; + case NPROC_VMEXE: + mstore(tsk_mem.vmexe, NPROC_VMEXE, buf); + break; + case NPROC_VMLIB: + mstore(tsk_mem.vmlib, NPROC_VMLIB, buf); + break; + } + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + case NPROC_WCHAN: + mstore(get_wchan(tsk), NPROC_WCHAN, buf); + pdebug("pid %d wchan: %lu.\n", tsk->pid, + get_wchan(tsk)); + break; + case NPROC_NAME: + mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf); + strncpy(buf, tsk->comm, sizeof(tsk->comm)); + buf += sizeof(tsk->comm); + break; + default: + pwarn("Unknown field %#x.\n", fields[i]); + } + } + err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, MSG_DONTWAIT); + if (err > 0) + err = 0; +out: + return err; +} + +/* + * Iterate over a list of PIDs. + */ +int nproc_select_pid(struct nlmsghdr *nlh, u32 left, u32 *fdata, u32 len, u32 *sdata) +{ + int i; + int err = 0; + u32 tcnt; + u32 *pids; + + if (left < sizeof(tcnt)) + goto err_inval; + left -= sizeof(tcnt); + + tcnt = sdata[0]; + + if (left < (tcnt * sizeof(u32))) + goto err_inval; + left -= tcnt * sizeof(u32); + + pids = &sdata[1]; + + for (i = 0; i < tcnt; i++) { + task_t *tsk; + tsk = find_task_by_pid(pids[i]); + pdebug("task found for pid %d: %s.\n", pids[i], tsk->comm); + if (!tsk) { + err = -ESRCH; + goto out; + } + err = nproc_pid_fields(nlh, fdata, len, tsk); + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +static u32 __reply_size_special(u32 id) +{ + u32 len = 0; + + switch (id) { + case NPROC_NAME: + len = sizeof(u32) + + sizeof(((struct task_struct*)0)->comm); + break; + default: + pwarn("Unknown field size in %#x.\n", id); + } + return len; +} + +/* + * Calculates the size of a reply message payload. Alternatively, we could have + * the user space caller supply a number along with the request and bail + * out or realloc later if we find the allocation was too small. More + * responsibility in user space, but faster. + */ +static u32 *__reply_size (u32 *data, u32 *left, u32 *len) +{ + u32 *fields; + u32 fcnt; + int i; + *len = 0; + + if (*left < sizeof(fcnt)) + goto err_inval; + *left -= sizeof(fcnt); + + fcnt = data[0]; + + if (*left < (fcnt * sizeof(u32))) + goto err_inval; + *left -= fcnt * sizeof(u32); + + fields = &data[1]; + + pdebug("for %d fields:\n", fcnt); + for (i = 0; i < fcnt; i++) { + u32 id = fields[i]; + u32 type = id & NPROC_TYPE_MASK; + pdebug(" %#8.8x.\n", fields[i]); + switch (type) { + case NPROC_TYPE_U32: + *len += sizeof(u32); + break; + case NPROC_TYPE_UL: + *len += sizeof(unsigned long); + break; + case NPROC_TYPE_U64: + *len += sizeof(u64); + break; + default: { /* Special cases */ + u32 slen; + slen = __reply_size_special(id); + if (slen) + *len += slen; + else + goto err_inval; + } + } + } + + return &fields[fcnt]; + +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Call the chosen process selector. Not much to choose from right now. + */ +static int nproc_get_ps(struct sk_buff *skb, struct nlmsghdr *nlh) +{ + int err; + u32 len; + u32 *data = NLMSG_DATA(nlh); + u32 *sdata; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + + sdata = __reply_size(data, &left, &len); + if (IS_ERR(sdata)) { + err = PTR_ERR(sdata); + goto out; + } + + switch (*sdata) { +#if 0 + case NPROC_SELECT_ALL: + err = nproc_select_all(nlh, data, len, sdata + 1); + break; +#endif + case NPROC_SELECT_PID: + err = nproc_select_pid(nlh, left, data, len, + sdata + 1); + break; +#if 0 + case NPROC_SELECT_UID: + err = nproc_select_uid(sdata + 1); + break; +#endif + default: + pwarn("Unknown selection method %#x.\n", *sdata); + goto err_inval; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +static int nproc_get_global(struct nlmsghdr *nlh) +{ + int err, i, len; + void *errp; + struct sk_buff *skb2; + char *buf; + u32 fcnt; + struct page_state *ps = NULL; + u32 *data = NLMSG_DATA(nlh); + u32 *fields; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + errp = __reply_size(data, &left, &len); + if (IS_ERR(errp)) { + err = PTR_ERR(errp); + goto out; + } + + fcnt = data[0]; + fields = &data[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + for (i = 0; i < fcnt; i++) { + u32 id = fields[i]; + switch (id) { + case NPROC_NR_DIRTY: + case NPROC_NR_WRITEBACK: + case NPROC_NR_UNSTABLE: + case NPROC_NR_PG_TABLE_PGS: + case NPROC_NR_MAPPED: + case NPROC_NR_SLAB: + if (!ps) + ps = __vmstat(); + switch (id) { + case NPROC_NR_DIRTY: + mstore(ps->nr_dirty, NPROC_NR_DIRTY, buf); + break; + case NPROC_NR_WRITEBACK: + mstore(ps->nr_writeback, NPROC_NR_WRITEBACK, buf); + break; + case NPROC_NR_UNSTABLE: + mstore(ps->nr_unstable, NPROC_NR_UNSTABLE, buf); + break; + case NPROC_NR_PG_TABLE_PGS: + mstore(ps->nr_page_table_pages, NPROC_NR_PG_TABLE_PGS, buf); + break; + case NPROC_NR_MAPPED: + mstore(ps->nr_mapped, NPROC_NR_MAPPED, buf); + break; + case NPROC_NR_SLAB: + mstore(ps->nr_slab, NPROC_NR_SLAB, buf); + break; + } + break; + case NPROC_MEMFREE: + mstore(nr_free_pages(), NPROC_MEMFREE, buf); + break; + case NPROC_PAGESIZE: + mstore(PAGE_SIZE, NPROC_PAGESIZE, buf); + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + default: + pwarn("Unknown field requested %#x.\n", + fields[i]); + goto err_inval; + } + } + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, MSG_DONTWAIT); + if (err > 0) + err = 0; +out: + kfree(ps); + return err; + +err_inval: + kfree(ps); + return -EINVAL; +} + +static int nproc_get_label(struct nlmsghdr *nlh) +{ + int err; + struct sk_buff *skb2; + const char *label; + char *buf; + int len; + u32 ltype; + u32 *data = NLMSG_DATA(nlh); + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + if (left < sizeof(ltype)) + goto err_inval; + + ltype = data[0]; + left -= sizeof(ltype); + + if (ltype == NPROC_LABEL_FIELD) { + int i; + u32 id; + + if (left < sizeof(id)) + goto err_inval; + + id = data[1]; + + for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++) + ; /* Do nothing */ + + if (labels[i].id != id) { + pwarn("No matching label found for %#x.\n", id); + goto err_inval; + } + + label = labels[i].label; + + } + else if (ltype == NPROC_LABEL_KSYM) { + char *modname; + unsigned long wchan, size, offset; + char namebuf[128]; + if (left < sizeof(unsigned long)) + goto err_inval; + + wchan = (unsigned long)data[1]; + label = kallsyms_lookup(wchan, &size, &offset, &modname, + namebuf); + if (!label) { + pwarn("No ksym found for %#lx.\n", wchan); + goto err_inval; + } + } + else { + pwarn("Unknown label type %#x.\n", ltype); + goto err_inval; + } + + len = strlen(label) + 1; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + strncpy(buf, label, len); + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, MSG_DONTWAIT); + if (err > 0) + err = 0; +out: + return err; + +err_inval: + return -EINVAL; +} + +static int nproc_get_list(struct nlmsghdr *nlh) +{ + int err, i, cnt, len; + struct sk_buff *skb2; + u32 *buf; + + cnt = ARRAY_SIZE(labels); + len = (cnt + 1) * sizeof(u32); + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + buf[0] = cnt; + for (i = 0; i < cnt; i++) + buf[i + 1] = labels[i].id; + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, MSG_DONTWAIT); + if (err > 0) + err = 0; +out: + return err; +} + +static __inline__ int nproc_process_msg(struct sk_buff *skb, + struct nlmsghdr *nlh) +{ + int err; + + if (!(nlh->nlmsg_flags & NLM_F_REQUEST)) + return 0; + + nlh->nlmsg_pid = NETLINK_CB(skb).pid; + + switch (nlh->nlmsg_type) { + case NPROC_GET_LIST: + err = nproc_get_list(nlh); + break; + case NPROC_GET_LABEL: + err = nproc_get_label(nlh); + break; + case NPROC_GET_GLOBAL: + err = nproc_get_global(nlh); + break; + case NPROC_GET_PS: + err = nproc_get_ps(skb, nlh); + break; + default: + pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type); + err = -EINVAL; + } + return err; + +} + +static int nproc_receive_skb(struct sk_buff *skb) +{ + int err = 0; + struct nlmsghdr *nlh; + + if (skb->len < NLMSG_LENGTH(0)) + goto err_inval; + + nlh = (struct nlmsghdr *)skb->data; + if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){ + pwarn("Invalid packet.\n"); + goto err_inval; + } + + err = nproc_process_msg(skb, nlh); + if (err || nlh->nlmsg_flags & NLM_F_ACK) { + pdebug("err %d, type %#x, flags %#x, seq %#x.\n", err, + nlh->nlmsg_type, nlh->nlmsg_flags, + nlh->nlmsg_seq); + netlink_ack(skb, nlh, err); + } + + return err; + +err_inval: + return -EINVAL; +} + +static void nproc_receive(struct sock *sk, int len) +{ + struct sk_buff *skb; + + while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { + nproc_receive_skb(skb); + kfree_skb(skb); + } +} + +static int nproc_init(void) +{ + nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive); + + if (!nproc_sock) { + perror("No netlink socket for nproc.\n"); + return -ENODEV; + } + + return 0; +} + +module_init(nproc_init); diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.8/init/Kconfig linux-2.6.8-nproc/init/Kconfig --- linux-2.6.8/init/Kconfig 2004-08-27 13:33:21.680899010 +0200 +++ linux-2.6.8-nproc/init/Kconfig 2004-08-27 13:28:33.104788111 +0200 @@ -141,6 +141,13 @@ config SYSCTL building a kernel for install/rescue disks or your system is very limited in memory. +config NPROC + bool "Netlink interface to /proc information" + depends on PROC_FS && EXPERIMENTAL + default y + help + Nproc is a netlink interface to /proc information. + config AUDIT bool "Auditing support" default y if SECURITY_SELINUX ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [1/2][PATCH] nproc: netlink access to /proc information 2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi @ 2004-08-27 13:39 ` Roger Luethi 0 siblings, 0 replies; 39+ messages in thread From: Roger Luethi @ 2004-08-27 13:39 UTC (permalink / raw) To: linux-kernel, Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson I failed to mention that the patch is missing some rather basic locking (say, in nproc_select_pid). Yeah, it is _that_ experimental :-/. I ignored the locking issue when mulling over the semantics of the new interface and forgot it later. The patch below should be an improvement. Roger --- kernel/nproc.c.01 2004-08-27 15:38:36.686602557 +0200 +++ kernel/nproc.c 2004-08-27 15:38:36.686602557 +0200 @@ -278,18 +278,23 @@ int nproc_select_pid(struct nlmsghdr *nl for (i = 0; i < tcnt; i++) { task_t *tsk; + read_lock(&tasklist_lock); tsk = find_task_by_pid(pids[i]); + if (tsk) + get_task_struct(tsk); + read_unlock(&tasklist_lock); + if (!tsk) + goto err_srch; pdebug("task found for pid %d: %s.\n", pids[i], tsk->comm); - if (!tsk) { - err = -ESRCH; - goto out; - } err = nproc_pid_fields(nlh, fdata, len, tsk); + put_task_struct(tsk); } -out: return err; +err_srch: + return -ESRCH; + err_inval: return -EINVAL; } ^ permalink raw reply [flat|nested] 39+ messages in thread
* [2/2][sample code] nproc: user space app 2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi 2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi @ 2004-08-27 12:24 ` Roger Luethi 2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris 2004-08-27 16:23 ` William Lee Irwin III 3 siblings, 0 replies; 39+ messages in thread From: Roger Luethi @ 2004-08-27 12:24 UTC (permalink / raw) To: linux-kernel Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson On a system running a kernel with nproc, the sample program (below) spits out the information it can gather from the kernel: Available fields, data types, and associated values. Obviously, real tools would have their own labels (and help texts) for well-known fields. Scope information and the NPROC_GET_LABEL operation allow them to provide additional information in a meaningful context. The sample program does have some extra knowledge beyond the bare interface: It knows that a time stamp (global scope) can be requested within process scope as well, and it knows how to request a symbol name from a wchan value. Sample output: ---------------------------------------------------------------------------- ================ Available fields ======================= ----id---- --------label------- -scope- ----type----- # 0 0x22000001 PID process __u32 # 1 0x21000002 Name process string # 2 0x12000004 MemFree global __u32 # 3 0x12000005 PageSize global __u32 # 4 0x14000006 Jiffies global __u64 # 5 0x22000010 VmSize process __u32 # 6 0x22000011 VmLock process __u32 # 7 0x22000012 VmRSS process __u32 # 8 0x22000013 VmData process __u32 # 9 0x22000014 VmStack process __u32 #10 0x22000015 VmExe process __u32 #11 0x22000016 VmLib process __u32 #12 0x13000051 nr_dirty global unsigned long #13 0x13000052 nr_writeback global unsigned long #14 0x13000053 nr_unstable global unsigned long #15 0x13000054 nr_page_table_pages global unsigned long #16 0x13000055 nr_mapped global unsigned long #17 0x13000056 nr_slab global unsigned long #18 0x23000100 wchan process unsigned long #19 0x01000101 wchan_symbol ( 0) string ================ Global fields ========================== ----id---- --------label------- --value--- # 0 0x12000004 MemFree 97926 # 1 0x12000005 PageSize 4096 # 2 0x14000006 Jiffies 4298132669 # 3 0x13000051 nr_dirty 10 # 4 0x13000052 nr_writeback 0 # 5 0x13000053 nr_unstable 0 # 6 0x13000054 nr_page_table_pages 405 # 7 0x13000055 nr_mapped 36021 # 8 0x13000056 nr_slab 5956 ================ Process fields ========================= ---------------- process PID 14318 ---------------------- ----id---- --------label------- --value--- # 0 0x14000006 Jiffies 4298132669 # 1 0x22000001 PID 14318 # 2 0x21000002 Name tst # 3 0x22000010 VmSize 1456 # 4 0x22000011 VmLock 0 # 5 0x22000012 VmRSS 360 # 6 0x22000013 VmData 272 # 7 0x22000014 VmStack 12 # 8 0x22000015 VmExe 8 # 9 0x22000016 VmLib 1140 #10 0x23000100 wchan 0 ---------------- process PID 1 ---------------------- ----id---- --------label------- --value--- # 0 0x14000006 Jiffies 4298132669 # 1 0x22000001 PID 1 # 2 0x21000002 Name init # 3 0x22000010 VmSize 1340 # 4 0x22000011 VmLock 0 # 5 0x22000012 VmRSS 468 # 6 0x22000013 VmData 144 # 7 0x22000014 VmStack 4 # 8 0x22000015 VmExe 28 # 9 0x22000016 VmLib 1140 #10 0x23000100 wchan 0xc01924f9 (ksym: do_select) 1000 iterations for both processes: CPU time : 0.000000s Wall time: 0.008305s ============================================================================ Sample code below: ---------------------------------------------------------------------------- #include <asm/types.h> #include <sys/socket.h> #include <linux/netlink.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <string.h> #include <time.h> #include <sys/time.h> /* Sample code to demonstrate nproc usage */ //#include "<linux/nproc.h>" #define NPROC_BASE 0x10 #define NPROC_GET_LIST (NPROC_BASE+0) #define NPROC_GET_LABEL (NPROC_BASE+1) #define NPROC_GET_GLOBAL (NPROC_BASE+2) #define NPROC_GET_PS (NPROC_BASE+3) #define NETLINK_NPROC 12 #define NPROC_SCOPE_MASK 0xF0000000 #define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */ #define NPROC_SCOPE_PROCESS 0x20000000 #define NPROC_SCOPE_LABEL 0x30000000 #define NPROC_TYPE_MASK 0x0F000000 #define NPROC_TYPE_STRING 0x01000000 #define NPROC_TYPE_U32 0x02000000 #define NPROC_TYPE_UL 0x03000000 #define NPROC_TYPE_U64 0x04000000 #define NPROC_SELECT_ALL 0x00000001 #define NPROC_SELECT_PID 0x00000002 #define NPROC_SELECT_UID 0x00000003 #define NPROC_LABEL_FIELD 0x00000001 #define NPROC_LABEL_KSYM 0x00000002 struct nproc_field { __u32 id; const char *label; }; #define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) #define NPROC_WCHAN (0x00000100 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) //#define DEBUG #ifdef DEBUG #define pdebug(x,args...) printf("%s:%d " x, __func__ , __LINE__, ##args) #else #define pdebug(x,args...) #endif #define perror(x,args...) fprintf(stderr, "%s:%d " x, __func__ , __LINE__, ##args) static __u32 seq_nr; static pid_t pid; static int nsk; /* netlink socket */ struct proc_message { struct nlmsghdr nlh; __u32 data[256]; }; int open_netlink() { if ((nsk = socket(PF_NETLINK, SOCK_RAW, NETLINK_NPROC)) == -1) { perror("Failed to open netlink proc socket.\n"); exit(1); } return nsk; } void send_request(struct proc_message *req) { int sent; req->nlh.nlmsg_flags = NLM_F_REQUEST; req->nlh.nlmsg_seq = seq_nr++; req->nlh.nlmsg_pid = pid; if ((sent = send(nsk, req, req->nlh.nlmsg_len, 0)) == -1) { perror("Failed to send netlink proc msg.\n"); exit(1); } pdebug("sent %d bytes seq %#x type %#x \n", sent, req->nlh.nlmsg_seq, req->nlh.nlmsg_type); } void *get_reply(__u32 type, struct proc_message *ans) { int len; if ((len = recv(nsk, ans, sizeof(struct proc_message), 0)) == -1) { perror("Failed to read netlink proc msg.\n"); exit(1); }; if (!NLMSG_OK((&(*ans).nlh), len)) { perror("Bad netlink msg.\n"); exit(1); } if (ans->nlh.nlmsg_type != type) { perror("read %d bytes seq %#x type %#x len %d\n", len, ans->nlh.nlmsg_seq, ans->nlh.nlmsg_type, ans->nlh.nlmsg_len); exit(1); } else pdebug("read %d bytes seq %#x type %#x len %d\n", len, ans->nlh.nlmsg_seq, ans->nlh.nlmsg_type, ans->nlh.nlmsg_len); return NLMSG_DATA(&ans->nlh); } void *get_global(__u32 num, struct proc_message *nlmsg) { int len = num * sizeof(__u32); nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len); nlmsg->nlh.nlmsg_type = NPROC_GET_GLOBAL; send_request(nlmsg); return get_reply(NPROC_GET_GLOBAL, nlmsg); } void get_ps(__u32 num, struct proc_message *nlmsg) { int len = num * sizeof(__u32); nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len); nlmsg->nlh.nlmsg_type = NPROC_GET_PS; send_request(nlmsg); } char *get_label(struct proc_message *nlmsg) { nlmsg->nlh.nlmsg_type = NPROC_GET_LABEL; send_request(nlmsg); return get_reply(NPROC_GET_LABEL, nlmsg); } char *get_field_label(__u32 id, struct proc_message *nlmsg) { __u32 *buf = NLMSG_DATA(&nlmsg->nlh); int len = 2 * sizeof(__u32); nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len); buf[0] = NPROC_LABEL_FIELD; buf[1] = id; return get_label(nlmsg); } char *get_ksym(unsigned long wchan, struct proc_message *nlmsg) { __u32 *buf = NLMSG_DATA(&nlmsg->nlh); unsigned long *addr; int len = sizeof(__u32) + sizeof(unsigned long); *buf++ = NPROC_LABEL_KSYM; addr = (unsigned long *)buf; *addr = wchan; nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(len); return get_label(nlmsg); } __u32 *get_list(struct proc_message *nlmsg) { nlmsg->nlh.nlmsg_len = NLMSG_LENGTH(0); nlmsg->nlh.nlmsg_type = NPROC_GET_LIST; send_request(nlmsg); return get_reply(NPROC_GET_LIST, nlmsg); } void print_ps(char *res, int psc, struct nproc_field *ps_label) { int i; struct proc_message nlmsg; printf(" ----id---- --------label------- --value---\n"); for (i = 0; i < psc; i++) { const char *label = ps_label[i].label; __u32 id = ps_label[i].id; __u32 type = id & NPROC_TYPE_MASK; printf("#%2d %#x %-20s ", i, id, label); switch (type) { case NPROC_TYPE_U32: { __u32 *p = (__u32 *)res; printf("%10u\n", *p); res = (char *)++p; break; } case NPROC_TYPE_UL: { unsigned long *p = (unsigned long *)res; if ((id == NPROC_WCHAN) && *p) { printf("%#8lx ", *p); printf("(ksym: %s)\n", get_ksym(*p, &nlmsg)); } else printf("%10lu\n", *p); res = (char *)++p; break; } case NPROC_TYPE_U64: { __u64 *p = (__u64 *)res; printf("%10llu\n", *p); res = (char *)++p; break; } case NPROC_TYPE_STRING: { __u32 *len = (__u32 *)res; char *p = res + sizeof(__u32); printf("%s\n", p); res += *len + sizeof(__u32); break; } default: printf("(?)\t"); } } } #define MAX_FIELDS 64 int main() { struct proc_message flist; struct proc_message nlmsg; struct proc_message gl_msg; struct proc_message ps_msg; __u32 *fields; __u32 *gl = NLMSG_DATA(&gl_msg.nlh); __u32 *ps = NLMSG_DATA(&ps_msg.nlh); struct nproc_field gl_label[MAX_FIELDS]; struct nproc_field ps_label[MAX_FIELDS]; char *res; int i; int ac, glc = 0, psc = 0; /* Count fields */ int cpu_0; struct timeval tv0, tv1; float wall; struct timezone tz; pid = getpid(); nsk = open_netlink(); fields = get_list(&flist); ac = *fields++; *gl++ = 0; /* Reserve space for field count */ *ps++ = 0; *ps++ = NPROC_JIFFIES; /* Special: both global and ps context */ ps_label[psc].id = NPROC_JIFFIES; ps_label[psc++].label = strdup(get_field_label(NPROC_JIFFIES, &nlmsg)); printf("================ Available fields =======================\n"); printf(" ----id---- --------label------- -scope- ----type-----\n"); for (i = 0; i < ac; i++) { char *label; __u32 scope, type; scope = fields[i] & NPROC_SCOPE_MASK; type = fields[i] & NPROC_TYPE_MASK; label = strdup(get_field_label(fields[i], &nlmsg)); printf("#%2d %#8.8x %-20s ", i, fields[i], label); switch (scope) { case NPROC_SCOPE_GLOBAL: printf("global "); *gl++ = fields[i]; gl_label[glc].id = fields[i]; gl_label[glc++].label = label; break; case NPROC_SCOPE_PROCESS: printf("process "); *ps++ = fields[i]; ps_label[psc].id = fields[i]; ps_label[psc++].label = label; break; default: printf("(%#5x) ", scope); } switch (type) { case NPROC_TYPE_U32: printf("__u32"); break; case NPROC_TYPE_UL: printf("unsigned long"); break; case NPROC_TYPE_U64: printf("__u64"); break; case NPROC_TYPE_STRING: printf("string"); break; default: printf("type: (%#8.8x)\t", type); } if ((glc == MAX_FIELDS) || (psc == MAX_FIELDS)) { perror("Array too small.\n"); exit(1); } printf("\n"); } gl = NLMSG_DATA(&gl_msg.nlh); *gl = glc; res = get_global(glc + 1, &gl_msg); printf("\n================ Global fields ==========================\n"); printf(" ----id---- --------label------- --value---\n"); for (i = 0; i < glc; i++) { const char *label = gl_label[i].label; __u32 id = gl_label[i].id; __u32 type = id & NPROC_TYPE_MASK; printf("#%2d %#8.8x %-20s ", i, id, label); switch (type) { case NPROC_TYPE_U32: { __u32 *p = (__u32 *)res; printf("%10u", *p); res = (char *)++p; break; } case NPROC_TYPE_UL: { unsigned long *p = (unsigned long *)res; printf("%10lu", *p); res = (char *)++p; break; } case NPROC_TYPE_U64: { __u64 *p = (__u64 *)res; printf("%10llu", *p); res = (char *)++p; break; } case NPROC_TYPE_STRING: { __u32 *len = (__u32 *)res; char *p = res + sizeof(__u32); printf("%s", p); res += *len + sizeof(__u32); break; } default: printf("(?)"); } printf("\n"); } printf("\n================ Process fields =========================\n"); *ps++ = NPROC_SELECT_PID; *ps++ = 2; // Number of PIDs to follow *ps++ = pid; *ps++ = 1; ps = NLMSG_DATA(&ps_msg.nlh); *ps = psc; get_ps(psc + 1 + 4, &ps_msg); res = get_reply(NPROC_GET_PS, &nlmsg); printf("---------------- process PID %5d ----------------------\n", pid); print_ps(res, psc, ps_label); res = get_reply(NPROC_GET_PS, &nlmsg); printf("---------------- process PID %5d ----------------------\n", 1); print_ps(res, psc, ps_label); gettimeofday(&tv0, &tz); cpu_0 = clock(); #define RUNS 1000 for (i = 0; i < RUNS; i++) { get_ps(psc + 1 + 4, &ps_msg); get_reply(NPROC_GET_PS, &nlmsg); get_reply(NPROC_GET_PS, &nlmsg); } printf("\n%d iterations for both processes:\n", RUNS); printf("\tCPU time : %fs\n", (float)(clock() - cpu_0)/CLOCKS_PER_SEC); gettimeofday(&tv1,&tz); wall = (float) tv1.tv_sec - tv0.tv_sec + (tv1.tv_usec - tv0.tv_usec) / 1.0e6; printf("\tWall time: %fs\n", wall); return 0; } ---------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi 2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi 2004-08-27 12:24 ` [2/2][sample code] nproc: user space app Roger Luethi @ 2004-08-27 14:50 ` James Morris 2004-08-27 15:26 ` Roger Luethi 2004-08-27 16:23 ` William Lee Irwin III 3 siblings, 1 reply; 39+ messages in thread From: James Morris @ 2004-08-27 14:50 UTC (permalink / raw) To: Roger Luethi Cc: linux-kernel, Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson, Chris Wright, Stephen Smalley On Fri, 27 Aug 2004, Roger Luethi wrote: > At the moment, the kernel sends a separate netlink message for every > process. You should look at the way rtnetlink dumps large amounts of data to userspace. > I haven't implemented any form of access control. One possibility is > to use some of the reserved bits in the ID field to indicate access > restrictions to both kernel and user space (e.g. everyone, process owner, > root) So, user tools would all need to be privileged? That sounds problematic. > and add some LSM hook for those needing fine-grained control. Control over the user request, or what the kernel returns? If the latter, LSM is not really a filtering API. - James -- James Morris <jmorris@redhat.com> ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris @ 2004-08-27 15:26 ` Roger Luethi 0 siblings, 0 replies; 39+ messages in thread From: Roger Luethi @ 2004-08-27 15:26 UTC (permalink / raw) To: James Morris Cc: linux-kernel, Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson, Chris Wright, Stephen Smalley On Fri, 27 Aug 2004 10:50:23 -0400, James Morris wrote: > On Fri, 27 Aug 2004, Roger Luethi wrote: > > > At the moment, the kernel sends a separate netlink message for every > > process. > > You should look at the way rtnetlink dumps large amounts of data to > userspace. At this point, I am just using a working prototype to gauge the interest in an improved interface. Other than that, I agree. This would be one of the "speed optimizations I haven't tried". > > I haven't implemented any form of access control. One possibility is > > to use some of the reserved bits in the ID field to indicate access > > restrictions to both kernel and user space (e.g. everyone, process owner, > > root) > > So, user tools would all need to be privileged? That sounds problematic. It just means that not all the pieces that would be required to make this a merge candidate have been implemented. I focused on the basic infrastructure that is needed for the basic protocol. Adding some access control that is about as smart as file permissions in /proc is fairly easy (we have the caller pid and netlink_skb_parms as a starting point). We only have read permissions to care about. It's trivial to flag each field as "world readable", "owner only" (for fields with process scope), and "root only". That covers pretty much what /proc permissions achieve. While I am confident that this will work, others may have better ideas for access control. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi ` (2 preceding siblings ...) 2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris @ 2004-08-27 16:23 ` William Lee Irwin III 2004-08-27 16:37 ` Albert Cahalan ` (2 more replies) 3 siblings, 3 replies; 39+ messages in thread From: William Lee Irwin III @ 2004-08-27 16:23 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson On Fri, Aug 27, 2004 at 02:24:12PM +0200, Roger Luethi wrote: > Problems with /proc > =================== > The information in /proc comes in a number of different formats, for > example: > - /proc/PID/stat works for parsers. However, because it is not > self-documenting, it can never shrink, It contains a growing number > of dead fields -- legacy tools expect them to be there. To make things > worse, there is no N/A value, which makes a field value 0 ambiguous. > - /proc/pid/status is self-documenting. No N/A value is necessary -- > fields can easily be added, removed, and reordered. Too easily, maybe. > Tool maintainers complain about parsing overhead and unstable file > formats. > - /proc/slabinfo is something of a hybrid and tries to avoid the > weaknesses of other formats. > So a key problem is that it's hard to make an interface that is both > easy for humans and parsers to read. The amount of human-readable > information in /proc has been growing and there's no way all these > files will be rewritten again to favor parsers. These are many of the same issues raised in rusty's "current /proc/ of shit" thread from a while back. On Fri, Aug 27, 2004 at 02:24:12PM +0200, Roger Luethi wrote: > Another problem with /proc is speed. If we put all information in a few > large files, the kernel needs to calculate many fields even if a tool > is only interested in one of them. OTOH, if the informations is split > into many small files, VFS and related overhead increases if a tool > needs to read many files just for the information on one single process. > In summary, /proc suffers from diverging goals of its two groups of > users (human readers and parsers), and it doesn't scale well for tools > monitoring many fields or many processes. There are more maintainability benefits from the interface improvement than speed benefits. How many processes did you microbenchmark with? I see no evidence that this will be a speedup with large numbers of processes, as the problematic algorithms are preserved wholesale. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 16:23 ` William Lee Irwin III @ 2004-08-27 16:37 ` Albert Cahalan 2004-08-27 16:41 ` William Lee Irwin III 2004-08-27 17:01 ` Roger Luethi 2004-08-28 19:45 ` [BENCHMARK] " Roger Luethi 2 siblings, 1 reply; 39+ messages in thread From: Albert Cahalan @ 2004-08-27 16:37 UTC (permalink / raw) To: William Lee Irwin III Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson On Fri, 2004-08-27 at 12:23, William Lee Irwin III wrote: > I see no evidence that this will be a speedup with large numbers of > processes, as the problematic algorithms are preserved wholesale. Well, as far as THAT goes, I thought your tree-based lookup was nice. I assume you still have the code. What we got instead was a sort of cached directory offset computation, which looks great... until you hit the bad case. I suggest that the people trying to reduce latency should try "top -d 0 -b >> /dev/null" while running something like the SDET benchmark. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 16:37 ` Albert Cahalan @ 2004-08-27 16:41 ` William Lee Irwin III 0 siblings, 0 replies; 39+ messages in thread From: William Lee Irwin III @ 2004-08-27 16:41 UTC (permalink / raw) To: Albert Cahalan; +Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson On Fri, 2004-08-27 at 12:23, William Lee Irwin III wrote: >> I see no evidence that this will be a speedup with large numbers of >> processes, as the problematic algorithms are preserved wholesale. On Fri, Aug 27, 2004 at 12:37:40PM -0400, Albert Cahalan wrote: > Well, as far as THAT goes, I thought your tree-based > lookup was nice. I assume you still have the code. > What we got instead was a sort of cached directory > offset computation, which looks great... until you > hit the bad case. I suggest that the people trying to > reduce latency should try "top -d 0 -b >> /dev/null" > while running something like the SDET benchmark. I can resurrect that easily enough. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 16:23 ` William Lee Irwin III 2004-08-27 16:37 ` Albert Cahalan @ 2004-08-27 17:01 ` Roger Luethi 2004-08-27 17:08 ` William Lee Irwin III 2004-08-28 19:45 ` [BENCHMARK] " Roger Luethi 2 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-27 17:01 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson On Fri, 27 Aug 2004 09:23:08 -0700, William Lee Irwin III wrote: > These are many of the same issues raised in rusty's "current /proc/ of > shit" thread from a while back. The problems are not new. The driver stuff has been outsourced to /sysfs in the meantime, though, and the information that is being added to /proc these days is usually human-readable and a pain to parse. > On Fri, Aug 27, 2004 at 02:24:12PM +0200, Roger Luethi wrote: > > Another problem with /proc is speed. If we put all information in a few > > large files, the kernel needs to calculate many fields even if a tool > > is only interested in one of them. OTOH, if the informations is split > > into many small files, VFS and related overhead increases if a tool > > needs to read many files just for the information on one single process. > > In summary, /proc suffers from diverging goals of its two groups of > > users (human readers and parsers), and it doesn't scale well for tools > > monitoring many fields or many processes. > > There are more maintainability benefits from the interface improvement > than speed benefits. Agreed. That has been my initial motivation. Speed is a bonus. > How many processes did you microbenchmark with? Nothing worth mentioning. I have nothing in /proc space to compare to. I was hoping someone would suggest a /proc based benchmark. > I see no evidence that this will be a speedup with large numbers of > processes, as the problematic algorithms are preserved wholesale. It doesn't fundamentally change the complexity, but I expect the reduction in overhead to be noticeable, mostly due to: - no more string parsing. - fewer system calls. - fewer cycles wasted on calculating unnecessary data fields. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [0/2][ANNOUNCE] nproc: netlink access to /proc information 2004-08-27 17:01 ` Roger Luethi @ 2004-08-27 17:08 ` William Lee Irwin III 0 siblings, 0 replies; 39+ messages in thread From: William Lee Irwin III @ 2004-08-27 17:08 UTC (permalink / raw) To: linux-kernel, Albert Cahalan, Paul Jackson On Fri, 27 Aug 2004 09:23:08 -0700, William Lee Irwin III wrote: >> I see no evidence that this will be a speedup with large numbers of >> processes, as the problematic algorithms are preserved wholesale. On Fri, Aug 27, 2004 at 07:01:43PM +0200, Roger Luethi wrote: > It doesn't fundamentally change the complexity, but I expect the > reduction in overhead to be noticeable, mostly due to: > - no more string parsing. > - fewer system calls. > - fewer cycles wasted on calculating unnecessary data fields. After some closer review it appears recent algorithmic improvements are largely orthogonal to your interface change; the new interface may just call the improved algorithms. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* [BENCHMARK] nproc: netlink access to /proc information 2004-08-27 16:23 ` William Lee Irwin III 2004-08-27 16:37 ` Albert Cahalan 2004-08-27 17:01 ` Roger Luethi @ 2004-08-28 19:45 ` Roger Luethi 2004-08-28 19:56 ` William Lee Irwin III 2 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-28 19:45 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson On Fri, 27 Aug 2004 09:23:08 -0700, William Lee Irwin III wrote: > than speed benefits. How many processes did you microbenchmark with? Executive summary: I wrote a benchmark to compare /proc and nproc performance. The results are as expected: Parsing even the most simple strings is expensive. /proc performance does not scale if we have to open and close many files, which is the common case. In a situation with many processes p and fields/files f the delivery overhead is roughly O(p) for nproc and O(p*f) for /proc. The difference becomes even more pronounced if a /proc file request triggers an expensive in-kernel computation for fields that are not of interest but part of the file, or if human-readable files need to be parsed. Benchmark: I chose the most favorable scenario for /proc I could think of: Reading a single, easy to parse file per process and find every data item useful. I picked /proc/pid/statm. For nproc, I chose seven fields that are calculated with the same resource usage as the fields in statm: NPROC_VMSIZE, NPROC_VMLOCK, NPROC_VMRSS, NPROC_VMDATA, NPROC_VMSTACK, NPROC_VMEXE, and NPROC_VMLIB. Numbers: * The first run is basically lseek+read: /proc/pid/statm for 1000 processes, 1000 times, lseek CPU time : 7.080000s Wall time: 7.636732s * The second run adds a simple sscanf call to dump seven values into seven variables: /prod/pid/statm for 1000 processes, 1000 times, lseek (scanf) CPU time : 10.230000s Wall time: 10.958432s * If we watch p processes with f files each, we typically hit the file descriptor limit before p * f == 1024. From then on, lseek is useless, we have to resort to opening and closing files: /prod/pid/statm for 1000 processes, 1000 times, open CPU time : 14.920000s Wall time: 16.087339s * Again, parsing the string comes at a cost: /prod/pid/statm for 1000 processes, 1000 times, open (scanf) CPU time : 18.110000s Wall time: 19.457451s * What happens if we need to read 2 simple /proc files (14 fields) per process? /prod/pid/statm (2x) for 1000 processes, 1000 times, open (scanf) CPU time : 30.250000s Wall time: 32.650314s * 10000 processes at 3 files each (27 fields) /prod/pid/statm (3x) for 10000 processes, 1000 times, open (scanf) CPU time : 450.630000s Wall time: 500.265503s * nproc delivering said 7 fields: nproc for 1000 processes, 1000 times, one process per request CPU time : 7.910000s Wall time: 8.473371s * 200 processes per request, but still 1000 reply messages. If we stuffed a bunch of them into every message, performance would improve further. nproc for 1000 processes, 1000 times, 200 processes per request CPU time : 6.350000s Wall time: 6.817391s * There's no large penalty if we need additional fields: 14 nproc fields for 1000 processes, 1000 times, one process per request CPU time : 8.680000s Wall time: 9.328828s 27 nproc fields for 10000 processes, 1000 times, one process per request CPU time : 88.270000s Wall time: 98.664330s Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-28 19:45 ` [BENCHMARK] " Roger Luethi @ 2004-08-28 19:56 ` William Lee Irwin III 2004-08-28 20:14 ` Roger Luethi 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-28 19:56 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson On Sat, Aug 28, 2004 at 09:45:46PM +0200, Roger Luethi wrote: > Executive summary: I wrote a benchmark to compare /proc and nproc > performance. The results are as expected: Parsing even the most simple > strings is expensive. /proc performance does not scale if we have to > open and close many files, which is the common case. > In a situation with many processes p and fields/files f the delivery > overhead is roughly O(p) for nproc and O(p*f) for /proc. > The difference becomes even more pronounced if a /proc file request > triggers an expensive in-kernel computation for fields that are not > of interest but part of the file, or if human-readable files need to > be parsed. > Benchmark: I chose the most favorable scenario for /proc I could think > of: Reading a single, easy to parse file per process and find every data > item useful. I picked /proc/pid/statm. For nproc, I chose seven fields > that are calculated with the same resource usage as the fields in statm: > NPROC_VMSIZE, NPROC_VMLOCK, NPROC_VMRSS, NPROC_VMDATA, NPROC_VMSTACK, > NPROC_VMEXE, and NPROC_VMLIB. These numbers are somewhat at variance with my experience in the area, as I see that the internal algorithms actually dominate the runtime of the /proc/ algorithms. Could you describe the processes used for the benchmarks, e.g. typical /proc/$PID/status and /proc/$PID/maps for them? -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-28 19:56 ` William Lee Irwin III @ 2004-08-28 20:14 ` Roger Luethi 2004-08-29 16:05 ` William Lee Irwin III 0 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-28 20:14 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson On Sat, 28 Aug 2004 12:56:47 -0700, William Lee Irwin III wrote: > These numbers are somewhat at variance with my experience in the area, > as I see that the internal algorithms actually dominate the runtime > of the /proc/ algorithms. Could you describe the processes used for the > benchmarks, e.g. typical /proc/$PID/status and /proc/$PID/maps for them? The status/maps numbers below are not only typical, but identical for all tasks. I'm forking off a defined number of children and then query their status from the parent. Because I was interested in delivery overhead, I built on purpose a benchmark without computationally expensive fields. Expensive field computation hurts /proc more than nproc because the latter allows you to have only the currently needed fields computed. Roger Name: nprocbench State: T (stopped) SleepAVG: 0% Tgid: 6400 Pid: 6400 PPid: 2120 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 100 100 100 100 FDSize: 32 Groups: 4 10 11 18 19 20 27 100 250 VmSize: 1336 kB VmLck: 0 kB VmRSS: 304 kB VmData: 144 kB VmStk: 16 kB VmExe: 12 kB VmLib: 1140 kB Threads: 1 SigPnd: 0000000000000000 ShdPnd: 0000000000080000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 08048000-0804b000 r-xp 00000000 03:45 160990 /home/rl/nproc/nprocbench 0804b000-0804c000 rw-p 00002000 03:45 160990 /home/rl/nproc/nprocbench 0804c000-0806d000 rw-p 0804c000 00:00 0 40000000-40013000 r-xp 00000000 03:42 11356336 /lib/ld-2.3.3.so 40013000-40014000 rw-p 00012000 03:42 11356336 /lib/ld-2.3.3.so 40014000-40015000 rw-p 40014000 00:00 0 40032000-4013c000 r-xp 00000000 03:42 11356337 /lib/libc-2.3.3.so 4013c000-40140000 rw-p 00109000 03:42 11356337 /lib/libc-2.3.3.so 40140000-40142000 rw-p 40140000 00:00 0 bfffc000-c0000000 rw-p bfffc000 00:00 0 ffffe000-fffff000 ---p 00000000 00:00 0 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-28 20:14 ` Roger Luethi @ 2004-08-29 16:05 ` William Lee Irwin III 2004-08-29 17:02 ` Roger Luethi 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 16:05 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson On Sat, 28 Aug 2004 12:56:47 -0700, William Lee Irwin III wrote: >> These numbers are somewhat at variance with my experience in the area, >> as I see that the internal algorithms actually dominate the runtime >> of the /proc/ algorithms. Could you describe the processes used for the >> benchmarks, e.g. typical /proc/$PID/status and /proc/$PID/maps for them? On Sat, Aug 28, 2004 at 10:14:35PM +0200, Roger Luethi wrote: > The status/maps numbers below are not only typical, but identical for > all tasks. I'm forking off a defined number of children and then query > their status from the parent. > Because I was interested in delivery overhead, I built on purpose a > benchmark without computationally expensive fields. Expensive field > computation hurts /proc more than nproc because the latter allows you > to have only the currently needed fields computed. Okay, these explain some of the difference. I usually see issues with around 10000 processes with fully populated virtual address spaces and several hundred vmas each, varying between 200 to 1000, mostly concentrated at somewhere just above 300. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 16:05 ` William Lee Irwin III @ 2004-08-29 17:02 ` Roger Luethi 2004-08-29 17:20 ` William Lee Irwin III 2004-08-31 15:34 ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi 0 siblings, 2 replies; 39+ messages in thread From: Roger Luethi @ 2004-08-29 17:02 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson On Sun, 29 Aug 2004 09:05:42 -0700, William Lee Irwin III wrote: > Okay, these explain some of the difference. I usually see issues with > around 10000 processes with fully populated virtual address spaces and > several hundred vmas each, varying between 200 to 1000, mostly > concentrated at somewhere just above 300. I agree, that should make quite a difference. As you said, we are working on orthogonal areas: My current focus is on data delivery (sane semantics and minimal overhead), while you seem to be more interested in better data gathering. I profiled "top -d 0 -b > /dev/null" for about 100 and 10^5 processes. When monitoring 100 (real-world) processes, /proc specific overhead (_IO_vfscanf_internal, number, __d_lookup, vsnprintf, etc.) amounts to about one third of total resource usage. ==> 100 processes: top -d 0 -b > /dev/null <== CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name symbol name 20439 12.2035 libc-2.3.3.so _IO_vfscanf_internal 15852 9.4647 vmlinux number 11635 6.9469 vmlinux task_statm 9286 5.5444 libc-2.3.3.so _IO_vfprintf_internal 9128 5.4500 vmlinux proc_pid_stat 5395 3.2212 vmlinux __d_lookup 4738 2.8289 vmlinux vsnprintf 4123 2.4617 libc-2.3.3.so _IO_default_xsputn_internal 4110 2.4540 libc-2.3.3.so __i686.get_pc_thunk.bx 3712 2.2163 libc-2.3.3.so _IO_putc_internal 3504 2.0921 vmlinux link_path_walk 3417 2.0402 libc-2.3.3.so ____strtoul_l_internal 3363 2.0079 libc-2.3.3.so ____strtol_l_internal 2250 1.3434 libncurses.so.5.4 _nc_outch 2116 1.2634 libc-2.3.3.so _IO_sputbackc_internal 2006 1.1977 top task_show 1851 1.1052 vmlinux pid_revalidate With 10^5 additional dummy processes, resource usage is dominated by attempts to get a current list of pids. My own benchmark walked a list of known pids, so that was not an issue. I bet though that nproc can provide more efficient means to get such a list than getdents (we could even allow a user to ask for a message on process creation/kill). So basically that's just another place where nproc-based tools would trounce /proc-based ones (that piece is vaporware today, though). ==> 10000 processes: top -d 0 -b > /dev/null <== CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name symbol name 35855 36.0707 vmlinux get_tgid_list 9366 9.4223 vmlinux pid_alive 7077 7.1196 libc-2.3.3.so _IO_vfscanf_internal 5386 5.4184 vmlinux number 3664 3.6860 vmlinux proc_pid_stat 3077 3.0955 libc-2.3.3.so _IO_vfprintf_internal 2136 2.1489 vmlinux __d_lookup 1720 1.7303 vmlinux vsnprintf 1451 1.4597 libc-2.3.3.so __i686.get_pc_thunk.bx 1409 1.4175 libc-2.3.3.so _IO_default_xsputn_internal 1258 1.2656 libc-2.3.3.so _IO_putc_internal 1225 1.2324 vmlinux link_path_walk 1210 1.2173 libc-2.3.3.so ____strtoul_l_internal 1199 1.2062 vmlinux task_statm 1157 1.1640 libc-2.3.3.so ____strtol_l_internal 794 0.7988 libc-2.3.3.so _IO_sputbackc_internal 776 0.7807 libncurses.so.5.4 _nc_outch The remaining profiles are for two benchmarks from my previous message. Field computation is more prominent than with top because the benchmark uses a known list of pids and parsing is kept at a trivial level. ==> /prod/pid/statm (2x) for 10000 processes <== CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name symbol name 7430 9.9485 libc-2.3.3.so _IO_vfscanf_internal 6195 8.2948 vmlinux __d_lookup 5477 7.3335 vmlinux task_statm 5082 6.8046 vmlinux number 3227 4.3208 vmlinux link_path_walk 3050 4.0838 libc-2.3.3.so ____strtol_l_internal 2116 2.8332 libc-2.3.3.so _IO_vfprintf_internal 2064 2.7636 vmlinux vsnprintf 1664 2.2280 vmlinux atomic_dec_and_lock 1551 2.0767 vmlinux task_dumpable 1497 2.0044 vmlinux pid_revalidate 1419 1.9000 vmlinux system_call 1401 1.8759 vmlinux pid_alive 1244 1.6657 libc-2.3.3.so _IO_sputbackc_internal 1175 1.5733 vmlinux dnotify_parent 1060 1.4193 libc-2.3.3.so _IO_default_xsputn_internal 922 1.2345 vmlinux file_move nproc removes most of the delivery overhead so field computation is now dominant. Strictly speaking, it should be even higher because the benchmarks requests the same fields three times, but they only get computed once in such a case. ==> 27 nproc fields for 10000 processes, one process per request <== CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name symbol name 7647 25.0894 vmlinux __task_mem 2125 6.9720 vmlinux find_pid 1884 6.1813 vmlinux nproc_pid_fields 1488 4.8820 vmlinux __task_mem_cheap 1161 3.8092 vmlinux mmgrab 978 3.2088 vmlinux netlink_recvmsg 944 3.0972 vmlinux alloc_skb 935 3.0677 vmlinux __might_sleep 751 2.4640 vmlinux nproc_select_pid 738 2.4213 vmlinux system_call 691 2.2671 vmlinux skb_dequeue 636 2.0867 vmlinux netlink_sendmsg 630 2.0670 vmlinux __copy_from_user_ll 624 2.0473 vmlinux sockfd_lookup 621 2.0375 vmlinux kfree 602 1.9751 vmlinux __reply_size 523 1.7159 vmlinux fget Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 17:02 ` Roger Luethi @ 2004-08-29 17:20 ` William Lee Irwin III 2004-08-29 17:52 ` Roger Luethi 2004-08-29 19:07 ` Paul Jackson 2004-08-31 15:34 ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi 1 sibling, 2 replies; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 17:20 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson On Sun, 29 Aug 2004 09:05:42 -0700, William Lee Irwin III wrote: >> Okay, these explain some of the difference. I usually see issues with >> around 10000 processes with fully populated virtual address spaces and >> several hundred vmas each, varying between 200 to 1000, mostly >> concentrated at somewhere just above 300. On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote: > I agree, that should make quite a difference. As you said, we are > working on orthogonal areas: My current focus is on data delivery (sane > semantics and minimal overhead), while you seem to be more interested > in better data gathering. Yes, there doesn't seem to be any conflict between the code we're working on. These benchmark results are very useful for quantifying the relative importance of the overheads under more typical conditions. On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote: > I profiled "top -d 0 -b > /dev/null" for about 100 and 10^5 processes. > When monitoring 100 (real-world) processes, /proc specific overhead > (_IO_vfscanf_internal, number, __d_lookup, vsnprintf, etc.) amounts to > about one third of total resource usage. > ==> 100 processes: top -d 0 -b > /dev/null <== > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name symbol name > 20439 12.2035 libc-2.3.3.so _IO_vfscanf_internal > 15852 9.4647 vmlinux number > 11635 6.9469 vmlinux task_statm > 9286 5.5444 libc-2.3.3.so _IO_vfprintf_internal > 9128 5.4500 vmlinux proc_pid_stat Lexical analysis is cpu-intensive, probably due to the cache misses taken while traversing the strings. This is likely inherent in string processing interfaces. On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote: > With 10^5 additional dummy processes, resource usage is dominated by > attempts to get a current list of pids. My own benchmark walked a list > of known pids, so that was not an issue. I bet though that nproc can > provide more efficient means to get such a list than getdents (we could > even allow a user to ask for a message on process creation/kill). > So basically that's just another place where nproc-based tools would > trounce /proc-based ones (that piece is vaporware today, though). > ==> 10000 processes: top -d 0 -b > /dev/null <== > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name symbol name > 35855 36.0707 vmlinux get_tgid_list > 9366 9.4223 vmlinux pid_alive > 7077 7.1196 libc-2.3.3.so _IO_vfscanf_internal > 5386 5.4184 vmlinux number > 3664 3.6860 vmlinux proc_pid_stat get_tgid_list() is a sad story I don't have time to go into in depth. The short version is that larger systems are extremely sensitive to hold time for writes on the tasklist_lock, and this being on scales not needing SGI participation to tell us (though scales beyond personal financial resources still). On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote: > The remaining profiles are for two benchmarks from my previous message. > Field computation is more prominent than with top because the benchmark > uses a known list of pids and parsing is kept at a trivial level. > ==> /prod/pid/statm (2x) for 10000 processes <== > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name symbol name > 7430 9.9485 libc-2.3.3.so _IO_vfscanf_internal > 6195 8.2948 vmlinux __d_lookup > 5477 7.3335 vmlinux task_statm > 5082 6.8046 vmlinux number > 3227 4.3208 vmlinux link_path_walk scanf() is still very pronounced here; I wonder how well-optimized glibc's implementation is, or if otherwise it may be useful to circumvent it with a more specialized parser if its generality requirements preclude faster execution. On Sun, Aug 29, 2004 at 07:02:48PM +0200, Roger Luethi wrote: > nproc removes most of the delivery overhead so field computation is > now dominant. Strictly speaking, it should be even higher because the > benchmarks requests the same fields three times, but they only get > computed once in such a case. > ==> 27 nproc fields for 10000 processes, one process per request <== > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name symbol name > 7647 25.0894 vmlinux __task_mem > 2125 6.9720 vmlinux find_pid > 1884 6.1813 vmlinux nproc_pid_fields > 1488 4.8820 vmlinux __task_mem_cheap > 1161 3.8092 vmlinux mmgrab It looks like I'm going after the right culprit(s) for the lower-level algorithms from this. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 17:20 ` William Lee Irwin III @ 2004-08-29 17:52 ` Roger Luethi 2004-08-29 18:16 ` William Lee Irwin III 2004-08-29 19:07 ` Paul Jackson 1 sibling, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-29 17:52 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote: > > ==> 10000 processes: top -d 0 -b > /dev/null <== > > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > > Profiling through timer interrupt > > samples % image name symbol name > > 35855 36.0707 vmlinux get_tgid_list > > 9366 9.4223 vmlinux pid_alive > > 7077 7.1196 libc-2.3.3.so _IO_vfscanf_internal > > 5386 5.4184 vmlinux number > > 3664 3.6860 vmlinux proc_pid_stat > > get_tgid_list() is a sad story I don't have time to go into in depth. > The short version is that larger systems are extremely sensitive to > hold time for writes on the tasklist_lock, and this being on scales > not needing SGI participation to tell us (though scales beyond personal > financial resources still). I am confident that this problem (as far as process monitoring is concerned) could be addressed with differential notification. > > ==> /prod/pid/statm (2x) for 10000 processes <== > > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > > Profiling through timer interrupt > > samples % image name symbol name > > 7430 9.9485 libc-2.3.3.so _IO_vfscanf_internal > > 6195 8.2948 vmlinux __d_lookup > > 5477 7.3335 vmlinux task_statm > > 5082 6.8046 vmlinux number > > 3227 4.3208 vmlinux link_path_walk > > scanf() is still very pronounced here; I wonder how well-optimized > glibc's implementation is, or if otherwise it may be useful to > circumvent it with a more specialized parser if its generality > requirements preclude faster execution. I'd much rather remove unnecessary overhead than optimize code for overhead processing. Note that number() takes out 7% and that's the _kernel_ printing numbers for user space to parse back. And __d_lookup is another /proc souvenir you get to keep as long as you use /proc. > > ==> 27 nproc fields for 10000 processes, one process per request <== > > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > > Profiling through timer interrupt > > samples % image name symbol name > > 7647 25.0894 vmlinux __task_mem > > 2125 6.9720 vmlinux find_pid > > 1884 6.1813 vmlinux nproc_pid_fields > > 1488 4.8820 vmlinux __task_mem_cheap > > 1161 3.8092 vmlinux mmgrab > > It looks like I'm going after the right culprit(s) for the lower-level > algorithms from this. Well __task_mem is promiment here because I don't call other computation functions. vmstat ain't cheap, and wchan is horribly expensive if the kernel does the ksym translation. Etc. pp. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 17:52 ` Roger Luethi @ 2004-08-29 18:16 ` William Lee Irwin III 2004-08-29 19:00 ` Roger Luethi 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 18:16 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote: >> get_tgid_list() is a sad story I don't have time to go into in depth. >> The short version is that larger systems are extremely sensitive to >> hold time for writes on the tasklist_lock, and this being on scales >> not needing SGI participation to tell us (though scales beyond personal >> financial resources still). On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: > I am confident that this problem (as far as process monitoring is > concerned) could be addressed with differential notification. I'm a bit squeamish about that given that mmlist_lock and tasklist_lock are both problematic and yet another global structure to fiddle with in the process creation and destruction path threatens similar trouble. Also, what guarantee is there that the notification events come sufficiently slowly for a single task to process, particularly when that task may not have a whole cpu's resources to marshal to the task? Queueing them sounds less than ideal due to resource consumption, and if notifications are dropped most of the efficiency gains are lost. So I question that a bit. I have a vague notion that userspace should intelligently schedule inquiries so requests are made at a rate the app can process and so that the app doesn't consume excessive amounts of cpu. In such an arrangement screen refresh events don't trigger a full scan of the tasklist, but rather only an incremental partial rescan of it, whose work is limited for the above cpu bandwidth concerns. On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote: >> scanf() is still very pronounced here; I wonder how well-optimized >> glibc's implementation is, or if otherwise it may be useful to >> circumvent it with a more specialized parser if its generality >> requirements preclude faster execution. On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: > I'd much rather remove unnecessary overhead than optimize code for > overhead processing. Note that number() takes out 7% and that's the > _kernel_ printing numbers for user space to parse back. And __d_lookup > is another /proc souvenir you get to keep as long as you use /proc. I'm expecting very very long lifetimes for legacy kernel versions and userspace predating the merge of nproc, so it's not entirely irrelevant, though backports aren't exactly something I relish. On Sun, 29 Aug 2004 10:20:22 -0700, William Lee Irwin III wrote: >> It looks like I'm going after the right culprit(s) for the lower-level >> algorithms from this. On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: > Well __task_mem is promiment here because I don't call other computation > functions. vmstat ain't cheap, and wchan is horribly expensive if the > kernel does the ksym translation. Etc. pp. task_mem() is generally prominent when the processes have large numbers of vmas, and also due to acquisition of ->mmap_sem. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 18:16 ` William Lee Irwin III @ 2004-08-29 19:00 ` Roger Luethi 2004-08-29 20:17 ` Albert Cahalan 0 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-29 19:00 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote: > On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: > > I am confident that this problem (as far as process monitoring is > > concerned) could be addressed with differential notification. > > I'm a bit squeamish about that given that mmlist_lock and tasklist_lock > are both problematic and yet another global structure to fiddle with in > the process creation and destruction path threatens similar trouble. The numbers looks so bad that for many cases it's going to be a significant win if we simply call nproc_send_note in said paths. But I'll admit that I've been entertaining thoughts about a global queue or something to send notifications in batches. > Also, what guarantee is there that the notification events come > sufficiently slowly for a single task to process, particularly when that > task may not have a whole cpu's resources to marshal to the task? A more likely guarantee is that a process that can't keep up with differential updates won't be able to process the whole list, either. Well, unless the system is loaded with tons of short-lived processes that wouldn't even make the full process list by the time it's pulled. But in such a case, a complete list of task won't do you much good, either, because by the time you are ready to query the kernel for details the tasks are gone. > Queueing them sounds less than ideal due to resource consumption, and > if notifications are dropped most of the efficiency gains are lost. So > I question that a bit. Point. Task discovery is not an exact science anyway, though. I'd still expect differential notification to be useful in most non-pathological cases, but I concede it's nowhere as clear-cut as nproc per se is. > I have a vague notion that userspace should intelligently schedule > inquiries so requests are made at a rate the app can process and so > that the app doesn't consume excessive amounts of cpu. In such an > arrangement screen refresh events don't trigger a full scan of the > tasklist, but rather only an incremental partial rescan of it, whose > work is limited for the above cpu bandwidth concerns. While I'm not sure I understand how that partial rescan (or its limits) would be defined, I agree with the general idea. There is indeed plenty of room for improvement in a smart user space. For instance, most apps show only the top n processes. So if an app shows the top 20 memory users, it could use nproc to get a complete list of pid+vmrss, and then request all the expensive fields only for the top 20 in that list. > On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: > > I'd much rather remove unnecessary overhead than optimize code for > > overhead processing. Note that number() takes out 7% and that's the > > _kernel_ printing numbers for user space to parse back. And __d_lookup > > is another /proc souvenir you get to keep as long as you use /proc. > > I'm expecting very very long lifetimes for legacy kernel versions and > userspace predating the merge of nproc, so it's not entirely irrelevant, > though backports aren't exactly something I relish. Uhm... Optimized string parsing would require updated user space anyway. OTOH, I can buy the legacy kernel argument, so if you want to rewrite the user space tools, go wild :-). You may find that there are issues more serious than string parsing: $ ps --version procps version 3.2.3 $ ps -o pid PID 2089 2139 $ strace ps -o pid 2>&1|grep 'open("/proc/'|wc -l 325 <whine> > On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: > > Well __task_mem is promiment here because I don't call other computation > > functions. vmstat ain't cheap, and wchan is horribly expensive if the > > kernel does the ksym translation. Etc. pp. > > task_mem() is generally prominent when the processes have large numbers > of vmas, and also due to acquisition of ->mmap_sem. Makes sense. I just wanted to make sure I wasn't misleading you. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 19:00 ` Roger Luethi @ 2004-08-29 20:17 ` Albert Cahalan 2004-08-29 20:46 ` William Lee Irwin III ` (2 more replies) 0 siblings, 3 replies; 39+ messages in thread From: Albert Cahalan @ 2004-08-29 20:17 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson > Roger Luethi writes: > On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote: >> On Sun, Aug 29, 2004 at 07:52:45PM +0200, Roger Luethi wrote: >>> I am confident that this problem (as far as process >>> monitoring is concerned) could be addressed with >>> differential notification. ... >> Also, what guarantee is there that the notification >> events come sufficiently slowly for a single task to >> process, particularly when that task may not have a whole >> cpu's resources to marshal to the task? > > A more likely guarantee is that a process that can't > keep up with differential updates won't be able to > process the whole list, either. When the reader falls behind, keep supplying differential updates as long as practical. When this starts to eat up lots of memory, switch to supplying the full list until the reader catches up again. >> I have a vague notion that userspace should intelligently schedule >> inquiries so requests are made at a rate the app can process and so >> that the app doesn't consume excessive amounts of cpu. In such an >> arrangement screen refresh events don't trigger a full scan of the >> tasklist, but rather only an incremental partial rescan of it, whose >> work is limited for the above cpu bandwidth concerns. If you won't scan, why update the display? This boils down to simply setting a lower refresh rate or using "nice". > While I'm not sure I understand how that partial rescan (or its limits) > would be defined, I agree with the general idea. There is indeed plenty > of room for improvement in a smart user space. For instance, most apps > show only the top n processes. So if an app shows the top 20 memory > users, it could use nproc to get a complete list of pid+vmrss, and then > request all the expensive fields only for the top 20 in that list. This is crummy. It's done for wchan, since that is so horribly expensive, but I'm not liking the larger race condition window. Remember that PIDs get reused. There isn't a generation counter or UUID that can be checked. > Uhm... Optimized string parsing would require updated user space > anyway. OTOH, I can buy the legacy kernel argument, so if you want to > rewrite the user space tools, go wild :-). You may find that there are > issues more serious than string parsing: > > $ ps --version > procps version 3.2.3 > $ ps -o pid > PID > 2089 > 2139 > $ strace ps -o pid 2>&1|grep 'open("/proc/'|wc -l > 325 > > <whine> While "pid" makes a nice extreme example, note that ps must handle arbitrary cases like "pmem,comm,wchan,ppid,session". Now, I direct your attention to "Introduction to Algorithms", by Cormen, Leiserson, and Rivest. Find the section entitled "The set-covering problem". It's page 974, section 37.3, in my version of the book. An example of this would be the determination of the minimum set of /proc files needed to supply some required set of process attributes. Look familiar? It's NP-hard. To me, that just sounds bad. :-) While there are decent (?) approximations that run in polynomial time, they are generally overkill. It is very common to need both the stat and status files. Selection, sorting, and display all may require data. But hey, we can go ahead and compute NP-hard problems in userspace if that makes the kernel less complicated. :-) Just remember that if I say "this is hard", I mean it. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 20:17 ` Albert Cahalan @ 2004-08-29 20:46 ` William Lee Irwin III 2004-08-29 21:45 ` Albert Cahalan 2004-08-29 21:41 ` Roger Luethi 2004-08-30 10:31 ` Paulo Marques 2 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 20:46 UTC (permalink / raw) To: Albert Cahalan; +Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote: >>> Also, what guarantee is there that the notification >>> events come sufficiently slowly for a single task to >>> process, particularly when that task may not have a whole >>> cpu's resources to marshal to the task? Roger Luethi writes: >> A more likely guarantee is that a process that can't >> keep up with differential updates won't be able to >> process the whole list, either. On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > When the reader falls behind, keep supplying differential > updates as long as practical. When this starts to eat up > lots of memory, switch to supplying the full list until > the reader catches up again. You shouldn't have to try to scan the set of all tasks in any bounded period of time or rely on differential updates. Scanning some part of the list of a bounded size, updating the state based on what was scanned, and reporting the rest as if it hadn't changed is the strategy I'm describing. On Sun, 29 Aug 2004 11:16:27 -0700, William Lee Irwin III wrote: >>> I have a vague notion that userspace should intelligently schedule >>> inquiries so requests are made at a rate the app can process and so >>> that the app doesn't consume excessive amounts of cpu. In such an >>> arrangement screen refresh events don't trigger a full scan of the >>> tasklist, but rather only an incremental partial rescan of it, whose >>> work is limited for the above cpu bandwidth concerns. On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > If you won't scan, why update the display? This boils down > to simply setting a lower refresh rate or using "nice". Some updates can be captured, merely not all. Updating the state given what was captured during the partial scan and then displaying the state derived from what could be captured in the refresh interval is more useful than being nonfunctional at the lower refresh intervals or needlessly beating the kernel in some futile attempt to exhaustively search an impossibly huge dataset in some time bound that can't be satisfied. Roger Luethi writes: >> While I'm not sure I understand how that partial rescan (or its limits) >> would be defined, I agree with the general idea. There is indeed plenty >> of room for improvement in a smart user space. For instance, most apps >> show only the top n processes. So if an app shows the top 20 memory >> users, it could use nproc to get a complete list of pid+vmrss, and then >> request all the expensive fields only for the top 20 in that list. On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > This is crummy. It's done for wchan, since that is so horribly > expensive, but I'm not liking the larger race condition window. > Remember that PIDs get reused. There isn't a generation counter > or UUID that can be checked. One shouldn't really need to care; periodically rechecking the fields of an active pid should suffice. You don't really care whether it's the same task or not, just that the fields are up-to-date and whether any task with that pid exists. Roger Luethi writes: >> Uhm... Optimized string parsing would require updated user space >> anyway. OTOH, I can buy the legacy kernel argument, so if you want to >> rewrite the user space tools, go wild :-). You may find that there are >> issues more serious than string parsing: [...] On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > While "pid" makes a nice extreme example, note that ps must > handle arbitrary cases like "pmem,comm,wchan,ppid,session". > Now, I direct your attention to "Introduction to Algorithms", > by Cormen, Leiserson, and Rivest. Find the section entitled > "The set-covering problem". It's page 974, section 37.3, in > my version of the book. An example of this would be the > determination of the minimum set of /proc files needed to > supply some required set of process attributes. > Look familiar? It's NP-hard. To me, that just sounds bad. :-) > While there are decent (?) approximations that run in > polynomial time, they are generally overkill. It is very > common to need both the stat and status files. Selection, > sorting, and display all may require data. > But hey, we can go ahead and compute NP-hard problems in > userspace if that makes the kernel less complicated. :-) > Just remember that if I say "this is hard", I mean it. Actually, the problem size is so small it shouldn't be problematic. There are only 13 /proc/ files associated with a process, so exhaustive search over 2**13 - 1 == 8191 nonempty subsets, e.g. queueing by size and checking for the satisfiability of the reporting, will suffice. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 20:46 ` William Lee Irwin III @ 2004-08-29 21:45 ` Albert Cahalan 2004-08-29 22:11 ` William Lee Irwin III 0 siblings, 1 reply; 39+ messages in thread From: Albert Cahalan @ 2004-08-29 21:45 UTC (permalink / raw) To: William Lee Irwin III Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote: > On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > > When the reader falls behind, keep supplying differential > > updates as long as practical. When this starts to eat up > > lots of memory, switch to supplying the full list until > > the reader catches up again. > > You shouldn't have to try to scan the set of all tasks in any bounded > period of time or rely on differential updates. Scanning some part of > the list of a bounded size, updating the state based on what was > scanned, and reporting the rest as if it hadn't changed is the strategy > I'm describing. That's defective. Users will not like it. > > If you won't scan, why update the display? This boils down > > to simply setting a lower refresh rate or using "nice". > > Some updates can be captured, merely not all. Updating the > state given what was captured during the partial scan and > then displaying the state derived from what could be > captured in the refresh interval is more useful than being > nonfunctional at the lower refresh intervals or needlessly > beating the kernel in some futile attempt to exhaustively > search an impossibly huge dataset in some time bound that > can't be satisfied. nice -n 19 top > Roger Luethi writes: > >> While I'm not sure I understand how that partial rescan (or its limits) > >> would be defined, I agree with the general idea. There is indeed plenty > >> of room for improvement in a smart user space. For instance, most apps > >> show only the top n processes. So if an app shows the top 20 memory > >> users, it could use nproc to get a complete list of pid+vmrss, and then > >> request all the expensive fields only for the top 20 in that list. > > On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > > This is crummy. It's done for wchan, since that is so horribly > > expensive, but I'm not liking the larger race condition window. > > Remember that PIDs get reused. There isn't a generation counter > > or UUID that can be checked. > > One shouldn't really need to care; periodically rechecking the fields > of an active pid should suffice. You don't really care whether it's the > same task or not, just that the fields are up-to-date and whether any > task with that pid exists. People use the procps tools to kill processes. Bad data leads to bad decisions. > On Sun, Aug 29, 2004 at 04:17:26PM -0400, Albert Cahalan wrote: > > While "pid" makes a nice extreme example, note that ps must > > handle arbitrary cases like "pmem,comm,wchan,ppid,session". > > Now, I direct your attention to "Introduction to Algorithms", > > by Cormen, Leiserson, and Rivest. Find the section entitled > > "The set-covering problem". It's page 974, section 37.3, in > > my version of the book. An example of this would be the > > determination of the minimum set of /proc files needed to > > supply some required set of process attributes. > > Look familiar? It's NP-hard. To me, that just sounds bad. :-) > > While there are decent (?) approximations that run in > > polynomial time, they are generally overkill. It is very > > common to need both the stat and status files. Selection, > > sorting, and display all may require data. > > But hey, we can go ahead and compute NP-hard problems in > > userspace if that makes the kernel less complicated. :-) > > Just remember that if I say "this is hard", I mean it. > > Actually, the problem size is so small it shouldn't be problematic. > There are only 13 /proc/ files associated with a process, so exhaustive > search over 2**13 - 1 == 8191 nonempty subsets, e.g. queueing by size > and checking for the satisfiability of the reporting, will suffice. Nice! Checking for satisfiability is only NP-complete... I do get your point, but I expect to see more /proc files as time passes. Also, there is the issue of maintainability. Example 1: It has crossed my mind to add separate files for the least security-critical data, so that an SE Linux system with moderate security could provide some minimal amount of basic info to normal users. Example 2: There could be files containing only data that is easy to generate or that needs the same locking. Even with the "ps -o pid" example given, opening /proc/*/stat is required to get the tty. Opening /proc/*/status is nearly required; one can do stat() on the directory to get that via st_uid though. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 21:45 ` Albert Cahalan @ 2004-08-29 22:11 ` William Lee Irwin III 0 siblings, 0 replies; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 22:11 UTC (permalink / raw) To: Albert Cahalan; +Cc: Roger Luethi, linux-kernel mailing list, Paul Jackson On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote: >> You shouldn't have to try to scan the set of all tasks in any bounded >> period of time or rely on differential updates. Scanning some part of >> the list of a bounded size, updating the state based on what was >> scanned, and reporting the rest as if it hadn't changed is the strategy >> I'm describing. On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote: > That's defective. Users will not like it. Scarcely. The task can't be done in realtime. The data will be stale by the time it's reported anyway. Limiting the amount of sampling done is vastly superior to beating the kernel's reporting interfaces to death in a totally futile attempt to achieve infeasible consistencies, burning ridiculous amounts of cpu in the process, and reporting gibberish in the end anyway. On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote: >> Some updates can be captured, merely not all. Updating the >> state given what was captured during the partial scan and >> then displaying the state derived from what could be >> captured in the refresh interval is more useful than being >> nonfunctional at the lower refresh intervals or needlessly >> beating the kernel in some futile attempt to exhaustively >> search an impossibly huge dataset in some time bound that >> can't be satisfied. On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote: > nice -n 19 top No, hard cpu limits are required, and even then it just spews gibberish and very slowly. The current algorithms are nonfunctional with any substantial number of processes. On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote: >> One shouldn't really need to care; periodically rechecking the fields >> of an active pid should suffice. You don't really care whether it's the >> same task or not, just that the fields are up-to-date and whether any >> task with that pid exists. On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote: > People use the procps tools to kill processes. > Bad data leads to bad decisions. Refusal to rate limit sampling doesn't make the data more coherent in the presence of large numbers of tasks. On Sun, 2004-08-29 at 16:46, William Lee Irwin III wrote: >> Actually, the problem size is so small it shouldn't be problematic. >> There are only 13 /proc/ files associated with a process, so exhaustive >> search over 2**13 - 1 == 8191 nonempty subsets, e.g. queueing by size >> and checking for the satisfiability of the reporting, will suffice. On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote: > Nice! Checking for satisfiability is only NP-complete... > I do get your point, but I expect to see more /proc files > as time passes. Also, there is the issue of maintainability. No, that's not general satisfiability. Each field to be reported needs at least one out of some set of subsets of /proc/ files associated with a process to be includes in those parsed. Checking for inclusion of one of a field's required subsets for each field suffices. The number of subsets of /proc/ files from which a field is calculable is bounded by some small constant. It must be constant, as there are a finite number of fields, and the constant is small, as this is some specific set and the precise upper bound can be found, and if/when it is found, it is very likely to be well under a tenth of the total number of subsets of /proc/ files associated with a process. On Sun, Aug 29, 2004 at 05:45:47PM -0400, Albert Cahalan wrote: > Example 1: It has crossed my mind to add separate files > for the least security-critical data, so that an SE Linux > system with moderate security could provide some minimal > amount of basic info to normal users. > Example 2: There could be files containing only data > that is easy to generate or that needs the same locking. > Even with the "ps -o pid" example given, opening /proc/*/stat > is required to get the tty. Opening /proc/*/status is nearly > required; one can do stat() on the directory to get that > via st_uid though. I don't have a whole lot to say on this subject. These sound reasonable. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 20:17 ` Albert Cahalan 2004-08-29 20:46 ` William Lee Irwin III @ 2004-08-29 21:41 ` Roger Luethi 2004-08-29 23:31 ` Albert Cahalan 2004-08-30 10:31 ` Paulo Marques 2 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-29 21:41 UTC (permalink / raw) To: Albert Cahalan Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson On Sun, 29 Aug 2004 16:17:26 -0400, Albert Cahalan wrote: > When the reader falls behind, keep supplying differential > updates as long as practical. When this starts to eat up > lots of memory, switch to supplying the full list until > the reader catches up again. I think it should be up to the reader to request stuff, so I'd probably just have the kernel notify the client that there won't be any more differential updates. Then the client can decide what to do now. But I'd have to play around with this to see what works. > > While I'm not sure I understand how that partial rescan (or its limits) > > would be defined, I agree with the general idea. There is indeed plenty > > of room for improvement in a smart user space. For instance, most apps > > show only the top n processes. So if an app shows the top 20 memory > > users, it could use nproc to get a complete list of pid+vmrss, and then > > request all the expensive fields only for the top 20 in that list. > > This is crummy. It's done for wchan, since that is so horribly > expensive, but I'm not liking the larger race condition window. The races left with nproc are much smaller. There is of course the question of whether the pid still exists by the time you query the kernel about it. But you get all the information in one go (although the process may still disappear while the kernel prepares the requested info). > > $ ps --version > > procps version 3.2.3 > > $ ps -o pid > > PID > > 2089 > > 2139 > > $ strace ps -o pid 2>&1|grep 'open("/proc/'|wc -l > > 325 > > > > <whine> > > While "pid" makes a nice extreme example, note that ps must > handle arbitrary cases like "pmem,comm,wchan,ppid,session". > > Now, I direct your attention to "Introduction to Algorithms", > by Cormen, Leiserson, and Rivest. Find the section entitled [...] > Just remember that if I say "this is hard", I mean it. Entertaining, but you missed the point: I am not terribly impressed with the fact that ps opens two files (stat, statm) for _every_ _single_ _process_ if all I want to know is, say, the name of PID 42 (example taken from ps(1): ps -p 42 -o comm=). And FWIW, you don't need the "minimum set of /proc files needed to supply some required set of process attributes". Any set that supplies the required fields will do, and you can get an excellent approximation in O(n). I suspect Cormen, Leiserson, and Rivest would take exception with your assertion that ps tools can't be improved. Or even that doing so is hard. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 21:41 ` Roger Luethi @ 2004-08-29 23:31 ` Albert Cahalan 2004-08-30 7:16 ` Roger Luethi 0 siblings, 1 reply; 39+ messages in thread From: Albert Cahalan @ 2004-08-29 23:31 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson On Sun, 2004-08-29 at 17:41, Roger Luethi wrote: > And FWIW, you don't need the "minimum set of /proc > files needed to supply some required set of process > attributes". Any set that supplies the required fields > will do, and you can get an excellent approximation > in O(n). You got that, and you didn't like it. I'm sure it wouldn't be hard to hack up some special-case optimization for the cases you've listed. As soon as I do so, you'll find another special case. Ultimately, you ARE asking to have procps solve the NP-hard set-covering problem. There are several good reasons to not go down that path. The potential for increasing numbers of /proc files in the future is one. Another is the very limited benefit; typical ps usage does require much of that data. Maintainability is yet another reason; ps does more than just spit out the data. It is very useful to have a decent selection of data items that will always be available for process selection, sorting, and any other use. The potential for adding bugs is great. That said, I do at times tweak the code used to select data sources. Perhaps I should add a new /proc/*/basics file for the most popular items. This would make fancy set-covering choices more profitable. ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 23:31 ` Albert Cahalan @ 2004-08-30 7:16 ` Roger Luethi 0 siblings, 0 replies; 39+ messages in thread From: Roger Luethi @ 2004-08-30 7:16 UTC (permalink / raw) To: Albert Cahalan Cc: William Lee Irwin III, linux-kernel mailing list, Paul Jackson On Sun, 29 Aug 2004 19:31:17 -0400, Albert Cahalan wrote: > select data sources. Perhaps I should add a new > /proc/*/basics file for the most popular items. It shouldn't surprise that I am not keen on making any semantic changes to /proc in order to help tools. nproc is a vastly superior interface. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 20:17 ` Albert Cahalan 2004-08-29 20:46 ` William Lee Irwin III 2004-08-29 21:41 ` Roger Luethi @ 2004-08-30 10:31 ` Paulo Marques 2004-08-30 10:53 ` William Lee Irwin III 2 siblings, 1 reply; 39+ messages in thread From: Paulo Marques @ 2004-08-30 10:31 UTC (permalink / raw) To: Albert Cahalan Cc: Roger Luethi, William Lee Irwin III, linux-kernel mailing list, Paul Jackson Albert Cahalan wrote: >... > > This is crummy. It's done for wchan, since that is so horribly > expensive, but I'm not liking the larger race condition window. > Remember that PIDs get reused. There isn't a generation counter > or UUID that can be checked. I just wanted to call your attention to the kallsyms speedup patch that is now on the -mm tree. It should improve wchan speed. My benchmarks for kallsyms_lookup (the function that was responsible for the wchan time) went from 1340us to 0.5us. So maybe this is enough not to make wchan a special case anymore... -- Paulo Marques - www.grupopie.com To err is human, but to really foul things up requires a computer. Farmers' Almanac, 1978 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-30 10:31 ` Paulo Marques @ 2004-08-30 10:53 ` William Lee Irwin III 2004-08-30 12:23 ` Paulo Marques 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-30 10:53 UTC (permalink / raw) To: Paulo Marques Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list, Paul Jackson Albert Cahalan wrote: >> This is crummy. It's done for wchan, since that is so horribly >> expensive, but I'm not liking the larger race condition window. >> Remember that PIDs get reused. There isn't a generation counter >> or UUID that can be checked. On Mon, Aug 30, 2004 at 11:31:43AM +0100, Paulo Marques wrote: > I just wanted to call your attention to the kallsyms speedup patch that > is now on the -mm tree. > It should improve wchan speed. My benchmarks for kallsyms_lookup (the > function that was responsible for the wchan time) went from 1340us to 0.5us. > So maybe this is enough not to make wchan a special case anymore... This seems to go wrong on big-endian machines; any chance you could look over your stuff and try to figure out what endianness issues it may have? -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-30 10:53 ` William Lee Irwin III @ 2004-08-30 12:23 ` Paulo Marques 2004-08-30 12:28 ` William Lee Irwin III 0 siblings, 1 reply; 39+ messages in thread From: Paulo Marques @ 2004-08-30 12:23 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list, Paul Jackson William Lee Irwin III wrote: > Albert Cahalan wrote: > >>>This is crummy. It's done for wchan, since that is so horribly >>>expensive, but I'm not liking the larger race condition window. >>>Remember that PIDs get reused. There isn't a generation counter >>>or UUID that can be checked. > > > On Mon, Aug 30, 2004 at 11:31:43AM +0100, Paulo Marques wrote: > >>I just wanted to call your attention to the kallsyms speedup patch that >>is now on the -mm tree. >>It should improve wchan speed. My benchmarks for kallsyms_lookup (the >>function that was responsible for the wchan time) went from 1340us to 0.5us. >>So maybe this is enough not to make wchan a special case anymore... > > > This seems to go wrong on big-endian machines; any chance you could look > over your stuff and try to figure out what endianness issues it may have? I went over the code but at a first glance couldn't find a notorius trouble spot. I don't have big-endian hardware myself so this is hard to test. Just a few questions to help me out in finding the problem: - is this really an endianess problem or is it a 64-bit integer problem? - are you cross compiling the kernel? Thanks in advance, -- Paulo Marques - www.grupopie.com To err is human, but to really foul things up requires a computer. Farmers' Almanac, 1978 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-30 12:23 ` Paulo Marques @ 2004-08-30 12:28 ` William Lee Irwin III 2004-08-30 13:43 ` Paulo Marques 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-30 12:28 UTC (permalink / raw) To: Paulo Marques Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list, Paul Jackson William Lee Irwin III wrote: >> This seems to go wrong on big-endian machines; any chance you could look >> over your stuff and try to figure out what endianness issues it may have? On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote: > I went over the code but at a first glance couldn't find a notorius > trouble spot. I don't have big-endian hardware myself so this is hard to > test. > Just a few questions to help me out in finding the problem: > - is this really an endianess problem or is it a 64-bit integer problem? Works fine on x86-64 and alpha. Prints gibberish on sparc64. On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote: > - are you cross compiling the kernel? > Thanks in advance, No. All native. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-30 12:28 ` William Lee Irwin III @ 2004-08-30 13:43 ` Paulo Marques 0 siblings, 0 replies; 39+ messages in thread From: Paulo Marques @ 2004-08-30 13:43 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Roger Luethi, linux-kernel mailing list, Paul Jackson William Lee Irwin III wrote: > William Lee Irwin III wrote: > >>>This seems to go wrong on big-endian machines; any chance you could look >>>over your stuff and try to figure out what endianness issues it may have? > > > On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote: > >>I went over the code but at a first glance couldn't find a notorius >>trouble spot. I don't have big-endian hardware myself so this is hard to >>test. >>Just a few questions to help me out in finding the problem: >>- is this really an endianess problem or is it a 64-bit integer problem? > > > Works fine on x86-64 and alpha. Prints gibberish on sparc64. > > > On Mon, Aug 30, 2004 at 01:23:51PM +0100, Paulo Marques wrote: > >>- are you cross compiling the kernel? >>Thanks in advance, > > > No. All native. Can you send me an ".tmp_kallsyms2.S" obtained after a kernel build on a sparc64, so that I can isolate the problem between scripts/kallsyms.c and kernel/kallsyms.c? (maybe gzip'ed and in private, because this can be a big file...) Thanks for all the help in debugging this. -- Paulo Marques - www.grupopie.com To err is human, but to really foul things up requires a computer. Farmers' Almanac, 1978 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 17:20 ` William Lee Irwin III 2004-08-29 17:52 ` Roger Luethi @ 2004-08-29 19:07 ` Paul Jackson 2004-08-29 19:17 ` William Lee Irwin III 1 sibling, 1 reply; 39+ messages in thread From: Paul Jackson @ 2004-08-29 19:07 UTC (permalink / raw) To: William Lee Irwin III; +Cc: rl, linux-kernel, albert > get_tgid_list() is a sad story I don't have time to go into in depth. > The short version is that larger systems are extremely sensitive to Thanks, Roger and William, for your good work here. I'm sure that SGI's big bertha's will benefit. In glancing at the get_tgid_list() I see it is careful to only pick off 20 (PROC_MAXPIDS) slots at a time. But elsewhere in the kernel, I see several uses of "do_each_thread()" which rip through the entire task list in a single shot. Is there a simple explanation for why it is ok in one place to take on the entire task list in a single sweep, but in another it is important to drop the lock every 20 slots? >From the code and nice comments, I see that: (1) the work that had to be done by proc_pid_readdir(), the caller of get_tgid_list(), required dropping the task list lock, and (2) so the harvested tgid's had to be stashed in a temp buffer. So perhaps the reason for not doing this in a single pass is: (3) it was not doable or not desirable (which one?) to size that temp buffer large enough to hold all the harvested tgid's in one pass. But my understanding is losing the scent of the trail at this point. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373 ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 19:07 ` Paul Jackson @ 2004-08-29 19:17 ` William Lee Irwin III 2004-08-29 19:49 ` Roger Luethi 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 19:17 UTC (permalink / raw) To: Paul Jackson; +Cc: rl, linux-kernel, albert At some point in the past, I wrote: >> get_tgid_list() is a sad story I don't have time to go into in depth. >> The short version is that larger systems are extremely sensitive to On Sun, Aug 29, 2004 at 12:07:33PM -0700, Paul Jackson wrote: > Thanks, Roger and William, for your good work here. I'm sure that SGI's > big bertha's will benefit. > In glancing at the get_tgid_list() I see it is careful to only pick off > 20 (PROC_MAXPIDS) slots at a time. But elsewhere in the kernel, I see > several uses of "do_each_thread()" which rip through the entire task > list in a single shot. > Is there a simple explanation for why it is ok in one place to take on > the entire task list in a single sweep, but in another it is important > to drop the lock every 20 slots? PROC_MAXPIDS is the size of the buffer used to temporarily store the pid's while doing user copies, so that potentially blocking operations may be done to transmit the pid's to userspace. Introducing another whole-tasklist scan, even if feasible, is probably not a good idea. On Sun, Aug 29, 2004 at 12:07:33PM -0700, Paul Jackson wrote: > From the code and nice comments, I see that: > (1) the work that had to be done by proc_pid_readdir(), the caller of > get_tgid_list(), required dropping the task list lock, and > (2) so the harvested tgid's had to be stashed in a temp buffer. > So perhaps the reason for not doing this in a single pass is: > (3) it was not doable or not desirable (which one?) to size that temp > buffer large enough to hold all the harvested tgid's in one pass. > But my understanding is losing the scent of the trail at this point. Using a larger, dynamically-allocated buffer may be better. e.g. allocating a page to buffer pid's with. A solution to the problem of the quadratic algorithm I wrote long ago restructured the tasklist as an rbtree so that the position in the tasklist could be recovered in O(lg(n)) time. Unfortunately, this increases the write hold time of tasklist_lock. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 19:17 ` William Lee Irwin III @ 2004-08-29 19:49 ` Roger Luethi 2004-08-29 20:25 ` William Lee Irwin III 0 siblings, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-29 19:49 UTC (permalink / raw) To: William Lee Irwin III, Paul Jackson, linux-kernel, albert On Sun, 29 Aug 2004 12:17:07 -0700, William Lee Irwin III wrote: > > In glancing at the get_tgid_list() I see it is careful to only pick off > > 20 (PROC_MAXPIDS) slots at a time. But elsewhere in the kernel, I see > > several uses of "do_each_thread()" which rip through the entire task > > list in a single shot. > > Is there a simple explanation for why it is ok in one place to take on > > the entire task list in a single sweep, but in another it is important > > to drop the lock every 20 slots? > [...] > Introducing another whole-tasklist scan, even if feasible, is probably > not a good idea. I'm not sure whether I should participate in that discussion. I'll risk discrediting nproc with wild speculations on a subject I haven't really looked into yet. Ah well... As far as nproc (and process monitoring) is concerned, we aren't really interested in walking a complete process list. All we care about is which pids exist right now. How about a bit field, maintained by the kernel, to indicate for each pid whether it exists or not? This would amount to 4 KiB by default and 512 KiB for PID_MAX_LIMIT (4 million processes). Maintenance cost would be one atomic bit operation per process creation/deletion. No contested locks. The list for the nproc user could be prepared based on the bit field (or simply memcpy'd), no tasklist_lock or walking linked lists required. What am I missing? Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 19:49 ` Roger Luethi @ 2004-08-29 20:25 ` William Lee Irwin III 2004-08-31 10:16 ` Roger Luethi 0 siblings, 1 reply; 39+ messages in thread From: William Lee Irwin III @ 2004-08-29 20:25 UTC (permalink / raw) To: Roger Luethi; +Cc: Paul Jackson, linux-kernel, albert On Sun, 29 Aug 2004 12:17:07 -0700, William Lee Irwin III wrote: >> Introducing another whole-tasklist scan, even if feasible, is probably >> not a good idea. On Sun, Aug 29, 2004 at 09:49:26PM +0200, Roger Luethi wrote: > I'm not sure whether I should participate in that discussion. I'll risk > discrediting nproc with wild speculations on a subject I haven't really > looked into yet. Ah well... There isn't much to speculate about here; reducing the arrival rate to tasklist_lock is okay, but it can't be held forever or use unbounded allocations or anything like that. On Sun, Aug 29, 2004 at 09:49:26PM +0200, Roger Luethi wrote: > As far as nproc (and process monitoring) is concerned, we aren't really > interested in walking a complete process list. All we care about is > which pids exist right now. How about a bit field, maintained by the > kernel, to indicate for each pid whether it exists or not? This would > amount to 4 KiB by default and 512 KiB for PID_MAX_LIMIT (4 million > processes). Maintenance cost would be one atomic bit operation per > process creation/deletion. No contested locks. > The list for the nproc user could be prepared based on the bit field > (or simply memcpy'd), no tasklist_lock or walking linked lists required. > What am I missing? The pid bitmap could be exported to userspace rather easily. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: netlink access to /proc information 2004-08-29 20:25 ` William Lee Irwin III @ 2004-08-31 10:16 ` Roger Luethi 0 siblings, 0 replies; 39+ messages in thread From: Roger Luethi @ 2004-08-31 10:16 UTC (permalink / raw) To: William Lee Irwin III, Paul Jackson, linux-kernel, albert On Sun, 29 Aug 2004 13:25:43 -0700, William Lee Irwin III wrote: > > The list for the nproc user could be prepared based on the bit field > > (or simply memcpy'd), no tasklist_lock or walking linked lists required. > > What am I missing? > > The pid bitmap could be exported to userspace rather easily. I implemented an "all processes" selector based on that. Remaining pieces are access control and a method for dumping large amounts of data (10 - 1000 KB) to user space. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* [BENCHMARK] nproc: Look Ma, No get_tgid_list! 2004-08-29 17:02 ` Roger Luethi 2004-08-29 17:20 ` William Lee Irwin III @ 2004-08-31 15:34 ` Roger Luethi 2004-08-31 19:38 ` William Lee Irwin III 1 sibling, 1 reply; 39+ messages in thread From: Roger Luethi @ 2004-08-31 15:34 UTC (permalink / raw) To: William Lee Irwin III, linux-kernel, Albert Cahalan, Paul Jackson This posting demonstrates a new method of monitoring all processes in a large system. You may remember what a /proc based tool does when monitoring some 10^5 processes -- it spends its time in the kernel hanging on to a read task_list_lock: ==> 10000 processes: top -d 0 -b > /dev/null <== CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % image name symbol name 35855 36.0707 vmlinux get_tgid_list 9366 9.4223 vmlinux pid_alive 7077 7.1196 libc-2.3.3.so _IO_vfscanf_internal 5386 5.4184 vmlinux number 3664 3.6860 vmlinux proc_pid_stat 3077 3.0955 libc-2.3.3.so _IO_vfprintf_internal 2136 2.1489 vmlinux __d_lookup 1720 1.7303 vmlinux vsnprintf 1451 1.4597 libc-2.3.3.so __i686.get_pc_thunk.bx 1409 1.4175 libc-2.3.3.so _IO_default_xsputn_internal 1258 1.2656 libc-2.3.3.so _IO_putc_internal 1225 1.2324 vmlinux link_path_walk 1210 1.2173 libc-2.3.3.so ____strtoul_l_internal 1199 1.2062 vmlinux task_statm 1157 1.1640 libc-2.3.3.so ____strtol_l_internal 794 0.7988 libc-2.3.3.so _IO_sputbackc_internal 776 0.7807 libncurses.so.5.4 _nc_outch Here's a profile for an nproc based tool monitoring the same set of processes: ==> 10000 processes: nprocbench <== CPU: CPU with timer interrupt, speed 0 MHz (estimated) Profiling through timer interrupt samples % app name symbol name 8641 24.8626 vmlinux __task_mem 2778 7.9931 vmlinux find_pid 2536 7.2968 vmlinux finish_task_switch 1872 5.3863 vmlinux netlink_recvmsg 1637 4.7101 vmlinux nproc_pid_fields 1373 3.9505 vmlinux __wake_up 1218 3.5045 vmlinux __copy_to_user_ll 1134 3.2628 vmlinux __task_mem_cheap 944 2.7162 vmlinux mmgrab 876 2.5205 vmlinux nproc_ps_do_pid 568 1.6343 vmlinux skb_dequeue 526 1.5135 libc-2.3.3.so __recv 514 1.4789 vmlinux alloc_skb 510 1.4674 vmlinux __might_sleep 485 1.3955 vmlinux skb_release_data 463 1.3322 vmlinux netlink_attachskb 363 1.0445 vmlinux sys_recvfrom Resource usage is now dominated by field computation, rather than by delivery overhead. By now it should be clear that nproc is not only a cleaner interface with lower overhead for tools, it also scales a lot better than /proc. Roger ^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [BENCHMARK] nproc: Look Ma, No get_tgid_list! 2004-08-31 15:34 ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi @ 2004-08-31 19:38 ` William Lee Irwin III 0 siblings, 0 replies; 39+ messages in thread From: William Lee Irwin III @ 2004-08-31 19:38 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel, Albert Cahalan, Paul Jackson On Tue, Aug 31, 2004 at 05:34:32PM +0200, Roger Luethi wrote: > This posting demonstrates a new method of monitoring all processes in > a large system. > You may remember what a /proc based tool does when monitoring some > 10^5 processes -- it spends its time in the kernel hanging on to a > read task_list_lock: > ==> 10000 processes: top -d 0 -b > /dev/null <== > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name symbol name > 35855 36.0707 vmlinux get_tgid_list > 9366 9.4223 vmlinux pid_alive > 7077 7.1196 libc-2.3.3.so _IO_vfscanf_internal > 5386 5.4184 vmlinux number > 3664 3.6860 vmlinux proc_pid_stat [...] The most crucial issue for larger systems is removing the rather easily triggerable rwlock starvation. Perhaps dipankar's /proc/ -only tasklist RCU patch can resolve that. On Tue, Aug 31, 2004 at 05:34:32PM +0200, Roger Luethi wrote: > Here's a profile for an nproc based tool monitoring the same set > of processes: > ==> 10000 processes: nprocbench <== > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % app name symbol name > 8641 24.8626 vmlinux __task_mem > 2778 7.9931 vmlinux find_pid > 2536 7.2968 vmlinux finish_task_switch > 1872 5.3863 vmlinux netlink_recvmsg > 1637 4.7101 vmlinux nproc_pid_fields [...] > Resource usage is now dominated by field computation, rather than by > delivery overhead. By now it should be clear that nproc is not only a > cleaner interface with lower overhead for tools, it also scales a lot > better than /proc. With this in hand we can probably ignore the /proc/ -related efficiency issues in favor of any method preventing the rwlock starvation, e.g. dipankar's /proc/ -only tasklist RCU patch. -- wli ^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2004-09-01 0:28 UTC | newest] Thread overview: 39+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-08-27 12:24 [0/2][ANNOUNCE] nproc: netlink access to /proc information Roger Luethi 2004-08-27 12:24 ` [1/2][PATCH] " Roger Luethi 2004-08-27 13:39 ` Roger Luethi 2004-08-27 12:24 ` [2/2][sample code] nproc: user space app Roger Luethi 2004-08-27 14:50 ` [0/2][ANNOUNCE] nproc: netlink access to /proc information James Morris 2004-08-27 15:26 ` Roger Luethi 2004-08-27 16:23 ` William Lee Irwin III 2004-08-27 16:37 ` Albert Cahalan 2004-08-27 16:41 ` William Lee Irwin III 2004-08-27 17:01 ` Roger Luethi 2004-08-27 17:08 ` William Lee Irwin III 2004-08-28 19:45 ` [BENCHMARK] " Roger Luethi 2004-08-28 19:56 ` William Lee Irwin III 2004-08-28 20:14 ` Roger Luethi 2004-08-29 16:05 ` William Lee Irwin III 2004-08-29 17:02 ` Roger Luethi 2004-08-29 17:20 ` William Lee Irwin III 2004-08-29 17:52 ` Roger Luethi 2004-08-29 18:16 ` William Lee Irwin III 2004-08-29 19:00 ` Roger Luethi 2004-08-29 20:17 ` Albert Cahalan 2004-08-29 20:46 ` William Lee Irwin III 2004-08-29 21:45 ` Albert Cahalan 2004-08-29 22:11 ` William Lee Irwin III 2004-08-29 21:41 ` Roger Luethi 2004-08-29 23:31 ` Albert Cahalan 2004-08-30 7:16 ` Roger Luethi 2004-08-30 10:31 ` Paulo Marques 2004-08-30 10:53 ` William Lee Irwin III 2004-08-30 12:23 ` Paulo Marques 2004-08-30 12:28 ` William Lee Irwin III 2004-08-30 13:43 ` Paulo Marques 2004-08-29 19:07 ` Paul Jackson 2004-08-29 19:17 ` William Lee Irwin III 2004-08-29 19:49 ` Roger Luethi 2004-08-29 20:25 ` William Lee Irwin III 2004-08-31 10:16 ` Roger Luethi 2004-08-31 15:34 ` [BENCHMARK] nproc: Look Ma, No get_tgid_list! Roger Luethi 2004-08-31 19:38 ` William Lee Irwin III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox