* [0/1][ANNOUNCE] nproc v2: netlink access to /proc information
@ 2004-09-08 18:40 Roger Luethi
2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi
2004-09-16 21:43 ` nproc: So? Roger Luethi
0 siblings, 2 replies; 69+ messages in thread
From: Roger Luethi @ 2004-09-08 18:40 UTC (permalink / raw)
To: Andrew Morton, linux-kernel
Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh,
Paul Jackson
I am submitting nproc, a new netlink interface to process information,
for review and a possible inclusion in mainline.
The problems with /proc as far as parsers go are widely known. Parsing is
both difficult and slow (including a more detailed discussion by reference:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109361019528995). What
follows is an overview showing how nproc fares in those areas.
Roger
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Clean Interface
---------------
The main motivation was to clean up the mess that are /proc semantics
and provide a clean interface for tools to gather process information.
Nproc does not add new knowledge to the kernel (some redundancy remains
until routines are shared with /proc). Instead, it offers existing
information in a form that works for tools. In fact, a tool can pass
the buffer read from the netlink directly as a va_list to vprintf
(strings require a trivial extra operation).
A small user-space app can present a view like the one below based on
zero prior knowledge about the fields the kernel has to offer. While I
don't envision that as common for tools in the future, it demonstrates
what can be done with little effort. This is not a mock-up, by the way,
the nprocdemo tool exists (lines truncated to fit 80 chars).
MemFree |PageSize|Jiffies |nr_dirty|nr_writeback|nr_unstable|[...]
____page|____byte|__________|____page|________page|_______page|[...]
7546| 4096| 1917203| 1| 0| 0|[...]
PID |Name |VmSize |VmLock |VmRSS |VmData |VmStack |[...]
_____|_______________|_____KiB|_____KiB|_____KiB|_____KiB|_____KiB|[...]
1|init | 1340| 0| 468| 144| 4|[...]
2|ksoftirqd/0 | 0| 0| 0| 0| 0|[...]
3|events/0 | 0| 0| 0| 0| 0|[...]
4|khelper | 0| 0| 0| 0| 0|[...]
5|netlink/0 | 0| 0| 0| 0| 0|[...]
6|kacpid | 0| 0| 0| 0| 0|[...]
23|kblockd/0 | 0| 0| 0| 0| 0|[...]
24|khubd | 0| 0| 0| 0| 0|[...]
36|pdflush | 0| 0| 0| 0| 0|[...]
37|pdflush | 0| 0| 0| 0| 0|[...]
38|kswapd0 | 0| 0| 0| 0| 0|[...]
39|aio/0 | 0| 0| 0| 0| 0|[...]
671|kseriod | 0| 0| 0| 0| 0|[...]
686|reiserfs/0 | 0| 0| 0| 0| 0|[...]
851|udevd | 1320| 0| 360| 144| 4|[...]
9159|syslogd | 1516| 0| 588| 272| 16|[...]
9382|gpm | 1540| 0| 468| 152| 4|[...]
9452|klogd | 1468| 0| 432| 276| 8|[...]
9478|hddtemp | 1692| 0| 848| 472| 16|[...]
9486|login | 2152| 0| 1204| 392| 36|[...]
9487|agetty | 1340| 0| 488| 156| 4|[...]
9488|agetty | 1340| 0| 488| 156| 4|[...]
9489|agetty | 1340| 0| 488| 156| 4|[...]
9490|agetty | 1340| 0| 488| 156| 4|[...]
9491|agetty | 1340| 0| 488| 156| 4|[...]
9598|zsh | 4748| 0| 1688| 532| 20|[...]
[...]
Performance
-----------
I measured the time to write a complete process table dump for 5000
tasks to /dev/null 100 times for "ps ax" and nprocdemo.
ps ax (5 process fields):
real 1m0.472s
user 0m18.227s
sys 0m28.545s
nprocdemo (automatic field discovery, reading and printing 11 process
fields + 9 global fields):
real 0m9.064s
user 0m2.491s
sys 0m1.554s
The details of resource usage for the benchmarks show that /proc based
tools are suffering badly from the inefficiency of three(!) conversions
between data and strings (kernel produces strings from numbers, app
converts back to numbers, app converts numbers again to strings for
printing).
For nproc based tools, only one conversion remains.
# ps ax > /dev/null
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
6524 14.0613 vmlinux ps number
4828 10.4058 libc-2.3.3.so ps _IO_vfscanf_internal
2740 5.9056 vmlinux ps vsnprintf
2689 5.7956 vmlinux ps proc_pid_stat
1807 3.8946 vmlinux ps __d_lookup
1676 3.6123 libc-2.3.3.so ps ____strtol_l_internal
1335 2.8773 vmlinux ps link_path_walk
1133 2.4420 libproc-3.2.3.so ps status2proc
1094 2.3579 vmlinux ps render_sigset_t
1088 2.3450 libc-2.3.3.so ps _IO_vfprintf_internal
1086 2.3407 libc-2.3.3.so ps __GI_strchr
885 1.9075 libc-2.3.3.so ps ____strtoul_l_internal
800 1.7242 vmlinux ps pid_revalidate
581 1.2522 vmlinux ps proc_pid_status
551 1.1876 libc-2.3.3.so ps _IO_sputbackc_internal
529 1.1402 vmlinux ps system_call
524 1.1294 libc-2.3.3.so ps _IO_default_xsputn_internal
476 1.0259 libc-2.3.3.so ps __i686.get_pc_thunk.bx
466 1.0044 vmlinux ps get_tgid_list
442 0.9526 vmlinux ps atomic_dec_and_lock
373 0.8039 vmlinux ps dput
311 0.6703 libc-2.3.3.so ps __GI___strtol_internal
274 0.5906 vmlinux ps __copy_to_user_ll
272 0.5862 vmlinux ps path_lookup
270 0.5819 vmlinux ps strncpy_from_user
262 0.5647 libproc-3.2.3.so ps escape_str
259 0.5582 vmlinux ps page_address
249 0.5367 libc-2.3.3.so ps __GI_____strtoull_l_internal
244 0.5259 libc-2.3.3.so ps __GI_strlen
# nprocdemo > /dev/null
CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name app name symbol name
1142 15.9208 libc-2.3.3.so nprocdemo _IO_vfprintf_internal
1072 14.9449 vmlinux vmlinux __task_mem
611 8.5181 libc-2.3.3.so nprocdemo _IO_new_file_xsputn
445 6.2038 vmlinux vmlinux nproc_pid_fields
244 3.4016 vmlinux vmlinux get_wchan
235 3.2762 vmlinux nprocdemo __copy_to_user_ll
233 3.2483 vmlinux vmlinux find_pid
215 2.9974 vmlinux vmlinux finish_task_switch
208 2.8998 vmlinux nprocdemo netlink_recvmsg
158 2.2027 vmlinux nprocdemo __wake_up
153 2.1330 libc-2.3.3.so nprocdemo __find_specmb
149 2.0772 vmlinux nprocdemo finish_task_switch
146 2.0354 libc-2.3.3.so nprocdemo __i686.get_pc_thunk.bx
114 1.5893 vmlinux vmlinux get_task_mm
94 1.3105 vmlinux nprocdemo skb_release_data
87 1.2129 vmlinux vmlinux nproc_ps_do_pid
76 1.0595 vmlinux vmlinux alloc_skb
72 1.0038 vmlinux nprocdemo system_call
68 0.9480 libc-2.3.3.so nprocdemo _IO_padn_internal
65 0.9062 libc-2.3.3.so nprocdemo read_int
64 0.8922 libc-2.3.3.so nprocdemo __recv
63 0.8783 vmlinux vmlinux netlink_attachskb
61 0.8504 vmlinux nprocdemo kfree
56 0.7807 vmlinux vmlinux __kmalloc
55 0.7668 vmlinux vmlinux schedule
47 0.6552 vmlinux vmlinux __task_mem_cheap
42 0.5855 vmlinux nprocdemo sys_socketcall
40 0.5576 vmlinux nprocdemo fget
37 0.5158 nprocdemo nprocdemo nproc_get_reply
EOT
^ permalink raw reply [flat|nested] 69+ messages in thread* [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-08 18:40 [0/1][ANNOUNCE] nproc v2: netlink access to /proc information Roger Luethi @ 2004-09-08 18:41 ` Roger Luethi 2004-09-09 0:35 ` William Lee Irwin III 2004-09-09 11:53 ` Stephen Smalley 2004-09-16 21:43 ` nproc: So? Roger Luethi 1 sibling, 2 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-08 18:41 UTC (permalink / raw) To: Andrew Morton, linux-kernel Cc: Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson A few notes: - Access control can be implemented easily. Right now it would be bloat, though -- the vast majority of fields in /proc are world-readable (/proc/pid/environ being the notable exception). - Additional process selectors (e.g. select by UID) are not hard to add, either, should there ever be a need. - There are a few things I'm not sure about: For instance, what is a good return value for mm_struct related fields wrt kernel threads? I picked 0, but ~(0) might be preferable because it's distinct. Signed-off-by: Roger Luethi <rl@hellgate.ch> diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/netlink.h linux-2.6.9-rc1-bk13-nproc/include/linux/netlink.h --- linux-2.6.9-rc1-bk13/include/linux/netlink.h 2004-09-06 18:53:17.000000000 +0200 +++ linux-2.6.9-rc1-bk13-nproc/include/linux/netlink.h 2004-09-06 19:50:56.000000000 +0200 @@ -15,6 +15,7 @@ #define NETLINK_ARPD 8 #define NETLINK_AUDIT 9 /* auditing */ #define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */ +#define NETLINK_NPROC 12 /* /proc information */ #define NETLINK_IP6_FW 13 #define NETLINK_DNRTMSG 14 /* DECnet routing messages */ #define NETLINK_TAPBASE 16 /* 16 to 31 are ethertap */ diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/nproc.h linux-2.6.9-rc1-bk13-nproc/include/linux/nproc.h --- linux-2.6.9-rc1-bk13/include/linux/nproc.h 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.9-rc1-bk13-nproc/include/linux/nproc.h 2004-09-08 18:56:41.763526856 +0200 @@ -0,0 +1,119 @@ +#ifndef _LINUX_NPROC_H +#define _LINUX_NPROC_H + +#include <linux/config.h> + +#ifndef __KERNEL__ +#define CONFIG_NPROC +#endif + +#ifdef CONFIG_NPROC + +/* Request types */ +#define NPROC_BASE 0x10 +#define NPROC_GET_FIELD_LIST (NPROC_BASE+0) +#define NPROC_GET_LABEL (NPROC_BASE+1) +#define NPROC_GET_GLOBAL (NPROC_BASE+2) +#define NPROC_GET_PS (NPROC_BASE+3) +#define NPROC_GET_PID_LIST (NPROC_BASE+4) + +/* Request flags */ + + +/* Field scopes */ +#define NPROC_SCOPE_MASK 0x70000000 +#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */ +#define NPROC_SCOPE_PROCESS 0x20000000 +#define NPROC_SCOPE_LABEL 0x30000000 + +/* Data types */ +#define NPROC_TYPE_MASK 0x07000000 +#define NPROC_TYPE_STRING 0x01000000 +#define NPROC_TYPE_U32 0x02000000 +#define NPROC_TYPE_UL 0x03000000 +#define NPROC_TYPE_U64 0x04000000 + +/* Access control (unused) */ +#define NPROC_PERM_MASK 0x00300000 +#define NPROC_PERM_USER 0x00100000 +#define NPROC_PERM_ROOT 0x00200000 + +/* Selectors */ +#define NPROC_SELECT_ALL 0x00000001 +#define NPROC_SELECT_PID 0x00000002 +#define NPROC_SELECT_UID 0x00000003 + +/* Labels */ +#define NPROC_LABEL_FIELD_NAME 0x00000001 +#define NPROC_LABEL_FIELD_FMT 0x00000002 +#define NPROC_LABEL_FIELD_UNIT 0x00000003 +#define NPROC_LABEL_WCHAN 0x00000004 + +/* Field IDs (unique key in bits 0 - 15) */ +#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL) +#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS) +/* Amount of free memory (pages) */ +#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* Size of a page (bytes) */ +#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* There's no guarantee about anything with jiffies. Still useful for some. */ +#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) +/* Process: VM size (KiB) */ +#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: locked memory (KiB) */ +#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: Memory resident size (KiB) */ +#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) +#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING) + +#ifdef __KERNEL__ +struct nproc_field { + __u32 id; + const char *label; + const char *fmt; + const char *unit; +}; + +static struct nproc_field labels[] = { + { NPROC_PID, "PID", "%5u", "" }, + { NPROC_NAME, "Name", "%-15s","" }, + { NPROC_MEMFREE, "MemFree", "%8u", "page" }, + { NPROC_PAGESIZE, "PageSize", "%4u", "byte" }, + { NPROC_JIFFIES, "Jiffies", "%10u", "" }, + { NPROC_VMSIZE, "VmSize", "%8u", "KiB" }, + { NPROC_VMLOCK, "VmLock", "%8u", "KiB" }, + { NPROC_VMRSS, "VmRSS", "%8u", "KiB" }, + { NPROC_VMDATA, "VmData", "%8u", "KiB" }, + { NPROC_VMSTACK, "VmStack", "%8u", "KiB" }, + { NPROC_VMEXE, "VmExe", "%8u", "KiB" }, + { NPROC_VMLIB, "VmLib", "%8u", "KiB" }, + { NPROC_UID, "UID", "%5u", "" }, + { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" }, + { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" }, + { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" }, + { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" }, + { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" }, + { NPROC_NR_SLAB, "nr_slab", "%8u", "page" }, + { NPROC_WCHAN, "wchan", "%p", "" }, +#ifdef CONFIG_KALLSYMS + { NPROC_WCHAN_NAME, "wchan_symbol", "%s"}, +#endif +}; +#endif /* __KERNEL__ */ + +#endif /* CONFIG_NPROC */ + +#endif /* _LINUX_NPROC_H */ diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/include/linux/pid.h linux-2.6.9-rc1-bk13-nproc/include/linux/pid.h --- linux-2.6.9-rc1-bk13/include/linux/pid.h 2004-09-06 18:53:17.000000000 +0200 +++ linux-2.6.9-rc1-bk13-nproc/include/linux/pid.h 2004-09-06 19:50:56.000000000 +0200 @@ -37,6 +37,7 @@ extern void FASTCALL(detach_pid(struct t extern struct pid *FASTCALL(find_pid(enum pid_type, int)); extern int alloc_pidmap(void); +extern void *get_pid_map(int); extern void FASTCALL(free_pidmap(int)); extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread); diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/Makefile linux-2.6.9-rc1-bk13-nproc/kernel/Makefile --- linux-2.6.9-rc1-bk13/kernel/Makefile 2004-09-06 18:53:17.000000000 +0200 +++ linux-2.6.9-rc1-bk13-nproc/kernel/Makefile 2004-09-06 19:50:56.000000000 +0200 @@ -15,6 +15,7 @@ obj-$(CONFIG_SMP) += cpu.o spinlock.o obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += module.o obj-$(CONFIG_KALLSYMS) += kallsyms.o +obj-$(CONFIG_NPROC) += nproc.o obj-$(CONFIG_PM) += power/ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_COMPAT) += compat.o diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/nproc.c linux-2.6.9-rc1-bk13-nproc/kernel/nproc.c --- linux-2.6.9-rc1-bk13/kernel/nproc.c 1970-01-01 01:00:00.000000000 +0100 +++ linux-2.6.9-rc1-bk13-nproc/kernel/nproc.c 2004-09-08 18:34:49.000000000 +0200 @@ -0,0 +1,851 @@ +/* + * nproc.c + * + * netlink interface to /proc information. + */ + +#include <linux/skbuff.h> +#include <net/sock.h> +#include <linux/swap.h> /* nr_free_pages() */ +#include <linux/kallsyms.h> /* kallsyms_lookup() */ +#include <linux/pid.h> /* get_pid_map() */ +#include <linux/nproc.h> +#include <asm/bitops.h> + +//#define DEBUG + +/* There must be like 5 million dprintk definitions, so let's add some more */ +#ifdef DEBUG +#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args) +#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args) +#else +#define pdebug(x,args...) +#define pwarn(x,args...) +#endif + +#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args) + +static struct sock *nproc_sock = NULL; + +struct task_mem { + u32 vmdata; + u32 vmstack; + u32 vmexe; + u32 vmlib; +}; + +struct task_mem_cheap { + u32 vmsize; + u32 vmlock; + u32 vmrss; +}; + +/* + * __task_mem/__task_mem_cheap basically duplicate the MMU version of + * task_mem, but they are split by cost and work on structs. + */ + +static void __task_mem(struct task_struct *tsk, struct task_mem *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + unsigned long data = 0, stack = 0, exec = 0, lib = 0; + struct vm_area_struct *vma; + + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long len = (vma->vm_end - vma->vm_start) >> 10; + if (!vma->vm_file) { + data += len; + if (vma->vm_flags & VM_GROWSDOWN) + stack += len; + continue; + } + if (vma->vm_flags & VM_WRITE) + continue; + if (vma->vm_flags & VM_EXEC) { + exec += len; + if (vma->vm_flags & VM_EXECUTABLE) + continue; + lib += len; + } + } + res->vmdata = data - stack; + res->vmstack = stack; + res->vmexe = exec - lib; + res->vmlib = lib; + up_read(&mm->mmap_sem); + + mmput(mm); + } else { + res->vmdata = 0; + res->vmstack = 0; + res->vmexe = 0; + res->vmlib = 0; + } +} + +static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + res->vmsize = mm->total_vm << (PAGE_SHIFT-10); + res->vmlock = mm->locked_vm << (PAGE_SHIFT-10); + res->vmrss = mm->rss << (PAGE_SHIFT-10); + mmput(mm); + } else { + res->vmsize = 0; + res->vmlock = 0; + res->vmrss = 0; + } +} + +/* + * page_alloc.c already has an extra function broken out to fill a + * struct with information. Cool. Not sure whether pgpgin/pgpgout + * should be left as is or nailed down as kbytes. + */ +static struct page_state *__vmstat(void) +{ + struct page_state *ps; + ps = kmalloc(sizeof(*ps), GFP_KERNEL); + if (!ps) + return ERR_PTR(-ENOMEM); + get_full_page_state(ps); + ps->pgpgin /= 2; /* sectors -> kbytes */ + ps->pgpgout /= 2; + return ps; +} + +/* + * Allocate and prefill an skb. The nlmsghdr provided to the function + * is a pointer to the respective struct in the request message. + */ +static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len) +{ + __u32 seq = nlh->nlmsg_seq; + __u16 type = nlh->nlmsg_type; + __u32 pid = nlh->nlmsg_pid; + struct sk_buff *skb2 = 0; + + skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL); + if (!skb2) { + skb2 = ERR_PTR(-ENOMEM); + goto out; + } + + NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len)); +out: + return skb2; + +nlmsg_failure: /* Used by NLMSG_PUT */ + kfree_skb(skb2); + return NULL; +} + +#define mstore(value, id, buf) \ +({ \ + u32 _type = id & NPROC_TYPE_MASK; \ + switch (_type) { \ + case NPROC_TYPE_U32: { \ + __u32 *p = (u32 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_UL: { \ + unsigned long *p = (unsigned long *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_U64: { \ + __u64 *p = (u64 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + default: \ + perror("Huh? Bad type!\n"); \ + } \ +}) + +static char *nproc_ps_field(u32 id, char *buf, task_t *tsk) +{ + struct task_mem tsk_mem; + struct task_mem_cheap tsk_mem_cheap; + + tsk_mem.vmdata = (~0); + tsk_mem_cheap.vmsize = (~0); + + switch (id) { + case NPROC_PID: + mstore(tsk->pid, NPROC_PID, buf); + break; + case NPROC_UID: + mstore(tsk->uid, NPROC_UID, buf); + break; + case NPROC_VMSIZE: + case NPROC_VMLOCK: + case NPROC_VMRSS: + if (tsk_mem_cheap.vmsize == (~0)) + __task_mem_cheap(tsk, &tsk_mem_cheap); + + switch (id) { + case NPROC_VMSIZE: + mstore(tsk_mem_cheap.vmsize, + NPROC_VMSIZE, buf); + break; + case NPROC_VMLOCK: + mstore(tsk_mem_cheap.vmlock, + NPROC_VMLOCK, buf); + break; + case NPROC_VMRSS: + mstore(tsk_mem_cheap.vmrss, + NPROC_VMRSS, buf); + break; + } + break; + case NPROC_VMDATA: + case NPROC_VMSTACK: + case NPROC_VMEXE: + case NPROC_VMLIB: + if (tsk_mem.vmdata == (~0)) + __task_mem(tsk, &tsk_mem); + + switch (id) { + case NPROC_VMDATA: + mstore(tsk_mem.vmdata, NPROC_VMDATA, + buf); + break; + case NPROC_VMSTACK: + mstore(tsk_mem.vmstack, NPROC_VMSTACK, + buf); + break; + case NPROC_VMEXE: + mstore(tsk_mem.vmexe, NPROC_VMEXE, buf); + break; + case NPROC_VMLIB: + mstore(tsk_mem.vmlib, NPROC_VMLIB, buf); + break; + } + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + case NPROC_WCHAN: + mstore(get_wchan(tsk), NPROC_WCHAN, buf); + break; + case NPROC_NAME: + mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf); + strncpy(buf, tsk->comm, sizeof(tsk->comm)); + buf += sizeof(tsk->comm); + break; + case NPROC_NOP_UL: + mstore(0, NPROC_TYPE_UL, buf); + break; + default: + pwarn("Unknown field ID %#x.\n", id); + goto err_inval; + } + return buf; +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Build and send a netlink msg for one PID. + */ +static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk) +{ + int i; + int err = 0; + struct sk_buff *skb2; + char *buf; + struct nlmsghdr *nlh2; + u32 fcnt, *fields; + + fcnt = fdata[0]; + fields = &fdata[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + nlh2 = (struct nlmsghdr *)skb2->data; + buf = NLMSG_DATA(nlh2); + + for (i = 0; i < fcnt; i++) { + buf = nproc_ps_field(fields[i], buf, tsk); + if (IS_ERR(buf)) { + err = PTR_ERR(buf); + goto out_free; + } + } + err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0); + if (err > 0) + err = 0; + return err; +out_free: + kfree_skb(skb2); +out: + return err; +} + +/* + * Find task for given pid, grab task lock (caller must unlock). + */ +static task_t *nproc_ps_get_task(int pid) +{ + task_t *tsk; + + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (tsk) + get_task_struct(tsk); + read_unlock(&tasklist_lock); + return tsk; +} + +/* + * Iterate over a list of PIDs. + */ +static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata) +{ + int i; + int err = 0; + u32 tcnt; + u32 *pids; + + if (left < sizeof(tcnt)) + goto err_inval; + left -= sizeof(tcnt); + + tcnt = sdata[0]; + + if (left < (tcnt * sizeof(u32))) + goto err_inval; + left -= tcnt * sizeof(u32); + + if (left) + pwarn("%d bytes left.\n", left); + + pids = &sdata[1]; + + for (i = 0; i < tcnt; i++) { + task_t *tsk; + tsk = nproc_ps_get_task(pids[i]); + if (!tsk) + continue; + err = nproc_pid_msg(nlh, fdata, len, tsk); + put_task_struct(tsk); + if (err) + goto out; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8) +#define BITS_PER_PAGE (PAGE_SIZE*8) + +/* + * Iterate over all PIDs. + */ +static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len) +{ + void *map; + int offset, i; + int err = 0; + + for (i = 0; i < PIDMAP_ENTRIES; i++) { + + map = get_pid_map(i); + if (!map) /* done -- there are no holes in pidmap_array */ + break; + if (IS_ERR(map)) /* No PIDs used in this map */ + continue; + offset = 0; + for ( ; ; ) { + int pid; + task_t *tsk; + offset = find_next_bit(map, BITS_PER_PAGE, ++offset); + if (offset >= BITS_PER_PAGE) + break; + pid = offset + i * BITS_PER_PAGE; + tsk = nproc_ps_get_task(pid); + if (!tsk) + continue; + err = nproc_pid_msg(nlh, fdata, len, tsk); + put_task_struct(tsk); + if (err) + goto out; + } + } + +out: + return err; +} + +static u32 __reply_size_special(u32 id) +{ + u32 len = 0; + + switch (id) { + case NPROC_NAME: + len = sizeof(u32) + + sizeof(((struct task_struct*)0)->comm); + break; + default: + pwarn("Unknown field size in %#x.\n", id); + } + return len; +} + +/* + * Calculates the size of a reply message payload. Alternatively, we could have + * the user space caller supply a number along with the request and bail + * out or realloc later if we find the allocation was too small. More + * responsibility in user space, but faster. + */ +static u32 *__reply_size (u32 *data, u32 *left, u32 *len) +{ + u32 *fields; + u32 fcnt; + int i; + *len = 0; + + if (*left < sizeof(fcnt)) + goto err_inval; + *left -= sizeof(fcnt); + + fcnt = data[0]; + + if (*left < (fcnt * sizeof(u32))) + goto err_inval; + *left -= fcnt * sizeof(u32); + + fields = &data[1]; + + for (i = 0; i < fcnt; i++) { + u32 id = fields[i]; + u32 type = id & NPROC_TYPE_MASK; + pdebug(" %#8.8x.\n", fields[i]); + switch (type) { + case NPROC_TYPE_U32: + *len += sizeof(u32); + break; + case NPROC_TYPE_UL: + *len += sizeof(unsigned long); + break; + case NPROC_TYPE_U64: + *len += sizeof(u64); + break; + default: { /* Special cases */ + u32 slen; + slen = __reply_size_special(id); + if (slen) + *len += slen; + else + goto err_inval; + } + } + } + + return &fields[fcnt]; + +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Call the chosen process selector. Adding additional selectors + * (e.g. select by uid) is easy, but is there a need? + */ +static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid) +{ + int err; + u32 len; + u32 *data = NLMSG_DATA(nlh); + u32 *sdata; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + + sdata = __reply_size(data, &left, &len); + if (IS_ERR(sdata)) { + err = PTR_ERR(sdata); + goto out; + } + + if (left < sizeof(u32)) + goto err_inval; + left -= sizeof(u32); + + switch (*sdata) { + case NPROC_SELECT_ALL: + if (left) + pwarn("%d bytes left.\n", left); + err = nproc_ps_select_all(nlh, data, len); + break; + case NPROC_SELECT_PID: + err = nproc_ps_select_pid(nlh, data, len, + left, sdata + 1); + break; + default: + pwarn("Unknown selection method %#x.\n", *sdata); + goto err_inval; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +static char *nproc_global_field(u32 id, char *buf) +{ + struct page_state *ps = NULL; + + switch (id) { + case NPROC_NR_DIRTY: + case NPROC_NR_WRITEBACK: + case NPROC_NR_UNSTABLE: + case NPROC_NR_PG_TABLE_PGS: + case NPROC_NR_MAPPED: + case NPROC_NR_SLAB: + if (!ps) { + ps = __vmstat(); + if (IS_ERR(ps)) { /* Just pass it on */ + buf = (void *)ps; + ps = NULL; + goto out; + } + } + switch (id) { + case NPROC_NR_DIRTY: + mstore(ps->nr_dirty, NPROC_NR_DIRTY, + buf); + break; + case NPROC_NR_WRITEBACK: + mstore(ps->nr_writeback, + NPROC_NR_WRITEBACK, + buf); + break; + case NPROC_NR_UNSTABLE: + mstore(ps->nr_unstable, + NPROC_NR_UNSTABLE, + buf); + break; + case NPROC_NR_PG_TABLE_PGS: + mstore(ps->nr_page_table_pages, + NPROC_NR_PG_TABLE_PGS, + buf); + break; + case NPROC_NR_MAPPED: + mstore(ps->nr_mapped, NPROC_NR_MAPPED, + buf); + break; + case NPROC_NR_SLAB: + mstore(ps->nr_slab, NPROC_NR_SLAB, buf); + break; + } + break; + case NPROC_MEMFREE: + mstore(nr_free_pages(), NPROC_MEMFREE, buf); + break; + case NPROC_PAGESIZE: + mstore(PAGE_SIZE, NPROC_PAGESIZE, buf); + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + default: + pwarn("Unknown field ID %#x.\n", id); + buf = ERR_PTR(-EINVAL); + goto out; + } + kfree(ps); +out: + return buf; +} + +static int nproc_get_global(struct nlmsghdr *nlh) +{ + int err, i; + void *errp; + struct sk_buff *skb2; + char *buf; + u32 fcnt, len; + u32 *data = NLMSG_DATA(nlh); + u32 *fields; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + errp = __reply_size(data, &left, &len); + if (IS_ERR(errp)) { + err = PTR_ERR(errp); + goto out; + } + if (left) + pwarn("%d bytes left.\n", left); + + fcnt = data[0]; + fields = &data[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + for (i = 0; i < fcnt; i++) { + buf = nproc_global_field(fields[i], buf); + if (IS_ERR(buf)) { + err = PTR_ERR(buf); + kfree_skb(skb2); + goto out; + } + } + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; +} + +static int find_id(__u32 *data, __u32 *left) +{ + int i; + u32 id; + + if (*left < sizeof(id)) + goto err_inval; + *left -= sizeof(sizeof(id)); + + if (*left) + pwarn("%d bytes left.\n", *left); + id = data[1]; + + for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++) + ; /* Do nothing */ + + if (labels[i].id != id) { + pwarn("No matching label found for %#x.\n", id); + goto err_inval; + } + + return i; + +err_inval: + return -EINVAL; +} + + +static int nproc_get_label(struct nlmsghdr *nlh) +{ + int err; + struct sk_buff *skb2; + const char *label; + char *buf; + int len; + u32 ltype; + u32 *data = NLMSG_DATA(nlh); + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + if (left < sizeof(ltype)) + goto err_inval; + left -= sizeof(ltype); + + ltype = data[0]; + + if (ltype == NPROC_LABEL_FIELD_NAME) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].label; + } + else if (ltype == NPROC_LABEL_FIELD_UNIT) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].unit; + } + else if (ltype == NPROC_LABEL_FIELD_FMT) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].fmt; + } + else if (ltype == NPROC_LABEL_WCHAN) { + char *modname; + unsigned long wchan, size, offset; + char namebuf[128]; + + if (left < sizeof(unsigned long)) + goto err_inval; + left -= sizeof(unsigned long); + + if (left) + pwarn("%d bytes left.\n", left); + + wchan = (unsigned long)data[1]; + label = kallsyms_lookup(wchan, &size, &offset, &modname, + namebuf); + + if (!label) { + pwarn("No ksym found for %#lx.\n", wchan); + goto err_inval; + } + } + else { + pwarn("Unknown label type %#x.\n", ltype); + goto err_inval; + } + + len = strlen(label) + 1; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + strncpy(buf, label, len); + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; + +err_inval: + return -EINVAL; +} + +static int nproc_get_list(struct nlmsghdr *nlh) +{ + int err, i, cnt, len; + struct sk_buff *skb2; + u32 *buf; + + cnt = ARRAY_SIZE(labels); + len = (cnt + 1) * sizeof(u32); + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + buf[0] = cnt; + for (i = 0; i < cnt; i++) + buf[i + 1] = labels[i].id; + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; +} + +static __inline__ int nproc_process_msg(struct sk_buff *skb, + struct nlmsghdr *nlh) +{ + int err = 0; + uid_t uid; + kernel_cap_t caps; + + if (!(nlh->nlmsg_flags & NLM_F_REQUEST)) + goto out; + + nlh->nlmsg_pid = NETLINK_CB(skb).pid; + uid = NETLINK_CB(skb).creds.uid; + caps = NETLINK_CB(skb).eff_cap; + + switch (nlh->nlmsg_type) { + case NPROC_GET_FIELD_LIST: + err = nproc_get_list(nlh); + break; + case NPROC_GET_LABEL: + err = nproc_get_label(nlh); + break; + case NPROC_GET_GLOBAL: + err = nproc_get_global(nlh); + break; + case NPROC_GET_PS: + err = nproc_get_ps(nlh, uid); + break; + default: + pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type); + err = -EINVAL; + } +out: + return err; + +} + +static int nproc_receive_skb(struct sk_buff *skb) +{ + int err = 0; + struct nlmsghdr *nlh; + + if (skb->len < NLMSG_LENGTH(0)) + goto err_inval; + + nlh = (struct nlmsghdr *)skb->data; + if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){ + pwarn("Invalid packet.\n"); + goto err_inval; + } + + err = nproc_process_msg(skb, nlh); + if (err || nlh->nlmsg_flags & NLM_F_ACK) { + pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err, + nlh->nlmsg_type, nlh->nlmsg_flags, + nlh->nlmsg_seq); + netlink_ack(skb, nlh, err); + } + + return err; + +err_inval: + return -EINVAL; +} + +static void nproc_receive(struct sock *sk, int len) +{ + struct sk_buff *skb; + + while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { + nproc_receive_skb(skb); + kfree_skb(skb); + } +} + +static int nproc_init(void) +{ + nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive); + + if (!nproc_sock) { + pwarn("No netlink socket for nproc.\n"); + return -ENODEV; + } + + return 0; +} + +module_init(nproc_init); diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/kernel/pid.c linux-2.6.9-rc1-bk13-nproc/kernel/pid.c --- linux-2.6.9-rc1-bk13/kernel/pid.c 2004-09-06 18:53:17.000000000 +0200 +++ linux-2.6.9-rc1-bk13-nproc/kernel/pid.c 2004-09-06 19:52:59.000000000 +0200 @@ -146,6 +146,17 @@ failure: return -1; } +void *get_pid_map(int idx) +{ + pidmap_t *map = pidmap_array + idx; + if (!map->page) + return NULL; + else if (atomic_read(&map->nr_free) == BITS_PER_PAGE) + return ERR_PTR(-1); + else + return map->page; +} + struct pid * fastcall find_pid(enum pid_type type, int nr) { struct hlist_node *elem; diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-bk13/init/Kconfig linux-2.6.9-rc1-bk13-nproc/init/Kconfig --- linux-2.6.9-rc1-bk13/init/Kconfig 2004-09-06 18:53:17.000000000 +0200 +++ linux-2.6.9-rc1-bk13-nproc/init/Kconfig 2004-09-06 19:50:56.000000000 +0200 @@ -139,6 +139,13 @@ config SYSCTL building a kernel for install/rescue disks or your system is very limited in memory. +config NPROC + bool "Netlink interface to /proc information" + depends on PROC_FS && EXPERIMENTAL + default y + help + Nproc is a netlink interface to /proc information. + config AUDIT bool "Auditing support" default y if SECURITY_SELINUX ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi @ 2004-09-09 0:35 ` William Lee Irwin III 2004-09-09 0:43 ` William Lee Irwin III 2004-09-09 18:43 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 2004-09-09 11:53 ` Stephen Smalley 1 sibling, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 0:35 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 08:41:30PM +0200, Roger Luethi wrote: > A few notes: > - Access control can be implemented easily. Right now it would be bloat, > though -- the vast majority of fields in /proc are world-readable > (/proc/pid/environ being the notable exception). > - Additional process selectors (e.g. select by UID) are not hard to > add, either, should there ever be a need. > - There are a few things I'm not sure about: For instance, what is a good > return value for mm_struct related fields wrt kernel threads? I picked > 0, but ~(0) might be preferable because it's distinct. > Signed-off-by: Roger Luethi <rl@hellgate.ch> Any chance you could convert these to use the new vm statistics accounting? -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 0:35 ` William Lee Irwin III @ 2004-09-09 0:43 ` William Lee Irwin III 2004-09-09 1:15 ` William Lee Irwin III 2004-09-09 18:43 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 1 sibling, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 0:43 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote: > Any chance you could convert these to use the new vm statistics > accounting? Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these. For that you will need to #ifdef on CONFIG_MMU and use the methods in fs/proc/task_nommu.c and so on. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 0:43 ` William Lee Irwin III @ 2004-09-09 1:15 ` William Lee Irwin III 2004-09-09 1:17 ` [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 1:15 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote: >> Any chance you could convert these to use the new vm statistics >> accounting? On Wed, Sep 08, 2004 at 05:43:20PM -0700, William Lee Irwin III wrote: > Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these. > For that you will need to #ifdef on CONFIG_MMU and use the methods > in fs/proc/task_nommu.c and so on. This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes whatsoever to the underlying code were made; rather, this merely resolves offsets so it applies cleanly. Compiletested on ia64. -- wli Index: mm4-2.6.9-rc1/include/linux/netlink.h =================================================================== --- mm4-2.6.9-rc1.orig/include/linux/netlink.h 2004-09-08 06:10:50.000000000 -0700 +++ mm4-2.6.9-rc1/include/linux/netlink.h 2004-09-08 17:45:27.500658296 -0700 @@ -15,6 +15,7 @@ #define NETLINK_ARPD 8 #define NETLINK_AUDIT 9 /* auditing */ #define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */ +#define NETLINK_NPROC 12 /* /proc information */ #define NETLINK_IP6_FW 13 #define NETLINK_DNRTMSG 14 /* DECnet routing messages */ #define NETLINK_KEVENT 15 /* Kernel messages to userspace */ Index: mm4-2.6.9-rc1/include/linux/nproc.h =================================================================== --- mm4-2.6.9-rc1.orig/include/linux/nproc.h 2004-04-25 12:31:02.000000000 -0700 +++ mm4-2.6.9-rc1/include/linux/nproc.h 2004-09-08 17:45:27.501634858 -0700 @@ -0,0 +1,119 @@ +#ifndef _LINUX_NPROC_H +#define _LINUX_NPROC_H + +#include <linux/config.h> + +#ifndef __KERNEL__ +#define CONFIG_NPROC +#endif + +#ifdef CONFIG_NPROC + +/* Request types */ +#define NPROC_BASE 0x10 +#define NPROC_GET_FIELD_LIST (NPROC_BASE+0) +#define NPROC_GET_LABEL (NPROC_BASE+1) +#define NPROC_GET_GLOBAL (NPROC_BASE+2) +#define NPROC_GET_PS (NPROC_BASE+3) +#define NPROC_GET_PID_LIST (NPROC_BASE+4) + +/* Request flags */ + + +/* Field scopes */ +#define NPROC_SCOPE_MASK 0x70000000 +#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */ +#define NPROC_SCOPE_PROCESS 0x20000000 +#define NPROC_SCOPE_LABEL 0x30000000 + +/* Data types */ +#define NPROC_TYPE_MASK 0x07000000 +#define NPROC_TYPE_STRING 0x01000000 +#define NPROC_TYPE_U32 0x02000000 +#define NPROC_TYPE_UL 0x03000000 +#define NPROC_TYPE_U64 0x04000000 + +/* Access control (unused) */ +#define NPROC_PERM_MASK 0x00300000 +#define NPROC_PERM_USER 0x00100000 +#define NPROC_PERM_ROOT 0x00200000 + +/* Selectors */ +#define NPROC_SELECT_ALL 0x00000001 +#define NPROC_SELECT_PID 0x00000002 +#define NPROC_SELECT_UID 0x00000003 + +/* Labels */ +#define NPROC_LABEL_FIELD_NAME 0x00000001 +#define NPROC_LABEL_FIELD_FMT 0x00000002 +#define NPROC_LABEL_FIELD_UNIT 0x00000003 +#define NPROC_LABEL_WCHAN 0x00000004 + +/* Field IDs (unique key in bits 0 - 15) */ +#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL) +#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS) +/* Amount of free memory (pages) */ +#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* Size of a page (bytes) */ +#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* There's no guarantee about anything with jiffies. Still useful for some. */ +#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) +/* Process: VM size (KiB) */ +#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: locked memory (KiB) */ +#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: Memory resident size (KiB) */ +#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) +#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING) + +#ifdef __KERNEL__ +struct nproc_field { + __u32 id; + const char *label; + const char *fmt; + const char *unit; +}; + +static struct nproc_field labels[] = { + { NPROC_PID, "PID", "%5u", "" }, + { NPROC_NAME, "Name", "%-15s","" }, + { NPROC_MEMFREE, "MemFree", "%8u", "page" }, + { NPROC_PAGESIZE, "PageSize", "%4u", "byte" }, + { NPROC_JIFFIES, "Jiffies", "%10u", "" }, + { NPROC_VMSIZE, "VmSize", "%8u", "KiB" }, + { NPROC_VMLOCK, "VmLock", "%8u", "KiB" }, + { NPROC_VMRSS, "VmRSS", "%8u", "KiB" }, + { NPROC_VMDATA, "VmData", "%8u", "KiB" }, + { NPROC_VMSTACK, "VmStack", "%8u", "KiB" }, + { NPROC_VMEXE, "VmExe", "%8u", "KiB" }, + { NPROC_VMLIB, "VmLib", "%8u", "KiB" }, + { NPROC_UID, "UID", "%5u", "" }, + { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" }, + { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" }, + { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" }, + { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" }, + { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" }, + { NPROC_NR_SLAB, "nr_slab", "%8u", "page" }, + { NPROC_WCHAN, "wchan", "%p", "" }, +#ifdef CONFIG_KALLSYMS + { NPROC_WCHAN_NAME, "wchan_symbol", "%s"}, +#endif +}; +#endif /* __KERNEL__ */ + +#endif /* CONFIG_NPROC */ + +#endif /* _LINUX_NPROC_H */ Index: mm4-2.6.9-rc1/include/linux/pid.h =================================================================== --- mm4-2.6.9-rc1.orig/include/linux/pid.h 2004-09-08 06:10:36.000000000 -0700 +++ mm4-2.6.9-rc1/include/linux/pid.h 2004-09-08 17:45:27.501634858 -0700 @@ -37,6 +37,7 @@ extern struct pid *FASTCALL(find_pid(enum pid_type, int)); extern int alloc_pidmap(void); +extern void *get_pid_map(int); extern void FASTCALL(free_pidmap(int)); extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread); Index: mm4-2.6.9-rc1/init/Kconfig =================================================================== --- mm4-2.6.9-rc1.orig/init/Kconfig 2004-09-08 06:10:50.000000000 -0700 +++ mm4-2.6.9-rc1/init/Kconfig 2004-09-08 17:45:27.504564546 -0700 @@ -139,6 +139,13 @@ building a kernel for install/rescue disks or your system is very limited in memory. +config NPROC + bool "Netlink interface to /proc information" + depends on PROC_FS && EXPERIMENTAL + default y + help + Nproc is a netlink interface to /proc information. + config AUDIT bool "Auditing support" default y if SECURITY_SELINUX Index: mm4-2.6.9-rc1/kernel/Makefile =================================================================== --- mm4-2.6.9-rc1.orig/kernel/Makefile 2004-09-08 06:10:50.000000000 -0700 +++ mm4-2.6.9-rc1/kernel/Makefile 2004-09-08 17:45:27.501634858 -0700 @@ -15,6 +15,7 @@ obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += module.o obj-$(CONFIG_KALLSYMS) += kallsyms.o +obj-$(CONFIG_NPROC) += nproc.o obj-$(CONFIG_PM) += power/ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_KEXEC) += kexec.o Index: mm4-2.6.9-rc1/kernel/nproc.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-04-25 12:31:02.000000000 -0700 +++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-08 17:45:27.503587983 -0700 @@ -0,0 +1,851 @@ +/* + * nproc.c + * + * netlink interface to /proc information. + */ + +#include <linux/skbuff.h> +#include <net/sock.h> +#include <linux/swap.h> /* nr_free_pages() */ +#include <linux/kallsyms.h> /* kallsyms_lookup() */ +#include <linux/pid.h> /* get_pid_map() */ +#include <linux/nproc.h> +#include <asm/bitops.h> + +//#define DEBUG + +/* There must be like 5 million dprintk definitions, so let's add some more */ +#ifdef DEBUG +#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args) +#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args) +#else +#define pdebug(x,args...) +#define pwarn(x,args...) +#endif + +#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args) + +static struct sock *nproc_sock = NULL; + +struct task_mem { + u32 vmdata; + u32 vmstack; + u32 vmexe; + u32 vmlib; +}; + +struct task_mem_cheap { + u32 vmsize; + u32 vmlock; + u32 vmrss; +}; + +/* + * __task_mem/__task_mem_cheap basically duplicate the MMU version of + * task_mem, but they are split by cost and work on structs. + */ + +static void __task_mem(struct task_struct *tsk, struct task_mem *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + unsigned long data = 0, stack = 0, exec = 0, lib = 0; + struct vm_area_struct *vma; + + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long len = (vma->vm_end - vma->vm_start) >> 10; + if (!vma->vm_file) { + data += len; + if (vma->vm_flags & VM_GROWSDOWN) + stack += len; + continue; + } + if (vma->vm_flags & VM_WRITE) + continue; + if (vma->vm_flags & VM_EXEC) { + exec += len; + if (vma->vm_flags & VM_EXECUTABLE) + continue; + lib += len; + } + } + res->vmdata = data - stack; + res->vmstack = stack; + res->vmexe = exec - lib; + res->vmlib = lib; + up_read(&mm->mmap_sem); + + mmput(mm); + } else { + res->vmdata = 0; + res->vmstack = 0; + res->vmexe = 0; + res->vmlib = 0; + } +} + +static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + res->vmsize = mm->total_vm << (PAGE_SHIFT-10); + res->vmlock = mm->locked_vm << (PAGE_SHIFT-10); + res->vmrss = mm->rss << (PAGE_SHIFT-10); + mmput(mm); + } else { + res->vmsize = 0; + res->vmlock = 0; + res->vmrss = 0; + } +} + +/* + * page_alloc.c already has an extra function broken out to fill a + * struct with information. Cool. Not sure whether pgpgin/pgpgout + * should be left as is or nailed down as kbytes. + */ +static struct page_state *__vmstat(void) +{ + struct page_state *ps; + ps = kmalloc(sizeof(*ps), GFP_KERNEL); + if (!ps) + return ERR_PTR(-ENOMEM); + get_full_page_state(ps); + ps->pgpgin /= 2; /* sectors -> kbytes */ + ps->pgpgout /= 2; + return ps; +} + +/* + * Allocate and prefill an skb. The nlmsghdr provided to the function + * is a pointer to the respective struct in the request message. + */ +static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len) +{ + __u32 seq = nlh->nlmsg_seq; + __u16 type = nlh->nlmsg_type; + __u32 pid = nlh->nlmsg_pid; + struct sk_buff *skb2 = 0; + + skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL); + if (!skb2) { + skb2 = ERR_PTR(-ENOMEM); + goto out; + } + + NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len)); +out: + return skb2; + +nlmsg_failure: /* Used by NLMSG_PUT */ + kfree_skb(skb2); + return NULL; +} + +#define mstore(value, id, buf) \ +({ \ + u32 _type = id & NPROC_TYPE_MASK; \ + switch (_type) { \ + case NPROC_TYPE_U32: { \ + __u32 *p = (u32 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_UL: { \ + unsigned long *p = (unsigned long *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_U64: { \ + __u64 *p = (u64 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + default: \ + perror("Huh? Bad type!\n"); \ + } \ +}) + +static char *nproc_ps_field(u32 id, char *buf, task_t *tsk) +{ + struct task_mem tsk_mem; + struct task_mem_cheap tsk_mem_cheap; + + tsk_mem.vmdata = (~0); + tsk_mem_cheap.vmsize = (~0); + + switch (id) { + case NPROC_PID: + mstore(tsk->pid, NPROC_PID, buf); + break; + case NPROC_UID: + mstore(tsk->uid, NPROC_UID, buf); + break; + case NPROC_VMSIZE: + case NPROC_VMLOCK: + case NPROC_VMRSS: + if (tsk_mem_cheap.vmsize == (~0)) + __task_mem_cheap(tsk, &tsk_mem_cheap); + + switch (id) { + case NPROC_VMSIZE: + mstore(tsk_mem_cheap.vmsize, + NPROC_VMSIZE, buf); + break; + case NPROC_VMLOCK: + mstore(tsk_mem_cheap.vmlock, + NPROC_VMLOCK, buf); + break; + case NPROC_VMRSS: + mstore(tsk_mem_cheap.vmrss, + NPROC_VMRSS, buf); + break; + } + break; + case NPROC_VMDATA: + case NPROC_VMSTACK: + case NPROC_VMEXE: + case NPROC_VMLIB: + if (tsk_mem.vmdata == (~0)) + __task_mem(tsk, &tsk_mem); + + switch (id) { + case NPROC_VMDATA: + mstore(tsk_mem.vmdata, NPROC_VMDATA, + buf); + break; + case NPROC_VMSTACK: + mstore(tsk_mem.vmstack, NPROC_VMSTACK, + buf); + break; + case NPROC_VMEXE: + mstore(tsk_mem.vmexe, NPROC_VMEXE, buf); + break; + case NPROC_VMLIB: + mstore(tsk_mem.vmlib, NPROC_VMLIB, buf); + break; + } + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + case NPROC_WCHAN: + mstore(get_wchan(tsk), NPROC_WCHAN, buf); + break; + case NPROC_NAME: + mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf); + strncpy(buf, tsk->comm, sizeof(tsk->comm)); + buf += sizeof(tsk->comm); + break; + case NPROC_NOP_UL: + mstore(0, NPROC_TYPE_UL, buf); + break; + default: + pwarn("Unknown field ID %#x.\n", id); + goto err_inval; + } + return buf; +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Build and send a netlink msg for one PID. + */ +static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk) +{ + int i; + int err = 0; + struct sk_buff *skb2; + char *buf; + struct nlmsghdr *nlh2; + u32 fcnt, *fields; + + fcnt = fdata[0]; + fields = &fdata[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + nlh2 = (struct nlmsghdr *)skb2->data; + buf = NLMSG_DATA(nlh2); + + for (i = 0; i < fcnt; i++) { + buf = nproc_ps_field(fields[i], buf, tsk); + if (IS_ERR(buf)) { + err = PTR_ERR(buf); + goto out_free; + } + } + err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0); + if (err > 0) + err = 0; + return err; +out_free: + kfree_skb(skb2); +out: + return err; +} + +/* + * Find task for given pid, grab task lock (caller must unlock). + */ +static task_t *nproc_ps_get_task(int pid) +{ + task_t *tsk; + + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (tsk) + get_task_struct(tsk); + read_unlock(&tasklist_lock); + return tsk; +} + +/* + * Iterate over a list of PIDs. + */ +static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata) +{ + int i; + int err = 0; + u32 tcnt; + u32 *pids; + + if (left < sizeof(tcnt)) + goto err_inval; + left -= sizeof(tcnt); + + tcnt = sdata[0]; + + if (left < (tcnt * sizeof(u32))) + goto err_inval; + left -= tcnt * sizeof(u32); + + if (left) + pwarn("%d bytes left.\n", left); + + pids = &sdata[1]; + + for (i = 0; i < tcnt; i++) { + task_t *tsk; + tsk = nproc_ps_get_task(pids[i]); + if (!tsk) + continue; + err = nproc_pid_msg(nlh, fdata, len, tsk); + put_task_struct(tsk); + if (err) + goto out; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8) +#define BITS_PER_PAGE (PAGE_SIZE*8) + +/* + * Iterate over all PIDs. + */ +static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len) +{ + void *map; + int offset, i; + int err = 0; + + for (i = 0; i < PIDMAP_ENTRIES; i++) { + + map = get_pid_map(i); + if (!map) /* done -- there are no holes in pidmap_array */ + break; + if (IS_ERR(map)) /* No PIDs used in this map */ + continue; + offset = 0; + for ( ; ; ) { + int pid; + task_t *tsk; + offset = find_next_bit(map, BITS_PER_PAGE, ++offset); + if (offset >= BITS_PER_PAGE) + break; + pid = offset + i * BITS_PER_PAGE; + tsk = nproc_ps_get_task(pid); + if (!tsk) + continue; + err = nproc_pid_msg(nlh, fdata, len, tsk); + put_task_struct(tsk); + if (err) + goto out; + } + } + +out: + return err; +} + +static u32 __reply_size_special(u32 id) +{ + u32 len = 0; + + switch (id) { + case NPROC_NAME: + len = sizeof(u32) + + sizeof(((struct task_struct*)0)->comm); + break; + default: + pwarn("Unknown field size in %#x.\n", id); + } + return len; +} + +/* + * Calculates the size of a reply message payload. Alternatively, we could have + * the user space caller supply a number along with the request and bail + * out or realloc later if we find the allocation was too small. More + * responsibility in user space, but faster. + */ +static u32 *__reply_size (u32 *data, u32 *left, u32 *len) +{ + u32 *fields; + u32 fcnt; + int i; + *len = 0; + + if (*left < sizeof(fcnt)) + goto err_inval; + *left -= sizeof(fcnt); + + fcnt = data[0]; + + if (*left < (fcnt * sizeof(u32))) + goto err_inval; + *left -= fcnt * sizeof(u32); + + fields = &data[1]; + + for (i = 0; i < fcnt; i++) { + u32 id = fields[i]; + u32 type = id & NPROC_TYPE_MASK; + pdebug(" %#8.8x.\n", fields[i]); + switch (type) { + case NPROC_TYPE_U32: + *len += sizeof(u32); + break; + case NPROC_TYPE_UL: + *len += sizeof(unsigned long); + break; + case NPROC_TYPE_U64: + *len += sizeof(u64); + break; + default: { /* Special cases */ + u32 slen; + slen = __reply_size_special(id); + if (slen) + *len += slen; + else + goto err_inval; + } + } + } + + return &fields[fcnt]; + +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Call the chosen process selector. Adding additional selectors + * (e.g. select by uid) is easy, but is there a need? + */ +static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid) +{ + int err; + u32 len; + u32 *data = NLMSG_DATA(nlh); + u32 *sdata; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + + sdata = __reply_size(data, &left, &len); + if (IS_ERR(sdata)) { + err = PTR_ERR(sdata); + goto out; + } + + if (left < sizeof(u32)) + goto err_inval; + left -= sizeof(u32); + + switch (*sdata) { + case NPROC_SELECT_ALL: + if (left) + pwarn("%d bytes left.\n", left); + err = nproc_ps_select_all(nlh, data, len); + break; + case NPROC_SELECT_PID: + err = nproc_ps_select_pid(nlh, data, len, + left, sdata + 1); + break; + default: + pwarn("Unknown selection method %#x.\n", *sdata); + goto err_inval; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +static char *nproc_global_field(u32 id, char *buf) +{ + struct page_state *ps = NULL; + + switch (id) { + case NPROC_NR_DIRTY: + case NPROC_NR_WRITEBACK: + case NPROC_NR_UNSTABLE: + case NPROC_NR_PG_TABLE_PGS: + case NPROC_NR_MAPPED: + case NPROC_NR_SLAB: + if (!ps) { + ps = __vmstat(); + if (IS_ERR(ps)) { /* Just pass it on */ + buf = (void *)ps; + ps = NULL; + goto out; + } + } + switch (id) { + case NPROC_NR_DIRTY: + mstore(ps->nr_dirty, NPROC_NR_DIRTY, + buf); + break; + case NPROC_NR_WRITEBACK: + mstore(ps->nr_writeback, + NPROC_NR_WRITEBACK, + buf); + break; + case NPROC_NR_UNSTABLE: + mstore(ps->nr_unstable, + NPROC_NR_UNSTABLE, + buf); + break; + case NPROC_NR_PG_TABLE_PGS: + mstore(ps->nr_page_table_pages, + NPROC_NR_PG_TABLE_PGS, + buf); + break; + case NPROC_NR_MAPPED: + mstore(ps->nr_mapped, NPROC_NR_MAPPED, + buf); + break; + case NPROC_NR_SLAB: + mstore(ps->nr_slab, NPROC_NR_SLAB, buf); + break; + } + break; + case NPROC_MEMFREE: + mstore(nr_free_pages(), NPROC_MEMFREE, buf); + break; + case NPROC_PAGESIZE: + mstore(PAGE_SIZE, NPROC_PAGESIZE, buf); + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + default: + pwarn("Unknown field ID %#x.\n", id); + buf = ERR_PTR(-EINVAL); + goto out; + } + kfree(ps); +out: + return buf; +} + +static int nproc_get_global(struct nlmsghdr *nlh) +{ + int err, i; + void *errp; + struct sk_buff *skb2; + char *buf; + u32 fcnt, len; + u32 *data = NLMSG_DATA(nlh); + u32 *fields; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + errp = __reply_size(data, &left, &len); + if (IS_ERR(errp)) { + err = PTR_ERR(errp); + goto out; + } + if (left) + pwarn("%d bytes left.\n", left); + + fcnt = data[0]; + fields = &data[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + for (i = 0; i < fcnt; i++) { + buf = nproc_global_field(fields[i], buf); + if (IS_ERR(buf)) { + err = PTR_ERR(buf); + kfree_skb(skb2); + goto out; + } + } + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; +} + +static int find_id(__u32 *data, __u32 *left) +{ + int i; + u32 id; + + if (*left < sizeof(id)) + goto err_inval; + *left -= sizeof(sizeof(id)); + + if (*left) + pwarn("%d bytes left.\n", *left); + id = data[1]; + + for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++) + ; /* Do nothing */ + + if (labels[i].id != id) { + pwarn("No matching label found for %#x.\n", id); + goto err_inval; + } + + return i; + +err_inval: + return -EINVAL; +} + + +static int nproc_get_label(struct nlmsghdr *nlh) +{ + int err; + struct sk_buff *skb2; + const char *label; + char *buf; + int len; + u32 ltype; + u32 *data = NLMSG_DATA(nlh); + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + if (left < sizeof(ltype)) + goto err_inval; + left -= sizeof(ltype); + + ltype = data[0]; + + if (ltype == NPROC_LABEL_FIELD_NAME) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].label; + } + else if (ltype == NPROC_LABEL_FIELD_UNIT) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].unit; + } + else if (ltype == NPROC_LABEL_FIELD_FMT) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].fmt; + } + else if (ltype == NPROC_LABEL_WCHAN) { + char *modname; + unsigned long wchan, size, offset; + char namebuf[128]; + + if (left < sizeof(unsigned long)) + goto err_inval; + left -= sizeof(unsigned long); + + if (left) + pwarn("%d bytes left.\n", left); + + wchan = (unsigned long)data[1]; + label = kallsyms_lookup(wchan, &size, &offset, &modname, + namebuf); + + if (!label) { + pwarn("No ksym found for %#lx.\n", wchan); + goto err_inval; + } + } + else { + pwarn("Unknown label type %#x.\n", ltype); + goto err_inval; + } + + len = strlen(label) + 1; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + strncpy(buf, label, len); + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; + +err_inval: + return -EINVAL; +} + +static int nproc_get_list(struct nlmsghdr *nlh) +{ + int err, i, cnt, len; + struct sk_buff *skb2; + u32 *buf; + + cnt = ARRAY_SIZE(labels); + len = (cnt + 1) * sizeof(u32); + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + buf[0] = cnt; + for (i = 0; i < cnt; i++) + buf[i + 1] = labels[i].id; + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; +} + +static __inline__ int nproc_process_msg(struct sk_buff *skb, + struct nlmsghdr *nlh) +{ + int err = 0; + uid_t uid; + kernel_cap_t caps; + + if (!(nlh->nlmsg_flags & NLM_F_REQUEST)) + goto out; + + nlh->nlmsg_pid = NETLINK_CB(skb).pid; + uid = NETLINK_CB(skb).creds.uid; + caps = NETLINK_CB(skb).eff_cap; + + switch (nlh->nlmsg_type) { + case NPROC_GET_FIELD_LIST: + err = nproc_get_list(nlh); + break; + case NPROC_GET_LABEL: + err = nproc_get_label(nlh); + break; + case NPROC_GET_GLOBAL: + err = nproc_get_global(nlh); + break; + case NPROC_GET_PS: + err = nproc_get_ps(nlh, uid); + break; + default: + pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type); + err = -EINVAL; + } +out: + return err; + +} + +static int nproc_receive_skb(struct sk_buff *skb) +{ + int err = 0; + struct nlmsghdr *nlh; + + if (skb->len < NLMSG_LENGTH(0)) + goto err_inval; + + nlh = (struct nlmsghdr *)skb->data; + if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){ + pwarn("Invalid packet.\n"); + goto err_inval; + } + + err = nproc_process_msg(skb, nlh); + if (err || nlh->nlmsg_flags & NLM_F_ACK) { + pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err, + nlh->nlmsg_type, nlh->nlmsg_flags, + nlh->nlmsg_seq); + netlink_ack(skb, nlh, err); + } + + return err; + +err_inval: + return -EINVAL; +} + +static void nproc_receive(struct sock *sk, int len) +{ + struct sk_buff *skb; + + while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { + nproc_receive_skb(skb); + kfree_skb(skb); + } +} + +static int nproc_init(void) +{ + nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive); + + if (!nproc_sock) { + pwarn("No netlink socket for nproc.\n"); + return -ENODEV; + } + + return 0; +} + +module_init(nproc_init); Index: mm4-2.6.9-rc1/kernel/pid.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 06:10:54.000000000 -0700 +++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-08 17:45:27.504564546 -0700 @@ -148,6 +148,17 @@ return -1; } +void *get_pid_map(int idx) +{ + pidmap_t *map = pidmap_array + idx; + if (!map->page) + return NULL; + else if (atomic_read(&map->nr_free) == BITS_PER_PAGE) + return ERR_PTR(-1); + else + return map->page; +} + struct pid * fastcall find_pid(enum pid_type type, int nr) { struct hlist_node *elem; ^ permalink raw reply [flat|nested] 69+ messages in thread
* [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 2004-09-09 1:15 ` William Lee Irwin III @ 2004-09-09 1:17 ` William Lee Irwin III 2004-09-09 1:21 ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 1:17 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 05:35:29PM -0700, William Lee Irwin III wrote: >>> Any chance you could convert these to use the new vm statistics >>> accounting? On Wed, Sep 08, 2004 at 05:43:20PM -0700, William Lee Irwin III wrote: >> Hmm, there's a more serious issue; CONFIG_MMU=n will barf on these. >> For that you will need to #ifdef on CONFIG_MMU and use the methods >> in fs/proc/task_nommu.c and so on. On Wed, Sep 08, 2004 at 06:15:49PM -0700, William Lee Irwin III wrote: > This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes > whatsoever to the underlying code were made; rather, this merely > resolves offsets so it applies cleanly. > Compiletested on ia64. Repost with appropriate Subject: line. -- wli Index: mm4-2.6.9-rc1/include/linux/netlink.h =================================================================== --- mm4-2.6.9-rc1.orig/include/linux/netlink.h 2004-09-08 06:10:50.000000000 -0700 +++ mm4-2.6.9-rc1/include/linux/netlink.h 2004-09-08 17:45:27.500658296 -0700 @@ -15,6 +15,7 @@ #define NETLINK_ARPD 8 #define NETLINK_AUDIT 9 /* auditing */ #define NETLINK_ROUTE6 11 /* af_inet6 route comm channel */ +#define NETLINK_NPROC 12 /* /proc information */ #define NETLINK_IP6_FW 13 #define NETLINK_DNRTMSG 14 /* DECnet routing messages */ #define NETLINK_KEVENT 15 /* Kernel messages to userspace */ Index: mm4-2.6.9-rc1/include/linux/nproc.h =================================================================== --- mm4-2.6.9-rc1.orig/include/linux/nproc.h 2004-04-25 12:31:02.000000000 -0700 +++ mm4-2.6.9-rc1/include/linux/nproc.h 2004-09-08 17:45:27.501634858 -0700 @@ -0,0 +1,119 @@ +#ifndef _LINUX_NPROC_H +#define _LINUX_NPROC_H + +#include <linux/config.h> + +#ifndef __KERNEL__ +#define CONFIG_NPROC +#endif + +#ifdef CONFIG_NPROC + +/* Request types */ +#define NPROC_BASE 0x10 +#define NPROC_GET_FIELD_LIST (NPROC_BASE+0) +#define NPROC_GET_LABEL (NPROC_BASE+1) +#define NPROC_GET_GLOBAL (NPROC_BASE+2) +#define NPROC_GET_PS (NPROC_BASE+3) +#define NPROC_GET_PID_LIST (NPROC_BASE+4) + +/* Request flags */ + + +/* Field scopes */ +#define NPROC_SCOPE_MASK 0x70000000 +#define NPROC_SCOPE_GLOBAL 0x10000000 /* Global w/o arguments */ +#define NPROC_SCOPE_PROCESS 0x20000000 +#define NPROC_SCOPE_LABEL 0x30000000 + +/* Data types */ +#define NPROC_TYPE_MASK 0x07000000 +#define NPROC_TYPE_STRING 0x01000000 +#define NPROC_TYPE_U32 0x02000000 +#define NPROC_TYPE_UL 0x03000000 +#define NPROC_TYPE_U64 0x04000000 + +/* Access control (unused) */ +#define NPROC_PERM_MASK 0x00300000 +#define NPROC_PERM_USER 0x00100000 +#define NPROC_PERM_ROOT 0x00200000 + +/* Selectors */ +#define NPROC_SELECT_ALL 0x00000001 +#define NPROC_SELECT_PID 0x00000002 +#define NPROC_SELECT_UID 0x00000003 + +/* Labels */ +#define NPROC_LABEL_FIELD_NAME 0x00000001 +#define NPROC_LABEL_FIELD_FMT 0x00000002 +#define NPROC_LABEL_FIELD_UNIT 0x00000003 +#define NPROC_LABEL_WCHAN 0x00000004 + +/* Field IDs (unique key in bits 0 - 15) */ +#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL) +#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS) +/* Amount of free memory (pages) */ +#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* Size of a page (bytes) */ +#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* There's no guarantee about anything with jiffies. Still useful for some. */ +#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) +/* Process: VM size (KiB) */ +#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: locked memory (KiB) */ +#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: Memory resident size (KiB) */ +#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) +#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING) + +#ifdef __KERNEL__ +struct nproc_field { + __u32 id; + const char *label; + const char *fmt; + const char *unit; +}; + +static struct nproc_field labels[] = { + { NPROC_PID, "PID", "%5u", "" }, + { NPROC_NAME, "Name", "%-15s","" }, + { NPROC_MEMFREE, "MemFree", "%8u", "page" }, + { NPROC_PAGESIZE, "PageSize", "%4u", "byte" }, + { NPROC_JIFFIES, "Jiffies", "%10u", "" }, + { NPROC_VMSIZE, "VmSize", "%8u", "KiB" }, + { NPROC_VMLOCK, "VmLock", "%8u", "KiB" }, + { NPROC_VMRSS, "VmRSS", "%8u", "KiB" }, + { NPROC_VMDATA, "VmData", "%8u", "KiB" }, + { NPROC_VMSTACK, "VmStack", "%8u", "KiB" }, + { NPROC_VMEXE, "VmExe", "%8u", "KiB" }, + { NPROC_VMLIB, "VmLib", "%8u", "KiB" }, + { NPROC_UID, "UID", "%5u", "" }, + { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" }, + { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" }, + { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" }, + { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" }, + { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" }, + { NPROC_NR_SLAB, "nr_slab", "%8u", "page" }, + { NPROC_WCHAN, "wchan", "%p", "" }, +#ifdef CONFIG_KALLSYMS + { NPROC_WCHAN_NAME, "wchan_symbol", "%s"}, +#endif +}; +#endif /* __KERNEL__ */ + +#endif /* CONFIG_NPROC */ + +#endif /* _LINUX_NPROC_H */ Index: mm4-2.6.9-rc1/include/linux/pid.h =================================================================== --- mm4-2.6.9-rc1.orig/include/linux/pid.h 2004-09-08 06:10:36.000000000 -0700 +++ mm4-2.6.9-rc1/include/linux/pid.h 2004-09-08 17:45:27.501634858 -0700 @@ -37,6 +37,7 @@ extern struct pid *FASTCALL(find_pid(enum pid_type, int)); extern int alloc_pidmap(void); +extern void *get_pid_map(int); extern void FASTCALL(free_pidmap(int)); extern void switch_exec_pids(struct task_struct *leader, struct task_struct *thread); Index: mm4-2.6.9-rc1/init/Kconfig =================================================================== --- mm4-2.6.9-rc1.orig/init/Kconfig 2004-09-08 06:10:50.000000000 -0700 +++ mm4-2.6.9-rc1/init/Kconfig 2004-09-08 17:45:27.504564546 -0700 @@ -139,6 +139,13 @@ building a kernel for install/rescue disks or your system is very limited in memory. +config NPROC + bool "Netlink interface to /proc information" + depends on PROC_FS && EXPERIMENTAL + default y + help + Nproc is a netlink interface to /proc information. + config AUDIT bool "Auditing support" default y if SECURITY_SELINUX Index: mm4-2.6.9-rc1/kernel/Makefile =================================================================== --- mm4-2.6.9-rc1.orig/kernel/Makefile 2004-09-08 06:10:50.000000000 -0700 +++ mm4-2.6.9-rc1/kernel/Makefile 2004-09-08 17:45:27.501634858 -0700 @@ -15,6 +15,7 @@ obj-$(CONFIG_UID16) += uid16.o obj-$(CONFIG_MODULES) += module.o obj-$(CONFIG_KALLSYMS) += kallsyms.o +obj-$(CONFIG_NPROC) += nproc.o obj-$(CONFIG_PM) += power/ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_KEXEC) += kexec.o Index: mm4-2.6.9-rc1/kernel/nproc.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-04-25 12:31:02.000000000 -0700 +++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-08 17:45:27.503587983 -0700 @@ -0,0 +1,851 @@ +/* + * nproc.c + * + * netlink interface to /proc information. + */ + +#include <linux/skbuff.h> +#include <net/sock.h> +#include <linux/swap.h> /* nr_free_pages() */ +#include <linux/kallsyms.h> /* kallsyms_lookup() */ +#include <linux/pid.h> /* get_pid_map() */ +#include <linux/nproc.h> +#include <asm/bitops.h> + +//#define DEBUG + +/* There must be like 5 million dprintk definitions, so let's add some more */ +#ifdef DEBUG +#define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args) +#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args) +#else +#define pdebug(x,args...) +#define pwarn(x,args...) +#endif + +#define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args) + +static struct sock *nproc_sock = NULL; + +struct task_mem { + u32 vmdata; + u32 vmstack; + u32 vmexe; + u32 vmlib; +}; + +struct task_mem_cheap { + u32 vmsize; + u32 vmlock; + u32 vmrss; +}; + +/* + * __task_mem/__task_mem_cheap basically duplicate the MMU version of + * task_mem, but they are split by cost and work on structs. + */ + +static void __task_mem(struct task_struct *tsk, struct task_mem *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + unsigned long data = 0, stack = 0, exec = 0, lib = 0; + struct vm_area_struct *vma; + + down_read(&mm->mmap_sem); + for (vma = mm->mmap; vma; vma = vma->vm_next) { + unsigned long len = (vma->vm_end - vma->vm_start) >> 10; + if (!vma->vm_file) { + data += len; + if (vma->vm_flags & VM_GROWSDOWN) + stack += len; + continue; + } + if (vma->vm_flags & VM_WRITE) + continue; + if (vma->vm_flags & VM_EXEC) { + exec += len; + if (vma->vm_flags & VM_EXECUTABLE) + continue; + lib += len; + } + } + res->vmdata = data - stack; + res->vmstack = stack; + res->vmexe = exec - lib; + res->vmlib = lib; + up_read(&mm->mmap_sem); + + mmput(mm); + } else { + res->vmdata = 0; + res->vmstack = 0; + res->vmexe = 0; + res->vmlib = 0; + } +} + +static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res) +{ + struct mm_struct *mm = get_task_mm(tsk); + if (mm) { + res->vmsize = mm->total_vm << (PAGE_SHIFT-10); + res->vmlock = mm->locked_vm << (PAGE_SHIFT-10); + res->vmrss = mm->rss << (PAGE_SHIFT-10); + mmput(mm); + } else { + res->vmsize = 0; + res->vmlock = 0; + res->vmrss = 0; + } +} + +/* + * page_alloc.c already has an extra function broken out to fill a + * struct with information. Cool. Not sure whether pgpgin/pgpgout + * should be left as is or nailed down as kbytes. + */ +static struct page_state *__vmstat(void) +{ + struct page_state *ps; + ps = kmalloc(sizeof(*ps), GFP_KERNEL); + if (!ps) + return ERR_PTR(-ENOMEM); + get_full_page_state(ps); + ps->pgpgin /= 2; /* sectors -> kbytes */ + ps->pgpgout /= 2; + return ps; +} + +/* + * Allocate and prefill an skb. The nlmsghdr provided to the function + * is a pointer to the respective struct in the request message. + */ +static struct sk_buff *nproc_alloc_nlmsg(struct nlmsghdr *nlh, u32 len) +{ + __u32 seq = nlh->nlmsg_seq; + __u16 type = nlh->nlmsg_type; + __u32 pid = nlh->nlmsg_pid; + struct sk_buff *skb2 = 0; + + skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL); + if (!skb2) { + skb2 = ERR_PTR(-ENOMEM); + goto out; + } + + NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len)); +out: + return skb2; + +nlmsg_failure: /* Used by NLMSG_PUT */ + kfree_skb(skb2); + return NULL; +} + +#define mstore(value, id, buf) \ +({ \ + u32 _type = id & NPROC_TYPE_MASK; \ + switch (_type) { \ + case NPROC_TYPE_U32: { \ + __u32 *p = (u32 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_UL: { \ + unsigned long *p = (unsigned long *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + case NPROC_TYPE_U64: { \ + __u64 *p = (u64 *)buf; \ + *p = value; \ + buf = (char *)++p; \ + break; \ + } \ + default: \ + perror("Huh? Bad type!\n"); \ + } \ +}) + +static char *nproc_ps_field(u32 id, char *buf, task_t *tsk) +{ + struct task_mem tsk_mem; + struct task_mem_cheap tsk_mem_cheap; + + tsk_mem.vmdata = (~0); + tsk_mem_cheap.vmsize = (~0); + + switch (id) { + case NPROC_PID: + mstore(tsk->pid, NPROC_PID, buf); + break; + case NPROC_UID: + mstore(tsk->uid, NPROC_UID, buf); + break; + case NPROC_VMSIZE: + case NPROC_VMLOCK: + case NPROC_VMRSS: + if (tsk_mem_cheap.vmsize == (~0)) + __task_mem_cheap(tsk, &tsk_mem_cheap); + + switch (id) { + case NPROC_VMSIZE: + mstore(tsk_mem_cheap.vmsize, + NPROC_VMSIZE, buf); + break; + case NPROC_VMLOCK: + mstore(tsk_mem_cheap.vmlock, + NPROC_VMLOCK, buf); + break; + case NPROC_VMRSS: + mstore(tsk_mem_cheap.vmrss, + NPROC_VMRSS, buf); + break; + } + break; + case NPROC_VMDATA: + case NPROC_VMSTACK: + case NPROC_VMEXE: + case NPROC_VMLIB: + if (tsk_mem.vmdata == (~0)) + __task_mem(tsk, &tsk_mem); + + switch (id) { + case NPROC_VMDATA: + mstore(tsk_mem.vmdata, NPROC_VMDATA, + buf); + break; + case NPROC_VMSTACK: + mstore(tsk_mem.vmstack, NPROC_VMSTACK, + buf); + break; + case NPROC_VMEXE: + mstore(tsk_mem.vmexe, NPROC_VMEXE, buf); + break; + case NPROC_VMLIB: + mstore(tsk_mem.vmlib, NPROC_VMLIB, buf); + break; + } + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + case NPROC_WCHAN: + mstore(get_wchan(tsk), NPROC_WCHAN, buf); + break; + case NPROC_NAME: + mstore(sizeof(tsk->comm), NPROC_TYPE_U32, buf); + strncpy(buf, tsk->comm, sizeof(tsk->comm)); + buf += sizeof(tsk->comm); + break; + case NPROC_NOP_UL: + mstore(0, NPROC_TYPE_UL, buf); + break; + default: + pwarn("Unknown field ID %#x.\n", id); + goto err_inval; + } + return buf; +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Build and send a netlink msg for one PID. + */ +static int nproc_pid_msg(struct nlmsghdr *nlh, u32 *fdata, u32 len, task_t *tsk) +{ + int i; + int err = 0; + struct sk_buff *skb2; + char *buf; + struct nlmsghdr *nlh2; + u32 fcnt, *fields; + + fcnt = fdata[0]; + fields = &fdata[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + nlh2 = (struct nlmsghdr *)skb2->data; + buf = NLMSG_DATA(nlh2); + + for (i = 0; i < fcnt; i++) { + buf = nproc_ps_field(fields[i], buf, tsk); + if (IS_ERR(buf)) { + err = PTR_ERR(buf); + goto out_free; + } + } + err = netlink_unicast(nproc_sock, skb2, nlh2->nlmsg_pid, 0); + if (err > 0) + err = 0; + return err; +out_free: + kfree_skb(skb2); +out: + return err; +} + +/* + * Find task for given pid, grab task lock (caller must unlock). + */ +static task_t *nproc_ps_get_task(int pid) +{ + task_t *tsk; + + read_lock(&tasklist_lock); + tsk = find_task_by_pid(pid); + if (tsk) + get_task_struct(tsk); + read_unlock(&tasklist_lock); + return tsk; +} + +/* + * Iterate over a list of PIDs. + */ +static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata) +{ + int i; + int err = 0; + u32 tcnt; + u32 *pids; + + if (left < sizeof(tcnt)) + goto err_inval; + left -= sizeof(tcnt); + + tcnt = sdata[0]; + + if (left < (tcnt * sizeof(u32))) + goto err_inval; + left -= tcnt * sizeof(u32); + + if (left) + pwarn("%d bytes left.\n", left); + + pids = &sdata[1]; + + for (i = 0; i < tcnt; i++) { + task_t *tsk; + tsk = nproc_ps_get_task(pids[i]); + if (!tsk) + continue; + err = nproc_pid_msg(nlh, fdata, len, tsk); + put_task_struct(tsk); + if (err) + goto out; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8) +#define BITS_PER_PAGE (PAGE_SIZE*8) + +/* + * Iterate over all PIDs. + */ +static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len) +{ + void *map; + int offset, i; + int err = 0; + + for (i = 0; i < PIDMAP_ENTRIES; i++) { + + map = get_pid_map(i); + if (!map) /* done -- there are no holes in pidmap_array */ + break; + if (IS_ERR(map)) /* No PIDs used in this map */ + continue; + offset = 0; + for ( ; ; ) { + int pid; + task_t *tsk; + offset = find_next_bit(map, BITS_PER_PAGE, ++offset); + if (offset >= BITS_PER_PAGE) + break; + pid = offset + i * BITS_PER_PAGE; + tsk = nproc_ps_get_task(pid); + if (!tsk) + continue; + err = nproc_pid_msg(nlh, fdata, len, tsk); + put_task_struct(tsk); + if (err) + goto out; + } + } + +out: + return err; +} + +static u32 __reply_size_special(u32 id) +{ + u32 len = 0; + + switch (id) { + case NPROC_NAME: + len = sizeof(u32) + + sizeof(((struct task_struct*)0)->comm); + break; + default: + pwarn("Unknown field size in %#x.\n", id); + } + return len; +} + +/* + * Calculates the size of a reply message payload. Alternatively, we could have + * the user space caller supply a number along with the request and bail + * out or realloc later if we find the allocation was too small. More + * responsibility in user space, but faster. + */ +static u32 *__reply_size (u32 *data, u32 *left, u32 *len) +{ + u32 *fields; + u32 fcnt; + int i; + *len = 0; + + if (*left < sizeof(fcnt)) + goto err_inval; + *left -= sizeof(fcnt); + + fcnt = data[0]; + + if (*left < (fcnt * sizeof(u32))) + goto err_inval; + *left -= fcnt * sizeof(u32); + + fields = &data[1]; + + for (i = 0; i < fcnt; i++) { + u32 id = fields[i]; + u32 type = id & NPROC_TYPE_MASK; + pdebug(" %#8.8x.\n", fields[i]); + switch (type) { + case NPROC_TYPE_U32: + *len += sizeof(u32); + break; + case NPROC_TYPE_UL: + *len += sizeof(unsigned long); + break; + case NPROC_TYPE_U64: + *len += sizeof(u64); + break; + default: { /* Special cases */ + u32 slen; + slen = __reply_size_special(id); + if (slen) + *len += slen; + else + goto err_inval; + } + } + } + + return &fields[fcnt]; + +err_inval: + return ERR_PTR(-EINVAL); +} + +/* + * Call the chosen process selector. Adding additional selectors + * (e.g. select by uid) is easy, but is there a need? + */ +static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid) +{ + int err; + u32 len; + u32 *data = NLMSG_DATA(nlh); + u32 *sdata; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + + sdata = __reply_size(data, &left, &len); + if (IS_ERR(sdata)) { + err = PTR_ERR(sdata); + goto out; + } + + if (left < sizeof(u32)) + goto err_inval; + left -= sizeof(u32); + + switch (*sdata) { + case NPROC_SELECT_ALL: + if (left) + pwarn("%d bytes left.\n", left); + err = nproc_ps_select_all(nlh, data, len); + break; + case NPROC_SELECT_PID: + err = nproc_ps_select_pid(nlh, data, len, + left, sdata + 1); + break; + default: + pwarn("Unknown selection method %#x.\n", *sdata); + goto err_inval; + } + +out: + return err; + +err_inval: + return -EINVAL; +} + +static char *nproc_global_field(u32 id, char *buf) +{ + struct page_state *ps = NULL; + + switch (id) { + case NPROC_NR_DIRTY: + case NPROC_NR_WRITEBACK: + case NPROC_NR_UNSTABLE: + case NPROC_NR_PG_TABLE_PGS: + case NPROC_NR_MAPPED: + case NPROC_NR_SLAB: + if (!ps) { + ps = __vmstat(); + if (IS_ERR(ps)) { /* Just pass it on */ + buf = (void *)ps; + ps = NULL; + goto out; + } + } + switch (id) { + case NPROC_NR_DIRTY: + mstore(ps->nr_dirty, NPROC_NR_DIRTY, + buf); + break; + case NPROC_NR_WRITEBACK: + mstore(ps->nr_writeback, + NPROC_NR_WRITEBACK, + buf); + break; + case NPROC_NR_UNSTABLE: + mstore(ps->nr_unstable, + NPROC_NR_UNSTABLE, + buf); + break; + case NPROC_NR_PG_TABLE_PGS: + mstore(ps->nr_page_table_pages, + NPROC_NR_PG_TABLE_PGS, + buf); + break; + case NPROC_NR_MAPPED: + mstore(ps->nr_mapped, NPROC_NR_MAPPED, + buf); + break; + case NPROC_NR_SLAB: + mstore(ps->nr_slab, NPROC_NR_SLAB, buf); + break; + } + break; + case NPROC_MEMFREE: + mstore(nr_free_pages(), NPROC_MEMFREE, buf); + break; + case NPROC_PAGESIZE: + mstore(PAGE_SIZE, NPROC_PAGESIZE, buf); + break; + case NPROC_JIFFIES: + mstore(get_jiffies_64(), NPROC_JIFFIES, buf); + break; + default: + pwarn("Unknown field ID %#x.\n", id); + buf = ERR_PTR(-EINVAL); + goto out; + } + kfree(ps); +out: + return buf; +} + +static int nproc_get_global(struct nlmsghdr *nlh) +{ + int err, i; + void *errp; + struct sk_buff *skb2; + char *buf; + u32 fcnt, len; + u32 *data = NLMSG_DATA(nlh); + u32 *fields; + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + errp = __reply_size(data, &left, &len); + if (IS_ERR(errp)) { + err = PTR_ERR(errp); + goto out; + } + if (left) + pwarn("%d bytes left.\n", left); + + fcnt = data[0]; + fields = &data[1]; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + for (i = 0; i < fcnt; i++) { + buf = nproc_global_field(fields[i], buf); + if (IS_ERR(buf)) { + err = PTR_ERR(buf); + kfree_skb(skb2); + goto out; + } + } + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; +} + +static int find_id(__u32 *data, __u32 *left) +{ + int i; + u32 id; + + if (*left < sizeof(id)) + goto err_inval; + *left -= sizeof(sizeof(id)); + + if (*left) + pwarn("%d bytes left.\n", *left); + id = data[1]; + + for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++) + ; /* Do nothing */ + + if (labels[i].id != id) { + pwarn("No matching label found for %#x.\n", id); + goto err_inval; + } + + return i; + +err_inval: + return -EINVAL; +} + + +static int nproc_get_label(struct nlmsghdr *nlh) +{ + int err; + struct sk_buff *skb2; + const char *label; + char *buf; + int len; + u32 ltype; + u32 *data = NLMSG_DATA(nlh); + u32 left = nlh->nlmsg_len - sizeof(*nlh); + + if (left < sizeof(ltype)) + goto err_inval; + left -= sizeof(ltype); + + ltype = data[0]; + + if (ltype == NPROC_LABEL_FIELD_NAME) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].label; + } + else if (ltype == NPROC_LABEL_FIELD_UNIT) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].unit; + } + else if (ltype == NPROC_LABEL_FIELD_FMT) { + int idx; + idx = find_id(data, &left); + if (idx < 0) + goto err_inval; + label = labels[idx].fmt; + } + else if (ltype == NPROC_LABEL_WCHAN) { + char *modname; + unsigned long wchan, size, offset; + char namebuf[128]; + + if (left < sizeof(unsigned long)) + goto err_inval; + left -= sizeof(unsigned long); + + if (left) + pwarn("%d bytes left.\n", left); + + wchan = (unsigned long)data[1]; + label = kallsyms_lookup(wchan, &size, &offset, &modname, + namebuf); + + if (!label) { + pwarn("No ksym found for %#lx.\n", wchan); + goto err_inval; + } + } + else { + pwarn("Unknown label type %#x.\n", ltype); + goto err_inval; + } + + len = strlen(label) + 1; + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + + strncpy(buf, label, len); + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; + +err_inval: + return -EINVAL; +} + +static int nproc_get_list(struct nlmsghdr *nlh) +{ + int err, i, cnt, len; + struct sk_buff *skb2; + u32 *buf; + + cnt = ARRAY_SIZE(labels); + len = (cnt + 1) * sizeof(u32); + + skb2 = nproc_alloc_nlmsg(nlh, len); + if (IS_ERR(skb2)) { + err = PTR_ERR(skb2); + goto out; + } + + buf = NLMSG_DATA((struct nlmsghdr *)skb2->data); + buf[0] = cnt; + for (i = 0; i < cnt; i++) + buf[i + 1] = labels[i].id; + + err = netlink_unicast(nproc_sock, skb2, nlh->nlmsg_pid, 0); + if (err > 0) + err = 0; +out: + return err; +} + +static __inline__ int nproc_process_msg(struct sk_buff *skb, + struct nlmsghdr *nlh) +{ + int err = 0; + uid_t uid; + kernel_cap_t caps; + + if (!(nlh->nlmsg_flags & NLM_F_REQUEST)) + goto out; + + nlh->nlmsg_pid = NETLINK_CB(skb).pid; + uid = NETLINK_CB(skb).creds.uid; + caps = NETLINK_CB(skb).eff_cap; + + switch (nlh->nlmsg_type) { + case NPROC_GET_FIELD_LIST: + err = nproc_get_list(nlh); + break; + case NPROC_GET_LABEL: + err = nproc_get_label(nlh); + break; + case NPROC_GET_GLOBAL: + err = nproc_get_global(nlh); + break; + case NPROC_GET_PS: + err = nproc_get_ps(nlh, uid); + break; + default: + pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type); + err = -EINVAL; + } +out: + return err; + +} + +static int nproc_receive_skb(struct sk_buff *skb) +{ + int err = 0; + struct nlmsghdr *nlh; + + if (skb->len < NLMSG_LENGTH(0)) + goto err_inval; + + nlh = (struct nlmsghdr *)skb->data; + if (skb->len < nlh->nlmsg_len || nlh->nlmsg_len < sizeof(*nlh)){ + pwarn("Invalid packet.\n"); + goto err_inval; + } + + err = nproc_process_msg(skb, nlh); + if (err || nlh->nlmsg_flags & NLM_F_ACK) { + pwarn("err %d, type %#x, flags %#x, seq %#x.\n", err, + nlh->nlmsg_type, nlh->nlmsg_flags, + nlh->nlmsg_seq); + netlink_ack(skb, nlh, err); + } + + return err; + +err_inval: + return -EINVAL; +} + +static void nproc_receive(struct sock *sk, int len) +{ + struct sk_buff *skb; + + while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) { + nproc_receive_skb(skb); + kfree_skb(skb); + } +} + +static int nproc_init(void) +{ + nproc_sock = netlink_kernel_create(NETLINK_NPROC, nproc_receive); + + if (!nproc_sock) { + pwarn("No netlink socket for nproc.\n"); + return -ENODEV; + } + + return 0; +} + +module_init(nproc_init); Index: mm4-2.6.9-rc1/kernel/pid.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/pid.c 2004-09-08 06:10:54.000000000 -0700 +++ mm4-2.6.9-rc1/kernel/pid.c 2004-09-08 17:45:27.504564546 -0700 @@ -148,6 +148,17 @@ return -1; } +void *get_pid_map(int idx) +{ + pidmap_t *map = pidmap_array + idx; + if (!map->page) + return NULL; + else if (atomic_read(&map->nr_free) == BITS_PER_PAGE) + return ERR_PTR(-1); + else + return map->page; +} + struct pid * fastcall find_pid(enum pid_type type, int nr) { struct hlist_node *elem; ^ permalink raw reply [flat|nested] 69+ messages in thread
* [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y 2004-09-09 1:17 ` [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 William Lee Irwin III @ 2004-09-09 1:21 ` William Lee Irwin III 2004-09-09 1:22 ` William Lee Irwin III 2004-09-09 1:26 ` [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c William Lee Irwin III 0 siblings, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 1:21 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 06:15:49PM -0700, William Lee Irwin III wrote: >> This is a straight rediff of nproc vs. 2.6.9-rc1-mm4. No changes >> whatsoever to the underlying code were made; rather, this merely >> resolves offsets so it applies cleanly. >> Compiletested on ia64. On Wed, Sep 08, 2004 at 06:17:08PM -0700, William Lee Irwin III wrote: > Repost with appropriate Subject: line. Make __task_mem() and __task_mem_cheap() use the appropriate methods for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n. The new methods for /proc/ accounting involve using counters kept in the mm instead of iteration over vmas. For the CONFIG_MMU=y case this does not involve acquiring mm->mmap_sem for any per-mm statistics. The CONFIG_MMU=n case still needs iteration over tblocks to calculate them. -- wli Index: mm4-2.6.9-rc1/kernel/nproc.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-08 17:45:27.503587983 -0700 +++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-08 18:11:24.826811093 -0700 @@ -44,44 +44,20 @@ * __task_mem/__task_mem_cheap basically duplicate the MMU version of * task_mem, but they are split by cost and work on structs. */ - +#ifdef CONFIG_MMU static void __task_mem(struct task_struct *tsk, struct task_mem *res) { struct mm_struct *mm = get_task_mm(tsk); - if (mm) { - unsigned long data = 0, stack = 0, exec = 0, lib = 0; - struct vm_area_struct *vma; - - down_read(&mm->mmap_sem); - for (vma = mm->mmap; vma; vma = vma->vm_next) { - unsigned long len = (vma->vm_end - vma->vm_start) >> 10; - if (!vma->vm_file) { - data += len; - if (vma->vm_flags & VM_GROWSDOWN) - stack += len; - continue; - } - if (vma->vm_flags & VM_WRITE) - continue; - if (vma->vm_flags & VM_EXEC) { - exec += len; - if (vma->vm_flags & VM_EXECUTABLE) - continue; - lib += len; - } - } - res->vmdata = data - stack; - res->vmstack = stack; - res->vmexe = exec - lib; - res->vmlib = lib; - up_read(&mm->mmap_sem); + if (!mm) + memset(res, 0, sizeof(struct task_mem)); + else { + res->vmdata = (mm->total_vm - mm->shared_vm - mm->stack_vm) + << (PAGE_SHIFT - 10); + res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10); + res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10; + res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe; mmput(mm); - } else { - res->vmdata = 0; - res->vmstack = 0; - res->vmexe = 0; - res->vmlib = 0; } } @@ -99,6 +75,80 @@ res->vmrss = 0; } } +#else /* !CONFIG_MMU */ +static void __task_mem(task_t *task, struct task_mem *stats) +{ + struct mm_struct *mm = get_task_mm(task) + + if (!mm) + memset(stats, 0, sizeof(struct task_mem)); + else { + unsigned long bytes = 0, sbytes = 0, slack = 0; + struct mm_tblk_struct *tblk; + + down_read(&mm->mmap_sem); + for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) { + if (!tblk->rblock) + continue; + bytes += kobjsize(tblk); + if (atomic_read(&mm->mm_count) > 1) || + tblk->rblock->refcount > 1) { + sbytes += kobjsize(tblk->rblock->kblock); + sbytes += kobjsize(tblk->rblock); + } else { + bytes += kobjsize(tblk->rblock->kblock); + bytes += kobjsize(tblk->rblock); + slack += kobjsize(tblock->rblock->kblock); + } + } + if (atomic_read(&mm->mm_count) > 1) + sbytes += kobjsize(mm); + else + bytes += kobjsize(mm); + up_read(&mm->mmap_sem); + mmput(mm); + if (task->fs && atomic_read(&task->fs->count) > 1) + sbytes += kobjsize(task->files); + else + bytes += kobjsize(task->files); + if (task->sighand && atomic_read(&task->sighand->count) > 1) + sbytes += kobjsize(task->sighand); + else + bytes += kobjsize(task->sighand); + bytes += kobjsize(task); + /* some interpretation is needed */ + stats->vmdata = bytes; + stats->vmstack = sbytes; + stats->vmexe = stats->vmlib = 0; + } +} + +static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats) +{ + struct mm_struct *mm = get_task_mm(task); + struct mm_tblock_struct *tblk; + int size; + + memset(stats, 0, sizeof(struct task_mem_cheap)); + stats->vmrss += kobjsize(mm); + down_read(&mm->mmap_sem); + for (tblk = &mm->context.block; tblk; tblk = tblk->next) { + if (tblk->next) + stats->vmrss += kobjsize(tblk->next); + if (tblk->rblock) { + stats->vmsize += kobjsize(tblk->rblock); + stats->vmrss += kobjsize(tblk->rblock); + stats->vmrss += kobjsize(tblk->rblock->kblock); + } + } + stats->vmrss += mm->end_code - mm->start_code; + stats->vmrss += mm->start_stack - mm->start_data; + up_read(&mm->mmap_sem); + mmput(mm); + stats->vmrss >>= 10; + stats->vmsize >>= 10; +} +#endif /* !CONFIG_MMU */ /* * page_alloc.c already has an extra function broken out to fill a ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y 2004-09-09 1:21 ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III @ 2004-09-09 1:22 ` William Lee Irwin III 2004-09-09 1:26 ` [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c William Lee Irwin III 1 sibling, 0 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 1:22 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 06:21:37PM -0700, William Lee Irwin III wrote: > Make __task_mem() and __task_mem_cheap() use the appropriate methods > for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n. > The new methods for /proc/ accounting involve using counters kept in > the mm instead of iteration over vmas. For the CONFIG_MMU=y case this > does not involve acquiring mm->mmap_sem for any per-mm statistics. The > CONFIG_MMU=n case still needs iteration over tblocks to calculate them. Once again, compiletested only on ia64. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c 2004-09-09 1:21 ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III 2004-09-09 1:22 ` William Lee Irwin III @ 2004-09-09 1:26 ` William Lee Irwin III 1 sibling, 0 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 1:26 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, Sep 08, 2004 at 06:21:37PM -0700, William Lee Irwin III wrote: > Make __task_mem() and __task_mem_cheap() use the appropriate methods > for CONFIG_MMU=y and add some attempt at correct code for CONFIG_MMU=n. > The new methods for /proc/ accounting involve using counters kept in > the mm instead of iteration over vmas. For the CONFIG_MMU=y case this > does not involve acquiring mm->mmap_sem for any per-mm statistics. The > CONFIG_MMU=n case still needs iteration over tblocks to calculate them. Round up text memory to the nearest page to resolve potential alignment anomalies in reported statistics. Compiletested on ia64. -- wli Index: mm4-2.6.9-rc1/fs/proc/task_mmu.c =================================================================== --- mm4-2.6.9-rc1.orig/fs/proc/task_mmu.c 2004-09-08 06:10:35.000000000 -0700 +++ mm4-2.6.9-rc1/fs/proc/task_mmu.c 2004-09-08 18:27:39.401017905 -0700 @@ -9,7 +9,7 @@ unsigned long data, text, lib; data = mm->total_vm - mm->shared_vm - mm->stack_vm; - text = (mm->end_code - mm->start_code) >> 10; + text = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10; lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text; buffer += sprintf(buffer, "VmSize:\t%8lu kB\n" ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 0:35 ` William Lee Irwin III 2004-09-09 0:43 ` William Lee Irwin III @ 2004-09-09 18:43 ` Roger Luethi 2004-09-09 18:49 ` William Lee Irwin III 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-09 18:43 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, 08 Sep 2004 17:35:29 -0700, William Lee Irwin III wrote: > On Wed, Sep 08, 2004 at 08:41:30PM +0200, Roger Luethi wrote: > > A few notes: > > - Access control can be implemented easily. Right now it would be bloat, > > though -- the vast majority of fields in /proc are world-readable > > (/proc/pid/environ being the notable exception). > > - Additional process selectors (e.g. select by UID) are not hard to > > add, either, should there ever be a need. > > - There are a few things I'm not sure about: For instance, what is a good > > return value for mm_struct related fields wrt kernel threads? I picked > > 0, but ~(0) might be preferable because it's distinct. > > Signed-off-by: Roger Luethi <rl@hellgate.ch> > > Any chance you could convert these to use the new vm statistics > accounting? Mea culpa. I copied the routines wholesale from 2.6.7 when I started work on nproc. They still seemed to work with 2.6.9-rc1-bk13, I hadn't noticed the work that had gone into field computation already. So for CONFIG_MMU, values in both __task_mem and __task_mem_cheap are cheap now. The routines can be merged. !CONFIG_MMU is a different story. Presumably, it needs a change in the fields that are offered (cp. task_mem in fs/proc/task_nommu.c). FWIW, my prefered solution would be to have only one routine task_mem to fill the respective struct for nproc and /proc. There seems to be a discrepancy between current task_mem in fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote for the nproc !CONFIG_MMU case. Can you explain? Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 18:43 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi @ 2004-09-09 18:49 ` William Lee Irwin III 2004-09-09 19:00 ` William Lee Irwin III 2004-09-09 19:11 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 0 siblings, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 18:49 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Wed, 08 Sep 2004 17:35:29 -0700, William Lee Irwin III wrote: >> Any chance you could convert these to use the new vm statistics >> accounting? On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote: > Mea culpa. I copied the routines wholesale from 2.6.7 when I started > work on nproc. They still seemed to work with 2.6.9-rc1-bk13, I hadn't > noticed the work that had gone into field computation already. So for > CONFIG_MMU, values in both __task_mem and __task_mem_cheap are cheap > now. The routines can be merged. > !CONFIG_MMU is a different story. Presumably, it needs a change in the > fields that are offered (cp. task_mem in fs/proc/task_nommu.c). > FWIW, my prefered solution would be to have only one routine task_mem > to fill the respective struct for nproc and /proc. I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation patch atop the others I sent. On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote: > There seems to be a discrepancy between current task_mem in > fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote > for the nproc !CONFIG_MMU case. Can you explain? I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I did, however, have to mangle the things via guesswork to avoid adding the new fields, which I really wanted you to arrange for or comment on as they are a matter of interface. Also, could you be more specific about these discrepancies? -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 18:49 ` William Lee Irwin III @ 2004-09-09 19:00 ` William Lee Irwin III 2004-09-09 19:02 ` [4/2] consolidate __task_mem() and __task_mem_cheap() William Lee Irwin III 2004-09-09 19:11 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 1 sibling, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 19:00 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, Sep 09, 2004 at 11:49:33AM -0700, William Lee Irwin III wrote: > I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation > patch atop the others I sent. Consolidate __task_mem() and __task_mem_cheap() now that both have been made cheap, and also combine struct task_mem with struct task_mem_cheap. Also adjust various users of *_cheap to the new terminology so no trace of the *_cheap bits remains. Compiletested on ia64. Index: mm4-2.6.9-rc1/kernel/nproc.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-08 18:11:24.826811093 -0700 +++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-09 12:00:44.649267323 -0700 @@ -32,17 +32,14 @@ u32 vmstack; u32 vmexe; u32 vmlib; -}; - -struct task_mem_cheap { u32 vmsize; u32 vmlock; u32 vmrss; }; /* - * __task_mem/__task_mem_cheap basically duplicate the MMU version of - * task_mem, but they are split by cost and work on structs. + * __task_mem() basically duplicates() the MMU and nommu versions of + * task_mem() from fs/proc/task_mmu.c and fs/proc/task_nommu.c */ #ifdef CONFIG_MMU static void __task_mem(struct task_struct *tsk, struct task_mem *res) @@ -57,22 +54,10 @@ res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10); res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10; res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe; - mmput(mm); - } -} - -static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res) -{ - struct mm_struct *mm = get_task_mm(tsk); - if (mm) { res->vmsize = mm->total_vm << (PAGE_SHIFT-10); res->vmlock = mm->locked_vm << (PAGE_SHIFT-10); res->vmrss = mm->rss << (PAGE_SHIFT-10); mmput(mm); - } else { - res->vmsize = 0; - res->vmlock = 0; - res->vmrss = 0; } } #else /* !CONFIG_MMU */ @@ -86,9 +71,16 @@ unsigned long bytes = 0, sbytes = 0, slack = 0; struct mm_tblk_struct *tblk; + stats->vmrss += kobjsize(mm); down_read(&mm->mmap_sem); for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) { - if (!tblk->rblock) + if (tblk->next) + stats->vmrss += kobjsize(tblk->next); + if (tblk->rblock) { + stats->vmsize += kobjsize(tblk->rblock); + stats->vmrss += kobjsize(tblk->rblock); + stats->vmrss += kobjsize(tblk->rblock->kblock); + } else continue; bytes += kobjsize(tblk); if (atomic_read(&mm->mm_count) > 1) || @@ -120,34 +112,12 @@ stats->vmdata = bytes; stats->vmstack = sbytes; stats->vmexe = stats->vmlib = 0; + stats->vmrss += mm->end_code - mm->start_code; + stats->vmrss += mm->start_stack - mm->start_data; + stats->vmrss >>= 10; + stats->vmsize >>= 10; } } - -static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats) -{ - struct mm_struct *mm = get_task_mm(task); - struct mm_tblock_struct *tblk; - int size; - - memset(stats, 0, sizeof(struct task_mem_cheap)); - stats->vmrss += kobjsize(mm); - down_read(&mm->mmap_sem); - for (tblk = &mm->context.block; tblk; tblk = tblk->next) { - if (tblk->next) - stats->vmrss += kobjsize(tblk->next); - if (tblk->rblock) { - stats->vmsize += kobjsize(tblk->rblock); - stats->vmrss += kobjsize(tblk->rblock); - stats->vmrss += kobjsize(tblk->rblock->kblock); - } - } - stats->vmrss += mm->end_code - mm->start_code; - stats->vmrss += mm->start_stack - mm->start_data; - up_read(&mm->mmap_sem); - mmput(mm); - stats->vmrss >>= 10; - stats->vmsize >>= 10; -} #endif /* !CONFIG_MMU */ /* @@ -223,10 +193,9 @@ static char *nproc_ps_field(u32 id, char *buf, task_t *tsk) { struct task_mem tsk_mem; - struct task_mem_cheap tsk_mem_cheap; tsk_mem.vmdata = (~0); - tsk_mem_cheap.vmsize = (~0); + tsk_mem.vmsize = (~0); switch (id) { case NPROC_PID: @@ -238,20 +207,20 @@ case NPROC_VMSIZE: case NPROC_VMLOCK: case NPROC_VMRSS: - if (tsk_mem_cheap.vmsize == (~0)) - __task_mem_cheap(tsk, &tsk_mem_cheap); + if (tsk_mem.vmsize == (~0)) + __task_mem(tsk, &tsk_mem); switch (id) { case NPROC_VMSIZE: - mstore(tsk_mem_cheap.vmsize, + mstore(tsk_mem.vmsize, NPROC_VMSIZE, buf); break; case NPROC_VMLOCK: - mstore(tsk_mem_cheap.vmlock, + mstore(tsk_mem.vmlock, NPROC_VMLOCK, buf); break; case NPROC_VMRSS: - mstore(tsk_mem_cheap.vmrss, + mstore(tsk_mem.vmrss, NPROC_VMRSS, buf); break; } ^ permalink raw reply [flat|nested] 69+ messages in thread
* [4/2] consolidate __task_mem() and __task_mem_cheap() 2004-09-09 19:00 ` William Lee Irwin III @ 2004-09-09 19:02 ` William Lee Irwin III 2004-09-09 19:07 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 19:02 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, Sep 09, 2004 at 12:00:24PM -0700, William Lee Irwin III wrote: > Consolidate __task_mem() and __task_mem_cheap() now that both have been > made cheap, and also combine struct task_mem with struct task_mem_cheap. > Also adjust various users of *_cheap to the new terminology so no trace > of the *_cheap bits remains. Compiletested on ia64. Repost with appropriate Subject: line. Index: mm4-2.6.9-rc1/kernel/nproc.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-08 18:11:24.826811093 -0700 +++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-09 12:00:44.649267323 -0700 @@ -32,17 +32,14 @@ u32 vmstack; u32 vmexe; u32 vmlib; -}; - -struct task_mem_cheap { u32 vmsize; u32 vmlock; u32 vmrss; }; /* - * __task_mem/__task_mem_cheap basically duplicate the MMU version of - * task_mem, but they are split by cost and work on structs. + * __task_mem() basically duplicates() the MMU and nommu versions of + * task_mem() from fs/proc/task_mmu.c and fs/proc/task_nommu.c */ #ifdef CONFIG_MMU static void __task_mem(struct task_struct *tsk, struct task_mem *res) @@ -57,22 +54,10 @@ res->vmstack = mm->stack_vm << (PAGE_SHIFT - 10); res->vmexe = PAGE_ALIGN(mm->end_code - mm->start_code) >> 10; res->vmlib = (mm->exec_vm << (PAGE_SHIFT - 10)) - res->vmexe; - mmput(mm); - } -} - -static void __task_mem_cheap(struct task_struct *tsk, struct task_mem_cheap *res) -{ - struct mm_struct *mm = get_task_mm(tsk); - if (mm) { res->vmsize = mm->total_vm << (PAGE_SHIFT-10); res->vmlock = mm->locked_vm << (PAGE_SHIFT-10); res->vmrss = mm->rss << (PAGE_SHIFT-10); mmput(mm); - } else { - res->vmsize = 0; - res->vmlock = 0; - res->vmrss = 0; } } #else /* !CONFIG_MMU */ @@ -86,9 +71,16 @@ unsigned long bytes = 0, sbytes = 0, slack = 0; struct mm_tblk_struct *tblk; + stats->vmrss += kobjsize(mm); down_read(&mm->mmap_sem); for (tblk = &mm->context.tblk; tblk; tblk = tblk->next) { - if (!tblk->rblock) + if (tblk->next) + stats->vmrss += kobjsize(tblk->next); + if (tblk->rblock) { + stats->vmsize += kobjsize(tblk->rblock); + stats->vmrss += kobjsize(tblk->rblock); + stats->vmrss += kobjsize(tblk->rblock->kblock); + } else continue; bytes += kobjsize(tblk); if (atomic_read(&mm->mm_count) > 1) || @@ -120,34 +112,12 @@ stats->vmdata = bytes; stats->vmstack = sbytes; stats->vmexe = stats->vmlib = 0; + stats->vmrss += mm->end_code - mm->start_code; + stats->vmrss += mm->start_stack - mm->start_data; + stats->vmrss >>= 10; + stats->vmsize >>= 10; } } - -static void __task_mem_cheap(task_t *task, struct task_mem_cheap *stats) -{ - struct mm_struct *mm = get_task_mm(task); - struct mm_tblock_struct *tblk; - int size; - - memset(stats, 0, sizeof(struct task_mem_cheap)); - stats->vmrss += kobjsize(mm); - down_read(&mm->mmap_sem); - for (tblk = &mm->context.block; tblk; tblk = tblk->next) { - if (tblk->next) - stats->vmrss += kobjsize(tblk->next); - if (tblk->rblock) { - stats->vmsize += kobjsize(tblk->rblock); - stats->vmrss += kobjsize(tblk->rblock); - stats->vmrss += kobjsize(tblk->rblock->kblock); - } - } - stats->vmrss += mm->end_code - mm->start_code; - stats->vmrss += mm->start_stack - mm->start_data; - up_read(&mm->mmap_sem); - mmput(mm); - stats->vmrss >>= 10; - stats->vmsize >>= 10; -} #endif /* !CONFIG_MMU */ /* @@ -223,10 +193,9 @@ static char *nproc_ps_field(u32 id, char *buf, task_t *tsk) { struct task_mem tsk_mem; - struct task_mem_cheap tsk_mem_cheap; tsk_mem.vmdata = (~0); - tsk_mem_cheap.vmsize = (~0); + tsk_mem.vmsize = (~0); switch (id) { case NPROC_PID: @@ -238,20 +207,20 @@ case NPROC_VMSIZE: case NPROC_VMLOCK: case NPROC_VMRSS: - if (tsk_mem_cheap.vmsize == (~0)) - __task_mem_cheap(tsk, &tsk_mem_cheap); + if (tsk_mem.vmsize == (~0)) + __task_mem(tsk, &tsk_mem); switch (id) { case NPROC_VMSIZE: - mstore(tsk_mem_cheap.vmsize, + mstore(tsk_mem.vmsize, NPROC_VMSIZE, buf); break; case NPROC_VMLOCK: - mstore(tsk_mem_cheap.vmlock, + mstore(tsk_mem.vmlock, NPROC_VMLOCK, buf); break; case NPROC_VMRSS: - mstore(tsk_mem_cheap.vmrss, + mstore(tsk_mem.vmrss, NPROC_VMRSS, buf); break; } ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [4/2] consolidate __task_mem() and __task_mem_cheap() 2004-09-09 19:02 ` [4/2] consolidate __task_mem() and __task_mem_cheap() William Lee Irwin III @ 2004-09-09 19:07 ` Roger Luethi 2004-09-09 19:15 ` [5/2] fix nommu VSZ reporting in consolidated task_mem() William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-09 19:07 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, 09 Sep 2004 12:02:14 -0700, William Lee Irwin III wrote: > + stats->vmrss += mm->end_code - mm->start_code; s/vmrss/vmsize/ ? ^ permalink raw reply [flat|nested] 69+ messages in thread
* [5/2] fix nommu VSZ reporting in consolidated task_mem() 2004-09-09 19:07 ` Roger Luethi @ 2004-09-09 19:15 ` William Lee Irwin III 0 siblings, 0 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 19:15 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, 09 Sep 2004 12:02:14 -0700, William Lee Irwin III wrote: >> + stats->vmrss += mm->end_code - mm->start_code; On Thu, Sep 09, 2004 at 09:07:57PM +0200, Roger Luethi wrote: > s/vmrss/vmsize/ ? This follows fs/proc/task_nommu.c:task_statm, which ->vmsize would not. vmsize would be the sum of kobjsize(tblk->rblock->kblock) for each tblock, which actually does need fixing in the above. -- wli Index: mm4-2.6.9-rc1/kernel/nproc.c =================================================================== --- mm4-2.6.9-rc1.orig/kernel/nproc.c 2004-09-09 12:00:44.649267323 -0700 +++ mm4-2.6.9-rc1/kernel/nproc.c 2004-09-09 12:18:01.876793680 -0700 @@ -77,7 +77,7 @@ if (tblk->next) stats->vmrss += kobjsize(tblk->next); if (tblk->rblock) { - stats->vmsize += kobjsize(tblk->rblock); + stats->vmsize += kobjsize(tblk->rblock->kblock); stats->vmrss += kobjsize(tblk->rblock); stats->vmrss += kobjsize(tblk->rblock->kblock); } else ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 18:49 ` William Lee Irwin III 2004-09-09 19:00 ` William Lee Irwin III @ 2004-09-09 19:11 ` Roger Luethi 2004-09-09 19:23 ` William Lee Irwin III 2004-09-11 22:25 ` Albert Cahalan 1 sibling, 2 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-09 19:11 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote: > I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation > patch atop the others I sent. I have a few minor changes coming up as well. One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should ifdef them out of the list of offered fields for that configuration (and maybe in nproc_ps_field as well). > On Thu, Sep 09, 2004 at 08:43:01PM +0200, Roger Luethi wrote: > > There seems to be a discrepancy between current task_mem in > > fs/proc/task_nommu.c and the __task_mem{,_cheap} routines you wrote > > for the nproc !CONFIG_MMU case. Can you explain? > > I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I > did, however, have to mangle the things via guesswork to avoid adding > the new fields, which I really wanted you to arrange for or comment on > as they are a matter of interface. Also, could you be more specific > about these discrepancies? task_nommu.c offers Mem, Slack, and Shared. __task_mem for !CONFIG_MMU offers VmData, VmStack, VmRSS, VmSize. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 19:11 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi @ 2004-09-09 19:23 ` William Lee Irwin III 2004-09-09 21:19 ` Roger Luethi 2004-09-10 15:30 ` Roger Luethi 2004-09-11 22:25 ` Albert Cahalan 1 sibling, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 19:23 UTC (permalink / raw) To: Roger Luethi; +Cc: Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote: >> I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation >> patch atop the others I sent. On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote: > I have a few minor changes coming up as well. I rest assured that nothing I've written thus far will apply to or be included in any of it, as a matter of course (nothing specific to you). On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote: > One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should > ifdef them out of the list of offered fields for that configuration (and > maybe in nproc_ps_field as well). This may be; I'll leave that decision to you as the interface designer. On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote: >> I'm not aware of a discrepancy with the fs/proc/task_nommu.c code; I >> did, however, have to mangle the things via guesswork to avoid adding >> the new fields, which I really wanted you to arrange for or comment on >> as they are a matter of interface. Also, could you be more specific >> about these discrepancies? On Thu, Sep 09, 2004 at 09:11:42PM +0200, Roger Luethi wrote: > task_nommu.c offers Mem, Slack, and Shared. __task_mem for !CONFIG_MMU > offers VmData, VmStack, VmRSS, VmSize. I took the structure fields to be just an argument passing convention giving the nommu case an identical prototype much like the helpers in fs/proc/task_{no,}mmu.c. Using different field names and etc. is also feasible, of course. I'll wait for your updates to follow up further. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 19:23 ` William Lee Irwin III @ 2004-09-09 21:19 ` Roger Luethi 2004-09-10 15:30 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-09 21:19 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, 09 Sep 2004 12:23:13 -0700, William Lee Irwin III wrote: > I took the structure fields to be just an argument passing convention > giving the nommu case an identical prototype much like the helpers in That seems rather confusing. We must special-case for !CONFIG_MMU anyway because field IDs are tied to meaning, i.e. systems export different sets of fields depending on this configuration setting. The proc filesystem does the same, the difference is that a changing set is easier to handle with nproc. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 19:23 ` William Lee Irwin III 2004-09-09 21:19 ` Roger Luethi @ 2004-09-10 15:30 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-10 15:30 UTC (permalink / raw) To: William Lee Irwin III, Andrew Morton, linux-kernel, Albert Cahalan, Paul Jackson On Thu, 09 Sep 2004 12:23:13 -0700, William Lee Irwin III wrote: > feasible, of course. I'll wait for your updates to follow up further. Incremental update below. It contains a reorganization of the field IDs (something I expected to do based on feedback) and minor tweaks in error handling. I'll post a full patch once the MMU stuff is sorted out. Roger diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/include/linux/nproc.h linux-2.6.9-rc1-mm4.02/include/linux/nproc.h --- linux-2.6.9-rc1-mm4.01/include/linux/nproc.h 2004-09-10 17:19:34.018727960 +0200 +++ linux-2.6.9-rc1-mm4.02/include/linux/nproc.h 2004-09-10 14:43:13.000000000 +0200 @@ -49,35 +49,57 @@ #define NPROC_LABEL_FIELD_UNIT 0x00000003 #define NPROC_LABEL_WCHAN 0x00000004 -/* Field IDs (unique key in bits 0 - 15) */ -#define NPROC_NOP_UL (0x00000020 | NPROC_TYPE_UL) -#define NPROC_PID (0x00000001 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_NAME (0x00000002 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS) -/* Amount of free memory (pages) */ -#define NPROC_MEMFREE (0x00000004 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) -/* Size of a page (bytes) */ -#define NPROC_PAGESIZE (0x00000005 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* --------------------------------------------------------------------- misc */ /* There's no guarantee about anything with jiffies. Still useful for some. */ -#define NPROC_JIFFIES (0x00000006 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) -/* Process: VM size (KiB) */ -#define NPROC_VMSIZE (0x00000010 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -/* Process: locked memory (KiB) */ -#define NPROC_VMLOCK (0x00000011 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -/* Process: Memory resident size (KiB) */ -#define NPROC_VMRSS (0x00000012 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_VMDATA (0x00000013 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_VMSTACK (0x00000014 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_VMEXE (0x00000015 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_VMLIB (0x00000016 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_UID (0x00000018 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) -#define NPROC_NR_DIRTY (0x00000051 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) -#define NPROC_NR_WRITEBACK (0x00000052 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) -#define NPROC_NR_UNSTABLE (0x00000053 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) -#define NPROC_NR_PG_TABLE_PGS (0x00000054 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) -#define NPROC_NR_MAPPED (0x00000055 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) -#define NPROC_NR_SLAB (0x00000056 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) -#define NPROC_WCHAN (0x00000080 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) -#define NPROC_WCHAN_NAME (0x00000081 | NPROC_TYPE_STRING) +#define NPROC_JIFFIES (0x00000001 | NPROC_TYPE_U64 | NPROC_SCOPE_GLOBAL) +/* Field IDs (unique key in bits 0 - 15) */ +#define NPROC_NOP_UL (0x00000002 | NPROC_TYPE_UL) +/* Size of a page */ +#define NPROC_PAGESIZE (0x00000003 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* --------------------------------------------------------- /proc/PID/status */ +#define NPROC_NAME (0x00000100 | NPROC_TYPE_STRING | NPROC_SCOPE_PROCESS) +#define NPROC_STATE (0x00000101 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_STATE_NAME (0x00000102 | NPROC_TYPE_STRING) +#define NPROC_SLEEP_TIME (0x00000103 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_TOTAL_TIME (0x00000104 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_PID (0x00000105 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_TGID (0x00000106 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_PPID (0x00000107 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_TRACER_PID (0x00000108 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_UID (0x00000109 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_EUID (0x00000110 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_SUID (0x00000111 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_FSUID (0x00000112 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_GID (0x00000113 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_EGID (0x00000114 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_SGID (0x00000115 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_FSGID (0x00000116 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: VM size */ +#define NPROC_VMSIZE (0x00000117 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: locked memory */ +#define NPROC_VMLOCK (0x00000118 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* Process: Memory resident size */ +#define NPROC_VMRSS (0x00000119 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMDATA (0x00000120 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMSTACK (0x00000121 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMEXE (0x00000122 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +#define NPROC_VMLIB (0x00000123 | NPROC_TYPE_U32 | NPROC_SCOPE_PROCESS) +/* ------------------------------------------------------------- /proc/vmstat */ +#define NPROC_NR_DIRTY (0x00000214 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_WRITEBACK (0x00000215 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_UNSTABLE (0x00000216 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_PG_TABLE_PGS (0x00000217 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_MAPPED (0x00000218 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +#define NPROC_NR_SLAB (0x00000219 | NPROC_TYPE_UL | NPROC_SCOPE_GLOBAL) +/* ------------------------------------------------------------ /proc/meminfo */ +/* Amount of free memory */ +#define NPROC_MEMFREE (0x00000320 | NPROC_TYPE_U32 | NPROC_SCOPE_GLOBAL) +/* ---------------------------------------------------------- /proc/PID/wchan */ +#define NPROC_WCHAN (0x00000421 | NPROC_TYPE_UL | NPROC_SCOPE_PROCESS) +#define NPROC_WCHAN_NAME (0x00000422 | NPROC_TYPE_STRING) +/* ----------------------------------------------------------- /proc/PID/stat */ +/* ---------------------------------------------------------- /proc/PID/statm */ + #ifdef __KERNEL__ struct nproc_field { @@ -88,11 +110,11 @@ struct nproc_field { }; static struct nproc_field labels[] = { - { NPROC_PID, "PID", "%5u", "" }, - { NPROC_NAME, "Name", "%-15s","" }, - { NPROC_MEMFREE, "MemFree", "%8u", "page" }, - { NPROC_PAGESIZE, "PageSize", "%4u", "byte" }, { NPROC_JIFFIES, "Jiffies", "%10u", "" }, + { NPROC_PAGESIZE, "PageSize", "%4u", "byte" }, + { NPROC_NAME, "Name", "%-15s","" }, + { NPROC_PID, "PID", "%5u", "" }, + { NPROC_UID, "UID", "%5u", "" }, { NPROC_VMSIZE, "VmSize", "%8u", "KiB" }, { NPROC_VMLOCK, "VmLock", "%8u", "KiB" }, { NPROC_VMRSS, "VmRSS", "%8u", "KiB" }, @@ -100,16 +122,16 @@ static struct nproc_field labels[] = { { NPROC_VMSTACK, "VmStack", "%8u", "KiB" }, { NPROC_VMEXE, "VmExe", "%8u", "KiB" }, { NPROC_VMLIB, "VmLib", "%8u", "KiB" }, - { NPROC_UID, "UID", "%5u", "" }, { NPROC_NR_DIRTY, "nr_dirty", "%8d", "page" }, { NPROC_NR_WRITEBACK, "nr_writeback", "%8u", "page" }, { NPROC_NR_UNSTABLE, "nr_unstable", "%8u", "page" }, { NPROC_NR_PG_TABLE_PGS, "nr_page_table_pages", "%8u", "page" }, { NPROC_NR_MAPPED, "nr_mapped", "%8u", "page" }, { NPROC_NR_SLAB, "nr_slab", "%8u", "page" }, + { NPROC_MEMFREE, "MemFree", "%8u", "page" }, { NPROC_WCHAN, "wchan", "%p", "" }, #ifdef CONFIG_KALLSYMS - { NPROC_WCHAN_NAME, "wchan_symbol", "%s"}, + { NPROC_WCHAN_NAME, "wchan_symbol", "%s", ""}, #endif }; #endif /* __KERNEL__ */ diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/kernel/nproc.c linux-2.6.9-rc1-mm4.02/kernel/nproc.c --- linux-2.6.9-rc1-mm4.01/kernel/nproc.c 2004-09-10 17:19:34.034725528 +0200 +++ linux-2.6.9-rc1-mm4.02/kernel/nproc.c 2004-09-10 12:04:28.000000000 +0200 @@ -17,12 +17,11 @@ /* There must be like 5 million dprintk definitions, so let's add some more */ #ifdef DEBUG #define pdebug(x,args...) printk(KERN_DEBUG "%s:%d " x, __func__ , __LINE__, ##args) -#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args) #else #define pdebug(x,args...) -#define pwarn(x,args...) #endif +#define pwarn(x,args...) printk(KERN_WARNING "%s:%d " x, __func__ , __LINE__, ##args) #define perror(x,args...) printk(KERN_ERR "%s:%d " x, __func__ , __LINE__, ##args) static struct sock *nproc_sock = NULL; @@ -129,18 +128,18 @@ static struct sk_buff *nproc_alloc_nlmsg struct sk_buff *skb2 = 0; skb2 = alloc_skb(NLMSG_SPACE(len), GFP_KERNEL); - if (!skb2) { - skb2 = ERR_PTR(-ENOMEM); - goto out; - } + if (!skb2) + goto err_out; NLMSG_PUT(skb2, pid, seq, type, NLMSG_ALIGN(len)); -out: - return skb2; + goto out; nlmsg_failure: /* Used by NLMSG_PUT */ kfree_skb(skb2); - return NULL; +err_out: + skb2 = ERR_PTR(-ENOMEM); +out: + return skb2; } #define mstore(value, id, buf) \ @@ -634,18 +633,17 @@ static int find_id(__u32 *data, __u32 *l pwarn("%d bytes left.\n", *left); id = data[1]; - for (i = 0; i < ARRAY_SIZE(labels) && labels[i].id != id; i++) - ; /* Do nothing */ - - if (labels[i].id != id) { - pwarn("No matching label found for %#x.\n", id); - goto err_inval; + for (i = 0; i < ARRAY_SIZE(labels); i++) { + if (labels[i].id == id) + goto out; } - return i; + pwarn("No matching label found for %#x.\n", id); err_inval: return -EINVAL; +out: + return i; } diff -uNp -X /home/rl/data/doc/kernel/dontdiff-2.6 linux-2.6.9-rc1-mm4.01/init/Kconfig linux-2.6.9-rc1-mm4.02/init/Kconfig --- linux-2.6.9-rc1-mm4.01/init/Kconfig 2004-09-10 17:19:34.040724616 +0200 +++ linux-2.6.9-rc1-mm4.02/init/Kconfig 2004-09-10 00:32:36.000000000 +0200 @@ -141,10 +141,11 @@ config SYSCTL config NPROC bool "Netlink interface to /proc information" - depends on PROC_FS && EXPERIMENTAL + depends on EXPERIMENTAL && !CONFIG_SECURITY default y help - Nproc is a netlink interface to /proc information. + Nproc is a netlink interface to /proc information. Its benefits + are clean semantics and high performance. config AUDIT bool "Auditing support" ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 19:11 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 2004-09-09 19:23 ` William Lee Irwin III @ 2004-09-11 22:25 ` Albert Cahalan 2004-09-12 4:58 ` William Lee Irwin III 2004-09-14 5:59 ` Roger Luethi 1 sibling, 2 replies; 69+ messages in thread From: Albert Cahalan @ 2004-09-11 22:25 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson On Thu, 2004-09-09 at 15:11, Roger Luethi wrote: > On Thu, 09 Sep 2004 11:49:33 -0700, William Lee Irwin III wrote: > > I'll follow up shortly with a task_mem()/task_mem_cheap() consolidation > > patch atop the others I sent. > > I have a few minor changes coming up as well. > > One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should > ifdef them out of the list of offered fields for that configuration (and > maybe in nproc_ps_field as well). No. First of all, I think they can be offered. Until proven otherwise, I'll assume that the !CONFIG_MMU case is buggy. Second of all, removal will make the !CONFIG_MMU systems less compatible with the rest of the world. This will mean that fewer apps can run on !CONFIG_MMU boxes. It's same problem as "All the world's a VAX". It's better that the apps work; an author working on a Pentium 4 Xeon is likely to write code that relies on the fields and might not really understand what "no MMU" is all about. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-11 22:25 ` Albert Cahalan @ 2004-09-12 4:58 ` William Lee Irwin III 2004-09-14 5:59 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-12 4:58 UTC (permalink / raw) To: Albert Cahalan Cc: Roger Luethi, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson On Thu, 2004-09-09 at 15:11, Roger Luethi wrote: >> I have a few minor changes coming up as well. >> One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should >> ifdef them out of the list of offered fields for that configuration (and >> maybe in nproc_ps_field as well). On Sat, Sep 11, 2004 at 06:25:56PM -0400, Albert Cahalan wrote: > No. First of all, I think they can be offered. Until proven > otherwise, I'll assume that the !CONFIG_MMU case is buggy. > Second of all, removal will make the !CONFIG_MMU systems > less compatible with the rest of the world. This will > mean that fewer apps can run on !CONFIG_MMU boxes. It's > same problem as "All the world's a VAX". It's better that > the apps work; an author working on a Pentium 4 Xeon is > likely to write code that relies on the fields and might > not really understand what "no MMU" is all about. Would the nommu bits I wrote be satisfactory for you? -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-11 22:25 ` Albert Cahalan 2004-09-12 4:58 ` William Lee Irwin III @ 2004-09-14 5:59 ` Roger Luethi 2004-09-14 6:18 ` William Lee Irwin III 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 5:59 UTC (permalink / raw) To: Albert Cahalan Cc: William Lee Irwin III, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote: > > One nitpick: As vmexe and vmlib are always 0 for !CONFIG_MMU, we should > > ifdef them out of the list of offered fields for that configuration (and > > maybe in nproc_ps_field as well). > > No. First of all, I think they can be offered. Until proven > otherwise, I'll assume that the !CONFIG_MMU case is buggy. I agree with you that those specific fields should be offered for !CONFIG_MMU. However, if for some reason they cannot carry a value that fits the field description, they should not be offered at all. The ambiguity of having 0 mean either "0" or "this field is not available" is bad. Trying to read a specific field _can_ fail, and applications had better handle that case (it's still trivial compared to having to parse different /proc file layouts depending on the configuration). > mean that fewer apps can run on !CONFIG_MMU boxes. It's > same problem as "All the world's a VAX". It's better that > the apps work; an author working on a Pentium 4 Xeon is > likely to write code that relies on the fields and might > not really understand what "no MMU" is all about. The presumed wrong assumptions underlying broken tools of the future are not a good base for designing a new interface. My interest is in making it easy to write correct applications (or in fixing broken apps that won't work, say, on !CONFIG_MMU systems). Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 5:59 ` Roger Luethi @ 2004-09-14 6:18 ` William Lee Irwin III 2004-09-14 6:23 ` William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 6:18 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote: >> No. First of all, I think they can be offered. Until proven >> otherwise, I'll assume that the !CONFIG_MMU case is buggy. On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: > I agree with you that those specific fields should be offered for > !CONFIG_MMU. However, if for some reason they cannot carry a value > that fits the field description, they should not be offered at all. The > ambiguity of having 0 mean either "0" or "this field is not available" > is bad. Trying to read a specific field _can_ fail, and applications > had better handle that case (it's still trivial compared to having to > parse different /proc file layouts depending on the configuration). Apart from doing something it's supposed to for !CONFIG_MMU and using the internal kernel accounting I set up for the CONFIG_MMU=y case I'm not very concerned about this. I have a vague notion there should probably be some consistency with the /proc/ precedent but am not particularly tied to it. We should probably ask Greg Ungerer (the maintainer of the external MMU-less patches) about what he prefers since it's likely we can't anticipate all of the !CONFIG_MMU concerns. On Sat, 11 Sep 2004 18:25:56 -0400, Albert Cahalan wrote: >> mean that fewer apps can run on !CONFIG_MMU boxes. It's >> same problem as "All the world's a VAX". It's better that >> the apps work; an author working on a Pentium 4 Xeon is >> likely to write code that relies on the fields and might >> not really understand what "no MMU" is all about. On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: > The presumed wrong assumptions underlying broken tools of the future > are not a good base for designing a new interface. My interest is in > making it easy to write correct applications (or in fixing broken apps > that won't work, say, on !CONFIG_MMU systems). I don't really know what the approach to app compatibility used by userspace for !CONFIG_MMU is; I'll refer you to Greg Ungerer as my knowledge of the CONFIG_MMU usage models and/or whatever userspace is used in tandem with it outside the VM's internals is rather scant. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 6:18 ` William Lee Irwin III @ 2004-09-14 6:23 ` William Lee Irwin III 2004-09-14 7:47 ` Greg Ungerer 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 6:23 UTC (permalink / raw) To: Greg Ungerer Cc: Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson, Roger Luethi Greg, could you comment on this since there are some people having trouble figuring out what's going on with VM-related /proc/ fields for !CONFIG_MMU. Please forgive the top-posting, it made more sense to quote the text below in this instance. On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: >> I agree with you that those specific fields should be offered for >> !CONFIG_MMU. However, if for some reason they cannot carry a value >> that fits the field description, they should not be offered at all. The >> ambiguity of having 0 mean either "0" or "this field is not available" >> is bad. Trying to read a specific field _can_ fail, and applications >> had better handle that case (it's still trivial compared to having to >> parse different /proc file layouts depending on the configuration). On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote: > Apart from doing something it's supposed to for !CONFIG_MMU and using > the internal kernel accounting I set up for the CONFIG_MMU=y case I'm > not very concerned about this. I have a vague notion there should > probably be some consistency with the /proc/ precedent but am not > particularly tied to it. We should probably ask Greg Ungerer (the > maintainer of the external MMU-less patches) about what he prefers > since it's likely we can't anticipate all of the !CONFIG_MMU concerns. On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: >> The presumed wrong assumptions underlying broken tools of the future >> are not a good base for designing a new interface. My interest is in >> making it easy to write correct applications (or in fixing broken apps >> that won't work, say, on !CONFIG_MMU systems). On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote: > I don't really know what the approach to app compatibility used by > userspace for !CONFIG_MMU is; I'll refer you to Greg Ungerer as my > knowledge of the CONFIG_MMU usage models and/or whatever userspace > is used in tandem with it outside the VM's internals is rather scant. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 6:23 ` William Lee Irwin III @ 2004-09-14 7:47 ` Greg Ungerer 2004-09-14 8:27 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: Greg Ungerer @ 2004-09-14 7:47 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson, Roger Luethi Hi William, Roger, William Lee Irwin III wrote: > Greg, could you comment on this since there are some people having > trouble figuring out what's going on with VM-related /proc/ fields for > !CONFIG_MMU. Please forgive the top-posting, it made more sense to > quote the text below in this instance. Yeah, the !CONFIG_MMU code behind this is probably a little stale. The thinking has mostly been to keep things as much the same as possible, even if the fields didn't have a sensible meaning in non-mmu space. > On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: > >>>I agree with you that those specific fields should be offered for >>>!CONFIG_MMU. However, if for some reason they cannot carry a value >>>that fits the field description, they should not be offered at all. The >>>ambiguity of having 0 mean either "0" or "this field is not available" >>>is bad. Trying to read a specific field _can_ fail, and applications >>>had better handle that case (it's still trivial compared to having to >>>parse different /proc file layouts depending on the configuration). In at least one case this is true now, as you mention for the VmXxx fields. But looking at these now I think we could actually implement most of them in a sensible way for the no-mmu case. Size, Exe, Lib, Stk, etc all apply with their conventional meanings. > On Mon, Sep 13, 2004 at 11:18:00PM -0700, William Lee Irwin III wrote: > >>Apart from doing something it's supposed to for !CONFIG_MMU and using >>the internal kernel accounting I set up for the CONFIG_MMU=y case I'm >>not very concerned about this. I have a vague notion there should >>probably be some consistency with the /proc/ precedent but am not >>particularly tied to it. We should probably ask Greg Ungerer (the >>maintainer of the external MMU-less patches) about what he prefers >>since it's likely we can't anticipate all of the !CONFIG_MMU concerns. > > > On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: > >>>The presumed wrong assumptions underlying broken tools of the future >>>are not a good base for designing a new interface. My interest is in >>>making it easy to write correct applications (or in fixing broken apps >>>that won't work, say, on !CONFIG_MMU systems). Reality for non-mmu targets is that most apps just won't be fixed for them, so we try real hard to make the world look like it is just like any other linux architecture. I think !CONFIG_MMU case can be cleaned up to make it almost identical to the CONFIG_MMU case, and reporting sensible values for just about all fields. Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com SnapGear -- a CyberGuard Company PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 7:47 ` Greg Ungerer @ 2004-09-14 8:27 ` Roger Luethi 0 siblings, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-14 8:27 UTC (permalink / raw) To: Greg Ungerer Cc: William Lee Irwin III, Albert Cahalan, Andrew Morton OSDL, linux-kernel mailing list, Paul Jackson On Tue, 14 Sep 2004 17:47:52 +1000, Greg Ungerer wrote: > Yeah, the !CONFIG_MMU code behind this is probably a little stale. > The thinking has mostly been to keep things as much the same as > possible, even if the fields didn't have a sensible meaning in > non-mmu space. With nproc, tool authors won't need to write any special-casing code for non-MMU. All they need to handle is the possibility that a field they ask for does not exist. (Of course it doesn't hurt if they know how to deal with non-MMU specific fields if any exist) > >On Tue, Sep 14, 2004 at 07:59:46AM +0200, Roger Luethi wrote: > > > >>>I agree with you that those specific fields should be offered for > >>>!CONFIG_MMU. However, if for some reason they cannot carry a value > >>>that fits the field description, they should not be offered at all. The > >>>ambiguity of having 0 mean either "0" or "this field is not available" > >>>is bad. Trying to read a specific field _can_ fail, and applications > >>>had better handle that case (it's still trivial compared to having to > >>>parse different /proc file layouts depending on the configuration). > > In at least one case this is true now, as you mention for the > VmXxx fields. But looking at these now I think we could actually > implement most of them in a sensible way for the no-mmu case. > Size, Exe, Lib, Stk, etc all apply with their conventional > meanings. It seems we all agree on that. What I'd object to is offering fields like Size, Exe, etc. and filling them with values that are wrong (e.g. returning always 0 for Exe). In such a case, the field is simply not offered and asking for it an error. That's not a problem we can solve for tool authors: Allowing them to distinguish between N/A and 0 is a property of the interface, and using that interface means knowing how to deal with that distinction. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi 2004-09-09 0:35 ` William Lee Irwin III @ 2004-09-09 11:53 ` Stephen Smalley 2004-09-09 17:22 ` William Lee Irwin III 1 sibling, 1 reply; 69+ messages in thread From: Stephen Smalley @ 2004-09-09 11:53 UTC (permalink / raw) To: Roger Luethi Cc: Andrew Morton, lkml, Albert Cahalan, William Lee Irwin III, Martin J. Bligh, Paul Jackson On Wed, 2004-09-08 at 14:41, Roger Luethi wrote: > A few notes: > - Access control can be implemented easily. Right now it would be bloat, > though -- the vast majority of fields in /proc are world-readable > (/proc/pid/environ being the notable exception). They aren't world readable when using a security module like SELinux; they are then typically only accessible by processes in the same security domain, aside from processes in privileged domains. security_task_to_inode() hook sets the security attributes on the /proc/pid inodes based on their security context, and then security_inode_permission() hook controls access to them. So you need at least comparable controls. -- Stephen Smalley <sds@epoch.ncsc.mil> National Security Agency ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 11:53 ` Stephen Smalley @ 2004-09-09 17:22 ` William Lee Irwin III 2004-09-09 17:53 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-09 17:22 UTC (permalink / raw) To: Stephen Smalley Cc: Roger Luethi, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson On Wed, 2004-09-08 at 14:41, Roger Luethi wrote: >> A few notes: >> - Access control can be implemented easily. Right now it would be bloat, >> though -- the vast majority of fields in /proc are world-readable >> (/proc/pid/environ being the notable exception). On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote: > They aren't world readable when using a security module like SELinux; > they are then typically only accessible by processes in the same > security domain, aside from processes in privileged domains. > security_task_to_inode() hook sets the security attributes on the > /proc/pid inodes based on their security context, and then > security_inode_permission() hook controls access to them. So you need > at least comparable controls. Can you make a more specific suggestion regarding the controls to use? It's a bit awkward for those highly unfamiliar with the subsystem to invent new methods for the security layer independently, so it's likely best some guidance (e.g. function prototype) be given. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 17:22 ` William Lee Irwin III @ 2004-09-09 17:53 ` Roger Luethi 2004-09-09 20:01 ` Stephen Smalley 2004-09-09 20:44 ` Chris Wright 0 siblings, 2 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-09 17:53 UTC (permalink / raw) To: William Lee Irwin III, Stephen Smalley, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote: > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote: > > They aren't world readable when using a security module like SELinux; > > they are then typically only accessible by processes in the same > > security domain, aside from processes in privileged domains. > > security_task_to_inode() hook sets the security attributes on the > > /proc/pid inodes based on their security context, and then > > security_inode_permission() hook controls access to them. So you need > > at least comparable controls. > > Can you make a more specific suggestion regarding the controls to use? > It's a bit awkward for those highly unfamiliar with the subsystem to For the same reason, I'm not comfortable with implementing SELinux type access controls myself. How about: config NPROC depends on !SECURITY_SELINUX Adding access control later won't be a problem for anyone who groks SELinux. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 17:53 ` Roger Luethi @ 2004-09-09 20:01 ` Stephen Smalley 2004-09-09 20:48 ` Chris Wright 2004-09-09 20:55 ` Roger Luethi 2004-09-09 20:44 ` Chris Wright 1 sibling, 2 replies; 69+ messages in thread From: Stephen Smalley @ 2004-09-09 20:01 UTC (permalink / raw) To: Roger Luethi Cc: William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright On Thu, 2004-09-09 at 13:53, Roger Luethi wrote: > On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote: > > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote: > > > They aren't world readable when using a security module like SELinux; > > > they are then typically only accessible by processes in the same > > > security domain, aside from processes in privileged domains. > > > security_task_to_inode() hook sets the security attributes on the > > > /proc/pid inodes based on their security context, and then > > > security_inode_permission() hook controls access to them. So you need > > > at least comparable controls. > > > > Can you make a more specific suggestion regarding the controls to use? > > It's a bit awkward for those highly unfamiliar with the subsystem to > > For the same reason, I'm not comfortable with implementing SELinux type > access controls myself. How about: > > config NPROC > depends on !SECURITY_SELINUX > > Adding access control later won't be a problem for anyone who groks > SELinux. Well, it isn't that easy, or at least I don't think it is. The problem is that there is no way presently to convey the sender's security credentials (beyond the existing uid, cap information), since the LSM patches for adding security fields and hooks for managing skb security fields were rejected. The best we can do at present is pass along the sender pid, uid, and cap, and the security module can look up the pid if it chooses to get the security field (but is naturally subject to races in that situation). Most obvious place to hook would be nproc_ps_get_task; we could then perform a check based on the sender's credentials and the target task's credentials, and simply return NULL if permission is not granted for that pair, thus skipping that task as if it didn't exist. That requires propagating the sender's credentials down to that function. Untested patch below. Index: linux-2.6/include/linux/security.h =================================================================== RCS file: /nfshome/pal/CVS/linux-2.6/include/linux/security.h,v retrieving revision 1.37 diff -u -p -r1.37 security.h --- linux-2.6/include/linux/security.h 16 Jun 2004 14:49:42 -0000 1.37 +++ linux-2.6/include/linux/security.h 9 Sep 2004 19:38:23 -0000 @@ -632,6 +632,13 @@ struct swap_info_struct; * security attributes, e.g. for /proc/pid inodes. * @p contains the task_struct for the task. * @inode contains the inode structure for the inode. + * @task_getstate: + * Check permission before getting the state of a task. + * @pid contains the pid of the requesting process. + * @p contains the task_struct for the target task. + * @uid contains the uid of the requesting process. + * @caps contains the capability set of the requesting process. + * Return 0 if permission is granted. * * Security hooks for Netlink messaging. * @@ -1153,6 +1160,7 @@ struct security_operations { unsigned long arg5); void (*task_reparent_to_init) (struct task_struct * p); void (*task_to_inode)(struct task_struct *p, struct inode *inode); + int (*task_getstate)(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps); int (*ipc_permission) (struct kern_ipc_perm * ipcp, short flag); @@ -1756,6 +1764,11 @@ static inline void security_task_to_inod security_ops->task_to_inode(p, inode); } +static inline int security_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps) +{ + return security_ops->task_getstate(pid, p, uid, caps); +} + static inline int security_ipc_permission (struct kern_ipc_perm *ipcp, short flag) { @@ -2389,6 +2402,11 @@ static inline void security_task_reparen static inline void security_task_to_inode(struct task_struct *p, struct inode *inode) { } +static inline int security_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps) +{ + return 0; +} + static inline int security_ipc_permission (struct kern_ipc_perm *ipcp, short flag) { Index: linux-2.6/security/dummy.c =================================================================== RCS file: /nfshome/pal/CVS/linux-2.6/security/dummy.c,v retrieving revision 1.34 diff -u -p -r1.34 dummy.c --- linux-2.6/security/dummy.c 16 Jun 2004 14:49:42 -0000 1.34 +++ linux-2.6/security/dummy.c 9 Sep 2004 19:39:01 -0000 @@ -619,6 +619,12 @@ static void dummy_task_reparent_to_init static void dummy_task_to_inode(struct task_struct *p, struct inode *inode) { } + +static int dummy_task_getstate(pid_t pid, struct task_struct *p, uid_t uid, kernel_cap_t caps) +{ + return 0; +} + static int dummy_ipc_permission (struct kern_ipc_perm *ipcp, short flag) { return 0; @@ -979,6 +985,7 @@ void security_fixup_ops (struct security set_to_dummy_if_null(ops, task_prctl); set_to_dummy_if_null(ops, task_reparent_to_init); set_to_dummy_if_null(ops, task_to_inode); + set_to_dummy_if_null(ops, task_getstate); set_to_dummy_if_null(ops, ipc_permission); set_to_dummy_if_null(ops, msg_msg_alloc_security); set_to_dummy_if_null(ops, msg_msg_free_security); --- linux-2.6/kernel/nproc.c.orig 2004-09-09 15:51:25.727833776 -0400 +++ linux-2.6/kernel/nproc.c 2004-09-09 15:30:19.171379624 -0400 @@ -296,7 +296,7 @@ out: /* * Find task for given pid, grab task lock (caller must unlock). */ -static task_t *nproc_ps_get_task(int pid) +static task_t *nproc_ps_get_task(struct nlmsghdr *nlh, int pid, uid_t uid, kernel_cap_t caps) { task_t *tsk; @@ -305,13 +305,17 @@ static task_t *nproc_ps_get_task(int pid if (tsk) get_task_struct(tsk); read_unlock(&tasklist_lock); + if (tsk && security_task_getstate(nlh->nlmsg_pid, tsk, uid, caps)) { + put_task_struct(tsk); + return NULL; + } return tsk; } /* * Iterate over a list of PIDs. */ -static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata) +static int nproc_ps_select_pid(struct nlmsghdr *nlh, u32 *fdata, u32 len, u32 left, u32 *sdata, uid_t uid, kernel_cap_t caps) { int i; int err = 0; @@ -335,7 +339,7 @@ static int nproc_ps_select_pid(struct nl for (i = 0; i < tcnt; i++) { task_t *tsk; - tsk = nproc_ps_get_task(pids[i]); + tsk = nproc_ps_get_task(nlh, pids[i], uid, caps); if (!tsk) continue; err = nproc_pid_msg(nlh, fdata, len, tsk); @@ -357,7 +361,7 @@ err_inval: /* * Iterate over all PIDs. */ -static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len) +static int nproc_ps_select_all(struct nlmsghdr *nlh, u32 *fdata, u32 len, uid_t uid, kernel_cap_t caps) { void *map; int offset, i; @@ -378,7 +382,7 @@ static int nproc_ps_select_all(struct nl if (offset >= BITS_PER_PAGE) break; pid = offset + i * BITS_PER_PAGE; - tsk = nproc_ps_get_task(pid); + tsk = nproc_ps_get_task(nlh, pid, uid, caps); if (!tsk) continue; err = nproc_pid_msg(nlh, fdata, len, tsk); @@ -467,7 +471,7 @@ err_inval: * Call the chosen process selector. Adding additional selectors * (e.g. select by uid) is easy, but is there a need? */ -static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid) +static int nproc_get_ps(struct nlmsghdr *nlh, uid_t uid, kernel_cap_t caps) { int err; u32 len; @@ -490,11 +494,11 @@ static int nproc_get_ps(struct nlmsghdr case NPROC_SELECT_ALL: if (left) pwarn("%d bytes left.\n", left); - err = nproc_ps_select_all(nlh, data, len); + err = nproc_ps_select_all(nlh, data, len, uid, caps); break; case NPROC_SELECT_PID: err = nproc_ps_select_pid(nlh, data, len, - left, sdata + 1); + left, sdata + 1, uid, caps); break; default: pwarn("Unknown selection method %#x.\n", *sdata); @@ -787,7 +791,7 @@ static __inline__ int nproc_process_msg( err = nproc_get_global(nlh); break; case NPROC_GET_PS: - err = nproc_get_ps(nlh, uid); + err = nproc_get_ps(nlh, uid, caps); break; default: pwarn("Unknown msg type %#x.\n", nlh->nlmsg_type); -- Stephen Smalley <sds@epoch.ncsc.mil> National Security Agency ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 20:01 ` Stephen Smalley @ 2004-09-09 20:48 ` Chris Wright 2004-09-10 12:11 ` Stephen Smalley 2004-09-09 20:55 ` Roger Luethi 1 sibling, 1 reply; 69+ messages in thread From: Chris Wright @ 2004-09-09 20:48 UTC (permalink / raw) To: Stephen Smalley Cc: Roger Luethi, William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright * Stephen Smalley (sds@epoch.ncsc.mil) wrote: > Well, it isn't that easy, or at least I don't think it is. The problem > is that there is no way presently to convey the sender's security > credentials (beyond the existing uid, cap information), since the LSM > patches for adding security fields and hooks for managing skb security > fields were rejected. The best we can do at present is pass along the > sender pid, uid, and cap, and the security module can look up the pid if > it chooses to get the security field (but is naturally subject to races > in that situation). > > Most obvious place to hook would be nproc_ps_get_task; we could then > perform a check based on the sender's credentials and the target task's > credentials, and simply return NULL if permission is not granted for > that pair, thus skipping that task as if it didn't exist. That requires > propagating the sender's credentials down to that function. > > Untested patch below. > > Index: linux-2.6/include/linux/security.h > =================================================================== > RCS file: /nfshome/pal/CVS/linux-2.6/include/linux/security.h,v > retrieving revision 1.37 > diff -u -p -r1.37 security.h > --- linux-2.6/include/linux/security.h 16 Jun 2004 14:49:42 -0000 1.37 > +++ linux-2.6/include/linux/security.h 9 Sep 2004 19:38:23 -0000 > @@ -632,6 +632,13 @@ struct swap_info_struct; > * security attributes, e.g. for /proc/pid inodes. > * @p contains the task_struct for the task. > * @inode contains the inode structure for the inode. > + * @task_getstate: > + * Check permission before getting the state of a task. > + * @pid contains the pid of the requesting process. > + * @p contains the task_struct for the target task. > + * @uid contains the uid of the requesting process. > + * @caps contains the capability set of the requesting process. > + * Return 0 if permission is granted. Why caps? thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 20:48 ` Chris Wright @ 2004-09-10 12:11 ` Stephen Smalley 0 siblings, 0 replies; 69+ messages in thread From: Stephen Smalley @ 2004-09-10 12:11 UTC (permalink / raw) To: Chris Wright Cc: Roger Luethi, William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris On Thu, 2004-09-09 at 16:48, Chris Wright wrote: > > + * @task_getstate: > > + * Check permission before getting the state of a task. > > + * @pid contains the pid of the requesting process. > > + * @p contains the task_struct for the target task. > > + * @uid contains the uid of the requesting process. > > + * @caps contains the capability set of the requesting process. > > + * Return 0 if permission is granted. > > Why caps? It is readily available in the netlink skb parms, and someone might want to use it, e.g. a security module might limit a requesting process to only getting state of other processes with the same uid unless the requesting process has some capability. -- Stephen Smalley <sds@epoch.ncsc.mil> National Security Agency ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 20:01 ` Stephen Smalley 2004-09-09 20:48 ` Chris Wright @ 2004-09-09 20:55 ` Roger Luethi 2004-09-09 21:05 ` Chris Wright 2004-09-09 21:25 ` Roger Luethi 1 sibling, 2 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-09 20:55 UTC (permalink / raw) To: Stephen Smalley Cc: William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright On Thu, 09 Sep 2004 16:01:06 -0400, Stephen Smalley wrote: > > For the same reason, I'm not comfortable with implementing SELinux type > > access controls myself. How about: > > > > config NPROC > > depends on !SECURITY_SELINUX > > > > Adding access control later won't be a problem for anyone who groks > > SELinux. > [...] > Most obvious place to hook would be nproc_ps_get_task; we could then > perform a check based on the sender's credentials and the target task's > credentials, and simply return NULL if permission is not granted for > that pair, thus skipping that task as if it didn't exist. That requires > propagating the sender's credentials down to that function. > > Untested patch below. I used a somewhat different approach in my development tree (not SELinuxy, though): Most fields were world readable, some required credentials. I don't have any strong feelings on access control, so I'd be happy with any mechanism that doesn't completely botch performance. Anyway, I do not consider lack of access controls to be a showstopper. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 20:55 ` Roger Luethi @ 2004-09-09 21:05 ` Chris Wright 2004-09-09 21:25 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: Chris Wright @ 2004-09-09 21:05 UTC (permalink / raw) To: Stephen Smalley, William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright * Roger Luethi (rl@hellgate.ch) wrote: > On Thu, 09 Sep 2004 16:01:06 -0400, Stephen Smalley wrote: > > > For the same reason, I'm not comfortable with implementing SELinux type > > > access controls myself. How about: > > > > > > config NPROC > > > depends on !SECURITY_SELINUX > > > > > > Adding access control later won't be a problem for anyone who groks > > > SELinux. > > > [...] > > Most obvious place to hook would be nproc_ps_get_task; we could then > > perform a check based on the sender's credentials and the target task's > > credentials, and simply return NULL if permission is not granted for > > that pair, thus skipping that task as if it didn't exist. That requires > > propagating the sender's credentials down to that function. > > > > Untested patch below. > > I used a somewhat different approach in my development tree (not > SELinuxy, though): Most fields were world readable, some required > credentials. > > I don't have any strong feelings on access control, so I'd be happy > with any mechanism that doesn't completely botch performance. Anyway, > I do not consider lack of access controls to be a showstopper. Some of these things become quite sensitive, esp across setuid, etc. For prototyping, I agree, not a showstopper. For merging, it should be figured out properly. thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 20:55 ` Roger Luethi 2004-09-09 21:05 ` Chris Wright @ 2004-09-09 21:25 ` Roger Luethi 2004-09-11 22:36 ` Albert Cahalan 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-09 21:25 UTC (permalink / raw) To: Stephen Smalley, William Lee Irwin III, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright On Thu, 09 Sep 2004 22:55:31 +0200, Roger Luethi wrote: > I used a somewhat different approach in my development tree (not > SELinuxy, though): Most fields were world readable, some required > credentials. I forgot to mention that you can see the remnants of that approach in <linux/nproc.h>: I used two bits of the field ID to define per-field access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT). Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 21:25 ` Roger Luethi @ 2004-09-11 22:36 ` Albert Cahalan 2004-09-12 5:00 ` William Lee Irwin III 2004-09-14 6:44 ` Roger Luethi 0 siblings, 2 replies; 69+ messages in thread From: Albert Cahalan @ 2004-09-11 22:36 UTC (permalink / raw) To: Roger Luethi Cc: Stephen Smalley, William Lee Irwin III, Andrew Morton OSDL, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright On Thu, 2004-09-09 at 17:25, Roger Luethi wrote: > On Thu, 09 Sep 2004 22:55:31 +0200, Roger Luethi wrote: > > I used a somewhat different approach in my development tree (not > > SELinuxy, though): Most fields were world readable, some required > > credentials. > > I forgot to mention that you can see the remnants of that approach in > <linux/nproc.h>: I used two bits of the field ID to define per-field > access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT). Besides the low-security and high-security choices, I'd like to see a medium-security choice. low: everybody sees everything medium: everybody sees something; privileged user sees all high: must be privileged This might mean that asking for stuff like EIP and WCHAN causes you to see fewer processes. If partial info is returned for a process, I'd like to also get a bitmap of valid fields. Special "not valid" values are a pain to deal with. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-11 22:36 ` Albert Cahalan @ 2004-09-12 5:00 ` William Lee Irwin III 2004-09-14 6:44 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-12 5:00 UTC (permalink / raw) To: Albert Cahalan Cc: Roger Luethi, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright On Thu, 2004-09-09 at 17:25, Roger Luethi wrote: >> I forgot to mention that you can see the remnants of that approach in >> <linux/nproc.h>: I used two bits of the field ID to define per-field >> access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT). On Sat, Sep 11, 2004 at 06:36:53PM -0400, Albert Cahalan wrote: > Besides the low-security and high-security choices, > I'd like to see a medium-security choice. > low: everybody sees everything > medium: everybody sees something; privileged user sees all > high: must be privileged > This might mean that asking for stuff like EIP and WCHAN > causes you to see fewer processes. > If partial info is returned for a process, I'd like to > also get a bitmap of valid fields. Special "not valid" > values are a pain to deal with. That's an interesting observation. Perhaps the union of the mmu and nommu fields should be nominally reported alongside a bitmap of the useful fields? -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-11 22:36 ` Albert Cahalan 2004-09-12 5:00 ` William Lee Irwin III @ 2004-09-14 6:44 ` Roger Luethi 2004-09-14 7:10 ` William Lee Irwin III 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 6:44 UTC (permalink / raw) To: Albert Cahalan Cc: Stephen Smalley, William Lee Irwin III, Andrew Morton OSDL, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson, James Morris, Chris Wright On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote: > > I forgot to mention that you can see the remnants of that approach in > > <linux/nproc.h>: I used two bits of the field ID to define per-field > > access restrictions (NPROC_PERM_USER, NPROC_PERM_ROOT). > > Besides the low-security and high-security choices, > I'd like to see a medium-security choice. > > low: everybody sees everything > medium: everybody sees something; privileged user sees all > high: must be privileged > > This might mean that asking for stuff like EIP and WCHAN > causes you to see fewer processes. I'm not sure I understand you correctly, but the combination of NPROC_PERM_USER and NPROC_PERM_ROOT already seems to fit your description: - If the access control bits for a field are cleared, any process/user can get that field information for any process. - If the access control bits are set to NPROC_PERM_USER, only root and the owner of a process can read the field for that process. - For NPROC_PERM_ROOT, only root can ever read such a field. I picked that design because it captures the essence of what /proc does today. > If partial info is returned for a process, I'd like to > also get a bitmap of valid fields. Special "not valid" > values are a pain to deal with. If an app asks for a field it has no or partial permission for, the set of processes returned is trimmed accordingly. Since an application will expect this behavior based on the access control bits, no guessing is involved here. If an app asks for a non-existant field (not supported on this architecture or obsolete), it will get an error back. No guessing involved here, either. We could report the bad field ID back, but it's easy for user-space to figure out and it's not in the fast path (for user space). The tricky case is if an app asks for an offered field without permission problems, but the field is not available in that particular context. The only instance of this that comes to mind are mm_struct related fields and kernel threads. Neither returning an error nor skipping affected processes seems a good solution. In this special case, the current nproc code returns 0, but that's probably not optimal. Currently, my preferred solution would be to return ~(0). I'm not convinced yet that making message formats more complex (adding bitmaps or lists of applicaple fields or something) for one special case is a better idea. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 6:44 ` Roger Luethi @ 2004-09-14 7:10 ` William Lee Irwin III 2004-09-14 7:55 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 7:10 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote: >> This might mean that asking for stuff like EIP and WCHAN >> causes you to see fewer processes. On Tue, Sep 14, 2004 at 08:44:03AM +0200, Roger Luethi wrote: > I'm not sure I understand you correctly, but the combination of > NPROC_PERM_USER and NPROC_PERM_ROOT already seems to fit your > description: > - If the access control bits for a field are cleared, any process/user > can get that field information for any process. > - If the access control bits are set to NPROC_PERM_USER, only root and > the owner of a process can read the field for that process. > - For NPROC_PERM_ROOT, only root can ever read such a field. > I picked that design because it captures the essence of what /proc > does today. The concern appears to be that the tools might interpret failed permission checks as indications of process nonexistence. I don't regard this as particularly pressing, as properly-written apps should check the specific value of errno (in particular to retry when EAGAIN is received in numerous contexts). On Sat, 11 Sep 2004 18:36:53 -0400, Albert Cahalan wrote: >> If partial info is returned for a process, I'd like to >> also get a bitmap of valid fields. Special "not valid" >> values are a pain to deal with. On Tue, Sep 14, 2004 at 08:44:03AM +0200, Roger Luethi wrote: > If an app asks for a field it has no or partial permission for, the set > of processes returned is trimmed accordingly. Since an application will > expect this behavior based on the access control bits, no guessing is > involved here. > If an app asks for a non-existant field (not supported on this > architecture or obsolete), it will get an error back. No guessing > involved here, either. We could report the bad field ID back, but it's > easy for user-space to figure out and it's not in the fast path (for > user space). > The tricky case is if an app asks for an offered field without permission > problems, but the field is not available in that particular context. The > only instance of this that comes to mind are mm_struct related fields > and kernel threads. Neither returning an error nor skipping affected > processes seems a good solution. In this special case, the current > nproc code returns 0, but that's probably not optimal. Currently, > my preferred solution would be to return ~(0). > I'm not convinced yet that making message formats more complex (adding > bitmaps or lists of applicaple fields or something) for one special > case is a better idea. Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be done if the fields are measured in units such that the top bit is never set for any feasible value, then a fully qualified error return could simply be returned as (unsigned long)(-err). I suspect VSZ may be problematic wrt. overflows even for 32-bit, not just for 31-bit. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 7:10 ` William Lee Irwin III @ 2004-09-14 7:55 ` Roger Luethi 2004-09-14 8:01 ` William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 7:55 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote: > > - If the access control bits for a field are cleared, any process/user > > can get that field information for any process. > > - If the access control bits are set to NPROC_PERM_USER, only root and > > the owner of a process can read the field for that process. > > - For NPROC_PERM_ROOT, only root can ever read such a field. > > I picked that design because it captures the essence of what /proc > > does today. > > The concern appears to be that the tools might interpret failed > permission checks as indications of process nonexistence. I don't > regard this as particularly pressing, as properly-written apps should > check the specific value of errno (in particular to retry when EAGAIN > is received in numerous contexts). I would expect a tool to refrain from asking for fields with restricted access if it needs a complete overview over existing processes. It can always ask for restricted fields in a second request (the vast majority of fields are world-readable anyway). > > processes seems a good solution. In this special case, the current > > nproc code returns 0, but that's probably not optimal. Currently, > > my preferred solution would be to return ~(0). > > I'm not convinced yet that making message formats more complex (adding > > bitmaps or lists of applicaple fields or something) for one special > > case is a better idea. > > Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be > done if the fields are measured in units such that the top bit is never > set for any feasible value, then a fully qualified error return could > simply be returned as (unsigned long)(-err). I suspect VSZ may be > problematic wrt. overflows even for 32-bit, not just for 31-bit. Yeah, that makes me nervous. There are just too many ways this can go wrong or be misinterpreted in user space. Currently, nproc does not indicate the type of error at all, because a properly written user-space app will either not hit an error or be able to figure out what the problem was based on the available information. I suppose if we wanted to change that (which doesn't sound unreasonable), the proper way would be to return error flags with an error message (delivered via netlink). Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 7:55 ` Roger Luethi @ 2004-09-14 8:01 ` William Lee Irwin III 2004-09-14 9:27 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 8:01 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote: >> The concern appears to be that the tools might interpret failed >> permission checks as indications of process nonexistence. I don't >> regard this as particularly pressing, as properly-written apps should >> check the specific value of errno (in particular to retry when EAGAIN >> is received in numerous contexts). On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote: > I would expect a tool to refrain from asking for fields with restricted > access if it needs a complete overview over existing processes. It can > always ask for restricted fields in a second request (the vast majority > of fields are world-readable anyway). That expectation can't be entirely relied upon, as the restrictions may not be predictable. On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote: >> Distinguishing between EPERM, ENOSYS, ENOENT, etc. could probably be >> done if the fields are measured in units such that the top bit is never >> set for any feasible value, then a fully qualified error return could >> simply be returned as (unsigned long)(-err). I suspect VSZ may be >> problematic wrt. overflows even for 32-bit, not just for 31-bit. On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote: > Yeah, that makes me nervous. There are just too many ways this can go > wrong or be misinterpreted in user space. Currently, nproc does not > indicate the type of error at all, because a properly written user-space > app will either not hit an error or be able to figure out what the > problem was based on the available information. I suppose if we wanted > to change that (which doesn't sound unreasonable), the proper way would > be to return error flags with an error message (delivered via netlink). This kind of error reporting is better still, as the fields then won't be polluted with invalid data under any circumstance (assuming the code can report subsets of the fields or some such, which I presume to be the case given that avoiding reporting potentially computationally expensive fields was one of the original motivators of the patch). -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 8:01 ` William Lee Irwin III @ 2004-09-14 9:27 ` Roger Luethi 2004-09-14 15:37 ` William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 9:27 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote: > On Tue, 14 Sep 2004 00:10:58 -0700, William Lee Irwin III wrote: > >> The concern appears to be that the tools might interpret failed > >> permission checks as indications of process nonexistence. I don't > >> regard this as particularly pressing, as properly-written apps should > >> check the specific value of errno (in particular to retry when EAGAIN > >> is received in numerous contexts). > > On Tue, Sep 14, 2004 at 09:55:08AM +0200, Roger Luethi wrote: > > I would expect a tool to refrain from asking for fields with restricted > > access if it needs a complete overview over existing processes. It can > > always ask for restricted fields in a second request (the vast majority > > of fields are world-readable anyway). > > That expectation can't be entirely relied upon, as the restrictions may > not be predictable. They should be. For the simple design I described the access restrictions are part of the field ID, so a tool can deduce the exact type of access restrictions even if it doesn't know the field. There's plenty of space left for additional access control flags in the field ID. If it gets much more complex, the application (let alone the kernel) has to have some knowledge of the security model anyway, so we could have simple operations that allow a tool to discover how access restrictions apply to the supported fields. > > problem was based on the available information. I suppose if we wanted > > to change that (which doesn't sound unreasonable), the proper way would > > be to return error flags with an error message (delivered via netlink). > > This kind of error reporting is better still, as the fields then won't > be polluted with invalid data under any circumstance (assuming the code > can report subsets of the fields or some such, which I presume to be > the case given that avoiding reporting potentially computationally > expensive fields was one of the original motivators of the patch). It cannot easily, and I don't think it wants to. The reason it's hard to just reply with a subset is that the kernel does not send any description of the reply content other than the serial number of the request -- it's up to the tool to know what it asked for. So if you remove a field, you'd have to let user-space know which field you removed. Sending only the allowed subset makes handling on both sides more complicated -- the kernel needs to build different kinds of messages in answer to one request, and user-space tool need to be able to parse that. The way the interface works now, though, is that a tool can rely on the content of the reply to match the request. This makes the common case both easy to write and fast. Let me break it down once again: - If a tool asks for a field the kernel doesn't know about, that's a fatal error. An error message is returned, nothing else (this can be discovered before any other reply is delivered). - If a tool specifically asks for a process which doesn't exist, nothing is returned. We could return an error indicating that. Might be a good idea. - If a tool asks for a field it doesn't have permission to read, it usally does have permission to read that field for some tasks (e.g. same owner), but not for others. So for some replies to one request, all requested fields will contain meaningful values. What about the replies that describe the tasks where the tool must not read at least some of the requested values? I chose to simply skip those tasks. We could also send an error message ("some tasks omitted") or send a complete reply with the restricted fields zeroed and a special flag set ("some fields in this reply zeroed due to access control"). I'm really afraid of over-engineering something here, though. The fields requested by tools like ps and top by default are all world readable in /proc. I showed that solutions fit right in should we ever need access control for real-world applications. For now, I'd rather not extend the interface significantly unless the current semantics are clearly insufficient. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 9:27 ` Roger Luethi @ 2004-09-14 15:37 ` William Lee Irwin III 2004-09-14 16:01 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 15:37 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote: >> That expectation can't be entirely relied upon, as the restrictions may >> not be predictable. On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > They should be. For the simple design I described the access restrictions > are part of the field ID, so a tool can deduce the exact type of access > restrictions even if it doesn't know the field. There's plenty of space > left for additional access control flags in the field ID. No, in general races of the form "permissions were altered after I checked them" can happen. On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > If it gets much more complex, the application (let alone the kernel) > has to have some knowledge of the security model anyway, so we could have > simple operations that allow a tool to discover how access restrictions > apply to the supported fields. Checking that system calls succeeded is a minimum requirement at all times. Misinterpreting error returns is the app's fault. On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote: >> This kind of error reporting is better still, as the fields then won't >> be polluted with invalid data under any circumstance (assuming the code >> can report subsets of the fields or some such, which I presume to be >> the case given that avoiding reporting potentially computationally >> expensive fields was one of the original motivators of the patch). On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > It cannot easily, and I don't think it wants to. The reason it's hard to > just reply with a subset is that the kernel does not send any description > of the reply content other than the serial number of the request -- > it's up to the tool to know what it asked for. So if you remove a field, > you'd have to let user-space know which field you removed. Sending only > the allowed subset makes handling on both sides more complicated -- > the kernel needs to build different kinds of messages in answer to one > request, and user-space tool need to be able to parse that. Irritating. That must mean you can't ask for specific fields. On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > The way the interface works now, though, is that a tool can rely on > the content of the reply to match the request. This makes the common > case both easy to write and fast. > Let me break it down once again: > - If a tool asks for a field the kernel doesn't know about, that's a > fatal error. An error message is returned, nothing else (this can be > discovered before any other reply is delivered). If you can't ask for specific fields you're dead anyway. On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > - If a tool specifically asks for a process which doesn't exist, > nothing is returned. We could return an error indicating that. Might > be a good idea. ESRCH and ENOENT sound good. On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > - If a tool asks for a field it doesn't have permission to read, it usally > does have permission to read that field for some tasks (e.g. same owner), > but not for others. So for some replies to one request, all requested > fields will contain meaningful values. What about the replies that > describe the tasks where the tool must not read at least some of the > requested values? I chose to simply skip those tasks. This is the bit about being dead already if you can't request subsets of fields and/or one field at a time. On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > We could also send an error message ("some tasks omitted") or send a > complete reply with the restricted fields zeroed and a special flag set > ("some fields in this reply zeroed due to access control"). > I'm really afraid of over-engineering something here, though. The fields > requested by tools like ps and top by default are all world readable > in /proc. I showed that solutions fit right in should we ever need > access control for real-world applications. For now, I'd rather not > extend the interface significantly unless the current semantics are > clearly insufficient. Well, "return this set of fields" means there's only one type of request necessary, and userspace merely iterates through the subsets obtained by striking out fields to which accesses caused errors until either the set is empty or the call succeeds. One field at a time at all times also means there's only one type of request necessary. So I don't see overengineering happening here, merely that "either all succeed or all fail" is a semantic that creates hardships for userspace; both the alternatives are simple. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 15:37 ` William Lee Irwin III @ 2004-09-14 16:01 ` Roger Luethi 2004-09-14 16:37 ` William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 16:01 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: > On Tue, 14 Sep 2004 01:01:32 -0700, William Lee Irwin III wrote: > >> That expectation can't be entirely relied upon, as the restrictions may > >> not be predictable. > > On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > > They should be. For the simple design I described the access restrictions > > are part of the field ID, so a tool can deduce the exact type of access > > restrictions even if it doesn't know the field. There's plenty of space > > left for additional access control flags in the field ID. > > No, in general races of the form "permissions were altered after I > checked them" can happen. Can you make an example? Some scenario where this would be important? > On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > > If it gets much more complex, the application (let alone the kernel) > > has to have some knowledge of the security model anyway, so we could have > > simple operations that allow a tool to discover how access restrictions > > apply to the supported fields. > > Checking that system calls succeeded is a minimum requirement at all > times. Misinterpreting error returns is the app's fault. It's async. You can't rely on return values. They'd have to be in netlink messages. > On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > > It cannot easily, and I don't think it wants to. The reason it's hard to > > just reply with a subset is that the kernel does not send any description > > of the reply content other than the serial number of the request -- > > it's up to the tool to know what it asked for. So if you remove a field, > > you'd have to let user-space know which field you removed. Sending only > > the allowed subset makes handling on both sides more complicated -- > > the kernel needs to build different kinds of messages in answer to one > > request, and user-space tool need to be able to parse that. > > Irritating. That must mean you can't ask for specific fields. How so? For process fields, the request block is one u32 indicating the number of field IDs to follow, then a bunch of u32 containing field IDs. Any subset of field IDs, in any order of the tool's choosing. The kernel replies with one message per process, each message containing all the fields the tool requested, in the same order. > On Tue, Sep 14, 2004 at 11:27:48AM +0200, Roger Luethi wrote: > > We could also send an error message ("some tasks omitted") or send a > > complete reply with the restricted fields zeroed and a special flag set > > ("some fields in this reply zeroed due to access control"). > > I'm really afraid of over-engineering something here, though. The fields > > requested by tools like ps and top by default are all world readable > > in /proc. I showed that solutions fit right in should we ever need > > access control for real-world applications. For now, I'd rather not > > extend the interface significantly unless the current semantics are > > clearly insufficient. > > Well, "return this set of fields" means there's only one type of > request necessary, and userspace merely iterates through the subsets > obtained by striking out fields to which accesses caused errors until > either the set is empty or the call succeeds. One field at a time at > all times also means there's only one type of request necessary. So I One field at a time at all times is unnecessarily slow. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 16:01 ` Roger Luethi @ 2004-09-14 16:37 ` William Lee Irwin III 2004-09-14 17:15 ` Roger Luethi 2004-09-14 18:37 ` Chris Wright 0 siblings, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 16:37 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: >> No, in general races of the form "permissions were altered after I >> checked them" can happen. On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > Can you make an example? Some scenario where this would be important? Not particularly. It largely means poorly-coded apps may report gibberish. On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: >> Checking that system calls succeeded is a minimum requirement at all >> times. Misinterpreting error returns is the app's fault. On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > It's async. You can't rely on return values. They'd have to be in > netlink messages. That's fine. Do these error messages specify which field access(es) caused the error? On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: >> Irritating. That must mean you can't ask for specific fields. On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > How so? For process fields, the request block is one u32 indicating the > number of field IDs to follow, then a bunch of u32 containing field IDs. > Any subset of field IDs, in any order of the tool's choosing. > The kernel replies with one message per process, each message containing > all the fields the tool requested, in the same order. Then assuming the error messages indicate which field access(es) caused the error(s), you're already done; userspace must merely retry the request with the offending fields cast out. Otherwise, you're still done: userspace can merely retry the field accesses one at a time (though it's nicer to say which ones caused the errors). On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: >> Well, "return this set of fields" means there's only one type of >> request necessary, and userspace merely iterates through the subsets >> obtained by striking out fields to which accesses caused errors until >> either the set is empty or the call succeeds. One field at a time at >> all times also means there's only one type of request necessary. So I On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > One field at a time at all times is unnecessarily slow. Yes, that was the "slower and stupider than thou" option. You've already vectorized field access requests, of which I heartily approve. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 16:37 ` William Lee Irwin III @ 2004-09-14 17:15 ` Roger Luethi 2004-09-14 17:43 ` William Lee Irwin III 2004-09-14 18:37 ` Chris Wright 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 17:15 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote: > On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: > >> No, in general races of the form "permissions were altered after I > >> checked them" can happen. > > On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > > Can you make an example? Some scenario where this would be important? > > Not particularly. It largely means poorly-coded apps may report gibberish. If we are still talking about the same thing here, gibberish is a rather strong word. In the design I proposed access control affects the subset of tasks returned as a result -- the tool would still display meaningful information for the tasks it got replies for. Anyway, if the access restrictions are hard-coded into the field ID, then it's only the credentials that can change, and I can't see a race there at the moment. > On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: > >> Checking that system calls succeeded is a minimum requirement at all > >> times. Misinterpreting error returns is the app's fault. > > On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > > It's async. You can't rely on return values. They'd have to be in > > netlink messages. > > That's fine. Do these error messages specify which field access(es) > caused the error? They don't, because the access control I had in my dev tree silently skipped tasks containing fields the process had no permission to read. IOW, access control works as an implicit task selector. And security wise that's clean because the kernel does not reveal any information about other processes to the querying task (not even evidence of their existence). > Then assuming the error messages indicate which field access(es) caused > the error(s), you're already done; userspace must merely retry the > request with the offending fields cast out. Otherwise, you're still > done: userspace can merely retry the field accesses one at a time > (though it's nicer to say which ones caused the errors). Agreed on every point. The question I am pondering is: Does nproc need access control right now? It's more work in kernel and user space and adds new opportunities to introduce bugs. The merits seem rather dubious right now, considering that all the fields used by current process info tools (files /proc/pid{cmdline, stat, statm, status, wchan}) are world readable. So my preference is to wait with access control until we know where and how it is necessary. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 17:15 ` Roger Luethi @ 2004-09-14 17:43 ` William Lee Irwin III 2004-09-14 18:45 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 17:43 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote: >> Not particularly. It largely means poorly-coded apps may report gibberish. On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote: > If we are still talking about the same thing here, gibberish is a rather > strong word. In the design I proposed access control affects the subset > of tasks returned as a result -- the tool would still display meaningful > information for the tasks it got replies for. That sounds bizarre. I'd expect some kind of reply, even if merely an error. I suppose "no reply" could be interpreted as "ESRCH", though this means distinguishing between "some field caused an error" and "the thing is dead" means the app has to fall back to requesting fields one at a time. On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote: > Anyway, if the access restrictions are hard-coded into the field ID, > then it's only the credentials that can change, and I can't see a race > there at the moment. The race is in the app, not the kernel, so there's nothing to fix in the kernel apart from distinctions between ESRCH and EPERM in error reporting (otherwise the app is helpless to resolve the ambiguity). On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote: >> That's fine. Do these error messages specify which field access(es) >> caused the error? On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote: > They don't, because the access control I had in my dev tree silently > skipped tasks containing fields the process had no permission to read. > IOW, access control works as an implicit task selector. And security > wise that's clean because the kernel does not reveal any information > about other processes to the querying task (not even evidence of their > existence). If all errors are handled with "no reply", userspace loses some efficiency, as it's forced to retry field accesses one at a time and wait for timeouts on each of them for a dead/inaccessible task. On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote: >> Then assuming the error messages indicate which field access(es) caused >> the error(s), you're already done; userspace must merely retry the >> request with the offending fields cast out. Otherwise, you're still >> done: userspace can merely retry the field accesses one at a time >> (though it's nicer to say which ones caused the errors). On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote: > Agreed on every point. > The question I am pondering is: Does nproc need access control right now? > It's more work in kernel and user space and adds new opportunities to > introduce bugs. The merits seem rather dubious right now, considering > that all the fields used by current process info tools (files > /proc/pid{cmdline, stat, statm, status, wchan}) are world readable. > So my preference is to wait with access control until we know where > and how it is necessary. This I can't answer. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 17:43 ` William Lee Irwin III @ 2004-09-14 18:45 ` Roger Luethi 2004-09-14 19:07 ` William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 18:45 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 10:43:25 -0700, William Lee Irwin III wrote: > On Tue, 14 Sep 2004 09:37:12 -0700, William Lee Irwin III wrote: > >> Not particularly. It largely means poorly-coded apps may report gibberish. > > On Tue, Sep 14, 2004 at 07:15:25PM +0200, Roger Luethi wrote: > > If we are still talking about the same thing here, gibberish is a rather > > strong word. In the design I proposed access control affects the subset > > of tasks returned as a result -- the tool would still display meaningful > > information for the tasks it got replies for. > > That sounds bizarre. I'd expect some kind of reply, even if merely an > error. I suppose "no reply" could be interpreted as "ESRCH", though > this means distinguishing between "some field caused an error" and > "the thing is dead" means the app has to fall back to requesting fields > one at a time. I suppose you are thinking of a request that lists a number of PIDs along with a number of field IDs. In that case yes, I agree that it makes sense to provide some explicit feedback to the tool once we add access control (before that, there is no ambiguity: a missing answer means ESRCH). The most common request, though, won't provide a list of pids, it will only provide a list of field IDs and select all processes in the system (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't ask for any specific process to begin with, ESRCH doesn't make sense here. And for a system that looks anything like /proc does today, fields that are capable of triggering EPERM are few and far between, certainly not something you are hitting unexpectedly in the fast path of a process monitoring tool. Thanks, by the way, for all the feedback that helped me realize that I have so far failed to explain the design well enough. I will try to work on that. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 18:45 ` Roger Luethi @ 2004-09-14 19:07 ` William Lee Irwin III 2004-09-14 19:31 ` Roger Luethi 2004-09-15 11:44 ` Roger Luethi 0 siblings, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 19:07 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote: > I suppose you are thinking of a request that lists a number of PIDs along > with a number of field IDs. In that case yes, I agree that it makes sense > to provide some explicit feedback to the tool once we add access control > (before that, there is no ambiguity: a missing answer means ESRCH). > The most common request, though, won't provide a list of pids, it will > only provide a list of field IDs and select all processes in the system > (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't > ask for any specific process to begin with, ESRCH doesn't make sense > here. And for a system that looks anything like /proc does today, > fields that are capable of triggering EPERM are few and far between, > certainly not something you are hitting unexpectedly in the fast path > of a process monitoring tool. Okay, so what kinds of errors are returned in this case, if any, or (worst case) are the offending tasks completely silently dropped? On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote: > Thanks, by the way, for all the feedback that helped me realize that > I have so far failed to explain the design well enough. I will try to > work on that. Thanks; while I could in principle expend more effort to understand the netlink code, it's likely swifter to be given such commentary. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 19:07 ` William Lee Irwin III @ 2004-09-14 19:31 ` Roger Luethi 2004-09-14 19:36 ` William Lee Irwin III 2004-09-15 11:44 ` Roger Luethi 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 19:31 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote: > On Tue, Sep 14, 2004 at 08:45:18PM +0200, Roger Luethi wrote: > > I suppose you are thinking of a request that lists a number of PIDs along > > with a number of field IDs. In that case yes, I agree that it makes sense > > to provide some explicit feedback to the tool once we add access control > > (before that, there is no ambiguity: a missing answer means ESRCH). > > The most common request, though, won't provide a list of pids, it will > > only provide a list of field IDs and select all processes in the system > > (NPROC_SELECT_ALL). There is no ambiguity here, either: The tool didn't > > ask for any specific process to begin with, ESRCH doesn't make sense > > here. And for a system that looks anything like /proc does today, > > fields that are capable of triggering EPERM are few and far between, > > certainly not something you are hitting unexpectedly in the fast path > > of a process monitoring tool. > > Okay, so what kinds of errors are returned in this case, if any, or > (worst case) are the offending tasks completely silently dropped? In published code: No access control whatsoever. In dev tree: Silently dropped. Possible: Any kind of error and additional information that makes sense (we have netlink messages as a transport, after all). That said, I don't think dropping tasks silently is a "worst case" in this scenario. Whatever your error report is going to be, it will boil down to saying "some tasks that may or may not live by the time you read this have been skipped because some fields that you knew had access restrictions prevented providing the information in those cases, and I must be cautious about not revealing any sensitive information to you so sorry I can't be more helpful". What's a tool going to do with that? If it cares to get a complete snapshot, it can simply send two requests: One with and one without restricted fields. So the tool would, say, request PID/VmSize in the first message and environ in the second message. Since only the owner can read the environment, the second request would yield answers only for a subset of the total process table. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 19:31 ` Roger Luethi @ 2004-09-14 19:36 ` William Lee Irwin III 2004-09-14 19:50 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: William Lee Irwin III @ 2004-09-14 19:36 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote: >> Okay, so what kinds of errors are returned in this case, if any, or >> (worst case) are the offending tasks completely silently dropped? On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote: > In published code: No access control whatsoever. In dev tree: Silently > dropped. Possible: Any kind of error and additional information that > makes sense (we have netlink messages as a transport, after all). I'm not sure what to make of this. On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote: > That said, I don't think dropping tasks silently is a "worst case" > in this scenario. Whatever your error report is going to be, it will > boil down to saying "some tasks that may or may not live by the time > you read this have been skipped because some fields that you knew had > access restrictions prevented providing the information in those cases, > and I must be cautious about not revealing any sensitive information > to you so sorry I can't be more helpful". What's a tool going to do > with that? If it cares to get a complete snapshot, it can simply send > two requests: One with and one without restricted fields. > So the tool would, say, request PID/VmSize in the first message and > environ in the second message. Since only the owner can read the > environment, the second request would yield answers only for a subset > of the total process table. This sounds safe enough, though it's unclear how to predict what fields may be restricted. I suppose one doesn't try and requests one field at a time for all tasks in this model of interaction with userspace. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 19:36 ` William Lee Irwin III @ 2004-09-14 19:50 ` Roger Luethi 0 siblings, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-14 19:50 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 12:36:26 -0700, William Lee Irwin III wrote: > On Tue, Sep 14, 2004 at 09:31:39PM +0200, Roger Luethi wrote: > > In published code: No access control whatsoever. In dev tree: Silently > > dropped. Possible: Any kind of error and additional information that > > makes sense (we have netlink messages as a transport, after all). > > I'm not sure what to make of this. I was just trying to say that anything is possible (there are no limitations inherent to the design), but I prefer it the way it is now. I don't feel strongly about it should something different turn out to be the preferred method of tool authors. > This sounds safe enough, though it's unclear how to predict what fields > may be restricted. I suppose one doesn't try and requests one field at Simple: The fact that a field is subject to access restrictions is part of the field ID. You can check that nproc.h contains this: /* Access control (unused) */ #define NPROC_PERM_MASK 0x00300000 #define NPROC_PERM_USER 0x00100000 #define NPROC_PERM_ROOT 0x00200000 So even if a tool were to discover a new, previously unknown field offered by the kernel, it could immediately tell that access restrictions apply and what type they are (in case you wonder, there's extra space in reserve to cover additional types of restrictions, including some catch-all thing (say NPROC_PERM_COMPLEX_WHICH_MEANS_YOU_HAD_BETTER_KNOW_WHAT_YOU'RE_DOING)). So nproc can cover everything /proc does today and is ready to go way beyond that -- should that ever be deemed a good thing. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 19:07 ` William Lee Irwin III 2004-09-14 19:31 ` Roger Luethi @ 2004-09-15 11:44 ` Roger Luethi 2004-09-15 20:02 ` Roger Luethi 1 sibling, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-15 11:44 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Tue, 14 Sep 2004 12:07:47 -0700, William Lee Irwin III wrote: > Thanks; while I could in principle expend more effort to understand the > netlink code, it's likely swifter to be given such commentary. This message aims at showing how nproc works for user space. If you need additional or a different kind of documentation, let me know. Roger Field ID ======== In order to extract a specific value from the proc filesystem, a tool combines the file path and some method to determine the appropriate offset into that file (depending on the file based on keyword, white-space separated column, etc.). At this point, the tool applies its knowledge of the specific field format to convert the string back to what it stands for. Nproc, on the other hand, uses field IDs to identify information. Each field ID (32 bit) contains a number of sub fields: bits 0-15 Content ID. For instance, 0x117 is the virtual memory size of a process. 20-21 Access control ID. Type of access control restrictions that apply to this field. Currently unused. 24-26 Data type ID. Defines the return type which is one of u32, unsigned long, u64, or string. 28-30 Scope ID. Defines the scope for which a field is valid. Scope can be process (e.g. VmSize) or global (e.g. MemFree). The remaining bits are reserved for future use. Some details on sub-fields: Content ID (bits 0-15) ---------- Bits 8-15 are used to indicate the /proc file in which a field occurs and 0-7 to indicate the field within that file (where applicable). There's no magic to that other than the fact that it makes easier for humans to check nproc.h. Content IDs are immutable and identical on all platforms. Thus, the meaning of any content ID, once assigned, must never ever change! Data type ID (24-26) ------------ It's no problem to define additional (even complex) data types should the need arise. For numbers, the data type simply defines the size of the container (32 bit, long, 64 bit). For strings, the string itself is prepended with a u32 indicating the length of the string. Scope ID (28-30) -------- The scope ID is just another piece of information for tools with automatic field discovery (see example below). Examples ======== A few examples of how the mechanisms are used: Simple ------ A tool like vmstat(8) starts from a bunch of IDs for global fields it's interested in. After opening the socket, it sends one NPROC_GET_GLOBAL request containing said field IDs to the kernel. The kernel sends one reply for vmstat to read: A va_list containing the result for each requested field ID. Unit conversion (if necessary) can typically be done in place. Format string and buffer are directly passed to vprintf(3). Done. Detecting obsolete fields ------------------------- An NPROC_GET_FIELD_LIST request can be used at start-up to determine the field IDs that are offered by the kernel. If an app requests an obsolete field anyway (being optimistic is faster for the common case), it will get an error message back and can determine the cause from there. I don't expect this to happen more often than it has in the past (disappearing fields suck), but it's a clean way to handle such an event. Field autodiscovery ------------------- A tool may be interested in printing all information available about a set of processes it is monitoring. At start-up, it sends NPROC_GET_FIELD_LIST and finds a new field it doesn't know about. >From the field ID, the tool can deduce that the unknown field: - is in process scope and thus interesting for its task. That's all it takes to add the new field to the NPROC_GET_PS request sent to the kernel (along with a list of monitored PIDs). If the reply for a PID is missing from the result, the PID has died. - needs 32bits to store the result With three label calls on the new field ID, the app determines that the kernel suggests "VmShared" as a label, "%8u" for formatting, and that the unit is "KiB". (This may sound like bloat or overkill, but all these strings are already available via /proc for many fields, just in a processed form that makes it impractical to get the individual elements back.) The tool appends the format string for the new field to its own format string and can now proceed like the tool in the first, trivial example. Dealing with strings -------------------- Most strings are really static labels (e.g. the label for a field ID or the symbol name for wchan). In those cases, it's up to user-space to ask for a label and cache the result as necessary. There are some cases, though, where the label is transient. At least one of them, the process name, is important enough to justify strings in regular (as opposed to label) replies. Otherwise, the process and its name may be gone by the time a tool gets around to ask for it based on a PID it received. As there are no unique task identifiers, there are races possible and correct caching is hard if not impossible. But how can we still get a valid va_list back? A library function in user space takes care of that. For a given list of field IDs, it replaces every string type field with a NOP (reply size: unsigned long) and appends the string type field ID to the end of the list: u32 u32 u32 PID | NAME | VMSIZE becomes u32 u32 u32 u32 PID | NOP | VMSIZE | NAME Now it's trivial to fix the replies: u32 unsigned long u32 u32 string 1 | 0 | 1340 | 16 | init ^-- space used for this string becomes u32 unsigned long u32 u32 string 1 | <pointer to first string> | 1340 | 16 | init Anticipating type changes ------------------------- Some fields may grow in size (e.g. NPROC_PID may move from u32 to unsigned long or u64). If a field is not available from the kernel, a smart tool can check the list of field IDs for a field with with the same content ID but a different data type and print that instead. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-15 11:44 ` Roger Luethi @ 2004-09-15 20:02 ` Roger Luethi 2004-09-15 20:20 ` William Lee Irwin III 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-15 20:02 UTC (permalink / raw) To: William Lee Irwin III, Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright Here's another thing we haven't been able to do with /proc: Finding out the relative cost of computing the elements we offer to user space. I ran a test program against 2.6.9-rc2-bk1 + nproc to get: Testing all process fields, best out of 10 FieldID CPU (s) Wall (s) Label 0x03000002 0.140000 0.202728 NOP 0x21000100 0.150000 0.210021 Name 0x22000105 0.120000 0.204886 PID 0x22000109 0.130000 0.205319 UID 0x22000117 0.140000 0.215275 VmSize 0x22000118 0.130000 0.214240 VmLock 0x22000119 0.120000 0.214870 VmRSS 0x22000120 0.160000 1.020574 VmData 0x22000121 0.140000 1.021185 VmStack 0x22000122 0.170000 1.021619 VmExe 0x22000123 0.170000 1.020045 VmLib 0x23000421 0.140000 0.220748 wchan Ignore the absolute values (I requested each field individually for all processes on my workstation, 1000 times). The cost of walking all vmas for VmData & Co. is very visible. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-15 20:02 ` Roger Luethi @ 2004-09-15 20:20 ` William Lee Irwin III 2004-09-15 20:33 ` Roger Luethi 2004-09-15 20:44 ` Roger Luethi 0 siblings, 2 replies; 69+ messages in thread From: William Lee Irwin III @ 2004-09-15 20:20 UTC (permalink / raw) To: Roger Luethi Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Wed, Sep 15, 2004 at 10:02:30PM +0200, Roger Luethi wrote: > Here's another thing we haven't been able to do with /proc: Finding out > the relative cost of computing the elements we offer to user space. > I ran a test program against 2.6.9-rc2-bk1 + nproc to get: > Testing all process fields, best out of 10 > FieldID CPU (s) Wall (s) Label > 0x03000002 0.140000 0.202728 NOP > 0x21000100 0.150000 0.210021 Name > 0x22000105 0.120000 0.204886 PID > 0x22000109 0.130000 0.205319 UID > 0x22000117 0.140000 0.215275 VmSize > 0x22000118 0.130000 0.214240 VmLock > 0x22000119 0.120000 0.214870 VmRSS > 0x22000120 0.160000 1.020574 VmData > 0x22000121 0.140000 1.021185 VmStack > 0x22000122 0.170000 1.021619 VmExe > 0x22000123 0.170000 1.020045 VmLib > 0x23000421 0.140000 0.220748 wchan > Ignore the absolute values (I requested each field individually for all > processes on my workstation, 1000 times). The cost of walking all vmas > for VmData & Co. is very visible. Try this again after applying my updates, which make it equivalent to the algorithms used internally by fs/proc/task_mmu.c. -- wli ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-15 20:20 ` William Lee Irwin III @ 2004-09-15 20:33 ` Roger Luethi 2004-09-15 20:44 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-15 20:33 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Wed, 15 Sep 2004 13:20:28 -0700, William Lee Irwin III wrote: > Try this again after applying my updates, which make it equivalent to the > algorithms used internally by fs/proc/task_mmu.c. That doesn't sound very interesting. The results are predictable. The point of my previous message was that we can easily identify expensive fields. Ah well, compiling patched kernel anyway. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-15 20:20 ` William Lee Irwin III 2004-09-15 20:33 ` Roger Luethi @ 2004-09-15 20:44 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-15 20:44 UTC (permalink / raw) To: William Lee Irwin III Cc: Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Albert Cahalan, Paul Jackson, James Morris, Chris Wright On Wed, 15 Sep 2004 13:20:28 -0700, William Lee Irwin III wrote: > > Ignore the absolute values (I requested each field individually for all > > processes on my workstation, 1000 times). The cost of walking all vmas > > for VmData & Co. is very visible. > > Try this again after applying my updates, which make it equivalent to the > algorithms used internally by fs/proc/task_mmu.c. Here you go: Testing all process fields, best out of 10 FieldID CPU (s) Wall (s) Label 0x03000002 0.130000 0.208989 NOP 0x21000100 0.150000 0.222867 Name 0x22000105 0.140000 0.216126 PID 0x22000109 0.140000 0.218058 UID 0x22000117 0.140000 0.231467 VmSize 0x22000118 0.140000 0.227863 VmLock 0x22000119 0.140000 0.229867 VmRSS 0x22000120 0.140000 0.226822 VmData 0x22000121 0.140000 0.228589 VmStack 0x22000122 0.130000 0.229107 VmExe 0x22000123 0.140000 0.228584 VmLib 0x23000421 0.140000 0.230716 wchan ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 16:37 ` William Lee Irwin III 2004-09-14 17:15 ` Roger Luethi @ 2004-09-14 18:37 ` Chris Wright 2004-09-14 18:55 ` Roger Luethi 1 sibling, 1 reply; 69+ messages in thread From: Chris Wright @ 2004-09-14 18:37 UTC (permalink / raw) To: William Lee Irwin III Cc: Roger Luethi, Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Paul Jackson, James Morris, Chris Wright * William Lee Irwin III (wli@holomorphy.com) wrote: > On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: > >> No, in general races of the form "permissions were altered after I > >> checked them" can happen. > > On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > > Can you make an example? Some scenario where this would be important? > > Not particularly. It largely means poorly-coded apps may report gibberish. Canonical example is access(2) followed by open(2), not really relevant in this case. However, exec setuid root app...when do you check, and when to you fill in data to send back to user? For /proc, this type of check happens often (see things like may_ptrace_attach and task_dumpable in fs/proc/base.c). thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 18:37 ` Chris Wright @ 2004-09-14 18:55 ` Roger Luethi 2004-09-14 19:05 ` Chris Wright 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-14 18:55 UTC (permalink / raw) To: Chris Wright Cc: William Lee Irwin III, Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Paul Jackson, James Morris On Tue, 14 Sep 2004 11:37:36 -0700, Chris Wright wrote: > * William Lee Irwin III (wli@holomorphy.com) wrote: > > On Tue, 14 Sep 2004 08:37:58 -0700, William Lee Irwin III wrote: > > >> No, in general races of the form "permissions were altered after I > > >> checked them" can happen. > > > > On Tue, Sep 14, 2004 at 06:01:50PM +0200, Roger Luethi wrote: > > > Can you make an example? Some scenario where this would be important? > > > > Not particularly. It largely means poorly-coded apps may report gibberish. > > Canonical example is access(2) followed by open(2), not really relevant > in this case. However, exec setuid root app...when do you check, and > when to you fill in data to send back to user? For /proc, this type of > check happens often (see things like may_ptrace_attach and > task_dumpable in fs/proc/base.c). For nproc, the procedure looks like this: A tool send(2)s a request, credentials are attached to skb. Based on said credentials, the kernel is free to provide (netlink_unicast to originating socket) or withhold information. In this regard, nproc works like other netlink interfaces. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 18:55 ` Roger Luethi @ 2004-09-14 19:05 ` Chris Wright 2004-09-14 21:12 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: Chris Wright @ 2004-09-14 19:05 UTC (permalink / raw) To: Chris Wright, William Lee Irwin III, Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Paul Jackson, James Morris * Roger Luethi (rl@hellgate.ch) wrote: > On Tue, 14 Sep 2004 11:37:36 -0700, Chris Wright wrote: > > Canonical example is access(2) followed by open(2), not really relevant > > in this case. However, exec setuid root app...when do you check, and > > when to you fill in data to send back to user? For /proc, this type of > > check happens often (see things like may_ptrace_attach and > > task_dumpable in fs/proc/base.c). > > For nproc, the procedure looks like this: A tool send(2)s a request, > credentials are attached to skb. Based on said credentials, the kernel > is free to provide (netlink_unicast to originating socket) or withhold > information. In this regard, nproc works like other netlink interfaces. Understood. Question is, if the request is for data that's associated with a task that is in the middle of an execve(setuid_root_app), does the credential-check/skb-fill for response happen atomically w.r.t. said execve? IOW, is it possible to pass credential check, then fill data that's become sensitive since the check happened? thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-14 19:05 ` Chris Wright @ 2004-09-14 21:12 ` Roger Luethi 0 siblings, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-14 21:12 UTC (permalink / raw) To: Chris Wright Cc: William Lee Irwin III, Albert Cahalan, Stephen Smalley, Andrew Morton OSDL, lkml, Paul Jackson, James Morris On Tue, 14 Sep 2004 12:05:09 -0700, Chris Wright wrote: > Understood. Question is, if the request is for data that's associated > with a task that is in the middle of an execve(setuid_root_app), does > the credential-check/skb-fill for response happen atomically w.r.t. said > execve? IOW, is it possible to pass credential check, then fill data > that's become sensitive since the check happened? It shouldn't be once we implement access control. I don't pretend to know what the best way is to prevent that. Checking several times just shrinks the race window, so I suppose we'd have to lock the source data structures down prior to checking credentials and copying data. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: [1/1][PATCH] nproc v2: netlink access to /proc information 2004-09-09 17:53 ` Roger Luethi 2004-09-09 20:01 ` Stephen Smalley @ 2004-09-09 20:44 ` Chris Wright 1 sibling, 0 replies; 69+ messages in thread From: Chris Wright @ 2004-09-09 20:44 UTC (permalink / raw) To: William Lee Irwin III, Stephen Smalley, Andrew Morton, lkml, Albert Cahalan, Martin J. Bligh, Paul Jackson * Roger Luethi (rl@hellgate.ch) wrote: > On Thu, 09 Sep 2004 10:22:00 -0700, William Lee Irwin III wrote: > > On Thu, Sep 09, 2004 at 07:53:31AM -0400, Stephen Smalley wrote: > > > They aren't world readable when using a security module like SELinux; > > > they are then typically only accessible by processes in the same > > > security domain, aside from processes in privileged domains. > > > security_task_to_inode() hook sets the security attributes on the > > > /proc/pid inodes based on their security context, and then > > > security_inode_permission() hook controls access to them. So you need > > > at least comparable controls. > > > > Can you make a more specific suggestion regarding the controls to use? > > It's a bit awkward for those highly unfamiliar with the subsystem to > > For the same reason, I'm not comfortable with implementing SELinux type > access controls myself. How about: > > config NPROC > depends on !SECURITY_SELINUX > It's not just SELinux, it's any security module (i.e. CONFIG_SECURITY for starters). thanks, -chris -- Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net ^ permalink raw reply [flat|nested] 69+ messages in thread
* nproc: So? 2004-09-08 18:40 [0/1][ANNOUNCE] nproc v2: netlink access to /proc information Roger Luethi 2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi @ 2004-09-16 21:43 ` Roger Luethi 1 sibling, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-16 21:43 UTC (permalink / raw) To: linux-kernel I have received some constructive criticism and suggestions, but I didn't see any comments on the desirability of nproc in mainline. Initially meant to be a proof-of-concept, nproc has become an interface that is much cleaner and faster than procfs can ever hope to be (it takes some reading of procps or libgtop code to appreciate the complexity that is /proc file parsing today), and every change in /proc files widens the gap. I presented source code, benchmarks, and design documentation to substantiate my claims; I can post the user-space code somewhere if there's interest. So I'm wondering if everybody's waiting for me to answer some important question I overlooked, or if there is a general sentiment that this project is not worth pursuing. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: nproc: So? @ 2004-09-17 16:55 Albert Cahalan 2004-09-17 17:51 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: Albert Cahalan @ 2004-09-17 16:55 UTC (permalink / raw) To: linux-kernel mailing list; +Cc: rl Roger Luethi writes: > I have received some constructive criticism and suggestions, > but I didn't see any comments on the desirability of nproc in > mainline. Initially meant to be a proof-of-concept, nproc has > become an interface that is much cleaner and faster than procfs > can ever hope to be (it takes some reading of procps or libgtop > code to appreciate the complexity that is /proc file parsing today), You spotted the perfect hash lookup? :-) > and every change in /proc files widens the gap. I presented > source code, benchmarks, and design documentation to substantiate > my claims; I can post the user-space code somewhere if there's > interest. > > So I'm wondering if everybody's waiting for me to answer some > important question I overlooked, or if there is a general > sentiment that this project is not worth pursuing. I'm very glad to see numerical proof that /proc is crap. If nproc does nothing else, it's still been useful. The funny varargs/vsprintf/whatever encoding is useless to me, as are the labels. The nicest think about netlink is, i think, that it might make a practical interface for incremental update. As processes run or get modified, monitoring apps might get notified. I did not see mention of this being implemented, and I would take quite some time to support it, so it's a long-term goal. (of course, people can always submit procps patches to support this) I doubt that it is good to break down the data into so many different items. It seems sensible to break down the data by locking requirements. I could use an opaque per-process cookie for process identification. This would protect from PID reuse, and might allow for faster lookup. Perhaps it contains: PID, address of task_struct, and the system-wide or per-cpu fork count from process creation. Something like the stat() syscall would be pretty decent. Well, whatever... In any case, I'd need to see some working code for the libproc library. My net connection dies for hours at a time, so don't expect speedy anything right now. BTW, I have a 32-bit big-endian system with char being unsigned by default. The varargs stuff is odd too. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: nproc: So? 2004-09-17 16:55 Albert Cahalan @ 2004-09-17 17:51 ` Roger Luethi 2004-09-18 12:40 ` Albert Cahalan 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-17 17:51 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel mailing list On Fri, 17 Sep 2004 12:55:32 -0400, Albert Cahalan wrote: > Roger Luethi writes: > > I have received some constructive criticism and suggestions, > > but I didn't see any comments on the desirability of nproc in > > mainline. Initially meant to be a proof-of-concept, nproc has > > become an interface that is much cleaner and faster than procfs > > can ever hope to be (it takes some reading of procps or libgtop > > code to appreciate the complexity that is /proc file parsing today), > > You spotted the perfect hash lookup? :-) I never claimed nproc is perfect. Solutions with comparable performance and simplicity are conceivable, but none of them will work anything like procfs. > The funny varargs/vsprintf/whatever encoding is useless to me, Actually, that's just a by-product of the design. It is what you get when you put all the fields back to back. The only addition I made kernel-side to make this easy to exploit was the introduction of a NOP field. > as are the labels. Yup. The labels are not useful for the tools you maintain. > The nicest think about netlink is, i think, that it might make > a practical interface for incremental update. As processes run > or get modified, monitoring apps might get notified. I did not > see mention of this being implemented, and I would take quite > some time to support it, so it's a long-term goal. (of course, > people can always submit procps patches to support this) Sounds like what wli and I have discussed as differential updates a few weeks ago. I agree that would be nice, for now the goal was to suggest something that's cleaner and faster than procfs. Extensions are easy to add later. > I doubt that it is good to break down the data into so many > different items. It seems sensible to break down the data by > locking requirements. True if you consider a static set of fields that never changes. Problematic otherwise, because as soon as you start grouping fields together, you need an agreement between kernel and user-space on the contents of these groups. With nproc, the kernel is free to group fields together for computation (even the first release calculated all the fields that needed VMA walks in one go). > I could use an opaque per-process cookie for process identification. > This would protect from PID reuse, and might allow for faster > lookup. Perhaps it contains: PID, address of task_struct, and the > system-wide or per-cpu fork count from process creation. Agreed, that would be useful. And it would be easy to integrate with nproc. Just add a field to return the cookie and a selector based on cookies rather than PIDs. > Something like the stat() syscall would be pretty decent. You lost me there. Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: nproc: So? 2004-09-17 17:51 ` Roger Luethi @ 2004-09-18 12:40 ` Albert Cahalan 2004-09-19 10:39 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: Albert Cahalan @ 2004-09-18 12:40 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel mailing list On Fri, 2004-09-17 at 13:51, Roger Luethi wrote: > On Fri, 17 Sep 2004 12:55:32 -0400, Albert Cahalan wrote: > > The nicest think about netlink is, i think, that it might make > > a practical interface for incremental update. As processes run > > or get modified, monitoring apps might get notified. I did not > > see mention of this being implemented, and I would take quite > > some time to support it, so it's a long-term goal. (of course, > > people can always submit procps patches to support this) > > Sounds like what wli and I have discussed as differential updates > a few weeks ago. I agree that would be nice, for now the goal was > to suggest something that's cleaner and faster than procfs. > Extensions are easy to add later. To me, this looks like the killer feature. You could even skip the regular process info. Simply return process identification cookies that could be passed into a separate syscall to get the information. > > I doubt that it is good to break down the data into so many > > different items. It seems sensible to break down the data by > > locking requirements. > > True if you consider a static set of fields that never changes. Problematic > otherwise, because as soon as you start grouping fields together, you need > an agreement between kernel and user-space on the contents of these groups. I suppose this is small potatoes compared to the overhead of dealing with ASCII, but individual field handling would be a bit slower. For initial libproc support, I'd start by requesting info in groups that match what /proc provides today. > > I could use an opaque per-process cookie for process identification. > > This would protect from PID reuse, and might allow for faster > > lookup. Perhaps it contains: PID, address of task_struct, and the > > system-wide or per-cpu fork count from process creation. > > Agreed, that would be useful. And it would be easy to integrate with > nproc. Just add a field to return the cookie and a selector based on > cookies rather than PIDs. > > > Something like the stat() syscall would be pretty decent. > > You lost me there. The stat() call simply fills in a struct. Given a per-process cookie (or a PID if you tolerate the race conditions), a syscall similar to stat() could fill in a struct. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: nproc: So? 2004-09-18 12:40 ` Albert Cahalan @ 2004-09-19 10:39 ` Roger Luethi 2004-09-19 12:29 ` Albert Cahalan 0 siblings, 1 reply; 69+ messages in thread From: Roger Luethi @ 2004-09-19 10:39 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel mailing list On Sat, 18 Sep 2004 08:40:12 -0400, Albert Cahalan wrote: > To me, this looks like the killer feature. You could even > skip the regular process info. Simply return process identification > cookies that could be passed into a separate syscall to get > the information. Do you mean "return cookies for all existing processes"? Or "return cookies for all processes created since X" (if so, what's X?) ? > > True if you consider a static set of fields that never changes. Problematic > > otherwise, because as soon as you start grouping fields together, you need > > an agreement between kernel and user-space on the contents of these groups. > > I suppose this is small potatoes compared to the overhead > of dealing with ASCII, but individual field handling would > be a bit slower. Correct. > For initial libproc support, I'd start by requesting info > in groups that match what /proc provides today. Makes perfect sense. You can pre-assemble an array of field IDs, hand them over to the kernel, and get the requested fields in the requested order. > The stat() call simply fills in a struct. Given a per-process > cookie (or a PID if you tolerate the race conditions), a syscall > similar to stat() could fill in a struct. With nproc as-is you can send a request that matches your desired struct and cast the result to a pointer to your struct. An application can build its own cookie simply by always requesting a set of fields that _together_ can be used to identify a process. I reckon that PID + process creation timestamp would be a good combination (except that the latter is not currently available). The creation of the complete reply to a request is atomic per process, the race is gone. What is not possible right now is selecting processes based on a cookie -- the only selectors so far are "all of them" and "select by PID". Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: nproc: So? 2004-09-19 10:39 ` Roger Luethi @ 2004-09-19 12:29 ` Albert Cahalan 2004-09-19 13:57 ` Roger Luethi 0 siblings, 1 reply; 69+ messages in thread From: Albert Cahalan @ 2004-09-19 12:29 UTC (permalink / raw) To: Roger Luethi; +Cc: linux-kernel mailing list On Sun, 2004-09-19 at 06:39, Roger Luethi wrote: > On Sat, 18 Sep 2004 08:40:12 -0400, Albert Cahalan wrote: > > To me, this looks like the killer feature. You could even > > skip the regular process info. Simply return process identification > > cookies that could be passed into a separate syscall to get > > the information. > > Do you mean "return cookies for all existing processes"? Or "return > cookies for all processes created since X" (if so, what's X?) ? First, queue cookies for all existing processes. Then, as process data changes, queue cookies for processes that need to be examined again. Suppress queueing of cookies for processes that are already in the queue so things don't get too backed up. If memory usage exceeds some adjustable limit, then switch to supplying all processes until the backlog is gone. I realize that the implementation may prove difficult. > With nproc as-is you can send a request that matches your desired struct > and cast the result to a pointer to your struct. Either that's marketing, or I missed something. :-) Can I force specific data sizes? Can I force a string to be NUL-terminated or a NUL-padded fixed-length buffer? Can I request padding bytes to be skipped over? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: nproc: So? 2004-09-19 12:29 ` Albert Cahalan @ 2004-09-19 13:57 ` Roger Luethi 0 siblings, 0 replies; 69+ messages in thread From: Roger Luethi @ 2004-09-19 13:57 UTC (permalink / raw) To: Albert Cahalan; +Cc: linux-kernel mailing list On Sun, 19 Sep 2004 08:29:57 -0400, Albert Cahalan wrote: > > Do you mean "return cookies for all existing processes"? Or "return > > cookies for all processes created since X" (if so, what's X?) ? > > First, queue cookies for all existing processes. > Then, as process data changes, queue cookies for > processes that need to be examined again. Suppress > queueing of cookies for processes that are already > in the queue so things don't get too backed up. > If memory usage exceeds some adjustable limit, then > switch to supplying all processes until the backlog > is gone. How is the kernel to know which changes of process data require re-examination? In all likelihood, any tool is only going to be interested in certain changes, not in others. > I realize that the implementation may prove difficult. It seems reasonable (and useful) to notify tools if new processes get created. It is certainly possible to have additional events (like field changes) trigger notifications, but this would probably become rather intrusive and expensive. > > With nproc as-is you can send a request that matches your desired struct > > and cast the result to a pointer to your struct. > > Either that's marketing, or I missed something. :-) > > Can I force specific data sizes? Can I force a string to > be NUL-terminated or a NUL-padded fixed-length buffer? > Can I request padding bytes to be skipped over? No, your data types have to match what the kernel offers. What I was referring to was your request for "info in groups that match what /proc provides today". What you _can_ do with nproc is, say, ask it to return a pointer to something like this: struct statm_extended { __u32 pid; /* __u32 namelen; * My simple cookie char name[16]; */ __u32 resident; /* __u32 shared; * __u32 trs; * /proc/PID/statm content __u32 lrs; * __u32 drs; * __u32 dt; */ }; Roger ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2004-09-19 13:57 UTC | newest] Thread overview: 69+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2004-09-08 18:40 [0/1][ANNOUNCE] nproc v2: netlink access to /proc information Roger Luethi 2004-09-08 18:41 ` [1/1][PATCH] " Roger Luethi 2004-09-09 0:35 ` William Lee Irwin III 2004-09-09 0:43 ` William Lee Irwin III 2004-09-09 1:15 ` William Lee Irwin III 2004-09-09 1:17 ` [1/2] rediff nproc v2 vs. 2.6.9-rc1-mm4 William Lee Irwin III 2004-09-09 1:21 ` [2/2] handle CONFIG_MMU=n and use new vm stats for CONFIG_MMU=y William Lee Irwin III 2004-09-09 1:22 ` William Lee Irwin III 2004-09-09 1:26 ` [3/2] round up text memory to the nearest page in fs/proc/task_mmu.c William Lee Irwin III 2004-09-09 18:43 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 2004-09-09 18:49 ` William Lee Irwin III 2004-09-09 19:00 ` William Lee Irwin III 2004-09-09 19:02 ` [4/2] consolidate __task_mem() and __task_mem_cheap() William Lee Irwin III 2004-09-09 19:07 ` Roger Luethi 2004-09-09 19:15 ` [5/2] fix nommu VSZ reporting in consolidated task_mem() William Lee Irwin III 2004-09-09 19:11 ` [1/1][PATCH] nproc v2: netlink access to /proc information Roger Luethi 2004-09-09 19:23 ` William Lee Irwin III 2004-09-09 21:19 ` Roger Luethi 2004-09-10 15:30 ` Roger Luethi 2004-09-11 22:25 ` Albert Cahalan 2004-09-12 4:58 ` William Lee Irwin III 2004-09-14 5:59 ` Roger Luethi 2004-09-14 6:18 ` William Lee Irwin III 2004-09-14 6:23 ` William Lee Irwin III 2004-09-14 7:47 ` Greg Ungerer 2004-09-14 8:27 ` Roger Luethi 2004-09-09 11:53 ` Stephen Smalley 2004-09-09 17:22 ` William Lee Irwin III 2004-09-09 17:53 ` Roger Luethi 2004-09-09 20:01 ` Stephen Smalley 2004-09-09 20:48 ` Chris Wright 2004-09-10 12:11 ` Stephen Smalley 2004-09-09 20:55 ` Roger Luethi 2004-09-09 21:05 ` Chris Wright 2004-09-09 21:25 ` Roger Luethi 2004-09-11 22:36 ` Albert Cahalan 2004-09-12 5:00 ` William Lee Irwin III 2004-09-14 6:44 ` Roger Luethi 2004-09-14 7:10 ` William Lee Irwin III 2004-09-14 7:55 ` Roger Luethi 2004-09-14 8:01 ` William Lee Irwin III 2004-09-14 9:27 ` Roger Luethi 2004-09-14 15:37 ` William Lee Irwin III 2004-09-14 16:01 ` Roger Luethi 2004-09-14 16:37 ` William Lee Irwin III 2004-09-14 17:15 ` Roger Luethi 2004-09-14 17:43 ` William Lee Irwin III 2004-09-14 18:45 ` Roger Luethi 2004-09-14 19:07 ` William Lee Irwin III 2004-09-14 19:31 ` Roger Luethi 2004-09-14 19:36 ` William Lee Irwin III 2004-09-14 19:50 ` Roger Luethi 2004-09-15 11:44 ` Roger Luethi 2004-09-15 20:02 ` Roger Luethi 2004-09-15 20:20 ` William Lee Irwin III 2004-09-15 20:33 ` Roger Luethi 2004-09-15 20:44 ` Roger Luethi 2004-09-14 18:37 ` Chris Wright 2004-09-14 18:55 ` Roger Luethi 2004-09-14 19:05 ` Chris Wright 2004-09-14 21:12 ` Roger Luethi 2004-09-09 20:44 ` Chris Wright 2004-09-16 21:43 ` nproc: So? Roger Luethi -- strict thread matches above, loose matches on Subject: below -- 2004-09-17 16:55 Albert Cahalan 2004-09-17 17:51 ` Roger Luethi 2004-09-18 12:40 ` Albert Cahalan 2004-09-19 10:39 ` Roger Luethi 2004-09-19 12:29 ` Albert Cahalan 2004-09-19 13:57 ` Roger Luethi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox