[RFC] on general object IDs again

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] on general object IDs again
@ 2012-01-11 16:19 Cyrill Gorcunov
  2012-01-11 17:25 ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2012-01-11 16:19 UTC (permalink / raw)
  To: LKML
  Cc: Andrew Morton, Kyle Moffett, Tejun Heo, Pavel Emelyanov,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin, KOSAKI Motohiro

Hi all,

here is an idea on general object IDs for kernel which could be
exported to user-space apps. The generation is done via the new
syscall __NR_gen_obj_id. The user-space application is supposed
to put a request and provide memory for encrypted data.

The patch is far from being complete and not all things are
implemented yet (and for namespaces there should be a patch
from Eric Biederman soon, so I left GEN_OBJ_ID_NS just in case,
will drop them).

What also is not yet done -- 

 - there should be a sysctl entry which would allow to completely
   disable this feature if one day we find that aes doesn't provide
   enough security or someone simply doesn't need this feature enabled

 - there should be kind of periodical cookies update routine so that
   one can't use same keys and salt forever, the time period is controlled
   by sysctl as well

 - there should be "disable updating key/salt for a series of syscalls",
   so one could be sure he got results generate with same key/salt so
   results can be sorted out in user-space, via sysctl too I think
   (ie one need to draw resources affinity for series of pids and while
    IDs are generated the key/salt should not be updated). For this sake
   gen_obj_lock is added.

 - no IDs for file descriptors yet, I thought something like -- user-space
   provides pid and array of fd, the kernel calculates IDs for every fd then

Regardless the things which are not implemented yet, I would like to hear
people *opinions* on such approach in general, and if it's acceptable at all.

Eric B., I know such approach is exactly what you mentioned as bad design
but I thought in various ways and didn't find any better solution
yet.

In turn doing everything in a manner of syscall "compare resources of two pids"
gives up very slow speed because we will have to compare every pid resources
with every other pids we're checkpointing.

Inability to restore this dynamic IDs doesn't look like a problem for me since
they are supposed to be read-only and used for sameness test.

Again, if someone has a better idea please don't hesitate to mention. Thanks!

P.S. Sorry if I forgot to CC someone who were replying me plreviously.

	Cyrill
---
 arch/x86/include/asm/unistd_64.h |    2 
 include/linux/gen_obj_id.h       |   39 ++++++++
 mm/Kconfig                       |   17 +++
 mm/Makefile                      |    1 
 mm/gen_obj_id.c                  |  183 +++++++++++++++++++++++++++++++++++++++
 5 files changed, 242 insertions(+)

Index: linux-2.6.git/arch/x86/include/asm/unistd_64.h
===================================================================
--- linux-2.6.git.orig/arch/x86/include/asm/unistd_64.h
+++ linux-2.6.git/arch/x86/include/asm/unistd_64.h
@@ -686,6 +686,8 @@ __SYSCALL(__NR_getcpu, sys_getcpu)
 __SYSCALL(__NR_process_vm_readv, sys_process_vm_readv)
 #define __NR_process_vm_writev			311
 __SYSCALL(__NR_process_vm_writev, sys_process_vm_writev)
+#define __NR_gen_obj_id				312
+__SYSCALL(__NR_gen_obj_id, sys_gen_obj_id)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
Index: linux-2.6.git/include/linux/gen_obj_id.h
===================================================================
--- /dev/null
+++ linux-2.6.git/include/linux/gen_obj_id.h
@@ -0,0 +1,39 @@
+#ifndef _LINUX_GEN_OBJ_ID_H
+#define _LINUX_GEN_OBJ_ID_H
+
+#include <linux/err.h>
+
+enum {
+	GEN_OBJ_ID_FILE,
+	GEN_OBJ_ID_VM,
+	GEN_OBJ_ID_FILES,
+	GEN_OBJ_ID_FS,
+	GEN_OBJ_ID_SIGHAND,
+	GEN_OBJ_ID_IO,
+	GEN_OBJ_ID_SYSVSEM,
+	GEN_OBJ_ID_NS_NET,
+	GEN_OBJ_ID_NS_PID,
+	GEN_OBJ_ID_NS_MNT,
+	GEN_OBJ_ID_NS_IPC,
+	GEN_OBJ_ID_NS_UTS,
+
+	GEN_OBJ_ID_TYPES,
+};
+
+#define GEN_OBJ_ID_MASK(type)	(1UL << (type))
+
+struct gen_obj_id_req {
+	__u64	mask;		/* request mask */
+	__u64	pid;		/* pid of a process to ask */
+	__u64	arg;		/* if needed */
+};
+
+struct gen_obj_id_fd {
+	__u32	nr_fd;
+	__u32	fd[0];
+};
+
+/* minimum bytes of buffer a user must provide */
+#define GEN_OBJ_ID_USER_SIZE	16
+
+#endif /* _LINUX_GEN_OBJ_ID_H */
Index: linux-2.6.git/mm/Kconfig
===================================================================
--- linux-2.6.git.orig/mm/Kconfig
+++ linux-2.6.git/mm/Kconfig
@@ -373,3 +373,20 @@ config CLEANCACHE
 	  in a negligible performance hit.
 
 	  If unsure, say Y to enable cleancache
+
+config GENERIC_OBJECT_ID
+	bool "Enable generic object ID infrastructure"
+	depends on CHECKPOINT_RESTORE
+	depends on CRYPTO_SHA1
+	default n
+	help
+	  Turn on functionality that can generate IDs for kernel objects,
+	  which are exported to the userspace.
+
+	  It is useful if you need to examine kernel objects and test
+	  if they are shared between several tasks. These IDs should never
+	  be used for anything but the "sameness" test. The IDs are dynamic
+	  and valid only while object is alive. Once it get freed or kernel
+	  is rebooted, the IDs will be changed.
+
+	  If unsure, say N here.
Index: linux-2.6.git/mm/Makefile
===================================================================
--- linux-2.6.git.orig/mm/Makefile
+++ linux-2.6.git/mm/Makefile
@@ -51,3 +51,4 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoiso
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
+obj-$(CONFIG_GENERIC_OBJECT_ID) += gen_obj_id.o
Index: linux-2.6.git/mm/gen_obj_id.c
===================================================================
--- /dev/null
+++ linux-2.6.git/mm/gen_obj_id.c
@@ -0,0 +1,183 @@
+#include <linux/kernel.h>
+#include <linux/capability.h>
+#include <linux/rwsem.h>
+#include <linux/bitops.h>
+#include <linux/scatterlist.h>
+#include <linux/syscalls.h>
+#include <linux/crypto.h>
+#include <crypto/aes.h>
+#include <linux/rwlock.h>
+#include <linux/rwsem.h>
+#include <linux/sysctl.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+
+#include <linux/gen_obj_id.h>
+
+static u8 gen_obj_cookie[GEN_OBJ_ID_TYPES][2][AES_KEYSIZE_128] __read_mostly;
+
+#define COOKIE_KEY(type)	&gen_obj_cookie[type][0][0]
+#define	COOKIE_ID(type)		&gen_obj_cookie[type][1][0]
+
+static DECLARE_RWSEM(gen_obj_lock);
+
+/*
+ * TODO
+ *
+ *  - enable/disable via sysfs
+ *  - cookie update interval (kthread ???)
+ *  - cookie lock/unlock if updating
+ *  - lock when a process does a series of syscalls,
+ *    when unlocked after that -- update all cookies
+ *    immediately
+ */
+
+static void gen_obj_refresh_cookies(void)
+{
+	int i;
+	for (i = 0; i < GEN_OBJ_ID_TYPES; i++) {
+		get_random_bytes(COOKIE_KEY(i), AES_KEYSIZE_128);
+		get_random_bytes(COOKIE_ID(i), AES_KEYSIZE_128);
+	}
+}
+
+static int encrypt_obj_id(int type, struct crypto_cipher *tfm,
+			  unsigned long obj_id, void __user *buf)
+{
+	__u8 ciphertext[AES_KEYSIZE_128];
+	__u8 expanded[AES_KEYSIZE_128];
+	int ret = 0;
+
+	memcpy(expanded, COOKIE_ID(type), AES_KEYSIZE_128);
+	*(unsigned long *)expanded += obj_id;
+
+	ret = crypto_cipher_setkey(tfm, COOKIE_KEY(type), AES_KEYSIZE_128);
+	if (ret)
+		goto err;
+
+	crypto_cipher_encrypt_one(tfm, ciphertext, expanded);
+	ret = copy_to_user(buf, ciphertext, AES_KEYSIZE_128);
+
+err:
+	return ret;
+}
+
+SYSCALL_DEFINE3(gen_obj_id, struct gen_obj_id_req __user *, request, void, __user *buf, long, len)
+{
+	struct task_struct *task;
+	struct crypto_cipher *tfm;
+	struct gen_obj_id_req req;
+	unsigned long obj_id;
+	unsigned long mask;
+	int ret = 0;
+	int bit;
+
+	BUILD_BUG_ON(GEN_OBJ_ID_TYPES < (sizeof(long) / BITS_PER_BYTE));
+	BUILD_BUG_ON(GEN_OBJ_ID_USER_SIZE < AES_BLOCK_SIZE);
+
+	if (len < GEN_OBJ_ID_USER_SIZE)
+		return -EINVAL;
+
+	if (copy_from_user(&req, (void __user *)request, sizeof(req)))
+		return -EFAULT;
+
+	tfm = crypto_alloc_cipher("aes", 0, CRYPTO_ALG_ASYNC);
+	if (IS_ERR(tfm)) {
+		ret = PTR_ERR(tfm);
+		return ret;
+	}
+
+	rcu_read_lock();
+	task = find_task_by_vpid((pid_t)req.pid);
+	if (!task) {
+		rcu_read_unlock();
+		return -ENOENT;
+	}
+	get_task_struct(task);
+	rcu_read_unlock();
+
+	if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+		ret = -EACCES;
+		goto err;
+	}
+
+	mask = (unsigned long)req.mask;
+	if (!mask) {
+		ret = -ENOENT;
+		goto err;
+	}
+
+	down_read(&gen_obj_lock);
+
+	for_each_set_bit(bit, &mask, sizeof(mask)) {
+
+		obj_id = 0;
+
+		switch (bit) {
+		case GEN_OBJ_ID_FILE:
+			break;
+		case GEN_OBJ_ID_VM:
+			obj_id = (unsigned long)task->mm;
+			break;
+		case GEN_OBJ_ID_FILES:
+			obj_id = (unsigned long)task->files;
+			break;
+		case GEN_OBJ_ID_FS:
+			obj_id = (unsigned long)task->fs;
+			break;
+		case GEN_OBJ_ID_SIGHAND:
+			obj_id = (unsigned long)task->sighand;
+			break;
+		case GEN_OBJ_ID_IO:
+			obj_id = (unsigned long)task->io_context;
+			break;
+		case GEN_OBJ_ID_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+			obj_id = (unsigned long)task->sysvsem.undo_list;
+#endif
+			break;
+		case GEN_OBJ_ID_NS_NET:
+		case GEN_OBJ_ID_NS_PID:
+		case GEN_OBJ_ID_NS_MNT:
+		case GEN_OBJ_ID_NS_IPC:
+		case GEN_OBJ_ID_NS_UTS:
+		default:
+			ret = -EINVAL;
+			goto err_unlock;
+		}
+
+		if (obj_id) {
+			ret = encrypt_obj_id(bit, tfm, obj_id, buf);
+			if (ret)
+				goto err_unlock;
+		}
+
+		if (len < GEN_OBJ_ID_USER_SIZE) {
+			ret = -ENOSPC;
+			goto err_unlock;
+		}
+
+		buf += GEN_OBJ_ID_USER_SIZE;
+		len -= GEN_OBJ_ID_USER_SIZE;
+	}
+
+err_unlock:
+	up_read(&gen_obj_lock);
+err:
+	crypto_free_cipher(tfm);
+	put_task_struct(task);
+
+	return ret;
+}
+
+static __init int gen_obj_cookie_init(void)
+{
+	gen_obj_refresh_cookies();
+	return 0;
+}
+late_initcall(gen_obj_cookie_init);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 16:19 [RFC] on general object IDs again Cyrill Gorcunov
@ 2012-01-11 17:25 ` KOSAKI Motohiro
  2012-01-11 17:59   ` Cyrill Gorcunov
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 17:25 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Kyle Moffett, Tejun Heo, Pavel Emelyanov,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

> In turn doing everything in a manner of syscall "compare resources of two pids"
> gives up very slow speed because we will have to compare every pid resources
> with every other pids we're checkpointing.

Could you please write down pseudo code of your userland compare logic? When
and why do you need? I still don't like this idea, but maybe I need to
understand your
requerement before putting negative comments.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 17:25 ` KOSAKI Motohiro
@ 2012-01-11 17:59   ` Cyrill Gorcunov
  2012-01-11 18:19     ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2012-01-11 17:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Andrew Morton, Kyle Moffett, Tejun Heo, Pavel Emelyanov,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

On Wed, Jan 11, 2012 at 12:25:04PM -0500, KOSAKI Motohiro wrote:
> > In turn doing everything in a manner of syscall "compare resources of two pids"
> > gives up very slow speed because we will have to compare every pid resources
> > with every other pids we're checkpointing.
> 
> Could you please write down pseudo code of your userland compare logic? When
> and why do you need? I still don't like this idea, but maybe I need to
> understand your
> requerement before putting negative comments.
> 

Hi Kosaki,

the idea on user-space is something like

 - collect all pids to dump
 - collect IDs for every pid
 - sort the IDs obtained
 - find the same IDs (which will be kind of find intersections in a sets of IDs) and
   set up CLONE_ flags on restore procedure as appropriate (for example if
   GEN_OBJ_ID_VM IDs for two or more tasks are the same we need to use CLONE_VM
   at restore time, and so on).

	Cyrill

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 17:59   ` Cyrill Gorcunov
@ 2012-01-11 18:19     ` KOSAKI Motohiro
  2012-01-11 18:22       ` Pavel Emelyanov
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 18:19 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: LKML, Andrew Morton, Kyle Moffett, Tejun Heo, Pavel Emelyanov,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

> Hi Kosaki,
>
> the idea on user-space is something like
>
>  - collect all pids to dump
>  - collect IDs for every pid
>  - sort the IDs obtained
>  - find the same IDs (which will be kind of find intersections in a sets of IDs) and
>   set up CLONE_ flags on restore procedure as appropriate (for example if
>   GEN_OBJ_ID_VM IDs for two or more tasks are the same we need to use CLONE_VM
>   at restore time, and so on).

Then, you only need to compare. not any other calculation. i.e. only
need id uniqueness.
And any resource are referenced from tasks. so, can you reuse pid for
this? example,
two taska share one mm.

task-a(pid: 100)
              |-----------------mm
task-b(pid: 200)


gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 18:19     ` KOSAKI Motohiro
@ 2012-01-11 18:22       ` Pavel Emelyanov
  2012-01-11 18:31         ` Cyrill Gorcunov
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Emelyanov @ 2012-01-11 18:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cyrill Gorcunov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

On 01/11/2012 10:19 PM, KOSAKI Motohiro wrote:
>> Hi Kosaki,
>>
>> the idea on user-space is something like
>>
>>  - collect all pids to dump
>>  - collect IDs for every pid
>>  - sort the IDs obtained
>>  - find the same IDs (which will be kind of find intersections in a sets of IDs) and
>>   set up CLONE_ flags on restore procedure as appropriate (for example if
>>   GEN_OBJ_ID_VM IDs for two or more tasks are the same we need to use CLONE_VM
>>   at restore time, and so on).
> 
> Then, you only need to compare. not any other calculation. i.e. only
> need id uniqueness.
> And any resource are referenced from tasks. so, can you reuse pid for
> this? example,
> two taska share one mm.
> 
> task-a(pid: 100)
>               |-----------------mm
> task-b(pid: 200)
> 
> 
> gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)

We can, but determining the youngest pid for an mm struct is O(N) algo.
Having N tasks with N mm_structs getting the sharing picture becomes O(N^2).

> .
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 18:22       ` Pavel Emelyanov
@ 2012-01-11 18:31         ` Cyrill Gorcunov
  2012-01-11 19:29           ` KOSAKI Motohiro
  0 siblings, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2012-01-11 18:31 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pavel Emelyanov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

On Wed, Jan 11, 2012 at 10:22:29PM +0400, Pavel Emelyanov wrote:
> On 01/11/2012 10:19 PM, KOSAKI Motohiro wrote:
> >> Hi Kosaki,
> >>
> >> the idea on user-space is something like
> >>
> >>  - collect all pids to dump
> >>  - collect IDs for every pid
> >>  - sort the IDs obtained
> >>  - find the same IDs (which will be kind of find intersections in a sets of IDs) and
> >>   set up CLONE_ flags on restore procedure as appropriate (for example if
> >>   GEN_OBJ_ID_VM IDs for two or more tasks are the same we need to use CLONE_VM
> >>   at restore time, and so on).
> > 
> > Then, you only need to compare. not any other calculation. i.e. only
> > need id uniqueness.
> > And any resource are referenced from tasks. so, can you reuse pid for
> > this? example,
> > two taska share one mm.
> > 
> > task-a(pid: 100)
> >               |-----------------mm
> > task-b(pid: 200)
> > 
> > 
> > gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)
> 
> We can, but determining the youngest pid for an mm struct is O(N) algo.
> Having N tasks with N mm_structs getting the sharing picture becomes O(N^2).
> 

Yeah, exactly. If not the speed problem we would simply stick
with Andrew's proposal as two-id-are-the-same(pid1, pid2)
syscall. But when we get a number of pids to dump we need the
resource affinity picture over them all.

	Cyrill

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 18:31         ` Cyrill Gorcunov
@ 2012-01-11 19:29           ` KOSAKI Motohiro
  2012-01-11 19:36             ` Pavel Emelyanov
                               ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 19:29 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

>> > Then, you only need to compare. not any other calculation. i.e. only
>> > need id uniqueness.
>> > And any resource are referenced from tasks. so, can you reuse pid for
>> > this? example,
>> > two taska share one mm.
>> >
>> > task-a(pid: 100)
>> >               |-----------------mm
>> > task-b(pid: 200)
>> >
>> >
>> > gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)
>>
>> We can, but determining the youngest pid for an mm struct is O(N) algo.
>> Having N tasks with N mm_structs getting the sharing picture becomes O(N^2).
>
> Yeah, exactly. If not the speed problem we would simply stick
> with Andrew's proposal as two-id-are-the-same(pid1, pid2)
> syscall.

Why O(N^2) is matter? Typical HPC system have mere a few hundred pids.
so, O(N^2)
is not slow. How do you mesure Andrew's proposal?

If you have 1000 pids and each syscall need 10usec,

1000 * 1000 * 10 = 10,000,000usec = 10sec. But, important thing is, almost all
processes don't share fs, mm and other structs. then, if we check
reference count
before task traversal, required time may reduce 1/10x - 1/100x.


> But when we get a number of pids to dump we need the
> resource affinity picture over them all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 19:29           ` KOSAKI Motohiro
@ 2012-01-11 19:36             ` Pavel Emelyanov
  2012-01-11 19:50               ` KOSAKI Motohiro
  2012-01-11 19:59             ` Cyrill Gorcunov
  2012-01-11 20:19             ` Eric W. Biederman
  2 siblings, 1 reply; 15+ messages in thread
From: Pavel Emelyanov @ 2012-01-11 19:36 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cyrill Gorcunov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

On 01/11/2012 11:29 PM, KOSAKI Motohiro wrote:
>>>> Then, you only need to compare. not any other calculation. i.e. only
>>>> need id uniqueness.
>>>> And any resource are referenced from tasks. so, can you reuse pid for
>>>> this? example,
>>>> two taska share one mm.
>>>>
>>>> task-a(pid: 100)
>>>>               |-----------------mm
>>>> task-b(pid: 200)
>>>>
>>>>
>>>> gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)
>>>
>>> We can, but determining the youngest pid for an mm struct is O(N) algo.
>>> Having N tasks with N mm_structs getting the sharing picture becomes O(N^2).
>>
>> Yeah, exactly. If not the speed problem we would simply stick
>> with Andrew's proposal as two-id-are-the-same(pid1, pid2)
>> syscall.
> 
> Why O(N^2) is matter? Typical HPC system have mere a few hundred pids.
> so, O(N^2)
> is not slow. How do you mesure Andrew's proposal?
> 
> If you have 1000 pids and each syscall need 10usec,
> 
> 1000 * 1000 * 10 = 10,000,000usec = 10sec. But, important thing is, almost all
> processes don't share fs, mm and other structs. then, if we check
> reference count
> before task traversal, required time may reduce 1/10x - 1/100x.

This might work for mm_structs, although quite a lot apps now do have threads and
this mm->users check will be negative. But how about open files? Once we entered the
get-the-youngest-file-owner routine we need to take locks and with 1000 tasks the
overhead is not 1000 syscalls, but 1000 (syscalls + locks).

> 
>> But when we get a number of pids to dump we need the
>> resource affinity picture over them all.
> .
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 19:36             ` Pavel Emelyanov
@ 2012-01-11 19:50               ` KOSAKI Motohiro
  2012-01-11 19:57                 ` Pavel Emelyanov
  0 siblings, 1 reply; 15+ messages in thread
From: KOSAKI Motohiro @ 2012-01-11 19:50 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

> This might work for mm_structs, although quite a lot apps now do have threads and
> this mm->users check will be negative. But how about open files? Once we entered the
> get-the-youngest-file-owner routine we need to take locks and with 1000 tasks the
> overhead is not 1000 syscalls, but 1000 (syscalls + locks).

Unfortunately, I have no knowledge your program have what assumption
of file modify. When I reviewed mincore patch from you, you said you
assume any file are preserved. And If nobody change the files, why do
we need to care file locks? I have no knowledge a grand picture of
your design at all.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 19:50               ` KOSAKI Motohiro
@ 2012-01-11 19:57                 ` Pavel Emelyanov
  0 siblings, 0 replies; 15+ messages in thread
From: Pavel Emelyanov @ 2012-01-11 19:57 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cyrill Gorcunov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

On 01/11/2012 11:50 PM, KOSAKI Motohiro wrote:
>> This might work for mm_structs, although quite a lot apps now do have threads and
>> this mm->users check will be negative. But how about open files? Once we entered the
>> get-the-youngest-file-owner routine we need to take locks and with 1000 tasks the
>> overhead is not 1000 syscalls, but 1000 (syscalls + locks).
> 
> Unfortunately, I have no knowledge your program have what assumption
> of file modify. When I reviewed mincore patch from you, you said you
> assume any file are preserved. And If nobody change the files, why do
> we need to care file locks? I have no knowledge a grand picture of
> your design at all.

By locking I mean the internal kernel locks that preserve kernel objects
integrity. E.g. to check for a task's file (the struct file) being the one
you're interested in you need to lock the task lock (call the task_lock(p))
in order to get the fdtable pointer.

By the way, if you have a file and want to have its ID the complexity is not
just O(N^2) for N tasks. For each task you should walk the whole fdtable which
just makes things worse.

That said, I'd stick to the IDs that are calculated out of the object pointer.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 19:29           ` KOSAKI Motohiro
  2012-01-11 19:36             ` Pavel Emelyanov
@ 2012-01-11 19:59             ` Cyrill Gorcunov
  2012-01-11 20:19             ` Eric W. Biederman
  2 siblings, 0 replies; 15+ messages in thread
From: Cyrill Gorcunov @ 2012-01-11 19:59 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Pavel Emelyanov, LKML, Andrew Morton, Kyle Moffett, Tejun Heo,
	Glauber Costa, Andi Kleen, Matt Helsley, Pekka Enberg,
	Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan, Herbert Xu,
	David S. Miller, Eric W. Biederman, Andrey Vagin

On Wed, Jan 11, 2012 at 02:29:48PM -0500, KOSAKI Motohiro wrote:
> >> > Then, you only need to compare. not any other calculation. i.e. only
> >> > need id uniqueness.
> >> > And any resource are referenced from tasks. so, can you reuse pid for
> >> > this? example,
> >> > two taska share one mm.
> >> >
> >> > task-a(pid: 100)
> >> >               |-----------------mm
> >> > task-b(pid: 200)
> >> >
> >> >
> >> > gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)
> >>
> >> We can, but determining the youngest pid for an mm struct is O(N) algo.
> >> Having N tasks with N mm_structs getting the sharing picture becomes O(N^2).
> >
> > Yeah, exactly. If not the speed problem we would simply stick
> > with Andrew's proposal as two-id-are-the-same(pid1, pid2)
> > syscall.
> 
> Why O(N^2) is matter? Typical HPC system have mere a few hundred pids.
> so, O(N^2) is not slow. How do you mesure Andrew's proposal?
> 

I consider quadratic approach only as a path where nothing
else can go. So to be fair -- I didn't measure such syscall.

> If you have 1000 pids and each syscall need 10usec,
> 
> 1000 * 1000 * 10 = 10,000,000usec = 10sec. But, important thing is, almost all
> processes don't share fs, mm and other structs. then, if we check
> reference count before task traversal, required time may reduce 1/10x - 1/100x.
> 

Sure thing, until some number of pids this will work (since compare two pointers
is very fast), but I fear eventually we will hit situation were such trade off
beat us. Also note that we do ask not only mm IDs, but ->files, ->signals,
->sysv.undo-list and so on (and who knows what else might be needed
in future). Since there is a hardware support for AES encoding on new
CPUs I think this is significant.

But again, Kosaki, if there some other fast way to retrieve such info,
it should be considered of course.

Technically for us plain kernel pointers would be enough unde root-only
approach but I've been strongly adviced to export such IDs via safe way
to a regular users as well (you could find the former patches from Pavel
in LKML archives).

	Cyrill

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 19:29           ` KOSAKI Motohiro
  2012-01-11 19:36             ` Pavel Emelyanov
  2012-01-11 19:59             ` Cyrill Gorcunov
@ 2012-01-11 20:19             ` Eric W. Biederman
  2012-01-11 20:24               ` Pavel Emelyanov
  2 siblings, 1 reply; 15+ messages in thread
From: Eric W. Biederman @ 2012-01-11 20:19 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Cyrill Gorcunov, Pavel Emelyanov, LKML, Andrew Morton,
	Kyle Moffett, Tejun Heo, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan,
	Herbert Xu, David S. Miller, Andrey Vagin

KOSAKI Motohiro <kosaki.motohiro@gmail.com> writes:

>>> > Then, you only need to compare. not any other calculation. i.e. only
>>> > need id uniqueness.
>>> > And any resource are referenced from tasks. so, can you reuse pid for
>>> > this? example,
>>> > two taska share one mm.
>>> >
>>> > task-a(pid: 100)
>>> >               |-----------------mm
>>> > task-b(pid: 200)
>>> >
>>> >
>>> > gen_obj_id(task-b, GEN_OBJ_ID_VM) return 100. (youngest pid of referenced tasks)
>>>
>>> We can, but determining the youngest pid for an mm struct is O(N) algo.
>>> Having N tasks with N mm_structs getting the sharing picture becomes O(N^2).
>>
>> Yeah, exactly. If not the speed problem we would simply stick
>> with Andrew's proposal as two-id-are-the-same(pid1, pid2)
>> syscall.
>
> Why O(N^2) is matter? Typical HPC system have mere a few hundred pids.
> so, O(N^2)
> is not slow. How do you mesure Andrew's proposal?
>
> If you have 1000 pids and each syscall need 10usec,
>
> 1000 * 1000 * 10 = 10,000,000usec = 10sec. But, important thing is, almost all
> processes don't share fs, mm and other structs. then, if we check
> reference count
> before task traversal, required time may reduce 1/10x - 1/100x.

A couple of things.
1) This is generally useful debug information so if we can we should
   design something that is enabled everywhere.

2) Something like avoiding a O(N^2) check is easy.  Just store the pid
   when we create the structure.

3) Which leads to the very obvious question.  Why don't we just assign
   an arbitrary id to all of the interesting structures?  The code to do
   that is trivial and the implementation is secure in a way that
   hashing pointers never will be.

   I thought for a little bit that struct file might be too light weight
   for that to make sense but on second glance struct file is fat enough
   that I think we could squeeze in another 64 bits without noticeably
   bloating the structure.

At a first case things like the mm struct really aren't interesting.
99+% of the time the sharing is well know and running around discovering
it is just a waste of time.

It is struct file that is really interesting.  Even on the most trivial
unix system struct file is shared between multiple processes and
multiple open file descriptors in the same process.

And it is struct file where there are a lot of them that the performance
difference matters.

....

Hmm.  Come to think of it I don't think an object comparison system call
is half as worthless as it has been made out to be.

The justification for not doing an object comparison system call is that
using an object comparison system call will result in a O(N^2)
algorithm.   For file descriptors we can avoid most of that by comparing
the file the file descriptor points to, and the offset of the file
descriptor.  So we should not need an all to all comparison.

In the second place the only reason why we would need O(N^2) complexity
instead of O(NlogN) is if the comparison system call only compares for
identity instead of returning a result that allows us to order the
objects we are worrying about.

Now the only possible problem with comparing pointers and returning
the objects in memory order is that some evil attacker might be
able to use that to their advantage.   However the in memory order
is generally derivable from the creation order of the objects so
I don't think it is worth worry about at this point.

Furthermore we can preserve the order of objects across a checkpoint by
carefully controlling the order that the objects are created when we
restore the system.

Part of me wants to say: go assign an id to everything that is
interesting, but now that I have thought through the comparison
system call it is an obvious and trivial solution with no practical
impact on the rest of the kernel.

Guys please go back and implement a comparison system call that
orders the objects and does not just compare for identity.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 20:19             ` Eric W. Biederman
@ 2012-01-11 20:24               ` Pavel Emelyanov
  2012-01-11 20:34                 ` Cyrill Gorcunov
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Emelyanov @ 2012-01-11 20:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: KOSAKI Motohiro, Cyrill Gorcunov, LKML, Andrew Morton,
	Kyle Moffett, Tejun Heo, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan,
	Herbert Xu, David S. Miller, Andrey Vagin

> In the second place the only reason why we would need O(N^2) complexity
> instead of O(NlogN) is if the comparison system call only compares for
> identity instead of returning a result that allows us to order the
> objects we are worrying about.

Hm... I like this one. As I said we'd have to do the sorting anyway so
having the less-equals-above comparator right in the kernel sounds OK.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 20:24               ` Pavel Emelyanov
@ 2012-01-11 20:34                 ` Cyrill Gorcunov
  2012-01-11 20:45                   ` Eric W. Biederman
  0 siblings, 1 reply; 15+ messages in thread
From: Cyrill Gorcunov @ 2012-01-11 20:34 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Eric W. Biederman, KOSAKI Motohiro, LKML, Andrew Morton,
	Kyle Moffett, Tejun Heo, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan,
	Herbert Xu, David S. Miller, Andrey Vagin

On Thu, Jan 12, 2012 at 12:24:33AM +0400, Pavel Emelyanov wrote:
> > In the second place the only reason why we would need O(N^2) complexity
> > instead of O(NlogN) is if the comparison system call only compares for
> > identity instead of returning a result that allows us to order the
> > objects we are worrying about.
> 
> Hm... I like this one. As I said we'd have to do the sorting anyway so
> having the less-equals-above comparator right in the kernel sounds OK.
> 

So it would look something like

	sys_cmp_kid(pid1, pid2, kid-type)

(where kid is kernel-id), right ? Or I miss something ?

	Cyrill

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] on general object IDs again
  2012-01-11 20:34                 ` Cyrill Gorcunov
@ 2012-01-11 20:45                   ` Eric W. Biederman
  0 siblings, 0 replies; 15+ messages in thread
From: Eric W. Biederman @ 2012-01-11 20:45 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Pavel Emelyanov, KOSAKI Motohiro, LKML, Andrew Morton,
	Kyle Moffett, Tejun Heo, Glauber Costa, Andi Kleen, Matt Helsley,
	Pekka Enberg, Eric Dumazet, Vasiliy Kulikov, Alexey Dobriyan,
	Herbert Xu, David S. Miller, Andrey Vagin

Cyrill Gorcunov <gorcunov@gmail.com> writes:

> On Thu, Jan 12, 2012 at 12:24:33AM +0400, Pavel Emelyanov wrote:
>> > In the second place the only reason why we would need O(N^2) complexity
>> > instead of O(NlogN) is if the comparison system call only compares for
>> > identity instead of returning a result that allows us to order the
>> > objects we are worrying about.
>> 
>> Hm... I like this one. As I said we'd have to do the sorting anyway so
>> having the less-equals-above comparator right in the kernel sounds OK.
>> 
>
> So it would look something like
>
> 	sys_cmp_kid(pid1, pid2, kid-type)
>
> (where kid is kernel-id), right ? Or I miss something ?

Pretty much.

For those objects you can have more than one of like file descriptors we
may want more parameters, or perhaps a family of syscalls.

We also probably don't want to want to talk about ids.  Because we
aren't comparing ids, we are in principle comparing the objects
themselves and returning an arbitrary ordering.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-01-11 20:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-11 16:19 [RFC] on general object IDs again Cyrill Gorcunov
2012-01-11 17:25 ` KOSAKI Motohiro
2012-01-11 17:59   ` Cyrill Gorcunov
2012-01-11 18:19     ` KOSAKI Motohiro
2012-01-11 18:22       ` Pavel Emelyanov
2012-01-11 18:31         ` Cyrill Gorcunov
2012-01-11 19:29           ` KOSAKI Motohiro
2012-01-11 19:36             ` Pavel Emelyanov
2012-01-11 19:50               ` KOSAKI Motohiro
2012-01-11 19:57                 ` Pavel Emelyanov
2012-01-11 19:59             ` Cyrill Gorcunov
2012-01-11 20:19             ` Eric W. Biederman
2012-01-11 20:24               ` Pavel Emelyanov
2012-01-11 20:34                 ` Cyrill Gorcunov
2012-01-11 20:45                   ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox