Linux userland API discussions
 help / color / mirror / Atom feed
* Re: How Not To Use kref (was Re: kdbus: add code for buses, domains and endpoints)
From: David Herrmann @ 2014-11-04  9:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Greg Kroah-Hartman, Linux API, linux-kernel, John Stultz,
	Arnd Bergmann, Tejun Heo, Marcel Holtmann, Ryan Lortie,
	Bastien Nocera, Djalal Harouni, Simon McVittie, Daniel Mack,
	Alban Crequy, javier.martinez-ZGY8ohtN/8pPYcu2f3hruQ,
	Tom Gundersen, Linus Torvalds
In-Reply-To: <20141030233801.GF7996-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

Hi Al

On Fri, Oct 31, 2014 at 12:38 AM, Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org> wrote:
> On Wed, Oct 29, 2014 at 03:00:52PM -0700, Greg Kroah-Hartman wrote:
>
>> +static void __kdbus_domain_user_free(struct kref *kref)
>> +{
>> +     struct kdbus_domain_user *user =
>> +             container_of(kref, struct kdbus_domain_user, kref);
>> +
>> +     BUG_ON(atomic_read(&user->buses) > 0);
>> +     BUG_ON(atomic_read(&user->connections) > 0);
>> +
>> +     mutex_lock(&user->domain->lock);
>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +     idr_remove(&user->domain->user_idr, user->idr);
>> +     hash_del(&user->hentry);
>         ^^^^^^^^^^^^^^^^^^^^^^^^
>> +     mutex_unlock(&user->domain->lock);
>> +
>> +     kdbus_domain_unref(user->domain);
>> +     kfree(user);
>> +}
>
>> +struct kdbus_domain_user *kdbus_domain_user_unref(struct kdbus_domain_user *u)
>> +{
>> +     if (u)
>> +             kref_put(&u->kref, __kdbus_domain_user_free);
>> +     return NULL;
>> +}
>
> If you remove an object from some search structures, taking the lock in
> destructor is Too Fucking Late(tm).  Somebody might have already found
> that puppy and decided to pick it (all under that lock) just as we'd
> got to that point in destructor and blocked there.  Oops...

Nice catch! I fixed it up via kref_get_unless_zero(). This has the
side-effect that there might be multiple domain_user objects for the
same user, but all but one will have ref==0. They don't carry and
valuable data in those cases, so we're fine. We will just end up using
the next one, or creating a new one.

Thanks a lot!
David

^ permalink raw reply

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: Richard Cochran @ 2014-11-04  8:27 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: John Stultz, Andy Lutomirski, Pawel Moll, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Pawel Moll
In-Reply-To: <3430954.VNaFmamXmP@wuerfel>

On Tue, Nov 04, 2014 at 09:01:31AM +0100, Arnd Bergmann wrote:
> On Monday 03 November 2014 17:11:53 John Stultz wrote:
> > I've got some thoughts on what a possible interface that wouldn't be
> > awful could look like, but I'm still hesitant because I don't really
> > know if exposing this sort of data is actually a good idea long term.
>  
> I was also thinking (while working on an unrelated patch) we could use
> a system call like
> 
> int clock_getoffset(clockid_t clkid, struct timespec *offs);
> 
> that returns the current offset between CLOCK_REALTIME and the
> requested timebase. It is of course racy, but so is every use
> of CLOCK_REALTIME. We could also use a reference other than
> CLOCK_REALTIME that might be more stable, but passing two arbitrary
> clocks as input would make this much more complex to implement.

No, it is really easy to implement. Just drop the idea of "atomic". It
really is not necessary or even possible.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: Richard Cochran @ 2014-11-04  8:24 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, Pawel Moll, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Pawel Moll
In-Reply-To: <CALAqxLXfy5P0kg-W7hL+Jf1iYv758+-2cTdZwsY8kAns1nvEmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Nov 03, 2014 at 05:11:53PM -0800, John Stultz wrote:
> On Mon, Nov 3, 2014 at 4:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> > If you're going to add double-stamped packets, can you also add a
> > syscall to read multiple clocks at once, atomically?  Or can you
> > otherwise add a non-perf mechanism to get at this data?

Does not need to be "atomic". In fact it cannot be atomic in the
general case. Some clocks are read over memory mapped registers, but
others are read over slow and sleepy buses like PCIe or MDIO.

> > Because the realtime to monotonic offset is really quite useful for
> > things like this, and it seems silly to make people actually open a
> > perf_event to get at it.
> 
> So this comes up periodically, but I don't think I've seen a interface
> proposal that was decent yet.

We have ioctl PTP_SYS_OFFSET that alternately reads a dynamic clock
and CLOCK_REALTIME a given number of times. This is done without locks
or any kind of "atomic" guarantees, and it works quite well in
practice. The user can pick the number of repetitions to deal with
noisy run time environments, and usually it is a simple matter of
picking the reading with the shortest duration. However, the user is
free to do statistics over the readings in any way he wants.

It would be nice (and people have requested) a syscall that takes two
clockid_t arguments but otherwise works like PTP_SYS_OFFSET.

We really will never have to support more than two clocks. The
application will pick one clock as the reference and then measure each
of the other clocks relative to it, one at a time. The performance
should be perfectly adequate, even better than reading three or more
at once (with the understanding that these are "software" time
stamps).

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: Arnd Bergmann @ 2014-11-04  8:01 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, Pawel Moll, Richard Cochran, Steven Rostedt,
	Ingo Molnar, Peter Zijlstra, Paul Mackerras,
	Arnaldo Carvalho de Melo, Masami Hiramatsu, Christopher Covington,
	Namhyung Kim, David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Pawel Moll
In-Reply-To: <CALAqxLXfy5P0kg-W7hL+Jf1iYv758+-2cTdZwsY8kAns1nvEmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Monday 03 November 2014 17:11:53 John Stultz wrote:
> I've got some thoughts on what a possible interface that wouldn't be
> awful could look like, but I'm still hesitant because I don't really
> know if exposing this sort of data is actually a good idea long term.
 
I was also thinking (while working on an unrelated patch) we could use
a system call like

int clock_getoffset(clockid_t clkid, struct timespec *offs);

that returns the current offset between CLOCK_REALTIME and the
requested timebase. It is of course racy, but so is every use
of CLOCK_REALTIME. We could also use a reference other than
CLOCK_REALTIME that might be more stable, but passing two arbitrary
clocks as input would make this much more complex to implement.

	Arnd

^ permalink raw reply

* Re: [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps
From: Peter Zijlstra @ 2014-11-04  7:23 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo, John Stultz, Masami Hiramatsu,
	Christopher Covington, Namhyung Kim, David Ahern, Thomas Gleixner,
	Tomeu Vizoso, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415060918-19954-2-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

On Tue, Nov 04, 2014 at 12:28:36AM +0000, Pawel Moll wrote:

> +int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;

const ?

>  /*
>   * perf samples are done in some very critical code paths (NMIs).
>   * If they take too much CPU time, the system can lock up and not
> @@ -324,7 +326,7 @@ extern __weak const char *perf_pmu_name(void)
>  
>  static inline u64 perf_clock(void)
>  {
> -	return local_clock();
> +	return ktime_get_mono_fast_ns();
>  }

Do we maybe want to make it boot-time switchable back to local_clock for
people with bad systems and or backwards compat issues?

Everybody using Core2 and older will very much not want to have this
unless they've got a very good reason for wanting it.

^ permalink raw reply

* Re: [PATCH v3 2/3] perf: Userspace event
From: Namhyung Kim @ 2014-11-04  6:33 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, David Ahern,
	Thomas Gleixner, Tomeu Vizoso, linux-kernel, linux-api,
	Pawel Moll
In-Reply-To: <1415060918-19954-3-git-send-email-pawel.moll@arm.com>

Hi Pawel,

On Tue,  4 Nov 2014 00:28:37 +0000, Pawel Moll wrote:
> +	/*
> +	 * Data in userspace event record is transparent for the kernel
> +	 *
> +	 * Userspace perf tool code maintains a list of known types with
> +	 * reference implementations of parsers for the data field.
> +	 *
> +	 * Overall size of the record (including type and size fields)
> +	 * is always aligned to 8 bytes by adding padding after the data.
> +	 *
> +	 * struct {
> +	 *	struct perf_event_header	header;
> +	 *	u32				type;
> +	 *	u32				size;

The struct perf_event_header also has 'size' field and it has the entire
length of the record so it's redundant.  Also there's 'misc' field in the
perf_event_header and I guess it can be used as 'type' info as it's
mostly for cpumode and we are in user mode by definition.

Thanks,
Namhyung


> +	 *	char				data[size];
> +	 *	char				__padding[-size & 7];
> +	 * 	struct sample_id		sample_id;
> +	 * };
> +	 */
> +	PERF_RECORD_UEVENT			= 11,
> +
>  	PERF_RECORD_MAX,			/* non-ABI */
>  };

^ permalink raw reply

* [PATCH net-next 7/7] bpf: remove test map scaffolding and use proper types
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415069656-14138-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

proper types and function helpers are ready. Use them in verifier testsuite.
Remove temporary stubs

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 kernel/bpf/test_stub.c      |   56 +++++++------------------------------------
 samples/bpf/test_verifier.c |   14 +++++------
 2 files changed, 16 insertions(+), 54 deletions(-)

diff --git a/kernel/bpf/test_stub.c b/kernel/bpf/test_stub.c
index fcaddff4003e..0ceae1e6e8b5 100644
--- a/kernel/bpf/test_stub.c
+++ b/kernel/bpf/test_stub.c
@@ -18,26 +18,18 @@ struct bpf_context {
 	u64 arg2;
 };
 
-static u64 test_func(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
-{
-	return 0;
-}
-
-static struct bpf_func_proto test_funcs[] = {
-	[BPF_FUNC_unspec] = {
-		.func = test_func,
-		.gpl_only = true,
-		.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
-		.arg1_type = ARG_CONST_MAP_PTR,
-		.arg2_type = ARG_PTR_TO_MAP_KEY,
-	},
-};
-
 static const struct bpf_func_proto *test_func_proto(enum bpf_func_id func_id)
 {
-	if (func_id < 0 || func_id >= ARRAY_SIZE(test_funcs))
+	switch (func_id) {
+	case BPF_FUNC_map_lookup_elem:
+		return &bpf_map_lookup_elem_proto;
+	case BPF_FUNC_map_update_elem:
+		return &bpf_map_update_elem_proto;
+	case BPF_FUNC_map_delete_elem:
+		return &bpf_map_delete_elem_proto;
+	default:
 		return NULL;
-	return &test_funcs[func_id];
+	}
 }
 
 static const struct bpf_context_access {
@@ -78,38 +70,8 @@ static struct bpf_prog_type_list tl_prog = {
 	.type = BPF_PROG_TYPE_UNSPEC,
 };
 
-static struct bpf_map *test_map_alloc(union bpf_attr *attr)
-{
-	struct bpf_map *map;
-
-	map = kzalloc(sizeof(*map), GFP_USER);
-	if (!map)
-		return ERR_PTR(-ENOMEM);
-
-	map->key_size = attr->key_size;
-	map->value_size = attr->value_size;
-	map->max_entries = attr->max_entries;
-	return map;
-}
-
-static void test_map_free(struct bpf_map *map)
-{
-	kfree(map);
-}
-
-static struct bpf_map_ops test_map_ops = {
-	.map_alloc = test_map_alloc,
-	.map_free = test_map_free,
-};
-
-static struct bpf_map_type_list tl_map = {
-	.ops = &test_map_ops,
-	.type = BPF_MAP_TYPE_UNSPEC,
-};
-
 static int __init register_test_ops(void)
 {
-	bpf_register_map_type(&tl_map);
 	bpf_register_prog_type(&tl_prog);
 	return 0;
 }
diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 63402742345e..b96175e90363 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -261,7 +261,7 @@ static struct bpf_test tests[] = {
 			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
 			BPF_LD_MAP_FD(BPF_REG_1, 0),
-			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_unspec),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 			BPF_EXIT_INSN(),
 		},
 		.fixup = {2},
@@ -417,7 +417,7 @@ static struct bpf_test tests[] = {
 			BPF_ALU64_REG(BPF_MOV, BPF_REG_2, BPF_REG_10),
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
 			BPF_LD_MAP_FD(BPF_REG_1, 0),
-			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_unspec),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_delete_elem),
 			BPF_EXIT_INSN(),
 		},
 		.errstr = "fd 0 is not pointing to valid bpf_map",
@@ -430,7 +430,7 @@ static struct bpf_test tests[] = {
 			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
 			BPF_LD_MAP_FD(BPF_REG_1, 0),
-			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_unspec),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
 			BPF_EXIT_INSN(),
 		},
@@ -445,7 +445,7 @@ static struct bpf_test tests[] = {
 			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
 			BPF_LD_MAP_FD(BPF_REG_1, 0),
-			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_unspec),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
 			BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
 			BPF_EXIT_INSN(),
@@ -461,7 +461,7 @@ static struct bpf_test tests[] = {
 			BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
 			BPF_LD_MAP_FD(BPF_REG_1, 0),
-			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_unspec),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 			BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
 			BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
 			BPF_EXIT_INSN(),
@@ -548,7 +548,7 @@ static struct bpf_test tests[] = {
 			BPF_ST_MEM(BPF_DW, BPF_REG_2, -56, 0),
 			BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -56),
 			BPF_LD_MAP_FD(BPF_REG_1, 0),
-			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_unspec),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_delete_elem),
 			BPF_EXIT_INSN(),
 		},
 		.fixup = {24},
@@ -659,7 +659,7 @@ static int create_map(void)
 	long long key, value = 0;
 	int map_fd;
 
-	map_fd = bpf_create_map(BPF_MAP_TYPE_UNSPEC, sizeof(key), sizeof(value), 1024);
+	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 1024);
 	if (map_fd < 0) {
 		printf("failed to create map '%s'\n", strerror(errno));
 	}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 6/7] bpf: allow eBPF programs to use maps
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415069656-14138-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
Note, these helpers are exposed as '.gpl_only = false', so non-GPL eBPF programs
can use them. That was requested by AndyL and DavidL before.

 include/linux/bpf.h      |    5 +++
 include/uapi/linux/bpf.h |    3 ++
 kernel/bpf/Makefile      |    2 +-
 kernel/bpf/helpers.c     |   88 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/helpers.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 51e9242e4803..75e94eaa228b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -133,4 +133,9 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
 
+/* verifier prototypes for helper functions called from eBPF programs */
+extern struct bpf_func_proto bpf_map_lookup_elem_proto;
+extern struct bpf_func_proto bpf_map_update_elem_proto;
+extern struct bpf_func_proto bpf_map_delete_elem_proto;
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9811d012b766..84a7fc3a23ec 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -160,6 +160,9 @@ union bpf_attr {
  */
 enum bpf_func_id {
 	BPF_FUNC_unspec,
+	BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(&map, &key) */
+	BPF_FUNC_map_update_elem, /* int map_update_elem(&map, &key, &value, flags) */
+	BPF_FUNC_map_delete_elem, /* int map_delete_elem(&map, &key) */
 	__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 72ec98ba2d42..a5ae60f0b0a2 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,5 +1,5 @@
 obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o helpers.o
 ifdef CONFIG_TEST_BPF
 obj-$(CONFIG_BPF_SYSCALL) += test_stub.o
 endif
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
new file mode 100644
index 000000000000..3fa78babe728
--- /dev/null
+++ b/kernel/bpf/helpers.c
@@ -0,0 +1,88 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/rcupdate.h>
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * bpf_map_lookup_elem_proto, so that verifier can properly checks the arguments
+ */
+static u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	/* verifier checked that R1 contains a valid pointer to bpf_map
+	 * and R2 points to a program stack and map->key_size bytes were
+	 * initialized
+	 */
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+	void *key = (void *) (unsigned long) r2;
+	void *value;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	value = map->ops->map_lookup_elem(map, key);
+
+	/* lookup() returns either pointer to element value or NULL
+	 * which is the meaning of PTR_TO_MAP_VALUE_OR_NULL type
+	 */
+	return (unsigned long) value;
+}
+
+struct bpf_func_proto bpf_map_lookup_elem_proto = {
+	.func = bpf_map_lookup_elem,
+	.gpl_only = false,
+	.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_MAP_KEY,
+};
+
+/* called from eBPF program under rcu lock */
+static u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+	void *key = (void *) (unsigned long) r2;
+	void *value = (void *) (unsigned long) r3;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return map->ops->map_update_elem(map, key, value, r4);
+}
+
+struct bpf_func_proto bpf_map_update_elem_proto = {
+	.func = bpf_map_update_elem,
+	.gpl_only = false,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_MAP_KEY,
+	.arg3_type = ARG_PTR_TO_MAP_VALUE,
+	.arg4_type = ARG_ANYTHING,
+};
+
+/* called from eBPF program under rcu lock */
+static u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+	struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+	void *key = (void *) (unsigned long) r2;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	return map->ops->map_delete_elem(map, key);
+}
+
+struct bpf_func_proto bpf_map_delete_elem_proto = {
+	.func = bpf_map_delete_elem,
+	.gpl_only = false,
+	.ret_type = RET_INTEGER,
+	.arg1_type = ARG_CONST_MAP_PTR,
+	.arg2_type = ARG_PTR_TO_MAP_KEY,
+};
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 5/7] bpf: add a testsuite for eBPF maps
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415069656-14138-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

. check error conditions and sanity of hash and array map APIs
. check large maps (that kernel gracefully switches to vmalloc from kmalloc)
. check multi-process parallel access and stress test

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
Eventually it can be moved tools/testing/selftests/bpf/, but for now keep
it in samples/bpf/, since that's where all subsequent samples are coming to.

 samples/bpf/Makefile    |    3 +-
 samples/bpf/libbpf.c    |    3 +-
 samples/bpf/libbpf.h    |    2 +-
 samples/bpf/test_maps.c |  287 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 292 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/test_maps.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 634391797856..0718d9ce4619 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -2,9 +2,10 @@
 obj- := dummy.o
 
 # List of programs to build
-hostprogs-y := test_verifier
+hostprogs-y := test_verifier test_maps
 
 test_verifier-objs := test_verifier.o libbpf.o
+test_maps-objs := test_maps.o libbpf.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index ff6504420738..17bb520eb57f 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -27,12 +27,13 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
 	return syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
 }
 
-int bpf_update_elem(int fd, void *key, void *value)
+int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags)
 {
 	union bpf_attr attr = {
 		.map_fd = fd,
 		.key = ptr_to_u64(key),
 		.value = ptr_to_u64(value),
+		.flags = flags,
 	};
 
 	return syscall(__NR_bpf, BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index 8a31babeca5d..f8678e5f48bf 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -6,7 +6,7 @@ struct bpf_insn;
 
 int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
 		   int max_entries);
-int bpf_update_elem(int fd, void *key, void *value);
+int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags);
 int bpf_lookup_elem(int fd, void *key, void *value);
 int bpf_delete_elem(int fd, void *key);
 int bpf_get_next_key(int fd, void *key, void *next_key);
diff --git a/samples/bpf/test_maps.c b/samples/bpf/test_maps.c
new file mode 100644
index 000000000000..91614031aed0
--- /dev/null
+++ b/samples/bpf/test_maps.c
@@ -0,0 +1,287 @@
+/*
+ * Testsuite for eBPF maps
+ *
+ * Copyright (c) 2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <unistd.h>
+#include <linux/bpf.h>
+#include <errno.h>
+#include <string.h>
+#include <assert.h>
+#include <sys/wait.h>
+#include <stdlib.h>
+#include "libbpf.h"
+
+/* sanity tests for map API */
+static void test_hashmap_sanity(int i, void *data)
+{
+	long long key, next_key, value;
+	int map_fd;
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value), 2);
+	if (map_fd < 0) {
+		printf("failed to create hashmap '%s'\n", strerror(errno));
+		exit(1);
+	}
+
+	key = 1;
+	value = 1234;
+	/* insert key=1 element */
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_UPDATE_OR_CREATE) == 0);
+
+	value = 0;
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == -1 &&
+	       errno == EEXIST);
+
+	assert(bpf_update_elem(map_fd, &key, &value, -1) == -1 && errno == EINVAL);
+
+	/* check that key=1 can be found */
+	assert(bpf_lookup_elem(map_fd, &key, &value) == 0 && value == 1234);
+
+	key = 2;
+	/* check that key=2 is not found */
+	assert(bpf_lookup_elem(map_fd, &key, &value) == -1 && errno == ENOENT);
+
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_UPDATE_ONLY) == -1 &&
+	       errno == ENOENT);
+
+	/* insert key=2 element */
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == 0);
+
+	/* key=1 and key=2 were inserted, check that key=0 cannot be inserted
+	 * due to max_entries limit
+	 */
+	key = 0;
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == -1 &&
+	       errno == E2BIG);
+
+	/* check that key = 0 doesn't exist */
+	assert(bpf_delete_elem(map_fd, &key) == -1 && errno == ENOENT);
+
+	/* iterate over two elements */
+	assert(bpf_get_next_key(map_fd, &key, &next_key) == 0 &&
+	       next_key == 2);
+	assert(bpf_get_next_key(map_fd, &next_key, &next_key) == 0 &&
+	       next_key == 1);
+	assert(bpf_get_next_key(map_fd, &next_key, &next_key) == -1 &&
+	       errno == ENOENT);
+
+	/* delete both elements */
+	key = 1;
+	assert(bpf_delete_elem(map_fd, &key) == 0);
+	key = 2;
+	assert(bpf_delete_elem(map_fd, &key) == 0);
+	assert(bpf_delete_elem(map_fd, &key) == -1 && errno == ENOENT);
+
+	key = 0;
+	/* check that map is empty */
+	assert(bpf_get_next_key(map_fd, &key, &next_key) == -1 &&
+	       errno == ENOENT);
+	close(map_fd);
+}
+
+static void test_arraymap_sanity(int i, void *data)
+{
+	int key, next_key, map_fd;
+	long long value;
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value), 2);
+	if (map_fd < 0) {
+		printf("failed to create arraymap '%s'\n", strerror(errno));
+		exit(1);
+	}
+
+	key = 1;
+	value = 1234;
+	/* insert key=1 element */
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_UPDATE_OR_CREATE) == 0);
+
+	value = 0;
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == -1 &&
+	       errno == EINVAL);
+
+	/* check that key=1 can be found */
+	assert(bpf_lookup_elem(map_fd, &key, &value) == 0 && value == 1234);
+
+	key = 0;
+	/* check that key=0 is also found and zero initialized */
+	assert(bpf_lookup_elem(map_fd, &key, &value) == 0 && value == 0);
+
+
+	/* key=0 and key=1 were inserted, check that key=2 cannot be inserted
+	 * due to max_entries limit
+	 */
+	key = 2;
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_UPDATE_ONLY) == -1 &&
+	       errno == E2BIG);
+
+	/* check that key = 2 doesn't exist */
+	assert(bpf_lookup_elem(map_fd, &key, &value) == -1 && errno == ENOENT);
+
+	/* iterate over two elements */
+	assert(bpf_get_next_key(map_fd, &key, &next_key) == 0 &&
+	       next_key == 0);
+	assert(bpf_get_next_key(map_fd, &next_key, &next_key) == 0 &&
+	       next_key == 1);
+	assert(bpf_get_next_key(map_fd, &next_key, &next_key) == -1 &&
+	       errno == ENOENT);
+
+	/* delete shouldn't succeed */
+	key = 1;
+	assert(bpf_delete_elem(map_fd, &key) == -1 && errno == EINVAL);
+
+	close(map_fd);
+}
+
+#define MAP_SIZE (32 * 1024)
+static void test_map_large(void)
+{
+	struct bigkey {
+		int a;
+		char b[116];
+		long long c;
+	} key;
+	int map_fd, i, value;
+
+	/* allocate 4Mbyte of memory */
+	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
+				MAP_SIZE);
+	if (map_fd < 0) {
+		printf("failed to create large map '%s'\n", strerror(errno));
+		exit(1);
+	}
+
+	for (i = 0; i < MAP_SIZE; i++) {
+		key = (struct bigkey) {.c = i};
+		value = i;
+		assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == 0);
+	}
+	key.c = -1;
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == -1 &&
+	       errno == E2BIG);
+
+	/* iterate through all elements */
+	for (i = 0; i < MAP_SIZE; i++)
+		assert(bpf_get_next_key(map_fd, &key, &key) == 0);
+	assert(bpf_get_next_key(map_fd, &key, &key) == -1 && errno == ENOENT);
+
+	key.c = 0;
+	assert(bpf_lookup_elem(map_fd, &key, &value) == 0 && value == 0);
+	key.a = 1;
+	assert(bpf_lookup_elem(map_fd, &key, &value) == -1 && errno == ENOENT);
+
+	close(map_fd);
+}
+
+/* fork N children and wait for them to complete */
+static void run_parallel(int tasks, void (*fn)(int i, void *data), void *data)
+{
+	pid_t pid[tasks];
+	int i;
+
+	for (i = 0; i < tasks; i++) {
+		pid[i] = fork();
+		if (pid[i] == 0) {
+			fn(i, data);
+			exit(0);
+		} else if (pid[i] == -1) {
+			printf("couldn't spawn #%d process\n", i);
+			exit(1);
+		}
+	}
+	for (i = 0; i < tasks; i++) {
+		int status;
+
+		assert(waitpid(pid[i], &status, 0) == pid[i]);
+		assert(status == 0);
+	}
+}
+
+static void test_map_stress(void)
+{
+	run_parallel(100, test_hashmap_sanity, NULL);
+	run_parallel(100, test_arraymap_sanity, NULL);
+}
+
+#define TASKS 1024
+#define DO_UPDATE 1
+#define DO_DELETE 0
+static void do_work(int fn, void *data)
+{
+	int map_fd = ((int *)data)[0];
+	int do_update = ((int *)data)[1];
+	int i;
+	int key, value;
+
+	for (i = fn; i < MAP_SIZE; i += TASKS) {
+		key = value = i;
+		if (do_update)
+			assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == 0);
+		else
+			assert(bpf_delete_elem(map_fd, &key) == 0);
+	}
+}
+
+static void test_map_parallel(void)
+{
+	int i, map_fd, key = 0, value = 0;
+	int data[2];
+
+	map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
+				MAP_SIZE);
+	if (map_fd < 0) {
+		printf("failed to create map for parallel test '%s'\n",
+		       strerror(errno));
+		exit(1);
+	}
+
+	data[0] = map_fd;
+	data[1] = DO_UPDATE;
+	/* use the same map_fd in children to add elements to this map
+	 * child_0 adds key=0, key=1024, key=2048, ...
+	 * child_1 adds key=1, key=1025, key=2049, ...
+	 * child_1023 adds key=1023, ...
+	 */
+	run_parallel(TASKS, do_work, data);
+
+	/* check that key=0 is already there */
+	assert(bpf_update_elem(map_fd, &key, &value, BPF_MAP_CREATE_ONLY) == -1 &&
+	       errno == EEXIST);
+
+	/* check that all elements were inserted */
+	key = -1;
+	for (i = 0; i < MAP_SIZE; i++)
+		assert(bpf_get_next_key(map_fd, &key, &key) == 0);
+	assert(bpf_get_next_key(map_fd, &key, &key) == -1 && errno == ENOENT);
+
+	/* another check for all elements */
+	for (i = 0; i < MAP_SIZE; i++) {
+		key = MAP_SIZE - i - 1;
+		assert(bpf_lookup_elem(map_fd, &key, &value) == 0 &&
+		       value == key);
+	}
+
+	/* now let's delete all elemenets in parallel */
+	data[1] = DO_DELETE;
+	run_parallel(TASKS, do_work, data);
+
+	/* nothing should be left */
+	key = -1;
+	assert(bpf_get_next_key(map_fd, &key, &key) == -1 && errno == ENOENT);
+}
+
+int main(void)
+{
+	test_hashmap_sanity(0, NULL);
+	test_arraymap_sanity(0, NULL);
+	test_map_large();
+	test_map_parallel();
+	test_map_stress();
+	printf("test_maps: OK\n");
+	return 0;
+}
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 4/7] bpf: fix BPF_MAP_LOOKUP_ELEM command return code
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1415069656-14138-1-git-send-email-ast@plumgrid.com>

fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage
described it in commit b4fc1a460f30("Merge branch 'bpf-next'"):
-----
BPF_MAP_LOOKUP_ELEM
    int bpf_lookup_elem(int fd, void *key, void *value)
    {
        union bpf_attr attr = {
            .map_fd = fd,
            .key = ptr_to_u64(key),
            .value = ptr_to_u64(value),
        };

        return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
    }
    bpf() syscall looks up an element with given key in  a  map  fd.
    If  element  is found it returns zero and stores element's value
    into value.  If element is not found  it  returns  -1  and  sets
    errno to ENOENT.

and further down in manpage:

   ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM,  indicates  that
          element with given key was not found.
-----

In general all BPF commands return ENOENT when map element is not found
(including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with
 flags == BPF_MAP_UPDATE_ONLY)

Subsequent patch adds a testsuite to check return values for all of
these combinations.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---

I don't think this patch is needed for 'net', since 'net' has syscall shell
only. Actual map types and their implementations are being introduced by
this set of patches.

 kernel/bpf/syscall.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c0d03bf317a2..088ac0b1b106 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -169,7 +169,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 	if (copy_from_user(key, ukey, map->key_size) != 0)
 		goto free_key;
 
-	err = -ESRCH;
+	err = -ENOENT;
 	rcu_read_lock();
 	value = map->ops->map_lookup_elem(map, key);
 	if (!value)
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 3/7] bpf: add array type of eBPF maps
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415069656-14138-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

add new map type BPF_MAP_TYPE_ARRAY and its implementation

- optimized for fastest possible lookup()
  . in the future verifier/JIT may recognize lookup() with constant key
    and optimize it into constant pointer. Can optimize non-constant
    key into direct pointer arithmetic as well, since pointers and
    value_size are constant for the life of the eBPF program.
    In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
    while preserving concurrent access to this map from user space

- two main use cases for array type:
  . 'global' eBPF variables: array of 1 element with key=0 and value is a
    collection of 'global' variables which programs can use to keep the state
    between events
  . aggregation of tracing events into fixed set of buckets

- all array elements pre-allocated and zero initialized at init time

- key as an index in array and can only be 4 byte

- map_delete_elem() returns EINVAL, since elements cannot be deleted

- map_update_elem() replaces elements in an non-atomic way
  (for atomic updates hashtable type should be used instead)

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---

Note, from eBPF program and from user space, all map types are accessed
through the same API.

Example of using array type for 'global' variables from eBPF program:
struct globals {
    u64 lat_ave;
    u64 lat_sum;
    u64 missed;
    u64 max_lat;
    int num_samples;
};

struct bpf_map_def SEC("maps") global_map = {
    .type = BPF_MAP_TYPE_ARRAY,
    .key_size = sizeof(int),
    .value_size = sizeof(struct globals),
    .max_entries = 1,
};

int bpf_prog(struct bpf_context *ctx)
{
    ...
    int ind = 0;
    struct globals *g = bpf_map_lookup_elem(&global_map, &ind);
    if (!g)
            return 0;
    if (g->lat_ave == 0) {
            g->num_samples++;
            g->lat_sum += delta;
            if (g->num_samples >= 100) {
                    g->lat_ave = g->lat_sum / g->num_samples;
    ...

The future verifier/JIT optimization will replace bpf_map_lookup_elem()
call inside eBPF program with const pointer to element value of key=0,
so that eBPF program will have no penalty whatsoever to access such
'global' variables.
At the same time user space can access this 'globals' via common map API.

Full example of both kernel and user side follows in later patches.

 include/uapi/linux/bpf.h |    1 +
 kernel/bpf/Makefile      |    2 +-
 kernel/bpf/arraymap.c    |  150 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 152 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/arraymap.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c071f9e3a454..9811d012b766 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -112,6 +112,7 @@ enum bpf_cmd {
 enum bpf_map_type {
 	BPF_MAP_TYPE_UNSPEC,
 	BPF_MAP_TYPE_HASH,
+	BPF_MAP_TYPE_ARRAY,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 2c0ec7f9da78..72ec98ba2d42 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,5 +1,5 @@
 obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o arraymap.o
 ifdef CONFIG_TEST_BPF
 obj-$(CONFIG_BPF_SYSCALL) += test_stub.o
 endif
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
new file mode 100644
index 000000000000..60212672ec9c
--- /dev/null
+++ b/kernel/bpf/arraymap.c
@@ -0,0 +1,150 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/err.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/mm.h>
+
+struct bpf_array {
+	struct bpf_map map;
+	u32 elem_size;
+	char value[0] __aligned(8);
+};
+
+/* Called from syscall */
+static struct bpf_map *array_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_array *array;
+	u32 elem_size;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size == 0)
+		return ERR_PTR(-EINVAL);
+
+	elem_size = round_up(attr->value_size, 8);
+
+	/* allocate all map elements and zero-initialize them */
+	array = kzalloc(sizeof(*array) + attr->max_entries * elem_size,
+			GFP_USER | __GFP_NOWARN);
+	if (!array) {
+		array = vzalloc(array->map.max_entries * array->elem_size);
+		if (!array)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	/* copy mandatory map attributes */
+	array->map.key_size = attr->key_size;
+	array->map.value_size = attr->value_size;
+	array->map.max_entries = attr->max_entries;
+
+	array->elem_size = elem_size;
+
+	return &array->map;
+
+}
+
+/* Called from syscall or from eBPF program */
+static void *array_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	u32 index = *(u32 *)key;
+
+	if (index >= array->map.max_entries)
+		return NULL;
+
+	return array->value + array->elem_size * index;
+}
+
+/* Called from syscall */
+static int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	u32 index = *(u32 *)key;
+	u32 *next = (u32 *)next_key;
+
+	if (index >= array->map.max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == array->map.max_entries - 1)
+		return -ENOENT;
+
+	*next = index + 1;
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
+				 u64 map_flags)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+	u32 index = *(u32 *)key;
+
+	if (map_flags > BPF_MAP_UPDATE_ONLY)
+		/* unknown flags */
+		return -EINVAL;
+
+	if (map_flags == BPF_MAP_CREATE_ONLY)
+		return -EINVAL;
+
+	if (index >= array->map.max_entries)
+		/* all elements were pre-allocated, cannot insert a new one */
+		return -E2BIG;
+
+	memcpy(array->value + array->elem_size * index, value, array->elem_size);
+	return 0;
+}
+
+/* Called from syscall or from eBPF program */
+static int array_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EINVAL;
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void array_map_free(struct bpf_map *map)
+{
+	struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. Wait for outstanding programs to complete
+	 * and free the array
+	 */
+	synchronize_rcu();
+
+	kvfree(array);
+}
+
+static struct bpf_map_ops array_ops = {
+	.map_alloc = array_map_alloc,
+	.map_free = array_map_free,
+	.map_get_next_key = array_map_get_next_key,
+	.map_lookup_elem = array_map_lookup_elem,
+	.map_update_elem = array_map_update_elem,
+	.map_delete_elem = array_map_delete_elem,
+};
+
+static struct bpf_map_type_list tl = {
+	.ops = &array_ops,
+	.type = BPF_MAP_TYPE_ARRAY,
+};
+
+static int __init register_array_map(void)
+{
+	bpf_register_map_type(&tl);
+	return 0;
+}
+late_initcall(register_array_map);
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 2/7] bpf: add hashtable type of eBPF maps
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet, linux-api, netdev,
	linux-kernel
In-Reply-To: <1415069656-14138-1-git-send-email-ast@plumgrid.com>

add new map type BPF_MAP_TYPE_HASH and its implementation

- maps are created/destroyed by userspace. Both userspace and eBPF programs
  can lookup/update/delete elements from the map

- eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
  for concurrent updates

- key/value are opaque range of bytes (aligned to 8 bytes)

- user space provides 3 configuration attributes via BPF syscall:
  key_size, value_size, max_entries

- map takes care of allocating/freeing key/value pairs

- map_update_elem() must fail to insert new element when max_entries
  limit is reached to make sure that eBPF programs cannot exhaust memory

- map_update_elem() replaces elements in an atomic way

- optimized for speed of lookup() which can be called multiple times from
  eBPF program which itself is triggered by high volume of events
  . in the future JIT compiler may recognize lookup() call and optimize it
    further, since key_size is constant for life of eBPF program

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
---
 include/uapi/linux/bpf.h |    1 +
 kernel/bpf/Makefile      |    2 +-
 kernel/bpf/hashtab.c     |  362 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 364 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/hashtab.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 19c7ae4a4dd5..c071f9e3a454 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -111,6 +111,7 @@ enum bpf_cmd {
 
 enum bpf_map_type {
 	BPF_MAP_TYPE_UNSPEC,
+	BPF_MAP_TYPE_HASH,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 0daf7f6ae7df..2c0ec7f9da78 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1,5 +1,5 @@
 obj-y := core.o
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o hashtab.o
 ifdef CONFIG_TEST_BPF
 obj-$(CONFIG_BPF_SYSCALL) += test_stub.o
 endif
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index 000000000000..9fa3227c7165
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,362 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <linux/jhash.h>
+#include <linux/filter.h>
+#include <linux/vmalloc.h>
+
+struct bpf_htab {
+	struct bpf_map map;
+	struct hlist_head *buckets;
+	spinlock_t lock;
+	u32 count;	/* number of elements in this hashtable */
+	u32 n_buckets;	/* number of hash buckets */
+	u32 elem_size;	/* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+	struct hlist_node hash_node;
+	struct rcu_head rcu;
+	u32 hash;
+	char key[0] __aligned(8);
+};
+
+/* Called from syscall */
+static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_htab *htab;
+	int err, i;
+
+	htab = kzalloc(sizeof(*htab), GFP_USER);
+	if (!htab)
+		return ERR_PTR(-ENOMEM);
+
+	/* mandatory map attributes */
+	htab->map.key_size = attr->key_size;
+	htab->map.value_size = attr->value_size;
+	htab->map.max_entries = attr->max_entries;
+
+	/* check sanity of attributes.
+	 * value_size == 0 may be allowed in the future to use map as a set
+	 */
+	err = -EINVAL;
+	if (htab->map.max_entries == 0 || htab->map.key_size == 0 ||
+	    htab->map.value_size == 0)
+		goto free_htab;
+
+	/* hash table size must be power of 2 */
+	htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
+
+	err = -E2BIG;
+	if (htab->map.key_size > MAX_BPF_STACK)
+		/* eBPF programs initialize keys on stack, so they cannot be
+		 * larger than max stack size
+		 */
+		goto free_htab;
+
+	err = -ENOMEM;
+	htab->buckets = kmalloc_array(htab->n_buckets, sizeof(struct hlist_head),
+				      GFP_USER | __GFP_NOWARN);
+
+	if (!htab->buckets) {
+		htab->buckets = vmalloc(htab->n_buckets * sizeof(struct hlist_head));
+		if (!htab->buckets)
+			goto free_htab;
+	}
+
+	for (i = 0; i < htab->n_buckets; i++)
+		INIT_HLIST_HEAD(&htab->buckets[i]);
+
+	spin_lock_init(&htab->lock);
+	htab->count = 0;
+
+	htab->elem_size = sizeof(struct htab_elem) +
+			  round_up(htab->map.key_size, 8) +
+			  htab->map.value_size;
+	return &htab->map;
+
+free_htab:
+	kfree(htab);
+	return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+	return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+	return &htab->buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem *lookup_elem_raw(struct hlist_head *head, u32 hash,
+					 void *key, u32 key_size)
+{
+	struct htab_elem *l;
+
+	hlist_for_each_entry_rcu(l, head, hash_node)
+		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+			return l;
+
+	return NULL;
+}
+
+/* Called from syscall or from eBPF program */
+static void *htab_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	u32 hash, key_size;
+
+	/* Must be called with rcu_read_lock. */
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (l)
+		return l->key + round_up(map->key_size, 8);
+
+	return NULL;
+}
+
+/* Called from syscall */
+static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l, *next_l;
+	u32 hash, key_size;
+	int i;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	head = select_bucket(htab, hash);
+
+	/* lookup the key */
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (!l) {
+		i = 0;
+		goto find_first_elem;
+	}
+
+	/* key was found, get next key in the same bucket */
+	next_l = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
+				  struct htab_elem, hash_node);
+
+	if (next_l) {
+		/* if next elem in this hash list is non-zero, just return it */
+		memcpy(next_key, next_l->key, key_size);
+		return 0;
+	}
+
+	/* no more elements in this hash list, go to the next bucket */
+	i = hash & (htab->n_buckets - 1);
+	i++;
+
+find_first_elem:
+	/* iterate over buckets */
+	for (; i < htab->n_buckets; i++) {
+		head = select_bucket(htab, i);
+
+		/* pick first element in the bucket */
+		next_l = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(head)),
+					  struct htab_elem, hash_node);
+		if (next_l) {
+			/* if it's not empty, just return it */
+			memcpy(next_key, next_l->key, key_size);
+			return 0;
+		}
+	}
+
+	/* itereated over all buckets and all elements */
+	return -ENOENT;
+}
+
+/* Called from syscall or from eBPF program */
+static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
+				u64 map_flags)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct htab_elem *l_new, *l_old;
+	struct hlist_head *head;
+	unsigned long flags;
+	u32 key_size;
+	int ret;
+
+	if (map_flags > BPF_MAP_UPDATE_ONLY)
+		/* unknown flags */
+		return -EINVAL;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	/* allocate new element outside of lock */
+	l_new = kmalloc(htab->elem_size, GFP_ATOMIC);
+	if (!l_new)
+		return -ENOMEM;
+
+	key_size = map->key_size;
+
+	memcpy(l_new->key, key, key_size);
+	memcpy(l_new->key + round_up(key_size, 8), value, map->value_size);
+
+	l_new->hash = htab_map_hash(l_new->key, key_size);
+
+	/* bpf_map_update_elem() can be called in_irq() */
+	spin_lock_irqsave(&htab->lock, flags);
+
+	head = select_bucket(htab, l_new->hash);
+
+	l_old = lookup_elem_raw(head, l_new->hash, key, key_size);
+
+	if (!l_old && unlikely(htab->count >= map->max_entries)) {
+		/* if elem with this 'key' doesn't exist and we've reached
+		 * max_entries limit, fail insertion of new elem
+		 */
+		ret = -E2BIG;
+		goto err;
+	}
+
+	if (l_old && map_flags == BPF_MAP_CREATE_ONLY) {
+		/* elem already exists */
+		ret = -EEXIST;
+		goto err;
+	}
+
+	if (!l_old && map_flags == BPF_MAP_UPDATE_ONLY) {
+		/* elem doesn't exist, cannot update it */
+		ret = -ENOENT;
+		goto err;
+	}
+
+	/* add new element to the head of the list, so that concurrent
+	 * search will find it before old elem
+	 */
+	hlist_add_head_rcu(&l_new->hash_node, head);
+	if (l_old) {
+		hlist_del_rcu(&l_old->hash_node);
+		kfree_rcu(l_old, rcu);
+	} else {
+		htab->count++;
+	}
+	spin_unlock_irqrestore(&htab->lock, flags);
+
+	return 0;
+err:
+	spin_unlock_irqrestore(&htab->lock, flags);
+	kfree(l_new);
+	return ret;
+}
+
+/* Called from syscall or from eBPF program */
+static int htab_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_head *head;
+	struct htab_elem *l;
+	unsigned long flags;
+	u32 hash, key_size;
+	int ret = -ENOENT;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size);
+
+	spin_lock_irqsave(&htab->lock, flags);
+
+	head = select_bucket(htab, hash);
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+
+	if (l) {
+		hlist_del_rcu(&l->hash_node);
+		htab->count--;
+		kfree_rcu(l, rcu);
+		ret = 0;
+	}
+
+	spin_unlock_irqrestore(&htab->lock, flags);
+	return ret;
+}
+
+static void delete_all_elements(struct bpf_htab *htab)
+{
+	int i;
+
+	for (i = 0; i < htab->n_buckets; i++) {
+		struct hlist_head *head = select_bucket(htab, i);
+		struct hlist_node *n;
+		struct htab_elem *l;
+
+		hlist_for_each_entry_safe(l, n, head, hash_node) {
+			hlist_del_rcu(&l->hash_node);
+			htab->count--;
+			kfree(l);
+		}
+	}
+}
+
+/* Called when map->refcnt goes to zero, either from workqueue or from syscall */
+static void htab_map_free(struct bpf_map *map)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+
+	/* at this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. Wait for outstanding critical sections in
+	 * these programs to complete
+	 */
+	synchronize_rcu();
+
+	/* some of kfree_rcu() callbacks for elements of this map may not have
+	 * executed. It's ok. Proceed to free residual elements and map itself
+	 */
+	delete_all_elements(htab);
+	kvfree(htab->buckets);
+	kfree(htab);
+}
+
+static struct bpf_map_ops htab_ops = {
+	.map_alloc = htab_map_alloc,
+	.map_free = htab_map_free,
+	.map_get_next_key = htab_map_get_next_key,
+	.map_lookup_elem = htab_map_lookup_elem,
+	.map_update_elem = htab_map_update_elem,
+	.map_delete_elem = htab_map_delete_elem,
+};
+
+static struct bpf_map_type_list tl = {
+	.ops = &htab_ops,
+	.type = BPF_MAP_TYPE_HASH,
+};
+
+static int __init register_htab_map(void)
+{
+	bpf_register_map_type(&tl);
+	return 0;
+}
+late_initcall(register_htab_map);
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 1/7] bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1415069656-14138-1-git-send-email-ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>

the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
either update existing map element or create a new one.
Initially the plan was to add a new command to handle the case of
'create new element if it didn't exist', but 'flags' style looks
cleaner and overall diff is much smaller (more code reused), so add 'flags'
attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
enum {
  BPF_MAP_UPDATE_OR_CREATE = 0, /* add new element or update existing */
  BPF_MAP_CREATE_ONLY,          /* add new element if it didn't exist */
  BPF_MAP_UPDATE_ONLY           /* update existing element */
};

BPF_MAP_CREATE_ONLY can fail with EEXIST if element already exists.
BPF_MAP_UPDATE_ONLY can fail with ENOENT if element doesn't exist.

Userspace will call it as:
int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
{
    union bpf_attr attr = {
        .map_fd = fd,
        .key = ptr_to_u64(key),
        .value = ptr_to_u64(value),
        .flags = flags;
    };

    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
}

Signed-off-by: Alexei Starovoitov <ast-uqk4Ao+rVK5Wk0Htik3J/w@public.gmane.org>
---
 include/linux/bpf.h      |    2 +-
 include/uapi/linux/bpf.h |   10 +++++++++-
 kernel/bpf/syscall.c     |    4 ++--
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3cf91754a957..51e9242e4803 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -22,7 +22,7 @@ struct bpf_map_ops {
 
 	/* funcs callable from userspace and from eBPF programs */
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
-	int (*map_update_elem)(struct bpf_map *map, void *key, void *value);
+	int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
 	int (*map_delete_elem)(struct bpf_map *map, void *key);
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d18316f9e9c4..19c7ae4a4dd5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -82,7 +82,7 @@ enum bpf_cmd {
 
 	/* create or update key/value pair in a given map
 	 * err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
-	 * Using attr->map_fd, attr->key, attr->value
+	 * Using attr->map_fd, attr->key, attr->value, attr->flags
 	 * returns zero or negative error
 	 */
 	BPF_MAP_UPDATE_ELEM,
@@ -117,6 +117,13 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_UNSPEC,
 };
 
+/* flags for BPF_MAP_UPDATE_ELEM command */
+enum bpf_map_update_flags {
+	BPF_MAP_UPDATE_OR_CREATE = 0,	/* add new element or update existing */
+	BPF_MAP_CREATE_ONLY,		/* add new element if it didn't exist */
+	BPF_MAP_UPDATE_ONLY		/* update existing element */
+};
+
 union bpf_attr {
 	struct { /* anonymous struct used by BPF_MAP_CREATE command */
 		__u32	map_type;	/* one of enum bpf_map_type */
@@ -132,6 +139,7 @@ union bpf_attr {
 			__aligned_u64 value;
 			__aligned_u64 next_key;
 		};
+		__u64		flags;
 	};
 
 	struct { /* anonymous struct used by BPF_PROG_LOAD command */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ba61c8c16032..c0d03bf317a2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -190,7 +190,7 @@ err_put:
 	return err;
 }
 
-#define BPF_MAP_UPDATE_ELEM_LAST_FIELD value
+#define BPF_MAP_UPDATE_ELEM_LAST_FIELD flags
 
 static int map_update_elem(union bpf_attr *attr)
 {
@@ -231,7 +231,7 @@ static int map_update_elem(union bpf_attr *attr)
 	 * therefore all map accessors rely on this fact, so do the same here
 	 */
 	rcu_read_lock();
-	err = map->ops->map_update_elem(map, key, value);
+	err = map->ops->map_update_elem(map, key, value, attr->flags);
 	rcu_read_unlock();
 
 free_value:
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH net-next 0/7] implementation of eBPF maps
From: Alexei Starovoitov @ 2014-11-04  2:54 UTC (permalink / raw)
  To: David S. Miller
  Cc: Ingo Molnar, Andy Lutomirski, Daniel Borkmann,
	Hannes Frederic Sowa, Eric Dumazet,
	linux-api-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Hi All,

this set of patches adds implementation of HASH and ARRAY types of eBPF maps
which were described in manpage in commit b4fc1a460f30("Merge branch 'bpf-next'")

The difference vs previous version of these patches from August:
- added 'flags' attribute to BPF_MAP_UPDATE_ELEM
- in HASH type implementation removed per-map kmem_cache.
  I was doing kmem_cache_create() for every map to enable selective slub
  debugging to check for overflows and leaks. Now it's not needed, so just
  use normal kmalloc() for map elements.
- added ARRAY type which was mentioned in manpage, but wasn't public yet
- added map testsuite and removed temporary bits from test_stubs

Note, eBPF programs cannot be attached to events yet.
It will come in the next set.

Alexei Starovoitov (7):
  bpf: add 'flags' attribute to BPF_MAP_UPDATE_ELEM command
  bpf: add hashtable type of eBPF maps
  bpf: add array type of eBPF maps
  bpf: fix BPF_MAP_LOOKUP_ELEM command return code
  bpf: add a testsuite for eBPF maps
  bpf: allow eBPF programs to use maps
  bpf: remove test map scaffolding and user proper types

 include/linux/bpf.h         |    7 +-
 include/uapi/linux/bpf.h    |   15 +-
 kernel/bpf/Makefile         |    2 +-
 kernel/bpf/arraymap.c       |  150 ++++++++++++++++++
 kernel/bpf/hashtab.c        |  362 +++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/helpers.c        |   88 +++++++++++
 kernel/bpf/syscall.c        |    6 +-
 kernel/bpf/test_stub.c      |   56 ++-----
 samples/bpf/Makefile        |    3 +-
 samples/bpf/libbpf.c        |    3 +-
 samples/bpf/libbpf.h        |    2 +-
 samples/bpf/test_maps.c     |  287 ++++++++++++++++++++++++++++++++++
 samples/bpf/test_verifier.c |   14 +-
 13 files changed, 932 insertions(+), 63 deletions(-)
 create mode 100644 kernel/bpf/arraymap.c
 create mode 100644 kernel/bpf/hashtab.c
 create mode 100644 kernel/bpf/helpers.c
 create mode 100644 samples/bpf/test_maps.c

-- 
1.7.9.5

^ permalink raw reply

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-04  1:59 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jnagal-hpIqsD4AKlfQT0dZR+AlfA
In-Reply-To: <1414783141-6947-8-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
  fs/kernfs/mount.c      | 48 
++++++++++++++++++++++++++++++++++++++++++++++++
  include/linux/kernfs.h |  2 ++
  kernel/cgroup.c        | 46 +++++++++++++++++++++++++++++++++++++++++++++-
  3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct 
super_block *sb)
  	return NULL;
  }

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the 
kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn)
+{
+	struct dentry *dentry;
+	struct inode *inode;
+
+	BUG_ON(sb->s_op != &kernfs_sops);
+
+	/* inode for the given kernfs_node should already exist. */
+	inode = ilookup(sb, kn->ino);
+	if (!inode) {
+		pr_debug("kernfs: could not get inode for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-EINVAL);
+	}
+
+	/* instantiate and link root dentry */
+	dentry = d_obtain_root(inode);
+	if (!dentry) {
+		pr_debug("kernfs: could not get dentry for '");
+		pr_cont_kernfs_path(kn);
+		pr_cont("'.\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/* If this is a new dentry, set it up. We need kernfs_mutex because this
+	 * may be called by callers other than kernfs_fill_super. */
+	mutex_lock(&kernfs_mutex);
+	if (!dentry->d_fsdata) {
+		kernfs_get(kn);
+		dentry->d_fsdata = kn;
+	} else {
+		WARN_ON(dentry->d_fsdata != kn);
+	}
+	mutex_unlock(&kernfs_mutex);
+
+	return dentry;
+}
+
  static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
  {
  	struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
  struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
  struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+				  struct kernfs_node *kn);
  struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
  				       unsigned int flags, void *priv);
  void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data, 
struct cgroup_sb_opts *opts)
  			return -ENOENT;
  	}

+	/* If inside a non-init cgroup namespace, only allow default hierarchy
+	 * to be mounted.
+	 */
+	if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+	    !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+		return -EINVAL;
+	}
+
  	if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
  		pr_warn("sane_behavior: this is still under development and its 
behaviors will change, proceed at your own risk\n");
  		if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root 
*root,
  		set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
  }

+struct dentry *cgroupns_get_root(struct super_block *sb,
+				 struct cgroup_namespace *ns)
+{
+	struct dentry *nsdentry;
+
+	nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+	return nsdentry;
+}
+
  static int cgroup_setup_root(struct cgroup_root *root, unsigned int 
ss_mask)
  {
  	LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct 
file_system_type *fs_type,
  	int ret;
  	int i;
  	bool new_sb;
+	struct cgroup_namespace *ns =
+		get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+	/* Check if the caller has permission to mount. */
+	if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+		put_cgroup_ns(ns);
+		return ERR_PTR(-EPERM);
+	}

  	/*
  	 * The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1842,28 @@ out_free:
  	kfree(opts.release_agent);
  	kfree(opts.name);

-	if (ret)
+	if (ret) {
+		put_cgroup_ns(ns);
  		return ERR_PTR(ret);
+	}

  	dentry = kernfs_mount(fs_type, flags, root->kf_root,
  				CGROUP_SUPER_MAGIC, &new_sb);
+
+	if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+		/* If this mount is for the default hierarchy in non-init cgroup
+		 * namespace, then instead of root cgroup's dentry, we return
+		 * the dentry corresponding to the cgroupns->root_cgrp.
+		 */
+		if (ns != &init_cgroup_ns) {
+			struct dentry *nsdentry;
+
+			nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+			dput(dentry);
+			dentry = nsdentry;
+		}
+	}
+
  	if (IS_ERR(dentry) || !new_sb)
  		cgroup_put(&root->cgrp);

@@ -1834,6 +1876,7 @@ out_free:
  		deactivate_super(pinned_sb);
  	}

+	put_cgroup_ns(ns);
  	return dentry;
  }

@@ -1862,6 +1905,7 @@ static struct file_system_type cgroup_fs_type = {
  	.name = "cgroup",
  	.mount = cgroup_mount,
  	.kill_sb = cgroup_kill_sb,
+	.fs_flags = FS_USERNS_MOUNT,
  };

  static struct kobject *cgroup_kobj;
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces
From: Aditya Kali @ 2014-11-04  1:56 UTC (permalink / raw)
  To: tj-DgEjT+Ai2ygdnm+yROfE0A, lizefan-hv44wF8Li93QT0dZR+AlfA,
	serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA, luto-kltTT9wpgjJwATOyAt5JVQ,
	ebiederm-aS9lmoZGLiVWk0Htik3J/w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, mingo-H+wXaHxf7aLQT0dZR+AlfA
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
In-Reply-To: <1414783141-6947-6-git-send-email-adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>


Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
  they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
  fs/proc/namespaces.c             |   1 +
  include/linux/cgroup.h           |  18 +++++-
  include/linux/cgroup_namespace.h |  36 +++++++++++
  include/linux/nsproxy.h          |   2 +
  include/linux/proc_ns.h          |   4 ++
  kernel/Makefile                  |   2 +-
  kernel/cgroup.c                  |  14 +++++
  kernel/cgroup_namespace.c        | 127 
+++++++++++++++++++++++++++++++++++++++
  kernel/fork.c                    |   2 +-
  kernel/nsproxy.c                 |  19 +++++-
  10 files changed, 220 insertions(+), 5 deletions(-)
  create mode 100644 include/linux/cgroup_namespace.h
  create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
  	&userns_operations,
  #endif
  	&mntns_operations,
+	&cgroupns_operations,
  };

  static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
  #include <linux/seq_file.h>
  #include <linux/kernfs.h>
  #include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>

  #ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
  #endif
  };

+struct cgroup_namespace {
+	atomic_t		count;
+	unsigned int		proc_inum;
+	struct user_namespace	*user_ns;
+	struct cgroup		*root_cgrp;
+};
+
  extern struct cgroup_root cgrp_dfl_root;
  extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, 
char *buf, size_t buflen)
  	return kernfs_name(cgrp->kn, buf, buflen);
  }

+static inline char * __must_check cgroup_path_ns(struct 
cgroup_namespace *ns,
+						 struct cgroup *cgrp, char *buf,
+						 size_t buflen)
+{
+	return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
  static inline char * __must_check cgroup_path(struct cgroup *cgrp, 
char *buf,
  					      size_t buflen)
  {
-	return kernfs_path(cgrp->kn, buf, buflen);
+	return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
  }

  static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h 
b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+	return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+		struct cgroup_namespace *ns)
+{
+	if (ns)
+		atomic_inc(&ns->count);
+	return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+	if (ns && atomic_dec_and_test(&ns->count))
+		free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					       struct user_namespace *user_ns,
+					       struct cgroup_namespace *old_ns);
+
+#endif  /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
  struct uts_namespace;
  struct ipc_namespace;
  struct pid_namespace;
+struct cgroup_namespace;
  struct fs_struct;

  /*
@@ -33,6 +34,7 @@ struct nsproxy {
  	struct mnt_namespace *mnt_ns;
  	struct pid_namespace *pid_ns_for_children;
  	struct net 	     *net_ns;
+	struct cgroup_namespace *cgroup_ns;
  };
  extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@

  struct pid_namespace;
  struct nsproxy;
+struct task_struct;
+struct inode;

  struct proc_ns_operations {
  	const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
  extern const struct proc_ns_operations pidns_operations;
  extern const struct proc_ns_operations userns_operations;
  extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

  /*
   * We always define these enumerators
@@ -37,6 +40,7 @@ enum {
  	PROC_UTS_INIT_INO	= 0xEFFFFFFEU,
  	PROC_USER_INIT_INO	= 0xEFFFFFFDU,
  	PROC_PID_INIT_INO	= 0xEFFFFFFCU,
+	PROC_CGROUP_INIT_INO	= 0xEFFFFFFBU,
  };

  #ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
  obj-$(CONFIG_KEXEC) += kexec.o
  obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
  obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
  obj-$(CONFIG_CPUSETS) += cpuset.o
  obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
  #include <linux/vmalloc.h> /* TODO: replace with more sophisticated 
array */
  #include <linux/kthread.h>
  #include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>

  #include <linux/atomic.h>

@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
  static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
  			      bool is_add);

+struct cgroup_namespace init_cgroup_ns = {
+	.count = {
+		.counter = 1,
+	},
+	.proc_inum = PROC_CGROUP_INIT_INO,
+	.user_ns = &init_user_ns,
+	.root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
  /* IDR wrappers which synchronize using cgroup_idr_lock */
  static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int 
end,
  			    gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node 
*parent_kn, const char *name,
  	parent = cgroup_kn_lock_live(parent_kn);
  	if (!parent)
  		return -ENODEV;
+
  	root = parent->root;

  	/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
  	unsigned long key;
  	int ssid, err;

+	get_user_ns(init_cgroup_ns.user_ns);
+
  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ *  Copyright (C) 2014 Google Inc.
+ *
+ *  Author: Aditya Kali (adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org)
+ *
+ *  This program is free software; you can redistribute it and/or modify it
+ *  under the terms of the GNU General Public License as published by 
the Free
+ *  Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+	struct cgroup_namespace *new_ns;
+
+	new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+	if (new_ns)
+		atomic_set(&new_ns->count, 1);
+	return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+	cgroup_put(ns->root_cgrp);
+	put_user_ns(ns->user_ns);
+	proc_free_inum(ns->proc_inum);
+	kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+					struct user_namespace *user_ns,
+					struct cgroup_namespace *old_ns)
+{
+	struct cgroup_namespace *new_ns = NULL;
+	struct cgroup *cgrp = NULL;
+	int err;
+
+	BUG_ON(!old_ns);
+
+	if (!(flags & CLONE_NEWCGROUP))
+		return get_cgroup_ns(old_ns);
+
+	/* Allow only sysadmin to create cgroup namespace. */
+	err = -EPERM;
+	if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+		goto err_out;
+
+	/* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+	 */
+	cgrp = get_task_cgroup(current);
+
+	err = -ENOMEM;
+	new_ns = alloc_cgroup_ns();
+	if (!new_ns)
+		goto err_out;
+
+	err = proc_alloc_inum(&new_ns->proc_inum);
+	if (err)
+		goto err_out;
+
+	new_ns->user_ns = get_user_ns(user_ns);
+	new_ns->root_cgrp = cgrp;
+
+	return new_ns;
+
+err_out:
+	if (cgrp)
+		cgroup_put(cgrp);
+	kfree(new_ns);
+	return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+	pr_info("setns not supported for cgroup namespace");
+	return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+	struct cgroup_namespace *ns = NULL;
+	struct nsproxy *nsproxy;
+
+	task_lock(task);
+	nsproxy = task->nsproxy;
+	if (nsproxy) {
+		ns = nsproxy->cgroup_ns;
+		get_cgroup_ns(ns);
+	}
+	task_unlock(task);
+
+	return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+	put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+	struct cgroup_namespace *cgroup_ns = ns;
+
+	return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+	.name		= "cgroup",
+	.type		= CLONE_NEWCGROUP,
+	.get		= cgroupns_get,
+	.put		= cgroupns_put,
+	.install	= cgroupns_install,
+	.inum		= cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+	return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long 
unshare_flags)
  	if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
  				CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
  				CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
-				CLONE_NEWUSER|CLONE_NEWPID))
+				CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
  		return -EINVAL;
  	/*
  	 * Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
  #include <linux/proc_ns.h>
  #include <linux/file.h>
  #include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>

  static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
  #ifdef CONFIG_NET
  	.net_ns			= &init_net,
  #endif
+	.cgroup_ns		= &init_cgroup_ns,
  };

  static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned 
long flags,
  		goto out_pid;
  	}

+	new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+					    tsk->nsproxy->cgroup_ns);
+	if (IS_ERR(new_nsp->cgroup_ns)) {
+		err = PTR_ERR(new_nsp->cgroup_ns);
+		goto out_cgroup;
+	}
+
  	new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
  	if (IS_ERR(new_nsp->net_ns)) {
  		err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy 
*create_new_namespaces(unsigned long flags,
  	return new_nsp;

  out_net:
+	if (new_nsp->cgroup_ns)
+		put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
  	if (new_nsp->pid_ns_for_children)
  		put_pid_ns(new_nsp->pid_ns_for_children);
  out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct 
task_struct *tsk)
  	struct nsproxy *new_ns;

  	if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			      CLONE_NEWPID | CLONE_NEWNET)))) {
+			      CLONE_NEWPID | CLONE_NEWNET |
+			      CLONE_NEWCGROUP)))) {
  		get_nsproxy(old_ns);
  		return 0;
  	}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
  		put_ipc_ns(ns->ipc_ns);
  	if (ns->pid_ns_for_children)
  		put_pid_ns(ns->pid_ns_for_children);
+	if (ns->cgroup_ns)
+		put_cgroup_ns(ns->cgroup_ns);
  	put_net(ns->net_ns);
  	kmem_cache_free(nsproxy_cachep, ns);
  }
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long 
unshare_flags,
  	int err = 0;

  	if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
-			       CLONE_NEWNET | CLONE_NEWPID)))
+			       CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
  		return 0;

  	user_ns = new_cred ? new_cred->user_ns : current_user_ns();
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: Andy Lutomirski @ 2014-11-04  1:25 UTC (permalink / raw)
  To: John Stultz
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Pawel Moll
In-Reply-To: <CALAqxLXfy5P0kg-W7hL+Jf1iYv758+-2cTdZwsY8kAns1nvEmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Nov 3, 2014 at 5:11 PM, John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 4:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 4:28 PM, Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
>>> From: Pawel Moll <mail-g3xvULXeDMYS+FvcfC7Uqw@public.gmane.org>
>>> Thomas suggested solution which gets down to my original proposal for
>>> sched/monotonic clock correlation - an additional sample type so events
>>> can be "double stamped" using different clock sources providing
>>> synchronisation points for later time approximation. I've just extended
>>> the implementation with configuration value to select the clock source.
>>> If the first patch (making perf timestamps monotonic) gets accepted,
>>> there will be no immediate need for this one, but I'd like to gain some
>>> feedback anyway.
>>>
>>
>> I have nothing intelligent to add to the potentional Thomas/Ingo
>> showdown, but I do have a related thought. :)
>>
>> If you're going to add double-stamped packets, can you also add a
>> syscall to read multiple clocks at once, atomically?  Or can you
>> otherwise add a non-perf mechanism to get at this data?
>>
>> Because the realtime to monotonic offset is really quite useful for
>> things like this, and it seems silly to make people actually open a
>> perf_event to get at it.
>
> So this comes up periodically, but I don't think I've seen a interface
> proposal that was decent yet.
>
> Also, if you want to read multiple clocks at once, do you stop at two,
> or three, or... there's possibly quite a few.  Additionally some
> clocks may not be possible to read atomically (perf/sched clock and
> system time for example may be based on different underlying
> clocksources). The general idea feels like its creeping towards some
> "atomically expose all timekeeping state" mega-interface.
>
> I've got some thoughts on what a possible interface that wouldn't be
> awful could look like, but I'm still hesitant because I don't really
> know if exposing this sort of data is actually a good idea long term.

My only real thought here is that, if perf is going to try to do this,
then presumably it should be reasonably integrated w/ the core timing
code.  I.e. if perf does this, then presumably the core code should
know about it and there should be a core interface to it.

--Andy

>
> thanks
> -john



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: John Stultz @ 2014-11-04  1:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Pawel Moll, Richard Cochran, Steven Rostedt, Ingo Molnar,
	Peter Zijlstra, Paul Mackerras, Arnaldo Carvalho de Melo,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Pawel Moll
In-Reply-To: <CALCETrXGoevmD_avz5sQfbbD624vpLW5=-8ovzTPT_5wzNFnVA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Nov 3, 2014 at 4:58 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 4:28 PM, Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
>> From: Pawel Moll <mail-g3xvULXeDMYS+FvcfC7Uqw@public.gmane.org>
>> Thomas suggested solution which gets down to my original proposal for
>> sched/monotonic clock correlation - an additional sample type so events
>> can be "double stamped" using different clock sources providing
>> synchronisation points for later time approximation. I've just extended
>> the implementation with configuration value to select the clock source.
>> If the first patch (making perf timestamps monotonic) gets accepted,
>> there will be no immediate need for this one, but I'd like to gain some
>> feedback anyway.
>>
>
> I have nothing intelligent to add to the potentional Thomas/Ingo
> showdown, but I do have a related thought. :)
>
> If you're going to add double-stamped packets, can you also add a
> syscall to read multiple clocks at once, atomically?  Or can you
> otherwise add a non-perf mechanism to get at this data?
>
> Because the realtime to monotonic offset is really quite useful for
> things like this, and it seems silly to make people actually open a
> perf_event to get at it.

So this comes up periodically, but I don't think I've seen a interface
proposal that was decent yet.

Also, if you want to read multiple clocks at once, do you stop at two,
or three, or... there's possibly quite a few.  Additionally some
clocks may not be possible to read atomically (perf/sched clock and
system time for example may be based on different underlying
clocksources). The general idea feels like its creeping towards some
"atomically expose all timekeeping state" mega-interface.

I've got some thoughts on what a possible interface that wouldn't be
awful could look like, but I'm still hesitant because I don't really
know if exposing this sort of data is actually a good idea long term.

thanks
-john

^ permalink raw reply

* Re: [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: Andy Lutomirski @ 2014-11-04  0:58 UTC (permalink / raw)
  To: Pawel Moll
  Cc: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Linux API,
	Pawel Moll
In-Reply-To: <1415060918-19954-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

On Mon, Nov 3, 2014 at 4:28 PM, Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org> wrote:
> From: Pawel Moll <mail-g3xvULXeDMYS+FvcfC7Uqw@public.gmane.org>
> Thomas suggested solution which gets down to my original proposal for
> sched/monotonic clock correlation - an additional sample type so events
> can be "double stamped" using different clock sources providing
> synchronisation points for later time approximation. I've just extended
> the implementation with configuration value to select the clock source.
> If the first patch (making perf timestamps monotonic) gets accepted,
> there will be no immediate need for this one, but I'd like to gain some
> feedback anyway.
>

I have nothing intelligent to add to the potentional Thomas/Ingo
showdown, but I do have a related thought. :)

If you're going to add double-stamped packets, can you also add a
syscall to read multiple clocks at once, atomically?  Or can you
otherwise add a non-perf mechanism to get at this data?

Because the realtime to monotonic offset is really quite useful for
things like this, and it seems silly to make people actually open a
perf_event to get at it.

--Andy

^ permalink raw reply

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Aditya Kali @ 2014-11-04  0:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Ingo Molnar
In-Reply-To: <CALCETrXeG2t=fW9HbkirDZudw9pbDwoqDq5ygJBkBMbqqoDAvw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Nov 3, 2014 at 4:17 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>>> -               if (nr_opts != 1) {
>>>>>>>> +               if (nr_opts > 1) {
>>>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>>                         return -EINVAL;
>>>>>>>
>>>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>>>
>>>>>>
>>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>>> cgroupns does the right thing automatically.
>>>>>>
>>>>>
>>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>>
>>>>
>>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>>> here.
>>>
>>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>>
>>
>> Yes. Hence this change makes sure that we don't return EINVAL when
>> nr_opts == 0 or nr_opts == 1 :)
>> That way, both of the following are equivalent when inside non-init cgroupns:
>>
>> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
>> (2) $ mount -t cgroup cgroup mountpoint
>>
>> Any other mount option will trigger the error here.
>
> I still don't get it.  Can you walk me through why mount -o
> some_other_option -t cgroup cgroup mountpoint causes -EINVAL?
>

Argh! You are right. I was totally convinced that this works. But it
clearly doesn't if you specify 1 legit mount option. I wanted to make
it work for both cases (1) and (2) above. But then this check will
have to be changed :(
Sorry about the back and forth. I am just going to make it return
EINVAL if __DEVEL_sane_behavior is not specified as suggested in the
beginning.

> --Andy

-- 
Aditya

^ permalink raw reply

* [PATCH v3 3/3] perf: Sample additional clock value
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pawel Moll, Pawel Moll
In-Reply-To: <1415060918-19954-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

From: Pawel Moll <mail-g3xvULXeDMYS+FvcfC7Uqw@public.gmane.org>

This patch adds an option to sample value of an additional clock with
any perf event, with the the aim of allowing time correlation between
data coming from perf and other sources like hardware trace which is
timestamped with an external clock.

The idea is to generate periodic perf record containing timestamps from
two different sources, sampled as close to each other as possible. This
allows performing simple linear approximation:

        perf event    other event
       -----O--------------+-------------O------> t_other
            :              |             :
            :              V             :
       -----O----------------------------O------> t_perf

User can request such samples for any standard perf event, with the most
obvious examples of cpu-clock (hrtimer) at selected frequency or other
periodic events like sched:sched_switch.

In order to do this, PERF_SAMPLE_CLOCK has to be set in struct
perf_event_attr.sample_type and a type of the clock to be sampled
selected by setting perf_event_attr.clock to a value corresponding to
a POSIX clock clk_id (see "man 2 clock_gettime")

Currently three clocks are implemented: CLOCK_REALITME = 0,
CLOCK_MONOTONIC = 1 and CLOCK_MONOTONIC_RAW = 2. The clock field is
5 bits wide to allow for future extension to custom, non-POSIX clock
sources(MAX_CLOCK for those is 16, see include/uapi/linux/time.h) like
ARM CoreSight (hardware trace) timestamp generator.

Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>
---
 include/linux/perf_event.h      |  2 ++
 include/uapi/linux/perf_event.h | 16 ++++++++++++++--
 kernel/events/core.c            | 30 ++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 867415d..395d6ed 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,8 @@ struct perf_sample_data {
 	 * Transaction flags for abort events:
 	 */
 	u64				txn;
+	/* Clock value (additional timestamp for time correlation) */
+	u64				clock;
 };
 
 /* default value for data source */
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9a64eb1..53a7a72 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -137,8 +137,9 @@ enum perf_event_sample_format {
 	PERF_SAMPLE_DATA_SRC			= 1U << 15,
 	PERF_SAMPLE_IDENTIFIER			= 1U << 16,
 	PERF_SAMPLE_TRANSACTION			= 1U << 17,
+	PERF_SAMPLE_CLOCK			= 1U << 18,
 
-	PERF_SAMPLE_MAX = 1U << 18,		/* non-ABI */
+	PERF_SAMPLE_MAX = 1U << 19,		/* non-ABI */
 };
 
 /*
@@ -304,7 +305,16 @@ struct perf_event_attr {
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
 				uevents        :  1, /* allow uevents into the buffer */
-				__reserved_1   : 38;
+
+				/*
+				 * clock: one of the POSIX clock IDs:
+				 *
+				 * 0 - CLOCK_REALTIME
+				 * 1 - CLOCK_MONOTONIC
+				 * 4 - CLOCK_MONOTONIC_RAW
+				 */
+				clock          :  5, /* clock type */
+				__reserved_1   : 33;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -544,6 +554,7 @@ enum perf_event_type {
 	 * 	{ u64			id;       } && PERF_SAMPLE_ID
 	 * 	{ u64			stream_id;} && PERF_SAMPLE_STREAM_ID
 	 * 	{ u32			cpu, res; } && PERF_SAMPLE_CPU
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 *	{ u64			id;	  } && PERF_SAMPLE_IDENTIFIER
 	 * } && perf_event_attr::sample_id_all
 	 *
@@ -687,6 +698,7 @@ enum perf_event_type {
 	 *	{ u64			weight;   } && PERF_SAMPLE_WEIGHT
 	 *	{ u64			data_src; } && PERF_SAMPLE_DATA_SRC
 	 *	{ u64			transaction; } && PERF_SAMPLE_TRANSACTION
+	 *	{ u64			clock;    } && PERF_SAMPLE_CLOCK
 	 * };
 	 */
 	PERF_RECORD_SAMPLE			= 9,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 3738e9c..611e2f7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1232,6 +1232,9 @@ static void perf_event__header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		size += sizeof(data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->header_size = size;
 }
 
@@ -1259,6 +1262,9 @@ static void perf_event__id_header_size(struct perf_event *event)
 	if (sample_type & PERF_SAMPLE_CPU)
 		size += sizeof(data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		size += sizeof(data->clock);
+
 	event->id_header_size = size;
 }
 
@@ -4599,6 +4605,24 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
 		data->cpu_entry.cpu	 = raw_smp_processor_id();
 		data->cpu_entry.reserved = 0;
 	}
+
+	if (sample_type & PERF_SAMPLE_CLOCK) {
+		switch (event->attr.clock) {
+		case CLOCK_REALTIME:
+			data->clock = ktime_get_real_ns();
+			break;
+		case CLOCK_MONOTONIC:
+			data->clock = ktime_get_mono_fast_ns();
+			break;
+		case CLOCK_MONOTONIC_RAW:
+			data->clock = ktime_get_raw_ns();
+			break;
+		default:
+			data->clock = 0;
+			break;
+		}
+	}
+
 }
 
 void perf_event_header__init_id(struct perf_event_header *header,
@@ -4629,6 +4653,9 @@ static void __perf_event__output_id_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_CPU)
 		perf_output_put(handle, data->cpu_entry);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (sample_type & PERF_SAMPLE_IDENTIFIER)
 		perf_output_put(handle, data->id);
 }
@@ -4857,6 +4884,9 @@ void perf_output_sample(struct perf_output_handle *handle,
 	if (sample_type & PERF_SAMPLE_TRANSACTION)
 		perf_output_put(handle, data->txn);
 
+	if (sample_type & PERF_SAMPLE_CLOCK)
+		perf_output_put(handle, data->clock);
+
 	if (!event->attr.watermark) {
 		int wakeup_events = event->attr.wakeup_events;
 
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH v3 2/3] perf: Userspace event
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel, linux-api, Pawel Moll, Pawel Moll
In-Reply-To: <1415060918-19954-1-git-send-email-pawel.moll@arm.com>

From: Pawel Moll <mail@pawelmoll.com>

This patch adds a PR_TASK_PERF_UEVENT prctl call which can be used by
any process to inject custom data into perf data stream as a new
PERF_RECORD_UEVENT record, if such process is being observed or if it
is running on a CPU being observed by the perf framework.

The prctl call takes the following arguments:

	prctl(PR_TASK_PERF_UEVENT, type, size, data, flags);

- type: a number meaning to describe content of the following data.
  Kernel does not pay attention to it and merely passes it further in
  the perf data, therefore its use must be agreed between the events
  producer (the process being observed) and the consumer (performance
  analysis tool). The perf userspace tool will contain a repository of
  "well known" types and reference implementation of their decoders.
- size: Length in bytes of the data.
- data: Pointer to the data.
- flags: Reserved for future use. Always pass zero.

Perf context that are supposed to receive events generated with the
prctl above must be opened with perf_event_attr.uevent set to 1. The
PERF_RECORD_UEVENT records consist of a standard perf event header,
32-bit type value, 32-bit data size and the data itself, followed by
padding to align the overall record size to 8 bytes and optional,
standard sample_id field.

Example use cases:

- "perf_printf" like mechanism to add logging messages to perf data;
  in the simplest case it can be just

	prctl(PR_TASK_PERF_UEVENT, 0, 8, "Message", 0);

- synchronisation of performance data generated in user space with the
  perf stream coming from the kernel. For example, the marker can be
  inserted by a JIT engine after it generated portion of the code, but
  before the code is executed for the first time, allowing the
  post-processor to pick the correct debugging information.

Signed-off-by: Pawel Moll <pawel.moll@arm.com>
---
 include/linux/perf_event.h      |  4 +++
 include/uapi/linux/perf_event.h | 23 ++++++++++++-
 include/uapi/linux/prctl.h      | 10 ++++++
 kernel/events/core.c            | 71 +++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c                    |  5 +++
 5 files changed, 112 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index ba490d5..867415d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -721,6 +721,8 @@ extern int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks
 extern void perf_event_exec(void);
 extern void perf_event_comm(struct task_struct *tsk, bool exec);
 extern void perf_event_fork(struct task_struct *tsk);
+extern int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		       const char __user *data);
 
 /* Callchains */
 DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
@@ -830,6 +832,8 @@ static inline void perf_event_mmap(struct vm_area_struct *vma)		{ }
 static inline void perf_event_exec(void)				{ }
 static inline void perf_event_comm(struct task_struct *tsk, bool exec)	{ }
 static inline void perf_event_fork(struct task_struct *tsk)		{ }
+static inline int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		              const char __user *data)			{ return -1; };
 static inline void perf_event_init(void)				{ }
 static inline int  perf_swevent_get_recursion_context(void)		{ return -1; }
 static inline void perf_swevent_put_recursion_context(int rctx)		{ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 9d84540..9a64eb1 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -303,7 +303,8 @@ struct perf_event_attr {
 				exclude_callchain_user   : 1, /* exclude user callchains */
 				mmap2          :  1, /* include mmap with inode data     */
 				comm_exec      :  1, /* flag comm events that are due to an exec */
-				__reserved_1   : 39;
+				uevents        :  1, /* allow uevents into the buffer */
+				__reserved_1   : 38;
 
 	union {
 		__u32		wakeup_events;	  /* wakeup every n events */
@@ -712,6 +713,26 @@ enum perf_event_type {
 	 */
 	PERF_RECORD_MMAP2			= 10,
 
+	/*
+	 * Data in userspace event record is transparent for the kernel
+	 *
+	 * Userspace perf tool code maintains a list of known types with
+	 * reference implementations of parsers for the data field.
+	 *
+	 * Overall size of the record (including type and size fields)
+	 * is always aligned to 8 bytes by adding padding after the data.
+	 *
+	 * struct {
+	 *	struct perf_event_header	header;
+	 *	u32				type;
+	 *	u32				size;
+	 *	char				data[size];
+	 *	char				__padding[-size & 7];
+	 * 	struct sample_id		sample_id;
+	 * };
+	 */
+	PERF_RECORD_UEVENT			= 11,
+
 	PERF_RECORD_MAX,			/* non-ABI */
 };
 
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75..2a6852f 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,14 @@ struct prctl_mm_map {
 #define PR_SET_THP_DISABLE	41
 #define PR_GET_THP_DISABLE	42
 
+/*
+ * Perf userspace event generation
+ *
+ * second argument: event type
+ * third argument:  data size
+ * fourth argument: pointer to data
+ * fifth argument:  flags (currently unused, pass 0)
+ */
+#define PR_TASK_PERF_UEVENT	43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ea3d6d3..3738e9c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5565,6 +5565,77 @@ static void perf_log_throttle(struct perf_event *event, int enable)
 }
 
 /*
+ * Userspace-generated event
+ */
+
+struct perf_uevent {
+	struct perf_event_header	header;
+	u32				type;
+	u32				size;
+	u8				data[0];
+};
+
+static void perf_uevent_output(struct perf_event *event, void *data)
+{
+	struct perf_uevent *uevent = data;
+	struct perf_output_handle handle;
+	struct perf_sample_data sample;
+	int size = uevent->header.size;
+
+	if (!event->attr.uevents)
+		return;
+
+	perf_event_header__init_id(&uevent->header, &sample, event);
+
+	if (perf_output_begin(&handle, event, uevent->header.size) != 0)
+		goto out;
+	perf_output_put(&handle, uevent->header);
+	perf_output_put(&handle, uevent->type);
+	perf_output_put(&handle, uevent->size);
+	__output_copy(&handle, uevent->data, uevent->size);
+
+	/* Padding to align overall data size to 8 bytes */
+	perf_output_skip(&handle, -uevent->size & (sizeof(u64) - 1));
+
+	perf_event__output_id_sample(event, &handle, &sample);
+
+	perf_output_end(&handle);
+out:
+	uevent->header.size = size;
+}
+
+int perf_uevent(struct task_struct *tsk, u32 type, u32 size,
+		const char __user *data)
+{
+	struct perf_uevent *uevent;
+
+	/* Need some reasonable limit */
+	if (size > PAGE_SIZE)
+		return -E2BIG;
+
+	uevent = kmalloc(sizeof(*uevent) + size, GFP_KERNEL);
+	if (!uevent)
+		return -ENOMEM;
+
+	uevent->header.type = PERF_RECORD_UEVENT;
+	uevent->header.size = sizeof(*uevent) + ALIGN(size, sizeof(u64));
+
+	uevent->type = type;
+	uevent->size = size;
+	if (copy_from_user(uevent->data, data, size)) {
+		kfree(uevent);
+		return -EFAULT;
+	}
+
+	perf_event_aux(perf_uevent_output, uevent, NULL);
+
+	kfree(uevent);
+
+	return 0;
+}
+
+
+/*
  * Generic event overflow handling, sampling.
  */
 
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0..1c83677 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2121,6 +2121,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
 	case PR_TASK_PERF_EVENTS_ENABLE:
 		error = perf_event_task_enable();
 		break;
+	case PR_TASK_PERF_UEVENT:
+		if (arg5 != 0)
+			return -EINVAL;
+		error = perf_uevent(me, arg2, arg3, (char __user *)arg4);
+		break;
 	case PR_GET_TIMERSLACK:
 		error = current->timer_slack_ns;
 		break;
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH v3 1/3] perf: Use monotonic clock as a source for timestamps
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pawel Moll
In-Reply-To: <1415060918-19954-1-git-send-email-pawel.moll-5wv7dgnIgG8@public.gmane.org>

Until now, perf framework never defined the meaning of the timestampt
captured as PERF_SAMPLE_TIME sample type. The values were obtaining
from local (sched) clock, which is unavailable in userspace. This made
it impossible to correlate perf data with any other events. Other
tracing solutions have the source configurable (ftrace) or just share
a common time domain between kernel and userspace (LTTng).

Follow the trend by using monotonic clock, which is readily available
as POSIX CLOCK_MONOTONIC.

Also add a sysctl "perf_sample_time_clk_id" attribute which can be used
by the user to obtain the clk_id to be used with POSIX clock API (eg.
clock_gettime()) to obtain a time value comparable with perf samples.

Signed-off-by: Pawel Moll <pawel.moll-5wv7dgnIgG8@public.gmane.org>
---

Ingo, I remember your comments about this approach in the past, but
during discussions at LPC Thomas was convinced that it's the right
thing to do - see cover letter for the series...

 include/linux/perf_event.h | 1 +
 kernel/events/core.c       | 4 +++-
 kernel/sysctl.c            | 7 +++++++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..ba490d5 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -738,6 +738,7 @@ extern int sysctl_perf_event_paranoid;
 extern int sysctl_perf_event_mlock;
 extern int sysctl_perf_event_sample_rate;
 extern int sysctl_perf_cpu_time_max_percent;
+extern int sysctl_perf_sample_time_clk_id;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9f..ea3d6d3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -234,6 +234,8 @@ int perf_cpu_time_max_percent_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int sysctl_perf_sample_time_clk_id = CLOCK_MONOTONIC;
+
 /*
  * perf samples are done in some very critical code paths (NMIs).
  * If they take too much CPU time, the system can lock up and not
@@ -324,7 +326,7 @@ extern __weak const char *perf_pmu_name(void)
 
 static inline u64 perf_clock(void)
 {
-	return local_clock();
+	return ktime_get_mono_fast_ns();
 }
 
 static inline struct perf_cpu_context *
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 15f2511..cb75f5b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1094,6 +1094,13 @@ static struct ctl_table kern_table[] = {
 		.extra1		= &zero,
 		.extra2		= &one_hundred,
 	},
+	{
+		.procname	= "perf_sample_time_clk_id",
+		.data		= &sysctl_perf_sample_time_clk_id,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec,
+	},
 #endif
 #ifdef CONFIG_KMEMCHECK
 	{
-- 
1.8.3.2

^ permalink raw reply related

* [PATCH v3 0/3] perf: User/kernel time correlation and event generation
From: Pawel Moll @ 2014-11-04  0:28 UTC (permalink / raw)
  To: Richard Cochran, Steven Rostedt, Ingo Molnar, Peter Zijlstra,
	Paul Mackerras, Arnaldo Carvalho de Melo, John Stultz,
	Masami Hiramatsu, Christopher Covington, Namhyung Kim,
	David Ahern, Thomas Gleixner, Tomeu Vizoso
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Pawel Moll

From: Pawel Moll <mail-g3xvULXeDMYS+FvcfC7Uqw@public.gmane.org>

Hello again,

Back to the subject, this time with a slightly different angle...

I've organised a session on the subject during the tracing
minisummit at LPC 2014 in Dusseldorf. Notes taken from the discussion
taken by Steven Rostedt (thanks Steve!)

http://www.linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_Tracing.txt

The following three patches address three main topics. They are pretty
much orthogonal and (subject to small and obvious modifications) could
be applied independently from each other.

An executive summary of both discussion and the patches:

1. User/kernel perf timestamps correlation

Thomas suggested that, instead of jumping through loops of correlation,
perf should simply generate monotonic clock timestamps, instead of
using sched clock. Peter and I pointed out that Ingo didn't like this
idea as monotonic can be slow, but apparently the cases when it is are
irrelevant. Thomas offered to fly to Budapest to physically convince
Ingo - I hope it won't be necessary and he will be able to achieve
this here, on mailing lists :-)

Setting aside potential performance problems, it would be a really
great solution, unifying all trace systems (perf, ftrace and LTTng)
in this respect. I'm more than happy to work on potential improvements
in the are of monotonic clock if it was to help the cause.

If it is a definite no-go, we still have the third patch, allowing post-
capture correlation based on synchronisation events.

2. User event generation

Everyone present agreed that it would be a very-nice-to-have feature.
There was some discussion about implementation details, so I welcome
feedback and comments regarding my take on the matter.

3. Correlation with external timestamps

This is a new issue, which surfaced recently while I was working on
hardware trace infrastructure. It can timestamp trace packets, but is
using yet another, completely different time source to do this.

Thomas suggested solution which gets down to my original proposal for
sched/monotonic clock correlation - an additional sample type so events
can be "double stamped" using different clock sources providing
synchronisation points for later time approximation. I've just extended
the implementation with configuration value to select the clock source.
If the first patch (making perf timestamps monotonic) gets accepted,
there will be no immediate need for this one, but I'd like to gain some
feedback anyway.


That's all for this series. Previous versions:

- RFC: http://www.spinics.net/lists/kernel/msg1824419.html
- v1: http://thread.gmane.org/gmane.linux.kernel/1790231
- v2: http://thread.gmane.org/gmane.linux.kernel/1793272

Pawel Moll (3):
  perf: Use monotonic clock as a source for timestamps
  perf: Userspace event
  perf: Sample additional clock value

 include/linux/perf_event.h      |   7 +++
 include/uapi/linux/perf_event.h |  37 +++++++++++++-
 include/uapi/linux/prctl.h      |  10 ++++
 kernel/events/core.c            | 105 +++++++++++++++++++++++++++++++++++++++-
 kernel/sys.c                    |   5 ++
 kernel/sysctl.c                 |   7 +++
 6 files changed, 168 insertions(+), 3 deletions(-)

-- 
1.8.3.2

^ permalink raw reply

* Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns
From: Andy Lutomirski @ 2014-11-04  0:17 UTC (permalink / raw)
  To: Aditya Kali
  Cc: Linux API, Linux Containers, Serge Hallyn,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Eric W. Biederman, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Ingo Molnar
In-Reply-To: <CAGr1F2EV4p_nJP_oMe3N8pBPedAZHbdB=XCMPjSEZTC9jmZoAg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <luto-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> wrote:
>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <adityakali-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>>         if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>                 pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>> -               if (nr_opts != 1) {
>>>>>>> +               if (nr_opts > 1) {
>>>>>>>                         pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>                         return -EINVAL;
>>>>>>
>>>>>> This looks wrong.  But, if you make the change above, then it'll be right.
>>>>>>
>>>>>
>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>> cgroupns does the right thing automatically.
>>>>>
>>>>
>>>> This is a debatable point, but it's not what I meant.  Won't your code
>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>
>>>
>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>> here.
>>
>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>
>
> Yes. Hence this change makes sure that we don't return EINVAL when
> nr_opts == 0 or nr_opts == 1 :)
> That way, both of the following are equivalent when inside non-init cgroupns:
>
> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
> (2) $ mount -t cgroup cgroup mountpoint
>
> Any other mount option will trigger the error here.

I still don't get it.  Can you walk me through why mount -o
some_other_option -t cgroup cgroup mountpoint causes -EINVAL?

--Andy

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox