* [PATCH v2 net-next 09/10] samples/bpf: tracepoint example
From: Alexei Starovoitov @ 2016-04-07 1:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
netdev, linux-kernel, kernel-team
In-Reply-To: <1459993411-2754735-1-git-send-email-ast@fb.com>
modify offwaketime to work with sched/sched_switch tracepoint
instead of kprobe into finish_task_switch
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
samples/bpf/offwaketime_kern.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/samples/bpf/offwaketime_kern.c b/samples/bpf/offwaketime_kern.c
index c0aa5a9b9c48..983629a31c79 100644
--- a/samples/bpf/offwaketime_kern.c
+++ b/samples/bpf/offwaketime_kern.c
@@ -73,7 +73,7 @@ int waker(struct pt_regs *ctx)
return 0;
}
-static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
+static inline int update_counts(void *ctx, u32 pid, u64 delta)
{
struct key_t key = {};
struct wokeby_t *woke;
@@ -100,15 +100,33 @@ static inline int update_counts(struct pt_regs *ctx, u32 pid, u64 delta)
return 0;
}
+#if 1
+/* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
+struct sched_switch_args {
+ unsigned long long pad;
+ char prev_comm[16];
+ int prev_pid;
+ int prev_prio;
+ long long prev_state;
+ char next_comm[16];
+ int next_pid;
+ int next_prio;
+};
+SEC("tracepoint/sched/sched_switch")
+int oncpu(struct sched_switch_args *ctx)
+{
+ /* record previous thread sleep time */
+ u32 pid = ctx->prev_pid;
+#else
SEC("kprobe/finish_task_switch")
int oncpu(struct pt_regs *ctx)
{
struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
+ /* record previous thread sleep time */
+ u32 pid = _(p->pid);
+#endif
u64 delta, ts, *tsp;
- u32 pid;
- /* record previous thread sleep time */
- pid = _(p->pid);
ts = bpf_ktime_get_ns();
bpf_map_update_elem(&start, &pid, &ts, BPF_ANY);
--
2.8.0
^ permalink raw reply related
* [PATCH v2 net-next 07/10] bpf: sanitize bpf tracepoint access
From: Alexei Starovoitov @ 2016-04-07 1:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
netdev, linux-kernel, kernel-team
In-Reply-To: <1459993411-2754735-1-git-send-email-ast@fb.com>
during bpf program loading remember the last byte of ctx access
and at the time of attaching the program to tracepoint check that
the program doesn't access bytes beyond defined in tracepoint fields
This also disallows access to __dynamic_array fields, but can be
relaxed in the future.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/bpf.h | 1 +
include/linux/trace_events.h | 1 +
kernel/bpf/verifier.c | 6 +++++-
kernel/events/core.c | 8 ++++++++
kernel/trace/trace_events.c | 18 ++++++++++++++++++
5 files changed, 33 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 198f6ace70ec..b2365a6eba3d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -131,6 +131,7 @@ struct bpf_prog_type_list {
struct bpf_prog_aux {
atomic_t refcnt;
u32 used_map_cnt;
+ u32 max_ctx_offset;
const struct bpf_verifier_ops *ops;
struct bpf_map **used_maps;
struct bpf_prog *prog;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 56f795e6a093..fe6441203b59 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -569,6 +569,7 @@ extern int trace_define_field(struct trace_event_call *call, const char *type,
int is_signed, int filter_type);
extern int trace_add_event_call(struct trace_event_call *call);
extern int trace_remove_event_call(struct trace_event_call *call);
+extern int trace_event_get_offsets(struct trace_event_call *call);
#define is_signed_type(type) (((type)(-1)) < (type)1)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e9b771..58792fed5678 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -652,8 +652,12 @@ static int check_ctx_access(struct verifier_env *env, int off, int size,
enum bpf_access_type t)
{
if (env->prog->aux->ops->is_valid_access &&
- env->prog->aux->ops->is_valid_access(off, size, t))
+ env->prog->aux->ops->is_valid_access(off, size, t)) {
+ /* remember the offset of last byte accessed in ctx */
+ if (env->prog->aux->max_ctx_offset < off + size)
+ env->prog->aux->max_ctx_offset = off + size;
return 0;
+ }
verbose("invalid bpf_context access off=%d size=%d\n", off, size);
return -EACCES;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e5ffe97d6166..9a01019ff7c8 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7133,6 +7133,14 @@ static int perf_event_set_bpf_prog(struct perf_event *event, u32 prog_fd)
return -EINVAL;
}
+ if (is_tracepoint) {
+ int off = trace_event_get_offsets(event->tp_event);
+
+ if (prog->aux->max_ctx_offset > off) {
+ bpf_prog_put(prog);
+ return -EACCES;
+ }
+ }
event->tp_event->prog = prog;
return 0;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 05ddc0820771..ced963049e0a 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -204,6 +204,24 @@ static void trace_destroy_fields(struct trace_event_call *call)
}
}
+/*
+ * run-time version of trace_event_get_offsets_<call>() that returns the last
+ * accessible offset of trace fields excluding __dynamic_array bytes
+ */
+int trace_event_get_offsets(struct trace_event_call *call)
+{
+ struct ftrace_event_field *tail;
+ struct list_head *head;
+
+ head = trace_get_fields(call);
+ /*
+ * head->next points to the last field with the largest offset,
+ * since it was added last by trace_define_field()
+ */
+ tail = list_first_entry(head, struct ftrace_event_field, link);
+ return tail->offset + tail->size;
+}
+
int trace_event_raw_init(struct trace_event_call *call)
{
int id;
--
2.8.0
^ permalink raw reply related
* [PATCH v2 net-next 05/10] bpf: register BPF_PROG_TYPE_TRACEPOINT program type
From: Alexei Starovoitov @ 2016-04-07 1:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
netdev, linux-kernel, kernel-team
In-Reply-To: <1459993411-2754735-1-git-send-email-ast@fb.com>
register tracepoint bpf program type and let it call the same set
of helper functions as BPF_PROG_TYPE_KPROBE
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
kernel/trace/bpf_trace.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e4ffb3ace5f..3e5ebe3254d2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -268,7 +268,7 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = {
.arg5_type = ARG_CONST_STACK_SIZE,
};
-static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+static const struct bpf_func_proto *tracing_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
case BPF_FUNC_map_lookup_elem:
@@ -295,12 +295,20 @@ static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func
return &bpf_get_smp_processor_id_proto;
case BPF_FUNC_perf_event_read:
return &bpf_perf_event_read_proto;
+ default:
+ return NULL;
+ }
+}
+
+static const struct bpf_func_proto *kprobe_prog_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
case BPF_FUNC_perf_event_output:
return &bpf_perf_event_output_proto;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto;
default:
- return NULL;
+ return tracing_func_proto(func_id);
}
}
@@ -332,9 +340,42 @@ static struct bpf_prog_type_list kprobe_tl = {
.type = BPF_PROG_TYPE_KPROBE,
};
+static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id func_id)
+{
+ switch (func_id) {
+ case BPF_FUNC_perf_event_output:
+ case BPF_FUNC_get_stackid:
+ return NULL;
+ default:
+ return tracing_func_proto(func_id);
+ }
+}
+
+static bool tp_prog_is_valid_access(int off, int size, enum bpf_access_type type)
+{
+ if (off < sizeof(void *) || off >= PERF_MAX_TRACE_SIZE)
+ return false;
+ if (type != BPF_READ)
+ return false;
+ if (off % size != 0)
+ return false;
+ return true;
+}
+
+static const struct bpf_verifier_ops tracepoint_prog_ops = {
+ .get_func_proto = tp_prog_func_proto,
+ .is_valid_access = tp_prog_is_valid_access,
+};
+
+static struct bpf_prog_type_list tracepoint_tl = {
+ .ops = &tracepoint_prog_ops,
+ .type = BPF_PROG_TYPE_TRACEPOINT,
+};
+
static int __init register_kprobe_prog_ops(void)
{
bpf_register_prog_type(&kprobe_tl);
+ bpf_register_prog_type(&tracepoint_tl);
return 0;
}
late_initcall(register_kprobe_prog_ops);
--
2.8.0
^ permalink raw reply related
* [PATCH v2 net-next 03/10] perf: split perf_trace_buf_prepare into alloc and update parts
From: Alexei Starovoitov @ 2016-04-07 1:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
netdev, linux-kernel, kernel-team
In-Reply-To: <1459993411-2754735-1-git-send-email-ast@fb.com>
split allows to move expensive update of 'struct trace_entry' to later phase.
Repurpose unused 1st argument of perf_tp_event() to indicate event type.
While splitting use temp variable 'rctx' instead of '*rctx' to avoid
unnecessary loads done by the compiler due to -fno-strict-aliasing
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/perf_event.h | 2 +-
include/linux/trace_events.h | 8 ++++----
include/trace/perf.h | 8 ++++----
kernel/events/core.c | 6 ++++--
kernel/trace/trace_event_perf.c | 39 ++++++++++++++++++++-------------------
kernel/trace/trace_kprobe.c | 10 ++++++----
kernel/trace/trace_syscalls.c | 13 +++++++------
kernel/trace/trace_uprobe.c | 5 +++--
8 files changed, 49 insertions(+), 42 deletions(-)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e89f7199c223..eb41b535ef38 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1016,7 +1016,7 @@ static inline bool perf_paranoid_kernel(void)
}
extern void perf_event_init(void);
-extern void perf_tp_event(u64 addr, u64 count, void *record,
+extern void perf_tp_event(u16 event_type, u64 count, void *record,
int entry_size, struct pt_regs *regs,
struct hlist_head *head, int rctx,
struct task_struct *task);
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 0810f81b6db2..56f795e6a093 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -605,15 +605,15 @@ extern void perf_trace_del(struct perf_event *event, int flags);
extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
char *filter_str);
extern void ftrace_profile_free_filter(struct perf_event *event);
-extern void *perf_trace_buf_prepare(int size, unsigned short type,
- struct pt_regs **regs, int *rctxp);
+void perf_trace_buf_update(void *record, u16 type);
+void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp);
static inline void
-perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr,
+perf_trace_buf_submit(void *raw_data, int size, int rctx, u16 type,
u64 count, struct pt_regs *regs, void *head,
struct task_struct *task)
{
- perf_tp_event(addr, count, raw_data, size, regs, head, rctx, task);
+ perf_tp_event(type, count, raw_data, size, regs, head, rctx, task);
}
#endif
diff --git a/include/trace/perf.h b/include/trace/perf.h
index 6f7e37869065..77cd9043b7e4 100644
--- a/include/trace/perf.h
+++ b/include/trace/perf.h
@@ -53,8 +53,7 @@ perf_trace_##call(void *__data, proto) \
sizeof(u64)); \
__entry_size -= sizeof(u32); \
\
- entry = perf_trace_buf_prepare(__entry_size, \
- event_call->event.type, &__regs, &rctx); \
+ entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx); \
if (!entry) \
return; \
\
@@ -64,8 +63,9 @@ perf_trace_##call(void *__data, proto) \
\
{ assign; } \
\
- perf_trace_buf_submit(entry, __entry_size, rctx, 0, \
- __count, __regs, head, __task); \
+ perf_trace_buf_submit(entry, __entry_size, rctx, \
+ event_call->event.type, __count, __regs, \
+ head, __task); \
}
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index de24fbce5277..d8512883c0a0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6987,7 +6987,7 @@ static int perf_tp_event_match(struct perf_event *event,
return 1;
}
-void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
+void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
struct pt_regs *regs, struct hlist_head *head, int rctx,
struct task_struct *task)
{
@@ -6999,9 +6999,11 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
.data = record,
};
- perf_sample_data_init(&data, addr, 0);
+ perf_sample_data_init(&data, 0, 0);
data.raw = &raw;
+ perf_trace_buf_update(record, event_type);
+
hlist_for_each_entry_rcu(event, head, hlist_entry) {
if (perf_tp_event_match(event, &data, regs))
perf_swevent_event(event, count, &data, regs);
diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c
index 7a68afca8249..5a927075977f 100644
--- a/kernel/trace/trace_event_perf.c
+++ b/kernel/trace/trace_event_perf.c
@@ -260,42 +260,43 @@ void perf_trace_del(struct perf_event *p_event, int flags)
tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event);
}
-void *perf_trace_buf_prepare(int size, unsigned short type,
- struct pt_regs **regs, int *rctxp)
+void *perf_trace_buf_alloc(int size, struct pt_regs **regs, int *rctxp)
{
- struct trace_entry *entry;
- unsigned long flags;
char *raw_data;
- int pc;
+ int rctx;
BUILD_BUG_ON(PERF_MAX_TRACE_SIZE % sizeof(unsigned long));
if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
- "perf buffer not large enough"))
+ "perf buffer not large enough"))
return NULL;
- pc = preempt_count();
-
- *rctxp = perf_swevent_get_recursion_context();
- if (*rctxp < 0)
+ *rctxp = rctx = perf_swevent_get_recursion_context();
+ if (rctx < 0)
return NULL;
if (regs)
- *regs = this_cpu_ptr(&__perf_regs[*rctxp]);
- raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]);
+ *regs = this_cpu_ptr(&__perf_regs[rctx]);
+ raw_data = this_cpu_ptr(perf_trace_buf[rctx]);
/* zero the dead bytes from align to not leak stack to user */
memset(&raw_data[size - sizeof(u64)], 0, sizeof(u64));
+ return raw_data;
+}
+EXPORT_SYMBOL_GPL(perf_trace_buf_alloc);
+NOKPROBE_SYMBOL(perf_trace_buf_alloc);
+
+void perf_trace_buf_update(void *record, u16 type)
+{
+ struct trace_entry *entry = record;
+ int pc = preempt_count();
+ unsigned long flags;
- entry = (struct trace_entry *)raw_data;
local_save_flags(flags);
tracing_generic_entry_update(entry, flags, pc);
entry->type = type;
-
- return raw_data;
}
-EXPORT_SYMBOL_GPL(perf_trace_buf_prepare);
-NOKPROBE_SYMBOL(perf_trace_buf_prepare);
+NOKPROBE_SYMBOL(perf_trace_buf_update);
#ifdef CONFIG_FUNCTION_TRACER
static void
@@ -319,13 +320,13 @@ perf_ftrace_function_call(unsigned long ip, unsigned long parent_ip,
memset(®s, 0, sizeof(regs));
perf_fetch_caller_regs(®s);
- entry = perf_trace_buf_prepare(ENTRY_SIZE, TRACE_FN, NULL, &rctx);
+ entry = perf_trace_buf_alloc(ENTRY_SIZE, NULL, &rctx);
if (!entry)
return;
entry->ip = ip;
entry->parent_ip = parent_ip;
- perf_trace_buf_submit(entry, ENTRY_SIZE, rctx, 0,
+ perf_trace_buf_submit(entry, ENTRY_SIZE, rctx, TRACE_FN,
1, ®s, head, NULL);
#undef ENTRY_SIZE
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 919e0ddd8fcc..5546eec0505f 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1149,14 +1149,15 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_alloc(size, NULL, &rctx);
if (!entry)
return;
entry->ip = (unsigned long)tk->rp.kp.addr;
memset(&entry[1], 0, dsize);
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
- perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+ perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
+ head, NULL);
}
NOKPROBE_SYMBOL(kprobe_perf_func);
@@ -1184,14 +1185,15 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri,
size = ALIGN(__size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_alloc(size, NULL, &rctx);
if (!entry)
return;
entry->func = (unsigned long)tk->rp.kp.addr;
entry->ret_ip = (unsigned long)ri->ret_addr;
store_trace_args(sizeof(*entry), &tk->tp, regs, (u8 *)&entry[1], dsize);
- perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+ perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
+ head, NULL);
}
NOKPROBE_SYMBOL(kretprobe_perf_func);
#endif /* CONFIG_PERF_EVENTS */
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index e78f364cc192..b2b6efc083a4 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -587,15 +587,16 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
size = ALIGN(size + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
- rec = (struct syscall_trace_enter *)perf_trace_buf_prepare(size,
- sys_data->enter_event->event.type, NULL, &rctx);
+ rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
return;
rec->nr = syscall_nr;
syscall_get_arguments(current, regs, 0, sys_data->nb_args,
(unsigned long *)&rec->args);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head, NULL);
+ perf_trace_buf_submit(rec, size, rctx,
+ sys_data->enter_event->event.type, 1, regs,
+ head, NULL);
}
static int perf_sysenter_enable(struct trace_event_call *call)
@@ -660,14 +661,14 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
size -= sizeof(u32);
- rec = (struct syscall_trace_exit *)perf_trace_buf_prepare(size,
- sys_data->exit_event->event.type, NULL, &rctx);
+ rec = perf_trace_buf_alloc(size, NULL, &rctx);
if (!rec)
return;
rec->nr = syscall_nr;
rec->ret = syscall_get_return_value(current, regs);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head, NULL);
+ perf_trace_buf_submit(rec, size, rctx, sys_data->exit_event->event.type,
+ 1, regs, head, NULL);
}
static int perf_sysexit_enable(struct trace_event_call *call)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 7915142c89e4..c53485441c88 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1131,7 +1131,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
if (hlist_empty(head))
goto out;
- entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx);
+ entry = perf_trace_buf_alloc(size, NULL, &rctx);
if (!entry)
goto out;
@@ -1152,7 +1152,8 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
memset(data + len, 0, size - esize - len);
}
- perf_trace_buf_submit(entry, size, rctx, 0, 1, regs, head, NULL);
+ perf_trace_buf_submit(entry, size, rctx, call->event.type, 1, regs,
+ head, NULL);
out:
preempt_enable();
}
--
2.8.0
^ permalink raw reply related
* [PATCH v2 net-next 00/10] allow bpf attach to tracepoints
From: Alexei Starovoitov @ 2016-04-07 1:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Peter Zijlstra, David S . Miller, Ingo Molnar, Daniel Borkmann,
Arnaldo Carvalho de Melo, Wang Nan, Josef Bacik, Brendan Gregg,
netdev, linux-kernel, kernel-team
Hi Steven, Peter,
v1->v2: addressed Peter's comments:
- fixed wording in patch 1, added ack
- refactored 2nd patch into 3:
2/10 remove unused __perf_addr macro which frees up
an argument in perf_trace_buf_submit
3/10 split perf_trace_buf_prepare into alloc and update parts, so that bpf
programs don't have to pay performance penalty for update of struct trace_entry
which is not going to be accessed by bpf
4/10 actual addition of bpf filter to perf tracepoint handler is now trivial
and bpf prog can be used as proper filter of tracepoints
v1 cover:
last time we discussed bpf+tracepoints it was a year ago [1] and the reason
we didn't proceed with that approach was that bpf would make arguments
arg1, arg2 to trace_xx(arg1, arg2) call to be exposed to bpf program
and that was considered unnecessary extension of abi. Back then I wanted
to avoid the cost of buffer alloc and field assign part in all
of the tracepoints, but looks like when optimized the cost is acceptable.
So this new apporach doesn't expose any new abi to bpf program.
The program is looking at tracepoint fields after they were copied
by perf_trace_xx() and described in /sys/kernel/debug/tracing/events/xxx/format
We made a tool [2] that takes arguments from /sys/.../format and works as:
$ tplist.py -v random:urandom_read
int got_bits;
int pool_left;
int input_left;
Then these fields can be copy-pasted into bpf program like:
struct urandom_read {
__u64 hidden_pad;
int got_bits;
int pool_left;
int input_left;
};
and the program can use it:
SEC("tracepoint/random/urandom_read")
int bpf_prog(struct urandom_read *ctx)
{
return ctx->pool_left > 0 ? 1 : 0;
}
This way the program can access tracepoint fields faster than
equivalent bpf+kprobe program, which is the main goal of these patches.
Patch 1-4 are simple changes in perf core side, please review.
I'd like to take the whole set via net-next tree, since the rest of
the patches might conflict with other bpf work going on in net-next
and we want to avoid cross-tree merge conflicts.
Alternatively we can put patches 1-4 into both tip and net-next.
Patch 9 is an example of access to tracepoint fields from bpf prog.
Patch 10 is a micro benchmark for bpf+kprobe vs bpf+tracepoint.
Note that for actual tracing tools the user doesn't need to
run tplist.py and copy-paste fields manually. The tools do it
automatically. Like argdist tool [3] can be used as:
$ argdist -H 't:block:block_rq_complete():u32:nr_sector'
where 'nr_sector' is name of tracepoint field taken from
/sys/kernel/debug/tracing/events/block/block_rq_complete/format
and appropriate bpf program is generated on the fly.
[1] http://thread.gmane.org/gmane.linux.kernel.api/8127/focus=8165
[2] https://github.com/iovisor/bcc/blob/master/tools/tplist.py
[3] https://github.com/iovisor/bcc/blob/master/tools/argdist.py
Alexei Starovoitov (10):
perf: optimize perf_fetch_caller_regs
perf: remove unused __addr variable
perf: split perf_trace_buf_prepare into alloc and update parts
perf, bpf: allow bpf programs attach to tracepoints
bpf: register BPF_PROG_TYPE_TRACEPOINT program type
bpf: support bpf_get_stackid() and bpf_perf_event_output() in
tracepoint programs
bpf: sanitize bpf tracepoint access
samples/bpf: add tracepoint support to bpf loader
samples/bpf: tracepoint example
samples/bpf: add tracepoint vs kprobe performance tests
include/linux/bpf.h | 2 +
include/linux/perf_event.h | 4 +-
include/linux/trace_events.h | 9 +-
include/trace/perf.h | 23 +++--
include/trace/trace_events.h | 3 -
include/uapi/linux/bpf.h | 1 +
kernel/bpf/stackmap.c | 2 +-
kernel/bpf/verifier.c | 6 +-
kernel/events/core.c | 27 ++++--
kernel/trace/bpf_trace.c | 85 ++++++++++++++++-
kernel/trace/trace_event_perf.c | 40 ++++----
kernel/trace/trace_events.c | 18 ++++
kernel/trace/trace_kprobe.c | 10 +-
kernel/trace/trace_syscalls.c | 13 +--
kernel/trace/trace_uprobe.c | 5 +-
samples/bpf/Makefile | 5 +
samples/bpf/bpf_load.c | 26 ++++-
samples/bpf/offwaketime_kern.c | 26 ++++-
samples/bpf/test_overhead_kprobe_kern.c | 41 ++++++++
samples/bpf/test_overhead_tp_kern.c | 36 +++++++
samples/bpf/test_overhead_user.c | 162 ++++++++++++++++++++++++++++++++
21 files changed, 475 insertions(+), 69 deletions(-)
create mode 100644 samples/bpf/test_overhead_kprobe_kern.c
create mode 100644 samples/bpf/test_overhead_tp_kern.c
create mode 100644 samples/bpf/test_overhead_user.c
--
2.8.0
^ permalink raw reply
* RE: [PATCH net-next V3 00/16] net: fec: cleanup and fixes
From: Fugang Duan @ 2016-04-07 1:23 UTC (permalink / raw)
To: Troy Kisky, netdev@vger.kernel.org, davem@davemloft.net,
lznuaa@gmail.com
Cc: andrew@lunn.ch, stillcompiling@gmail.com, arnd@arndb.de,
sergei.shtylyov@cogentembedded.com, gerg@uclinux.org,
Fabio Estevam, johannes@sipsolutions.net, l.stach@pengutronix.de,
linux-arm-kernel@lists.infradead.org, tremyfr@gmail.com
In-Reply-To: <57053C94.2000009@boundarydevices.com>
From: Troy Kisky <troy.kisky@boundarydevices.com> Sent: Thursday, April 07, 2016 12:43 AM
> To: Fugang Duan <fugang.duan@nxp.com>; netdev@vger.kernel.org;
> davem@davemloft.net; lznuaa@gmail.com
> Cc: Fabio Estevam <fabio.estevam@nxp.com>; l.stach@pengutronix.de;
> andrew@lunn.ch; tremyfr@gmail.com; gerg@uclinux.org; linux-arm-
> kernel@lists.infradead.org; johannes@sipsolutions.net;
> stillcompiling@gmail.com; sergei.shtylyov@cogentembedded.com;
> arnd@arndb.de
> Subject: Re: [PATCH net-next V3 00/16] net: fec: cleanup and fixes
>
> On 4/6/2016 1:51 AM, Fugang Duan wrote:
> > From: Troy Kisky <troy.kisky@boundarydevices.com> Sent: Wednesday,
> > April 06, 2016 10:26 AM
> >> To: netdev@vger.kernel.org; davem@davemloft.net; Fugang Duan
> >> <fugang.duan@nxp.com>; lznuaa@gmail.com
> >> Cc: Fabio Estevam <fabio.estevam@nxp.com>; l.stach@pengutronix.de;
> >> andrew@lunn.ch; tremyfr@gmail.com; gerg@uclinux.org; linux-arm-
> >> kernel@lists.infradead.org; johannes@sipsolutions.net;
> >> stillcompiling@gmail.com; sergei.shtylyov@cogentembedded.com;
> >> arnd@arndb.de; Troy Kisky <troy.kisky@boundarydevices.com>
> >> Subject: [PATCH net-next V3 00/16] net: fec: cleanup and fixes
> >>
> >> V3 has
> >>
> >> 1 dropped patch "net: fec: print more debug info in fec_timeout"
> >> 2 new patches
> >> 0002-net-fec-remove-unused-interrupt-FEC_ENET_TS_TIMER.patch
> >> 0003-net-fec-return-IRQ_HANDLED-if-fec_ptp_check_pps_even.patch
> >>
> >> 1 combined patch
> >> 0004-net-fec-pass-rxq-txq-to-fec_enet_rx-tx_queue-instead.patch
> >>
> >> The changes are noted on individual patches
> >>
> >> My measured performance of this series is
> >>
> >> before patch set
> >> 365 Mbits/sec Tx/407 RX
> >>
> >> after patch set
> >> 374 Tx/427 Rx
> >>
> >
> > I doubt the performance data, I validate it on i.MX6q sabresd board on the
> latest commit(4da46cebbd3b) in net tree.
>
>
>
> I was doing UDP tests, as outlined in my V2 cover letter. Also, my cpu is 1G. Is
> yours 1.2G?
>
It is also 1GHz.
I will double test UDP performance.
>
>
>
> > root@imx6qdlsolo:~# uname -a
> > Linux imx6qdlsolo 4.6.0-rc1-00318-g4da46ce #180 SMP Wed Apr 6 16:24:09
> > CST 2016 armv7l GNU/Linux
>
>
> This is the V2 patch that I dropped.
>
> I will force update my local net-next_master branch, to make testing this series
> easier.
> Note that my local net-next_master branch has about 19 patches on top of this
> series.
> so,
>
> tkisky@office-server2:~/linux-imx6$ git reset --hard HEAD~19 HEAD is now at
> a125da7 net: fec: don't set cbd_bufaddr unless no mapping error
>
I see.
>
> >
> > TCP RX performance is 602Mbps, TX is only 325Mbps, TX path has some
> performance issue in net tree.
> > I will dig out it.
> >
> >
>
>
> More testing is always better. Thanks
>
>
> >>
> >> Troy Kisky (16):
> >> net: fec: only check queue 0 if RXF_0/TXF_0 interrupt is set
> >> net: fec: remove unused interrupt FEC_ENET_TS_TIMER
> >> net: fec: return IRQ_HANDLED if fec_ptp_check_pps_event handled it
> >> net: fec: pass rxq/txq to fec_enet_rx/tx_queue instead of queue_id
> >> net: fec: reduce interrupts
> >> net: fec: split off napi routine with 3 queues
> >> net: fec: don't clear all rx queue bits when just one is being checked
> >> net: fec: set cbd_sc without relying on previous value
> >> net: fec: eliminate calls to fec_enet_get_prevdesc
> >> net: fec: move restart test for efficiency
> >> net: fec: clear cbd_sc after transmission to help with debugging
> >> net: fec: dump all tx queues in fec_dump
> >> net: fec: detect tx int lost
> >> net: fec: create subroutine reset_tx_queue
> >> net: fec: call dma_unmap_single on mapped tx buffers at restart
> >> net: fec: don't set cbd_bufaddr unless no mapping error
> >>
> >> drivers/net/ethernet/freescale/fec.h | 10 +-
> >> drivers/net/ethernet/freescale/fec_main.c | 410
> >> ++++++++++++++++------------
> >> --
> >> 2 files changed, 218 insertions(+), 202 deletions(-)
> >>
> >> --
> >> 2.5.0
> >
^ permalink raw reply
* Re: [PATCH net-next V3 00/16] net: fec: cleanup and fixes
From: Troy Kisky @ 2016-04-07 1:09 UTC (permalink / raw)
To: David Miller
Cc: netdev, fugang.duan, lznuaa, fabio.estevam, l.stach, andrew,
tremyfr, gerg, linux-arm-kernel, johannes, stillcompiling,
sergei.shtylyov, arnd
In-Reply-To: <20160406.172029.2094647648319882709.davem@davemloft.net>
On 4/6/2016 2:20 PM, David Miller wrote:
>
> This is a way too large patch series.
>
> Please split it up into smaller, more logical, pieces.
>
> Thanks.
>
If you apply the 1st 3 that have been acked, I'll be at 13.
Would I then send the next 5 for V4, and when that is applied
send another V4 with the next 8 that have been already been acked?
Thanks
Troy
^ permalink raw reply
* Re: [PATCH net-next V3 05/16] net: fec: reduce interrupts
From: Troy Kisky @ 2016-04-07 0:42 UTC (permalink / raw)
To: David Miller
Cc: netdev, fugang.duan, lznuaa, fabio.estevam, l.stach, andrew,
tremyfr, gerg, linux-arm-kernel, johannes, stillcompiling,
sergei.shtylyov, arnd
In-Reply-To: <20160406.172008.266926707628676037.davem@davemloft.net>
On 4/6/2016 2:20 PM, David Miller wrote:
> From: Troy Kisky <troy.kisky@boundarydevices.com>
> Date: Tue, 5 Apr 2016 19:25:51 -0700
>
>> By clearing the NAPI interrupts in the NAPI routine
>> and not in the interrupt handler, we can reduce the
>> number of interrupts. We also don't need any status
>> variables as the registers are still valid.
>>
>> Also, notice that if budget pkts are received, the
>> next call to fec_enet_rx_napi will now continue to
>> receive the previously pending packets.
>>
>> To test that this actually reduces interrupts, try
>> this command before/after patch
>>
>> cat /proc/interrupts |grep ether; \
>> ping -s2800 192.168.0.201 -f -c1000 ; \
>> cat /proc/interrupts |grep ether
>>
>> For me, before this patch is 2996 interrupts.
>> After patch is 2010 interrupts.
>>
>> Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com>
>
> I really don't think this is a good idea at all.
>
> I would instead really rather see you stash away the
> status register values into some piece of software state,
> and then re-read them before you are about to finish a
> NAPI poll cycle.
>
>
Sure, that's an easy change. But if a TX int is what caused the interrupt
and masks them, and then a RX packet happens before napi runs, do you want
the TX serviced 1st, or RX ?
^ permalink raw reply
* panic in __inet_hash
From: David Ahern @ 2016-04-07 0:02 UTC (permalink / raw)
To: netdev@vger.kernel.org, edumazet
Rebased to top of tree a few minutes ago to test v6 of the multipath
patch. I hitting a panic in __inet_hash:
[ 17.264247] BUG: unable to handle kernel paging request at
ffffffffffffffa8
[ 17.265015] IP: [<ffffffff816e2b3c>] __inet_hash+0x11c/0x1c0
[ 17.265015] PGD 1e07067 PUD 1e09067 PMD 0
[ 17.265015] Oops: 0000 [#1] SMP
[ 17.265015] Modules linked in:
[ 17.265015] CPU: 0 PID: 1488 Comm: zebra Not tainted 4.6.0-rc1+ #5
[ 17.265015] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org
04/01/2014
[ 17.265015] task: ffff880011010800 ti: ffff880017bb0000 task.ti:
ffff880017bb0000
[ 17.265015] RIP: 0010:[<ffffffff816e2b3c>] [<ffffffff816e2b3c>]
__inet_hash+0x11c/0x1c0
[ 17.265015] RSP: 0018:ffff880017bb3e48 EFLAGS: 00010293
[ 17.265015] RAX: 0000000000000002 RBX: 0000000000000004 RCX:
0000000000000002
[ 17.265015] RDX: 0000000000000000 RSI: ffff880011010fa0 RDI:
ffff880011010800
[ 17.265015] RBP: ffff880017bb3e88 R08: ffffffff8282b080 R09:
ffffffffffffff98
[ 17.265015] R10: 0000000000000000 R11: 0000000000000000 R12:
ffffffff82ea80c0
[ 17.265015] R13: ffffffff82ea7f80 R14: 0000000000000000 R15:
ffff88001b618000
[ 17.265015] FS: 00007f947df9d740(0000) GS:ffff88001fc00000(0000)
knlGS:0000000000000000
[ 17.265015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 17.265015] CR2: ffffffffffffffa8 CR3: 00000000179d7000 CR4:
00000000000006f0
[ 17.265015] Stack:
[ 17.265015] ffffffff82e95150 0000006e00000005 ffffffff8170b540
ffff88001b618000
[ 17.265015] 0000000000000000 0000000000000003 0000000000000000
00000000ffffffff
[ 17.265015] ffff880017bb3ea0 ffffffff816e2c19 ffff88001b618000
ffff880017bb3ec0
[ 17.265015] Call Trace:
[ 17.265015] [<ffffffff8170b540>] ? udp4_seq_show+0x150/0x150
[ 17.265015] [<ffffffff816e2c19>] inet_hash+0x39/0x60
[ 17.265015] [<ffffffff816e44dd>] inet_csk_listen_start+0x9d/0xc0
[ 17.265015] [<ffffffff81719c2c>] inet_listen+0x9c/0xe0
[ 17.265015] [<ffffffff8166b5ae>] SyS_listen+0x4e/0x80
[ 17.265015] [<ffffffff81002eac>] do_syscall_64+0x5c/0x2b0
[ 17.265015] [<ffffffff8181829a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 17.265015] Code: 4b 54 9e ff 41 f6 c6 01 74 13 e9 88 00 00 00 4d 8b
36 e8 38 54 9e ff 41 f6 c6 01 75 7a 4d 8d 4e 98 4d 39 cf 74 e9 41 0f b7
47 10 <66> 41 39 46 a8 75 dd 41 0f b6 46 ab 89 c2 41 32 57 13 83 e2 20
[ 17.265015] RIP [<ffffffff816e2b3c>] __inet_hash+0x11c/0x1c0
[ 17.265015] RSP <ffff880017bb3e48>
[ 17.265015] CR2: ffffffffffffffa8
[ 17.265015] ---[ end trace d7340320507851f4 ]---
git bisect points to:
dsa@kenny:~/kernel-2.git$ git bisect bad
3b24d854cb35383c30642116e5992fd619bdc9bc is the first bad commit
commit 3b24d854cb35383c30642116e5992fd619bdc9bc
Author: Eric Dumazet <edumazet@google.com>
Date: Fri Apr 1 08:52:17 2016 -0700
tcp/dccp: do not touch listener sk_refcnt under synflood
When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.
By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt
Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.
Peak performance under SYNFLOOD is increased by ~33% :
On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps
Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
:040000 040000 77d3f9ba3c3ec4c3443416f7536fe6d3ee9ec1a1
b797845f7bbad79b5875bfa969ebfe9759c0b8b9 M include
:040000 040000 0325378c93011f87012e5a6c064b09c2d59e052b
32f90a3ec258f4625765c33d4718b2370e69e862 M net
^ permalink raw reply
* Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed
From: Florian Westphal @ 2016-04-06 23:45 UTC (permalink / raw)
To: Paul Moore
Cc: Florian Westphal, Paolo Abeni, linux-security-module,
David S. Miller, James Morris, Andreas Gruenbacher,
Stephen Smalley, netdev, selinux
In-Reply-To: <CAHC9VhSZP1QxT-pQeHAW9m1xmyDsbFeA=9b8SdeWgS0RiDTOBA@mail.gmail.com>
Paul Moore <paul@paul-moore.com> wrote:
> On Wed, Apr 6, 2016 at 6:14 PM, Florian Westphal <fw@strlen.de> wrote:
> > netfilter hooks are per namespace -- so there is hook unregister when
> > netns is destroyed.
>
> Looking around, I see the global and per-namespace registration
> functions (nf_register_hook and nf_register_net_hook, respectively),
> but I'm looking to see if/how newly created namespace inherit
> netfilter hooks from the init network namespace ... if you can create
> a network namespace and dodge the SELinux hooks, that isn't a good
> thing from a SELinux point of view, although it might be a plus
> depending on where you view Paolo's original patches ;)
Heh :-)
If you use nf_register_net_hook, the hook is only registered in the
namespace.
If you use nf_register_hook, the hook is put on a global list and
registed in all existing namespaces.
New namespaces will have the hook added as well (see
netfilter_net_init -> nf_register_hook_list in netfilter/core.c )
Since nf_register_hook is used it should be impossible to get a netns
that doesn't call these hooks.
> > Do you think it makes sense to rework the patch to delay registering
> > of the netfiler hooks until the system is in a state where they're
> > needed, without the 'unregister' aspect?
>
> I would need to see the patch to say for certain, but in principle
> that seems perfectly reasonable and I think would satisfy both the
> netdev and SELinux camps - good suggestion. My main goal is to drop
> the selinux_nf_ip_init() entirely so it can't be used as a ROP gadget.
>
> We might even be able to trim the secmark_active and peerlbl_active
> checks in the SELinux netfilter hooks (an earlier attempt at
> optimization; contrary to popular belief, I do care about SELinux
> performance), although that would mean that enabling the network
> access controls would be one way ... I guess you can disregard that
> last bit, I'm thinking aloud again.
One way is fine I think.
> > Ideally this would even be per netns -- in perfect world we would
> > be able to make it so that a new netns are created with an empty
> > hook list.
>
> In general SELinux doesn't care about namespaces, for reasons that are
> sorta beyond the scope of this conversation, so I would like to stick
> to a all or nothing approach to enabling the SELinux netfilter hooks
> across namespaces. Perhaps we can revisit this at a later time, but
> let's keep it simple right now.
Okay, I'd prefer to stick to your recommendation anyway wrt. to selinux
(Casey, I read your comment regarding smack. Noted, we don't want to
break smack either...)
I think that in this case the entire question is:
In your experience, how likely is a config where
selinux is enabled BUT the hooks are not needed (i.e., where we hit the
if (!selinux_policycap_netpeer)
return NF_ACCEPT;
if (!secmark_active && !peerlbl_active)
return NF_ACCEPT;
tests inside the hooks)? If such setups are uncommon we should just
drop this idea or at least put it on the back burner until the more expensive
netfilter hooks (conntrack, cough) are out of the way.
Thanks,
Florian
^ permalink raw reply
* Re: [net-next 03/16] fm10k: Avoid crashing the kernel
From: Keller, Jacob E @ 2016-04-06 23:24 UTC (permalink / raw)
To: davem@davemloft.net, Kirsher, Jeffrey T
Cc: nhorman@redhat.com, netdev@vger.kernel.org, Allan, Bruce W,
sassmann@redhat.com, jogreene@redhat.com
In-Reply-To: <20160405.121234.207383895896842448.davem@davemloft.net>
On Tue, 2016-04-05 at 12:12 -0400, David Miller wrote:
> As Joe suggested, it is not reasonable to expect all compilers to be
> able to figure
> out the result of all of the index increments in this function lead
> to a specific
> constant value.
>
> Your only option is to either keep the code as-is, or add proper
> error reporting to
> this function and to all callers, in order to handle the situation at
> run time which
> I realize is exactly what you are trying to avoid.
>
> If this crashes at run time with the BUG_ON(), it's going to happen
> really quickly
> when you bring the interface up. So I don't see
> the run time check as so tragic.
So we're ok with just not changing this then, and living with a crash?
That's fine with me, I suppose.
Should the second WARN_ONCE be changed to a BUG_ON as well to get a
crash instead of just a warning?
Thanks,
Jake
^ permalink raw reply
* Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed
From: Paul Moore @ 2016-04-06 23:15 UTC (permalink / raw)
To: Florian Westphal
Cc: Paolo Abeni, linux-security-module, David S. Miller, James Morris,
Andreas Gruenbacher, Stephen Smalley, netdev, selinux
In-Reply-To: <20160406221445.GA26807@breakpoint.cc>
On Wed, Apr 6, 2016 at 6:14 PM, Florian Westphal <fw@strlen.de> wrote:
> netfilter hooks are per namespace -- so there is hook unregister when
> netns is destroyed.
Looking around, I see the global and per-namespace registration
functions (nf_register_hook and nf_register_net_hook, respectively),
but I'm looking to see if/how newly created namespace inherit
netfilter hooks from the init network namespace ... if you can create
a network namespace and dodge the SELinux hooks, that isn't a good
thing from a SELinux point of view, although it might be a plus
depending on where you view Paolo's original patches ;)
> Do you think it makes sense to rework the patch to delay registering
> of the netfiler hooks until the system is in a state where they're
> needed, without the 'unregister' aspect?
I would need to see the patch to say for certain, but in principle
that seems perfectly reasonable and I think would satisfy both the
netdev and SELinux camps - good suggestion. My main goal is to drop
the selinux_nf_ip_init() entirely so it can't be used as a ROP gadget.
We might even be able to trim the secmark_active and peerlbl_active
checks in the SELinux netfilter hooks (an earlier attempt at
optimization; contrary to popular belief, I do care about SELinux
performance), although that would mean that enabling the network
access controls would be one way ... I guess you can disregard that
last bit, I'm thinking aloud again.
> Ideally this would even be per netns -- in perfect world we would
> be able to make it so that a new netns are created with an empty
> hook list.
In general SELinux doesn't care about namespaces, for reasons that are
sorta beyond the scope of this conversation, so I would like to stick
to a all or nothing approach to enabling the SELinux netfilter hooks
across namespaces. Perhaps we can revisit this at a later time, but
let's keep it simple right now.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Boot failure when using NFS on OMAP based evms
From: Franklin S Cooper Jr. @ 2016-04-06 23:12 UTC (permalink / raw)
To: samanthakumar, willemb, edumazet, tony, mugunthanvnm
Cc: netdev, linux-kernel, davem, kuznet, jmorris, yoshfuji, kaber,
nsekhar, linux-omap
Hi All,
Currently linux-next is failing to boot via NFS on my AM335x GP evm,
AM437x GP evm and Beagle X15. I bisected the problem down to the commit
"udp: remove headers from UDP packets before queueing".
I had to revert the following three commits to get things working again:
e6afc8ace6dd5cef5e812f26c72579da8806f5ac
udp: remove headers from UDP packets before queueing
627d2d6b550094d88f9e518e15967e7bf906ebbf
udp: enable MSG_PEEK at non-zero offset
b9bb53f3836f4eb2bdeb3447be11042bd29c2408
sock: convert sk_peek_offset functions to WRITE_ONCE
I'm using omap2plus_defconfig for my config.
Below bootlogs are from my AM335x GP evm:
Working
http://pastebin.ubuntu.com/15661989/ (Linux-next 3 patches reverted)
Failing
http://pastebin.ubuntu.com/15661318/ (Linux-next)
^ permalink raw reply
* Re: [RFC PATCH 5/5] Add sample for adding simple drop program to link
From: Alexei Starovoitov @ 2016-04-06 23:11 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Brenden Blanco, davem, netdev, tom, Or Gerlitz, daniel,
john.fastabend
In-Reply-To: <20160406220100.0df04925@redhat.com>
On Wed, Apr 06, 2016 at 10:01:00PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 6 Apr 2016 21:48:48 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > If I do multiple flows, via ./pktgen_sample05_flow_per_thread.sh
> > then I hit this strange 14.5Mpps limit (proto 17: 14505558 drops/s).
> > And the RX 4x CPUs are starting to NOT use 100% in softirq, they have
> > some cycles attributed to %idle. (I verified generator is sending at
> > 24Mpps).
...
> > If I change the program to not touch packet data (don't call
> > load_byte()) then the performance increase to 14.6Mpps (single
> > flow/cpu). And the RX CPU is mostly idle... mlx4_en_process_rx_cq()
> > and page alloc/free functions taking the time.
Please try it with module param log_num_mgm_entry_size=-1
It should get to 20Mpps when bpf doesn't touch the packet.
> Before someone else point out the obvious... I forgot to enable JIT.
> Enable it::
>
> # echo 1 > /proc/sys/net/core/bpf_jit_enable
>
> Performance increased to: 10.8Mpps (proto 17: 10819446 drops/s)
>
> Samples: 51K of event 'cycles', Event count (approx.): 56775706510
> Overhead Command Shared Object Symbol
> + 55.90% ksoftirqd/7 [kernel.vmlinux] [k] sk_load_byte_positive_offset
> + 10.71% ksoftirqd/7 [mlx4_en] [k] mlx4_en_alloc_frags
...
> It is a very likely cache-miss in sk_load_byte_positive_offset().
yes, likely due to missing ddio as you said.
^ permalink raw reply
* Re: [RFC PATCH 2/2] selinux: implement support for dynamic net hook [de-]registration
From: Casey Schaufler @ 2016-04-06 22:32 UTC (permalink / raw)
To: Paolo Abeni, linux-security-module
Cc: David S. Miller, James Morris, Paul Moore, Andreas Gruenbacher,
Stephen Smalley, Florian Westphal, netdev
In-Reply-To: <b7f1a4578e0617c7361dd392282e03556aa37b46.1459934322.git.pabeni@redhat.com>
On 4/6/2016 2:51 AM, Paolo Abeni wrote:
> This patch leverage the netlbl_changed() hook to perform on demand
> registration and deregistration of the netfilter hooks and the
> socket_sock_rcv_skb hook.
>
> With default policy and empty netfilter/netlabel configuration, the
> above hooks are not registered and this allows avoiding nf_hook_slow
> in the xmit path and socket_sock_rcv_skb() in the rx path.
There is no reason to assume that there is a relationship between
a netlabel configuration and a netfilter configuration. Smack always
has a netlabel configuration. Security modules (e.g. AppArmor) may
well use netfilter without netlabel.
Please stop assuming that security == SELinux.
>
> This gives measurable network performance improvement in both
> directions.
In the case where SELinux is enabled and netfilter is not.
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> security/selinux/hooks.c | 72 +++++++++++++++++++++++++++++++------
> security/selinux/include/security.h | 1 +
> security/selinux/ss/services.c | 1 +
> security/selinux/xfrm.c | 4 +++
> 4 files changed, 68 insertions(+), 10 deletions(-)
>
> diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
> index 912deee..a3baa69 100644
> --- a/security/selinux/hooks.c
> +++ b/security/selinux/hooks.c
> @@ -4745,11 +4745,13 @@ static int selinux_secmark_relabel_packet(u32 sid)
> static void selinux_secmark_refcount_inc(void)
> {
> atomic_inc(&selinux_secmark_refcount);
> + selinux_net_update();
> }
>
> static void selinux_secmark_refcount_dec(void)
> {
> atomic_dec(&selinux_secmark_refcount);
> + selinux_net_update();
> }
>
> static void selinux_req_classify_flow(const struct request_sock *req,
> @@ -4836,6 +4838,11 @@ static int selinux_tun_dev_open(void *security)
> return 0;
> }
>
> +static void selinux_netlbl_changed(void)
> +{
> + selinux_net_update();
> +}
> +
> static int selinux_nlmsg_perm(struct sock *sk, struct sk_buff *skb)
> {
> int err = 0;
> @@ -6091,7 +6098,6 @@ static struct security_hook_list selinux_hooks[] = {
> LSM_HOOK_INIT(socket_getsockopt, selinux_socket_getsockopt),
> LSM_HOOK_INIT(socket_setsockopt, selinux_socket_setsockopt),
> LSM_HOOK_INIT(socket_shutdown, selinux_socket_shutdown),
> - LSM_HOOK_INIT(socket_sock_rcv_skb, selinux_socket_sock_rcv_skb),
> LSM_HOOK_INIT(socket_getpeersec_stream,
> selinux_socket_getpeersec_stream),
> LSM_HOOK_INIT(socket_getpeersec_dgram, selinux_socket_getpeersec_dgram),
> @@ -6113,6 +6119,7 @@ static struct security_hook_list selinux_hooks[] = {
> LSM_HOOK_INIT(tun_dev_attach_queue, selinux_tun_dev_attach_queue),
> LSM_HOOK_INIT(tun_dev_attach, selinux_tun_dev_attach),
> LSM_HOOK_INIT(tun_dev_open, selinux_tun_dev_open),
> + LSM_HOOK_INIT(netlbl_changed, selinux_netlbl_changed),
>
> #ifdef CONFIG_SECURITY_NETWORK_XFRM
> LSM_HOOK_INIT(xfrm_policy_alloc_security, selinux_xfrm_policy_alloc),
> @@ -6145,6 +6152,11 @@ static struct security_hook_list selinux_hooks[] = {
> #endif
> };
>
> +/* dynamically registered/unregisterd */
> +static struct security_hook_list selinux_sock_hooks[] = {
> + LSM_HOOK_INIT(socket_sock_rcv_skb, selinux_socket_sock_rcv_skb),
> +};
> +
> static __init int selinux_init(void)
> {
> if (!security_module_enable("selinux")) {
> @@ -6240,7 +6252,9 @@ static struct nf_hook_ops selinux_nf_ops[] = {
> #endif /* IPV6 */
> };
>
> -static int __init selinux_nf_ip_init(void)
> +static bool nf_hooks_registered;
> +
> +static int selinux_nf_ip_init(void)
> {
> int err;
>
> @@ -6253,25 +6267,21 @@ static int __init selinux_nf_ip_init(void)
> if (err)
> panic("SELinux: nf_register_hooks: error %d\n", err);
>
> + nf_hooks_registered = true;
> return 0;
> }
>
> -__initcall(selinux_nf_ip_init);
> -
> -#ifdef CONFIG_SECURITY_SELINUX_DISABLE
> static void selinux_nf_ip_exit(void)
> {
> printk(KERN_DEBUG "SELinux: Unregistering netfilter hooks\n");
>
> nf_unregister_hooks(selinux_nf_ops, ARRAY_SIZE(selinux_nf_ops));
> + nf_hooks_registered = false;
> }
> -#endif
>
> #else /* CONFIG_NETFILTER */
>
> -#ifdef CONFIG_SECURITY_SELINUX_DISABLE
> #define selinux_nf_ip_exit()
> -#endif
>
> #endif /* CONFIG_NETFILTER */
>
> @@ -6300,8 +6310,8 @@ int selinux_disable(void)
> /* Try to destroy the avc node cache */
> avc_disable();
>
> - /* Unregister netfilter hooks. */
> - selinux_nf_ip_exit();
> + /* Unregister net hooks. */
> + selinux_net_update();
>
> /* Unregister selinuxfs. */
> exit_sel_fs();
> @@ -6309,3 +6319,45 @@ int selinux_disable(void)
> return 0;
> }
> #endif
> +
> +DEFINE_MUTEX(selinux_net_mutex);
> +
> +static bool nf_hooks_required(void)
> +{
> + return (selinux_secmark_enabled() || selinux_peerlbl_enabled() ||
> + !selinux_policycap_netpeer) && selinux_enabled;
> +}
> +
> +static bool sock_hooks_required(void)
> +{
> + return (!selinux_policycap_netpeer || selinux_secmark_enabled() ||
> + selinux_peerlbl_enabled()) && selinux_enabled;
> +}
> +
> +static bool sock_hooks_registered;
> +
> +void selinux_net_update(void)
> +{
> + if ((nf_hooks_required() == nf_hooks_registered) &&
> + (sock_hooks_required() == sock_hooks_registered))
> + return;
> +
> + mutex_lock(&selinux_net_mutex);
> + if (nf_hooks_required() != nf_hooks_registered) {
> + if (!nf_hooks_registered)
> + selinux_nf_ip_init();
> + else
> + selinux_nf_ip_exit();
> + }
> +
> + if (sock_hooks_required() != sock_hooks_registered) {
> + if (!sock_hooks_registered)
> + security_add_hooks(selinux_sock_hooks,
> + ARRAY_SIZE(selinux_sock_hooks));
> + else
> + security_delete_hooks(selinux_sock_hooks,
> + ARRAY_SIZE(selinux_sock_hooks));
> + sock_hooks_registered = !sock_hooks_registered;
> + }
> + mutex_unlock(&selinux_net_mutex);
> +}
> diff --git a/security/selinux/include/security.h b/security/selinux/include/security.h
> index 38feb55..0428ab4 100644
> --- a/security/selinux/include/security.h
> +++ b/security/selinux/include/security.h
> @@ -261,6 +261,7 @@ extern void selinux_status_update_setenforce(int enforcing);
> extern void selinux_status_update_policyload(int seqno);
> extern void selinux_complete_init(void);
> extern int selinux_disable(void);
> +extern void selinux_net_update(void);
> extern void exit_sel_fs(void);
> extern struct path selinux_null;
> extern struct vfsmount *selinuxfs_mount;
> diff --git a/security/selinux/ss/services.c b/security/selinux/ss/services.c
> index ebda973..c509018 100644
> --- a/security/selinux/ss/services.c
> +++ b/security/selinux/ss/services.c
> @@ -2016,6 +2016,7 @@ static void security_load_policycaps(void)
> POLICYDB_CAPABILITY_OPENPERM);
> selinux_policycap_alwaysnetwork = ebitmap_get_bit(&policydb.policycaps,
> POLICYDB_CAPABILITY_ALWAYSNETWORK);
> + selinux_net_update();
> }
>
> static int security_preserve_bools(struct policydb *p);
> diff --git a/security/selinux/xfrm.c b/security/selinux/xfrm.c
> index 56e354f..cc2b0d4 100644
> --- a/security/selinux/xfrm.c
> +++ b/security/selinux/xfrm.c
> @@ -112,6 +112,7 @@ static int selinux_xfrm_alloc_user(struct xfrm_sec_ctx **ctxp,
>
> *ctxp = ctx;
> atomic_inc(&selinux_xfrm_refcount);
> + selinux_net_update();
> return 0;
>
> err:
> @@ -128,6 +129,7 @@ static void selinux_xfrm_free(struct xfrm_sec_ctx *ctx)
> return;
>
> atomic_dec(&selinux_xfrm_refcount);
> + selinux_net_update();
> kfree(ctx);
> }
>
> @@ -303,6 +305,7 @@ int selinux_xfrm_policy_clone(struct xfrm_sec_ctx *old_ctx,
> if (!new_ctx)
> return -ENOMEM;
> atomic_inc(&selinux_xfrm_refcount);
> + selinux_net_update();
> *new_ctxp = new_ctx;
>
> return 0;
> @@ -370,6 +373,7 @@ int selinux_xfrm_state_alloc_acquire(struct xfrm_state *x,
>
> x->security = ctx;
> atomic_inc(&selinux_xfrm_refcount);
> + selinux_net_update();
> out:
> kfree(ctx_str);
> return rc;
^ permalink raw reply
* [PATCH net-next v2] macvlan: Support interface operstate properly
From: Debabrata Banerjee @ 2016-04-06 22:36 UTC (permalink / raw)
To: Nikolay Aleksandrov, Patrick McHardy, netdev; +Cc: Debabrata Banerjee
Set appropriate macvlan interface status based on lower device and our
status. Can be up, down, or lowerlayerdown.
de7d244d0 improved operstate by setting it from unknown to up, however
it did not handle transferring down or lowerlayerdown.
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
---
v2: Fix locking and update commit message
drivers/net/macvlan.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 45 insertions(+), 2 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 2bcf1f3..306124ba 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -91,6 +91,7 @@ static struct macvlan_port *macvlan_port_get_rtnl(const struct net_device *dev)
}
#define macvlan_port_exists(dev) (dev->priv_flags & IFF_MACVLAN_PORT)
+#define is_macvlan(dev) (dev->priv_flags & IFF_MACVLAN)
static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
const unsigned char *addr)
@@ -1242,6 +1243,28 @@ static int macvlan_changelink_sources(struct macvlan_dev *vlan, u32 mode,
return 0;
}
+static void macvlan_set_operstate(struct net_device *lowerdev,
+ struct net_device *dev)
+{
+ unsigned char newstate = dev->operstate;
+
+ if (!(dev->flags & IFF_UP))
+ newstate = IF_OPER_DOWN;
+ else if ((lowerdev->flags & IFF_UP) && netif_oper_up(lowerdev))
+ newstate = IF_OPER_UP;
+ else
+ newstate = IF_OPER_LOWERLAYERDOWN;
+
+ write_lock_bh(&dev_base_lock);
+ if (dev->operstate != newstate) {
+ dev->operstate = newstate;
+ write_unlock_bh(&dev_base_lock);
+ netdev_state_change(dev);
+ } else {
+ write_unlock_bh(&dev_base_lock);
+ }
+}
+
int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
{
@@ -1324,6 +1347,7 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
list_add_tail_rcu(&vlan->list, &port->vlans);
netif_stacked_transfer_operstate(lowerdev, dev);
+ macvlan_set_operstate(lowerdev, dev);
linkwatch_fire_event(dev);
return 0;
@@ -1518,17 +1542,36 @@ static int macvlan_device_event(struct notifier_block *unused,
struct macvlan_port *port;
LIST_HEAD(list_kill);
- if (!macvlan_port_exists(dev))
+ if (!macvlan_port_exists(dev) && !is_macvlan(dev))
+ return NOTIFY_DONE;
+
+ if (is_macvlan(dev)) {
+ vlan = netdev_priv(dev);
+
+ switch (event) {
+ case NETDEV_UP:
+ case NETDEV_DOWN:
+ case NETDEV_CHANGE:
+ netif_stacked_transfer_operstate(vlan->lowerdev,
+ vlan->dev);
+ macvlan_set_operstate(vlan->lowerdev, vlan->dev);
+ break;
+ }
+
return NOTIFY_DONE;
+ }
port = macvlan_port_get_rtnl(dev);
switch (event) {
case NETDEV_UP:
+ case NETDEV_DOWN:
case NETDEV_CHANGE:
- list_for_each_entry(vlan, &port->vlans, list)
+ list_for_each_entry(vlan, &port->vlans, list) {
netif_stacked_transfer_operstate(vlan->lowerdev,
vlan->dev);
+ macvlan_set_operstate(vlan->lowerdev, vlan->dev);
+ }
break;
case NETDEV_FEAT_CHANGE:
list_for_each_entry(vlan, &port->vlans, list) {
--
2.8.0
^ permalink raw reply related
* Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed
From: Florian Westphal @ 2016-04-06 22:14 UTC (permalink / raw)
To: Paul Moore
Cc: Paolo Abeni, linux-security-module, David S. Miller, James Morris,
Andreas Gruenbacher, Stephen Smalley, Florian Westphal, netdev,
selinux
In-Reply-To: <CAHC9VhS6aJdPR6xTH2-ehikS5qvj6jFbZAUtzoBXp+WC9Ugi=Q@mail.gmail.com>
Paul Moore <paul@paul-moore.com> wrote:
> On Wed, Apr 6, 2016 at 5:51 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > Currently, selinux always registers iptables POSTROUTING hooks regarless of
> > the running policy needs for any action to be performed by them.
> >
> > Even the socket_sock_rcv_skb() is always registered, but it can result in a no-op
> > depending on the current policy configuration.
> >
> > The above invocations in the kernel datapath are cause of measurable
> > overhead in networking performance test.
> >
> > This patch series adds explicit notification for netlabel status change
> > (other relevant status change, like xfrm and secmark, are already notified to
> > LSM) and use this information in selinux to register the above hooks only when
> > the current status makes them relevant, deregistering them when no-op
> >
> > Avoiding the LSM hooks overhead, in netperf UDP_STREAM test with small packets,
> > gives about 5% performance improvement on rx and about 8% on tx.
>
> [NOTE: added the SELinux mailing list to the CC line, please include
> when submitting SELinux patches]
>
> While I appreciate the patch and the work that went into development
> and testing, I'm going to reject this patch on the grounds that it
> conflicts with work we've just started thinking about which should
> bring some tangible security benefit.
>
> The recent addition of post-init read only memory opens up some
> interesting possibilities for SELinux and LSMs in general, the thing
> which we've just started looking at is marking the LSM hook structure
> read only after init. There are some complicating factors for
> SELinux, but I'm confident those can be resolved, and from what I can
> tell marking the hooks read only will have no effect on other LSMs.
> While marking the LSM hook structure doesn't directly affect the
> SELinux netfilter hooks, once we remove the ability to deregister the
> LSM hooks we will have no need to support deregistering netfilter
> hooks and I expect we will drop that functionality as well to help
> decrease the risk of tampering.
netfilter hooks are per namespace -- so there is hook unregister when
netns is destroyed.
Do you think it makes sense to rework the patch to delay registering
of the netfiler hooks until the system is in a state where they're
needed, without the 'unregister' aspect?
Ideally this would even be per netns -- in perfect world we would
be able to make it so that a new netns are created with an empty
hook list.
Thanks.
^ permalink raw reply
* Re: [PATCH net-next] net: intel: remove dead links
From: Jeff Kirsher @ 2016-04-06 22:08 UTC (permalink / raw)
To: David Miller, jbenc
Cc: netdev, jesse.brandeburg, shannon.nelson, carolyn.wyborny,
donald.c.skidmore, bruce.w.allan, john.ronciak, mitch.a.williams,
intel-wired-lan
In-Reply-To: <20160406.165451.42694079252458681.davem@davemloft.net>
[-- Attachment #1: Type: text/plain, Size: 1040 bytes --]
On Wed, 2016-04-06 at 16:54 -0400, David Miller wrote:
> From: Jiri Benc <jbenc@redhat.com>
> Date: Tue, 5 Apr 2016 16:25:07 +0200
>
> >
> > The Kconfig for Intel NICs references two different URLs for the
> > "Adapter
> > & Driver ID Guide". Neither of those two links works. The current
> > URL seems
> > to be
> > http://www.intel.com/content/www/us/en/support/network-and-i-o/ethe
> > rnet-products/000005584.html
> > but given it's apparently constantly changing, there's no point in
> > having it
> > in the help text.
> >
> > Just keep a generic pointer to http://support.intel.com. Hopefully,
> > this one
> > will have a longer live. It still works, at least.
> >
> > Futhermore, remove a link to "the latest Intel PRO/100 network
> > driver for
> > Linux", this has no place in the mainline kernel and the latest
> > Linux driver
> > it offers is from 2006, anyway.
> >
> > Signed-off-by: Jiri Benc <jbenc@redhat.com>
> I expect Jeff to pull this into his tree, thanks.
Yep, got it queued up.
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
^ permalink raw reply
* Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed
From: Paul Moore @ 2016-04-06 21:43 UTC (permalink / raw)
To: Casey Schaufler
Cc: Paolo Abeni, linux-security-module, David S. Miller, James Morris,
Andreas Gruenbacher, Stephen Smalley, Florian Westphal, netdev
In-Reply-To: <5705819A.3030809@schaufler-ca.com>
On Wed, Apr 6, 2016 at 5:37 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> On 4/6/2016 2:51 AM, Paolo Abeni wrote:
>> Currently, selinux always registers iptables POSTROUTING hooks regarless of
>> the running policy needs for any action to be performed by them.
>>
>> Even the socket_sock_rcv_skb() is always registered, but it can result in a no-op
>> depending on the current policy configuration.
>>
>> The above invocations in the kernel datapath are cause of measurable
>> overhead in networking performance test.
>>
>> This patch series adds explicit notification for netlabel status change
>> (other relevant status change, like xfrm and secmark, are already notified to
>> LSM) and use this information in selinux to register the above hooks only when
>> the current status makes them relevant, deregistering them when no-op
>>
>> Avoiding the LSM hooks overhead, in netperf UDP_STREAM test with small packets,
>> gives about 5% performance improvement on rx and about 8% on tx.
>>
>> Paolo Abeni (2):
>> security: add hook for netlabel status change notification
>> selinux: implement support for dynamic net hook [de-]registration
>>
>> include/linux/lsm_hooks.h | 6 ++++
>> include/linux/security.h | 5 +++
>> net/netlabel/netlabel_cipso_v4.c | 8 +++--
>> net/netlabel/netlabel_unlabeled.c | 5 ++-
>> security/security.c | 7 ++++
>> security/selinux/hooks.c | 72 +++++++++++++++++++++++++++++++------
>> security/selinux/include/security.h | 1 +
>> security/selinux/ss/services.c | 1 +
>> security/selinux/xfrm.c | 4 +++
>> 9 files changed, 96 insertions(+), 13 deletions(-)
>>
>
> Is there a patch 1/2?
Yes, there was (it was the "security: add hook ..." patch), but for
some reason it hasn't hit the archive that I normally use. Odd.
I'll fwd the patch to you off-list so as not to spam everyone again.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed
From: Casey Schaufler @ 2016-04-06 21:43 UTC (permalink / raw)
To: Paolo Abeni, linux-security-module
Cc: David S. Miller, James Morris, Paul Moore, Andreas Gruenbacher,
Stephen Smalley, Florian Westphal, netdev
In-Reply-To: <cover.1459934322.git.pabeni@redhat.com>
On 4/6/2016 2:51 AM, Paolo Abeni wrote:
> Currently, selinux always registers iptables POSTROUTING hooks regarless of
> the running policy needs for any action to be performed by them.
>
> Even the socket_sock_rcv_skb() is always registered, but it can result in a no-op
> depending on the current policy configuration.
>
> The above invocations in the kernel datapath are cause of measurable
> overhead in networking performance test.
>
> This patch series adds explicit notification for netlabel status change
> (other relevant status change, like xfrm and secmark, are already notified to
> LSM) and use this information in selinux to register the above hooks only when
> the current status makes them relevant, deregistering them when no-op
>
> Avoiding the LSM hooks overhead, in netperf UDP_STREAM test with small packets,
> gives about 5% performance improvement on rx and about 8% on tx.
>
> Paolo Abeni (2):
> security: add hook for netlabel status change notification
> selinux: implement support for dynamic net hook [de-]registration
Did you consider the fact that netlabel and the LSM socket
hooks are used by Smack as well as SELinux? Did you measure the
impact that your changes have on Smack? Does Smack even work
with your changes?
>
> include/linux/lsm_hooks.h | 6 ++++
> include/linux/security.h | 5 +++
> net/netlabel/netlabel_cipso_v4.c | 8 +++--
> net/netlabel/netlabel_unlabeled.c | 5 ++-
> security/security.c | 7 ++++
> security/selinux/hooks.c | 72 +++++++++++++++++++++++++++++++------
> security/selinux/include/security.h | 1 +
> security/selinux/ss/services.c | 1 +
> security/selinux/xfrm.c | 4 +++
> 9 files changed, 96 insertions(+), 13 deletions(-)
>
^ permalink raw reply
* Re: [PATCH net-next 0/7] sctp: support sctp_diag in kernel
From: marcelo.leitner @ 2016-04-06 21:42 UTC (permalink / raw)
To: David Miller; +Cc: lucien.xin, netdev, linux-sctp, vyasevich, daniel
In-Reply-To: <20160406.161312.1132930117025273561.davem@davemloft.net>
On Wed, Apr 06, 2016 at 04:13:12PM -0400, David Miller wrote:
> From: Xin Long <lucien.xin@gmail.com>
> Date: Tue, 5 Apr 2016 12:06:25 +0800
>
> > This patchset will add sctp_diag module to implement diag interface on
> > sctp in kernel.
> ...
>
> This series looks generally fine to me, but I'd like to see some review from
> SCTP experts before I apply this series.
>
> Thanks.
Will finish reviewing it tomorrow. Thanks for reviewing it already
^ permalink raw reply
* Re: [PATCH] macvlan: Support interface operstate properly
From: Nikolay Aleksandrov @ 2016-04-06 21:42 UTC (permalink / raw)
To: Banerjee, Debabrata, Patrick McHardy, netdev@vger.kernel.org
In-Reply-To: <53A6C2CA-9C49-4F85-97F6-DD60276AFCC2@akamai.com>
On 04/06/2016 11:26 PM, Banerjee, Debabrata wrote:
> On 4/6/16, 5:03 PM, "Nikolay Aleksandrov" <nikolay@cumulusnetworks.com> wrote:
>
>
>> On 04/06/2016 10:30 PM, Debabrata Banerjee wrote:
>>> Set appropriate macvlan interface status based on lower device and our
>>> status. Can be up, down, or lowerlayerdown.
>>>
>>> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
>>>
>>
>> May I ask what is exactly that you're fixing here ? I recently had to make macvlan's
>> operstates more accurate and I haven't experienced any wrong behaviour since commit
>> de7d244d0a35 ("macvlan: make operstate and carrier more accurate").
>
> Yes I saw the other patch, it's an improvement from when I started working on this.
>
>
>> Also it's the linkwatch's job to take care for the proper operstate, we can use
>> netif_stacked_transfer_operstate to help it, but I don't think directly setting
>> operstate is a good idea.
>
> This patch was modeled after __hsr_set_operstate(). But I agree there's probably
> better ways to do it. I'm not sure why netif_stacked_transfer_operstate() doesn't do
> it itself, although in the case of a layered device, my patch actually uses the other
> possible state, which is lowerlayerdown. Without the patch operstate goes directly to
> down.
>
>>
>> One more thing - you cannot use netdev_state_change() under the write_lock as it
>> may sleep.
>
> You're right, I can resubmit moving the call out of the critical section, if the patch
> will be taken.
>
I don't know if it'll be taken, but you can submit v2 for review. I'll review and
test it tomorrow as it's late here and I'm tired. :-)
Since this is not a bug fix, I'd suggest to target net-next and you
don't have to CC linux-kernel@vger.kernel.org.
Thanks for the explanation, I misread part of the patch at first and was confused,
but I got the idea now.
Cheers,
Nik
^ permalink raw reply
* Re: [RFC PATCH 0/2] selinux: avoid nf hooks overhead when not needed
From: Casey Schaufler @ 2016-04-06 21:37 UTC (permalink / raw)
To: Paolo Abeni, linux-security-module
Cc: David S. Miller, James Morris, Paul Moore, Andreas Gruenbacher,
Stephen Smalley, Florian Westphal, netdev
In-Reply-To: <cover.1459934322.git.pabeni@redhat.com>
On 4/6/2016 2:51 AM, Paolo Abeni wrote:
> Currently, selinux always registers iptables POSTROUTING hooks regarless of
> the running policy needs for any action to be performed by them.
>
> Even the socket_sock_rcv_skb() is always registered, but it can result in a no-op
> depending on the current policy configuration.
>
> The above invocations in the kernel datapath are cause of measurable
> overhead in networking performance test.
>
> This patch series adds explicit notification for netlabel status change
> (other relevant status change, like xfrm and secmark, are already notified to
> LSM) and use this information in selinux to register the above hooks only when
> the current status makes them relevant, deregistering them when no-op
>
> Avoiding the LSM hooks overhead, in netperf UDP_STREAM test with small packets,
> gives about 5% performance improvement on rx and about 8% on tx.
>
> Paolo Abeni (2):
> security: add hook for netlabel status change notification
> selinux: implement support for dynamic net hook [de-]registration
>
> include/linux/lsm_hooks.h | 6 ++++
> include/linux/security.h | 5 +++
> net/netlabel/netlabel_cipso_v4.c | 8 +++--
> net/netlabel/netlabel_unlabeled.c | 5 ++-
> security/security.c | 7 ++++
> security/selinux/hooks.c | 72 +++++++++++++++++++++++++++++++------
> security/selinux/include/security.h | 1 +
> security/selinux/ss/services.c | 1 +
> security/selinux/xfrm.c | 4 +++
> 9 files changed, 96 insertions(+), 13 deletions(-)
>
Is there a patch 1/2?
^ permalink raw reply
* Re: [PATCH] macvlan: Support interface operstate properly
From: Banerjee, Debabrata @ 2016-04-06 21:26 UTC (permalink / raw)
To: Nikolay Aleksandrov, Patrick McHardy, netdev@vger.kernel.org
In-Reply-To: <570579AA.6050400@cumulusnetworks.com>
On 4/6/16, 5:03 PM, "Nikolay Aleksandrov" <nikolay@cumulusnetworks.com> wrote:
>On 04/06/2016 10:30 PM, Debabrata Banerjee wrote:
>> Set appropriate macvlan interface status based on lower device and our
>> status. Can be up, down, or lowerlayerdown.
>>
>> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
>>
>
>May I ask what is exactly that you're fixing here ? I recently had to make macvlan's
>operstates more accurate and I haven't experienced any wrong behaviour since commit
>de7d244d0a35 ("macvlan: make operstate and carrier more accurate").
Yes I saw the other patch, it's an improvement from when I started working on this.
>Also it's the linkwatch's job to take care for the proper operstate, we can use
>netif_stacked_transfer_operstate to help it, but I don't think directly setting
>operstate is a good idea.
This patch was modeled after __hsr_set_operstate(). But I agree there's probably
better ways to do it. I'm not sure why netif_stacked_transfer_operstate() doesn't do
it itself, although in the case of a layered device, my patch actually uses the other
possible state, which is lowerlayerdown. Without the patch operstate goes directly to
down.
>
>One more thing - you cannot use netdev_state_change() under the write_lock as it
>may sleep.
You're right, I can resubmit moving the call out of the critical section, if the patch
will be taken.
^ permalink raw reply
* Re: [PATCH] macvlan: Support interface operstate properly
From: Nikolay Aleksandrov @ 2016-04-06 21:27 UTC (permalink / raw)
To: Debabrata Banerjee, Patrick McHardy, netdev
In-Reply-To: <570579AA.6050400@cumulusnetworks.com>
On 04/06/2016 11:03 PM, Nikolay Aleksandrov wrote:
> On 04/06/2016 10:30 PM, Debabrata Banerjee wrote:
>> Set appropriate macvlan interface status based on lower device and our
>> status. Can be up, down, or lowerlayerdown.
>>
>> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
>>
>
> May I ask what is exactly that you're fixing here ? I recently had to make macvlan's
> operstates more accurate and I haven't experienced any wrong behaviour since commit
> de7d244d0a35 ("macvlan: make operstate and carrier more accurate").
> Also it's the linkwatch's job to take care for the proper operstate, we can use
> netif_stacked_transfer_operstate to help it, but I don't think directly setting
> operstate is a good idea.
Misread part of the change, got it now. My comment below still stands though,
> One more thing - you cannot use netdev_state_change() under the write_lock as it
> may sleep.
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox