Netdev List
 help / color / mirror / Atom feed
* [PATCH v6 bpf-next 3/9] tools/bpf: sync include/uapi/linux/bpf.h
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

sync bpf.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 60b99b730a41..86f7c438d40f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2422,7 +2422,9 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(spin_lock),			\
+	FN(spin_unlock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3056,4 +3058,7 @@ struct bpf_line_info {
 	__u32	line_col;
 };
 
+struct bpf_spin_lock {
+	__u32	val;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 1/9] bpf: introduce bpf_spin_lock
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
bpf program serialize access to other variables.

Example:
struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
};
struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
}

Restrictions and safety checks:
- bpf_spin_lock is only allowed inside HASH and ARRAY maps.
- BTF description of the map is mandatory for safety analysis.
- bpf program can take one bpf_spin_lock at a time, since two or more can
  cause dead locks.
- only one 'struct bpf_spin_lock' is allowed per map element.
  It drastically simplifies implementation yet allows bpf program to use
  any number of bpf_spin_locks.
- when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
- bpf program must bpf_spin_unlock() before return.
- bpf program can access 'struct bpf_spin_lock' only via
  bpf_spin_lock()/bpf_spin_unlock() helpers.
- load/store into 'struct bpf_spin_lock lock;' field is not allowed.
- to use bpf_spin_lock() helper the BTF description of map value must be
  a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
  Nested lock inside another struct is not allowed.
- syscall map_lookup doesn't copy bpf_spin_lock field to user space.
- syscall map_update and program map_update do not update bpf_spin_lock field.
- bpf_spin_lock cannot be on the stack or inside networking packet.
  bpf_spin_lock can only be inside HASH or ARRAY map value.
- bpf_spin_lock is available to root only and to all program types.
- bpf_spin_lock is not allowed in inner maps of map-in-map.
- ld_abs is not allowed inside spin_lock-ed region.
- tracing progs and socket filter progs cannot use bpf_spin_lock due to
  insufficient preemption checks

Implementation details:
- cgroup-bpf class of programs can nest with xdp/tc programs.
  Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
  Other solutions to avoid nested bpf_spin_lock are possible.
  Like making sure that all networking progs run with softirq disabled.
  spin_lock_irqsave is the simplest and doesn't add overhead to the
  programs that don't use it.
- arch_spinlock_t is used when its implemented as queued_spin_lock
- archs can force their own arch_spinlock_t
- on architectures where queued_spin_lock is not available and
  sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
- presence of bpf_spin_lock inside map value could have been indicated via
  extra flag during map_create, but specifying it via BTF is cleaner.
  It provides introspection for map key/value and reduces user mistakes.

Next steps:
- allow bpf_spin_lock in other map types (like cgroup local storage)
- introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
  to request kernel to grab bpf_spin_lock before rewriting the value.
  That will serialize access to map elements.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h          |  37 +++++++-
 include/linux/bpf_verifier.h |   1 +
 include/linux/btf.h          |   1 +
 include/uapi/linux/bpf.h     |   7 +-
 kernel/Kconfig.locks         |   3 +
 kernel/bpf/arraymap.c        |   7 +-
 kernel/bpf/btf.c             |  42 +++++++++
 kernel/bpf/core.c            |   2 +
 kernel/bpf/hashtab.c         |  21 +++--
 kernel/bpf/helpers.c         |  80 +++++++++++++++++
 kernel/bpf/map_in_map.c      |   5 ++
 kernel/bpf/syscall.c         |  21 ++++-
 kernel/bpf/verifier.c        | 169 ++++++++++++++++++++++++++++++++++-
 net/core/filter.c            |  16 +++-
 14 files changed, 386 insertions(+), 26 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0394f1f9213b..2ae615b48bb8 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -72,14 +72,15 @@ struct bpf_map {
 	u32 value_size;
 	u32 max_entries;
 	u32 map_flags;
-	u32 pages;
+	int spin_lock_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
 	u32 btf_key_type_id;
 	u32 btf_value_type_id;
 	struct btf *btf;
+	u32 pages;
 	bool unpriv_array;
-	/* 55 bytes hole */
+	/* 51 bytes hole */
 
 	/* The 3rd and 4th cacheline with misc members to avoid false sharing
 	 * particularly with refcounting.
@@ -91,6 +92,34 @@ struct bpf_map {
 	char name[BPF_OBJ_NAME_LEN];
 };
 
+static inline bool map_value_has_spin_lock(const struct bpf_map *map)
+{
+	return map->spin_lock_off >= 0;
+}
+
+static inline void check_and_init_map_lock(struct bpf_map *map, void *dst)
+{
+	if (likely(!map_value_has_spin_lock(map)))
+		return;
+	*(struct bpf_spin_lock *)(dst + map->spin_lock_off) =
+		(struct bpf_spin_lock){};
+}
+
+/* copy everything but bpf_spin_lock */
+static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
+{
+	if (unlikely(map_value_has_spin_lock(map))) {
+		u32 off = map->spin_lock_off;
+
+		memcpy(dst, src, off);
+		memcpy(dst + off + sizeof(struct bpf_spin_lock),
+		       src + off + sizeof(struct bpf_spin_lock),
+		       map->value_size - off - sizeof(struct bpf_spin_lock));
+	} else {
+		memcpy(dst, src, map->value_size);
+	}
+}
+
 struct bpf_offload_dev;
 struct bpf_offloaded_map;
 
@@ -162,6 +191,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_CTX,		/* pointer to context */
 	ARG_ANYTHING,		/* any (initialized) argument is ok */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
+	ARG_PTR_TO_SPIN_LOCK,	/* pointer to bpf_spin_lock */
 };
 
 /* type of values returned from helper functions */
@@ -879,7 +909,8 @@ extern const struct bpf_func_proto bpf_msg_redirect_hash_proto;
 extern const struct bpf_func_proto bpf_msg_redirect_map_proto;
 extern const struct bpf_func_proto bpf_sk_redirect_hash_proto;
 extern const struct bpf_func_proto bpf_sk_redirect_map_proto;
-
+extern const struct bpf_func_proto bpf_spin_lock_proto;
+extern const struct bpf_func_proto bpf_spin_unlock_proto;
 extern const struct bpf_func_proto bpf_get_local_storage_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 0620e418dde5..69f7a3449eda 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -148,6 +148,7 @@ struct bpf_verifier_state {
 	/* call stack tracking */
 	struct bpf_func_state *frame[MAX_CALL_FRAMES];
 	u32 curframe;
+	u32 active_spin_lock;
 	bool speculative;
 };
 
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 12502e25e767..455d31b55828 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -50,6 +50,7 @@ u32 btf_id(const struct btf *btf);
 bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   const struct btf_member *m,
 			   u32 expected_offset, u32 expected_size);
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 
 #ifdef CONFIG_BPF_SYSCALL
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 60b99b730a41..86f7c438d40f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2422,7 +2422,9 @@ union bpf_attr {
 	FN(map_peek_elem),		\
 	FN(msg_push_data),		\
 	FN(msg_pop_data),		\
-	FN(rc_pointer_rel),
+	FN(rc_pointer_rel),		\
+	FN(spin_lock),			\
+	FN(spin_unlock),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -3056,4 +3058,7 @@ struct bpf_line_info {
 	__u32	line_col;
 };
 
+struct bpf_spin_lock {
+	__u32	val;
+};
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 84d882f3e299..fbba478ae522 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -242,6 +242,9 @@ config QUEUED_SPINLOCKS
 	def_bool y if ARCH_USE_QUEUED_SPINLOCKS
 	depends on SMP
 
+config BPF_ARCH_SPINLOCK
+	bool
+
 config ARCH_USE_QUEUED_RWLOCKS
 	bool
 
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 25632a75d630..d6d979910a2a 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -270,9 +270,10 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 		memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]),
 		       value, map->value_size);
 	else
-		memcpy(array->value +
-		       array->elem_size * (index & array->index_mask),
-		       value, map->value_size);
+		copy_map_value(map,
+			       array->value +
+			       array->elem_size * (index & array->index_mask),
+			       value);
 	return 0;
 }
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 3d661f0606fe..7019c1f05cab 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -355,6 +355,11 @@ static bool btf_type_is_struct(const struct btf_type *t)
 	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
 }
 
+static bool __btf_type_is_struct(const struct btf_type *t)
+{
+	return BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT;
+}
+
 static bool btf_type_is_array(const struct btf_type *t)
 {
 	return BTF_INFO_KIND(t->info) == BTF_KIND_ARRAY;
@@ -2045,6 +2050,43 @@ static void btf_struct_log(struct btf_verifier_env *env,
 	btf_verifier_log(env, "size=%u vlen=%u", t->size, btf_type_vlen(t));
 }
 
+/* find 'struct bpf_spin_lock' in map value.
+ * return >= 0 offset if found
+ * and < 0 in case of error
+ */
+int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
+{
+	const struct btf_member *member;
+	u32 i, off = -ENOENT;
+
+	if (!__btf_type_is_struct(t))
+		return -EINVAL;
+
+	for_each_member(i, t, member) {
+		const struct btf_type *member_type = btf_type_by_id(btf,
+								    member->type);
+		if (!__btf_type_is_struct(member_type))
+			continue;
+		if (member_type->size != sizeof(struct bpf_spin_lock))
+			continue;
+		if (strcmp(__btf_name_by_offset(btf, member_type->name_off),
+			   "bpf_spin_lock"))
+			continue;
+		if (off != -ENOENT)
+			/* only one 'struct bpf_spin_lock' is allowed */
+			return -E2BIG;
+		off = btf_member_bit_offset(t, member);
+		if (off % 8)
+			/* valid C code cannot generate such BTF */
+			return -EINVAL;
+		off /= 8;
+		if (off % __alignof__(struct bpf_spin_lock))
+			/* valid struct bpf_spin_lock will be 4 byte aligned */
+			return -EINVAL;
+	}
+	return off;
+}
+
 static void btf_struct_seq_show(const struct btf *btf, const struct btf_type *t,
 				u32 type_id, void *data, u8 bits_offset,
 				struct seq_file *m)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index a7bcb23bee84..550d9f9298f3 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2001,6 +2001,8 @@ const struct bpf_func_proto bpf_map_delete_elem_proto __weak;
 const struct bpf_func_proto bpf_map_push_elem_proto __weak;
 const struct bpf_func_proto bpf_map_pop_elem_proto __weak;
 const struct bpf_func_proto bpf_map_peek_elem_proto __weak;
+const struct bpf_func_proto bpf_spin_lock_proto __weak;
+const struct bpf_func_proto bpf_spin_unlock_proto __weak;
 
 const struct bpf_func_proto bpf_get_prandom_u32_proto __weak;
 const struct bpf_func_proto bpf_get_smp_processor_id_proto __weak;
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4b7c76765d9d..6d3b22c5ad68 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -718,21 +718,12 @@ static bool fd_htab_map_needs_adjust(const struct bpf_htab *htab)
 	       BITS_PER_LONG == 64;
 }
 
-static u32 htab_size_value(const struct bpf_htab *htab, bool percpu)
-{
-	u32 size = htab->map.value_size;
-
-	if (percpu || fd_htab_map_needs_adjust(htab))
-		size = round_up(size, 8);
-	return size;
-}
-
 static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 					 void *value, u32 key_size, u32 hash,
 					 bool percpu, bool onallcpus,
 					 struct htab_elem *old_elem)
 {
-	u32 size = htab_size_value(htab, percpu);
+	u32 size = htab->map.value_size;
 	bool prealloc = htab_is_prealloc(htab);
 	struct htab_elem *l_new, **pl_new;
 	void __percpu *pptr;
@@ -770,10 +761,13 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
+		check_and_init_map_lock(&htab->map,
+					l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
 	if (percpu) {
+		size = round_up(size, 8);
 		if (prealloc) {
 			pptr = htab_elem_get_ptr(l_new, key_size);
 		} else {
@@ -791,8 +785,13 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 
 		if (!prealloc)
 			htab_elem_set_ptr(l_new, key_size, pptr);
-	} else {
+	} else if (fd_htab_map_needs_adjust(htab)) {
+		size = round_up(size, 8);
 		memcpy(l_new->key + round_up(key_size, 8), value, size);
+	} else {
+		copy_map_value(&htab->map,
+			       l_new->key + round_up(key_size, 8),
+			       value);
 	}
 
 	l_new->hash = hash;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a74972b07e74..29a8a4e62b8a 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -221,6 +221,86 @@ const struct bpf_func_proto bpf_get_current_comm_proto = {
 	.arg2_type	= ARG_CONST_SIZE,
 };
 
+#if defined(CONFIG_QUEUED_SPINLOCKS) || defined(CONFIG_BPF_ARCH_SPINLOCK)
+
+static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
+{
+	arch_spinlock_t *l = (void *)lock;
+	union {
+		__u32 val;
+		arch_spinlock_t lock;
+	} u = { .lock = __ARCH_SPIN_LOCK_UNLOCKED };
+
+	compiletime_assert(u.val == 0, "__ARCH_SPIN_LOCK_UNLOCKED not 0");
+	BUILD_BUG_ON(sizeof(*l) != sizeof(__u32));
+	BUILD_BUG_ON(sizeof(*lock) != sizeof(__u32));
+	arch_spin_lock(l);
+}
+
+static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
+{
+	arch_spinlock_t *l = (void *)lock;
+
+	arch_spin_unlock(l);
+}
+
+#else
+
+static inline void __bpf_spin_lock(struct bpf_spin_lock *lock)
+{
+	atomic_t *l = (void *)lock;
+
+	BUILD_BUG_ON(sizeof(*l) != sizeof(*lock));
+	do {
+		atomic_cond_read_relaxed(l, !VAL);
+	} while (atomic_xchg(l, 1));
+}
+
+static inline void __bpf_spin_unlock(struct bpf_spin_lock *lock)
+{
+	atomic_t *l = (void *)lock;
+
+	atomic_set_release(l, 0);
+}
+
+#endif
+
+static DEFINE_PER_CPU(unsigned long, irqsave_flags);
+
+notrace BPF_CALL_1(bpf_spin_lock, struct bpf_spin_lock *, lock)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__bpf_spin_lock(lock);
+	this_cpu_write(irqsave_flags, flags);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_spin_lock_proto = {
+	.func		= bpf_spin_lock,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
+};
+
+notrace BPF_CALL_1(bpf_spin_unlock, struct bpf_spin_lock *, lock)
+{
+	unsigned long flags;
+
+	flags = this_cpu_read(irqsave_flags);
+	__bpf_spin_unlock(lock);
+	local_irq_restore(flags);
+	return 0;
+}
+
+const struct bpf_func_proto bpf_spin_unlock_proto = {
+	.func		= bpf_spin_unlock,
+	.gpl_only	= false,
+	.ret_type	= RET_VOID,
+	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
+};
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 52378d3e34b3..583346a0ab29 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -37,6 +37,11 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 		return ERR_PTR(-EINVAL);
 	}
 
+	if (map_value_has_spin_lock(inner_map)) {
+		fdput(f);
+		return ERR_PTR(-ENOTSUPP);
+	}
+
 	inner_map_meta_size = sizeof(*inner_map_meta);
 	/* In some cases verifier needs to access beyond just base map. */
 	if (inner_map->ops == &array_map_ops)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b155cd17c1bd..ebf0a673cb83 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -463,7 +463,7 @@ int map_check_no_btf(const struct bpf_map *map,
 	return -ENOTSUPP;
 }
 
-static int map_check_btf(const struct bpf_map *map, const struct btf *btf,
+static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			 u32 btf_key_id, u32 btf_value_id)
 {
 	const struct btf_type *key_type, *value_type;
@@ -478,6 +478,21 @@ static int map_check_btf(const struct bpf_map *map, const struct btf *btf,
 	if (!value_type || value_size != map->value_size)
 		return -EINVAL;
 
+	map->spin_lock_off = btf_find_spin_lock(btf, value_type);
+
+	if (map_value_has_spin_lock(map)) {
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY)
+			return -ENOTSUPP;
+		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
+		    map->value_size) {
+			WARN_ONCE(1,
+				  "verifier bug spin_lock_off %d value_size %d\n",
+				  map->spin_lock_off, map->value_size);
+			return -EFAULT;
+		}
+	}
+
 	if (map->ops->map_check_btf)
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 
@@ -542,6 +557,8 @@ static int map_create(union bpf_attr *attr)
 		map->btf = btf;
 		map->btf_key_type_id = attr->btf_key_type_id;
 		map->btf_value_type_id = attr->btf_value_type_id;
+	} else {
+		map->spin_lock_off = -EINVAL;
 	}
 
 	err = security_bpf_map_alloc(map);
@@ -740,7 +757,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 			err = -ENOENT;
 		} else {
 			err = 0;
-			memcpy(value, ptr, value_size);
+			copy_map_value(map, value, ptr);
 		}
 		rcu_read_unlock();
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8c1c21cd50b4..b6af9d2cb236 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -213,6 +213,7 @@ struct bpf_call_arg_meta {
 	s64 msize_smax_value;
 	u64 msize_umax_value;
 	int ptr_id;
+	int func_id;
 };
 
 static DEFINE_MUTEX(bpf_verifier_lock);
@@ -351,6 +352,12 @@ static bool reg_is_refcounted(const struct bpf_reg_state *reg)
 	return type_is_refcounted(reg->type);
 }
 
+static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
+{
+	return reg->type == PTR_TO_MAP_VALUE &&
+		map_value_has_spin_lock(reg->map_ptr);
+}
+
 static bool reg_is_refcounted_or_null(const struct bpf_reg_state *reg)
 {
 	return type_is_refcounted_or_null(reg->type);
@@ -712,6 +719,7 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 	}
 	dst_state->speculative = src->speculative;
 	dst_state->curframe = src->curframe;
+	dst_state->active_spin_lock = src->active_spin_lock;
 	for (i = 0; i <= src->curframe; i++) {
 		dst = dst_state->frame[i];
 		if (!dst) {
@@ -1483,6 +1491,21 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 	if (err)
 		verbose(env, "R%d max value is outside of the array range\n",
 			regno);
+
+	if (map_value_has_spin_lock(reg->map_ptr)) {
+		u32 lock = reg->map_ptr->spin_lock_off;
+
+		/* if any part of struct bpf_spin_lock can be touched by
+		 * load/store reject this program
+		 */
+		if ((reg->smin_value + off <= lock &&
+		     lock < reg->umax_value + off + size) ||
+		    (reg->smin_value + off < lock + sizeof(struct bpf_spin_lock) &&
+		     lock + sizeof(struct bpf_spin_lock) <= reg->umax_value + off + size)) {
+			verbose(env, "bpf_spin_lock cannot be accessed directly by load/store\n");
+			return -EACCES;
+		}
+	}
 	return err;
 }
 
@@ -2192,6 +2215,91 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 	}
 }
 
+/* Implementation details:
+ * bpf_map_lookup returns PTR_TO_MAP_VALUE_OR_NULL
+ * Two bpf_map_lookups (even with the same key) will have different reg->id.
+ * For traditional PTR_TO_MAP_VALUE the verifier clears reg->id after
+ * value_or_null->value transition, since the verifier only cares about
+ * the range of access to valid map value pointer and doesn't care about actual
+ * address of the map element.
+ * For maps with 'struct bpf_spin_lock' inside map value the verifier keeps
+ * reg->id > 0 after value_or_null->value transition. By doing so
+ * two bpf_map_lookups will be considered two different pointers that
+ * point to different bpf_spin_locks.
+ * The verifier allows taking only one bpf_spin_lock at a time to avoid
+ * dead-locks.
+ * Since only one bpf_spin_lock is allowed the checks are simpler than
+ * reg_is_refcounted() logic. The verifier needs to remember only
+ * one spin_lock instead of array of acquired_refs.
+ * cur_state->active_spin_lock remembers which map value element got locked
+ * and clears it after bpf_spin_unlock.
+ */
+static int process_spin_lock(struct bpf_verifier_env *env, int regno,
+			     bool is_lock)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	struct bpf_verifier_state *cur = env->cur_state;
+	bool is_const = tnum_is_const(reg->var_off);
+	struct bpf_map *map = reg->map_ptr;
+	u64 val = reg->var_off.value;
+
+	if (reg->type != PTR_TO_MAP_VALUE) {
+		verbose(env, "R%d is not a pointer to map_value\n", regno);
+		return -EINVAL;
+	}
+	if (!is_const) {
+		verbose(env,
+			"R%d doesn't have constant offset. bpf_spin_lock has to be at the constant offset\n",
+			regno);
+		return -EINVAL;
+	}
+	if (!map->btf) {
+		verbose(env,
+			"map '%s' has to have BTF in order to use bpf_spin_lock\n",
+			map->name);
+		return -EINVAL;
+	}
+	if (!map_value_has_spin_lock(map)) {
+		if (map->spin_lock_off == -E2BIG)
+			verbose(env,
+				"map '%s' has more than one 'struct bpf_spin_lock'\n",
+				map->name);
+		else if (map->spin_lock_off == -ENOENT)
+			verbose(env,
+				"map '%s' doesn't have 'struct bpf_spin_lock'\n",
+				map->name);
+		else
+			verbose(env,
+				"map '%s' is not a struct type or bpf_spin_lock is mangled\n",
+				map->name);
+		return -EINVAL;
+	}
+	if (map->spin_lock_off != val + reg->off) {
+		verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock'\n",
+			val + reg->off);
+		return -EINVAL;
+	}
+	if (is_lock) {
+		if (cur->active_spin_lock) {
+			verbose(env,
+				"Locking two bpf_spin_locks are not allowed\n");
+			return -EINVAL;
+		}
+		cur->active_spin_lock = reg->id;
+	} else {
+		if (!cur->active_spin_lock) {
+			verbose(env, "bpf_spin_unlock without taking a lock\n");
+			return -EINVAL;
+		}
+		if (cur->active_spin_lock != reg->id) {
+			verbose(env, "bpf_spin_unlock of different lock\n");
+			return -EINVAL;
+		}
+		cur->active_spin_lock = 0;
+	}
+	return 0;
+}
+
 static bool arg_type_is_mem_ptr(enum bpf_arg_type type)
 {
 	return type == ARG_PTR_TO_MEM ||
@@ -2268,6 +2376,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			return -EFAULT;
 		}
 		meta->ptr_id = reg->id;
+	} else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
+		if (meta->func_id == BPF_FUNC_spin_lock) {
+			if (process_spin_lock(env, regno, true))
+				return -EACCES;
+		} else if (meta->func_id == BPF_FUNC_spin_unlock) {
+			if (process_spin_lock(env, regno, false))
+				return -EACCES;
+		} else {
+			verbose(env, "verifier internal error\n");
+			return -EFAULT;
+		}
 	} else if (arg_type_is_mem_ptr(arg_type)) {
 		expected_type = PTR_TO_STACK;
 		/* One exception here. In case function allows for NULL to be
@@ -2887,6 +3006,7 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		return err;
 	}
 
+	meta.func_id = func_id;
 	/* check args */
 	err = check_func_arg(env, BPF_REG_1, fn->arg1_type, &meta);
 	if (err)
@@ -4473,7 +4593,8 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 		} else if (reg->type == PTR_TO_SOCKET_OR_NULL) {
 			reg->type = PTR_TO_SOCKET;
 		}
-		if (is_null || !reg_is_refcounted(reg)) {
+		if (is_null || !(reg_is_refcounted(reg) ||
+				 reg_may_point_to_spin_lock(reg))) {
 			/* We don't need id from this point onwards anymore,
 			 * thus we should better reset it, so that state
 			 * pruning has chances to take effect.
@@ -4871,6 +4992,11 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return err;
 	}
 
+	if (env->cur_state->active_spin_lock) {
+		verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n");
+		return -EINVAL;
+	}
+
 	if (regs[BPF_REG_6].type != PTR_TO_CTX) {
 		verbose(env,
 			"at the time of BPF_LD_ABS|IND R6 != pointer to skb\n");
@@ -5607,8 +5733,11 @@ static bool regsafe(struct bpf_reg_state *rold, struct bpf_reg_state *rcur,
 	case PTR_TO_MAP_VALUE:
 		/* If the new min/max/var_off satisfy the old ones and
 		 * everything else matches, we are OK.
-		 * We don't care about the 'id' value, because nothing
-		 * uses it for PTR_TO_MAP_VALUE (only for ..._OR_NULL)
+		 * 'id' is not compared, since it's only used for maps with
+		 * bpf_spin_lock inside map element and in such cases if
+		 * the rest of the prog is valid for one map element then
+		 * it's valid for all map elements regardless of the key
+		 * used in bpf_map_lookup()
 		 */
 		return memcmp(rold, rcur, offsetof(struct bpf_reg_state, id)) == 0 &&
 		       range_within(rold, rcur) &&
@@ -5811,6 +5940,9 @@ static bool states_equal(struct bpf_verifier_env *env,
 	if (old->speculative && !cur->speculative)
 		return false;
 
+	if (old->active_spin_lock != cur->active_spin_lock)
+		return false;
+
 	/* for states to be equal callsites have to be the same
 	 * and all frame states need to be equivalent
 	 */
@@ -6229,6 +6361,12 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
+				if (env->cur_state->active_spin_lock &&
+				    (insn->src_reg == BPF_PSEUDO_CALL ||
+				     insn->imm != BPF_FUNC_spin_unlock)) {
+					verbose(env, "function calls are not allowed while holding a lock\n");
+					return -EINVAL;
+				}
 				if (insn->src_reg == BPF_PSEUDO_CALL)
 					err = check_func_call(env, insn, &env->insn_idx);
 				else
@@ -6259,6 +6397,11 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
+				if (env->cur_state->active_spin_lock) {
+					verbose(env, "bpf_spin_unlock is missing\n");
+					return -EINVAL;
+				}
+
 				if (state->curframe) {
 					/* exit from nested function */
 					env->prev_insn_idx = env->insn_idx;
@@ -6356,6 +6499,19 @@ static int check_map_prealloc(struct bpf_map *map)
 		!(map->map_flags & BPF_F_NO_PREALLOC);
 }
 
+static bool is_tracing_prog_type(enum bpf_prog_type type)
+{
+	switch (type) {
+	case BPF_PROG_TYPE_KPROBE:
+	case BPF_PROG_TYPE_TRACEPOINT:
+	case BPF_PROG_TYPE_PERF_EVENT:
+	case BPF_PROG_TYPE_RAW_TRACEPOINT:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 					struct bpf_map *map,
 					struct bpf_prog *prog)
@@ -6378,6 +6534,13 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		}
 	}
 
+	if ((is_tracing_prog_type(prog->type) ||
+	     prog->type == BPF_PROG_TYPE_SOCKET_FILTER) &&
+	    map_value_has_spin_lock(map)) {
+		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
+		return -EINVAL;
+	}
+
 	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
 	    !bpf_offload_prog_map_match(prog, map)) {
 		verbose(env, "offload device mismatch between prog and map\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 41984ad4b9b4..3a49f68eda10 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5314,10 +5314,20 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 		return &bpf_tail_call_proto;
 	case BPF_FUNC_ktime_get_ns:
 		return &bpf_ktime_get_ns_proto;
+	default:
+		break;
+	}
+
+	if (!capable(CAP_SYS_ADMIN))
+		return NULL;
+
+	switch (func_id) {
+	case BPF_FUNC_spin_lock:
+		return &bpf_spin_lock_proto;
+	case BPF_FUNC_spin_unlock:
+		return &bpf_spin_unlock_proto;
 	case BPF_FUNC_trace_printk:
-		if (capable(CAP_SYS_ADMIN))
-			return bpf_get_trace_printk_proto();
-		/* else, fall through */
+		return bpf_get_trace_printk_proto();
 	default:
 		return NULL;
 	}
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 4/9] selftests/bpf: add bpf_spin_lock verifier tests
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

add bpf_spin_lock tests to test_verifier.c that don't require
latest llvm with BTF support

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/test_verifier.c   | 104 +++++-
 .../selftests/bpf/verifier/spin_lock.c        | 331 ++++++++++++++++++
 2 files changed, 434 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/spin_lock.c

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index c5e22422a852..ae01aedb65a0 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -32,6 +32,7 @@
 #include <linux/bpf_perf_event.h>
 #include <linux/bpf.h>
 #include <linux/if_ether.h>
+#include <linux/btf.h>
 
 #include <bpf/bpf.h>
 
@@ -49,7 +50,7 @@
 
 #define MAX_INSNS	BPF_MAXINSNS
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	13
+#define MAX_NR_MAPS	14
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -76,6 +77,7 @@ struct bpf_test {
 	int fixup_map_in_map[MAX_FIXUPS];
 	int fixup_cgroup_storage[MAX_FIXUPS];
 	int fixup_percpu_cgroup_storage[MAX_FIXUPS];
+	int fixup_map_spin_lock[MAX_FIXUPS];
 	const char *errstr;
 	const char *errstr_unpriv;
 	uint32_t retval, retval_unpriv, insn_processed;
@@ -381,6 +383,98 @@ static int create_cgroup_storage(bool percpu)
 	return fd;
 }
 
+#define BTF_INFO_ENC(kind, kind_flag, vlen) \
+	((!!(kind_flag) << 31) | ((kind) << 24) | ((vlen) & BTF_MAX_VLEN))
+#define BTF_TYPE_ENC(name, info, size_or_type) \
+	(name), (info), (size_or_type)
+#define BTF_INT_ENC(encoding, bits_offset, nr_bits) \
+	((encoding) << 24 | (bits_offset) << 16 | (nr_bits))
+#define BTF_TYPE_INT_ENC(name, encoding, bits_offset, bits, sz) \
+	BTF_TYPE_ENC(name, BTF_INFO_ENC(BTF_KIND_INT, 0, 0), sz), \
+	BTF_INT_ENC(encoding, bits_offset, bits)
+#define BTF_MEMBER_ENC(name, type, bits_offset) \
+	(name), (type), (bits_offset)
+
+struct btf_raw_data {
+	__u32 raw_types[64];
+	const char *str_sec;
+	__u32 str_sec_size;
+};
+
+/* struct bpf_spin_lock {
+ *   int val;
+ * };
+ * struct val {
+ *   int cnt;
+ *   struct bpf_spin_lock l;
+ * };
+ */
+static const char btf_str_sec[] = "\0bpf_spin_lock\0val\0cnt\0l";
+static __u32 btf_raw_types[] = {
+	/* int */
+	BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),  /* [1] */
+	/* struct bpf_spin_lock */                      /* [2] */
+	BTF_TYPE_ENC(1, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 1), 4),
+	BTF_MEMBER_ENC(15, 1, 0), /* int val; */
+	/* struct val */                                /* [3] */
+	BTF_TYPE_ENC(15, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 2), 8),
+	BTF_MEMBER_ENC(19, 1, 0), /* int cnt; */
+	BTF_MEMBER_ENC(23, 2, 32),/* struct bpf_spin_lock l; */
+};
+
+static int load_btf(void)
+{
+	struct btf_header hdr = {
+		.magic = BTF_MAGIC,
+		.version = BTF_VERSION,
+		.hdr_len = sizeof(struct btf_header),
+		.type_len = sizeof(btf_raw_types),
+		.str_off = sizeof(btf_raw_types),
+		.str_len = sizeof(btf_str_sec),
+	};
+	void *ptr, *raw_btf;
+	int btf_fd;
+
+	ptr = raw_btf = malloc(sizeof(hdr) + sizeof(btf_raw_types) +
+			       sizeof(btf_str_sec));
+
+	memcpy(ptr, &hdr, sizeof(hdr));
+	ptr += sizeof(hdr);
+	memcpy(ptr, btf_raw_types, hdr.type_len);
+	ptr += hdr.type_len;
+	memcpy(ptr, btf_str_sec, hdr.str_len);
+	ptr += hdr.str_len;
+
+	btf_fd = bpf_load_btf(raw_btf, ptr - raw_btf, 0, 0, 0);
+	free(raw_btf);
+	if (btf_fd < 0)
+		return -1;
+	return btf_fd;
+}
+
+static int create_map_spin_lock(void)
+{
+	struct bpf_create_map_attr attr = {
+		.name = "test_map",
+		.map_type = BPF_MAP_TYPE_ARRAY,
+		.key_size = 4,
+		.value_size = 8,
+		.max_entries = 1,
+		.btf_key_type_id = 1,
+		.btf_value_type_id = 3,
+	};
+	int fd, btf_fd;
+
+	btf_fd = load_btf();
+	if (btf_fd < 0)
+		return -1;
+	attr.btf_fd = btf_fd;
+	fd = bpf_create_map_xattr(&attr);
+	if (fd < 0)
+		printf("Failed to create map with spin_lock\n");
+	return fd;
+}
+
 static char bpf_vlog[UINT_MAX >> 8];
 
 static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
@@ -399,6 +493,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_in_map = test->fixup_map_in_map;
 	int *fixup_cgroup_storage = test->fixup_cgroup_storage;
 	int *fixup_percpu_cgroup_storage = test->fixup_percpu_cgroup_storage;
+	int *fixup_map_spin_lock = test->fixup_map_spin_lock;
 
 	if (test->fill_helper)
 		test->fill_helper(test);
@@ -515,6 +610,13 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_map_stacktrace++;
 		} while (*fixup_map_stacktrace);
 	}
+	if (*fixup_map_spin_lock) {
+		map_fds[13] = create_map_spin_lock();
+		do {
+			prog[*fixup_map_spin_lock].imm = map_fds[13];
+			fixup_map_spin_lock++;
+		} while (*fixup_map_spin_lock);
+	}
 }
 
 static int set_admin(bool admin)
diff --git a/tools/testing/selftests/bpf/verifier/spin_lock.c b/tools/testing/selftests/bpf/verifier/spin_lock.c
new file mode 100644
index 000000000000..d829eef372a4
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/spin_lock.c
@@ -0,0 +1,331 @@
+{
+	"spin_lock: test1 success",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = ACCEPT,
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test2 direct ld/st",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "cannot be accessed directly",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test3 direct ld/st",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 1),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "cannot be accessed directly",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test4 direct ld/st",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_6, 3),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "cannot be accessed directly",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test5 call within a locked region",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_get_prandom_u32),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "calls are not allowed",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test6 missing unlock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "unlock is missing",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test7 unlock without lock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "without taking a lock",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test8 double lock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_6, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "calls are not allowed",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test9 different lock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3, 11 },
+	.result = REJECT,
+	.errstr = "unlock of different lock",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test10 lock in subprog without unlock",
+	.insns = {
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 1, 0, 5),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 1),
+	BPF_EXIT_INSN(),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 3 },
+	.result = REJECT,
+	.errstr = "unlock is missing",
+	.result_unpriv = REJECT,
+	.errstr_unpriv = "",
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+},
+{
+	"spin_lock: test11 ld_abs under lock",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_ST_MEM(BPF_W, BPF_REG_10, -4, 0),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
+	BPF_LD_MAP_FD(BPF_REG_1,
+		      0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_lock),
+	BPF_LD_ABS(BPF_B, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_7),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_spin_unlock),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.fixup_map_spin_lock = { 4 },
+	.result = REJECT,
+	.errstr = "inside bpf_spin_lock",
+	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+},
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 2/9] bpf: add support for bpf_spin_lock to cgroup local storage
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Allow 'struct bpf_spin_lock' to reside inside cgroup local storage.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/local_storage.c | 2 ++
 kernel/bpf/syscall.c       | 3 ++-
 kernel/bpf/verifier.c      | 2 ++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 07a34ef562a0..0295427f06e2 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -147,6 +147,7 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
 		return -ENOMEM;
 
 	memcpy(&new->data[0], value, map->value_size);
+	check_and_init_map_lock(map, new->data);
 
 	new = xchg(&storage->buf, new);
 	kfree_rcu(new, rcu);
@@ -483,6 +484,7 @@ struct bpf_cgroup_storage *bpf_cgroup_storage_alloc(struct bpf_prog *prog,
 		storage->buf = kmalloc_node(size, flags, map->numa_node);
 		if (!storage->buf)
 			goto enomem;
+		check_and_init_map_lock(map, storage->buf->data);
 	} else {
 		storage->percpu_buf = __alloc_percpu_gfp(size, 8, flags);
 		if (!storage->percpu_buf)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ebf0a673cb83..b29e6dc44650 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -482,7 +482,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 
 	if (map_value_has_spin_lock(map)) {
 		if (map->map_type != BPF_MAP_TYPE_HASH &&
-		    map->map_type != BPF_MAP_TYPE_ARRAY)
+		    map->map_type != BPF_MAP_TYPE_ARRAY &&
+		    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE)
 			return -ENOTSUPP;
 		if (map->spin_lock_off + sizeof(struct bpf_spin_lock) >
 		    map->value_size) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b6af9d2cb236..dd1bd7485a57 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3089,6 +3089,8 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
 		regs[BPF_REG_0].map_ptr = meta.map_ptr;
 		if (fn->ret_type == RET_PTR_TO_MAP_VALUE) {
 			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE;
+			if (map_value_has_spin_lock(meta.map_ptr))
+				regs[BPF_REG_0].id = ++env->id_gen;
 		} else {
 			regs[BPF_REG_0].type = PTR_TO_MAP_VALUE_OR_NULL;
 			regs[BPF_REG_0].id = ++env->id_gen;
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 7/9] tools/bpf: sync uapi/bpf.h
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

add BPF_F_LOCK definition to tools/include/uapi/linux/bpf.h

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 86f7c438d40f..1777fa0c61e4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -267,6 +267,7 @@ enum bpf_attach_type {
 #define BPF_ANY		0 /* create new element or update existing */
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */
+#define BPF_F_LOCK	4 /* spin_lock-ed map_lookup/map_update */
 
 /* flags for BPF_MAP_CREATE command */
 #define BPF_F_NO_PREALLOC	(1U << 0)
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 6/9] bpf: introduce BPF_F_LOCK flag
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
and for map_update() helper function.
In all these cases take a lock of existing element (which was provided
in BTF description) before copying (in or out) the rest of map value.

Implementation details that are part of uapi:

Array:
The array map takes the element lock for lookup/update.

Hash:
hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
If old element exists it takes the element lock and updates the element in place.
If element doesn't exist it allocates new one and inserts into hash table
while holding the bucket lock.
In rare case the hashmap has to take both the bucket lock and the element lock
to update old value in place.

Cgroup local storage:
It is similar to array. update in place and lookup are done with lock taken.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h        |  2 ++
 include/uapi/linux/bpf.h   |  1 +
 kernel/bpf/arraymap.c      | 24 ++++++++++++++--------
 kernel/bpf/hashtab.c       | 42 +++++++++++++++++++++++++++++++++++---
 kernel/bpf/helpers.c       | 16 +++++++++++++++
 kernel/bpf/local_storage.c | 14 ++++++++++++-
 kernel/bpf/syscall.c       | 25 +++++++++++++++++++++--
 7 files changed, 110 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2ae615b48bb8..bd169a7bcc93 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -119,6 +119,8 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 		memcpy(dst, src, map->value_size);
 	}
 }
+void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
+			   bool lock_src);
 
 struct bpf_offload_dev;
 struct bpf_offloaded_map;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 86f7c438d40f..1777fa0c61e4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -267,6 +267,7 @@ enum bpf_attach_type {
 #define BPF_ANY		0 /* create new element or update existing */
 #define BPF_NOEXIST	1 /* create new element if it didn't exist */
 #define BPF_EXIST	2 /* update existing element */
+#define BPF_F_LOCK	4 /* spin_lock-ed map_lookup/map_update */
 
 /* flags for BPF_MAP_CREATE command */
 #define BPF_F_NO_PREALLOC	(1U << 0)
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index d6d979910a2a..c72e0d8e1e65 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -253,8 +253,9 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 {
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	u32 index = *(u32 *)key;
+	char *val;
 
-	if (unlikely(map_flags > BPF_EXIST))
+	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
 		return -EINVAL;
 
@@ -262,18 +263,25 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 		/* all elements were pre-allocated, cannot insert a new one */
 		return -E2BIG;
 
-	if (unlikely(map_flags == BPF_NOEXIST))
+	if (unlikely(map_flags & BPF_NOEXIST))
 		/* all elements already exist */
 		return -EEXIST;
 
-	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
+	if (unlikely((map_flags & BPF_F_LOCK) &&
+		     !map_value_has_spin_lock(map)))
+		return -EINVAL;
+
+	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
 		memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]),
 		       value, map->value_size);
-	else
-		copy_map_value(map,
-			       array->value +
-			       array->elem_size * (index & array->index_mask),
-			       value);
+	} else {
+		val = array->value +
+			array->elem_size * (index & array->index_mask);
+		if (map_flags & BPF_F_LOCK)
+			copy_map_value_locked(map, val, value, false);
+		else
+			copy_map_value(map, val, value);
+	}
 	return 0;
 }
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 6d3b22c5ad68..937776531998 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -804,11 +804,11 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 static int check_flags(struct bpf_htab *htab, struct htab_elem *l_old,
 		       u64 map_flags)
 {
-	if (l_old && map_flags == BPF_NOEXIST)
+	if (l_old && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST)
 		/* elem already exists */
 		return -EEXIST;
 
-	if (!l_old && map_flags == BPF_EXIST)
+	if (!l_old && (map_flags & ~BPF_F_LOCK) == BPF_EXIST)
 		/* elem doesn't exist, cannot update it */
 		return -ENOENT;
 
@@ -827,7 +827,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	u32 key_size, hash;
 	int ret;
 
-	if (unlikely(map_flags > BPF_EXIST))
+	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
 		/* unknown flags */
 		return -EINVAL;
 
@@ -840,6 +840,28 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
+	if (unlikely(map_flags & BPF_F_LOCK)) {
+		if (unlikely(!map_value_has_spin_lock(map)))
+			return -EINVAL;
+		/* find an element without taking the bucket lock */
+		l_old = lookup_nulls_elem_raw(head, hash, key, key_size,
+					      htab->n_buckets);
+		ret = check_flags(htab, l_old, map_flags);
+		if (ret)
+			return ret;
+		if (l_old) {
+			/* grab the element lock and update value in place */
+			copy_map_value_locked(map,
+					      l_old->key + round_up(key_size, 8),
+					      value, false);
+			return 0;
+		}
+		/* fall through, grab the bucket lock and lookup again.
+		 * 99.9% chance that the element won't be found,
+		 * but second lookup under lock has to be done.
+		 */
+	}
+
 	/* bpf_map_update_elem() can be called in_irq() */
 	raw_spin_lock_irqsave(&b->lock, flags);
 
@@ -849,6 +871,20 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (ret)
 		goto err;
 
+	if (unlikely(l_old && (map_flags & BPF_F_LOCK))) {
+		/* first lookup without the bucket lock didn't find the element,
+		 * but second lookup with the bucket lock found it.
+		 * This case is highly unlikely, but has to be dealt with:
+		 * grab the element lock in addition to the bucket lock
+		 * and update element in place
+		 */
+		copy_map_value_locked(map,
+				      l_old->key + round_up(key_size, 8),
+				      value, false);
+		ret = 0;
+		goto err;
+	}
+
 	l_new = alloc_htab_elem(htab, key, value, key_size, hash, false, false,
 				l_old);
 	if (IS_ERR(l_new)) {
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 29a8a4e62b8a..8218be1ee6ef 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -301,6 +301,22 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
 	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
 };
 
+void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
+			   bool lock_src)
+{
+	struct bpf_spin_lock *lock;
+
+	if (lock_src)
+		lock = src + map->spin_lock_off;
+	else
+		lock = dst + map->spin_lock_off;
+	preempt_disable();
+	____bpf_spin_lock(lock);
+	copy_map_value(map, dst, src);
+	____bpf_spin_unlock(lock);
+	preempt_enable();
+}
+
 #ifdef CONFIG_CGROUPS
 BPF_CALL_0(bpf_get_current_cgroup_id)
 {
diff --git a/kernel/bpf/local_storage.c b/kernel/bpf/local_storage.c
index 0295427f06e2..6b572e2de7fb 100644
--- a/kernel/bpf/local_storage.c
+++ b/kernel/bpf/local_storage.c
@@ -131,7 +131,14 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
 	struct bpf_cgroup_storage *storage;
 	struct bpf_storage_buffer *new;
 
-	if (flags != BPF_ANY && flags != BPF_EXIST)
+	if (unlikely(flags & ~(BPF_F_LOCK | BPF_EXIST | BPF_NOEXIST)))
+		return -EINVAL;
+
+	if (unlikely(flags & BPF_NOEXIST))
+		return -EINVAL;
+
+	if (unlikely((flags & BPF_F_LOCK) &&
+		     !map_value_has_spin_lock(map)))
 		return -EINVAL;
 
 	storage = cgroup_storage_lookup((struct bpf_cgroup_storage_map *)map,
@@ -139,6 +146,11 @@ static int cgroup_storage_update_elem(struct bpf_map *map, void *_key,
 	if (!storage)
 		return -ENOENT;
 
+	if (flags & BPF_F_LOCK) {
+		copy_map_value_locked(map, storage->buf->data, value, false);
+		return 0;
+	}
+
 	new = kmalloc_node(sizeof(struct bpf_storage_buffer) +
 			   map->value_size,
 			   __GFP_ZERO | GFP_ATOMIC | __GFP_NOWARN,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b29e6dc44650..0834958f1dc4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -682,7 +682,7 @@ static void *__bpf_copy_key(void __user *ukey, u64 key_size)
 }
 
 /* last field in 'union bpf_attr' used by this command */
-#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
+#define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags
 
 static int map_lookup_elem(union bpf_attr *attr)
 {
@@ -698,6 +698,9 @@ static int map_lookup_elem(union bpf_attr *attr)
 	if (CHECK_ATTR(BPF_MAP_LOOKUP_ELEM))
 		return -EINVAL;
 
+	if (attr->flags & ~BPF_F_LOCK)
+		return -EINVAL;
+
 	f = fdget(ufd);
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
@@ -708,6 +711,12 @@ static int map_lookup_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
+	if ((attr->flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
 	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
@@ -758,7 +767,13 @@ static int map_lookup_elem(union bpf_attr *attr)
 			err = -ENOENT;
 		} else {
 			err = 0;
-			copy_map_value(map, value, ptr);
+			if (attr->flags & BPF_F_LOCK)
+				/* lock 'ptr' and copy everything but lock */
+				copy_map_value_locked(map, value, ptr, true);
+			else
+				copy_map_value(map, value, ptr);
+			/* mask lock, since value wasn't zero inited */
+			check_and_init_map_lock(map, value);
 		}
 		rcu_read_unlock();
 	}
@@ -818,6 +833,12 @@ static int map_update_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
+	if ((attr->flags & BPF_F_LOCK) &&
+	    !map_value_has_spin_lock(map)) {
+		err = -EINVAL;
+		goto err_put;
+	}
+
 	key = __bpf_copy_key(ukey, map->key_size);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 5/9] selftests/bpf: add bpf_spin_lock C test
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

add bpf_spin_lock C based test that requires latest llvm with BTF support

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/Makefile         |   2 +-
 tools/testing/selftests/bpf/bpf_helpers.h    |   4 +
 tools/testing/selftests/bpf/test_progs.c     |  43 +++++++-
 tools/testing/selftests/bpf/test_spin_lock.c | 108 +++++++++++++++++++
 4 files changed, 155 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_spin_lock.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8993e9c8f410..302b8e70dec9 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o
+	xdp_dummy.o test_map_in_map.o test_spin_lock.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 6c77cf7bedce..6a0ce0f055c5 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -172,6 +172,10 @@ static int (*bpf_skb_vlan_pop)(void *ctx) =
 	(void *) BPF_FUNC_skb_vlan_pop;
 static int (*bpf_rc_pointer_rel)(void *ctx, int rel_x, int rel_y) =
 	(void *) BPF_FUNC_rc_pointer_rel;
+static void (*bpf_spin_lock)(struct bpf_spin_lock *lock) =
+	(void *) BPF_FUNC_spin_lock;
+static void (*bpf_spin_unlock)(struct bpf_spin_lock *lock) =
+	(void *) BPF_FUNC_spin_unlock;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index d8940b8b2f8d..d2e71d697340 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -28,7 +28,7 @@ typedef __u16 __sum16;
 #include <sys/wait.h>
 #include <sys/types.h>
 #include <fcntl.h>
-
+#include <pthread.h>
 #include <linux/bpf.h>
 #include <linux/err.h>
 #include <bpf/bpf.h>
@@ -1985,6 +1985,46 @@ static void test_flow_dissector(void)
 	bpf_object__close(obj);
 }
 
+static void *test_spin_lock(void *arg)
+{
+	__u32 duration, retval;
+	int err, prog_fd = *(u32 *) arg;
+
+	err = bpf_prog_test_run(prog_fd, 10000, &pkt_v4, sizeof(pkt_v4),
+				NULL, NULL, &retval, &duration);
+	CHECK(err || retval, "",
+	      "err %d errno %d retval %d duration %d\n",
+	      err, errno, retval, duration);
+	pthread_exit(arg);
+}
+
+static void test_spinlock(void)
+{
+	const char *file = "./test_spin_lock.o";
+	pthread_t thread_id[4];
+	struct bpf_object *obj;
+	int prog_fd;
+	int err = 0, i;
+	void *ret;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_CGROUP_SKB, &obj, &prog_fd);
+	if (err) {
+		printf("test_spin_lock:bpf_prog_load errno %d\n", errno);
+		goto close_prog;
+	}
+	for (i = 0; i < 4; i++)
+		assert(pthread_create(&thread_id[i], NULL,
+				      &test_spin_lock, &prog_fd) == 0);
+	for (i = 0; i < 4; i++)
+		assert(pthread_join(thread_id[i], &ret) == 0 &&
+		       ret == (void *)&prog_fd);
+	goto close_prog_noerr;
+close_prog:
+	error_cnt++;
+close_prog_noerr:
+	bpf_object__close(obj);
+}
+
 int main(void)
 {
 	srand(time(NULL));
@@ -2013,6 +2053,7 @@ int main(void)
 	test_queue_stack_map(QUEUE);
 	test_queue_stack_map(STACK);
 	test_flow_dissector();
+	test_spinlock();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
diff --git a/tools/testing/selftests/bpf/test_spin_lock.c b/tools/testing/selftests/bpf/test_spin_lock.c
new file mode 100644
index 000000000000..40f904312090
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_spin_lock.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+#include <linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+struct hmap_elem {
+	volatile int cnt;
+	struct bpf_spin_lock lock;
+	int test_padding;
+};
+
+struct bpf_map_def SEC("maps") hmap = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct hmap_elem),
+	.max_entries = 1,
+};
+
+BPF_ANNOTATE_KV_PAIR(hmap, int, struct hmap_elem);
+
+
+struct cls_elem {
+	struct bpf_spin_lock lock;
+	volatile int cnt;
+};
+
+struct bpf_map_def SEC("maps") cls_map = {
+	.type = BPF_MAP_TYPE_CGROUP_STORAGE,
+	.key_size = sizeof(struct bpf_cgroup_storage_key),
+	.value_size = sizeof(struct cls_elem),
+};
+
+BPF_ANNOTATE_KV_PAIR(cls_map, struct bpf_cgroup_storage_key,
+		     struct cls_elem);
+
+struct bpf_vqueue {
+	struct bpf_spin_lock lock;
+	/* 4 byte hole */
+	unsigned long long lasttime;
+	int credit;
+	unsigned int rate;
+};
+
+struct bpf_map_def SEC("maps") vqueue = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct bpf_vqueue),
+	.max_entries = 1,
+};
+
+BPF_ANNOTATE_KV_PAIR(vqueue, int, struct bpf_vqueue);
+#define CREDIT_PER_NS(delta, rate) (((delta) * rate) >> 20)
+
+SEC("spin_lock_demo")
+int bpf_sping_lock_test(struct __sk_buff *skb)
+{
+	volatile int credit = 0, max_credit = 100, pkt_len = 64;
+	struct hmap_elem zero = {}, *val;
+	unsigned long long curtime;
+	struct bpf_vqueue *q;
+	struct cls_elem *cls;
+	int key = 0;
+	int err = 0;
+
+	val = bpf_map_lookup_elem(&hmap, &key);
+	if (!val) {
+		bpf_map_update_elem(&hmap, &key, &zero, 0);
+		val = bpf_map_lookup_elem(&hmap, &key);
+		if (!val) {
+			err = 1;
+			goto err;
+		}
+	}
+	/* spin_lock in hash map run time test */
+	bpf_spin_lock(&val->lock);
+	if (val->cnt)
+		val->cnt--;
+	else
+		val->cnt++;
+	if (val->cnt != 0 && val->cnt != 1)
+		err = 1;
+	bpf_spin_unlock(&val->lock);
+
+	/* spin_lock in array. virtual queue demo */
+	q = bpf_map_lookup_elem(&vqueue, &key);
+	if (!q)
+		goto err;
+	curtime = bpf_ktime_get_ns();
+	bpf_spin_lock(&q->lock);
+	q->credit += CREDIT_PER_NS(curtime - q->lasttime, q->rate);
+	q->lasttime = curtime;
+	if (q->credit > max_credit)
+		q->credit = max_credit;
+	q->credit -= pkt_len;
+	credit = q->credit;
+	bpf_spin_unlock(&q->lock);
+
+	/* spin_lock in cgroup local storage */
+	cls = bpf_get_local_storage(&cls_map, 0);
+	bpf_spin_lock(&cls->lock);
+	cls->cnt++;
+	bpf_spin_unlock(&cls->lock);
+
+err:
+	return err;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 8/9] libbpf: introduce bpf_map_lookup_elem_flags()
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Introduce
int bpf_map_lookup_elem_flags(int fd, const void *key, void *value, __u64 flags)
helper to lookup array/hash/cgroup_local_storage elements with BPF_F_LOCK flag.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/bpf.c      | 13 +++++++++++++
 tools/lib/bpf/bpf.h      |  2 ++
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 16 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 88cbd110ae58..3defad77dc7a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -368,6 +368,19 @@ int bpf_map_lookup_elem(int fd, const void *key, void *value)
 	return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
 }
 
+int bpf_map_lookup_elem_flags(int fd, const void *key, void *value, __u64 flags)
+{
+	union bpf_attr attr;
+
+	bzero(&attr, sizeof(attr));
+	attr.map_fd = fd;
+	attr.key = ptr_to_u64(key);
+	attr.value = ptr_to_u64(value);
+	attr.flags = flags;
+
+	return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
+}
+
 int bpf_map_lookup_and_delete_elem(int fd, const void *key, void *value)
 {
 	union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 8f09de482839..ed09eed2dc3b 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -110,6 +110,8 @@ LIBBPF_API int bpf_map_update_elem(int fd, const void *key, const void *value,
 				   __u64 flags);
 
 LIBBPF_API int bpf_map_lookup_elem(int fd, const void *key, void *value);
+LIBBPF_API int bpf_map_lookup_elem_flags(int fd, const void *key, void *value,
+					 __u64 flags);
 LIBBPF_API int bpf_map_lookup_and_delete_elem(int fd, const void *key,
 					      void *value);
 LIBBPF_API int bpf_map_delete_elem(int fd, const void *key);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 266bc95d0142..f6f96fc38c50 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -130,4 +130,5 @@ LIBBPF_0.0.2 {
 		bpf_probe_helper;
 		bpf_probe_map_type;
 		bpf_probe_prog_type;
+		bpf_map_lookup_elem_flags;
 } LIBBPF_0.0.1;
-- 
2.20.0


^ permalink raw reply related

* [PATCH v6 bpf-next 9/9] selftests/bpf: test for BPF_F_LOCK
From: Alexei Starovoitov @ 2019-01-31  5:05 UTC (permalink / raw)
  To: davem; +Cc: daniel, peterz, jannh, netdev, kernel-team
In-Reply-To: <20190131050546.3634009-1-ast@kernel.org>

Add C based test that runs 4 bpf programs in parallel
that update the same hash and array maps.
And another 2 threads that read from these two maps
via lookup(key, value, BPF_F_LOCK) api
to make sure the user space sees consistent value in both
hash and array elements while user space races with kernel bpf progs.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/testing/selftests/bpf/Makefile        |  2 +-
 tools/testing/selftests/bpf/test_map_lock.c | 66 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c    | 74 +++++++++++++++++++++
 3 files changed, 141 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_map_lock.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 302b8e70dec9..9de0a1ae8d65 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ BPF_OBJ_FILES = \
 	sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \
 	get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
 	test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_xdp_vlan.o \
-	xdp_dummy.o test_map_in_map.o test_spin_lock.o
+	xdp_dummy.o test_map_in_map.o test_spin_lock.o test_map_lock.o
 
 # Objects are built with default compilation flags and with sub-register
 # code-gen enabled.
diff --git a/tools/testing/selftests/bpf/test_map_lock.c b/tools/testing/selftests/bpf/test_map_lock.c
new file mode 100644
index 000000000000..af8cc68ed2f9
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_map_lock.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2019 Facebook
+#include <linux/bpf.h>
+#include <linux/version.h>
+#include "bpf_helpers.h"
+
+#define VAR_NUM 16
+
+struct hmap_elem {
+	struct bpf_spin_lock lock;
+	int var[VAR_NUM];
+};
+
+struct bpf_map_def SEC("maps") hash_map = {
+	.type = BPF_MAP_TYPE_HASH,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct hmap_elem),
+	.max_entries = 1,
+};
+
+BPF_ANNOTATE_KV_PAIR(hash_map, int, struct hmap_elem);
+
+struct array_elem {
+	struct bpf_spin_lock lock;
+	int var[VAR_NUM];
+};
+
+struct bpf_map_def SEC("maps") array_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(struct array_elem),
+	.max_entries = 1,
+};
+
+BPF_ANNOTATE_KV_PAIR(array_map, int, struct array_elem);
+
+SEC("map_lock_demo")
+int bpf_map_lock_test(struct __sk_buff *skb)
+{
+	struct hmap_elem zero = {}, *val;
+	int rnd = bpf_get_prandom_u32();
+	int key = 0, err = 1, i;
+	struct array_elem *q;
+
+	val = bpf_map_lookup_elem(&hash_map, &key);
+	if (!val)
+		goto err;
+	/* spin_lock in hash map */
+	bpf_spin_lock(&val->lock);
+	for (i = 0; i < VAR_NUM; i++)
+		val->var[i] = rnd;
+	bpf_spin_unlock(&val->lock);
+
+	/* spin_lock in array */
+	q = bpf_map_lookup_elem(&array_map, &key);
+	if (!q)
+		goto err;
+	bpf_spin_lock(&q->lock);
+	for (i = 0; i < VAR_NUM; i++)
+		q->var[i] = rnd;
+	bpf_spin_unlock(&q->lock);
+	err = 0;
+err:
+	return err;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index d2e71d697340..a08d026ac396 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -2025,6 +2025,79 @@ static void test_spinlock(void)
 	bpf_object__close(obj);
 }
 
+static void *parallel_map_access(void *arg)
+{
+	int err, map_fd = *(u32 *) arg;
+	int vars[17], i, j, rnd, key = 0;
+
+	for (i = 0; i < 10000; i++) {
+		err = bpf_map_lookup_elem_flags(map_fd, &key, vars, BPF_F_LOCK);
+		if (err) {
+			printf("lookup failed\n");
+			error_cnt++;
+			goto out;
+		}
+		if (vars[0] != 0) {
+			printf("lookup #%d var[0]=%d\n", i, vars[0]);
+			error_cnt++;
+			goto out;
+		}
+		rnd = vars[1];
+		for (j = 2; j < 17; j++) {
+			if (vars[j] == rnd)
+				continue;
+			printf("lookup #%d var[1]=%d var[%d]=%d\n",
+			       i, rnd, j, vars[j]);
+			error_cnt++;
+			goto out;
+		}
+	}
+out:
+	pthread_exit(arg);
+}
+
+static void test_map_lock(void)
+{
+	const char *file = "./test_map_lock.o";
+	int prog_fd, map_fd[2], vars[17] = {};
+	pthread_t thread_id[6];
+	struct bpf_object *obj;
+	int err = 0, key = 0, i;
+	void *ret;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_CGROUP_SKB, &obj, &prog_fd);
+	if (err) {
+		printf("test_map_lock:bpf_prog_load errno %d\n", errno);
+		goto close_prog;
+	}
+	map_fd[0] = bpf_find_map(__func__, obj, "hash_map");
+	if (map_fd[0] < 0)
+		goto close_prog;
+	map_fd[1] = bpf_find_map(__func__, obj, "array_map");
+	if (map_fd[1] < 0)
+		goto close_prog;
+
+	bpf_map_update_elem(map_fd[0], &key, vars, BPF_F_LOCK);
+
+	for (i = 0; i < 4; i++)
+		assert(pthread_create(&thread_id[i], NULL,
+				      &test_spin_lock, &prog_fd) == 0);
+	for (i = 4; i < 6; i++)
+		assert(pthread_create(&thread_id[i], NULL,
+				      &parallel_map_access, &map_fd[i - 4]) == 0);
+	for (i = 0; i < 4; i++)
+		assert(pthread_join(thread_id[i], &ret) == 0 &&
+		       ret == (void *)&prog_fd);
+	for (i = 4; i < 6; i++)
+		assert(pthread_join(thread_id[i], &ret) == 0 &&
+		       ret == (void *)&map_fd[i - 4]);
+	goto close_prog_noerr;
+close_prog:
+	error_cnt++;
+close_prog_noerr:
+	bpf_object__close(obj);
+}
+
 int main(void)
 {
 	srand(time(NULL));
@@ -2054,6 +2127,7 @@ int main(void)
 	test_queue_stack_map(STACK);
 	test_flow_dissector();
 	test_spinlock();
+	test_map_lock();
 
 	printf("Summary: %d PASSED, %d FAILED\n", pass_cnt, error_cnt);
 	return error_cnt ? EXIT_FAILURE : EXIT_SUCCESS;
-- 
2.20.0


^ permalink raw reply related

* Re: [PATCH iproute2-next] Introduce ip-brctl shell script
From: David Ahern @ 2019-01-31  5:12 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Phil Sutter, Eric Garver, Tomas Dolezal, Stephen Hemminger,
	Lennert Buytenhek, netdev
In-Reply-To: <20190130115555.61868ab9@redhat.com>

On 1/30/19 3:55 AM, Stefano Brivio wrote:
>> I get your intent, but this seems more appropriate for you / Red Hat to
>> carry than something we want to distribute as part of iproute2.
> 
> Sure, I could also do that, but:
> 
> - me creating another project: similar maintenance burden for
>   distribution maintainers as keeping bridge-utils around,
>   for something that won't have any active development
> 
> - carrying it in a single distribution downstream: I would have gone
>   that way if I thought it wouldn't be useful for others. I myself use
>   (also) distributions other than Fedora/RHEL and this would feel
>   just... wrong
> 
> Why do you think it's not appropriate to distribute this as part of
> iproute2? Too ugly? Bloated? Anything I can improve?
> 
> I think it would be appropriate because it intimately depends on
> ip-link -- it's really nothing more than a helper for iproute2 tools.
> 

Again, I understand your point ... I still, too often, type ifconfig
from long in-grained muscle memory.

This is a convenience wrapper around commands packaged in iproute2. If
iproute2 adds this wrapper, it will have to carry it and maintain it
forever. Distributions (Fedora, RHEL, Debian, etc) may see it
differently and decide to add this patch onto iproute2 that they
distribute as a means for dropping bridge-utils. That's a reasonable
migration choice. It is just not something upstream iproute2 should carry.

^ permalink raw reply

* Re: Question on ptr_ring linux header
From: fei phung @ 2019-01-31  5:16 UTC (permalink / raw)
  To: mst, feiphung, netdev
In-Reply-To: <CAF8QhUg9oSBy14tZUkRtxFQP6_Y_4CbFttt6HOCG2pgnQY1SaA@mail.gmail.com>

Hi,

/*
 * Filename: circ_ring.c
 * Version: 1.0
 * Description: A circular buffer using API from
 * https://github.com/torvalds/linux/blob/master/include/linux/ptr_ring.h
 */

ptr_ring's void** queue is just giving data race problem, running
consume() together with [assignment of pointers+produce()] will
definitely give rise to data race

mutex or lock cannot help in this case. Please correct me if wrong

Regards,
Phung

^ permalink raw reply

* Re: [PATCH] bpf: selftests: handle sparse CPU allocations
From: Y Song @ 2019-01-31  5:25 UTC (permalink / raw)
  To: Martynas Pumputis; +Cc: netdev, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <20190130091900.4044-1-m@lambda.lt>

[ My reply somehow rejected by netdev, this is to send it again. ]

On Wed, Jan 30, 2019 at 1:19 AM Martynas Pumputis <m@lambda.lt> wrote:
>
> Previously, bpf_num_possible_cpus() had a bug when calculating a
> number of possible CPUs in the case of sparse CPU allocations, as
> it was considering only the first range or element of
> /sys/devices/system/cpu/possible.
>
> E.g. in the case of "0,2-3" (CPU 1 is not available), the function
> returned 1 instead of 3.
>
> This patch fixes the function by making it parse all CPU ranges and
> elements.
>
> Signed-off-by: Martynas Pumputis <m@lambda.lt>
> ---
>  tools/testing/selftests/bpf/bpf_util.h | 29 +++++++++++++++++---------
>  1 file changed, 19 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/bpf_util.h b/tools/testing/selftests/bpf/bpf_util.h
> index 315a44fa32af..8cab50408204 100644
> --- a/tools/testing/selftests/bpf/bpf_util.h
> +++ b/tools/testing/selftests/bpf/bpf_util.h
> @@ -13,7 +13,7 @@ static inline unsigned int bpf_num_possible_cpus(void)
>         unsigned int start, end, possible_cpus = 0;
>         char buff[128];
>         FILE *fp;
> -       int n;
> +       int n, i, j = 0;
>
>         fp = fopen(fcpu, "r");
>         if (!fp) {
> @@ -21,17 +21,26 @@ static inline unsigned int bpf_num_possible_cpus(void)
>                 exit(1);
>         }
>
> -       while (fgets(buff, sizeof(buff), fp)) {
> -               n = sscanf(buff, "%u-%u", &start, &end);
> -               if (n == 0) {
> -                       printf("Failed to retrieve # possible CPUs!\n");
> -                       exit(1);
> -               } else if (n == 1) {
> -                       end = start;
> +       if (!fgets(buff, sizeof(buff), fp)) {
> +               printf("Failed to read %s!\n", fcpu);
> +               exit(1);
> +       }
> +
> +       for (i = 0; i <= strlen(buff); i++) {
> +               if (buff[i] == ',' || buff[i] == '\0') {
> +                       buff[i] = '\0';

This does not sound right. For example, the cpu list "0,2-3",
you will change "," to '\0" so buffer becomes "0\02-3".
The next iteration you will get strlen(buff) = 1.
The "2-3" will be skipped.

> +                       n = sscanf(&buff[j], "%u-%u", &start, &end);
> +                       if (n <= 0) {
> +                               printf("Failed to retrieve # possible CPUs!\n");
> +                               exit(1);
> +                       } else if (n == 1) {
> +                               end = start;
> +                       }
> +                       possible_cpus += end - start + 1;
> +                       j = i + 1;
>                 }
> -               possible_cpus = start == 0 ? end + 1 : 0;
> -               break;
>         }
> +
>         fclose(fp);
>
>         return possible_cpus;
> --
> 2.20.1
>

^ permalink raw reply

* Re: [PATCH] ipconfig: make the wait for carrier timeout configurable
From: David Miller @ 2019-01-31  5:36 UTC (permalink / raw)
  To: martink
  Cc: kuznet, yoshfuji, netdev, linux-kernel, manfred.schlaegl,
	martin.kepplinger
In-Reply-To: <20190128113005.7683-1-martink@posteo.de>

From: Martin Kepplinger <martink@posteo.de>
Date: Mon, 28 Jan 2019 12:30:05 +0100

> From: Manfred Schlaegl <manfred.schlaegl@ginzinger.com>
> 
> commit 3fb72f1e6e61 ("ipconfig wait for carrier") added a
> "wait for carrier" policy, with a fixed worst case maximum wait
> of two minutes.
> 
> This makes the wait for carrier timeout configurable (0 - 240 seconds).
> 
> The informative timeout messages introduced with
> commit 5e404cd65860 ("ipconfig: add informative timeout messages while
> waiting for carrier") were adapted. Message output is done in a fixed
> interval of 20 seconds, just like before (240/12).
> 
> Signed-off-by: Manfred Schlaegl <manfred.schlaegl@ginzinger.com>
> Signed-off-by: Martin Kepplinger <martin.kepplinger@ginzinger.com>

I would prefer that this be implemented using a kernel command line
parameter.

That way you don't need to rebuild the kernel to adjust the value.

Thanks.

^ permalink raw reply

* Re: [PATCH net v4] l2tp: fix reading optional fields of L2TPv3
From: David Miller @ 2019-01-31  5:45 UTC (permalink / raw)
  To: jian.w.wen; +Cc: netdev, eric.dumazet, gnault
In-Reply-To: <20190130065514.13076-1-jian.w.wen@oracle.com>

From: Jacob Wen <jian.w.wen@oracle.com>
Date: Wed, 30 Jan 2019 14:55:14 +0800

> Use pskb_may_pull() to make sure the optional fields are in skb linear
> parts, so we can safely read them later.
> 
> It's easy to reproduce the issue with a net driver that supports paged
> skb data. Just create a L2TPv3 over IP tunnel and then generates some
> network traffic.
> Once reproduced, rx err in /sys/kernel/debug/l2tp/tunnels will increase.
> 
> Changes in v4:
> 1. s/l2tp_v3_pull_opt/l2tp_v3_ensure_opt_in_linear/
> 2. s/tunnel->version != L2TP_HDR_VER_2/tunnel->version == L2TP_HDR_VER_3/
> 3. Add 'Fixes' in commit messages.
> 
> Changes in v3:
> 1. To keep consistency, move the code out of l2tp_recv_common.
> 2. Use "net" instead of "net-next", since this is a bug fix.
> 
> Changes in v2:
> 1. Only fix L2TPv3 to make code simple.
>    To fix both L2TPv3 and L2TPv2, we'd better refactor l2tp_recv_common.
>    It's complicated to do so.
> 2. Reloading pointers after pskb_may_pull
> 
> Fixes: f7faffa3ff8e ("l2tp: Add L2TPv3 protocol support")
> Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
> Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
> 
> Signed-off-by: Jacob Wen <jian.w.wen@oracle.com>
> Acked-by: Guillaume Nault <gnault@redhat.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] strparser: Return if socket does not have required number of bytes
From: David Miller @ 2019-01-31  5:59 UTC (permalink / raw)
  To: vakul.garg; +Cc: netdev, borisp, aviadye, davejwatson, doronrk
In-Reply-To: <20190130072933.23281-1-vakul.garg@nxp.com>

From: Vakul Garg <vakul.garg@nxp.com>
Date: Wed, 30 Jan 2019 07:31:44 +0000

> Function strp_data_ready() should peek the associated socket to check
> whether it has the required number of bytes available before queueing
> work or initiating socket read via strp_read_sock(). This saves cpu
> cycles because strp_read_sock() is called only when required amount of
> data is available.
> 
> Signed-off-by: Vakul Garg <vakul.garg@nxp.com>

You can't do this, I think.  It's racy and the user socket owned check
is absolutely necessary before you do the need bytes check.

^ permalink raw reply

* Re: [PATCH net] ipvlan, l3mdev: fix broken l3s mode wrt local routes
From: David Miller @ 2019-01-31  6:14 UTC (permalink / raw)
  To: daniel; +Cc: netdev, maheshb, dsa, fw, m
In-Reply-To: <20190130114948.24227-1-daniel@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 30 Jan 2019 12:49:48 +0100

> While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
> I ran into the issue that while l3 mode is working fine, l3s mode
> does not have any connectivity to kube-apiserver and hence all pods
> end up in Error state as well. The ipvlan master device sits on
> top of a bond device and hostns traffic to kube-apiserver (also running
> in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
> where the latter is the address of the bond0. While in l3 mode, a
> curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
> works fine from hostns, neither of them do in case of l3s. In the
> latter only a curl to https://127.0.0.1:37573 appeared to work where
> for local addresses of bond0 I saw kernel suddenly starting to emit
> ARP requests to query HW address of bond0 which remained unanswered
> and neighbor entries in INCOMPLETE state. These ARP requests only
> happen while in l3s.
> 
> Debugging this further, I found the issue is that l3s mode is piggy-
> backing on l3 master device, and in this case local routes are using
> l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
> f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
> if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
> a loopback"). I found that reverting them back into using the
> net->loopback_dev fixed ipvlan l3s connectivity and got everything
> working for the CNI.
> 
> Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
> l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
> on l3 master device is to get the l3mdev_ip_rcv() receive hook for
> setting the dst entry of the input route without adding its own
> ipvlan specific hacks into the receive path, however, any l3 domain
> semantics beyond just that are breaking l3s operation. Note that
> ipvlan also has the ability to dynamically switch its internal
> operation from l3 to l3s for all ports via ipvlan_set_port_mode()
> at runtime. In any case, l3 vs l3s soley distinguishes itself by
> 'de-confusing' netfilter through switching skb->dev to ipvlan slave
> device late in NF_INET_LOCAL_IN before handing the skb to L4.
> 
> Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
> if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
> without any additional l3mdev semantics on top. This should also have
> minimal impact since dev->priv_flags is already hot in cache. With
> this set, l3s mode is working fine and I also get things like
> masquerading pod traffic on the ipvlan master properly working.
> 
>   [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf
> 
> Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
> Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
> Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: Heiner Kallweit @ 2019-01-31  6:21 UTC (permalink / raw)
  To: David Chang; +Cc: Peter Ceiley, Realtek linux nic maintainers, netdev
In-Reply-To: <20190131023240.GF25745@linux-kyyb.suse>

David, thanks for the link to the bug ticket.
I think only a proper bisect can help to find the offending commit.

Heiner


On 31.01.2019 03:32, David Chang wrote:
> Hi,
> 
> We had a similr case here.
> - Realtek r8169 receive performance regression in kernel 4.19
>   https://bugzilla.suse.com/show_bug.cgi?id=1119649
> 
> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
> The major symptom is there are many rx_missed count.
> 
> 
> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
>> Hi Peter,
>>
>> recently I had somebody where pcie_aspm=off for whatever reason didn't
>> do the trick, can you also check with pcie_aspm.policy=performance.
> 
> We will give it a try later.
> 
>> And please check with "ethtool -S <if>" whether the chip statistics
>> show a significant number of errors.
>>
>> If this doesn't help you may have to bisect to find the offending commit.
> 
> We had tried fallback driver to a few previous commits as following,
> but with no luck.
> 
> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
> 
> Thanks,
> David Chang
> 
>>
>> Heiner
>>
>>
>> On 30.01.2019 10:59, Peter Ceiley wrote:
>>> Hi Heiner,
>>>
>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
>>> and this made no difference.
>>>
>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
>>> subsequently loaded the module in the running 4.19.18 kernel. I can
>>> confirm that this immediately resolved the issue and access to the NFS
>>> shares operated as expected.
>>>
>>> I presume this means it is an issue with the r8169 driver included in
>>> 4.19 onwards?
>>>
>>> To answer your last questions:
>>>
>>> Base Board Information
>>>     Manufacturer: Alienware
>>>     Product Name: 0PGRP5
>>>     Version: A02
>>>
>>> ... and yes, the RTL8168 is the onboard network chip.
>>>
>>> Regards,
>>>
>>> Peter.
>>>
>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> I think the vendor driver doesn't enable ASPM per default.
>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
>>>> Few older systems seem to have issues with ASPM, what kind of
>>>> system / mainboard are you using? The RTL8168 is the onboard
>>>> network chip?
>>>>
>>>> Rgds, Heiner
>>>>
>>>>
>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
>>>>> Hi Heiner,
>>>>>
>>>>> Thanks, I'll do some more testing. It might not be the driver - I
>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
>>>>> a good idea.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Peter.
>>>>>
>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> at a first glance it doesn't look like a typical driver issue.
>>>>>> What you could do:
>>>>>>
>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
>>>>>>
>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
>>>>>>
>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
>>>>>>
>>>>>> Any specific reason why you think root cause is in the driver and not
>>>>>> elsewhere in the network subsystem?
>>>>>>
>>>>>> Heiner
>>>>>>
>>>>>>
>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
>>>>>>> Hi Heiner,
>>>>>>>
>>>>>>> Thanks for getting back to me.
>>>>>>>
>>>>>>> No, I don't use jumbo packets.
>>>>>>>
>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
>>>>>>> establishing a connection and is most notable, for example, on my
>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
>>>>>>> larger directories) to list the contents of each directory. Once a
>>>>>>> transfer begins on a file, I appear to get good bandwidth.
>>>>>>>
>>>>>>> I'm unsure of the best scientific data to provide you in order to
>>>>>>> troubleshoot this issue. Running the following
>>>>>>>
>>>>>>>     netstat -s |grep retransmitted
>>>>>>>
>>>>>>> shows a steady increase in retransmitted segments each time I list the
>>>>>>> contents of a remote directory, for example, running 'ls' on a
>>>>>>> directory containing 345 media files did the following using kernel
>>>>>>> 4.19.18:
>>>>>>>
>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
>>>>>>> the following:
>>>>>>>     real    0m19.867s
>>>>>>>     user    0m0.012s
>>>>>>>     sys    0m0.036s
>>>>>>>
>>>>>>> The same command shows no retransmitted segments running kernel
>>>>>>> 4.18.16 and 'time' showed:
>>>>>>>     real    0m0.300s
>>>>>>>     user    0m0.004s
>>>>>>>     sys    0m0.007s
>>>>>>>
>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
>>>>>>>
>>>>>>> dmesg XID:
>>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
>>>>>>>
>>>>>>> # lspci -vv
>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
>>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>>>     Latency: 0, Cache Line Size: 64 bytes
>>>>>>>     Interrupt: pin A routed to IRQ 19
>>>>>>>     Region 0: I/O ports at d000 [size=256]
>>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
>>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
>>>>>>>     Capabilities: [40] Power Management version 3
>>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
>>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>>>>         Address: 0000000000000000  Data: 0000
>>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
>>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
>>>>>>> <512ns, L1 <64us
>>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>>>> SlotPowerLimit 10.000W
>>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
>>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
>>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
>>>>>>> Latency L0s unlimited, L1 <64us
>>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
>>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
>>>>>>> OBFF Via message/WAKE#
>>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
>>>>>>> OBFF Disabled
>>>>>>>              AtomicOpsCtl: ReqEn-
>>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>>>>>>>              Transmit Margin: Normal Operating Range,
>>>>>>> EnterModifiedCompliance- ComplianceSOS-
>>>>>>>              Compliance De-emphasis: -6dB
>>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
>>>>>>> EqualizationComplete-, EqualizationPhase1-
>>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>>>>>>>         Vector table: BAR=4 offset=00000000
>>>>>>>         PBA: BAR=4 offset=00000800
>>>>>>>     Capabilities: [d0] Vital Product Data
>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
>>>>>>>         Not readable
>>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
>>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
>>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
>>>>>>> ECRCChkCap+ ECRCChkEn-
>>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
>>>>>>>     Capabilities: [140 v1] Virtual Channel
>>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
>>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
>>>>>>>         Ctrl:    ArbSelect=Fixed
>>>>>>>         Status:    InProgress-
>>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
>>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
>>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
>>>>>>>             Status:    NegoPending- InProgress-
>>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
>>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
>>>>>>>         Max snoop latency: 71680ns
>>>>>>>         Max no snoop latency: 71680ns
>>>>>>>     Kernel driver in use: r8169
>>>>>>>     Kernel modules: r8169
>>>>>>>
>>>>>>> Please let me know if you have any other ideas in terms of testing.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Peter.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have been experiencing very poor network performance since Kernel
>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
>>>>>>>>>
>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
>>>>>>>>> 4.20.4 & 4.19.18).
>>>>>>>>>
>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
>>>>>>>>> differ in that I still have a network connection. I have attempted to
>>>>>>>>> reload the driver on a running system, but this does not improve the
>>>>>>>>> situation.
>>>>>>>>>
>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
>>>>>>>>>
>>>>>>>>> lshw shows:
>>>>>>>>>        description: Ethernet interface
>>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
>>>>>>>>>        physical id: 0
>>>>>>>>>        bus info: pci@0000:03:00.0
>>>>>>>>>        logical name: enp3s0
>>>>>>>>>        version: 0c
>>>>>>>>>        serial:
>>>>>>>>>        size: 1Gbit/s
>>>>>>>>>        capacity: 1Gbit/s
>>>>>>>>>        width: 64 bits
>>>>>>>>>        clock: 33MHz
>>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
>>>>>>>>> 1000bt-fd autonegotiation
>>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
>>>>>>>>>        resources: irq:19 ioport:d000(size=256)
>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
>>>>>>>>>
>>>>>>>>> Kind Regards,
>>>>>>>>>
>>>>>>>>> Peter.
>>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> the description "poor network performance" is quite vague, therefore:
>>>>>>>>
>>>>>>>> - Can you provide any measurements?
>>>>>>>> - iperf results before and after
>>>>>>>> - statistics about dropped packets (rx and/or tx)
>>>>>>>> - Do you use jumbo packets?
>>>>>>>>
>>>>>>>> Also help would be a "lspci -vv" output for the network card and
>>>>>>>> the dmesg output line with the chip XID.
>>>>>>>>
>>>>>>>> Heiner
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply

* RE: [PATCH net-next] strparser: Return if socket does not have required number of bytes
From: Vakul Garg @ 2019-01-31  6:25 UTC (permalink / raw)
  To: David Miller
  Cc: netdev@vger.kernel.org, borisp@mellanox.com, aviadye@mellanox.com,
	davejwatson@fb.com, doronrk@fb.com
In-Reply-To: <20190130.215950.1443919738399463157.davem@davemloft.net>



> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Thursday, January 31, 2019 11:30 AM
> To: Vakul Garg <vakul.garg@nxp.com>
> Cc: netdev@vger.kernel.org; borisp@mellanox.com;
> aviadye@mellanox.com; davejwatson@fb.com; doronrk@fb.com
> Subject: Re: [PATCH net-next] strparser: Return if socket does not have
> required number of bytes
> 
> From: Vakul Garg <vakul.garg@nxp.com>
> Date: Wed, 30 Jan 2019 07:31:44 +0000
> 
> > Function strp_data_ready() should peek the associated socket to check
> > whether it has the required number of bytes available before queueing
> > work or initiating socket read via strp_read_sock(). This saves cpu
> > cycles because strp_read_sock() is called only when required amount of
> > data is available.
> >
> > Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
> 
> You can't do this, I think.  It's racy and the user socket owned check is
> absolutely necessary before you do the need bytes check.

Do you mean to say that 'peek_len' operation can't be invoked if the socket is already owned by some other context?

My understanding by reading following link is that we can to peek_len operation without acquiring sock lock.

https://elixir.bootlin.com/linux/v5.0-rc4/source/include/linux/net.h#L191

Can you please kindly elaborate the problem?


^ permalink raw reply

* Re: [PATCH net 0/3] net: stmmac: Misc fixes
From: David Miller @ 2019-01-31  6:29 UTC (permalink / raw)
  To: jose.abreu; +Cc: netdev, joao.pinto, peppe.cavallaro, alexandre.torgue
In-Reply-To: <cover.1548859967.git.joabreu@synopsys.com>

From: Jose Abreu <jose.abreu@synopsys.com>
Date: Wed, 30 Jan 2019 15:54:18 +0100

> Some misc fixes for stmmac targeting -net.

Series applied.

^ permalink raw reply

* Re: [ethtool 1/6] ethtool: move option parsing related code into function
From: Jeff Kirsher @ 2019-01-31  6:30 UTC (permalink / raw)
  To: John W. Linville, Nicholas Nunley; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20190130210831.GD19981@tuxdriver.com>

[-- Attachment #1: Type: text/plain, Size: 3069 bytes --]

On Wed, 2019-01-30 at 16:08 -0500, John W. Linville wrote:
> On Thu, Jan 17, 2019 at 03:03:08PM -0800, Nicholas Nunley wrote:
> > Move option parsing code into find_option function.
> > 
> > No behavior changes.
> > 
> > Based on patch by Kan Liang <kan.liang@intel.com>
> > 
> > Signed-off-by: Nicholas Nunley <nicholas.d.nunley@intel.com>
> 
> Well, after looking at this series for a while I had decided to apply it.
> But when I applied it and did a 'make distcheck', I got this:
> 
> ...
> 
> gcc -DTEST_ETHTOOL -g -O2   -o test-features test_features-test-
> features.o test_features-test-common.o test_features-ethtool.o
> test_features-rxclass.o test_features-amd8111e.o test_features-de2104x.o
> test_features-dsa.o test_features-e100.o test_features-e1000.o
> test_features-et131x.o test_features-igb.o test_features-fec_8xx.o
> test_features-ibm_emac.o test_features-ixgb.o test_features-ixgbe.o
> test_features-natsemi.o test_features-pcnet32.o test_features-realtek.o
> test_features-tg3.o test_features-marvell.o test_features-vioc.o
> test_features-smsc911x.o test_features-at76c50x-usb.o test_features-sfc.o 
> test_features-stmmac.o test_features-sff-common.o test_features-sfpid.o
> test_features-sfpdiag.o test_features-ixgbevf.o test_features-tse.o
> test_features-vmxnet3.o test_features-qsfp.o test_features-fjes.o
> test_features-lan78xx.o -lm 
> make[2]: Leaving directory '/home/linville/git/ethtool/ethtool-
> 4.19/_build/sub'
> make  check-TESTS
> make[2]: Entering directory '/home/linville/git/ethtool/ethtool-
> 4.19/_build/sub'
> make[3]: Entering directory '/home/linville/git/ethtool/ethtool-
> 4.19/_build/sub'
> FAIL: test-cmdline
> PASS: test-features
> =========================================================================
> ===
> Testsuite summary for ethtool 4.19
> =========================================================================
> ===
> # TOTAL: 2
> # PASS:  1
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  1
> # XPASS: 0
> # ERROR: 0
> =========================================================================
> ===
> See ./test-suite.log
> Please report to netdev@vger.kernel.org
> =========================================================================
> ===
> make[3]: *** [Makefile:1942: test-suite.log] Error 1
> make[3]: Leaving directory '/home/linville/git/ethtool/ethtool-
> 4.19/_build/sub'
> make[2]: *** [Makefile:2050: check-TESTS] Error 2
> make[2]: Leaving directory '/home/linville/git/ethtool/ethtool-
> 4.19/_build/sub'
> make[1]: *** [Makefile:2264: check-am] Error 2
> make[1]: Leaving directory '/home/linville/git/ethtool/ethtool-
> 4.19/_build/sub'
> make: *** [Makefile:2186: distcheck] Error 1
> 
> ...
> 
> I'll attach ./ethtool-4.19/_build/sub/test-suite.log to this
> message. Obviously we need whatever additional changes are needed to
> get 'make check-TESTS' to pass legitimately.

Interesting that we did not see this in testing.  I will work with Nick to
provide an updated patch, resolves your distcheck failure.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: Heiner Kallweit @ 2019-01-31  6:35 UTC (permalink / raw)
  To: David Chang; +Cc: Peter Ceiley, Realtek linux nic maintainers, netdev
In-Reply-To: <b8ed124a-fa79-e242-af03-4d24e4d6c56e@gmail.com>

Hi David, two more things:

1. Could you please test a recent linux-next kernel?
2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19
   and compare them.

Heiner


On 31.01.2019 07:21, Heiner Kallweit wrote:
> David, thanks for the link to the bug ticket.
> I think only a proper bisect can help to find the offending commit.
> 
> Heiner
> 
> 
> On 31.01.2019 03:32, David Chang wrote:
>> Hi,
>>
>> We had a similr case here.
>> - Realtek r8169 receive performance regression in kernel 4.19
>>   https://bugzilla.suse.com/show_bug.cgi?id=1119649
>>
>> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
>> The major symptom is there are many rx_missed count.
>>
>>
>> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
>>> Hi Peter,
>>>
>>> recently I had somebody where pcie_aspm=off for whatever reason didn't
>>> do the trick, can you also check with pcie_aspm.policy=performance.
>>
>> We will give it a try later.
>>
>>> And please check with "ethtool -S <if>" whether the chip statistics
>>> show a significant number of errors.
>>>
>>> If this doesn't help you may have to bisect to find the offending commit.
>>
>> We had tried fallback driver to a few previous commits as following,
>> but with no luck.
>>
>> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
>> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
>> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
>> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
>> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
>>
>> Thanks,
>> David Chang
>>
>>>
>>> Heiner
>>>
>>>
>>> On 30.01.2019 10:59, Peter Ceiley wrote:
>>>> Hi Heiner,
>>>>
>>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
>>>> and this made no difference.
>>>>
>>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
>>>> subsequently loaded the module in the running 4.19.18 kernel. I can
>>>> confirm that this immediately resolved the issue and access to the NFS
>>>> shares operated as expected.
>>>>
>>>> I presume this means it is an issue with the r8169 driver included in
>>>> 4.19 onwards?
>>>>
>>>> To answer your last questions:
>>>>
>>>> Base Board Information
>>>>     Manufacturer: Alienware
>>>>     Product Name: 0PGRP5
>>>>     Version: A02
>>>>
>>>> ... and yes, the RTL8168 is the onboard network chip.
>>>>
>>>> Regards,
>>>>
>>>> Peter.
>>>>
>>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> I think the vendor driver doesn't enable ASPM per default.
>>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
>>>>> Few older systems seem to have issues with ASPM, what kind of
>>>>> system / mainboard are you using? The RTL8168 is the onboard
>>>>> network chip?
>>>>>
>>>>> Rgds, Heiner
>>>>>
>>>>>
>>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
>>>>>> Hi Heiner,
>>>>>>
>>>>>> Thanks, I'll do some more testing. It might not be the driver - I
>>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
>>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
>>>>>> a good idea.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Peter.
>>>>>>
>>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> at a first glance it doesn't look like a typical driver issue.
>>>>>>> What you could do:
>>>>>>>
>>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
>>>>>>>
>>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
>>>>>>>
>>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
>>>>>>>
>>>>>>> Any specific reason why you think root cause is in the driver and not
>>>>>>> elsewhere in the network subsystem?
>>>>>>>
>>>>>>> Heiner
>>>>>>>
>>>>>>>
>>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
>>>>>>>> Hi Heiner,
>>>>>>>>
>>>>>>>> Thanks for getting back to me.
>>>>>>>>
>>>>>>>> No, I don't use jumbo packets.
>>>>>>>>
>>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
>>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
>>>>>>>> establishing a connection and is most notable, for example, on my
>>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
>>>>>>>> larger directories) to list the contents of each directory. Once a
>>>>>>>> transfer begins on a file, I appear to get good bandwidth.
>>>>>>>>
>>>>>>>> I'm unsure of the best scientific data to provide you in order to
>>>>>>>> troubleshoot this issue. Running the following
>>>>>>>>
>>>>>>>>     netstat -s |grep retransmitted
>>>>>>>>
>>>>>>>> shows a steady increase in retransmitted segments each time I list the
>>>>>>>> contents of a remote directory, for example, running 'ls' on a
>>>>>>>> directory containing 345 media files did the following using kernel
>>>>>>>> 4.19.18:
>>>>>>>>
>>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
>>>>>>>> the following:
>>>>>>>>     real    0m19.867s
>>>>>>>>     user    0m0.012s
>>>>>>>>     sys    0m0.036s
>>>>>>>>
>>>>>>>> The same command shows no retransmitted segments running kernel
>>>>>>>> 4.18.16 and 'time' showed:
>>>>>>>>     real    0m0.300s
>>>>>>>>     user    0m0.004s
>>>>>>>>     sys    0m0.007s
>>>>>>>>
>>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
>>>>>>>>
>>>>>>>> dmesg XID:
>>>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
>>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
>>>>>>>>
>>>>>>>> # lspci -vv
>>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
>>>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>>>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>>>>     Latency: 0, Cache Line Size: 64 bytes
>>>>>>>>     Interrupt: pin A routed to IRQ 19
>>>>>>>>     Region 0: I/O ports at d000 [size=256]
>>>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
>>>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
>>>>>>>>     Capabilities: [40] Power Management version 3
>>>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
>>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
>>>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>>>>>         Address: 0000000000000000  Data: 0000
>>>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
>>>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
>>>>>>>> <512ns, L1 <64us
>>>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>>>>> SlotPowerLimit 10.000W
>>>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
>>>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>>>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
>>>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>>>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
>>>>>>>> Latency L0s unlimited, L1 <64us
>>>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>>>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>>>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>>>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
>>>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
>>>>>>>> OBFF Via message/WAKE#
>>>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
>>>>>>>> OBFF Disabled
>>>>>>>>              AtomicOpsCtl: ReqEn-
>>>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>>>>>>>>              Transmit Margin: Normal Operating Range,
>>>>>>>> EnterModifiedCompliance- ComplianceSOS-
>>>>>>>>              Compliance De-emphasis: -6dB
>>>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
>>>>>>>> EqualizationComplete-, EqualizationPhase1-
>>>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>>>>>>>>         Vector table: BAR=4 offset=00000000
>>>>>>>>         PBA: BAR=4 offset=00000800
>>>>>>>>     Capabilities: [d0] Vital Product Data
>>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
>>>>>>>>         Not readable
>>>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
>>>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
>>>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
>>>>>>>> ECRCChkCap+ ECRCChkEn-
>>>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
>>>>>>>>     Capabilities: [140 v1] Virtual Channel
>>>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
>>>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
>>>>>>>>         Ctrl:    ArbSelect=Fixed
>>>>>>>>         Status:    InProgress-
>>>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
>>>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
>>>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
>>>>>>>>             Status:    NegoPending- InProgress-
>>>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
>>>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
>>>>>>>>         Max snoop latency: 71680ns
>>>>>>>>         Max no snoop latency: 71680ns
>>>>>>>>     Kernel driver in use: r8169
>>>>>>>>     Kernel modules: r8169
>>>>>>>>
>>>>>>>> Please let me know if you have any other ideas in terms of testing.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> Peter.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have been experiencing very poor network performance since Kernel
>>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
>>>>>>>>>>
>>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
>>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
>>>>>>>>>> 4.20.4 & 4.19.18).
>>>>>>>>>>
>>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
>>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
>>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
>>>>>>>>>> differ in that I still have a network connection. I have attempted to
>>>>>>>>>> reload the driver on a running system, but this does not improve the
>>>>>>>>>> situation.
>>>>>>>>>>
>>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
>>>>>>>>>>
>>>>>>>>>> lshw shows:
>>>>>>>>>>        description: Ethernet interface
>>>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
>>>>>>>>>>        physical id: 0
>>>>>>>>>>        bus info: pci@0000:03:00.0
>>>>>>>>>>        logical name: enp3s0
>>>>>>>>>>        version: 0c
>>>>>>>>>>        serial:
>>>>>>>>>>        size: 1Gbit/s
>>>>>>>>>>        capacity: 1Gbit/s
>>>>>>>>>>        width: 64 bits
>>>>>>>>>>        clock: 33MHz
>>>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
>>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
>>>>>>>>>> 1000bt-fd autonegotiation
>>>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
>>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
>>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
>>>>>>>>>>        resources: irq:19 ioport:d000(size=256)
>>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
>>>>>>>>>>
>>>>>>>>>> Kind Regards,
>>>>>>>>>>
>>>>>>>>>> Peter.
>>>>>>>>>>
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> the description "poor network performance" is quite vague, therefore:
>>>>>>>>>
>>>>>>>>> - Can you provide any measurements?
>>>>>>>>> - iperf results before and after
>>>>>>>>> - statistics about dropped packets (rx and/or tx)
>>>>>>>>> - Do you use jumbo packets?
>>>>>>>>>
>>>>>>>>> Also help would be a "lspci -vv" output for the network card and
>>>>>>>>> the dmesg output line with the chip XID.
>>>>>>>>>
>>>>>>>>> Heiner
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply

* [PATCH net-next] r8169: improve WoL handling
From: Heiner Kallweit @ 2019-01-31  6:43 UTC (permalink / raw)
  To: Realtek linux nic maintainers, David Miller; +Cc: netdev@vger.kernel.org

WoL handling for the RTL8168 family is a little bit tricky because of
different types of broken BIOS and/or chip quirks.

Two known issues:
1. Network properly resumes from suspend only if WoL is enabled in the chip.
2. Some notebooks wake up immediately if system is suspended and network
   device is wakeup-enabled.

Few patches tried to deal with this:
7edf6d314cd0 ("r8169: disable WOL per default")
18041b523692 ("r8169: restore previous behavior to accept BIOS WoL
settings")

Currently we have the situation that the chip WoL settings as set by
the BIOS are respected (to prevent issue 1), but the device doesn't get
wakeup-enabled (to prevent issue 2).

This leads to another issue:
If systemd is told to set WoL it first checks whether the requested
settings are active already (and does nothing if yes). Due to the chip
WoL flags being set properly systemd assumes that WoL is configured
properly in our case. Result is that device doesn't get wakeup-enabled
and WoL doesn't work (until it's set e.g. by ethtool).

This patch now:
- leaves the chip WoL settings as is (to prevent issue 1)
- keeps the behavior to not wakeup-enable the device initially
  (to prevent issue 2)
- In addition we report WoL as being disabled in get_wol, matching
  that device isn't wakeup-enabled. If systemd is told to enable WoL,
  it will therefore detect that it has to do something and will
  call set_wol.

Of course the user still has the option to override this with
e.g. ethtool.

Because there are lots of consumer mainboards with members of the
RTL8168 family as onboard network, there's a certain chance that also
the new approach may trigger a BIOS bug and cause issues. Therefore
don't remove __rtl8169_get_wol() completely.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
---
 drivers/net/ethernet/realtek/r8169.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 3e650bd9e..2dab28115 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1371,6 +1371,8 @@ static void rtl_link_chg_patch(struct rtl8169_private *tp)
 
 #define WAKE_ANY (WAKE_PHY | WAKE_MAGIC | WAKE_UCAST | WAKE_BCAST | WAKE_MCAST)
 
+/* Don't delete it completely, in case we need to re-enable it */
+#if 0
 static u32 __rtl8169_get_wol(struct rtl8169_private *tp)
 {
 	u8 options;
@@ -1405,6 +1407,7 @@ static u32 __rtl8169_get_wol(struct rtl8169_private *tp)
 
 	return wolopts;
 }
+#endif
 
 static void rtl8169_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
 {
@@ -4284,7 +4287,7 @@ static void rtl_wol_suspend_quirk(struct rtl8169_private *tp)
 
 static bool rtl_wol_pll_power_down(struct rtl8169_private *tp)
 {
-	if (!__rtl8169_get_wol(tp))
+	if (!device_may_wakeup(tp_to_dev(tp)))
 		return false;
 
 	phy_speed_down(tp->phydev, false);
@@ -7441,8 +7444,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		return rc;
 	}
 
-	tp->saved_wolopts = __rtl8169_get_wol(tp);
-
 	mutex_init(&tp->wk.mutex);
 	INIT_WORK(&tp->wk.work, rtl_task);
 	u64_stats_init(&tp->rx_stats.syncp);
-- 
2.20.1


^ permalink raw reply related

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: Heiner Kallweit @ 2019-01-31  6:49 UTC (permalink / raw)
  To: David Chang; +Cc: Peter Ceiley, Realtek linux nic maintainers, netdev
In-Reply-To: <4d832c16-8830-b746-a818-6026c2e6725c@gmail.com>

And one more inquiry ..

So far I read about the issue only in combination with NFS.
Does the issue also occur with iperf or some other type of
high network load?

Heiner


On 31.01.2019 07:35, Heiner Kallweit wrote:
> Hi David, two more things:
> 
> 1. Could you please test a recent linux-next kernel?
> 2. Please get a register dump (ethtool -d <if>) from 4.18 and 4.19
>    and compare them.
> 
> Heiner
> 
> 
> On 31.01.2019 07:21, Heiner Kallweit wrote:
>> David, thanks for the link to the bug ticket.
>> I think only a proper bisect can help to find the offending commit.
>>
>> Heiner
>>
>>
>> On 31.01.2019 03:32, David Chang wrote:
>>> Hi,
>>>
>>> We had a similr case here.
>>> - Realtek r8169 receive performance regression in kernel 4.19
>>>   https://bugzilla.suse.com/show_bug.cgi?id=1119649
>>>
>>> kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
>>> The major symptom is there are many rx_missed count.
>>>
>>>
>>> On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
>>>> Hi Peter,
>>>>
>>>> recently I had somebody where pcie_aspm=off for whatever reason didn't
>>>> do the trick, can you also check with pcie_aspm.policy=performance.
>>>
>>> We will give it a try later.
>>>
>>>> And please check with "ethtool -S <if>" whether the chip statistics
>>>> show a significant number of errors.
>>>>
>>>> If this doesn't help you may have to bisect to find the offending commit.
>>>
>>> We had tried fallback driver to a few previous commits as following,
>>> but with no luck.
>>>
>>> 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
>>> 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
>>> a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
>>> 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
>>> e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
>>>
>>> Thanks,
>>> David Chang
>>>
>>>>
>>>> Heiner
>>>>
>>>>
>>>> On 30.01.2019 10:59, Peter Ceiley wrote:
>>>>> Hi Heiner,
>>>>>
>>>>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
>>>>> and this made no difference.
>>>>>
>>>>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
>>>>> subsequently loaded the module in the running 4.19.18 kernel. I can
>>>>> confirm that this immediately resolved the issue and access to the NFS
>>>>> shares operated as expected.
>>>>>
>>>>> I presume this means it is an issue with the r8169 driver included in
>>>>> 4.19 onwards?
>>>>>
>>>>> To answer your last questions:
>>>>>
>>>>> Base Board Information
>>>>>     Manufacturer: Alienware
>>>>>     Product Name: 0PGRP5
>>>>>     Version: A02
>>>>>
>>>>> ... and yes, the RTL8168 is the onboard network chip.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Peter.
>>>>>
>>>>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> I think the vendor driver doesn't enable ASPM per default.
>>>>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
>>>>>> Few older systems seem to have issues with ASPM, what kind of
>>>>>> system / mainboard are you using? The RTL8168 is the onboard
>>>>>> network chip?
>>>>>>
>>>>>> Rgds, Heiner
>>>>>>
>>>>>>
>>>>>> On 29.01.2019 07:20, Peter Ceiley wrote:
>>>>>>> Hi Heiner,
>>>>>>>
>>>>>>> Thanks, I'll do some more testing. It might not be the driver - I
>>>>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
>>>>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
>>>>>>> a good idea.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Peter.
>>>>>>>
>>>>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> at a first glance it doesn't look like a typical driver issue.
>>>>>>>> What you could do:
>>>>>>>>
>>>>>>>> - Test the r8169.c from 4.18 on top of 4.19.
>>>>>>>>
>>>>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
>>>>>>>>
>>>>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
>>>>>>>>
>>>>>>>> Any specific reason why you think root cause is in the driver and not
>>>>>>>> elsewhere in the network subsystem?
>>>>>>>>
>>>>>>>> Heiner
>>>>>>>>
>>>>>>>>
>>>>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
>>>>>>>>> Hi Heiner,
>>>>>>>>>
>>>>>>>>> Thanks for getting back to me.
>>>>>>>>>
>>>>>>>>> No, I don't use jumbo packets.
>>>>>>>>>
>>>>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
>>>>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
>>>>>>>>> establishing a connection and is most notable, for example, on my
>>>>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
>>>>>>>>> larger directories) to list the contents of each directory. Once a
>>>>>>>>> transfer begins on a file, I appear to get good bandwidth.
>>>>>>>>>
>>>>>>>>> I'm unsure of the best scientific data to provide you in order to
>>>>>>>>> troubleshoot this issue. Running the following
>>>>>>>>>
>>>>>>>>>     netstat -s |grep retransmitted
>>>>>>>>>
>>>>>>>>> shows a steady increase in retransmitted segments each time I list the
>>>>>>>>> contents of a remote directory, for example, running 'ls' on a
>>>>>>>>> directory containing 345 media files did the following using kernel
>>>>>>>>> 4.19.18:
>>>>>>>>>
>>>>>>>>> increased retransmitted segments by 21 and the 'time' command showed
>>>>>>>>> the following:
>>>>>>>>>     real    0m19.867s
>>>>>>>>>     user    0m0.012s
>>>>>>>>>     sys    0m0.036s
>>>>>>>>>
>>>>>>>>> The same command shows no retransmitted segments running kernel
>>>>>>>>> 4.18.16 and 'time' showed:
>>>>>>>>>     real    0m0.300s
>>>>>>>>>     user    0m0.004s
>>>>>>>>>     sys    0m0.007s
>>>>>>>>>
>>>>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
>>>>>>>>>
>>>>>>>>> dmesg XID:
>>>>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
>>>>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
>>>>>>>>>
>>>>>>>>> # lspci -vv
>>>>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
>>>>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
>>>>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>>>>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
>>>>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
>>>>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
>>>>>>>>>     Latency: 0, Cache Line Size: 64 bytes
>>>>>>>>>     Interrupt: pin A routed to IRQ 19
>>>>>>>>>     Region 0: I/O ports at d000 [size=256]
>>>>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
>>>>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
>>>>>>>>>     Capabilities: [40] Power Management version 3
>>>>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
>>>>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
>>>>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>>>>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
>>>>>>>>>         Address: 0000000000000000  Data: 0000
>>>>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
>>>>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
>>>>>>>>> <512ns, L1 <64us
>>>>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>>>>>>>>> SlotPowerLimit 10.000W
>>>>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
>>>>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>>>>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
>>>>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
>>>>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
>>>>>>>>> Latency L0s unlimited, L1 <64us
>>>>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
>>>>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
>>>>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
>>>>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
>>>>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>>>>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
>>>>>>>>> OBFF Via message/WAKE#
>>>>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>>>>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
>>>>>>>>> OBFF Disabled
>>>>>>>>>              AtomicOpsCtl: ReqEn-
>>>>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
>>>>>>>>>              Transmit Margin: Normal Operating Range,
>>>>>>>>> EnterModifiedCompliance- ComplianceSOS-
>>>>>>>>>              Compliance De-emphasis: -6dB
>>>>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
>>>>>>>>> EqualizationComplete-, EqualizationPhase1-
>>>>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>>>>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
>>>>>>>>>         Vector table: BAR=4 offset=00000000
>>>>>>>>>         PBA: BAR=4 offset=00000800
>>>>>>>>>     Capabilities: [d0] Vital Product Data
>>>>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
>>>>>>>>>         Not readable
>>>>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
>>>>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>>>>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
>>>>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>>>>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
>>>>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>>>>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
>>>>>>>>> ECRCChkCap+ ECRCChkEn-
>>>>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>>>>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
>>>>>>>>>     Capabilities: [140 v1] Virtual Channel
>>>>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
>>>>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
>>>>>>>>>         Ctrl:    ArbSelect=Fixed
>>>>>>>>>         Status:    InProgress-
>>>>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
>>>>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
>>>>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
>>>>>>>>>             Status:    NegoPending- InProgress-
>>>>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
>>>>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
>>>>>>>>>         Max snoop latency: 71680ns
>>>>>>>>>         Max no snoop latency: 71680ns
>>>>>>>>>     Kernel driver in use: r8169
>>>>>>>>>     Kernel modules: r8169
>>>>>>>>>
>>>>>>>>> Please let me know if you have any other ideas in terms of testing.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Peter.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have been experiencing very poor network performance since Kernel
>>>>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
>>>>>>>>>>>
>>>>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
>>>>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
>>>>>>>>>>> 4.20.4 & 4.19.18).
>>>>>>>>>>>
>>>>>>>>>>> If someone could guide me in the right direction, I'm happy to help
>>>>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
>>>>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
>>>>>>>>>>> differ in that I still have a network connection. I have attempted to
>>>>>>>>>>> reload the driver on a running system, but this does not improve the
>>>>>>>>>>> situation.
>>>>>>>>>>>
>>>>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
>>>>>>>>>>>
>>>>>>>>>>> lshw shows:
>>>>>>>>>>>        description: Ethernet interface
>>>>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>>>>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
>>>>>>>>>>>        physical id: 0
>>>>>>>>>>>        bus info: pci@0000:03:00.0
>>>>>>>>>>>        logical name: enp3s0
>>>>>>>>>>>        version: 0c
>>>>>>>>>>>        serial:
>>>>>>>>>>>        size: 1Gbit/s
>>>>>>>>>>>        capacity: 1Gbit/s
>>>>>>>>>>>        width: 64 bits
>>>>>>>>>>>        clock: 33MHz
>>>>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
>>>>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
>>>>>>>>>>> 1000bt-fd autonegotiation
>>>>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
>>>>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
>>>>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
>>>>>>>>>>>        resources: irq:19 ioport:d000(size=256)
>>>>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
>>>>>>>>>>>
>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>
>>>>>>>>>>> Peter.
>>>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> the description "poor network performance" is quite vague, therefore:
>>>>>>>>>>
>>>>>>>>>> - Can you provide any measurements?
>>>>>>>>>> - iperf results before and after
>>>>>>>>>> - statistics about dropped packets (rx and/or tx)
>>>>>>>>>> - Do you use jumbo packets?
>>>>>>>>>>
>>>>>>>>>> Also help would be a "lspci -vv" output for the network card and
>>>>>>>>>> the dmesg output line with the chip XID.
>>>>>>>>>>
>>>>>>>>>> Heiner
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply

* [PATCH] can: flexcan: fix timeout when set small bitrate
From: Joakim Zhang @ 2019-01-31  6:58 UTC (permalink / raw)
  To: mkl@pengutronix.de, linux-can@vger.kernel.org
  Cc: wg@grandegger.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, dl-linux-imx, Aisheng Dong,
	Joakim Zhang

From: Dong Aisheng <aisheng.dong@nxp.com>

Current we can meet timeout issue when setting a small bitrate
like 10000 as follows:
root@imx6qdlsolo:~# ip link set can0 up type can bitrate 10000
A link change request failed with some changes committed already.
Interface can0 may have been left with an inconsistent configuration,
please check.
RTNETLINK answers: Connection timed out

It is caused by calling of flexcan_chip_unfreeze() timeout.

Originally the code is using usleep_range(10, 20) for unfreeze operation,
but the patch (8badd65 can: flexcan: avoid calling usleep_range from
interrupt context) changed it into udelay(10) which is only a half delay
of before, there're also some other delay changes.

After only changed unfreeze delay back to udelay(20), the issue is gone.
So other timeout values are kept the same as 8badd65 changed.

Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com>
Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
---
 drivers/net/can/flexcan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index 2bca867bcfaa..1d3a9053bbeb 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -530,7 +530,7 @@ static int flexcan_chip_unfreeze(struct flexcan_priv *priv)
 	priv->write(reg, &regs->mcr);
 
 	while (timeout-- && (priv->read(&regs->mcr) & FLEXCAN_MCR_FRZ_ACK))
-		udelay(10);
+		udelay(20);
 
 	if (priv->read(&regs->mcr) & FLEXCAN_MCR_FRZ_ACK)
 		return -ETIMEDOUT;
-- 
2.17.1


^ permalink raw reply related

* [PATCH btf 0/3] Add BTF types deduplication algorithm
From: Andrii Nakryiko @ 2019-01-31  6:58 UTC (permalink / raw)
  To: netdev, ast, yhs, kafai, acme, daniel, kernel-team; +Cc: Andrii Nakryiko

This patch series adds BTF deduplication algorithm to libbpf. This algorithm
allows to take BTF type information containing duplicate per-compilation unit
information and reduce it to equivalent set of BTF types with no duplication without
loss of information. It also deduplicates strings and removes those strings that
are not referenced from any BTF type (and line information in .BTF.ext section,
if any).

Algorithm also resolves struct/union forward declarations into concrete BTF types
across multiple compilation units to facilitate better deduplication ratio. If
undesired, this resolution can be disabled through specifying corresponding options.

When applied to BTF data emitted by pahole's DWARF->BTF converter, it reduces
the overall size of .BTF section by about 65x, from about 112MB to 1.75MB, leaving
only 29247 out of initial 3073497 BTF type descriptors.

Algorithm with minor differences and preliminary results before FUNC/FUNC_PROTO
support is also described more verbosely at:
https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html

Andrii Nakryiko (3):
  btf: extract BTF type size calculation
  btf: add BTF types deduplication algorithm
  selftests/btf: add initial BTF dedup tests

 tools/lib/bpf/btf.c                    | 1851 +++++++++++++++++++++++-
 tools/lib/bpf/btf.h                    |   11 +
 tools/lib/bpf/libbpf.map               |    3 +
 tools/testing/selftests/bpf/test_btf.c |  535 ++++++-
 4 files changed, 2333 insertions(+), 67 deletions(-)

-- 
2.17.1


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox