Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v2 1/1] net/mlx5: Fix code style issue in mlx driver
From: David Miller @ 2019-02-05  2:02 UTC (permalink / raw)
  To: saeedm; +Cc: xiangxia.m.yue, netdev
In-Reply-To: <CALzJLG_uLE5c-yHYPGXV5Pp2w4nEYGBU9tzJwNMABNiC1V6HTg@mail.gmail.com>

From: Saeed Mahameed <saeedm@dev.mellanox.co.il>
Date: Mon, 4 Feb 2019 15:26:51 -0800

> On Thu, Jan 31, 2019 at 5:20 PM <xiangxia.m.yue@gmail.com> wrote:
>>
>> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>>
>> Add the tab before '}' and keep the code style consistent.
>>
>> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
> 
> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> 
> Dave, you can take this patch to net-next.

Great, applied.

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: fix libbpf_print
From: Alexei Starovoitov @ 2019-02-05  1:51 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Stanislav Fomichev, netdev@vger.kernel.org, davem@davemloft.net,
	ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <f7b5339e-b682-c00a-afaf-a00e08aff5f7@fb.com>

On Tue, Feb 05, 2019 at 12:37:29AM +0000, Yonghong Song wrote:
> 
> 
> On 2/4/19 4:20 PM, Stanislav Fomichev wrote:
> > With the recent print rework we now have the following problem:
> > pr_{warning,info,debug} expand to __pr which calls libbpf_print.
> > libbpf_print does va_start and calls __libbpf_pr with va_list argument.
> > In __base_pr we again do va_start. Because the next argument is a
> > va_list, we don't get correct pointer to the argument (and print noting
> > in my case, I don't know why it doesn't crash tbh).
> > 
> > Fix this by changing libbpf_print_fn_t signature to accept va_list and
> > remove unneeded calls to va_start in the existing users.
> > 
> > Alternatively, this can we solved by exporting __libbpf_pr and
> > changing __pr macro to (and killing libbpf_print):
> > {
> > 	if (__libbpf_pr)
> > 		__libbpf_pr(level, "libbpf: " fmt, ##__VA_ARGS__)
> > }
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> 
> It is my mistake. My early version did passed correctly and later
> on I made some changes and did not test properly. Thanks for the fix!
> 
> Acked-by: Yonghong Song <yhs@fb.com>

argh.
Applied. Thanks for the fix.
Yonghong, how was the earlier patch set tested?
It sounds that nothing should have worked.
How perf changes were tested?


^ permalink raw reply

* Re: [PATCH btf 0/3] Add BTF types deduplication algorithm
From: Andrii Nakryiko @ 2019-02-05  1:34 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andrii Nakryiko, netdev, Alexei Starovoitov, Yonghong Song,
	Martin Lau, Arnaldo Carvalho de Melo, kernel-team
In-Reply-To: <f31c224c-6f4c-ef04-887c-680e6c8facd4@iogearbox.net>

On Mon, Feb 4, 2019 at 5:32 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Hi Andrii,
>
> On 01/31/2019 07:58 AM, Andrii Nakryiko wrote:
> > This patch series adds BTF deduplication algorithm to libbpf. This algorithm
> > allows to take BTF type information containing duplicate per-compilation unit
> > information and reduce it to equivalent set of BTF types with no duplication without
> > loss of information. It also deduplicates strings and removes those strings that
> > are not referenced from any BTF type (and line information in .BTF.ext section,
> > if any).
> >
> > Algorithm also resolves struct/union forward declarations into concrete BTF types
> > across multiple compilation units to facilitate better deduplication ratio. If
> > undesired, this resolution can be disabled through specifying corresponding options.
> >
> > When applied to BTF data emitted by pahole's DWARF->BTF converter, it reduces
> > the overall size of .BTF section by about 65x, from about 112MB to 1.75MB, leaving
> > only 29247 out of initial 3073497 BTF type descriptors.
> >
> > Algorithm with minor differences and preliminary results before FUNC/FUNC_PROTO
> > support is also described more verbosely at:
> > https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
> >
> > Andrii Nakryiko (3):
> >   btf: extract BTF type size calculation
> >   btf: add BTF types deduplication algorithm
> >   selftests/btf: add initial BTF dedup tests
> >
> >  tools/lib/bpf/btf.c                    | 1851 +++++++++++++++++++++++-
> >  tools/lib/bpf/btf.h                    |   11 +
> >  tools/lib/bpf/libbpf.map               |    3 +
> >  tools/testing/selftests/bpf/test_btf.c |  535 ++++++-
> >  4 files changed, 2333 insertions(+), 67 deletions(-)
>
> This would need a proper rebase for bpf-next so that it applies w/o
> bigger conflicts. Please also change the libbpf.map and place the newly
> exported functions under LIBBPF_0.0.2 (as everything under 0.0.1 was from
> prior released kernel).

Done. Thanks!

>
> Thanks,
> Daniel

^ permalink raw reply

* [PATCH btf v2 2/3] btf: add BTF types deduplication algorithm
From: Andrii Nakryiko @ 2019-02-05  1:29 UTC (permalink / raw)
  To: netdev, ast, yhs, daniel, acme, kernel-team, kafai, ecree,
	andrii.nakryiko
  Cc: Andrii Nakryiko
In-Reply-To: <20190205012946.1590917-1-andriin@fb.com>

This patch implements BTF types deduplication algorithm. It allows to
greatly compress typical output of pahole's DWARF-to-BTF conversion or
LLVM's compilation output by detecting and collapsing identical types emitted in
isolation per compilation unit. Algorithm also resolves struct/union forward
declarations into concrete BTF types representing referenced struct/union. If
undesired, this resolution can be disabled through specifying corresponding options.

Algorithm itself and its application to Linux kernel's BTF types is
described in details at:
https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/btf.c      | 1741 ++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/btf.h      |    7 +
 tools/lib/bpf/libbpf.map |    1 +
 3 files changed, 1749 insertions(+)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 06bd1a625ff4..e5097be16018 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -849,3 +849,1744 @@ __u32 btf_ext__line_info_rec_size(const struct btf_ext *btf_ext)
 {
 	return btf_ext->line_info.rec_size;
 }
+
+struct btf_dedup;
+
+static struct btf_dedup *btf_dedup_new(struct btf *btf, struct btf_ext *btf_ext,
+				       const struct btf_dedup_opts *opts);
+static void btf_dedup_free(struct btf_dedup *d);
+static int btf_dedup_strings(struct btf_dedup *d);
+static int btf_dedup_prim_types(struct btf_dedup *d);
+static int btf_dedup_struct_types(struct btf_dedup *d);
+static int btf_dedup_ref_types(struct btf_dedup *d);
+static int btf_dedup_compact_types(struct btf_dedup *d);
+static int btf_dedup_remap_types(struct btf_dedup *d);
+
+/*
+ * Deduplicate BTF types and strings.
+ *
+ * BTF dedup algorithm takes as an input `struct btf` representing `.BTF` ELF
+ * section with all BTF type descriptors and string data. It overwrites that
+ * memory in-place with deduplicated types and strings without any loss of
+ * information. If optional `struct btf_ext` representing '.BTF.ext' ELF section
+ * is provided, all the strings referenced from .BTF.ext section are honored
+ * and updated to point to the right offsets after deduplication.
+ *
+ * If function returns with error, type/string data might be garbled and should
+ * be discarded.
+ *
+ * More verbose and detailed description of both problem btf_dedup is solving,
+ * as well as solution could be found at:
+ * https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
+ *
+ * Problem description and justification
+ * =====================================
+ *
+ * BTF type information is typically emitted either as a result of conversion
+ * from DWARF to BTF or directly by compiler. In both cases, each compilation
+ * unit contains information about a subset of all the types that are used
+ * in an application. These subsets are frequently overlapping and contain a lot
+ * of duplicated information when later concatenated together into a single
+ * binary. This algorithm ensures that each unique type is represented by single
+ * BTF type descriptor, greatly reducing resulting size of BTF data.
+ *
+ * Compilation unit isolation and subsequent duplication of data is not the only
+ * problem. The same type hierarchy (e.g., struct and all the type that struct
+ * references) in different compilation units can be represented in BTF to
+ * various degrees of completeness (or, rather, incompleteness) due to
+ * struct/union forward declarations.
+ *
+ * Let's take a look at an example, that we'll use to better understand the
+ * problem (and solution). Suppose we have two compilation units, each using
+ * same `struct S`, but each of them having incomplete type information about
+ * struct's fields:
+ *
+ * // CU #1:
+ * struct S;
+ * struct A {
+ *	int a;
+ *	struct A* self;
+ *	struct S* parent;
+ * };
+ * struct B;
+ * struct S {
+ *	struct A* a_ptr;
+ *	struct B* b_ptr;
+ * };
+ *
+ * // CU #2:
+ * struct S;
+ * struct A;
+ * struct B {
+ *	int b;
+ *	struct B* self;
+ *	struct S* parent;
+ * };
+ * struct S {
+ *	struct A* a_ptr;
+ *	struct B* b_ptr;
+ * };
+ *
+ * In case of CU #1, BTF data will know only that `struct B` exist (but no
+ * more), but will know the complete type information about `struct A`. While
+ * for CU #2, it will know full type information about `struct B`, but will
+ * only know about forward declaration of `struct A` (in BTF terms, it will
+ * have `BTF_KIND_FWD` type descriptor with name `B`).
+ *
+ * This compilation unit isolation means that it's possible that there is no
+ * single CU with complete type information describing structs `S`, `A`, and
+ * `B`. Also, we might get tons of duplicated and redundant type information.
+ *
+ * Additional complication we need to keep in mind comes from the fact that
+ * types, in general, can form graphs containing cycles, not just DAGs.
+ *
+ * While algorithm does deduplication, it also merges and resolves type
+ * information (unless disabled throught `struct btf_opts`), whenever possible.
+ * E.g., in the example above with two compilation units having partial type
+ * information for structs `A` and `B`, the output of algorithm will emit
+ * a single copy of each BTF type that describes structs `A`, `B`, and `S`
+ * (as well as type information for `int` and pointers), as if they were defined
+ * in a single compilation unit as:
+ *
+ * struct A {
+ *	int a;
+ *	struct A* self;
+ *	struct S* parent;
+ * };
+ * struct B {
+ *	int b;
+ *	struct B* self;
+ *	struct S* parent;
+ * };
+ * struct S {
+ *	struct A* a_ptr;
+ *	struct B* b_ptr;
+ * };
+ *
+ * Algorithm summary
+ * =================
+ *
+ * Algorithm completes its work in 6 separate passes:
+ *
+ * 1. Strings deduplication.
+ * 2. Primitive types deduplication (int, enum, fwd).
+ * 3. Struct/union types deduplication.
+ * 4. Reference types deduplication (pointers, typedefs, arrays, funcs, func
+ *    protos, and const/volatile/restrict modifiers).
+ * 5. Types compaction.
+ * 6. Types remapping.
+ *
+ * Algorithm determines canonical type descriptor, which is a single
+ * representative type for each truly unique type. This canonical type is the
+ * one that will go into final deduplicated BTF type information. For
+ * struct/unions, it is also the type that algorithm will merge additional type
+ * information into (while resolving FWDs), as it discovers it from data in
+ * other CUs. Each input BTF type eventually gets either mapped to itself, if
+ * that type is canonical, or to some other type, if that type is equivalent
+ * and was chosen as canonical representative. This mapping is stored in
+ * `btf_dedup->map` array. This map is also used to record STRUCT/UNION that
+ * FWD type got resolved to.
+ *
+ * To facilitate fast discovery of canonical types, we also maintain canonical
+ * index (`btf_dedup->dedup_table`), which maps type descriptor's signature hash
+ * (i.e., hashed kind, name, size, fields, etc) into a list of canonical types
+ * that match that signature. With sufficiently good choice of type signature
+ * hashing function, we can limit number of canonical types for each unique type
+ * signature to a very small number, allowing to find canonical type for any
+ * duplicated type very quickly.
+ *
+ * Struct/union deduplication is the most critical part and algorithm for
+ * deduplicating structs/unions is described in greater details in comments for
+ * `btf_dedup_is_equiv` function.
+ */
+int btf__dedup(struct btf *btf, struct btf_ext *btf_ext,
+	       const struct btf_dedup_opts *opts)
+{
+	struct btf_dedup *d = btf_dedup_new(btf, btf_ext, opts);
+	int err;
+
+	if (IS_ERR(d)) {
+		pr_debug("btf_dedup_new failed: %ld", PTR_ERR(d));
+		return -EINVAL;
+	}
+
+	err = btf_dedup_strings(d);
+	if (err < 0) {
+		pr_debug("btf_dedup_strings failed:%d\n", err);
+		goto done;
+	}
+	err = btf_dedup_prim_types(d);
+	if (err < 0) {
+		pr_debug("btf_dedup_prim_types failed:%d\n", err);
+		goto done;
+	}
+	err = btf_dedup_struct_types(d);
+	if (err < 0) {
+		pr_debug("btf_dedup_struct_types failed:%d\n", err);
+		goto done;
+	}
+	err = btf_dedup_ref_types(d);
+	if (err < 0) {
+		pr_debug("btf_dedup_ref_types failed:%d\n", err);
+		goto done;
+	}
+	err = btf_dedup_compact_types(d);
+	if (err < 0) {
+		pr_debug("btf_dedup_compact_types failed:%d\n", err);
+		goto done;
+	}
+	err = btf_dedup_remap_types(d);
+	if (err < 0) {
+		pr_debug("btf_dedup_remap_types failed:%d\n", err);
+		goto done;
+	}
+
+done:
+	btf_dedup_free(d);
+	return err;
+}
+
+#define BTF_DEDUP_TABLE_SIZE_LOG 14
+#define BTF_DEDUP_TABLE_MOD ((1 << BTF_DEDUP_TABLE_SIZE_LOG) - 1)
+#define BTF_UNPROCESSED_ID ((__u32)-1)
+#define BTF_IN_PROGRESS_ID ((__u32)-2)
+
+struct btf_dedup_node {
+	struct btf_dedup_node *next;
+	__u32 type_id;
+};
+
+struct btf_dedup {
+	/* .BTF section to be deduped in-place */
+	struct btf *btf;
+	/*
+	 * Optional .BTF.ext section. When provided, any strings referenced
+	 * from it will be taken into account when deduping strings
+	 */
+	struct btf_ext *btf_ext;
+	/*
+	 * This is a map from any type's signature hash to a list of possible
+	 * canonical representative type candidates. Hash collisions are
+	 * ignored, so even types of various kinds can share same list of
+	 * candidates, which is fine because we rely on subsequent
+	 * btf_xxx_equal() checks to authoritatively verify type equality.
+	 */
+	struct btf_dedup_node **dedup_table;
+	/* Canonical types map */
+	__u32 *map;
+	/* Hypothetical mapping, used during type graph equivalence checks */
+	__u32 *hypot_map;
+	__u32 *hypot_list;
+	size_t hypot_cnt;
+	size_t hypot_cap;
+	/* Various option modifying behavior of algorithm */
+	struct btf_dedup_opts opts;
+};
+
+struct btf_str_ptr {
+	const char *str;
+	__u32 new_off;
+	bool used;
+};
+
+struct btf_str_ptrs {
+	struct btf_str_ptr *ptrs;
+	const char *data;
+	__u32 cnt;
+	__u32 cap;
+};
+
+static inline __u32 hash_combine(__u32 h, __u32 value)
+{
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+	return h * 37 + value * GOLDEN_RATIO_PRIME;
+#undef GOLDEN_RATIO_PRIME
+}
+
+#define for_each_hash_node(table, hash, node) \
+	for (node = table[hash & BTF_DEDUP_TABLE_MOD]; node; node = node->next)
+
+static int btf_dedup_table_add(struct btf_dedup *d, __u32 hash, __u32 type_id)
+{
+	struct btf_dedup_node *node = malloc(sizeof(struct btf_dedup_node));
+
+	if (!node)
+		return -ENOMEM;
+	node->type_id = type_id;
+	node->next = d->dedup_table[hash & BTF_DEDUP_TABLE_MOD];
+	d->dedup_table[hash & BTF_DEDUP_TABLE_MOD] = node;
+	return 0;
+}
+
+static int btf_dedup_hypot_map_add(struct btf_dedup *d,
+				   __u32 from_id, __u32 to_id)
+{
+	if (d->hypot_cnt == d->hypot_cap) {
+		__u32 *new_list;
+
+		d->hypot_cap += max(16, d->hypot_cap / 2);
+		new_list = realloc(d->hypot_list, sizeof(__u32) * d->hypot_cap);
+		if (!new_list)
+			return -ENOMEM;
+		d->hypot_list = new_list;
+	}
+	d->hypot_list[d->hypot_cnt++] = from_id;
+	d->hypot_map[from_id] = to_id;
+	return 0;
+}
+
+static void btf_dedup_clear_hypot_map(struct btf_dedup *d)
+{
+	int i;
+
+	for (i = 0; i < d->hypot_cnt; i++)
+		d->hypot_map[d->hypot_list[i]] = BTF_UNPROCESSED_ID;
+	d->hypot_cnt = 0;
+}
+
+static void btf_dedup_table_free(struct btf_dedup *d)
+{
+	struct btf_dedup_node *head, *tmp;
+	int i;
+
+	if (!d->dedup_table)
+		return;
+
+	for (i = 0; i < (1 << BTF_DEDUP_TABLE_SIZE_LOG); i++) {
+		while (d->dedup_table[i]) {
+			tmp = d->dedup_table[i];
+			d->dedup_table[i] = tmp->next;
+			free(tmp);
+		}
+
+		head = d->dedup_table[i];
+		while (head) {
+			tmp = head;
+			head = head->next;
+			free(tmp);
+		}
+	}
+
+	free(d->dedup_table);
+	d->dedup_table = NULL;
+}
+
+static void btf_dedup_free(struct btf_dedup *d)
+{
+	btf_dedup_table_free(d);
+
+	free(d->map);
+	d->map = NULL;
+
+	free(d->hypot_map);
+	d->hypot_map = NULL;
+
+	free(d->hypot_list);
+	d->hypot_list = NULL;
+
+	free(d);
+}
+
+static struct btf_dedup *btf_dedup_new(struct btf *btf, struct btf_ext *btf_ext,
+				       const struct btf_dedup_opts *opts)
+{
+	struct btf_dedup *d = calloc(1, sizeof(struct btf_dedup));
+	int i, err = 0;
+
+	if (!d)
+		return ERR_PTR(-ENOMEM);
+
+	d->btf = btf;
+	d->btf_ext = btf_ext;
+
+	d->dedup_table = calloc(1 << BTF_DEDUP_TABLE_SIZE_LOG,
+				sizeof(struct btf_dedup_node *));
+	if (!d->dedup_table) {
+		err = -ENOMEM;
+		goto done;
+	}
+
+	d->map = malloc(sizeof(__u32) * (1 + btf->nr_types));
+	if (!d->map) {
+		err = -ENOMEM;
+		goto done;
+	}
+	/* special BTF "void" type is made canonical immediately */
+	d->map[0] = 0;
+	for (i = 1; i <= btf->nr_types; i++)
+		d->map[i] = BTF_UNPROCESSED_ID;
+
+	d->hypot_map = malloc(sizeof(__u32) * (1 + btf->nr_types));
+	if (!d->hypot_map) {
+		err = -ENOMEM;
+		goto done;
+	}
+	for (i = 0; i <= btf->nr_types; i++)
+		d->hypot_map[i] = BTF_UNPROCESSED_ID;
+
+	d->opts.dont_resolve_fwds = opts && opts->dont_resolve_fwds;
+
+done:
+	if (err) {
+		btf_dedup_free(d);
+		return ERR_PTR(err);
+	}
+
+	return d;
+}
+
+typedef int (*str_off_fn_t)(__u32 *str_off_ptr, void *ctx);
+
+/*
+ * Iterate over all possible places in .BTF and .BTF.ext that can reference
+ * string and pass pointer to it to a provided callback `fn`.
+ */
+static int btf_for_each_str_off(struct btf_dedup *d, str_off_fn_t fn, void *ctx)
+{
+	void *line_data_cur, *line_data_end;
+	int i, j, r, rec_size;
+	struct btf_type *t;
+
+	for (i = 1; i <= d->btf->nr_types; i++) {
+		t = d->btf->types[i];
+		r = fn(&t->name_off, ctx);
+		if (r)
+			return r;
+
+		switch (BTF_INFO_KIND(t->info)) {
+		case BTF_KIND_STRUCT:
+		case BTF_KIND_UNION: {
+			struct btf_member *m = (struct btf_member *)(t + 1);
+			__u16 vlen = BTF_INFO_VLEN(t->info);
+
+			for (j = 0; j < vlen; j++) {
+				r = fn(&m->name_off, ctx);
+				if (r)
+					return r;
+				m++;
+			}
+			break;
+		}
+		case BTF_KIND_ENUM: {
+			struct btf_enum *m = (struct btf_enum *)(t + 1);
+			__u16 vlen = BTF_INFO_VLEN(t->info);
+
+			for (j = 0; j < vlen; j++) {
+				r = fn(&m->name_off, ctx);
+				if (r)
+					return r;
+				m++;
+			}
+			break;
+		}
+		case BTF_KIND_FUNC_PROTO: {
+			struct btf_param *m = (struct btf_param *)(t + 1);
+			__u16 vlen = BTF_INFO_VLEN(t->info);
+
+			for (j = 0; j < vlen; j++) {
+				r = fn(&m->name_off, ctx);
+				if (r)
+					return r;
+				m++;
+			}
+			break;
+		}
+		default:
+			break;
+		}
+	}
+
+	if (!d->btf_ext)
+		return 0;
+
+	line_data_cur = d->btf_ext->line_info.info;
+	line_data_end = d->btf_ext->line_info.info + d->btf_ext->line_info.len;
+	rec_size = d->btf_ext->line_info.rec_size;
+
+	while (line_data_cur < line_data_end) {
+		struct btf_ext_info_sec *sec = line_data_cur;
+		struct bpf_line_info_min *line_info;
+		__u32 num_info = sec->num_info;
+
+		r = fn(&sec->sec_name_off, ctx);
+		if (r)
+			return r;
+
+		line_data_cur += sizeof(struct btf_ext_info_sec);
+		for (i = 0; i < num_info; i++) {
+			line_info = line_data_cur;
+			r = fn(&line_info->file_name_off, ctx);
+			if (r)
+				return r;
+			r = fn(&line_info->line_off, ctx);
+			if (r)
+				return r;
+			line_data_cur += rec_size;
+		}
+	}
+
+	return 0;
+}
+
+static int str_sort_by_content(const void *a1, const void *a2)
+{
+	const struct btf_str_ptr *p1 = a1;
+	const struct btf_str_ptr *p2 = a2;
+
+	return strcmp(p1->str, p2->str);
+}
+
+static int str_sort_by_offset(const void *a1, const void *a2)
+{
+	const struct btf_str_ptr *p1 = a1;
+	const struct btf_str_ptr *p2 = a2;
+
+	if (p1->str != p2->str)
+		return p1->str < p2->str ? -1 : 1;
+	return 0;
+}
+
+static int btf_dedup_str_ptr_cmp(const void *str_ptr, const void *pelem)
+{
+	const struct btf_str_ptr *p = pelem;
+
+	if (str_ptr != p->str)
+		return (const char *)str_ptr < p->str ? -1 : 1;
+	return 0;
+}
+
+static int btf_str_mark_as_used(__u32 *str_off_ptr, void *ctx)
+{
+	struct btf_str_ptrs *strs;
+	struct btf_str_ptr *s;
+
+	if (*str_off_ptr == 0)
+		return 0;
+
+	strs = ctx;
+	s = bsearch(strs->data + *str_off_ptr, strs->ptrs, strs->cnt,
+		    sizeof(struct btf_str_ptr), btf_dedup_str_ptr_cmp);
+	if (!s)
+		return -EINVAL;
+	s->used = true;
+	return 0;
+}
+
+static int btf_str_remap_offset(__u32 *str_off_ptr, void *ctx)
+{
+	struct btf_str_ptrs *strs;
+	struct btf_str_ptr *s;
+
+	if (*str_off_ptr == 0)
+		return 0;
+
+	strs = ctx;
+	s = bsearch(strs->data + *str_off_ptr, strs->ptrs, strs->cnt,
+		    sizeof(struct btf_str_ptr), btf_dedup_str_ptr_cmp);
+	if (!s)
+		return -EINVAL;
+	*str_off_ptr = s->new_off;
+	return 0;
+}
+
+/*
+ * Dedup string and filter out those that are not referenced from either .BTF
+ * or .BTF.ext (if provided) sections.
+ *
+ * This is done by building index of all strings in BTF's string section,
+ * then iterating over all entities that can reference strings (e.g., type
+ * names, struct field names, .BTF.ext line info, etc) and marking corresponding
+ * strings as used. After that all used strings are deduped and compacted into
+ * sequential blob of memory and new offsets are calculated. Then all the string
+ * references are iterated again and rewritten using new offsets.
+ */
+static int btf_dedup_strings(struct btf_dedup *d)
+{
+	const struct btf_header *hdr = d->btf->hdr;
+	char *start = (char *)d->btf->nohdr_data + hdr->str_off;
+	char *end = start + d->btf->hdr->str_len;
+	char *p = start, *tmp_strs = NULL;
+	struct btf_str_ptrs strs = {
+		.cnt = 0,
+		.cap = 0,
+		.ptrs = NULL,
+		.data = start,
+	};
+	int i, j, err = 0, grp_idx;
+	bool grp_used;
+
+	/* build index of all strings */
+	while (p < end) {
+		if (strs.cnt + 1 > strs.cap) {
+			struct btf_str_ptr *new_ptrs;
+
+			strs.cap += max(strs.cnt / 2, 16);
+			new_ptrs = realloc(strs.ptrs,
+					   sizeof(strs.ptrs[0]) * strs.cap);
+			if (!new_ptrs) {
+				err = -ENOMEM;
+				goto done;
+			}
+			strs.ptrs = new_ptrs;
+		}
+
+		strs.ptrs[strs.cnt].str = p;
+		strs.ptrs[strs.cnt].used = false;
+
+		p += strlen(p) + 1;
+		strs.cnt++;
+	}
+
+	/* temporary storage for deduplicated strings */
+	tmp_strs = malloc(d->btf->hdr->str_len);
+	if (!tmp_strs) {
+		err = -ENOMEM;
+		goto done;
+	}
+
+	/* mark all used strings */
+	strs.ptrs[0].used = true;
+	err = btf_for_each_str_off(d, btf_str_mark_as_used, &strs);
+	if (err)
+		goto done;
+
+	/* sort strings by context, so that we can identify duplicates */
+	qsort(strs.ptrs, strs.cnt, sizeof(strs.ptrs[0]), str_sort_by_content);
+
+	/*
+	 * iterate groups of equal strings and if any instance in a group was
+	 * referenced, emit single instance and remember new offset
+	 */
+	p = tmp_strs;
+	grp_idx = 0;
+	grp_used = strs.ptrs[0].used;
+	/* iterate past end to avoid code duplication after loop */
+	for (i = 1; i <= strs.cnt; i++) {
+		/*
+		 * when i == strs.cnt, we want to skip string comparison and go
+		 * straight to handling last group of strings (otherwise we'd
+		 * need to handle last group after the loop w/ duplicated code)
+		 */
+		if (i < strs.cnt &&
+		    !strcmp(strs.ptrs[i].str, strs.ptrs[grp_idx].str)) {
+			grp_used = grp_used || strs.ptrs[i].used;
+			continue;
+		}
+
+		/*
+		 * this check would have been required after the loop to handle
+		 * last group of strings, but due to <= condition in a loop
+		 * we avoid that duplication
+		 */
+		if (grp_used) {
+			int new_off = p - tmp_strs;
+			__u32 len = strlen(strs.ptrs[grp_idx].str);
+
+			memmove(p, strs.ptrs[grp_idx].str, len + 1);
+			for (j = grp_idx; j < i; j++)
+				strs.ptrs[j].new_off = new_off;
+			p += len + 1;
+		}
+
+		if (i < strs.cnt) {
+			grp_idx = i;
+			grp_used = strs.ptrs[i].used;
+		}
+	}
+
+	/* replace original strings with deduped ones */
+	d->btf->hdr->str_len = p - tmp_strs;
+	memmove(start, tmp_strs, d->btf->hdr->str_len);
+	end = start + d->btf->hdr->str_len;
+
+	/* restore original order for further binary search lookups */
+	qsort(strs.ptrs, strs.cnt, sizeof(strs.ptrs[0]), str_sort_by_offset);
+
+	/* remap string offsets */
+	err = btf_for_each_str_off(d, btf_str_remap_offset, &strs);
+	if (err)
+		goto done;
+
+	d->btf->hdr->str_len = end - start;
+
+done:
+	free(tmp_strs);
+	free(strs.ptrs);
+	return err;
+}
+
+static __u32 btf_hash_common(struct btf_type *t)
+{
+	__u32 h;
+
+	h = hash_combine(0, t->name_off);
+	h = hash_combine(h, t->info);
+	h = hash_combine(h, t->size);
+	return h;
+}
+
+static bool btf_equal_common(struct btf_type *t1, struct btf_type *t2)
+{
+	return t1->name_off == t2->name_off &&
+	       t1->info == t2->info &&
+	       t1->size == t2->size;
+}
+
+/* Calculate type signature hash of INT. */
+static __u32 btf_hash_int(struct btf_type *t)
+{
+	__u32 info = *(__u32 *)(t + 1);
+	__u32 h;
+
+	h = btf_hash_common(t);
+	h = hash_combine(h, info);
+	return h;
+}
+
+/* Check structural equality of two INTs. */
+static bool btf_equal_int(struct btf_type *t1, struct btf_type *t2)
+{
+	__u32 info1, info2;
+
+	if (!btf_equal_common(t1, t2))
+		return false;
+	info1 = *(__u32 *)(t1 + 1);
+	info2 = *(__u32 *)(t2 + 1);
+	return info1 == info2;
+}
+
+/* Calculate type signature hash of ENUM. */
+static __u32 btf_hash_enum(struct btf_type *t)
+{
+	struct btf_enum *member = (struct btf_enum *)(t + 1);
+	__u32 vlen = BTF_INFO_VLEN(t->info);
+	__u32 h = btf_hash_common(t);
+	int i;
+
+	for (i = 0; i < vlen; i++) {
+		h = hash_combine(h, member->name_off);
+		h = hash_combine(h, member->val);
+		member++;
+	}
+	return h;
+}
+
+/* Check structural equality of two ENUMs. */
+static bool btf_equal_enum(struct btf_type *t1, struct btf_type *t2)
+{
+	struct btf_enum *m1, *m2;
+	__u16 vlen;
+	int i;
+
+	if (!btf_equal_common(t1, t2))
+		return false;
+
+	vlen = BTF_INFO_VLEN(t1->info);
+	m1 = (struct btf_enum *)(t1 + 1);
+	m2 = (struct btf_enum *)(t2 + 1);
+	for (i = 0; i < vlen; i++) {
+		if (m1->name_off != m2->name_off || m1->val != m2->val)
+			return false;
+		m1++;
+		m2++;
+	}
+	return true;
+}
+
+/*
+ * Calculate type signature hash of STRUCT/UNION, ignoring referenced type IDs,
+ * as referenced type IDs equivalence is established separately during type
+ * graph equivalence check algorithm.
+ */
+static __u32 btf_hash_struct(struct btf_type *t)
+{
+	struct btf_member *member = (struct btf_member *)(t + 1);
+	__u32 vlen = BTF_INFO_VLEN(t->info);
+	__u32 h = btf_hash_common(t);
+	int i;
+
+	for (i = 0; i < vlen; i++) {
+		h = hash_combine(h, member->name_off);
+		h = hash_combine(h, member->offset);
+		/* no hashing of referenced type ID, it can be unresolved yet */
+		member++;
+	}
+	return h;
+}
+
+/*
+ * Check structural compatibility of two FUNC_PROTOs, ignoring referenced type
+ * IDs. This check is performed during type graph equivalence check and
+ * referenced types equivalence is checked separately.
+ */
+static bool btf_equal_struct(struct btf_type *t1, struct btf_type *t2)
+{
+	struct btf_member *m1, *m2;
+	__u16 vlen;
+	int i;
+
+	if (!btf_equal_common(t1, t2))
+		return false;
+
+	vlen = BTF_INFO_VLEN(t1->info);
+	m1 = (struct btf_member *)(t1 + 1);
+	m2 = (struct btf_member *)(t2 + 1);
+	for (i = 0; i < vlen; i++) {
+		if (m1->name_off != m2->name_off || m1->offset != m2->offset)
+			return false;
+		m1++;
+		m2++;
+	}
+	return true;
+}
+
+/*
+ * Calculate type signature hash of ARRAY, including referenced type IDs,
+ * under assumption that they were already resolved to canonical type IDs and
+ * are not going to change.
+ */
+static __u32 btf_hash_array(struct btf_type *t)
+{
+	struct btf_array *info = (struct btf_array *)(t + 1);
+	__u32 h = btf_hash_common(t);
+
+	h = hash_combine(h, info->type);
+	h = hash_combine(h, info->index_type);
+	h = hash_combine(h, info->nelems);
+	return h;
+}
+
+/*
+ * Check exact equality of two ARRAYs, taking into account referenced
+ * type IDs, under assumption that they were already resolved to canonical
+ * type IDs and are not going to change.
+ * This function is called during reference types deduplication to compare
+ * ARRAY to potential canonical representative.
+ */
+static bool btf_equal_array(struct btf_type *t1, struct btf_type *t2)
+{
+	struct btf_array *info1, *info2;
+
+	if (!btf_equal_common(t1, t2))
+		return false;
+
+	info1 = (struct btf_array *)(t1 + 1);
+	info2 = (struct btf_array *)(t2 + 1);
+	return info1->type == info2->type &&
+	       info1->index_type == info2->index_type &&
+	       info1->nelems == info2->nelems;
+}
+
+/*
+ * Check structural compatibility of two ARRAYs, ignoring referenced type
+ * IDs. This check is performed during type graph equivalence check and
+ * referenced types equivalence is checked separately.
+ */
+static bool btf_compat_array(struct btf_type *t1, struct btf_type *t2)
+{
+	struct btf_array *info1, *info2;
+
+	if (!btf_equal_common(t1, t2))
+		return false;
+
+	info1 = (struct btf_array *)(t1 + 1);
+	info2 = (struct btf_array *)(t2 + 1);
+	return info1->nelems == info2->nelems;
+}
+
+/*
+ * Calculate type signature hash of FUNC_PROTO, including referenced type IDs,
+ * under assumption that they were already resolved to canonical type IDs and
+ * are not going to change.
+ */
+static inline __u32 btf_hash_fnproto(struct btf_type *t)
+{
+	struct btf_param *member = (struct btf_param *)(t + 1);
+	__u16 vlen = BTF_INFO_VLEN(t->info);
+	__u32 h = btf_hash_common(t);
+	int i;
+
+	for (i = 0; i < vlen; i++) {
+		h = hash_combine(h, member->name_off);
+		h = hash_combine(h, member->type);
+		member++;
+	}
+	return h;
+}
+
+/*
+ * Check exact equality of two FUNC_PROTOs, taking into account referenced
+ * type IDs, under assumption that they were already resolved to canonical
+ * type IDs and are not going to change.
+ * This function is called during reference types deduplication to compare
+ * FUNC_PROTO to potential canonical representative.
+ */
+static inline bool btf_equal_fnproto(struct btf_type *t1, struct btf_type *t2)
+{
+	struct btf_param *m1, *m2;
+	__u16 vlen;
+	int i;
+
+	if (!btf_equal_common(t1, t2))
+		return false;
+
+	vlen = BTF_INFO_VLEN(t1->info);
+	m1 = (struct btf_param *)(t1 + 1);
+	m2 = (struct btf_param *)(t2 + 1);
+	for (i = 0; i < vlen; i++) {
+		if (m1->name_off != m2->name_off || m1->type != m2->type)
+			return false;
+		m1++;
+		m2++;
+	}
+	return true;
+}
+
+/*
+ * Check structural compatibility of two FUNC_PROTOs, ignoring referenced type
+ * IDs. This check is performed during type graph equivalence check and
+ * referenced types equivalence is checked separately.
+ */
+static inline bool btf_compat_fnproto(struct btf_type *t1, struct btf_type *t2)
+{
+	struct btf_param *m1, *m2;
+	__u16 vlen;
+	int i;
+
+	/* skip return type ID */
+	if (t1->name_off != t2->name_off || t1->info != t2->info)
+		return false;
+
+	vlen = BTF_INFO_VLEN(t1->info);
+	m1 = (struct btf_param *)(t1 + 1);
+	m2 = (struct btf_param *)(t2 + 1);
+	for (i = 0; i < vlen; i++) {
+		if (m1->name_off != m2->name_off)
+			return false;
+		m1++;
+		m2++;
+	}
+	return true;
+}
+
+/*
+ * Deduplicate primitive types, that can't reference other types, by calculating
+ * their type signature hash and comparing them with any possible canonical
+ * candidate. If no canonical candidate matches, type itself is marked as
+ * canonical and is added into `btf_dedup->dedup_table` as another candidate.
+ */
+static int btf_dedup_prim_type(struct btf_dedup *d, __u32 type_id)
+{
+	struct btf_type *t = d->btf->types[type_id];
+	struct btf_type *cand;
+	struct btf_dedup_node *cand_node;
+	/* if we don't find equivalent type, then we are canonical */
+	__u32 new_id = type_id;
+	__u32 h;
+
+	switch (BTF_INFO_KIND(t->info)) {
+	case BTF_KIND_CONST:
+	case BTF_KIND_VOLATILE:
+	case BTF_KIND_RESTRICT:
+	case BTF_KIND_PTR:
+	case BTF_KIND_TYPEDEF:
+	case BTF_KIND_ARRAY:
+	case BTF_KIND_STRUCT:
+	case BTF_KIND_UNION:
+	case BTF_KIND_FUNC:
+	case BTF_KIND_FUNC_PROTO:
+		return 0;
+
+	case BTF_KIND_INT:
+		h = btf_hash_int(t);
+		for_each_hash_node(d->dedup_table, h, cand_node) {
+			cand = d->btf->types[cand_node->type_id];
+			if (btf_equal_int(t, cand)) {
+				new_id = cand_node->type_id;
+				break;
+			}
+		}
+		break;
+
+	case BTF_KIND_ENUM:
+		h = btf_hash_enum(t);
+		for_each_hash_node(d->dedup_table, h, cand_node) {
+			cand = d->btf->types[cand_node->type_id];
+			if (btf_equal_enum(t, cand)) {
+				new_id = cand_node->type_id;
+				break;
+			}
+		}
+		break;
+
+	case BTF_KIND_FWD:
+		h = btf_hash_common(t);
+		for_each_hash_node(d->dedup_table, h, cand_node) {
+			cand = d->btf->types[cand_node->type_id];
+			if (btf_equal_common(t, cand)) {
+				new_id = cand_node->type_id;
+				break;
+			}
+		}
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	d->map[type_id] = new_id;
+	if (type_id == new_id && btf_dedup_table_add(d, h, type_id))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int btf_dedup_prim_types(struct btf_dedup *d)
+{
+	int i, err;
+
+	for (i = 1; i <= d->btf->nr_types; i++) {
+		err = btf_dedup_prim_type(d, i);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
+/*
+ * Check whether type is already mapped into canonical one (could be to itself).
+ */
+static inline bool is_type_mapped(struct btf_dedup *d, uint32_t type_id)
+{
+	return d->map[type_id] <= BTF_MAX_TYPE;
+}
+
+/*
+ * Resolve type ID into its canonical type ID, if any; otherwise return original
+ * type ID. If type is FWD and is resolved into STRUCT/UNION already, follow
+ * STRUCT/UNION link and resolve it into canonical type ID as well.
+ */
+static inline __u32 resolve_type_id(struct btf_dedup *d, __u32 type_id)
+{
+	while (is_type_mapped(d, type_id) && d->map[type_id] != type_id)
+		type_id = d->map[type_id];
+	return type_id;
+}
+
+/*
+ * Resolve FWD to underlying STRUCT/UNION, if any; otherwise return original
+ * type ID.
+ */
+static uint32_t resolve_fwd_id(struct btf_dedup *d, uint32_t type_id)
+{
+	__u32 orig_type_id = type_id;
+
+	if (BTF_INFO_KIND(d->btf->types[type_id]->info) != BTF_KIND_FWD)
+		return type_id;
+
+	while (is_type_mapped(d, type_id) && d->map[type_id] != type_id)
+		type_id = d->map[type_id];
+
+	if (BTF_INFO_KIND(d->btf->types[type_id]->info) != BTF_KIND_FWD)
+		return type_id;
+
+	return orig_type_id;
+}
+
+
+static inline __u16 btf_fwd_kind(struct btf_type *t)
+{
+	return BTF_INFO_KFLAG(t->info) ? BTF_KIND_UNION : BTF_KIND_STRUCT;
+}
+
+/*
+ * Check equivalence of BTF type graph formed by candidate struct/union (we'll
+ * call it "candidate graph" in this description for brevity) to a type graph
+ * formed by (potential) canonical struct/union ("canonical graph" for brevity
+ * here, though keep in mind that not all types in canonical graph are
+ * necessarily canonical representatives themselves, some of them might be
+ * duplicates or its uniqueness might not have been established yet).
+ * Returns:
+ *  - >0, if type graphs are equivalent;
+ *  -  0, if not equivalent;
+ *  - <0, on error.
+ *
+ * Algorithm performs side-by-side DFS traversal of both type graphs and checks
+ * equivalence of BTF types at each step. If at any point BTF types in candidate
+ * and canonical graphs are not compatible structurally, whole graphs are
+ * incompatible. If types are structurally equivalent (i.e., all information
+ * except referenced type IDs is exactly the same), a mapping from `canon_id` to
+ * a `cand_id` is recored in hypothetical mapping (`btf_dedup->hypot_map`).
+ * If a type references other types, then those referenced types are checked
+ * for equivalence recursively.
+ *
+ * During DFS traversal, if we find that for current `canon_id` type we
+ * already have some mapping in hypothetical map, we check for two possible
+ * situations:
+ *   - `canon_id` is mapped to exactly the same type as `cand_id`. This will
+ *     happen when type graphs have cycles. In this case we assume those two
+ *     types are equivalent.
+ *   - `canon_id` is mapped to different type. This is contradiction in our
+ *     hypothetical mapping, because same graph in canonical graph corresponds
+ *     to two different types in candidate graph, which for equivalent type
+ *     graphs shouldn't happen. This condition terminates equivalence check
+ *     with negative result.
+ *
+ * If type graphs traversal exhausts types to check and find no contradiction,
+ * then type graphs are equivalent.
+ *
+ * When checking types for equivalence, there is one special case: FWD types.
+ * If FWD type resolution is allowed and one of the types (either from canonical
+ * or candidate graph) is FWD and other is STRUCT/UNION (depending on FWD's kind
+ * flag) and their names match, hypothetical mapping is updated to point from
+ * FWD to STRUCT/UNION. If graphs will be determined as equivalent successfully,
+ * this mapping will be used to record FWD -> STRUCT/UNION mapping permanently.
+ *
+ * Technically, this could lead to incorrect FWD to STRUCT/UNION resolution,
+ * if there are two exactly named (or anonymous) structs/unions that are
+ * compatible structurally, one of which has FWD field, while other is concrete
+ * STRUCT/UNION, but according to C sources they are different structs/unions
+ * that are referencing different types with the same name. This is extremely
+ * unlikely to happen, but btf_dedup API allows to disable FWD resolution if
+ * this logic is causing problems.
+ *
+ * Doing FWD resolution means that both candidate and/or canonical graphs can
+ * consists of portions of the graph that come from multiple compilation units.
+ * This is due to the fact that types within single compilation unit are always
+ * deduplicated and FWDs are already resolved, if referenced struct/union
+ * definiton is available. So, if we had unresolved FWD and found corresponding
+ * STRUCT/UNION, they will be from different compilation units. This
+ * consequently means that when we "link" FWD to corresponding STRUCT/UNION,
+ * type graph will likely have at least two different BTF types that describe
+ * same type (e.g., most probably there will be two different BTF types for the
+ * same 'int' primitive type) and could even have "overlapping" parts of type
+ * graph that describe same subset of types.
+ *
+ * This in turn means that our assumption that each type in canonical graph
+ * must correspond to exactly one type in candidate graph might not hold
+ * anymore and will make it harder to detect contradictions using hypothetical
+ * map. To handle this problem, we allow to follow FWD -> STRUCT/UNION
+ * resolution only in canonical graph. FWDs in candidate graphs are never
+ * resolved. To see why it's OK, let's check all possible situations w.r.t. FWDs
+ * that can occur:
+ *   - Both types in canonical and candidate graphs are FWDs. If they are
+ *     structurally equivalent, then they can either be both resolved to the
+ *     same STRUCT/UNION or not resolved at all. In both cases they are
+ *     equivalent and there is no need to resolve FWD on candidate side.
+ *   - Both types in canonical and candidate graphs are concrete STRUCT/UNION,
+ *     so nothing to resolve as well, algorithm will check equivalence anyway.
+ *   - Type in canonical graph is FWD, while type in candidate is concrete
+ *     STRUCT/UNION. In this case candidate graph comes from single compilation
+ *     unit, so there is exactly one BTF type for each unique C type. After
+ *     resolving FWD into STRUCT/UNION, there might be more than one BTF type
+ *     in canonical graph mapping to single BTF type in candidate graph, but
+ *     because hypothetical mapping maps from canonical to candidate types, it's
+ *     alright, and we still maintain the property of having single `canon_id`
+ *     mapping to single `cand_id` (there could be two different `canon_id`
+ *     mapped to the same `cand_id`, but it's not contradictory).
+ *   - Type in canonical graph is concrete STRUCT/UNION, while type in candidate
+ *     graph is FWD. In this case we are just going to check compatibility of
+ *     STRUCT/UNION and corresponding FWD, and if they are compatible, we'll
+ *     assume that whatever STRUCT/UNION FWD resolves to must be equivalent to
+ *     a concrete STRUCT/UNION from canonical graph. If the rest of type graphs
+ *     turn out equivalent, we'll re-resolve FWD to concrete STRUCT/UNION from
+ *     canonical graph.
+ */
+static int btf_dedup_is_equiv(struct btf_dedup *d, __u32 cand_id,
+			      __u32 canon_id)
+{
+	struct btf_type *cand_type;
+	struct btf_type *canon_type;
+	__u32 hypot_type_id;
+	__u16 cand_kind;
+	__u16 canon_kind;
+	int i, eq;
+
+	/* if both resolve to the same canonical, they must be equivalent */
+	if (resolve_type_id(d, cand_id) == resolve_type_id(d, canon_id))
+		return 1;
+
+	canon_id = resolve_fwd_id(d, canon_id);
+
+	hypot_type_id = d->hypot_map[canon_id];
+	if (hypot_type_id <= BTF_MAX_TYPE)
+		return hypot_type_id == cand_id;
+
+	if (btf_dedup_hypot_map_add(d, canon_id, cand_id))
+		return -ENOMEM;
+
+	cand_type = d->btf->types[cand_id];
+	canon_type = d->btf->types[canon_id];
+	cand_kind = BTF_INFO_KIND(cand_type->info);
+	canon_kind = BTF_INFO_KIND(canon_type->info);
+
+	if (cand_type->name_off != canon_type->name_off)
+		return 0;
+
+	/* FWD <--> STRUCT/UNION equivalence check, if enabled */
+	if (!d->opts.dont_resolve_fwds
+	    && (cand_kind == BTF_KIND_FWD || canon_kind == BTF_KIND_FWD)
+	    && cand_kind != canon_kind) {
+		__u16 real_kind;
+		__u16 fwd_kind;
+
+		if (cand_kind == BTF_KIND_FWD) {
+			real_kind = canon_kind;
+			fwd_kind = btf_fwd_kind(cand_type);
+		} else {
+			real_kind = cand_kind;
+			fwd_kind = btf_fwd_kind(canon_type);
+		}
+		return fwd_kind == real_kind;
+	}
+
+	if (cand_type->info != canon_type->info)
+		return 0;
+
+	switch (cand_kind) {
+	case BTF_KIND_INT:
+		return btf_equal_int(cand_type, canon_type);
+
+	case BTF_KIND_ENUM:
+		return btf_equal_enum(cand_type, canon_type);
+
+	case BTF_KIND_FWD:
+		return btf_equal_common(cand_type, canon_type);
+
+	case BTF_KIND_CONST:
+	case BTF_KIND_VOLATILE:
+	case BTF_KIND_RESTRICT:
+	case BTF_KIND_PTR:
+	case BTF_KIND_TYPEDEF:
+	case BTF_KIND_FUNC:
+		return btf_dedup_is_equiv(d, cand_type->type, canon_type->type);
+
+	case BTF_KIND_ARRAY: {
+		struct btf_array *cand_arr, *canon_arr;
+
+		if (!btf_compat_array(cand_type, canon_type))
+			return 0;
+		cand_arr = (struct btf_array *)(cand_type + 1);
+		canon_arr = (struct btf_array *)(canon_type + 1);
+		eq = btf_dedup_is_equiv(d,
+			cand_arr->index_type, canon_arr->index_type);
+		if (eq <= 0)
+			return eq;
+		return btf_dedup_is_equiv(d, cand_arr->type, canon_arr->type);
+	}
+
+	case BTF_KIND_STRUCT:
+	case BTF_KIND_UNION: {
+		struct btf_member *cand_m, *canon_m;
+		__u16 vlen;
+
+		if (!btf_equal_struct(cand_type, canon_type))
+			return 0;
+		vlen = BTF_INFO_VLEN(cand_type->info);
+		cand_m = (struct btf_member *)(cand_type + 1);
+		canon_m = (struct btf_member *)(canon_type + 1);
+		for (i = 0; i < vlen; i++) {
+			eq = btf_dedup_is_equiv(d, cand_m->type, canon_m->type);
+			if (eq <= 0)
+				return eq;
+			cand_m++;
+			canon_m++;
+		}
+
+		return 1;
+	}
+
+	case BTF_KIND_FUNC_PROTO: {
+		struct btf_param *cand_p, *canon_p;
+		__u16 vlen;
+
+		if (!btf_compat_fnproto(cand_type, canon_type))
+			return 0;
+		eq = btf_dedup_is_equiv(d, cand_type->type, canon_type->type);
+		if (eq <= 0)
+			return eq;
+		vlen = BTF_INFO_VLEN(cand_type->info);
+		cand_p = (struct btf_param *)(cand_type + 1);
+		canon_p = (struct btf_param *)(canon_type + 1);
+		for (i = 0; i < vlen; i++) {
+			eq = btf_dedup_is_equiv(d, cand_p->type, canon_p->type);
+			if (eq <= 0)
+				return eq;
+			cand_p++;
+			canon_p++;
+		}
+		return 1;
+	}
+
+	default:
+		return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * Use hypothetical mapping, produced by successful type graph equivalence
+ * check, to augment existing struct/union canonical mapping, where possible.
+ *
+ * If BTF_KIND_FWD resolution is allowed, this mapping is also used to record
+ * FWD -> STRUCT/UNION correspondence as well. FWD resolution is bidirectional:
+ * it doesn't matter if FWD type was part of canonical graph or candidate one,
+ * we are recording the mapping anyway. As opposed to carefulness required
+ * for struct/union correspondence mapping (described below), for FWD resolution
+ * it's not important, as by the time that FWD type (reference type) will be
+ * deduplicated all structs/unions will be deduped already anyway.
+ *
+ * Recording STRUCT/UNION mapping is purely a performance optimization and is
+ * not required for correctness. It needs to be done carefully to ensure that
+ * struct/union from candidate's type graph is not mapped into corresponding
+ * struct/union from canonical type graph that itself hasn't been resolved into
+ * canonical representative. The only guarantee we have is that canonical
+ * struct/union was determined as canonical and that won't change. But any
+ * types referenced through that struct/union fields could have been not yet
+ * resolved, so in case like that it's too early to establish any kind of
+ * correspondence between structs/unions.
+ *
+ * No canonical correspondence is derived for primitive types (they are already
+ * deduplicated completely already anyway) or reference types (they rely on
+ * stability of struct/union canonical relationship for equivalence checks).
+ */
+static void btf_dedup_merge_hypot_map(struct btf_dedup *d)
+{
+	__u32 cand_type_id, targ_type_id;
+	__u16 t_kind, c_kind;
+	__u32 t_id, c_id;
+	int i;
+
+	for (i = 0; i < d->hypot_cnt; i++) {
+		cand_type_id = d->hypot_list[i];
+		targ_type_id = d->hypot_map[cand_type_id];
+		t_id = resolve_type_id(d, targ_type_id);
+		c_id = resolve_type_id(d, cand_type_id);
+		t_kind = BTF_INFO_KIND(d->btf->types[t_id]->info);
+		c_kind = BTF_INFO_KIND(d->btf->types[c_id]->info);
+		/*
+		 * Resolve FWD into STRUCT/UNION.
+		 * It's ok to resolve FWD into STRUCT/UNION that's not yet
+		 * mapped to canonical representative (as opposed to
+		 * STRUCT/UNION <--> STRUCT/UNION mapping logic below), because
+		 * eventually that struct is going to be mapped and all resolved
+		 * FWDs will automatically resolve to correct canonical
+		 * representative. This will happen before ref type deduping,
+		 * which critically depends on stability of these mapping. This
+		 * stability is not a requirement for STRUCT/UNION equivalence
+		 * checks, though.
+		 */
+		if (t_kind != BTF_KIND_FWD && c_kind == BTF_KIND_FWD)
+			d->map[c_id] = t_id;
+		else if (t_kind == BTF_KIND_FWD && c_kind != BTF_KIND_FWD)
+			d->map[t_id] = c_id;
+
+		if ((t_kind == BTF_KIND_STRUCT || t_kind == BTF_KIND_UNION) &&
+		    c_kind != BTF_KIND_FWD &&
+		    is_type_mapped(d, c_id) &&
+		    !is_type_mapped(d, t_id)) {
+			/*
+			 * as a perf optimization, we can map struct/union
+			 * that's part of type graph we just verified for
+			 * equivalence. We can do that for struct/union that has
+			 * canonical representative only, though.
+			 */
+			d->map[t_id] = c_id;
+		}
+	}
+}
+
+/*
+ * Deduplicate struct/union types.
+ *
+ * For each struct/union type its type signature hash is calculated, taking
+ * into account type's name, size, number, order and names of fields, but
+ * ignoring type ID's referenced from fields, because they might not be deduped
+ * completely until after reference types deduplication phase. This type hash
+ * is used to iterate over all potential canonical types, sharing same hash.
+ * For each canonical candidate we check whether type graphs that they form
+ * (through referenced types in fields and so on) are equivalent using algorithm
+ * implemented in `btf_dedup_is_equiv`. If such equivalence is found and
+ * BTF_KIND_FWD resolution is allowed, then hypothetical mapping
+ * (btf_dedup->hypot_map) produced by aforementioned type graph equivalence
+ * algorithm is used to record FWD -> STRUCT/UNION mapping. It's also used to
+ * potentially map other structs/unions to their canonical representatives,
+ * if such relationship hasn't yet been established. This speeds up algorithm
+ * by eliminating some of the duplicate work.
+ *
+ * If no matching canonical representative was found, struct/union is marked
+ * as canonical for itself and is added into btf_dedup->dedup_table hash map
+ * for further look ups.
+ */
+static int btf_dedup_struct_type(struct btf_dedup *d, __u32 type_id)
+{
+	struct btf_dedup_node *cand_node;
+	struct btf_type *t;
+	/* if we don't find equivalent type, then we are canonical */
+	__u32 new_id = type_id;
+	__u16 kind;
+	__u32 h;
+
+	/* already deduped or is in process of deduping (loop detected) */
+	if (d->map[type_id] <= BTF_MAX_TYPE)
+		return 0;
+
+	t = d->btf->types[type_id];
+	kind = BTF_INFO_KIND(t->info);
+
+	if (kind != BTF_KIND_STRUCT && kind != BTF_KIND_UNION)
+		return 0;
+
+	h = btf_hash_struct(t);
+	for_each_hash_node(d->dedup_table, h, cand_node) {
+		int eq;
+
+		btf_dedup_clear_hypot_map(d);
+		eq = btf_dedup_is_equiv(d, type_id, cand_node->type_id);
+		if (eq < 0)
+			return eq;
+		if (!eq)
+			continue;
+		new_id = cand_node->type_id;
+		btf_dedup_merge_hypot_map(d);
+		break;
+	}
+
+	d->map[type_id] = new_id;
+	if (type_id == new_id && btf_dedup_table_add(d, h, type_id))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int btf_dedup_struct_types(struct btf_dedup *d)
+{
+	int i, err;
+
+	for (i = 1; i <= d->btf->nr_types; i++) {
+		err = btf_dedup_struct_type(d, i);
+		if (err)
+			return err;
+	}
+	return 0;
+}
+
+/*
+ * Deduplicate reference type.
+ *
+ * Once all primitive and struct/union types got deduplicated, we can easily
+ * deduplicate all other (reference) BTF types. This is done in two steps:
+ *
+ * 1. Resolve all referenced type IDs into their canonical type IDs. This
+ * resolution can be done either immediately for primitive or struct/union types
+ * (because they were deduped in previous two phases) or recursively for
+ * reference types. Recursion will always terminate at either primitive or
+ * struct/union type, at which point we can "unwind" chain of reference types
+ * one by one. There is no danger of encountering cycles because in C type
+ * system the only way to form type cycle is through struct/union, so any chain
+ * of reference types, even those taking part in a type cycle, will inevitably
+ * reach struct/union at some point.
+ *
+ * 2. Once all referenced type IDs are resolved into canonical ones, BTF type
+ * becomes "stable", in the sense that no further deduplication will cause
+ * any changes to it. With that, it's now possible to calculate type's signature
+ * hash (this time taking into account referenced type IDs) and loop over all
+ * potential canonical representatives. If no match was found, current type
+ * will become canonical representative of itself and will be added into
+ * btf_dedup->dedup_table as another possible canonical representative.
+ */
+static int btf_dedup_ref_type(struct btf_dedup *d, __u32 type_id)
+{
+	struct btf_dedup_node *cand_node;
+	struct btf_type *t, *cand;
+	/* if we don't find equivalent type, then we are representative type */
+	__u32 new_id = type_id;
+	__u32 h, ref_type_id;
+
+	if (d->map[type_id] == BTF_IN_PROGRESS_ID)
+		return -ELOOP;
+	if (d->map[type_id] <= BTF_MAX_TYPE)
+		return resolve_type_id(d, type_id);
+
+	t = d->btf->types[type_id];
+	d->map[type_id] = BTF_IN_PROGRESS_ID;
+
+	switch (BTF_INFO_KIND(t->info)) {
+	case BTF_KIND_CONST:
+	case BTF_KIND_VOLATILE:
+	case BTF_KIND_RESTRICT:
+	case BTF_KIND_PTR:
+	case BTF_KIND_TYPEDEF:
+	case BTF_KIND_FUNC:
+		ref_type_id = btf_dedup_ref_type(d, t->type);
+		if (ref_type_id < 0)
+			return ref_type_id;
+		t->type = ref_type_id;
+
+		h = btf_hash_common(t);
+		for_each_hash_node(d->dedup_table, h, cand_node) {
+			cand = d->btf->types[cand_node->type_id];
+			if (btf_equal_common(t, cand)) {
+				new_id = cand_node->type_id;
+				break;
+			}
+		}
+		break;
+
+	case BTF_KIND_ARRAY: {
+		struct btf_array *info = (struct btf_array *)(t + 1);
+
+		ref_type_id = btf_dedup_ref_type(d, info->type);
+		if (ref_type_id < 0)
+			return ref_type_id;
+		info->type = ref_type_id;
+
+		ref_type_id = btf_dedup_ref_type(d, info->index_type);
+		if (ref_type_id < 0)
+			return ref_type_id;
+		info->index_type = ref_type_id;
+
+		h = btf_hash_array(t);
+		for_each_hash_node(d->dedup_table, h, cand_node) {
+			cand = d->btf->types[cand_node->type_id];
+			if (btf_equal_array(t, cand)) {
+				new_id = cand_node->type_id;
+				break;
+			}
+		}
+		break;
+	}
+
+	case BTF_KIND_FUNC_PROTO: {
+		struct btf_param *param;
+		__u16 vlen;
+		int i;
+
+		ref_type_id = btf_dedup_ref_type(d, t->type);
+		if (ref_type_id < 0)
+			return ref_type_id;
+		t->type = ref_type_id;
+
+		vlen = BTF_INFO_VLEN(t->info);
+		param = (struct btf_param *)(t + 1);
+		for (i = 0; i < vlen; i++) {
+			ref_type_id = btf_dedup_ref_type(d, param->type);
+			if (ref_type_id < 0)
+				return ref_type_id;
+			param->type = ref_type_id;
+			param++;
+		}
+
+		h = btf_hash_fnproto(t);
+		for_each_hash_node(d->dedup_table, h, cand_node) {
+			cand = d->btf->types[cand_node->type_id];
+			if (btf_equal_fnproto(t, cand)) {
+				new_id = cand_node->type_id;
+				break;
+			}
+		}
+		break;
+	}
+
+	default:
+		return -EINVAL;
+	}
+
+	d->map[type_id] = new_id;
+	if (type_id == new_id && btf_dedup_table_add(d, h, type_id))
+		return -ENOMEM;
+
+	return new_id;
+}
+
+static int btf_dedup_ref_types(struct btf_dedup *d)
+{
+	int i, err;
+
+	for (i = 1; i <= d->btf->nr_types; i++) {
+		err = btf_dedup_ref_type(d, i);
+		if (err < 0)
+			return err;
+	}
+	btf_dedup_table_free(d);
+	return 0;
+}
+
+/*
+ * Compact types.
+ *
+ * After we established for each type its corresponding canonical representative
+ * type, we now can eliminate types that are not canonical and leave only
+ * canonical ones layed out sequentially in memory by copying them over
+ * duplicates. During compaction btf_dedup->hypot_map array is reused to store
+ * a map from original type ID to a new compacted type ID, which will be used
+ * during next phase to "fix up" type IDs, referenced from struct/union and
+ * reference types.
+ */
+static int btf_dedup_compact_types(struct btf_dedup *d)
+{
+	struct btf_type **new_types;
+	__u32 next_type_id = 1;
+	char *types_start, *p;
+	int i, len;
+
+	/* we are going to reuse hypot_map to store compaction remapping */
+	d->hypot_map[0] = 0;
+	for (i = 1; i <= d->btf->nr_types; i++)
+		d->hypot_map[i] = BTF_UNPROCESSED_ID;
+
+	types_start = d->btf->nohdr_data + d->btf->hdr->type_off;
+	p = types_start;
+
+	for (i = 1; i <= d->btf->nr_types; i++) {
+		if (d->map[i] != i)
+			continue;
+
+		len = btf_type_size(d->btf->types[i]);
+		if (len < 0)
+			return len;
+
+		memmove(p, d->btf->types[i], len);
+		d->hypot_map[i] = next_type_id;
+		d->btf->types[next_type_id] = (struct btf_type *)p;
+		p += len;
+		next_type_id++;
+	}
+
+	/* shrink struct btf's internal types index and update btf_header */
+	d->btf->nr_types = next_type_id - 1;
+	d->btf->types_size = d->btf->nr_types;
+	d->btf->hdr->type_len = p - types_start;
+	new_types = realloc(d->btf->types,
+			    (1 + d->btf->nr_types) * sizeof(struct btf_type *));
+	if (!new_types)
+		return -ENOMEM;
+	d->btf->types = new_types;
+
+	/* make sure string section follows type information without gaps */
+	d->btf->hdr->str_off = p - (char *)d->btf->nohdr_data;
+	memmove(p, d->btf->strings, d->btf->hdr->str_len);
+	d->btf->strings = p;
+	p += d->btf->hdr->str_len;
+
+	d->btf->data_size = p - (char *)d->btf->data;
+	return 0;
+}
+
+/*
+ * Figure out final (deduplicated and compacted) type ID for provided original
+ * `type_id` by first resolving it into corresponding canonical type ID and
+ * then mapping it to a deduplicated type ID, stored in btf_dedup->hypot_map,
+ * which is populated during compaction phase.
+ */
+static int btf_dedup_remap_type_id(struct btf_dedup *d, __u32 type_id)
+{
+	__u32 resolved_type_id, new_type_id;
+
+	resolved_type_id = resolve_type_id(d, type_id);
+	new_type_id = d->hypot_map[resolved_type_id];
+	if (new_type_id > BTF_MAX_TYPE)
+		return -EINVAL;
+	return new_type_id;
+}
+
+/*
+ * Remap referenced type IDs into deduped type IDs.
+ *
+ * After BTF types are deduplicated and compacted, their final type IDs may
+ * differ from original ones. The map from original to a corresponding
+ * deduped type ID is stored in btf_dedup->hypot_map and is populated during
+ * compaction phase. During remapping phase we are rewriting all type IDs
+ * referenced from any BTF type (e.g., struct fields, func proto args, etc) to
+ * their final deduped type IDs.
+ */
+static int btf_dedup_remap_type(struct btf_dedup *d, __u32 type_id)
+{
+	struct btf_type *t = d->btf->types[type_id];
+	int i, r;
+
+	switch (BTF_INFO_KIND(t->info)) {
+	case BTF_KIND_INT:
+	case BTF_KIND_ENUM:
+		break;
+
+	case BTF_KIND_FWD:
+	case BTF_KIND_CONST:
+	case BTF_KIND_VOLATILE:
+	case BTF_KIND_RESTRICT:
+	case BTF_KIND_PTR:
+	case BTF_KIND_TYPEDEF:
+	case BTF_KIND_FUNC:
+		r = btf_dedup_remap_type_id(d, t->type);
+		if (r < 0)
+			return r;
+		t->type = r;
+		break;
+
+	case BTF_KIND_ARRAY: {
+		struct btf_array *arr_info = (struct btf_array *)(t + 1);
+
+		r = btf_dedup_remap_type_id(d, arr_info->type);
+		if (r < 0)
+			return r;
+		arr_info->type = r;
+		r = btf_dedup_remap_type_id(d, arr_info->index_type);
+		if (r < 0)
+			return r;
+		arr_info->index_type = r;
+		break;
+	}
+
+	case BTF_KIND_STRUCT:
+	case BTF_KIND_UNION: {
+		struct btf_member *member = (struct btf_member *)(t + 1);
+		__u16 vlen = BTF_INFO_VLEN(t->info);
+
+		for (i = 0; i < vlen; i++) {
+			r = btf_dedup_remap_type_id(d, member->type);
+			if (r < 0)
+				return r;
+			member->type = r;
+			member++;
+		}
+		break;
+	}
+
+	case BTF_KIND_FUNC_PROTO: {
+		struct btf_param *param = (struct btf_param *)(t + 1);
+		__u16 vlen = BTF_INFO_VLEN(t->info);
+
+		r = btf_dedup_remap_type_id(d, t->type);
+		if (r < 0)
+			return r;
+		t->type = r;
+
+		for (i = 0; i < vlen; i++) {
+			r = btf_dedup_remap_type_id(d, param->type);
+			if (r < 0)
+				return r;
+			param->type = r;
+			param++;
+		}
+		break;
+	}
+
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int btf_dedup_remap_types(struct btf_dedup *d)
+{
+	int i, r;
+
+	for (i = 1; i <= d->btf->nr_types; i++) {
+		r = btf_dedup_remap_type(d, i);
+		if (r < 0)
+			return r;
+	}
+	return 0;
+}
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index 258c87e9f55d..c739de7ed993 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -84,6 +84,13 @@ LIBBPF_API int btf_ext__reloc_line_info(const struct btf *btf,
 LIBBPF_API __u32 btf_ext__func_info_rec_size(const struct btf_ext *btf_ext);
 LIBBPF_API __u32 btf_ext__line_info_rec_size(const struct btf_ext *btf_ext);
 
+struct btf_dedup_opts {
+	bool dont_resolve_fwds;
+};
+
+LIBBPF_API int btf__dedup(struct btf *btf, struct btf_ext *btf_ext,
+			  const struct btf_dedup_opts *opts);
+
 #ifdef __cplusplus
 } /* extern "C" */
 #endif
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7990e857e003..7e4a8c1e1c1c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -133,6 +133,7 @@ LIBBPF_0.0.2 {
 		bpf_map_lookup_elem_flags;
 		bpf_object__find_map_fd_by_name;
 		bpf_get_link_xdp_id;
+		btf__dedup;
 		btf__get_map_kv_tids;
 		btf_ext__free;
 		btf_ext__func_info_rec_size;
-- 
2.17.1


^ permalink raw reply related

* [PATCH btf v2 3/3] selftests/btf: add initial BTF dedup tests
From: Andrii Nakryiko @ 2019-02-05  1:29 UTC (permalink / raw)
  To: netdev, ast, yhs, daniel, acme, kernel-team, kafai, ecree,
	andrii.nakryiko
  Cc: Andrii Nakryiko
In-Reply-To: <20190205012946.1590917-1-andriin@fb.com>

This patch sets up a new kind of tests (BTF dedup tests) and tests few aspects of
BTF dedup algorithm. More complete set of tests will come in follow up patches.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/btf.c                    |  12 +
 tools/lib/bpf/btf.h                    |   3 +
 tools/lib/bpf/libbpf.map               |   2 +
 tools/testing/selftests/bpf/test_btf.c | 535 ++++++++++++++++++++++++-
 4 files changed, 537 insertions(+), 15 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index e5097be16018..4949f8840bda 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -237,6 +237,11 @@ static int btf_parse_type_sec(struct btf *btf)
 	return 0;
 }
 
+__u32 btf__get_nr_types(const struct btf *btf)
+{
+	return btf->nr_types;
+}
+
 const struct btf_type *btf__type_by_id(const struct btf *btf, __u32 type_id)
 {
 	if (type_id > btf->nr_types)
@@ -427,6 +432,13 @@ int btf__fd(const struct btf *btf)
 	return btf->fd;
 }
 
+void btf__get_strings(const struct btf *btf, const char **strings,
+		      __u32 *str_len)
+{
+	*strings = btf->strings;
+	*str_len = btf->hdr->str_len;
+}
+
 const char *btf__name_by_offset(const struct btf *btf, __u32 offset)
 {
 	if (offset < btf->hdr->str_len)
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index c739de7ed993..25a9d2db035d 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -59,11 +59,14 @@ LIBBPF_API void btf__free(struct btf *btf);
 LIBBPF_API struct btf *btf__new(__u8 *data, __u32 size);
 LIBBPF_API __s32 btf__find_by_name(const struct btf *btf,
 				   const char *type_name);
+LIBBPF_API __u32 btf__get_nr_types(const struct btf *btf);
 LIBBPF_API const struct btf_type *btf__type_by_id(const struct btf *btf,
 						  __u32 id);
 LIBBPF_API __s64 btf__resolve_size(const struct btf *btf, __u32 type_id);
 LIBBPF_API int btf__resolve_type(const struct btf *btf, __u32 type_id);
 LIBBPF_API int btf__fd(const struct btf *btf);
+LIBBPF_API void btf__get_strings(const struct btf *btf, const char **strings,
+				 __u32 *str_len);
 LIBBPF_API const char *btf__name_by_offset(const struct btf *btf, __u32 offset);
 LIBBPF_API int btf__get_from_id(__u32 id, struct btf **btf);
 LIBBPF_API int btf__get_map_kv_tids(const struct btf *btf, char *map_name,
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7e4a8c1e1c1c..89c1149e32ee 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -135,6 +135,8 @@ LIBBPF_0.0.2 {
 		bpf_get_link_xdp_id;
 		btf__dedup;
 		btf__get_map_kv_tids;
+		btf__get_nr_types;
+		btf__get_strings;
 		btf_ext__free;
 		btf_ext__func_info_rec_size;
 		btf_ext__line_info_rec_size;
diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index aebaeff5a5a0..91080311f3e1 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -79,12 +79,21 @@ static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
 	BTF_TYPE_ENC(name, BTF_INFO_ENC(BTF_KIND_INT, 0, 0), sz),	\
 	BTF_INT_ENC(encoding, bits_offset, bits)
 
+#define BTF_FWD_ENC(name, kind_flag) \
+	BTF_TYPE_ENC(name, BTF_INFO_ENC(BTF_KIND_FWD, kind_flag, 0), 0)
+
 #define BTF_ARRAY_ENC(type, index_type, nr_elems)	\
 	(type), (index_type), (nr_elems)
 #define BTF_TYPE_ARRAY_ENC(type, index_type, nr_elems) \
 	BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_ARRAY, 0, 0), 0), \
 	BTF_ARRAY_ENC(type, index_type, nr_elems)
 
+#define BTF_STRUCT_ENC(name, nr_elems, sz)	\
+	BTF_TYPE_ENC(name, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, nr_elems), sz)
+
+#define BTF_UNION_ENC(name, nr_elems, sz)	\
+	BTF_TYPE_ENC(name, BTF_INFO_ENC(BTF_KIND_UNION, 0, nr_elems), sz)
+
 #define BTF_MEMBER_ENC(name, type, bits_offset)	\
 	(name), (type), (bits_offset)
 #define BTF_ENUM_ENC(name, val) (name), (val)
@@ -100,6 +109,12 @@ static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
 #define BTF_CONST_ENC(type) \
 	BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_CONST, 0, 0), type)
 
+#define BTF_VOLATILE_ENC(type) \
+	BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_VOLATILE, 0, 0), type)
+
+#define BTF_RESTRICT_ENC(type) \
+	BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_RESTRICT, 0, 0), type)
+
 #define BTF_FUNC_PROTO_ENC(ret_type, nargs) \
 	BTF_TYPE_ENC(0, BTF_INFO_ENC(BTF_KIND_FUNC_PROTO, 0, nargs), ret_type)
 
@@ -112,6 +127,10 @@ static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
 #define BTF_END_RAW 0xdeadbeef
 #define NAME_TBD 0xdeadb33f
 
+#define NAME_NTH(N) (0xffff0000 | N)
+#define IS_NAME_NTH(X) ((X & 0xffff0000) == 0xffff0000)
+#define GET_NAME_NTH_IDX(X) (X & 0x0000ffff)
+
 #define MAX_NR_RAW_U32 1024
 #define BTF_LOG_BUF_SIZE 65535
 
@@ -120,12 +139,14 @@ static struct args {
 	unsigned int file_test_num;
 	unsigned int get_info_test_num;
 	unsigned int info_raw_test_num;
+	unsigned int dedup_test_num;
 	bool raw_test;
 	bool file_test;
 	bool get_info_test;
 	bool pprint_test;
 	bool always_log;
 	bool info_raw_test;
+	bool dedup_test;
 } args;
 
 static char btf_log_buf[BTF_LOG_BUF_SIZE];
@@ -2836,11 +2857,13 @@ static void *btf_raw_create(const struct btf_header *hdr,
 			    const char **ret_next_str)
 {
 	const char *next_str = str, *end_str = str + str_sec_size;
+	const char **strs_idx = NULL, **tmp_strs_idx;
+	int strs_cap = 0, strs_cnt = 0, next_str_idx = 0;
 	unsigned int size_needed, offset;
 	struct btf_header *ret_hdr;
-	int i, type_sec_size;
+	int i, type_sec_size, err = 0;
 	uint32_t *ret_types;
-	void *raw_btf;
+	void *raw_btf = NULL;
 
 	type_sec_size = get_raw_sec_size(raw_types);
 	if (CHECK(type_sec_size < 0, "Cannot get nr_raw_types"))
@@ -2855,17 +2878,44 @@ static void *btf_raw_create(const struct btf_header *hdr,
 	memcpy(raw_btf, hdr, sizeof(*hdr));
 	offset = sizeof(*hdr);
 
+	/* Index strings */
+	while ((next_str = get_next_str(next_str, end_str))) {
+		if (strs_cnt == strs_cap) {
+			strs_cap += max(16, strs_cap / 2);
+			tmp_strs_idx = realloc(strs_idx,
+					       sizeof(*strs_idx) * strs_cap);
+			if (CHECK(!tmp_strs_idx,
+				  "Cannot allocate memory for strs_idx")) {
+				err = -1;
+				goto done;
+			}
+			strs_idx = tmp_strs_idx;
+		}
+		strs_idx[strs_cnt++] = next_str;
+		next_str += strlen(next_str);
+	}
+
 	/* Copy type section */
 	ret_types = raw_btf + offset;
 	for (i = 0; i < type_sec_size / sizeof(raw_types[0]); i++) {
 		if (raw_types[i] == NAME_TBD) {
-			next_str = get_next_str(next_str, end_str);
-			if (CHECK(!next_str, "Error in getting next_str")) {
-				free(raw_btf);
-				return NULL;
+			if (CHECK(next_str_idx == strs_cnt,
+				  "Error in getting next_str #%d",
+				  next_str_idx)) {
+				err = -1;
+				goto done;
 			}
-			ret_types[i] = next_str - str;
-			next_str += strlen(next_str);
+			ret_types[i] = strs_idx[next_str_idx++] - str;
+		} else if (IS_NAME_NTH(raw_types[i])) {
+			int idx = GET_NAME_NTH_IDX(raw_types[i]);
+
+			if (CHECK(idx <= 0 || idx > strs_cnt,
+				  "Error getting string #%d, strs_cnt:%d",
+				  idx, strs_cnt)) {
+				err = -1;
+				goto done;
+			}
+			ret_types[i] = strs_idx[idx-1] - str;
 		} else {
 			ret_types[i] = raw_types[i];
 		}
@@ -2882,8 +2932,17 @@ static void *btf_raw_create(const struct btf_header *hdr,
 
 	*btf_size = size_needed;
 	if (ret_next_str)
-		*ret_next_str = next_str;
+		*ret_next_str =
+			next_str_idx < strs_cnt ? strs_idx[next_str_idx] : NULL;
 
+done:
+	if (err) {
+		if (raw_btf)
+			free(raw_btf);
+		if (strs_idx)
+			free(strs_idx);
+		return NULL;
+	}
 	return raw_btf;
 }
 
@@ -5552,20 +5611,450 @@ static int test_info_raw(void)
 	return err;
 }
 
+struct btf_raw_data {
+	__u32 raw_types[MAX_NR_RAW_U32];
+	const char *str_sec;
+	__u32 str_sec_size;
+};
+
+struct btf_dedup_test {
+	const char *descr;
+	struct btf_raw_data input;
+	struct btf_raw_data expect;
+	struct btf_dedup_opts opts;
+};
+
+const struct btf_dedup_test dedup_tests[] = {
+
+{
+	.descr = "dedup: unused strings filtering",
+	.input = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_NTH(2), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_TYPE_INT_ENC(NAME_NTH(5), BTF_INT_SIGNED, 0, 64, 8),
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0unused\0int\0foo\0bar\0long"),
+	},
+	.expect = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_TYPE_INT_ENC(NAME_NTH(2), BTF_INT_SIGNED, 0, 64, 8),
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0int\0long"),
+	},
+	.opts = {
+		.dont_resolve_fwds = false,
+	},
+},
+{
+	.descr = "dedup: strings deduplication",
+	.input = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_TYPE_INT_ENC(NAME_NTH(2), BTF_INT_SIGNED, 0, 64, 8),
+			BTF_TYPE_INT_ENC(NAME_NTH(3), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_TYPE_INT_ENC(NAME_NTH(4), BTF_INT_SIGNED, 0, 64, 8),
+			BTF_TYPE_INT_ENC(NAME_NTH(5), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0int\0long int\0int\0long int\0int"),
+	},
+	.expect = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_TYPE_INT_ENC(NAME_NTH(2), BTF_INT_SIGNED, 0, 64, 8),
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0int\0long int"),
+	},
+	.opts = {
+		.dont_resolve_fwds = false,
+	},
+},
+{
+	.descr = "dedup: struct example #1",
+	/*
+	 * struct s {
+	 *	struct s *next;
+	 *	const int *a;
+	 *	int b[16];
+	 *	int c;
+	 * }
+	 */
+	.input = {
+		.raw_types = {
+			/* int */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),	/* [1] */
+			/* int[16] */
+			BTF_TYPE_ARRAY_ENC(1, 1, 16),					/* [2] */
+			/* struct s { */
+			BTF_STRUCT_ENC(NAME_NTH(2), 4, 84),				/* [3] */
+				BTF_MEMBER_ENC(NAME_NTH(3), 4, 0),	/* struct s *next;	*/
+				BTF_MEMBER_ENC(NAME_NTH(4), 5, 64),	/* const int *a;	*/
+				BTF_MEMBER_ENC(NAME_NTH(5), 2, 128),	/* int b[16];		*/
+				BTF_MEMBER_ENC(NAME_NTH(6), 1, 640),	/* int c;		*/
+			/* ptr -> [3] struct s */
+			BTF_PTR_ENC(3),							/* [4] */
+			/* ptr -> [6] const int */
+			BTF_PTR_ENC(6),							/* [5] */
+			/* const -> [1] int */
+			BTF_CONST_ENC(1),						/* [6] */
+
+			/* full copy of the above */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),	/* [7] */
+			BTF_TYPE_ARRAY_ENC(7, 7, 16),					/* [8] */
+			BTF_STRUCT_ENC(NAME_NTH(2), 4, 84),				/* [9] */
+				BTF_MEMBER_ENC(NAME_NTH(3), 10, 0),
+				BTF_MEMBER_ENC(NAME_NTH(4), 11, 64),
+				BTF_MEMBER_ENC(NAME_NTH(5), 8, 128),
+				BTF_MEMBER_ENC(NAME_NTH(6), 7, 640),
+			BTF_PTR_ENC(9),							/* [10] */
+			BTF_PTR_ENC(12),						/* [11] */
+			BTF_CONST_ENC(7),						/* [12] */
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0int\0s\0next\0a\0b\0c\0"),
+	},
+	.expect = {
+		.raw_types = {
+			/* int */
+			BTF_TYPE_INT_ENC(NAME_NTH(4), BTF_INT_SIGNED, 0, 32, 4),	/* [1] */
+			/* int[16] */
+			BTF_TYPE_ARRAY_ENC(1, 1, 16),					/* [2] */
+			/* struct s { */
+			BTF_STRUCT_ENC(NAME_NTH(6), 4, 84),				/* [3] */
+				BTF_MEMBER_ENC(NAME_NTH(5), 4, 0),	/* struct s *next;	*/
+				BTF_MEMBER_ENC(NAME_NTH(1), 5, 64),	/* const int *a;	*/
+				BTF_MEMBER_ENC(NAME_NTH(2), 2, 128),	/* int b[16];		*/
+				BTF_MEMBER_ENC(NAME_NTH(3), 1, 640),	/* int c;		*/
+			/* ptr -> [3] struct s */
+			BTF_PTR_ENC(3),							/* [4] */
+			/* ptr -> [6] const int */
+			BTF_PTR_ENC(6),							/* [5] */
+			/* const -> [1] int */
+			BTF_CONST_ENC(1),						/* [6] */
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0a\0b\0c\0int\0next\0s"),
+	},
+	.opts = {
+		.dont_resolve_fwds = false,
+	},
+},
+{
+	.descr = "dedup: all possible kinds (no duplicates)",
+	.input = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_TBD, BTF_INT_SIGNED, 0, 32, 8),		/* [1] int */
+			BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_ENUM, 0, 2), 4),	/* [2] enum */
+				BTF_ENUM_ENC(NAME_TBD, 0),
+				BTF_ENUM_ENC(NAME_TBD, 1),
+			BTF_FWD_ENC(NAME_TBD, 1 /* union kind_flag */),			/* [3] fwd */
+			BTF_TYPE_ARRAY_ENC(2, 1, 7),					/* [4] array */
+			BTF_STRUCT_ENC(NAME_TBD, 1, 4),					/* [5] struct */
+				BTF_MEMBER_ENC(NAME_TBD, 1, 0),
+			BTF_UNION_ENC(NAME_TBD, 1, 4),					/* [6] union */
+				BTF_MEMBER_ENC(NAME_TBD, 1, 0),
+			BTF_TYPEDEF_ENC(NAME_TBD, 1),					/* [7] typedef */
+			BTF_PTR_ENC(0),							/* [8] ptr */
+			BTF_CONST_ENC(8),						/* [9] const */
+			BTF_VOLATILE_ENC(8),						/* [10] volatile */
+			BTF_RESTRICT_ENC(8),						/* [11] restrict */
+			BTF_FUNC_PROTO_ENC(1, 2),					/* [12] func_proto */
+				BTF_FUNC_PROTO_ARG_ENC(NAME_TBD, 1),
+				BTF_FUNC_PROTO_ARG_ENC(NAME_TBD, 8),
+			BTF_FUNC_ENC(NAME_TBD, 12),					/* [13] func */
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0A\0B\0C\0D\0E\0F\0G\0H\0I\0J\0K\0L\0M"),
+	},
+	.expect = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_TBD, BTF_INT_SIGNED, 0, 32, 8),		/* [1] int */
+			BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_ENUM, 0, 2), 4),	/* [2] enum */
+				BTF_ENUM_ENC(NAME_TBD, 0),
+				BTF_ENUM_ENC(NAME_TBD, 1),
+			BTF_FWD_ENC(NAME_TBD, 1 /* union kind_flag */),			/* [3] fwd */
+			BTF_TYPE_ARRAY_ENC(2, 1, 7),					/* [4] array */
+			BTF_STRUCT_ENC(NAME_TBD, 1, 4),					/* [5] struct */
+				BTF_MEMBER_ENC(NAME_TBD, 1, 0),
+			BTF_UNION_ENC(NAME_TBD, 1, 4),					/* [6] union */
+				BTF_MEMBER_ENC(NAME_TBD, 1, 0),
+			BTF_TYPEDEF_ENC(NAME_TBD, 1),					/* [7] typedef */
+			BTF_PTR_ENC(0),							/* [8] ptr */
+			BTF_CONST_ENC(8),						/* [9] const */
+			BTF_VOLATILE_ENC(8),						/* [10] volatile */
+			BTF_RESTRICT_ENC(8),						/* [11] restrict */
+			BTF_FUNC_PROTO_ENC(1, 2),					/* [12] func_proto */
+				BTF_FUNC_PROTO_ARG_ENC(NAME_TBD, 1),
+				BTF_FUNC_PROTO_ARG_ENC(NAME_TBD, 8),
+			BTF_FUNC_ENC(NAME_TBD, 12),					/* [13] func */
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0A\0B\0C\0D\0E\0F\0G\0H\0I\0J\0K\0L\0M"),
+	},
+	.opts = {
+		.dont_resolve_fwds = false,
+	},
+},
+{
+	.descr = "dedup: no int duplicates",
+	.input = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 8),
+			/* different name */
+			BTF_TYPE_INT_ENC(NAME_NTH(2), BTF_INT_SIGNED, 0, 32, 8),
+			/* different encoding */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_CHAR, 0, 32, 8),
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_BOOL, 0, 32, 8),
+			/* different bit offset */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 8, 32, 8),
+			/* different bit size */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 27, 8),
+			/* different byte size */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0int\0some other int"),
+	},
+	.expect = {
+		.raw_types = {
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 8),
+			/* different name */
+			BTF_TYPE_INT_ENC(NAME_NTH(2), BTF_INT_SIGNED, 0, 32, 8),
+			/* different encoding */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_CHAR, 0, 32, 8),
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_BOOL, 0, 32, 8),
+			/* different bit offset */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 8, 32, 8),
+			/* different bit size */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 27, 8),
+			/* different byte size */
+			BTF_TYPE_INT_ENC(NAME_NTH(1), BTF_INT_SIGNED, 0, 32, 4),
+			BTF_END_RAW,
+		},
+		BTF_STR_SEC("\0int\0some other int"),
+	},
+	.opts = {
+		.dont_resolve_fwds = false,
+	},
+},
+
+};
+
+static int btf_type_size(const struct btf_type *t)
+{
+	int base_size = sizeof(struct btf_type);
+	__u16 vlen = BTF_INFO_VLEN(t->info);
+	__u16 kind = BTF_INFO_KIND(t->info);
+
+	switch (kind) {
+	case BTF_KIND_FWD:
+	case BTF_KIND_CONST:
+	case BTF_KIND_VOLATILE:
+	case BTF_KIND_RESTRICT:
+	case BTF_KIND_PTR:
+	case BTF_KIND_TYPEDEF:
+	case BTF_KIND_FUNC:
+		return base_size;
+	case BTF_KIND_INT:
+		return base_size + sizeof(__u32);
+	case BTF_KIND_ENUM:
+		return base_size + vlen * sizeof(struct btf_enum);
+	case BTF_KIND_ARRAY:
+		return base_size + sizeof(struct btf_array);
+	case BTF_KIND_STRUCT:
+	case BTF_KIND_UNION:
+		return base_size + vlen * sizeof(struct btf_member);
+	case BTF_KIND_FUNC_PROTO:
+		return base_size + vlen * sizeof(struct btf_param);
+	default:
+		fprintf(stderr, "Unsupported BTF_KIND:%u\n", kind);
+		return -EINVAL;
+	}
+}
+
+static void dump_btf_strings(const char *strs, __u32 len)
+{
+	const char *cur = strs;
+	int i = 0;
+
+	while (cur < strs + len) {
+		fprintf(stderr, "string #%d: '%s'\n", i, cur);
+		cur += strlen(cur) + 1;
+		i++;
+	}
+}
+
+static int do_test_dedup(unsigned int test_num)
+{
+	const struct btf_dedup_test *test = &dedup_tests[test_num - 1];
+	int err = 0, i;
+	__u32 test_nr_types, expect_nr_types, test_str_len, expect_str_len;
+	void *raw_btf;
+	unsigned int raw_btf_size;
+	struct btf *test_btf = NULL, *expect_btf = NULL;
+	const char *ret_test_next_str, *ret_expect_next_str;
+	const char *test_strs, *expect_strs;
+	const char *test_str_cur, *test_str_end;
+	const char *expect_str_cur, *expect_str_end;
+
+	fprintf(stderr, "BTF dedup test[%u] (%s):", test_num, test->descr);
+
+	raw_btf = btf_raw_create(&hdr_tmpl, test->input.raw_types,
+				 test->input.str_sec, test->input.str_sec_size,
+				 &raw_btf_size, &ret_test_next_str);
+	if (!raw_btf)
+		return -1;
+	test_btf = btf__new((__u8 *)raw_btf, raw_btf_size);
+	free(raw_btf);
+	if (CHECK(IS_ERR(test_btf), "invalid test_btf errno:%ld",
+		  PTR_ERR(test_btf))) {
+		err = -1;
+		goto done;
+	}
+
+	raw_btf = btf_raw_create(&hdr_tmpl, test->expect.raw_types,
+				 test->expect.str_sec,
+				 test->expect.str_sec_size,
+				 &raw_btf_size, &ret_expect_next_str);
+	if (!raw_btf)
+		return -1;
+	expect_btf = btf__new((__u8 *)raw_btf, raw_btf_size);
+	free(raw_btf);
+	if (CHECK(IS_ERR(expect_btf), "invalid expect_btf errno:%ld",
+		  PTR_ERR(expect_btf))) {
+		err = -1;
+		goto done;
+	}
+
+	err = btf__dedup(test_btf, NULL, &test->opts);
+	if (CHECK(err, "btf_dedup failed errno:%d", err)) {
+		err = -1;
+		goto done;
+	}
+
+	btf__get_strings(test_btf, &test_strs, &test_str_len);
+	btf__get_strings(expect_btf, &expect_strs, &expect_str_len);
+	if (CHECK(test_str_len != expect_str_len,
+		  "test_str_len:%u != expect_str_len:%u",
+		  test_str_len, expect_str_len)) {
+		fprintf(stderr, "\ntest strings:\n");
+		dump_btf_strings(test_strs, test_str_len);
+		fprintf(stderr, "\nexpected strings:\n");
+		dump_btf_strings(expect_strs, expect_str_len);
+		err = -1;
+		goto done;
+	}
+
+	test_str_cur = test_strs;
+	test_str_end = test_strs + test_str_len;
+	expect_str_cur = expect_strs;
+	expect_str_end = expect_strs + expect_str_len;
+	while (test_str_cur < test_str_end && expect_str_cur < expect_str_end) {
+		size_t test_len, expect_len;
+
+		test_len = strlen(test_str_cur);
+		expect_len = strlen(expect_str_cur);
+		if (CHECK(test_len != expect_len,
+			  "test_len:%zu != expect_len:%zu "
+			  "(test_str:%s, expect_str:%s)",
+			  test_len, expect_len, test_str_cur, expect_str_cur)) {
+			err = -1;
+			goto done;
+		}
+		if (CHECK(strcmp(test_str_cur, expect_str_cur),
+			  "test_str:%s != expect_str:%s",
+			  test_str_cur, expect_str_cur)) {
+			err = -1;
+			goto done;
+		}
+		test_str_cur += test_len + 1;
+		expect_str_cur += expect_len + 1;
+	}
+	if (CHECK(test_str_cur != test_str_end,
+		  "test_str_cur:%p != test_str_end:%p",
+		  test_str_cur, test_str_end)) {
+		err = -1;
+		goto done;
+	}
+
+	test_nr_types = btf__get_nr_types(test_btf);
+	expect_nr_types = btf__get_nr_types(expect_btf);
+	if (CHECK(test_nr_types != expect_nr_types,
+		  "test_nr_types:%u != expect_nr_types:%u",
+		  test_nr_types, expect_nr_types)) {
+		err = -1;
+		goto done;
+	}
+
+	for (i = 1; i <= test_nr_types; i++) {
+		const struct btf_type *test_type, *expect_type;
+		int test_size, expect_size;
+
+		test_type = btf__type_by_id(test_btf, i);
+		expect_type = btf__type_by_id(expect_btf, i);
+		test_size = btf_type_size(test_type);
+		expect_size = btf_type_size(expect_type);
+
+		if (CHECK(test_size != expect_size,
+			  "type #%d: test_size:%d != expect_size:%u",
+			  i, test_size, expect_size)) {
+			err = -1;
+			goto done;
+		}
+		if (CHECK(memcmp((void *)test_type,
+				 (void *)expect_type,
+				 test_size),
+			  "type #%d: contents differ", i)) {
+			err = -1;
+			goto done;
+		}
+	}
+
+done:
+	if (!err)
+		fprintf(stderr, "OK");
+	if (!IS_ERR(test_btf))
+		btf__free(test_btf);
+	if (!IS_ERR(expect_btf))
+		btf__free(expect_btf);
+
+	return err;
+}
+
+static int test_dedup(void)
+{
+	unsigned int i;
+	int err = 0;
+
+	if (args.dedup_test_num)
+		return count_result(do_test_dedup(args.dedup_test_num));
+
+	for (i = 1; i <= ARRAY_SIZE(dedup_tests); i++)
+		err |= count_result(do_test_dedup(i));
+
+	return err;
+}
+
 static void usage(const char *cmd)
 {
 	fprintf(stderr, "Usage: %s [-l] [[-r btf_raw_test_num (1 - %zu)] |\n"
 			"\t[-g btf_get_info_test_num (1 - %zu)] |\n"
 			"\t[-f btf_file_test_num (1 - %zu)] |\n"
 			"\t[-k btf_prog_info_raw_test_num (1 - %zu)] |\n"
-			"\t[-p (pretty print test)]]\n",
+			"\t[-p (pretty print test)] |\n"
+			"\t[-d btf_dedup_test_num (1 - %zu)]]\n",
 		cmd, ARRAY_SIZE(raw_tests), ARRAY_SIZE(get_info_tests),
-		ARRAY_SIZE(file_tests), ARRAY_SIZE(info_raw_tests));
+		ARRAY_SIZE(file_tests), ARRAY_SIZE(info_raw_tests),
+		ARRAY_SIZE(dedup_tests));
 }
 
 static int parse_args(int argc, char **argv)
 {
-	const char *optstr = "lpk:f:r:g:";
+	const char *optstr = "hlpk:f:r:g:d:";
 	int opt;
 
 	while ((opt = getopt(argc, argv, optstr)) != -1) {
@@ -5592,12 +6081,16 @@ static int parse_args(int argc, char **argv)
 			args.info_raw_test_num = atoi(optarg);
 			args.info_raw_test = true;
 			break;
+		case 'd':
+			args.dedup_test_num = atoi(optarg);
+			args.dedup_test = true;
+			break;
 		case 'h':
 			usage(argv[0]);
 			exit(0);
 		default:
-				usage(argv[0]);
-				return -1;
+			usage(argv[0]);
+			return -1;
 		}
 	}
 
@@ -5633,6 +6126,14 @@ static int parse_args(int argc, char **argv)
 		return -1;
 	}
 
+	if (args.dedup_test_num &&
+	    (args.dedup_test_num < 1 ||
+	     args.dedup_test_num > ARRAY_SIZE(dedup_tests))) {
+		fprintf(stderr, "BTF dedup test number must be [1 - %zu]\n",
+			ARRAY_SIZE(dedup_tests));
+		return -1;
+	}
+
 	return 0;
 }
 
@@ -5668,14 +6169,18 @@ int main(int argc, char **argv)
 	if (args.info_raw_test)
 		err |= test_info_raw();
 
+	if (args.dedup_test)
+		err |= test_dedup();
+
 	if (args.raw_test || args.get_info_test || args.file_test ||
-	    args.pprint_test || args.info_raw_test)
+	    args.pprint_test || args.info_raw_test || args.dedup_test)
 		goto done;
 
 	err |= test_raw();
 	err |= test_get_info();
 	err |= test_file();
 	err |= test_info_raw();
+	err |= test_dedup();
 
 done:
 	print_summary();
-- 
2.17.1


^ permalink raw reply related

* [PATCH btf v2 1/3] btf: extract BTF type size calculation
From: Andrii Nakryiko @ 2019-02-05  1:29 UTC (permalink / raw)
  To: netdev, ast, yhs, daniel, acme, kernel-team, kafai, ecree,
	andrii.nakryiko
  Cc: Andrii Nakryiko
In-Reply-To: <20190205012946.1590917-1-andriin@fb.com>

This pre-patch extracts calculation of amount of space taken by BTF type descriptor
for later reuse by btf_dedup functionality.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/lib/bpf/btf.c | 98 +++++++++++++++++++++------------------------
 1 file changed, 46 insertions(+), 52 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 7ec0463354db..06bd1a625ff4 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -182,6 +182,37 @@ static int btf_parse_str_sec(struct btf *btf)
 	return 0;
 }
 
+static int btf_type_size(struct btf_type *t)
+{
+	int base_size = sizeof(struct btf_type);
+	__u16 vlen = BTF_INFO_VLEN(t->info);
+
+	switch (BTF_INFO_KIND(t->info)) {
+	case BTF_KIND_FWD:
+	case BTF_KIND_CONST:
+	case BTF_KIND_VOLATILE:
+	case BTF_KIND_RESTRICT:
+	case BTF_KIND_PTR:
+	case BTF_KIND_TYPEDEF:
+	case BTF_KIND_FUNC:
+		return base_size;
+	case BTF_KIND_INT:
+		return base_size + sizeof(__u32);
+	case BTF_KIND_ENUM:
+		return base_size + vlen * sizeof(struct btf_enum);
+	case BTF_KIND_ARRAY:
+		return base_size + sizeof(struct btf_array);
+	case BTF_KIND_STRUCT:
+	case BTF_KIND_UNION:
+		return base_size + vlen * sizeof(struct btf_member);
+	case BTF_KIND_FUNC_PROTO:
+		return base_size + vlen * sizeof(struct btf_param);
+	default:
+		pr_debug("Unsupported BTF_KIND:%u\n", BTF_INFO_KIND(t->info));
+		return -EINVAL;
+	}
+}
+
 static int btf_parse_type_sec(struct btf *btf)
 {
 	struct btf_header *hdr = btf->hdr;
@@ -191,41 +222,13 @@ static int btf_parse_type_sec(struct btf *btf)
 
 	while (next_type < end_type) {
 		struct btf_type *t = next_type;
-		__u16 vlen = BTF_INFO_VLEN(t->info);
+		int type_size;
 		int err;
 
-		next_type += sizeof(*t);
-		switch (BTF_INFO_KIND(t->info)) {
-		case BTF_KIND_INT:
-			next_type += sizeof(int);
-			break;
-		case BTF_KIND_ARRAY:
-			next_type += sizeof(struct btf_array);
-			break;
-		case BTF_KIND_STRUCT:
-		case BTF_KIND_UNION:
-			next_type += vlen * sizeof(struct btf_member);
-			break;
-		case BTF_KIND_ENUM:
-			next_type += vlen * sizeof(struct btf_enum);
-			break;
-		case BTF_KIND_FUNC_PROTO:
-			next_type += vlen * sizeof(struct btf_param);
-			break;
-		case BTF_KIND_FUNC:
-		case BTF_KIND_TYPEDEF:
-		case BTF_KIND_PTR:
-		case BTF_KIND_FWD:
-		case BTF_KIND_VOLATILE:
-		case BTF_KIND_CONST:
-		case BTF_KIND_RESTRICT:
-			break;
-		default:
-			pr_debug("Unsupported BTF_KIND:%u\n",
-			     BTF_INFO_KIND(t->info));
-			return -EINVAL;
-		}
-
+		type_size = btf_type_size(t);
+		if (type_size < 0)
+			return type_size;
+		next_type += type_size;
 		err = btf_add_type(btf, t);
 		if (err)
 			return err;
@@ -252,21 +255,6 @@ static bool btf_type_is_void_or_null(const struct btf_type *t)
 	return !t || btf_type_is_void(t);
 }
 
-static __s64 btf_type_size(const struct btf_type *t)
-{
-	switch (BTF_INFO_KIND(t->info)) {
-	case BTF_KIND_INT:
-	case BTF_KIND_STRUCT:
-	case BTF_KIND_UNION:
-	case BTF_KIND_ENUM:
-		return t->size;
-	case BTF_KIND_PTR:
-		return sizeof(void *);
-	default:
-		return -EINVAL;
-	}
-}
-
 #define MAX_RESOLVE_DEPTH 32
 
 __s64 btf__resolve_size(const struct btf *btf, __u32 type_id)
@@ -280,11 +268,16 @@ __s64 btf__resolve_size(const struct btf *btf, __u32 type_id)
 	t = btf__type_by_id(btf, type_id);
 	for (i = 0; i < MAX_RESOLVE_DEPTH && !btf_type_is_void_or_null(t);
 	     i++) {
-		size = btf_type_size(t);
-		if (size >= 0)
-			break;
-
 		switch (BTF_INFO_KIND(t->info)) {
+		case BTF_KIND_INT:
+		case BTF_KIND_STRUCT:
+		case BTF_KIND_UNION:
+		case BTF_KIND_ENUM:
+			size = t->size;
+			goto done;
+		case BTF_KIND_PTR:
+			size = sizeof(void *);
+			goto done;
 		case BTF_KIND_TYPEDEF:
 		case BTF_KIND_VOLATILE:
 		case BTF_KIND_CONST:
@@ -308,6 +301,7 @@ __s64 btf__resolve_size(const struct btf *btf, __u32 type_id)
 	if (size < 0)
 		return -EINVAL;
 
+done:
 	if (nelems && size > UINT32_MAX / nelems)
 		return -E2BIG;
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH btf v2 0/3] Add BTF types deduplication algorithm
From: Andrii Nakryiko @ 2019-02-05  1:29 UTC (permalink / raw)
  To: netdev, ast, yhs, daniel, acme, kernel-team, kafai, ecree,
	andrii.nakryiko
  Cc: Andrii Nakryiko

This patch series adds BTF deduplication algorithm to libbpf. This algorithm
allows to take BTF type information containing duplicate per-compilation unit
information and reduce it to equivalent set of BTF types with no duplication without
loss of information. It also deduplicates strings and removes those strings that
are not referenced from any BTF type (and line information in .BTF.ext section,
if any).

Algorithm also resolves struct/union forward declarations into concrete BTF types
across multiple compilation units to facilitate better deduplication ratio. If
undesired, this resolution can be disabled through specifying corresponding options.

When applied to BTF data emitted by pahole's DWARF->BTF converter, it reduces
the overall size of .BTF section by about 65x, from about 112MB to 1.75MB, leaving
only 29247 out of initial 3073497 BTF type descriptors.

Algorithm with minor differences and preliminary results before FUNC/FUNC_PROTO
support is also described more verbosely at:
https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html

v1->v2:
- rebase on latest bpf-next
- err_log/elog -> pr_debug
- btf__dedup, btf__get_strings, btf__get_nr_types listed under 0.0.2 version

Andrii Nakryiko (3):
  btf: extract BTF type size calculation
  btf: add BTF types deduplication algorithm
  selftests/btf: add initial BTF dedup tests

 tools/lib/bpf/btf.c                    | 1851 +++++++++++++++++++++++-
 tools/lib/bpf/btf.h                    |   10 +
 tools/lib/bpf/libbpf.map               |    3 +
 tools/testing/selftests/bpf/test_btf.c |  535 ++++++-
 4 files changed, 2332 insertions(+), 67 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH bpf-next 1/6] bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
From: Martin Lau @ 2019-02-05  0:50 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev@vger.kernel.org, Alexei Starovoitov, Kernel Team,
	Lawrence Brakmo
In-Reply-To: <51fe0012-e294-078a-4fc4-6151f8b55195@iogearbox.net>

On Mon, Feb 04, 2019 at 11:33:28PM +0100, Daniel Borkmann wrote:
> Hi Martin,
> 
> On 02/01/2019 08:03 AM, Martin KaFai Lau wrote:
> > In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)"
> > before accessing the fields in sock.  For example, in __netdev_pick_tx:
> > 
> > static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
> > 			    struct net_device *sb_dev)
> > {
> > 	/* ... */
> > 
> > 	struct sock *sk = skb->sk;
> > 
> > 		if (queue_index != new_index && sk &&
> > 		    sk_fullsock(sk) &&
> > 		    rcu_access_pointer(sk->sk_dst_cache))
> > 			sk_tx_queue_set(sk, new_index);
> > 
> > 	/* ... */
> > 
> > 	return queue_index;
> > }
> > 
> > This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
> > where a few of the convert_ctx_access() in filter.c has already been
> > accessing the skb->sk sock_common's fields,
> > e.g. sock_ops_convert_ctx_access().
> > 
> > "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
> > Some of the fileds in "bpf_sock" will not be directly
> > accessible through the "__sk_buff->sk" pointer.  It is limited
> > by the new "bpf_sock_common_is_valid_access()".
> > e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
> >      are not allowed.
> > 
> > The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
> > can be used to get a sk with all accessible fields in "bpf_sock".
> > This helper is added to both cg_skb and sched_(cls|act).
> > 
> > int cg_skb_foo(struct __sk_buff *skb) {
> > 	struct bpf_sock *sk;
> > 	__u32 family;
> > 
> > 	sk = skb->sk;
> > 	if (!sk)
> > 		return 1;
> > 
> > 	sk = bpf_sk_fullsock(sk);
> > 	if (!sk)
> > 		return 1;
> > 
> > 	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
> > 		return 1;
> > 
> > 	/* some_traffic_shaping(); */
> > 
> > 	return 1;
> > }
> > 
> > (1) The sk is read only
> > 
> > (2) There is no new "struct bpf_sock_common" introduced.
> > 
> > (3) Future kernel sock's members could be added to bpf_sock only
> >     instead of repeatedly adding at multiple places like currently
> >     in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
> > 
> > (4) After "sk = skb->sk", the reg holding sk is in type
> >     PTR_TO_SOCK_COMMON_OR_NULL.
> > 
> > (5) After bpf_sk_fullsock(), the return type will be in type
> >     PTR_TO_SOCKET_OR_NULL which is the same as the return type of
> >     bpf_sk_lookup_xxx().
> > 
> >     However, bpf_sk_fullsock() does not take refcnt.  The
> >     acquire_reference_state() is only depending on the return type now.
> >     To avoid it, a new is_acquire_function() is checked before calling
> >     acquire_reference_state().
> 
> Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after
> all this is more of an implementation detail which we would expose here to
> the developer.
> 
> Is there a specific reason why fetching skb->sk couldn't already be of the
> type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be
> needed and most logic we have today could already be reused (modulo refcnt
> avoidance)?
Not all running context has a fullsock (PTR_TO_SOCKET_OR_NULL).

Based on how sk_to_full_sk() is used (e.g. in bpf_get_socket_uid()),
not sure a sk (e.g. tw sock) can always be traced back to a full sk.

In term of the patch implementation, it is not much difference.  It is a bit
simplier without bpf_sk_fullsock() and PTR_TO_SOCK_COMMON(_OR_NULL) but
not a lot.  I have tried both.

The "fullsock" has already been exposed in another form.
e.g. In sock_ops, the tcp_sock fields is not read if it is not a fullsock
while other sock_common fields will still be available.  The bpf_prog
can test the sock_ops->is_fullsock for what to do.

> 
> In particular, do you need the skb->sk without the full-sk part somewhere
> (e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the
> helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk
> parent where you could also access remaining bpf_sock fields?
I am thinking more on what if the bpf_prog only needs the fields from
sock_common (e.g. the src/dst ip/port) and skb already has
other needed info (e.g. protocol/mark/priority).
Enforing skb->sk must be a fullsock will unnecessarily limit those
bpf_prog from seeing all skb.

A "struct bpf_common_sock" could be added instead vs a bpf_sk_fullsock()
tester.  I think having one "struct bpf_sock" is better and less confusing.

Later, for the running context that is sure to have a fullsock,
skb->sk can directly have PTR_TO_SOCKET_OR_NULL instead of
PTR_TO_SOCK_COMMON_OR_NULL.

Thanks,
Martin

> 
> This could then also be plugged into bpf_tcp_sock() given this needs to be
> full sk anyway.

^ permalink raw reply

* Re: [PATCH v2] net: dsa: mv88e6xxx: Revise irq setup ordering
From: John David Anglin @ 2019-02-05  0:38 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <20190204231410.GG3397@lunn.ch>

On 2019-02-04 6:14 p.m., Andrew Lunn wrote:
> On Mon, Feb 04, 2019 at 04:59:13PM -0500, John David Anglin wrote:
>> This change fixes a race condition in the setup of hardware irqs and the
>> code enabling PHY link detection in the mv88e6xxx driver.
>>
>> This race was observed on the espressobin board where the GPIO interrupt
>> controller only supports edge interrupts.  If the INTn output pin goes low
>> before the GPIO interrupt is enabled, PHY link interrupts are not detected.
>>
>> With this change, we
>> 1) force INTn high by clearing all interrupt enables in global 1 control 1,
>> 2) setup the hardware irq, and then
>> 3) perform the remaining common setup.
>>
>> This simplifies the setup and allows some unnecessary code to be removed.
> Hi Dave
>
> I took a closer look now. I don't actually see why the current code is
> wrong.
The problem is INTn can go low before the interrupt handler for it is
registered and enabled.
As a result, interrupts never occur if link happens to come up before
the interrupt handler
completes being enabled.
>
> mv88e6xxx_g1_irq_setup() calls mv88e6xxx_g1_irq_setup_common() and
> then registers the interrupt handler.
>
> mv88e6xxx_g1_irq_setup_common() does what you want, it masks all
> interrupts in the hardware and clears any pending interrupts which can
> be cleared.
>
> The change you made is actually dangerous. As soon as you request the
> interrupt, it is live, it can fire, and call
> mv88e6xxx_g1_irq_thread_work(). That needs the irq domain. But the
> change you made defers the creating of the domain until after the
> interrupt is registered. So we can de-refernece a NULL pointer in the
> interrupt handler.
This can't happen.  The domain is setup immediately after registering
the GPIO interrupt.
The interrupt can't fire until one of the enables is set.  These are set
by mv88e6xxx_g2_irq_setup(),
mv88e6xxx_g1_atu_prob_irq_setup() and
mv88e6xxx_g1_vtu_prob_irq_setup().  These irqs
are setup after mv88e6xxx_g1_irq_setup()/mv88e6xxx_irq_poll_setup() is
called.  Thus, the
irq domain is setup before the GPIO interrupt can fire.

I have tested both hardware and polled interrupts using espressobin with
v4.20.6 and networking
starts correctly.

Dave

-- 
John David Anglin  dave.anglin@bell.net


^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: fix libbpf_print
From: Yonghong Song @ 2019-02-05  0:37 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev@vger.kernel.org
  Cc: davem@davemloft.net, ast@kernel.org, daniel@iogearbox.net
In-Reply-To: <20190205002055.80759-1-sdf@google.com>



On 2/4/19 4:20 PM, Stanislav Fomichev wrote:
> With the recent print rework we now have the following problem:
> pr_{warning,info,debug} expand to __pr which calls libbpf_print.
> libbpf_print does va_start and calls __libbpf_pr with va_list argument.
> In __base_pr we again do va_start. Because the next argument is a
> va_list, we don't get correct pointer to the argument (and print noting
> in my case, I don't know why it doesn't crash tbh).
> 
> Fix this by changing libbpf_print_fn_t signature to accept va_list and
> remove unneeded calls to va_start in the existing users.
> 
> Alternatively, this can we solved by exporting __libbpf_pr and
> changing __pr macro to (and killing libbpf_print):
> {
> 	if (__libbpf_pr)
> 		__libbpf_pr(level, "libbpf: " fmt, ##__VA_ARGS__)
> }
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>

It is my mistake. My early version did passed correctly and later
on I made some changes and did not test properly. Thanks for the fix!

Acked-by: Yonghong Song <yhs@fb.com>


> ---
>   tools/lib/bpf/libbpf.c                         | 14 ++++----------
>   tools/lib/bpf/libbpf.h                         |  3 +--
>   tools/perf/util/bpf-loader.c                   | 10 ++--------
>   tools/testing/selftests/bpf/test_btf.c         | 13 ++-----------
>   tools/testing/selftests/bpf/test_libbpf_open.c | 10 ++--------
>   tools/testing/selftests/bpf/test_progs.c       | 10 ++--------
>   6 files changed, 13 insertions(+), 47 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 84ca6c2bea91..47969aa0faf8 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -54,22 +54,16 @@
>   
>   #define __printf(a, b)	__attribute__((format(printf, a, b)))
>   
> -__printf(2, 3)
> -static int __base_pr(enum libbpf_print_level level, const char *format, ...)
> +static int __base_pr(enum libbpf_print_level level, const char *format,
> +		     va_list args)
>   {
> -	va_list args;
> -	int err;
> -
>   	if (level == LIBBPF_DEBUG)
>   		return 0;
>   
> -	va_start(args, format);
> -	err = vfprintf(stderr, format, args);
> -	va_end(args);
> -	return err;
> +	return vfprintf(stderr, format, args);
>   }
>   
> -static __printf(2, 3) libbpf_print_fn_t __libbpf_pr = __base_pr;
> +static libbpf_print_fn_t __libbpf_pr = __base_pr;
>   
>   void libbpf_set_print(libbpf_print_fn_t fn)
>   {
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 19dbc1bed960..69a7c25eaccc 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -54,8 +54,7 @@ enum libbpf_print_level {
>   };
>   
>   typedef int (*libbpf_print_fn_t)(enum libbpf_print_level level,
> -				 const char *, ...)
> -	__attribute__((format(printf, 2, 3)));
> +				 const char *, va_list ap);
>   
>   LIBBPF_API void libbpf_set_print(libbpf_print_fn_t fn);
>   
> diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
> index 38afdbe6a9e0..037d8ff6a634 100644
> --- a/tools/perf/util/bpf-loader.c
> +++ b/tools/perf/util/bpf-loader.c
> @@ -25,15 +25,9 @@
>   #include "c++/clang-c.h"
>   
>   static int libbpf_perf_print(enum libbpf_print_level level __attribute__((unused)),
> -			      const char *fmt, ...)
> +			      const char *fmt, va_list args)
>   {
> -	va_list args;
> -	int ret;
> -
> -	va_start(args, fmt);
> -	ret = veprintf(1, verbose, pr_fmt(fmt), args);
> -	va_end(args);
> -	return ret;
> +	return veprintf(1, verbose, pr_fmt(fmt), args);
>   }
>   
>   struct bpf_prog_priv {
> diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
> index aebaeff5a5a0..5afab823ffbe 100644
> --- a/tools/testing/selftests/bpf/test_btf.c
> +++ b/tools/testing/selftests/bpf/test_btf.c
> @@ -52,19 +52,10 @@ static int count_result(int err)
>   	return err;
>   }
>   
> -#define __printf(a, b)	__attribute__((format(printf, a, b)))
> -
> -__printf(2, 3)
>   static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
> -		     const char *format, ...)
> +		     const char *format, va_list args)
>   {
> -	va_list args;
> -	int err;
> -
> -	va_start(args, format);
> -	err = vfprintf(stderr, format, args);
> -	va_end(args);
> -	return err;
> +	return vfprintf(stderr, format, args);
>   }
>   
>   #define BTF_INFO_ENC(kind, kind_flag, vlen)			\
> diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
> index b9ff3bf76544..1909ecf4d999 100644
> --- a/tools/testing/selftests/bpf/test_libbpf_open.c
> +++ b/tools/testing/selftests/bpf/test_libbpf_open.c
> @@ -36,19 +36,13 @@ static void usage(char *argv[])
>   
>   static bool debug = 0;
>   static int libbpf_debug_print(enum libbpf_print_level level,
> -			      const char *fmt, ...)
> +			      const char *fmt, va_list args)
>   {
> -	va_list args;
> -	int ret;
> -
>   	if (level == LIBBPF_DEBUG && !debug)
>   		return 0;
>   
> -	va_start(args, fmt);
>   	fprintf(stderr, "[%d] ", level);
> -	ret = vfprintf(stderr, fmt, args);
> -	va_end(args);
> -	return ret;
> +	return vfprintf(stderr, fmt, args);
>   }
>   
>   #define EXIT_FAIL_LIBBPF EXIT_FAILURE
> diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
> index 55d05102e7bf..c52bd90fbb34 100644
> --- a/tools/testing/selftests/bpf/test_progs.c
> +++ b/tools/testing/selftests/bpf/test_progs.c
> @@ -1785,18 +1785,12 @@ static void test_task_fd_query_tp(void)
>   }
>   
>   static int libbpf_debug_print(enum libbpf_print_level level,
> -			      const char *format, ...)
> +			      const char *format, va_list args)
>   {
> -	va_list args;
> -	int ret;
> -
>   	if (level == LIBBPF_DEBUG)
>   		return 0;
>   
> -	va_start(args, format);
> -	ret = vfprintf(stderr, format, args);
> -	va_end(args);
> -	return ret;
> +	return vfprintf(stderr, format, args);
>   }
>   
>   static void test_reference_tracking()
> 

^ permalink raw reply

* Re: [PATCH btf 0/3] Add BTF types deduplication algorithm
From: Daniel Borkmann @ 2019-02-05  0:26 UTC (permalink / raw)
  To: Andrii Nakryiko, netdev, ast, yhs, kafai, acme, kernel-team
In-Reply-To: <20190131065837.3380288-1-andriin@fb.com>

Hi Andrii,

On 01/31/2019 07:58 AM, Andrii Nakryiko wrote:
> This patch series adds BTF deduplication algorithm to libbpf. This algorithm
> allows to take BTF type information containing duplicate per-compilation unit
> information and reduce it to equivalent set of BTF types with no duplication without
> loss of information. It also deduplicates strings and removes those strings that
> are not referenced from any BTF type (and line information in .BTF.ext section,
> if any).
> 
> Algorithm also resolves struct/union forward declarations into concrete BTF types
> across multiple compilation units to facilitate better deduplication ratio. If
> undesired, this resolution can be disabled through specifying corresponding options.
> 
> When applied to BTF data emitted by pahole's DWARF->BTF converter, it reduces
> the overall size of .BTF section by about 65x, from about 112MB to 1.75MB, leaving
> only 29247 out of initial 3073497 BTF type descriptors.
> 
> Algorithm with minor differences and preliminary results before FUNC/FUNC_PROTO
> support is also described more verbosely at:
> https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html
> 
> Andrii Nakryiko (3):
>   btf: extract BTF type size calculation
>   btf: add BTF types deduplication algorithm
>   selftests/btf: add initial BTF dedup tests
> 
>  tools/lib/bpf/btf.c                    | 1851 +++++++++++++++++++++++-
>  tools/lib/bpf/btf.h                    |   11 +
>  tools/lib/bpf/libbpf.map               |    3 +
>  tools/testing/selftests/bpf/test_btf.c |  535 ++++++-
>  4 files changed, 2333 insertions(+), 67 deletions(-)

This would need a proper rebase for bpf-next so that it applies w/o
bigger conflicts. Please also change the libbpf.map and place the newly
exported functions under LIBBPF_0.0.2 (as everything under 0.0.1 was from
prior released kernel).

Thanks,
Daniel

^ permalink raw reply

* [PATCH bpf-next] libbpf: fix libbpf_print
From: Stanislav Fomichev @ 2019-02-05  0:20 UTC (permalink / raw)
  To: netdev; +Cc: yhs, davem, ast, daniel, Stanislav Fomichev

With the recent print rework we now have the following problem:
pr_{warning,info,debug} expand to __pr which calls libbpf_print.
libbpf_print does va_start and calls __libbpf_pr with va_list argument.
In __base_pr we again do va_start. Because the next argument is a
va_list, we don't get correct pointer to the argument (and print noting
in my case, I don't know why it doesn't crash tbh).

Fix this by changing libbpf_print_fn_t signature to accept va_list and
remove unneeded calls to va_start in the existing users.

Alternatively, this can we solved by exporting __libbpf_pr and
changing __pr macro to (and killing libbpf_print):
{
	if (__libbpf_pr)
		__libbpf_pr(level, "libbpf: " fmt, ##__VA_ARGS__)
}

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/lib/bpf/libbpf.c                         | 14 ++++----------
 tools/lib/bpf/libbpf.h                         |  3 +--
 tools/perf/util/bpf-loader.c                   | 10 ++--------
 tools/testing/selftests/bpf/test_btf.c         | 13 ++-----------
 tools/testing/selftests/bpf/test_libbpf_open.c | 10 ++--------
 tools/testing/selftests/bpf/test_progs.c       | 10 ++--------
 6 files changed, 13 insertions(+), 47 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 84ca6c2bea91..47969aa0faf8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -54,22 +54,16 @@
 
 #define __printf(a, b)	__attribute__((format(printf, a, b)))
 
-__printf(2, 3)
-static int __base_pr(enum libbpf_print_level level, const char *format, ...)
+static int __base_pr(enum libbpf_print_level level, const char *format,
+		     va_list args)
 {
-	va_list args;
-	int err;
-
 	if (level == LIBBPF_DEBUG)
 		return 0;
 
-	va_start(args, format);
-	err = vfprintf(stderr, format, args);
-	va_end(args);
-	return err;
+	return vfprintf(stderr, format, args);
 }
 
-static __printf(2, 3) libbpf_print_fn_t __libbpf_pr = __base_pr;
+static libbpf_print_fn_t __libbpf_pr = __base_pr;
 
 void libbpf_set_print(libbpf_print_fn_t fn)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 19dbc1bed960..69a7c25eaccc 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -54,8 +54,7 @@ enum libbpf_print_level {
 };
 
 typedef int (*libbpf_print_fn_t)(enum libbpf_print_level level,
-				 const char *, ...)
-	__attribute__((format(printf, 2, 3)));
+				 const char *, va_list ap);
 
 LIBBPF_API void libbpf_set_print(libbpf_print_fn_t fn);
 
diff --git a/tools/perf/util/bpf-loader.c b/tools/perf/util/bpf-loader.c
index 38afdbe6a9e0..037d8ff6a634 100644
--- a/tools/perf/util/bpf-loader.c
+++ b/tools/perf/util/bpf-loader.c
@@ -25,15 +25,9 @@
 #include "c++/clang-c.h"
 
 static int libbpf_perf_print(enum libbpf_print_level level __attribute__((unused)),
-			      const char *fmt, ...)
+			      const char *fmt, va_list args)
 {
-	va_list args;
-	int ret;
-
-	va_start(args, fmt);
-	ret = veprintf(1, verbose, pr_fmt(fmt), args);
-	va_end(args);
-	return ret;
+	return veprintf(1, verbose, pr_fmt(fmt), args);
 }
 
 struct bpf_prog_priv {
diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index aebaeff5a5a0..5afab823ffbe 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -52,19 +52,10 @@ static int count_result(int err)
 	return err;
 }
 
-#define __printf(a, b)	__attribute__((format(printf, a, b)))
-
-__printf(2, 3)
 static int __base_pr(enum libbpf_print_level level __attribute__((unused)),
-		     const char *format, ...)
+		     const char *format, va_list args)
 {
-	va_list args;
-	int err;
-
-	va_start(args, format);
-	err = vfprintf(stderr, format, args);
-	va_end(args);
-	return err;
+	return vfprintf(stderr, format, args);
 }
 
 #define BTF_INFO_ENC(kind, kind_flag, vlen)			\
diff --git a/tools/testing/selftests/bpf/test_libbpf_open.c b/tools/testing/selftests/bpf/test_libbpf_open.c
index b9ff3bf76544..1909ecf4d999 100644
--- a/tools/testing/selftests/bpf/test_libbpf_open.c
+++ b/tools/testing/selftests/bpf/test_libbpf_open.c
@@ -36,19 +36,13 @@ static void usage(char *argv[])
 
 static bool debug = 0;
 static int libbpf_debug_print(enum libbpf_print_level level,
-			      const char *fmt, ...)
+			      const char *fmt, va_list args)
 {
-	va_list args;
-	int ret;
-
 	if (level == LIBBPF_DEBUG && !debug)
 		return 0;
 
-	va_start(args, fmt);
 	fprintf(stderr, "[%d] ", level);
-	ret = vfprintf(stderr, fmt, args);
-	va_end(args);
-	return ret;
+	return vfprintf(stderr, fmt, args);
 }
 
 #define EXIT_FAIL_LIBBPF EXIT_FAILURE
diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c
index 55d05102e7bf..c52bd90fbb34 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -1785,18 +1785,12 @@ static void test_task_fd_query_tp(void)
 }
 
 static int libbpf_debug_print(enum libbpf_print_level level,
-			      const char *format, ...)
+			      const char *format, va_list args)
 {
-	va_list args;
-	int ret;
-
 	if (level == LIBBPF_DEBUG)
 		return 0;
 
-	va_start(args, format);
-	ret = vfprintf(stderr, format, args);
-	va_end(args);
-	return ret;
+	return vfprintf(stderr, format, args);
 }
 
 static void test_reference_tracking()
-- 
2.20.1.611.gfbb209baf1-goog


^ permalink raw reply related

* [net-next][PATCH 2/5] rds: rdma: add consumer reject
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

For legacy protocol version incompatibility with non linux RDS,
consumer reject reason being used to convey it to peer. But the
choice of reject reason value as '1' was really poor.

Anyway for interoperability reasons with shipping products,
it needs to be supported. For any future versions, properly
encoded reject reason should to be used.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/ib_cm.c          |  6 ++++--
 net/rds/rdma_transport.c | 12 ++++++++++++
 net/rds/rdma_transport.h |  6 ++++++
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 0eeae09..a1c3ad3 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -734,8 +734,10 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 
 	/* Check whether the remote protocol version matches ours. */
 	version = rds_ib_protocol_compatible(event, isv6);
-	if (!version)
+	if (!version) {
+		err = RDS_RDMA_REJ_INCOMPAT;
 		goto out;
+	}
 
 	dp = event->param.conn.private_data;
 	if (isv6) {
@@ -851,7 +853,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 	if (conn)
 		mutex_unlock(&conn->c_cm_lock);
 	if (err)
-		rdma_reject(cm_id, NULL, 0);
+		rdma_reject(cm_id, &err, sizeof(int));
 	return destroy;
 }
 
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 6b0f57c..63cbc6b 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -51,6 +51,8 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 	struct rds_connection *conn = cm_id->context;
 	struct rds_transport *trans;
 	int ret = 0;
+	int *err;
+	u8 len;
 
 	rdsdebug("conn %p id %p handling event %u (%s)\n", conn, cm_id,
 		 event->event, rdma_event_msg(event->event));
@@ -106,8 +108,18 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 		break;
 
 	case RDMA_CM_EVENT_REJECTED:
+		if (!conn)
+			break;
+		err = (int *)rdma_consumer_reject_data(cm_id, event, &len);
+		if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) {
+			pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
+				&conn->c_laddr, &conn->c_faddr);
+			conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
+			rds_conn_drop(conn);
+		}
 		rdsdebug("Connection rejected: %s\n",
 			 rdma_reject_msg(cm_id, event->status));
+		break;
 		/* FALLTHROUGH */
 	case RDMA_CM_EVENT_ADDR_ERROR:
 	case RDMA_CM_EVENT_ROUTE_ERROR:
diff --git a/net/rds/rdma_transport.h b/net/rds/rdma_transport.h
index 200d313..bfafd4a 100644
--- a/net/rds/rdma_transport.h
+++ b/net/rds/rdma_transport.h
@@ -11,6 +11,12 @@
 
 #define RDS_RDMA_RESOLVE_TIMEOUT_MS     5000
 
+/* Below reject reason is for legacy interoperability issue with non-linux
+ * RDS endpoints where older version incompatibility is conveyed via value 1.
+ * For future version(s), proper encoded reject reason should be be used.
+ */
+#define RDS_RDMA_REJ_INCOMPAT		1
+
 int rds_rdma_conn_connect(struct rds_connection *conn);
 int rds_rdma_cm_event_handler(struct rdma_cm_id *cm_id,
 			      struct rdma_cm_event *event);
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 0/5] rds: add tos support
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar

RDS applications make use of tos to classify database traffic.
This feature has been used in shipping products from 2.6.32 based
kernels. Its tied with RDS v4.1 protocol version and the compatibility
gets negotiated as part of connections setup.

Patchset keeps full backward compatibility using existing connection
negotiation scheme. Currently the feature is exploited by RDMA
transport and for TCP transport the user tos values are mapped to
same default class (0).

For RDMA transports, RDMA CM service type API is used to
set up different SL(service lanes) and the IB fabric is configured
for tos mapping using Subnet Manager(SL to VL mappings).
Similarly for ROCE fabric, user priority is mapped with different
DSCP code points which are associated with different switch queues
in the fabric.

The original code was developed by Bang Nguyen in downstream kernel back in
2.6.32 kernel days and it has evolved significantly over period of time.

Thanks to Yanjun for doing testing with various combinations of host like
v3.1<->v4.1, v4.1.<->v3.1, v4.1 upstream to shipping v4.1 etc etc

Patchset is also available on below git tree.

The following changes since commit cc7335786f7278d66bdcf96d3d411edfcb01be51:

  socket: fix for Add SO_TIMESTAMP[NS]_NEW (2019-02-03 20:36:11 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux.git for_net-next-5.1/rds-tos-v4

for you to fetch changes up to fd261ce6a30e01ad67c416e2c67e263024b3a6f9:

  rds: rdma: update rdma transport for tos (2019-02-04 14:59:13 -0800)

----------------------------------------------------------------

Santosh Shilimkar (5):
  rds: make v3.1 as compat version
  rds: rdma: add consumer reject
  rds: add type of service(tos) infrastructure
  rds: add transport specific tos_map hook
  rds: rdma: update rdma transport for tos

 include/uapi/linux/rds.h | 11 ++++++++
 net/rds/af_rds.c         | 37 ++++++++++++++++++++++++-
 net/rds/connection.c     | 21 ++++++++------
 net/rds/ib.c             | 11 ++++++++
 net/rds/ib.h             |  4 ++-
 net/rds/ib_cm.c          | 72 +++++++++++++++++++++++++++---------------------
 net/rds/ib_recv.c        |  4 +--
 net/rds/ib_send.c        |  5 ++--
 net/rds/rdma_transport.c | 14 ++++++++++
 net/rds/rdma_transport.h |  6 ++++
 net/rds/rds.h            | 14 ++++++++--
 net/rds/recv.c           |  1 +
 net/rds/send.c           |  7 +++--
 net/rds/tcp.c            |  8 ++++++
 net/rds/tcp_listen.c     |  2 +-
 net/rds/threads.c        |  1 +
 16 files changed, 166 insertions(+), 52 deletions(-)

-- 
1.9.1

^ permalink raw reply

* [net-next][PATCH 4/5] rds: add transport specific tos_map hook
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

RDMA transport maps user tos to underline virtual lanes(VL)
for IB or DSCP values. RDMA CM transport abstract thats for
RDS. TCP transport makes use of default priority 0 and maps
all user tos values to it.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/af_rds.c | 10 ++++++----
 net/rds/ib.c     | 10 ++++++++++
 net/rds/rds.h    |  1 +
 net/rds/tcp.c    |  7 +++++++
 4 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 9045158..d6cc97f 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -255,16 +255,18 @@ static __poll_t rds_poll(struct file *file, struct socket *sock,
 static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
 	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
-	rds_tos_t tos;
+	rds_tos_t utos, tos = 0;
 
 	switch (cmd) {
 	case SIOCRDSSETTOS:
-		if (get_user(tos, (rds_tos_t __user *)arg))
+		if (get_user(utos, (rds_tos_t __user *)arg))
 			return -EFAULT;
 
 		if (rs->rs_transport &&
-		    rs->rs_transport->t_type == RDS_TRANS_TCP)
-			tos = 0;
+		    rs->rs_transport->get_tos_map)
+			tos = rs->rs_transport->get_tos_map(utos);
+		else
+			return -ENOIOCTLCMD;
 
 		spin_lock_bh(&rds_sock_lock);
 		if (rs->rs_tos || rs->rs_conn) {
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 21b6588..2da9b75 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -515,6 +515,15 @@ void rds_ib_exit(void)
 	rds_ib_mr_exit();
 }
 
+static u8 rds_ib_get_tos_map(u8 tos)
+{
+	/* 1:1 user to transport map for RDMA transport.
+	 * In future, if custom map is desired, hook can export
+	 * user configurable map.
+	 */
+	return tos;
+}
+
 struct rds_transport rds_ib_transport = {
 	.laddr_check		= rds_ib_laddr_check,
 	.xmit_path_complete	= rds_ib_xmit_path_complete,
@@ -537,6 +546,7 @@ struct rds_transport rds_ib_transport = {
 	.sync_mr		= rds_ib_sync_mr,
 	.free_mr		= rds_ib_free_mr,
 	.flush_mrs		= rds_ib_flush_mrs,
+	.get_tos_map		= rds_ib_get_tos_map,
 	.t_owner		= THIS_MODULE,
 	.t_name			= "infiniband",
 	.t_unloading		= rds_ib_is_unloading,
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 7e52b92..0d8f67c 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -574,6 +574,7 @@ struct rds_transport {
 	void (*free_mr)(void *trans_private, int invalidate);
 	void (*flush_mrs)(void);
 	bool (*t_unloading)(struct rds_connection *conn);
+	u8 (*get_tos_map)(u8 tos);
 };
 
 /* Bind hash table key length.  It is the sum of the size of a struct
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index eb68519..fd26941 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -453,6 +453,12 @@ static void rds_tcp_destroy_conns(void)
 
 static void rds_tcp_exit(void);
 
+static u8 rds_tcp_get_tos_map(u8 tos)
+{
+	/* all user tos mapped to default 0 for TCP transport */
+	return 0;
+}
+
 struct rds_transport rds_tcp_transport = {
 	.laddr_check		= rds_tcp_laddr_check,
 	.xmit_path_prepare	= rds_tcp_xmit_path_prepare,
@@ -467,6 +473,7 @@ struct rds_transport rds_tcp_transport = {
 	.inc_free		= rds_tcp_inc_free,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
+	.get_tos_map		= rds_tcp_get_tos_map,
 	.t_owner		= THIS_MODULE,
 	.t_name			= "tcp",
 	.t_type			= RDS_TRANS_TCP,
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 1/5] rds: make v3.1 as compat version
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

Mark RDSv3.1 as compat version and add v4.1 version macro's.
Subsequent patches enable TOS(Type of Service) feature which is
tied with v4.1 for RDMA transport.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/connection.c |  1 +
 net/rds/ib_cm.c      | 40 +++++++++++++++++++++++-----------------
 net/rds/rds.h        |  4 ++++
 net/rds/threads.c    |  1 +
 4 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 3bd2f4a..1ab14b6 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -139,6 +139,7 @@ static void __rds_conn_path_init(struct rds_connection *conn,
 	atomic_set(&cp->cp_state, RDS_CONN_DOWN);
 	cp->cp_send_gen = 0;
 	cp->cp_reconnect_jiffies = 0;
+	cp->cp_conn->c_proposed_version = RDS_PROTOCOL_VERSION;
 	INIT_DELAYED_WORK(&cp->cp_send_w, rds_send_worker);
 	INIT_DELAYED_WORK(&cp->cp_recv_w, rds_recv_worker);
 	INIT_DELAYED_WORK(&cp->cp_conn_w, rds_connect_worker);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index bfbb31f..0eeae09 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -133,23 +133,24 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		rds_ib_set_flow_control(conn, be32_to_cpu(credit));
 	}
 
-	if (conn->c_version < RDS_PROTOCOL(3, 1)) {
-		pr_notice("RDS/IB: Connection <%pI6c,%pI6c> version %u.%u no longer supported\n",
-			  &conn->c_laddr, &conn->c_faddr,
-			  RDS_PROTOCOL_MAJOR(conn->c_version),
-			  RDS_PROTOCOL_MINOR(conn->c_version));
-		set_bit(RDS_DESTROY_PENDING, &conn->c_path[0].cp_flags);
-		rds_conn_destroy(conn);
-		return;
-	} else {
-		pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
-			  ic->i_active_side ? "Active" : "Passive",
-			  &conn->c_laddr, &conn->c_faddr,
-			  RDS_PROTOCOL_MAJOR(conn->c_version),
-			  RDS_PROTOCOL_MINOR(conn->c_version),
-			  ic->i_flowctl ? ", flow control" : "");
+	if (conn->c_version < RDS_PROTOCOL_VERSION) {
+		if (conn->c_version != RDS_PROTOCOL_COMPAT_VERSION) {
+			pr_notice("RDS/IB: Connection <%pI6c,%pI6c> version %u.%u no longer supported\n",
+				  &conn->c_laddr, &conn->c_faddr,
+				  RDS_PROTOCOL_MAJOR(conn->c_version),
+				  RDS_PROTOCOL_MINOR(conn->c_version));
+			rds_conn_destroy(conn);
+			return;
+		}
 	}
 
+	pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
+		  ic->i_active_side ? "Active" : "Passive",
+		  &conn->c_laddr, &conn->c_faddr,
+		  RDS_PROTOCOL_MAJOR(conn->c_version),
+		  RDS_PROTOCOL_MINOR(conn->c_version),
+		  ic->i_flowctl ? ", flow control" : "");
+
 	atomic_set(&ic->i_cq_quiesce, 0);
 
 	/* Init rings and fill recv. this needs to wait until protocol
@@ -184,6 +185,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 					    NULL);
 	}
 
+	conn->c_proposed_version = conn->c_version;
 	rds_connect_complete(conn);
 }
 
@@ -667,6 +669,9 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event, bool isv6)
 		version = RDS_PROTOCOL_3_0;
 		while ((common >>= 1) != 0)
 			version++;
+	} else if (RDS_PROTOCOL_COMPAT_VERSION ==
+		   RDS_PROTOCOL(major, minor)) {
+		version = RDS_PROTOCOL_COMPAT_VERSION;
 	} else {
 		if (isv6)
 			printk_ratelimited(KERN_NOTICE "RDS: Connection from %pI6c using incompatible protocol version %u.%u\n",
@@ -861,7 +866,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 
 	/* If the peer doesn't do protocol negotiation, we must
 	 * default to RDSv3.0 */
-	rds_ib_set_protocol(conn, RDS_PROTOCOL_3_0);
+	rds_ib_set_protocol(conn, RDS_PROTOCOL_VERSION);
 	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
 
 	ret = rds_ib_setup_qp(conn);
@@ -870,7 +875,8 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 		goto out;
 	}
 
-	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp, RDS_PROTOCOL_VERSION,
+	rds_ib_cm_fill_conn_param(conn, &conn_param, &dp,
+				  conn->c_proposed_version,
 				  UINT_MAX, UINT_MAX, isv6);
 	ret = rdma_connect(cm_id, &conn_param);
 	if (ret)
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 4ffe100..660023f 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -19,10 +19,13 @@
  */
 #define RDS_PROTOCOL_3_0	0x0300
 #define RDS_PROTOCOL_3_1	0x0301
+#define RDS_PROTOCOL_4_0	0x0400
+#define RDS_PROTOCOL_4_1	0x0401
 #define RDS_PROTOCOL_VERSION	RDS_PROTOCOL_3_1
 #define RDS_PROTOCOL_MAJOR(v)	((v) >> 8)
 #define RDS_PROTOCOL_MINOR(v)	((v) & 255)
 #define RDS_PROTOCOL(maj, min)	(((maj) << 8) | min)
+#define RDS_PROTOCOL_COMPAT_VERSION	RDS_PROTOCOL_3_1
 
 /* The following ports, 16385, 18634, 18635, are registered with IANA as
  * the ports to be used for RDS over TCP and UDP.  Currently, only RDS over
@@ -151,6 +154,7 @@ struct rds_connection {
 	struct rds_cong_map	*c_fcong;
 
 	/* Protocol version */
+	unsigned int		c_proposed_version;
 	unsigned int		c_version;
 	possible_net_t		c_net;
 
diff --git a/net/rds/threads.c b/net/rds/threads.c
index e64f9e4..32dc50f 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -93,6 +93,7 @@ void rds_connect_path_complete(struct rds_conn_path *cp, int curr)
 		queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
 	}
 	rcu_read_unlock();
+	cp->cp_conn->c_proposed_version = RDS_PROTOCOL_VERSION;
 }
 EXPORT_SYMBOL_GPL(rds_connect_path_complete);
 
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 5/5] rds: rdma: update rdma transport for tos
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

For RDMA transports, RDS TOS is an extension of IB QoS(Annex A13)
to provide clients the ability to segregate traffic flows for
different type of data. RDMA CM abstract it for ULPs using
rdma_set_service_type(). Internally, each traffic flow is
represented by a connection with all of its independent resources
like that of a normal connection, and is differentiated by
service type. In other words, there can be multiple qp connections
between an IP pair and each supports a unique service type.

The feature has been added from RDSv4.1 onwards and supports
rolling upgrades. RDMA connection metadata also carries the tos
information to set up SL on end to end context. The original
code was developed by Bang Nguyen in downstream kernel back in
2.6.32 kernel days and it has evolved over period of time.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 net/rds/ib.h             |  4 +++-
 net/rds/ib_cm.c          | 32 +++++++++++++++++---------------
 net/rds/ib_recv.c        |  4 ++--
 net/rds/ib_send.c        |  5 +++--
 net/rds/rdma_transport.c |  1 +
 net/rds/send.c           |  5 +++--
 6 files changed, 29 insertions(+), 22 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 71ff356..752f922 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -67,7 +67,9 @@ struct rds_ib_conn_priv_cmn {
 	u8			ricpc_protocol_major;
 	u8			ricpc_protocol_minor;
 	__be16			ricpc_protocol_minor_mask;	/* bitmask */
-	__be32			ricpc_reserved1;
+	u8			ricpc_dp_toss;
+	u8			ripc_reserved1;
+	__be16			ripc_reserved2;
 	__be64			ricpc_ack_seq;
 	__be32			ricpc_credit;	/* non-zero enables flow ctl */
 };
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 70518e3..66c6eb5 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -144,9 +144,9 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn, struct rdma_cm_even
 		}
 	}
 
-	pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c> version %u.%u%s\n",
+	pr_notice("RDS/IB: %s conn connected <%pI6c,%pI6c,%d> version %u.%u%s\n",
 		  ic->i_active_side ? "Active" : "Passive",
-		  &conn->c_laddr, &conn->c_faddr,
+		  &conn->c_laddr, &conn->c_faddr, conn->c_tos,
 		  RDS_PROTOCOL_MAJOR(conn->c_version),
 		  RDS_PROTOCOL_MINOR(conn->c_version),
 		  ic->i_flowctl ? ", flow control" : "");
@@ -222,6 +222,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 			    cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
 			dp->ricp_v6.dp_ack_seq =
 			    cpu_to_be64(rds_ib_piggyb_ack(ic));
+			dp->ricp_v6.dp_cmn.ricpc_dp_toss = conn->c_tos;
 
 			conn_param->private_data = &dp->ricp_v6;
 			conn_param->private_data_len = sizeof(dp->ricp_v6);
@@ -236,6 +237,7 @@ static void rds_ib_cm_fill_conn_param(struct rds_connection *conn,
 			    cpu_to_be16(RDS_IB_SUPPORTED_PROTOCOLS);
 			dp->ricp_v4.dp_ack_seq =
 			    cpu_to_be64(rds_ib_piggyb_ack(ic));
+			dp->ricp_v4.dp_cmn.ricpc_dp_toss = conn->c_tos;
 
 			conn_param->private_data = &dp->ricp_v4;
 			conn_param->private_data_len = sizeof(dp->ricp_v4);
@@ -391,10 +393,9 @@ static void rds_ib_qp_event_handler(struct ib_event *event, void *data)
 		rdma_notify(ic->i_cm_id, IB_EVENT_COMM_EST);
 		break;
 	default:
-		rdsdebug("Fatal QP Event %u (%s) "
-			"- connection %pI6c->%pI6c, reconnecting\n",
-			event->event, ib_event_msg(event->event),
-			&conn->c_laddr, &conn->c_faddr);
+		rdsdebug("Fatal QP Event %u (%s) - connection %pI6c->%pI6c, reconnecting\n",
+			 event->event, ib_event_msg(event->event),
+			 &conn->c_laddr, &conn->c_faddr);
 		rds_conn_drop(conn);
 		break;
 	}
@@ -662,11 +663,11 @@ static u32 rds_ib_protocol_compatible(struct rdma_cm_event *event, bool isv6)
 
 	/* Even if len is crap *now* I still want to check it. -ASG */
 	if (event->param.conn.private_data_len < data_len || major == 0)
-		return RDS_PROTOCOL_3_0;
+		return RDS_PROTOCOL_4_0;
 
 	common = be16_to_cpu(mask) & RDS_IB_SUPPORTED_PROTOCOLS;
-	if (major == 3 && common) {
-		version = RDS_PROTOCOL_3_0;
+	if (major == 4 && common) {
+		version = RDS_PROTOCOL_4_0;
 		while ((common >>= 1) != 0)
 			version++;
 	} else if (RDS_PROTOCOL_COMPAT_VERSION ==
@@ -778,15 +779,16 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 		daddr6 = &d_mapped_addr;
 	}
 
-	rdsdebug("saddr %pI6c daddr %pI6c RDSv%u.%u lguid 0x%llx fguid "
-		 "0x%llx\n", saddr6, daddr6,
-		 RDS_PROTOCOL_MAJOR(version), RDS_PROTOCOL_MINOR(version),
+	rdsdebug("saddr %pI6c daddr %pI6c RDSv%u.%u lguid 0x%llx fguid 0x%llx, tos:%d\n",
+		 saddr6, daddr6, RDS_PROTOCOL_MAJOR(version),
+		 RDS_PROTOCOL_MINOR(version),
 		 (unsigned long long)be64_to_cpu(lguid),
-		 (unsigned long long)be64_to_cpu(fguid));
+		 (unsigned long long)be64_to_cpu(fguid), dp_cmn->ricpc_dp_toss);
 
 	/* RDS/IB is not currently netns aware, thus init_net */
 	conn = rds_conn_create(&init_net, daddr6, saddr6,
-			       &rds_ib_transport, 0, GFP_KERNEL, ifindex);
+			       &rds_ib_transport, dp_cmn->ricpc_dp_toss,
+			       GFP_KERNEL, ifindex);
 	if (IS_ERR(conn)) {
 		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
 		conn = NULL;
@@ -868,7 +870,7 @@ int rds_ib_cm_initiate_connect(struct rdma_cm_id *cm_id, bool isv6)
 
 	/* If the peer doesn't do protocol negotiation, we must
 	 * default to RDSv3.0 */
-	rds_ib_set_protocol(conn, RDS_PROTOCOL_VERSION);
+	rds_ib_set_protocol(conn, RDS_PROTOCOL_4_1);
 	ic->i_flowctl = rds_ib_sysctl_flow_control;	/* advertise flow control */
 
 	ret = rds_ib_setup_qp(conn);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 2f16146..d395eec 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -986,9 +986,9 @@ void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic,
 	} else {
 		/* We expect errors as the qp is drained during shutdown */
 		if (rds_conn_up(conn) || rds_conn_connecting(conn))
-			rds_ib_conn_error(conn, "recv completion on <%pI6c,%pI6c> had status %u (%s), disconnecting and reconnecting\n",
+			rds_ib_conn_error(conn, "recv completion on <%pI6c,%pI6c, %d> had status %u (%s), disconnecting and reconnecting\n",
 					  &conn->c_laddr, &conn->c_faddr,
-					  wc->status,
+					  conn->c_tos, wc->status,
 					  ib_wc_status_msg(wc->status));
 	}
 
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 4e0c36a..09c46f2 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -305,8 +305,9 @@ void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc)
 
 	/* We expect errors as the qp is drained during shutdown */
 	if (wc->status != IB_WC_SUCCESS && rds_conn_up(conn)) {
-		rds_ib_conn_error(conn, "send completion on <%pI6c,%pI6c> had status %u (%s), disconnecting and reconnecting\n",
-				  &conn->c_laddr, &conn->c_faddr, wc->status,
+		rds_ib_conn_error(conn, "send completion on <%pI6c,%pI6c,%d> had status %u (%s), disconnecting and reconnecting\n",
+				  &conn->c_laddr, &conn->c_faddr,
+				  conn->c_tos, wc->status,
 				  ib_wc_status_msg(wc->status));
 	}
 }
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index e37f915..46bce83 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -83,6 +83,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 		break;
 
 	case RDMA_CM_EVENT_ADDR_RESOLVED:
+		rdma_set_service_type(cm_id, conn->c_tos);
 		/* XXX do we need to clean up if this fails? */
 		ret = rdma_resolve_route(cm_id,
 					 RDS_RDMA_RESOLVE_TIMEOUT_MS);
diff --git a/net/rds/send.c b/net/rds/send.c
index c555e12..166dd57 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1277,12 +1277,13 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* rds_conn_create has a spinlock that runs with IRQ off.
 	 * Caching the conn in the socket helps a lot. */
-	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr)) {
+	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr) &&
+	    rs->rs_tos == rs->rs_conn->c_tos) {
 		conn = rs->rs_conn;
 	} else {
 		conn = rds_conn_create_outgoing(sock_net(sock->sk),
 						&rs->rs_bound_addr, &daddr,
-						rs->rs_transport, 0,
+						rs->rs_transport, rs->rs_tos,
 						sock->sk->sk_allocation,
 						scope_id);
 		if (IS_ERR(conn)) {
-- 
1.9.1


^ permalink raw reply related

* [net-next][PATCH 3/5] rds: add type of service(tos) infrastructure
From: Santosh Shilimkar @ 2019-02-05  0:04 UTC (permalink / raw)
  To: netdev, davem; +Cc: yanjun.zhu, santosh.shilimkar
In-Reply-To: <1549325089-16572-1-git-send-email-santosh.shilimkar@oracle.com>

RDS Service type (TOS) is user-defined and needs to be configured
via RDS IOCTL interface. It must be set before initiating any
traffic and once set the TOS can not be changed. All out-going
traffic from the socket will be associated with its TOS.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
---
 include/uapi/linux/rds.h | 11 +++++++++++
 net/rds/af_rds.c         | 35 ++++++++++++++++++++++++++++++++++-
 net/rds/connection.c     | 20 +++++++++++---------
 net/rds/ib.c             |  1 +
 net/rds/ib_cm.c          |  2 +-
 net/rds/rdma_transport.c |  1 +
 net/rds/rds.h            |  9 +++++++--
 net/rds/recv.c           |  1 +
 net/rds/send.c           |  6 +++---
 net/rds/tcp.c            |  1 +
 net/rds/tcp_listen.c     |  2 +-
 11 files changed, 72 insertions(+), 17 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 8b73cb6..5d0f76c 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -69,6 +69,12 @@
 #define RDS_TRANS_COUNT	3
 #define	RDS_TRANS_NONE	(~0)
 
+/* IOCTLS commands for SOL_RDS */
+#define SIOCRDSSETTOS		(SIOCPROTOPRIVATE)
+#define SIOCRDSGETTOS		(SIOCPROTOPRIVATE + 1)
+
+typedef __u8	rds_tos_t;
+
 /*
  * Control message types for SOL_RDS.
  *
@@ -149,6 +155,7 @@ struct rds_info_connection {
 	__be32		faddr;
 	__u8		transport[TRANSNAMSIZ];		/* null term ascii */
 	__u8		flags;
+	__u8		tos;
 } __attribute__((packed));
 
 struct rds6_info_connection {
@@ -171,6 +178,7 @@ struct rds_info_message {
 	__be16		lport;
 	__be16		fport;
 	__u8		flags;
+	__u8		tos;
 } __attribute__((packed));
 
 struct rds6_info_message {
@@ -214,6 +222,7 @@ struct rds_info_tcp_socket {
 	__u32           last_sent_nxt;
 	__u32           last_expected_una;
 	__u32           last_seen_una;
+	__u8		tos;
 } __attribute__((packed));
 
 struct rds6_info_tcp_socket {
@@ -240,6 +249,7 @@ struct rds_info_rdma_connection {
 	__u32		max_send_sge;
 	__u32		rdma_mr_max;
 	__u32		rdma_mr_size;
+	__u8		tos;
 };
 
 struct rds6_info_rdma_connection {
@@ -253,6 +263,7 @@ struct rds6_info_rdma_connection {
 	__u32		max_send_sge;
 	__u32		rdma_mr_max;
 	__u32		rdma_mr_size;
+	__u8		tos;
 };
 
 /* RDS message Receive Path Latency points */
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 65571a6..9045158 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -254,7 +254,38 @@ static __poll_t rds_poll(struct file *file, struct socket *sock,
 
 static int rds_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
 {
-	return -ENOIOCTLCMD;
+	struct rds_sock *rs = rds_sk_to_rs(sock->sk);
+	rds_tos_t tos;
+
+	switch (cmd) {
+	case SIOCRDSSETTOS:
+		if (get_user(tos, (rds_tos_t __user *)arg))
+			return -EFAULT;
+
+		if (rs->rs_transport &&
+		    rs->rs_transport->t_type == RDS_TRANS_TCP)
+			tos = 0;
+
+		spin_lock_bh(&rds_sock_lock);
+		if (rs->rs_tos || rs->rs_conn) {
+			spin_unlock_bh(&rds_sock_lock);
+			return -EINVAL;
+		}
+		rs->rs_tos = tos;
+		spin_unlock_bh(&rds_sock_lock);
+		break;
+	case SIOCRDSGETTOS:
+		spin_lock_bh(&rds_sock_lock);
+		tos = rs->rs_tos;
+		spin_unlock_bh(&rds_sock_lock);
+		if (put_user(tos, (rds_tos_t __user *)arg))
+			return -EFAULT;
+		break;
+	default:
+		return -ENOIOCTLCMD;
+	}
+
+	return 0;
 }
 
 static int rds_cancel_sent_to(struct rds_sock *rs, char __user *optval,
@@ -650,6 +681,8 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol)
 	spin_lock_init(&rs->rs_rdma_lock);
 	rs->rs_rdma_keys = RB_ROOT;
 	rs->rs_rx_traces = 0;
+	rs->rs_tos = 0;
+	rs->rs_conn = NULL;
 
 	spin_lock_bh(&rds_sock_lock);
 	list_add_tail(&rs->rs_item, &rds_sock_list);
diff --git a/net/rds/connection.c b/net/rds/connection.c
index 1ab14b6..7ea134f 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -84,7 +84,7 @@ static struct rds_connection *rds_conn_lookup(struct net *net,
 					      const struct in6_addr *laddr,
 					      const struct in6_addr *faddr,
 					      struct rds_transport *trans,
-					      int dev_if)
+					      u8 tos, int dev_if)
 {
 	struct rds_connection *conn, *ret = NULL;
 
@@ -92,6 +92,7 @@ static struct rds_connection *rds_conn_lookup(struct net *net,
 		if (ipv6_addr_equal(&conn->c_faddr, faddr) &&
 		    ipv6_addr_equal(&conn->c_laddr, laddr) &&
 		    conn->c_trans == trans &&
+		    conn->c_tos == tos &&
 		    net == rds_conn_net(conn) &&
 		    conn->c_dev_if == dev_if) {
 			ret = conn;
@@ -160,7 +161,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 						const struct in6_addr *laddr,
 						const struct in6_addr *faddr,
 						struct rds_transport *trans,
-						gfp_t gfp,
+						gfp_t gfp, u8 tos,
 						int is_outgoing,
 						int dev_if)
 {
@@ -172,7 +173,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
 
 	rcu_read_lock();
-	conn = rds_conn_lookup(net, head, laddr, faddr, trans, dev_if);
+	conn = rds_conn_lookup(net, head, laddr, faddr, trans, tos, dev_if);
 	if (conn &&
 	    conn->c_loopback &&
 	    conn->c_trans != &rds_loop_transport &&
@@ -206,6 +207,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	conn->c_isv6 = !ipv6_addr_v4mapped(laddr);
 	conn->c_faddr = *faddr;
 	conn->c_dev_if = dev_if;
+	conn->c_tos = tos;
 
 #if IS_ENABLED(CONFIG_IPV6)
 	/* If the local address is link local, set c_bound_if to be the
@@ -298,7 +300,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		struct rds_connection *found;
 
 		found = rds_conn_lookup(net, head, laddr, faddr, trans,
-					dev_if);
+					tos, dev_if);
 		if (found) {
 			struct rds_conn_path *cp;
 			int i;
@@ -333,10 +335,10 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 struct rds_connection *rds_conn_create(struct net *net,
 				       const struct in6_addr *laddr,
 				       const struct in6_addr *faddr,
-				       struct rds_transport *trans, gfp_t gfp,
-				       int dev_if)
+				       struct rds_transport *trans, u8 tos,
+				       gfp_t gfp, int dev_if)
 {
-	return __rds_conn_create(net, laddr, faddr, trans, gfp, 0, dev_if);
+	return __rds_conn_create(net, laddr, faddr, trans, gfp, tos, 0, dev_if);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create);
 
@@ -344,9 +346,9 @@ struct rds_connection *rds_conn_create_outgoing(struct net *net,
 						const struct in6_addr *laddr,
 						const struct in6_addr *faddr,
 						struct rds_transport *trans,
-						gfp_t gfp, int dev_if)
+						u8 tos, gfp_t gfp, int dev_if)
 {
-	return __rds_conn_create(net, laddr, faddr, trans, gfp, 1, dev_if);
+	return __rds_conn_create(net, laddr, faddr, trans, gfp, tos, 1, dev_if);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create_outgoing);
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 9d7b758..21b6588 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -301,6 +301,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection *conn,
 
 	iinfo->src_addr = conn->c_laddr.s6_addr32[3];
 	iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
+	iinfo->tos = conn->c_tos;
 
 	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
 	memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index a1c3ad3..70518e3 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -786,7 +786,7 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 
 	/* RDS/IB is not currently netns aware, thus init_net */
 	conn = rds_conn_create(&init_net, daddr6, saddr6,
-			       &rds_ib_transport, GFP_KERNEL, ifindex);
+			       &rds_ib_transport, 0, GFP_KERNEL, ifindex);
 	if (IS_ERR(conn)) {
 		rdsdebug("rds_conn_create failed (%ld)\n", PTR_ERR(conn));
 		conn = NULL;
diff --git a/net/rds/rdma_transport.c b/net/rds/rdma_transport.c
index 63cbc6b..e37f915 100644
--- a/net/rds/rdma_transport.c
+++ b/net/rds/rdma_transport.c
@@ -115,6 +115,7 @@ static int rds_rdma_cm_event_handler_cmn(struct rdma_cm_id *cm_id,
 			pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n",
 				&conn->c_laddr, &conn->c_faddr);
 			conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION;
+			conn->c_tos = 0;
 			rds_conn_drop(conn);
 		}
 		rdsdebug("Connection rejected: %s\n",
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 660023f..7e52b92 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -158,6 +158,9 @@ struct rds_connection {
 	unsigned int		c_version;
 	possible_net_t		c_net;
 
+	/* TOS */
+	u8			c_tos;
+
 	struct list_head	c_map_item;
 	unsigned long		c_map_queued;
 
@@ -652,6 +655,7 @@ struct rds_sock {
 	u8			rs_rx_traces;
 	u8			rs_rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
 	struct rds_msg_zcopy_queue rs_zcookie_queue;
+	u8			rs_tos;
 };
 
 static inline struct rds_sock *rds_sk_to_rs(const struct sock *sk)
@@ -760,13 +764,14 @@ struct rds_sock *rds_find_bound(const struct in6_addr *addr, __be16 port,
 struct rds_connection *rds_conn_create(struct net *net,
 				       const struct in6_addr *laddr,
 				       const struct in6_addr *faddr,
-				       struct rds_transport *trans, gfp_t gfp,
+				       struct rds_transport *trans,
+				       u8 tos, gfp_t gfp,
 				       int dev_if);
 struct rds_connection *rds_conn_create_outgoing(struct net *net,
 						const struct in6_addr *laddr,
 						const struct in6_addr *faddr,
 						struct rds_transport *trans,
-						gfp_t gfp, int dev_if);
+						u8 tos, gfp_t gfp, int dev_if);
 void rds_conn_shutdown(struct rds_conn_path *cpath);
 void rds_conn_destroy(struct rds_connection *conn);
 void rds_conn_drop(struct rds_connection *conn);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 6bb6b16..853de48 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -782,6 +782,7 @@ void rds_inc_info_copy(struct rds_incoming *inc,
 
 	minfo.seq = be64_to_cpu(inc->i_hdr.h_sequence);
 	minfo.len = be32_to_cpu(inc->i_hdr.h_len);
+	minfo.tos = inc->i_conn->c_tos;
 
 	if (flip) {
 		minfo.laddr = daddr;
diff --git a/net/rds/send.c b/net/rds/send.c
index fd8b687..c555e12 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1277,12 +1277,12 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 
 	/* rds_conn_create has a spinlock that runs with IRQ off.
 	 * Caching the conn in the socket helps a lot. */
-	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr))
+	if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr)) {
 		conn = rs->rs_conn;
-	else {
+	} else {
 		conn = rds_conn_create_outgoing(sock_net(sock->sk),
 						&rs->rs_bound_addr, &daddr,
-						rs->rs_transport,
+						rs->rs_transport, 0,
 						sock->sk->sk_allocation,
 						scope_id);
 		if (IS_ERR(conn)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index c16f0a3..eb68519 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -267,6 +267,7 @@ static void rds_tcp_tc_info(struct socket *rds_sock, unsigned int len,
 		tsinfo.last_sent_nxt = tc->t_last_sent_nxt;
 		tsinfo.last_expected_una = tc->t_last_expected_una;
 		tsinfo.last_seen_una = tc->t_last_seen_una;
+		tsinfo.tos = tc->t_cpath->cp_conn->c_tos;
 
 		rds_info_copy(iter, &tsinfo, sizeof(tsinfo));
 	}
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index c12203f..810a3a4 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -200,7 +200,7 @@ int rds_tcp_accept_one(struct socket *sock)
 
 	conn = rds_conn_create(sock_net(sock->sk),
 			       my_addr, peer_addr,
-			       &rds_tcp_transport, GFP_KERNEL, dev_if);
+			       &rds_tcp_transport, 0, GFP_KERNEL, dev_if);
 
 	if (IS_ERR(conn)) {
 		ret = PTR_ERR(conn);
-- 
1.9.1


^ permalink raw reply related

* Re: [PATCH v1] net: dsa: qca8k: implement DT-based ports <-> phy translation
From: Christian Lamparter @ 2019-02-04 23:55 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: netdev, Florian Fainelli, Vivien Didelot
In-Reply-To: <20190204222641.GE3397@lunn.ch>

Hello Andrew and Florian.

I concated both replies into this post.

On Monday, February 4, 2019 11:26:41 PM CET Andrew Lunn wrote:
> On Mon, Feb 04, 2019 at 10:35:55PM +0100, Christian Lamparter wrote:
> > The QCA8337 enumerates 5 PHYs on the MDC/MDIO access: PHY0-PHY4.
> > Based on the System Block Diagram in Section 1.2 of the
> > QCA8337's datasheet. These PHYs are internally connected
> > to MACs of PORT 1 - PORT 5.
> 
> Hi Christian
> 
> Is the off-by-one the problem here?
> 
Yes, this was triggered by a MII_PHYSID1 and MII_PHYSID2 read for a 
non-existent PHY at address >0x5< coming from dsa_register_switch()
during boot.

I added a WARN_ON to qca8k_phy_read() to see from where the reads are
coming from:

[    6.218168] WARNING: CPU: 0 PID: 21 at qca8k_phy_read+0x3c/0x68
[    6.224062] Modules linked in:
[    6.227107] CPU: 0 PID: 21 Comm: kworker/0:1 Tainted: G        W         4.19.16 #0
[    6.234732] Workqueue: events deferred_probe_work_func
[    6.239849] NIP:  c0308528 LR: c0308528 CTR: c0257d30
[    6.244878] REGS: cf485b80 TRAP: 0700   Tainted: G        W          (4.19.16)
[    6.252064] MSR:  00029000 <CE,EE,ME>  CR: 28002222  XER: 00000000
[    6.258224]
[    6.258224] GPR00: c0308528 cf485c30 cf47c000 00000005 00000000 00000215 c09d61e4 c09d3de0
[    6.258224] GPR08: 00021000 c09bfb00 c09bfb00 00000007 24002444 00000000 c003f81c cf466060
[    6.258224] GPR16: ffffffff cf633f60 cf633f6c 00000001 cf633f40 c0589f7c 006080c0 c09bede8
[    6.258224] GPR24: cf633f40 00000000 00000000 fffff000 00000003 cdf21790 00000003 00000005
[    6.292952] NIP [c0308528] qca8k_phy_read+0x3c/0x68
[    6.297808] LR [c0308528] qca8k_phy_read+0x3c/0x68
[    6.302574] Call Trace:
[    6.305013] [cf485c30] [c0308528] qca8k_phy_read+0x3c/0x68 (unreliable)
[    6.311602] [cf485c50] [c0300530] mdiobus_read+0x6c/0x9c
[    6.316894] [cf485c70] [c02ffdcc] get_phy_device+0x188/0x204
[    6.322527] [cf485cd0] [c0300740] mdiobus_scan+0x20/0x160
[    6.327901] [cf485d00] [c0300a3c] __mdiobus_register+0x1bc/0x2a8
[    6.333884] [cf485d30] [c047e690] dsa_register_switch+0x6a0/0x904
[    6.339954] [cf485db0] [c0300dd8] mdio_probe+0x40/0x88
[    6.345070] [cf485dc0] [c026947c] really_probe+0x168/0x300
[    6.350530] [cf485df0] [c0269b44] driver_probe_device+0x410/0x460
[    6.356594] [cf485e10] [c02672cc] bus_for_each_drv+0x5c/0xc0
[    6.362229] [cf485e40] [c0269270] __device_attach+0xa8/0x144
[    6.367862] [cf485e70] [c0268430] bus_probe_device+0x3c/0xc0
[    6.373495] [cf485e90] [c02689dc] deferred_probe_work_func+0x70/0xac
[    6.379821] [cf485eb0] [c003a2c4] process_one_work+0x25c/0x3b0
[    6.385633] [cf485ed0] [c003a708] worker_thread+0x2f0/0x434
[    6.391180] [cf485f10] [c003f950] kthread+0x134/0x138
[    6.396209] [cf485f40] [c000d23c] ret_from_kernel_thread+0x14/0x1c
[    6.402357] Instruction dump:
[    6.405313] 93c10018 7cbe2b78 93e1001c 7c9f2378 93a10014 83a30018 4bfffe4d 3d20c058
[    6.413027] 7c651b78 7fe4fb78 3869c970 4bd4e35d <0fe00000> 80010024 7fc5f378 807d0004
[    6.420916] ---[ end trace 00aeb6767b21cd36 ]---

If I disable port@5 (see qca8k.txt example)
<https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/net/dsa/qca8k.txt#L103>

then problem went away (but of course, this makes the WAN port useless).

When I looked more into it, I started to notice that the mdiobus_scan
started from address >1< (not 0). It was skipping PHY0 (which does exist!)
and then the extract from qca8k.txt suddenly made sense:

|The integrated switch subnode should be specified according to the binding
|described in dsa/dsa.txt. As the QCA8K switches do not have a N:N mapping of
|port and PHY id, each subnode describing a port needs to have a valid phandle
|referencing the internal PHY connected to it. The CPU port of this switch is
|always port 0.

From what I can tell, the slave mdio port-numbers (i.e 0 - 5/6 on the qca8k)
are being used directly as phy-addresses on the slave-bus. And since the port0
is a cpu port it gets skipped when the ds->phy_mii_mask is created in
dsa_switch_setup():

| ds->phys_mii_mask |= dsa_user_ports(ds);  // 0x3E

<https://elixir.bootlin.com/linux/v5.0-rc5/source/net/dsa/dsa2.c#L350>

However, ds->phys_mii_mask is used to calculate the slave's phy_mask in
dsa_slave_mii_bus_init()

|ds->slave_mii_bus->phy_mask = ~ds->phys_mii_mask; // ~0x3E

<https://github.com/torvalds/linux/blob/master/net/dsa/slave.c#L61>

which is then later used by __mdiobus_register() to scan for the
supposedly existing PHY at 0x1 - 0x5.

|	for (i = 0; i < PHY_MAX_ADDR; i++) {
|		if ((bus->phy_mask & (1 << i)) == 0) {
|			struct phy_device *phydev;
|
|			phydev = mdiobus_scan(bus, i);
|			if (IS_ERR(phydev) && (PTR_ERR(phydev) != -ENODEV)) {
|				err = PTR_ERR(phydev);
|				goto error;
|			}
|		}
|	}

So, I'm looking for a way to get this "-1" somewhere and this version
was the best justification I came up with. Because as Florian said,
this is supposed to work for different drivers as well.

---

On Monday, February 4, 2019 11:11:41 PM CET Florian Fainelli wrote:
> On 2/4/19 1:35 PM, Christian Lamparter wrote:
> > The QCA8337 enumerates 5 PHYs on the MDC/MDIO access: PHY0-PHY4.
> > Based on the System Block Diagram in Section 1.2 of the
> > QCA8337's datasheet. These PHYs are internally connected
> > to MACs of PORT 1 - PORT 5. However, neither qca8k's slave
> > mdio access functions qca8k_phy_read()/qca8k_phy_write()
> > nor the dsa framework is set up for that.
> > 
> > This version of the patch uses the existing phy-handle
> > properties of each specified DSA Port in the DT to map
> > each PORT/MAC to its exposed PHY on the MDIO bus. This
> > is supported by the current binding document qca8k.txt
> > as well.
> 
> I don't think you should have to do any of this translation, because you
> can do a couple of things with DSA/Device Tree:
> 
> - you can not provide a phy-handle property at all, in which case, the
> core DSA layer assumes that the PHY is part of the switch's internal
> MDIO bus which is implictly created by dsa_slave_mii_bus_create()
> 
> - you can specify a phy-handle property and then the PHY device tree
> node can be placed pretty much anywhere in Device Tree, including on a
> separate MDIO bus Device Tre node which is "external" to the switch
> 
> In either case, the PHY device's MDIO bus parent and its address are
> taken care of by drivers/of/of_mdio.c. You can look at mx88e6xxx for how
> it deals with its internal vs. external MDIO bus controller and that
> driver is used on a wide variety of cconfiguration.

Hm, this sounds to me like the qca8k driver might be cheating a bit. Though,
I can't really tell, I found "stub" translation routines in the mv88e6060
driver called mv88e6060_port_to_phy_addr(). But it just checks the range.
I think the issue here is that qca8k_phy_read/write don't actually operate
on the "internal" mdio-bus of the switch. Instead the mdiobus_read/write
in qca8k_phy_read/write operates on the provided mdio-bus (by the 
ethernet-mac or gpio-mdio, etc...). But let's see what else can be done 
(maybe qca8k_phy_read/write() is just wrong and should be dropped?).

Regards,
Christian



^ permalink raw reply

* Re: [PATCH mlx5-next 12/12] net/mlx5: Set ODP SRQ support in firmware
From: Saeed Mahameed @ 2019-02-04 23:54 UTC (permalink / raw)
  To: jgg@ziepe.ca, leon@kernel.org
  Cc: Majd Dibbiny, Moni Shoua, linux-rdma@vger.kernel.org,
	netdev@vger.kernel.org, dledford@redhat.com
In-Reply-To: <20190204212319.GE10237@ziepe.ca>

On Mon, 2019-02-04 at 14:23 -0700, Jason Gunthorpe wrote:
> On Sun, Feb 03, 2019 at 11:03:11AM +0200, Leon Romanovsky wrote:
> > On Thu, Jan 31, 2019 at 04:28:44PM -0700, Jason Gunthorpe wrote:
> > > On Tue, Jan 22, 2019 at 08:48:51AM +0200, Leon Romanovsky wrote:
> > > > From: Moni Shoua <monis@mellanox.com>
> > > > 
> > > > To avoid compatibility issue with older kernels the firmware
> > > > doesn't
> > > > allow SRQ to work with ODP unless kernel asks for it.
> > > > 
> > > > Signed-off-by: Moni Shoua <monis@mellanox.com>
> > > > Reviewed-by: Majd Dibbiny <majd@mellanox.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> > > >  .../net/ethernet/mellanox/mlx5/core/main.c    | 53
> > > > +++++++++++++++++++
> > > >  include/linux/mlx5/device.h                   |  3 ++
> > > >  include/linux/mlx5/mlx5_ifc.h                 |  1 +
> > > >  3 files changed, 57 insertions(+)
> > > > 
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > > > b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > > > index be81b319b0dc..b3a76df0cf6c 100644
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> > > > @@ -459,6 +459,53 @@ static int handle_hca_cap_atomic(struct
> > > > mlx5_core_dev *dev)
> > > >  	return err;
> > > >  }
> > > > 
> > > > +static int handle_hca_cap_odp(struct mlx5_core_dev *dev)
> > > > +{
> > > > +	void *set_ctx;
> > > > +	void *set_hca_cap;
> > > > +	int set_sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
> > > > +	int err;
> > > > +
> > > > +	if (!MLX5_CAP_GEN(dev, pg))
> > > > +		return 0;
> > > 
> > > Should a
> > > 
> > >     if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
> > >              return 0;
> > > 
> > > Be here?
> > 
> > We had similar discussion in mlx5_ib main.c, but here we are
> > talking
> > about mlx5_core code, which from my point of view should represent
> > the
> > real HW capabilities without relation to kernel compilation mode.
> 
> This switch is to tell the FW that the mlx5_ib module supports the
> new
> protocol - so having it in core code at all is really weird. I assume
> there is some startup sequence reason?
> 

Yes, sadly this must be in startup, set_hca_cap requests must come
prior to init_hca command.

> Since the modularity is already wrecked it seems like an odd
> reason not to add the if..
> 

Agree, even better, let's compile out the whole function. I would even
consider having a separate file in mlx5/core for IB related start-up
procedures :).

> Jason

^ permalink raw reply

* Re: [PATCH mlx5-next 12/12] net/mlx5: Set ODP SRQ support in firmware
From: Saeed Mahameed @ 2019-02-04 23:47 UTC (permalink / raw)
  To: Jason Gunthorpe, leon@kernel.org, dledford@redhat.com
  Cc: Majd Dibbiny, Moni Shoua, Leon Romanovsky,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20190122064851.6032-13-leon@kernel.org>

On Tue, 2019-01-22 at 08:48 +0200, Leon Romanovsky wrote:
> From: Moni Shoua <monis@mellanox.com>
> 
> To avoid compatibility issue with older kernels the firmware doesn't
> allow SRQ to work with ODP unless kernel asks for it.
> 
> Signed-off-by: Moni Shoua <monis@mellanox.com>
> Reviewed-by: Majd Dibbiny <majd@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
>  .../net/ethernet/mellanox/mlx5/core/main.c    | 53
> +++++++++++++++++++
>  include/linux/mlx5/device.h                   |  3 ++
>  include/linux/mlx5/mlx5_ifc.h                 |  1 +
>  3 files changed, 57 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> index be81b319b0dc..b3a76df0cf6c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
> @@ -459,6 +459,53 @@ static int handle_hca_cap_atomic(struct
> mlx5_core_dev *dev)
>  	return err;
>  }
>  
> +static int handle_hca_cap_odp(struct mlx5_core_dev *dev)
> +{
> +	void *set_ctx;
> +	void *set_hca_cap;
> +	int set_sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
> +	int err;
> +

reversed xmas tree.

> +	if (!MLX5_CAP_GEN(dev, pg))
> +		return 0;
> +
> +	err = mlx5_core_get_caps(dev, MLX5_CAP_ODP);
> +	if (err)
> +		return err;
> +
> +	/**
> +	 * If all bits are cleared we shouldn't try to set it
> +	 * or we might fail while trying to access a reserved bit.
> +	 */

"set them" not "set it" ? 

to me this is a redundant comment, the code is self explanatory.

> +	if (!(MLX5_CAP_ODP_MAX(dev, ud_odp_caps.srq_receive) ||
> +	      MLX5_CAP_ODP_MAX(dev, rc_odp_caps.srq_receive) ||
> +	      MLX5_CAP_ODP_MAX(dev, xrc_odp_caps.srq_receive)))
> +		return 0;
> +
> +	set_ctx = kzalloc(set_sz, GFP_KERNEL);
> +	if (!set_ctx)
> +		return -ENOMEM;
> +
> +	set_hca_cap = MLX5_ADDR_OF(set_hca_cap_in, set_ctx,
> capability);
> +	memcpy(set_hca_cap, dev->caps.hca_cur[MLX5_CAP_ODP],
> +	       MLX5_ST_SZ_BYTES(odp_cap));
> +
> +	/* set ODP SRQ support for RC/UD and XRC transports */
> +	MLX5_SET(odp_cap, set_hca_cap, ud_odp_caps.srq_receive,
> +		 (MLX5_CAP_ODP_MAX(dev, ud_odp_caps.srq_receive)));
> +
> +	MLX5_SET(odp_cap, set_hca_cap, rc_odp_caps.srq_receive,
> +		 (MLX5_CAP_ODP_MAX(dev, rc_odp_caps.srq_receive)));
> +
> +	MLX5_SET(odp_cap, set_hca_cap, xrc_odp_caps.srq_receive,
> +		 (MLX5_CAP_ODP_MAX(dev, xrc_odp_caps.srq_receive)));
> +

Redundant parentheses around the 3rd parameter. 

> +	err = set_caps(dev, set_ctx, set_sz,
> MLX5_SET_HCA_CAP_OP_MOD_ODP);
> +
> +	kfree(set_ctx);
> +	return err;
> +}
> +
>  static int handle_hca_cap(struct mlx5_core_dev *dev)
>  {
>  	void *set_ctx = NULL;
> @@ -926,6 +973,12 @@ static int mlx5_load_one(struct mlx5_core_dev
> *dev, struct mlx5_priv *priv,
>  		goto reclaim_boot_pages;
>  	}
>  
> +	err = handle_hca_cap_odp(dev);
> +	if (err) {
> +		dev_err(&pdev->dev, "handle_hca_cap_odp failed\n");
> +		goto reclaim_boot_pages;
> +	}
> +
>  	err = mlx5_satisfy_startup_pages(dev, 0);
>  	if (err) {
>  		dev_err(&pdev->dev, "failed to allocate init pages\n");
> diff --git a/include/linux/mlx5/device.h
> b/include/linux/mlx5/device.h
> index 8c4a820bd4c1..0845a227a7b2 100644
> --- a/include/linux/mlx5/device.h
> +++ b/include/linux/mlx5/device.h
> @@ -1201,6 +1201,9 @@ enum mlx5_qcam_feature_groups {
>  #define MLX5_CAP_ODP(mdev, cap)\
>  	MLX5_GET(odp_cap, mdev->caps.hca_cur[MLX5_CAP_ODP], cap)
>  
> +#define MLX5_CAP_ODP_MAX(mdev, cap)\
> +	MLX5_GET(odp_cap, mdev->caps.hca_max[MLX5_CAP_ODP], cap)
> +
>  #define MLX5_CAP_VECTOR_CALC(mdev, cap) \
>  	MLX5_GET(vector_calc_cap, \
>  		 mdev->caps.hca_cur[MLX5_CAP_VECTOR_CALC], cap)
> diff --git a/include/linux/mlx5/mlx5_ifc.h
> b/include/linux/mlx5/mlx5_ifc.h
> index 5407db8ba8e1..c5c679390fbd 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -72,6 +72,7 @@ enum {
>  
>  enum {
>  	MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE        = 0x0,
> +	MLX5_SET_HCA_CAP_OP_MOD_ODP                   = 0x2,
>  	MLX5_SET_HCA_CAP_OP_MOD_ATOMIC                = 0x3,
>  };
>  

^ permalink raw reply

* [PATCH net-next 2/2] net: marvell: mvpp2: fix lack of link interrupts
From: Russell King @ 2019-02-04 23:35 UTC (permalink / raw)
  To: Antoine Tenart, Maxime Chevallier; +Cc: Baruch Siach, David S. Miller, netdev

Sven Auhagen reports that if he changes a SFP+ module for a SFP module
on the Macchiatobin Single Shot, the link does not come back up.  For
Sven, it is as easy as:

- Insert a SFP+ module connected, and use ping6 to verify link is up.
- Remove SFP+ module
- Insert SFP 1000base-X module use ping6 to verify link is up: Link
  up event did not trigger and the link is down

but that doesn't show the problem for me.  Locally, this has been
reproduced by:

- Boot with no modules.
- Insert SFP+ module, confirm link is up.
- Replace module with 25000base-X module.  Confirm link is up.
- Set remote end down, link is reported as dropped at both ends.
- Set remote end up, link is reported up at remote end, but not local
  end due to lack of link interrupt.

Fix this by setting up both GMAC and XLG interrupts for port 0, but
only unmasking the appropriate interrupt according to the current mode
set in the mac_config() method.  However, only do the mask/unmask
dance when we are really changing the link mode to avoid missing any
link interrupts.

Tested-by: Sven Auhagen <sven.auhagen@voleatech.de>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
Possibly needs backporting to stable kernels, but would conflict with
patch 1. That could be solved by re-ordering the patches.

 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index c54ed19dcdae..d8974c446f8e 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -1133,7 +1133,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
 {
 	u32 val;
 
-	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
+	if (port->phylink ||
+	    phy_interface_mode_is_rgmii(port->phy_interface) ||
 	    phy_interface_mode_is_8023z(port->phy_interface) ||
 	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_MASK);
@@ -4635,6 +4636,7 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 			     const struct phylink_link_state *state)
 {
 	struct mvpp2_port *port = netdev_priv(dev);
+	bool change_interface = port->phy_interface != state->interface;
 
 	/* Check for invalid configuration */
 	if (state->interface == PHY_INTERFACE_MODE_10GKR && port->gop_id != 0) {
@@ -4644,14 +4646,16 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 
 	/* Make sure the port is disabled when reconfiguring the mode */
 	mvpp2_port_disable(port);
+	if (change_interface) {
+		mvpp22_gop_mask_irq(port);
 
-	if (port->priv->hw_version == MVPP22 &&
-	    port->phy_interface != state->interface) {
-		port->phy_interface = state->interface;
+		if (port->priv->hw_version == MVPP22) {
+			port->phy_interface = state->interface;
 
-		/* Reconfigure the serdes lanes */
-		phy_power_off(port->comphy);
-		mvpp22_mode_reconfigure(port);
+			/* Reconfigure the serdes lanes */
+			phy_power_off(port->comphy);
+			mvpp22_mode_reconfigure(port);
+		}
 	}
 
 	/* mac (re)configuration */
@@ -4665,6 +4669,9 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 	if (port->priv->hw_version == MVPP21 && port->flags & MVPP2_F_LOOPBACK)
 		mvpp2_port_loopback_set(port, state);
 
+	if (change_interface)
+		mvpp22_gop_unmask_irq(port);
+
 	mvpp2_port_enable(port);
 }
 
-- 
2.7.4


^ permalink raw reply related

* [PATCH net-next 1/2] net: marvell: mvpp2: use phy_interface_mode_is_8023z() helper
From: Russell King @ 2019-02-04 23:35 UTC (permalink / raw)
  To: Antoine Tenart, Maxime Chevallier; +Cc: Baruch Siach, David S. Miller, netdev

Use the phy_interface_mode_is_8023z() helper for detecting interface
modes that use 802.3z serial encoding.  This is equivalent to testing
for both 1000base-X and 2500base-X.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
---
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 38 ++++++++++---------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index f1dab0b55769..c54ed19dcdae 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -1090,9 +1090,8 @@ static void mvpp22_gop_unmask_irq(struct mvpp2_port *port)
 	u32 val;
 
 	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-	    port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+	    phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		/* Enable the GMAC link status irq for this port */
 		val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
 		val |= MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
@@ -1122,9 +1121,8 @@ static void mvpp22_gop_mask_irq(struct mvpp2_port *port)
 	}
 
 	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-	    port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+	    phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_SUM_MASK);
 		val &= ~MVPP22_GMAC_INT_SUM_MASK_LINK_STAT;
 		writel(val, port->base + MVPP22_GMAC_INT_SUM_MASK);
@@ -1136,9 +1134,8 @@ static void mvpp22_gop_setup_irq(struct mvpp2_port *port)
 	u32 val;
 
 	if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-	    port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+	    phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_MASK);
 		val |= MVPP22_GMAC_INT_MASK_LINK_STAT;
 		writel(val, port->base + MVPP22_GMAC_INT_MASK);
@@ -1259,9 +1256,8 @@ static void mvpp2_port_loopback_set(struct mvpp2_port *port,
 	else
 		val &= ~MVPP2_GMAC_GMII_LB_EN_MASK;
 
-	if (port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-	    port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    port->phy_interface == PHY_INTERFACE_MODE_2500BASEX)
+	if (phy_interface_mode_is_8023z(port->phy_interface) ||
+	    port->phy_interface == PHY_INTERFACE_MODE_SGMII)
 		val |= MVPP2_GMAC_PCS_LB_EN_MASK;
 	else
 		val &= ~MVPP2_GMAC_PCS_LB_EN_MASK;
@@ -2487,9 +2483,8 @@ static irqreturn_t mvpp2_link_status_isr(int irq, void *dev_id)
 				link = true;
 		}
 	} else if (phy_interface_mode_is_rgmii(port->phy_interface) ||
-		   port->phy_interface == PHY_INTERFACE_MODE_SGMII ||
-		   port->phy_interface == PHY_INTERFACE_MODE_1000BASEX ||
-		   port->phy_interface == PHY_INTERFACE_MODE_2500BASEX) {
+		   phy_interface_mode_is_8023z(port->phy_interface) ||
+		   port->phy_interface == PHY_INTERFACE_MODE_SGMII) {
 		val = readl(port->base + MVPP22_GMAC_INT_STAT);
 		if (val & MVPP22_GMAC_INT_STAT_LINK) {
 			event = true;
@@ -4581,8 +4576,7 @@ static void mvpp2_gmac_config(struct mvpp2_port *port, unsigned int mode,
 	ctrl0 &= ~MVPP2_GMAC_PORT_TYPE_MASK;
 	ctrl2 &= ~(MVPP2_GMAC_PORT_RESET_MASK | MVPP2_GMAC_PCS_ENABLE_MASK);
 
-	if (state->interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    state->interface == PHY_INTERFACE_MODE_2500BASEX) {
+	if (phy_interface_mode_is_8023z(state->interface)) {
 		/* 1000BaseX and 2500BaseX ports cannot negotiate speed nor can
 		 * they negotiate duplex: they are always operating with a fixed
 		 * speed of 1000/2500Mbps in full duplex, so force 1000/2500
@@ -4602,9 +4596,8 @@ static void mvpp2_gmac_config(struct mvpp2_port *port, unsigned int mode,
 	if (phylink_test(state->advertising, Asym_Pause))
 		an |= MVPP2_GMAC_FC_ADV_ASM_EN;
 
-	if (state->interface == PHY_INTERFACE_MODE_SGMII ||
-	    state->interface == PHY_INTERFACE_MODE_1000BASEX ||
-	    state->interface == PHY_INTERFACE_MODE_2500BASEX) {
+	if (phy_interface_mode_is_8023z(state->interface) ||
+	    state->interface == PHY_INTERFACE_MODE_SGMII) {
 		an |= MVPP2_GMAC_IN_BAND_AUTONEG;
 		ctrl2 |= MVPP2_GMAC_INBAND_AN_MASK | MVPP2_GMAC_PCS_ENABLE_MASK;
 
@@ -4665,9 +4658,8 @@ static void mvpp2_mac_config(struct net_device *dev, unsigned int mode,
 	if (state->interface == PHY_INTERFACE_MODE_10GKR)
 		mvpp2_xlg_config(port, mode, state);
 	else if (phy_interface_mode_is_rgmii(state->interface) ||
-		 state->interface == PHY_INTERFACE_MODE_SGMII ||
-		 state->interface == PHY_INTERFACE_MODE_1000BASEX ||
-		 state->interface == PHY_INTERFACE_MODE_2500BASEX)
+		 phy_interface_mode_is_8023z(state->interface) ||
+		 state->interface == PHY_INTERFACE_MODE_SGMII)
 		mvpp2_gmac_config(port, mode, state);
 
 	if (port->priv->hw_version == MVPP21 && port->flags & MVPP2_F_LOOPBACK)
-- 
2.7.4


^ permalink raw reply related

* Re: [PATCH net-next v2 1/1] net/mlx5: Fix code style issue in mlx driver
From: Saeed Mahameed @ 2019-02-04 23:26 UTC (permalink / raw)
  To: Tonghao Zhang; +Cc: David S. Miller, Linux Netdev List
In-Reply-To: <1548797004-45137-1-git-send-email-xiangxia.m.yue@gmail.com>

On Thu, Jan 31, 2019 at 5:20 PM <xiangxia.m.yue@gmail.com> wrote:
>
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
>
> Add the tab before '}' and keep the code style consistent.
>
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>

Dave, you can take this patch to net-next.

Thanks,
Saeed.

^ permalink raw reply

* Re: [PATCH v2] net: dsa: mv88e6xxx: Revise irq setup ordering
From: Andrew Lunn @ 2019-02-04 23:14 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <824d011b-3692-69c3-5e2c-58e950a80abf@bell.net>

On Mon, Feb 04, 2019 at 04:59:13PM -0500, John David Anglin wrote:
> This change fixes a race condition in the setup of hardware irqs and the
> code enabling PHY link detection in the mv88e6xxx driver.
> 
> This race was observed on the espressobin board where the GPIO interrupt
> controller only supports edge interrupts.  If the INTn output pin goes low
> before the GPIO interrupt is enabled, PHY link interrupts are not detected.
> 
> With this change, we
> 1) force INTn high by clearing all interrupt enables in global 1 control 1,
> 2) setup the hardware irq, and then
> 3) perform the remaining common setup.
> 
> This simplifies the setup and allows some unnecessary code to be removed.

Hi Dave

I took a closer look now. I don't actually see why the current code is
wrong.

mv88e6xxx_g1_irq_setup() calls mv88e6xxx_g1_irq_setup_common() and
then registers the interrupt handler.

mv88e6xxx_g1_irq_setup_common() does what you want, it masks all
interrupts in the hardware and clears any pending interrupts which can
be cleared.

The change you made is actually dangerous. As soon as you request the
interrupt, it is live, it can fire, and call
mv88e6xxx_g1_irq_thread_work(). That needs the irq domain. But the
change you made defers the creating of the domain until after the
interrupt is registered. So we can de-refernece a NULL pointer in the
interrupt handler.

	  Andrew

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox