Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net] be2net: Signal that the device cannot transmit during reconfiguration
From: David Miller @ 2019-07-16 19:41 UTC (permalink / raw)
  To: bpoirier
  Cc: sathya.perla, ajit.khaparde, sriharsha.basavapatna, somnath.kotur,
	fyang, saeedm, netdev
In-Reply-To: <20190716081655.7676-1-bpoirier@suse.com>

From: Benjamin Poirier <bpoirier@suse.com>
Date: Tue, 16 Jul 2019 17:16:55 +0900

> While changing the number of interrupt channels, be2net stops adapter
> operation (including netif_tx_disable()) but it doesn't signal that it
> cannot transmit. This may lead dev_watchdog() to falsely trigger during
> that time.
> 
> Add the missing call to netif_carrier_off(), following the pattern used in
> many other drivers. netif_carrier_on() is already taken care of in
> be_open().
> 
> Signed-off-by: Benjamin Poirier <bpoirier@suse.com>

Applied.

^ permalink raw reply

* [PATCH net-next v1] fix: taprio: Change type of txtime-delay parameter to u32
From: Vedang Patel @ 2019-07-16 19:52 UTC (permalink / raw)
  To: netdev
  Cc: jeffrey.t.kirsher, davem, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, jakub.kicinski, m-karicheri2,
	sergei.shtylyov, eric.dumazet, aaron.f.brown, stephen,
	Vedang Patel

During the review of the iproute2 patches for txtime-assist mode, it was
pointed out that it does not make sense for the txtime-delay parameter to
be negative. So, change the type of the parameter from s32 to u32.

Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 include/uapi/linux/pkt_sched.h | 2 +-
 net/sched/sch_taprio.c         | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 1f623252abe8..18f185299f47 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1174,7 +1174,7 @@ enum {
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
 	TCA_TAPRIO_ATTR_FLAGS, /* u32 */
-	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
+	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* u32 */
 	__TCA_TAPRIO_ATTR_MAX,
 };
 
diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
index 388750ddc57a..c39db507ba3f 100644
--- a/net/sched/sch_taprio.c
+++ b/net/sched/sch_taprio.c
@@ -75,7 +75,7 @@ struct taprio_sched {
 	struct sched_gate_list __rcu *admin_sched;
 	struct hrtimer advance_timer;
 	struct list_head taprio_list;
-	int txtime_delay;
+	u32 txtime_delay;
 };
 
 static ktime_t sched_base_time(const struct sched_gate_list *sched)
@@ -1113,7 +1113,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
 			goto unlock;
 		}
 
-		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
+		q->txtime_delay = nla_get_u32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
 	}
 
 	if (!TXTIME_ASSIST_IS_ENABLED(taprio_flags) &&
@@ -1430,7 +1430,7 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
 		goto options_error;
 
 	if (q->txtime_delay &&
-	    nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
+	    nla_put_u32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
 		goto options_error;
 
 	if (oper && dump_schedule(skb, oper))
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v4 1/6] Update kernel headers
From: Vedang Patel @ 2019-07-16 19:53 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel

The type for txtime-delay parameter will change from s32 to u32. So,
make the corresponding change in the ABI file as well.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 include/uapi/linux/pkt_sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 1f623252abe8..18f185299f47 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -1174,7 +1174,7 @@ enum {
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
 	TCA_TAPRIO_ATTR_FLAGS, /* u32 */
-	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
+	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* u32 */
 	__TCA_TAPRIO_ATTR_MAX,
 };
 
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v4 2/6] etf: Add skip_sock_check
From: Vedang Patel @ 2019-07-16 19:53 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563306789-2908-1-git-send-email-vedang.patel@intel.com>

ETF Qdisc currently checks for a socket with SO_TXTIME socket option. If
either is not present, the packet is dropped. In the future commits, we
want other Qdiscs to add packet with launchtime to the ETF Qdisc. Also,
there are some packets (e.g. ICMP packets) which may not have a socket
associated with them.  So, add an option to skip this check.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_etf.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tc/q_etf.c b/tc/q_etf.c
index 76aca476c61d..c2090589bc64 100644
--- a/tc/q_etf.c
+++ b/tc/q_etf.c
@@ -130,6 +130,13 @@ static int etf_parse_opt(struct qdisc_util *qu, int argc,
 				explain_clockid(*argv);
 				return -1;
 			}
+		} else if (strcmp(*argv, "skip_sock_check") == 0) {
+			if (opt.flags & TC_ETF_SKIP_SOCK_CHECK) {
+				fprintf(stderr, "etf: duplicate \"skip_sock_check\" specification\n");
+				return -1;
+			}
+
+			opt.flags |= TC_ETF_SKIP_SOCK_CHECK;
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -171,8 +178,10 @@ static int etf_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	print_uint(PRINT_ANY, "delta", "delta %d ", qopt->delta);
 	print_string(PRINT_ANY, "offload", "offload %s ",
 				(qopt->flags & TC_ETF_OFFLOAD_ON) ? "on" : "off");
-	print_string(PRINT_ANY, "deadline_mode", "deadline_mode %s",
+	print_string(PRINT_ANY, "deadline_mode", "deadline_mode %s ",
 				(qopt->flags & TC_ETF_DEADLINE_MODE_ON) ? "on" : "off");
+	print_string(PRINT_ANY, "skip_sock_check", "skip_sock_check %s",
+				(qopt->flags & TC_ETF_SKIP_SOCK_CHECK) ? "on" : "off");
 
 	return 0;
 }
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v4 5/6] tc: etf: Add documentation for skip-skb-check.
From: Vedang Patel @ 2019-07-16 19:53 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563306789-2908-1-git-send-email-vedang.patel@intel.com>

Document the newly added option (skip-skb-check) on the etf man-page.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 man/man8/tc-etf.8 | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/man/man8/tc-etf.8 b/man/man8/tc-etf.8
index 30a12de7d2c7..2e01a591dbaa 100644
--- a/man/man8/tc-etf.8
+++ b/man/man8/tc-etf.8
@@ -106,6 +106,16 @@ referred to as "Launch Time" or "Time-Based Scheduling" by the
 documentation of network interface controllers.
 The default is for this option to be disabled.
 
+.TP
+skip_skb_check
+.br
+.BR etf(8)
+currently drops any packet which does not have a socket associated with it or
+if the socket does not have SO_TXTIME socket option set. But, this will not
+work if the launchtime is set by another entity inside the kernel (e.g. some
+other Qdisc). Setting the skip_skb_check will skip checking for a socket
+associated with the packet.
+
 .SH EXAMPLES
 
 ETF is used to enforce a Quality of Service. It controls when each
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v4 4/6] taprio: add support for setting txtime_delay.
From: Vedang Patel @ 2019-07-16 19:53 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563306789-2908-1-git-send-email-vedang.patel@intel.com>

This adds support for setting the txtime_delay parameter which is useful
for the txtime offload mode of taprio.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_taprio.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/tc/q_taprio.c b/tc/q_taprio.c
index 1db2aba6efe7..b9954436b0f9 100644
--- a/tc/q_taprio.c
+++ b/tc/q_taprio.c
@@ -52,7 +52,7 @@ static void explain(void)
 		"		[num_tc NUMBER] [map P0 P1 ...] "
 		"		[queues COUNT@OFFSET COUNT@OFFSET COUNT@OFFSET ...] "
 		"		[ [sched-entry index cmd gate-mask interval] ... ] "
-		"		[base-time time] "
+		"		[base-time time] [txtime-delay delay]"
 		"\n"
 		"CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
 }
@@ -160,6 +160,7 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	struct list_head sched_entries;
 	struct rtattr *tail, *l;
 	__u32 taprio_flags = 0;
+	__u32 txtime_delay = 0;
 	__s64 cycle_time = 0;
 	__s64 base_time = 0;
 	int err, idx;
@@ -293,6 +294,17 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 				return -1;
 			}
 
+		} else if (strcmp(*argv, "txtime-delay") == 0) {
+			NEXT_ARG();
+			if (txtime_delay != 0) {
+				fprintf(stderr, "taprio: duplicate \"txtime-delay\" specification\n");
+				return -1;
+			}
+			if (get_u32(&txtime_delay, *argv, 0)) {
+				PREV_ARG();
+				return -1;
+			}
+
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -315,6 +327,9 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	if (opt.num_tc > 0)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_PRIOMAP, &opt, sizeof(opt));
 
+	if (txtime_delay)
+		addattr_l(n, 1024, TCA_TAPRIO_ATTR_TXTIME_DELAY, &txtime_delay, sizeof(txtime_delay));
+
 	if (base_time)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_SCHED_BASE_TIME, &base_time, sizeof(base_time));
 
@@ -464,6 +479,13 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 		print_0xhex(PRINT_ANY, "flags", " flags %#x", flags);
 	}
 
+	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]) {
+		__u32 txtime_delay;
+
+		txtime_delay = rta_getattr_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
+		print_uint(PRINT_ANY, "txtime_delay", " txtime delay %d", txtime_delay);
+	}
+
 	print_schedule(f, tb);
 
 	if (tb[TCA_TAPRIO_ATTR_ADMIN_SCHED]) {
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v4 6/6] tc: taprio: Update documentation
From: Vedang Patel @ 2019-07-16 19:53 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563306789-2908-1-git-send-email-vedang.patel@intel.com>

Add documentation for the latest options, flags and txtime-delay, to the
taprio manpage.

This also adds an example to run tc in txtime offload mode.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 man/man8/tc-taprio.8 | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/man/man8/tc-taprio.8 b/man/man8/tc-taprio.8
index 850be9b03649..e1d19ba19089 100644
--- a/man/man8/tc-taprio.8
+++ b/man/man8/tc-taprio.8
@@ -112,6 +112,26 @@ means that traffic class 0 is "active" for that schedule entry.
 long that state defined by <command> and <gate mask> should be held
 before moving to the next entry.
 
+.TP
+flags
+.br
+Specifies different modes for taprio. Currently, only txtime-assist is
+supported which can be enabled by setting it to 0x1. In this mode, taprio will
+set the transmit timestamp depending on the interval in which the packet needs
+to be transmitted. It will then utililize the
+.BR etf(8)
+qdisc to sort and transmit the packets at the right time. The second example
+can be used as a reference to configure this mode.
+
+.TP
+txtime-delay
+.br
+This parameter is specific to the txtime offload mode. It specifies the maximum
+time a packet might take to reach the network card from the taprio qdisc. The
+value should always be greater than the delta specified in the
+.BR etf(8)
+qdisc.
+
 .SH EXAMPLES
 
 The following example shows how an traffic schedule with three traffic
@@ -137,6 +157,26 @@ reference CLOCK_TAI. The schedule is composed of three entries each of
               clockid CLOCK_TAI
 .EE
 
+Following is an example to enable the txtime offload mode in taprio. See
+.BR etf(8)
+for more information about configuring the ETF qdisc.
+
+.EX
+# tc qdisc replace dev eth0 parent root handle 100 taprio \\
+              num_tc 3 \\
+              map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
+              queues 1@0 1@0 1@0 \\
+              base-time 1528743495910289987 \\
+              sched-entry S 01 300000 \\
+              sched-entry S 02 300000 \\
+              sched-entry S 04 400000 \\
+              flags 0x1 \\
+              txtime-delay 200000 \\
+              clockid CLOCK_TAI
+
+# tc qdisc replace dev $IFACE parent 100:1 etf skip_skb_check \\
+              offload delta 200000 clockid CLOCK_TAI
+.EE
 
 .SH AUTHORS
 Vinicius Costa Gomes <vinicius.gomes@intel.com>
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v4 3/6] taprio: Add support for setting flags
From: Vedang Patel @ 2019-07-16 19:53 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563306789-2908-1-git-send-email-vedang.patel@intel.com>

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This allows a new parameter, flags, to be passed to taprio. Currently, it
only supports enabling the txtime-assist mode. But, we plan to add
different modes for taprio (e.g. hardware offloading) and this parameter
will be useful in enabling those modes.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_taprio.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tc/q_taprio.c b/tc/q_taprio.c
index 62c8c591da99..1db2aba6efe7 100644
--- a/tc/q_taprio.c
+++ b/tc/q_taprio.c
@@ -159,6 +159,7 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	__s64 cycle_time_extension = 0;
 	struct list_head sched_entries;
 	struct rtattr *tail, *l;
+	__u32 taprio_flags = 0;
 	__s64 cycle_time = 0;
 	__s64 base_time = 0;
 	int err, idx;
@@ -281,6 +282,17 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 				explain_clockid(*argv);
 				return -1;
 			}
+		} else if (strcmp(*argv, "flags") == 0) {
+			NEXT_ARG();
+			if (taprio_flags) {
+				fprintf(stderr, "taprio: duplicate \"flags\" specification\n");
+				return -1;
+			}
+			if (get_u32(&taprio_flags, *argv, 0)) {
+				PREV_ARG();
+				return -1;
+			}
+
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -297,6 +309,9 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	if (clockid != CLOCKID_INVALID)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_SCHED_CLOCKID, &clockid, sizeof(clockid));
 
+	if (taprio_flags)
+		addattr_l(n, 1024, TCA_TAPRIO_ATTR_FLAGS, &taprio_flags, sizeof(taprio_flags));
+
 	if (opt.num_tc > 0)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_PRIOMAP, &opt, sizeof(opt));
 
@@ -442,6 +457,13 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 
 	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
 
+	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
+		__u32 flags;
+
+		flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
+		print_0xhex(PRINT_ANY, "flags", " flags %#x", flags);
+	}
+
 	print_schedule(f, tb);
 
 	if (tb[TCA_TAPRIO_ATTR_ADMIN_SCHED]) {
-- 
2.7.3


^ permalink raw reply related

* Re: [PATCH bpf 1/2] selftests/bpf: fix test_verifier/test_maps make dependencies
From: Stanislav Fomichev @ 2019-07-16 19:55 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, netdev, daniel, ast, andrii.nakryiko, kernel-team,
	Ilya Leoshkevich, Stanislav Fomichev, Martin KaFai Lau
In-Reply-To: <20190716193837.2808971-1-andriin@fb.com>

On 07/16, Andrii Nakryiko wrote:
> e46fc22e60a4 ("selftests/bpf: make directory prerequisites order-only")
> exposed existing problem in Makefile for test_verifier and test_maps tests:
> their dependency on auto-generated header file with a list of all tests wasn't
> recorded explicitly. This patch fixes these issues.
Why adding it explicitly fixes it? At least for test_verifier, we have
the following rule:

	test_verifier.c: $(VERIFIER_TESTS_H)

And there should be implicit/builtin test_verifier -> test_verifier.c
dependency rule.

Same for maps, I guess:

	$(OUTPUT)/test_maps: map_tests/*.c
	test_maps.c: $(MAP_TESTS_H)

So why is it not working as is? What I'm I missing?

^ permalink raw reply

* Re: [PATCH net-next v1] fix: taprio: Change type of txtime-delay parameter to u32
From: Patel, Vedang @ 2019-07-16 20:15 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: Kirsher, Jeffrey T, David Miller, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, intel-wired-lan@lists.osuosl.org, Gomes, Vinicius,
	l@dorileo.org, Jakub Kicinski, Murali Karicheri, Sergei Shtylyov,
	Eric Dumazet, Brown, Aaron F, Stephen Hemminger
In-Reply-To: <1563306738-2779-1-git-send-email-vedang.patel@intel.com>

I request that this patch should also be considered for the net tree since it fixes the data type of of the txtime_delay parameter and should go in with the iproute2 patches which implement support for txtime-assist mode. 

Thanks,
Vedang Patel

> On Jul 16, 2019, at 12:52 PM, Patel, Vedang <vedang.patel@intel.com> wrote:
> 
> During the review of the iproute2 patches for txtime-assist mode, it was
> pointed out that it does not make sense for the txtime-delay parameter to
> be negative. So, change the type of the parameter from s32 to u32.
> 
> Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
> Reported-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Vedang Patel <vedang.patel@intel.com>
> ---
> include/uapi/linux/pkt_sched.h | 2 +-
> net/sched/sch_taprio.c         | 6 +++---
> 2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> index 1f623252abe8..18f185299f47 100644
> --- a/include/uapi/linux/pkt_sched.h
> +++ b/include/uapi/linux/pkt_sched.h
> @@ -1174,7 +1174,7 @@ enum {
> 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME, /* s64 */
> 	TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME_EXTENSION, /* s64 */
> 	TCA_TAPRIO_ATTR_FLAGS, /* u32 */
> -	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* s32 */
> +	TCA_TAPRIO_ATTR_TXTIME_DELAY, /* u32 */
> 	__TCA_TAPRIO_ATTR_MAX,
> };
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 388750ddc57a..c39db507ba3f 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -75,7 +75,7 @@ struct taprio_sched {
> 	struct sched_gate_list __rcu *admin_sched;
> 	struct hrtimer advance_timer;
> 	struct list_head taprio_list;
> -	int txtime_delay;
> +	u32 txtime_delay;
> };
> 
> static ktime_t sched_base_time(const struct sched_gate_list *sched)
> @@ -1113,7 +1113,7 @@ static int taprio_change(struct Qdisc *sch, struct nlattr *opt,
> 			goto unlock;
> 		}
> 
> -		q->txtime_delay = nla_get_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
> +		q->txtime_delay = nla_get_u32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
> 	}
> 
> 	if (!TXTIME_ASSIST_IS_ENABLED(taprio_flags) &&
> @@ -1430,7 +1430,7 @@ static int taprio_dump(struct Qdisc *sch, struct sk_buff *skb)
> 		goto options_error;
> 
> 	if (q->txtime_delay &&
> -	    nla_put_s32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
> +	    nla_put_u32(skb, TCA_TAPRIO_ATTR_TXTIME_DELAY, q->txtime_delay))
> 		goto options_error;
> 
> 	if (oper && dump_schedule(skb, oper))
> -- 
> 2.7.3
> 


^ permalink raw reply

* Re: [PATCH V35 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Matthew Garrett @ 2019-07-16 20:32 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: James Morris, LSM List, Linux Kernel Mailing List, Linux API,
	David Howells, Alexei Starovoitov, Network Development,
	Chun-Yi Lee
In-Reply-To: <5d363f09-d649-5693-45c0-bb99d69f0f38@iogearbox.net>

On Mon, Jul 15, 2019 at 3:54 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> Hmm, does security_locked_down() ever return a code > 0 or why do you
> have the double check on return code? If not, then for clarity the
> ret code from security_locked_down() should be checked as 'ret < 0'
> as well and out label should be at the memset directly instead.

It doesn't, so I'll update. Thanks!

^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Alexei Starovoitov @ 2019-07-16 20:54 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Adrian Ratiu, Alexei Starovoitov, bpf,
	Brendan Gregg, connoro, Daniel Borkmann, duyuchao, Ingo Molnar,
	jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Steven Rostedt, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190710141548.132193-1-joel@joelfernandes.org>

On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
> Hi,

why are you cc-ing the whole world for this patch set?
I'll reply to all as well, but I suspect a bunch of folks consider it spam.
Please read Documentation/bpf/bpf_devel_QA.rst

Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam,
so I'm not sure this set reached public mailing lists.

> These patches make it possible to attach BPF programs directly to tracepoints
> using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
> attach to be alive. This has the following benefits:
> 
> 1. Simplified Security: In Android, we have finer-grained security controls to
> specific ftrace trace events using SELinux labels. We control precisely who is
> allowed to enable an ftrace event already. By adding a node to ftrace for
> attaching BPF programs, we can use the same mechanism to further control who is
> allowed to attach to a trace event.
> 
> 2. Process lifetime: In Android we are adding usecases where a tracing program
> needs to be attached all the time to a tracepoint, for the full life time of
> the system. Such as to gather statistics where there no need for a detach for
> the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
> means keeping a process alive all the time.  However, in Android our BPF loader
> currently (for hardeneded security) involves just starting a process at boot
> time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
> don't keep this process alive all the time. It is more suitable to do a
> one-shot attach of the program using ftrace and not need to have a process
> alive all the time anymore for this. Such process also needs elevated
> privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
> anyway so by design Android's bpfloader runs once at init and exits.
> 
> This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
> The following commands can be written into it:
> attach:<fd>     Attaches BPF prog fd to tracepoint
> detach:<fd>     Detaches BPF prog fd to tracepoint

Looks like, to detach a program the user needs to read a text file,
parse bpf prog id from text into binary. Then call fd_from_id bpf syscall,
get a binary FD, convert it back to text and write as a text back into this file.
I think this is just a single example why text based apis are not accepted
in bpf anymore.

Through the patch set you call it ftrace. As far as I can see, this set
has zero overlap with ftrace. There is no ftrace-bpf connection here at all
that we discussed in the past Steven. It's all quite confusing.

I suggest android to solve sticky raw_tracepoint problem with user space deamon.
The reasons, you point out why user daemon cannot be used, sound weak to me.
Another acceptable solution would be to introduce pinning of raw_tp objects.
bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would
be natural extension.

^ permalink raw reply

* [Patch net v2] net_sched: unset TCQ_F_CAN_BYPASS when adding filters
From: Cong Wang @ 2019-07-16 20:57 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Eric Dumazet

For qdisc's that support TC filters and set TCQ_F_CAN_BYPASS,
notably fq_codel, it makes no sense to let packets bypass the TC
filters we setup in any scenario, otherwise our packets steering
policy could not be enforced.

This can be reproduced easily with the following script:

 ip li add dev dummy0 type dummy
 ifconfig dummy0 up
 tc qd add dev dummy0 root fq_codel
 tc filter add dev dummy0 parent 8001: protocol arp basic action mirred egress redirect dev lo
 tc filter add dev dummy0 parent 8001: protocol ip basic action mirred egress redirect dev lo
 ping -I dummy0 192.168.112.1

Without this patch, packets are sent directly to dummy0 without
hitting any of the filters. With this patch, packets are redirected
to loopback as expected.

This fix is not perfect, it only unsets the flag but does not set it back
because we have to save the information somewhere in the qdisc if we
really want that. Note, both fq_codel and sfq clear this flag in their
->bind_tcf() but this is clearly not sufficient when we don't use any
class ID.

Fixes: 23624935e0c4 ("net_sched: TCQ_F_CAN_BYPASS generalization")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_api.c      | 1 +
 net/sched/sch_fq_codel.c | 2 --
 net/sched/sch_sfq.c      | 2 --
 3 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 278014e26aec..d144233423c5 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2152,6 +2152,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false, rtnl_held);
 		tfilter_put(tp, fh);
+		q->flags &= ~TCQ_F_CAN_BYPASS;
 	}
 
 errout:
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index e2faf33d282b..d59fbcc745d1 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -596,8 +596,6 @@ static unsigned long fq_codel_find(struct Qdisc *sch, u32 classid)
 static unsigned long fq_codel_bind(struct Qdisc *sch, unsigned long parent,
 			      u32 classid)
 {
-	/* we cannot bypass queue discipline anymore */
-	sch->flags &= ~TCQ_F_CAN_BYPASS;
 	return 0;
 }
 
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 420bd8411677..68404a9d2ce4 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -824,8 +824,6 @@ static unsigned long sfq_find(struct Qdisc *sch, u32 classid)
 static unsigned long sfq_bind(struct Qdisc *sch, unsigned long parent,
 			      u32 classid)
 {
-	/* we cannot bypass queue discipline anymore */
-	sch->flags &= ~TCQ_F_CAN_BYPASS;
 	return 0;
 }
 
-- 
2.21.0


^ permalink raw reply related

* [Patch net v2 1/2] fib: relax source validation check for loopback packets
From: Cong Wang @ 2019-07-16 20:58 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Julian Anastasov, David Ahern

In a rare case where we redirect local packets from veth to lo,
these packets fail to pass the source validation when rp_filter
is turned on, as the tracing shows:

  <...>-311708 [040] ..s1 7951180.957825: fib_table_lookup: table 254 oif 0 iif 1 src 10.53.180.130 dst 10.53.180.130 tos 0 scope 0 flags 0
  <...>-311708 [040] ..s1 7951180.957826: fib_table_lookup_nh: nexthop dev eth0 oif 4 src 10.53.180.130

So, the fib table lookup returns eth0 as the nexthop even though
the packets are local and should be routed to loopback nonetheless,
but they can't pass the dev match check in fib_info_nh_uses_dev()
without this patch.

It should be safe to relax this check for this special case, as
normally packets coming out of loopback device still have skb_dst
so they won't even hit this slow path.

Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/fib_frontend.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 317339cd7f03..e8bc939b56dd 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -388,6 +388,11 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
 	fib_combine_itag(itag, &res);
 
 	dev_match = fib_info_nh_uses_dev(res.fi, dev);
+	/* This is not common, loopback packets retain skb_dst so normally they
+	 * would not even hit this slow path.
+	 */
+	dev_match = dev_match || (res.type == RTN_LOCAL &&
+				  dev == net->loopback_dev);
 	if (dev_match) {
 		ret = FIB_RES_NHC(res)->nhc_scope >= RT_SCOPE_HOST;
 		return ret;
-- 
2.21.0


^ permalink raw reply related

* [Patch net v2 2/2] selftests: add a test case for rp filter
From: Cong Wang @ 2019-07-16 20:58 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, David Ahern

Add a test case to simulate the loopback packet case fixed
in the previous patch.

This test gets passed after the fix:

IPv4 rp_filter tests
    TEST: rp_filter passes local packets                                [ OK ]
    TEST: rp_filter passes loopback packets                             [ OK ]

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 tools/testing/selftests/net/fib_tests.sh | 30 +++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 9457aaeae092..a9e45471edfe 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -9,12 +9,13 @@ ret=0
 ksft_skip=4
 
 # all tests in this script. Can be overridden with -t option
-TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics ipv4_route_v6_gw"
+TESTS="unregister down carrier nexthop ipv6_rt ipv4_rt ipv6_addr_metric ipv4_addr_metric ipv6_route_metrics ipv4_route_metrics ipv4_route_v6_gw rp_filter"
 
 VERBOSE=0
 PAUSE_ON_FAIL=no
 PAUSE=no
 IP="ip -netns ns1"
+NETNS="ip netns exec ns1"
 
 log_test()
 {
@@ -433,6 +434,32 @@ fib_carrier_test()
 	fib_carrier_unicast_test
 }
 
+fib_rp_filter_test()
+{
+	echo
+	echo "IPv4 rp_filter tests"
+
+	setup
+
+	$IP link set dev lo address 52:54:00:6a:c7:5e
+	$IP link set dev dummy0 address 52:54:00:6a:c7:5e
+	echo 1 | $NETNS tee /proc/sys/net/ipv4/conf/all/rp_filter > /dev/null
+	echo 1 | $NETNS tee /proc/sys/net/ipv4/conf/all/accept_local > /dev/null
+	echo 1 | $NETNS tee /proc/sys/net/ipv4/conf/all/route_localnet > /dev/null
+
+	$NETNS tc qd add dev dummy0 parent root handle 1: fq_codel
+	$NETNS tc filter add dev dummy0 parent 1: protocol arp basic action mirred egress redirect dev lo
+	$NETNS tc filter add dev dummy0 parent 1: protocol ip basic action mirred egress redirect dev lo
+
+	run_cmd "ip netns exec ns1 ping -I dummy0 -w1 -c1 198.51.100.1"
+	log_test $? 0 "rp_filter passes local packets"
+
+	run_cmd "ip netns exec ns1 ping -I dummy0 -w1 -c1 127.0.0.1"
+	log_test $? 0 "rp_filter passes loopback packets"
+
+	cleanup
+}
+
 ################################################################################
 # Tests on nexthop spec
 
@@ -1557,6 +1584,7 @@ do
 	fib_unreg_test|unregister)	fib_unreg_test;;
 	fib_down_test|down)		fib_down_test;;
 	fib_carrier_test|carrier)	fib_carrier_test;;
+	fib_rp_filter_test|rp_filter)	fib_rp_filter_test;;
 	fib_nexthop_test|nexthop)	fib_nexthop_test;;
 	ipv6_route_test|ipv6_rt)	ipv6_route_test;;
 	ipv4_route_test|ipv4_rt)	ipv4_route_test;;
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net v2] skbuff: fix compilation warnings in skb_dump()
From: David Miller @ 2019-07-16 21:12 UTC (permalink / raw)
  To: cai; +Cc: willemb, joe, clang-built-linux, netdev, linux-kernel
In-Reply-To: <1563291785-6545-1-git-send-email-cai@lca.pw>

From: Qian Cai <cai@lca.pw>
Date: Tue, 16 Jul 2019 11:43:05 -0400

> The commit 6413139dfc64 ("skbuff: increase verbosity when dumping skb
> data") introduced a few compilation warnings.
> 
> net/core/skbuff.c:766:32: warning: format specifies type 'unsigned
> short' but the argument has type 'unsigned int' [-Wformat]
>                        level, sk->sk_family, sk->sk_type,
> sk->sk_protocol);
>                                              ^~~~~~~~~~~
> net/core/skbuff.c:766:45: warning: format specifies type 'unsigned
> short' but the argument has type 'unsigned int' [-Wformat]
>                        level, sk->sk_family, sk->sk_type,
> sk->sk_protocol);
> ^~~~~~~~~~~~~~~
> 
> Fix them by using the proper types.
> 
> Fixes: 6413139dfc64 ("skbuff: increase verbosity when dumping skb data")
> Signed-off-by: Qian Cai <cai@lca.pw>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH 4/9] ipv4: add lockdep condition to fix for_each_entry (v1)
From: David Miller @ 2019-07-16 21:12 UTC (permalink / raw)
  To: paulmck
  Cc: joel, linux-kernel, kuznet, bhelgaas, bp, c0d1n61at3, edumazet,
	gregkh, yoshfuji, hpa, mingo, corbet, josh, keescook,
	kernel-hardening, kernel-team, jiangshanlai, lenb, linux-acpi,
	linux-doc, linux-pci, linux-pm, mathieu.desnoyers, neilb, netdev,
	oleg, pavel, peterz, rjw, rasmus.villemoes, rcu, rostedt, tj,
	tglx, will, x86
In-Reply-To: <20190716183955.GF14271@linux.ibm.com>

From: "Paul E. McKenney" <paulmck@linux.ibm.com>
Date: Tue, 16 Jul 2019 11:39:55 -0700

> On Mon, Jul 15, 2019 at 10:37:00AM -0400, Joel Fernandes (Google) wrote:
>> Using the previous support added, use it for adding lockdep conditions
>> to list usage here.
>> 
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> We need an ack or better from the subsystem maintainer for this one.

Acked-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply

* Re: [PATCH -next] iwlwifi: dbg: work around clang bug by marking debug strings static
From: Luca Coelho @ 2019-07-16 21:14 UTC (permalink / raw)
  To: Nick Desaulniers, Joe Perches, Kalle Valo
  Cc: Arnd Bergmann, Nathan Chancellor, Johannes Berg,
	Emmanuel Grumbach, Intel Linux Wireless, David S. Miller,
	Shahar S Matityahu, Sara Sharon, linux-wireless, netdev, LKML,
	clang-built-linux
In-Reply-To: <CAKwvOdkD_r2YBqRDy-uTGMG1YeRF8KokxjikR0XLkXLsdJca0g@mail.gmail.com>

On Tue, 2019-07-16 at 10:28 -0700, Nick Desaulniers wrote:
> On Thu, Jul 11, 2019 at 7:15 PM Joe Perches <joe@perches.com> wrote:
> > On Thu, 2019-07-11 at 17:17 -0700, Nick Desaulniers wrote:
> > > Commit r353569 in prerelease Clang-9 is producing a linkage failure:
> > > 
> > > ld: drivers/net/wireless/intel/iwlwifi/fw/dbg.o:
> > > in function `_iwl_fw_dbg_apply_point':
> > > dbg.c:(.text+0x827a): undefined reference to `__compiletime_assert_2387'
> > > 
> > > when the following configs are enabled:
> > > - CONFIG_IWLWIFI
> > > - CONFIG_IWLMVM
> > > - CONFIG_KASAN
> > > 
> > > Work around the issue for now by marking the debug strings as `static`,
> > > which they probably should be any ways.
> > > 
> > > Link: https://bugs.llvm.org/show_bug.cgi?id=42580
> > > Link: https://github.com/ClangBuiltLinux/linux/issues/580
> > > Reported-by: Arnd Bergmann <arnd@arndb.de>
> > > Reported-by: Nathan Chancellor <natechancellor@gmail.com>
> > > Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
> > > ---
> > >  drivers/net/wireless/intel/iwlwifi/fw/dbg.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/wireless/intel/iwlwifi/fw/dbg.c b/drivers/net/wireless/intel/iwlwifi/fw/dbg.c
> > > index e411ac98290d..f8c90ea4e9b4 100644
> > > --- a/drivers/net/wireless/intel/iwlwifi/fw/dbg.c
> > > +++ b/drivers/net/wireless/intel/iwlwifi/fw/dbg.c
> > > @@ -2438,7 +2438,7 @@ static void iwl_fw_dbg_info_apply(struct iwl_fw_runtime *fwrt,
> > >  {
> > >       u32 img_name_len = le32_to_cpu(dbg_info->img_name_len);
> > >       u32 dbg_cfg_name_len = le32_to_cpu(dbg_info->dbg_cfg_name_len);
> > > -     const char err_str[] =
> > > +     static const char err_str[] =
> > >               "WRT: ext=%d. Invalid %s name length %d, expected %d\n";
> > 
> > Better still would be to use the format string directly
> > in both locations instead of trying to deduplicate it
> > via storing it into a separate pointer.
> > 
> > Let the compiler/linker consolidate the format.
> > It's smaller object code, allows format/argument verification,
> > and is simpler for humans to understand.
> 
> Whichever Kalle prefers, I just want my CI green again.

We already changed this in a later internal patch, which will reach the
mainline (-next) soon.  So let's skip this for now.

--
Cheers,
Luca.


^ permalink raw reply

* Re: [PATCH -next] iwlwifi: dbg: work around clang bug by marking debug strings static
From: Nick Desaulniers @ 2019-07-16 21:18 UTC (permalink / raw)
  To: Luca Coelho
  Cc: Joe Perches, Kalle Valo, Arnd Bergmann, Nathan Chancellor,
	Johannes Berg, Emmanuel Grumbach, Intel Linux Wireless,
	David S. Miller, Shahar S Matityahu, Sara Sharon, linux-wireless,
	netdev, LKML, clang-built-linux
In-Reply-To: <da053a97d771eff0ad8db37e644108ed2fad25a3.camel@coelho.fi>

On Tue, Jul 16, 2019 at 2:15 PM Luca Coelho <luca@coelho.fi> wrote:
>
> On Tue, 2019-07-16 at 10:28 -0700, Nick Desaulniers wrote:
> > On Thu, Jul 11, 2019 at 7:15 PM Joe Perches <joe@perches.com> wrote:
> > > Better still would be to use the format string directly
> > > in both locations instead of trying to deduplicate it
> > > via storing it into a separate pointer.
> > >
> > > Let the compiler/linker consolidate the format.
> > > It's smaller object code, allows format/argument verification,
> > > and is simpler for humans to understand.
> >
> > Whichever Kalle prefers, I just want my CI green again.
>
> We already changed this in a later internal patch, which will reach the
> mainline (-next) soon.  So let's skip this for now.

Ok, but this has now regressed into mainline and is blocking Linaro's
ToolChain Working Group's CI, so if you could send a bugfix ASAP we'd
appreciate it.
-- 
Thanks,
~Nick Desaulniers

^ permalink raw reply

* Re: [PATCH net-next v1] fix: taprio: Change type of txtime-delay parameter to u32
From: David Miller @ 2019-07-16 21:19 UTC (permalink / raw)
  To: vedang.patel
  Cc: netdev, jeffrey.t.kirsher, jhs, xiyou.wangcong, jiri,
	intel-wired-lan, vinicius.gomes, l, jakub.kicinski, m-karicheri2,
	sergei.shtylyov, eric.dumazet, aaron.f.brown, stephen
In-Reply-To: <1563306738-2779-1-git-send-email-vedang.patel@intel.com>

From: Vedang Patel <vedang.patel@intel.com>
Date: Tue, 16 Jul 2019 12:52:18 -0700

> During the review of the iproute2 patches for txtime-assist mode, it was
> pointed out that it does not make sense for the txtime-delay parameter to
> be negative. So, change the type of the parameter from s32 to u32.
> 
> Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode")
> Reported-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Vedang Patel <vedang.patel@intel.com>

You should have targetted this at 'net' as that's the only tree open
right now.

I'll apply this.

^ permalink raw reply

* Re: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace
From: Joel Fernandes @ 2019-07-16 21:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: linux-kernel, Adrian Ratiu, Alexei Starovoitov, bpf,
	Brendan Gregg, connoro, Daniel Borkmann, duyuchao, Ingo Molnar,
	jeffv, Karim Yaghmour, kernel-team, linux-kselftest,
	Manali Shukla, Manjo Raja Rao, Martin KaFai Lau, Masami Hiramatsu,
	Matt Mullins, Michal Gregorczyk, Michal Gregorczyk,
	Mohammad Husain, namhyung, namhyung, netdev, paul.chaignon,
	primiano, Qais Yousef, Shuah Khan, Song Liu, Srinivas Ramana,
	Steven Rostedt, Tamir Carmeli, Yonghong Song
In-Reply-To: <20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp.thefacebook.com>

On Tue, Jul 16, 2019 at 01:54:57PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
> > Hi,
> 
> why are you cc-ing the whole world for this patch set?

Well, the whole world happens to be interested in BPF on Android.

> I'll reply to all as well, but I suspect a bunch of folks consider it spam.
> Please read Documentation/bpf/bpf_devel_QA.rst

Ok, I'll read it.

> Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam,
> so I'm not sure this set reached public mailing lists.

Certainly the CC list here is not added to folks who consider it spam. All
the folks added have been interested in BPF on Android at various points of
time. Is this CC list really that large? It has around 24 email addresses or
so. I can trim it a bit if needed. Also, you sound like as if people are
screaming at me to stop emailing them, certainly that's not the case and no
one has told me it is spam.

And, it did reach the public archive btw:
https://lore.kernel.org/netdev/20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp.thefacebook.com/T/#m1460ba463b78312e38b68b8c118f673d2ead9446

> > These patches make it possible to attach BPF programs directly to tracepoints
> > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
> > attach to be alive. This has the following benefits:
> > 
> > 1. Simplified Security: In Android, we have finer-grained security controls to
> > specific ftrace trace events using SELinux labels. We control precisely who is
> > allowed to enable an ftrace event already. By adding a node to ftrace for
> > attaching BPF programs, we can use the same mechanism to further control who is
> > allowed to attach to a trace event.
> > 
> > 2. Process lifetime: In Android we are adding usecases where a tracing program
> > needs to be attached all the time to a tracepoint, for the full life time of
> > the system. Such as to gather statistics where there no need for a detach for
> > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
> > means keeping a process alive all the time.  However, in Android our BPF loader
> > currently (for hardeneded security) involves just starting a process at boot
> > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
> > don't keep this process alive all the time. It is more suitable to do a
> > one-shot attach of the program using ftrace and not need to have a process
> > alive all the time anymore for this. Such process also needs elevated
> > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
> > anyway so by design Android's bpfloader runs once at init and exits.
> > 
> > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
> > The following commands can be written into it:
> > attach:<fd>     Attaches BPF prog fd to tracepoint
> > detach:<fd>     Detaches BPF prog fd to tracepoint
> 
> Looks like, to detach a program the user needs to read a text file,
> parse bpf prog id from text into binary. Then call fd_from_id bpf syscall,
> get a binary FD, convert it back to text and write as a text back into this file.
> I think this is just a single example why text based apis are not accepted
> in bpf anymore.

This can also be considered a tracefs API.

And we can certainly change the detach to accept program ids as well if
that's easier. 'detach:prog:<prog_id>' and 'detach:fd:<fd>'.

By the way, I can also list the set of cumbersome steps needed to attach a
BPF program using perf and I bet it will be longer ;-)

> Through the patch set you call it ftrace. As far as I can see, this set
> has zero overlap with ftrace. There is no ftrace-bpf connection here at all
> that we discussed in the past Steven. It's all quite confusing.

It depends on what you mean by ftrace, may be I can call it 'trace events' or
something if it is less ambiguious. All of this has been collectively called
ftrace before.

I am not sure if you you are making sense actually, trace_events mechanism is
a part of ftrace. See the documentation: Documentation/trace/ftrace.rst. Even
the documentation file name has the word ftrace in it.

I have also spoken to Steven before about this, I don't think he ever told me
there is no connection so again I am a bit lost at your comments.

> I suggest android to solve sticky raw_tracepoint problem with user space deamon.
> The reasons, you point out why user daemon cannot be used, sound weak to me.

I don't think it is weak. It seems overkill to have a daemon for a trace
event that is say supposed to be attached to all the time for the lifetime of
the system. Why should there be a daemon consuming resources if it is active
all the time?

In Android, we are very careful about spawning useless processes and leaving
them alive for the lifetime of the system - for no good reason. Our security
teams also don't like this, and they can comment more.

> Another acceptable solution would be to introduce pinning of raw_tp objects.
> bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would
> be natural extension.

I don't think the pinning solves the security problem, it just solves the
process lifetime problem. Currently, attaching trace events through perf
requires CAP_SYS_ADMIN. However, with ftrace events, we already control
security of events by labeling the nodes in tracefs and granting access to
the labeled context through the selinux policies. Having a 'bpf' node in
tracefs for events, and granting access to the labels is a natural extension.

I also thought about the pinning idea before, but we also want to add support
for not just raw tracepoints, but also regular tracepoints (events if you
will). I am hesitant to add a new BPF API just for creating regular
tracepoints and then pinning those as well.

I don't see why a new bpf node for a trace event is a bad idea, really.
tracefs is how we deal with trace events on Android. We do it in production
systems. This is a natural extension to that and fits with the security model
well.

thanks,

 - Joel

^ permalink raw reply

* Re: [PATCH ghak90 V6 02/10] audit: add container id
From: Paul Moore @ 2019-07-16 21:39 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: Tycho Andersen, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, netfilter-devel, sgrubb, omosnace,
	dhowells, simo, Eric Paris, Serge Hallyn, ebiederm, nhorman
In-Reply-To: <20190716193828.xvm67iv5jyypvvxp@madcap2.tricolour.ca>

On Tue, Jul 16, 2019 at 3:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-15 16:38, Paul Moore wrote:
> > On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-05-29 11:29, Paul Moore wrote:
> >
> > ...
> >
> > > > The idea is that only container orchestrators should be able to
> > > > set/modify the audit container ID, and since setting the audit
> > > > container ID can have a significant effect on the records captured
> > > > (and their routing to multiple daemons when we get there) modifying
> > > > the audit container ID is akin to modifying the audit configuration
> > > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > > is that you would only change the audit container ID from one
> > > > set/inherited value to another if you were nesting containers, in
> > > > which case the nested container orchestrator would need to be granted
> > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > compromise).  We did consider allowing for a chain of nested audit
> > > > container IDs, but the implications of doing so are significant
> > > > (implementation mess, runtime cost, etc.) so we are leaving that out
> > > > of this effort.
> > >
> > > We had previously discussed the idea of restricting
> > > orchestrators/engines from only being able to set the audit container
> > > identifier on their own descendants, but it was discarded.  I've added a
> > > check to ensure this is now enforced.
> >
> > When we weren't allowing nested orchestrators it wasn't necessary, but
> > with the move to support nesting I believe this will be a requirement.
> > We might also need/want to restrict audit container ID changes if a
> > descendant is acting as a container orchestrator and managing one or
> > more audit container IDs; although I'm less certain of the need for
> > this.
>
> I was of the opinion it was necessary before with single-layer parallel
> orchestrators/engines.

One of the many things we've disagreed on, but it doesn't really
matter at this point.

> > > I've also added a check to ensure that a process can't set its own audit
> > > container identifier ...
> >
> > What does this protect against, or what problem does this solve?
> > Considering how easy it is to fork/exec, it seems like this could be
> > trivially bypassed.
>
> Well, for starters, it would remove one layer of nesting.  It would
> separate the functional layers of processes.

This doesn't seem like something we need to protect against, what's
the harm?  My opinion at this point is that we should only add
restrictions to protect against problematic or dangerous situations; I
don't believe one extra layer of nesting counts as either.

Perhaps the container folks on the To/CC line can comment on this?  If
there is a valid reason for this restriction, great, let's do it,
otherwise it seems like an unnecessary hard coded policy to me.

> Other than that, it seems
> like a gut feeling that it is just wrong to allow it.  It seems like a
> layer violation that one container orchestrator/engine could set its own
> audit container identifier and then set its children as well.  It would
> be its own parent.

I suspect you are right that the current crop of container engines
won't do this, but who knows what we'll be doing with "containers" 5,
or even 10, years from now.  With that in mind, let me ask the
question again: is allowing an orchestrator the ability to set its own
audit container ID problematic and/or dangerous?

> It would make it harder to verify adherance to descendancy and inheritance rules.

The audit log should contain all the information needed to track that,
right?  If it doesn't, then I think we have a problem with the
information we are logging.  Right?

> > > ... and that if the identifier is already set, then the
> > > orchestrator/engine must be in a descendant user namespace from the
> > > orchestrator that set the previously inherited audit container
> > > identifier.
> >
> > You lost me here ... although I don't like the idea of relying on X
> > namespace inheritance for a hard coded policy on setting the audit
> > container ID; we've worked hard to keep this independent of any
> > definition of a "container" and it would sadden me greatly if we had
> > to go back on that.
>
> This would seem to be the one concession I'm reluctantly making to try
> to solve this nested container orchestrator/engine challenge.

As I said, you lost me on this - how does this help?  A more detailed
explanation of how this helps resolve the nesting problem would be
useful.

> Would backing off on that descendant user namespace requirement and only
> require that a nested audit container identifier only be permitted on a
> descendant task be sufficient?  It may for this use case, but I suspect
> not for additional audit daemons (we're not there yet) and message
> routing to those daemons.
>
> The one difference here is that it does not depend on this if the audit
> container identifier has not already been set.

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Re: [PATCH bpf 1/2] selftests/bpf: fix test_verifier/test_maps make dependencies
From: Andrii Nakryiko @ 2019-07-16 21:40 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Andrii Nakryiko, bpf, Networking, Daniel Borkmann,
	Alexei Starovoitov, Kernel Team, Ilya Leoshkevich,
	Stanislav Fomichev, Martin KaFai Lau
In-Reply-To: <20190716195544.GB14834@mini-arch>

On Tue, Jul 16, 2019 at 12:55 PM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 07/16, Andrii Nakryiko wrote:
> > e46fc22e60a4 ("selftests/bpf: make directory prerequisites order-only")
> > exposed existing problem in Makefile for test_verifier and test_maps tests:
> > their dependency on auto-generated header file with a list of all tests wasn't
> > recorded explicitly. This patch fixes these issues.
> Why adding it explicitly fixes it? At least for test_verifier, we have
> the following rule:
>
>         test_verifier.c: $(VERIFIER_TESTS_H)
>
> And there should be implicit/builtin test_verifier -> test_verifier.c
> dependency rule.
>
> Same for maps, I guess:
>
>         $(OUTPUT)/test_maps: map_tests/*.c
>         test_maps.c: $(MAP_TESTS_H)
>
> So why is it not working as is? What I'm I missing?

I don't know exactly why it's not working, but it's clearly because of
that. It's the only difference between how test_progs are set up,
which didn't break, and test_maps/test_verifier, which did.

Feel free to figure it out through a maze of Makefiles why it didn't
work as expected, but this definitely fixed a breakage (at least for
me).

^ permalink raw reply

* Re: [Patch net v2 2/2] selftests: add a test case for rp filter
From: Cong Wang @ 2019-07-16 21:42 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: David Ahern
In-Reply-To: <20190716205804.19775-2-xiyou.wangcong@gmail.com>

On Tue, Jul 16, 2019 at 1:58 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Add a test case to simulate the loopback packet case fixed
> in the previous patch.

Well, I actually have to create a dummy1 to simulate it, as dummy0
is the gateway interface assigned to an IP address.

I will send v3.

Thanks.

^ permalink raw reply

* OOM triggered by SCTP
From: Marek Majkowski @ 2019-07-16 21:47 UTC (permalink / raw)
  To: vyasevich, nhorman, marcelo.leitner, linux-sctp; +Cc: netdev, kernel-team

Morning,

My poor man's fuzzer found something interesting in SCTP. It seems
like creating large number of SCTP sockets + some magic dance, upsets
a memory subsystem related to SCTP. The sequence:

 - create SCTP socket
 - call setsockopts (SCTP_EVENTS)
 - call bind(::1, port)
 - call sendmsg(long buffer, MSG_CONFIRM, ::1, port)
 - close SCTP socket
 - repeat couple thousand times

Full code:
https://gist.github.com/majek/bd083dae769804d39134ce01f4f802bb#file-test_sctp-c

I'm running it on virtme the simplest way:
$ virtme-run --show-boot-console --rw --pwd --kimg bzImage --memory
512M --script-sh ./test_sctp

Originally I was running it inside net namespace, and just having a
localhost interface is sufficient to trigger the problem.

Kernel is 5.2.1 (with KASAN and such, but that shouldn't be a factor).
In some tests I saw a message that might indicate something funny
hitting neighbor table:

neighbour: ndisc_cache: neighbor table overflow!

I'm not addr-decoding the stack trace, since it seems unrelated to the
root cause.

Cheers,
    Marek

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox