Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf 0/5] bpf: allow wide (u64) aligned loads for some fields of bpf_sock_addr
From: Daniel Borkmann @ 2019-07-15 22:21 UTC (permalink / raw)
  To: Stanislav Fomichev, netdev, bpf; +Cc: davem, ast, Yonghong Song
In-Reply-To: <20190715163956.204061-1-sdf@google.com>

On 7/15/19 6:39 PM, Stanislav Fomichev wrote:
> When fixing selftests by adding support for wide stores, Yonghong
> reported that he had seen some examples where clang generates
> single u64 loads for two adjacent u32s as well:
> http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0c62@fb.com
> 
> Let's support aligned u64 reads for some bpf_sock_addr fields
> as well.
> 
> (This can probably wait for bpf-next, I'll defer to Younhong and the
> maintainers.)
> 
> Cc: Yonghong Song <yhs@fb.com>
> 
> Stanislav Fomichev (5):
>   bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
>   bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and
>     msg_src_ip6
>   selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
>   selftests/bpf: add selftests for wide loads
>   bpf: sync bpf.h to tools/
> 
>  include/linux/filter.h                        |  2 +-
>  include/uapi/linux/bpf.h                      |  4 +-
>  net/core/filter.c                             | 24 ++++--
>  tools/include/uapi/linux/bpf.h                |  4 +-
>  .../selftests/bpf/verifier/wide_access.c      | 73 +++++++++++++++++++
>  .../selftests/bpf/verifier/wide_store.c       | 36 ---------
>  6 files changed, 95 insertions(+), 48 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/wide_access.c
>  delete mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf] selftests/bpf: put test_stub.o into $(OUTPUT)
From: Daniel Borkmann @ 2019-07-15 22:20 UTC (permalink / raw)
  To: Ilya Leoshkevich, bpf, netdev; +Cc: gor, heiko.carstens
In-Reply-To: <20190712135950.91600-1-iii@linux.ibm.com>

On 7/12/19 3:59 PM, Ilya Leoshkevich wrote:
> Add a rule to put test_stub.o in $(OUTPUT) and change the references to
> it accordingly. This prevents test_stub.o from being created in the
> source directory.
> 
> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH iproute2 net-next v2 1/6] Kernel header update for hardware offloading changes.
From: Patel, Vedang @ 2019-07-15 22:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Stephen Hemminger, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Gomes, Vinicius, netdev@vger.kernel.org, Dorileo, Leandro,
	Jakub Kicinski, Murali Karicheri
In-Reply-To: <2c13cf19-0b4a-2149-1624-040191cedad9@gmail.com>

Ok I will send out the patches again. 

Thanks,
Vedang

> On Jul 15, 2019, at 1:16 PM, David Ahern <dsahern@gmail.com> wrote:
> 
> On 7/15/19 1:50 PM, Stephen Hemminger wrote:
>> On Mon, 15 Jul 2019 19:40:19 +0000
>> "Patel, Vedang" <vedang.patel@intel.com> wrote:
>> 
>>> Hi Stephen, 
>>> 
>>> The kernel patches corresponding to this series have been merged. I just wanted to check whether these iproute2 related patches are on your TODO list.
>>> 
>>> Let me know if you need any information from me on these patches.
>>> 
>>> Thanks,
>>> Vedang Patel
>> 
>> 
>> David Ahern handles iproute2 next
>> 
>> https://patchwork.ozlabs.org/patch/1111466/
>> 
> 
> given the long time delay between when the iproute2 patches were posted
> and when the kernel side was accepted you will need to re-send the
> iproute2 patches.


^ permalink raw reply

* [PATCH iproute2 net-next v3 1/5] etf: Add skip_sock_check
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel

ETF Qdisc currently checks for a socket with SO_TXTIME socket option. If
either is not present, the packet is dropped. In the future commits, we
want other Qdiscs to add packet with launchtime to the ETF Qdisc. Also,
there are some packets (e.g. ICMP packets) which may not have a socket
associated with them.  So, add an option to skip this check.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_etf.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tc/q_etf.c b/tc/q_etf.c
index 76aca476c61d..c2090589bc64 100644
--- a/tc/q_etf.c
+++ b/tc/q_etf.c
@@ -130,6 +130,13 @@ static int etf_parse_opt(struct qdisc_util *qu, int argc,
 				explain_clockid(*argv);
 				return -1;
 			}
+		} else if (strcmp(*argv, "skip_sock_check") == 0) {
+			if (opt.flags & TC_ETF_SKIP_SOCK_CHECK) {
+				fprintf(stderr, "etf: duplicate \"skip_sock_check\" specification\n");
+				return -1;
+			}
+
+			opt.flags |= TC_ETF_SKIP_SOCK_CHECK;
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -171,8 +178,10 @@ static int etf_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	print_uint(PRINT_ANY, "delta", "delta %d ", qopt->delta);
 	print_string(PRINT_ANY, "offload", "offload %s ",
 				(qopt->flags & TC_ETF_OFFLOAD_ON) ? "on" : "off");
-	print_string(PRINT_ANY, "deadline_mode", "deadline_mode %s",
+	print_string(PRINT_ANY, "deadline_mode", "deadline_mode %s ",
 				(qopt->flags & TC_ETF_DEADLINE_MODE_ON) ? "on" : "off");
+	print_string(PRINT_ANY, "skip_sock_check", "skip_sock_check %s",
+				(qopt->flags & TC_ETF_SKIP_SOCK_CHECK) ? "on" : "off");
 
 	return 0;
 }
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 2/5] taprio: Add support for setting flags
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

From: Vinicius Costa Gomes <vinicius.gomes@intel.com>

This allows a new parameter, flags, to be passed to taprio. Currently, it
only supports enabling the txtime-assist mode. But, we plan to add
different modes for taprio (e.g. hardware offloading) and this parameter
will be useful in enabling those modes.

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_taprio.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/tc/q_taprio.c b/tc/q_taprio.c
index 62c8c591da99..930ecb9d1eef 100644
--- a/tc/q_taprio.c
+++ b/tc/q_taprio.c
@@ -154,6 +154,7 @@ static struct sched_entry *create_entry(uint32_t gatemask, uint32_t interval, ui
 static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 			    char **argv, struct nlmsghdr *n, const char *dev)
 {
+	__u32 taprio_flags = UINT32_MAX;
 	__s32 clockid = CLOCKID_INVALID;
 	struct tc_mqprio_qopt opt = { };
 	__s64 cycle_time_extension = 0;
@@ -281,6 +282,17 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 				explain_clockid(*argv);
 				return -1;
 			}
+		} else if (strcmp(*argv, "flags") == 0) {
+			NEXT_ARG();
+			if (taprio_flags != UINT32_MAX) {
+				fprintf(stderr, "taprio: duplicate \"flags\" specification\n");
+				return -1;
+			}
+			if (get_u32(&taprio_flags, *argv, 0)) {
+				PREV_ARG();
+				return -1;
+			}
+
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -297,6 +309,9 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	if (clockid != CLOCKID_INVALID)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_SCHED_CLOCKID, &clockid, sizeof(clockid));
 
+	if (taprio_flags != UINT32_MAX)
+		addattr_l(n, 1024, TCA_TAPRIO_ATTR_FLAGS, &taprio_flags, sizeof(taprio_flags));
+
 	if (opt.num_tc > 0)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_PRIOMAP, &opt, sizeof(opt));
 
@@ -405,6 +420,7 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	struct rtattr *tb[TCA_TAPRIO_ATTR_MAX + 1];
 	struct tc_mqprio_qopt *qopt = 0;
 	__s32 clockid = CLOCKID_INVALID;
+	__u32 taprio_flags = 0;
 	int i;
 
 	if (opt == NULL)
@@ -442,6 +458,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 
 	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
 
+	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
+		taprio_flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
+		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
+	}
+
 	print_schedule(f, tb);
 
 	if (tb[TCA_TAPRIO_ATTR_ADMIN_SCHED]) {
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 3/5] taprio: add support for setting txtime_delay.
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

This adds support for setting the txtime_delay parameter which is useful
for the txtime offload mode of taprio.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 tc/q_taprio.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/tc/q_taprio.c b/tc/q_taprio.c
index 930ecb9d1eef..13bd5c44cac9 100644
--- a/tc/q_taprio.c
+++ b/tc/q_taprio.c
@@ -52,7 +52,7 @@ static void explain(void)
 		"		[num_tc NUMBER] [map P0 P1 ...] "
 		"		[queues COUNT@OFFSET COUNT@OFFSET COUNT@OFFSET ...] "
 		"		[ [sched-entry index cmd gate-mask interval] ... ] "
-		"		[base-time time] "
+		"		[base-time time] [txtime-delay delay]"
 		"\n"
 		"CLOCKID must be a valid SYS-V id (i.e. CLOCK_TAI)\n");
 }
@@ -162,6 +162,7 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	struct rtattr *tail, *l;
 	__s64 cycle_time = 0;
 	__s64 base_time = 0;
+	__s32 txtime_delay = 0;
 	int err, idx;
 
 	INIT_LIST_HEAD(&sched_entries);
@@ -293,6 +294,17 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 				return -1;
 			}
 
+		} else if (strcmp(*argv, "txtime-delay") == 0) {
+			NEXT_ARG();
+			if (txtime_delay != 0) {
+				fprintf(stderr, "taprio: duplicate \"txtime-delay\" specification\n");
+				return -1;
+			}
+			if (get_s32(&txtime_delay, *argv, 0)) {
+				PREV_ARG();
+				return -1;
+			}
+
 		} else if (strcmp(*argv, "help") == 0) {
 			explain();
 			return -1;
@@ -315,6 +327,9 @@ static int taprio_parse_opt(struct qdisc_util *qu, int argc,
 	if (opt.num_tc > 0)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_PRIOMAP, &opt, sizeof(opt));
 
+	if (txtime_delay)
+		addattr_l(n, 1024, TCA_TAPRIO_ATTR_TXTIME_DELAY, &txtime_delay, sizeof(txtime_delay));
+
 	if (base_time)
 		addattr_l(n, 1024, TCA_TAPRIO_ATTR_SCHED_BASE_TIME, &base_time, sizeof(base_time));
 
@@ -421,6 +436,7 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	struct tc_mqprio_qopt *qopt = 0;
 	__s32 clockid = CLOCKID_INVALID;
 	__u32 taprio_flags = 0;
+	__s32 txtime_delay = 0;
 	int i;
 
 	if (opt == NULL)
@@ -463,6 +479,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
 	}
 
+	if (tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]) {
+		txtime_delay = rta_getattr_s32(tb[TCA_TAPRIO_ATTR_TXTIME_DELAY]);
+		print_int(PRINT_ANY, "txtime_delay", " txtime delay %d", txtime_delay);
+	}
+
 	print_schedule(f, tb);
 
 	if (tb[TCA_TAPRIO_ATTR_ADMIN_SCHED]) {
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 4/5] tc: etf: Add documentation for skip-skb-check.
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

Document the newly added option (skip-skb-check) on the etf man-page.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 man/man8/tc-etf.8 | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/man/man8/tc-etf.8 b/man/man8/tc-etf.8
index 30a12de7d2c7..2e01a591dbaa 100644
--- a/man/man8/tc-etf.8
+++ b/man/man8/tc-etf.8
@@ -106,6 +106,16 @@ referred to as "Launch Time" or "Time-Based Scheduling" by the
 documentation of network interface controllers.
 The default is for this option to be disabled.
 
+.TP
+skip_skb_check
+.br
+.BR etf(8)
+currently drops any packet which does not have a socket associated with it or
+if the socket does not have SO_TXTIME socket option set. But, this will not
+work if the launchtime is set by another entity inside the kernel (e.g. some
+other Qdisc). Setting the skip_skb_check will skip checking for a socket
+associated with the packet.
+
 .SH EXAMPLES
 
 ETF is used to enforce a Quality of Service. It controls when each
-- 
2.7.3


^ permalink raw reply related

* [PATCH iproute2 net-next v3 5/5] tc: taprio: Update documentation
From: Vedang Patel @ 2019-07-15 22:51 UTC (permalink / raw)
  To: netdev
  Cc: jhs, xiyou.wangcong, jiri, stephen, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern,
	Vedang Patel
In-Reply-To: <1563231104-19912-1-git-send-email-vedang.patel@intel.com>

Add documentation for the latest options, flags and txtime-delay, to the
taprio manpage.

This also adds an example to run tc in txtime offload mode.

Signed-off-by: Vedang Patel <vedang.patel@intel.com>
---
 man/man8/tc-taprio.8 | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/man/man8/tc-taprio.8 b/man/man8/tc-taprio.8
index 850be9b03649..e1d19ba19089 100644
--- a/man/man8/tc-taprio.8
+++ b/man/man8/tc-taprio.8
@@ -112,6 +112,26 @@ means that traffic class 0 is "active" for that schedule entry.
 long that state defined by <command> and <gate mask> should be held
 before moving to the next entry.
 
+.TP
+flags
+.br
+Specifies different modes for taprio. Currently, only txtime-assist is
+supported which can be enabled by setting it to 0x1. In this mode, taprio will
+set the transmit timestamp depending on the interval in which the packet needs
+to be transmitted. It will then utililize the
+.BR etf(8)
+qdisc to sort and transmit the packets at the right time. The second example
+can be used as a reference to configure this mode.
+
+.TP
+txtime-delay
+.br
+This parameter is specific to the txtime offload mode. It specifies the maximum
+time a packet might take to reach the network card from the taprio qdisc. The
+value should always be greater than the delta specified in the
+.BR etf(8)
+qdisc.
+
 .SH EXAMPLES
 
 The following example shows how an traffic schedule with three traffic
@@ -137,6 +157,26 @@ reference CLOCK_TAI. The schedule is composed of three entries each of
               clockid CLOCK_TAI
 .EE
 
+Following is an example to enable the txtime offload mode in taprio. See
+.BR etf(8)
+for more information about configuring the ETF qdisc.
+
+.EX
+# tc qdisc replace dev eth0 parent root handle 100 taprio \\
+              num_tc 3 \\
+              map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \\
+              queues 1@0 1@0 1@0 \\
+              base-time 1528743495910289987 \\
+              sched-entry S 01 300000 \\
+              sched-entry S 02 300000 \\
+              sched-entry S 04 400000 \\
+              flags 0x1 \\
+              txtime-delay 200000 \\
+              clockid CLOCK_TAI
+
+# tc qdisc replace dev $IFACE parent 100:1 etf skip_skb_check \\
+              offload delta 200000 clockid CLOCK_TAI
+.EE
 
 .SH AUTHORS
 Vinicius Costa Gomes <vinicius.gomes@intel.com>
-- 
2.7.3


^ permalink raw reply related

* Re: [PATCH V35 23/29] bpf: Restrict bpf when kernel lockdown is in confidentiality mode
From: Daniel Borkmann @ 2019-07-15 22:54 UTC (permalink / raw)
  To: Matthew Garrett, jmorris
  Cc: linux-security-module, linux-kernel, linux-api, David Howells,
	Alexei Starovoitov, Matthew Garrett, netdev, Chun-Yi Lee
In-Reply-To: <20190715195946.223443-24-matthewgarrett@google.com>

On 7/15/19 9:59 PM, Matthew Garrett wrote:
> From: David Howells <dhowells@redhat.com>
> 
> bpf_read() and bpf_read_str() could potentially be abused to (eg) allow
> private keys in kernel memory to be leaked. Disable them if the kernel
> has been locked down in confidentiality mode.
> 
> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Signed-off-by: Matthew Garrett <mjg59@google.com>
> cc: netdev@vger.kernel.org
> cc: Chun-Yi Lee <jlee@suse.com>
> cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/linux/security.h     |  1 +
>  kernel/trace/bpf_trace.c     | 10 ++++++++++
>  security/lockdown/lockdown.c |  1 +
>  3 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 987d8427f091..8dd1741a52cd 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -118,6 +118,7 @@ enum lockdown_reason {
>  	LOCKDOWN_INTEGRITY_MAX,
>  	LOCKDOWN_KCORE,
>  	LOCKDOWN_KPROBES,
> +	LOCKDOWN_BPF_READ,
>  	LOCKDOWN_CONFIDENTIALITY_MAX,
>  };
>  
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index ca1255d14576..605908da61c5 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -142,7 +142,12 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const void *, unsafe_ptr)
>  {
>  	int ret;
>  
> +	ret = security_locked_down(LOCKDOWN_BPF_READ);
> +	if (ret)
> +		goto out;
> +
>  	ret = probe_kernel_read(dst, unsafe_ptr, size);
> +out:
>  	if (unlikely(ret < 0))
>  		memset(dst, 0, size);

Hmm, does security_locked_down() ever return a code > 0 or why do you
have the double check on return code? If not, then for clarity the
ret code from security_locked_down() should be checked as 'ret < 0'
as well and out label should be at the memset directly instead.

> @@ -569,6 +574,10 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
>  {
>  	int ret;
>  
> +	ret = security_locked_down(LOCKDOWN_BPF_READ);
> +	if (ret)
> +		goto out;
> +
>  	/*
>  	 * The strncpy_from_unsafe() call will likely not fill the entire
>  	 * buffer, but that's okay in this circumstance as we're probing
> @@ -579,6 +588,7 @@ BPF_CALL_3(bpf_probe_read_str, void *, dst, u32, size,
>  	 * is returned that can be used for bpf_perf_event_output() et al.
>  	 */
>  	ret = strncpy_from_unsafe(dst, unsafe_ptr, size);
> +out:
>  	if (unlikely(ret < 0))
>  		memset(dst, 0, size);

Ditto.

Thanks,
Daniel

^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Andrii Nakryiko @ 2019-07-15 22:55 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers,
	Yonghong Song
In-Reply-To: <87k1cj3b69.fsf@netronome.com>

On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@netronome.com> wrote:
>
>
> Andrii Nakryiko writes:
>
> > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >>
> >>
> >> Andrii Nakryiko writes:
> >>
> >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
> >> >>
> >> >> This is an RFC based on latest bpf-next about acclerating insn patching
> >> >> speed, it is now near the shape of final PATCH set, and we could see the
> >> >> changes migrating to list patching would brings, so send out for
> >> >> comments. Most of the info are in cover letter. I splitted the code in a
> >> >> way to show API migration more easily.
> >> >
> >> >
> >> > Hey Jiong,
> >> >
> >> >
> >> > Sorry, took me a while to get to this and learn more about instruction
> >> > patching. Overall this looks good and I think is a good direction.
> >> > I'll post high-level feedback here, and some more
> >> > implementation-specific ones in corresponding patches.
> >>
> >> Great, thanks very much for the feedbacks. Most of your feedbacks are
> >> hitting those pain points I exactly had ran into. For some of them, I
> >> thought similar solutions like yours, but failed due to various
> >> reasons. Let's go through them again, I could have missed some important
> >> things.
> >>
> >> Please see my replies below.
> >
> > Thanks for thoughtful reply :)
> >
> >>
> >> >>
> >> >> Test Results
> >> >> ===
> >> >>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
> >> >>     modes (interpreter, JIT, JIT with blinding).
> >> >>
> >> >>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
> >> >>     patching time from 5100s (nearly one and a half hour) to less than
> >> >>     0.5s for 1M insn patching.
> >> >>
> >> >> Known Issues
> >> >> ===
> >> >>   - The following warning is triggered when running scale test which
> >> >>     contains 1M insns and patching:
> >> >>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
> >> >>
> >> >>     This is caused by existing code, it can be reproduced on bpf-next
> >> >>     master with jit blinding enabled, then run scale unit test, it will
> >> >>     shown up after half an hour. After this set, patching is very fast, so
> >> >>     it shows up quickly.
> >> >>
> >> >>   - No line info adjustment support when doing insn delete, subprog adj
> >> >>     is with bug when doing insn delete as well. Generally, removal of insns
> >> >>     could possibly cause remove of entire line or subprog, therefore
> >> >>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
> >> >>     don't have good idea and clean code for integrating this into the
> >> >>     linearization code at the moment, will do more experimenting,
> >> >>     appreciate ideas and suggestions on this.
> >> >
> >> > Is there any specific problem to detect which line info to delete? Or
> >> > what am I missing besides careful implementation?
> >>
> >> Mostly line info and subprog info are range info which covers a range of
> >> insns. Deleting insns could causing you adjusting the range or removing one
> >> range entirely. subprog info could be fully recalcuated during
> >> linearization while line info I need some careful implementation and I
> >> failed to have clean code for this during linearization also as said no
> >> unit tests to help me understand whether the code is correct or not.
> >>
> >
> > Ok, that's good that it's just about clean implementation. Try to
> > implement it as clearly as possible. Then post it here, and if it can
> > be improved someone (me?) will try to help to clean it up further.
> >
> > Not a big expert on line info, so can't comment on that,
> > unfortunately. Maybe Yonghong can chime in (cc'ed)
> >
> >
> >> I will described this latter, spent too much time writing the following
> >> reply. Might worth an separate discussion thread.
> >>
> >> >>
> >> >>     Insn delete doesn't happen on normal programs, for example Cilium
> >> >>     benchmarks, and happens rarely on test_progs, so the test coverage is
> >> >>     not good. That's also why this RFC have a full pass on selftest with
> >> >>     this known issue.
> >> >
> >> > I hope you'll add test for deletion (and w/ corresponding line info)
> >> > in final patch set :)
> >>
> >> Will try. Need to spend some time on BTF format.
> >> >
> >> >>
> >> >>   - Could further use mem pool to accelerate the speed, changes are trivial
> >> >>     on top of this RFC, and could be 2x extra faster. Not included in this
> >> >>     RFC as reducing the algo complexity from quadratic to linear of insn
> >> >>     number is the first step.
> >> >
> >> > Honestly, I think that would add more complexity than necessary, and I
> >> > think we can further speed up performance without that, see below.
> >> >
> >> >>
> >> >> Background
> >> >> ===
> >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
> >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
> >> >> remove insns.
> >> >>
> >> >> At the moment, insn patching is quadratic of insn number, this is due to
> >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
> >> >>
> >> >>   for insn inside prog
> >> >>     patch insn + regeneate bpf prog
> >> >>     for insn inside new prog
> >> >>       adjust jump target
> >> >>
> >> >> This is causing significant time spending when a bpf prog requires large
> >> >> amount of patching on different insns. Benchmarking shows it could take
> >> >> more than half minutes to finish patching when patching number is more
> >> >> than 50K, and the time spent could be more than one hour when patching
> >> >> number is around 1M.
> >> >>
> >> >>   15000   :    3s
> >> >>   45000   :   29s
> >> >>   95000   :  125s
> >> >>   195000  :  712s
> >> >>   1000000 : 5100s
> >> >>
> >> >> This RFC introduces new patching infrastructure. Before doing insn
> >> >> patching, insns in bpf prog are turned into a singly linked list, insert
> >> >> new insns just insert new list node, delete insns just set delete flag.
> >> >> And finally, the list is linearized back into array, and branch target
> >> >> adjustment is done for all jump insns during linearization. This algo
> >> >> brings the time complexity from quadratic to linear of insn number.
> >> >>
> >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
> >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
> >> >> to less than 0.5s.
> >> >>
> >> >> Patching API
> >> >> ===
> >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
> >> >> where only BPF insns are patched. The other is "verification layer" where
> >> >> insns have corresponding aux info as well high level subprog info, so
> >> >> insn patching means aux info needs to be patched as well, and subprog info
> >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
> >> >> should always be updated after insn patching.
> >> >>
> >> >> So, list creation, destroy, insert, delete is the same for both layer,
> >> >> but lineration is different. "verification layer" patching require extra
> >> >> work. Therefore the patch APIs are:
> >> >>
> >> >>    list creation:                bpf_create_list_insn
> >> >>    list patch:                   bpf_patch_list_insn
> >> >>    list pre-patch:               bpf_prepatch_list_insn
> >> >
> >> > I think pre-patch name is very confusing, until I read full
> >> > description I couldn't understand what it's supposed to be used for.
> >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
> >> > me wondering whether instruction buffer is inserted after instruction,
> >> > or instruction is replaced with a bunch of instructions.
> >> >
> >> > So how about two more specific names:
> >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
> >> > instruction with a list of patch instructions)
> >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
> >> > one is pretty clear).
> >>
> >> My sense on English word is not great, will switch to above which indeed
> >> reads more clear.
> >>
> >> >>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
> >> >>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
> >> >
> >> > These two functions are both quite involved, as well as share a lot of
> >> > common code. I'd rather have one linearize instruction, that takes env
> >> > as an optional parameter. If env is specified (which is the case for
> >> > all cases except for constant blinding pass), then adjust aux_data and
> >> > subprogs along the way.
> >>
> >> Two version of lineration and how to unify them was a painpoint to me. I
> >> thought to factor out some of the common code out, but it actually doesn't
> >> count much, the final size counting + insnsi resize parts are the same,
> >> then things start to diverge since the "Copy over insn" loop.
> >>
> >> verifier layer needs to copy and initialize aux data etc. And jump
> >> relocation is different. At core layer, the use case is JIT blinding which
> >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
> >
> > Sorry, I didn't get what "could expand an jump_imm insn into a
> > and/or/jump_reg sequence", maybe you can clarify if I'm missing
> > something.
> >
> > But from your cover letter description, core layer has no jumps at
> > all, while verifier has jumps inside patch buffer. So, if you support
> > jumps inside of patch buffer, it will automatically work for core
> > layer. Or what am I missing?
>
> I meant in core layer (JIT blinding), there is the following patching:
>
> input:
>   insn 0             insn 0
>   insn 1             insn 1
>   jmp_imm   >>       mov_imm  \
>   insn 2             xor_imm    insn seq expanded from jmp_imm
>   insn 3             jmp_reg  /
>                      insn 2
>                      insn 3
>
>
> jmp_imm is the insn that will be patched, and the actually transformation
> is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
> at the end of the patch buffer, must jump to the same destination as the
> original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
> be relocated, and the jump destination is outside of patch buffer.


Ok, great, thanks for explaining, yeah it's definitely something that
we should be able to support. BUT. It got me thinking a bit more and I
think I have simpler and more elegant solution now, again, supporting
both core-layer and verifier-layer operations.

struct bpf_patchable_insn {
   struct bpf_patchable_insn *next;
   struct bpf_insn insn;
   int orig_idx; /* original non-patched index */
   int new_idx;  /* new index, will be filled only during linearization */
};

struct bpf_patcher {
    /* dummy head node of a chain of patchable instructions */
    struct bpf_patchable_insn insn_head;
    /* dynamic array of size(original instruction count)
     * this is a map from original instruction index to a first
     * patchable instruction that replaced that instruction (or
     * just original instruction as bpf_patchable_insn).
     */
    int *orig_idx_to_patchable_insn;
    int cnt;
};

Few points, but it should be pretty clear just from comments and definitions:
1. When you created bpf_patcher, you create patchabe_insn list, fill
orig_idx_to_patchable_insn map to store proper pointers. This array is
NEVER changed after that.
2. When replacing instruction, you re-use struct bpf_patchable_insn
for first patched instruction, then append after that (not prepend to
next instruction to not disrupt orig_idx -> patchable_insn mapping).
3. During linearizations, you first traverse the chain of instructions
and trivially assing new_idxs.
4. No need for patchabe_insn->target anymore. All jumps use relative
instruction offsets, right? So when you need to determine new
instruction index during linearization, you just do (after you
calculated new instruction indicies):

func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
   int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
   int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
   adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
}

The idea is that we want to support quick look-up by original
instruction index. That's what orig_idx_to_patchable_insn provides. On
the other hand, no existing instruction is ever referencing newly
patched instruction by its new offset, so with careful implementation,
you can transparently support all the cases, regardless if it's in
core layer or verifier layer (so, e.g., verifier layer patched
instructions now will be able to jump out of patched buffer, if
necessary, neat, right?).

It is cleaner than everything we've discussed so far. Unless I missed
something critical (it's all quite convoluted, so I might have
forgotten some parts already). Let me know what you think.


>
> This means for core layer (jit blinding), it needs to take care of insn
> inside patch buffer.
>
> > Just compared two version of linearize side by side. From what I can
> > see, unified version could look like this, high-level:
> >
> > 1. Count final insn count (but see my other suggestions how to avoid
> > that altogether). If not changed - exit.
> > 2. Realloc insn buffer, copy just instructions (not aux_data yet).
> > Build idx_map, if necessary.
> > 3. (if env) then bpf_patchable_insn has aux_data, so now do another
> > pass and copy it into resulting array.
> > 4. (if env) Copy sub info. Though I'd see if we can just reuse old
> > ones and just adjust offsets. I'm not sure why we need to allocate new
> > array, subprogram count shouldn't change, right?
>
> If there is no dead insn elimination opt, then we could just adjust
> offsets. When there is insn deleting, I feel the logic becomes more
> complex. One subprog could be completely deleted or partially deleted, so
> I feel just recalculate the whole subprog info as a side-product is
> much simpler.

What's the situation where entirety of subprog can be deleted?


>
> > 5. (common) Relocate jumps. Not clear why core layer doesn't care
> > about PATCHED (or, alternatively, why verifier layer cares).
>
> See above, in this RFC, core layer care PATCHED during relocating jumps,
> and verifier layer doesn't.
>
> > And again, with targets pointer it will look totally different (and
> > simpler).
>
> Yes, will see how the code looks.
>
> > 6. (if env) adjust subprogs
> > 7. (common) Adjust prog's line info.
> >
> > The devil is in the details, but I think this will still be better if
> > contained in one function if a bunch of `if (env)` checks. Still
> > pretty linear.
> >
> >> jump_reg is at the end of the patch buffer, it should be relocated. While
> >> all use case in verifier layer, no jump in the prog will be patched and all
> >> new jumps in patch buffer will jump inside the buffer locally so no need to
> >> resolve.
> >>
> >> And yes we could unify them into one and control the diverge using
> >> argument, but then where to place the function is an issue. My
> >> understanding is verifier.c is designed to be on top of core.c and core.c
> >> should not reference and no need to be aware of any verifier specific data
> >> structures, for example env or bpf_aux_insn_data etc.
> >
> > Func prototype where it is. Maybe forward-declare verifier env struct.
> > Implementation in verifier.c?
> >
> >>
> >> So, in this RFC, I had choosed to write separate linerization function for
> >> core and verifier layer. Does this make sense?
> >
> > See above. Let's still try to make it better.
> >
> >>
> >> >
> >> > This would keep logic less duplicated and shouldn't complexity beyond
> >> > few null checks in few places.
> >> >
> >> >>    list destroy:                 bpf_destroy_list_insn
> >> >>
> >> >
> >> > I'd also add a macro foreach_list_insn instead of explicit for loops
> >> > in multiple places. That would also allow to skip deleted instructions
> >> > transparently.
> >> >
> >> >> list patch could change the insn at patch point, it will invalid the aux
> >> >
> >> > typo: invalid -> invalidate
> >>
> >> Ack.
> >>
> >> >
> >> >> info at patching point. list pre-patch insert new insns before patch point
> >> >> where the insn and associated aux info are not touched, it is used for
> >> >> example in convert_ctx_access when generating prologue.
> >> >>
> >> >> Typical API sequence for one patching pass:
> >> >>
> >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> >> >>    for (elem = list; elem; elem = elem->next)
> >> >>       patch_buf = gen_patch_buf_logic;
> >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> >> >>    bpf_prog = bpf_linearize_list_insn(list)
> >> >>    bpf_destroy_list_insn(list)
> >> >>
> >> >> Several patching passes could also share the same list:
> >> >>
> >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> >> >>    for (elem = list; elem; elem = elem->next)
> >> >>       patch_buf = gen_patch_buf_logic1;
> >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> >> >>    for (elem = list; elem; elem = elem->next)
> >> >>       patch_buf = gen_patch_buf_logic2;
> >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> >> >>    bpf_prog = bpf_linearize_list_insn(list)
> >> >>    bpf_destroy_list_insn(list)
> >> >>
> >> >> but note new inserted insns int early passes won't have aux info except
> >> >> zext info. So, if one patch pass requires all aux info updated and
> >> >> recalculated for all insns including those pathced, it should first
> >> >> linearize the old list, then re-create the list. The RFC always create and
> >> >> linearize the list for each migrated patching pass separately.
> >> >
> >> > I think we should do just one list creation, few passes of patching
> >> > and then linearize once. That will save quite a lot of memory
> >> > allocation and will speed up a lot of things. All the verifier
> >> > patching happens one after the other without any other functionality
> >> > in between, so there shouldn't be any problem.
> >>
> >> Yes, as mentioned above, it is possible and I had tried to do it in an very
> >> initial impl. IIRC convert_ctx_access + fixup_bpf_calls could share the
> >> same list, but then the 32-bit zero extension insertion pass requires
> >> aux.zext_dst set properly for all instructions including those patched
> >
> > So zext_dst. Seems like it's easily calculatable, so doesn't seem like
> > it even needs to be accessed from aux_data.
> >
> > But. I can see at least two ways to do this:
> > 1. those patching passes that care about aux_data, should just do
> > extra check for NULL. Because when we adjust insns now, we just leave
> > zero-initialized aux_data, except for zext_dst and seen. So it's easy
> > to default to them if aux_data is NULL for patchable_insn.
> > 2. just allocate and fill them out them when applying patch insns
> > buffer. It's not a duplication, we already fill them out during
> > patching today. So just do the same, except through malloc()'ed
> > pointer instead. At the end they will be copied into linear resulting
> > array during linearization (uniformly with non-patched insns).
> >
> >> one which we need to linearize the list first (as we set zext_dst during
> >> linerization), or the other choice is we do the zext_dst initialization
> >> during bpf_patch_list_insn, but this then make bpf_patch_list_insn diverge
> >> between core and verifier layer.
> >
> > List construction is much simpler, even if we have to have extra
> > check, similar to `if (env) { do_extra(); }`, IMO, it's fine.
> >
> >>
> >> > As for aux_data. We can solve that even more simply and reliably by
> >> > storing a pointer along the struct bpf_list_insn
> >>
> >> This is exactly what I had implemented initially, but then the issue is how
> >> to handle aux_data for patched insn? IIRC I was leave it as a NULL pointer,
> >> but later found zext_dst info is required for all insns, so I end up
> >> duplicating zext_dst in bpf_list_insn.
> >
> > See above. No duplication. You have a pointer. Whether aux_data is in
> > original array or was malloc()'ed, doesn't matter. But no duplication
> > of fields.
> >
> >>
> >> This leads me worrying we need to keep duplicating fields there as soon as
> >> there is new similar requirements in future patching pass and I thought it
> >> might be better to just reference the aux_data inside env using orig_idx,
> >> this avoids duplicating information, but we need to make sure used fields
> >> inside aux_data for patched insn update-to-date during linearization or
> >> patching list.
> >>
> >> > (btw, how about calling it bpf_patchable_insn?).
> >>
> >> No preference, will use this one.
> >>
> >> > Here's how I propose to represent this patchable instruction:
> >> >
> >> > struct bpf_list_insn {
> >> >        struct bpf_insn insn;
> >> >        struct bpf_list_insn *next;
> >> >        struct bpf_list_insn *target;
> >> >        struct bpf_insn_aux_data *aux_data;
> >> >        s32 orig_idx; // can repurpose this to have three meanings:
> >> >                      // -2 - deleted
> >> >                      // -1 - patched/inserted insn
> >> >                      // >=0 - original idx
> >>
> >> I actually had experimented the -2/-1/0 trick, exactly the same number
> >> assignment :) IIRC the code was not clear compared with using flag, the
> >> reason seems to be:
> >>   1. we still need orig_idx of an patched insn somehow, meaning negate the
> >>      index.
> >
> > Not following, original index with be >=0, no?
> >
> >>   2. somehow somecode need to know whether one insn is deleted or patched
> >>      after the negation, so I end up with some ugly code.
> >
> > So that's why you'll have constants with descriptive name for -2 and -1.
> >
> >>
> >> Anyway, I might had not thought hard enough on this, I will retry using the
> >> special index instead of flag, hopefully I could have clean code this time.
> >>
> >
> > Yeah, please try again. All those `orig_idx = insn->orig_idx - 1; if
> > (orig_idx >= 0) { ... }` are very confusing.
> >
> >> > };
> >> >
> >> > The idea would be as follows:
> >> > 1. when creating original list, target pointer will point directly to
> >> > a patchable instruction wrapper for jumps/calls. This will allow to
> >> > stop tracking and re-calculating jump offsets and instruction indicies
> >> > until linearization.
> >>
> >> Not sure I have followed the idea of "target" pointer. At the moment we are
> >> using index mapping array (generated as by-product during coping insn).
> >>
> >> While the "target" pointer means to during list initialization, each jump
> >> insn will have target initialized to the list node of the converted jump
> >> destination insn, and all those non-jump insns are with NULL? Then during
> >> linearization you assign index to each list node (could be done as
> >> by-product of other pass) before insn coping which could then relocate the
> >> insn during the coping as the "target" would have final index calculated?
> >> Am I following correctly?
> >
> > Yes, I think you are understanding correctly what I'm saying. For
> > implementation, you can do it in few ways, through few passes or with
> > some additional data, is less important. See what's cleanest.
> >
> >>
> >> > 2. aux_data is also filled at that point. Later at linearization time
> >> > you'd just iterate over all the instructions in final order and copy
> >> > original aux_data, if it's present. And then just repace env's
> >> > aux_data array at the end, should be very simple and fast.
> >>
> >> As explained, I am worried making aux_data a pointer will causing
> >> duplicating some fields into list_insn if the fields are required for
> >> patched insns.
> >
> > Addressed above, I don't think there will be any duplication, because
> > we pass aux_data by pointer.
> >
> >>
> >> > 3. during fix_bpf_calls, zext, ctx rewrite passes, we'll reuse the
> >> > same list of instructions and those passes will just keep inserting
> >> > instruction buffers. Given we have restriction that all the jumps are
> >> > only within patch buffer, it will be trivial to construct proper
> >> > patchable instruction wrappers for newly added instructions, with NULL
> >> > for aux_data and possibly non-NULL target (if it's a JMP insn).
> >> > 4. After those passes, linearize, adjust subprogs (for this you'll
> >> > probably still need to create index mapping, right?), copy or create
> >> > new aux_data.
> >> > 5. Done.
> >> >
> >> > What do you think? I think this should be overall simpler and faster.
> >> > But let me know if I'm missing something.
> >>
> >> Thanks for all these thoughts, they are very good suggestions and reminds
> >> me to revisit some points I had forgotten. I will do the following things:
> >>
> >>   1. retry the negative index solution to eliminate flag if the result code
> >>      could be clean.
> >>   2. the "target" pointer seems make sense, it makes list_insn bigger but
> >>      normally space trade with time, so I will try to implement it to see
> >>      how the code looks like.
> >>   3. I still have concerns on making aux_data as pointer. Mostly due to
> >>      patched insn will have NULL pointer and in case aux info of patched
> >>      insn is required, we need to duplicate info inside list_insn. For
> >>      example 32-bit zext opt requires zext_dst.
> >>
> >
> >
> > So one more thing I wanted to suggest. I'll try to keep high-level
> > suggestions here.
> >
> > What about having a wrapper for patchable_insn list, where you can
> > store some additional data, like final count and whatever else. It
> > will eliminate some passes (counting) and will make list handling
> > easier (because you can have a dummy head pointer, so no special
> > handling of first element
>
> Will try it.
>
> > you had this concern in patch #1, I
> > believe). But it will be clear if it's beneficial once implemented.
>
> >> Regards,
> >> Jiong
> >>
> >> >>
> >> >> Compared with old patching code, this new infrastructure has much less core
> >> >> code, even though the final code has a couple of extra lines but that is
> >> >> mostly due to for list based infrastructure, we need to do more error
> >> >> checks, so the list and associated aux data structure could be freed when
> >> >> errors happens.
> >> >>
> >> >> Patching Restrictions
> >> >> ===
> >> >>   - For core layer, the linearization assume no new jumps inside patch buf.
> >> >>     Currently, the only user of this layer is jit blinding.
> >> >>   - For verifier layer, there could be new jumps inside patch buf, but
> >> >>     they should have branch target resolved themselves, meaning new jumps
> >> >>     doesn't jump to insns out of the patch buf. This is the case for all
> >> >>     existing verifier layer users.
> >> >>   - bpf_insn_aux_data for all patched insns including the one at patch
> >> >>     point are invalidated, only 32-bit zext info will be recalcuated.
> >> >>     If the aux data of insn at patch point needs to be retained, it is
> >> >>     purely insn insertion, so need to use the pre-patch API.
> >> >>
> >> >> I plan to send out a PATCH set once I finished insn deletion line info adj
> >> >> support, please have a looks at this RFC, and appreciate feedbacks.
> >> >>
> >> >> Jiong Wang (8):
> >> >>   bpf: introducing list based insn patching infra to core layer
> >> >>   bpf: extend list based insn patching infra to verification layer
> >> >>   bpf: migrate jit blinding to list patching infra
> >> >>   bpf: migrate convert_ctx_accesses to list patching infra
> >> >>   bpf: migrate fixup_bpf_calls to list patching infra
> >> >>   bpf: migrate zero extension opt to list patching infra
> >> >>   bpf: migrate insn remove to list patching infra
> >> >>   bpf: delete all those code around old insn patching infrastructure
> >> >>
> >> >>  include/linux/bpf_verifier.h |   1 -
> >> >>  include/linux/filter.h       |  27 +-
> >> >>  kernel/bpf/core.c            | 431 +++++++++++++++++-----------
> >> >>  kernel/bpf/verifier.c        | 649 +++++++++++++++++++------------------------
> >> >>  4 files changed, 580 insertions(+), 528 deletions(-)
> >> >>
> >> >> --
> >> >> 2.7.4
> >> >>
> >>
>

^ permalink raw reply

* Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed
From: Andrii Nakryiko @ 2019-07-15 23:00 UTC (permalink / raw)
  To: Jiong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Edward Cree, Naveen N. Rao,
	Andrii Nakryiko, Jakub Kicinski, bpf, Networking, oss-drivers,
	Yonghong Song
In-Reply-To: <CAEf4BzYDAVUgajz4=dRTu5xQDddp5pi2s=T1BdFmRLZjOwGypQ@mail.gmail.com>

On Mon, Jul 15, 2019 at 3:55 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> >
> >
> > Andrii Nakryiko writes:
> >
> > > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@netronome.com> wrote:
> > >>
> > >>
> > >> Andrii Nakryiko writes:
> > >>
> > >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@netronome.com> wrote:
> > >> >>
> > >> >> This is an RFC based on latest bpf-next about acclerating insn patching
> > >> >> speed, it is now near the shape of final PATCH set, and we could see the
> > >> >> changes migrating to list patching would brings, so send out for
> > >> >> comments. Most of the info are in cover letter. I splitted the code in a
> > >> >> way to show API migration more easily.
> > >> >
> > >> >
> > >> > Hey Jiong,
> > >> >
> > >> >
> > >> > Sorry, took me a while to get to this and learn more about instruction
> > >> > patching. Overall this looks good and I think is a good direction.
> > >> > I'll post high-level feedback here, and some more
> > >> > implementation-specific ones in corresponding patches.
> > >>
> > >> Great, thanks very much for the feedbacks. Most of your feedbacks are
> > >> hitting those pain points I exactly had ran into. For some of them, I
> > >> thought similar solutions like yours, but failed due to various
> > >> reasons. Let's go through them again, I could have missed some important
> > >> things.
> > >>
> > >> Please see my replies below.
> > >
> > > Thanks for thoughtful reply :)
> > >
> > >>
> > >> >>
> > >> >> Test Results
> > >> >> ===
> > >> >>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
> > >> >>     modes (interpreter, JIT, JIT with blinding).
> > >> >>
> > >> >>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
> > >> >>     patching time from 5100s (nearly one and a half hour) to less than
> > >> >>     0.5s for 1M insn patching.
> > >> >>
> > >> >> Known Issues
> > >> >> ===
> > >> >>   - The following warning is triggered when running scale test which
> > >> >>     contains 1M insns and patching:
> > >> >>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
> > >> >>
> > >> >>     This is caused by existing code, it can be reproduced on bpf-next
> > >> >>     master with jit blinding enabled, then run scale unit test, it will
> > >> >>     shown up after half an hour. After this set, patching is very fast, so
> > >> >>     it shows up quickly.
> > >> >>
> > >> >>   - No line info adjustment support when doing insn delete, subprog adj
> > >> >>     is with bug when doing insn delete as well. Generally, removal of insns
> > >> >>     could possibly cause remove of entire line or subprog, therefore
> > >> >>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
> > >> >>     don't have good idea and clean code for integrating this into the
> > >> >>     linearization code at the moment, will do more experimenting,
> > >> >>     appreciate ideas and suggestions on this.
> > >> >
> > >> > Is there any specific problem to detect which line info to delete? Or
> > >> > what am I missing besides careful implementation?
> > >>
> > >> Mostly line info and subprog info are range info which covers a range of
> > >> insns. Deleting insns could causing you adjusting the range or removing one
> > >> range entirely. subprog info could be fully recalcuated during
> > >> linearization while line info I need some careful implementation and I
> > >> failed to have clean code for this during linearization also as said no
> > >> unit tests to help me understand whether the code is correct or not.
> > >>
> > >
> > > Ok, that's good that it's just about clean implementation. Try to
> > > implement it as clearly as possible. Then post it here, and if it can
> > > be improved someone (me?) will try to help to clean it up further.
> > >
> > > Not a big expert on line info, so can't comment on that,
> > > unfortunately. Maybe Yonghong can chime in (cc'ed)
> > >
> > >
> > >> I will described this latter, spent too much time writing the following
> > >> reply. Might worth an separate discussion thread.
> > >>
> > >> >>
> > >> >>     Insn delete doesn't happen on normal programs, for example Cilium
> > >> >>     benchmarks, and happens rarely on test_progs, so the test coverage is
> > >> >>     not good. That's also why this RFC have a full pass on selftest with
> > >> >>     this known issue.
> > >> >
> > >> > I hope you'll add test for deletion (and w/ corresponding line info)
> > >> > in final patch set :)
> > >>
> > >> Will try. Need to spend some time on BTF format.
> > >> >
> > >> >>
> > >> >>   - Could further use mem pool to accelerate the speed, changes are trivial
> > >> >>     on top of this RFC, and could be 2x extra faster. Not included in this
> > >> >>     RFC as reducing the algo complexity from quadratic to linear of insn
> > >> >>     number is the first step.
> > >> >
> > >> > Honestly, I think that would add more complexity than necessary, and I
> > >> > think we can further speed up performance without that, see below.
> > >> >
> > >> >>
> > >> >> Background
> > >> >> ===
> > >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
> > >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
> > >> >> remove insns.
> > >> >>
> > >> >> At the moment, insn patching is quadratic of insn number, this is due to
> > >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
> > >> >>
> > >> >>   for insn inside prog
> > >> >>     patch insn + regeneate bpf prog
> > >> >>     for insn inside new prog
> > >> >>       adjust jump target
> > >> >>
> > >> >> This is causing significant time spending when a bpf prog requires large
> > >> >> amount of patching on different insns. Benchmarking shows it could take
> > >> >> more than half minutes to finish patching when patching number is more
> > >> >> than 50K, and the time spent could be more than one hour when patching
> > >> >> number is around 1M.
> > >> >>
> > >> >>   15000   :    3s
> > >> >>   45000   :   29s
> > >> >>   95000   :  125s
> > >> >>   195000  :  712s
> > >> >>   1000000 : 5100s
> > >> >>
> > >> >> This RFC introduces new patching infrastructure. Before doing insn
> > >> >> patching, insns in bpf prog are turned into a singly linked list, insert
> > >> >> new insns just insert new list node, delete insns just set delete flag.
> > >> >> And finally, the list is linearized back into array, and branch target
> > >> >> adjustment is done for all jump insns during linearization. This algo
> > >> >> brings the time complexity from quadratic to linear of insn number.
> > >> >>
> > >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
> > >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
> > >> >> to less than 0.5s.
> > >> >>
> > >> >> Patching API
> > >> >> ===
> > >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
> > >> >> where only BPF insns are patched. The other is "verification layer" where
> > >> >> insns have corresponding aux info as well high level subprog info, so
> > >> >> insn patching means aux info needs to be patched as well, and subprog info
> > >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
> > >> >> should always be updated after insn patching.
> > >> >>
> > >> >> So, list creation, destroy, insert, delete is the same for both layer,
> > >> >> but lineration is different. "verification layer" patching require extra
> > >> >> work. Therefore the patch APIs are:
> > >> >>
> > >> >>    list creation:                bpf_create_list_insn
> > >> >>    list patch:                   bpf_patch_list_insn
> > >> >>    list pre-patch:               bpf_prepatch_list_insn
> > >> >
> > >> > I think pre-patch name is very confusing, until I read full
> > >> > description I couldn't understand what it's supposed to be used for.
> > >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
> > >> > me wondering whether instruction buffer is inserted after instruction,
> > >> > or instruction is replaced with a bunch of instructions.
> > >> >
> > >> > So how about two more specific names:
> > >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
> > >> > instruction with a list of patch instructions)
> > >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
> > >> > one is pretty clear).
> > >>
> > >> My sense on English word is not great, will switch to above which indeed
> > >> reads more clear.
> > >>
> > >> >>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
> > >> >>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
> > >> >
> > >> > These two functions are both quite involved, as well as share a lot of
> > >> > common code. I'd rather have one linearize instruction, that takes env
> > >> > as an optional parameter. If env is specified (which is the case for
> > >> > all cases except for constant blinding pass), then adjust aux_data and
> > >> > subprogs along the way.
> > >>
> > >> Two version of lineration and how to unify them was a painpoint to me. I
> > >> thought to factor out some of the common code out, but it actually doesn't
> > >> count much, the final size counting + insnsi resize parts are the same,
> > >> then things start to diverge since the "Copy over insn" loop.
> > >>
> > >> verifier layer needs to copy and initialize aux data etc. And jump
> > >> relocation is different. At core layer, the use case is JIT blinding which
> > >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
> > >
> > > Sorry, I didn't get what "could expand an jump_imm insn into a
> > > and/or/jump_reg sequence", maybe you can clarify if I'm missing
> > > something.
> > >
> > > But from your cover letter description, core layer has no jumps at
> > > all, while verifier has jumps inside patch buffer. So, if you support
> > > jumps inside of patch buffer, it will automatically work for core
> > > layer. Or what am I missing?
> >
> > I meant in core layer (JIT blinding), there is the following patching:
> >
> > input:
> >   insn 0             insn 0
> >   insn 1             insn 1
> >   jmp_imm   >>       mov_imm  \
> >   insn 2             xor_imm    insn seq expanded from jmp_imm
> >   insn 3             jmp_reg  /
> >                      insn 2
> >                      insn 3
> >
> >
> > jmp_imm is the insn that will be patched, and the actually transformation
> > is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
> > at the end of the patch buffer, must jump to the same destination as the
> > original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
> > be relocated, and the jump destination is outside of patch buffer.
>
>
> Ok, great, thanks for explaining, yeah it's definitely something that
> we should be able to support. BUT. It got me thinking a bit more and I
> think I have simpler and more elegant solution now, again, supporting
> both core-layer and verifier-layer operations.
>
> struct bpf_patchable_insn {
>    struct bpf_patchable_insn *next;
>    struct bpf_insn insn;
>    int orig_idx; /* original non-patched index */
>    int new_idx;  /* new index, will be filled only during linearization */
> };
>
> struct bpf_patcher {
>     /* dummy head node of a chain of patchable instructions */
>     struct bpf_patchable_insn insn_head;
>     /* dynamic array of size(original instruction count)
>      * this is a map from original instruction index to a first
>      * patchable instruction that replaced that instruction (or
>      * just original instruction as bpf_patchable_insn).
>      */
>     int *orig_idx_to_patchable_insn;
>     int cnt;
> };
>
> Few points, but it should be pretty clear just from comments and definitions:
> 1. When you created bpf_patcher, you create patchabe_insn list, fill
> orig_idx_to_patchable_insn map to store proper pointers. This array is
> NEVER changed after that.
> 2. When replacing instruction, you re-use struct bpf_patchable_insn
> for first patched instruction, then append after that (not prepend to
> next instruction to not disrupt orig_idx -> patchable_insn mapping).
> 3. During linearizations, you first traverse the chain of instructions
> and trivially assing new_idxs.
> 4. No need for patchabe_insn->target anymore. All jumps use relative
> instruction offsets, right? So when you need to determine new
> instruction index during linearization, you just do (after you
> calculated new instruction indicies):
>
> func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
>    int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
>    int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
>    adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
> }

Forgot to mention. To handle deleted insns, you can either traverse
till you find first non-deleted instruction after that one, or during
filling of new_idx, just make sure to that new_idx of deleted
instruction is the same as the first non-deleted instruction's
new_idx.

For subprogs (presuming there is no case where we can just eliminate
entire subprog), you can just do simple look up from original index to
a new index. No need to copy/recalculate, etc. It's just orig_idx ->
new_idx mapping.

>
> The idea is that we want to support quick look-up by original
> instruction index. That's what orig_idx_to_patchable_insn provides. On
> the other hand, no existing instruction is ever referencing newly
> patched instruction by its new offset, so with careful implementation,
> you can transparently support all the cases, regardless if it's in
> core layer or verifier layer (so, e.g., verifier layer patched
> instructions now will be able to jump out of patched buffer, if
> necessary, neat, right?).
>
> It is cleaner than everything we've discussed so far. Unless I missed
> something critical (it's all quite convoluted, so I might have
> forgotten some parts already). Let me know what you think.
>
>
> >
> > This means for core layer (jit blinding), it needs to take care of insn
> > inside patch buffer.
> >
> > > Just compared two version of linearize side by side. From what I can
> > > see, unified version could look like this, high-level:
> > >
> > > 1. Count final insn count (but see my other suggestions how to avoid
> > > that altogether). If not changed - exit.
> > > 2. Realloc insn buffer, copy just instructions (not aux_data yet).
> > > Build idx_map, if necessary.
> > > 3. (if env) then bpf_patchable_insn has aux_data, so now do another
> > > pass and copy it into resulting array.
> > > 4. (if env) Copy sub info. Though I'd see if we can just reuse old
> > > ones and just adjust offsets. I'm not sure why we need to allocate new
> > > array, subprogram count shouldn't change, right?
> >
> > If there is no dead insn elimination opt, then we could just adjust
> > offsets. When there is insn deleting, I feel the logic becomes more
> > complex. One subprog could be completely deleted or partially deleted, so
> > I feel just recalculate the whole subprog info as a side-product is
> > much simpler.
>
> What's the situation where entirety of subprog can be deleted?
>
>
> >
> > > 5. (common) Relocate jumps. Not clear why core layer doesn't care
> > > about PATCHED (or, alternatively, why verifier layer cares).
> >
> > See above, in this RFC, core layer care PATCHED during relocating jumps,
> > and verifier layer doesn't.
> >
> > > And again, with targets pointer it will look totally different (and
> > > simpler).
> >
> > Yes, will see how the code looks.
> >
> > > 6. (if env) adjust subprogs
> > > 7. (common) Adjust prog's line info.
> > >
> > > The devil is in the details, but I think this will still be better if
> > > contained in one function if a bunch of `if (env)` checks. Still
> > > pretty linear.
> > >
> > >> jump_reg is at the end of the patch buffer, it should be relocated. While
> > >> all use case in verifier layer, no jump in the prog will be patched and all
> > >> new jumps in patch buffer will jump inside the buffer locally so no need to
> > >> resolve.
> > >>
> > >> And yes we could unify them into one and control the diverge using
> > >> argument, but then where to place the function is an issue. My
> > >> understanding is verifier.c is designed to be on top of core.c and core.c
> > >> should not reference and no need to be aware of any verifier specific data
> > >> structures, for example env or bpf_aux_insn_data etc.
> > >
> > > Func prototype where it is. Maybe forward-declare verifier env struct.
> > > Implementation in verifier.c?
> > >
> > >>
> > >> So, in this RFC, I had choosed to write separate linerization function for
> > >> core and verifier layer. Does this make sense?
> > >
> > > See above. Let's still try to make it better.
> > >
> > >>
> > >> >
> > >> > This would keep logic less duplicated and shouldn't complexity beyond
> > >> > few null checks in few places.
> > >> >
> > >> >>    list destroy:                 bpf_destroy_list_insn
> > >> >>
> > >> >
> > >> > I'd also add a macro foreach_list_insn instead of explicit for loops
> > >> > in multiple places. That would also allow to skip deleted instructions
> > >> > transparently.
> > >> >
> > >> >> list patch could change the insn at patch point, it will invalid the aux
> > >> >
> > >> > typo: invalid -> invalidate
> > >>
> > >> Ack.
> > >>
> > >> >
> > >> >> info at patching point. list pre-patch insert new insns before patch point
> > >> >> where the insn and associated aux info are not touched, it is used for
> > >> >> example in convert_ctx_access when generating prologue.
> > >> >>
> > >> >> Typical API sequence for one patching pass:
> > >> >>
> > >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> > >> >>    for (elem = list; elem; elem = elem->next)
> > >> >>       patch_buf = gen_patch_buf_logic;
> > >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> > >> >>    bpf_prog = bpf_linearize_list_insn(list)
> > >> >>    bpf_destroy_list_insn(list)
> > >> >>
> > >> >> Several patching passes could also share the same list:
> > >> >>
> > >> >>    struct bpf_list_insn list = bpf_create_list_insn(struct bpf_prog);
> > >> >>    for (elem = list; elem; elem = elem->next)
> > >> >>       patch_buf = gen_patch_buf_logic1;
> > >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> > >> >>    for (elem = list; elem; elem = elem->next)
> > >> >>       patch_buf = gen_patch_buf_logic2;
> > >> >>       elem = bpf_patch_list_insn(elem, patch_buf, cnt);
> > >> >>    bpf_prog = bpf_linearize_list_insn(list)
> > >> >>    bpf_destroy_list_insn(list)
> > >> >>
> > >> >> but note new inserted insns int early passes won't have aux info except
> > >> >> zext info. So, if one patch pass requires all aux info updated and
> > >> >> recalculated for all insns including those pathced, it should first
> > >> >> linearize the old list, then re-create the list. The RFC always create and
> > >> >> linearize the list for each migrated patching pass separately.
> > >> >
> > >> > I think we should do just one list creation, few passes of patching
> > >> > and then linearize once. That will save quite a lot of memory
> > >> > allocation and will speed up a lot of things. All the verifier
> > >> > patching happens one after the other without any other functionality
> > >> > in between, so there shouldn't be any problem.
> > >>
> > >> Yes, as mentioned above, it is possible and I had tried to do it in an very
> > >> initial impl. IIRC convert_ctx_access + fixup_bpf_calls could share the
> > >> same list, but then the 32-bit zero extension insertion pass requires
> > >> aux.zext_dst set properly for all instructions including those patched
> > >
> > > So zext_dst. Seems like it's easily calculatable, so doesn't seem like
> > > it even needs to be accessed from aux_data.
> > >
> > > But. I can see at least two ways to do this:
> > > 1. those patching passes that care about aux_data, should just do
> > > extra check for NULL. Because when we adjust insns now, we just leave
> > > zero-initialized aux_data, except for zext_dst and seen. So it's easy
> > > to default to them if aux_data is NULL for patchable_insn.
> > > 2. just allocate and fill them out them when applying patch insns
> > > buffer. It's not a duplication, we already fill them out during
> > > patching today. So just do the same, except through malloc()'ed
> > > pointer instead. At the end they will be copied into linear resulting
> > > array during linearization (uniformly with non-patched insns).
> > >
> > >> one which we need to linearize the list first (as we set zext_dst during
> > >> linerization), or the other choice is we do the zext_dst initialization
> > >> during bpf_patch_list_insn, but this then make bpf_patch_list_insn diverge
> > >> between core and verifier layer.
> > >
> > > List construction is much simpler, even if we have to have extra
> > > check, similar to `if (env) { do_extra(); }`, IMO, it's fine.
> > >
> > >>
> > >> > As for aux_data. We can solve that even more simply and reliably by
> > >> > storing a pointer along the struct bpf_list_insn
> > >>
> > >> This is exactly what I had implemented initially, but then the issue is how
> > >> to handle aux_data for patched insn? IIRC I was leave it as a NULL pointer,
> > >> but later found zext_dst info is required for all insns, so I end up
> > >> duplicating zext_dst in bpf_list_insn.
> > >
> > > See above. No duplication. You have a pointer. Whether aux_data is in
> > > original array or was malloc()'ed, doesn't matter. But no duplication
> > > of fields.
> > >
> > >>
> > >> This leads me worrying we need to keep duplicating fields there as soon as
> > >> there is new similar requirements in future patching pass and I thought it
> > >> might be better to just reference the aux_data inside env using orig_idx,
> > >> this avoids duplicating information, but we need to make sure used fields
> > >> inside aux_data for patched insn update-to-date during linearization or
> > >> patching list.
> > >>
> > >> > (btw, how about calling it bpf_patchable_insn?).
> > >>
> > >> No preference, will use this one.
> > >>
> > >> > Here's how I propose to represent this patchable instruction:
> > >> >
> > >> > struct bpf_list_insn {
> > >> >        struct bpf_insn insn;
> > >> >        struct bpf_list_insn *next;
> > >> >        struct bpf_list_insn *target;
> > >> >        struct bpf_insn_aux_data *aux_data;
> > >> >        s32 orig_idx; // can repurpose this to have three meanings:
> > >> >                      // -2 - deleted
> > >> >                      // -1 - patched/inserted insn
> > >> >                      // >=0 - original idx
> > >>
> > >> I actually had experimented the -2/-1/0 trick, exactly the same number
> > >> assignment :) IIRC the code was not clear compared with using flag, the
> > >> reason seems to be:
> > >>   1. we still need orig_idx of an patched insn somehow, meaning negate the
> > >>      index.
> > >
> > > Not following, original index with be >=0, no?
> > >
> > >>   2. somehow somecode need to know whether one insn is deleted or patched
> > >>      after the negation, so I end up with some ugly code.
> > >
> > > So that's why you'll have constants with descriptive name for -2 and -1.
> > >
> > >>
> > >> Anyway, I might had not thought hard enough on this, I will retry using the
> > >> special index instead of flag, hopefully I could have clean code this time.
> > >>
> > >
> > > Yeah, please try again. All those `orig_idx = insn->orig_idx - 1; if
> > > (orig_idx >= 0) { ... }` are very confusing.
> > >
> > >> > };
> > >> >
> > >> > The idea would be as follows:
> > >> > 1. when creating original list, target pointer will point directly to
> > >> > a patchable instruction wrapper for jumps/calls. This will allow to
> > >> > stop tracking and re-calculating jump offsets and instruction indicies
> > >> > until linearization.
> > >>
> > >> Not sure I have followed the idea of "target" pointer. At the moment we are
> > >> using index mapping array (generated as by-product during coping insn).
> > >>
> > >> While the "target" pointer means to during list initialization, each jump
> > >> insn will have target initialized to the list node of the converted jump
> > >> destination insn, and all those non-jump insns are with NULL? Then during
> > >> linearization you assign index to each list node (could be done as
> > >> by-product of other pass) before insn coping which could then relocate the
> > >> insn during the coping as the "target" would have final index calculated?
> > >> Am I following correctly?
> > >
> > > Yes, I think you are understanding correctly what I'm saying. For
> > > implementation, you can do it in few ways, through few passes or with
> > > some additional data, is less important. See what's cleanest.
> > >
> > >>
> > >> > 2. aux_data is also filled at that point. Later at linearization time
> > >> > you'd just iterate over all the instructions in final order and copy
> > >> > original aux_data, if it's present. And then just repace env's
> > >> > aux_data array at the end, should be very simple and fast.
> > >>
> > >> As explained, I am worried making aux_data a pointer will causing
> > >> duplicating some fields into list_insn if the fields are required for
> > >> patched insns.
> > >
> > > Addressed above, I don't think there will be any duplication, because
> > > we pass aux_data by pointer.
> > >
> > >>
> > >> > 3. during fix_bpf_calls, zext, ctx rewrite passes, we'll reuse the
> > >> > same list of instructions and those passes will just keep inserting
> > >> > instruction buffers. Given we have restriction that all the jumps are
> > >> > only within patch buffer, it will be trivial to construct proper
> > >> > patchable instruction wrappers for newly added instructions, with NULL
> > >> > for aux_data and possibly non-NULL target (if it's a JMP insn).
> > >> > 4. After those passes, linearize, adjust subprogs (for this you'll
> > >> > probably still need to create index mapping, right?), copy or create
> > >> > new aux_data.
> > >> > 5. Done.
> > >> >
> > >> > What do you think? I think this should be overall simpler and faster.
> > >> > But let me know if I'm missing something.
> > >>
> > >> Thanks for all these thoughts, they are very good suggestions and reminds
> > >> me to revisit some points I had forgotten. I will do the following things:
> > >>
> > >>   1. retry the negative index solution to eliminate flag if the result code
> > >>      could be clean.
> > >>   2. the "target" pointer seems make sense, it makes list_insn bigger but
> > >>      normally space trade with time, so I will try to implement it to see
> > >>      how the code looks like.
> > >>   3. I still have concerns on making aux_data as pointer. Mostly due to
> > >>      patched insn will have NULL pointer and in case aux info of patched
> > >>      insn is required, we need to duplicate info inside list_insn. For
> > >>      example 32-bit zext opt requires zext_dst.
> > >>
> > >
> > >
> > > So one more thing I wanted to suggest. I'll try to keep high-level
> > > suggestions here.
> > >
> > > What about having a wrapper for patchable_insn list, where you can
> > > store some additional data, like final count and whatever else. It
> > > will eliminate some passes (counting) and will make list handling
> > > easier (because you can have a dummy head pointer, so no special
> > > handling of first element
> >
> > Will try it.
> >
> > > you had this concern in patch #1, I
> > > believe). But it will be clear if it's beneficial once implemented.
> >
> > >> Regards,
> > >> Jiong
> > >>
> > >> >>
> > >> >> Compared with old patching code, this new infrastructure has much less core
> > >> >> code, even though the final code has a couple of extra lines but that is
> > >> >> mostly due to for list based infrastructure, we need to do more error
> > >> >> checks, so the list and associated aux data structure could be freed when
> > >> >> errors happens.
> > >> >>
> > >> >> Patching Restrictions
> > >> >> ===
> > >> >>   - For core layer, the linearization assume no new jumps inside patch buf.
> > >> >>     Currently, the only user of this layer is jit blinding.
> > >> >>   - For verifier layer, there could be new jumps inside patch buf, but
> > >> >>     they should have branch target resolved themselves, meaning new jumps
> > >> >>     doesn't jump to insns out of the patch buf. This is the case for all
> > >> >>     existing verifier layer users.
> > >> >>   - bpf_insn_aux_data for all patched insns including the one at patch
> > >> >>     point are invalidated, only 32-bit zext info will be recalcuated.
> > >> >>     If the aux data of insn at patch point needs to be retained, it is
> > >> >>     purely insn insertion, so need to use the pre-patch API.
> > >> >>
> > >> >> I plan to send out a PATCH set once I finished insn deletion line info adj
> > >> >> support, please have a looks at this RFC, and appreciate feedbacks.
> > >> >>
> > >> >> Jiong Wang (8):
> > >> >>   bpf: introducing list based insn patching infra to core layer
> > >> >>   bpf: extend list based insn patching infra to verification layer
> > >> >>   bpf: migrate jit blinding to list patching infra
> > >> >>   bpf: migrate convert_ctx_accesses to list patching infra
> > >> >>   bpf: migrate fixup_bpf_calls to list patching infra
> > >> >>   bpf: migrate zero extension opt to list patching infra
> > >> >>   bpf: migrate insn remove to list patching infra
> > >> >>   bpf: delete all those code around old insn patching infrastructure
> > >> >>
> > >> >>  include/linux/bpf_verifier.h |   1 -
> > >> >>  include/linux/filter.h       |  27 +-
> > >> >>  kernel/bpf/core.c            | 431 +++++++++++++++++-----------
> > >> >>  kernel/bpf/verifier.c        | 649 +++++++++++++++++++------------------------
> > >> >>  4 files changed, 580 insertions(+), 528 deletions(-)
> > >> >>
> > >> >> --
> > >> >> 2.7.4
> > >> >>
> > >>
> >

^ permalink raw reply

* Re: [PATCH v2 bpf-next] selftests/bpf: remove logic duplication in test_verifier.c
From: Daniel Borkmann @ 2019-07-15 23:03 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, netdev, ast
  Cc: andrii.nakryiko, kernel-team, Krzesimir Nowak
In-Reply-To: <20190712174441.4089282-1-andriin@fb.com>

On 7/12/19 7:44 PM, Andrii Nakryiko wrote:
> test_verifier tests can specify single- and multi-runs tests. Internally
> logic of handling them is duplicated. Get rid of it by making single run
> retval/data specification to be a first run spec.
> 
> Cc: Krzesimir Nowak <krzesimir@kinvolk.io>
> Signed-off-by: Andrii Nakryiko <andriin@fb.com>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH iproute2 net-next v3 2/5] taprio: Add support for setting flags
From: Stephen Hemminger @ 2019-07-15 23:37 UTC (permalink / raw)
  To: Vedang Patel
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern
In-Reply-To: <1563231104-19912-2-git-send-email-vedang.patel@intel.com>

On Mon, 15 Jul 2019 15:51:41 -0700
Vedang Patel <vedang.patel@intel.com> wrote:

> @@ -405,6 +420,7 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>  	struct rtattr *tb[TCA_TAPRIO_ATTR_MAX + 1];
>  	struct tc_mqprio_qopt *qopt = 0;
>  	__s32 clockid = CLOCKID_INVALID;
> +	__u32 taprio_flags = 0;
>  	int i;
>  
>  	if (opt == NULL)
> @@ -442,6 +458,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
>  
>  	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
>  
> +	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
> +		taprio_flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
> +		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
> +	}
> +

Overall this looks fine, but three small comments:
1. It is better not to do unnecessary variable initialization
2. It is better to move variables into the basic block where they are used.
3. Use the print_0xhex() instead of print_uint() for hex values. The difference
   is that in the JSON output, print_uint would be decimal but the print_0xhex
   is always hex.  And use "flags %#x" so that it is clear you are printing flags in hex.



^ permalink raw reply

* Re: [PATCH iproute2 net-next v3 3/5] taprio: add support for setting txtime_delay.
From: Stephen Hemminger @ 2019-07-15 23:38 UTC (permalink / raw)
  To: Vedang Patel
  Cc: netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	leandro.maciel.dorileo, jakub.kicinski, m-karicheri2, dsahern
In-Reply-To: <1563231104-19912-3-git-send-email-vedang.patel@intel.com>

On Mon, 15 Jul 2019 15:51:42 -0700
Vedang Patel <vedang.patel@intel.com> wrote:

> +			if (get_s32(&txtime_delay, *argv, 0)) {

Is txtime_delay of a negative value meaningful?

^ permalink raw reply

* Re: [PATCH iproute2 net-next v3 2/5] taprio: Add support for setting flags
From: Jakub Kicinski @ 2019-07-16  0:15 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Vedang Patel, netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	leandro.maciel.dorileo, m-karicheri2, dsahern
In-Reply-To: <20190715163743.2c6cec2b@hermes.lan>

On Mon, 15 Jul 2019 16:37:43 -0700, Stephen Hemminger wrote:
> On Mon, 15 Jul 2019 15:51:41 -0700
> Vedang Patel <vedang.patel@intel.com> wrote:
> > @@ -442,6 +458,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
> >  
> >  	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
> >  
> > +	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
> > +		taprio_flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
> > +		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
> > +	}
>[...]
> 3. Use the print_0xhex() instead of print_uint() for hex values. The difference
>    is that in the JSON output, print_uint would be decimal but the print_0xhex
>    is always hex.  And use "flags %#x" so that it is clear you are printing flags in hex.

In my humble personal experience scripting tests using iproute2 and
bpftool with Python I found printing the "hex string" instead of just
outputing the integer value counter productive :( Even tho it looks
better to the eye, JSON is primarily for machine processing and hex
strings have to be manually converted.

^ permalink raw reply

* Re: [pull request][net 0/3] Mellanox, mlx5 fixes 2019-07-15
From: David Miller @ 2019-07-16  0:20 UTC (permalink / raw)
  To: saeedm; +Cc: netdev
In-Reply-To: <20190715200940.31799-1-saeedm@mellanox.com>

From: Saeed Mahameed <saeedm@mellanox.com>
Date: Mon, 15 Jul 2019 20:09:53 +0000

> This pull request provides mlx5 TC flower and tunnel fixes for
> kernel 5.2 from Eli and Vlad.
> 
> Please pull and let me know if there is any problem.

Pulled, thanks Saeed.

^ permalink raw reply

* Re: [PATCH iproute2 net-next v3 2/5] taprio: Add support for setting flags
From: Stephen Hemminger @ 2019-07-16  0:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Vedang Patel, netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	leandro.maciel.dorileo, m-karicheri2, dsahern
In-Reply-To: <20190715171515.248460a6@cakuba.netronome.com>

On Mon, 15 Jul 2019 17:15:15 -0700
Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

> On Mon, 15 Jul 2019 16:37:43 -0700, Stephen Hemminger wrote:
> > On Mon, 15 Jul 2019 15:51:41 -0700
> > Vedang Patel <vedang.patel@intel.com> wrote:  
> > > @@ -442,6 +458,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
> > >  
> > >  	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
> > >  
> > > +	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
> > > +		taprio_flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
> > > +		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
> > > +	}  
> >[...]
> > 3. Use the print_0xhex() instead of print_uint() for hex values. The difference
> >    is that in the JSON output, print_uint would be decimal but the print_0xhex
> >    is always hex.  And use "flags %#x" so that it is clear you are printing flags in hex.  
> 
> In my humble personal experience scripting tests using iproute2 and
> bpftool with Python I found printing the "hex string" instead of just
> outputing the integer value counter productive :( Even tho it looks
> better to the eye, JSON is primarily for machine processing and hex
> strings have to be manually converted.

If it is hex on normal output, it should be hex on JSON output.
And what ever the normal output format is has to be accepted on command line as input.

^ permalink raw reply

* [bpf-next RFC 0/6] Introduce a BPF helper to generate SYN cookies
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov

From: Petar Penkov <ppenkov@google.com>

This patch series introduces a BPF helper function that allows generating SYN
cookies from BPF.

The first two patches in the series modify several TCP helper functions to
allow for SKB-less operation, as is the case with XDP.

The third patch introduces the bpf_tcp_gen_syncookie helper function which
generates a SYN cookie for either XDP or TC programs. 

The last three patches sync tools/ and add a test. 

The primary design consideration I see in the patch series is the return value
of the helper function. Currently bpf_tcp_gen_syncookie returns a 64-bit value
that contains both the 32-bit syncookie, and the 16-bit mss value which is
encoded in the cookie. On error, it would return a negative value instead. I
chose this over writing the cookie into the provided TCP packet to avoid writing
packet data as currently if a helper changes the packet data, the first argument
has to point to the context (can this be relaxed?). 

To make the API cleaner we can instead return something like the struct below
though the return type would then not really be RET_INTEGER or any of the
currently existing return types.
struct bpf_syncookie {
	u16 error; // or u8 error, u8 unused for future use
	u16 mss;
	u32 syncookie;
}

Petar Penkov (6):
  tcp: tcp_syn_flood_action read port from socket
  tcp: add skb-less helpers to retrieve SYN cookie
  bpf: add bpf_tcp_gen_syncookie helper
  bpf: sync bpf.h to tools/
  selftests/bpf: bpf_tcp_gen_syncookie->bpf_helpers
  selftests/bpf: add test for bpf_tcp_gen_syncookie

 include/net/tcp.h                             | 11 +++
 include/uapi/linux/bpf.h                      | 30 ++++++-
 net/core/filter.c                             | 62 +++++++++++++
 net/ipv4/tcp_input.c                          | 87 +++++++++++++++++--
 net/ipv4/tcp_ipv4.c                           |  8 ++
 net/ipv6/tcp_ipv6.c                           |  8 ++
 tools/include/uapi/linux/bpf.h                | 37 +++++++-
 tools/testing/selftests/bpf/bpf_helpers.h     |  3 +
 .../bpf/progs/test_tcp_check_syncookie_kern.c | 28 ++++--
 .../bpf/test_tcp_check_syncookie_user.c       | 61 +++++++++++--
 10 files changed, 313 insertions(+), 22 deletions(-)

-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply

* [bpf-next RFC 1/6] tcp: tcp_syn_flood_action read port from socket
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This allows us to call this function before an SKB has been
allocated.

Signed-off-by: Petar Penkov <ppenkov@google.com>
---
 net/ipv4/tcp_input.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c21e8a22fb3b..8892df6de1d4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6422,9 +6422,7 @@ EXPORT_SYMBOL(inet_reqsk_alloc);
 /*
  * Return true if a syncookie should be sent
  */
-static bool tcp_syn_flood_action(const struct sock *sk,
-				 const struct sk_buff *skb,
-				 const char *proto)
+static bool tcp_syn_flood_action(const struct sock *sk, const char *proto)
 {
 	struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
 	const char *msg = "Dropping request";
@@ -6444,7 +6442,7 @@ static bool tcp_syn_flood_action(const struct sock *sk,
 	    net->ipv4.sysctl_tcp_syncookies != 2 &&
 	    xchg(&queue->synflood_warned, 1) == 0)
 		net_info_ratelimited("%s: Possible SYN flooding on port %d. %s.  Check SNMP counters.\n",
-				     proto, ntohs(tcp_hdr(skb)->dest), msg);
+				     proto, sk->sk_num, msg);
 
 	return want_cookie;
 }
@@ -6487,7 +6485,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	 */
 	if ((net->ipv4.sysctl_tcp_syncookies == 2 ||
 	     inet_csk_reqsk_queue_is_full(sk)) && !isn) {
-		want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
+		want_cookie = tcp_syn_flood_action(sk, rsk_ops->slab_name);
 		if (!want_cookie)
 			goto drop;
 	}
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* [bpf-next RFC 2/6] tcp: add skb-less helpers to retrieve SYN cookie
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This patch allows generation of a SYN cookie before an SKB has been
allocated, as is the case at XDP.

Signed-off-by: Petar Penkov <ppenkov@google.com>
---
 include/net/tcp.h    | 11 ++++++
 net/ipv4/tcp_input.c | 79 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_ipv4.c  |  8 +++++
 net/ipv6/tcp_ipv6.c  |  8 +++++
 4 files changed, 106 insertions(+)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cca3c59b98bf..a128e22c0d5d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -414,6 +414,17 @@ void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
 		       int estab, struct tcp_fastopen_cookie *foc);
 const u8 *tcp_parse_md5sig_option(const struct tcphdr *th);
 
+/*
+ *	BPF SKB-less helpers
+ */
+u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph,
+			 struct tcphdr *tch, u32 *cookie);
+u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph,
+			 struct tcphdr *tch, u32 *cookie);
+u16 tcp_get_syncookie(struct request_sock_ops *rsk_ops,
+		      const struct tcp_request_sock_ops *af_ops,
+		      struct sock *sk, void *iph, struct tcphdr *tch,
+		      u32 *cookie);
 /*
  *	TCP v4 functions exported for the inet6 API
  */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8892df6de1d4..1406d7e0953c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3782,6 +3782,52 @@ static void smc_parse_options(const struct tcphdr *th,
 #endif
 }
 
+/* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped
+ * value on success.
+ *
+ * Invoked for BPF SYN cookie generation, so th should be a SYN.
+ */
+static u16 tcp_parse_mss_option(const struct net *net, const struct tcphdr *th,
+				u16 user_mss)
+{
+	const unsigned char *ptr = (const unsigned char *)(th + 1);
+	int length = (th->doff * 4) - sizeof(struct tcphdr);
+	u16 mss = 0;
+
+	while (length > 0) {
+		int opcode = *ptr++;
+		int opsize;
+
+		switch (opcode) {
+		case TCPOPT_EOL:
+			return mss;
+		case TCPOPT_NOP:	/* Ref: RFC 793 section 3.1 */
+			length--;
+			continue;
+		default:
+			if (length < 2)
+				return mss;
+			opsize = *ptr++;
+			if (opsize < 2) /* "silly options" */
+				return mss;
+			if (opsize > length)
+				return mss;	/* fail on partial options */
+			if (opcode == TCPOPT_MSS && opsize == TCPOLEN_MSS) {
+				u16 in_mss = get_unaligned_be16(ptr);
+
+				if (in_mss) {
+					if (user_mss && user_mss < in_mss)
+						in_mss = user_mss;
+					mss = in_mss;
+				}
+			}
+			ptr += opsize - 2;
+			length -= opsize;
+		}
+	}
+	return mss;
+}
+
 /* Look for tcp options. Normally only called on SYN and SYNACK packets.
  * But, this can also be called on packets in the established flow when
  * the fast version below fails.
@@ -6464,6 +6510,39 @@ static void tcp_reqsk_record_syn(const struct sock *sk,
 	}
 }
 
+u16 tcp_get_syncookie(struct request_sock_ops *rsk_ops,
+		      const struct tcp_request_sock_ops *af_ops,
+		      struct sock *sk, void *iph, struct tcphdr *th,
+		      u32 *cookie)
+{
+	u16 mss = 0;
+#ifdef CONFIG_SYN_COOKIES
+	bool is_v4 = rsk_ops->family == AF_INET;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (sock_net(sk)->ipv4.sysctl_tcp_syncookies != 2 &&
+	    !inet_csk_reqsk_queue_is_full(sk))
+		return 0;
+
+	if (!tcp_syn_flood_action(sk, rsk_ops->slab_name))
+		return 0;
+
+	if (sk_acceptq_is_full(sk)) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
+		return 0;
+	}
+
+	mss = tcp_parse_mss_option(sock_net(sk), th, tp->rx_opt.user_mss);
+	if (!mss)
+		mss = af_ops->mss_clamp;
+
+	tcp_synq_overflow(sk);
+	*cookie = is_v4 ? __cookie_v4_init_sequence(iph, th, &mss)
+			: __cookie_v6_init_sequence(iph, th, &mss);
+#endif
+	return mss;
+}
+
 int tcp_conn_request(struct request_sock_ops *rsk_ops,
 		     const struct tcp_request_sock_ops *af_ops,
 		     struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..0e06e59784bd 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1515,6 +1515,14 @@ static struct sock *tcp_v4_cookie_check(struct sock *sk, struct sk_buff *skb)
 	return sk;
 }
 
+u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph,
+			 struct tcphdr *tch, u32 *cookie)
+{
+	return tcp_get_syncookie(&tcp_request_sock_ops,
+				 &tcp_request_sock_ipv4_ops, sk, iph, tch,
+				 cookie);
+}
+
 /* The socket must have it's spinlock held when we get
  * here, unless it is a TCP_LISTEN socket.
  *
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index d56a9019a0fe..ce46cdba54bc 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1058,6 +1058,14 @@ static struct sock *tcp_v6_cookie_check(struct sock *sk, struct sk_buff *skb)
 	return sk;
 }
 
+u16 tcp_v6_get_syncookie(struct sock *sk, struct ipv6hdr *iph,
+			 struct tcphdr *tch, u32 *cookie)
+{
+	return tcp_get_syncookie(&tcp6_request_sock_ops,
+				 &tcp_request_sock_ipv6_ops, sk, iph, tch,
+				 cookie);
+}
+
 static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 {
 	if (skb->protocol == htons(ETH_P_IP))
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* [bpf-next RFC 3/6] bpf: add bpf_tcp_gen_syncookie helper
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

This helper function allows BPF programs to try to generate SYN
cookies, given a reference to a listener socket. The function works
from XDP and with an skb context since bpf_skc_lookup_tcp can lookup a
socket in both cases.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
---
 include/uapi/linux/bpf.h | 30 ++++++++++++++++++-
 net/core/filter.c        | 62 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6f68438aa4ed..abf4a85c76d1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2713,6 +2713,33 @@ union bpf_attr {
  *		**-EPERM** if no permission to send the *sig*.
  *
  *		**-EAGAIN** if bpf program can try again.
+ *
+ * s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Try to issue a SYN cookie for the packet with corresponding
+ *		IP/TCP headers, *iph* and *th*, on the listening socket in *sk*.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains **sizeof**\ (**struct tcphdr**).
+ *
+ *	Return
+ *		On success, lower 32 bits hold the generated SYN cookie in
+ *		network order and the higher 32 bits hold the MSS value for that
+ *		cookie.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** SYN cookie cannot be issued due to error
+ *
+ *		**-ENOENT** SYN cookie should not be issued (no SYN flood)
+ *
+ *		**-ENOTSUPP** kernel configuration does not enable SYN cookies
+ *
+ *		**-EPROTONOSUPPORT** *sk* family is not AF_INET/AF_INET6
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2824,7 +2851,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(tcp_gen_syncookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 47f6386fb17a..109fd1e286f4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5850,6 +5850,64 @@ static const struct bpf_func_proto bpf_tcp_check_syncookie_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_tcp_gen_syncookie, struct sock *, sk, void *, iph, u32, iph_len,
+	   struct tcphdr *, th, u32, th_len)
+{
+#ifdef CONFIG_SYN_COOKIES
+	u32 cookie;
+	u16 mss;
+
+	if (unlikely(th_len < sizeof(*th)))
+		return -EINVAL;
+
+	if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state != TCP_LISTEN)
+		return -EINVAL;
+
+	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies)
+		return -EINVAL;
+
+	if (!th->syn || th->ack || th->fin || th->rst)
+		return -EINVAL;
+
+	switch (sk->sk_family) {
+	case AF_INET:
+		if (unlikely(iph_len < sizeof(struct iphdr)))
+			return -EINVAL;
+		mss = tcp_v4_get_syncookie(sk, iph, th, &cookie);
+		break;
+
+#if IS_BUILTIN(CONFIG_IPV6)
+	case AF_INET6:
+		if (unlikely(iph_len < sizeof(struct ipv6hdr)))
+			return -EINVAL;
+		mss = tcp_v6_get_syncookie(sk, iph, th, &cookie);
+		break;
+#endif /* CONFIG_IPV6 */
+
+	default:
+		return -EPROTONOSUPPORT;
+	}
+	if (mss <= 0)
+		return -ENOENT;
+
+	return htonl(cookie) | ((u64)mss << 32);
+#else
+	return -ENOTSUPP;
+#endif /* CONFIG_SYN_COOKIES */
+}
+
+static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = {
+	.func		= bpf_tcp_gen_syncookie,
+	.gpl_only	= true, /* __cookie_v*_init_sequence() is GPL */
+	.pkt_access	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+	.arg2_type	= ARG_PTR_TO_MEM,
+	.arg3_type	= ARG_CONST_SIZE,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -6135,6 +6193,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_check_syncookie_proto;
 	case BPF_FUNC_skb_ecn_set_ce:
 		return &bpf_skb_ecn_set_ce_proto;
+	case BPF_FUNC_tcp_gen_syncookie:
+		return &bpf_tcp_gen_syncookie_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
@@ -6174,6 +6234,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_skc_lookup_tcp_proto;
 	case BPF_FUNC_tcp_check_syncookie:
 		return &bpf_tcp_check_syncookie_proto;
+	case BPF_FUNC_tcp_gen_syncookie:
+		return &bpf_tcp_gen_syncookie_proto;
 #endif
 	default:
 		return bpf_base_func_proto(func_id);
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* [bpf-next RFC 5/6] selftests/bpf: bpf_tcp_gen_syncookie->bpf_helpers
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Expose bpf_tcp_gen_syncookie to selftests.

Signed-off-by: Petar Penkov <ppenkov@google.com>
---
 tools/testing/selftests/bpf/bpf_helpers.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index 5a3d92c8bec8..19f01e967402 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -228,6 +228,9 @@ static void *(*bpf_sk_storage_get)(void *map, struct bpf_sock *sk,
 static int (*bpf_sk_storage_delete)(void *map, struct bpf_sock *sk) =
 	(void *)BPF_FUNC_sk_storage_delete;
 static int (*bpf_send_signal)(unsigned sig) = (void *)BPF_FUNC_send_signal;
+static long long (*bpf_tcp_gen_syncookie)(struct bpf_sock *sk, void *ip,
+					  int ip_len, void *tcp, int tcp_len) =
+	(void *) BPF_FUNC_tcp_gen_syncookie;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* [bpf-next RFC 4/6] bpf: sync bpf.h to tools/
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Sync updated documentation for bpf_redirect_map.

Sync the bpf_tcp_gen_syncookie helper function definition with the one
in tools/uapi.

Signed-off-by: Petar Penkov <ppenkov@google.com>
---
 tools/include/uapi/linux/bpf.h | 37 +++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f506c68b2612..abf4a85c76d1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1571,8 +1571,11 @@ union bpf_attr {
  * 		but this is only implemented for native XDP (with driver
  * 		support) as of this writing).
  *
- * 		All values for *flags* are reserved for future usage, and must
- * 		be left at zero.
+ * 		The lower two bits of *flags* are used as the return code if
+ * 		the map lookup fails. This is so that the return value can be
+ * 		one of the XDP program return codes up to XDP_TX, as chosen by
+ * 		the caller. Any higher bits in the *flags* argument must be
+ * 		unset.
  *
  * 		When used to redirect packets to net devices, this helper
  * 		provides a high performance increase over **bpf_redirect**\ ().
@@ -2710,6 +2713,33 @@ union bpf_attr {
  *		**-EPERM** if no permission to send the *sig*.
  *
  *		**-EAGAIN** if bpf program can try again.
+ *
+ * s64 bpf_tcp_gen_syncookie(struct bpf_sock *sk, void *iph, u32 iph_len, struct tcphdr *th, u32 th_len)
+ *	Description
+ *		Try to issue a SYN cookie for the packet with corresponding
+ *		IP/TCP headers, *iph* and *th*, on the listening socket in *sk*.
+ *
+ *		*iph* points to the start of the IPv4 or IPv6 header, while
+ *		*iph_len* contains **sizeof**\ (**struct iphdr**) or
+ *		**sizeof**\ (**struct ip6hdr**).
+ *
+ *		*th* points to the start of the TCP header, while *th_len*
+ *		contains **sizeof**\ (**struct tcphdr**).
+ *
+ *	Return
+ *		On success, lower 32 bits hold the generated SYN cookie in
+ *		network order and the higher 32 bits hold the MSS value for that
+ *		cookie.
+ *
+ *		On failure, the returned value is one of the following:
+ *
+ *		**-EINVAL** SYN cookie cannot be issued due to error
+ *
+ *		**-ENOENT** SYN cookie should not be issued (no SYN flood)
+ *
+ *		**-ENOTSUPP** kernel configuration does not enable SYN cookies
+ *
+ *		**-EPROTONOSUPPORT** *sk* family is not AF_INET/AF_INET6
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2821,7 +2851,8 @@ union bpf_attr {
 	FN(strtoul),			\
 	FN(sk_storage_get),		\
 	FN(sk_storage_delete),		\
-	FN(send_signal),
+	FN(send_signal),		\
+	FN(tcp_gen_syncookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* [bpf-next RFC 6/6] selftests/bpf: add test for bpf_tcp_gen_syncookie
From: Petar Penkov @ 2019-07-16  0:26 UTC (permalink / raw)
  To: netdev, bpf; +Cc: davem, ast, daniel, edumazet, lmb, sdf, Petar Penkov
In-Reply-To: <20190716002650.154729-1-ppenkov.kernel@gmail.com>

From: Petar Penkov <ppenkov@google.com>

Modify the existing bpf_tcp_check_syncookie test to also generate a
SYN cookie, pass the packet to the kernel, and verify that the two
cookies are the same (and both valid). Since cloned SKBs are skipped
during generic XDP, this test does not issue a SYN cookie when run in
XDP mode. We therefore only check that a valid SYN cookie was issued at
the TC hook.

Additionally, verify that the MSS for that SYN cookie is within
expected range.

Signed-off-by: Petar Penkov <ppenkov@google.com>
---
 .../bpf/progs/test_tcp_check_syncookie_kern.c | 28 +++++++--
 .../bpf/test_tcp_check_syncookie_user.c       | 61 ++++++++++++++++---
 2 files changed, 76 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c b/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c
index 1ab095bcacd8..229832766f42 100644
--- a/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcp_check_syncookie_kern.c
@@ -19,8 +19,8 @@
 struct bpf_map_def SEC("maps") results = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(__u32),
-	.value_size = sizeof(__u64),
-	.max_entries = 1,
+	.value_size = sizeof(__u32),
+	.max_entries = 3,
 };
 
 static __always_inline void check_syncookie(void *ctx, void *data,
@@ -33,8 +33,10 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 	struct ipv6hdr *ipv6h;
 	struct tcphdr *tcph;
 	int ret;
+	__u32 key_mss = 2;
+	__u32 key_gen = 1;
 	__u32 key = 0;
-	__u64 value = 1;
+	__s64 seq_mss;
 
 	ethh = data;
 	if (ethh + 1 > data_end)
@@ -66,6 +68,8 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 		if (sk->state != BPF_TCP_LISTEN)
 			goto release;
 
+		seq_mss = bpf_tcp_gen_syncookie(sk, ipv4h, sizeof(*ipv4h),
+						tcph, sizeof(*tcph));
 		ret = bpf_tcp_check_syncookie(sk, ipv4h, sizeof(*ipv4h),
 					      tcph, sizeof(*tcph));
 		break;
@@ -95,6 +99,9 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 		if (sk->state != BPF_TCP_LISTEN)
 			goto release;
 
+		seq_mss = bpf_tcp_gen_syncookie(sk, ipv6h, sizeof(*ipv6h),
+						tcph, sizeof(*tcph));
+
 		ret = bpf_tcp_check_syncookie(sk, ipv6h, sizeof(*ipv6h),
 					      tcph, sizeof(*tcph));
 		break;
@@ -103,8 +110,19 @@ static __always_inline void check_syncookie(void *ctx, void *data,
 		return;
 	}
 
-	if (ret == 0)
-		bpf_map_update_elem(&results, &key, &value, 0);
+	if (seq_mss > 0) {
+		__u32 cookie = bpf_ntohl((__u32)seq_mss);
+		__u32 mss = seq_mss >> 32;
+
+		bpf_map_update_elem(&results, &key_gen, &cookie, 0);
+		bpf_map_update_elem(&results, &key_mss, &mss, 0);
+	}
+
+	if (ret == 0) {
+		__u32 cookie = bpf_ntohl(tcph->ack_seq) - 1;
+
+		bpf_map_update_elem(&results, &key, &cookie, 0);
+	}
 
 release:
 	bpf_sk_release(sk);
diff --git a/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c b/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c
index 87829c86c746..f3ff49ceb481 100644
--- a/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c
+++ b/tools/testing/selftests/bpf/test_tcp_check_syncookie_user.c
@@ -2,6 +2,7 @@
 // Copyright (c) 2018 Facebook
 // Copyright (c) 2019 Cloudflare
 
+#include <limits.h>
 #include <string.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -77,7 +78,7 @@ static int connect_to_server(int server_fd)
 	return fd;
 }
 
-static int get_map_fd_by_prog_id(int prog_id)
+static int get_map_fd_by_prog_id(int prog_id, bool *xdp)
 {
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
@@ -104,6 +105,8 @@ static int get_map_fd_by_prog_id(int prog_id)
 		goto err;
 	}
 
+	*xdp = info.type == BPF_PROG_TYPE_XDP;
+
 	map_fd = bpf_map_get_fd_by_id(map_ids[0]);
 	if (map_fd < 0)
 		log_err("Failed to get fd by map id %d", map_ids[0]);
@@ -113,18 +116,32 @@ static int get_map_fd_by_prog_id(int prog_id)
 	return map_fd;
 }
 
-static int run_test(int server_fd, int results_fd)
+static int run_test(int server_fd, int results_fd, bool xdp)
 {
 	int client = -1, srv_client = -1;
 	int ret = 0;
 	__u32 key = 0;
-	__u64 value = 0;
+	__u32 key_gen = 1;
+	__u32 key_mss = 2;
+	__u32 value = 0;
+	__u32 value_gen = 0;
+	__u32 value_mss = 0;
 
 	if (bpf_map_update_elem(results_fd, &key, &value, 0) < 0) {
 		log_err("Can't clear results");
 		goto err;
 	}
 
+	if (bpf_map_update_elem(results_fd, &key_gen, &value_gen, 0) < 0) {
+		log_err("Can't clear results");
+		goto err;
+	}
+
+	if (bpf_map_update_elem(results_fd, &key_mss, &value_mss, 0) < 0) {
+		log_err("Can't clear results");
+		goto err;
+	}
+
 	client = connect_to_server(server_fd);
 	if (client == -1)
 		goto err;
@@ -140,8 +157,35 @@ static int run_test(int server_fd, int results_fd)
 		goto err;
 	}
 
-	if (value != 1) {
-		log_err("Didn't match syncookie: %llu", value);
+	if (value == 0) {
+		log_err("Didn't match syncookie: %u", value);
+		goto err;
+	}
+
+	if (bpf_map_lookup_elem(results_fd, &key_gen, &value_gen) < 0) {
+		log_err("Can't lookup result");
+		goto err;
+	}
+
+	if (xdp && value_gen == 0) {
+		// SYN packets do not get passed through generic XDP, skip the
+		// rest of the test.
+		log_err("Did not find SYN cookie at XDP.");
+		goto out;
+	}
+
+	if (bpf_map_lookup_elem(results_fd, &key_mss, &value_mss) < 0) {
+		log_err("Can't lookup result");
+		goto err;
+	}
+
+	if (value != value_gen) {
+		log_err("BPF generated cookie does not match kernel one");
+		goto err;
+	}
+
+	if (value_mss < 536 || value_mss > USHRT_MAX) {
+		log_err("Unexpected MSS retrieved");
 		goto err;
 	}
 
@@ -163,13 +207,14 @@ int main(int argc, char **argv)
 	int server_v6 = -1;
 	int results = -1;
 	int err = 0;
+	bool xdp;
 
 	if (argc < 2) {
 		fprintf(stderr, "Usage: %s prog_id\n", argv[0]);
 		exit(1);
 	}
 
-	results = get_map_fd_by_prog_id(atoi(argv[1]));
+	results = get_map_fd_by_prog_id(atoi(argv[1]), &xdp);
 	if (results < 0) {
 		log_err("Can't get map");
 		goto err;
@@ -194,10 +239,10 @@ int main(int argc, char **argv)
 	if (server_v6 == -1)
 		goto err;
 
-	if (run_test(server, results))
+	if (run_test(server, results, xdp))
 		goto err;
 
-	if (run_test(server_v6, results))
+	if (run_test(server_v6, results, xdp))
 		goto err;
 
 	printf("ok\n");
-- 
2.22.0.510.g264f2c817a-goog


^ permalink raw reply related

* Re: [PATCH iproute2 net-next v3 2/5] taprio: Add support for setting flags
From: Jakub Kicinski @ 2019-07-16  0:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Vedang Patel, netdev, jhs, xiyou.wangcong, jiri, vinicius.gomes,
	leandro.maciel.dorileo, m-karicheri2, dsahern
In-Reply-To: <20190715172422.4e127da2@hermes.lan>

On Mon, 15 Jul 2019 17:24:22 -0700, Stephen Hemminger wrote:
> On Mon, 15 Jul 2019 17:15:15 -0700
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> > On Mon, 15 Jul 2019 16:37:43 -0700, Stephen Hemminger wrote:  
> > > On Mon, 15 Jul 2019 15:51:41 -0700
> > > Vedang Patel <vedang.patel@intel.com> wrote:    
> > > > @@ -442,6 +458,11 @@ static int taprio_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
> > > >  
> > > >  	print_string(PRINT_ANY, "clockid", "clockid %s", get_clock_name(clockid));
> > > >  
> > > > +	if (tb[TCA_TAPRIO_ATTR_FLAGS]) {
> > > > +		taprio_flags = rta_getattr_u32(tb[TCA_TAPRIO_ATTR_FLAGS]);
> > > > +		print_uint(PRINT_ANY, "flags", " flags %x", taprio_flags);
> > > > +	}    
> > >[...]
> > > 3. Use the print_0xhex() instead of print_uint() for hex values. The difference
> > >    is that in the JSON output, print_uint would be decimal but the print_0xhex
> > >    is always hex.  And use "flags %#x" so that it is clear you are printing flags in hex.    
> > 
> > In my humble personal experience scripting tests using iproute2 and
> > bpftool with Python I found printing the "hex string" instead of just
> > outputing the integer value counter productive :( Even tho it looks
> > better to the eye, JSON is primarily for machine processing and hex
> > strings have to be manually converted.  
> 
> If it is hex on normal output, it should be hex on JSON output.
> And what ever the normal output format is has to be accepted on command line as input.

Ah, I forgot the output == input principle in iproute2!
In any case if there was ever a vote whether to limit this principle to
non-JSON output, and make machines' life easier, I'd vote 'yes' :)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox