Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 2/2] bpf, selftest: fix checksum value for test #13
From: Paolo Pisati @ 2019-07-11  9:31 UTC (permalink / raw)
  To: --in-reply-to=, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S . Miller,
	Shuah Khan, Jakub Kicinski, Jiong Wang
  Cc: netdev, bpf, linux-kselftest, linux-kernel
In-Reply-To: <1562837513-745-1-git-send-email-p.pisati@gmail.com>

From: Paolo Pisati <paolo.pisati@canonical.com>

Signed-off-by: Paolo Pisati <paolo.pisati@canonical.com>
---
 tools/testing/selftests/bpf/verifier/array_access.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/verifier/array_access.c b/tools/testing/selftests/bpf/verifier/array_access.c
index bcb83196e459..4698f560d756 100644
--- a/tools/testing/selftests/bpf/verifier/array_access.c
+++ b/tools/testing/selftests/bpf/verifier/array_access.c
@@ -255,7 +255,7 @@
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 	.fixup_map_array_ro = { 3 },
 	.result = ACCEPT,
-	.retval = -29,
+	.retval = 28,
 },
 {
 	"invalid write map access into a read-only array 1",
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Eric Dumazet @ 2019-07-11  9:19 UTC (permalink / raw)
  To: Christoph Paasch, Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, David Miller, netdev,
	Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell, Tyler Hicks,
	Yuchung Cheng, Bruce Curtis, Jonathan Lemon, Dustin Marquess
In-Reply-To: <B600B3AB-559E-44C1-869C-7309DB28850E@gmail.com>



On 7/11/19 9:28 AM, Christoph Paasch wrote:
> 
> 
>> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>>>
>>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>>>
>>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>>> -The send socket was non-blocking
>>> -SO_SNDBUF set to 128KiB
>>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>>> -Load was on both systems: a while(1) program spinning on each CPU core
>>> -The receiver was on an older unaffected kernel
>>>
>>
>> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
>> since skbs needs to be allocated for retransmits.
> 
> Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?

4.15+ kernels have :

if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf &&
    tcp_queue != TCP_FRAG_IN_WRITE_QUEUE)) {


Meaning that things like TLP will succeed.

Anything we add in TCP stack to overcome the SO_SNDBUF by twice the limit _will_ be exploited at scale.

I am not sure we want to continue to support small SO_SNDBUF values, as this makes no sense today.

We use 64 MB auto tuning limit, and /proc/sys/net/ipv4/tcp_notsent_lowat to 1 MB.

I would rather work (when net-next reopens) on better collapsing at rtx to allow reduction of the overhead.


Something like :

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f6a9c95a48edb234e4d4e21bf585744fbaf9a0a7..d5c85986209cd162cf39edb787b1385cb2c8b630 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2860,7 +2860,7 @@ static int __net_init tcp_sk_init(struct net *net)
        net->ipv4.sysctl_tcp_early_retrans = 3;
        net->ipv4.sysctl_tcp_recovery = TCP_RACK_LOSS_DETECTION;
        net->ipv4.sysctl_tcp_slow_start_after_idle = 1; /* By default, RFC2861 behavior.  */
-       net->ipv4.sysctl_tcp_retrans_collapse = 1;
+       net->ipv4.sysctl_tcp_retrans_collapse = 3;
        net->ipv4.sysctl_tcp_max_reordering = 300;
        net->ipv4.sysctl_tcp_dsack = 1;
        net->ipv4.sysctl_tcp_app_win = 31;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d61264cf89ef66b229ecf797c1abfb7fcdab009f..05cd264f98b084f62eaf2ef9d6e14a392670d02c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3015,8 +3015,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 
        next_skb_size = next_skb->len;
 
-       BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
-
        if (next_skb_size) {
                if (next_skb_size <= skb_availroom(skb))
                        skb_copy_bits(next_skb, 0, skb_put(skb, next_skb_size),
@@ -3054,8 +3052,6 @@ static bool tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 /* Check if coalescing SKBs is legal. */
 static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb)
 {
-       if (tcp_skb_pcount(skb) > 1)
-               return false;
        if (skb_cloned(skb))
                return false;
        /* Some heuristics for collapsing over SACK'd could be invented */
@@ -3114,7 +3110,7 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct tcp_sock *tp = tcp_sk(sk);
        unsigned int cur_mss;
-       int diff, len, err;
+       int diff, len, maxlen, err;
 
 
        /* Inconclusive MTU probe */
@@ -3165,12 +3161,13 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb, int segs)
                        return -ENOMEM;
 
                diff = tcp_skb_pcount(skb);
+               maxlen = (sock_net(sk)->ipv4.sysctl_tcp_retrans_collapse & 2) ? len : cur_mss;
+               if (skb->len < maxlen)
+                       tcp_retrans_try_collapse(sk, skb, maxlen);
                tcp_set_skb_tso_segs(skb, cur_mss);
                diff -= tcp_skb_pcount(skb);
                if (diff)
                        tcp_adjust_pcount(sk, skb, diff);
-               if (skb->len < cur_mss)
-                       tcp_retrans_try_collapse(sk, skb, cur_mss);
        }
 
        /* RFC3168, section 6.1.1.1. ECN fallback */

^ permalink raw reply related

* Re: [PATCH net-next] net/mlx5e: Provide cb_list pointer when setting up tc block on rep
From: Vlad Buslov @ 2019-07-11  9:13 UTC (permalink / raw)
  To: wenxu
  Cc: Vlad Buslov, netdev@vger.kernel.org, jhs@mojatatu.com,
	xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net,
	pablo@netfilter.org, Saeed Mahameed
In-Reply-To: <6465040c-d22f-f4b6-0232-11ff4af81753@ucloud.cn>


On Thu 11 Jul 2019 at 03:35, wenxu <wenxu@ucloud.cn> wrote:
> 在 2019/7/11 2:25, Vlad Buslov 写道:
>> Recent refactoring of tc block offloads infrastructure introduced new
>> flow_block_cb_setup_simple() method intended to be used as unified way for
>> all drivers to register offload callbacks. However, commit that actually
>> extended all users (drivers) with block cb list and provided it to
>> flow_block infra missed mlx5 en_rep. This leads to following NULL-pointer
>> dereference when creating Qdisc:
>>
>> [  278.385175] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [  278.393233] #PF: supervisor read access in kernel mode
>> [  278.399446] #PF: error_code(0x0000) - not-present page
>> [  278.405847] PGD 8000000850e73067 P4D 8000000850e73067 PUD 8620cd067 PMD 0
>> [  278.414141] Oops: 0000 [#1] SMP PTI
>> [  278.419019] CPU: 7 PID: 3369 Comm: tc Not tainted 5.2.0-rc6+ #492
>> [  278.426580] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
>> [  278.435853] RIP: 0010:flow_block_cb_setup_simple+0xc4/0x190
>> [  278.442953] Code: 10 48 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 00 48 05 00 01 00 00 49 89 40 08 31 c0 c3 b8 a1 ff ff ff c3 f3 c3 <48> 8b 06 48 39 c6 75 0a eb 1a 48 8b 00 48 39 c6 74 12
>>  48 3b 50 28
>> [  278.464829] RSP: 0018:ffffaf07c3f97990 EFLAGS: 00010246
>> [  278.471648] RAX: 0000000000000000 RBX: ffff9b43ed4c7680 RCX: ffff9b43d5f80840
>> [  278.480408] RDX: ffffffffc0491650 RSI: 0000000000000000 RDI: ffffaf07c3f97998
>> [  278.489110] RBP: ffff9b43ddff9000 R08: ffff9b43d5f80840 R09: 0000000000000001
>> [  278.497838] R10: 0000000000000009 R11: 00000000000003ad R12: ffffaf07c3f97c08
>> [  278.506595] R13: ffff9b43d5f80000 R14: ffff9b43ed4c7680 R15: ffff9b43dfa20b40
>> [  278.515374] FS:  00007f796be1b400(0000) GS:ffff9b43ef840000(0000) knlGS:0000000000000000
>> [  278.525099] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  278.532453] CR2: 0000000000000000 CR3: 0000000840398002 CR4: 00000000001606e0
>> [  278.541197] Call Trace:
>> [  278.545252]  tcf_block_offload_cmd.isra.52+0x7e/0xb0
>> [  278.551871]  tcf_block_get_ext+0x365/0x3e0
>> [  278.557569]  qdisc_create+0x15c/0x4e0
>> [  278.562859]  ? kmem_cache_alloc_trace+0x1a2/0x1c0
>> [  278.569235]  tc_modify_qdisc+0x1c8/0x780
>> [  278.574761]  rtnetlink_rcv_msg+0x291/0x340
>> [  278.580518]  ? _cond_resched+0x15/0x40
>> [  278.585856]  ? rtnl_calcit.isra.29+0x120/0x120
>> [  278.591868]  netlink_rcv_skb+0x4a/0x110
>> [  278.597198]  netlink_unicast+0x1a0/0x250
>> [  278.602601]  netlink_sendmsg+0x2c1/0x3c0
>> [  278.608022]  sock_sendmsg+0x5b/0x60
>> [  278.612969]  ___sys_sendmsg+0x289/0x310
>> [  278.618231]  ? do_wp_page+0x99/0x730
>> [  278.623216]  ? page_add_new_anon_rmap+0xbe/0x140
>> [  278.629298]  ? __handle_mm_fault+0xc84/0x1360
>> [  278.635113]  ? __sys_sendmsg+0x5e/0xa0
>> [  278.640285]  __sys_sendmsg+0x5e/0xa0
>> [  278.645239]  do_syscall_64+0x5b/0x1b0
>> [  278.650274]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  278.656697] RIP: 0033:0x7f796abdeb87
>> [  278.661628] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00 00 00 53
>>  48 89 f3 48
>> [  278.683248] RSP: 002b:00007ffde213ba48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>> [  278.692245] RAX: ffffffffffffffda RBX: 000000005d261e6f RCX: 00007f796abdeb87
>> [  278.700862] RDX: 0000000000000000 RSI: 00007ffde213bab0 RDI: 0000000000000003
>> [  278.709527] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
>> [  278.718167] R10: 000000000000000c R11: 0000000000000246 R12: 0000000000000001
>> [  278.726743] R13: 000000000067b580 R14: 0000000000000000 R15: 0000000000000000
>> [  278.735302] Modules linked in: dummy vxlan ip6_udp_tunnel udp_tunnel sch_ingress nfsv3 nfs_acl nfs lockd grace fscache bridge stp llc sunrpc mlx5_ib ib_uverbs intel_rapl ib_core sb_edac x86_pkg_temp_
>> thermal intel_powerclamp coretemp kvm_intel kvm mlx5_core irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel igb ghash_clmulni_intel ses mei_me enclosure mlxfw ipmi_ssif intel_cstate iTCO_wdt ptp mei
>> pps_core iTCO_vendor_support pcspkr joydev intel_uncore i2c_i801 ipmi_si lpc_ich intel_rapl_perf ioatdma wmi dca pcc_cpufreq ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad ast i2c_algo_bit drm_k
>> ms_helper ttm drm mpt3sas raid_class scsi_transport_sas
>> [  278.802263] CR2: 0000000000000000
>> [  278.807170] ---[ end trace b1f0a442a279e66f ]---
>>
>> Extend en_rep with new static mlx5e_rep_block_cb_list list and pass it to
>> flow_block_cb_setup_simple() function instead of hardcoded NULL pointer.
>>
>> Fixes: 955bcb6ea0df ("drivers: net: use flow block API")
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>> index 10ef90a7bddd..7245d287633d 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
>> @@ -1175,6 +1175,8 @@ static int mlx5e_rep_setup_tc_cb(enum tc_setup_type type, void *type_data,
>>  	}
>>  }
>>  
>> +static LIST_HEAD(mlx5e_rep_block_cb_list);
>> +
>
> I think it is not necessary needs a extra LIST_HEAD, the early mlx5e_block_cb_list is ok
>
> The early patch http://patchwork.ozlabs.org/patch/1130439/ is enough.

Yes, but in that case we will mix mlx5e_rep_setup_tc_cb rep callback and
mlx5e_rep_indr_setup_block_cb indirect callback in single list. I might
be missing something, but I don't see why we would want that.

>
>>  static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
>>  			      void *type_data)
>>  {
>> @@ -1182,7 +1184,8 @@ static int mlx5e_rep_setup_tc(struct net_device *dev, enum tc_setup_type type,
>>  
>>  	switch (type) {
>>  	case TC_SETUP_BLOCK:
>> -		return flow_block_cb_setup_simple(type_data, NULL,
>> +		return flow_block_cb_setup_simple(type_data,
>> +						  &mlx5e_rep_block_cb_list,
>>  						  mlx5e_rep_setup_tc_cb,
>>  						  priv, priv, true);
>>  	default:


^ permalink raw reply

* [PATCH v3 bpf] selftests/bpf: do not ignore clang failures
From: Ilya Leoshkevich @ 2019-07-11  9:12 UTC (permalink / raw)
  To: bpf, netdev; +Cc: andrii.nakryiko, daniel, liu.song.a23, Ilya Leoshkevich

When compiling an eBPF prog fails, make still returns 0, because
failing clang command's output is piped to llc and therefore its
exit status is ignored.

When clang fails, pipe the string "clang failed" to llc. This will make
llc fail with an informative error message. This solution was chosen
over using pipefail, having separate targets or getting rid of llc
invocation due to its simplicity.

In addition, pull Kbuild.include in order to get .DELETE_ON_ERROR target,
which would cause partial .o files to be removed.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
v1->v2: use intermediate targets instead of pipefail
v2->v3: pipe "clang failed" instead of using intermediate targets

tools/testing/selftests/bpf/Makefile | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e36356e2377e..e375f399b7a6 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
+include ../../../../scripts/Kbuild.include
 
 LIBDIR := ../../../lib
 BPFDIR := $(LIBDIR)/bpf
@@ -185,8 +186,8 @@ $(ALU32_BUILD_DIR)/test_progs_32: prog_tests/*.c
 
 $(ALU32_BUILD_DIR)/%.o: progs/%.c $(ALU32_BUILD_DIR) \
 					$(ALU32_BUILD_DIR)/test_progs_32
-	$(CLANG) $(CLANG_FLAGS) \
-		 -O2 -target bpf -emit-llvm -c $< -o - |      \
+	($(CLANG) $(CLANG_FLAGS) -O2 -target bpf -emit-llvm -c $< -o - || \
+		echo "clang failed") | \
 	$(LLC) -march=bpf -mattr=+alu32 -mcpu=$(CPU) $(LLC_FLAGS) \
 		-filetype=obj -o $@
 ifeq ($(DWARF2BTF),y)
@@ -197,16 +198,16 @@ endif
 # Have one program compiled without "-target bpf" to test whether libbpf loads
 # it successfully
 $(OUTPUT)/test_xdp.o: progs/test_xdp.c
-	$(CLANG) $(CLANG_FLAGS) \
-		-O2 -emit-llvm -c $< -o - | \
+	($(CLANG) $(CLANG_FLAGS) -O2 -emit-llvm -c $< -o - || \
+		echo "clang failed") | \
 	$(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@
 ifeq ($(DWARF2BTF),y)
 	$(BTF_PAHOLE) -J $@
 endif
 
 $(OUTPUT)/%.o: progs/%.c
-	$(CLANG) $(CLANG_FLAGS) \
-		 -O2 -target bpf -emit-llvm -c $< -o - |      \
+	($(CLANG) $(CLANG_FLAGS) -O2 -target bpf -emit-llvm -c $< -o - || \
+		echo "clang failed") | \
 	$(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@
 ifeq ($(DWARF2BTF),y)
 	$(BTF_PAHOLE) -J $@
-- 
2.21.0


^ permalink raw reply related

* [PATCH net-next iproute2 v2 3/3] tc: flower: Add matching on conntrack info
From: Paul Blakey @ 2019-07-11  8:14 UTC (permalink / raw)
  To: Jiri Pirko, Paul Blakey, Roi Dayan, Yossi Kuperman, Oz Shlomo,
	Marcelo Ricardo Leitner, netdev, David Miller, Aaron Conole,
	Zhike Wang
  Cc: Rony Efraim, nst-kernel, John Hurley, Simon Horman, Justin Pettit
In-Reply-To: <1562832867-32347-1-git-send-email-paulb@mellanox.com>

Matches on conntrack state, zone, mark, and label.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
---
 man/man8/tc-flower.8 |  35 +++++++
 tc/f_flower.c        | 276 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 310 insertions(+), 1 deletion(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index adff41e..04ee194 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -289,6 +289,41 @@ bits is assumed.
 .TQ
 .BI enc_ttl " NUMBER"
 .TQ
+.BR
+.TP
+.BI ct_state " CT_STATE"
+.TQ
+.BI ct_zone " CT_MASKED_ZONE"
+.TQ
+.BI ct_mark " CT_MASKED_MARK"
+.TQ
+.BI ct_label " CT_MASKED_LABEL"
+Matches on connection tracking info
+.RS
+.TP
+.I CT_STATE
+Match the connection state, and can ne combination of [{+|-}flag] flags, where flag can be one of
+.RS
+.TP
+trk - Tracked connection.
+.TP
+new - New connection.
+.TP
+est - Established connection.
+.TP
+Example: +trk+est
+.RE
+.TP
+.I CT_MASKED_ZONE
+Match the connection zone, and can be masked.
+.TP
+.I CT_MASKED_MARK
+32bit match on the connection mark, and can be masked.
+.TP
+.I CT_MASKED_LABEL
+128bit match on the connection label, and can be masked.
+.RE
+.TP
 .BI geneve_opts " OPTIONS"
 Match on IP tunnel metadata. Key id
 .I NUMBER
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 70d40d3..a2a2301 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -82,9 +82,14 @@ static void explain(void)
 		"			enc_ttl MASKED-IP_TTL |\n"
 		"			geneve_opts MASKED-OPTIONS |\n"
 		"			ip_flags IP-FLAGS | \n"
-		"			enc_dst_port [ port_number ] }\n"
+		"			enc_dst_port [ port_number ] |\n"
+		"			ct_state MASKED_CT_STATE |\n"
+		"			ct_label MASKED_CT_LABEL |\n"
+		"			ct_mark MASKED_CT_MARK |\n"
+		"			ct_zone MASKED_CT_ZONE }\n"
 		"	FILTERID := X:Y:Z\n"
 		"	MASKED_LLADDR := { LLADDR | LLADDR/MASK | LLADDR/BITS }\n"
+		"	MASKED_CT_STATE := combination of {+|-} and flags trk,est,new\n"
 		"	ACTION-SPEC := ... look at individual actions\n"
 		"\n"
 		"NOTE:	CLASSID, IP-PROTO are parsed as hexadecimal input.\n"
@@ -214,6 +219,159 @@ static int flower_parse_matching_flags(char *str,
 	return 0;
 }
 
+static int flower_parse_u16(char *str, int value_type, int mask_type,
+			    struct nlmsghdr *n)
+{
+	__u16 value, mask;
+	char *slash;
+
+	slash = strchr(str, '/');
+	if (slash)
+		*slash = '\0';
+
+	if (get_u16(&value, str, 0))
+		return -1;
+
+	if (slash) {
+		if (get_u16(&mask, slash + 1, 0))
+			return -1;
+	} else {
+		mask = UINT16_MAX;
+	}
+
+	addattr16(n, MAX_MSG, value_type, value);
+	addattr16(n, MAX_MSG, mask_type, mask);
+
+	return 0;
+}
+
+static int flower_parse_u32(char *str, int value_type, int mask_type,
+			    struct nlmsghdr *n)
+{
+	__u32 value, mask;
+	char *slash;
+
+	slash = strchr(str, '/');
+	if (slash)
+		*slash = '\0';
+
+	if (get_u32(&value, str, 0))
+		return -1;
+
+	if (slash) {
+		if (get_u32(&mask, slash + 1, 0))
+			return -1;
+	} else {
+		mask = UINT32_MAX;
+	}
+
+	addattr32(n, MAX_MSG, value_type, value);
+	addattr32(n, MAX_MSG, mask_type, mask);
+
+	return 0;
+}
+
+static int flower_parse_ct_mark(char *str, struct nlmsghdr *n)
+{
+	return flower_parse_u32(str,
+				TCA_FLOWER_KEY_CT_MARK,
+				TCA_FLOWER_KEY_CT_MARK_MASK,
+				n);
+}
+
+static int flower_parse_ct_zone(char *str, struct nlmsghdr *n)
+{
+	return flower_parse_u16(str,
+				TCA_FLOWER_KEY_CT_ZONE,
+				TCA_FLOWER_KEY_CT_ZONE_MASK,
+				n);
+}
+
+static int flower_parse_ct_labels(char *str, struct nlmsghdr *n)
+{
+#define LABELS_SIZE	16
+	uint8_t labels[LABELS_SIZE], lmask[LABELS_SIZE];
+	char *slash, *mask = NULL;
+	size_t slen, slen_mask = 0;
+
+	slash = index(str, '/');
+	if (slash) {
+		*slash = 0;
+		mask = slash + 1;
+		slen_mask = strlen(mask);
+	}
+
+	slen = strlen(str);
+	if (slen > LABELS_SIZE * 2 || slen_mask > LABELS_SIZE * 2) {
+		char errmsg[128];
+
+		snprintf(errmsg, sizeof(errmsg),
+				"%zd Max allowed size %d",
+				slen, LABELS_SIZE*2);
+		invarg(errmsg, str);
+	}
+
+	if (hex2mem(str, labels, slen / 2) < 0)
+		invarg("labels must be a hex string\n", str);
+	addattr_l(n, MAX_MSG, TCA_FLOWER_KEY_CT_LABELS, labels, slen / 2);
+
+	if (mask) {
+		if (hex2mem(mask, lmask, slen_mask / 2) < 0)
+			invarg("labels mask must be a hex string\n", mask);
+	} else {
+		memset(lmask, 0xff, sizeof(lmask));
+		slen_mask = sizeof(lmask) * 2;
+	}
+	addattr_l(n, MAX_MSG, TCA_FLOWER_KEY_CT_LABELS_MASK, lmask,
+		  slen_mask / 2);
+
+	return 0;
+}
+
+static struct flower_ct_states {
+	char *str;
+	int flag;
+} flower_ct_states[] = {
+	{ "trk", TCA_FLOWER_KEY_CT_FLAGS_TRACKED },
+	{ "new", TCA_FLOWER_KEY_CT_FLAGS_NEW },
+	{ "est", TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED },
+};
+
+static int flower_parse_ct_state(char *str, struct nlmsghdr *n)
+{
+	int flags = 0, mask = 0,  len, i;
+	bool p;
+
+	while (*str != '\0') {
+		if (*str == '+')
+			p = true;
+		else if (*str == '-')
+			p = false;
+		else
+			return -1;
+
+		for (i = 0; i < ARRAY_SIZE(flower_ct_states); i++) {
+			len = strlen(flower_ct_states[i].str);
+			if (strncmp(str + 1, flower_ct_states[i].str, len))
+				continue;
+
+			if (p)
+				flags |= flower_ct_states[i].flag;
+			mask |= flower_ct_states[i].flag;
+			break;
+		}
+
+		if (i == ARRAY_SIZE(flower_ct_states))
+			return -1;
+
+		str += len + 1;
+	}
+
+	addattr16(n, MAX_MSG, TCA_FLOWER_KEY_CT_STATE, flags);
+	addattr16(n, MAX_MSG, TCA_FLOWER_KEY_CT_STATE_MASK, mask);
+	return 0;
+}
+
 static int flower_parse_ip_proto(char *str, __be16 eth_type, int type,
 				 __u8 *p_ip_proto, struct nlmsghdr *n)
 {
@@ -898,6 +1056,34 @@ static int flower_parse_opt(struct filter_util *qu, char *handle,
 			flags |= TCA_CLS_FLAGS_SKIP_HW;
 		} else if (matches(*argv, "skip_sw") == 0) {
 			flags |= TCA_CLS_FLAGS_SKIP_SW;
+		} else if (matches(*argv, "ct_state") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_ct_state(*argv, n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"ct_state\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "ct_zone") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_ct_zone(*argv, n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"ct_zone\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "ct_mark") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_ct_mark(*argv, n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"ct_mark\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "ct_label") == 0) {
+			NEXT_ARG();
+			ret = flower_parse_ct_labels(*argv, n);
+			if (ret < 0) {
+				fprintf(stderr, "Illegal \"ct_label\"\n");
+				return -1;
+			}
 		} else if (matches(*argv, "indev") == 0) {
 			NEXT_ARG();
 			if (check_ifname(*argv))
@@ -1590,6 +1776,85 @@ static void flower_print_tcp_flags(const char *name, struct rtattr *flags_attr,
 	print_string(PRINT_ANY, name, namefrm, out);
 }
 
+static void flower_print_ct_state(struct rtattr *flags_attr,
+				  struct rtattr *mask_attr)
+{
+	SPRINT_BUF(out);
+	uint16_t state;
+	uint16_t state_mask;
+	size_t done = 0;
+	int i;
+
+	if (!flags_attr)
+		return;
+
+	state = rta_getattr_u16(flags_attr);
+	if (mask_attr)
+		state_mask = rta_getattr_u16(mask_attr);
+	else
+		state_mask = UINT16_MAX;
+
+	for (i = 0; i < ARRAY_SIZE(flower_ct_states); i++) {
+		if (!(state_mask & flower_ct_states[i].flag))
+			continue;
+
+		if (state & flower_ct_states[i].flag)
+			done += sprintf(out + done, "+%s",
+					flower_ct_states[i].str);
+		else
+			done += sprintf(out + done, "-%s",
+					flower_ct_states[i].str);
+	}
+
+	print_string(PRINT_ANY, "ct_state", "\n  ct_state %s", out);
+}
+
+static void flower_print_ct_label(struct rtattr *attr,
+				  struct rtattr *mask_attr)
+{
+	const unsigned char *str;
+	bool print_mask = false;
+	int data_len, i;
+	SPRINT_BUF(out);
+	char *p;
+
+	if (!attr)
+		return;
+
+	data_len = RTA_PAYLOAD(attr);
+	hexstring_n2a(RTA_DATA(attr), data_len, out, sizeof(out));
+	p = out + data_len*2;
+
+	data_len = RTA_PAYLOAD(attr);
+	str = RTA_DATA(mask_attr);
+	if (data_len != 16)
+		print_mask = true;
+	for (i = 0; !print_mask && i < data_len; i++) {
+		if (str[i] != 0xff)
+			print_mask = true;
+	}
+	if (print_mask) {
+		*p++ = '/';
+		hexstring_n2a(RTA_DATA(mask_attr), data_len, p,
+			      sizeof(out)-(p-out));
+		p += data_len*2;
+	}
+	*p = '\0';
+
+	print_string(PRINT_ANY, "ct_label", "\n  ct_label %s", out);
+}
+
+static void flower_print_ct_zone(struct rtattr *attr,
+				 struct rtattr *mask_attr)
+{
+	print_masked_u16("ct_zone", attr, mask_attr);
+}
+
+static void flower_print_ct_mark(struct rtattr *attr,
+				 struct rtattr *mask_attr)
+{
+	print_masked_u32("ct_mark", attr, mask_attr);
+}
 
 static void flower_print_key_id(const char *name, struct rtattr *attr)
 {
@@ -1949,6 +2214,15 @@ static int flower_print_opt(struct filter_util *qu, FILE *f,
 				    tb[TCA_FLOWER_KEY_FLAGS],
 				    tb[TCA_FLOWER_KEY_FLAGS_MASK]);
 
+	flower_print_ct_state(tb[TCA_FLOWER_KEY_CT_STATE],
+			      tb[TCA_FLOWER_KEY_CT_STATE_MASK]);
+	flower_print_ct_zone(tb[TCA_FLOWER_KEY_CT_ZONE],
+			     tb[TCA_FLOWER_KEY_CT_ZONE_MASK]);
+	flower_print_ct_mark(tb[TCA_FLOWER_KEY_CT_MARK],
+			     tb[TCA_FLOWER_KEY_CT_MARK_MASK]);
+	flower_print_ct_label(tb[TCA_FLOWER_KEY_CT_LABELS],
+			      tb[TCA_FLOWER_KEY_CT_LABELS_MASK]);
+
 	close_json_object();
 
 	if (tb[TCA_FLOWER_FLAGS]) {
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next iproute2 v2 2/3] tc: Introduce tc ct action
From: Paul Blakey @ 2019-07-11  8:14 UTC (permalink / raw)
  To: Jiri Pirko, Paul Blakey, Roi Dayan, Yossi Kuperman, Oz Shlomo,
	Marcelo Ricardo Leitner, netdev, David Miller, Aaron Conole,
	Zhike Wang
  Cc: Rony Efraim, nst-kernel, John Hurley, Simon Horman, Justin Pettit
In-Reply-To: <1562832867-32347-1-git-send-email-paulb@mellanox.com>

New tc action to send packets to conntrack module, commit
them, and set a zone, labels, mark, and nat on the connection.

It can also clear the packet's conntrack state by using clear.

Usage:
   ct clear
   ct commit [force] [zone] [mark] [label] [nat]
   ct [nat] [zone]

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Roi Dayan <roid@mellanox.com>
---
 include/uapi/linux/tc_act/tc_ct.h |  41 ++++
 tc/Makefile                       |   1 +
 tc/m_ct.c                         | 497 ++++++++++++++++++++++++++++++++++++++
 tc/tc_util.c                      |  44 ++++
 tc/tc_util.h                      |   4 +
 5 files changed, 587 insertions(+)
 create mode 100644 include/uapi/linux/tc_act/tc_ct.h
 create mode 100644 tc/m_ct.c

diff --git a/include/uapi/linux/tc_act/tc_ct.h b/include/uapi/linux/tc_act/tc_ct.h
new file mode 100644
index 0000000..5fb1d7a
--- /dev/null
+++ b/include/uapi/linux/tc_act/tc_ct.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef __UAPI_TC_CT_H
+#define __UAPI_TC_CT_H
+
+#include <linux/types.h>
+#include <linux/pkt_cls.h>
+
+enum {
+	TCA_CT_UNSPEC,
+	TCA_CT_PARMS,
+	TCA_CT_TM,
+	TCA_CT_ACTION,		/* u16 */
+	TCA_CT_ZONE,		/* u16 */
+	TCA_CT_MARK,		/* u32 */
+	TCA_CT_MARK_MASK,	/* u32 */
+	TCA_CT_LABELS,		/* u128 */
+	TCA_CT_LABELS_MASK,	/* u128 */
+	TCA_CT_NAT_IPV4_MIN,	/* be32 */
+	TCA_CT_NAT_IPV4_MAX,	/* be32 */
+	TCA_CT_NAT_IPV6_MIN,	/* struct in6_addr */
+	TCA_CT_NAT_IPV6_MAX,	/* struct in6_addr */
+	TCA_CT_NAT_PORT_MIN,	/* be16 */
+	TCA_CT_NAT_PORT_MAX,	/* be16 */
+	TCA_CT_PAD,
+	__TCA_CT_MAX
+};
+
+#define TCA_CT_MAX (__TCA_CT_MAX - 1)
+
+#define TCA_CT_ACT_COMMIT	(1 << 0)
+#define TCA_CT_ACT_FORCE	(1 << 1)
+#define TCA_CT_ACT_CLEAR	(1 << 2)
+#define TCA_CT_ACT_NAT		(1 << 3)
+#define TCA_CT_ACT_NAT_SRC	(1 << 4)
+#define TCA_CT_ACT_NAT_DST	(1 << 5)
+
+struct tc_ct {
+	tc_gen;
+};
+
+#endif /* __UAPI_TC_CT_H */
diff --git a/tc/Makefile b/tc/Makefile
index 09ff369..14171a2 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -53,6 +53,7 @@ TCMODULES += m_ctinfo.o
 TCMODULES += m_bpf.o
 TCMODULES += m_tunnel_key.o
 TCMODULES += m_sample.o
+TCMODULES += m_ct.o
 TCMODULES += p_ip.o
 TCMODULES += p_ip6.o
 TCMODULES += p_icmp.o
diff --git a/tc/m_ct.c b/tc/m_ct.c
new file mode 100644
index 0000000..8589cb9
--- /dev/null
+++ b/tc/m_ct.c
@@ -0,0 +1,497 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/* -
+ * m_ct.c     Connection tracking action
+ *
+ * Authors:   Paul Blakey <paulb@mellanox.com>
+ *            Yossi Kuperman <yossiku@mellanox.com>
+ *            Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include "utils.h"
+#include "tc_util.h"
+#include <linux/tc_act/tc_ct.h>
+
+static void
+usage(void)
+{
+	fprintf(stderr,
+		"Usage: ct clear\n"
+		"	ct commit [force] [zone ZONE] [mark MASKED_MARK] [label MASKED_LABEL] [nat NAT_SPEC]\n"
+		"	ct [nat] [zone ZONE]\n"
+		"Where: ZONE is the conntrack zone table number\n"
+		"	NAT_SPEC is {src|dst} addr addr1[-addr2] [port port1[-port2]]\n"
+		"\n");
+	exit(-1);
+}
+
+static int ct_parse_nat_addr_range(const char *str, struct nlmsghdr *n)
+{
+	inet_prefix addr = { .family = AF_UNSPEC, };
+	char *addr1, *addr2 = 0;
+	SPRINT_BUF(buffer);
+	int attr;
+	int ret;
+
+	strncpy(buffer, str, sizeof(buffer) - 1);
+
+	addr1 = buffer;
+	addr2 = strchr(addr1, '-');
+	if (addr2) {
+		*addr2 = '\0';
+		addr2++;
+	}
+
+	ret = get_addr(&addr, addr1, AF_UNSPEC);
+	if (ret)
+		return ret;
+	attr = addr.family == AF_INET ? TCA_CT_NAT_IPV4_MIN :
+					TCA_CT_NAT_IPV6_MIN;
+	addattr_l(n, MAX_MSG, attr, addr.data, addr.bytelen);
+
+	if (addr2) {
+		ret = get_addr(&addr, addr2, addr.family);
+		if (ret)
+			return ret;
+	}
+	attr = addr.family == AF_INET ? TCA_CT_NAT_IPV4_MAX :
+					TCA_CT_NAT_IPV6_MAX;
+	addattr_l(n, MAX_MSG, attr, addr.data, addr.bytelen);
+
+	return 0;
+}
+
+static int ct_parse_nat_port_range(const char *str, struct nlmsghdr *n)
+{
+	char *port1, *port2 = 0;
+	SPRINT_BUF(buffer);
+	__be16 port;
+	int ret;
+
+	strncpy(buffer, str, sizeof(buffer) - 1);
+
+	port1 = buffer;
+	port2 = strchr(port1, '-');
+	if (port2) {
+		*port2 = '\0';
+		port2++;
+	}
+
+	ret = get_be16(&port, port1, 10);
+	if (ret)
+		return -1;
+	addattr16(n, MAX_MSG, TCA_CT_NAT_PORT_MIN, port);
+
+	if (port2) {
+		ret = get_be16(&port, port2, 10);
+		if (ret)
+			return -1;
+	}
+	addattr16(n, MAX_MSG, TCA_CT_NAT_PORT_MAX, port);
+
+	return 0;
+}
+
+
+static int ct_parse_u16(char *str, int value_type, int mask_type,
+			struct nlmsghdr *n)
+{
+	__u16 value, mask;
+	char *slash = 0;
+
+	if (mask_type != TCA_CT_UNSPEC) {
+		slash = strchr(str, '/');
+		if (slash)
+			*slash = '\0';
+	}
+
+	if (get_u16(&value, str, 0))
+		return -1;
+
+	if (slash) {
+		if (get_u16(&mask, slash + 1, 0))
+			return -1;
+	} else {
+		mask = UINT16_MAX;
+	}
+
+	addattr16(n, MAX_MSG, value_type, value);
+	if (mask_type != TCA_CT_UNSPEC)
+		addattr16(n, MAX_MSG, mask_type, mask);
+
+	return 0;
+}
+
+static int ct_parse_u32(char *str, int value_type, int mask_type,
+			struct nlmsghdr *n)
+{
+	__u32 value, mask;
+	char *slash;
+
+	slash = strchr(str, '/');
+	if (slash)
+		*slash = '\0';
+
+	if (get_u32(&value, str, 0))
+		return -1;
+
+	if (slash) {
+		if (get_u32(&mask, slash + 1, 0))
+			return -1;
+	} else {
+		mask = UINT32_MAX;
+	}
+
+	addattr32(n, MAX_MSG, value_type, value);
+	addattr32(n, MAX_MSG, mask_type, mask);
+
+	return 0;
+}
+
+static int ct_parse_mark(char *str, struct nlmsghdr *n)
+{
+	return ct_parse_u32(str, TCA_CT_MARK, TCA_CT_MARK_MASK, n);
+}
+
+static int ct_parse_labels(char *str, struct nlmsghdr *n)
+{
+#define LABELS_SIZE	16
+	uint8_t labels[LABELS_SIZE], lmask[LABELS_SIZE];
+	char *slash, *mask = NULL;
+	size_t slen, slen_mask = 0;
+
+	slash = index(str, '/');
+	if (slash) {
+		*slash = 0;
+		mask = slash+1;
+		slen_mask = strlen(mask);
+	}
+
+	slen = strlen(str);
+	if (slen > LABELS_SIZE*2 || slen_mask > LABELS_SIZE*2) {
+		char errmsg[128];
+
+		snprintf(errmsg, sizeof(errmsg),
+				"%zd Max allowed size %d",
+				slen, LABELS_SIZE*2);
+		invarg(errmsg, str);
+	}
+
+	if (hex2mem(str, labels, slen/2) < 0)
+		invarg("ct: labels must be a hex string\n", str);
+	addattr_l(n, MAX_MSG, TCA_CT_LABELS, labels, slen/2);
+
+	if (mask) {
+		if (hex2mem(mask, lmask, slen_mask/2) < 0)
+			invarg("ct: labels mask must be a hex string\n", mask);
+	} else {
+		memset(lmask, 0xff, sizeof(lmask));
+		slen_mask = sizeof(lmask)*2;
+	}
+	addattr_l(n, MAX_MSG, TCA_CT_LABELS_MASK, lmask, slen_mask/2);
+
+	return 0;
+}
+
+static int
+parse_ct(struct action_util *a, int *argc_p, char ***argv_p, int tca_id,
+		struct nlmsghdr *n)
+{
+	struct tc_ct sel = {};
+	char **argv = *argv_p;
+	struct rtattr *tail;
+	int argc = *argc_p;
+	int ct_action = 0;
+	int ret;
+
+	tail = addattr_nest(n, MAX_MSG, tca_id);
+
+	if (argc && matches(*argv, "ct") == 0)
+		NEXT_ARG_FWD();
+
+	while (argc > 0) {
+		if (matches(*argv, "zone") == 0) {
+			NEXT_ARG();
+
+			if (ct_parse_u16(*argv,
+					 TCA_CT_ZONE, TCA_CT_UNSPEC, n)) {
+				fprintf(stderr, "ct: Illegal \"zone\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "nat") == 0) {
+			ct_action |= TCA_CT_ACT_NAT;
+
+			NEXT_ARG();
+			if (matches(*argv, "src") == 0)
+				ct_action |= TCA_CT_ACT_NAT_SRC;
+			else if (matches(*argv, "dst") == 0)
+				ct_action |= TCA_CT_ACT_NAT_DST;
+			else
+				continue;
+
+			NEXT_ARG();
+			if (matches(*argv, "addr") != 0)
+				usage();
+
+			NEXT_ARG();
+			ret = ct_parse_nat_addr_range(*argv, n);
+			if (ret) {
+				fprintf(stderr, "ct: Illegal nat address range\n");
+				return -1;
+			}
+
+			NEXT_ARG_FWD();
+			if (matches(*argv, "port") != 0)
+				continue;
+
+			NEXT_ARG();
+			ret = ct_parse_nat_port_range(*argv, n);
+			if (ret) {
+				fprintf(stderr, "ct: Illegal nat port range\n");
+				return -1;
+			}
+		} else if (matches(*argv, "clear") == 0) {
+			ct_action |= TCA_CT_ACT_CLEAR;
+		} else if (matches(*argv, "commit") == 0) {
+			ct_action |= TCA_CT_ACT_COMMIT;
+		} else if (matches(*argv, "force") == 0) {
+			ct_action |= TCA_CT_ACT_FORCE;
+		} else if (matches(*argv, "index") == 0) {
+			NEXT_ARG();
+			if (get_u32(&sel.index, *argv, 10)) {
+				fprintf(stderr, "ct: Illegal \"index\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "mark") == 0) {
+			NEXT_ARG();
+
+			ret = ct_parse_mark(*argv, n);
+			if (ret) {
+				fprintf(stderr, "ct: Illegal \"mark\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "label") == 0) {
+			NEXT_ARG();
+
+			ret = ct_parse_labels(*argv, n);
+			if (ret) {
+				fprintf(stderr, "ct: Illegal \"label\"\n");
+				return -1;
+			}
+		} else if (matches(*argv, "help") == 0) {
+			usage();
+		} else {
+			break;
+		}
+		NEXT_ARG_FWD();
+	}
+
+	if (ct_action & TCA_CT_ACT_CLEAR &&
+	    ct_action & ~TCA_CT_ACT_CLEAR) {
+		fprintf(stderr, "ct: clear can only be used alone\n");
+		return -1;
+	}
+
+	if (ct_action & TCA_CT_ACT_NAT_SRC &&
+	    ct_action & TCA_CT_ACT_NAT_DST) {
+		fprintf(stderr, "ct: src and dst nat can't be used together\n");
+		return -1;
+	}
+
+	if ((ct_action & TCA_CT_ACT_COMMIT) &&
+	    (ct_action & TCA_CT_ACT_NAT) &&
+	    !(ct_action & (TCA_CT_ACT_NAT_SRC | TCA_CT_ACT_NAT_DST))) {
+		fprintf(stderr, "ct: commit and nat must set src or dst\n");
+		return -1;
+	}
+
+	if (!(ct_action & TCA_CT_ACT_COMMIT) &&
+	    (ct_action & (TCA_CT_ACT_NAT_SRC | TCA_CT_ACT_NAT_DST))) {
+		fprintf(stderr, "ct: src or dst is only valid if commit is set\n");
+		return -1;
+	}
+
+	parse_action_control_dflt(&argc, &argv, &sel.action, false,
+				  TC_ACT_PIPE);
+	NEXT_ARG_FWD();
+
+	addattr16(n, MAX_MSG, TCA_CT_ACTION, ct_action);
+	addattr_l(n, MAX_MSG, TCA_CT_PARMS, &sel, sizeof(sel));
+	addattr_nest_end(n, tail);
+
+	*argc_p = argc;
+	*argv_p = argv;
+	return 0;
+}
+
+static int ct_sprint_port(char *buf, const char *prefix, struct rtattr *attr)
+{
+	if (!attr)
+		return 0;
+
+	return sprintf(buf, "%s%d", prefix, rta_getattr_be16(attr));
+}
+
+static int ct_sprint_ip_addr(char *buf, const char *prefix,
+			     struct rtattr *attr)
+{
+	int family;
+	size_t len;
+
+	if (!attr)
+		return 0;
+
+	len = RTA_PAYLOAD(attr);
+
+	if (len == 4)
+		family = AF_INET;
+	else if (len == 16)
+		family = AF_INET6;
+	else
+		return 0;
+
+	return sprintf(buf, "%s%s", prefix, rt_addr_n2a_rta(family, attr));
+}
+
+static void ct_print_nat(int ct_action, struct rtattr **tb)
+{
+	size_t done = 0;
+	char out[256] = "";
+	bool nat;
+
+	if (!(ct_action & TCA_CT_ACT_NAT))
+		return;
+
+	if (ct_action & TCA_CT_ACT_NAT_SRC) {
+		nat = true;
+		done += sprintf(out + done, "src");
+	} else if (ct_action & TCA_CT_ACT_NAT_DST) {
+		nat = true;
+		done += sprintf(out + done, "dst");
+	}
+
+	if (nat) {
+		done += ct_sprint_ip_addr(out + done, " addr ",
+					  tb[TCA_CT_NAT_IPV4_MIN]);
+		done += ct_sprint_ip_addr(out + done, " addr ",
+					  tb[TCA_CT_NAT_IPV6_MIN]);
+		if (tb[TCA_CT_NAT_IPV4_MAX] &&
+		    memcmp(RTA_DATA(tb[TCA_CT_NAT_IPV4_MIN]),
+			   RTA_DATA(tb[TCA_CT_NAT_IPV4_MAX]), 4))
+			done += ct_sprint_ip_addr(out + done, "-",
+						  tb[TCA_CT_NAT_IPV4_MAX]);
+		else if (tb[TCA_CT_NAT_IPV6_MAX] &&
+			    memcmp(RTA_DATA(tb[TCA_CT_NAT_IPV6_MIN]),
+				   RTA_DATA(tb[TCA_CT_NAT_IPV6_MAX]), 16))
+			done += ct_sprint_ip_addr(out + done, "-",
+						  tb[TCA_CT_NAT_IPV6_MAX]);
+		done += ct_sprint_port(out + done, " port ",
+				       tb[TCA_CT_NAT_PORT_MIN]);
+		if (tb[TCA_CT_NAT_PORT_MAX] &&
+		    memcmp(RTA_DATA(tb[TCA_CT_NAT_PORT_MIN]),
+			   RTA_DATA(tb[TCA_CT_NAT_PORT_MAX]), 2))
+			done += ct_sprint_port(out + done, "-",
+					       tb[TCA_CT_NAT_PORT_MAX]);
+	}
+
+	if (done)
+		print_string(PRINT_ANY, "nat", " nat %s", out);
+	else
+		print_string(PRINT_ANY, "nat", " nat", "");
+}
+
+static void ct_print_labels(struct rtattr *attr,
+			    struct rtattr *mask_attr)
+{
+	const unsigned char *str;
+	bool print_mask = false;
+	char out[256], *p;
+	int data_len, i;
+
+	if (!attr)
+		return;
+
+	data_len = RTA_PAYLOAD(attr);
+	hexstring_n2a(RTA_DATA(attr), data_len, out, sizeof(out));
+	p = out + data_len*2;
+
+	data_len = RTA_PAYLOAD(attr);
+	str = RTA_DATA(mask_attr);
+	if (data_len != 16)
+		print_mask = true;
+	for (i = 0; !print_mask && i < data_len; i++) {
+		if (str[i] != 0xff)
+			print_mask = true;
+	}
+	if (print_mask) {
+		*p++ = '/';
+		hexstring_n2a(RTA_DATA(mask_attr), data_len, p,
+			      sizeof(out)-(p-out));
+		p += data_len*2;
+	}
+	*p = '\0';
+
+	print_string(PRINT_ANY, "label", " label %s", out);
+}
+
+static int print_ct(struct action_util *au, FILE *f, struct rtattr *arg)
+{
+	struct rtattr *tb[TCA_CT_MAX + 1];
+	const char *commit;
+	struct tc_ct *p;
+	int ct_action = 0;
+
+	if (arg == NULL)
+		return -1;
+
+	parse_rtattr_nested(tb, TCA_CT_MAX, arg);
+	if (tb[TCA_CT_PARMS] == NULL) {
+		print_string(PRINT_FP, NULL, "%s", "[NULL ct parameters]");
+		return -1;
+	}
+
+	p = RTA_DATA(tb[TCA_CT_PARMS]);
+
+	print_string(PRINT_ANY, "kind", "%s", "ct");
+
+	if (tb[TCA_CT_ACTION])
+		ct_action = rta_getattr_u16(tb[TCA_CT_ACTION]);
+	if (ct_action & TCA_CT_ACT_COMMIT) {
+		commit = ct_action & TCA_CT_ACT_FORCE ?
+			 "commit force" : "commit";
+		print_string(PRINT_ANY, "action", " %s", commit);
+	} else if (ct_action & TCA_CT_ACT_CLEAR) {
+		print_string(PRINT_ANY, "action", " %s", "clear");
+	}
+
+	print_masked_u32("mark", tb[TCA_CT_MARK], tb[TCA_CT_MARK_MASK]);
+	print_masked_u16("zone", tb[TCA_CT_ZONE], NULL);
+	ct_print_labels(tb[TCA_CT_LABELS], tb[TCA_CT_LABELS_MASK]);
+	ct_print_nat(ct_action, tb);
+
+	print_action_control(f, " ", p->action, "");
+
+	print_uint(PRINT_ANY, "index", "\n\t index %u", p->index);
+	print_int(PRINT_ANY, "ref", " ref %d", p->refcnt);
+	print_int(PRINT_ANY, "bind", " bind %d", p->bindcnt);
+
+	if (show_stats) {
+		if (tb[TCA_CT_TM]) {
+			struct tcf_t *tm = RTA_DATA(tb[TCA_CT_TM]);
+
+			print_tm(f, tm);
+		}
+	}
+	print_string(PRINT_FP, NULL, "%s", "\n ");
+
+	return 0;
+}
+
+struct action_util ct_action_util = {
+	.id = "ct",
+	.parse_aopt = parse_ct,
+	.print_aopt = print_ct,
+};
diff --git a/tc/tc_util.c b/tc/tc_util.c
index 53d15e0..8e461ba 100644
--- a/tc/tc_util.c
+++ b/tc/tc_util.c
@@ -913,3 +913,47 @@ compat_xstats:
 	if (tb[TCA_XSTATS] && xstats)
 		*xstats = tb[TCA_XSTATS];
 }
+
+void print_masked_u32(const char *name, struct rtattr *attr,
+		      struct rtattr *mask_attr)
+{
+	__u32 value, mask;
+	SPRINT_BUF(namefrm);
+	SPRINT_BUF(out);
+	size_t done;
+
+	if (!attr)
+		return;
+
+	value = rta_getattr_u32(attr);
+	mask = mask_attr ? rta_getattr_u32(mask_attr) : UINT32_MAX;
+
+	done = sprintf(out, "%u", value);
+	if (mask != UINT32_MAX)
+		sprintf(out + done, "/0x%x", mask);
+
+	sprintf(namefrm, " %s %%s", name);
+	print_string(PRINT_ANY, name, namefrm, out);
+}
+
+void print_masked_u16(const char *name, struct rtattr *attr,
+		      struct rtattr *mask_attr)
+{
+	__u16 value, mask;
+	SPRINT_BUF(namefrm);
+	SPRINT_BUF(out);
+	size_t done;
+
+	if (!attr)
+		return;
+
+	value = rta_getattr_u16(attr);
+	mask = mask_attr ? rta_getattr_u16(mask_attr) : UINT16_MAX;
+
+	done = sprintf(out, "%u", value);
+	if (mask != UINT16_MAX)
+		sprintf(out + done, "/0x%x", mask);
+
+	sprintf(namefrm, " %s %%s", name);
+	print_string(PRINT_ANY, name, namefrm, out);
+}
diff --git a/tc/tc_util.h b/tc/tc_util.h
index eb4b60d..0c3425a 100644
--- a/tc/tc_util.h
+++ b/tc/tc_util.h
@@ -127,4 +127,8 @@ int action_a2n(char *arg, int *result, bool allow_num);
 
 bool tc_qdisc_block_exists(__u32 block_index);
 
+void print_masked_u32(const char *name, struct rtattr *attr,
+		      struct rtattr *mask_attr);
+void print_masked_u16(const char *name, struct rtattr *attr,
+		      struct rtattr *mask_attr);
 #endif
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next iproute2 v2 1/3] tc: add NLA_F_NESTED flag to all actions options nested block
From: Paul Blakey @ 2019-07-11  8:14 UTC (permalink / raw)
  To: Jiri Pirko, Paul Blakey, Roi Dayan, Yossi Kuperman, Oz Shlomo,
	Marcelo Ricardo Leitner, netdev, David Miller, Aaron Conole,
	Zhike Wang
  Cc: Rony Efraim, nst-kernel, John Hurley, Simon Horman, Justin Pettit
In-Reply-To: <1562832867-32347-1-git-send-email-paulb@mellanox.com>

Strict netlink validation now requires this flag on all nested
attributes, add it for action options.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
---
 tc/m_action.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tc/m_action.c b/tc/m_action.c
index ab6bc0a..2d36a69 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -214,7 +214,8 @@ done0:
 			tail = addattr_nest(n, MAX_MSG, ++prio);
 			addattr_l(n, MAX_MSG, TCA_ACT_KIND, k, strlen(k) + 1);
 
-			ret = a->parse_aopt(a, &argc, &argv, TCA_ACT_OPTIONS,
+			ret = a->parse_aopt(a, &argc, &argv,
+					    TCA_ACT_OPTIONS | NLA_F_NESTED,
 					    n);
 
 			if (ret < 0) {
-- 
1.8.3.1


^ permalink raw reply related

* [PATCH net-next iproute2 v2 0/3] net/sched: Introduce tc connection tracking
From: Paul Blakey @ 2019-07-11  8:14 UTC (permalink / raw)
  To: Jiri Pirko, Paul Blakey, Roi Dayan, Yossi Kuperman, Oz Shlomo,
	Marcelo Ricardo Leitner, netdev, David Miller, Aaron Conole,
	Zhike Wang
  Cc: Rony Efraim, nst-kernel, John Hurley, Simon Horman, Justin Pettit

Hi,

This patch series add connection tracking capabilities in tc.
It does so via a new tc action, called act_ct, and new tc flower classifier matching.
Act ct and relevant flower matches, are still under review in net-next mailing list.

Usage is as follows:
$ tc qdisc add dev ens1f0_0 ingress
$ tc qdisc add dev ens1f0_1 ingress

$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 0 proto ip \
  flower ip_proto tcp ct_state -trk \
  action ct zone 2 pipe \
  action goto chain 2
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 2 proto ip \
  flower ct_state +trk+new \
  action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 2 proto ip \
  flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
  action ct nat pipe \
  action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_1 ingress \
  prio 1 chain 0 proto ip \
  flower ip_proto tcp ct_state -trk \
  action ct zone 2 pipe \
  action goto chain 1
$ tc filter add dev ens1f0_1 ingress \
  prio 1 chain 1 proto ip \
  flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
  action ct nat pipe \
  action mirred egress redirect dev ens1f0_0

Changelog:
V1->V2:
	Removed pkt_cls changes (as it was merged already)

Paul Blakey (3):
  tc: add NLA_F_NESTED flag to all actions options nested block
  tc: Introduce tc ct action
  tc: flower: Add matching on conntrack info

 include/uapi/linux/tc_act/tc_ct.h |  41 ++++
 man/man8/tc-flower.8              |  35 +++
 tc/Makefile                       |   1 +
 tc/f_flower.c                     | 276 ++++++++++++++++++++-
 tc/m_action.c                     |   3 +-
 tc/m_ct.c                         | 497 ++++++++++++++++++++++++++++++++++++++
 tc/tc_util.c                      |  44 ++++
 tc/tc_util.h                      |   4 +
 8 files changed, 899 insertions(+), 2 deletions(-)
 create mode 100644 include/uapi/linux/tc_act/tc_ct.h
 create mode 100644 tc/m_ct.c

-- 
1.8.3.1


^ permalink raw reply

* Re: [PATCH net-next 1/3] net: flow_offload: remove netns parameter from flow_block_cb_alloc()
From: Jiri Pirko @ 2019-07-11  8:08 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, jakub.kicinski
In-Reply-To: <20190711001235.20686-1-pablo@netfilter.org>

Thu, Jul 11, 2019 at 02:12:33AM CEST, pablo@netfilter.org wrote:
>No need to annotate the netns on the flow block callback object,
>flow_block_cb_is_busy() already checks for used blocks.
>
>Fixes: d63db30c8537 ("net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()")
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next 2/3] net: flow_offload: rename tc_setup_cb_t to flow_setup_cb_t
From: Jiri Pirko @ 2019-07-11  8:07 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, jakub.kicinski
In-Reply-To: <20190711001235.20686-2-pablo@netfilter.org>

Thu, Jul 11, 2019 at 02:12:34AM CEST, pablo@netfilter.org wrote:
>Rename this type definition and adapt users.
>
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: flow_offload: add flow_block structure and use it
From: Jiri Pirko @ 2019-07-11  8:06 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, davem, netdev, jakub.kicinski
In-Reply-To: <20190711001235.20686-3-pablo@netfilter.org>

Thu, Jul 11, 2019 at 02:12:35AM CEST, pablo@netfilter.org wrote:
>This object stores the flow block callbacks that are attached to this
>block. This patch restores block sharing.
>
>Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions")
>Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
>---
> include/net/flow_offload.h        |  5 +++++
> include/net/netfilter/nf_tables.h |  5 +++--
> include/net/sch_generic.h         |  2 +-
> net/core/flow_offload.c           |  2 +-
> net/netfilter/nf_tables_api.c     |  2 +-
> net/netfilter/nf_tables_offload.c |  5 +++--
> net/sched/cls_api.c               | 10 +++++++---
> 7 files changed, 21 insertions(+), 10 deletions(-)
>
>diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
>index 98bf3af5c84d..e50d94736829 100644
>--- a/include/net/flow_offload.h
>+++ b/include/net/flow_offload.h
>@@ -248,6 +248,10 @@ enum flow_block_binder_type {
> 	FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS,
> };
> 
>+struct flow_block {
>+	struct list_head cb_list;
>+};
>+
> struct netlink_ext_ack;
> 
> struct flow_block_offload {
>@@ -255,6 +259,7 @@ struct flow_block_offload {
> 	enum flow_block_binder_type binder_type;
> 	bool block_shared;
> 	struct net *net;
>+	struct flow_block *block;
> 	struct list_head cb_list;
> 	struct list_head *driver_block_list;
> 	struct netlink_ext_ack *extack;
>diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
>index 35dfdd9f69b3..00658462f89b 100644
>--- a/include/net/netfilter/nf_tables.h
>+++ b/include/net/netfilter/nf_tables.h
>@@ -11,6 +11,7 @@
> #include <linux/rhashtable.h>
> #include <net/netfilter/nf_flow_table.h>
> #include <net/netlink.h>
>+#include <net/flow_offload.h>
> 
> struct module;
> 
>@@ -951,7 +952,7 @@ struct nft_stats {
>  *	@stats: per-cpu chain stats
>  *	@chain: the chain
>  *	@dev_name: device name that this base chain is attached to (if any)
>- *	@cb_list: list of flow block callbacks (for hardware offload)
>+ *	@block: flow block (for hardware offload)
>  */
> struct nft_base_chain {
> 	struct nf_hook_ops		ops;
>@@ -961,7 +962,7 @@ struct nft_base_chain {
> 	struct nft_stats __percpu	*stats;
> 	struct nft_chain		chain;
> 	char 				dev_name[IFNAMSIZ];
>-	struct list_head		cb_list;
>+	struct flow_block		block;
> };
> 
> static inline struct nft_base_chain *nft_base_chain(const struct nft_chain *chain)
>diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>index 9482e060483b..58041cb0ce15 100644
>--- a/include/net/sch_generic.h
>+++ b/include/net/sch_generic.h
>@@ -399,7 +399,7 @@ struct tcf_block {
> 	refcount_t refcnt;
> 	struct net *net;
> 	struct Qdisc *q;
>-	struct list_head cb_list;
>+	struct flow_block flow;

It is not a "flow", that is confusing. It should be named "flow_block".


> 	struct list_head owner_list;
> 	bool keep_dst;
> 	unsigned int offloadcnt; /* Number of oddloaded filters */
>diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
>index a800fa78d96c..935c7f81a9ef 100644
>--- a/net/core/flow_offload.c
>+++ b/net/core/flow_offload.c
>@@ -198,7 +198,7 @@ struct flow_block_cb *flow_block_cb_lookup(struct flow_block_offload *f,
> {
> 	struct flow_block_cb *block_cb;
> 
>-	list_for_each_entry(block_cb, f->driver_block_list, driver_list) {
>+	list_for_each_entry(block_cb, &f->block->cb_list, list) {

Please made struct flow_block *block and argument of cb_lookup instead
of struct flow_block_offload *f (as it was previously).


> 		if (block_cb->cb == cb &&
> 		    block_cb->cb_ident == cb_ident)
> 			return block_cb;
>diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
>index ed17a7c29b86..c565f146435b 100644
>--- a/net/netfilter/nf_tables_api.c
>+++ b/net/netfilter/nf_tables_api.c
>@@ -1662,7 +1662,7 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
> 
> 		chain->flags |= NFT_BASE_CHAIN | flags;
> 		basechain->policy = NF_ACCEPT;
>-		INIT_LIST_HEAD(&basechain->cb_list);
>+		INIT_LIST_HEAD(&basechain->block.cb_list);
> 	} else {
> 		chain = kzalloc(sizeof(*chain), GFP_KERNEL);
> 		if (chain == NULL)
>diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
>index 2c3302845f67..2a184277ee58 100644
>--- a/net/netfilter/nf_tables_offload.c
>+++ b/net/netfilter/nf_tables_offload.c
>@@ -116,7 +116,7 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
> 	struct flow_block_cb *block_cb;
> 	int err;
> 
>-	list_for_each_entry(block_cb, &basechain->cb_list, list) {
>+	list_for_each_entry(block_cb, &basechain->block.cb_list, list) {
> 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
> 		if (err < 0)
> 			return err;
>@@ -154,7 +154,7 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
> static int nft_flow_offload_bind(struct flow_block_offload *bo,
> 				 struct nft_base_chain *basechain)
> {
>-	list_splice(&bo->cb_list, &basechain->cb_list);
>+	list_splice(&bo->cb_list, &basechain->block.cb_list);
> 	return 0;
> }
> 
>@@ -198,6 +198,7 @@ static int nft_flow_offload_chain(struct nft_trans *trans,
> 		return -EOPNOTSUPP;
> 
> 	bo.command = cmd;
>+	bo.block = &basechain->block;
> 	bo.binder_type = FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS;
> 	bo.extack = &extack;
> 	INIT_LIST_HEAD(&bo.cb_list);
>diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>index 51fbe6e95a92..66181961ad6f 100644
>--- a/net/sched/cls_api.c
>+++ b/net/sched/cls_api.c
>@@ -691,6 +691,8 @@ static void tc_indr_block_ing_cmd(struct tc_indr_block_dev *indr_dev,
> 	if (!indr_dev->block)
> 		return;
> 
>+	bo.block = &indr_dev->block->flow;
>+
> 	indr_block_cb->cb(indr_dev->dev, indr_block_cb->cb_priv, TC_SETUP_BLOCK,
> 			  &bo);
> 	tcf_block_setup(indr_dev->block, &bo);
>@@ -775,6 +777,7 @@ static void tc_indr_block_call(struct tcf_block *block, struct net_device *dev,
> 		.command	= command,
> 		.binder_type	= ei->binder_type,
> 		.net		= dev_net(dev),
>+		.block		= &block->flow,
> 		.block_shared	= tcf_block_shared(block),
> 		.extack		= extack,
> 	};
>@@ -810,6 +813,7 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
> 	bo.net = dev_net(dev);
> 	bo.command = command;
> 	bo.binder_type = ei->binder_type;
>+	bo.block = &block->flow;
> 	bo.block_shared = tcf_block_shared(block);
> 	bo.extack = extack;
> 	INIT_LIST_HEAD(&bo.cb_list);
>@@ -988,7 +992,7 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
> 	}
> 	mutex_init(&block->lock);
> 	INIT_LIST_HEAD(&block->chain_list);
>-	INIT_LIST_HEAD(&block->cb_list);
>+	INIT_LIST_HEAD(&block->flow.cb_list);

With introduction of struct flow_block, please introduce also a helper
to init this struct. Does not look right to init it from user codes
(tc/nft).


> 	INIT_LIST_HEAD(&block->owner_list);
> 	INIT_LIST_HEAD(&block->chain0.filter_chain_list);
> 
>@@ -1570,7 +1574,7 @@ static int tcf_block_bind(struct tcf_block *block,
> 
> 		i++;
> 	}
>-	list_splice(&bo->cb_list, &block->cb_list);
>+	list_splice(&bo->cb_list, &block->flow.cb_list);
> 
> 	return 0;
> 
>@@ -3155,7 +3159,7 @@ int tc_setup_cb_call(struct tcf_block *block, enum tc_setup_type type,
> 	if (block->nooffloaddevcnt && err_stop)
> 		return -EOPNOTSUPP;
> 
>-	list_for_each_entry(block_cb, &block->cb_list, list) {
>+	list_for_each_entry(block_cb, &block->flow.cb_list, list) {
> 		err = block_cb->cb(type, type_data, block_cb->cb_priv);
> 		if (err) {
> 			if (err_stop)
>-- 
>2.11.0
>
>

^ permalink raw reply

* [PATCH net-next] netfilter: nf_table_offload: Fix zero prio of flow_cls_common_offload
From: wenxu @ 2019-07-11  8:03 UTC (permalink / raw)
  To: pablo, davem; +Cc: netfilter-devel, netdev

From: wenxu <wenxu@ucloud.cn>

The flow_cls_common_offload prio should be not zero

It leads the invalid table prio in hw.

# nft add table netdev firewall
# nft add chain netdev firewall acl { type filter hook ingress device mlx_pf0vf0 priority - 300 \; }
# nft add rule netdev firewall acl ip daddr 1.1.1.7 drop
Error: Could not process rule: Invalid argument

kernel log
mlx5_core 0000:81:00.0: E-Switch: Failed to create FDB Table err -22 (table prio: 65535, level: 0, size: 4194304)

Fixes: c9626a2cbdb2 ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
---
 net/netfilter/nf_tables_offload.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index 2c33028..01d8133 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -7,6 +7,8 @@
 #include <net/netfilter/nf_tables_offload.h>
 #include <net/pkt_cls.h>
 
+#define FLOW_OFFLOAD_DEFAUT_PRIO 1U
+
 static struct nft_flow_rule *nft_flow_rule_alloc(int num_actions)
 {
 	struct nft_flow_rule *flow;
@@ -107,6 +109,7 @@ static void nft_flow_offload_common_init(struct flow_cls_common_offload *common,
 					struct netlink_ext_ack *extack)
 {
 	common->protocol = proto;
+	common->prio = TC_H_MAKE(FLOW_OFFLOAD_DEFAUT_PRIO << 16, 0);
 	common->extack = extack;
 }
 
-- 
1.8.3.1


^ permalink raw reply related

* Re:  Re: linux-next: build failure after merge of the net-next tree
From: Bernard Metzler @ 2019-07-11  8:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Stephen Rothwell, Doug Ledford, David Miller,
	Networking, Linux Next Mailing List, Linux Kernel Mailing List
In-Reply-To: <20190710175212.GM2887@mellanox.com>

-----"Jason Gunthorpe" <jgg@mellanox.com> wrote: -----

>To: "Leon Romanovsky" <leon@kernel.org>, "Bernard Metzler"
><bmt@zurich.ibm.com>
>From: "Jason Gunthorpe" <jgg@mellanox.com>
>Date: 07/10/2019 07:52PM
>Cc: "Stephen Rothwell" <sfr@canb.auug.org.au>, "Doug Ledford"
><dledford@redhat.com>, "David Miller" <davem@davemloft.net>,
>"Networking" <netdev@vger.kernel.org>, "Linux Next Mailing List"
><linux-next@vger.kernel.org>, "Linux Kernel Mailing List"
><linux-kernel@vger.kernel.org>
>Subject: [EXTERNAL] Re: linux-next: build failure after merge of the
>net-next tree
>
>On Tue, Jul 09, 2019 at 09:43:46AM +0300, Leon Romanovsky wrote:
>> On Tue, Jul 09, 2019 at 01:56:36PM +1000, Stephen Rothwell wrote:
>> > Hi all,
>> >
>> > After merging the net-next tree, today's linux-next build (x86_64
>> > allmodconfig) failed like this:
>> >
>> > drivers/infiniband/sw/siw/siw_cm.c: In function
>'siw_create_listen':
>> > drivers/infiniband/sw/siw/siw_cm.c:1978:3: error: implicit
>declaration of function 'for_ifa'; did you mean 'fork_idle'?
>[-Werror=implicit-function-declaration]
>> >    for_ifa(in_dev)
>> >    ^~~~~~~
>> >    fork_idle
>> > drivers/infiniband/sw/siw/siw_cm.c:1978:18: error: expected ';'
>before '{' token
>> >    for_ifa(in_dev)
>> >                   ^
>> >                   ;
>> >    {
>> >    ~
>> >
>> > Caused by commit
>> >
>> >   6c52fdc244b5 ("rdma/siw: connection management")
>> >
>> > from the rdma tree.  I don't know why this didn't fail after I
>mereged
>> > that tree.
>> 
>> I had the same question, because I have this fix for a couple of
>days already.
>> 
>> From 56c9e15ec670af580daa8c3ffde9503af3042d67 Mon Sep 17 00:00:00
>2001
>> From: Leon Romanovsky <leonro@mellanox.com>
>> Date: Sun, 7 Jul 2019 10:43:42 +0300
>> Subject: [PATCH] Fixup to build SIW issue
>> 
>> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>>  drivers/infiniband/sw/siw/siw_cm.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>> 
>> diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>b/drivers/infiniband/sw/siw/siw_cm.c
>> index 8e618cb7261f..c883bf514341 100644
>> +++ b/drivers/infiniband/sw/siw/siw_cm.c
>> @@ -1954,6 +1954,7 @@ static void siw_drop_listeners(struct
>iw_cm_id *id)
>>  int siw_create_listen(struct iw_cm_id *id, int backlog)
>>  {
>>  	struct net_device *dev = to_siw_dev(id->device)->netdev;
>> +	const struct in_ifaddr *ifa;
>>  	int rv = 0, listeners = 0;
>> 
>>  	siw_dbg(id->device, "id 0x%p: backlog %d\n", id, backlog);
>> @@ -1975,8 +1976,7 @@ int siw_create_listen(struct iw_cm_id *id,
>int backlog)
>>  			id, &s_laddr.sin_addr, ntohs(s_laddr.sin_port),
>>  			&s_raddr->sin_addr, ntohs(s_raddr->sin_port));
>> 
>> -		for_ifa(in_dev)
>> -		{
>> +		in_dev_for_each_ifa_rcu(ifa, in_dev) {
>>  			if (ipv4_is_zeronet(s_laddr.sin_addr.s_addr) ||
>
>Hum. There is no rcu lock held here and we can't use RCU anyhow as
>siw_listen_address will sleep.
>
>I think this needs to use rtnl, as below. Bernard, please urgently
>confirm. Thanks
>

Hi Jason,

That listen will not sleep. The socket is just marked
listening. Accepting a new connection is handled asynchronously
by a work handler, kicked by a socket callback
(siw_cm_llp_state_change).

But, I think you are correct, we are missing the
rcu_read_lock/unlock around that iteration. Could we please add
that (see below)?

Thanks very much!
Bernard.

diff --git a/drivers/infiniband/sw/siw/siw_cm.c b/drivers/infiniband/sw/siw/siw_cm.c
index c883bf514341..c5c060103993 100644
--- a/drivers/infiniband/sw/siw/siw_cm.c
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -1976,6 +1976,7 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
                        id, &s_laddr.sin_addr, ntohs(s_laddr.sin_port),
                        &s_raddr->sin_addr, ntohs(s_raddr->sin_port));
 
+               rcu_read_lock();
                in_dev_for_each_ifa_rcu(ifa, in_dev) {
                        if (ipv4_is_zeronet(s_laddr.sin_addr.s_addr) ||
                            s_laddr.sin_addr.s_addr == ifa->ifa_address) {
@@ -1988,6 +1989,7 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
                                        listeners++;
                        }
                }
+               rcu_read_unlock();
                in_dev_put(in_dev);
        } else if (id->local_addr.ss_family == AF_INET6) {
                struct inet6_dev *in6_dev = in6_dev_get(dev);



>diff --git a/drivers/infiniband/sw/siw/siw_cm.c
>b/drivers/infiniband/sw/siw/siw_cm.c
>index 8e618cb7261f62..ee98e96a5bfaba 100644
>--- a/drivers/infiniband/sw/siw/siw_cm.c
>+++ b/drivers/infiniband/sw/siw/siw_cm.c
>@@ -1965,6 +1965,7 @@ int siw_create_listen(struct iw_cm_id *id, int
>backlog)
> 	 */
> 	if (id->local_addr.ss_family == AF_INET) {
> 		struct in_device *in_dev = in_dev_get(dev);
>+		const struct in_ifaddr *ifa;
> 		struct sockaddr_in s_laddr, *s_raddr;
> 
> 		memcpy(&s_laddr, &id->local_addr, sizeof(s_laddr));
>@@ -1975,8 +1976,8 @@ int siw_create_listen(struct iw_cm_id *id, int
>backlog)
> 			id, &s_laddr.sin_addr, ntohs(s_laddr.sin_port),
> 			&s_raddr->sin_addr, ntohs(s_raddr->sin_port));
> 
>-		for_ifa(in_dev)
>-		{
>+		rtnl_lock();
>+		in_dev_for_each_ifa_rtnl(ifa, in_dev) {
> 			if (ipv4_is_zeronet(s_laddr.sin_addr.s_addr) ||
> 			    s_laddr.sin_addr.s_addr == ifa->ifa_address) {
> 				s_laddr.sin_addr.s_addr = ifa->ifa_address;
>@@ -1988,7 +1989,7 @@ int siw_create_listen(struct iw_cm_id *id, int
>backlog)
> 					listeners++;
> 			}
> 		}
>-		endfor_ifa(in_dev);
>+		rtnl_unlock();
> 		in_dev_put(in_dev);
> 	} else if (id->local_addr.ss_family == AF_INET6) {
> 		struct inet6_dev *in6_dev = in6_dev_get(dev);
>
>


^ permalink raw reply related

* Re: NEIGH: BUG, double timer add, state is 8
From: Lorenzo Bianconi @ 2019-07-11  7:54 UTC (permalink / raw)
  To: David Ahern; +Cc: Marek Majkowski, David Miller, netdev, kernel-team
In-Reply-To: <c383ea93-4257-f31c-e259-f71169f7baef@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

> On 7/5/19 11:30 AM, Lorenzo Bianconi wrote:
> > looking at the reproducer it seems to me the issue is due to the use of
> > 'NTF_USE' from userspace.
> > Should we unschedule the neigh timer if we are in IN_TIMER receiving this
> > flag from userspace? (taking appropriate locking)
> 
> I think you are right. Do you want to send a patch?

Hi David,

thx for the feedback. Sure, I will post a patch soon.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* RE: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"
From: Jan Szewczyk @ 2019-07-11  7:41 UTC (permalink / raw)
  To: David Ahern, Stefano Brivio
  Cc: davem@davemloft.net, netdev@vger.kernel.org, Wei Wang,
	Martin KaFai Lau, Eric Dumazet
In-Reply-To: <5981b8d0-cfdf-d230-fa22-cfcfaa5ee4b9@gmail.com>

Hi guys!

Yes, that's exactly it! Thank you very much, so now I know what is happening 😊.

Thanks again for your help!

BR,
Jan Szewczyk

-----Original Message-----
From: David Ahern <dsahern@gmail.com> 
Sent: Wednesday, July 10, 2019 21:13
To: Stefano Brivio <sbrivio@redhat.com>; Jan Szewczyk <jan.szewczyk@ericsson.com>
Cc: davem@davemloft.net; netdev@vger.kernel.org; Wei Wang <weiwan@google.com>; Martin KaFai Lau <kafai@fb.com>; Eric Dumazet <edumazet@google.com>
Subject: Re: Question about linux kernel commit: "net/ipv6: move metrics from dst to rt6_info"

On 7/10/19 1:09 PM, Stefano Brivio wrote:
> Jan,
> 
> On Wed, 10 Jul 2019 12:59:41 +0000
> Jan Szewczyk <jan.szewczyk@ericsson.com> wrote:
> 
>> Hi!
>> I digged up a little further and maybe it's not a problem with MTU 
>> itself. I checked every entry I get from RTM_GETROUTE netlink message 
>> and after triggering "too big packet" by pinging ipv6address I get 
>> exactly the same messages on 4.12 and 4.18, except that the one with 
>> that pinged ipv6address is missing on 4.18 at all. What is weird - 
>> it's visible when running "ip route get to ipv6address". Do you know 
>> why there is a mismatch there?
> 
> If I understand you correctly, an implementation equivalent to 'ip -6 
> route list show' (using the NLM_F_DUMP flag) won't show the so-called 
> route exception, while 'ip -6 route get' shows it.
> 
> If that's the case: that was broken by commit 2b760fcf5cfb ("ipv6: 
> hook up exception table to store dst cache") that landed in 4.15, and 
> fixed by net-next commit 1e47b4837f3b ("ipv6: Dump route exceptions if 
> requested"). For more details, see the log of this commit itself.
> 

ah, good point. My mind locked on RTM_GETROUTE as a specific route request not a dump.

^ permalink raw reply

* Re: [RFC] virtio-net: share receive_*() and add_recvbuf_*() with virtio-vsock
From: Jason Wang @ 2019-07-11  7:37 UTC (permalink / raw)
  To: Stefano Garzarella, Michael S. Tsirkin, Stefan Hajnoczi
  Cc: virtualization, netdev
In-Reply-To: <20190710153707.twmzgmwqqw3pstos@steredhat>

On 2019/7/10 下午11:37, Stefano Garzarella wrote:
> Hi,
> as Jason suggested some months ago, I looked better at the virtio-net driver to
> understand if we can reuse some parts also in the virtio-vsock driver, since we
> have similar challenges (mergeable buffers, page allocation, small
> packets, etc.).
>
> Initially, I would add the skbuff in the virtio-vsock in order to re-use
> receive_*() functions.

Yes, that will be a good step.

> Then I would move receive_[small, big, mergeable]() and
> add_recvbuf_[small, big, mergeable]() outside of virtio-net driver, in order to
> call them also from virtio-vsock. I need to do some refactoring (e.g. leave the
> XDP part on the virtio-net driver), but I think it is feasible.
>
> The idea is to create a virtio-skb.[h,c] where put these functions and a new
> object where stores some attributes needed (e.g. hdr_len ) and status (e.g.
> some fields of struct receive_queue).

My understanding is we could be more ambitious here. Do you see any 
blocker for reusing virtio-net directly? It's better to reuse not only 
the functions but also the logic like NAPI to avoid re-inventing 
something buggy and duplicated.

> This is an idea of virtio-skb.h that
> I have in mind:
>      struct virtskb;

What fields do you want to store in virtskb? It looks to be exist 
sk_buff is flexible enough to us?

>
>      struct sk_buff *virtskb_receive_small(struct virtskb *vs, ...);
>      struct sk_buff *virtskb_receive_big(struct virtskb *vs, ...);
>      struct sk_buff *virtskb_receive_mergeable(struct virtskb *vs, ...);
>
>      int virtskb_add_recvbuf_small(struct virtskb*vs, ...);
>      int virtskb_add_recvbuf_big(struct virtskb *vs, ...);
>      int virtskb_add_recvbuf_mergeable(struct virtskb *vs, ...);
>
> For the Guest->Host path it should be easier, so maybe I can add a
> "virtskb_send(struct virtskb *vs, struct sk_buff *skb)" with a part of the code
> of xmit_skb().

I may miss something, but I don't see any thing that prevents us from 
using xmit_skb() directly.

>
> Let me know if you have in mind better names or if I should put these function
> in another place.
>
> I would like to leave the control part completely separate, so, for example,
> the two drivers will negotiate the features independently and they will call
> the right virtskb_receive_*() function based on the negotiation.

If it's one the issue of negotiation, we can simply change the 
virtnet_probe() to deal with different devices.

>
> I already started to work on it, but before to do more steps and send an RFC
> patch, I would like to hear your opinion.
> Do you think that makes sense?
> Do you see any issue or a better solution?

I still think we need to seek a way of adding some codes on virtio-net.c 
directly if there's no huge different in the processing of TX/RX. That 
would save us a lot time.

Thanks

>
> Thanks in advance,
> Stefano

^ permalink raw reply

* Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
From: Christoph Paasch @ 2019-07-11  7:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Prout, Andrew - LLSC - MITLL, David Miller, netdev,
	Greg Kroah-Hartman, Jonathan Looney, Neal Cardwell, Tyler Hicks,
	Yuchung Cheng, Bruce Curtis, Jonathan Lemon, Dustin Marquess
In-Reply-To: <b1dfd327-a784-6609-3c83-dab42c3c7eda@gmail.com>



> On Jul 10, 2019, at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 7/10/19 8:53 PM, Prout, Andrew - LLSC - MITLL wrote:
>> 
>> Our initial rollout was v4.14.130, but I reproduced it with v4.14.132 as well, reliably for the samba test and once (not reliably) with synthetic test I was trying. A patched v4.14.132 with this patch partially reverted (just the four lines from tcp_fragment deleted) passed the samba test.
>> 
>> The synthetic test was a pair of simple send/recv test programs under the following conditions:
>> -The send socket was non-blocking
>> -SO_SNDBUF set to 128KiB
>> -The receiver NIC was being flooded with traffic from multiple hosts (to induce packet loss/retransmits)
>> -Load was on both systems: a while(1) program spinning on each CPU core
>> -The receiver was on an older unaffected kernel
>> 
> 
> SO_SNDBUF to 128KB does not permit to recover from heavy losses,
> since skbs needs to be allocated for retransmits.

Would it make sense to always allow the alloc in tcp_fragment when coming from __tcp_retransmit_skb() through the retransmit-timer ?

AFAICS, the crasher was when an attacker sends "fake" SACK-blocks. Thus, we would still be protected from too much fragmentation, but at least would always allow the retransmission to go out.


Christoph

> 
> The bug we fixed allowed remote attackers to crash all linux hosts,
> 
> I am afraid we have to enforce the real SO_SNDBUF limit, finally.
> 
> Even a cushion of 128KB per socket is dangerous, for servers with millions of TCP sockets.
> 
> You will either have to set SO_SNDBUF to higher values, or let autotuning in place.
> Or revert the patches and allow attackers hit you badly.
> 


^ permalink raw reply

* RE: [PATCH v6 rdma-next 0/6] RDMA/qedr: Use the doorbell overflow recovery mechanism for RDMA
From: Michal Kalderon @ 2019-07-11  7:23 UTC (permalink / raw)
  To: Gal Pressman, Ariel Elior, jgg@ziepe.ca, dledford@redhat.com
  Cc: linux-rdma@vger.kernel.org, davem@davemloft.net,
	netdev@vger.kernel.org, sleybo@amazon.com
In-Reply-To: <7b2f2205-6b5d-c9e7-2d59-296367e517ac@amazon.com>

> From: linux-rdma-owner@vger.kernel.org <linux-rdma-
> owner@vger.kernel.org> On Behalf Of Gal Pressman
> 
> On 09/07/2019 17:17, Michal Kalderon wrote:
> > This patch series uses the doorbell overflow recovery mechanism
> > introduced in commit 36907cd5cd72 ("qed: Add doorbell overflow
> > recovery mechanism") for rdma ( RoCE and iWARP )
> >
> > The first three patches modify the core code to contain helper
> > functions for managing mmap_xa inserting, getting and freeing entries.
> > The code was taken almost as is from the efa driver.
> > There is still an open discussion on whether we should take this even
> > further and make the entire mmap generic. Until a decision is made, I
> > only created the database API and modified the efa and qedr driver to
> > use it. The doorbell recovery code will be based on the common code.
> >
> > Efa driver was compile tested only.
> 
> For the whole series:
> Tested-by: Gal Pressman <galpress@amazon.com>

Thanks Gal!


^ permalink raw reply

* Re: [PATCH net-next iproute2 2/3] tc: Introduce tc ct action
From: Paul Blakey @ 2019-07-11  7:21 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Jiri Pirko, Roi Dayan, Yossi Kuperman, Oz Shlomo,
	netdev@vger.kernel.org, David Miller, Aaron Conole, Zhike Wang,
	Justin Pettit, John Hurley, Rony Efraim, nst-kernel@redhat.com,
	Simon Horman
In-Reply-To: <20190709153657.GF3390@localhost.localdomain>


On 7/9/2019 6:36 PM, Marcelo Ricardo Leitner wrote:
> On Tue, Jul 09, 2019 at 06:58:36AM +0000, Paul Blakey wrote:
>> On 7/8/2019 8:54 PM, Marcelo Ricardo Leitner wrote:
>>> On Sun, Jul 07, 2019 at 11:53:47AM +0300, Paul Blakey wrote:
>>>> New tc action to send packets to conntrack module, commit
>>>> them, and set a zone, labels, mark, and nat on the connection.
>>>>
>>>> It can also clear the packet's conntrack state by using clear.
>>>>
>>>> Usage:
>>>>      ct clear
>>>>      ct commit [force] [zone] [mark] [label] [nat]
>>> Isn't the 'commit' also optional? More like
>>>       ct [commit [force]] [zone] [mark] [label] [nat]
>>>
>>>>      ct [nat] [zone]
>>>>
>>>> Signed-off-by: Paul Blakey <paulb@mellanox.com>
>>>> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>>>> Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
>>>> Acked-by: Jiri Pirko <jiri@mellanox.com>
>>>> Acked-by: Roi Dayan <roid@mellanox.com>
>>>> ---
>>> ...
>>>> +static void
>>>> +usage(void)
>>>> +{
>>>> +	fprintf(stderr,
>>>> +		"Usage: ct clear\n"
>>>> +		"	ct commit [force] [zone ZONE] [mark MASKED_MARK] [label MASKED_LABEL] [nat NAT_SPEC]\n"
>>> Ditto here then.
>>
>> In commit msg and here, it means there is multiple modes of operation. I
>> think it's easier to split those.
> Yep, that is good.
> More below.
>
>> "ct clear" to clear it , not other options can be added here.
>>
>> "ct commit  [force].... " sends to conntrack and commit a connection,
>> and only for commit can you specify force mark  label, and nat with
>> nat_spec....
>>
>> and the last one, "ct [nat] [zone ZONE]" is to just send the packet to
>> conntrack on some zone [optional], restore nat [optional].
>>
>>
>>>> +		"	ct [nat] [zone ZONE]\n"
>>>> +		"Where: ZONE is the conntrack zone table number\n"
>>>> +		"	NAT_SPEC is {src|dst} addr addr1[-addr2] [port port1[-port2]]\n"
>>>> +		"\n");
>>>> +	exit(-1);
>>>> +}
>>> ...
>>>
>>> The validation below doesn't enforce that commit must be there for
>>> such case.
>> which case? commit is optional. the above are the three valid patterns.
> That's the point. But the 2nd example is saying 'commit' word is
> mandatory in that mode. It is written as it is a command that was
> selected.
>
> One may use just:
>      ct [zone]
> And not
>      ct commit [zone]
> Right?

It is optional in the overall syntax.


But I split it into modes:

clear, commit, and "restore" (I unofficial call it like that, because it 
usually used to get the +est state on the packet and can restore nat, it 
doesn't actually restore anything for the first packet on the -trk rule)

It is mandatory in the second mode (commit), if you don't specify commit 
or clear, you can only use the third form - "restore", which is to send 
to ct on some optional zone, and optionally and restore nat (so we get 
ct [zone] [nat]).

I think this syntax is easy, maybe I can label them as the modes of 
operation above (then I'll need to name the restore one better :)).

If there is a different syntax you think might be easier I'll change to 
that.


Thanks,

Paul.







^ permalink raw reply

* Re: [PATCH net-next v6 0/4] net/sched: Introduce tc connection tracking
From: Paul Blakey @ 2019-07-11  7:12 UTC (permalink / raw)
  To: David Miller
  Cc: Jiri Pirko, Roi Dayan, Yossi Kuperman, Oz Shlomo,
	marcelo.leitner@gmail.com, netdev@vger.kernel.org,
	aconole@redhat.com, wangzhike@jd.com, Rony Efraim,
	nst-kernel@redhat.com, john.hurley@netronome.com,
	simon.horman@netronome.com, jpettit@ovn.org
In-Reply-To: <20190709.121402.1804664264408465946.davem@davemloft.net>


On 7/9/2019 10:14 PM, David Miller wrote:
> From: Paul Blakey <paulb@mellanox.com>
> Date: Tue,  9 Jul 2019 10:30:47 +0300
>
>> This patch series add connection tracking capabilities in tc sw datapath.
>> It does so via a new tc action, called act_ct, and new tc flower classifier matching
>> on conntrack state, mark and label.
>   ...
>
> Ok, I applied this, but two things:
>
> 1) You owe Cong Wang an explanation, a real detailed one, about the L2
>     vs L3 design of this feature.  I did not see you address his feedback,
>     but if you did I apologize.
>
> 2) Because the MPLS changes went in first, TCA_ID_CT ended up in a
>     different spot in the enumeration and therefore the value is
>     different.
>
> Thanks.



Thanks!

Re 1, I provided one in "Re: [PATCH net-next v2 0/4] net/sched: 
Introduce tc connection tracking", hope that's enough.


^ permalink raw reply

* [PATCH v2 bpf-next 2/3] selftests/bpf: add trickier size resolution tests
From: Andrii Nakryiko @ 2019-07-11  6:53 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko
In-Reply-To: <20190711065307.2425636-1-andriin@fb.com>

Add more BTF tests, validating that size resolution logic is correct in
few trickier cases.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/test_btf.c | 88 ++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_btf.c b/tools/testing/selftests/bpf/test_btf.c
index 8351cb5f4a20..3d617e806054 100644
--- a/tools/testing/selftests/bpf/test_btf.c
+++ b/tools/testing/selftests/bpf/test_btf.c
@@ -3417,6 +3417,94 @@ static struct btf_raw_test raw_tests[] = {
 	.value_type_id = 1,
 	.max_entries = 4,
 },
+/*
+ * typedef int arr_t[16];
+ * struct s {
+ *	arr_t *a;
+ * };
+ */
+{
+	.descr = "struct->ptr->typedef->array->int size resolution",
+	.raw_types = {
+		BTF_STRUCT_ENC(NAME_TBD, 1, 8),			/* [1] */
+		BTF_MEMBER_ENC(NAME_TBD, 2, 0),
+		BTF_PTR_ENC(3),					/* [2] */
+		BTF_TYPEDEF_ENC(NAME_TBD, 4),			/* [3] */
+		BTF_TYPE_ARRAY_ENC(5, 5, 16),			/* [4] */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),	/* [5] */
+		BTF_END_RAW,
+	},
+	BTF_STR_SEC("\0s\0a\0arr_t"),
+	.map_type = BPF_MAP_TYPE_ARRAY,
+	.map_name = "ptr_mod_chain_size_resolve_map",
+	.key_size = sizeof(int),
+	.value_size = sizeof(int) * 16,
+	.key_type_id = 5 /* int */,
+	.value_type_id = 3 /* arr_t */,
+	.max_entries = 4,
+},
+/*
+ * typedef int arr_t[16][8][4];
+ * struct s {
+ *	arr_t *a;
+ * };
+ */
+{
+	.descr = "struct->ptr->typedef->multi-array->int size resolution",
+	.raw_types = {
+		BTF_STRUCT_ENC(NAME_TBD, 1, 8),			/* [1] */
+		BTF_MEMBER_ENC(NAME_TBD, 2, 0),
+		BTF_PTR_ENC(3),					/* [2] */
+		BTF_TYPEDEF_ENC(NAME_TBD, 4),			/* [3] */
+		BTF_TYPE_ARRAY_ENC(5, 7, 16),			/* [4] */
+		BTF_TYPE_ARRAY_ENC(6, 7, 8),			/* [5] */
+		BTF_TYPE_ARRAY_ENC(7, 7, 4),			/* [6] */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),	/* [7] */
+		BTF_END_RAW,
+	},
+	BTF_STR_SEC("\0s\0a\0arr_t"),
+	.map_type = BPF_MAP_TYPE_ARRAY,
+	.map_name = "multi_arr_size_resolve_map",
+	.key_size = sizeof(int),
+	.value_size = sizeof(int) * 16 * 8 * 4,
+	.key_type_id = 7 /* int */,
+	.value_type_id = 3 /* arr_t */,
+	.max_entries = 4,
+},
+/*
+ * typedef int int_t;
+ * typedef int_t arr3_t[4];
+ * typedef arr3_t arr2_t[8];
+ * typedef arr2_t arr1_t[16];
+ * struct s {
+ *	arr1_t *a;
+ * };
+ */
+{
+	.descr = "typedef/multi-arr mix size resolution",
+	.raw_types = {
+		BTF_STRUCT_ENC(NAME_TBD, 1, 8),			/* [1] */
+		BTF_MEMBER_ENC(NAME_TBD, 2, 0),
+		BTF_PTR_ENC(3),					/* [2] */
+		BTF_TYPEDEF_ENC(NAME_TBD, 4),			/* [3] */
+		BTF_TYPE_ARRAY_ENC(5, 10, 16),			/* [4] */
+		BTF_TYPEDEF_ENC(NAME_TBD, 6),			/* [5] */
+		BTF_TYPE_ARRAY_ENC(7, 10, 8),			/* [6] */
+		BTF_TYPEDEF_ENC(NAME_TBD, 8),			/* [7] */
+		BTF_TYPE_ARRAY_ENC(9, 10, 4),			/* [8] */
+		BTF_TYPEDEF_ENC(NAME_TBD, 10),			/* [9] */
+		BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),	/* [10] */
+		BTF_END_RAW,
+	},
+	BTF_STR_SEC("\0s\0a\0arr1_t\0arr2_t\0arr3_t\0int_t"),
+	.map_type = BPF_MAP_TYPE_ARRAY,
+	.map_name = "typedef_arra_mix_size_resolve_map",
+	.key_size = sizeof(int),
+	.value_size = sizeof(int) * 16 * 8 * 4,
+	.key_type_id = 10 /* int */,
+	.value_type_id = 3 /* arr_t */,
+	.max_entries = 4,
+},
 
 }; /* struct btf_raw_test raw_tests[] */
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 3/3] selftests/bpf: use typedef'ed arrays as map values
From: Andrii Nakryiko @ 2019-07-11  6:53 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko
In-Reply-To: <20190711065307.2425636-1-andriin@fb.com>

Convert few tests that couldn't use typedef'ed arrays due to kernel bug.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c     | 3 ++-
 tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c | 3 +--
 tools/testing/selftests/bpf/progs/test_stacktrace_map.c      | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c b/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c
index d06b47a09097..33254b771384 100644
--- a/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c
+++ b/tools/testing/selftests/bpf/progs/test_get_stack_rawtp.c
@@ -47,11 +47,12 @@ struct {
  * issue and avoid complicated C programming massaging.
  * This is an acceptable workaround since there is one entry here.
  */
+typedef __u64 raw_stack_trace_t[2 * MAX_STACK_RAWTP];
 struct {
 	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
 	__uint(max_entries, 1);
 	__type(key, __u32);
-	__u64 (*value)[2 * MAX_STACK_RAWTP];
+	__type(value, raw_stack_trace_t);
 } rawdata_map SEC(".maps");
 
 SEC("tracepoint/raw_syscalls/sys_enter")
diff --git a/tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c b/tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c
index bbfc8337b6f0..f5638e26865d 100644
--- a/tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c
+++ b/tools/testing/selftests/bpf/progs/test_stacktrace_build_id.c
@@ -36,8 +36,7 @@ struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
 	__uint(max_entries, 128);
 	__type(key, __u32);
-	/* there seems to be a bug in kernel not handling typedef properly */
-	struct bpf_stack_build_id (*value)[PERF_MAX_STACK_DEPTH];
+	__type(value, stack_trace_t);
 } stack_amap SEC(".maps");
 
 /* taken from /sys/kernel/debug/tracing/events/random/urandom_read/format */
diff --git a/tools/testing/selftests/bpf/progs/test_stacktrace_map.c b/tools/testing/selftests/bpf/progs/test_stacktrace_map.c
index 803c15dc109d..fa0be3e10a10 100644
--- a/tools/testing/selftests/bpf/progs/test_stacktrace_map.c
+++ b/tools/testing/selftests/bpf/progs/test_stacktrace_map.c
@@ -35,7 +35,7 @@ struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY);
 	__uint(max_entries, 16384);
 	__type(key, __u32);
-	__u64 (*value)[PERF_MAX_STACK_DEPTH];
+	__type(value, stack_trace_t);
 } stack_amap SEC(".maps");
 
 /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
-- 
2.17.1


^ permalink raw reply related

* [PATCH v2 bpf-next 0/3] fix BTF verification size resolution
From: Andrii Nakryiko @ 2019-07-11  6:53 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko

BTF size resolution logic isn't always resolving type size correctly, leading
to erroneous map creation failures due to value size mismatch.

This patch set:
1. fixes the issue (patch #1);
2. adds tests for trickier cases (patch #2);
3. and converts few test cases utilizing BTF-defined maps, that previously
   couldn't use typedef'ed arrays due to kernel bug (patch #3).

Patch #1 can be applied against bpf tree, but selftest ones (#2 and #3) have
to go against bpf-next for now.

Andrii Nakryiko (3):
  bpf: fix BTF verifier size resolution logic
  selftests/bpf: add trickier size resolution tests
  selftests/bpf: use typedef'ed arrays as map values

 kernel/bpf/btf.c                              | 14 ++-
 .../bpf/progs/test_get_stack_rawtp.c          |  3 +-
 .../bpf/progs/test_stacktrace_build_id.c      |  3 +-
 .../selftests/bpf/progs/test_stacktrace_map.c |  2 +-
 tools/testing/selftests/bpf/test_btf.c        | 88 +++++++++++++++++++
 5 files changed, 102 insertions(+), 8 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH v2 bpf-next 1/3] bpf: fix BTF verifier size resolution logic
From: Andrii Nakryiko @ 2019-07-11  6:53 UTC (permalink / raw)
  To: bpf, netdev, ast, daniel, yhs
  Cc: andrii.nakryiko, kernel-team, Andrii Nakryiko, Martin KaFai Lau
In-Reply-To: <20190711065307.2425636-1-andriin@fb.com>

BTF verifier has a size resolution bug which in some circumstances leads to
invalid size resolution for, e.g., TYPEDEF modifier.  This happens if we have
[1] PTR -> [2] TYPEDEF -> [3] ARRAY, in which case due to being in pointer
context ARRAY size won't be resolved (because for pointer it doesn't matter, so
it's a sink in pointer context), but it will be permanently remembered as zero
for TYPEDEF and TYPEDEF will be marked as RESOLVED. Eventually ARRAY size will
be resolved correctly, but TYPEDEF resolved_size won't be updated anymore.
This, subsequently, will lead to erroneous map creation failure, if that
TYPEDEF is specified as either key or value, as key_size/value_size won't
correspond to resolved size of TYPEDEF (kernel will believe it's zero).

Note, that if BTF was ordered as [1] ARRAY <- [2] TYPEDEF <- [3] PTR, this
won't be a problem, as by the time we get to TYPEDEF, ARRAY's size is already
calculated and stored.

This bug manifests itself in rejecting BTF-defined maps that use array
typedef as a value type:

typedef int array_t[16];

struct {
    __uint(type, BPF_MAP_TYPE_ARRAY);
    __type(value, array_t); /* i.e., array_t *value; */
} test_map SEC(".maps");

The fix consists on not relying on modifier's resolved_size and instead using
modifier's resolved_id (type ID for "concrete" type to which modifier
eventually resolves) and doing size determination for that resolved type. This
allow to preserve existing "early DFS termination" logic for PTR or
STRUCT_OR_ARRAY contexts, but still do correct size determination for modifier
types.

Fixes: eb3f595dab40 ("bpf: btf: Validate type reference")
Cc: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
---
 kernel/bpf/btf.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index cad09858a5f2..22fe8b155e51 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -1073,11 +1073,18 @@ const struct btf_type *btf_type_id_size(const struct btf *btf,
 				 !btf_type_is_var(size_type)))
 			return NULL;

-		size = btf->resolved_sizes[size_type_id];
 		size_type_id = btf->resolved_ids[size_type_id];
 		size_type = btf_type_by_id(btf, size_type_id);
 		if (btf_type_nosize_or_null(size_type))
 			return NULL;
+		else if (btf_type_has_size(size_type))
+			size = size_type->size;
+		else if (btf_type_is_array(size_type))
+			size = btf->resolved_sizes[size_type_id];
+		else if (btf_type_is_ptr(size_type))
+			size = sizeof(void *);
+		else
+			return NULL;
 	}

 	*type_id = size_type_id;
@@ -1602,7 +1609,6 @@ static int btf_modifier_resolve(struct btf_verifier_env *env,
 	const struct btf_type *next_type;
 	u32 next_type_id = t->type;
 	struct btf *btf = env->btf;
-	u32 next_type_size = 0;

 	next_type = btf_type_by_id(btf, next_type_id);
 	if (!next_type || btf_type_is_resolve_source_only(next_type)) {
@@ -1620,7 +1626,7 @@ static int btf_modifier_resolve(struct btf_verifier_env *env,
 	 * save us a few type-following when we use it later (e.g. in
 	 * pretty print).
 	 */
-	if (!btf_type_id_size(btf, &next_type_id, &next_type_size)) {
+	if (!btf_type_id_size(btf, &next_type_id, NULL)) {
 		if (env_type_is_resolved(env, next_type_id))
 			next_type = btf_type_id_resolve(btf, &next_type_id);

@@ -1633,7 +1639,7 @@ static int btf_modifier_resolve(struct btf_verifier_env *env,
 		}
 	}

-	env_stack_pop_resolved(env, next_type_id, next_type_size);
+	env_stack_pop_resolved(env, next_type_id, 0);

 	return 0;
 }
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH] net/mlx5e: Move priv variable into case statement in mlx5e_setup_tc
From: Nathan Chancellor @ 2019-07-11  6:09 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller,
	Linux Netdev List, RDMA mailing list, linux-kernel,
	clang-built-linux
In-Reply-To: <CALzJLG9Aw=sVPDiewHr+4Jiuaod_1q=10vzMzCUVg-rCCXD6cQ@mail.gmail.com>

On Wed, Jul 10, 2019 at 11:02:00PM -0700, Saeed Mahameed wrote:
> On Wed, Jul 10, 2019 at 12:05 PM Nathan Chancellor
> <natechancellor@gmail.com> wrote:
> >
> > There is an unused variable warning on arm64 defconfig when
> > CONFIG_MLX5_ESWITCH is unset:
> >
> > drivers/net/ethernet/mellanox/mlx5/core/en_main.c:3467:21: warning:
> > unused variable 'priv' [-Wunused-variable]
> >         struct mlx5e_priv *priv = netdev_priv(dev);
> >                            ^
> > 1 warning generated.
> >
> > Move it down into the case statement where it is used.
> >
> > Fixes: 4e95bc268b91 ("net: flow_offload: add flow_block_cb_setup_simple()")
> > Link: https://github.com/ClangBuiltLinux/linux/issues/597
> > Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 6d0ae87c8ded..651eb714eb5b 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -3464,15 +3464,16 @@ static LIST_HEAD(mlx5e_block_cb_list);
> >  static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
> >                           void *type_data)
> >  {
> > -       struct mlx5e_priv *priv = netdev_priv(dev);
> > -
> >         switch (type) {
> >  #ifdef CONFIG_MLX5_ESWITCH
> > -       case TC_SETUP_BLOCK:
> > +       case TC_SETUP_BLOCK: {
> > +               struct mlx5e_priv *priv = netdev_priv(dev);
> > +
> >                 return flow_block_cb_setup_simple(type_data,
> >                                                   &mlx5e_block_cb_list,
> >                                                   mlx5e_setup_tc_block_cb,
> >                                                   priv, priv, true);
> > +       }
> 
> Hi Nathan,
> 
> We have another patch internally that fixes this, and it is already
> queued up in my queue.
> it works differently as we want to pass priv instead of netdev to
> mlx5e_setup_tc_mqprio below,
> which will also solve warning ..
> 
> So i would like to submit that patch if it is ok with you ?

Hi Saeed,

Whatever works best for you, I just care that the warning gets fixed,
not how it is done :) I wouldn't mind being put on CC so I can pick it
up for my local tests.

Thanks for the follow up!
Nathan

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox