Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Issue with ping source address display
From: 吉藤英明 @ 2016-04-12 22:59 UTC (permalink / raw)
  To: Daniele Orlandi; +Cc: network dev
In-Reply-To: <570D35CF.9070006@orlandi.com>

Hi,

2016-04-13 2:52 GMT+09:00 Daniele Orlandi <daniele@orlandi.com>:
>
> Hello,
>
> More than one year ago I posted the following message but it hasn't
> received a reply, now I've been stung by a similar issue, you may want
> to investigate:
>
>
> I noticed that when ping receives ICMP messages from different sources
> the first IP address is always used and displayed:
>
>
> vihai@seviolab:~$ ping -V
> ping utility, iputils-s20121221
>
> This is a (simulated) flapping route:
>
> vihai@seviolab:~$ ping 10.254.10.140
> PING 10.254.10.140 (10.254.10.140) 56(84) bytes of data.
> From 192.168.1.1 icmp_seq=1 Destination Host Unreachable
> From 192.168.1.1 icmp_seq=2 Destination Host Unreachable
> From 192.168.1.1 icmp_seq=3 Destination Host Unreachable
> 64 bytes from 192.168.1.1: icmp_seq=4 ttl=61 time=24.7 ms
> 64 bytes from 192.168.1.1: icmp_seq=5 ttl=61 time=25.6 ms
> 64 bytes from 192.168.1.1: icmp_seq=6 ttl=61 time=69.6 ms
> From 192.168.1.1 icmp_seq=7 Destination Host Unreachable
> From 192.168.1.1 icmp_seq=8 Destination Host Unreachable
> From 192.168.1.1 icmp_seq=9 Destination Host Unreachable
> ^C
> --- 10.254.10.140 ping statistics ---
> 9 packets transmitted, 3 received, +6 errors, 66% packet loss, time 8001ms
> rtt min/avg/max/mdev = 24.797/40.061/69.692/20.955 ms
>
>
> The sources, however are different:
>
> vihai@seviolab:~$ sudo tcpdump -n icmp
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
> ^[OA^[OA^[OA17:09:55.932981 IP 192.168.1.21 > 10.254.10.140: ICMP echo
> request, id 9278, seq 1, length 64
> 17:09:55.933234 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
> unreachable, length 92
> 17:09:56.933169 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 2, length 64
> 17:09:56.933416 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
> unreachable, length 92
> 17:09:57.933160 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 3, length 64
> 17:09:57.933404 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
> unreachable, length 92
> 17:09:58.933163 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 4, length 64
> 17:09:58.957939 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
> 9278, seq 4, length 64
> 17:09:59.935050 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 5, length 64
> 17:09:59.960724 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
> 9278, seq 5, length 64
> 17:10:00.936177 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 6, length 64
> 17:10:01.005849 IP 10.254.10.140 > 192.168.1.21: ICMP echo reply, id
> 9278, seq 6, length 64
> 17:10:01.936313 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 7, length 64
> 17:10:01.936626 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
> unreachable, length 92
> 17:10:02.935321 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 8, length 64
> 17:10:02.935591 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
> unreachable, length 92
> 17:10:03.934322 IP 192.168.1.21 > 10.254.10.140: ICMP echo request, id
> 9278, seq 9, length 64
> 17:10:03.934613 IP 192.168.1.1 > 192.168.1.21: ICMP host 10.254.10.140
> unreachable, length 92
>

Thank you for your report.  I'll try fixing it.

--yoshfuji

>
>
>
> Tried with a different ping implementation (RouterOS) and the behaviour
> seems correct:
>
> [vihai@SevioLab SW1] > ping 10.254.10.140
> HOST                                     SIZE TTL TIME  STATUS
> 192.168.1.1                             84  64 0ms   host unreachable
> 192.168.1.1                             84  64 0ms   host unreachable
> 192.168.1.1                             84  64 0ms   host unreachable
> 10.254.10.140                           56  61 20ms
> 10.254.10.140                           56  61 46ms
> 10.254.10.140                           56  61 37ms
> 192.168.1.1                             84  64 0ms   host unreachable
> 192.168.1.1                             84  64 0ms   host unreachable
> 192.168.1.1                             84  64 0ms   host unreachable
>     sent=9 received=3 packet-loss=66% min-rtt=20ms avg-rtt=34ms max-rtt=46ms
>
>
> Recently I was pinging with IPv6, a router in between filtered the
> packet, however the shown source address was not the right one:
>
> root@monitor:~# ping6 -i 0.2 www.google.com
> PING www.google.com(mil01s25-in-x04.1e100.net) 56 data bytes
> From mil01s25-in-x04.1e100.net icmp_seq=1 Destination unreachable: No route
> From mil01s25-in-x04.1e100.net icmp_seq=2 Destination unreachable: No route
> From mil01s25-in-x04.1e100.net icmp_seq=3 Destination unreachable: No route
>
> 19:33:19.589285 IP6 2a01:2d8:aca0:fce:944e:c8ff:fe4d:96de >
> mil01s25-in-x04.1e100.net: ICMP6, echo request, seq 1, length 64
> 19:33:19.611666 IP6 mix-br2.intercom.it >
> 2a01:2d8:aca0:fce:944e:c8ff:fe4d:96de: ICMP6, destination unreachable,
> unreachable route mil01s25-in-x04.1e100.net, length 112
>
>
> Bye,
>

^ permalink raw reply

* [PATCH iproute2] ss: 64bit inode numbers
From: Eric Dumazet @ 2016-04-12 22:22 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Lets prepare for a possibility to have 64bit inode numbers for sockets,
even if the kernel currently enforces 32bit numbers.

Presumably, if both kernel and userland are 64bit (no 32bit emulation),
kernel could switch to 64bit inode numbers soon.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 misc/ss.c |   38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 38cf331..3355da5 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -377,7 +377,7 @@ static FILE *ephemeral_ports_open(void)
 
 struct user_ent {
 	struct user_ent	*next;
-	unsigned int	ino;
+	__u64		ino;
 	int		pid;
 	int		fd;
 	char		*process;
@@ -388,17 +388,18 @@ struct user_ent {
 #define USER_ENT_HASH_SIZE	256
 struct user_ent *user_ent_hash[USER_ENT_HASH_SIZE];
 
-static int user_ent_hashfn(unsigned int ino)
+static int user_ent_hashfn(__u64 ino)
 {
-	int val = (ino >> 24) ^ (ino >> 16) ^ (ino >> 8) ^ ino;
+	unsigned int val = (unsigned int)ino;
 
+	val ^= (unsigned int)(ino >> 32);
+	val ^= (val >> 16);
+	val ^= (val >> 8);
 	return val & (USER_ENT_HASH_SIZE - 1);
 }
 
-static void user_ent_add(unsigned int ino, char *process,
-					int pid, int fd,
-					char *proc_ctx,
-					char *sock_ctx)
+static void user_ent_add(__u64 ino, char *process, int pid, int fd,
+			 char *proc_ctx, char *sock_ctx)
 {
 	struct user_ent *p, **pp;
 
@@ -494,7 +495,7 @@ static void user_ent_hash_build(void)
 
 		while ((d1 = readdir(dir1)) != NULL) {
 			const char *pattern = "socket:[";
-			unsigned int ino;
+			__u64 ino;
 			char lnk[64];
 			int fd;
 			ssize_t link_len;
@@ -513,7 +514,7 @@ static void user_ent_hash_build(void)
 			if (strncmp(lnk, pattern, strlen(pattern)))
 				continue;
 
-			sscanf(lnk, "socket:[%u]", &ino);
+			sscanf(lnk, "socket:[%llu]", &ino);
 
 			snprintf(tmp, sizeof(tmp), "%s/%d/fd/%s",
 					root, pid, d1->d_name);
@@ -549,7 +550,7 @@ enum entry_types {
 };
 
 #define ENTRY_BUF_SIZE 512
-static int find_entry(unsigned int ino, char **buf, int type)
+static int find_entry(__u64 ino, char **buf, int type)
 {
 	struct user_ent *p;
 	int cnt = 0;
@@ -728,10 +729,10 @@ struct sockstat {
 	int		    rport;
 	int		    state;
 	int		    rq, wq;
-	unsigned int ino;
-	unsigned int uid;
+	unsigned int	    uid;
 	int		    refcnt;
 	unsigned int	    iface;
+	__u64		    ino;
 	unsigned long long  sk;
 	char *name;
 	char *peer_name;
@@ -801,7 +802,7 @@ static void sock_details_print(struct sockstat *s)
 	if (s->uid)
 		printf(" uid:%u", s->uid);
 
-	printf(" ino:%u", s->ino);
+	printf(" ino:%llu", s->ino);
 	printf(" sk:%llx", s->sk);
 }
 
@@ -1799,7 +1800,7 @@ static int tcp_show_line(char *line, const struct filter *f, int family)
 		return 0;
 
 	opt[0] = 0;
-	n = sscanf(data, "%x %x:%x %x:%x %x %d %d %u %d %llx %d %d %d %d %d %[^\n]\n",
+	n = sscanf(data, "%x %x:%x %x:%x %x %d %d %llu %d %llx %d %d %d %d %d %[^\n]\n",
 		   &s.ss.state, &s.ss.wq, &s.ss.rq,
 		   &s.timer, &s.timeout, &s.retrans, &s.ss.uid, &s.probes,
 		   &s.ss.ino, &s.ss.refcnt, &s.ss.sk, &rto, &ato, &s.qack, &s.cwnd,
@@ -2481,7 +2482,7 @@ static int dgram_show_line(char *line, const struct filter *f, int family)
 		return 0;
 
 	opt[0] = 0;
-	n = sscanf(data, "%x %x:%x %*x:%*x %*x %d %*d %u %d %llx %[^\n]\n",
+	n = sscanf(data, "%x %x:%x %*x:%*x %*x %d %*d %llu %d %llx %[^\n]\n",
 	       &s.state, &s.wq, &s.rq,
 	       &s.uid, &s.ino,
 	       &s.refcnt, &s.sk, opt);
@@ -2829,7 +2830,7 @@ static int unix_show(struct filter *f)
 		u->name = NULL;
 		u->peer_name = NULL;
 
-		if (sscanf(buf, "%x: %x %x %x %x %x %d %s",
+		if (sscanf(buf, "%x: %x %x %x %x %x %llu %s",
 			   &u->rport, &u->rq, &u->wq, &flags, &u->type,
 			   &u->state, &u->ino, name) < 8)
 			name[0] = 0;
@@ -3085,9 +3086,10 @@ static int packet_show_line(char *buf, const struct filter *f, int fam)
 {
 	unsigned long long sk;
 	struct sockstat stat = {};
-	int type, prot, iface, state, rq, uid, ino;
+	int type, prot, iface, state, rq, uid;
+	__u64 ino;
 
-	sscanf(buf, "%llx %*d %d %x %d %d %u %u %u",
+	sscanf(buf, "%llx %*d %d %x %d %d %u %u %llu",
 			&sk,
 			&type, &prot, &iface, &state,
 			&rq, &uid, &ino);

^ permalink raw reply related

* Re: [PATCH] mwifiex: fix possible NULL dereference
From: Andy Shevchenko @ 2016-04-12 22:15 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: Sudip Mukherjee, Amitkumar Karwar, Nishant Sarmukadam, Kalle Valo,
	linux-kernel@vger.kernel.org, open list:TI WILINK WIRELES...,
	netdev, Sudip Mukherjee
In-Reply-To: <867653EE-136B-488D-8F85-716B1ED9BD1A@intel.com>

On Tue, Apr 12, 2016 at 8:43 PM, Rustad, Mark D <mark.d.rustad@intel.com> wrote:
> Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
>
>> On Mon, Apr 11, 2016 at 6:27 PM, Sudip Mukherjee
>> <sudipm.mukherjee@gmail.com> wrote:
>>>
>>> From: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
>>>
>>> We have a check for card just after dereferencing it. So if it is NULL
>>> we have already dereferenced it before its check. Lets dereference it
>>> after checking card for NULL.

>>> --- a/drivers/net/wireless/marvell/mwifiex/pcie.c
>>> +++ b/drivers/net/wireless/marvell/mwifiex/pcie.c
>>> @@ -2884,10 +2884,11 @@ static void mwifiex_unregister_dev(struct
>>> mwifiex_adapter *adapter)
>>>  {
>>>         struct pcie_service_card *card = adapter->card;>>
>>
>> Let's say it's 0.
>>
>>>        const struct mwifiex_pcie_card_reg *reg;
>>> -       struct pci_dev *pdev = card->dev;>>
>>
>> This would be equal to offset of dev member in pcie_service_card struct.
>>
>> Nothing wrong here.
>
> Actually, that is not true. The dereference of card tells the compiler that
> card can't be NULL, so it is free to eliminate the check below.
> Unbelievably, this can even happen for a reference such as &ptr->thing where
> the pointer isn't even actually dereferenced at all!

Hmm... Can we look at the result assembly? If I'm not mistaken,
compiler wouldn't even try to calculate pdev pointer before first use
of it.


>
>>> +       struct pci_dev *pdev;
>>>         int i;
>>>
>>>         if (card) {
>>> +               pdev = card->dev;
>>>                 if (card->msix_enable) {
>>>                         for (i = 0; i < MWIFIEX_NUM_MSIX_VECTORS; i++)


-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply

* [PATCH net-next 5/5] bpf, samples: add test cases for raw stack
From: Daniel Borkmann @ 2016-04-12 22:10 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, bblanco, brendan.d.gregg, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1460497071.git.daniel@iogearbox.net>

This adds test cases mostly around ARG_PTR_TO_RAW_STACK to check the
verifier behaviour.

  [...]
  #84 raw_stack: no skb_load_bytes OK
  #85 raw_stack: skb_load_bytes, no init OK
  #86 raw_stack: skb_load_bytes, init OK
  #87 raw_stack: skb_load_bytes, spilled regs around bounds OK
  #88 raw_stack: skb_load_bytes, spilled regs corruption OK
  #89 raw_stack: skb_load_bytes, spilled regs corruption 2 OK
  #90 raw_stack: skb_load_bytes, spilled regs + data OK
  #91 raw_stack: skb_load_bytes, invalid access 1 OK
  #92 raw_stack: skb_load_bytes, invalid access 2 OK
  #93 raw_stack: skb_load_bytes, invalid access 3 OK
  #94 raw_stack: skb_load_bytes, invalid access 4 OK
  #95 raw_stack: skb_load_bytes, invalid access 5 OK
  #96 raw_stack: skb_load_bytes, invalid access 6 OK
  #97 raw_stack: skb_load_bytes, large access OK
  Summary: 98 PASSED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/test_verifier.c | 268 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 268 insertions(+)

diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c
index 4b51a90..9eba8d1 100644
--- a/samples/bpf/test_verifier.c
+++ b/samples/bpf/test_verifier.c
@@ -309,6 +309,19 @@ static struct bpf_test tests[] = {
 		.result_unpriv = REJECT,
 	},
 	{
+		"check valid spill/fill, skb mark",
+		.insns = {
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_1),
+			BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_6, -8),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8),
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.result_unpriv = ACCEPT,
+	},
+	{
 		"check corrupted spill/fill",
 		.insns = {
 			/* spill R1(ctx) into stack */
@@ -1180,6 +1193,261 @@ static struct bpf_test tests[] = {
 		.result_unpriv = REJECT,
 		.result = ACCEPT,
 	},
+	{
+		"raw_stack: no skb_load_bytes",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			/* Call to skb_load_bytes() omitted. */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid read from stack off -8+0 size 8",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, no init",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, init",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+			BPF_ST_MEM(BPF_DW, BPF_REG_6, 0, 0xcafe),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, spilled regs around bounds",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -16),
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, -8), /* spill ctx from R1 */
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1,  8), /* spill ctx from R1 */
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, -8), /* fill ctx into R0 */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,  8), /* fill ctx into R2 */
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2,
+				    offsetof(struct __sk_buff, priority)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_2),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, spilled regs corruption",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -8),
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0), /* spill ctx from R1 */
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0), /* fill ctx into R0 */
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "R0 invalid mem access 'inv'",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, spilled regs corruption 2",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -16),
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, -8), /* spill ctx from R1 */
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1,  0), /* spill ctx from R1 */
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1,  8), /* spill ctx from R1 */
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, -8), /* fill ctx into R0 */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,  8), /* fill ctx into R2 */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_6,  0), /* fill ctx into R3 */
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2,
+				    offsetof(struct __sk_buff, priority)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_2),
+			BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_3,
+				    offsetof(struct __sk_buff, pkt_type)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_3),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "R3 invalid mem access 'inv'",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, spilled regs + data",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -16),
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, -8), /* spill ctx from R1 */
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1,  0), /* spill ctx from R1 */
+			BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1,  8), /* spill ctx from R1 */
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, -8), /* fill ctx into R0 */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6,  8), /* fill ctx into R2 */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_6,  0), /* fill data into R3 */
+			BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0,
+				    offsetof(struct __sk_buff, mark)),
+			BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2,
+				    offsetof(struct __sk_buff, priority)),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_2),
+			BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_3),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, invalid access 1",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -513),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid stack type R3 off=-513 access_size=8",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, invalid access 2",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -1),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 8),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid stack type R3 off=-1 access_size=8",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, invalid access 3",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 0xffffffff),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 0xffffffff),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid stack type R3 off=-1 access_size=-1",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, invalid access 4",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -1),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 0x7fffffff),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid stack type R3 off=-1 access_size=2147483647",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, invalid access 5",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -512),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 0x7fffffff),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid stack type R3 off=-512 access_size=2147483647",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, invalid access 6",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -512),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 0),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = REJECT,
+		.errstr = "invalid stack type R3 off=-512 access_size=0",
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
+	{
+		"raw_stack: skb_load_bytes, large access",
+		.insns = {
+			BPF_MOV64_IMM(BPF_REG_2, 4),
+			BPF_ALU64_REG(BPF_MOV, BPF_REG_6, BPF_REG_10),
+			BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, -512),
+			BPF_MOV64_REG(BPF_REG_3, BPF_REG_6),
+			BPF_MOV64_IMM(BPF_REG_4, 512),
+			BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_skb_load_bytes),
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 0),
+			BPF_EXIT_INSN(),
+		},
+		.result = ACCEPT,
+		.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+	},
 };
 
 static int probe_filter_length(struct bpf_insn *fp)
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 4/5] bpf, samples: don't zero data when not needed
From: Daniel Borkmann @ 2016-04-12 22:10 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, bblanco, brendan.d.gregg, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1460497071.git.daniel@iogearbox.net>

Remove the zero initialization in the sample programs where appropriate.
Note that this is an optimization which is now possible, old programs
still doing the zero initialization are just fine as well. Also, make
sure we don't have padding issues when we don't memset() the entire
struct anymore.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 samples/bpf/offwaketime_kern.c | 10 ++++++----
 samples/bpf/tracex1_kern.c     |  4 +---
 samples/bpf/tracex2_kern.c     |  4 ++--
 samples/bpf/tracex5_kern.c     |  6 +++---
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/samples/bpf/offwaketime_kern.c b/samples/bpf/offwaketime_kern.c
index 983629a..e7d9a0a 100644
--- a/samples/bpf/offwaketime_kern.c
+++ b/samples/bpf/offwaketime_kern.c
@@ -11,7 +11,7 @@
 #include <linux/version.h>
 #include <linux/sched.h>
 
-#define _(P) ({typeof(P) val = 0; bpf_probe_read(&val, sizeof(val), &P); val;})
+#define _(P) ({typeof(P) val; bpf_probe_read(&val, sizeof(val), &P); val;})
 
 #define MINBLOCK_US	1
 
@@ -61,7 +61,7 @@ SEC("kprobe/try_to_wake_up")
 int waker(struct pt_regs *ctx)
 {
 	struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
-	struct wokeby_t woke = {};
+	struct wokeby_t woke;
 	u32 pid;
 
 	pid = _(p->pid);
@@ -75,17 +75,19 @@ int waker(struct pt_regs *ctx)
 
 static inline int update_counts(void *ctx, u32 pid, u64 delta)
 {
-	struct key_t key = {};
 	struct wokeby_t *woke;
 	u64 zero = 0, *val;
+	struct key_t key;
 
+	__builtin_memset(&key.waker, 0, sizeof(key.waker));
 	bpf_get_current_comm(&key.target, sizeof(key.target));
 	key.tret = bpf_get_stackid(ctx, &stackmap, STACKID_FLAGS);
+	key.wret = 0;
 
 	woke = bpf_map_lookup_elem(&wokeby, &pid);
 	if (woke) {
 		key.wret = woke->ret;
-		__builtin_memcpy(&key.waker, woke->name, TASK_COMM_LEN);
+		__builtin_memcpy(&key.waker, woke->name, sizeof(key.waker));
 		bpf_map_delete_elem(&wokeby, &pid);
 	}
 
diff --git a/samples/bpf/tracex1_kern.c b/samples/bpf/tracex1_kern.c
index 3f450a8..107da14 100644
--- a/samples/bpf/tracex1_kern.c
+++ b/samples/bpf/tracex1_kern.c
@@ -23,16 +23,14 @@ int bpf_prog1(struct pt_regs *ctx)
 	/* attaches to kprobe netif_receive_skb,
 	 * looks for packets on loobpack device and prints them
 	 */
-	char devname[IFNAMSIZ] = {};
+	char devname[IFNAMSIZ];
 	struct net_device *dev;
 	struct sk_buff *skb;
 	int len;
 
 	/* non-portable! works for the given kernel only */
 	skb = (struct sk_buff *) PT_REGS_PARM1(ctx);
-
 	dev = _(skb->dev);
-
 	len = _(skb->len);
 
 	bpf_probe_read(devname, sizeof(devname), dev->name);
diff --git a/samples/bpf/tracex2_kern.c b/samples/bpf/tracex2_kern.c
index 6d6eefd..5e11c20 100644
--- a/samples/bpf/tracex2_kern.c
+++ b/samples/bpf/tracex2_kern.c
@@ -66,7 +66,7 @@ struct hist_key {
 	char comm[16];
 	u64 pid_tgid;
 	u64 uid_gid;
-	u32 index;
+	u64 index;
 };
 
 struct bpf_map_def SEC("maps") my_hist_map = {
@@ -82,7 +82,7 @@ int bpf_prog3(struct pt_regs *ctx)
 	long write_size = PT_REGS_PARM3(ctx);
 	long init_val = 1;
 	long *value;
-	struct hist_key key = {};
+	struct hist_key key;
 
 	key.index = log2l(write_size);
 	key.pid_tgid = bpf_get_current_pid_tgid();
diff --git a/samples/bpf/tracex5_kern.c b/samples/bpf/tracex5_kern.c
index b3f4295..f95f232 100644
--- a/samples/bpf/tracex5_kern.c
+++ b/samples/bpf/tracex5_kern.c
@@ -22,7 +22,7 @@ struct bpf_map_def SEC("maps") progs = {
 SEC("kprobe/seccomp_phase1")
 int bpf_prog1(struct pt_regs *ctx)
 {
-	struct seccomp_data sd = {};
+	struct seccomp_data sd;
 
 	bpf_probe_read(&sd, sizeof(sd), (void *)PT_REGS_PARM1(ctx));
 
@@ -40,7 +40,7 @@ int bpf_prog1(struct pt_regs *ctx)
 /* we jump here when syscall number == __NR_write */
 PROG(__NR_write)(struct pt_regs *ctx)
 {
-	struct seccomp_data sd = {};
+	struct seccomp_data sd;
 
 	bpf_probe_read(&sd, sizeof(sd), (void *)PT_REGS_PARM1(ctx));
 	if (sd.args[2] == 512) {
@@ -53,7 +53,7 @@ PROG(__NR_write)(struct pt_regs *ctx)
 
 PROG(__NR_read)(struct pt_regs *ctx)
 {
-	struct seccomp_data sd = {};
+	struct seccomp_data sd;
 
 	bpf_probe_read(&sd, sizeof(sd), (void *)PT_REGS_PARM1(ctx));
 	if (sd.args[2] > 128 && sd.args[2] <= 1024) {
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 3/5] bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK
From: Daniel Borkmann @ 2016-04-12 22:10 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, bblanco, brendan.d.gregg, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1460497071.git.daniel@iogearbox.net>

This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
also be removed since the verifier already makes sure we stay within bounds
on stack buffers.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/helpers.c     | 17 +++++++++++----
 kernel/trace/bpf_trace.c | 10 ++++++---
 net/core/filter.c        | 57 +++++++++++++++++++++++++++++++++---------------
 3 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 50da680..ad7a057 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -163,17 +163,26 @@ static u64 bpf_get_current_comm(u64 r1, u64 size, u64 r3, u64 r4, u64 r5)
 	struct task_struct *task = current;
 	char *buf = (char *) (long) r1;
 
-	if (!task)
-		return -EINVAL;
+	if (unlikely(!task))
+		goto err_clear;
 
-	strlcpy(buf, task->comm, min_t(size_t, size, sizeof(task->comm)));
+	strncpy(buf, task->comm, size);
+
+	/* Verifier guarantees that size > 0. For task->comm exceeding
+	 * size, guarantee that buf is %NUL-terminated. Unconditionally
+	 * done here to save the size test.
+	 */
+	buf[size - 1] = 0;
 	return 0;
+err_clear:
+	memset(buf, 0, size);
+	return -EINVAL;
 }
 
 const struct bpf_func_proto bpf_get_current_comm_proto = {
 	.func		= bpf_get_current_comm,
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_STACK,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
 	.arg2_type	= ARG_CONST_STACK_SIZE,
 };
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 413ec56..6855878 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -62,17 +62,21 @@ EXPORT_SYMBOL_GPL(trace_call_bpf);
 static u64 bpf_probe_read(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
 {
 	void *dst = (void *) (long) r1;
-	int size = (int) r2;
+	int ret, size = (int) r2;
 	void *unsafe_ptr = (void *) (long) r3;
 
-	return probe_kernel_read(dst, unsafe_ptr, size);
+	ret = probe_kernel_read(dst, unsafe_ptr, size);
+	if (unlikely(ret < 0))
+		memset(dst, 0, size);
+
+	return ret;
 }
 
 static const struct bpf_func_proto bpf_probe_read_proto = {
 	.func		= bpf_probe_read,
 	.gpl_only	= true,
 	.ret_type	= RET_INTEGER,
-	.arg1_type	= ARG_PTR_TO_STACK,
+	.arg1_type	= ARG_PTR_TO_RAW_STACK,
 	.arg2_type	= ARG_CONST_STACK_SIZE,
 	.arg3_type	= ARG_ANYTHING,
 };
diff --git a/net/core/filter.c b/net/core/filter.c
index e8486ba..5d2ac2b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1409,16 +1409,19 @@ static u64 bpf_skb_load_bytes(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
 	unsigned int len = (unsigned int) r4;
 	void *ptr;
 
-	if (unlikely((u32) offset > 0xffff || len > MAX_BPF_STACK))
-		return -EFAULT;
+	if (unlikely((u32) offset > 0xffff))
+		goto err_clear;
 
 	ptr = skb_header_pointer(skb, offset, len, to);
 	if (unlikely(!ptr))
-		return -EFAULT;
+		goto err_clear;
 	if (ptr != to)
 		memcpy(to, ptr, len);
 
 	return 0;
+err_clear:
+	memset(to, 0, len);
+	return -EFAULT;
 }
 
 static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
@@ -1427,7 +1430,7 @@ static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
-	.arg3_type	= ARG_PTR_TO_STACK,
+	.arg3_type	= ARG_PTR_TO_RAW_STACK,
 	.arg4_type	= ARG_CONST_STACK_SIZE,
 };
 
@@ -1756,12 +1759,19 @@ static u64 bpf_skb_get_tunnel_key(u64 r1, u64 r2, u64 size, u64 flags, u64 r5)
 	struct bpf_tunnel_key *to = (struct bpf_tunnel_key *) (long) r2;
 	const struct ip_tunnel_info *info = skb_tunnel_info(skb);
 	u8 compat[sizeof(struct bpf_tunnel_key)];
+	void *to_orig = to;
+	int err;
 
-	if (unlikely(!info || (flags & ~(BPF_F_TUNINFO_IPV6))))
-		return -EINVAL;
-	if (ip_tunnel_info_af(info) != bpf_tunnel_key_af(flags))
-		return -EPROTO;
+	if (unlikely(!info || (flags & ~(BPF_F_TUNINFO_IPV6)))) {
+		err = -EINVAL;
+		goto err_clear;
+	}
+	if (ip_tunnel_info_af(info) != bpf_tunnel_key_af(flags)) {
+		err = -EPROTO;
+		goto err_clear;
+	}
 	if (unlikely(size != sizeof(struct bpf_tunnel_key))) {
+		err = -EINVAL;
 		switch (size) {
 		case offsetof(struct bpf_tunnel_key, tunnel_label):
 		case offsetof(struct bpf_tunnel_key, tunnel_ext):
@@ -1771,12 +1781,12 @@ static u64 bpf_skb_get_tunnel_key(u64 r1, u64 r2, u64 size, u64 flags, u64 r5)
 			 * a common path later on.
 			 */
 			if (ip_tunnel_info_af(info) != AF_INET)
-				return -EINVAL;
+				goto err_clear;
 set_compat:
 			to = (struct bpf_tunnel_key *)compat;
 			break;
 		default:
-			return -EINVAL;
+			goto err_clear;
 		}
 	}
 
@@ -1793,9 +1803,12 @@ set_compat:
 	}
 
 	if (unlikely(size != sizeof(struct bpf_tunnel_key)))
-		memcpy((void *)(long) r2, to, size);
+		memcpy(to_orig, to, size);
 
 	return 0;
+err_clear:
+	memset(to_orig, 0, size);
+	return err;
 }
 
 static const struct bpf_func_proto bpf_skb_get_tunnel_key_proto = {
@@ -1803,7 +1816,7 @@ static const struct bpf_func_proto bpf_skb_get_tunnel_key_proto = {
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
-	.arg2_type	= ARG_PTR_TO_STACK,
+	.arg2_type	= ARG_PTR_TO_RAW_STACK,
 	.arg3_type	= ARG_CONST_STACK_SIZE,
 	.arg4_type	= ARG_ANYTHING,
 };
@@ -1813,16 +1826,26 @@ static u64 bpf_skb_get_tunnel_opt(u64 r1, u64 r2, u64 size, u64 r4, u64 r5)
 	struct sk_buff *skb = (struct sk_buff *) (long) r1;
 	u8 *to = (u8 *) (long) r2;
 	const struct ip_tunnel_info *info = skb_tunnel_info(skb);
+	int err;
 
 	if (unlikely(!info ||
-		     !(info->key.tun_flags & TUNNEL_OPTIONS_PRESENT)))
-		return -ENOENT;
-	if (unlikely(size < info->options_len))
-		return -ENOMEM;
+		     !(info->key.tun_flags & TUNNEL_OPTIONS_PRESENT))) {
+		err = -ENOENT;
+		goto err_clear;
+	}
+	if (unlikely(size < info->options_len)) {
+		err = -ENOMEM;
+		goto err_clear;
+	}
 
 	ip_tunnel_info_opts_get(to, info);
+	if (size > info->options_len)
+		memset(to + info->options_len, 0, size - info->options_len);
 
 	return info->options_len;
+err_clear:
+	memset(to, 0, size);
+	return err;
 }
 
 static const struct bpf_func_proto bpf_skb_get_tunnel_opt_proto = {
@@ -1830,7 +1853,7 @@ static const struct bpf_func_proto bpf_skb_get_tunnel_opt_proto = {
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
-	.arg2_type	= ARG_PTR_TO_STACK,
+	.arg2_type	= ARG_PTR_TO_RAW_STACK,
 	.arg3_type	= ARG_CONST_STACK_SIZE,
 };
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 2/5] bpf, verifier: add ARG_PTR_TO_RAW_STACK type
From: Daniel Borkmann @ 2016-04-12 22:10 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, bblanco, brendan.d.gregg, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1460497071.git.daniel@iogearbox.net>

When passing buffers from eBPF stack space into a helper function, we have
ARG_PTR_TO_STACK argument type for helpers available. The verifier makes sure
that such buffers are initialized, within boundaries, etc.

However, the downside with this is that we have a couple of helper functions
such as bpf_skb_load_bytes() that fill out the passed buffer in the expected
success case anyway, so zero initializing them prior to the helper call is
unneeded/wasted instructions in the eBPF program that can be avoided.

Therefore, add a new helper function argument type called ARG_PTR_TO_RAW_STACK.
The idea is to skip the STACK_MISC check in check_stack_boundary() and color
the related stack slots as STACK_MISC after we checked all call arguments.

Helper functions using ARG_PTR_TO_RAW_STACK must make sure that every path of
the helper function will fill the provided buffer area, so that we cannot leak
any uninitialized stack memory. This f.e. means that error paths need to
memset() the buffers, but the expected fast-path doesn't have to do this
anymore.

Since there's no such helper needing more than at most one ARG_PTR_TO_RAW_STACK
argument, we can keep it simple and don't need to check for multiple areas.
Should in future such a use-case really appear, we have check_raw_mode() that
will make sure we implement support for it first.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h   |  5 +++++
 kernel/bpf/verifier.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b2365a6..5fb3c61 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -66,6 +66,11 @@ enum bpf_arg_type {
 	 * functions that access data on eBPF program stack
 	 */
 	ARG_PTR_TO_STACK,	/* any pointer to eBPF program stack */
+	ARG_PTR_TO_RAW_STACK,	/* any pointer to eBPF program stack, area does not
+				 * need to be initialized, helper function must fill
+				 * all bytes or clear them in error case.
+				 */
+
 	ARG_CONST_STACK_SIZE,	/* number of bytes accessed from stack */
 	ARG_CONST_STACK_SIZE_OR_ZERO, /* number of bytes accessed from stack or 0 */
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 202f8f7..9c843a5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -207,6 +207,9 @@ struct verifier_env {
 
 struct bpf_call_arg_meta {
 	struct bpf_map *map_ptr;
+	bool raw_mode;
+	int regno;
+	int access_size;
 };
 
 /* verbose verifier prints what it's seeing
@@ -789,7 +792,8 @@ static int check_xadd(struct verifier_env *env, struct bpf_insn *insn)
  * and all elements of stack are initialized
  */
 static int check_stack_boundary(struct verifier_env *env, int regno,
-				int access_size, bool zero_size_allowed)
+				int access_size, bool zero_size_allowed,
+				struct bpf_call_arg_meta *meta)
 {
 	struct verifier_state *state = &env->cur_state;
 	struct reg_state *regs = state->regs;
@@ -815,6 +819,12 @@ static int check_stack_boundary(struct verifier_env *env, int regno,
 		return -EACCES;
 	}
 
+	if (meta && meta->raw_mode) {
+		meta->access_size = access_size;
+		meta->regno = regno;
+		return 0;
+	}
+
 	for (i = 0; i < access_size; i++) {
 		if (state->stack_slot_type[MAX_BPF_STACK + off + i] != STACK_MISC) {
 			verbose("invalid indirect read from stack off %d+%d size %d\n",
@@ -859,7 +869,8 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 		expected_type = CONST_PTR_TO_MAP;
 	} else if (arg_type == ARG_PTR_TO_CTX) {
 		expected_type = PTR_TO_CTX;
-	} else if (arg_type == ARG_PTR_TO_STACK) {
+	} else if (arg_type == ARG_PTR_TO_STACK ||
+		   arg_type == ARG_PTR_TO_RAW_STACK) {
 		expected_type = PTR_TO_STACK;
 		/* One exception here. In case function allows for NULL to be
 		 * passed in as argument, it's a CONST_IMM type. Final test
@@ -867,6 +878,7 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 		 */
 		if (reg->type == CONST_IMM && reg->imm == 0)
 			expected_type = CONST_IMM;
+		meta->raw_mode = arg_type == ARG_PTR_TO_RAW_STACK;
 	} else {
 		verbose("unsupported arg_type %d\n", arg_type);
 		return -EFAULT;
@@ -896,7 +908,7 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 		err = check_stack_boundary(env, regno, meta->map_ptr->key_size,
-					   false);
+					   false, NULL);
 	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
 		/* bpf_map_xxx(..., map_ptr, ..., value) call:
 		 * check [value, value + map->value_size) validity
@@ -907,7 +919,8 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 		err = check_stack_boundary(env, regno,
-					   meta->map_ptr->value_size, false);
+					   meta->map_ptr->value_size,
+					   false, NULL);
 	} else if (arg_type == ARG_CONST_STACK_SIZE ||
 		   arg_type == ARG_CONST_STACK_SIZE_OR_ZERO) {
 		bool zero_size_allowed = (arg_type == ARG_CONST_STACK_SIZE_OR_ZERO);
@@ -922,7 +935,7 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 			return -EACCES;
 		}
 		err = check_stack_boundary(env, regno - 1, reg->imm,
-					   zero_size_allowed);
+					   zero_size_allowed, meta);
 	}
 
 	return err;
@@ -953,6 +966,24 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 	return 0;
 }
 
+static int check_raw_mode(const struct bpf_func_proto *fn)
+{
+	int count = 0;
+
+	if (fn->arg1_type == ARG_PTR_TO_RAW_STACK)
+		count++;
+	if (fn->arg2_type == ARG_PTR_TO_RAW_STACK)
+		count++;
+	if (fn->arg3_type == ARG_PTR_TO_RAW_STACK)
+		count++;
+	if (fn->arg4_type == ARG_PTR_TO_RAW_STACK)
+		count++;
+	if (fn->arg5_type == ARG_PTR_TO_RAW_STACK)
+		count++;
+
+	return count > 1 ? -EINVAL : 0;
+}
+
 static int check_call(struct verifier_env *env, int func_id)
 {
 	struct verifier_state *state = &env->cur_state;
@@ -984,6 +1015,15 @@ static int check_call(struct verifier_env *env, int func_id)
 
 	memset(&meta, 0, sizeof(meta));
 
+	/* We only support one arg being in raw mode at the moment, which
+	 * is sufficient for the helper functions we have right now.
+	 */
+	err = check_raw_mode(fn);
+	if (err) {
+		verbose("kernel subsystem misconfigured func %d\n", func_id);
+		return err;
+	}
+
 	/* check args */
 	err = check_func_arg(env, BPF_REG_1, fn->arg1_type, &meta);
 	if (err)
@@ -1001,6 +1041,15 @@ static int check_call(struct verifier_env *env, int func_id)
 	if (err)
 		return err;
 
+	/* Mark slots with STACK_MISC in case of raw mode, stack offset
+	 * is inferred from register state.
+	 */
+	for (i = 0; i < meta.access_size; i++) {
+		err = check_mem_access(env, meta.regno, i, BPF_B, BPF_WRITE, -1);
+		if (err)
+			return err;
+	}
+
 	/* reset caller saved regs */
 	for (i = 0; i < CALLER_SAVED_REGS; i++) {
 		reg = regs + caller_saved[i];
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 1/5] bpf, verifier: add bpf_call_arg_meta for passing meta data
From: Daniel Borkmann @ 2016-04-12 22:10 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, bblanco, brendan.d.gregg, netdev,
	Daniel Borkmann
In-Reply-To: <cover.1460497071.git.daniel@iogearbox.net>

Currently, when the verifier checks calls in check_call() function, we
call check_func_arg() for all 5 arguments e.g. to make sure expected types
are correct. In some cases, we collect meta data (here: map pointer) to
perform additional checks such as checking stack boundary on key/value
sizes for subsequent arguments. As we're going to extend the meta data,
add a generic struct bpf_call_arg_meta that we can use for passing into
check_func_arg().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 kernel/bpf/verifier.c | 40 +++++++++++++++++++++++-----------------
 1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6c5d7cd..202f8f7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -205,6 +205,10 @@ struct verifier_env {
 #define BPF_COMPLEXITY_LIMIT_INSNS	65536
 #define BPF_COMPLEXITY_LIMIT_STACK	1024
 
+struct bpf_call_arg_meta {
+	struct bpf_map *map_ptr;
+};
+
 /* verbose verifier prints what it's seeing
  * bpf_check() is called under lock, so no race to access these global vars
  */
@@ -822,7 +826,8 @@ static int check_stack_boundary(struct verifier_env *env, int regno,
 }
 
 static int check_func_arg(struct verifier_env *env, u32 regno,
-			  enum bpf_arg_type arg_type, struct bpf_map **mapp)
+			  enum bpf_arg_type arg_type,
+			  struct bpf_call_arg_meta *meta)
 {
 	struct reg_state *reg = env->cur_state.regs + regno;
 	enum bpf_reg_type expected_type;
@@ -875,14 +880,13 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 
 	if (arg_type == ARG_CONST_MAP_PTR) {
 		/* bpf_map_xxx(map_ptr) call: remember that map_ptr */
-		*mapp = reg->map_ptr;
-
+		meta->map_ptr = reg->map_ptr;
 	} else if (arg_type == ARG_PTR_TO_MAP_KEY) {
 		/* bpf_map_xxx(..., map_ptr, ..., key) call:
 		 * check that [key, key + map->key_size) are within
 		 * stack limits and initialized
 		 */
-		if (!*mapp) {
+		if (!meta->map_ptr) {
 			/* in function declaration map_ptr must come before
 			 * map_key, so that it's verified and known before
 			 * we have to check map_key here. Otherwise it means
@@ -891,19 +895,19 @@ static int check_func_arg(struct verifier_env *env, u32 regno,
 			verbose("invalid map_ptr to access map->key\n");
 			return -EACCES;
 		}
-		err = check_stack_boundary(env, regno, (*mapp)->key_size,
+		err = check_stack_boundary(env, regno, meta->map_ptr->key_size,
 					   false);
 	} else if (arg_type == ARG_PTR_TO_MAP_VALUE) {
 		/* bpf_map_xxx(..., map_ptr, ..., value) call:
 		 * check [value, value + map->value_size) validity
 		 */
-		if (!*mapp) {
+		if (!meta->map_ptr) {
 			/* kernel subsystem misconfigured verifier */
 			verbose("invalid map_ptr to access map->value\n");
 			return -EACCES;
 		}
-		err = check_stack_boundary(env, regno, (*mapp)->value_size,
-					   false);
+		err = check_stack_boundary(env, regno,
+					   meta->map_ptr->value_size, false);
 	} else if (arg_type == ARG_CONST_STACK_SIZE ||
 		   arg_type == ARG_CONST_STACK_SIZE_OR_ZERO) {
 		bool zero_size_allowed = (arg_type == ARG_CONST_STACK_SIZE_OR_ZERO);
@@ -954,8 +958,8 @@ static int check_call(struct verifier_env *env, int func_id)
 	struct verifier_state *state = &env->cur_state;
 	const struct bpf_func_proto *fn = NULL;
 	struct reg_state *regs = state->regs;
-	struct bpf_map *map = NULL;
 	struct reg_state *reg;
+	struct bpf_call_arg_meta meta;
 	int i, err;
 
 	/* find function prototype */
@@ -978,20 +982,22 @@ static int check_call(struct verifier_env *env, int func_id)
 		return -EINVAL;
 	}
 
+	memset(&meta, 0, sizeof(meta));
+
 	/* check args */
-	err = check_func_arg(env, BPF_REG_1, fn->arg1_type, &map);
+	err = check_func_arg(env, BPF_REG_1, fn->arg1_type, &meta);
 	if (err)
 		return err;
-	err = check_func_arg(env, BPF_REG_2, fn->arg2_type, &map);
+	err = check_func_arg(env, BPF_REG_2, fn->arg2_type, &meta);
 	if (err)
 		return err;
-	err = check_func_arg(env, BPF_REG_3, fn->arg3_type, &map);
+	err = check_func_arg(env, BPF_REG_3, fn->arg3_type, &meta);
 	if (err)
 		return err;
-	err = check_func_arg(env, BPF_REG_4, fn->arg4_type, &map);
+	err = check_func_arg(env, BPF_REG_4, fn->arg4_type, &meta);
 	if (err)
 		return err;
-	err = check_func_arg(env, BPF_REG_5, fn->arg5_type, &map);
+	err = check_func_arg(env, BPF_REG_5, fn->arg5_type, &meta);
 	if (err)
 		return err;
 
@@ -1013,18 +1019,18 @@ static int check_call(struct verifier_env *env, int func_id)
 		 * can check 'value_size' boundary of memory access
 		 * to map element returned from bpf_map_lookup_elem()
 		 */
-		if (map == NULL) {
+		if (meta.map_ptr == NULL) {
 			verbose("kernel subsystem misconfigured verifier\n");
 			return -EINVAL;
 		}
-		regs[BPF_REG_0].map_ptr = map;
+		regs[BPF_REG_0].map_ptr = meta.map_ptr;
 	} else {
 		verbose("unknown return type %d of func %d\n",
 			fn->ret_type, func_id);
 		return -EINVAL;
 	}
 
-	err = check_map_func_compatibility(map, func_id);
+	err = check_map_func_compatibility(meta.map_ptr, func_id);
 	if (err)
 		return err;
 
-- 
1.9.3

^ permalink raw reply related

* [PATCH net-next 0/5] BPF updates
From: Daniel Borkmann @ 2016-04-12 22:10 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, tgraf, bblanco, brendan.d.gregg, netdev,
	Daniel Borkmann

This series adds a new verifier argument type called
ARG_PTR_TO_RAW_STACK and converts related helpers to make
use of it. Basic idea is that we can save init of stack
memory when the helper function is guaranteed to fully
fill out the passed buffer in every path. Series also adds
test cases and converts samples. For more details, please
see individual patches.

Thanks!

Daniel Borkmann (5):
  bpf, verifier: add bpf_call_arg_meta for passing meta data
  bpf, verifier: add ARG_PTR_TO_RAW_STACK type
  bpf: convert relevant helper args to ARG_PTR_TO_RAW_STACK
  bpf, samples: don't zero data when not needed
  bpf, samples: add test cases for raw stack

 include/linux/bpf.h            |   5 +
 kernel/bpf/helpers.c           |  17 ++-
 kernel/bpf/verifier.c          |  97 +++++++++++----
 kernel/trace/bpf_trace.c       |  10 +-
 net/core/filter.c              |  57 ++++++---
 samples/bpf/offwaketime_kern.c |  10 +-
 samples/bpf/test_verifier.c    | 268 +++++++++++++++++++++++++++++++++++++++++
 samples/bpf/tracex1_kern.c     |   4 +-
 samples/bpf/tracex2_kern.c     |   4 +-
 samples/bpf/tracex5_kern.c     |   6 +-
 10 files changed, 421 insertions(+), 57 deletions(-)

-- 
1.9.3

^ permalink raw reply

* Re: [PATCH] sctp: add support for RPS and RFS
From: Tom Herbert @ 2016-04-12 21:58 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Linux Kernel Network Developers, Neil Horman, Vlad Yasevich,
	linux-sctp
In-Reply-To: <20160412215728.GI15005@localhost.localdomain>

On Tue, Apr 12, 2016 at 2:57 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> On Tue, Apr 12, 2016 at 02:50:45PM -0700, Tom Herbert wrote:
>> On Tue, Apr 12, 2016 at 2:11 PM, Marcelo Ricardo Leitner
>> <marcelo.leitner@gmail.com> wrote:
>> > This patch adds what's missing to properly support RPS and RFS on SCTP,
>> > as some of it is already implemented in common calls.
>> >
>> > Having support for RPS and RFS allows better scaling specially because
>> > not all NICs support hashing SCTP headers.
>> >
>> > Save the hash right when we dequeue a skb from inqueue so we do it only
>> > once per skb instead of per chunk. New sockets will then inherit the
>> > hash through sctp_copy_sock().
>> >
>> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
>> > ---
>> >  net/sctp/inqueue.c | 3 +++
>> >  net/sctp/socket.c  | 3 +++
>> >  2 files changed, 6 insertions(+)
>> >
>> > diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
>> > index 7e8a16c77039e1b70ef89f3e862dbb332bcc614f..b335ffcef0b901b0e71bf0843057dbf5877da31b 100644
>> > --- a/net/sctp/inqueue.c
>> > +++ b/net/sctp/inqueue.c
>> > @@ -163,6 +163,9 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
>> >                 chunk->singleton = 1;
>> >                 ch = (sctp_chunkhdr_t *) chunk->skb->data;
>> >                 chunk->data_accepted = 0;
>> > +
>> > +               if (chunk->asoc)
>> > +                       sock_rps_save_rxhash(chunk->asoc->base.sk, chunk->skb);
>> >         }
>> >
>> >         chunk->chunk_hdr = ch;
>> > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
>> > index 878d28eda1a68dc639d7b9f3f663d7518f21bb32..36697f85ce48be39f41ddedd9f53369c7f9e28d8 100644
>> > --- a/net/sctp/socket.c
>> > +++ b/net/sctp/socket.c
>> > @@ -6430,6 +6430,8 @@ unsigned int sctp_poll(struct file *file, struct socket *sock, poll_table *wait)
>> >
>> >         poll_wait(file, sk_sleep(sk), wait);
>> >
>> > +       sock_rps_record_flow(sk);
>> > +
>> >         /* A TCP-style listening socket becomes readable when the accept queue
>> >          * is not empty.
>> >          */
>> > @@ -7186,6 +7188,7 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
>> >         newsk->sk_lingertime = sk->sk_lingertime;
>> >         newsk->sk_rcvtimeo = sk->sk_rcvtimeo;
>> >         newsk->sk_sndtimeo = sk->sk_sndtimeo;
>> > +       newsk->sk_rxhash = sk->sk_rxhash;
>> >
>> >         newinet = inet_sk(newsk);
>> >
>> HI Marcelo,
>>
>> sock_rps_record_flow should probably be in sctp_poll also (like it is
>> in tcp_poll).
>>
>> Thanks,
>> Tom
>
> Hi Tom,
>
> Yes, that's the middle chunk on this patch, no?
>
> Note that for udp it's done after the actual poll, while for tcp it's
> being saved before it. I went more udp-style on this one.
>

Yes, I missed that.

> Thanks,
> Marcelo

^ permalink raw reply

* Re: [PATCH] sctp: add support for RPS and RFS
From: Marcelo Ricardo Leitner @ 2016-04-12 21:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, Neil Horman, Vlad Yasevich,
	linux-sctp
In-Reply-To: <CALx6S35by22eLr-FOFj_RQvOLZm+=cyKYn1MeeD3QQqEEcsRAg@mail.gmail.com>

On Tue, Apr 12, 2016 at 02:50:45PM -0700, Tom Herbert wrote:
> On Tue, Apr 12, 2016 at 2:11 PM, Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> > This patch adds what's missing to properly support RPS and RFS on SCTP,
> > as some of it is already implemented in common calls.
> >
> > Having support for RPS and RFS allows better scaling specially because
> > not all NICs support hashing SCTP headers.
> >
> > Save the hash right when we dequeue a skb from inqueue so we do it only
> > once per skb instead of per chunk. New sockets will then inherit the
> > hash through sctp_copy_sock().
> >
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> > ---
> >  net/sctp/inqueue.c | 3 +++
> >  net/sctp/socket.c  | 3 +++
> >  2 files changed, 6 insertions(+)
> >
> > diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
> > index 7e8a16c77039e1b70ef89f3e862dbb332bcc614f..b335ffcef0b901b0e71bf0843057dbf5877da31b 100644
> > --- a/net/sctp/inqueue.c
> > +++ b/net/sctp/inqueue.c
> > @@ -163,6 +163,9 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
> >                 chunk->singleton = 1;
> >                 ch = (sctp_chunkhdr_t *) chunk->skb->data;
> >                 chunk->data_accepted = 0;
> > +
> > +               if (chunk->asoc)
> > +                       sock_rps_save_rxhash(chunk->asoc->base.sk, chunk->skb);
> >         }
> >
> >         chunk->chunk_hdr = ch;
> > diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> > index 878d28eda1a68dc639d7b9f3f663d7518f21bb32..36697f85ce48be39f41ddedd9f53369c7f9e28d8 100644
> > --- a/net/sctp/socket.c
> > +++ b/net/sctp/socket.c
> > @@ -6430,6 +6430,8 @@ unsigned int sctp_poll(struct file *file, struct socket *sock, poll_table *wait)
> >
> >         poll_wait(file, sk_sleep(sk), wait);
> >
> > +       sock_rps_record_flow(sk);
> > +
> >         /* A TCP-style listening socket becomes readable when the accept queue
> >          * is not empty.
> >          */
> > @@ -7186,6 +7188,7 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
> >         newsk->sk_lingertime = sk->sk_lingertime;
> >         newsk->sk_rcvtimeo = sk->sk_rcvtimeo;
> >         newsk->sk_sndtimeo = sk->sk_sndtimeo;
> > +       newsk->sk_rxhash = sk->sk_rxhash;
> >
> >         newinet = inet_sk(newsk);
> >
> HI Marcelo,
> 
> sock_rps_record_flow should probably be in sctp_poll also (like it is
> in tcp_poll).
> 
> Thanks,
> Tom

Hi Tom,

Yes, that's the middle chunk on this patch, no?

Note that for udp it's done after the actual poll, while for tcp it's
being saved before it. I went more udp-style on this one.

Thanks,
Marcelo

^ permalink raw reply

* Re: [PATCH] sctp: add support for RPS and RFS
From: Tom Herbert @ 2016-04-12 21:50 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: Linux Kernel Network Developers, Neil Horman, Vlad Yasevich,
	linux-sctp
In-Reply-To: <d0b676c036479fe304218a05e1b03a7ba274426c.1460495323.git.marcelo.leitner@gmail.com>

On Tue, Apr 12, 2016 at 2:11 PM, Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> This patch adds what's missing to properly support RPS and RFS on SCTP,
> as some of it is already implemented in common calls.
>
> Having support for RPS and RFS allows better scaling specially because
> not all NICs support hashing SCTP headers.
>
> Save the hash right when we dequeue a skb from inqueue so we do it only
> once per skb instead of per chunk. New sockets will then inherit the
> hash through sctp_copy_sock().
>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  net/sctp/inqueue.c | 3 +++
>  net/sctp/socket.c  | 3 +++
>  2 files changed, 6 insertions(+)
>
> diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
> index 7e8a16c77039e1b70ef89f3e862dbb332bcc614f..b335ffcef0b901b0e71bf0843057dbf5877da31b 100644
> --- a/net/sctp/inqueue.c
> +++ b/net/sctp/inqueue.c
> @@ -163,6 +163,9 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
>                 chunk->singleton = 1;
>                 ch = (sctp_chunkhdr_t *) chunk->skb->data;
>                 chunk->data_accepted = 0;
> +
> +               if (chunk->asoc)
> +                       sock_rps_save_rxhash(chunk->asoc->base.sk, chunk->skb);
>         }
>
>         chunk->chunk_hdr = ch;
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 878d28eda1a68dc639d7b9f3f663d7518f21bb32..36697f85ce48be39f41ddedd9f53369c7f9e28d8 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -6430,6 +6430,8 @@ unsigned int sctp_poll(struct file *file, struct socket *sock, poll_table *wait)
>
>         poll_wait(file, sk_sleep(sk), wait);
>
> +       sock_rps_record_flow(sk);
> +
>         /* A TCP-style listening socket becomes readable when the accept queue
>          * is not empty.
>          */
> @@ -7186,6 +7188,7 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
>         newsk->sk_lingertime = sk->sk_lingertime;
>         newsk->sk_rcvtimeo = sk->sk_rcvtimeo;
>         newsk->sk_sndtimeo = sk->sk_sndtimeo;
> +       newsk->sk_rxhash = sk->sk_rxhash;
>
>         newinet = inet_sk(newsk);
>
HI Marcelo,

sock_rps_record_flow should probably be in sctp_poll also (like it is
in tcp_poll).

Thanks,
Tom

> --
> 2.5.0
>

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Ben Greear @ 2016-04-12 21:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng, Nandita Dukkipati, open list,
	Kama, Meirav
In-Reply-To: <1460492999.6473.599.camel@edumazet-glaptop3.roam.corp.google.com>

On 04/12/2016 01:29 PM, Eric Dumazet wrote:
> On Tue, 2016-04-12 at 13:23 -0700, Ben Greear wrote:
>
>> It worked well enough for years that I didn't even know other algorithms were
>> available.  It was broken around 4.0 time, and I reported it to the list,
>> and no one seemed to really care enough to do anything about it.  I changed
>> to reno and ignored the problem as well.
>>
>> It is trivially easy to see the regression when using ath10k NIC, and from this email
>> thread, I guess other NICs have similar issues.
>
> Since it is so trivial, why don't you start a bisection ?

I vaguely remember doing a bisect, but I can't find any email about
that, so maybe I didn't.  At any rate, it is somewhere between 3.17 and 4.0.
 From memory, it was between 3.19 and 4.0, but I am not certain of that.

Neil's suggestion, from the thread below, is that it was likely:  "605ad7f tcp: refine TSO autosizing"

Here is previous email thread:

https://www.mail-archive.com/netdev@vger.kernel.org/msg80803.html

This one has a link to a pcap I made at the time:

https://www.mail-archive.com/netdev@vger.kernel.org/msg80890.html

>
> I asked a capture, I did not say ' switch to Reno or whatever ', right ?
>
> Guessing is nice, but investigating and fixing is better.
>
> Do not assume that nothing can be done, please ?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* [PATCH] sctp: add support for RPS and RFS
From: Marcelo Ricardo Leitner @ 2016-04-12 21:11 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, Vlad Yasevich, linux-sctp

This patch adds what's missing to properly support RPS and RFS on SCTP,
as some of it is already implemented in common calls.

Having support for RPS and RFS allows better scaling specially because
not all NICs support hashing SCTP headers.

Save the hash right when we dequeue a skb from inqueue so we do it only
once per skb instead of per chunk. New sockets will then inherit the
hash through sctp_copy_sock().

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/sctp/inqueue.c | 3 +++
 net/sctp/socket.c  | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
index 7e8a16c77039e1b70ef89f3e862dbb332bcc614f..b335ffcef0b901b0e71bf0843057dbf5877da31b 100644
--- a/net/sctp/inqueue.c
+++ b/net/sctp/inqueue.c
@@ -163,6 +163,9 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
 		chunk->singleton = 1;
 		ch = (sctp_chunkhdr_t *) chunk->skb->data;
 		chunk->data_accepted = 0;
+
+		if (chunk->asoc)
+			sock_rps_save_rxhash(chunk->asoc->base.sk, chunk->skb);
 	}
 
 	chunk->chunk_hdr = ch;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 878d28eda1a68dc639d7b9f3f663d7518f21bb32..36697f85ce48be39f41ddedd9f53369c7f9e28d8 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6430,6 +6430,8 @@ unsigned int sctp_poll(struct file *file, struct socket *sock, poll_table *wait)
 
 	poll_wait(file, sk_sleep(sk), wait);
 
+	sock_rps_record_flow(sk);
+
 	/* A TCP-style listening socket becomes readable when the accept queue
 	 * is not empty.
 	 */
@@ -7186,6 +7188,7 @@ void sctp_copy_sock(struct sock *newsk, struct sock *sk,
 	newsk->sk_lingertime = sk->sk_lingertime;
 	newsk->sk_rcvtimeo = sk->sk_rcvtimeo;
 	newsk->sk_sndtimeo = sk->sk_sndtimeo;
+	newsk->sk_rxhash = sk->sk_rxhash;
 
 	newinet = inet_sk(newsk);
 
-- 
2.5.0

^ permalink raw reply related

* Re: [PATCH net-next] ibmvnic: Defer tx completion processing using a wait queue
From: John Allen @ 2016-04-12 21:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Thomas Falcon, netdev, linuxppc-dev
In-Reply-To: <1460491940.6473.592.camel@edumazet-glaptop3.roam.corp.google.com>

On 04/12/2016 03:12 PM, Eric Dumazet wrote:
> On Tue, 2016-04-12 at 14:38 -0500, John Allen wrote:
>> Moves tx completion processing out of interrupt context, deferring work
>> using a wait queue. With this work now deferred, we must account for the
>> possibility that skbs can be sent faster than we can process completion
>> requests in which case the tx buffer will overflow. If the tx buffer is
>> full, ibmvnic_xmit will return NETDEV_TX_BUSY and stop the current tx
>> queue. Subsequently, the queue will be restarted in ibmvnic_complete_tx
>> when all pending tx completion requests have been cleared.
> 
> 1) Why is this needed ?

In the current ibmvnic implementation, tx completion processing is done in
interrupt context. Depending on the load, this can block further
interrupts for a long time. This patch just creates a bottom half so that
when a tx completion interrupt comes in, we can defer the majority of the
work and exit interrupt context quickly.

> 
> 2) If it is needed, why is this not done in a generic way, so that other
> drivers can use this ?

I'm still fairly new to network driver development so I'm not in tune with
the needs of other drivers. My assumption was that the wait queue data
structure was a reasonably generic way to handle something like this. Is
there a more appropriate/generic way of implementing a bottom half for
this purpose?

-John
> 
> Thanks.
> 
> 
> 
> 

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Eric Dumazet @ 2016-04-12 20:29 UTC (permalink / raw)
  To: Ben Greear
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng, Nandita Dukkipati, open list,
	Kama, Meirav
In-Reply-To: <570D5944.5070100@candelatech.com>

On Tue, 2016-04-12 at 13:23 -0700, Ben Greear wrote:

> It worked well enough for years that I didn't even know other algorithms were
> available.  It was broken around 4.0 time, and I reported it to the list,
> and no one seemed to really care enough to do anything about it.  I changed
> to reno and ignored the problem as well.
> 
> It is trivially easy to see the regression when using ath10k NIC, and from this email
> thread, I guess other NICs have similar issues.

Since it is so trivial, why don't you start a bisection ?

I asked a capture, I did not say ' switch to Reno or whatever ', right ?

Guessing is nice, but investigating and fixing is better.

Do not assume that nothing can be done, please ?

Thanks.

^ permalink raw reply

* Re: Issue with ping source address display
From: Julian Anastasov @ 2016-04-12 20:27 UTC (permalink / raw)
  To: Daniele Orlandi; +Cc: netdev
In-Reply-To: <570D50C8.4010409@orlandi.com>


	Hello,

On Tue, 12 Apr 2016, Daniele Orlandi wrote:

> On 12/04/2016 21:38, Julian Anastasov wrote:
> > 
> > 	What is the kernel version?
> 
> Much time passed and I don't remember, however the latest test with IPv6
> ping was made on this host:
> 
> root@monitor:~# uname -a
> Linux monitor 4.4.0-17-generic #33-Ubuntu SMP Tue Mar 29 17:17:28 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux

	I asked because I remember for such problem but
at the same time your first report was from Sep 2014, long
before the bad period: 4.0 - 4.1. Can you find and try
this fix from 4.2?:

commit 34b99df4e6256ddafb663c6de0711dceceddfe0e
Author: Julian Anastasov <ja@ssi.bg>
Date:   Tue Jun 23 08:34:39 2015 +0300

    ip: report the original address of ICMP messages
    
    ICMP messages can trigger ICMP and local errors. In this case
    serr->port is 0 and starting from Linux 4.0 we do not return
    the original target address to the error queue readers.
    Add function to define which errors provide addr_offset.
    With this fix my ping command is not silent anymore.

	Because I'm not sure if your kernel includes it.
I also remember that the ping utility can use random memory
when such 4.0/4.1 kernel returns msg_namelen=0:

http://marc.info/?l=linux-netdev&m=143509000420707&w=2

Thread:
http://marc.info/?t=143503781500001&r=1&w=2

Regards

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Ben Greear @ 2016-04-12 20:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng, Nandita Dukkipati, open list,
	Kama, Meirav
In-Reply-To: <1460492253.6473.596.camel@edumazet-glaptop3.roam.corp.google.com>

On 04/12/2016 01:17 PM, Eric Dumazet wrote:
> On Tue, 2016-04-12 at 13:11 -0700, Ben Greear wrote:
>> On 04/12/2016 12:31 PM, Machani, Yaniv wrote:
>>> On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
>>>> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
>>>>> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>>>>>>
>>>>
>>>> If you are using 'Cubic' TCP congestion control, then please try
>>>> something different.
>>>> It was broken last I checked, at least when used with the ath10k driver.
>>>>
>>>
>>> Thanks Ben, this indeed seems to be the issue !
>>> Switching to reno got me to max throughput instantly.
>>>
>>> I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?
>>
>> I think at the time it was blamed on ath10k and no one cared to try to fix it.
>>
>> Or, maybe no one really uses CUBIC anymore?
>>
>> Either way, I have no plans to try to fix CUBIC, but maybe someone who knows
>> this code better could give it a try.
>
> Well, cubic seems to work in many cases, assuming they are not too many
> drops.
>
> Assuming one flow can get nominal speed in few RTT is kind a dream, and
> so far nobody claimed a CC was able to do that, while still being fair
> and resilient.
>
> TCP CC are full of heuristics, and by definition heuristics that were
> working 6 years ago might need to be refreshed.
>
> We are still maintaining Cubic for sure.

It worked well enough for years that I didn't even know other algorithms were
available.  It was broken around 4.0 time, and I reported it to the list,
and no one seemed to really care enough to do anything about it.  I changed
to reno and ignored the problem as well.

It is trivially easy to see the regression when using ath10k NIC, and from this email
thread, I guess other NICs have similar issues.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Eric Dumazet @ 2016-04-12 20:17 UTC (permalink / raw)
  To: Ben Greear
  Cc: Machani, Yaniv, netdev, David S. Miller, Eric Dumazet,
	Neal Cardwell, Yuchung Cheng, Nandita Dukkipati, open list,
	Kama, Meirav
In-Reply-To: <570D566B.8050302@candelatech.com>

On Tue, 2016-04-12 at 13:11 -0700, Ben Greear wrote:
> On 04/12/2016 12:31 PM, Machani, Yaniv wrote:
> > On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
> >> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
> >>> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
> >>>>
> >>
> >> If you are using 'Cubic' TCP congestion control, then please try
> >> something different.
> >> It was broken last I checked, at least when used with the ath10k driver.
> >>
> >
> > Thanks Ben, this indeed seems to be the issue !
> > Switching to reno got me to max throughput instantly.
> >
> > I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?
> 
> I think at the time it was blamed on ath10k and no one cared to try to fix it.
> 
> Or, maybe no one really uses CUBIC anymore?
> 
> Either way, I have no plans to try to fix CUBIC, but maybe someone who knows
> this code better could give it a try.

Well, cubic seems to work in many cases, assuming they are not too many
drops.

Assuming one flow can get nominal speed in few RTT is kind a dream, and
so far nobody claimed a CC was able to do that, while still being fair
and resilient.

TCP CC are full of heuristics, and by definition heuristics that were
working 6 years ago might need to be refreshed.

We are still maintaining Cubic for sure.

^ permalink raw reply

* Re: [PATCH net-next] ibmvnic: Defer tx completion processing using a wait queue
From: Eric Dumazet @ 2016-04-12 20:12 UTC (permalink / raw)
  To: John Allen; +Cc: Thomas Falcon, netdev, linuxppc-dev
In-Reply-To: <570D4EBC.60409@linux.vnet.ibm.com>

On Tue, 2016-04-12 at 14:38 -0500, John Allen wrote:
> Moves tx completion processing out of interrupt context, deferring work
> using a wait queue. With this work now deferred, we must account for the
> possibility that skbs can be sent faster than we can process completion
> requests in which case the tx buffer will overflow. If the tx buffer is
> full, ibmvnic_xmit will return NETDEV_TX_BUSY and stop the current tx
> queue. Subsequently, the queue will be restarted in ibmvnic_complete_tx
> when all pending tx completion requests have been cleared.

1) Why is this needed ?

2) If it is needed, why is this not done in a generic way, so that other
drivers can use this ?

Thanks.

^ permalink raw reply

* Re: TCP reaching to maximum throughput after a long time
From: Ben Greear @ 2016-04-12 20:11 UTC (permalink / raw)
  To: Machani, Yaniv, netdev
  Cc: Eric Dumazet, David S. Miller, Eric Dumazet, Neal Cardwell,
	Yuchung Cheng, Nandita Dukkipati, open list, Kama, Meirav
In-Reply-To: <AE1C82FB3D0EC64DB1F752C81CBD110139101889@DFRE01.ent.ti.com>

On 04/12/2016 12:31 PM, Machani, Yaniv wrote:
> On Tue, Apr 12, 2016 at 18:04:52, Ben Greear wrote:
>> On 04/12/2016 07:52 AM, Eric Dumazet wrote:
>>> On Tue, 2016-04-12 at 12:17 +0000, Machani, Yaniv wrote:
>>>>
>>
>> If you are using 'Cubic' TCP congestion control, then please try
>> something different.
>> It was broken last I checked, at least when used with the ath10k driver.
>>
>
> Thanks Ben, this indeed seems to be the issue !
> Switching to reno got me to max throughput instantly.
>
> I'm still looking through the thread you have shared, but from what I understand there is no planned fix for it ?

I think at the time it was blamed on ath10k and no one cared to try to fix it.

Or, maybe no one really uses CUBIC anymore?

Either way, I have no plans to try to fix CUBIC, but maybe someone who knows
this code better could give it a try.

Thanks,
Ben


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH v3 1/2] sctp: compress bit-wide flags to a bitfield on sctp_sock
From: Neil Horman @ 2016-04-12 19:50 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev, Vlad Yasevich, linux-sctp, David Laight, Jakub Sitnicki
In-Reply-To: <1d6d26426e6f1eee2851aed47b77ffe8bc335210.1460144373.git.marcelo.leitner@gmail.com>

On Fri, Apr 08, 2016 at 04:41:27PM -0300, Marcelo Ricardo Leitner wrote:
> It wastes space and gets worse as we add new flags, so convert bit-wide
> flags to a bitfield.
> 
> Currently it already saves 4 bytes in sctp_sock, which are left as holes
> in it for now. The whole struct needs packing, which should be done in
> another patch.
> 
> Note that do_auto_asconf cannot be merged, as explained in the comment
> before it.
> 
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> ---
>  include/net/sctp/structs.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 6df1ce7a411c548bda4163840a90578b6e1b4cfe..1a6a626904bba4223b7921bbb4be41c2550271a7 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -210,14 +210,14 @@ struct sctp_sock {
>  	int user_frag;
>  
>  	__u32 autoclose;
> -	__u8 nodelay;
> -	__u8 disable_fragments;
> -	__u8 v4mapped;
> -	__u8 frag_interleave;
>  	__u32 adaptation_ind;
>  	__u32 pd_point;
> -	__u8 recvrcvinfo;
> -	__u8 recvnxtinfo;
> +	__u16	nodelay:1,
> +		disable_fragments:1,
> +		v4mapped:1,
> +		frag_interleave:1,
> +		recvrcvinfo:1,
> +		recvnxtinfo:1;
>  
>  	atomic_t pd_mode;
>  	/* Receive to here while partial delivery is in effect. */
> -- 
> 2.5.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

I've not run it myself, but this series looks reasonable

Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply

* Re: Issue with ping source address display
From: Daniele Orlandi @ 2016-04-12 19:47 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: netdev
In-Reply-To: <alpine.LFD.2.11.1604122235020.1611@ja.home.ssi.bg>

[-- Attachment #1: Type: text/plain, Size: 696 bytes --]

On 12/04/2016 21:38, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Tue, 12 Apr 2016, Daniele Orlandi wrote:
> 
>> I noticed that when ping receives ICMP messages from different sources
>> the first IP address is always used and displayed:
>>
>>
>> vihai@seviolab:~$ ping -V
>> ping utility, iputils-s20121221
> 
> 	What is the kernel version?

Much time passed and I don't remember, however the latest test with IPv6
ping was made on this host:

root@monitor:~# uname -a
Linux monitor 4.4.0-17-generic #33-Ubuntu SMP Tue Mar 29 17:17:28 UTC
2016 x86_64 x86_64 x86_64 GNU/Linux

root@monitor:~# ping -V
ping utility, iputils-s20121221

Thanks,

-- 
  Daniele Orlandi


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4326 bytes --]

^ permalink raw reply

* Re: Issue with ping source address display
From: Julian Anastasov @ 2016-04-12 19:38 UTC (permalink / raw)
  To: Daniele Orlandi; +Cc: netdev
In-Reply-To: <570D35CF.9070006@orlandi.com>


	Hello,

On Tue, 12 Apr 2016, Daniele Orlandi wrote:

> I noticed that when ping receives ICMP messages from different sources
> the first IP address is always used and displayed:
> 
> 
> vihai@seviolab:~$ ping -V
> ping utility, iputils-s20121221

	What is the kernel version?

Regards

^ permalink raw reply

* [PATCH net-next] ibmvnic: Defer tx completion processing using a wait queue
From: John Allen @ 2016-04-12 19:38 UTC (permalink / raw)
  To: Thomas Falcon; +Cc: netdev, linuxppc-dev

Moves tx completion processing out of interrupt context, deferring work
using a wait queue. With this work now deferred, we must account for the
possibility that skbs can be sent faster than we can process completion
requests in which case the tx buffer will overflow. If the tx buffer is
full, ibmvnic_xmit will return NETDEV_TX_BUSY and stop the current tx
queue. Subsequently, the queue will be restarted in ibmvnic_complete_tx
when all pending tx completion requests have been cleared.

Signed-off-by: John Allen <jallen@linux.vnet.ibm.com>
---
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index 864cb21..641e340 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -105,6 +105,7 @@ static int pending_scrq(struct ibmvnic_adapter *,
 			struct ibmvnic_sub_crq_queue *);
 static union sub_crq *ibmvnic_next_scrq(struct ibmvnic_adapter *,
 					struct ibmvnic_sub_crq_queue *);
+static int ibmvnic_tx_work(void *data);
 static int ibmvnic_poll(struct napi_struct *napi, int data);
 static void send_map_query(struct ibmvnic_adapter *adapter);
 static void send_request_map(struct ibmvnic_adapter *, dma_addr_t, __be32, u8);
@@ -437,6 +438,17 @@ static int ibmvnic_open(struct net_device *netdev)

 		tx_pool->consumer_index = 0;
 		tx_pool->producer_index = 0;
+
+		init_waitqueue_head(&tx_pool->ibmvnic_tx_comp_q);
+		tx_pool->work_thread =
+			kthread_run(ibmvnic_tx_work, adapter->tx_scrq[i],
+				    "%s_%s_%d",
+				    IBMVNIC_NAME, adapter->netdev->name, i);
+		if (IS_ERR(tx_pool->work_thread)) {
+			dev_err(dev, "Couldn't create kernel thread: %ld\n",
+				PTR_ERR(tx_pool->work_thread));
+			goto thread_failed;
+		}
 	}
 	adapter->bounce_buffer_size =
 	    (netdev->mtu + ETH_HLEN - 1) / PAGE_SIZE + 1;
@@ -477,6 +489,9 @@ bounce_map_failed:
 bounce_alloc_failed:
 	i = tx_subcrqs - 1;
 	kfree(adapter->tx_pool[i].free_map);
+thread_failed:
+	for (j = 0; j < i; j++)
+		kthread_stop(adapter->tx_pool[j].work_thread);
 tx_fm_alloc_failed:
 	free_long_term_buff(adapter, &adapter->tx_pool[i].long_term_buff);
 tx_ltb_alloc_failed:
@@ -731,6 +746,16 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct net_device *netdev)
 	}

 	index = tx_pool->free_map[tx_pool->consumer_index];
+	/* tx queue full */
+	if ((index + 1) % adapter->max_tx_entries_per_subcrq ==
+	    tx_pool->free_map[tx_pool->producer_index]) {
+		netif_tx_stop_queue(netdev_get_tx_queue(netdev, queue_num));
+		tx_send_failed++;
+		tx_dropped++;
+		ret = NETDEV_TX_BUSY;
+		goto out;
+	}
+
 	offset = index * adapter->req_mtu;
 	dst = tx_pool->long_term_buff.buff + offset;
 	memset(dst, 0, adapter->req_mtu);
@@ -1314,6 +1339,7 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter *adapter,
 {
 	struct device *dev = &adapter->vdev->dev;
 	struct ibmvnic_tx_buff *txbuff;
+	struct netdev_queue *txq;
 	union sub_crq *next;
 	int index;
 	int i, j;
@@ -1361,6 +1387,10 @@ restart_loop:
 		next->tx_comp.first = 0;
 	}

+	txq = netdev_get_tx_queue(adapter->netdev, scrq->pool_index);
+	if (netif_tx_queue_stopped(txq))
+		netif_tx_wake_queue(txq);
+
 	enable_scrq_irq(adapter, scrq);

 	if (pending_scrq(adapter, scrq)) {
@@ -1371,13 +1401,35 @@ restart_loop:
 	return 0;
 }

+static int ibmvnic_tx_work(void *data)
+{
+	int rc;
+	struct ibmvnic_sub_crq_queue *scrq = data;
+	struct ibmvnic_adapter *adapter = scrq->adapter;
+
+	while (1) {
+		rc = wait_event_interruptible(adapter->
+					      tx_pool[scrq->pool_index].
+					      ibmvnic_tx_comp_q,
+					      pending_scrq(adapter, scrq));
+		BUG_ON(rc);
+
+		if (kthread_should_stop())
+			break;
+
+		disable_scrq_irq(adapter, scrq);
+
+		ibmvnic_complete_tx(adapter, scrq);
+	}
+	return 0;
+}
+
 static irqreturn_t ibmvnic_interrupt_tx(int irq, void *instance)
 {
 	struct ibmvnic_sub_crq_queue *scrq = instance;
 	struct ibmvnic_adapter *adapter = scrq->adapter;

-	disable_scrq_irq(adapter, scrq);
-	ibmvnic_complete_tx(adapter, scrq);
+	wake_up(&adapter->tx_pool[scrq->pool_index].ibmvnic_tx_comp_q);

 	return IRQ_HANDLED;
 }

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox