Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: kernel BUG at lib/string.c:LINE! (4)
From: Julian Anastasov @ 2018-05-16 22:43 UTC (permalink / raw)
  To: syzbot
  Cc: coreteam, davem, fw, horms, kadlec, linux-kernel, lvs-devel,
	netdev, netfilter-devel, pablo, syzkaller-bugs, wensong
In-Reply-To: <0000000000006764b4056c5476d9@google.com>


	Hello,

On Wed, 16 May 2018, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    0b7d9978406f Merge branch 'Microsemi-Ocelot-Ethernet-switc..
> git tree:       net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16e91017800000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=aac887f77319868646df
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1665d637800000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10517107800000
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+aac887f77319868646df@syzkaller.appspotmail.com
> 
> IPVS: Unknown mcast interface: veth1_to???a????????????
> IPVS: Unknown mcast interface: veth1_to???a????????????
> IPVS: Unknown mcast interface: veth1_to???a????????????
> detected buffer overflow in strlen
> ------------[ cut here ]------------
> kernel BUG at lib/string.c:1052!
> invalid opcode: 0000 [#1] SMP KASAN
> Dumping ftrace buffer:
>   (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051
> RSP: 0018:ffff8801c976f800 EFLAGS: 00010282
> RAX: 0000000000000022 RBX: 0000000000000040 RCX: 0000000000000000
> RDX: 0000000000000022 RSI: ffffffff8160f6f1 RDI: ffffed00392edef6
> RBP: ffff8801c976f800 R08: ffff8801cf4c62c0 R09: ffffed003b5e4fb0
> R10: ffffed003b5e4fb0 R11: ffff8801daf27d87 R12: ffff8801c976fa20
> R13: ffff8801c976fae4 R14: ffff8801c976fae0 R15: 000000000000048b
> FS:  00007fd99f75e700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000200001c0 CR3: 00000001d6843000 CR4: 00000000001406e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> strlen include/linux/string.h:270 [inline]
> strlcpy include/linux/string.h:293 [inline]
> do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388
> nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
> nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115
> ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253
> udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487
> ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917
> tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057
> sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046
> __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
> __do_sys_setsockopt net/socket.c:1914 [inline]
> __se_sys_setsockopt net/socket.c:1911 [inline]
> __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x447369
> RSP: 002b:00007fd99f75dda8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 00000000006e39e4 RCX: 0000000000447369
> RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003
> RBP: 0000000000000000 R08: 0000000000000018 R09: 0000000000000000
> R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000006e39e0
> R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0000000000000001
> Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb de
> 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90 90 90
> 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56
> RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: ffff8801c976f800
> ---[ end trace 624046f2d9af7702 ]---

	Just to let you know that I tested a patch with
the syzbot, will do more tests before submitting...

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH bpf-next] bpf: fix sock hashmap kmalloc warning
From: Daniel Borkmann @ 2018-05-16 22:58 UTC (permalink / raw)
  To: Yonghong Song, ast, netdev; +Cc: kernel-team
In-Reply-To: <20180516210626.776403-1-yhs@fb.com>

On 05/16/2018 11:06 PM, Yonghong Song wrote:
> syzbot reported a kernel warning below:
>   WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
>   Kernel panic - not syncing: panic_on_warn set ...
> 
>   CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9
>   Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
>   Call Trace:
>    __dump_stack lib/dump_stack.c:77 [inline]
>    dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>    panic+0x22f/0x4de kernel/panic.c:184
>    __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
>    report_bug+0x252/0x2d0 lib/bug.c:186
>    fixup_bug arch/x86/kernel/traps.c:178 [inline]
>    do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
>    do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>    invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
>   RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996
>   RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246
>   RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7
>   RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1
>   RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700
>   R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280
>   R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0
>    __do_kmalloc mm/slab.c:3713 [inline]
>    __kmalloc+0x25/0x760 mm/slab.c:3727
>    kmalloc include/linux/slab.h:517 [inline]
>    map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858
>    __do_sys_bpf kernel/bpf/syscall.c:2131 [inline]
>    __se_sys_bpf kernel/bpf/syscall.c:2096 [inline]
>    __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096
>    do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The test case is against sock hashmap with a key size 0xffffffe1.
> Such a large key size will cause the below code in function
> sock_hash_alloc() overflowing and produces a smaller elem_size,
> hence map creation will be successful.
>     htab->elem_size = sizeof(struct htab_elem) +
>                       round_up(htab->map.key_size, 8);
> 
> Later, when map_get_next_key is called and kernel tries
> to allocate the key unsuccessfully, it will issue
> the above warning.
> 
> Similar to hashtab, ensure the key size is at most
> MAX_BPF_STACK for a successful map creation.
> 
> Fixes: 81110384441a ("bpf: sockmap, add hash map support")
> Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com
> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied to bpf-next, thanks Yonghong!

^ permalink raw reply

* Re: [PATCH bpf-next] libbpf: add ifindex to enable offload support
From: Daniel Borkmann @ 2018-05-16 22:59 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: oss-drivers, netdev, David Beckett
In-Reply-To: <20180516210249.6486-1-jakub.kicinski@netronome.com>

On 05/16/2018 11:02 PM, Jakub Kicinski wrote:
> From: David Beckett <david.beckett@netronome.com>
> 
> BPF programs currently can only be offloaded using iproute2. This
> patch will allow programs to be offloaded using libbpf calls.
> 
> Signed-off-by: David Beckett <david.beckett@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Applied to bpf-next, thanks guys!

^ permalink raw reply

* Re: [PATCH] bpf: add __printf verification to bpf_verifier_vlog
From: Daniel Borkmann @ 2018-05-16 22:59 UTC (permalink / raw)
  To: Mathieu Malaterre, Alexei Starovoitov; +Cc: netdev, linux-kernel
In-Reply-To: <20180516202741.20861-1-malat@debian.org>

On 05/16/2018 10:27 PM, Mathieu Malaterre wrote:
> __printf is useful to verify format and arguments. ‘bpf_verifier_vlog’
> function is used twice in verifier.c in both cases the caller function
> already uses the __printf gcc attribute.
> 
> Remove the following warning, triggered with W=1:
> 
>   kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
> 
> Signed-off-by: Mathieu Malaterre <malat@debian.org>

Looks good, applied to bpf-next, thanks Mathieu!

^ permalink raw reply

* Proposal
From: Miss Zeliha Omer Faruk @ 2018-05-16 23:26 UTC (permalink / raw)





Hello

Greetings to you please i have a business proposal for you contact me
for more detailes asap thanks.

Best Regards,
Miss.Zeliha ömer faruk
Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
Sisli - Istanbul, Turkey

^ permalink raw reply

* [PATCH net-next 0/8] tcp: default RACK loss recovery
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng

This patch set implements the features correspond to the
draft-ietf-tcpm-rack-03 version of the RACK draft.
https://datatracker.ietf.org/meeting/101/materials/slides-101-tcpm-update-on-tcp-rack-00

1. SACK: implement equivalent DUPACK threshold heuristic in RACK to
   replace existing RFC6675 recovery (tcp_mark_head_lost).

2. Non-SACK: simplify RFC6582 NewReno implementation

3. RTO: apply RACK's time-based approach to avoid spuriouly
   marking very recently sent packets lost.

4. with (1)(2)(3), make RACK the exclusive fast recovery mechanism to
   mark losses based on time on S/ACK. Tail loss probe and F-RTO remain
   enabled by default as complementary mechanisms to send probes in
   CA_Open and CA_Loss states. The probes would solicit S/ACKs to trigger
   RACK time-based loss detection.

All Google web and internal servers have been running RACK-only mode
(4) for a while now. a/b experiments indicate RACK/TLP on average
reduces recovery latency by 10% compared to RFC6675. RFC6675
is default-off now but can be enabled by disabling RACK (sysctl
net.ipv4.tcp_recovery=0) for unseen issues.

Yuchung Cheng (8):
  tcp: support DUPACK threshold in RACK
  tcp: disable RFC6675 loss detection
  tcp: simpler NewReno implementation
  tcp: account lost retransmit after timeout
  tcp: new helper tcp_timeout_mark_lost
  tcp: separate loss marking and state update on RTO
  tcp: new helper tcp_rack_skb_timeout
  tcp: don't mark recently sent packets lost on RTO

 Documentation/networking/ip-sysctl.txt |  4 +-
 include/net/tcp.h                      |  5 ++
 net/ipv4/tcp_input.c                   | 99 ++++++++++++++------------
 net/ipv4/tcp_recovery.c                | 80 ++++++++++++++++-----
 4 files changed, 124 insertions(+), 64 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply

* [PATCH net-next 1/8] tcp: support DUPACK threshold in RACK
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

This patch adds support for the classic DUPACK threshold rule
(#DupThresh) in RACK.

When the number of packets SACKed is greater or equal to the
threshold, RACK sets the reordering window to zero which would
immediately mark all the unsacked packets below the highest SACKed
sequence lost. Since this approach is known to not work well with
reordering, RACK only uses it if no reordering has been observed.

The DUPACK threshold rule is a particularly useful extension to the
fast recoveries triggered by RACK reordering timer. For example
data-center transfers where the RTT is much smaller than a timer
tick, or high RTT path where the default RTT/4 may take too long.

Note that this patch differs slightly from RFC6675. RFC6675
considers a packet lost when at least #DupThresh higher-sequence
packets are SACKed.

With RACK, for connections that have seen reordering, RACK
continues to use a dynamically-adaptive time-based reordering
window to detect losses. But for connections on which we have not
yet seen reordering, this patch considers a packet lost when at
least one higher sequence packet is SACKed and the total number
of SACKed packets is at least DupThresh. For example, suppose a
connection has not seen reordering, and sends 10 packets, and
packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2
lost. RACK considers packets 1, 2, 4, 6 lost.

There is some small risk of spurious retransmits here due to
reordering. However, this is mostly limited to the first flight of
a connection on which the sender receives SACKs from reordering.
And RFC 6675 and FACK loss detection have a similar risk on the
first flight with reordering (it's just that the risk of spurious
retransmits from reordering was slightly narrower for those older
algorithms due to the margin of 3*MSS).

Also the minimum reordering window is reduced from 1 msec to 0
to recover quicker on short RTT transfers. Therefore RACK is more
aggressive in marking packets lost during recovery to reduce the
reordering window timeouts.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 Documentation/networking/ip-sysctl.txt |  1 +
 include/net/tcp.h                      |  1 +
 net/ipv4/tcp_recovery.c                | 40 +++++++++++++++++---------
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 59afc9a10b4f..13bbac50dc8b 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -451,6 +451,7 @@ tcp_recovery - INTEGER
 	RACK: 0x1 enables the RACK loss detection for fast detection of lost
 	      retransmissions and tail drops.
 	RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
+	RACK: 0x4 disables RACK's DUPACK threshold heuristic

 	Default: 0x1

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3b1d617b0110..85000c85ddcd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -245,6 +245,7 @@ extern long sysctl_tcp_mem[3];

 #define TCP_RACK_LOSS_DETECTION  0x1 /* Use RACK to detect losses */
 #define TCP_RACK_STATIC_REO_WND  0x2 /* Use static RACK reo wnd */
+#define TCP_RACK_NO_DUPTHRESH    0x4 /* Do not use DUPACK threshold in RACK */

 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 3a81720ac0c4..1c1bdf12a96f 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -21,6 +21,32 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2)
 	return t1 > t2 || (t1 == t2 && after(seq1, seq2));
 }

+u32 tcp_rack_reo_wnd(const struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if (!tp->rack.reord) {
+		/* If reordering has not been observed, be aggressive during
+		 * the recovery or starting the recovery by DUPACK threshold.
+		 */
+		if (inet_csk(sk)->icsk_ca_state >= TCP_CA_Recovery)
+			return 0;
+
+		if (tp->sacked_out >= tp->reordering &&
+		    !(sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_NO_DUPTHRESH))
+			return 0;
+	}
+
+	/* To be more reordering resilient, allow min_rtt/4 settling delay.
+	 * Use min_rtt instead of the smoothed RTT because reordering is
+	 * often a path property and less related to queuing or delayed ACKs.
+	 * Upon receiving DSACKs, linearly increase the window up to the
+	 * smoothed RTT.
+	 */
+	return min((tcp_min_rtt(tp) >> 2) * tp->rack.reo_wnd_steps,
+		   tp->srtt_us >> 3);
+}
+
 /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01):
  *
  * Marks a packet lost, if some packet sent later has been (s)acked.
@@ -44,23 +70,11 @@ static bool tcp_rack_sent_after(u64 t1, u64 t2, u32 seq1, u32 seq2)
 static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 min_rtt = tcp_min_rtt(tp);
 	struct sk_buff *skb, *n;
 	u32 reo_wnd;

 	*reo_timeout = 0;
-	/* To be more reordering resilient, allow min_rtt/4 settling delay
-	 * (lower-bounded to 1000uS). We use min_rtt instead of the smoothed
-	 * RTT because reordering is often a path property and less related
-	 * to queuing or delayed ACKs.
-	 */
-	reo_wnd = 1000;
-	if ((tp->rack.reord || inet_csk(sk)->icsk_ca_state < TCP_CA_Recovery) &&
-	    min_rtt != ~0U) {
-		reo_wnd = max((min_rtt >> 2) * tp->rack.reo_wnd_steps, reo_wnd);
-		reo_wnd = min(reo_wnd, tp->srtt_us >> 3);
-	}
-
+	reo_wnd = tcp_rack_reo_wnd(sk);
 	list_for_each_entry_safe(skb, n, &tp->tsorted_sent_queue,
 				 tcp_tsorted_anchor) {
 		struct tcp_skb_cb *scb = TCP_SKB_CB(skb);
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 2/8] tcp: disable RFC6675 loss detection
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

This patch disables RFC6675 loss detection and make sysctl
net.ipv4.tcp_recovery = 1 controls a binary choice between RACK
(1) or RFC6675 (0).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 Documentation/networking/ip-sysctl.txt |  3 ++-
 net/ipv4/tcp_input.c                   | 12 ++++++++----
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 13bbac50dc8b..ea304a23c8d7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -449,7 +449,8 @@ tcp_recovery - INTEGER
 	features.
 
 	RACK: 0x1 enables the RACK loss detection for fast detection of lost
-	      retransmissions and tail drops.
+	      retransmissions and tail drops. It also subsumes and disables
+	      RFC6675 recovery for SACK connections.
 	RACK: 0x2 makes RACK's reordering window static (min_rtt/4).
 	RACK: 0x4 disables RACK's DUPACK threshold heuristic
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b188e0d75edd..ccbe04f80040 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2035,6 +2035,11 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
 	return tp->sacked_out + 1;
 }
 
+static bool tcp_is_rack(const struct sock *sk)
+{
+	return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
+}
+
 /* Linux NewReno/SACK/ECN state machine.
  * --------------------------------------
  *
@@ -2141,7 +2146,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
 		return true;
 
 	/* Not-A-Trick#2 : Classic rule... */
-	if (tcp_dupack_heuristics(tp) > tp->reordering)
+	if (!tcp_is_rack(sk) && tcp_dupack_heuristics(tp) > tp->reordering)
 		return true;
 
 	return false;
@@ -2722,8 +2727,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	/* Use RACK to detect loss */
-	if (sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) {
+	if (tcp_is_rack(sk)) {
 		u32 prior_retrans = tp->retrans_out;
 
 		tcp_rack_mark_lost(sk);
@@ -2862,7 +2866,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		fast_rexmit = 1;
 	}
 
-	if (do_lost)
+	if (!tcp_is_rack(sk) && do_lost)
 		tcp_update_scoreboard(sk, fast_rexmit);
 	*rexmit = REXMIT_LOST;
 }
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 3/8] tcp: simpler NewReno implementation
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

This is a rewrite of NewReno loss recovery implementation that is
simpler and standalone for readability and better performance by
using less states.

Note that NewReno refers to RFC6582 as a modification to the fast
recovery algorithm. It is used only if the connection does not
support SACK in Linux. It should not to be confused with the Reno
(AIMD) congestion control.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 include/net/tcp.h       |  1 +
 net/ipv4/tcp_input.c    | 19 +++++++++++--------
 net/ipv4/tcp_recovery.c | 27 +++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 85000c85ddcd..d7f81325bee5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1878,6 +1878,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
+void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 			     u64 xmit_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ccbe04f80040..076206873e3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2223,9 +2223,7 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_is_reno(tp)) {
-		tcp_mark_head_lost(sk, 1, 1);
-	} else {
+	if (tcp_is_sack(tp)) {
 		int sacked_upto = tp->sacked_out - tp->reordering;
 		if (sacked_upto >= 0)
 			tcp_mark_head_lost(sk, sacked_upto, 0);
@@ -2723,11 +2721,16 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una)
 	return false;
 }
 
-static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
+static void tcp_identify_packet_loss(struct sock *sk, int *ack_flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_is_rack(sk)) {
+	if (tcp_rtx_queue_empty(sk))
+		return;
+
+	if (unlikely(tcp_is_reno(tp))) {
+		tcp_newreno_mark_lost(sk, *ack_flag & FLAG_SND_UNA_ADVANCED);
+	} else if (tcp_is_rack(sk)) {
 		u32 prior_retrans = tp->retrans_out;
 
 		tcp_rack_mark_lost(sk);
@@ -2823,11 +2826,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 			tcp_try_keep_open(sk);
 			return;
 		}
-		tcp_rack_identify_loss(sk, ack_flag);
+		tcp_identify_packet_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
 		tcp_process_loss(sk, flag, is_dupack, rexmit);
-		tcp_rack_identify_loss(sk, ack_flag);
+		tcp_identify_packet_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
 			return;
@@ -2844,7 +2847,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
-		tcp_rack_identify_loss(sk, ack_flag);
+		tcp_identify_packet_loss(sk, ack_flag);
 		if (!tcp_time_to_recover(sk, flag)) {
 			tcp_try_to_open(sk, flag);
 			return;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 1c1bdf12a96f..299b0e38aa9a 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -216,3 +216,30 @@ void tcp_rack_update_reo_wnd(struct sock *sk, struct rate_sample *rs)
 		tp->rack.reo_wnd_steps = 1;
 	}
 }
+
+/* RFC6582 NewReno recovery for non-SACK connection. It simply retransmits
+ * the next unacked packet upon receiving
+ * a) three or more DUPACKs to start the fast recovery
+ * b) an ACK acknowledging new data during the fast recovery.
+ */
+void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced)
+{
+	const u8 state = inet_csk(sk)->icsk_ca_state;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if ((state < TCP_CA_Recovery && tp->sacked_out >= tp->reordering) ||
+	    (state == TCP_CA_Recovery && snd_una_advanced)) {
+		struct sk_buff *skb = tcp_rtx_queue_head(sk);
+		u32 mss;
+
+		if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+			return;
+
+		mss = tcp_skb_mss(skb);
+		if (tcp_skb_pcount(skb) > 1 && skb->len > mss)
+			tcp_fragment(sk, TCP_FRAG_IN_RTX_QUEUE, skb,
+				     mss, mss, GFP_ATOMIC);
+
+		tcp_skb_mark_lost_uncond_verify(tp, skb);
+	}
+}
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 4/8] tcp: account lost retransmit after timeout
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

The previous approach for the lost and retransmit bits was to
wipe the slate clean: zero all the lost and retransmit bits,
correspondingly zero the lost_out and retrans_out counters, and
then add back the lost bits (and correspondingly increment lost_out).

The new approach is to treat this very much like marking packets
lost in fast recovery. We don’t wipe the slate clean. We just say
that for all packets that were not yet marked sacked or lost, we now
mark them as lost in exactly the same way we do for fast recovery.

This fixes the lost retransmit accounting at RTO time and greatly
simplifies the RTO code by sharing much of the logic with Fast
Recovery.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 include/net/tcp.h       |  1 +
 net/ipv4/tcp_input.c    | 18 +++---------------
 net/ipv4/tcp_recovery.c |  4 ++--
 3 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d7f81325bee5..402484ed9b57 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1878,6 +1878,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
+void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 076206873e3e..6fb0a28977a0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1929,7 +1929,6 @@ void tcp_enter_loss(struct sock *sk)
 	struct sk_buff *skb;
 	bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
 	bool is_reneg;			/* is receiver reneging on SACKs? */
-	bool mark_lost;
 
 	/* Reduce ssthresh if it has not yet been made inside this window. */
 	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1945,9 +1944,6 @@ void tcp_enter_loss(struct sock *sk)
 	tp->snd_cwnd_cnt   = 0;
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
-	tp->retrans_out = 0;
-	tp->lost_out = 0;
-
 	if (tcp_is_reno(tp))
 		tcp_reset_reno_sack(tp);
 
@@ -1959,21 +1955,13 @@ void tcp_enter_loss(struct sock *sk)
 		/* Mark SACK reneging until we recover from this loss event. */
 		tp->is_sack_reneg = 1;
 	}
-	tcp_clear_all_retrans_hints(tp);
-
 	skb_rbtree_walk_from(skb) {
-		mark_lost = (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) ||
-			     is_reneg);
-		if (mark_lost)
-			tcp_sum_lost(tp, skb);
-		TCP_SKB_CB(skb)->sacked &= (~TCPCB_TAGBITS)|TCPCB_SACKED_ACKED;
-		if (mark_lost) {
+		if (is_reneg)
 			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
-			TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
-			tp->lost_out += tcp_skb_pcount(skb);
-		}
+		tcp_mark_skb_lost(sk, skb);
 	}
 	tcp_verify_left_out(tp);
+	tcp_clear_all_retrans_hints(tp);
 
 	/* Timeout in disordered state after receiving substantial DUPACKs
 	 * suggests that the degree of reordering is over-estimated.
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 299b0e38aa9a..b2f9be388bf3 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -2,7 +2,7 @@
 #include <linux/tcp.h>
 #include <net/tcp.h>
 
-static void tcp_rack_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
+void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -95,7 +95,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 		remaining = tp->rack.rtt_us + reo_wnd -
 			    tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
 		if (remaining <= 0) {
-			tcp_rack_mark_skb_lost(sk, skb);
+			tcp_mark_skb_lost(sk, skb);
 			list_del_init(&skb->tcp_tsorted_anchor);
 		} else {
 			/* Record maximum wait time */
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 5/8] tcp: new helper tcp_timeout_mark_lost
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets
lost upon RTO.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 net/ipv4/tcp_input.c | 50 +++++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6fb0a28977a0..af32accda2a9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,18 +1917,43 @@ static inline void tcp_init_undo(struct tcp_sock *tp)
 	tp->undo_retrans = tp->retrans_out ? : -1;
 }
 
-/* Enter Loss state. If we detect SACK reneging, forget all SACK information
+/* If we detect SACK reneging, forget all SACK information
  * and reset tags completely, otherwise preserve SACKs. If receiver
  * dropped its ofo queue, we will know this due to reneging detection.
  */
+static void tcp_timeout_mark_lost(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	bool is_reneg;			/* is receiver reneging on SACKs? */
+
+	skb = tcp_rtx_queue_head(sk);
+	is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
+	if (is_reneg) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
+		tp->sacked_out = 0;
+		/* Mark SACK reneging until we recover from this loss event. */
+		tp->is_sack_reneg = 1;
+	} else if (tcp_is_reno(tp)) {
+		tcp_reset_reno_sack(tp);
+	}
+
+	skb_rbtree_walk_from(skb) {
+		if (is_reneg)
+			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
+		tcp_mark_skb_lost(sk, skb);
+	}
+	tcp_verify_left_out(tp);
+	tcp_clear_all_retrans_hints(tp);
+}
+
+/* Enter Loss state. */
 void tcp_enter_loss(struct sock *sk)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct net *net = sock_net(sk);
-	struct sk_buff *skb;
 	bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;
-	bool is_reneg;			/* is receiver reneging on SACKs? */
 
 	/* Reduce ssthresh if it has not yet been made inside this window. */
 	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
@@ -1944,24 +1969,7 @@ void tcp_enter_loss(struct sock *sk)
 	tp->snd_cwnd_cnt   = 0;
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
-	if (tcp_is_reno(tp))
-		tcp_reset_reno_sack(tp);
-
-	skb = tcp_rtx_queue_head(sk);
-	is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
-	if (is_reneg) {
-		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
-		tp->sacked_out = 0;
-		/* Mark SACK reneging until we recover from this loss event. */
-		tp->is_sack_reneg = 1;
-	}
-	skb_rbtree_walk_from(skb) {
-		if (is_reneg)
-			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
-		tcp_mark_skb_lost(sk, skb);
-	}
-	tcp_verify_left_out(tp);
-	tcp_clear_all_retrans_hints(tp);
+	tcp_timeout_mark_lost(sk);
 
 	/* Timeout in disordered state after receiving substantial DUPACKs
 	 * suggests that the degree of reordering is over-estimated.
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 6/8] tcp: separate loss marking and state update on RTO
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

Previously when TCP times out, it first updates cwnd and ssthresh,
marks packets lost, and then updates congestion state again. This
was fine because everything not yet delivered is marked lost,
so the inflight is always 0 and cwnd can be safely set to 1 to
retransmit one packet on timeout.

But the inflight may not always be 0 on timeout if TCP changes to
mark packets lost based on packet sent time. Therefore we must
first mark the packet lost, then set the cwnd based on the
(updated) inflight.

This is not a pure refactor. Congestion control may potentially
break if it uses (not yet updated) inflight to compute ssthresh.
Fortunately all existing congestion control modules does not do that.
Also it changes the inflight when CA_LOSS_EVENT is called, and only
westwood processes such an event but does not use inflight.

This change has two other minor side benefits:
1) consistent with Fast Recovery s.t. the inflight is updated
   first before tcp_enter_recovery flips state to CA_Recovery.

2) avoid intertwining loss marking with state update, making the
   code more readable.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af32accda2a9..1ccc97b368c7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1955,6 +1955,8 @@ void tcp_enter_loss(struct sock *sk)
 	struct net *net = sock_net(sk);
 	bool new_recovery = icsk->icsk_ca_state < TCP_CA_Recovery;

+	tcp_timeout_mark_lost(sk);
+
 	/* Reduce ssthresh if it has not yet been made inside this window. */
 	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
 	    !after(tp->high_seq, tp->snd_una) ||
@@ -1969,8 +1971,6 @@ void tcp_enter_loss(struct sock *sk)
 	tp->snd_cwnd_cnt   = 0;
 	tp->snd_cwnd_stamp = tcp_jiffies32;

-	tcp_timeout_mark_lost(sk);
-
 	/* Timeout in disordered state after receiving substantial DUPACKs
 	 * suggests that the degree of reordering is over-estimated.
 	 */
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 7/8] tcp: new helper tcp_rack_skb_timeout
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack
to prepare the final RTO change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 include/net/tcp.h       |  2 ++
 net/ipv4/tcp_input.c    | 10 +++++-----
 net/ipv4/tcp_recovery.c |  9 +++++++--
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 402484ed9b57..b46d0f9adbdb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1880,6 +1880,8 @@ void tcp_init(void);
 /* tcp_recovery.c */
 void tcp_mark_skb_lost(struct sock *sk, struct sk_buff *skb);
 void tcp_newreno_mark_lost(struct sock *sk, bool snd_una_advanced);
+extern s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb,
+				u32 reo_wnd);
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 			     u64 xmit_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1ccc97b368c7..ba8a8e3464aa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,6 +1917,11 @@ static inline void tcp_init_undo(struct tcp_sock *tp)
 	tp->undo_retrans = tp->retrans_out ? : -1;
 }
 
+static bool tcp_is_rack(const struct sock *sk)
+{
+	return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
+}
+
 /* If we detect SACK reneging, forget all SACK information
  * and reset tags completely, otherwise preserve SACKs. If receiver
  * dropped its ofo queue, we will know this due to reneging detection.
@@ -2031,11 +2036,6 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
 	return tp->sacked_out + 1;
 }
 
-static bool tcp_is_rack(const struct sock *sk)
-{
-	return sock_net(sk)->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION;
-}
-
 /* Linux NewReno/SACK/ECN state machine.
  * --------------------------------------
  *
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index b2f9be388bf3..30cbfb69b1de 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -47,6 +47,12 @@ u32 tcp_rack_reo_wnd(const struct sock *sk)
 		   tp->srtt_us >> 3);
 }
 
+s32 tcp_rack_skb_timeout(struct tcp_sock *tp, struct sk_buff *skb, u32 reo_wnd)
+{
+	return tp->rack.rtt_us + reo_wnd -
+	       tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+}
+
 /* RACK loss detection (IETF draft draft-ietf-tcpm-rack-01):
  *
  * Marks a packet lost, if some packet sent later has been (s)acked.
@@ -92,8 +98,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 		/* A packet is lost if it has not been s/acked beyond
 		 * the recent RTT plus the reordering window.
 		 */
-		remaining = tp->rack.rtt_us + reo_wnd -
-			    tcp_stamp_us_delta(tp->tcp_mstamp, skb->skb_mstamp);
+		remaining = tcp_rack_skb_timeout(tp, skb, reo_wnd);
 		if (remaining <= 0) {
 			tcp_mark_skb_lost(sk, skb);
 			list_del_init(&skb->tcp_tsorted_anchor);
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [PATCH net-next 8/8] tcp: don't mark recently sent packets lost on RTO
From: Yuchung Cheng @ 2018-05-16 23:40 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, ncardwell, soheil, priyarjha, Yuchung Cheng
In-Reply-To: <20180516234017.172775-1-ycheng@google.com>

An RTO event indicates the head has not been acked for a long time
after its last (re)transmission. But the other packets are not
necessarily lost if they have been only sent recently (for example
due to application limit). This patch would prohibit marking packets
sent within an RTT to be lost on RTO event, using similar logic in
TCP RACK detection.

Normally the head (SND.UNA) would be marked lost since RTO should
fire strictly after the head was sent. An exception is when the
most recent RACK RTT measurement is larger than the (previous)
RTO. To address this exception the head is always marked lost.

Congestion control interaction: since we may not mark every packet
lost, the congestion window may be more than 1 (inflight plus 1).
But only one packet will be retransmitted after RTO, since
tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The
connection still performs slow start from one packet (with Cubic
congestion control).

This commit was tested in an A/B test with Google web servers,
and showed a reduction of 2% in (spurious) retransmits post
timeout (SlowStartRetrans), and correspondingly reduced DSACKs
(DSACKIgnoredOld) by 7%.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Priyaranjan Jha <priyarjha@google.com>
---
 net/ipv4/tcp_input.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ba8a8e3464aa..0bf032839548 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1929,11 +1929,11 @@ static bool tcp_is_rack(const struct sock *sk)
 static void tcp_timeout_mark_lost(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
+	struct sk_buff *skb, *head;
 	bool is_reneg;			/* is receiver reneging on SACKs? */

-	skb = tcp_rtx_queue_head(sk);
-	is_reneg = skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED);
+	head = tcp_rtx_queue_head(sk);
+	is_reneg = head && (TCP_SKB_CB(head)->sacked & TCPCB_SACKED_ACKED);
 	if (is_reneg) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSACKRENEGING);
 		tp->sacked_out = 0;
@@ -1943,9 +1943,13 @@ static void tcp_timeout_mark_lost(struct sock *sk)
 		tcp_reset_reno_sack(tp);
 	}

+	skb = head;
 	skb_rbtree_walk_from(skb) {
 		if (is_reneg)
 			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_ACKED;
+		else if (tcp_is_rack(sk) && skb != head &&
+			 tcp_rack_skb_timeout(tp, skb, 0) > 0)
+			continue; /* Don't mark recently sent ones lost yet */
 		tcp_mark_skb_lost(sk, skb);
 	}
 	tcp_verify_left_out(tp);
@@ -1972,7 +1976,7 @@ void tcp_enter_loss(struct sock *sk)
 		tcp_ca_event(sk, CA_EVENT_LOSS);
 		tcp_init_undo(tp);
 	}
-	tp->snd_cwnd	   = 1;
+	tp->snd_cwnd	   = tcp_packets_in_flight(tp) + 1;
 	tp->snd_cwnd_cnt   = 0;
 	tp->snd_cwnd_stamp = tcp_jiffies32;

-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* [bpf-next PATCH] bpf: sockmap, on update propagate errors back to userspace
From: John Fastabend @ 2018-05-16 23:38 UTC (permalink / raw)
  To: ast, daniel; +Cc: netdev

When an error happens in the update sockmap element logic also pass
the err up to the user.

Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 79f5e89..c6de139 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -1875,7 +1875,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
 		write_unlock_bh(&osock->sk_callback_lock);
 	}
 out:
-	return 0;
+	return err;
 }
 
 int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)

^ permalink raw reply related

* [PATCH bpf] bpf: fix truncated jump targets on heavy expansions
From: Daniel Borkmann @ 2018-05-16 23:44 UTC (permalink / raw)
  To: alexei.starovoitov; +Cc: netdev, Daniel Borkmann

Recently during testing, I ran into the following panic:

  [  207.892422] Internal error: Accessing user space memory outside uaccess.h routines: 96000004 [#1] SMP
  [  207.901637] Modules linked in: binfmt_misc [...]
  [  207.966530] CPU: 45 PID: 2256 Comm: test_verifier Tainted: G        W         4.17.0-rc3+ #7
  [  207.974956] Hardware name: FOXCONN R2-1221R-A4/C2U4N_MB, BIOS G31FB18A 03/31/2017
  [  207.982428] pstate: 60400005 (nZCv daif +PAN -UAO)
  [  207.987214] pc : bpf_skb_load_helper_8_no_cache+0x34/0xc0
  [  207.992603] lr : 0xffff000000bdb754
  [  207.996080] sp : ffff000013703ca0
  [  207.999384] x29: ffff000013703ca0 x28: 0000000000000001
  [  208.004688] x27: 0000000000000001 x26: 0000000000000000
  [  208.009992] x25: ffff000013703ce0 x24: ffff800fb4afcb00
  [  208.015295] x23: ffff00007d2f5038 x22: ffff00007d2f5000
  [  208.020599] x21: fffffffffeff2a6f x20: 000000000000000a
  [  208.025903] x19: ffff000009578000 x18: 0000000000000a03
  [  208.031206] x17: 0000000000000000 x16: 0000000000000000
  [  208.036510] x15: 0000ffff9de83000 x14: 0000000000000000
  [  208.041813] x13: 0000000000000000 x12: 0000000000000000
  [  208.047116] x11: 0000000000000001 x10: ffff0000089e7f18
  [  208.052419] x9 : fffffffffeff2a6f x8 : 0000000000000000
  [  208.057723] x7 : 000000000000000a x6 : 00280c6160000000
  [  208.063026] x5 : 0000000000000018 x4 : 0000000000007db6
  [  208.068329] x3 : 000000000008647a x2 : 19868179b1484500
  [  208.073632] x1 : 0000000000000000 x0 : ffff000009578c08
  [  208.078938] Process test_verifier (pid: 2256, stack limit = 0x0000000049ca7974)
  [  208.086235] Call trace:
  [  208.088672]  bpf_skb_load_helper_8_no_cache+0x34/0xc0
  [  208.093713]  0xffff000000bdb754
  [  208.096845]  bpf_test_run+0x78/0xf8
  [  208.100324]  bpf_prog_test_run_skb+0x148/0x230
  [  208.104758]  sys_bpf+0x314/0x1198
  [  208.108064]  el0_svc_naked+0x30/0x34
  [  208.111632] Code: 91302260 f9400001 f9001fa1 d2800001 (29500680)
  [  208.117717] ---[ end trace 263cb8a59b5bf29f ]---

The program itself which caused this had a long jump over the whole
instruction sequence where all of the inner instructions required
heavy expansions into multiple BPF instructions. Additionally, I also
had BPF hardening enabled which requires once more rewrites of all
constant values in order to blind them. Each time we rewrite insns,
bpf_adj_branches() would need to potentially adjust branch targets
which cross the patchlet boundary to accommodate for the additional
delta. Eventually that lead to the case where the target offset could
not fit into insn->off's upper 0x7fff limit anymore where then offset
wraps around becoming negative (in s16 universe), or vice versa
depending on the jump direction.

Therefore it becomes necessary to detect and reject any such occasions
in a generic way for native eBPF and cBPF to eBPF migrations. For
the latter we can simply check bounds in the bpf_convert_filter()'s
BPF_EMIT_JMP helper macro and bail out once we surpass limits. The
bpf_patch_insn_single() for native eBPF (and cBPF to eBPF in case
of subsequent hardening) is a bit more complex in that we need to
detect such truncations before hitting the bpf_prog_realloc(). Thus
the latter is split into an extra pass to probe problematic offsets
on the original program in order to fail early. With that in place
and carefully tested I no longer hit the panic and the rewrites are
rejected properly. The above example panic I've seen on bpf-next,
though the issue itself is generic in that a guard against this issue
in bpf seems more appropriate in this case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 [ Will follow up with an additional test case in bpf-next. ]

 kernel/bpf/core.c | 100 ++++++++++++++++++++++++++++++++++++++++--------------
 net/core/filter.c |  11 ++++--
 2 files changed, 84 insertions(+), 27 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba03ec3..6ef6746 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -218,47 +218,84 @@ int bpf_prog_calc_tag(struct bpf_prog *fp)
 	return 0;
 }
 
-static void bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta)
+static int bpf_adj_delta_to_imm(struct bpf_insn *insn, u32 pos, u32 delta,
+				u32 curr, const bool probe_pass)
 {
+	const s64 imm_min = S32_MIN, imm_max = S32_MAX;
+	s64 imm = insn->imm;
+
+	if (curr < pos && curr + imm + 1 > pos)
+		imm += delta;
+	else if (curr > pos + delta && curr + imm + 1 <= pos + delta)
+		imm -= delta;
+	if (imm < imm_min || imm > imm_max)
+		return -ERANGE;
+	if (!probe_pass)
+		insn->imm = imm;
+	return 0;
+}
+
+static int bpf_adj_delta_to_off(struct bpf_insn *insn, u32 pos, u32 delta,
+				u32 curr, const bool probe_pass)
+{
+	const s32 off_min = S16_MIN, off_max = S16_MAX;
+	s32 off = insn->off;
+
+	if (curr < pos && curr + off + 1 > pos)
+		off += delta;
+	else if (curr > pos + delta && curr + off + 1 <= pos + delta)
+		off -= delta;
+	if (off < off_min || off > off_max)
+		return -ERANGE;
+	if (!probe_pass)
+		insn->off = off;
+	return 0;
+}
+
+static int bpf_adj_branches(struct bpf_prog *prog, u32 pos, u32 delta,
+			    const bool probe_pass)
+{
+	u32 i, insn_cnt = prog->len + (probe_pass ? delta : 0);
 	struct bpf_insn *insn = prog->insnsi;
-	u32 i, insn_cnt = prog->len;
-	bool pseudo_call;
-	u8 code;
-	int off;
+	int ret = 0;
 
 	for (i = 0; i < insn_cnt; i++, insn++) {
+		u8 code;
+
+		/* In the probing pass we still operate on the original,
+		 * unpatched image in order to check overflows before we
+		 * do any other adjustments. Therefore skip the patchlet.
+		 */
+		if (probe_pass && i == pos) {
+			i += delta + 1;
+			insn++;
+		}
 		code = insn->code;
-		if (BPF_CLASS(code) != BPF_JMP)
-			continue;
-		if (BPF_OP(code) == BPF_EXIT)
+		if (BPF_CLASS(code) != BPF_JMP ||
+		    BPF_OP(code) == BPF_EXIT)
 			continue;
+		/* Adjust offset of jmps if we cross patch boundaries. */
 		if (BPF_OP(code) == BPF_CALL) {
-			if (insn->src_reg == BPF_PSEUDO_CALL)
-				pseudo_call = true;
-			else
+			if (insn->src_reg != BPF_PSEUDO_CALL)
 				continue;
+			ret = bpf_adj_delta_to_imm(insn, pos, delta, i,
+						   probe_pass);
 		} else {
-			pseudo_call = false;
+			ret = bpf_adj_delta_to_off(insn, pos, delta, i,
+						   probe_pass);
 		}
-		off = pseudo_call ? insn->imm : insn->off;
-
-		/* Adjust offset of jmps if we cross boundaries. */
-		if (i < pos && i + off + 1 > pos)
-			off += delta;
-		else if (i > pos + delta && i + off + 1 <= pos + delta)
-			off -= delta;
-
-		if (pseudo_call)
-			insn->imm = off;
-		else
-			insn->off = off;
+		if (ret)
+			break;
 	}
+
+	return ret;
 }
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len)
 {
 	u32 insn_adj_cnt, insn_rest, insn_delta = len - 1;
+	const u32 cnt_max = S16_MAX;
 	struct bpf_prog *prog_adj;
 
 	/* Since our patchlet doesn't expand the image, we're done. */
@@ -269,6 +306,15 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 
 	insn_adj_cnt = prog->len + insn_delta;
 
+	/* Reject anything that would potentially let the insn->off
+	 * target overflow when we have excessive program expansions.
+	 * We need to probe here before we do any reallocation where
+	 * we afterwards may not fail anymore.
+	 */
+	if (insn_adj_cnt > cnt_max &&
+	    bpf_adj_branches(prog, off, insn_delta, true))
+		return NULL;
+
 	/* Several new instructions need to be inserted. Make room
 	 * for them. Likely, there's no need for a new allocation as
 	 * last page could have large enough tailroom.
@@ -294,7 +340,11 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 		sizeof(*patch) * insn_rest);
 	memcpy(prog_adj->insnsi + off, patch, sizeof(*patch) * len);
 
-	bpf_adj_branches(prog_adj, off, insn_delta);
+	/* We are guaranteed to not fail at this point, otherwise
+	 * the ship has sailed to reverse to the original state. An
+	 * overflow cannot happen at this point.
+	 */
+	BUG_ON(bpf_adj_branches(prog_adj, off, insn_delta, false));
 
 	return prog_adj;
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index e77c30c..201ff36b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -481,11 +481,18 @@ static int bpf_convert_filter(struct sock_filter *prog, int len,
 
 #define BPF_EMIT_JMP							\
 	do {								\
+		const s32 off_min = S16_MIN, off_max = S16_MAX;		\
+		s32 off;						\
+									\
 		if (target >= len || target < 0)			\
 			goto err;					\
-		insn->off = addrs ? addrs[target] - addrs[i] - 1 : 0;	\
+		off = addrs ? addrs[target] - addrs[i] - 1 : 0;		\
 		/* Adjust pc relative offset for 2nd or 3rd insn. */	\
-		insn->off -= insn - tmp_insns;				\
+		off -= insn - tmp_insns;				\
+		/* Reject anything not fitting into insn->off. */	\
+		if (off < off_min || off > off_max)			\
+			goto err;					\
+		insn->off = off;					\
 	} while (0)
 
 		case BPF_JMP | BPF_JA:
-- 
2.9.5

^ permalink raw reply related

* Re: [bpf-next PATCH] bpf: sockmap, on update propagate errors back to userspace
From: Daniel Borkmann @ 2018-05-16 23:50 UTC (permalink / raw)
  To: John Fastabend, ast; +Cc: netdev
In-Reply-To: <20180516233814.15266.67859.stgit@john-Precision-Tower-5810>

On 05/17/2018 01:38 AM, John Fastabend wrote:
> When an error happens in the update sockmap element logic also pass
> the err up to the user.
> 
> Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap")
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>

Agree, applied to bpf-next, thanks John!

^ permalink raw reply

* Re: [RFC bpf-next 04/11] bpf: Add PTR_TO_SOCKET verifier type
From: Joe Stringer @ 2018-05-16 23:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend,
	Martin KaFai Lau
In-Reply-To: <20180515023718.3zluffqkf52buc25@ast-mbp>

On 14 May 2018 at 19:37, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, May 09, 2018 at 02:07:02PM -0700, Joe Stringer wrote:
>> Teach the verifier a little bit about a new type of pointer, a
>> PTR_TO_SOCKET. This pointer type is accessed from BPF through the
>> 'struct bpf_sock' structure.
>>
>> Signed-off-by: Joe Stringer <joe@wand.net.nz>
>> ---
>>  include/linux/bpf.h          | 19 +++++++++-
>>  include/linux/bpf_verifier.h |  2 ++
>>  kernel/bpf/verifier.c        | 86 ++++++++++++++++++++++++++++++++++++++------
>>  net/core/filter.c            | 30 +++++++++-------
>>  4 files changed, 114 insertions(+), 23 deletions(-)
>
> Ack for patches 1-3. In this one few nits:
>
>> @@ -1723,6 +1752,16 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
>>               err = check_packet_access(env, regno, off, size, false);
>>               if (!err && t == BPF_READ && value_regno >= 0)
>>                       mark_reg_unknown(env, regs, value_regno);
>> +
>> +     } else if (reg->type == PTR_TO_SOCKET) {
>> +             if (t == BPF_WRITE) {
>> +                     verbose(env, "cannot write into socket\n");
>> +                     return -EACCES;
>> +             }
>> +             err = check_sock_access(env, regno, off, size, t);
>> +             if (!err && t == BPF_READ && value_regno >= 0)
>
> t == BPF_READ check is unnecessary.
>
>> @@ -5785,7 +5845,13 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr)
>>
>>       if (ret == 0)
>>               /* program is valid, convert *(u32*)(ctx + off) accesses */
>> -             ret = convert_ctx_accesses(env);
>> +             ret = convert_ctx_accesses(env, env->ops->convert_ctx_access,
>> +                                        PTR_TO_CTX);
>> +
>> +     if (ret == 0)
>> +             /* Convert *(u32*)(sock_ops + off) accesses */
>> +             ret = convert_ctx_accesses(env, bpf_sock_convert_ctx_access,
>> +                                        PTR_TO_SOCKET);
>
> Overall looks great.
> Only this part is missing for PTR_TO_SOCKET:
>      } else if (dst_reg_type != *prev_dst_type &&
>                 (dst_reg_type == PTR_TO_CTX ||
>                  *prev_dst_type == PTR_TO_CTX)) {
>              verbose(env, "same insn cannot be used with different pointers\n");
>              return -EINVAL;
> similar logic has to be added.
> Otherwise the following will be accepted:
>
> R1 = sock_ptr
> goto X;
> ...
> R1 = some_other_valid_ptr;
> goto X;
> ...
>
> R2 = *(u32 *)(R1 + 0);
> this will be rewritten for first branch,
> but it's wrong for second.
>

Thanks for the review, will address these comments.

^ permalink raw reply

* [PATCH net] erspan: fix invalid erspan version.
From: William Tu @ 2018-05-17  0:24 UTC (permalink / raw)
  To: netdev; +Cc: gvrose8192

ERSPAN only support version 1 and 2.  When packets send to an
erspan device which does not have proper version number set,
drop the packet.  In real case, we observe multicast packets
sent to the erspan pernet device, erspan0, which does not have
erspan version configured.

Reported-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
---
 net/ipv4/ip_gre.c  | 4 +++-
 net/ipv6/ip6_gre.c | 5 ++++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2409e648454d..2d8efeecf619 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -734,10 +734,12 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb,
 		erspan_build_header(skb, ntohl(tunnel->parms.o_key),
 				    tunnel->index,
 				    truncate, true);
-	else
+	else if (tunnel->erspan_ver == 2)
 		erspan_build_header_v2(skb, ntohl(tunnel->parms.o_key),
 				       tunnel->dir, tunnel->hwid,
 				       truncate, true);
+	else
+		goto free_skb;
 
 	tunnel->parms.o_flags &= ~TUNNEL_KEY;
 	__gre_xmit(skb, dev, &tunnel->parms.iph, htons(ETH_P_ERSPAN));
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index bede77f24784..d20072fc38cb 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -991,11 +991,14 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
 			erspan_build_header(skb, ntohl(t->parms.o_key),
 					    t->parms.index,
 					    truncate, false);
-		else
+		else if (t->parms.erspan_ver == 2)
 			erspan_build_header_v2(skb, ntohl(t->parms.o_key),
 					       t->parms.dir,
 					       t->parms.hwid,
 					       truncate, false);
+		else
+			goto tx_err;
+
 		fl6.daddr = t->parms.raddr;
 	}
 
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH v3] {net, IB}/mlx5: Use 'kvfree()' for memory allocated by 'kvzalloc()'
From: Saeed Mahameed @ 2018-05-17  0:42 UTC (permalink / raw)
  To: christophe.jaillet@wanadoo.fr, Matan Barak, jgg@ziepe.ca,
	davem@davemloft.net, leon@kernel.org, dledford@redhat.com
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, kernel-janitors@vger.kernel.org
In-Reply-To: <20180516190720.11633-1-christophe.jaillet@wanadoo.fr>

On Wed, 2018-05-16 at 21:07 +0200, Christophe JAILLET wrote:
> When 'kvzalloc()' is used to allocate memory, 'kvfree()' must be used
> to
> free it.
> 
> Fixes: 1cbe6fc86ccfe ("IB/mlx5: Add support for CQE compressing")
> Fixes: fed9ce22bf8ae ("net/mlx5: E-Switch, Add API to create vport rx
> rules")
> Fixes: 9efa75254593d ("net/mlx5_core: Introduce access functions to
> query vport RoCE fields")
> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
> ---
> v1 -> v2: More places to update have been added to the patch
> v2 -> v3: Add Fixes tag
> 
> 3 patches with one Fixes tag each should probably be better, but
> honestly, I won't send a v4.
> Fill free to split it if needed.

Applied to mlx5-next, thanks Christophe!

> ---
>  drivers/infiniband/hw/mlx5/cq.c                            | 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/vport.c            | 6 +++
> ---
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/cq.c
> b/drivers/infiniband/hw/mlx5/cq.c
> index 77d257ec899b..6d52ea03574e 100644
> --- a/drivers/infiniband/hw/mlx5/cq.c
> +++ b/drivers/infiniband/hw/mlx5/cq.c
> @@ -849,7 +849,7 @@ static int create_cq_user(struct mlx5_ib_dev
> *dev, struct ib_udata *udata,
>  	return 0;
>  
>  err_cqb:
> -	kfree(*cqb);
> +	kvfree(*cqb);
>  
>  err_db:
>  	mlx5_ib_db_unmap_user(to_mucontext(context), &cq->db);
> diff --git
> a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> index 35e256eb2f6e..b123f8a52ad8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
> @@ -663,7 +663,7 @@ static int esw_create_vport_rx_group(struct
> mlx5_eswitch *esw)
>  
>  	esw->offloads.vport_rx_group = g;
>  out:
> -	kfree(flow_group_in);
> +	kvfree(flow_group_in);
>  	return err;
>  }
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> index 177e076b8d17..719cecb182c6 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/vport.c
> @@ -511,7 +511,7 @@ int mlx5_query_nic_vport_system_image_guid(struct
> mlx5_core_dev *mdev,
>  	*system_image_guid = MLX5_GET64(query_nic_vport_context_out,
> out,
>  					nic_vport_context.system_ima
> ge_guid);
>  
> -	kfree(out);
> +	kvfree(out);
>  
>  	return 0;
>  }
> @@ -531,7 +531,7 @@ int mlx5_query_nic_vport_node_guid(struct
> mlx5_core_dev *mdev, u64 *node_guid)
>  	*node_guid = MLX5_GET64(query_nic_vport_context_out, out,
>  				nic_vport_context.node_guid);
>  
> -	kfree(out);
> +	kvfree(out);
>  
>  	return 0;
>  }
> @@ -587,7 +587,7 @@ int mlx5_query_nic_vport_qkey_viol_cntr(struct
> mlx5_core_dev *mdev,
>  	*qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out,
>  				   nic_vport_context.qkey_violation_
> counter);
>  
> -	kfree(out);
> +	kvfree(out);
>  
>  	return 0;
>  }

^ permalink raw reply

* Re: [RFC bpf-next 06/11] bpf: Add reference tracking to verifier
From: Joe Stringer @ 2018-05-17  1:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Joe Stringer, daniel, netdev, ast, john fastabend,
	Martin KaFai Lau
In-Reply-To: <20180515030415.lhw7jhrl7uowt4la@ast-mbp>

On 14 May 2018 at 20:04, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, May 09, 2018 at 02:07:04PM -0700, Joe Stringer wrote:
>> Allow helper functions to acquire a reference and return it into a
>> register. Specific pointer types such as the PTR_TO_SOCKET will
>> implicitly represent such a reference. The verifier must ensure that
>> these references are released exactly once in each path through the
>> program.
>>
>> To achieve this, this commit assigns an id to the pointer and tracks it
>> in the 'bpf_func_state', then when the function or program exits,
>> verifies that all of the acquired references have been freed. When the
>> pointer is passed to a function that frees the reference, it is removed
>> from the 'bpf_func_state` and all existing copies of the pointer in
>> registers are marked invalid.
>>
>> Signed-off-by: Joe Stringer <joe@wand.net.nz>
>> ---
>>  include/linux/bpf_verifier.h |  18 ++-
>>  kernel/bpf/verifier.c        | 295 ++++++++++++++++++++++++++++++++++++++++---
>>  2 files changed, 292 insertions(+), 21 deletions(-)
>>
>> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
>> index 9dcd87f1d322..8dbee360b3ec 100644
>> --- a/include/linux/bpf_verifier.h
>> +++ b/include/linux/bpf_verifier.h
>> @@ -104,6 +104,11 @@ struct bpf_stack_state {
>>       u8 slot_type[BPF_REG_SIZE];
>>  };
>>
>> +struct bpf_reference_state {
>> +     int id;
>> +     int insn_idx; /* allocation insn */
>
> the insn_idx is for more verbose messages, right?
> It doesn't seem to affect the safety of algorithm.
> Please add a comment to clarify that.

Yup, will do.

>> +/* Acquire a pointer id from the env and update the state->refs to include
>> + * this new pointer reference.
>> + * On success, returns a valid pointer id to associate with the register
>> + * On failure, returns a negative errno.
>> + */
>> +static int acquire_reference_state(struct bpf_verifier_env *env, int insn_idx)
>> +{
>> +     struct bpf_func_state *state = cur_func(env);
>> +     int new_ofs = state->acquired_refs;
>> +     int id, err;
>> +
>> +     err = realloc_reference_state(state, state->acquired_refs + 1, true);
>> +     if (err)
>> +             return err;
>> +     id = ++env->id_gen;
>> +     state->refs[new_ofs].id = id;
>> +     state->refs[new_ofs].insn_idx = insn_idx;
>
> I thought that we may avoid this extra 'ref_state' array if we store
> 'id' into 'aux' array which is one to one to array of instructions
> and avoid this expensive reallocs, but then I realized we can go
> through the same instruction that returns a pointer to socket
> multiple times and every time it needs to be different 'id' and
> tracked indepdently, so yeah. All that infra is necessary.
> Would be good to document the algorithm a bit more.

Good point, I'll add these details to the bpf_reference_state definition.
Will consider other areas that could receive some docs attention.

>> @@ -2498,6 +2711,15 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn
>>                       return err;
>>       }
>>
>> +     /* If the function is a release() function, mark all copies of the same
>> +      * pointer as "freed" in all registers and in the stack.
>> +      */
>> +     if (is_release_function(func_id)) {
>> +             err = release_reference(env);
>
> I think this can be improved if check_func_arg() stores ptr_id into meta.
> Then this loop
>  for (i = BPF_REG_1; i < BPF_REG_6; i++) {
>        if (reg_is_refcounted(&regs[i])) {
> in release_reference() won't be needed.

That's a nice cleanup.

> Also the macros from the previous patch look ugly, but considering this patch
> I guess it's justified. At least I don't see a better way of doing it.

Completely agree, ugly, but I also didn't see a great alternative.

^ permalink raw reply

* pull-request: bpf-next 2018-05-17
From: Daniel Borkmann @ 2018-05-17  1:09 UTC (permalink / raw)
  To: davem; +Cc: daniel, ast, netdev

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Provide a new BPF helper for doing a FIB and neighbor lookup
   in the kernel tables from an XDP or tc BPF program. The helper
   provides a fast-path for forwarding packets. The API supports
   IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
   implemented in this initial work, from David (Ahern).

2) Just a tiny diff but huge feature enabled for nfp driver by
   extending the BPF offload beyond a pure host processing offload.
   Offloaded XDP programs are allowed to set the RX queue index and
   thus opening the door for defining a fully programmable RSS/n-tuple
   filter replacement. Once BPF decided on a queue already, the device
   data-path will skip the conventional RSS processing completely,
   from Jakub.

3) The original sockmap implementation was array based similar to
   devmap. However unlike devmap where an ifindex has a 1:1 mapping
   into the map there are use cases with sockets that need to be
   referenced using longer keys. Hence, sockhash map is added reusing
   as much of the sockmap code as possible, from John.

4) Introduce BTF ID. The ID is allocatd through an IDR similar as
   with BPF maps and progs. It also makes BTF accessible to user
   space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
   through BPF_OBJ_GET_INFO_BY_FD, from Martin.

5) Enable BPF stackmap with build_id also in NMI context. Due to the
   up_read() of current->mm->mmap_sem build_id cannot be parsed.
   This work defers the up_read() via a per-cpu irq_work so that
   at least limited support can be enabled, from Song.

6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
   JIT conversion as well as implementation of an optimized 32/64 bit
   immediate load in the arm64 JIT that allows to reduce the number of
   emitted instructions; in case of tested real-world programs they
   were shrinking by three percent, from Daniel.

7) Add ifindex parameter to the libbpf loader in order to enable
   BPF offload support. Right now only iproute2 can load offloaded
   BPF and this will also enable libbpf for direct integration into
   other applications, from David (Beckett).

8) Convert the plain text documentation under Documentation/bpf/ into
   RST format since this is the appropriate standard the kernel is
   moving to for all documentation. Also add an overview README.rst,
   from Jesper.

9) Add __printf verification attribute to the bpf_verifier_vlog()
   helper. Though it uses va_list we can still allow gcc to check
   the format string, from Mathieu.

10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
    is a bash 4.0+ feature which is not guaranteed to be available
    when calling out to shell, therefore use a more portable variant,
    from Joe.

11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
    instead of relying on the gcc built-in, from Björn.

12) Fix a sock hashmap kmalloc warning reported by syzbot when an
    overly large key size is used in hashmap then causing overflows
    in htab->elem_size. Reject bogus attr->key_size early in the
    sock_hash_alloc(), from Yonghong.

13) Ensure in BPF selftests when urandom_read is being linked that
    --build-id is always enabled so that test_stacktrace_build_id[_nmi]
    won't be failing, from Alexei.

14) Add bitsperlong.h as well as errno.h uapi headers into the tools
    header infrastructure which point to one of the arch specific
    uapi headers. This was needed in order to fix a build error on
    some systems for the BPF selftests, from Sirio.

15) Allow for short options to be used in the xdp_monitor BPF sample
    code. And also a bpf.h tools uapi header sync in order to fix a
    selftest build failure. Both from Prashant.

16) More formally clarify the meaning of ID in the direct packet access
    section of the BPF documentation, from Wang.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!

----------------------------------------------------------------

The following changes since commit 53a7bdfb2a2756cce8003b90817f8a6fb4d830d9:

  dt-bindings: dsa: Remove unnecessary #address/#size-cells (2018-05-08 20:28:44 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to e23afe5e7cba89cd0744c5218eda1b3553455c17:

  bpf: sockmap, on update propagate errors back to userspace (2018-05-17 01:48:22 +0200)

----------------------------------------------------------------
Alexei Starovoitov (4):
      Merge branch 'bpf-jit-cleanups'
      Merge branch 'fix-samples'
      Merge branch 'convert-doc-to-rst'
      selftests/bpf: make sure build-id is on

Björn Töpel (1):
      xsk: fix 64-bit division

Daniel Borkmann (14):
      Merge branch 'bpf-btf-id'
      Merge branch 'bpf-nfp-programmable-rss'
      Merge branch 'bpf-fib-lookup-helper'
      Merge branch 'bpf-perf-rb-libbpf'
      Merge branch 'bpf-stackmap-nmi'
      bpf, mips: remove unused function
      bpf, sparc: remove unused variable
      bpf, x64: clean up retpoline emission slightly
      bpf, arm32: save 4 bytes of unneeded stack space
      bpf, arm64: save 4 bytes of unneeded stack space
      bpf, arm64: optimize 32/64 immediate emission
      bpf, arm64: save 4 bytes in prologue when ebpf insns came from cbpf
      bpf: add ld64 imm test cases
      Merge branch 'bpf-sock-hashmap'

David Ahern (10):
      net/ipv6: Rename fib6_lookup to fib6_node_lookup
      net/ipv6: Rename rt6_multipath_select
      net/ipv6: Extract table lookup from ip6_pol_route
      net/ipv6: Refactor fib6_rule_action
      net/ipv6: Add fib6_lookup
      net/ipv6: Update fib6 tracepoint to take fib6_info
      net/ipv6: Add fib lookup stubs for use in bpf helper
      bpf: Provide helper to do forwarding lookups in kernel FIB table
      samples/bpf: Add example of ipv4 and ipv6 forwarding in XDP
      samples/bpf: Decrement ttl in fib forwarding example

David Beckett (1):
      libbpf: add ifindex to enable offload support

Jakub Kicinski (14):
      bpf: xdp: allow offloads to store into rx_queue_index
      nfp: bpf: support setting the RX queue index
      tools: bpftool: use PERF_SAMPLE_TIME instead of reading the clock
      samples: bpf: rename struct bpf_map_def to avoid conflict with libbpf
      samples: bpf: compile and link against full libbpf
      tools: bpf: move the event reading loop to libbpf
      tools: bpf: improve comments in libbpf.h
      tools: bpf: don't complain about no kernel version for networking code
      samples: bpf: convert some XDP samples from bpf_load to libbpf
      samples: bpf: include bpf/bpf.h instead of local libbpf.h
      samples: bpf: rename libbpf.h to bpf_insn.h
      samples: bpf: fix build after move to compiling full libbpf.a
      samples: bpf: move libbpf from object dependencies to libs
      samples: bpf: make the build less noisy

Jesper Dangaard Brouer (5):
      bpf, doc: add basic README.rst file
      bpf, doc: rename txt files to rst files
      bpf, doc: convert bpf_design_QA.rst to use RST formatting
      bpf, doc: convert bpf_devel_QA.rst to use RST formatting
      bpf, doc: howto use/run the BPF selftests

Joe Stringer (1):
      selftests/bpf: Fix bash reference in Makefile

John Fastabend (5):
      bpf: sockmap, refactor sockmap routines to work with hashmap
      bpf: sockmap, add hash map support
      bpf: selftest additions for SOCKHASH
      bpf: bpftool, support for sockhash
      bpf: sockmap, on update propagate errors back to userspace

Martin KaFai Lau (6):
      bpf: btf: Avoid WARN_ON when CONFIG_REFCOUNT_FULL=y
      bpf: btf: Introduce BTF ID
      bpf: btf: Add struct bpf_btf_info
      bpf: btf: Some test_btf clean up
      bpf: btf: Update tools/include/uapi/linux/btf.h with BTF ID
      bpf: btf: Tests for BPF_OBJ_GET_INFO_BY_FD and BPF_BTF_GET_FD_BY_ID

Mathieu Malaterre (1):
      bpf: add __printf verification to bpf_verifier_vlog

Prashant Bhole (2):
      bpf: sync tools bpf.h uapi header
      samples/bpf: xdp_monitor, accept short options

Sirio Balmelli (2):
      selftests/bpf: add architecture-agnostic headers
      selftests/bpf: ignore build products

Song Liu (2):
      bpf: enable stackmap with build_id in nmi context
      bpf: add selftest for stackmap with build_id in NMI context

Wang YanQing (1):
      bpf, doc: clarification for the meaning of 'id'

Yonghong Song (1):
      bpf: fix sock hashmap kmalloc warning

 Documentation/bpf/README.rst                      |  36 ++
 Documentation/bpf/bpf_design_QA.rst               | 221 ++++++++
 Documentation/bpf/bpf_design_QA.txt               | 156 ------
 Documentation/bpf/bpf_devel_QA.rst                | 640 +++++++++++++++++++++
 Documentation/bpf/bpf_devel_QA.txt                | 570 -------------------
 Documentation/networking/filter.txt               |  15 +-
 arch/arm/net/bpf_jit_32.c                         |  13 +-
 arch/arm64/net/bpf_jit_comp.c                     | 115 ++--
 arch/mips/net/ebpf_jit.c                          |  26 -
 arch/sparc/net/bpf_jit_comp_64.c                  |   1 -
 arch/x86/include/asm/nospec-branch.h              |  29 +-
 drivers/net/ethernet/netronome/nfp/bpf/fw.h       |   1 +
 drivers/net/ethernet/netronome/nfp/bpf/jit.c      |  47 ++
 drivers/net/ethernet/netronome/nfp/bpf/main.c     |  11 +
 drivers/net/ethernet/netronome/nfp/bpf/main.h     |   8 +
 drivers/net/ethernet/netronome/nfp/bpf/verifier.c |  28 +-
 drivers/net/ethernet/netronome/nfp/nfp_asm.h      |  22 +-
 include/linux/bpf.h                               |  10 +-
 include/linux/bpf_types.h                         |   1 +
 include/linux/bpf_verifier.h                      |   4 +-
 include/linux/btf.h                               |   2 +
 include/linux/filter.h                            |   3 +-
 include/net/addrconf.h                            |  14 +
 include/net/ip6_fib.h                             |  21 +-
 include/net/tcp.h                                 |   3 +-
 include/trace/events/fib6.h                       |  14 +-
 include/uapi/linux/bpf.h                          | 142 ++++-
 init/Kconfig                                      |   1 +
 kernel/bpf/btf.c                                  | 136 ++++-
 kernel/bpf/core.c                                 |   1 +
 kernel/bpf/sockmap.c                              | 644 +++++++++++++++++++---
 kernel/bpf/stackmap.c                             |  59 +-
 kernel/bpf/syscall.c                              |  41 +-
 kernel/bpf/verifier.c                             |  16 +-
 net/core/filter.c                                 | 365 +++++++++++-
 net/ipv6/addrconf_core.c                          |  33 +-
 net/ipv6/af_inet6.c                               |   6 +-
 net/ipv6/fib6_rules.c                             | 138 ++++-
 net/ipv6/ip6_fib.c                                |  21 +-
 net/ipv6/route.c                                  |  76 +--
 net/xdp/xdp_umem.c                                |   2 +-
 samples/bpf/Makefile                              | 166 +++---
 samples/bpf/{libbpf.h => bpf_insn.h}              |   8 +-
 samples/bpf/bpf_load.c                            |  12 +-
 samples/bpf/bpf_load.h                            |   6 +-
 samples/bpf/cookie_uid_helper_example.c           |   2 +-
 samples/bpf/cpustat_user.c                        |   2 +-
 samples/bpf/fds_example.c                         |   4 +-
 samples/bpf/lathist_user.c                        |   2 +-
 samples/bpf/load_sock_ops.c                       |   2 +-
 samples/bpf/lwt_len_hist_user.c                   |   2 +-
 samples/bpf/map_perf_test_user.c                  |   2 +-
 samples/bpf/sock_example.c                        |   3 +-
 samples/bpf/sock_example.h                        |   1 -
 samples/bpf/sockex1_user.c                        |   2 +-
 samples/bpf/sockex2_user.c                        |   2 +-
 samples/bpf/sockex3_user.c                        |   2 +-
 samples/bpf/syscall_tp_user.c                     |   2 +-
 samples/bpf/tc_l2_redirect_user.c                 |   2 +-
 samples/bpf/test_cgrp2_array_pin.c                |   2 +-
 samples/bpf/test_cgrp2_attach.c                   |   3 +-
 samples/bpf/test_cgrp2_attach2.c                  |   3 +-
 samples/bpf/test_cgrp2_sock.c                     |   3 +-
 samples/bpf/test_cgrp2_sock2.c                    |   3 +-
 samples/bpf/test_current_task_under_cgroup_user.c |   2 +-
 samples/bpf/test_lru_dist.c                       |   2 +-
 samples/bpf/test_map_in_map_user.c                |   2 +-
 samples/bpf/test_overhead_user.c                  |   2 +-
 samples/bpf/test_probe_write_user_user.c          |   2 +-
 samples/bpf/trace_output_user.c                   |   8 +-
 samples/bpf/tracex1_user.c                        |   2 +-
 samples/bpf/tracex2_user.c                        |   2 +-
 samples/bpf/tracex3_user.c                        |   2 +-
 samples/bpf/tracex4_user.c                        |   2 +-
 samples/bpf/tracex5_user.c                        |   2 +-
 samples/bpf/tracex6_user.c                        |   2 +-
 samples/bpf/tracex7_user.c                        |   2 +-
 samples/bpf/xdp1_user.c                           |  31 +-
 samples/bpf/xdp_adjust_tail_user.c                |  36 +-
 samples/bpf/xdp_fwd_kern.c                        | 138 +++++
 samples/bpf/xdp_fwd_user.c                        | 136 +++++
 samples/bpf/xdp_monitor_user.c                    |   6 +-
 samples/bpf/xdp_redirect_cpu_user.c               |   2 +-
 samples/bpf/xdp_redirect_map_user.c               |   2 +-
 samples/bpf/xdp_redirect_user.c                   |   2 +-
 samples/bpf/xdp_router_ipv4_user.c                |   2 +-
 samples/bpf/xdp_rxq_info_user.c                   |  46 +-
 samples/bpf/xdp_tx_iptunnel_user.c                |   2 +-
 samples/bpf/xdpsock_user.c                        |   2 +-
 tools/bpf/bpftool/.gitignore                      |   3 +
 tools/bpf/bpftool/map.c                           |   1 +
 tools/bpf/bpftool/map_perf_ring.c                 |  83 +--
 tools/include/uapi/asm/bitsperlong.h              |  18 +
 tools/include/uapi/asm/errno.h                    |  18 +
 tools/include/uapi/linux/bpf.h                    | 143 ++++-
 tools/lib/bpf/Makefile                            |   2 +-
 tools/lib/bpf/bpf.c                               |  12 +
 tools/lib/bpf/bpf.h                               |   3 +
 tools/lib/bpf/libbpf.c                            | 125 ++++-
 tools/lib/bpf/libbpf.h                            |  62 ++-
 tools/testing/selftests/bpf/.gitignore            |   1 +
 tools/testing/selftests/bpf/Makefile              |  12 +-
 tools/testing/selftests/bpf/bpf_helpers.h         |  11 +
 tools/testing/selftests/bpf/bpf_rand.h            |  80 +++
 tools/testing/selftests/bpf/test_btf.c            | 478 ++++++++++++----
 tools/testing/selftests/bpf/test_progs.c          | 140 ++++-
 tools/testing/selftests/bpf/test_sockhash_kern.c  |   5 +
 tools/testing/selftests/bpf/test_sockmap.c        |  27 +-
 tools/testing/selftests/bpf/test_sockmap_kern.c   | 343 +-----------
 tools/testing/selftests/bpf/test_sockmap_kern.h   | 363 ++++++++++++
 tools/testing/selftests/bpf/test_verifier.c       |  62 +++
 tools/testing/selftests/bpf/trace_helpers.c       |  87 +--
 tools/testing/selftests/bpf/trace_helpers.h       |  11 +-
 tools/testing/selftests/bpf/urandom_read.c        |  10 +-
 114 files changed, 4600 insertions(+), 1865 deletions(-)
 create mode 100644 Documentation/bpf/README.rst
 create mode 100644 Documentation/bpf/bpf_design_QA.rst
 delete mode 100644 Documentation/bpf/bpf_design_QA.txt
 create mode 100644 Documentation/bpf/bpf_devel_QA.rst
 delete mode 100644 Documentation/bpf/bpf_devel_QA.txt
 rename samples/bpf/{libbpf.h => bpf_insn.h} (98%)
 create mode 100644 samples/bpf/xdp_fwd_kern.c
 create mode 100644 samples/bpf/xdp_fwd_user.c
 create mode 100644 tools/bpf/bpftool/.gitignore
 create mode 100644 tools/include/uapi/asm/bitsperlong.h
 create mode 100644 tools/include/uapi/asm/errno.h
 create mode 100644 tools/testing/selftests/bpf/bpf_rand.h
 create mode 100644 tools/testing/selftests/bpf/test_sockhash_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_sockmap_kern.h

^ permalink raw reply

* Re: [PATCH 34/40] atm: simplify procfs code
From: Eric W. Biederman @ 2018-05-17  1:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Alexander Viro, Alexey Dobriyan,
	Greg Kroah-Hartman, Jiri Slaby, Alessandro Zummo,
	Alexandre Belloni, linux-acpi, drbd-dev, linux-ide, netdev,
	linux-rtc, megaraidlinux.pdl, linux-scsi, devel, linux-afs,
	linux-ext4, jfs-discussion, netfilter-devel, linux-kernel
In-Reply-To: <20180515141232.GD31296@lst.de>

Christoph Hellwig <hch@lst.de> writes:

> On Sat, May 05, 2018 at 07:51:18AM -0500, Eric W. Biederman wrote:
>> Christoph Hellwig <hch@lst.de> writes:
>> 
>> > Use remove_proc_subtree to remove the whole subtree on cleanup, and
>> > unwind the registration loop into individual calls.  Switch to use
>> > proc_create_seq where applicable.
>> 
>> Can you please explain why you are removing the error handling when
>> you are unwinding the registration loop?
>
> Because there is no point in handling these errors.  The code work
> perfectly fine without procfs, or without given proc files and the
> removal works just fine if they don't exist either.  This is a very
> common patter in various parts of the kernel already.
>
> I'll document it better in the changelog.

Thank you.  That is the kind of thing that could be a signal of
inattentiveness and problems, especially when it is not documented.

Eric

^ permalink raw reply

* linux-next: manual merge of the net-next tree with the vfs tree
From: Stephen Rothwell @ 2018-05-17  1:34 UTC (permalink / raw)
  To: David Miller, Networking, Al Viro
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Christoph Hellwig, Chris Novakovic

[-- Attachment #1: Type: text/plain, Size: 3612 bytes --]

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/ipv4/ipconfig.c

between commits:

  3f3942aca6da ("proc: introduce proc_create_single{,_data}")
  c04d2cb2009f ("ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_servers")

from the vfs tree and commit:

  4d019b3f80dc ("ipconfig: Create /proc/net/ipconfig directory")

from the net-next tree.

I fixed it up (see below - there may be more to do) and can carry the
fix as necessary. This is now fixed as far as linux-next is concerned,
but any non trivial conflicts should be mentioned to your upstream
maintainer when your tree is submitted for merging.  You may also want
to consider cooperating with the maintainer of the conflicting tree to
minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/ipv4/ipconfig.c
index bbcbcc113d19,86c9f755de3d..000000000000
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@@ -1282,6 -1317,74 +1317,61 @@@ static int pnp_seq_show(struct seq_fil
  			   &ic_servaddr);
  	return 0;
  }
 -
 -static int pnp_seq_open(struct inode *indoe, struct file *file)
 -{
 -	return single_open(file, pnp_seq_show, NULL);
 -}
 -
 -static const struct file_operations pnp_seq_fops = {
 -	.open		= pnp_seq_open,
 -	.read		= seq_read,
 -	.llseek		= seq_lseek,
 -	.release	= single_release,
 -};
 -
+ /* Create the /proc/net/ipconfig directory */
+ static int __init ipconfig_proc_net_init(void)
+ {
+ 	ipconfig_dir = proc_net_mkdir(&init_net, "ipconfig", init_net.proc_net);
+ 	if (!ipconfig_dir)
+ 		return -ENOMEM;
+ 
+ 	return 0;
+ }
+ 
+ /* Create a new file under /proc/net/ipconfig */
+ static int ipconfig_proc_net_create(const char *name,
+ 				    const struct file_operations *fops)
+ {
+ 	char *pname;
+ 	struct proc_dir_entry *p;
+ 
+ 	if (!ipconfig_dir)
+ 		return -ENOMEM;
+ 
+ 	pname = kasprintf(GFP_KERNEL, "%s%s", "ipconfig/", name);
+ 	if (!pname)
+ 		return -ENOMEM;
+ 
+ 	p = proc_create(pname, 0444, init_net.proc_net, fops);
+ 	kfree(pname);
+ 	if (!p)
+ 		return -ENOMEM;
+ 
+ 	return 0;
+ }
+ 
+ /* Write NTP server IP addresses to /proc/net/ipconfig/ntp_servers */
+ static int ntp_servers_seq_show(struct seq_file *seq, void *v)
+ {
+ 	int i;
+ 
+ 	for (i = 0; i < CONF_NTP_SERVERS_MAX; i++) {
+ 		if (ic_ntp_servers[i] != NONE)
+ 			seq_printf(seq, "%pI4\n", &ic_ntp_servers[i]);
+ 	}
+ 	return 0;
+ }
+ 
+ static int ntp_servers_seq_open(struct inode *inode, struct file *file)
+ {
+ 	return single_open(file, ntp_servers_seq_show, NULL);
+ }
+ 
+ static const struct file_operations ntp_servers_seq_fops = {
+ 	.open		= ntp_servers_seq_open,
+ 	.read		= seq_read,
+ 	.llseek		= seq_lseek,
+ 	.release	= single_release,
+ };
  #endif /* CONFIG_PROC_FS */
  
  /*
@@@ -1356,8 -1459,20 +1446,20 @@@ static int __init ip_auto_config(void
  	int err;
  	unsigned int i;
  
+ 	/* Initialise all name servers and NTP servers to NONE (but only if the
+ 	 * "ip=" or "nfsaddrs=" kernel command line parameters weren't decoded,
+ 	 * otherwise we'll overwrite the IP addresses specified there)
+ 	 */
+ 	if (ic_set_manually == 0) {
+ 		ic_nameservers_predef();
+ 		ic_ntp_servers_predef();
+ 	}
+ 
  #ifdef CONFIG_PROC_FS
 -	proc_create("pnp", 0444, init_net.proc_net, &pnp_seq_fops);
 +	proc_create_single("pnp", 0444, init_net.proc_net, pnp_seq_show);
+ 
+ 	if (ipconfig_proc_net_init() == 0)
+ 		ipconfig_proc_net_create("ntp_servers", &ntp_servers_seq_fops);
  #endif /* CONFIG_PROC_FS */
  
  	if (!ic_enable)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH ghak81 V3 3/3] audit: collect audit task parameters
From: kbuild test robot @ 2018-05-17  1:43 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: kbuild-all, Linux-Audit Mailing List, LKML,
	Linux NetDev Upstream Mailing List, Netfilter Devel List,
	Linux Security Module list, Integrity Measurement Architecture,
	SElinux list, Eric Paris, Paul Moore, Steve Grubb, Ingo Molnar,
	David Howells, Richard Guy Briggs
In-Reply-To: <fbed63483b5206009ee43ae889b30d43051f386c.1526430313.git.rgb@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 2340 bytes --]

Hi Richard,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on next-20180516]
[cannot apply to linus/master tip/sched/core v4.17-rc5 v4.17-rc4 v4.17-rc3 v4.17-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Richard-Guy-Briggs/audit-group-task-params/20180517-090703
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   kernel/fork.c: In function 'copy_process':
>> kernel/fork.c:1739:3: error: 'struct task_struct' has no member named 'audit'
     p->audit = NULL;
      ^~

vim +1739 kernel/fork.c

  1728	
  1729		p->default_timer_slack_ns = current->timer_slack_ns;
  1730	
  1731		task_io_accounting_init(&p->ioac);
  1732		acct_clear_integrals(p);
  1733	
  1734		posix_cpu_timers_init(p);
  1735	
  1736		p->start_time = ktime_get_ns();
  1737		p->real_start_time = ktime_get_boot_ns();
  1738		p->io_context = NULL;
> 1739		p->audit = NULL;
  1740		cgroup_fork(p);
  1741	#ifdef CONFIG_NUMA
  1742		p->mempolicy = mpol_dup(p->mempolicy);
  1743		if (IS_ERR(p->mempolicy)) {
  1744			retval = PTR_ERR(p->mempolicy);
  1745			p->mempolicy = NULL;
  1746			goto bad_fork_cleanup_threadgroup_lock;
  1747		}
  1748	#endif
  1749	#ifdef CONFIG_CPUSETS
  1750		p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
  1751		p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
  1752		seqcount_init(&p->mems_allowed_seq);
  1753	#endif
  1754	#ifdef CONFIG_TRACE_IRQFLAGS
  1755		p->irq_events = 0;
  1756		p->hardirqs_enabled = 0;
  1757		p->hardirq_enable_ip = 0;
  1758		p->hardirq_enable_event = 0;
  1759		p->hardirq_disable_ip = _THIS_IP_;
  1760		p->hardirq_disable_event = 0;
  1761		p->softirqs_enabled = 1;
  1762		p->softirq_enable_ip = _THIS_IP_;
  1763		p->softirq_enable_event = 0;
  1764		p->softirq_disable_ip = 0;
  1765		p->softirq_disable_event = 0;
  1766		p->hardirq_context = 0;
  1767		p->softirq_context = 0;
  1768	#endif
  1769	
  1770		p->pagefault_disabled = 0;
  1771	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6329 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox