>From feff53ceeb01f37c4298612e1c84fcdfb4dcd99f Mon Sep 17 00:00:00 2001 From: Daniel Borkmann Date: Mon, 25 Jul 2016 18:06:12 +0200 Subject: [PATCH 01/10] udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skb [ Upstream commit ba66bbe5480a012108958a71cff88b23dce84956 ] After a612769774a3 ("udp: prevent bugcheck if filter truncates packet too much"), there followed various other fixes for similar cases such as f4979fcea7fd ("rose: limit sk_filter trim to payload"). Latter introduced a new helper sk_filter_trim_cap(), where we can pass the trim limit directly to the socket filter handling. Make use of it here as well with sizeof(struct udphdr) as lower cap limit and drop the extra skb->len test in UDP's input path. Signed-off-by: Daniel Borkmann Cc: Willem de Bruijn Acked-by: Willem de Bruijn Signed-off-by: David S. Miller --- net/ipv4/udp.c | 4 +--- net/ipv6/udp.c | 4 +--- 2 files changed, 2 insertions(+), 6 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 4aed8fc..e61f7cd 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1581,9 +1581,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) udp_lib_checksum_complete(skb)) goto csum_error; - if (sk_filter(sk, skb)) - goto drop; - if (unlikely(skb->len < sizeof(struct udphdr))) + if (sk_filter_trim_cap(sk, skb, sizeof(struct udphdr))) goto drop; udp_csum_pull_header(skb); diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index acc09705..42a2edf 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -618,9 +618,7 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) udp_lib_checksum_complete(skb)) goto csum_error; - if (sk_filter(sk, skb)) - goto drop; - if (unlikely(skb->len < sizeof(struct udphdr))) + if (sk_filter_trim_cap(sk, skb, sizeof(struct udphdr))) goto drop; udp_csum_pull_header(skb); -- 2.1.0 >From f8561e442d29f3a00ab657c275f1bc68c9dab6bb Mon Sep 17 00:00:00 2001 From: Mark Bloch Date: Thu, 21 Jul 2016 11:52:55 +0300 Subject: [PATCH 02/10] net/bonding: Enforce active-backup policy for IPoIB bonds [ Upstream commit 1533e77315220dc1d5ec3bd6d9fe32e2aa0a74c0 ] When using an IPoIB bond currently only active-backup mode is a valid use case and this commit strengthens it. Since commit 2ab82852a270 ("net/bonding: Enable bonding to enslave netdevices not supporting set_mac_address()") was introduced till 4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence the fail over mac policy always applied to IPoIB bonds. With the introduction of commit 492a7e67ff83 ("IB/IPoIB: Allow setting the device address"), that doesn't hold and practically IPoIB bonds are broken as of that. To fix it, lets go to fail over mac if the device doesn't support the ndo OR this is IPoIB device. As a by-product, this commit also prevents a stack corruption which occurred when trying to copy 20 bytes (IPoIB) device address to a sockaddr struct that has only 16 bytes of storage. Signed-off-by: Mark Bloch Signed-off-by: Or Gerlitz Signed-off-by: Saeed Mahameed Acked-by: Andy Gospodarek Signed-off-by: Jay Vosburgh Signed-off-by: David S. Miller --- drivers/net/bonding/bond_main.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index a2afa3b..4d79819 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1422,7 +1422,16 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) return -EINVAL; } - if (slave_ops->ndo_set_mac_address == NULL) { + if (slave_dev->type == ARPHRD_INFINIBAND && + BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { + netdev_warn(bond_dev, "Type (%d) supports only active-backup mode\n", + slave_dev->type); + res = -EOPNOTSUPP; + goto err_undo_flags; + } + + if (!slave_ops->ndo_set_mac_address || + slave_dev->type == ARPHRD_INFINIBAND) { netdev_warn(bond_dev, "The slave device specified does not support setting the MAC address\n"); if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP && bond->params.fail_over_mac != BOND_FOM_ACTIVE) { -- 2.1.0 >From f0e152b3f0b9ab321d69ccc7359eb7371be1d163 Mon Sep 17 00:00:00 2001 From: Ido Schimmel Date: Fri, 22 Jul 2016 14:56:20 +0300 Subject: [PATCH 03/10] bridge: Fix incorrect re-injection of LLDP packets [ Upstream commit baedbe55884c003819f5c8c063ec3d2569414296 ] Commit 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") caused LLDP packets arriving through a bridge port to be re-injected to the Rx path with skb->dev set to the bridge device, but this breaks the lldpad daemon. The lldpad daemon opens a packet socket with protocol set to ETH_P_LLDP for any valid device on the system, which doesn't not include soft devices such as bridge and VLAN. Since packet sockets (ptype_base) are processed in the Rx path after the Rx handler, LLDP packets with skb->dev set to the bridge device never reach the lldpad daemon. Fix this by making the bridge's Rx handler re-inject LLDP packets with RX_HANDLER_PASS, which effectively restores the behaviour prior to the mentioned commit. This means netfilter will never receive LLDP packets coming through a bridge port, as I don't see a way in which we can have okfn() consume the packet without breaking existing behaviour. I've already carried out a similar fix for STP packets in commit 56fae404fb2c ("bridge: Fix incorrect re-injection of STP packets"). Fixes: 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") Signed-off-by: Ido Schimmel Reviewed-by: Jiri Pirko Cc: Florian Westphal Cc: John Fastabend Signed-off-by: David S. Miller --- net/bridge/br_input.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c index 43d2cd8..28d5ec2 100644 --- a/net/bridge/br_input.c +++ b/net/bridge/br_input.c @@ -288,6 +288,14 @@ rx_handler_result_t br_handle_frame(struct sk_buff **pskb) case 0x01: /* IEEE MAC (Pause) */ goto drop; + case 0x0E: /* 802.1AB LLDP */ + fwd_mask |= p->br->group_fwd_mask; + if (fwd_mask & (1u << dest[5])) + goto forward; + *pskb = skb; + __br_handle_local_finish(skb); + return RX_HANDLER_PASS; + default: /* Allow selective forwarding for most other protocols */ fwd_mask |= p->br->group_fwd_mask; -- 2.1.0 >From 0993829fe12ff067e16673660be2f799e1f842ae Mon Sep 17 00:00:00 2001 From: Mike Manning Date: Fri, 22 Jul 2016 18:32:11 +0100 Subject: [PATCH 04/10] net: ipv6: Always leave anycast and multicast groups on link down [ Upstream commit ea06f7176413e2538d13bb85b65387d0917943d9 ] Default kernel behavior is to delete IPv6 addresses on link down, which entails deletion of the multicast and the subnet-router anycast addresses. These deletions do not happen with sysctl setting to keep global IPv6 addresses on link down, so every link down/up causes an increment of the anycast and multicast refcounts. These bogus refcounts may stop these addrs from being removed on subsequent calls to delete them. The solution is to leave the groups for the multicast and subnet anycast on link down for the callflow when global IPv6 addresses are kept. Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional") Signed-off-by: Mike Manning Acked-by: David Ahern Signed-off-by: David S. Miller --- net/ipv6/addrconf.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 47f837a..047c75a 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -3562,6 +3562,10 @@ restart: if (state != INET6_IFADDR_STATE_DEAD) { __ipv6_ifa_notify(RTM_DELADDR, ifa); inet6addr_notifier_call_chain(NETDEV_DOWN, ifa); + } else { + if (idev->cnf.forwarding) + addrconf_leave_anycast(ifa); + addrconf_leave_solict(ifa->idev, &ifa->addr); } write_lock_bh(&idev->lock); -- 2.1.0 >From 9809c8e278461bc4367e6b09018cfcf87943efe0 Mon Sep 17 00:00:00 2001 From: Marcelo Ricardo Leitner Date: Sat, 23 Jul 2016 00:32:48 -0300 Subject: [PATCH 05/10] sctp: fix BH handling on socket backlog [ Upstream commit eefc1b1d105ee4d2ce907833ce675f1e9599b5e3 ] Now that the backlog processing is called with BH enabled, we have to disable BH before taking the socket lock via bh_lock_sock() otherwise it may dead lock: sctp_backlog_rcv() bh_lock_sock(sk); if (sock_owned_by_user(sk)) { if (sk_add_backlog(sk, skb, sk->sk_rcvbuf)) sctp_chunk_free(chunk); else backloged = 1; } else sctp_inq_push(inqueue, chunk); bh_unlock_sock(sk); while sctp_inq_push() was disabling/enabling BH, but enabling BH triggers pending softirq, which then may try to re-lock the socket in sctp_rcv(). [ 219.187215] [ 219.187217] [] _raw_spin_lock+0x20/0x30 [ 219.187223] [] sctp_rcv+0x48c/0xba0 [sctp] [ 219.187225] [] ? nf_iterate+0x62/0x80 [ 219.187226] [] ip_local_deliver_finish+0x94/0x1e0 [ 219.187228] [] ip_local_deliver+0x6f/0xf0 [ 219.187229] [] ? ip_rcv_finish+0x3b0/0x3b0 [ 219.187230] [] ip_rcv_finish+0xd8/0x3b0 [ 219.187232] [] ip_rcv+0x282/0x3a0 [ 219.187233] [] ? update_curr+0x66/0x180 [ 219.187235] [] __netif_receive_skb_core+0x524/0xa90 [ 219.187236] [] ? update_cfs_shares+0x30/0xf0 [ 219.187237] [] ? __enqueue_entity+0x6c/0x70 [ 219.187239] [] ? enqueue_entity+0x204/0xdf0 [ 219.187240] [] __netif_receive_skb+0x18/0x60 [ 219.187242] [] process_backlog+0x9e/0x140 [ 219.187243] [] net_rx_action+0x22c/0x370 [ 219.187245] [] __do_softirq+0x112/0x2e7 [ 219.187247] [] do_softirq_own_stack+0x1c/0x30 [ 219.187247] [ 219.187248] [] do_softirq.part.14+0x38/0x40 [ 219.187249] [] __local_bh_enable_ip+0x7d/0x80 [ 219.187254] [] sctp_inq_push+0x68/0x80 [sctp] [ 219.187258] [] sctp_backlog_rcv+0x151/0x1c0 [sctp] [ 219.187260] [] __release_sock+0x87/0xf0 [ 219.187261] [] release_sock+0x30/0xa0 [ 219.187265] [] sctp_accept+0x17d/0x210 [sctp] [ 219.187266] [] ? prepare_to_wait_event+0xf0/0xf0 [ 219.187268] [] inet_accept+0x3c/0x130 [ 219.187269] [] SYSC_accept4+0x103/0x210 [ 219.187271] [] ? _raw_spin_unlock_bh+0x1a/0x20 [ 219.187272] [] ? release_sock+0x8c/0xa0 [ 219.187276] [] ? sctp_inet_listen+0x62/0x1b0 [sctp] [ 219.187277] [] SyS_accept+0x10/0x20 Fixes: 860fbbc343bf ("sctp: prepare for socket backlog behavior change") Cc: Eric Dumazet Signed-off-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller --- net/sctp/input.c | 2 ++ net/sctp/inqueue.c | 2 -- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/sctp/input.c b/net/sctp/input.c index 47cf460..f093322 100644 --- a/net/sctp/input.c +++ b/net/sctp/input.c @@ -328,6 +328,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb) */ sk = rcvr->sk; + local_bh_disable(); bh_lock_sock(sk); if (sock_owned_by_user(sk)) { @@ -339,6 +340,7 @@ int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb) sctp_inq_push(inqueue, chunk); bh_unlock_sock(sk); + local_bh_enable(); /* If the chunk was backloged again, don't drop refs */ if (backloged) diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c index 9d87bba..b335ffc 100644 --- a/net/sctp/inqueue.c +++ b/net/sctp/inqueue.c @@ -89,12 +89,10 @@ void sctp_inq_push(struct sctp_inq *q, struct sctp_chunk *chunk) * Eventually, we should clean up inqueue to not rely * on the BH related data structures. */ - local_bh_disable(); list_add_tail(&chunk->list, &q->in_chunk_list); if (chunk->asoc) chunk->asoc->stats.ipackets++; q->immediate.func(&q->immediate); - local_bh_enable(); } /* Peek at the next chunk on the inqeue. */ -- 2.1.0 >From 4e76e1c6648ac314ba982866a292dccbf93cab6c Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Sat, 23 Jul 2016 07:43:50 +0200 Subject: [PATCH 06/10] net/irda: fix NULL pointer dereference on memory allocation failure [ Upstream commit d3e6952cfb7ba5f4bfa29d4803ba91f96ce1204d ] I ran into this: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 2 PID: 2012 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 task: ffff8800b745f2c0 ti: ffff880111740000 task.ti: ffff880111740000 RIP: 0010:[] [] irttp_connect_request+0x36/0x710 RSP: 0018:ffff880111747bb8 EFLAGS: 00010286 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000069dd8358 RDX: 0000000000000009 RSI: 0000000000000027 RDI: 0000000000000048 RBP: ffff880111747c00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000069dd8358 R11: 1ffffffff0759723 R12: 0000000000000000 R13: ffff88011a7e4780 R14: 0000000000000027 R15: 0000000000000000 FS: 00007fc738404700(0000) GS:ffff88011af00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc737fdfb10 CR3: 0000000118087000 CR4: 00000000000006e0 Stack: 0000000000000200 ffff880111747bd8 ffffffff810ee611 ffff880119f1f220 ffff880119f1f4f8 ffff880119f1f4f0 ffff88011a7e4780 ffff880119f1f232 ffff880119f1f220 ffff880111747d58 ffffffff82bca542 0000000000000000 Call Trace: [] irda_connect+0x562/0x1190 [] SYSC_connect+0x202/0x2a0 [] SyS_connect+0x9/0x10 [] do_syscall_64+0x19c/0x410 [] entry_SYSCALL64_slow_path+0x25/0x25 Code: 41 89 ca 48 89 e5 41 57 41 56 41 55 41 54 41 89 d7 53 48 89 fb 48 83 c7 48 48 89 fa 41 89 f6 48 c1 ea 03 48 83 ec 20 4c 8b 65 10 <0f> b6 04 02 84 c0 74 08 84 c0 0f 8e 4c 04 00 00 80 7b 48 00 74 RIP [] irttp_connect_request+0x36/0x710 RSP ---[ end trace 4cda2588bc055b30 ]--- The problem is that irda_open_tsap() can fail and leave self->tsap = NULL, and then irttp_connect_request() almost immediately dereferences it. Cc: stable@vger.kernel.org Signed-off-by: Vegard Nossum Signed-off-by: David S. Miller --- net/irda/af_irda.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c index 923abd6..8d2f7c9 100644 --- a/net/irda/af_irda.c +++ b/net/irda/af_irda.c @@ -1024,8 +1024,11 @@ static int irda_connect(struct socket *sock, struct sockaddr *uaddr, } /* Check if we have opened a local TSAP */ - if (!self->tsap) - irda_open_tsap(self, LSAP_ANY, addr->sir_name); + if (!self->tsap) { + err = irda_open_tsap(self, LSAP_ANY, addr->sir_name); + if (err) + goto out; + } /* Move to connecting socket, start sending Connect Requests */ sock->state = SS_CONNECTING; -- 2.1.0 >From 716cc6f0afc29f7f5f1dccb8ac4ff6b08b83f425 Mon Sep 17 00:00:00 2001 From: Vegard Nossum Date: Sat, 23 Jul 2016 09:42:35 +0200 Subject: [PATCH 07/10] net/sctp: terminate rhashtable walk correctly [ Upstream commit 5fc382d87517707ad77ea4c9c12e2a3fde2c838a ] I was seeing a lot of these: BUG: sleeping function called from invalid context at mm/slab.h:388 in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2 Preemption disabled at:[] rhashtable_walk_start+0x46/0x150 [] preempt_count_add+0x1fb/0x280 [] _raw_spin_lock+0x12/0x40 [] console_unlock+0x2f7/0x930 [] vprintk_emit+0x2fb/0x520 [] vprintk_default+0x1a/0x20 [] printk+0x94/0xb0 [] print_stack_trace+0xe0/0x170 [] ___might_sleep+0x3be/0x460 [] __might_sleep+0x90/0x1a0 [] kmem_cache_alloc+0x153/0x1e0 [] rhashtable_walk_init+0xfe/0x2d0 [] sctp_transport_walk_start+0x1e/0x60 [] sctp_transport_seq_start+0x4d/0x150 [] seq_read+0x27b/0x1180 [] proc_reg_read+0xbc/0x180 [] __vfs_read+0xdb/0x610 [] vfs_read+0xea/0x2d0 [] SyS_pread64+0x11b/0x150 [] do_syscall_64+0x19c/0x410 [] return_from_SYSCALL_64+0x0/0x6a [] 0xffffffffffffffff Apparently we always need to call rhashtable_walk_stop(), even when rhashtable_walk_start() fails: * rhashtable_walk_start - Start a hash table walk * @iter: Hash table iterator * * Start a hash table walk. Note that we take the RCU lock in all * cases including when we return an error. So you must always call * rhashtable_walk_stop to clean up. otherwise we never call rcu_read_unlock() and we get the splat above. Fixes: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag") See-also: 53fa1036 ("sctp: fix some rhashtable functions using in sctp proc/diag") See-also: f2dba9c6 ("rhashtable: Introduce rhashtable_walk_*") Cc: Xin Long Cc: Herbert Xu Cc: stable@vger.kernel.org Signed-off-by: Vegard Nossum Acked-by: Marcelo Ricardo Leitner Signed-off-by: David S. Miller --- net/sctp/socket.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/sctp/socket.c b/net/sctp/socket.c index 67154b8..7f5689a 100644 --- a/net/sctp/socket.c +++ b/net/sctp/socket.c @@ -4301,6 +4301,7 @@ int sctp_transport_walk_start(struct rhashtable_iter *iter) err = rhashtable_walk_start(iter); if (err && err != -EAGAIN) { + rhashtable_walk_stop(iter); rhashtable_walk_exit(iter); return err; } -- 2.1.0 >From a1dc5a3ce16bab0f568cbb0a68bc68da63a664ef Mon Sep 17 00:00:00 2001 From: Manish Chopra Date: Mon, 25 Jul 2016 19:07:46 +0300 Subject: [PATCH 08/10] qed: Fix setting/clearing bit in completion bitmap [ Upstream commit 59d3f1ceb69b54569685d0c34dff16a1e0816b19 ] Slowpath completion handling is incorrectly changing SPQ_RING_SIZE bits instead of a single one. Fixes: 76a9a3642a0b ("qed: fix handling of concurrent ramrods") Signed-off-by: Manish Chopra Signed-off-by: Yuval Mintz Signed-off-by: David S. Miller --- drivers/net/ethernet/qlogic/qed/qed_spq.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c b/drivers/net/ethernet/qlogic/qed/qed_spq.c index b122f60..03601df 100644 --- a/drivers/net/ethernet/qlogic/qed/qed_spq.c +++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c @@ -809,13 +809,12 @@ int qed_spq_completion(struct qed_hwfn *p_hwfn, * in a bitmap and increasing the chain consumer only * for the first successive completed entries. */ - bitmap_set(p_spq->p_comp_bitmap, pos, SPQ_RING_SIZE); + __set_bit(pos, p_spq->p_comp_bitmap); while (test_bit(p_spq->comp_bitmap_idx, p_spq->p_comp_bitmap)) { - bitmap_clear(p_spq->p_comp_bitmap, - p_spq->comp_bitmap_idx, - SPQ_RING_SIZE); + __clear_bit(p_spq->comp_bitmap_idx, + p_spq->p_comp_bitmap); p_spq->comp_bitmap_idx++; qed_chain_return_produced(&p_spq->chain); } -- 2.1.0 >From d7d53adcacbbbe2a3f6bfb4d6ea6a482e028d495 Mon Sep 17 00:00:00 2001 From: Beniamino Galvani Date: Tue, 26 Jul 2016 12:24:53 +0200 Subject: [PATCH 09/10] macsec: ensure rx_sa is set when validation is disabled [ Upstream commit e3a3b626010a14fe067f163c2c43409d5afcd2a9 ] macsec_decrypt() is not called when validation is disabled and so macsec_skb_cb(skb)->rx_sa is not set; but it is used later in macsec_post_decrypt(), ensure that it's always initialized. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Beniamino Galvani Acked-by: Sabrina Dubroca Signed-off-by: David S. Miller --- drivers/net/macsec.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index 8bcd78f..a70b6c4 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -942,7 +942,6 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb, } macsec_skb_cb(skb)->req = req; - macsec_skb_cb(skb)->rx_sa = rx_sa; skb->dev = dev; aead_request_set_callback(req, 0, macsec_decrypt_done, skb); @@ -1169,6 +1168,8 @@ static rx_handler_result_t macsec_handle_frame(struct sk_buff **pskb) } } + macsec_skb_cb(skb)->rx_sa = rx_sa; + /* Disabled && !changed text => skip validation */ if (hdr->tci_an & MACSEC_TCI_C || secy->validate_frames != MACSEC_VALIDATE_DISABLED) -- 2.1.0 >From fc4814047308a5d3beb2f6284ed7cfb73c032e9d Mon Sep 17 00:00:00 2001 From: Soheil Hassas Yeganeh Date: Fri, 29 Jul 2016 09:34:02 -0400 Subject: [PATCH 10/10] tcp: consider recv buf for the initial window scale [ Upstream commit f626300a3e776ccc9671b0dd94698fb3aa315966 ] tcp_select_initial_window() intends to advertise a window scaling for the maximum possible window size. To do so, it considers the maximum of net.ipv4.tcp_rmem[2] and net.core.rmem_max as the only possible upper-bounds. However, users with CAP_NET_ADMIN can use SO_RCVBUFFORCE to set the socket's receive buffer size to values larger than net.ipv4.tcp_rmem[2] and net.core.rmem_max. Thus, SO_RCVBUFFORCE is effectively ignored by tcp_select_initial_window(). To fix this, consider the maximum of net.ipv4.tcp_rmem[2], net.core.rmem_max and socket's initial buffer space. Fixes: b0573dea1fb3 ("[NET]: Introduce SO_{SND,RCV}BUFFORCE socket options") Signed-off-by: Soheil Hassas Yeganeh Suggested-by: Neal Cardwell Acked-by: Neal Cardwell Signed-off-by: David S. Miller --- net/ipv4/tcp_output.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index e00e972..700b72c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -236,7 +236,8 @@ void tcp_select_initial_window(int __space, __u32 mss, /* Set window scaling on max possible window * See RFC1323 for an explanation of the limit to 14 */ - space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max); + space = max_t(u32, space, sysctl_tcp_rmem[2]); + space = max_t(u32, space, sysctl_rmem_max); space = min_t(u32, space, *window_clamp); while (space > 65535 && (*rcv_wscale) < 14) { space >>= 1; -- 2.1.0