Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-03-13  8:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Alexander Duyck, bhelgaas, alexander.h.duyck, linux-pci,
	virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, mheyne, liang-min.wang, mark.d.rustad, hch
In-Reply-To: <1520928772.28745.53.camel@infradead.org>

On Tue, Mar 13, 2018 at 08:12:52AM +0000, David Woodhouse wrote:
> I'd also *really* like to see a way to enable this for PFs which don't
> have (and don't need) a driver. We seem to have lost that along the
> way.

We've been forth and back on that.  I agree that not having any driver
just seems dangerous.  If your PF really does nothing we should just
have a trivial pf_stub driver that does nothing but wiring up
pci_sriov_configure_simple.  We can then add PCI IDs to it either
statically, or using the dynamic ids mechanism.

^ permalink raw reply

* [PATCH] xfrm: fix rcu_read_unlock usage in xfrm_local_error
From: Taehee Yoo @ 2018-03-13  8:26 UTC (permalink / raw)
  To: davem; +Cc: netdev, steffen.klassert, Taehee Yoo

In the xfrm_local_error, rcu_read_unlock should be called when afinfo
is not NULL. because xfrm_state_get_afinfo calls rcu_read_unlock
if afinfo is NULL.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
 net/xfrm/xfrm_output.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 2346867..89b178a7 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -285,8 +285,9 @@ void xfrm_local_error(struct sk_buff *skb, int mtu)
 		return;
 
 	afinfo = xfrm_state_get_afinfo(proto);
-	if (afinfo)
+	if (afinfo) {
 		afinfo->local_error(skb, mtu);
-	rcu_read_unlock();
+		rcu_read_unlock();
+	}
 }
 EXPORT_SYMBOL_GPL(xfrm_local_error);
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH v2 iproute2-next 0/6] cm_id, cq, mr, and pd resource tracking
From: Leon Romanovsky @ 2018-03-13  8:32 UTC (permalink / raw)
  To: David Ahern; +Cc: Steve Wise, stephen, netdev, linux-rdma
In-Reply-To: <87cc06e1-0e54-12f7-a9b0-326e568c3d6f@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

On Mon, Mar 12, 2018 at 10:53:03AM -0700, David Ahern wrote:
> On 3/12/18 8:16 AM, Steve Wise wrote:
> > Hey all,
> >
> > The kernel side of this series has been merged for rdma-next [1].  Let me
> > know if this iproute2 series can be merged, of if it needs more changes.
> >
>
> The problem is that iproute2 headers are synced to kernel headers from
> DaveM's tree (net-next mainly). I take it this series will not appear in
> Dave's tree until after a merge through Linus' tree. Correct?

David,

Technically, you are right, and we would like to ask you for an extra tweak
to the flow for the RDMAtool, because current scheme causes delays at least
cycle.

Every RDMAtool's patchset which requires changes to headers is always
includes header patch, can you please accept those series and once you
are bringing new net-next headers from Linus, simply overwrite all our
headers?

Thanks

>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: BUG_ON triggered in skb_segment
From: Steffen Klassert @ 2018-03-13  8:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yonghong Song, Daniel Borkmann, Alexei Starovoitov, netdev,
	Martin Lau
In-Reply-To: <c679bf38-ef11-7ba3-c24f-fa618bb0d956@gmail.com>

On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/12/2018 11:08 PM, Yonghong Song wrote:
> > 
> > 
> > On 3/12/18 11:04 PM, Eric Dumazet wrote:
> > > 
> > > 
> > > On 03/12/2018 10:45 PM, Yonghong Song wrote:
> > > > ...
> > > > Setup:
> > > > =====
> > > > 
> > > > The test will involve three machines:
> > > > ï¿½ï¿½ M_ipv6 <-> M_nat <-> M_ipv4
> > > > 
> > > > The M_nat will do ipv4<->ipv6 address translation and then
> > > > forward packet
> > > > to proper destination. The control plane will configure M_nat properly
> > > > will understand virtual ipv4 address for machine M_ipv6, and
> > > > virtual ipv6 address for machine M_ipv4.
> > > > 
> > > > M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
> > > > The program uses bpf_skb_change_proto to do protocol conversion.
> > > > bpf_skb_change_proto will adjust skb header_len and len properly
> > > > based on protocol change.
> > > > After the conversion, the program will make proper change on
> > > > ethhdr and ip4/6 header, recalculate checksum, and send the packet out
> > > > through bpf_redirect.
> > > > 
> > > > Experiment:
> > > > ===========
> > > > 
> > > > MTU: 1500B for all three machines.
> > > > 
> > > > The tso/lro/gro are enabled on the M_nat box.
> > > > 
> > > > ping works on both ways of M_ipv6 <-> M_ipv4.
> > > > It works for transfering a small file (4KB) between M_ipv6 and
> > > > M_ipv4 (both ways).
> > > > Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
> > > > failed with the above BUG_ON, really fast.
> > > > Did not really test from M_ipv4 to M_ipv6 with large file.
> > > > 
> > > > The error path likely to be (also from the above call stack):
> > > > ï¿½ï¿½ nic -> lro/gro -> bpf_program -> gso (BUG_ON)

Just out of curiosity, are these packets created with LRO or GRO?
Usually LRO is disabled if forwarding is enabled on a machine,
because segmented LGO packets are likely corrupt.

These packets take an alternative redirect path, so not sure what
happens here.

> > > > 
> > > > In one of experiments, I explicitly printed the skb->len and
> > > > skb->data_len. The values are below:
> > > > ï¿½ï¿½ skb_segment: len 2856, data_len 2686
> > > > They should be equal to avoid BUG.
> > > > 
> > > > In another experiment, I got:
> > > > ï¿½ï¿½ skb_segment: len 1428, data_len 1258
> > > > 
> > > > In both cases, the difference is 170 bytes. Not sure whether
> > > > this is just a coincidence or not.
> > > > 
> > > > Workaround:
> > > > ===========
> > > > 
> > > > A workaround to avoid BUG_ON is to disable lro/gro. This way,
> > > > kernel will not receive big packets and hence gso is not really called.
> > > > 
> > > > I am not familiar with gso code. Does anybody hit this BUG_ON before?
> > > > Any suggestion on how to debug this?
> > > > 
> > > 
> > > skb_segment() works if incoming GRO packet is not modified in its
> > > geometry.
> > > 
> > > In your case it seems you had to adjust gso_size (calling
> > > skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
> > > skb_segment() badly, because geometry changes, unless you had
> > > specific MTU/MSS restrictions.
> > > 
> > > You will have to make skb_segment() more generic if you really want this.
> > 
> > In net/core/filter.c function bpf_skb_change_proto, which is called
> > in the bpf program, does some GSO adjustment. Could you help check
> > whether it satisfies my above use case or not? Thanks!
> 
> As I said this  helper ends up modifying gso_size by +/- 20 (sizeof(ipv6
> header) - sizeof(ipv4 header))
> 
> So it wont work if skb_segment() is called after this change.

Even HW TSO use gso_size to segment the packets. Would'nt this
result in broken packets too, if gso_size is modified on a
forwarding path?

> 
> Not clear why the GRO packet is not sent as is (as a TSO packet) since
> mlx4/mlx5 NICs certainly support TSO.

If the packets are generated with GRO, there could be data chained
at the frag_list pointer. Most NICs can't offload such skbs, so if
skb_segment() can't split at the frag_list pointer, it will just
segment the packets based on gso_size.

^ permalink raw reply

* Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support
From: David Woodhouse @ 2018-03-13  8:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Duyck, bhelgaas, alexander.h.duyck, linux-pci,
	virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, mheyne, liang-min.wang, mark.d.rustad
In-Reply-To: <20180313081628.GA618@lst.de>

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]



On Tue, 2018-03-13 at 09:16 +0100, Christoph Hellwig wrote:
> On Tue, Mar 13, 2018 at 08:12:52AM +0000, David Woodhouse wrote:
> > 
> > I'd also *really* like to see a way to enable this for PFs which don't
> > have (and don't need) a driver. We seem to have lost that along the
> > way.
> We've been forth and back on that.  I agree that not having any driver
> just seems dangerous.  If your PF really does nothing we should just
> have a trivial pf_stub driver that does nothing but wiring up
> pci_sriov_configure_simple.  We can then add PCI IDs to it either
> statically, or using the dynamic ids mechanism.

Or just add it to the existing pci-stub. What's the point in having a
new driver? 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
From: Greg Kroah-Hartman @ 2018-03-13  8:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Luis R. Rodriguez, Jessica Yu, Linus Torvalds, Mimi Zohar,
	Djalal Harouni, David Miller, Andy Lutomirski, Kees Cook,
	Alexei Starovoitov, Al Viro, Daniel Borkmann, Network Development,
	Linux Kernel Mailing List, kernel-team, Linux API
In-Reply-To: <1f3e2c92-1e09-cc77-3b90-e33d7da35a2f@fb.com>

On Mon, Mar 12, 2018 at 10:22:00AM -0700, Alexei Starovoitov wrote:
> On 3/10/18 7:34 AM, Luis R. Rodriguez wrote:
> > Also,
> > 
> > Alexei you never answered my questions out aliases with the umh modules.
> > Long term this important to consider.
> 
> aliases always felt like a crutch to me.
> I can see an argument when they're used as 'alias pci:* foo'
> but the way it's used in networking with ip_set_* and nf-* is
> something I prefer not to ever do again.
> Definitely no aliases for bpfilter umh.

I agree, let's not do that if at all possible for these types of
binaries.

greg k-h

^ permalink raw reply

* Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support
From: Christoph Hellwig @ 2018-03-13  8:54 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Christoph Hellwig, Alexander Duyck, bhelgaas, alexander.h.duyck,
	linux-pci, virtio-dev, kvm, netdev, dan.daly, linux-kernel,
	linux-nvme, keith.busch, netanel, mheyne, liang-min.wang,
	mark.d.rustad
In-Reply-To: <1520930719.11412.2.camel@infradead.org>

On Tue, Mar 13, 2018 at 08:45:19AM +0000, David Woodhouse wrote:
> 
> 
> On Tue, 2018-03-13 at 09:16 +0100, Christoph Hellwig wrote:
> > On Tue, Mar 13, 2018 at 08:12:52AM +0000, David Woodhouse wrote:
> > > 
> > > I'd also *really* like to see a way to enable this for PFs which don't
> > > have (and don't need) a driver. We seem to have lost that along the
> > > way.
> > We've been forth and back on that.  I agree that not having any driver
> > just seems dangerous.  If your PF really does nothing we should just
> > have a trivial pf_stub driver that does nothing but wiring up
> > pci_sriov_configure_simple.  We can then add PCI IDs to it either
> > statically, or using the dynamic ids mechanism.
> 
> Or just add it to the existing pci-stub. What's the point in having a
> new driver? 

Because binding to pci-stub means that you'd now enable the simple
SR-IOV for any device bound to PCI stub.  Which often might be the wrong
thing.

^ permalink raw reply

* Re: [PATCH iproute2] Revert "iproute: "list/flush/save default" selected all of the routes"
From: Alexander Zubkov @ 2018-03-13  8:46 UTC (permalink / raw)
  To: Luca Boccassi, Stephen Hemminger; +Cc: netdev@vger.kernel.org
In-Reply-To: <1520890670.31389.1.camel@debian.org>

Hello.

May be the better way would be to change how "all"/"any" argument behaves? My original concern was about "default" only. I agree too, that "all" or "any" should work for all routes. But not for the default.

12.03.2018, 22:37, "Luca Boccassi" <bluca@debian.org>:
> On Mon, 2018-03-12 at 14:03 -0700, Stephen Hemminger wrote:
>>  This reverts commit 9135c4d6037ff9f1818507bac0049fc44db8c3d2.
>>
>>  Debian maintainer found that basic command:
>>          # ip route flush all
>>  No longer worked as expected which breaks user scripts and
>>  expectations. It no longer flushed all IPv4 routes.
>>
>>  Reported-by: Luca Boccassi <bluca@debian.org>
>>  Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>>  ---
>>   ip/iproute.c | 65 ++++++++++++++++++------------------------------
>>  ------------
>>   lib/utils.c  | 13 ++++++++++++
>>   2 files changed, 32 insertions(+), 46 deletions(-)
>
> Tested-by: Luca Boccassi <bluca@debian.org>
>
> Thanks, solves the problem. I'll backport it to Debian.
>
> Alexander, reproducing the issue is quite simple - before that commit,
> ip route ls all showed all routes, but with the change it started
> showing only the default table. Same for ip route flush.
>
> --
> Kind regards,
> Luca Boccassi

^ permalink raw reply

* [PATCH net] qed: Use after free in qed_rdma_free()
From: Dan Carpenter @ 2018-03-13  9:09 UTC (permalink / raw)
  To: Ariel Elior, Michal Kalderon; +Cc: everest-linux-l2, netdev, kernel-janitors

We're dereferencing "p_hwfn->p_rdma_info" but that is freed on the line
before in qed_rdma_resc_free(p_hwfn).

Fixes: 9de506a547c0 ("qed: Free RoCE ILT Memory on rmmod qedr")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
index f3ee6538b553..a411f9c702a1 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
@@ -379,8 +379,8 @@ static void qed_rdma_free(struct qed_hwfn *p_hwfn)
 	DP_VERBOSE(p_hwfn, QED_MSG_RDMA, "Freeing RDMA\n");
 
 	qed_rdma_free_reserved_lkey(p_hwfn);
-	qed_rdma_resc_free(p_hwfn);
 	qed_cxt_free_proto_ilt(p_hwfn, p_hwfn->p_rdma_info->proto);
+	qed_rdma_resc_free(p_hwfn);
 }
 
 static void qed_rdma_get_guid(struct qed_hwfn *p_hwfn, u8 *guid)

^ permalink raw reply related

* Re: [pci PATCH v5 3/4] ena: Migrate over to unmanaged SR-IOV support
From: David Woodhouse @ 2018-03-13  9:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alexander Duyck, bhelgaas, alexander.h.duyck, linux-pci,
	virtio-dev, kvm, netdev, dan.daly, linux-kernel, linux-nvme,
	keith.busch, netanel, mheyne, liang-min.wang, mark.d.rustad
In-Reply-To: <20180313085442.GA1537@lst.de>

[-- Attachment #1: Type: text/plain, Size: 715 bytes --]

On Tue, 2018-03-13 at 09:54 +0100, Christoph Hellwig wrote:
> On Tue, Mar 13, 2018 at 08:45:19AM +0000, David Woodhouse wrote:
> Because binding to pci-stub means that you'd now enable the simple
> SR-IOV for any device bound to PCI stub.  Which often might be the wrong
> thing.

No, *using* it would be the wrong thing (bad root; no biscuit).

Except when the PF doesn't have SR-IOV capability anyway, in which case
who cares.

Or when the PF does have SR-IOV capability and root has ensure that
she's doing the right thing.

I understand the arguments about disallowing root from doing bad
things. Not that I agree with them. But simply changing to a
*different* driver seems pointless.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply

* Re: Problem with bridge (mcast-to-ucast + hairpin) and Broadcom's 802.11f in their FullMAC fw
From: Arend van Spriel @ 2018-03-13  9:18 UTC (permalink / raw)
  To: Felix Fietkau, Rafał Miłecki, Linus Lüssing,
	Franky Lin, Hante Meuleman, Chi-Hsien Lin, Wright Feng,
	Pieter-Paul Giesberts
  Cc: Network Development,
	bridge-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	open list:BROADCOM BRCM80211 IEEE802.11n WIRELESS DRIVER,
	brcm80211-dev-list-+wT8y+m8/X5BDgjK7y7TUQ
In-Reply-To: <c24304cf-7e0d-137a-94b1-d8f4e61a3b70-Vt+b4OUoWG0@public.gmane.org>

On 3/13/2018 8:20 AM, Felix Fietkau wrote:
> [resent with fixed typo in linux-wireless address]
>
> On 2018-02-27 11:08, Rafał Miłecki wrote:
>> I've problem when using OpenWrt/LEDE on a home router with Broadcom's
>> FullMAC WiFi chipset.
>>
>>
>> First of all OpenWrt/LEDE uses bridge interface for LAN network with:
>> 1) IFLA_BRPORT_MCAST_TO_UCAST
>> 2) Clients isolation in hostapd
>> 3) Hairpin mode enabled
>>
>> For more details please see Linus's patch description:
>> https://patchwork.kernel.org/patch/9530669/
>> and maybe hairpin mode patch:
>> https://lwn.net/Articles/347344/
>>
>> Short version: in that setup packets received from a bridged wireless
>> interface can be handled back to it for transmission.
>>
>>
>> Now, Broadcom's firmware for their FullMAC chipsets in AP mode
>> supports an obsoleted 802.11f AKA IAPP standard. It's a roaming
>> standard that was replaced by 802.11r.
>>
>> Whenever a new station associates, firmware generates a packet like:
>> ff ff ff ff  ff ff ec 10  7b 5f ?? ??  00 06 00 01  af 81 01 00
>> (just masked 2 bytes of my MAC)
>>
>> For mode details you can see discussion in my brcmfmac patch thread:
>> https://patchwork.kernel.org/patch/10191451/
>>
>>
>> The problem is that bridge (in setup as above) handles such a packet
>> back to the device.
>>
>> That makes Broadcom's FullMAC firmware believe that a given station
>> just connected to another AP in a network (which doesn't even exist).
>> As a result firmware immediately disassociates that station. It's
>> simply impossible to connect to the router. Every association is
>> followed by immediate disassociation.
>>
>>
>> Can you see any solution for this problem? Is that an option to stop
>> multicast-to-unicast from touching 802.11f packets? Some other ideas?
>> Obviously I can't modify Broadcom's firmware and drop that obsoleted
>> standard.
> Let's look at it from a different angle: Since these packets are
> forwarded as normal packets by the bridge, and the Broadcom firmware
> reacts to them in this nasty way, that's basically local DoS security
> issue. In my opinion that matters a lot more than having support for an
> obsolete feature that almost nobody will ever want to use.
>
> I think the right approach to deal with this issue is to drop these
> garbage packets in both the receive and transmit path of brcmfmac.

My approach was to get rid of it in firmware as this never made it into 
the 802.11 spec. So I asked internally whether it was still used. Turns 
out that we still rely on it for some customers. I am fine with dropping 
these "garbage" packets, but given that there is still use for it I 
would like to see that under a Kconfig flag. Dropping it may be the default.

Regards,
Arend

^ permalink raw reply

* RE: [PATCH net] qed: Use after free in qed_rdma_free()
From: Kalderon, Michal @ 2018-03-13  9:40 UTC (permalink / raw)
  To: Dan Carpenter, Elior, Ariel
  Cc: Dept-Eng Everest Linux L2, netdev@vger.kernel.org,
	kernel-janitors@vger.kernel.org
In-Reply-To: <20180313090938.GA17609@mwanda>

> From: Dan Carpenter [mailto:dan.carpenter@oracle.com]
> Sent: Tuesday, March 13, 2018 11:10 AM
> To: Elior, Ariel <Ariel.Elior@cavium.com>; Kalderon, Michal
> <Michal.Kalderon@cavium.com>
> Cc: Dept-Eng Everest Linux L2 <Dept-EngEverestLinuxL2@cavium.com>;
> netdev@vger.kernel.org; kernel-janitors@vger.kernel.org
> Subject: [PATCH net] qed: Use after free in qed_rdma_free()
> 
> We're dereferencing "p_hwfn->p_rdma_info" but that is freed on the line
> before in qed_rdma_resc_free(p_hwfn).
> 
> Fixes: 9de506a547c0 ("qed: Free RoCE ILT Memory on rmmod qedr")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> index f3ee6538b553..a411f9c702a1 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> @@ -379,8 +379,8 @@ static void qed_rdma_free(struct qed_hwfn
> *p_hwfn)
>  	DP_VERBOSE(p_hwfn, QED_MSG_RDMA, "Freeing RDMA\n");
> 
>  	qed_rdma_free_reserved_lkey(p_hwfn);
> -	qed_rdma_resc_free(p_hwfn);
>  	qed_cxt_free_proto_ilt(p_hwfn, p_hwfn->p_rdma_info->proto);
> +	qed_rdma_resc_free(p_hwfn);
>  }
> 
>  static void qed_rdma_get_guid(struct qed_hwfn *p_hwfn, u8 *guid)
Thanks,

Acked-by: Michal Kalderon <Michal.Kalderon@cavium.com>

^ permalink raw reply

* [PATCH V2 net 0/1] net/smc: listen socket closing
From: Ursula Braun @ 2018-03-13  9:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun

Hi Dave,

last week you asked for a better solution using generic infrastructure
to fix the closing of a listening SMC socket.
This made me realize that flush_work() is an appropriate way to make
sure the smc_tcp_listen_worker has finished processing (incl. the
release of the internal clcsock.)

Thanks, Ursula

Ursula Braun (1):
  net/smc: simplify wait when closing listen socket

 net/smc/af_smc.c    |  4 ----
 net/smc/smc_close.c | 25 +++----------------------
 2 files changed, 3 insertions(+), 26 deletions(-)

-- 
2.13.5

^ permalink raw reply

* [PATCH V2 net 1/1] net/smc: simplify wait when closing listen socket
From: Ursula Braun @ 2018-03-13  9:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180313094154.65533-1-ubraun@linux.vnet.ibm.com>

Closing of a listen socket wakes up kernel_accept() of
smc_tcp_listen_worker(), and then has to wait till smc_tcp_listen_worker()
gives up the internal clcsock. The wait logic introduced with
commit 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
might wait longer than necessary. This patch implements the idea to
implement the wait just with flush_work(), and gets rid of the extra
smc_close_wait_listen_clcsock() function.

Fixes: 127f49705823 ("net/smc: release clcsock from tcp_listen_worker")
Reported-by: Hans Wippel <hwippel@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
---
 net/smc/af_smc.c    |  4 ----
 net/smc/smc_close.c | 25 +++----------------------
 2 files changed, 3 insertions(+), 26 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 8cc97834d4f6..1e0d780855c3 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -978,10 +978,6 @@ static void smc_tcp_listen_work(struct work_struct *work)
 		lsmc->clcsock = NULL;
 	}
 	release_sock(lsk);
-	/* no more listening, wake up smc_close_wait_listen_clcsock and
-	 * accept
-	 */
-	lsk->sk_state_change(lsk);
 	sock_put(&lsmc->sk); /* sock_hold in smc_listen */
 }
 
diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
index e339c0186dcf..fa41d9881741 100644
--- a/net/smc/smc_close.c
+++ b/net/smc/smc_close.c
@@ -30,27 +30,6 @@ static void smc_close_cleanup_listen(struct sock *parent)
 		smc_close_non_accepted(sk);
 }
 
-static void smc_close_wait_listen_clcsock(struct smc_sock *smc)
-{
-	DEFINE_WAIT_FUNC(wait, woken_wake_function);
-	struct sock *sk = &smc->sk;
-	signed long timeout;
-
-	timeout = SMC_CLOSE_WAIT_LISTEN_CLCSOCK_TIME;
-	add_wait_queue(sk_sleep(sk), &wait);
-	do {
-		release_sock(sk);
-		if (smc->clcsock)
-			timeout = wait_woken(&wait, TASK_UNINTERRUPTIBLE,
-					     timeout);
-		sched_annotate_sleep();
-		lock_sock(sk);
-		if (!smc->clcsock)
-			break;
-	} while (timeout);
-	remove_wait_queue(sk_sleep(sk), &wait);
-}
-
 /* wait for sndbuf data being transmitted */
 static void smc_close_stream_wait(struct smc_sock *smc, long timeout)
 {
@@ -204,9 +183,11 @@ int smc_close_active(struct smc_sock *smc)
 			rc = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR);
 			/* wake up kernel_accept of smc_tcp_listen_worker */
 			smc->clcsock->sk->sk_data_ready(smc->clcsock->sk);
-			smc_close_wait_listen_clcsock(smc);
 		}
 		smc_close_cleanup_listen(sk);
+		release_sock(sk);
+		flush_work(&smc->tcp_listen_work);
+		lock_sock(sk);
 		break;
 	case SMC_ACTIVE:
 		smc_close_stream_wait(smc, timeout);
-- 
2.13.5

^ permalink raw reply related

* Re: [PATCH bpf-next v4 1/2] bpf: extend stackmap to save binary_build_id+offset instead of address
From: kbuild test robot @ 2018-03-13  9:53 UTC (permalink / raw)
  To: Song Liu
  Cc: kbuild-all, netdev, ast, peterz, daniel, kernel-team, hannes,
	qinteng, Song Liu
In-Reply-To: <20180312203957.2025833-2-songliubraving@fb.com>

Hi Song,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20180309]
[also build test WARNING on v4.16-rc5]
[cannot apply to linus/master v4.16-rc4 v4.16-rc3 v4.16-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Song-Liu/bpf-stackmap-with-build_id-offset/20180313-085825


coccinelle warnings: (new ones prefixed by >>)

>> kernel/bpf/stackmap.c:177:2-3: Unneeded semicolon

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* [PATCH] bpf: fix semicolon.cocci warnings
From: kbuild test robot @ 2018-03-13  9:53 UTC (permalink / raw)
  To: Song Liu
  Cc: kbuild-all, netdev, ast, peterz, daniel, kernel-team, hannes,
	qinteng, Song Liu
In-Reply-To: <20180312203957.2025833-2-songliubraving@fb.com>

From: Fengguang Wu <fengguang.wu@intel.com>

kernel/bpf/stackmap.c:177:2-3: Unneeded semicolon


 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Fixes: fd5c09555695 ("bpf: extend stackmap to save binary_build_id+offset instead of address")
CC: Song Liu <songliubraving@fb.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---

 stackmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -174,7 +174,7 @@ static inline int stack_map_parse_build_
 		if (new_offs <= note_offs)  /* overflow */
 			break;
 		note_offs = new_offs;
-	};
+	}
 	return -EINVAL;
 }
 

^ permalink raw reply

* [PATCH 1/2] udp: Move the udp sysctl to namespace.
From: Tonghao Zhang @ 2018-03-13  9:57 UTC (permalink / raw)
  To: davem, rshearma, edumazet; +Cc: netdev, Tonghao Zhang

This patch moves the udp_rmem_min, udp_wmem_min
to namespace and init the udp_l3mdev_accept explicitly.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 include/net/netns/ipv4.h   |  3 ++
 net/ipv4/sysctl_net_ipv4.c | 32 ++++++++---------
 net/ipv4/udp.c             | 86 +++++++++++++++++++++++++++-------------------
 net/ipv6/udp.c             | 52 ++++++++++++++--------------
 4 files changed, 96 insertions(+), 77 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 3a970e4..382bfd7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -168,6 +168,9 @@ struct netns_ipv4 {
 	atomic_t tfo_active_disable_times;
 	unsigned long tfo_active_disable_stamp;
 
+	int sysctl_udp_wmem_min;
+	int sysctl_udp_rmem_min;
+
 #ifdef CONFIG_NET_L3_MASTER_DEV
 	int sysctl_udp_l3mdev_accept;
 #endif
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 011de9a..5b72d97 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -520,22 +520,6 @@ static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
-	{
-		.procname	= "udp_rmem_min",
-		.data		= &sysctl_udp_rmem_min,
-		.maxlen		= sizeof(sysctl_udp_rmem_min),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one
-	},
-	{
-		.procname	= "udp_wmem_min",
-		.data		= &sysctl_udp_wmem_min,
-		.maxlen		= sizeof(sysctl_udp_wmem_min),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one
-	},
 	{ }
 };
 
@@ -1167,6 +1151,22 @@ static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+	{
+		.procname	= "udp_rmem_min",
+		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_udp_rmem_min),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one
+	},
+	{
+		.procname	= "udp_wmem_min",
+		.data		= &init_net.ipv4.sysctl_udp_wmem_min,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_udp_wmem_min),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one
+	},
 	{ }
 };
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3013404..7ae77f2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -122,12 +122,6 @@
 long sysctl_udp_mem[3] __read_mostly;
 EXPORT_SYMBOL(sysctl_udp_mem);
 
-int sysctl_udp_rmem_min __read_mostly;
-EXPORT_SYMBOL(sysctl_udp_rmem_min);
-
-int sysctl_udp_wmem_min __read_mostly;
-EXPORT_SYMBOL(sysctl_udp_wmem_min);
-
 atomic_long_t udp_memory_allocated;
 EXPORT_SYMBOL(udp_memory_allocated);
 
@@ -2533,35 +2527,35 @@ int udp_abort(struct sock *sk, int err)
 EXPORT_SYMBOL_GPL(udp_abort);
 
 struct proto udp_prot = {
-	.name		   = "UDP",
-	.owner		   = THIS_MODULE,
-	.close		   = udp_lib_close,
-	.connect	   = ip4_datagram_connect,
-	.disconnect	   = udp_disconnect,
-	.ioctl		   = udp_ioctl,
-	.init		   = udp_init_sock,
-	.destroy	   = udp_destroy_sock,
-	.setsockopt	   = udp_setsockopt,
-	.getsockopt	   = udp_getsockopt,
-	.sendmsg	   = udp_sendmsg,
-	.recvmsg	   = udp_recvmsg,
-	.sendpage	   = udp_sendpage,
-	.release_cb	   = ip4_datagram_release_cb,
-	.hash		   = udp_lib_hash,
-	.unhash		   = udp_lib_unhash,
-	.rehash		   = udp_v4_rehash,
-	.get_port	   = udp_v4_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
-	.sysctl_wmem	   = &sysctl_udp_wmem_min,
-	.sysctl_rmem	   = &sysctl_udp_rmem_min,
-	.obj_size	   = sizeof(struct udp_sock),
-	.h.udp_table	   = &udp_table,
+	.name			= "UDP",
+	.owner			= THIS_MODULE,
+	.close			= udp_lib_close,
+	.connect		= ip4_datagram_connect,
+	.disconnect		= udp_disconnect,
+	.ioctl			= udp_ioctl,
+	.init			= udp_init_sock,
+	.destroy		= udp_destroy_sock,
+	.setsockopt		= udp_setsockopt,
+	.getsockopt		= udp_getsockopt,
+	.sendmsg		= udp_sendmsg,
+	.recvmsg		= udp_recvmsg,
+	.sendpage		= udp_sendpage,
+	.release_cb		= ip4_datagram_release_cb,
+	.hash			= udp_lib_hash,
+	.unhash			= udp_lib_unhash,
+	.rehash			= udp_v4_rehash,
+	.get_port		= udp_v4_get_port,
+	.memory_allocated	= &udp_memory_allocated,
+	.sysctl_mem		= sysctl_udp_mem,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_udp_wmem_min),
+	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_udp_rmem_min),
+	.obj_size		= sizeof(struct udp_sock),
+	.h.udp_table		= &udp_table,
 #ifdef CONFIG_COMPAT
-	.compat_setsockopt = compat_udp_setsockopt,
-	.compat_getsockopt = compat_udp_getsockopt,
+	.compat_setsockopt	= compat_udp_setsockopt,
+	.compat_getsockopt	= compat_udp_getsockopt,
 #endif
-	.diag_destroy	   = udp_abort,
+	.diag_destroy		= udp_abort,
 };
 EXPORT_SYMBOL(udp_prot);
 
@@ -2831,6 +2825,21 @@ u32 udp_flow_hashrnd(void)
 }
 EXPORT_SYMBOL(udp_flow_hashrnd);
 
+static int __net_init udp_sysctl_init(struct net *net)
+{
+	net->ipv4.sysctl_udp_rmem_min = SK_MEM_QUANTUM;
+	net->ipv4.sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	net->ipv4.sysctl_udp_l3mdev_accept = 0;
+#endif
+	return 0;
+}
+
+static struct pernet_operations __net_initdata udp_sysctl_ops = {
+	.init       = udp_sysctl_init,
+};
+
 void __init udp_init(void)
 {
 	unsigned long limit;
@@ -2843,8 +2852,12 @@ void __init udp_init(void)
 	sysctl_udp_mem[1] = limit;
 	sysctl_udp_mem[2] = sysctl_udp_mem[0] * 2;
 
-	sysctl_udp_rmem_min = SK_MEM_QUANTUM;
-	sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+	init_net.ipv4.sysctl_udp_rmem_min = SK_MEM_QUANTUM;
+	init_net.ipv4.sysctl_udp_wmem_min = SK_MEM_QUANTUM;
+
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	init_net.ipv4.sysctl_udp_l3mdev_accept = 0;
+#endif
 
 	/* 16 spinlocks per cpu */
 	udp_busylocks_log = ilog2(nr_cpu_ids) + 4;
@@ -2854,4 +2867,7 @@ void __init udp_init(void)
 		panic("UDP: failed to alloc udp_busylocks\n");
 	for (i = 0; i < (1U << udp_busylocks_log); i++)
 		spin_lock_init(udp_busylocks + i);
+
+	if (register_pernet_subsys(&udp_sysctl_ops))
+		panic("UDP: failed to init sysctl parameters.\n");
 }
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 52e3ea0..ad30f5e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1509,34 +1509,34 @@ void udp6_proc_exit(struct net *net)
 /* ------------------------------------------------------------------------ */
 
 struct proto udpv6_prot = {
-	.name		   = "UDPv6",
-	.owner		   = THIS_MODULE,
-	.close		   = udp_lib_close,
-	.connect	   = ip6_datagram_connect,
-	.disconnect	   = udp_disconnect,
-	.ioctl		   = udp_ioctl,
-	.init		   = udp_init_sock,
-	.destroy	   = udpv6_destroy_sock,
-	.setsockopt	   = udpv6_setsockopt,
-	.getsockopt	   = udpv6_getsockopt,
-	.sendmsg	   = udpv6_sendmsg,
-	.recvmsg	   = udpv6_recvmsg,
-	.release_cb	   = ip6_datagram_release_cb,
-	.hash		   = udp_lib_hash,
-	.unhash		   = udp_lib_unhash,
-	.rehash		   = udp_v6_rehash,
-	.get_port	   = udp_v6_get_port,
-	.memory_allocated  = &udp_memory_allocated,
-	.sysctl_mem	   = sysctl_udp_mem,
-	.sysctl_wmem	   = &sysctl_udp_wmem_min,
-	.sysctl_rmem	   = &sysctl_udp_rmem_min,
-	.obj_size	   = sizeof(struct udp6_sock),
-	.h.udp_table	   = &udp_table,
+	.name			= "UDPv6",
+	.owner			= THIS_MODULE,
+	.close			= udp_lib_close,
+	.connect		= ip6_datagram_connect,
+	.disconnect		= udp_disconnect,
+	.ioctl			= udp_ioctl,
+	.init			= udp_init_sock,
+	.destroy		= udpv6_destroy_sock,
+	.setsockopt		= udpv6_setsockopt,
+	.getsockopt		= udpv6_getsockopt,
+	.sendmsg		= udpv6_sendmsg,
+	.recvmsg		= udpv6_recvmsg,
+	.release_cb		= ip6_datagram_release_cb,
+	.hash			= udp_lib_hash,
+	.unhash			= udp_lib_unhash,
+	.rehash			= udp_v6_rehash,
+	.get_port		= udp_v6_get_port,
+	.memory_allocated	= &udp_memory_allocated,
+	.sysctl_mem		= sysctl_udp_mem,
+	.sysctl_wmem_offset     = offsetof(struct net, ipv4.sysctl_udp_wmem_min),
+	.sysctl_rmem_offset     = offsetof(struct net, ipv4.sysctl_udp_rmem_min),
+	.obj_size		= sizeof(struct udp6_sock),
+	.h.udp_table		= &udp_table,
 #ifdef CONFIG_COMPAT
-	.compat_setsockopt = compat_udpv6_setsockopt,
-	.compat_getsockopt = compat_udpv6_getsockopt,
+	.compat_setsockopt	= compat_udpv6_setsockopt,
+	.compat_getsockopt	= compat_udpv6_getsockopt,
 #endif
-	.diag_destroy      = udp_abort,
+	.diag_destroy		= udp_abort,
 };
 
 static struct inet_protosw udpv6_protosw = {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH 2/2] doc: Change the udp/sctp rmem/wmem default value.
From: Tonghao Zhang @ 2018-03-13  9:57 UTC (permalink / raw)
  To: davem, rshearma, edumazet; +Cc: netdev, Tonghao Zhang
In-Reply-To: <1520935050-14547-1-git-send-email-xiangxia.m.yue@gmail.com>

The SK_MEM_QUANTUM was changed from PAGE_SIZE to 4096.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 Documentation/networking/ip-sysctl.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 783675a..1d11207 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -755,13 +755,13 @@ udp_rmem_min - INTEGER
 	Minimal size of receive buffer used by UDP sockets in moderation.
 	Each UDP socket is able to use the size for receiving data, even if
 	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
-	Default: 1 page
+	Default: 4K
 
 udp_wmem_min - INTEGER
 	Minimal size of send buffer used by UDP sockets in moderation.
 	Each UDP socket is able to use the size for sending data, even if
 	total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
-	Default: 1 page
+	Default: 4K
 
 CIPSOv4 Variables:
 
@@ -2101,7 +2101,7 @@ sctp_rmem - vector of 3 INTEGERs: min, default, max
 	It is guaranteed to each SCTP socket (but not association) even
 	under moderate memory pressure.
 
-	Default: 1 page
+	Default: 4K
 
 sctp_wmem  - vector of 3 INTEGERs: min, default, max
 	Currently this tunable has no effect.
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH net 4/5] tcp: prevent bogus undos when SACK is not enabled
From: Ilpo Järvinen @ 2018-03-13 10:24 UTC (permalink / raw)
  To: David Miller; +Cc: Yuchung Cheng, ncardwell, Netdev, edumazet
In-Reply-To: <20180309.102357.98405985881146907.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

On Fri, 9 Mar 2018, David Miller wrote:

> From: Ilpo Jï¿œrvinen <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 9 Mar 2018 16:11:47 +0200 (EET)
> 
> > Unfortunately I don't have now permission to publish the time-seq
> > graph about it but I've tried to improve the changelog messages so
> > that you can better understand under which conditions the problem
> > occurs.
> 
> It is indeed extremely unfortunate that you wish to justify a change
> for which you cannot provide the supporting data at all.

Here is the time-seqno graph about the issue:

https://www.cs.helsinki.fi/u/ijjarvin/linux/nonsackbugs/recovery_undo_bug.pdf

First the correct CC action (wnd reduction) occurs; then bogus undo 
causes bursting back to the window with which the congestion losses 
occurred earlier; because of the burst, some packets get lost due to 
congestion again.

The sender is actually somewhat lucky here: If only one packet would get 
lost instead of three, the same process would repeat for the next recovery 
(as cumulative ACK to high_seq condition would reoccur).

-- 
 i.

^ permalink raw reply

* [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

If SACK is not enabled and the first cumulative ACK after the RTO
retransmission covers more than the retransmitted skb, a spurious
FRTO undo will trigger (assuming FRTO is enabled for that RTO).
The reason is that any non-retransmitted segment acknowledged will
set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
no indication that it would have been delivered for real (the
scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
case so the check for that bit won't help like it does with SACK).
Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
in tcp_process_loss.

We need to use more strict condition for non-SACK case and check
that none of the cumulatively ACKed segments were retransmitted
to prove that progress is due to original transmissions. Only then
keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
non-SACK case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a26c09..c60745c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				pkts_acked = rexmit_acked + newdata_acked;

 			tcp_remove_reno_sacks(sk, pkts_acked);
+
+			/* If any of the cumulatively ACKed segments was
+			 * retransmitted, non-SACK case cannot confirm that
+			 * progress was due to original transmission due to
+			 * lack of TCPCB_SACKED_ACKED bits even if some of
+			 * the packets may have been never retransmitted.
+			 */
+			if (flag & FLAG_RETRANS_DATA_ACKED)
+				flag &= ~FLAG_ORIG_SACK_ACKED;
 		} else {
 			int delta;

-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

A miscalculation for the number of acknowledged packets occurs during
RTO recovery whenever SACK is not enabled and a cumulative ACK covers
any non-retransmitted skbs. The reason is that pkts_acked value
calculated in tcp_clean_rtx_queue is not correct for slow start after
RTO as it may include segments that were not lost and therefore did
not need retransmissions in the slow start following the RTO. Then
tcp_slow_start will add the excess into cwnd bloating it and
triggering a burst.

Instead, we want to pass only the number of retransmitted segments
that were covered by the cumulative ACK (and potentially newly sent
data segments too if the cumulative ACK covers that far).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9a1b3c1..4a26c09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3027,6 +3027,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 	long seq_rtt_us = -1L;
 	long ca_rtt_us = -1L;
 	u32 pkts_acked = 0;
+	u32 rexmit_acked = 0;
+	u32 newdata_acked = 0;
 	u32 last_in_flight = 0;
 	bool rtt_update;
 	int flag = 0;
@@ -3056,8 +3058,10 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (unlikely(sacked & TCPCB_RETRANS)) {
-			if (sacked & TCPCB_SACKED_RETRANS)
+			if (sacked & TCPCB_SACKED_RETRANS) {
 				tp->retrans_out -= acked_pcount;
+				rexmit_acked += acked_pcount;
+			}
 			flag |= FLAG_RETRANS_DATA_ACKED;
 		} else if (!(sacked & TCPCB_SACKED_ACKED)) {
 			last_ackt = skb->skb_mstamp;
@@ -3070,6 +3074,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 				reord = start_seq;
 			if (!after(scb->end_seq, tp->high_seq))
 				flag |= FLAG_ORIG_SACK_ACKED;
+			else
+				newdata_acked += acked_pcount;
 		}
 
 		if (sacked & TCPCB_SACKED_ACKED) {
@@ -3151,6 +3157,14 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack,
 		}
 
 		if (tcp_is_reno(tp)) {
+			/* Due to discontinuity on RTO in the artificial
+			 * sacked_out calculations, TCP must restrict
+			 * pkts_acked without SACK to rexmits and new data
+			 * segments
+			 */
+			if (icsk->icsk_ca_state == TCP_CA_Loss)
+				pkts_acked = rexmit_acked + newdata_acked;
+
 			tcp_remove_reno_sacks(sk, pkts_acked);
 		} else {
 			int delta;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 4/5] tcp: prevent bogus undos when SACK is not enabled
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

When a cumulative ACK lands to high_seq at the end of loss
recovery and SACK is not enabled, the sender needs to avoid
false fast retransmits (RFC6582). The avoidance mechanisms is
implemented by remaining in the loss recovery CA state until
one additional cumulative ACK arrives. During the operation of
this avoidance mechanism, there is internal transient in the
use of state variables which will always trigger a bogus undo.

When we enter to this transient state in tcp_try_undo_recovery,
tcp_any_retrans_done is often (always?) false resulting in
clearing retrans_stamp. On the next cumulative ACK,
tcp_try_undo_recovery again executes because CA state still
remains in the same recovery state and tcp_may_undo will always
return true because tcp_packet_delayed has this condition:
    return !tp->retrans_stamp || ...

Check if the false fast retransmit transient avoidance is in
progress in tcp_packet_delayed to avoid bogus undos. Since snd_una
has advanced already on this ACK but CA state still remains
unchanged (CA state is updated slightly later than undo is
checked), prior_snd_una needs to be passed to tcp_packet_delayed
(instead of tp->snd_una). Passing prior_snd_una around to
the tcp_packet_delayed makes this change look more involved than
it really is.

The additional checks done in this change only affect non-SACK
case, the SACK case remains the same.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 42 ++++++++++++++++++++++++++----------------
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 72ecfbb..270aa48 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2241,10 +2241,17 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
+static inline bool tcp_packet_delayed(const struct tcp_sock *tp,
+				      const u32 prior_snd_una)
 {
-	return !tp->retrans_stamp ||
-	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+	if (!tp->retrans_stamp) {
+		/* Sender will be in a transient state with cleared
+		 * retrans_stamp during false fast retransmit prevention
+		 * mechanism
+		 */
+		return !tcp_false_fast_retrans_possible(tp, prior_snd_una);
+	}
+	return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
 }
 
 /* Undo procedures. */
@@ -2334,17 +2341,19 @@ static void tcp_undo_cwnd_reduction(struct sock *sk, bool unmark_loss)
 	tp->rack.advanced = 1; /* Force RACK to re-exam losses */
 }
 
-static inline bool tcp_may_undo(const struct tcp_sock *tp)
+static inline bool tcp_may_undo(const struct tcp_sock *tp,
+				const u32 prior_snd_una)
 {
-	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+	return tp->undo_marker &&
+	       (!tp->undo_retrans || tcp_packet_delayed(tp, prior_snd_una));
 }
 
 /* People celebrate: "We love our President!" */
-static bool tcp_try_undo_recovery(struct sock *sk)
+static bool tcp_try_undo_recovery(struct sock *sk, const u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(tp, prior_snd_una)) {
 		int mib_idx;
 
 		/* Happy end! We did not retransmit anything
@@ -2391,11 +2400,12 @@ static bool tcp_try_undo_dsack(struct sock *sk)
 }
 
 /* Undo during loss recovery after partial ACK or using F-RTO. */
-static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo)
+static bool tcp_try_undo_loss(struct sock *sk, const u32 prior_snd_una,
+			      bool frto_undo)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (frto_undo || tcp_may_undo(tp)) {
+	if (frto_undo || tcp_may_undo(tp, prior_snd_una)) {
 		tcp_undo_cwnd_reduction(sk, true);
 
 		DBGUNDO(sk, "partial loss");
@@ -2628,13 +2638,13 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
 static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
-			     int *rexmit)
+			     int *rexmit, const u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool recovered = !before(tp->snd_una, tp->high_seq);
 
 	if ((flag & FLAG_SND_UNA_ADVANCED) &&
-	    tcp_try_undo_loss(sk, false))
+	    tcp_try_undo_loss(sk, prior_snd_una, false))
 		return;
 
 	if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */
@@ -2642,7 +2652,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 		 * lost, i.e., never-retransmitted data are (s)acked.
 		 */
 		if ((flag & FLAG_ORIG_SACK_ACKED) &&
-		    tcp_try_undo_loss(sk, true))
+		    tcp_try_undo_loss(sk, prior_snd_una, true))
 			return;
 
 		if (after(tp->snd_nxt, tp->high_seq)) {
@@ -2665,7 +2675,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
 
 	if (recovered) {
 		/* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */
-		tcp_try_undo_recovery(sk);
+		tcp_try_undo_recovery(sk, prior_snd_una);
 		return;
 	}
 	if (tcp_is_reno(tp)) {
@@ -2685,7 +2695,7 @@ static bool tcp_try_undo_partial(struct sock *sk, u32 prior_snd_una)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tp->undo_marker && tcp_packet_delayed(tp)) {
+	if (tp->undo_marker && tcp_packet_delayed(tp, prior_snd_una)) {
 		/* Plain luck! Hole if filled with delayed
 		 * packet, rather than with a retransmit. Check reordering.
 		 */
@@ -2788,7 +2798,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		case TCP_CA_Recovery:
 			if (tcp_is_reno(tp))
 				tcp_reset_reno_sack(tp);
-			if (tcp_try_undo_recovery(sk))
+			if (tcp_try_undo_recovery(sk, prior_snd_una))
 				return;
 			tcp_end_cwnd_reduction(sk);
 			break;
@@ -2815,7 +2825,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
 		tcp_rack_identify_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
-		tcp_process_loss(sk, flag, is_dupack, rexmit);
+		tcp_process_loss(sk, flag, is_dupack, rexmit, prior_snd_una);
 		tcp_rack_identify_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 0/5] tcp: fixes to non-SACK TCP
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov

Here is a series of fixes to issues that occur when SACK is not
enabled for a TCP connection. These are not fixes to just some
remote corner cases of recovery but many/almost all recoveries
without SACK will be impacted by one (or even more than one) of
them. The sender-side changes (1-4) are not mainly, if any, to
improve performance (throughput) but address correctness
(congestion control responses should not get incorrectly
reverted) and burstiness (that may cause additional problems
later as some of the packets in such bursts may get dropped
needing again to resort to loss recovery that is likely
similarly bursty).

v1 -> v2:
- Tried to improve changelogs
- Reworked FRTO undo fix location
- Removed extra parenthesis from EXPR (and while at it, reverse
  also the sides of &&)
- Pass prior_snd_una rather than flag around to avoid moving
  tcp_packet_delayed call
- Pass tp instead of sk. Sk was there only due to a subsequent
  change (that I think is only net-next material) limiting the
  use of the transient state to only RTO recoveries as it won't
  be needed after NewReno recovery that won't do unnecessary
  rexmits like the non-SACK RTO recovery does

v2 -> v3:
- Remove unnecessarily added braces

tcp: feed correct number of pkts acked to cc
tcp: prevent bogus FRTO undos with non-SACK flows
tcp: move false FR condition into
tcp: prevent bogus undos when SACK is not enabled
tcp: send real dupACKs by locking advertized

^ permalink raw reply

* [PATCH v3 net 3/5] tcp: move false FR condition into tcp_false_fast_retrans_possible()
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

No functional changes. This change simplifies the next change
slightly.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c60745c..72ecfbb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2211,6 +2211,17 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 	}
 }
 
+/* False fast retransmits may occur when SACK is not in use under certain
+ * conditions (RFC6582). The sender MUST hold old state until something
+ * *above* high_seq is ACKed to prevent triggering such false fast
+ * retransmits. SACK TCP is safe.
+ */
+static bool tcp_false_fast_retrans_possible(const struct tcp_sock *tp,
+					    const u32 snd_una)
+{
+	return tcp_is_reno(tp) && (snd_una == tp->high_seq);
+}
+
 static bool tcp_tsopt_ecr_before(const struct tcp_sock *tp, u32 when)
 {
 	return tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
@@ -2350,10 +2361,10 @@ static bool tcp_try_undo_recovery(struct sock *sk)
 	} else if (tp->rack.reo_wnd_persist) {
 		tp->rack.reo_wnd_persist--;
 	}
-	if (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) {
-		/* Hold old state until something *above* high_seq
-		 * is ACKed. For Reno it is MUST to prevent false
-		 * fast retransmits (RFC2582). SACK TCP is safe. */
+	if (tcp_false_fast_retrans_possible(tp, tp->snd_una)) {
+		/* Hold old state until something *above* high_seq is ACKed
+		 * if false fast retransmit is possible.
+		 */
 		if (!tcp_any_retrans_done(sk))
 			tp->retrans_stamp = 0;
 		return true;
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3 net 5/5] tcp: send real dupACKs by locking advertized window for non-SACK flows
From: Ilpo Järvinen @ 2018-03-13 10:25 UTC (permalink / raw)
  To: netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Sergei Shtylyov
In-Reply-To: <1520936711-16784-1-git-send-email-ilpo.jarvinen@helsinki.fi>

Currently, the TCP code is overly eager to increase window on
almost every ACK. It makes those ACKs that the receiver should
sent as dupACKs look like they update window that is not
considered a real dupACK by the non-SACK sender-side code.
Therefore the sender needs to resort to RTO to recover
losses as fast retransmit/fast recovery cannot be triggered
by such masked duplicate ACKs.

This change makes sure that when an ofo segment is received,
no change to window is applied if we are going to send a dupACK.
Even with this change, the window updates keep being maintained
but that occurs "behind the scenes". That is, this change does
not interfere with memory management of the flow which could
have long-term impact for the progress of the flow but only
prevents those updates being seen on the wire on short-term.
It's ok to change the window for non-dupACKs such as the first
ACK after ofo arrivals start if that ACK was using delayed ACKs
and also whenever the ack field advances. As ack field typically
advances once per RTT as the first hole is retransmitted, the
window updates are not delayed entirely during long recoveries.

Even before this fix, tcp_select_window did not allow ACK
shrinking the window for duplicate ACKs (that was previously
even called "treason" but the old charmy message is gone now).
The advertized window can only shrink when also ack field
changes which will not be blocked by this change as it is not
duplicate ACK.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |  3 ++-
 net/ipv4/tcp_input.c  |  5 ++++-
 net/ipv4/tcp_output.c | 43 +++++++++++++++++++++++++------------------
 3 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 8f4c549..e239662 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@ struct tcp_sock {
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		unused:2;
+		dupack_wnd_lock :1, /* Non-SACK constant rwnd dupacks needed? */
+		unused:1;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		unused1	    : 1,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 270aa48..4ff192b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4626,6 +4626,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	bool fragstolen;
 	int eaten;
 
@@ -4669,7 +4670,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 			 * gap in queue is filled.
 			 */
 			if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
-				inet_csk(sk)->icsk_ack.pingpong = 0;
+				icsk->icsk_ack.pingpong = 0;
 		}
 
 		if (tp->rx_opt.num_sacks)
@@ -4719,6 +4720,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 		goto queue_and_out;
 	}
 
+	if (tcp_is_reno(tp) && !(icsk->icsk_ack.pending & ICSK_ACK_TIMER))
+		tp->dupack_wnd_lock = 1;
 	tcp_data_queue_ofo(sk, skb);
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6818042..45fbe92 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -249,25 +249,32 @@ static u16 tcp_select_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 old_win = tp->rcv_wnd;
-	u32 cur_win = tcp_receive_window(tp);
-	u32 new_win = __tcp_select_window(sk);
-
-	/* Never shrink the offered window */
-	if (new_win < cur_win) {
-		/* Danger Will Robinson!
-		 * Don't update rcv_wup/rcv_wnd here or else
-		 * we will not be able to advertise a zero
-		 * window in time.  --DaveM
-		 *
-		 * Relax Will Robinson.
-		 */
-		if (new_win == 0)
-			NET_INC_STATS(sock_net(sk),
-				      LINUX_MIB_TCPWANTZEROWINDOWADV);
-		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+	u32 cur_win;
+	u32 new_win;
+
+	if (tp->dupack_wnd_lock) {
+		new_win = old_win;
+		tp->dupack_wnd_lock = 0;
+	} else {
+		cur_win = tcp_receive_window(tp);
+		new_win = __tcp_select_window(sk);
+		/* Never shrink the offered window */
+		if (new_win < cur_win) {
+			/* Danger Will Robinson!
+			 * Don't update rcv_wup/rcv_wnd here or else
+			 * we will not be able to advertise a zero
+			 * window in time.  --DaveM
+			 *
+			 * Relax Will Robinson.
+			 */
+			if (new_win == 0)
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPWANTZEROWINDOWADV);
+			new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		}
+		tp->rcv_wnd = new_win;
+		tp->rcv_wup = tp->rcv_nxt;
 	}
-	tp->rcv_wnd = new_win;
-	tp->rcv_wup = tp->rcv_nxt;
 
 	/* Make sure we do not exceed the maximum possible
 	 * scaled window.
-- 
2.7.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox